CN111079594B - Video action classification and identification method based on double-flow cooperative network - Google Patents
Video action classification and identification method based on double-flow cooperative network Download PDFInfo
- Publication number
- CN111079594B CN111079594B CN201911228675.0A CN201911228675A CN111079594B CN 111079594 B CN111079594 B CN 111079594B CN 201911228675 A CN201911228675 A CN 201911228675A CN 111079594 B CN111079594 B CN 111079594B
- Authority
- CN
- China
- Prior art keywords
- feature
- time domain
- video
- time
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a video category identification method based on a double-flow cooperative network, which comprises the steps of firstly enabling heterogeneous airspace characteristics and time domain characteristics to perform information interaction; the information interaction is to fuse heterogeneous time domain features and space domain features, extract time domain and space domain complementary parts from the fused time-space domain features, fuse the complementary parts into original time domain feature post-space domain features, and respectively form post-space domain sequence features of space domain sequence features by all time domain feature post-space domain features after the fusion of the complementary parts; then, sequence feature aggregation is carried out on the space domain sequence features and the time domain sequence features to obtain aggregated space domain features and aggregated time domain features; and finally, pre-training a classifier model for testing and classifying the video to be identified. The invention can realize flow complementation of different inflow mode information, thereby achieving more accurate action recognition effect.
Description
Technical Field
The invention belongs to the technical field of video action classification and identification, and particularly relates to a video action classification and identification method based on a double-flow cooperative network.
Background
Due to the popularity of smart phones, public monitoring, portable cameras, etc., short video data is rapidly growing due to its easy availability. The motion recognition based on the short video has important academic value and can provide assistance for business applications such as intelligent security, user recommendation and the like. The double-flow network is always the most widely adopted and best-effective framework in the field of motion recognition, but most of motion recognition solutions based on the double-flow network focus on how to design structures to integrate different flow characteristics, and the different flow networks are trained in a separate mode, so that end-to-end reasoning cannot be achieved.
The goal of video action category identification is to identify the category of actions occurring in video, and the existing dual-stream-based action category identification method mainly comprises the following steps:
(1) And (3) airspace feature extraction flow: extracting airspace characteristics from an input RGB video frame by using a convolution network, wherein the convolution networks of 2D and 3D are used in the existing method, and the branch aims at extracting morphological type information in the video to provide a basis for later fusion;
(2) Time domain feature extraction flow: the spatial domain features are extracted from the input pre-extracted optical flow field by a convolution network, and a 2D and 3D convolution network can be used as an infrastructure network, and the branch aims at providing a basis for extracting motion type information in the video and then fusing.
The existing category identification method based on double-flow video actions is mostly based on fusion of the rear-end trial characteristics of the structure, the characteristics of two sub-streams are required to be extracted separately, and then the fusion mode is improved, so that the following defects exist:
(1) Information representing the same pattern in two heterogeneous input streams is processed separately, with complementary information in between, in fact, not co-processed at the front end of the network, resulting in the possible loss of some key features that contribute to motion recognition;
(2) The end-to-end reasoning learning cannot be performed, two branches must be processed separately, and the mutual flow of information in the heterogeneous feature extraction flow cannot be ensured to maintain the distinguishing property of the features.
Disclosure of Invention
The invention provides a video action classification and identification method based on a double-flow cooperative network, which is based on the problems that some key technologies may be lost, video frames and optical flow fields are processed separately, information is not flowing and end-to-end processing cannot be performed in the prior art, and realizes double-flow information complementation and information mutual flow by constructing a connecting unit to interact heterogeneous spatial domain characteristics and time domain characteristics, and meanwhile, can realize end-to-end reasoning and learning.
The invention comprises the following specific contents:
a video action classification recognition method based on double-flow cooperative network comprises the steps of simultaneously extracting time domain sequence features X from a video optical flow field through a convolution network o Extracting airspace sequence characteristic X from video frame f The method comprises the steps of carrying out a first treatment on the surface of the Reconstructing a connection unit to let the heterogeneous time domain sequence feature X o Sum spatial domain sequence feature X f Performing information interaction; then constructing a time domain sequence characteristic x 'after interaction of a sharing unit pair' j o And the airspace sequence characteristic x 'after interaction' i f Respectively performing sequence feature aggregation to obtain aggregated time domain features Z o And aggregated airspace feature Z f ;
The information interaction specifically comprises the following steps:
step one: time domain sequence feature X extracted from video optical flow field o And spatial sequence feature X extracted from video frames f Fusing to obtain a heterogeneous correlation matrix Y of the time-space domain sequence characteristics;
step two: extracting complementary time domain sequence features according to the heterogeneous correlation matrix Y obtained in the step oneAnd complementary spatial sequence features->And complementary time domain sequence features->Fusion back to time domain sequence feature X o Generating a fused time domainSequence feature x' j o Complementary spatial sequence characteristics->Fused return space domain sequence feature X f Generating the fused airspace sequence characteristic x' i f 。
To better implement the invention, further, the time domain feature Z o Sum spatial domain feature Z f Regularization is carried out at the same time, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification score and the airspace feature classification score into a prediction space-time feature classification score vector for actual video motion recognition; the predicted spatiotemporal feature classification score vector is divided into a correct spatiotemporal feature classification score vector and an incorrect spatiotemporal feature classification score vector; the correct time-space characteristic classification score vector is a classification score vector of the category with real time-domain characteristics; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the actual identification of the video.
In order to better realize the invention, further, a pre-training sample set is selected for training, and a classifier model containing the correct space-time characteristic classification scores of each action class classification is generated; obtaining cross entropy loss function L according to correct space-time characteristic classification score 1 。
In order to better realize the invention, the shared weight layer further uses the input time domain feature Z o Sum spatial domain feature Z f Respectively constructing time domain features Z o Heterogeneous triplet pairs and spatial signature Z f Is simultaneously according to time domain feature Z o Heterogeneous triplet pairs and spatial signature Z f Determination of heterogeneous triplet pair loss function L 2 。
In order to better implement the invention, the sharing weight layer further calculates a time domain feature Z o Class center of (2)Sum spatial domain feature Z f Class center of->From the time-domain features Z obtained o Class center of->Sum spatial domain feature Z f Class center of->Calculating and distinguishing an embedding limit loss function L 3 。
To better implement the invention, further, a cross entropy loss function L is employed 1 Heterogeneous triplet vs. loss function L 2 Discriminating the embedding limit loss function L 3 As a trained loss function L.
In order to better realize the invention, further, in the actual video identification, the generated prediction space-time feature classification score vectors are ordered in the order from big to small, the prediction space-time feature classification score vector with the largest value is selected, the prediction space-time feature classification score vector with the largest value is the correct space-time feature classification score vector in the identified video, and the category index corresponding to the correct space-time feature classification score vector in the identified video is the category of the behavior.
In order to better implement the present invention, further,
in the formula g θ () For measuring the similarity of variables, the function is expressed asWherein->But->W K Is a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size of the heterogeneous correlation matrix is a matrix with the number of rows and columns equal to the number of video samples;
the fused time domain sequence characteristic x 'obtained in the step two' j o And the fused airspace sequence characteristic x' i f The specific expression of (2) is:
in the method, in the process of the invention,and->The interactive function for separating complementary features of the space domain and the time domain is expressed as followsAnd->w f ,w o To learn parameters, fused time domain sequence features x' i f The method comprises the steps of carrying out a first treatment on the surface of the Fused airspace sequence characteristic x' j o The expressions are X 'respectively' f ={x′ 1 f ,x′ 2 f,…,x′ T f }、X′ o ={x′ 1 o ,x′ 2 o ,…,x′ T o };
The cross entropy loss function L1 is expressed as follows:
wherein L is 1 Represents the cross-entropy loss value,a correct spatiotemporal feature class score representing class after the i-th sample output,/th sample output>Representing the correct space-time feature classification score of the ith sample when the ith sample is output to class j;
the expression of the heterogeneous triplet pair of the airspace characteristics isThe expression of the heterogeneous triplet pair of the time domain characteristics is +.>Wherein subscripts a, p, n respectively represent anchor points, positive example points, and negative example points, and i and j represent sample pair action category indexes;
the heterogeneous triplet pair loss function is specifically:
wherein L is 2 Representing tripletLoss value of the pair;representing a 2-norm distance measure; if x is greater than 0, [ x ]] + X, if x is 0 or less, [ x ]] + =0;α 1 Is a threshold value;
class center of the airspace featureThe expression is->Class center of time domain feature->The expression is->Wherein c= { C 1 ,c 2 ,…,c s "is a table label, l i For the label of the i-th sample, 1 () is an indication function;
the distinguishing embedding limiting loss function specifically comprises the following steps:
wherein L is 3 Representing the loss value of discriminating embedding, alpha 2 ,α 3 Is a threshold value;
the expression of the loss function L is:
L=λ 1 L 1 +λ 2 L 2 +λ 3 L 3 。
compared with the prior art, the invention has the following advantages:
(1) Processing two heterogeneous input streams simultaneously, and processing complementary information cooperatively to avoid losing key features which are helpful for motion recognition
(2) Meanwhile, double-flow information is processed, so that end-to-end reasoning learning is realized, and the mutual flow of information of feature extraction flows in double-flow isomerism is ensured.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a schematic diagram of a dual stream feature process framework;
FIG. 3 is a schematic diagram of a connection network for different modality flow branches;
FIG. 4 is a schematic representation of the effect of the present invention in comparison with the prior art.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only some embodiments of the present invention, but not all embodiments, and therefore should not be considered as limiting the scope of protection. All other embodiments, which are obtained by a worker of ordinary skill in the art without creative efforts, are within the protection scope of the present invention based on the embodiments of the present invention.
Example 1:
a video action classification recognition method based on a double-flow cooperative network is disclosed, and is combined with the descriptions of fig. 1, 2 and 3, firstly, spatial sequence features are extracted from video frames and time domain sequence features are extracted from video optical flow fields simultaneously through a convolution network, and the spatial sequence feature expression is as followsThe characteristic expression of the time domain sequence is +.>Wherein->d is the dimension of the feature; constructing a connecting unit to enable the heterogeneous airspace sequence characteristics and the time domain sequence characteristics to carry out information interaction; then construct a shared unit pair fusionRespectively carrying out sequence feature aggregation on the post-airspace sequence features and the fused time domain sequence features to obtain aggregated airspace features and aggregated time domain features; the airspace characteristic expression after polymerization is Z f The time domain characteristic expression after aggregation is Z o ;
The information interaction specifically comprises the following steps:
step one: the method comprises the steps of fusing the airspace sequence features extracted from video frames with the time domain sequence features extracted from video optical flow fields, wherein the specific formula is as follows:
in the formula g θ () For measuring the similarity of variables, the function is expressed asWherein->But->W K Is a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size of the heterogeneous correlation matrix is a matrix with the number of rows and columns equal to the number of video samples;
step two: according to the heterogeneous correlation matrix Y obtained in the first step, complementary time domain sequence features and complementary space domain sequence features are separated from the fused space domain sequence features and time domain sequence features, and the separated complementary time domain sequence features and the complementary space domain sequence features are respectively fused back to the space domain sequence featuresAnd time domain sequence feature->Obtaining the fused time domain sequenceThe specific formulas of the column characteristics and the fused airspace sequence characteristics are as follows:
in the method, in the process of the invention,and->Interaction functions separating complementary features for spatial and temporal domains, respectively +.>And->w f ,w o For the parameters to be learned' i f The time domain sequence characteristics are fused; x's' j o The spatial sequence characteristics after fusion; the expression of the spatial domain sequence characteristics after fusion is +.>The fused time domain sequence features have the expression +.>
Working principle: in the prior art, the characteristics in the video frame and the optical flow field are respectively extracted, and then are fused, so that some complementary information is easily lost; the separated complementary features are added into the original extracted space domain sequence features and the original time domain sequence features, so that the original extracted time domain sequence features and the original extracted space domain sequence features contain complementary information, and then the time domain sequence features and the space domain sequence features with the complementary information are sent to further processing; in practice, since a video segment usually has a large number of frames, it takes a huge computational cost if it is used as an input for subsequent operations, and there is redundancy in that much information is similar, the video is sampled before extracting the features. The method comprises the steps of sampling video in a global sparse sampling mode, acquiring M frames of RGB frame images, wherein optical flow field images have x and y directions, and carrying out feature extraction on the sampled video frames by using a convolution network, wherein the total number of the images is 2M, and the sampled video frames and the optical flow field images are respectively subjected to feature extraction by using an acceptance and acceptance-v 3; the extracted features are then fed into a connection unit for processing.
Example 2:
on the basis of the above embodiment 1, in order to better implement the present invention, as shown in fig. 1 and 2, a sharing unit is further constructed to perform sequence feature aggregation on the fused spatial sequence feature and the fused time domain sequence feature respectively; the fused airspace sequence features are aggregated into airspace features Z f The fused time domain sequence features are aggregated into time domain features Z o The method comprises the steps of carrying out a first treatment on the surface of the To time domain feature Z o Sum spatial domain feature Z f Regularization is carried out at the same time, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification score and the airspace feature classification score into a prediction space-time feature classification score vector for actual video motion recognition; the predicted spatiotemporal feature classification score vector is divided into a correct spatiotemporal feature classification score vector and an incorrect spatiotemporal feature classification score vector; the correct time-space characteristic classification score vector is a classification score vector of the category with real time-domain characteristics; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the actual identification of the video.
Other portions of this embodiment are the same as those of embodiment 1, and thus will not be described in detail.
Example 3:
on the basis of any one of the above embodiments 1-2, in order to better implement the present invention, further, a sample set is selected for training, and a classifier model containing the correct space-time feature classification score of the action classification is generated; the cross entropy loss function, the heterogeneous triplet pair loss function and the combination function of the discrimination embedding limit loss function are adopted as the training loss function.
Working principle: the sample set is selected for pre-training, a classifier model is trained, and a cross entropy loss function, a heterogeneous triplet pair loss function and a combination function for distinguishing an embedding limit loss function are introduced as the trained loss function, so that the classifier model obtained through pre-training is more real and reliable, and classification is more aggregated.
Other portions of this embodiment are the same as any of embodiments 1-2 described above, and thus will not be described again.
Example 4:
on the basis of any one of the above embodiments 1 to 3, in order to better implement the present invention, further, the sharing weight layer further inputs a time domain feature Z o Sum spatial domain feature Z f Respectively constructing a heterogeneous triplet pair of the space domain characteristic and a heterogeneous triplet pair of the time domain characteristic; the expression of the heterogeneous triplet pair of the airspace characteristics isThe expression of the heterogeneous triplet pair of the time domain characteristics is +.>Wherein subscripts a, p, n respectively represent anchor points, positive example points, and negative example points, and i and j represent sample pair action category indexes; the heterogeneous triplet pair loss function is specifically:
wherein L is 2 Representing tripletLoss value of the pair;representing a 2-norm distance measure; if x is greater than 0, [ x ]] + X, if x is 0 or less, [ x ]] + =0;α 1 Is a threshold value;
meanwhile, the class center of the space domain feature and the class center of the time domain feature are also obtained; the expression of the class center of the airspace characteristics is thatThe time domain feature is represented by the class center expression +.>Wherein c= { C 1 ,c 2 ,…,c s "is a table label, l i For the label of the i-th sample, 1 () is an indication function; the distinguishing embedding limiting loss function specifically comprises the following steps:
wherein L is 3 Representing the loss value of discriminating embedding, alpha 2 ,α 3 Is a threshold value;
the cross entropy loss function expression is:
wherein L is 1 Represents the cross-entropy loss value,correct spatiotemporal feature class score representing true class after the ith sample output, +.>Representing the correct space-time feature classification score of the ith sample when the ith sample is output to class j; by the present loss functionThe characteristics of the real classification categories can be more prominently aggregated;
the loss function expression for training the whole network is as follows:
L=λ 1 L 1 +λ 2 L 2 +λ 3 L 3 ;
here it is empirically available: l=l 1 +0.5L 2 +0.5L 3 。
Other portions of this embodiment are the same as any of embodiments 1 to 5 described above, and thus will not be described again.
Example 5:
in order to better implement the present invention on the basis of any one of the above embodiments 1 to 4, further, in the actual video motion recognition process, in the actual video recognition, the generated predicted spatiotemporal feature classification score vectors are ordered in order from large to small, the predicted spatiotemporal feature classification score vector with the largest value is selected, and the predicted spatiotemporal feature classification score vector with the largest value is the correct spatiotemporal feature classification score vector in the recognized video, and the class index corresponding to the correct spatiotemporal feature classification score vector in the recognized video is the class of the behavior. The present invention specifically uses top-k index to evaluate our model. Top-k refers to the proportion of video sequences with correct labels in the top k results in the classification feature scores returned by the model, and is the most commonly used classification evaluation method. In this example, k is set to 1. The present invention was tested on the large-scale video behavior classification dataset UCF-101 and HMDB-51 datasets. The UCF-101 data set comprises 101 action categories, 13,320 samples are taken, 70% of the samples are selected as training sets, and the rest are taken as verification sets; the HMDB-51 data set comprises 51 action categories, 6,849 samples in total, 70% of the samples are selected as a training set, and the rest are used as a verification set; as shown in FIG. 4, the fusion recognition performance of the invention after information interaction of all verification sets is superior to that of the prior method. On UCF-101 data set, the final recognition performance of the method is improved by 0.4% compared with the prior optimal method, and the final recognition performance of the method is improved by 3.2% compared with the prior optimal method on HMDB-51. The method is superior to the existing method in all measurement modes, and the identification accuracy of video behavior classification is improved.
Other portions of this embodiment are the same as any of embodiments 1 to 4 described above, and thus will not be described again.
The foregoing description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and any simple modification, equivalent variation, etc. of the above embodiment according to the technical matter of the present invention fall within the scope of the present invention.
Claims (5)
1. A video action classification and identification method based on a double-flow cooperative network is characterized in that firstly, a time domain sequence feature X is simultaneously extracted from a video optical flow field through a convolution network o Extracting airspace sequence characteristic X from video frame f The method comprises the steps of carrying out a first treatment on the surface of the Reconstructing a connection unit to let the heterogeneous time domain sequence feature X o Sum spatial domain sequence feature X f Performing information interaction; then constructing a time domain sequence characteristic after interaction of a sharing unit pairAnd post-interaction airspace sequence features->Respectively performing sequence feature aggregation to obtain aggregated time domain features Z o And aggregated airspace feature Z f ;
The information interaction specifically comprises the following steps:
step one: time domain sequence feature X extracted from video optical flow field o And spatial sequence feature X extracted from video frames f Fusing to obtain a heterogeneous correlation matrix Y of the time-space domain sequence characteristics;
step two: extracting complementary time domain sequence features according to the heterogeneous correlation matrix Y obtained in the step oneAnd complementary spatial sequence features->And complementary time domain sequence features->Fusion back to time domain sequence feature X o Generating the fused time domain sequence feature +.>Complementary spatial sequence characteristics->Fused return space domain sequence feature X f Generating the fused airspace sequence characteristics ∈ ->To time domain feature Z o Sum spatial domain feature Z f Regularization is carried out at the same time, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification score and the airspace feature classification score into a prediction space-time feature classification score vector for actual video motion recognition; the predicted spatiotemporal feature classification score vector is divided into a correct spatiotemporal feature classification score vector and an incorrect spatiotemporal feature classification score vector; the correct time-space characteristic classification score vector is a classification score vector of the category with real time-domain characteristics; the error space-time feature classification score vector is a space-time feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video;
selecting a pre-training sample set for training to generate a classifier model containing the correct space-time characteristic classification scores of each action class classification; obtaining cross entropy loss function L according to correct space-time characteristic classification score 1 ;
The shared weight layer also uses the input time domain feature Z o Sum spatial domain feature Z f Respectively constructed when constructedDomain features Z o Heterogeneous triplet pairs and spatial signature Z f Is simultaneously according to time domain feature Z o Heterogeneous triplet pairs and spatial signature Z f Determination of heterogeneous triplet pair loss function L 2 。
2. The method for classifying and identifying video actions based on double-flow cooperative network as claimed in claim 1, wherein said shared weight layer further obtains time domain feature Z o Class center of (2)Sum spatial domain feature Z f Class center of->From the time-domain features Z obtained o Class center of->Sum spatial domain feature Z f Class center of->Calculating and distinguishing an embedding limit loss function L 3 。
3. The video motion classification and identification method based on double-flow cooperative network as claimed in claim 2, wherein a cross entropy loss function L is adopted 1 Heterogeneous triplet vs. loss function L 2 Discriminating the embedding limit loss function L 3 As a trained loss function L.
4. The video motion classification recognition method based on the dual-stream collaborative network according to claim 3, wherein in actual video recognition, the generated predicted spatiotemporal feature classification score vectors are ordered in order from big to small, the predicted spatiotemporal feature classification score vector with the largest value is selected, the predicted spatiotemporal feature classification score vector with the largest value is the correct spatiotemporal feature classification score vector in the recognized video, and the class index corresponding to the correct spatiotemporal feature classification score vector in the recognized video is the class of the behavior.
5. The video motion classification and identification method based on double-flow cooperative network of claim 4, wherein,
in the formula g θ () For measuring the similarity of variables, the function is expressed asWherein->But->W K Is a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size of the heterogeneous correlation matrix is a matrix with the number of rows and columns equal to the number of video samples;
the fused time domain sequence characteristics obtained in the step twoAnd fused airspace sequence characteristics->The specific expression of (2) is:
in the method, in the process of the invention,and->The interactive function for separating complementary features of the space domain and the time domain is expressed as followsAnd->w f ,w o Fused time-domain sequence feature +.for parameters to be learned>Fused airspace sequence characteristics->The expressions are +.>
The cross entropy loss function L1 is expressed as follows:
wherein L is 1 Represents the cross-entropy loss value,a correct spatiotemporal feature class score representing class after the i-th sample output,/th sample output>Representing the correct space-time feature classification score of the ith sample when the ith sample is output to class j;
the expression of the heterogeneous triplet pair of the airspace characteristics isThe expression of the heterogeneous triplet pair of the time domain characteristics is +.>Wherein subscripts a, p, n respectively represent anchor points, positive example points, and negative example points, and i and j represent sample pair action category indexes;
the heterogeneous triplet pair loss function is specifically:
wherein L is 2 Loss values representing triplet pairs;representing a 2-norm distance measure; when x is greater than 0, [ x ]] + When x is less than or equal to 0, [ x ]] + =0;α 1 Is a threshold value;
class center of the airspace featureThe expression is->Class center of time domain feature->The expression is->Wherein c= { C 1 ,c 2 ,…,c s "is a table label, l i For the label of the i-th sample, 1 () is an indication function;
the distinguishing embedding limiting loss function specifically comprises the following steps:
wherein L is 3 Representing the loss value of discriminating embedding, alpha 2 ,α 3 Is a threshold value;
the expression of the loss function L is:
L=λ 1 L 1 +λ 2 L 2 +λ 3 L 3 。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911228675.0A CN111079594B (en) | 2019-12-04 | 2019-12-04 | Video action classification and identification method based on double-flow cooperative network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911228675.0A CN111079594B (en) | 2019-12-04 | 2019-12-04 | Video action classification and identification method based on double-flow cooperative network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111079594A CN111079594A (en) | 2020-04-28 |
CN111079594B true CN111079594B (en) | 2023-06-06 |
Family
ID=70312816
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911228675.0A Active CN111079594B (en) | 2019-12-04 | 2019-12-04 | Video action classification and identification method based on double-flow cooperative network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111079594B (en) |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259874B (en) * | 2020-05-06 | 2020-07-28 | 成都派沃智通科技有限公司 | Campus security video monitoring method based on deep learning |
CN111312367A (en) * | 2020-05-11 | 2020-06-19 | 成都派沃智通科技有限公司 | Campus personnel abnormal psychological prediction method based on self-adaptive cloud management platform |
CN112446348B (en) * | 2020-12-08 | 2022-05-31 | 电子科技大学 | Behavior identification method based on characteristic spectrum flow |
CN113343786B (en) * | 2021-05-20 | 2022-05-17 | 武汉大学 | Lightweight video action recognition method and system based on deep learning |
CN113255570B (en) * | 2021-06-15 | 2021-09-24 | 成都考拉悠然科技有限公司 | Sequential action detection method for sensing video clip relation |
CN114943286B (en) * | 2022-05-20 | 2023-04-07 | 电子科技大学 | Unknown target discrimination method based on fusion of time domain features and space domain features |
CN115393660B (en) * | 2022-10-28 | 2023-02-24 | 松立控股集团股份有限公司 | Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104023226A (en) * | 2014-05-28 | 2014-09-03 | 北京邮电大学 | HVS-based novel video quality evaluation method |
CN107506712A (en) * | 2017-08-15 | 2017-12-22 | 成都考拉悠然科技有限公司 | Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks |
CN109101896A (en) * | 2018-07-19 | 2018-12-28 | 电子科技大学 | A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism |
CN109558781A (en) * | 2018-08-02 | 2019-04-02 | 北京市商汤科技开发有限公司 | A kind of multi-angle video recognition methods and device, equipment and storage medium |
CN109784269A (en) * | 2019-01-11 | 2019-05-21 | 中国石油大学(华东) | One kind is based on the united human action detection of space-time and localization method |
CN109858407A (en) * | 2019-01-17 | 2019-06-07 | 西北大学 | A kind of video behavior recognition methods based on much information stream feature and asynchronous fusion |
CN110070041A (en) * | 2019-04-23 | 2019-07-30 | 江西理工大学 | A kind of video actions recognition methods of time-space compression excitation residual error multiplication network |
CN110135369A (en) * | 2019-05-20 | 2019-08-16 | 威创集团股份有限公司 | A kind of Activity recognition method, system, equipment and computer readable storage medium |
CN110163052A (en) * | 2018-08-01 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Video actions recognition methods, device and machinery equipment |
CN110334746A (en) * | 2019-06-12 | 2019-10-15 | 腾讯科技(深圳)有限公司 | A kind of image detecting method and device |
CN110390308A (en) * | 2019-07-26 | 2019-10-29 | 华侨大学 | It is a kind of to fight the video behavior recognition methods for generating network based on space-time |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6159489B2 (en) * | 2014-04-11 | 2017-07-05 | ペキン センスタイム テクノロジー ディベロップメント カンパニー リミテッド | Face authentication method and system |
-
2019
- 2019-12-04 CN CN201911228675.0A patent/CN111079594B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104023226A (en) * | 2014-05-28 | 2014-09-03 | 北京邮电大学 | HVS-based novel video quality evaluation method |
CN107506712A (en) * | 2017-08-15 | 2017-12-22 | 成都考拉悠然科技有限公司 | Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks |
CN109101896A (en) * | 2018-07-19 | 2018-12-28 | 电子科技大学 | A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism |
CN110163052A (en) * | 2018-08-01 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Video actions recognition methods, device and machinery equipment |
CN109558781A (en) * | 2018-08-02 | 2019-04-02 | 北京市商汤科技开发有限公司 | A kind of multi-angle video recognition methods and device, equipment and storage medium |
CN109784269A (en) * | 2019-01-11 | 2019-05-21 | 中国石油大学(华东) | One kind is based on the united human action detection of space-time and localization method |
CN109858407A (en) * | 2019-01-17 | 2019-06-07 | 西北大学 | A kind of video behavior recognition methods based on much information stream feature and asynchronous fusion |
CN110070041A (en) * | 2019-04-23 | 2019-07-30 | 江西理工大学 | A kind of video actions recognition methods of time-space compression excitation residual error multiplication network |
CN110135369A (en) * | 2019-05-20 | 2019-08-16 | 威创集团股份有限公司 | A kind of Activity recognition method, system, equipment and computer readable storage medium |
CN110334746A (en) * | 2019-06-12 | 2019-10-15 | 腾讯科技(深圳)有限公司 | A kind of image detecting method and device |
CN110390308A (en) * | 2019-07-26 | 2019-10-29 | 华侨大学 | It is a kind of to fight the video behavior recognition methods for generating network based on space-time |
Non-Patent Citations (2)
Title |
---|
Spatiotemporal residual networks for video action recognition;Christoph R等;《Advances in Neural Information Processing Systems》;20161205;第3468-3476页 * |
基于时空双流卷积与LSTM的人体动作识别;毛志强等;《软件》;20180930;第38卷(第09期);第9-12页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111079594A (en) | 2020-04-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111079594B (en) | Video action classification and identification method based on double-flow cooperative network | |
US10152644B2 (en) | Progressive vehicle searching method and device | |
CN113936339B (en) | Fighting identification method and device based on double-channel cross attention mechanism | |
CN110059465B (en) | Identity verification method, device and equipment | |
CN111581405A (en) | Cross-modal generalization zero sample retrieval method for generating confrontation network based on dual learning | |
CN111368815A (en) | Pedestrian re-identification method based on multi-component self-attention mechanism | |
Chen et al. | Multi-label image recognition with joint class-aware map disentangling and label correlation embedding | |
CN115033670A (en) | Cross-modal image-text retrieval method with multi-granularity feature fusion | |
Gao et al. | The labeled multiple canonical correlation analysis for information fusion | |
CN109711422A (en) | Image real time transfer, the method for building up of model, device, computer equipment and storage medium | |
Beikmohammadi et al. | SWP-LeafNET: A novel multistage approach for plant leaf identification based on deep CNN | |
CN110827265B (en) | Image anomaly detection method based on deep learning | |
Wang et al. | Spatial–temporal pooling for action recognition in videos | |
CN108960142B (en) | Pedestrian re-identification method based on global feature loss function | |
Li et al. | One-class knowledge distillation for face presentation attack detection | |
CN113239159A (en) | Cross-modal retrieval method of videos and texts based on relational inference network | |
Li et al. | Image manipulation localization using attentional cross-domain CNN features | |
Fang et al. | Traffic police gesture recognition by pose graph convolutional networks | |
CN116933051A (en) | Multi-mode emotion recognition method and system for modal missing scene | |
An | Pedestrian Re‐Recognition Algorithm Based on Optimization Deep Learning‐Sequence Memory Model | |
Chen et al. | SSR-HEF: crowd counting with multiscale semantic refining and hard example focusing | |
Cai et al. | Glitch in the matrix: A large scale benchmark for content driven audio–visual forgery detection and localization | |
Ma et al. | Cascade transformer decoder based occluded pedestrian detection with dynamic deformable convolution and Gaussian projection channel attention mechanism | |
Dong et al. | A supervised dictionary learning and discriminative weighting model for action recognition | |
WO2023185074A1 (en) | Group behavior recognition method based on complementary spatio-temporal information modeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |