CN111079594A - Video action classification and identification method based on double-current cooperative network - Google Patents

Video action classification and identification method based on double-current cooperative network Download PDF

Info

Publication number
CN111079594A
CN111079594A CN201911228675.0A CN201911228675A CN111079594A CN 111079594 A CN111079594 A CN 111079594A CN 201911228675 A CN201911228675 A CN 201911228675A CN 111079594 A CN111079594 A CN 111079594A
Authority
CN
China
Prior art keywords
feature
time domain
video
time
space
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911228675.0A
Other languages
Chinese (zh)
Other versions
CN111079594B (en
Inventor
徐行
张静然
沈复民
贾可
申恒涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Koala Youran Technology Co ltd
Original Assignee
Chengdu Koala Youran Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Koala Youran Technology Co ltd filed Critical Chengdu Koala Youran Technology Co ltd
Priority to CN201911228675.0A priority Critical patent/CN111079594B/en
Publication of CN111079594A publication Critical patent/CN111079594A/en
Application granted granted Critical
Publication of CN111079594B publication Critical patent/CN111079594B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a video category identification method based on a double-current cooperative network, which comprises the steps of firstly, carrying out information interaction on heterogeneous spatial domain characteristics and time domain characteristics; the information interaction fuses heterogeneous time domain features and space domain features, complementary parts of a time domain and a space domain are extracted from the fused time-space domain features, the complementary parts are fused into originally extracted time domain feature post-space domain features, and all time domain feature post-space domain features after the complementary parts are fused respectively form time domain sequence features after space domain sequence features; then, carrying out sequence feature aggregation on the space domain sequence features and the time domain sequence features to obtain aggregated space domain features and aggregated time domain features; and finally, pre-training a classifier model for testing and classifying the video to be recognized. The invention can realize the flow complementation of different inflow mode information, thereby achieving more accurate action recognition effect.

Description

Video action classification and identification method based on double-current cooperative network
Technical Field
The invention belongs to the technical field of video motion classification and identification, and particularly relates to a video motion classification and identification method based on a double-current cooperative network.
Background
Short video data is growing rapidly due to its ready availability due to the popularity of smart phones, public surveillance, portable cameras, and the like. The action recognition based on the short video has important academic value and can provide help for business applications such as intelligent security and user recommendation. The double-flow network is always the most extensive and best framework adopted in the field of action recognition, but most of the existing action recognition solutions based on the double-flow network focus on how to design a structure to fuse different flow characteristics, and different flow networks are trained in an independent mode, so that end-to-end reasoning cannot be realized.
The video motion category identification aims at identifying the category of motion occurring in a video, and the existing motion category identification method based on double flow mainly comprises the following steps:
(1) and airspace characteristic extraction flow: the method comprises the steps that spatial domain features are extracted from input RGB video frames through a convolution network, 2D and 3D convolution networks are used in the existing method, and the branch aims to extract morphological information in the video and provide a basis for later fusion;
(2) time domain feature extraction stream: spatial domain features are extracted from an input pre-extracted optical flow field by a convolution network, 2D and 3D convolution networks can also be used as an infrastructure network, and the branch aims to provide a basis for extracting motion information in a video and then fusing.
Most of the existing category identification methods based on the double-stream video motion are based on the fusion of the characteristics tried at the rear end of the structure, and the characteristics of two branch streams must be extracted separately first, and then the fusion mode is improved, so that the following defects exist:
(1) information which represents the same mode in two heterogeneous input streams is processed separately, and actually, complementary information between the two input streams is not processed cooperatively at the front end of the network, so that some key features which are beneficial to action recognition can be lost;
(2) inference learning cannot be performed end to end, two branches must be processed separately, and mutual flow of information in heterogeneous feature extraction streams cannot be guaranteed to maintain feature discriminativity.
Disclosure of Invention
The invention provides a video action classification and identification method based on a double-flow cooperative network, which is used for solving the problems that some key technologies are possibly lost, a video frame and an optical flow field are processed separately, information does not flow and end-to-end processing cannot be carried out in the prior art.
The invention specifically comprises the following contents:
a video motion classification and identification method based on a double-current cooperative network comprises the steps of firstly extracting time domain sequence characteristics X from a video optical flow field through a convolutional network simultaneouslyoExtracting space domain sequence characteristic X from video framef(ii) a Then a connection unit is constructed to lead the heterogeneous time domain sequence characteristics XoSum space domain sequence feature XfCarrying out information interaction; then constructing a time domain sequence characteristic of a sharing unit pair after interaction
Figure BDA0002302948580000021
And the space domain sequence characteristics after interaction
Figure BDA0002302948580000022
Respectively carrying out sequence characteristic polymerization to obtain polymerized time domain characteristics ZoAnd the spatial domain feature Z after the aggregationf
The information interaction specifically comprises the following steps:
the method comprises the following steps: extracting time domain sequence characteristics X from a video optical flow fieldoAnd spatial domain sequence characteristics X extracted from the video framefFusing to obtain a heterogeneous correlation matrix Y of the time-space domain sequence characteristics;
step two: the isomeric correlation moment obtained according to the step oneArray Y, extracting complementary time domain sequence characteristics
Figure BDA0002302948580000023
And complementary spatial sequence features
Figure BDA0002302948580000024
And will complement the time domain sequence characteristics
Figure BDA0002302948580000025
Fuse back to time domain sequence feature XoGenerating fused time-domain sequence features
Figure BDA0002302948580000026
Characterizing the complementary spatial sequences
Figure BDA0002302948580000027
Fused-back null-field sequence feature XfGenerating fused spatial domain sequence features
Figure BDA0002302948580000028
To better implement the invention, further, the time domain feature Z is usedoSum spatial domain feature ZfMeanwhile, regularization is carried out, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification fraction and the space domain feature classification fraction into a prediction space-time feature classification fraction vector for actual video motion recognition; the prediction space-time feature classification score vector is divided into a correct space-time feature classification score group and an error space-time feature classification score vector; the correct time-space feature classification score vector is a classification score vector of a real category of the time-domain feature; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video.
In order to better realize the invention, a pre-training sample set is further selected for training to generate correct space-time characteristic classification scores containing all action class classificationsA classifier model; solving a cross entropy loss function L according to correct space-time feature classification scores1
In order to better implement the present invention, further, the sharing weight layer further passes through the input time domain feature ZoSum spatial domain feature ZfRespectively constructing time domain characteristics ZoPairs of isomeric triplets and spatial domain feature ZfOf the same pair of isomeric triplets, simultaneously according to the time domain characteristic ZoPairs of isomeric triplets and spatial domain feature ZfTo find the isomeric triplet pair loss function L2
In order to better implement the present invention, further, the shared weight layer also finds the time domain feature ZoClass center of
Figure BDA0002302948580000029
Sum spatial domain feature ZfClass center of
Figure BDA0002302948580000031
According to the obtained time domain characteristic ZoClass center of
Figure BDA0002302948580000032
Sum spatial domain feature ZfClass center of
Figure BDA0002302948580000033
Calculating a discriminant embedding limiting loss function L3
To better implement the present invention, further, a cross-entropy loss function L is employed1Isomeric triad pair loss function L2Determining the embedding limit loss function L3As a loss function L of the training.
In order to better realize the invention, further, in the actual video identification, the generated prediction space-time feature classification score vectors are sequenced from large to small, the prediction space-time feature classification score vector with the largest value is selected, the selected prediction space-time feature classification score vector with the largest value is the correct space-time feature classification score vector in the identified video, and the category index corresponding to the correct space-time feature classification score vector in the identified video is the category of the behavior.
In order to better implement the invention, it is further advantageous,
the spatial domain feature XfIs expressed as
Figure BDA0002302948580000034
The time domain feature XoIs expressed as
Figure BDA0002302948580000035
Wherein
Figure BDA0002302948580000036
d is the dimension of the feature;
the expression of the heterogeneous correlation matrix Y obtained in the first step is as follows:
Figure BDA0002302948580000037
in the formula, gθ() To measure the similarity of variables, the function is embodied as
Figure BDA0002302948580000038
Wherein
Figure BDA0002302948580000039
While
Figure BDA00023029485800000310
WKIs a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size is a matrix with the row and column number equal to the number of video samples;
the fused time domain sequence characteristics obtained in the second step
Figure BDA00023029485800000311
And fused spatial domain sequence features
Figure BDA00023029485800000312
The specific expression of (A) is as follows:
Figure BDA00023029485800000313
in the formula (I), the compound is shown in the specification,
Figure BDA00023029485800000314
and
Figure BDA00023029485800000315
the expression is respectively an interactive function of the complementary characteristics of space domain and time domain separation
Figure BDA00023029485800000316
And
Figure BDA00023029485800000317
wf,wothe time domain sequence features after fusion are used as parameters needing learning
Figure BDA00023029485800000318
Fused spatial domain sequence features
Figure BDA00023029485800000319
Are respectively expressed as
Figure BDA00023029485800000320
The expression of the cross entropy loss function L1 is:
Figure BDA00023029485800000321
in the formula, L1Which represents the cross-entropy loss value of the entropy,
Figure BDA0002302948580000041
a correct spatiotemporal feature classification score representing the classification category after the ith sample output,
Figure BDA0002302948580000042
representing the correct spatiotemporal feature classification score when the ith sample is output to the j class;
the heterogeneous triplet pair expression of the spatial domain feature is
Figure BDA0002302948580000043
The isomeric triplet pair expression of the time domain feature is
Figure BDA0002302948580000044
Wherein subscripts a, p, n denote anchor, positive, and negative case points, respectively, and i and j denote sample versus action class indices;
the isomeric triplet pair loss function is specifically:
Figure BDA0002302948580000045
wherein L is2Represents the loss value of the triplet;
Figure BDA00023029485800000411
represents a 2-norm distance metric; if x is greater than 0, [ x ]]+X, if x is less than or equal to 0, [ x [ ]]+=0;α1Is a threshold value;
class center of the spatial domain feature
Figure BDA0002302948580000046
Is expressed as
Figure BDA0002302948580000047
Class center of time domain features
Figure BDA0002302948580000048
Is expressed as
Figure BDA0002302948580000049
Wherein C ═ { C ═ C1,c2,…,csIs a table-like label, liFor the label of the ith sample, 1() is the indicator function;
the distinguishing embedding limiting loss function is specifically as follows:
Figure BDA00023029485800000410
in the formula, L3Loss value indicating discriminant embedding, α23Is a threshold value;
the expression of the loss function L is:
L=λ1L12L23L3
compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the two heterogeneous input streams are processed simultaneously, and some complementary information is processed cooperatively, so that loss of some key features which are helpful for motion recognition is avoided
(2) And meanwhile, double-flow information is processed, end-to-end reasoning learning is realized, and mutual flow of information of feature extraction flows in double-flow isomerism is ensured.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is a block diagram of a dual stream feature processing process;
FIG. 3 is a schematic diagram of a connection network for different modal flow branches;
fig. 4 is a schematic diagram comparing the effect of the present invention with the prior art.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and therefore should not be considered as a limitation to the scope of protection. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Example 1:
a video motion classification and identification method based on a double-current cooperative network is disclosed, which is combined with the graph shown in figure 1, figure 2 and figure 3, firstly, a convolution network is used for simultaneously extracting space domain sequence characteristics from a video frame and time domain sequence characteristics from a video optical flow field, and the expression of the space domain sequence characteristics is
Figure BDA0002302948580000051
The time domain sequence characteristic expression is
Figure BDA0002302948580000052
Wherein
Figure BDA0002302948580000053
d is the dimension of the feature; constructing a connecting unit to enable the heterogeneous space domain sequence characteristics and time domain sequence characteristics to carry out information interaction; then constructing a sharing unit to respectively carry out sequence feature aggregation on the fused space domain sequence features and the fused time domain sequence features to obtain aggregated space domain features and aggregated time domain features; the space domain characteristic expression after aggregation is ZfThe time domain feature expression after aggregation is Zo
The information interaction specifically comprises the following steps:
the method comprises the following steps: the method comprises the following steps of fusing spatial domain sequence characteristics extracted from a video frame and time domain sequence characteristics extracted from a video optical flow field, wherein the specific formula is as follows:
Figure BDA0002302948580000054
in the formula, gθ() To measure the similarity of variables, the function is embodied as
Figure BDA0002302948580000055
Wherein
Figure BDA0002302948580000056
While
Figure BDA0002302948580000057
WKIs a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size is a matrix with the row and column number equal to the number of video samples;
step two: according to the heterogeneous correlation matrix Y obtained in the step one, separating complementary time domain sequence characteristics and complementary space domain sequence characteristics from the fused space domain sequence characteristics and time domain sequence characteristics, and respectively fusing the separated complementary time domain sequence characteristics and complementary space domain sequence characteristics back to the space domain sequence characteristics
Figure BDA0002302948580000058
And time domain sequence features
Figure BDA0002302948580000061
Obtaining the fused time domain sequence characteristics and the fused space domain sequence characteristics, wherein the specific formula is as follows:
Figure BDA0002302948580000062
in the formula (I), the compound is shown in the specification,
Figure BDA0002302948580000063
and
Figure BDA0002302948580000064
the interaction functions of complementary features for spatial and temporal separation, respectively
Figure BDA0002302948580000065
And
Figure BDA0002302948580000066
wf,woin order for the parameters to be learned,
Figure BDA0002302948580000067
the time domain sequence characteristics after fusion;
Figure BDA0002302948580000068
the fused space domain sequence characteristics are obtained; the fused airspace sequence characteristics have the expression of
Figure BDA0002302948580000069
The expression of the fused time domain sequence characteristic is
Figure BDA00023029485800000610
The working principle is as follows: in the processing of the prior art, the characteristics in a video frame and an optical flow field are respectively extracted and then fused, so that some complementary information is easily lost; adding the separated complementary features into the originally extracted space domain sequence features and time domain sequence features, so that the originally extracted time domain sequence features and space domain sequence features contain complementary information, and then sending the time domain sequence features and the space domain sequence features with the complementary information for further processing; in actual operation, because a video usually has a large number of frames, if all the frames are used as input to perform subsequent operations, huge calculation cost is needed, and many information in the frames are similar and have redundancy, the video needs to be sampled before features are extracted. Sampling a video in a global sparse sampling mode, wherein an RGB frame image acquires M frames, an optical flow field image has x and y directions and has 2M images in total, and then performing feature extraction on the sampled video frames by using a convolution network, wherein the feature extraction is performed on the sampled video frames and the optical flow field image respectively by using an Incep and an Incep-v 3; and then the extracted features are sent to a connecting unit for processing.
Example 2:
on the basis of the above embodiment 1, in order to better implement the present invention, as shown in fig. 1 and fig. 2, a shared unit pair fusion is further constructedPerforming sequence feature aggregation on the combined spatial domain sequence features and the fused time domain sequence features respectively; the fused airspace sequence features are aggregated into an airspace feature ZfThe fused time domain sequence features are aggregated into a time domain feature Zo(ii) a Time domain feature ZoSum spatial domain feature ZfMeanwhile, regularization is carried out, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification fraction and the space domain feature classification fraction into a prediction space-time feature classification fraction vector for actual video motion recognition; the prediction space-time feature classification score vector is divided into a correct space-time feature classification score group and an error space-time feature classification score vector; the correct time-space feature classification score vector is a classification score vector of a real category of the time-domain feature; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video.
Other parts of this embodiment are the same as embodiment 1, and thus are not described again.
Example 3:
on the basis of any one of the embodiments 1-2, in order to better implement the present invention, a sample set is further selected for training, and a classifier model including correct spatio-temporal feature classification scores for motion classification is generated; and adopting a combined function of a cross entropy loss function, a heterogeneous triplet pair loss function and a discriminant embedding limit loss function as a training loss function.
The working principle is as follows: selecting a sample set for pre-training, training a classifier model, and introducing a combined function of a cross entropy loss function, a heterogeneous triplet pair loss function and a discriminant embedding limiting loss function as a trained loss function, so that the classifier model obtained by pre-training is more real and reliable, and the classification is more aggregated.
Other parts of this embodiment are the same as any of embodiments 1-2 described above, and thus are not described again.
Example 4:
radical according to any one of the preceding embodiments 1 to 3On the basis, in order to better implement the present invention, further, the sharing weight layer further inputs the time domain feature ZoSum spatial domain feature ZfRespectively constructing a heterogeneous triplet pair of the spatial domain characteristic and a heterogeneous triplet pair of the time domain characteristic; the heterogeneous triplet pair expression of the spatial domain feature is
Figure BDA0002302948580000071
The isomeric triplet pair expression of the time domain feature is
Figure BDA0002302948580000072
Wherein subscripts a, p, n denote anchor, positive, and negative case points, respectively, and i and j denote sample versus action class indices; the isomeric triplet pair loss function is specifically:
Figure BDA0002302948580000073
wherein L is2Represents the loss value of the triplet;
Figure BDA0002302948580000074
represents a 2-norm distance metric; if x is greater than 0, [ x ]]+X, if x is less than or equal to 0, [ x [ ]]+=0;α1Is a threshold value;
meanwhile, the class center of the space domain characteristic and the class center of the time domain characteristic are also solved; the class center expression of the spatial domain features is
Figure BDA0002302948580000075
The class-centered expression of the time domain features is
Figure BDA0002302948580000076
Wherein C ═ { C ═ C1,c2,…,csIs a table-like label, liFor the label of the ith sample, 1() is the indicator function; the distinguishing embedding limiting loss function is specifically as follows:
Figure BDA0002302948580000081
in the formula, L3Loss value indicating discriminant embedding, α23Is a threshold value;
the cross entropy loss function expression is as follows:
Figure BDA0002302948580000082
in the formula, L1Which represents the cross-entropy loss value of the entropy,
Figure BDA0002302948580000083
represents the correct spatiotemporal feature classification score of the true classification category after the ith sample output,
Figure BDA0002302948580000084
representing the correct spatiotemporal feature classification score when the ith sample is output to the j class; through the loss function, the characteristics of real classification categories can be more prominently aggregated;
the loss function expression for training the whole network is as follows:
L=λ1L12L23L3
this is empirically obtained: l ═ L1+0.5L2+0.5L3
Other parts of this embodiment are the same as any of embodiments 1 to 5, and thus are not described again.
Example 5:
on the basis of any one of the above embodiments 1 to 4, to better implement the present invention, further, in the actual video motion recognition process, in the actual video recognition, the generated spatio-temporal feature classification score vectors are sorted in descending order, the spatio-temporal feature classification score vector with the largest value is selected, the spatio-temporal feature classification score vector with the largest value is the correct spatio-temporal feature classification score vector in the recognized video, and the category index corresponding to the correct spatio-temporal feature classification score vector in the recognized video is the category of the motion. The present invention specifically uses the top-k index to evaluate our model. top-k refers to the proportion of video sequences with correct labels in the top k results in the classification feature score returned by the model, and is the most common classification evaluation method. In this example, k is set to 1. The invention was tested on a large scale video behavior classification dataset, UCF-101, and HMDB-51 datasets. The UCF-101 data set comprises 101 action categories, 13,320 samples are selected, 70% of the samples are selected as a training set, and the rest are selected as a verification set; the HMDB-51 data set comprises 51 action categories, 6,849 samples are selected, 70% of the samples are selected as a training set, and the rest are selected as a verification set; the comparison result is shown in fig. 4, and it can be seen that the fusion identification performance of the invention after all verification sets are subjected to information interaction is superior to that of the existing method. On the UCF-101 data set, the final identification performance of the method is improved by 0.4% compared with the prior optimal method, and the final identification performance of the method on the HMDB-51 is improved by 3.2% compared with the prior optimal method. The method is superior to the existing method in all measurement modes, and the identification accuracy of video behavior classification is improved.
Other parts of this embodiment are the same as any of embodiments 1 to 4, and thus are not described again.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.

Claims (8)

1. A video motion classification and identification method based on a double-current cooperative network is characterized in that time domain sequence features X are extracted from a video optical flow field through a convolutional network at the same timeoExtracting space domain sequence characteristic X from video framef(ii) a Then a connection unit is constructed to lead the heterogeneous time domain sequence characteristics XoSum space domain sequence feature XfCarrying out information interaction; then constructing a time domain sequence characteristic of a sharing unit pair after interaction
Figure FDA0002302948570000011
And the space domain sequence characteristics after interaction
Figure FDA0002302948570000012
Respectively carrying out sequence characteristic polymerization to obtain polymerized time domain characteristics ZoAnd the spatial domain feature Z after the aggregationf
The information interaction specifically comprises the following steps:
the method comprises the following steps: extracting time domain sequence characteristics X from a video optical flow fieldoAnd spatial domain sequence characteristics X extracted from the video framefFusing to obtain a heterogeneous correlation matrix Y of the time-space domain sequence characteristics;
step two: extracting complementary time domain sequence characteristics according to the heterogeneous correlation matrix Y obtained in the step one
Figure FDA0002302948570000013
And complementary spatial sequence features
Figure FDA0002302948570000014
And will complement the time domain sequence characteristics
Figure FDA0002302948570000015
Fuse back to time domain sequence feature XoGenerating fused time-domain sequence features
Figure FDA0002302948570000016
Characterizing the complementary spatial sequences
Figure FDA0002302948570000017
Fused-back null-field sequence feature XfGenerating fused spatial domain sequence features
Figure FDA0002302948570000018
2. The method for video motion classification and identification based on the dual-stream cooperative network as claimed in claim 1, wherein the method is characterized in thatThen, the time domain feature Z is determinedoSum spatial domain feature ZfMeanwhile, regularization is carried out, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification fraction and the space domain feature classification fraction into a prediction space-time feature classification fraction vector for actual video motion recognition; the prediction space-time feature classification score vector is divided into a correct space-time feature classification score group and an error space-time feature classification score vector; the correct time-space feature classification score vector is a classification score vector of a real category of the time-domain feature; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video.
3. The video motion classification and identification method based on the double-current cooperative network as claimed in claim 2, characterized in that a pre-training sample set is selected for training to generate a classifier model containing correct spatio-temporal feature classification scores for each motion class classification; solving a cross entropy loss function L according to correct space-time feature classification scores1
4. The method for video action classification and identification based on dual-stream cooperative network as claimed in claim 3, wherein the sharing weight layer further passes through the input time domain feature ZoSum spatial domain feature ZfRespectively constructing time domain characteristics ZoPairs of isomeric triplets and spatial domain feature ZfOf the same pair of isomeric triplets, simultaneously according to the time domain characteristic ZoPairs of isomeric triplets and spatial domain feature ZfTo find the isomeric triplet pair loss function L2
5. The method for video action classification and identification based on dual-stream cooperative network as claimed in claim 4, wherein said shared weight layer further finds time domain feature ZoClass center of
Figure FDA0002302948570000021
Sum spatial domain feature ZfClass center of
Figure FDA0002302948570000022
According to the obtained time domain characteristic ZoClass center of
Figure FDA0002302948570000023
Sum spatial domain feature ZfClass center of
Figure FDA0002302948570000024
Calculating a discriminant embedding limiting loss function L3
6. The method for video action classification and identification based on the dual-stream cooperative network as claimed in claim 5, wherein a cross entropy loss function L is adopted1Isomeric triad pair loss function L2Determining the embedding limit loss function L3As a loss function L of the training.
7. The method as claimed in claim 2, wherein in the actual video recognition, the generated spatiotemporal feature classification score vectors are sorted from large to small, and the spatiotemporal feature classification score vector with the largest value is selected, the spatiotemporal feature classification score vector with the largest value is the correct spatiotemporal feature classification score vector in the recognized video, and the category index corresponding to the correct spatiotemporal feature classification score vector in the recognized video is the category of the behavior.
8. The video motion classification and identification method based on the dual-stream cooperative network as claimed in claim 7,
the spatial domain feature XfIs expressed as
Figure FDA0002302948570000025
The time domain feature XoIs expressed as
Figure FDA0002302948570000026
Wherein
Figure FDA0002302948570000027
d is the dimension of the feature;
the expression of the heterogeneous correlation matrix Y obtained in the first step is as follows:
Figure FDA0002302948570000028
in the formula, gθ() To measure the similarity of variables, the function is embodied as
Figure FDA0002302948570000029
Wherein
Figure FDA00023029485700000210
While
Figure FDA00023029485700000211
WKIs a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size is a matrix with the row and column number equal to the number of video samples;
the fused time domain sequence characteristics obtained in the second step
Figure FDA00023029485700000212
And fused spatial domain sequence features
Figure FDA00023029485700000213
The specific expression of (A) is as follows:
Figure FDA00023029485700000214
in the formula (I), the compound is shown in the specification,
Figure FDA00023029485700000215
and
Figure FDA00023029485700000216
the expression is respectively an interactive function of the complementary characteristics of space domain and time domain separation
Figure FDA00023029485700000217
And
Figure FDA00023029485700000218
wf,wothe time domain sequence features after fusion are used as parameters needing learning
Figure FDA00023029485700000219
Fused spatial domain sequence features
Figure FDA00023029485700000220
Are respectively expressed as
Figure FDA00023029485700000221
The expression of the cross entropy loss function L1 is:
Figure FDA0002302948570000031
in the formula, L1Which represents the cross-entropy loss value of the entropy,
Figure FDA0002302948570000032
a correct spatiotemporal feature classification score representing the classification category after the ith sample output,
Figure FDA0002302948570000033
representing the correct spatiotemporal feature classification score when the ith sample is output to the j class;
heterogeneity of the spatial domain featuresThe triplet pair is expressed as
Figure FDA0002302948570000034
The isomeric triplet pair expression of the time domain feature is
Figure FDA0002302948570000035
Wherein subscripts a, p, n denote anchor, positive, and negative case points, respectively, and i and j denote sample versus action class indices;
the isomeric triplet pair loss function is specifically:
Figure FDA0002302948570000036
wherein L is2Represents the loss value of the triplet;
Figure FDA0002302948570000037
represents a 2-norm distance metric; if x is greater than 0, [ x ]]+X, if x is less than or equal to 0, [ x [ ]]+=0;α1Is a threshold value;
class center of the spatial domain feature
Figure FDA0002302948570000038
Is expressed as
Figure FDA0002302948570000039
Class center of time domain features
Figure FDA00023029485700000310
Is expressed as
Figure FDA00023029485700000311
Wherein C ═ { C ═ C1,c2,…,csIs a table-like label, liFor the label of the ith sample, 1() is the indicator function;
the distinguishing embedding limiting loss function is specifically as follows:
Figure FDA00023029485700000312
in the formula, L3Loss value indicating discriminant embedding, α23Is a threshold value;
the expression of the loss function L is:
L=λ1L12L23L3
CN201911228675.0A 2019-12-04 2019-12-04 Video action classification and identification method based on double-flow cooperative network Active CN111079594B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911228675.0A CN111079594B (en) 2019-12-04 2019-12-04 Video action classification and identification method based on double-flow cooperative network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911228675.0A CN111079594B (en) 2019-12-04 2019-12-04 Video action classification and identification method based on double-flow cooperative network

Publications (2)

Publication Number Publication Date
CN111079594A true CN111079594A (en) 2020-04-28
CN111079594B CN111079594B (en) 2023-06-06

Family

ID=70312816

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911228675.0A Active CN111079594B (en) 2019-12-04 2019-12-04 Video action classification and identification method based on double-flow cooperative network

Country Status (1)

Country Link
CN (1) CN111079594B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259874A (en) * 2020-05-06 2020-06-09 成都派沃智通科技有限公司 Campus security video monitoring method based on deep learning
CN111312367A (en) * 2020-05-11 2020-06-19 成都派沃智通科技有限公司 Campus personnel abnormal psychological prediction method based on self-adaptive cloud management platform
CN112446348A (en) * 2020-12-08 2021-03-05 电子科技大学 Behavior identification method based on characteristic spectrum flow
CN113255570A (en) * 2021-06-15 2021-08-13 成都考拉悠然科技有限公司 Sequential action detection method for sensing video clip relation
CN113343786A (en) * 2021-05-20 2021-09-03 武汉大学 Lightweight video action recognition network, method and system based on deep learning
CN114943286A (en) * 2022-05-20 2022-08-26 电子科技大学 Unknown target discrimination method based on fusion of time domain features and space domain features
CN115393660A (en) * 2022-10-28 2022-11-25 松立控股集团股份有限公司 Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104023226A (en) * 2014-05-28 2014-09-03 北京邮电大学 HVS-based novel video quality evaluation method
US20170147868A1 (en) * 2014-04-11 2017-05-25 Beijing Sesetime Technology Development Co., Ltd. A method and a system for face verification
CN107506712A (en) * 2017-08-15 2017-12-22 成都考拉悠然科技有限公司 Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN109558781A (en) * 2018-08-02 2019-04-02 北京市商汤科技开发有限公司 A kind of multi-angle video recognition methods and device, equipment and storage medium
CN109784269A (en) * 2019-01-11 2019-05-21 中国石油大学(华东) One kind is based on the united human action detection of space-time and localization method
CN109858407A (en) * 2019-01-17 2019-06-07 西北大学 A kind of video behavior recognition methods based on much information stream feature and asynchronous fusion
CN110070041A (en) * 2019-04-23 2019-07-30 江西理工大学 A kind of video actions recognition methods of time-space compression excitation residual error multiplication network
CN110135369A (en) * 2019-05-20 2019-08-16 威创集团股份有限公司 A kind of Activity recognition method, system, equipment and computer readable storage medium
CN110163052A (en) * 2018-08-01 2019-08-23 腾讯科技(深圳)有限公司 Video actions recognition methods, device and machinery equipment
CN110334746A (en) * 2019-06-12 2019-10-15 腾讯科技(深圳)有限公司 A kind of image detecting method and device
CN110390308A (en) * 2019-07-26 2019-10-29 华侨大学 It is a kind of to fight the video behavior recognition methods for generating network based on space-time

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170147868A1 (en) * 2014-04-11 2017-05-25 Beijing Sesetime Technology Development Co., Ltd. A method and a system for face verification
CN104023226A (en) * 2014-05-28 2014-09-03 北京邮电大学 HVS-based novel video quality evaluation method
CN107506712A (en) * 2017-08-15 2017-12-22 成都考拉悠然科技有限公司 Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN110163052A (en) * 2018-08-01 2019-08-23 腾讯科技(深圳)有限公司 Video actions recognition methods, device and machinery equipment
CN109558781A (en) * 2018-08-02 2019-04-02 北京市商汤科技开发有限公司 A kind of multi-angle video recognition methods and device, equipment and storage medium
CN109784269A (en) * 2019-01-11 2019-05-21 中国石油大学(华东) One kind is based on the united human action detection of space-time and localization method
CN109858407A (en) * 2019-01-17 2019-06-07 西北大学 A kind of video behavior recognition methods based on much information stream feature and asynchronous fusion
CN110070041A (en) * 2019-04-23 2019-07-30 江西理工大学 A kind of video actions recognition methods of time-space compression excitation residual error multiplication network
CN110135369A (en) * 2019-05-20 2019-08-16 威创集团股份有限公司 A kind of Activity recognition method, system, equipment and computer readable storage medium
CN110334746A (en) * 2019-06-12 2019-10-15 腾讯科技(深圳)有限公司 A kind of image detecting method and device
CN110390308A (en) * 2019-07-26 2019-10-29 华侨大学 It is a kind of to fight the video behavior recognition methods for generating network based on space-time

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHRISTOPH R等: "Spatiotemporal residual networks for video action recognition", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS》 *
毛志强等: "基于时空双流卷积与LSTM的人体动作识别", 《软件》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111259874A (en) * 2020-05-06 2020-06-09 成都派沃智通科技有限公司 Campus security video monitoring method based on deep learning
CN111312367A (en) * 2020-05-11 2020-06-19 成都派沃智通科技有限公司 Campus personnel abnormal psychological prediction method based on self-adaptive cloud management platform
CN112446348A (en) * 2020-12-08 2021-03-05 电子科技大学 Behavior identification method based on characteristic spectrum flow
CN112446348B (en) * 2020-12-08 2022-05-31 电子科技大学 Behavior identification method based on characteristic spectrum flow
CN113343786A (en) * 2021-05-20 2021-09-03 武汉大学 Lightweight video action recognition network, method and system based on deep learning
CN113343786B (en) * 2021-05-20 2022-05-17 武汉大学 Lightweight video action recognition method and system based on deep learning
CN113255570A (en) * 2021-06-15 2021-08-13 成都考拉悠然科技有限公司 Sequential action detection method for sensing video clip relation
CN113255570B (en) * 2021-06-15 2021-09-24 成都考拉悠然科技有限公司 Sequential action detection method for sensing video clip relation
CN114943286A (en) * 2022-05-20 2022-08-26 电子科技大学 Unknown target discrimination method based on fusion of time domain features and space domain features
CN114943286B (en) * 2022-05-20 2023-04-07 电子科技大学 Unknown target discrimination method based on fusion of time domain features and space domain features
CN115393660A (en) * 2022-10-28 2022-11-25 松立控股集团股份有限公司 Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism
CN115393660B (en) * 2022-10-28 2023-02-24 松立控股集团股份有限公司 Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism

Also Published As

Publication number Publication date
CN111079594B (en) 2023-06-06

Similar Documents

Publication Publication Date Title
CN111079594B (en) Video action classification and identification method based on double-flow cooperative network
Li et al. Collaborative spatiotemporal feature learning for video action recognition
US10152644B2 (en) Progressive vehicle searching method and device
CN108537136B (en) Pedestrian re-identification method based on attitude normalization image generation
CN113936339B (en) Fighting identification method and device based on double-channel cross attention mechanism
CN115033670A (en) Cross-modal image-text retrieval method with multi-granularity feature fusion
CN109727246A (en) Comparative learning image quality evaluation method based on twin network
CN111506773B (en) Video duplicate removal method based on unsupervised depth twin network
WO2022160772A1 (en) Person re-identification method based on view angle guidance multi-adversarial attention
CN110827265B (en) Image anomaly detection method based on deep learning
CN113221641A (en) Video pedestrian re-identification method based on generation of confrontation network and attention mechanism
CN108960142B (en) Pedestrian re-identification method based on global feature loss function
CN108647621A (en) A kind of video analysis processing system and method based on recognition of face
CN113963170A (en) RGBD image saliency detection method based on interactive feature fusion
CN117152459A (en) Image detection method, device, computer readable medium and electronic equipment
WO2023185074A1 (en) Group behavior recognition method based on complementary spatio-temporal information modeling
CN116933051A (en) Multi-mode emotion recognition method and system for modal missing scene
Sarker et al. Transformer-Based Person Re-Identification: A Comprehensive Review
CN115439791A (en) Cross-domain video action recognition method, device, equipment and computer-readable storage medium
Sun et al. Video-based parent-child relationship prediction
CN115705756A (en) Motion detection method, motion detection device, computer equipment and storage medium
CN113468540A (en) Security portrait processing method based on network security big data and network security system
Liu et al. Text detection based on bidirectional feature fusion and sa attention mechanism
CN115631530B (en) Fair facial expression recognition method based on face action unit
Xie et al. Pedestrian attribute recognition based on multi-scale fusion and cross attention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant