CN111079594A - Video action classification and identification method based on double-current cooperative network - Google Patents
Video action classification and identification method based on double-current cooperative network Download PDFInfo
- Publication number
- CN111079594A CN111079594A CN201911228675.0A CN201911228675A CN111079594A CN 111079594 A CN111079594 A CN 111079594A CN 201911228675 A CN201911228675 A CN 201911228675A CN 111079594 A CN111079594 A CN 111079594A
- Authority
- CN
- China
- Prior art keywords
- feature
- time domain
- video
- time
- space
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to a video category identification method based on a double-current cooperative network, which comprises the steps of firstly, carrying out information interaction on heterogeneous spatial domain characteristics and time domain characteristics; the information interaction fuses heterogeneous time domain features and space domain features, complementary parts of a time domain and a space domain are extracted from the fused time-space domain features, the complementary parts are fused into originally extracted time domain feature post-space domain features, and all time domain feature post-space domain features after the complementary parts are fused respectively form time domain sequence features after space domain sequence features; then, carrying out sequence feature aggregation on the space domain sequence features and the time domain sequence features to obtain aggregated space domain features and aggregated time domain features; and finally, pre-training a classifier model for testing and classifying the video to be recognized. The invention can realize the flow complementation of different inflow mode information, thereby achieving more accurate action recognition effect.
Description
Technical Field
The invention belongs to the technical field of video motion classification and identification, and particularly relates to a video motion classification and identification method based on a double-current cooperative network.
Background
Short video data is growing rapidly due to its ready availability due to the popularity of smart phones, public surveillance, portable cameras, and the like. The action recognition based on the short video has important academic value and can provide help for business applications such as intelligent security and user recommendation. The double-flow network is always the most extensive and best framework adopted in the field of action recognition, but most of the existing action recognition solutions based on the double-flow network focus on how to design a structure to fuse different flow characteristics, and different flow networks are trained in an independent mode, so that end-to-end reasoning cannot be realized.
The video motion category identification aims at identifying the category of motion occurring in a video, and the existing motion category identification method based on double flow mainly comprises the following steps:
(1) and airspace characteristic extraction flow: the method comprises the steps that spatial domain features are extracted from input RGB video frames through a convolution network, 2D and 3D convolution networks are used in the existing method, and the branch aims to extract morphological information in the video and provide a basis for later fusion;
(2) time domain feature extraction stream: spatial domain features are extracted from an input pre-extracted optical flow field by a convolution network, 2D and 3D convolution networks can also be used as an infrastructure network, and the branch aims to provide a basis for extracting motion information in a video and then fusing.
Most of the existing category identification methods based on the double-stream video motion are based on the fusion of the characteristics tried at the rear end of the structure, and the characteristics of two branch streams must be extracted separately first, and then the fusion mode is improved, so that the following defects exist:
(1) information which represents the same mode in two heterogeneous input streams is processed separately, and actually, complementary information between the two input streams is not processed cooperatively at the front end of the network, so that some key features which are beneficial to action recognition can be lost;
(2) inference learning cannot be performed end to end, two branches must be processed separately, and mutual flow of information in heterogeneous feature extraction streams cannot be guaranteed to maintain feature discriminativity.
Disclosure of Invention
The invention provides a video action classification and identification method based on a double-flow cooperative network, which is used for solving the problems that some key technologies are possibly lost, a video frame and an optical flow field are processed separately, information does not flow and end-to-end processing cannot be carried out in the prior art.
The invention specifically comprises the following contents:
a video motion classification and identification method based on a double-current cooperative network comprises the steps of firstly extracting time domain sequence characteristics X from a video optical flow field through a convolutional network simultaneouslyoExtracting space domain sequence characteristic X from video framef(ii) a Then a connection unit is constructed to lead the heterogeneous time domain sequence characteristics XoSum space domain sequence feature XfCarrying out information interaction; then constructing a time domain sequence characteristic of a sharing unit pair after interactionAnd the space domain sequence characteristics after interactionRespectively carrying out sequence characteristic polymerization to obtain polymerized time domain characteristics ZoAnd the spatial domain feature Z after the aggregationf;
The information interaction specifically comprises the following steps:
the method comprises the following steps: extracting time domain sequence characteristics X from a video optical flow fieldoAnd spatial domain sequence characteristics X extracted from the video framefFusing to obtain a heterogeneous correlation matrix Y of the time-space domain sequence characteristics;
step two: the isomeric correlation moment obtained according to the step oneArray Y, extracting complementary time domain sequence characteristicsAnd complementary spatial sequence featuresAnd will complement the time domain sequence characteristicsFuse back to time domain sequence feature XoGenerating fused time-domain sequence featuresCharacterizing the complementary spatial sequencesFused-back null-field sequence feature XfGenerating fused spatial domain sequence features
To better implement the invention, further, the time domain feature Z is usedoSum spatial domain feature ZfMeanwhile, regularization is carried out, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification fraction and the space domain feature classification fraction into a prediction space-time feature classification fraction vector for actual video motion recognition; the prediction space-time feature classification score vector is divided into a correct space-time feature classification score group and an error space-time feature classification score vector; the correct time-space feature classification score vector is a classification score vector of a real category of the time-domain feature; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video.
In order to better realize the invention, a pre-training sample set is further selected for training to generate correct space-time characteristic classification scores containing all action class classificationsA classifier model; solving a cross entropy loss function L according to correct space-time feature classification scores1。
In order to better implement the present invention, further, the sharing weight layer further passes through the input time domain feature ZoSum spatial domain feature ZfRespectively constructing time domain characteristics ZoPairs of isomeric triplets and spatial domain feature ZfOf the same pair of isomeric triplets, simultaneously according to the time domain characteristic ZoPairs of isomeric triplets and spatial domain feature ZfTo find the isomeric triplet pair loss function L2。
In order to better implement the present invention, further, the shared weight layer also finds the time domain feature ZoClass center ofSum spatial domain feature ZfClass center ofAccording to the obtained time domain characteristic ZoClass center ofSum spatial domain feature ZfClass center ofCalculating a discriminant embedding limiting loss function L3。
To better implement the present invention, further, a cross-entropy loss function L is employed1Isomeric triad pair loss function L2Determining the embedding limit loss function L3As a loss function L of the training.
In order to better realize the invention, further, in the actual video identification, the generated prediction space-time feature classification score vectors are sequenced from large to small, the prediction space-time feature classification score vector with the largest value is selected, the selected prediction space-time feature classification score vector with the largest value is the correct space-time feature classification score vector in the identified video, and the category index corresponding to the correct space-time feature classification score vector in the identified video is the category of the behavior.
In order to better implement the invention, it is further advantageous,
in the formula, gθ() To measure the similarity of variables, the function is embodied asWhereinWhileWKIs a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size is a matrix with the row and column number equal to the number of video samples;
the fused time domain sequence characteristics obtained in the second stepAnd fused spatial domain sequence featuresThe specific expression of (A) is as follows:
in the formula (I), the compound is shown in the specification,andthe expression is respectively an interactive function of the complementary characteristics of space domain and time domain separationAndwf,wothe time domain sequence features after fusion are used as parameters needing learningFused spatial domain sequence featuresAre respectively expressed as
The expression of the cross entropy loss function L1 is:
in the formula, L1Which represents the cross-entropy loss value of the entropy,a correct spatiotemporal feature classification score representing the classification category after the ith sample output,representing the correct spatiotemporal feature classification score when the ith sample is output to the j class;
the heterogeneous triplet pair expression of the spatial domain feature isThe isomeric triplet pair expression of the time domain feature isWherein subscripts a, p, n denote anchor, positive, and negative case points, respectively, and i and j denote sample versus action class indices;
the isomeric triplet pair loss function is specifically:
wherein L is2Represents the loss value of the triplet;represents a 2-norm distance metric; if x is greater than 0, [ x ]]+X, if x is less than or equal to 0, [ x [ ]]+=0;α1Is a threshold value;
class center of the spatial domain featureIs expressed asClass center of time domain featuresIs expressed asWherein C ═ { C ═ C1,c2,…,csIs a table-like label, liFor the label of the ith sample, 1() is the indicator function;
the distinguishing embedding limiting loss function is specifically as follows:
in the formula, L3Loss value indicating discriminant embedding, α2,α3Is a threshold value;
the expression of the loss function L is:
L=λ1L1+λ2L2+λ3L3。
compared with the prior art, the invention has the following advantages and beneficial effects:
(1) the two heterogeneous input streams are processed simultaneously, and some complementary information is processed cooperatively, so that loss of some key features which are helpful for motion recognition is avoided
(2) And meanwhile, double-flow information is processed, end-to-end reasoning learning is realized, and mutual flow of information of feature extraction flows in double-flow isomerism is ensured.
Drawings
FIG. 1 is a schematic overall flow diagram of the present invention;
FIG. 2 is a block diagram of a dual stream feature processing process;
FIG. 3 is a schematic diagram of a connection network for different modal flow branches;
fig. 4 is a schematic diagram comparing the effect of the present invention with the prior art.
Detailed Description
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments, and therefore should not be considered as a limitation to the scope of protection. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Example 1:
a video motion classification and identification method based on a double-current cooperative network is disclosed, which is combined with the graph shown in figure 1, figure 2 and figure 3, firstly, a convolution network is used for simultaneously extracting space domain sequence characteristics from a video frame and time domain sequence characteristics from a video optical flow field, and the expression of the space domain sequence characteristics isThe time domain sequence characteristic expression isWhereind is the dimension of the feature; constructing a connecting unit to enable the heterogeneous space domain sequence characteristics and time domain sequence characteristics to carry out information interaction; then constructing a sharing unit to respectively carry out sequence feature aggregation on the fused space domain sequence features and the fused time domain sequence features to obtain aggregated space domain features and aggregated time domain features; the space domain characteristic expression after aggregation is ZfThe time domain feature expression after aggregation is Zo;
The information interaction specifically comprises the following steps:
the method comprises the following steps: the method comprises the following steps of fusing spatial domain sequence characteristics extracted from a video frame and time domain sequence characteristics extracted from a video optical flow field, wherein the specific formula is as follows:
in the formula, gθ() To measure the similarity of variables, the function is embodied asWhereinWhileWKIs a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size is a matrix with the row and column number equal to the number of video samples;
step two: according to the heterogeneous correlation matrix Y obtained in the step one, separating complementary time domain sequence characteristics and complementary space domain sequence characteristics from the fused space domain sequence characteristics and time domain sequence characteristics, and respectively fusing the separated complementary time domain sequence characteristics and complementary space domain sequence characteristics back to the space domain sequence characteristicsAnd time domain sequence featuresObtaining the fused time domain sequence characteristics and the fused space domain sequence characteristics, wherein the specific formula is as follows:
in the formula (I), the compound is shown in the specification,andthe interaction functions of complementary features for spatial and temporal separation, respectivelyAndwf,woin order for the parameters to be learned,the time domain sequence characteristics after fusion;the fused space domain sequence characteristics are obtained; the fused airspace sequence characteristics have the expression ofThe expression of the fused time domain sequence characteristic is
The working principle is as follows: in the processing of the prior art, the characteristics in a video frame and an optical flow field are respectively extracted and then fused, so that some complementary information is easily lost; adding the separated complementary features into the originally extracted space domain sequence features and time domain sequence features, so that the originally extracted time domain sequence features and space domain sequence features contain complementary information, and then sending the time domain sequence features and the space domain sequence features with the complementary information for further processing; in actual operation, because a video usually has a large number of frames, if all the frames are used as input to perform subsequent operations, huge calculation cost is needed, and many information in the frames are similar and have redundancy, the video needs to be sampled before features are extracted. Sampling a video in a global sparse sampling mode, wherein an RGB frame image acquires M frames, an optical flow field image has x and y directions and has 2M images in total, and then performing feature extraction on the sampled video frames by using a convolution network, wherein the feature extraction is performed on the sampled video frames and the optical flow field image respectively by using an Incep and an Incep-v 3; and then the extracted features are sent to a connecting unit for processing.
Example 2:
on the basis of the above embodiment 1, in order to better implement the present invention, as shown in fig. 1 and fig. 2, a shared unit pair fusion is further constructedPerforming sequence feature aggregation on the combined spatial domain sequence features and the fused time domain sequence features respectively; the fused airspace sequence features are aggregated into an airspace feature ZfThe fused time domain sequence features are aggregated into a time domain feature Zo(ii) a Time domain feature ZoSum spatial domain feature ZfMeanwhile, regularization is carried out, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification fraction and the space domain feature classification fraction into a prediction space-time feature classification fraction vector for actual video motion recognition; the prediction space-time feature classification score vector is divided into a correct space-time feature classification score group and an error space-time feature classification score vector; the correct time-space feature classification score vector is a classification score vector of a real category of the time-domain feature; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video.
Other parts of this embodiment are the same as embodiment 1, and thus are not described again.
Example 3:
on the basis of any one of the embodiments 1-2, in order to better implement the present invention, a sample set is further selected for training, and a classifier model including correct spatio-temporal feature classification scores for motion classification is generated; and adopting a combined function of a cross entropy loss function, a heterogeneous triplet pair loss function and a discriminant embedding limit loss function as a training loss function.
The working principle is as follows: selecting a sample set for pre-training, training a classifier model, and introducing a combined function of a cross entropy loss function, a heterogeneous triplet pair loss function and a discriminant embedding limiting loss function as a trained loss function, so that the classifier model obtained by pre-training is more real and reliable, and the classification is more aggregated.
Other parts of this embodiment are the same as any of embodiments 1-2 described above, and thus are not described again.
Example 4:
radical according to any one of the preceding embodiments 1 to 3On the basis, in order to better implement the present invention, further, the sharing weight layer further inputs the time domain feature ZoSum spatial domain feature ZfRespectively constructing a heterogeneous triplet pair of the spatial domain characteristic and a heterogeneous triplet pair of the time domain characteristic; the heterogeneous triplet pair expression of the spatial domain feature isThe isomeric triplet pair expression of the time domain feature isWherein subscripts a, p, n denote anchor, positive, and negative case points, respectively, and i and j denote sample versus action class indices; the isomeric triplet pair loss function is specifically:
wherein L is2Represents the loss value of the triplet;represents a 2-norm distance metric; if x is greater than 0, [ x ]]+X, if x is less than or equal to 0, [ x [ ]]+=0;α1Is a threshold value;
meanwhile, the class center of the space domain characteristic and the class center of the time domain characteristic are also solved; the class center expression of the spatial domain features isThe class-centered expression of the time domain features isWherein C ═ { C ═ C1,c2,…,csIs a table-like label, liFor the label of the ith sample, 1() is the indicator function; the distinguishing embedding limiting loss function is specifically as follows:
in the formula, L3Loss value indicating discriminant embedding, α2,α3Is a threshold value;
the cross entropy loss function expression is as follows:
in the formula, L1Which represents the cross-entropy loss value of the entropy,represents the correct spatiotemporal feature classification score of the true classification category after the ith sample output,representing the correct spatiotemporal feature classification score when the ith sample is output to the j class; through the loss function, the characteristics of real classification categories can be more prominently aggregated;
the loss function expression for training the whole network is as follows:
L=λ1L1+λ2L2+λ3L3;
this is empirically obtained: l ═ L1+0.5L2+0.5L3。
Other parts of this embodiment are the same as any of embodiments 1 to 5, and thus are not described again.
Example 5:
on the basis of any one of the above embodiments 1 to 4, to better implement the present invention, further, in the actual video motion recognition process, in the actual video recognition, the generated spatio-temporal feature classification score vectors are sorted in descending order, the spatio-temporal feature classification score vector with the largest value is selected, the spatio-temporal feature classification score vector with the largest value is the correct spatio-temporal feature classification score vector in the recognized video, and the category index corresponding to the correct spatio-temporal feature classification score vector in the recognized video is the category of the motion. The present invention specifically uses the top-k index to evaluate our model. top-k refers to the proportion of video sequences with correct labels in the top k results in the classification feature score returned by the model, and is the most common classification evaluation method. In this example, k is set to 1. The invention was tested on a large scale video behavior classification dataset, UCF-101, and HMDB-51 datasets. The UCF-101 data set comprises 101 action categories, 13,320 samples are selected, 70% of the samples are selected as a training set, and the rest are selected as a verification set; the HMDB-51 data set comprises 51 action categories, 6,849 samples are selected, 70% of the samples are selected as a training set, and the rest are selected as a verification set; the comparison result is shown in fig. 4, and it can be seen that the fusion identification performance of the invention after all verification sets are subjected to information interaction is superior to that of the existing method. On the UCF-101 data set, the final identification performance of the method is improved by 0.4% compared with the prior optimal method, and the final identification performance of the method on the HMDB-51 is improved by 3.2% compared with the prior optimal method. The method is superior to the existing method in all measurement modes, and the identification accuracy of video behavior classification is improved.
Other parts of this embodiment are the same as any of embodiments 1 to 4, and thus are not described again.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, and all simple modifications and equivalent variations of the above embodiments according to the technical spirit of the present invention are included in the scope of the present invention.
Claims (8)
1. A video motion classification and identification method based on a double-current cooperative network is characterized in that time domain sequence features X are extracted from a video optical flow field through a convolutional network at the same timeoExtracting space domain sequence characteristic X from video framef(ii) a Then a connection unit is constructed to lead the heterogeneous time domain sequence characteristics XoSum space domain sequence feature XfCarrying out information interaction; then constructing a time domain sequence characteristic of a sharing unit pair after interactionAnd the space domain sequence characteristics after interactionRespectively carrying out sequence characteristic polymerization to obtain polymerized time domain characteristics ZoAnd the spatial domain feature Z after the aggregationf;
The information interaction specifically comprises the following steps:
the method comprises the following steps: extracting time domain sequence characteristics X from a video optical flow fieldoAnd spatial domain sequence characteristics X extracted from the video framefFusing to obtain a heterogeneous correlation matrix Y of the time-space domain sequence characteristics;
step two: extracting complementary time domain sequence characteristics according to the heterogeneous correlation matrix Y obtained in the step oneAnd complementary spatial sequence featuresAnd will complement the time domain sequence characteristicsFuse back to time domain sequence feature XoGenerating fused time-domain sequence featuresCharacterizing the complementary spatial sequencesFused-back null-field sequence feature XfGenerating fused spatial domain sequence features
2. The method for video motion classification and identification based on the dual-stream cooperative network as claimed in claim 1, wherein the method is characterized in thatThen, the time domain feature Z is determinedoSum spatial domain feature ZfMeanwhile, regularization is carried out, then a shared weight layer is input, and then a time domain feature classification score and a space domain feature classification score are extracted; finally, fusing the time domain feature classification fraction and the space domain feature classification fraction into a prediction space-time feature classification fraction vector for actual video motion recognition; the prediction space-time feature classification score vector is divided into a correct space-time feature classification score group and an error space-time feature classification score vector; the correct time-space feature classification score vector is a classification score vector of a real category of the time-domain feature; the error spatiotemporal feature classification score vector is a spatiotemporal feature classification score vector of other action categories extracted from the identified video in the process of actually identifying the video.
3. The video motion classification and identification method based on the double-current cooperative network as claimed in claim 2, characterized in that a pre-training sample set is selected for training to generate a classifier model containing correct spatio-temporal feature classification scores for each motion class classification; solving a cross entropy loss function L according to correct space-time feature classification scores1。
4. The method for video action classification and identification based on dual-stream cooperative network as claimed in claim 3, wherein the sharing weight layer further passes through the input time domain feature ZoSum spatial domain feature ZfRespectively constructing time domain characteristics ZoPairs of isomeric triplets and spatial domain feature ZfOf the same pair of isomeric triplets, simultaneously according to the time domain characteristic ZoPairs of isomeric triplets and spatial domain feature ZfTo find the isomeric triplet pair loss function L2。
5. The method for video action classification and identification based on dual-stream cooperative network as claimed in claim 4, wherein said shared weight layer further finds time domain feature ZoClass center ofSum spatial domain feature ZfClass center ofAccording to the obtained time domain characteristic ZoClass center ofSum spatial domain feature ZfClass center ofCalculating a discriminant embedding limiting loss function L3。
6. The method for video action classification and identification based on the dual-stream cooperative network as claimed in claim 5, wherein a cross entropy loss function L is adopted1Isomeric triad pair loss function L2Determining the embedding limit loss function L3As a loss function L of the training.
7. The method as claimed in claim 2, wherein in the actual video recognition, the generated spatiotemporal feature classification score vectors are sorted from large to small, and the spatiotemporal feature classification score vector with the largest value is selected, the spatiotemporal feature classification score vector with the largest value is the correct spatiotemporal feature classification score vector in the recognized video, and the category index corresponding to the correct spatiotemporal feature classification score vector in the recognized video is the category of the behavior.
8. The video motion classification and identification method based on the dual-stream cooperative network as claimed in claim 7,
in the formula, gθ() To measure the similarity of variables, the function is embodied asWhereinWhileWKIs a function to be learned; y is a heterogeneous correlation matrix of time-space domain characteristics, and the size is a matrix with the row and column number equal to the number of video samples;
the fused time domain sequence characteristics obtained in the second stepAnd fused spatial domain sequence featuresThe specific expression of (A) is as follows:
in the formula (I), the compound is shown in the specification,andthe expression is respectively an interactive function of the complementary characteristics of space domain and time domain separationAndwf,wothe time domain sequence features after fusion are used as parameters needing learningFused spatial domain sequence featuresAre respectively expressed as
The expression of the cross entropy loss function L1 is:
in the formula, L1Which represents the cross-entropy loss value of the entropy,a correct spatiotemporal feature classification score representing the classification category after the ith sample output,representing the correct spatiotemporal feature classification score when the ith sample is output to the j class;
heterogeneity of the spatial domain featuresThe triplet pair is expressed asThe isomeric triplet pair expression of the time domain feature isWherein subscripts a, p, n denote anchor, positive, and negative case points, respectively, and i and j denote sample versus action class indices;
the isomeric triplet pair loss function is specifically:
wherein L is2Represents the loss value of the triplet;represents a 2-norm distance metric; if x is greater than 0, [ x ]]+X, if x is less than or equal to 0, [ x [ ]]+=0;α1Is a threshold value;
class center of the spatial domain featureIs expressed asClass center of time domain featuresIs expressed asWherein C ═ { C ═ C1,c2,…,csIs a table-like label, liFor the label of the ith sample, 1() is the indicator function;
the distinguishing embedding limiting loss function is specifically as follows:
in the formula, L3Loss value indicating discriminant embedding, α2,α3Is a threshold value;
the expression of the loss function L is:
L=λ1L1+λ2L2+λ3L3。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911228675.0A CN111079594B (en) | 2019-12-04 | 2019-12-04 | Video action classification and identification method based on double-flow cooperative network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911228675.0A CN111079594B (en) | 2019-12-04 | 2019-12-04 | Video action classification and identification method based on double-flow cooperative network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111079594A true CN111079594A (en) | 2020-04-28 |
CN111079594B CN111079594B (en) | 2023-06-06 |
Family
ID=70312816
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911228675.0A Active CN111079594B (en) | 2019-12-04 | 2019-12-04 | Video action classification and identification method based on double-flow cooperative network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111079594B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259874A (en) * | 2020-05-06 | 2020-06-09 | 成都派沃智通科技有限公司 | Campus security video monitoring method based on deep learning |
CN111312367A (en) * | 2020-05-11 | 2020-06-19 | 成都派沃智通科技有限公司 | Campus personnel abnormal psychological prediction method based on self-adaptive cloud management platform |
CN112446348A (en) * | 2020-12-08 | 2021-03-05 | 电子科技大学 | Behavior identification method based on characteristic spectrum flow |
CN113255570A (en) * | 2021-06-15 | 2021-08-13 | 成都考拉悠然科技有限公司 | Sequential action detection method for sensing video clip relation |
CN113343786A (en) * | 2021-05-20 | 2021-09-03 | 武汉大学 | Lightweight video action recognition network, method and system based on deep learning |
CN114943286A (en) * | 2022-05-20 | 2022-08-26 | 电子科技大学 | Unknown target discrimination method based on fusion of time domain features and space domain features |
CN115393660A (en) * | 2022-10-28 | 2022-11-25 | 松立控股集团股份有限公司 | Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104023226A (en) * | 2014-05-28 | 2014-09-03 | 北京邮电大学 | HVS-based novel video quality evaluation method |
US20170147868A1 (en) * | 2014-04-11 | 2017-05-25 | Beijing Sesetime Technology Development Co., Ltd. | A method and a system for face verification |
CN107506712A (en) * | 2017-08-15 | 2017-12-22 | 成都考拉悠然科技有限公司 | Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks |
CN109101896A (en) * | 2018-07-19 | 2018-12-28 | 电子科技大学 | A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism |
CN109558781A (en) * | 2018-08-02 | 2019-04-02 | 北京市商汤科技开发有限公司 | A kind of multi-angle video recognition methods and device, equipment and storage medium |
CN109784269A (en) * | 2019-01-11 | 2019-05-21 | 中国石油大学(华东) | One kind is based on the united human action detection of space-time and localization method |
CN109858407A (en) * | 2019-01-17 | 2019-06-07 | 西北大学 | A kind of video behavior recognition methods based on much information stream feature and asynchronous fusion |
CN110070041A (en) * | 2019-04-23 | 2019-07-30 | 江西理工大学 | A kind of video actions recognition methods of time-space compression excitation residual error multiplication network |
CN110135369A (en) * | 2019-05-20 | 2019-08-16 | 威创集团股份有限公司 | A kind of Activity recognition method, system, equipment and computer readable storage medium |
CN110163052A (en) * | 2018-08-01 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Video actions recognition methods, device and machinery equipment |
CN110334746A (en) * | 2019-06-12 | 2019-10-15 | 腾讯科技(深圳)有限公司 | A kind of image detecting method and device |
CN110390308A (en) * | 2019-07-26 | 2019-10-29 | 华侨大学 | It is a kind of to fight the video behavior recognition methods for generating network based on space-time |
-
2019
- 2019-12-04 CN CN201911228675.0A patent/CN111079594B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170147868A1 (en) * | 2014-04-11 | 2017-05-25 | Beijing Sesetime Technology Development Co., Ltd. | A method and a system for face verification |
CN104023226A (en) * | 2014-05-28 | 2014-09-03 | 北京邮电大学 | HVS-based novel video quality evaluation method |
CN107506712A (en) * | 2017-08-15 | 2017-12-22 | 成都考拉悠然科技有限公司 | Method for distinguishing is known in a kind of human behavior based on 3D depth convolutional networks |
CN109101896A (en) * | 2018-07-19 | 2018-12-28 | 电子科技大学 | A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism |
CN110163052A (en) * | 2018-08-01 | 2019-08-23 | 腾讯科技(深圳)有限公司 | Video actions recognition methods, device and machinery equipment |
CN109558781A (en) * | 2018-08-02 | 2019-04-02 | 北京市商汤科技开发有限公司 | A kind of multi-angle video recognition methods and device, equipment and storage medium |
CN109784269A (en) * | 2019-01-11 | 2019-05-21 | 中国石油大学(华东) | One kind is based on the united human action detection of space-time and localization method |
CN109858407A (en) * | 2019-01-17 | 2019-06-07 | 西北大学 | A kind of video behavior recognition methods based on much information stream feature and asynchronous fusion |
CN110070041A (en) * | 2019-04-23 | 2019-07-30 | 江西理工大学 | A kind of video actions recognition methods of time-space compression excitation residual error multiplication network |
CN110135369A (en) * | 2019-05-20 | 2019-08-16 | 威创集团股份有限公司 | A kind of Activity recognition method, system, equipment and computer readable storage medium |
CN110334746A (en) * | 2019-06-12 | 2019-10-15 | 腾讯科技(深圳)有限公司 | A kind of image detecting method and device |
CN110390308A (en) * | 2019-07-26 | 2019-10-29 | 华侨大学 | It is a kind of to fight the video behavior recognition methods for generating network based on space-time |
Non-Patent Citations (2)
Title |
---|
CHRISTOPH R等: "Spatiotemporal residual networks for video action recognition", 《ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS》 * |
毛志强等: "基于时空双流卷积与LSTM的人体动作识别", 《软件》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111259874A (en) * | 2020-05-06 | 2020-06-09 | 成都派沃智通科技有限公司 | Campus security video monitoring method based on deep learning |
CN111312367A (en) * | 2020-05-11 | 2020-06-19 | 成都派沃智通科技有限公司 | Campus personnel abnormal psychological prediction method based on self-adaptive cloud management platform |
CN112446348A (en) * | 2020-12-08 | 2021-03-05 | 电子科技大学 | Behavior identification method based on characteristic spectrum flow |
CN112446348B (en) * | 2020-12-08 | 2022-05-31 | 电子科技大学 | Behavior identification method based on characteristic spectrum flow |
CN113343786A (en) * | 2021-05-20 | 2021-09-03 | 武汉大学 | Lightweight video action recognition network, method and system based on deep learning |
CN113343786B (en) * | 2021-05-20 | 2022-05-17 | 武汉大学 | Lightweight video action recognition method and system based on deep learning |
CN113255570A (en) * | 2021-06-15 | 2021-08-13 | 成都考拉悠然科技有限公司 | Sequential action detection method for sensing video clip relation |
CN113255570B (en) * | 2021-06-15 | 2021-09-24 | 成都考拉悠然科技有限公司 | Sequential action detection method for sensing video clip relation |
CN114943286A (en) * | 2022-05-20 | 2022-08-26 | 电子科技大学 | Unknown target discrimination method based on fusion of time domain features and space domain features |
CN114943286B (en) * | 2022-05-20 | 2023-04-07 | 电子科技大学 | Unknown target discrimination method based on fusion of time domain features and space domain features |
CN115393660A (en) * | 2022-10-28 | 2022-11-25 | 松立控股集团股份有限公司 | Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism |
CN115393660B (en) * | 2022-10-28 | 2023-02-24 | 松立控股集团股份有限公司 | Parking lot fire detection method based on weak supervision collaborative sparse relationship ranking mechanism |
Also Published As
Publication number | Publication date |
---|---|
CN111079594B (en) | 2023-06-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111079594B (en) | Video action classification and identification method based on double-flow cooperative network | |
US10152644B2 (en) | Progressive vehicle searching method and device | |
CN113936339B (en) | Fighting identification method and device based on double-channel cross attention mechanism | |
CN108537136B (en) | Pedestrian re-identification method based on attitude normalization image generation | |
CN115033670A (en) | Cross-modal image-text retrieval method with multi-granularity feature fusion | |
CN111506773B (en) | Video duplicate removal method based on unsupervised depth twin network | |
CN109727246A (en) | Comparative learning image quality evaluation method based on twin network | |
CN113221641A (en) | Video pedestrian re-identification method based on generation of confrontation network and attention mechanism | |
WO2022160772A1 (en) | Person re-identification method based on view angle guidance multi-adversarial attention | |
CN110827265B (en) | Image anomaly detection method based on deep learning | |
CN110287879A (en) | A kind of video behavior recognition methods based on attention mechanism | |
WO2023185074A1 (en) | Group behavior recognition method based on complementary spatio-temporal information modeling | |
CN117152459B (en) | Image detection method, device, computer readable medium and electronic equipment | |
CN113963170A (en) | RGBD image saliency detection method based on interactive feature fusion | |
CN115410078A (en) | Low-quality underwater image fish target detection method | |
Sarker et al. | Transformer-based person re-identification: a comprehensive review | |
CN112560668A (en) | Human behavior identification method based on scene prior knowledge | |
CN116311504A (en) | Small sample behavior recognition method, system and equipment | |
CN115439791A (en) | Cross-domain video action recognition method, device, equipment and computer-readable storage medium | |
Sun et al. | Video-based parent-child relationship prediction | |
CN117011539A (en) | Target detection method, training method, device and equipment of target detection model | |
CN115705756A (en) | Motion detection method, motion detection device, computer equipment and storage medium | |
CN113468540A (en) | Security portrait processing method based on network security big data and network security system | |
CN111931788A (en) | Image feature extraction method based on complex value | |
Liu et al. | Text detection based on bidirectional feature fusion and sa attention mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |