CN109409257A - A kind of video timing motion detection method based on Weakly supervised study - Google Patents

A kind of video timing motion detection method based on Weakly supervised study Download PDF

Info

Publication number
CN109409257A
CN109409257A CN201811181395.4A CN201811181395A CN109409257A CN 109409257 A CN109409257 A CN 109409257A CN 201811181395 A CN201811181395 A CN 201811181395A CN 109409257 A CN109409257 A CN 109409257A
Authority
CN
China
Prior art keywords
video
classifier
segment
motion detection
detection method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811181395.4A
Other languages
Chinese (zh)
Inventor
李革
钟家兴
李楠楠
孔伟杰
张涛
黄靖佳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University Shenzhen Graduate School
Original Assignee
Peking University Shenzhen Graduate School
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Shenzhen Graduate School filed Critical Peking University Shenzhen Graduate School
Priority to CN201811181395.4A priority Critical patent/CN109409257A/en
Publication of CN109409257A publication Critical patent/CN109409257A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • G06F18/2414Smoothing the distance, e.g. radial basis function networks [RBFN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/254Fusion techniques of classification results, e.g. of results related to same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Image Analysis (AREA)

Abstract

The present invention relates to digital image processing techniques field, specially a kind of video timing motion detection method based on Weakly supervised study.This method comprises the concrete steps that, step 1: video input classifier, respectively obtaining different detection confidence levels;Step 2: score of the fusion video in different classifications device;Step 3: condition random field accurate adjustment result;The step of detection-phase is that step 4: the classifier that video input to be detected is trained obtains different detection confidence levels;Step 5: passing through the different detection confidence level of FC-CRF optimization fusion.This method can combine the output of the priori knowledge of the mankind and neural network, the experimental results showed that FC-CRF improves the detection performance of 20.8%mAP@0.5 on ActivityNet.

Description

A kind of video timing motion detection method based on Weakly supervised study
Technical field
The present invention relates to digital image processing techniques field, specially a kind of video timing based on weak prison study acts inspection Survey method.
Background technique
In the past few years, the inspiration of the immense success by deep learning in terms of the analysis task based on image, perhaps There are multi-model deep learning framework, especially convolutional neural networks (CNN) or recurrent neural network (RNN) to be introduced in base In the motion analysis of video.Karpathy et al. carries out action recognition using deep learning in video first, and at design Manage the various deep learning models of single frames or series of frames.Tran et al. constructs a C3D model, and the model is in space-time 3D convolution is executed in video body and integrates appearance and movement prompt preferably to indicate.Wang et al. proposes time slice network (TSN), it inherits the advantages of double-current feature extraction structure, and longer video clipping is coped with using sparse sampling scheme. Qiu et al. proposes puppet 3D (P3D) remaining network to recycle the ready-made 2D network of 3D CNN.In addition to processing action recognition it Outside, it can solve movement detection there are also some other work or candidate region generate problem.Shou et al. is detected using multistage CNN Network carries out time operating position fixing.Escorcia et al. proposes DAPs model, which uses RNN encoded video sequence, and Retrieval action is suggested during single.Lin et al., which is skipped, generates step using the candidate region of single step motion detector (SSAD) Suddenly.Shou et al. designs convolution-deconvolution (CDC) network to determine accurate timing limits.
In the past few years, behavioural analysis understands that field causes many concerns in video.According to manual character representation Or deep learning model architecture, many researchs have been carried out to this problem.A large amount of work on hands are handled in a manner of supervising by force Action analysis task, wherein the training data of the action example without background by manual annotations or trims.In recent years, some strong prisons The method of superintending and directing achieves satisfactory result.However, mark acts example nowadays in more and more large-scale sets of video data Precise time position be time-consuming and time-consuming.In addition, as indicated, the exact time of movement different from object boundary The definition of range is usually subjectivity, and inconsistent between different observers, this may cause additional deviation and mistake.
It is reasonably to select using Weakly supervised method to overcome these limitations of timing motion detection.The prior art is logical Accurate time-labeling or the video cut out building deep learning model are crossed, and model of the invention directlys adopt unpruned view Frequency evidence gives training, and only needs video rank class label.
Summary of the invention
It is an object of the invention to a kind of video timing motion detection method based on Weakly supervised study.It is dynamic to solve timing It detects, the time location of example is acted in model prediction of the invention action classification and video.Appoint in Weakly supervised study In business, only videl stage tag along sort is provided as supervisory signals, and in the training process, includes the movement mixed with background The video clipping of example will not be modified.
In order to achieve the object of the present invention, following technical solution is specifically taken:
A kind of video timing motion detection method based on Weakly supervised study, specific step is as follows for training:
Step 1: video input classifier, respectively obtaining different detection confidence levels;
Step 2: score of the fusion video in different classifications device;
Step 3: condition random field accurate adjustment result.
Above-mentioned steps 1 carry out in the following order:
A) video is divided into the isoplith not being overlapped, using segment as unit extraction feature.
B) classifier provides corresponding detection confidence level to different action classifications respectively according to the feature of these segments.
The step 2 carries out in the following order:
C video clips) are given, by preliminary classification device, corresponding category score is obtained and (is detailed in step 1);
D) according to score, video clips partial content is wiped, new video segment is obtained.Concrete operations are as follows: according to piece of video Disconnected category score, calculates the class probability of its classification, then according to probability height, at random corresponding video clip, removes training Collection.
E) all videos of training set are traversed once, such as above-mentioned removal partial video segment, obtains new training set.
The step 3 carries out in the following order:
F) the training classifier on the video of new training set;
G) training convergence judgement when being judged as NO, repeats step second step and third step, when being judged as YES, obtains a system Arrange trained classifier.
In the training process, the segment that there is exceptionally high degree of trust behavior to occur gradually is deleted.By doing so, to obtain one Series has the classifier of respective preference, is used for different types of movement segment.
In service stage, the segment for making example is driven according to the classifier selection trained repeatedly, and pass through full connection strap Part random field (FC-CRF) optimization fusion result.The step of detection-phase, is as follows:
Step 4: the classifier that video input to be detected is trained obtains different detection confidence levels;
Step 5: passing through the different detection confidence level of FC-CRF optimization fusion;
Above-mentioned steps 4 carry out in the following order:
I) video to be detected is divided into the isoplith not being overlapped, using segment as unit extraction feature.
II) trained classifier provides corresponding inspection to different action classifications respectively according to the feature of these segments Survey confidence level.
Above-mentioned steps 5 carry out in the following order:
III) according to video clips category score, the class probability of its classification is calculated.
IV full condition of contact random field FC-CRF) is used, in the form of probability graph, receives class probability and inputs, and according to The time shaft position of video clip, optimization fusion is as a result, export final detection probability.
Due to taking above-mentioned technological means, the invention has the advantages that and good effect:
1. the invention proposes a Weakly supervised models to detect the time movement that do not trim in video.The model by pair Video carries out gradually erasing to obtain a series of classifiers.In test phase, by collecting the detection knot from classifier one by one Fruit is come to apply model of the invention be convenient.
2. according to known to the present invention, this is first by full condition of contact random field [22] (fully connected Conditional ramdom filed, FC-CRF) introduce time motion detection task work, it is used for the elder generation of the mankind The output for testing knowledge and neural network combines.The experimental results showed that FC-CRF improves 20.8% on ActivityNet The detection performance of mAP@0.5.
3. the present invention has carried out extensive experiment to two challenging sets of video data of not trimming, i.e., ActivityNet [11] and THUMOS'14 [20];Prove the detection effect of the method for the present invention in Average Accuracy (mean Average precision, mAP) it is more than other all Weakly supervised timing motion detection methods, or even be comparable to certain strong Measure of supervision.
In order to illustrate more clearly of conception and technical scheme of the invention, with reference to the accompanying drawing, pass through specific embodiment pair The present invention is described further.
Detailed description of the invention
Fig. 1 is the flow chart of video timing motion detection method of the present invention;
Fig. 2 is training flow chart of the invention.
Specific embodiment
Fig. 1 is the flow chart of video timing motion detection method of the present invention, and as shown in Figure 1, one kind being based on Weakly supervised study Video timing motion detection method, include the following steps: 1, each classifier S1 of video input, respectively obtain different inspections Survey confidence level;2, score S2 of the fusion video in different classifications device;3, condition random field accurate adjustment result S3.
Fig. 2 is training flow chart of the invention, as shown in Fig. 2, training flow chart includes the following steps: that video clips pass through Preliminary classification device obtains corresponding category score 11;According to score, video clips partial content is wiped, new video segment 12 is obtained; The training classifier 13 on new video;Training convergence judgement, is judged as NO 14, repeats step 12 and 13, be judged as YES under entrance One step 15;Obtain a series of trained classifiers 15.
Specific step is as follows for the model training process of the method for the present invention:
Given videoComprising N number of editing, the wherein other class label of K videl stageIt gives The fixed classifier specified by parameter θ, the present invention can obtain classification fractional φ (V;θ)∈RNXC, wherein C is the number of all categories Amount.In t-th of erasing step, the rest segment of training video is expressed as V by the present inventiont, and classifier is expressed as θt.It is right In the i-th row φ (Vt;θt) φI:, corresponding i-th of editing of original classification score, the present invention calculates probability in j-th of segment The standardized classification p of softmaxI, j(Vt):
In addition, the present invention defines weight factor αI, j:
Wherein δτIt is defined as follows:
Wherein τ is decay factor, is a hyper parameter.Probability of erasure sI, jIt is as follows:
sI, j(Vt)=αI, j(Vt)pI, j(Vt)
Obtain t wheel probability of erasure sI, j(Vt) after, the present invention completes training process as follows:
Step 2: the use of model.
By a series of obtained classifier calculated pI, jWith αI, j, obtain its average valueWithThe present invention establishes one entirely Condition of contact random field, energy function are as follows:
Wherein, label independent variable liWith ljByIt is specified, indicate the corresponding class label of i-th, j segment.Hereafter, it uses Mean field approximation optimizationAnd ask α p result can each segment monitoring confidence level.According to the full condition of contact random field, It calculates and maximizes posterior probability, the final score of every section of video can be obtained.
Method of the invention is tested on the present invention is in ActivityNet and THUMOS ' 14, it is as a result as follows.
In below table, the index compared is the friendship of different time axis and the average precision than under, i.e. mAP (mean Average Precision), it measures in the video being retrieved in the friendship of different time axis and than ratio accurate under threshold value.It should Index is the bigger the better.
Strong supervised learning refers to that the markup information of training sample includes video classification information and timing information.
Weakly supervised study refers to that the markup information of training sample only includes video classification information.
Single phase.Cascade, single classification, more classification refer to the distinct methods that respective document proposes, propose to other bibliography Other methods will not enumerate.
Table 1 is that different time axis is handed over and than the average precision under threshold value in ActivityNet data set,
2 mAP@tIoU on THUMOS ' 14 of table --- THUMOS14 data set different time axis friendship is simultaneously average than under threshold value Precision ratio.
Wherein: Strong/Weak Supervision: strong supervision/Weakly supervised study, each method in table first row The method provided for corresponding document and author.
According to other embodiments of the invention, for the technical solution:
1. classifier can be based on any neural network, can also be itself and traditional characteristic.
2. full condition of contact random field can be replaced with the condition random field of any kind.
Bibliography, abbreviation document, interior square brackets are literature number, such as: [53] are document 53, and [59] are document 59,
[1]A.Karpathy,G.Toderici,S.Shetty,T.Leung,R.Sukthankar,and L.Fei- Fei.2014.Large-scale video classification with convolutional neural networks.In CVPR.1725–1732.
[2]P.Bojanowski,R.Lajugie,F.R.Bach,I.Laptev,J.Ponce,C.Schmid,and J.Sivic.2014.Weakly supervised action labeling in videos under ordering constraints.In ECCV.628–643.
[3]P.Bojanowski,R.Lajugie,E.Grave,F.Bach,I.Laptev,J.Ponce,and C.Schmid.2015. Weakly-supervised alignment of video with text.In ICCV.4462– 4470.
[4]A.Pinz C.Feichtenhofer and A.Zisserman.2016.Convolutional two- stream network fusion for video action recognition.In CVPR.1933–1941.
[5]Joao Carreira and Andrew Zisserman.2017.Quo Vadis,Action Recognition A New Model and the Kinetics Dataset.In IEEE Conference on Computer Vision and Pattern Recognition.4724–4733.
[6]Xiyang Dai,Bharat Singh,Guyue Zhang,Larry S.Davis,and Yan Qiu Chen.2017.Temporal Context Network for Activity Localization in Videos.In IEEE International Conference on Computer Vision.5727–5736.
[7]Oneata Dan,Jakob Verbeek,and Cordelia Schmid.2014.The LEAR submission at Thumos 2014. Computer Vision and Pattern Recognition[cs.CV] (2014).
[8]J.Donahue,L.Anne Hendricks,S.Guadarrama,M.Rohrbach,S.Venugopalan, K.Saenko,and T. Darrell.2015.Long-term recurrent convolutional networks for visual recognition and description.In CVPR. 2625–2634.
[9]V.Escorcia,F.C.Heilbron,J.C.Niebles,and B.Ghanem.2016.Daps:Deep action proposals for action understanding.In In European Conference on Computer Vision.768–784.
[10]Victor Escorcia,Fabian Caba Heilbron,Juan Carlos Niebles,and Bernard Ghanem.2016.DAPs:Deep Action Proposals for Action Understanding.In European Conference on Computer Vision.768–784.
[11]B.Ghanem F.Caba Heilbron,V.Escorcia and J.Carlos Niebles.2015.Activitynet:A large-scale video benchmark for human activity understanding.In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.961–970.
[12]C.Gan,C.Sun,L.Duan,and B.Gong.2016.Webly-supervised video recognition by mutually voting for relevant web images and web video frames.In ECCV.849–866.
[13]Jiyang Gao,Zhenheng Yang,Chen Sun,Kan Chen,and Ram Nevatia.2017.TURN TAP:Temporal Unit Regression Network for Temporal Action Proposals.arXiv:1703.06189(2017).
[14]A.Richard H.Kuehne and J.Gall.2016.Weakly supervised learning of actions from transcripts.CoRR, abs/1610.02237(2016).
[15]Fabian Caba Heilbron,Wayner Barrios,Victor Escorcia,and Bernard Ghanem.2017.SCC:Semantic Context Cascade for Efficient Action Detection.In IEEE Conference on Computer Vision and Pattern Recognition.
[16]Fabian Caba Heilbron,Juan Carlos Niebles,and Bernard Ghanem.2016.Fast Tem-poral Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos.In Computer Vision and Pattern Recognition.1914–1923.
[17]D.Huang,L.Fei-Fei,and J.C.Niebles.2016.Connectionist temporal modeling for weakly supervised action labeling.In ECCV.137–153.
[18]Dinesh Jayaraman and Kristen Grauman.2016.Slow and Steady Feature Analysis:Higher Order Temporal Coherence in Video.In Computer Vision and Pattern Recognition.3852–3861.
[19]Yangqing Jia,Evan Shelhamer,Jeff Donahue,Sergey Karayev,Jonathan Long,Ross Girshick,Sergio Guadarrama,and Trevor Darrell.2014.Caffe: Convolutional Architecture for Fast Feature Embedding.arXiv preprint arXiv: 1408.5093(2014).
[20]Y.-G.Jiang,J.Liu,A.Roshan Zamir,G.Toderici,I.Laptev,M.Shah,and R.Suk-thankar.2014. THUMOS challenge:Action recognition with a large number of classes.http://crcv.ucf.edu/THUMOS14/(2014).
[21]Svebor Karaman,Lorenzo Seidenari,and Alberto Del Bimbo.[n.d.] .Fast saliency based pooling of Fisher encoded dense trajectories.([n.d.]).
[22]P. and V.Koltun.2011.Efficient inference in fully connected crfs with gaussian edge potentials.In NIPS.109–117.
[23]Y.Qiao L.Wang and X.Tang.2016.MoFAP:A multi-level representation for action recognition.IJCV 119,3(2016),254–271.
[24]Ivan Laptev and Tony Lindeberg.2003.Space-time interest points.In 9th Interna-tional Conference on Computer Vision.432–439.
[25]I.Laptev,M.Marszalek,C.Schmid,and B.Rozenfeld.2008.Learning realistic human actions from movies.In CVPR.1–8.
[26]Colin Lea,Michael D.Flynn,Rene Vidal,Austin Reiter,and Gregory D.Hager.2017.Temporal Convolutional Networks for Action Segmentation and Detection.In IEEE Conference on Computer Vision and Pattern Recognition.1003– 1012.
[27]Tianwei Lin,Xu Zhao,and Zheng Shou.2017.Single Shot Temporal Action Detection.In ACM on Multimedia Conference.
[28]L.Wang,Y.Xiong,D.Lin,and L.V.Gool.2017.UntrimmedNets for Weakly Super-vised Action Recognition and Detection.arXiv:1703.03329(2017).
[29]L.Wang,Y.Xiong,Z.Wang,Y.Qiao,D.Lin,X.Tang,and L.Van Gool.2016.Temporal segment networks: Towards good practices for deep action recognition.In ECCV.20–36.
[30]Cordelia Schmid Marcin Marszalek,Ivan Laptev.2009.Actions in context.In CVPR.2929–2936.
[31]Hossein Mobahi,Ronan Collobert,and Jason Weston.2009.Deep learning from temporal coherence in video..In International Conference on Machine Learning,ICML 2009,Montreal,Quebec,Canada,June.93.
[32]Li Nannan,Xu Dan,Ying Zhenqiang,Li Zhihao,and Li Ge.2016.Searching Action Propsoals via Spatial Actionness estimation and Temporal Path Inference and Tracking.In Asian Conference on Computer Vision.384–399.
[33]J.Sivic F.R.Bach O.Duchenne,I.Laptev and J.Ponce.2009.Automatic annotation of human actions in video.In ICCV.1491–1498.
[34]Zhaofan Qiu,Ting Yao,and Tao Mei.2017.Learning Spatio-Temporal Represen-tation with Pseudo-3D Residual Networks.In ICCV.
[35]Alexander Richard and Juergen Gall.2016.Temporal Action Detection Using a Statistical Language Model.In Computer Vision and Pattern Recognition.
[36]Suman Saha,Gurkirt Singh,Michael Sapienza,Philip H.S.Torr,and Fabio Cuz-zolin.2016.Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos.arXiv:1608.01529(2016).
[37]Zheng Shou,Jonathan Chan,Alireza Zareian,Kazuyuki Miyazawa,and Shih Fu Chang.2017.CDC: Convolutional-De-Convolutional Networks for Precise Tem-poral Action Localization in Untrimmed Videos. (2017).
[38]Zheng Shou,Dongang Wang,and Shih-Fu Chang.2016.Temporal action lo-calization in untrimmed videos via multi-stage cnns.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1049–1058.
[39]Gunnar A.Sigurdsson,Olga Russakovsky,and Abhinav Gupta.2017.What Actions are Needed for Understanding Human Actions in Videos? CoRR abs/ 1708.02696(2017).arXiv:1708.02696 http://arxiv.org/abs/1708.02696
[40]Karen Simonyan and Andrew Zisserman.2014.Two-stream convolutional net-works for action recognition in videos.In Advances in neural information process-ing systems.568–576.
[41]Krishna Kumar Singh and Yong Jae Lee.2017.Hide-and-Seek:Forcing a Net-work to be Meticulous for Weakly-supervised Object and Action Localization.arXiv:1704.04232(2017).
[42]S.Satkin and M.Hebert.2010.Modeling the temporal extent of actions.In ECCV.536–548.
[43]Chen Sun,Sanketh Shetty,Rahul Sukthankar,and Ram Nevatia.2015.Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images.In ACM International Conference on Multimedia.371–380.
[44]Du Tran,Lubomir Bourdev,Rob Fergus,Lorenzo Torresani,and Manohar Paluri.2015.Learning spatiotemporal features with 3d convolutional networks.In Proceedings of the IEEE International Conference on Computer Vision.4489–4497.
[45]Heng Wang and Cordelia Schmid.2013.Action recognition with improved trajectories.In Proceedings of the IEEE International Conference on Computer Vision.3551–3558.
[46]Limin Wang,Yu Qiao,and Xiaoou Tang.[n.d.].Action Recognition and Detection by Combining Motion and Appearance Features.([n.d.]).
[47]L.Wang,Y.Qiao,and X.Tang.2015.Action recognition with trajectory- pooled deep-convolutional descriptors.In CVPR.4305–4314.
[48]Limin Wang,Yuanjun Xiong,Zhe Wang,Yu Qiao,Dahua Lin,Xiaoou Tang, and Luc Van Gool.2017. Temporal Segment Networks for Action Recognition in Videos.CoRR abs/1705.02953(2017).arXiv:1705.02953 http://arxiv.org/abs/ 1705.02953
[49]Xiaolong Wang,Ross Girshick,Abhinav Gupta,and Kaiming He.2017.Non-local Neural Networks. arXiv preprint arXiv:1711.07971(2017).
[50]Yunchao Wei,Jiashi Feng,Xiaodan Liang,Ming-Ming Cheng,Yao Zhao, and Shuicheng Yan.2017. Object Region Mining with Adversarial Erasing:A Simple Classification to Semantic Segmentation Approach. arXiv:1703.08448 (2017).
[51]Yunchao Wei,Wei Xia,Junshi Huang,Bingbing Ni,Jian Dong,Yao Zhao, and Shuicheng Yan.2014. CNN:Single-label to Multi-label.Computer Science (2014).
[52]L Wiskott and T Sejnowski.2002.Slow feature analysis:unsupervised learning of invariances.Neural Computation 14,4(2002),715.
[53]Yuanjun Xiong,Yue Zhao,Limin Wang,Dahua Lin,and Xiaoou Tang.2017.A Pursuit of Temporal Accuracy in General Activity Detection.arXiv: 1703.02716(2017).
[54]Huijuan Xu,Abir Das,and Kate Saenko.2017.R-C3D:Region Convolutional 3D Network for Temporal Activity Detection.In IEEE International Conference on Computer Vision.5794–5803.
[55]Serena Yeung,Olga Russakovsky,Greg Mori,and Li Fei-Fei.2016.End- to-end learning of action detection from frame glimpses in videos.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2678–2687.
[56]Jun Yuan,Bingbing Ni,Xiaokang Yang,and Ashraf A.Kassim.2016.Temporal Action Localization with Pyramid of Score Distribution Features.In Computer Vision and Pattern Recognition.3093–3102.
[57]Zehuan Yuan,Jonathan C.Stroud,Tong Lu,and Jia Deng.2017.Temporal Action Localization by Structured Maximal Sums.In IEEE Conference on Computer Vision and Pattern Recognition.3215–3223.
[58]Yimeng Zhang and Tsuhan Chen.2012.Efficient inference for fully- connected CRFs with stationarity. 2012 IEEE Conference on Computer Vision and Pattern Recognition(CVPR)00(2012),582–589.
[59]Yue Zhao,Yuanjun Xiong,Limin Wang,Zhirong Wu,Xiaoou Tang,and Dahua Lin.2017.Temporal Action Detection with Structured Segment Networks.In IEEE International Conference on Computer Vision. 2933–2942.
[60]Yi Zhu and Shawn Newsam.2016.Efficient Action Detection in Untrimmed Videos via Multi-Task Learning.arXiv:1612.07403(2016) 。

Claims (9)

1. a kind of video timing motion detection method based on Weakly supervised study, the specific steps of which are as follows:
Step 1: video input classifier, respectively obtaining different detection confidence levels;
Step 2: score of the fusion video in different classifications device;
Step 3: condition random field accurate adjustment result.
2. the video timing motion detection method according to claim 1 based on Weakly supervised study, it is characterised in that: described Step 1 carry out in the following order:
A) video is divided into the isoplith not being overlapped, using segment as unit extraction feature.
B) classifier provides corresponding detection confidence level to different action classifications respectively according to the feature of these segments.
3. the video timing motion detection method according to claim 1 based on Weakly supervised study, it is characterised in that: described Step 2 carry out in the following order:
C video clips) are given, by preliminary classification device, corresponding category score is obtained and (is detailed in step 1);
D) according to score, video clips partial content is wiped, new video segment is obtained.Concrete operations are as follows: according to video clips class Other score, calculates the class probability of its classification, then according to probability height, at random corresponding video clip, removes training set.
E) all videos of training set are traversed once, such as above-mentioned removal partial video segment, obtains new training set.
4. the video timing motion detection method according to claim 1 based on Weakly supervised study, it is characterised in that: described Step 3 carry out in the following order:
F) the training classifier on the video of new training set;
G) training convergence judgement when being judged as NO, repeats step second step and third step, when being judged as YES, obtains a series of instructions The classifier perfected.
5. the video timing motion detection method according to any one of claims 1-4 based on Weakly supervised study, in step Also include detection-phase after rapid 3, which comprises the concrete steps that:
Step 4: the classifier that video input to be detected is trained obtains different detection confidence levels;
Step 5: passing through the different detection confidence level of FC-CRF optimization fusion.
6. the video timing motion detection method according to claim 5 based on Weakly supervised study, it is characterised in that: described Step 4 carry out in the following order:
I) video to be detected is divided into the isoplith not being overlapped, using segment as unit extraction feature.
II) trained classifier provides corresponding detection to different action classifications respectively and sets according to the feature of these segments Reliability.
7. the video timing motion detection method according to claim 5 based on Weakly supervised study, it is characterised in that: described Step 5 carry out in the following order:
III) according to video clips category score, the class probability of its classification is calculated;
IV full condition of contact random field FC-CRF) is used, in the form of probability graph, receives class probability input, and according to video The time shaft position of segment, optimization fusion is as a result, export final detection probability.
8. the video timing motion detection method according to any one of claims 1 to 4 based on Weakly supervised study, special Sign is: the model training process of the trained classifier are as follows:
Given videoComprising N number of editing, the wherein other class label of K videl stageIt gives by joining Number θ specified classifier, we can obtain classification fractional φ (V;θ)∈RNXC, wherein C is the quantity of all categories.In t In a erasing step, the rest segment of training video is expressed as V by ust, and classifier is expressed as θt;For the i-th row φ (Vt;θt) φi, corresponding i-th of editing of original classification score, we calculate probability softmax standard in j-th of segment The classification p of changeI, j(Vt):
In addition, we define weight factor αI, j:
Wherein δτIt is defined as follows:
Wherein τ is decay factor, is a hyper parameter.Probability of erasure sI, jIt is as follows:
sI, j(Vt)=αI, j(Vt)pI, j(Vt)
Obtain t wheel probability of erasure sI, j(Vt) after, we complete training process as follows:
9. the video timing motion detection method according to any one of claims 1 to 4 based on Weakly supervised study, special Sign is: the model usage mode of the trained classifier are as follows:
By a series of obtained classifier calculated pI, jWith αI, j, obtain its average valueWithWe establish a full connection strap Part random field, energy function are as follows:
Wherein, label independent variable liWith ljByIt is specified, indicate the corresponding class label of i-th, j segment;Hereafter, using average Field near-optimalAnd ask α p result can each segment monitoring confidence level;According to the full condition of contact random field, calculate Posterior probability is maximized, the final score of every section of video can be obtained.
CN201811181395.4A 2018-10-11 2018-10-11 A kind of video timing motion detection method based on Weakly supervised study Pending CN109409257A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811181395.4A CN109409257A (en) 2018-10-11 2018-10-11 A kind of video timing motion detection method based on Weakly supervised study

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811181395.4A CN109409257A (en) 2018-10-11 2018-10-11 A kind of video timing motion detection method based on Weakly supervised study

Publications (1)

Publication Number Publication Date
CN109409257A true CN109409257A (en) 2019-03-01

Family

ID=65467544

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811181395.4A Pending CN109409257A (en) 2018-10-11 2018-10-11 A kind of video timing motion detection method based on Weakly supervised study

Country Status (1)

Country Link
CN (1) CN109409257A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189800A (en) * 2019-05-06 2019-08-30 浙江大学 Furnace oxygen content soft-measuring modeling method based on more granularities cascade Recognition with Recurrent Neural Network
CN110490055A (en) * 2019-07-08 2019-11-22 中国科学院信息工程研究所 A kind of Weakly supervised Activity recognition localization method and device recoded based on three
CN111079646A (en) * 2019-12-16 2020-04-28 中山大学 Method and system for positioning weak surveillance video time sequence action based on deep learning
CN111104855A (en) * 2019-11-11 2020-05-05 杭州电子科技大学 Workflow identification method based on time sequence behavior detection
CN113516032A (en) * 2021-04-29 2021-10-19 中国科学院西安光学精密机械研究所 Weak supervision monitoring video abnormal behavior detection method based on time domain attention

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIA-XING ZHONG 等: "Step-by-step Erasion, One-by-one Collection:AWeakly Supervised Temporal Action Detector", 《ARXIV》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110189800A (en) * 2019-05-06 2019-08-30 浙江大学 Furnace oxygen content soft-measuring modeling method based on more granularities cascade Recognition with Recurrent Neural Network
CN110189800B (en) * 2019-05-06 2021-03-30 浙江大学 Furnace oxygen content soft measurement modeling method based on multi-granularity cascade cyclic neural network
CN110490055A (en) * 2019-07-08 2019-11-22 中国科学院信息工程研究所 A kind of Weakly supervised Activity recognition localization method and device recoded based on three
CN111104855A (en) * 2019-11-11 2020-05-05 杭州电子科技大学 Workflow identification method based on time sequence behavior detection
CN111104855B (en) * 2019-11-11 2023-09-12 杭州电子科技大学 Workflow identification method based on time sequence behavior detection
CN111079646A (en) * 2019-12-16 2020-04-28 中山大学 Method and system for positioning weak surveillance video time sequence action based on deep learning
CN111079646B (en) * 2019-12-16 2023-06-06 中山大学 Weak supervision video time sequence action positioning method and system based on deep learning
CN113516032A (en) * 2021-04-29 2021-10-19 中国科学院西安光学精密机械研究所 Weak supervision monitoring video abnormal behavior detection method based on time domain attention
CN113516032B (en) * 2021-04-29 2023-04-18 中国科学院西安光学精密机械研究所 Weak supervision monitoring video abnormal behavior detection method based on time domain attention

Similar Documents

Publication Publication Date Title
Zhong et al. Step-by-step erasion, one-by-one collection: a weakly supervised temporal action detector
Shi et al. Weakly-supervised action localization by generative attention modeling
Liu et al. Completeness modeling and context separation for weakly supervised temporal action localization
Kukleva et al. Unsupervised learning of action classes with continuous temporal embedding
Huang et al. Foreground-action consistency network for weakly supervised temporal action localization
CN109409257A (en) A kind of video timing motion detection method based on Weakly supervised study
Li et al. Temporal action segmentation from timestamp supervision
Xiong et al. A pursuit of temporal accuracy in general activity detection
Zhao et al. Temporal action detection with structured segment networks
Richard et al. Neuralnetwork-viterbi: A framework for weakly supervised video learning
Richard et al. Weakly supervised action learning with rnn based fine-to-coarse modeling
Liu et al. Multi-shot temporal event localization: a benchmark
Wang et al. Untrimmednets for weakly supervised action recognition and detection
Shou et al. Online detection of action start in untrimmed, streaming videos
Fayyaz et al. Sct: Set constrained temporal transformer for set supervised action segmentation
CN108537119B (en) Small sample video identification method
Vahdani et al. Deep learning-based action detection in untrimmed videos: A survey
CN110969166A (en) Small target identification method and system in inspection scene
Ji et al. Learning temporal action proposals with fewer labels
Shou et al. Online action detection in untrimmed, streaming videos-modeling and evaluation
CN112560827B (en) Model training method, model training device, model prediction method, electronic device, and medium
Javed et al. Replay and key-events detection for sports video summarization using confined elliptical local ternary patterns and extreme learning machine
Ge et al. Deep snippet selective network for weakly supervised temporal action localization
CN114782997A (en) Pedestrian re-identification method and system based on multi-loss attention adaptive network
Chen et al. Refinement of Boundary Regression Using Uncertainty in Temporal Action Localization.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190301

WD01 Invention patent application deemed withdrawn after publication