CN108830212B

CN108830212B - Video behavior time axis detection method

Info

Publication number: CN108830212B
Application number: CN201810597905.XA
Authority: CN
Inventors: 李革; 张涛; 李楠楠; 林凯; 孔伟杰; 李宏
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2018-06-12
Filing date: 2018-06-12
Publication date: 2022-04-22
Anticipated expiration: 2038-06-12
Also published as: CN108830212A

Abstract

The invention discloses a video behavior time axis detection method, which is based on deep learning and time structure modeling, combines coarse grain detection and fine grain detection to detect the video behavior time axis, and uses a double-flow model to extract the time-space characteristics of a video on the basis of the existing model SSN; modeling the time structure of the behavior, and dividing a single behavior into three stages; then, a new characteristic pyramid capable of effectively extracting time boundary information of video behaviors is provided; finally, coarse grain detection and fine grain detection are combined, so that the detection result is more accurate; the method has high detection precision, exceeds all the existing methods, has wide applicability, can be suitable for detecting the video segments of interest to human beings in an intelligent monitoring system or a man-machine monitoring system, is convenient for subsequent analysis and processing, and has important application value.

Description

Video behavior time axis detection method

Technical Field

The invention relates to the technical field of video analysis, in particular to a video behavior time axis detection method.

Background

Videos containing human behavior can be divided into two categories: one type is a video that is artificially cropped, contains only human behavior, and does not contain any irrelevant background video; one type is an uncut video after shooting, which includes not only human behavior but also irrelevant background segments, such as titles, viewers, and the like. The video behavior time axis detection means that in a section of video which is not manually cut, the starting time and the ending time of the occurrence of human behaviors are positioned, and the categories of the human behaviors are identified. The existing video behavior time axis detection method mainly follows two steps of strategies: firstly, a large number of video behavior timeline candidate frames which are likely to contain human motion video clips are extracted, then fine adjustment on the position and the length of the extracted candidate frames is carried out, and the located behaviors are classified. Generally, although the video behavior timeline candidate frame extraction can roughly locate human behaviors in a video, the locating precision is low, and the overlapping rate with action clips is low, so that the optimization and accurate classification of the video behavior timeline candidate frame are very important. In practical application scenarios, it is important to accurately locate the start time and the end time of occurrence of human behavior. On the basis of the existing video behavior time axis candidate frame extraction method, the invention mainly aims at the task of video behavior time axis detection and carries out accurate video behavior time axis positioning based on deep learning.

At present, according to different video behavior time axis detection modes, the existing video behavior time axis detection models can be divided into two types:

the first type is a one-stage process. The term "one-stage method" refers to a method for directly finding and locating human behavior from an uncut video. The efficiency of this type of process is relatively high. However, due to the huge amount of information contained in the video, it is difficult to achieve a good positioning result by this method of directly positioning in one step.

The second method is a two-stage method, namely, extracting a video behavior time axis candidate frame, and then adjusting and classifying the video behavior time axis candidate frame. Most of the existing video behavior time axis detection methods are two-stage methods. The method extracts a large number of video segments which are likely to contain human behavior clips from the video through a plurality of rapid video behavior time axis candidate frame extraction algorithms. Although the video behavior timeline candidate frame can be used as a rough positioning result, the positioning accuracy is poor, and a large number of useless background video segments are contained in the video behavior timeline candidate frame. Therefore, some algorithms focus on fine-tuning the position of the video behavior timeline candidate frame in the second stage, so as to correct the position of the video behavior timeline candidate frame and improve the positioning accuracy. Meanwhile, in the second stage, the candidate frames of the video behavior time axis are screened and classified again, useless background clips are removed, and therefore a good video behavior time axis detection result is achieved.

The existing better video behavior time axis detection method includes R-C3D, SSN (ZHao, Yue, et al. "Temporal action detection with structured segment networks." The IEEE International Conference on Computer Vision (ICCV). Vol.8.2017.), etc. The SSN proposes that the time structure of the behavior in the video should be modeled so as to achieve the goal of accurate positioning. The SSN divides a time axis candidate frame into three stages of starting, middle and ending, and a structural feature pyramid is established on each stage to extract time structure information. On the extracted pyramid features, the SSN establishes two classifiers for behavior classification and candidate box integrity judgment respectively. SSN achieves a better video behavior time axis detection result. However, the SSN model itself has two disadvantages: one is that SSN attempts to accurately locate the temporal boundaries of behaviors, but ignores the information of the temporal boundary portion; secondly, the SSN performs integrity judgment on a whole candidate frame, and directly discards the candidate frame including the incomplete behavior, so that the candidate frame is not fully utilized, and the efficiency is to be improved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a novel video behavior time axis detection method TBN, which is based on deep learning, integrates low-level characteristics and high-level characteristics of a video, and combines coarse grain detection and fine grain detection to realize time axis detection of human behaviors in the video. The method can be applied to multiple fields, such as intelligent monitoring videos, human-computer interaction and other scenes, and capturing the interesting behaviors of human beings, so that a large amount of useless video data are filtered, and the subsequent analysis and operation of the system are facilitated.

The principle of the invention is as follows: the method disclosed by the invention is based on deep learning, integrates the low-level characteristics and the high-level characteristics of the video, integrates coarse-grained detection and fine-grained detection, and realizes the detection of the behavior time axis in the video. The method comprises the steps of firstly modeling a time structure of human behaviors in a video based on a current excellent video behavior time axis detection model SSN, and dividing a video behavior time axis candidate frame into a starting stage, a middle stage and an ending stage. Then, a brand-new structured time axis boundary attention feature pyramid is used for fusing high-level features and low-level features of the video and extracting time boundary information of behaviors; and finally, using one behavior classifier to classify behaviors, using two integrity classifiers with different granularities to detect coarse granularity and fine granularity, and fusing the two detection results to achieve the effect of accurately positioning the video behavior time axis.

The method provides a new model TBN based on SSN to detect the time axis of the video behavior, fuses all levels of characteristics of the video by using a time axis boundary attention characteristic pyramid, extracts time axis boundary information, and fuses detection results of different granularities. The method of the invention is tested and verified on the THUMOS' 14 data set, the obtained experimental effect exceeds all the existing methods, and the accurate positioning of the video behavior time axis is achieved.

The technical scheme provided by the invention is as follows:

a video behavior time axis detection method combining coarse grain detection and fine grain detection. Extracting video behavior time axis candidate frames by using a TAG (TAG) of a video to be detected, extracting the video behavior time axis candidate frames by using an efficient video behavior time axis candidate frame extraction method, and extending 1/2 two sections of each video behavior time axis candidate frame to obtain an extended video behavior time axis candidate frame. The method carries out staged modeling on the candidate frames of the video behavior time axis after being prolonged, carries out time structure modeling on each candidate frame of the video behavior time axis, and divides a single candidate frame into three stages; and a classical double-current model (RGB + optical flow) is used for carrying out feature extraction on the video of each stage, a video classification depth model is used for extracting the space-time features of the video, and the features are stored in a memory. In each stage, a time axis boundary attention feature pyramid capable of effectively extracting video behavior time axis edge information is constructed, a multilayer feature pyramid is established to effectively extract time boundary information of video behaviors, down sampling is carried out on features of a behavior time boundary information part which does not contain the behavior time boundary information part, the down sampling is not carried out on the behavior time edge part, and the features of the two parts are connected, so that the information of the behavior time edge part accounts for a larger proportion in the final global features. Connecting the features of each stage to obtain a global feature vector, constructing 3 classifiers on the global feature vector, respectively classifying behaviors, evaluating the integrity of the behaviors in the candidate frame, evaluating the staged integrity of the behaviors in the candidate frame, constructing three classifiers A, B, C and a regressor R on the pyramid feature, and respectively classifying the behaviors, evaluating the integrity of the behaviors in the candidate frame and evaluating the staged integrity of the behaviors in the candidate frame by the three classifiers; and (4) utilizing the candidate frame grading integrity evaluation result of the classifier C, and combining complete behavior fragments from incomplete behavior fragments by using a combination method provided in the TAG. Through a large amount of training, a model capable of judging behavior integrity from two granularities is trained, and behavior time axis detection of two different granularities is carried out by combining a behavior classifier; the detection results of the two granularities are fused, and the purpose of accurately positioning the time axis of human behaviors in the video is achieved.

Preferably, the video classification depth model is a dual-flow model based on a bninclusion network; the model respectively inputs RGB images and optical flow characteristics of a video into two BNIncepression networks with the same structure, and trains color characteristics and motion characteristics of the learning video; the length of the extracted feature vector is 1024.

Preferably, for each timeline candidate frame, extending 1/2 the two ends of the timeline candidate frame by the length of the candidate frame respectively to contain the contextual semantic information of the behavior; for each extended video behavior timeline candidate frame, the video behavior timeline candidate frame is divided into three stages, namely a start stage, an intermediate stage and an end stage, which correspond to the start, progress and end of a behavior.

Preferably, for the 'initial stage', 'intermediate stage' and 'end stage' of each video behavior time axis candidate frame, extracting the double-stream features thereof, establishing a K-layer feature pyramid on the double-stream features, and dividing the corresponding stage into T stages by each layer of the pyramid_kAnd (4) merging the features sampled in each part together through mean pooling.

Preferably, on the characteristic of the 'middle stage', a behavior classifier is constructed, and the behaviors in the video behavior time axis candidate frame are subjected to C +1 behavior classification to obtain the category of the behaviors; where C represents C behavior classes and 1 represents a behavior class of a background class.

Preferably, on the global feature, two integrity classifiers B and C are constructed; the classifier B carries out integrity evaluation on the whole video behavior time axis candidate frame and judges whether a complete behavior is contained or not; the classifier C uniformly divides a video behavior time axis candidate frame into N_SAnd each part is subjected to integrity evaluation to judge whether the part is an integral part of a behavior.

The specific implementation mode comprises the following steps:

1) inputting: video frames (RGB + optical flow) sampled from within the video behavior timeline candidate frame;

2) modeling a candidate frame time structure of a video behavior time axis: the video behavior timeline candidate frame is extended 1/2 at each end, and then the extended candidate frame is divided into three stages, namely start, middle and end.

3) Feature extraction: performing feature extraction on each stage of the video behavior time axis candidate frame by using a dual-flow network based on a BNIncepration network structure, wherein the feature extraction is divided into extraction of features based on RGB (Red, Green, blue) branches and optical flow branches, and the features of the three stages are respectively H^s，H^c，H^e

4) Establishing a time axis boundary attention feature pyramid: and establishing a multi-layer characteristic pyramid on the characteristic of each candidate stage, and extracting the time edge information of the behavior.

5) Constructing global features: and connecting the characteristics of each stage of the video behavior time axis candidate frame to form a global characteristic.

6) Constructing a classifier: on the pyramid feature, three classifiers and one regressor are established. The three classifiers are used for behavior classification, behavior integrity evaluation in the candidate frame and staged integrity evaluation of the candidate frame respectively, and the classifiers are used for fine tuning the position of the candidate frame of the video behavior time axis. Three classifiers are established at H once^c，(H^s+H^c+H^e),(H^s+H^c+H^e) The characteristics are as follows.

7) And generating a fine-grained video behavior time axis candidate frame detection result. And combining the complete video behavior segments from the incomplete video behavior segments.

8) And fusing classifier results. And fusing the outputs of the three classifiers, and multiplying the results of each classifier to obtain the confidence scores of all the candidate frames.

9) And (6) regression. And performing regression of the position and the length on all the video behavior time axis candidate frames to enable the positioning result to be more accurate.

10) Non-maxima suppression. And (4) performing non-maximum value inhibition on all positioning results, and screening out a result with high confidence as a final positioning result.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a novel method for detecting a human behavior time axis in a video, which provides a better video behavior time axis detection model TBN on the basis of the existing SSN (simple sequence number network); secondly, in order to solve the defect that the existing video behavior time axis detection method is not accurate enough, the invention provides a time axis boundary attention feature pyramid which can effectively extract time edge information of video behaviors and helps a model to judge the starting time and the ending time of the video behaviors more accurately; and finally, the coarse-grained and fine-grained positioning results are combined, so that a more accurate detection result is obtained. The method can be applied to capture the interesting video behaviors in an intelligent video monitoring system or a man-machine interaction system, so that useless video data are filtered out, and the subsequent analysis and operation are facilitated.

Compared with the prior art, the method disclosed by the invention tests and evaluates the test data part of the THUMOS' 14 data set of the detection data set by using the most extensive video behavior time axis at present, and the detection effect is superior to that of all other existing methods disclosed at present through evaluation, thereby illustrating the technical superiority of the method disclosed by the invention.

Drawings

Fig. 1 is an overall frame diagram of a video behavior timeline detection method provided by the present invention. 1-time axis boundary attention feature pyramid capable of effectively extracting video behavior time boundary information provided by the invention.

Fig. 2 is a structural comparison between the feature pyramid proposed in the present invention and the feature pyramid in SSN. 2-structured temporal pyramid in SSN; 3-time axis boundary attention feature pyramid proposed by the present invention; 4-dual stream feature per video slice (snippet); 5-global features concatenated by stages; 6-1 × 1 convolution downsampling.

Fig. 3 is a schematic diagram of a method used in the invention and candidate box combination. 7-integrity assessment score of candidate box; 8-a timeline candidate box divided into five sections, where 1 represents that the candidate box clip contains a full behavior clip, 0 represents that the candidate box clip does not contain a full behavior clip, and 1 represents that the entire candidate box contains a full behavior. 9-combined detection results.

FIG. 4 is a method flow diagram of a pedestrian detection method provided by the present invention.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

Video behavior timeline detection is a sub-problem of video behavior detection, which focuses primarily on locating the temporal start and end points at which behaviors occur within a video. The method plays a key technical support role in the fields of intelligent monitoring systems, man-machine interaction, video search and the like. Because of the large amount of information contained in video, most of which is unwanted noise, it is important to automatically time-line locate video segments of human interest prior to processing for subsequent processing. In practical applications, a relatively high requirement is placed on the accuracy of the video behavior localization. In order to accurately locate the start time and the end time of a video behavior, it is important to extract the temporal boundary information of the behavior. Aiming at the difficulty, the method adopts an SSN basic network structure to perform structural modeling on the behavior in a time domain, and provides a novel time axis boundary attention feature pyramid structure capable of effectively extracting time boundary information of the video behavior; on the time axis boundary attention feature pyramid, three classifiers and a regressor are trained, coarse-grained detection and fine-grained detection can be combined, the length and the position of a candidate frame of the time axis of the video behavior can be effectively adjusted, and the purpose of accurate positioning is achieved; finally, experiments are carried out on the THUMOS' 14 data set, and the results show that the video behavior detection effect of the method is superior to that of all existing methods, and the average detection precision reaches 33.75.

Fig. 1 is an overall frame diagram of a video behavior timeline detection method provided by the present invention, fig. 3 is a schematic diagram of a method for combining incomplete behavior video clips in the present invention, and fig. 4 is a flowchart of a method of a video behavior timeline detection method provided by the present invention, which specifically includes the following steps:

1) the video data is processed S1. The method is based on the double-flow algorithm to extract the characteristics, so that the video needs to be extracted into RGB pictures and the optical flow characteristics of the video need to be extracted.

2) Candidate box extraction S2. The existing mainstream video behavior time axis detection method obeys two steps of strategies: firstly extracting video behavior time axis candidate frames, and then carrying out fine tuning, secondary screening and classification on the video behavior time axis candidate frames. The invention uses TAG algorithm recorded in literature (Xiong, Yuanjun, et al, "A Pursuit of Temporal Accuracy in General Activity detection." arXiv preprinting arXiv:1703.02716(2017).) to extract the behavior time axis candidate frame of the video. The method trains a plurality of category-related behavior/background two classifiers to perform behavior/background two classification on the video, and takes the generated scores of the classifiers as the probability that a video segment contains human behaviors, thereby generating a probability sequence for the whole long video. And (4) generating candidate boxes with different lengths on the probability sequence by using a watershed algorithm. The candidate frames extracted by the TAG algorithm are high in quality and have high recall rate.

3) Behavioral temporal structural modeling S3. Since the candidate box is likely to contain a behavioral video segment, the invention models the behavioral temporal structure of the candidate box partitioning stage. A candidate frame p is given [ s, e ], where s, e are p, the start frame and the end frame, respectively. The invention extends 1/2 the two ends of p to get the candidate frame p' after extension. The invention divides p' into three stages, namely start (starting), middle (course) and end (ending), which respectively correspond to the start stage, the proceeding stage and the ending stage of an action in the video.

4) Sampling and feature extraction S4. Since the amount of data in a video is very large, processing an entire video requires a large amount of computational resources and vision, and not all of the data is needed to achieve accurate positioning. Therefore, the invention adopts the sparse sampling method proposed in the TSN, and then utilizes the image classification depth model or the video classification depth model which is pre-trained on the Kinetics data set to carry out feature extraction. From each extended candidate box p', the present invention extracts uniformly 9 video slices (snippets) using the sampling strategy proposed in the TSN. At present, there are two main methods for extracting video features. One is a double-flow model, which extracts color information and motion information from the video respectively; and the other model is a 3D CNN model, and the spatiotemporal joint information is extracted from the video by adopting a 3D convolution mode. At present, the existing results of video behavior identification and detection prove that the experimental results of the method based on the dual-flow model are generally superior to those of the method based on the 3D CNN model, so that the dual-flow model based on the BNIncepression network is adopted as a feature extraction tool, and the output feature dimension is 1024.

5) A time axis boundary attention feature pyramid S5 is established. In order to accurately locate the start time and end time of human behavior within a video, the model should focus more on the boundary between the behavior and the background, i.e., the "behavior time boundary" proposed by the present invention. As shown in fig. 2, the invention builds one or more layers of time axis boundary attention feature pyramids at the three additional stages of each candidate box. Taking the 'intermediate stage' as an example, the invention establishes a characteristic pyramid with K layers, and each layer uniformly divides the 'intermediate stage' into T_kAnd (4) partial. The jth part of the kth layer has a time interval of [ s ]_kj，e_kj]The pyramid features are expressed as:

wherein v is_tA dual stream feature for one video slice. Equation (1) represents that each portion of the pyramid feature is simply obtained by mean pooling of each frame feature of the video segment it contains. Connecting each partial feature of each layer of the feature pyramid, the invention obtains the global feature expression of the 'intermediate stage':

where Θ represents a convolution operation of 1 x 1. Formula (2) shows that, in this embodiment, the pyramid feature that does not include the time boundary information portion is further down-sampled, and the length of the feature vector is shortened, so that the weight of the feature vector in the feature is reduced, and the network focuses more on the information of the time boundary portion of the behavior. In SSNThe structured pyramid of (a) can efficiently model time, but does not focus the network on the temporal boundaries of behavior like the feature pyramid proposed in the present invention. Fig. 2 is a more intuitive representation of the feature pyramid proposed by the present invention. For the "start stage", "intermediate stage" and "end stage", the 1-level, 2-level and 1-level feature pyramids were used, respectively. Through the feature pyramid, the embodiment obtains the features H of the three stages respectively^s，H^c，H^e。

6) A classifier S6 is constructed. SSN builds two classifiers and a regressor on the feature pyramid. In order to combine the detection of two granularities, the invention establishes three A, B and C and a regressor R above a characteristic pyramid. The three classifiers are used for behavior classification, candidate in-frame behavior integrity evaluation and candidate in-frame behavior staged integrity evaluation respectively. Wherein input A is characterized by H^cAnd inputs B, C and R are all characterized as (H)^s，H^c，H^e). Specifically, given a candidate box, it is considered to contain a complete behavior when IoU with any behavior video segment is greater than 0.7, otherwise it contains background or incomplete behavior. The classifier B performs an integrity evaluation on the behavior within the entire candidate box. In addition, a candidate box is uniformly divided into N_sA part

And each section is integrity evaluated using classifier C. When w is_iWhen the overlapping rate of a certain behavior video segment is higher than 0.5, the behavior video segment is considered to be complete, otherwise, the behavior video segment is not complete. Position regression has proven to be effective in improving the accuracy of positioning, and therefore the present invention uses a regressor R to adjust the position and length of the candidate frame.

7) Referring to fig. 3, the present embodiment of the invention adopts the proposed TAG algorithm to combine complete behavior segments S7 from incomplete video behavior segments. The recombined behavioral video segments have more accurate time boundaries due to the combination of finer-grained segments. In addition, the design effectively improves the utilization efficiency of the candidate frame, thereby improving the detection effect.

8) And (6) regression. The invention trains a regressor to correct the position and the length of the candidate frame of the video behavior time axis, so that the positioning result is more accurate S8.

9) All test results are collected, duplicate test results are discarded using non-maximal suppression, and test results are evaluated S9.

The method is a specific implementation scheme for effectively extracting the time boundary information of the video behavior to detect the time axis of the video behavior by combining coarse granularity detection and fine granularity detection. The training and testing of the above embodiments are performed on an actual video data set. The method trains on a verification set of a THUMOS' 14 data set, wherein RGB branches and optical flow branches are trained for 1000 times and 6000 times at learning rates of 0.001 and 0.005 respectively, and the learning rates of the RGB branches and the optical flow branches are multiplied by 0.1 at intervals of 400 times and 2500 times respectively. This example was then tested on a test set of the thumb' 14 dataset and the results of the experiment were evaluated using the accepted evaluation criteria mAP (mean Average precision). One evaluation criterion associated with the mAP is IoU of the detection result and the actual video behavior segment. IoU, the higher the accuracy of the positioning. Table 1 is the evaluation of the inventive method and the 8 methods compared to it on the thumb' 14 dataset, comparing the inventive method with other methods with a mAP at IoU ═ 0.5, with a higher mAP indicating better results and higher algorithm performance. As can be seen from Table 1, the detection efficacy (mAP) of the method of the present invention is improved by nearly 3% due to all the methods disclosed so far, and compared to the method SSN of this example. In addition, when IoU is greater than 0.5, the video behavior time axis detection effect of the method is obviously due to other methods, and the characteristic pyramid and the multiple granularity detection methods provided by the method are proved to be capable of effectively improving the position accuracy. The above experimental results are sufficient to illustrate the superiority of the present method.

Table 1 comparison of test evaluation results for video behavior timeline detection on thumb' 14 dataset

tIoU	0.3	0.4	0.5	0.6	0.7
						Oneata et al.[1]	27.0	20.8	14.4	8.5	3.2
Richard et al.[2]	30.0	23.2	15.2	-	-
						S-CNN[3]	36.3	28.7	19.0	10.3	5.3
Yeung et al.[4]	36.0	26.4	17.1	-	-
						Yuan et al.[5]	33.6	26.1	18.8	-	-
CDC[6]	40.1	29.4	23.3	13.1	7.9
						RC3D[7]	44.8	35.6	28.9	-	-
CBR-TS[8]	50.1	41.3	31.0	19.1	9.9
						SS-TAD[9]	45.7	-	29.2	-	9.6
SSN(ImageNet)[10]	-	-	27.36	-	-
						SSN(Kinetics)[10]	-	-	32.50	-	-
TBN(ImageNet)	48.82	39.11	30.30	20.65	11.32
						TBN(Kinetics)	54.8	44.60	33.75	22.75	12.98

Note: "ImageNet" and "Kinetics" represent networks pre-trained using the ImageNet dataset and networks pre-trained using Kinetics, respectively.

The prior art methods for comparison in table 1 are described in the following respective documents:

[1]Dan Oneata,Jakob Verbeek,and Cordelia Schmid.The lear submission at THUMOS 2014.2014.

[2]Alexander Richard and Juergen Gall.Temporal action detection using a statistical language model.In CVPR,pages 3131–3140,2016.

[3]Zheng Shou,Dongang Wang,and Shih-Fu Chang.Temporal action localization in untrimmed videos via multi-stage cnns.In CVPR,pages 1049–1058,2016.

[4]Serena Yeung,Olga Russakovsky,Greg Mori,and Li Fei-Fei.End-to-end learning of action detection from frame glimpses in videos.In CVPR,pages 2678–2687,2016.

[5]Jun Yuan,Bingbing Ni,Xiaokang Yang,and Ashraf A Kassim.Temporal action lo-calization with pyramid of score distribution features.In CVPR,pages 3093–3102,2016.

[6]Zheng Shou,Jonathan Chan,Alireza Zareian,Kazuyuki Miyazawa,and Shih-Fu Chang.Cdc:convolutional-de-convolutional networks for precise temporal action lo-calization in untrimmed videos.In CVPR,pages 1417–1426,2017.

[7]Huijuan Xu,Abir Das,and Kate Saenko.R-c3d:Region convolutional 3d network for temporal activity detection.In ICCV,volume 6,page 8,2017.

[8][7]Jiyang Gao,Zhenheng Yang,and Ram Nevatia.Cascaded boundary regression for temporal action detection.arXiv preprint arXiv:1705.01180,2017.

[9]S Buch,V Escorcia,B Ghanem,L Fei-Fei,and JC Niebles.End-to-end,single-stream temporal action detection in untrimmed videos.In BMVC,2017.

[10][32]Yue Zhao,Yuanjun Xiong,Limin Wang,Zhirong Wu,Xiaoou Tang,and Dahua Lin.Temporal action detection with structured segment networks.In ICCV,volume 8,2017.

it is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A video behavior time axis detection method comprises the following steps:

1) extracting a video behavior timeline candidate frame by using a TAG (TAG) of a video to be detected;

2) performing time structure modeling on each video behavior time axis candidate frame, and dividing a single candidate frame into three stages;

3) extracting the spatiotemporal characteristics of the video by using a video classification depth model, and storing the characteristics in a memory;

4) at each stage, establishing a multilayer characteristic pyramid to effectively extract time boundary information of video behaviors, performing down-sampling on the characteristics of a part which does not contain the behavior time boundary information, and performing down-sampling on a behavior time edge part, and connecting the characteristics of the two parts, so that the information of the behavior time edge part accounts for a larger proportion in the final global characteristics;

5) on the pyramid characteristics, three classifiers A, B, C and a regressor R are constructed, and the three classifiers are respectively used for behavior classification, candidate in-frame behavior integrity evaluation and candidate in-frame behavior staged integrity evaluation;

6) combining complete behavior fragments from incomplete behavior fragments by utilizing a candidate frame grading integrity evaluation result of the classifier C and using a combination method provided in the TAG;

7) performing two kinds of behavior time axis detection with different granularities; and the detection results of the two granularities are fused, so that the aim of accurately positioning the time axis of the video behavior is fulfilled.

2. The video behavior timeline detection method according to claim 1, wherein said video classification depth model in step 3) is a dual-flow model based on a bninclusion network; the model respectively inputs RGB images and optical flow characteristics of a video into two BNIncepression networks with the same structure, and trains color characteristics and motion characteristics of the learning video; the length of the extracted feature vector is 1024.

3. The video behavior timeline detection method of claim 2, wherein for each timeline candidate frame, 1/2 extending the length of the candidate frame from each end of the timeline candidate frame to contain contextual semantic information of the behavior; for each extended video behavior timeline candidate frame, the video behavior timeline candidate frame is divided into three stages, namely a start stage, an intermediate stage and an end stage, which correspond to the start, progress and end of a behavior.

4. The video behavior timeline detection method of claim 3, wherein for each of the "start stage", "middle stage" and "end stage" of the video behavior timeline candidate frame, the dual-stream features thereof are extracted, a K-layer feature pyramid is established on the dual-stream features, and each layer of the pyramid divides the corresponding stage into T stages_kAnd (4) merging the features sampled in each part together through mean pooling.

5. The video behavior timeline detection method of claim 1, wherein on global features, two integrity classifiers B and C are constructed; the classifier B carries out integrity evaluation on the whole video behavior time axis candidate frame and judges whether a complete behavior is contained or not; the classifier C uniformly divides a video behavior time axis candidate frame into N_SAnd each part is subjected to integrity evaluation to judge whether the part is an integral part of a behavior.

6. The video behavior timeline detection method of claim 5, wherein for candidate frames judged to be incomplete by the classifier B, the integrity evaluation result of the classifier C is used, and the combination method proposed in TAG is used to combine the complete parts in different candidate frames into a new detection result, thereby combining complete behavior segments from incomplete behavior segments.

7. The method according to claim 1, wherein a regressor is constructed on the global features and trained to adjust the positions of the candidate frames, so that the positioning result is more accurate.

8. The video behavior timeline detection method according to claim 7, wherein the accurate video behavior timeline detection result is obtained by combining the detection results of two granularities generated by the classifier B and the classifier C, respectively.