CN108898076B

CN108898076B - Method for positioning video behavior time axis and extracting candidate frame

Info

Publication number: CN108898076B
Application number: CN201810607040.0A
Authority: CN
Inventors: 李革; 张涛; 李楠楠; 黄靖佳; 钟家兴; 李宏
Original assignee: Peking University Shenzhen Graduate School
Current assignee: Peking University Shenzhen Graduate School
Priority date: 2018-06-13
Filing date: 2018-06-13
Publication date: 2022-07-01
Anticipated expiration: 2038-06-13
Also published as: CN108898076A

Abstract

A video behavior time axis positioning and candidate frame extracting method is a video behavior time axis candidate frame extracting method based on uncut video data and deep reinforcement learning, and comprises the following specific steps: firstly, establishing a Markov model on a video behavior time axis positioning task, and converting the video behavior time axis positioning task into a Markov decision-making process; then, a Markov decision process is solved by using a classic deep reinforcement learning algorithm DQN, so that the length and the position of a time axis window are automatically adjusted by the algorithm; finally, the trained agent model and an action/background two-classifier are used to locate human behavior in the video, and time axis candidate boxes are generated for subsequent more accurate location and analysis. The invention surpasses most advanced algorithms in efficiency and effect at present, and can be used for positioning human behaviors in videos.

Description

Method for positioning video behavior time axis and extracting candidate frame

Technical Field

The invention relates to the technical field of video analysis, in particular to a method for positioning a video behavior time axis and extracting a candidate frame.

Background

Videos containing human behavior can be divided into two categories: one type is artificially cropped video that contains only human behavior and no irrelevant background video; one type is an uncut video after shooting, which includes not only human behavior but also irrelevant background segments, such as titles, viewers, and the like. The video behavior time axis detection means that in a section of video which is not manually cut, the starting time and the ending time of the occurrence of human behaviors are positioned, and the categories of the human behaviors are identified. The existing video behavior time axis detection method mainly follows two steps of strategies: first, a large number of timeline candidate frames that are likely to contain human motion video clips are extracted, and then fine-tuning in position and length is performed on the extracted candidate frames, and the located behaviors are classified. Extracting high quality timeline candidate boxes is a very critical step for accurate video behavior detection. The method mainly aims at the task of extracting the video behavior time axis candidate frame, and efficiently extracts the high-quality video behavior time axis candidate frame based on deep reinforcement learning. In addition, the method can also be directly used for rough video behavior time axis positioning.

The existing video behavior time axis candidate frame extraction methods mainly comprise the following steps:

a method based on sliding window. This is the simplest method at present, and a large number of video clips are generated by manually setting time axis windows with different lengths and sliding the time axis windows on the whole video in a set step length. And secondly, performing secondary classification on the generated video clips by using a moving/background secondary classifier, and recording the obtained foreground clips as a time axis candidate frame. The main problem with the sliding window method is that it is computationally inefficient and produces time-axis candidate frames of poor quality.

And II, a method based on a recurrent neural network. The method is proposed after deep learning is started again in recent years, modeling is carried out on a long video sequence through a recurrent neural network, and video clips which possibly contain human behaviors in a video are searched through extracting information on time sequences in the video. The method has high calculation efficiency. However, the existing recurrent neural network is difficult to process huge information amount in the video, so that the quality of the extracted time axis candidate frame is not high, that is, the overlapping rate of the candidate frame and human behaviors in the video is not high enough, and the condition of missing detection is serious.

And thirdly, combining the short video frequency bands from bottom to top. The method divides the whole video into small video bands with equal length, carries out action/background two classification on each small video band, and finally combines the foreground video bands to obtain a time axis candidate frame. The method has better calculation efficiency and lower omission ratio, but the quality of the candidate frame is not high enough. Therefore, the method has more space for improvement.

Disclosure of Invention

In order to overcome the defects of the prior art in efficiency, candidate frame quality and intelligence, the invention provides a novel video behavior time axis positioning and candidate frame extracting method, which is based on deep reinforcement learning and realizes efficient and intelligent extraction of video behavior time axis candidate frames. The method can be applied to the extraction of the video segments, so that the video segments which are interesting to human beings are obtained from the uncut video of the long segments, and the subsequent video analysis is carried out.

The principle of the invention is as follows: the method adopts a deep reinforcement learning algorithm to realize the extraction of the candidate frames of the video behavior time axis. By taking the research of a deep reinforcement learning of a deep learning team of a Google company as a reference, the method applies classical deep Q-network (Mnih, Volodymyr, et al. "Human-level control through deep learning retrieval learning", Nature.2015) to a video behavior time axis candidate frame extraction task to obtain a good extraction effect with high calculation efficiency. In the literature (Mnih, Volodymyr, et al, "Human-level control through depth retrieval learning," nature.2015), a convolutional neural network is used to extract features of a game screen, and the extracted features are input into a fully-connected network to learn actions to be taken in a corresponding game state, so that the network learns to play the game like a Human being. By using the method provided by deepmind for reference, the method firstly establishes a Markov decision process on a video behavior time axis positioning task, converts the video behavior time axis positioning into a Markov decision process for solving, then uses a classical 3D convolutional neural network (C3D) to extract the time-space information of a video, trains an intelligent network to learn a strategy for automatically adjusting the position and the length of a time axis window, and achieves the purpose of positioning the video behavior time axis. Different from the previous method, the method adopts a deep reinforcement learning method, automatically searches the target candidate frame through a continuous decision sequence, achieves real 'intellectualization', has higher calculation efficiency, and obtains the candidate frame with higher quality.

The method is based on deep reinforcement learning to extract the candidate frames of the video behavior time axis, tests are carried out on the real video on the THUMOS' 14 public data set, and the obtained experimental result exceeds most of the current methods and the calculation efficiency exceeds the current methods.

The technical scheme provided by the invention is as follows:

a video behavior time axis candidate frame extraction method is characterized in that a C3D convolutional neural network is used for extracting video space-time characteristics, an intelligent neural network is trained by using a Q-learning algorithm to learn a strategy for automatically adjusting the position and the length of a time axis window, and video clips containing human behaviors are automatically searched. The method comprises the following steps:

1) establishing a Markov decision process on a video behavior time axis positioning task;

2) performing feature extraction by using the video behavior classification depth model C3D, and storing feature maps generated by each convolution layer in the depth model in a memory;

3) connecting the depth features with motion vectors of four actions executed in the past to form feature vectors fusing current window video information and historical actions;

4) inputting the characteristic vector into the DQN by adopting a classic deep reinforcement learning algorithm DQN, training the DQN to solve a Markov decision process, and learning a method for automatically adjusting the position and the length of a window according to the characteristics of the current time axis window;

5) utilizing the trained DQN to automatically and continuously adjust the position and the length of a window on the basis of initializing the window so as to enable the window to be accurately close to a human action segment in a video;

6) and training an action/background two classifier to judge whether the video clips under the window contain human behaviors or not, so as to search and position the human behavior clips in the video, thereby achieving the purposes of positioning the video behavior time axis and extracting the candidate frames.

Preferably, the video behavior classification depth model has a C3D depth convolution network model with 8 convolutional layers; the C3D deep convolution network extracts the spatio-temporal information of the video through the convolution and pooling operation of 3D; the spatio-temporal features of the video are extracted by using a C3D network which is pre-trained on a Sports1M data set, and a feature vector output by a first full-connection layer fc-6 layer of a C3D network is used as a video feature vector, wherein the length of the video feature vector is 1024.

Preferably, a markov model is built on the video behavior timeline position, and its action set a contains six actions for the timeline window, respectively, left shift "," right shift "," jump "," left side extension "," right side extension "," shortening ".

Preferably, the motion vectors of the four actions performed in the past are connected with the fc-6 layer features of the C3D network to form a feature vector F fusing the current window video information and the historical actions.

Preferably, the DQN algorithm for solving the markov decision process consists of three fully-connected neural networks; the DQN extracts picture features by using a convolutional neural network, inputs the features into a full-connection layer, and solves a Markov decision process by using the full-connection neural network; the output of the DQN is a motion vector, and the network selects the motion with the maximum probability to execute; using the feature vector output by the fc-6 layer of the C3D network as a video feature vector, fusing 4 historical motion vectors executed by the network in the past as an input DQN feature vector; the length of the vector output by the DQN is 6, corresponding to 6 operations on the time axis window.

Preferably, a category independent action/background two classifier is used to determine whether a video segment under a window contains human behavior.

Preferably, if the window is determined by the action/background two classifier to contain human behavior, it is recorded as a candidate frame, and the position of the window is updated to the rightmost side of the current window.

Preferably, if the window is judged by the action/background two classifier not to include human behaviors, the feature vector F is input into the DQN network, the action with the maximum probability is selected, the action is executed on the current window, and the position and the length of the current window are updated.

Preferably, the position and length of the window are updated using a fixed ratio of 0.2.

Preferably, when the window reaches the end of the video 5 times, then the positioning of the behavior within the video is considered to have been completed. The method for extracting the video behavior time axis candidate frame specifically comprises the following steps:

1) establishing a Markov decision process M ═ S, A, P on a video behavior positioning task_s，a，R_s，a}. Where S is the state set, A is the action set, P_s，aProbability of performing action a in state s, R_s，aThe reward resulting from performing action a in state s.

2) Initializing a window: a time axis window s, e is initialized at the beginning of the video to be processed, the length of the window is initialized randomly. s and e are the start and end frames of the window.

3) Feature extraction: and uniformly sampling 16 frames of the video frames in the current window, inputting the sampled 16 frames into a C3D network, and extracting fc-6 layer characteristics of the C3D network.

4) And (4) connecting the features. The decision vectors corresponding to the four actions performed in the past are connected to the fc-6 level features of the C3D network as features F of the input DQN.

5) Action/background two classification: inputting the fc-6 layer feature of the C3D network of the video under the current window into a motion/background two classifier, and judging whether the video segment covered by the current window contains the interesting behavior. If yes, recording the current window as a candidate frame, setting a new window [ s ', e' ] at the rightmost end of the current window, and returning to the step 3 to restart the search. Otherwise, continue to step 6.

6) And (3) decision making: and inputting the F into the DQN to obtain an action decision a under the current window, and executing an operation a on the position or the length of the current window to obtain a new window [ s ', e' ].

7) And (5) repeating the steps (3) to (6) until the current window reaches the end of the video.

Compared with the prior art, the invention has the beneficial effects that: the invention provides a novel extraction method of candidate frames of a video behavior time axis, which is used for reference of excellent research results of a depemind team on deep reinforcement learning and is applied to the field of video behavior time axis detection, so that a good candidate frame extraction effect is achieved with high calculation efficiency and intelligence. The method can be used for extracting candidate frames and roughly positioning the behavior time axis of the uncut video, and the extracted candidate frames can be used for subsequent fine window position adjustment and behavior classification. The method can be widely applied to capturing human behaviors in an intelligent monitoring system or man-machine interaction for subsequent behavior recognition and analysis, and can also be applied to the fields of video search and the like.

Compared with the prior art, the invention performs the test and evaluation on the data set of the video behavior detection data set THUMOS' 14 which is most widely used at present. The extraction effect and the behavior positioning effect of the candidate frame on the THUMOS' 14 are superior to those of the existing method, higher calculation efficiency is obtained, and the technical superiority of the method is explained.

Drawings

Fig. 1 is an overall frame diagram of the present invention. Wherein, 1, a solid line frame is a time axis window before executing operation, and a dashed line frame is a time axis window after executing action of the video to be detected; 2, utilizing a C3D network to extract features; 3-learning the strategy by utilizing the DQN network and determining the action executed on the window; 4-action decision vector output by DQN network; 5-perform the selected operation of DQN for the time axis window.

Fig. 2 is a block diagram of six operations on a window, which are "shift left", "shift right", "jump", "extend left", "extend right", and "shorten", respectively, according to the present invention. 6-current timeline window; 7-timeline Window after performing action.

Fig. 3 is a flowchart of a method for extracting candidate frames for video behavior timeline detection according to the present invention.

Fig. 4A is a visualization of the results of the experimental evaluation of the present invention, namely: recall-number of candidate boxes curve.

Fig. 4B is a visualization of the results of the experimental evaluation of the present invention, namely: recall-IoU curve. Wherein: 8-the timeline candidate frame extraction recall-candidate frame number curve of the present invention; 9-recall from timeline candidate extraction of the present invention-IoU curve.

Detailed Description

The invention will be further described by way of examples, without in any way limiting the scope of the invention, with reference to the accompanying drawings.

The extraction of the candidate frames of the video behavior time axis is a sub-problem of video behavior detection, and plays a key technical support role in the fields of intelligent monitoring, human-computer interaction, video search and the like. Different from common image-based target detection, video behavior detection requires processing of a large amount of uncut video data, and therefore higher requirements are put forward on effectiveness and computational efficiency of video behavior detection technology. The deep reinforcement learning algorithm is proved to have strong learning ability and can better solve the problem of processing a decision sequence. The invention applies the classic DQN algorithm to the task of positioning the video line time axis. Firstly, a classical C3D network is used for extracting spatiotemporal features in a video, then a Markov decision process is established on the task of video behavior time axis positioning, and then a Q-learning algorithm is used for training a DQN network to solve the Markov decision process. The trained DQN network can automatically adjust the position and the length of a window according to video content covered by a time axis window, so that video clips containing human behaviors are efficiently searched through a series of window operations. Experimental results on the THUMOS' 14 dataset show that the computational efficiency, the quality of the timeline candidate boxes, and the video behavior timeline positioning results of the method of the present invention are superior to most existing methods. Without re-screening and fine-tuning the candidate box, the average mean of precision (mAP) of the method of the present invention on the THUMOS' 14 dataset reached 24.4.

Table 1 structure of C3D feature extraction network used in the present invention

FIG. 1 is an overall frame diagram of video behavior timeline positioning and candidate frame extraction according to the present invention, comprising: in the video 1 to be detected, a solid line frame is a time axis window before operation is executed, and a dotted line frame is a time axis window after action is executed; performing feature extraction 2 by using a C3D network; utilizing the DQN network learning strategy and determining an action 3 to be executed on the window; an action decision vector 4 output by the DQN network; the operation 5 selected by DQN is performed for the time axis window.

Fig. 2 is a block diagram of six kinds of operations on a window, which are respectively "shift left", "shift right", "jump", "extend left", "extend right", and "shorten"; a current timeline window 6; the timeline window 7 after the action is performed.

Fig. 3 is a flowchart of a method for extracting candidate frames for video behavior timeline detection, which specifically includes the following steps:

firstly, a Markov decision process S1 is established, and a Markov decision process M is established on the video behavior positioning task, wherein the Markov decision process M is { S, A, P }_s，a，R_s，a}. Where S is the state set, A is the action set, P_s，aProbability of performing action a in state s, R_s，aThe reward resulting from performing action a in state s. The state set S here consists of feature vectors of the video; the design of the action set A is very critical, and the invention designs A { 'move left', 'move right', 'extend left', 'extend right', 'shorten', 'jump' }, which represents a pair window [ s, e { 'move left', 'lengthen right', 'shorten', 'jump' }]Six operations of (2). P_s，aThe value is constantly 1, which indicates that each action selected by the network is executed definitely; r_s，a{ +1, -1, +5, -5 }. When the network selects positiveWhen the action is confirmed, the reward +1 is obtained, otherwise, the reward-1 is obtained. The "jump" action is more specific, meaning that the window is randomly jumped to any position in the video, with the network awarding +5 when the "jump" action is performed correctly, and-5 otherwise. The ultimate goal of the method is to solve the markov process so that the total reward received by the network is maximized.

Second, the video data is processed and extracted as a picture S2, and the video data is extracted as a still video frame.

Third, a window S3 is initialized, where the window is initialized at the start of the video, and the length of the window is the average length of all video segments containing human behavior during training.

Fourth, sampling and feature extraction S4, 16 frames are uniformly sampled from the video under the current window from the present invention. There are two methods for extracting video features that are most commonly used at present. The first is the dual stream method (Simnyan, Karen, and Andrew Zisserman. "Two-stream dependent networks for action recovery in video." Advances in neural information processing systems. 2014.). The double-flow method respectively inputs RGB image and optical flow characteristics of the video into two convolution neural networks with the same structure, trains two networks to learn two different characteristics, namely spatial characteristics and motion characteristics of the video. And finally, the double-current network fuses the results of the two networks to achieve the effect of characteristic complementation. The second is the 3D convolutional neural network method (Tran, Du, et al, "Learning convolutional neural networks with 3D relational networks," International Conference on Computer vision.2015.). Different from the method of separately processing the time information and the space information by a double-flow method, the 3D convolutional neural network directly inputs a plurality of frames of pictures into the convolutional neural network, the space-time information of the video is extracted by using 3D convolutional operation and pooling operation, and the extracted features are fused with color features and motion features. Although the existing research proves that the double-flow method is slightly superior to the 3D convolutional neural network in experimental results, the double-flow method needs to extract optical flow characteristics, is time-consuming and has low calculation efficiency. Therefore, the invention adopts a method for extracting video characteristics by 3D CNN. The invention inputs 16 frames of pictures into a C3D network which is pre-trained on a Sports1m data set, extracts the characteristic vector of an fc-6 layer of the network, and the length of the characteristic vector is 4096. In addition, the invention encodes the four operations executed recently by the network, and obtains a history vector with the length of 4 × 6-24. The invention connects the fc-6 layer feature of C3D with the history vector to obtain a feature vector with the length of 5120. The network structure for extracting spatio-temporal features of video in C3D is shown in table 1.

Fifthly, the motion/background classification S5 obtains fc-6 layer feature vectors of the C3D network of the video clip under the current window, and inputs the fc-6 layer feature vectors into a feature vector composed of two fully-connected layers, and the structure of the feature vector is shown in fig. 1. The two-class network determines whether the video segment covered by the current window contains a complete action. A full motion here refers to a motion video segment that is at least greater than 0.5 of the actual motion IoU. If the current window is an action segment, recording the current window as a candidate box (such as S8 in FIG. 3), and moving the window to the rightmost end of the current window, otherwise, correcting the position and length of the current window according to the action decision of DQN to make the current window approach to the action segment.

Sixthly, judging whether the whole video is processed or not S9, and when the window reaches the tail of the video repeatedly five times, the invention considers that the whole video is processed and finishes the whole flow of generating the candidate frame. Otherwise, the search of the action segment and the generation of the candidate box are continued.

And seventhly, performing action decisions S6 and S7, and if the video processing is not completed, inputting the feature vectors of the video segments under the current window into the DQN after the feature vectors are acquired. The DQN network here consists of three fully connected layers, the structure of which is shown in fig. 1. The DQN determines the operation of the current window according to the characteristics of the video under the current window and four operations performed in the past. The DQN outputs a vector of length 6 corresponding to 6 operations on the window, respectively "shift left", "shift right", "extend left", "extend right", "shorten", and "jump". And the network selects the action with the maximum probability and executes the action to obtain a new window. Assume that the current window is s, e, with a length d-e-s. The network chooses to perform the action "move left", and the new window is s ', e'. Where s ═ max (0, s- α × d), e ═ max (0, e- α × d). α is a ratio when an operation is performed on the window. In order to balance processing time and positioning accuracy, the invention sets alpha to be 0.2.

And eighthly, circulating the flows to know that the whole video processing is finished. After the processing is completed, the obtained time axis candidate frames are subjected to non-maximum value suppression, and highly overlapped candidate frames are removed.

The above is a specific implementation scheme that the method utilizes a reinforcement learning algorithm to convert the process of video behavior positioning into an automatic decision-making process, so as to extract a candidate frame of the video behavior timeline. The training and testing of the above embodiments are performed on real video. The present scheme tests on the thumb' 14 dataset. Tables 2 and 3 are the results of the evaluation of the THUMOS' 14 data set by the present invention and five other algorithms to which it is compared. The results in table 2 can be used to measure the ability of the invention to extract candidate frames of the timeline of video behaviors, i.e. the computational efficiency. Recall is typically used to measure candidate box extraction capability. The results shown in table 3 are used to measure the ability of the present invention to directly perform the video behavior timeline positioning, and the evaluation index is the Average precision mean value mapp (mean Average precision). The larger the value of both indices, the better. The experimental result shows that the method can achieve higher recall rate than the current extraction method of other video behavior time axis candidate frames by using fewer candidate frames, and can obtain high-quality time axis candidate frames while effectively reducing useless positioning results.

When the method is used for directly positioning the video behavior time axis, the detection effect equivalent to that of other methods at present can be achieved under the condition of not carrying out secondary screening. The above two criteria are sufficient to illustrate the superiority of the method.

Table 2 test evaluation results of time axis candidate extraction on thumb 14 by different methods

TABLE 3 test evaluation results of video behavior timeline positioning on THUMOS' 14 dataset by different methods

Method	mAP
		SCNN[3]	19.0
Yeung[2]	17.1
		DAPs[1]	13.9
CDC[5]	23.3
		SSN*[4]	29.1
The invention	24.4

Note: results from this method are a secondary screening and adjustment of the window.

Fig. 4A is a visualization of the results of the experimental evaluation of the present invention, namely: recall-number of candidate boxes curve. Fig. 4B is a visualization of the results of the experimental evaluation of the present invention, namely: recall-IoU curve. There are shown a timeline candidate extraction recall-number of candidates curve 8 of the present invention and a timeline candidate extraction recall-IoU curve 9 of the present invention.

The method and the device can evaluate the performance of the video behavior timeline candidate frame on the task more intuitively. The prior art methods for comparison in tables 2 and 3 and fig. 4A and B are described in the following respective documents:

[1]Victor Escorcia,Fabian Caba Heilbron,Juan Carlos Niebles,and Bernard Ghanem,“Daps:Deep action proposals for action understanding,”in European Conference on Computer Vision.Springer,2016,pp.768–784.

[2]Serena Yeung,Olga Russakovsky,Greg Mori,an Li Fei-Fei,“End-to-end learning of action detection from frame glimpses in videos,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.IEEE,2016,pp.2678–2687.

[3]Zheng Shou,Dongang Wang,and Shih-Fu Chang,“Temporal action localization in untrimmed videos via multi-stage cnns,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016,pp.1049–1058.

[4]Yue Zhao,Yuanjun Xiong,Limin Wang,Zhirong Wu,Xiaoou Tang,and Dahua Lin,“Temporal action detection with structured segment networks,”in The IEEE International Conference on Computer Vision(ICCV),2017,vol.8.

[5]Zheng Shou,Jonathan Chan,Alireza Zareian,Kazuyuki Miyazawa,and Shih Fu Chang,“Cdc:Convolutionalde-convolutional networks for precise temporal action localization in untrimmed videos,”arXiv preprint arXiv:1703.01515,2017.

[6]S.；Buch,V.；Escorcia,C.；Shen,B.；Ghanem,and J.C.Niebles,“Sst:Single-stream temporal action proposals,”in In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition,2017.

[7]Heilbron,F.C.；Niebles,J.C.；and Ghanem,B.2016.Fast temporal activity proposals for efficient detection of human actions in untrimmed videos.In Computer Vision and Pattern Recognition,1914–1923.

it is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. A video behavior time axis candidate frame extraction method is characterized in that: the method comprises the following steps:

3) connecting the depth features with decision vectors corresponding to four actions executed in the past to form a decision vector fusing current window video information and historical actions;

4) inputting the decision vector into the DQN by adopting a classic deep reinforcement learning algorithm DQN, training the decision vector to solve the Markov decision process, and learning a method for automatically adjusting the position and the length of a window according to the characteristics of the current time axis window;

2. The method of claim 1, wherein the video behavior timeline candidate frame extraction method is characterized in that the video behavior classification depth model has a C3D depth convolution network model with 8 convolution layers; the C3D deep convolution network extracts the spatio-temporal information of the video through the convolution and pooling operation of 3D; the spatio-temporal features of the video are extracted by using a C3D network which is pre-trained on a Sports1M data set, and the motion vector output by the fc-6 layer of the first full-connection layer of the C3D network is used as a video feature vector, and the length of the motion vector is 1024.

3. The method as claimed in claim 2, wherein a Markov model is built at the video action timeline position, and the action set A comprises six actions for the timeline window, i.e. move left, move right, jump, extend left, extend right, and shorten.

4. The method as claimed in claim 2, wherein the decision vectors corresponding to four actions performed in the past are connected to the fc-6 layer features of the C3D network to form a feature vector F fusing the current window video information and the historical actions.

5. The method of extracting video behavior timeline candidate frames according to claim 1, wherein said DQN algorithm for solving a markov decision process is comprised of three fully-connected neural networks; the DQN extracts picture features by using a convolutional neural network, inputs the features into a full-connection layer, and solves a Markov decision process by using the full-connection neural network; the output of the DQN is an action vector, and the action with the maximum probability is selected by the fusion network to be executed; using the feature vector output by the fc-6 layer of the C3D network as a video feature vector, fusing 4 historical motion vectors executed by the network in the past as an input DQN decision vector; the length of the vector output by the DQN is 6, which corresponds to 6 operations on the time axis window.

6. The method as claimed in claim 2, wherein an action/background two classifier is used to determine whether the video segment under the window contains human behavior.

7. The method as claimed in claim 6, wherein if the window is determined to contain human behavior by the action/background two classifier, recording the window as a candidate frame, and updating the position of the window to the rightmost side of the current window.

8. The method as claimed in claim 7, wherein if the window is determined by the action/background two classifier not to include human behavior, the feature vector F is input into the DQN network, the action with the highest probability is selected, and the action is performed on the current window to update its position and length.

9. The method as claimed in claim 8, wherein the updating of the position and length of the window is performed using a fixed ratio of 0.2.

10. The method of claim 9, wherein when the window reaches the end of the video 5 times, then the positioning of the behavior within the video is considered complete.