CN109409307B

CN109409307B - Online video behavior detection method based on space-time context analysis

Info

Publication number: CN109409307B
Application number: CN201811298487.0A
Authority: CN
Inventors: 李楠楠; 张世雄; 张子尧; 李革; 安欣赏; 张伟民
Original assignee: Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Current assignee: Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Priority date: 2018-11-02
Filing date: 2018-11-02
Publication date: 2022-04-01
Anticipated expiration: 2038-11-02
Also published as: CN109409307A

Abstract

An online video behavior detection method based on spatio-temporal context analysis is characterized in that a deep learning framework is adopted, a spatio-temporal context analysis technology is combined, so that online detection of behaviors occurring in an input video is realized, and the detection is carried out in a time domain and a space domain in a combined mode. The invention comprises two parts: behavior detection within a video segment and linking between video segments. Generating a candidate action region by an algorithm in a video segment by combining a current frame and space-time dynamic information by using a coding-decoding model; video inter-segment linking links candidate action regions into action chains that continuously focus on a specified action object, from its appearance until its end, while predicting the category of action in an online manner.

Description

Online video behavior detection method based on space-time context analysis

Technical Field

The invention relates to the technical field of video behavior analysis, in particular to an online video behavior detection system based on space-time context analysis and a method thereof.

Background

Video behavior detection not only correctly classifies behaviors occurring in a given video, but also locates behaviors in a time domain and a space domain, and is a key step in video human behavior understanding research. In short, the existing methods usually use a two-step process to solve this problem: generating a single frame of motion detection results with the retrained motion detector, including the regressed object box and the corresponding motion classification score; the final spatiotemporal motion chain is formed by concatenating or tracking the motion detection results of a single frame over the entire video duration, usually under some constraints, such as: it is required that the motion detection frame overlapping areas of adjacent frames be as large as possible. The limitations of this process are mainly reflected in two aspects: 1) the method only utilizes the current image or motion information to perform single-frame behavior detection, and ignores the continuity of motion behaviors in time; 2) the connection algorithm is usually performed in an off-line and batch manner, i.e.: the action chain is continued from the beginning of the video to the end of the video, and then another time domain pruning algorithm is used for eliminating false detection results. In the present invention, the above two problems are solved by the following means: 1) combining the current frame and the space-time context information to perform action detection; 2) and (3) completing the generation of the behavior chain and the classification and prediction of the behavior in one-time processing by adopting an online detection mode.

In 2017, Zhu et al (Zhu H., visual R., and Lu S.2017. "A spread-temporal relational Regression Network for Video Action protocol", IEEE International Conference on Computer Vision, pp.5814-5822) proposed a Regression Network model for generating Action behavior Proposal, which was constructed based on Conv LSTM (relational Long Short-Term Memory), and fused Spatio-temporal information and current frame information for Action detection. The drawback of this method is that within a small segment of video, only the later located video frames can usually use spatio-temporal dynamic information to assist the current detection.

Disclosure of Invention

The invention aims to provide an online video behavior detection system based on space-time context analysis, which can utilize video sequence context information when detecting the action behavior of a current frame, can incrementally generate a behavior chain along with the continuous input of video frames, and dynamically classify the video behavior.

The invention also aims to provide an online video behavior detection method based on the spatio-temporal context analysis.

The method provided by the invention has two main improvements compared with the existing method: 1) the method is based on ConvGRU (connected Gated Recurrent Unit), compared with ConvLSTM, the method is a lightweight cyclic memory model, has much fewer parameters, and reduces the risk of overfitting on a small sample number set; 2) their model is a single forward model, so that only video frames at the back end of the input video sequence can utilize the fused spatio-temporal dynamic information for behavior detection, while the method proposed by the present invention is an encoding-decoding model, and the spatio-temporal context information of the video sequence can be used in each frame during decoding.

The principle of the invention is as follows: 1) extracting single-frame video characteristics by using a deep convolutional neural network, inputting a plurality of continuous frame video characteristics into ConvGRU to construct a coding-decoding video sequence description model, coding spatiotemporal context information in forward transmission, decoding spatiotemporal dynamic information coded in backward transmission into each frame, and finishing motion detection by combining current frame information; 2) Maintaining a dynamic behavior category candidate pool, gradually narrowing the range of possible behavior categories as the input video sequence grows, and dynamically pruning the currently generated behavior chain, comprising: growth, termination and time domain pruning.

The technical scheme provided by the invention is as follows:

the time-space domain behavior detection method provided by the invention comprises two parts: behavior detection within a video segment and linking between video segments. Generating a candidate action region by an algorithm in a video segment by combining a current frame and space-time dynamic information by using a coding-decoding model; video inter-segment linking links candidate action regions into action chains that continuously focus on a specified action object, from its appearance until its end, while predicting the category of action in an online manner.

An online video behavior detection system based on spatiotemporal context analysis comprises a video behavior spatiotemporal context information fusion network and a motion frame online linking and classifying algorithm; wherein: the video behavior spatiotemporal context information fusion network is used for fusing current frame information and behavior spatiotemporal context information in a video segment; the motion frame online linking and classifying algorithm is used for linking the motion frames corresponding to the same motion target in an online mode to form a complete behavior chain and classifying the behavior classes of the motion frames.

The video behavior spatiotemporal context information fusion network specifically comprises: the single-frame feature extraction network is used for extracting the depth expression features of the current frame RGB image and the optical flow image in the video segment; the video segment space-time context information fusion network is characterized by comprising a ConvGRU model-based coding-decoding module, a ConvGRU model-based coding-decoding module and a ConvGRU model-based coding-decoding module, wherein the ConvGRU model-based coding-decoding module is used for extracting video segment space-time context expression characteristics and fusing the video segment space-time context expression characteristics with video current frame characteristics to obtain fusion characteristics; and the behavior detection network is used for carrying out single-frame behavior detection on the fusion characteristics to obtain a behavior classification score and position the position where the behavior occurs to generate a motion frame.

The online linking and classifying algorithm for the motion frame specifically comprises the following steps: constructing a behavior category candidate pool for maintaining a specified number of behavior categories that are currently most likely to occur for a given video; the behavior category candidate pool updating algorithm is used for scoring behavior categories, gradually reducing the range of the behavior categories to which the current video possibly belongs, and realizing online rapid classification of a behavior chain; the behavior chain online growth algorithm is used for linking the behavior candidate region corresponding to the video clip to the existing behavior chain to realize online growth of the behavior chain; or determining the behavior candidate region as a new behavior chain.

An online video behavior detection method based on spatiotemporal context analysis comprises the following steps:

step 1: calculating an optical flow image for the current frame, and extracting depth expression characteristics of the RGB image and the optical flow image;

step 2: constructing a coding-decoding network to extract space-time context information of video behaviors, and fusing the space-time context information with current frame information to obtain fusion characteristics;

and step 3: classifying and position regressing the fusion characteristics to generate a motion frame, and linking the motion frame by using a Viterbi algorithm to obtain a behavior candidate region;

and 4, step 4: constructing a behavior category candidate pool, and updating behavior categories which may appear;

and 5: linking the behavior candidate region to an existing behavior chain in an online mode or generating a new behavior chain;

step 6: and fusing the detection results of the RGB image branches and the optical flow image branches to obtain a final detection result.

Compared with the prior art, the invention has the beneficial effects that:

by utilizing the technical scheme provided by the invention, when the behavior detection is carried out on the video single-frame image, the behavior spatio-temporal context information in the video segment is utilized, so that the accuracy of the behavior detection is improved; meanwhile, the video behavior can be detected online, compared with the traditional offline batch processing method, the timeliness of the video behavior detection is improved, and the method can be applied to occasions with higher real-time requirements, such as intelligent robots, human-computer interaction systems and the like. Compared with the existing video behavior detection technology, on the basis of the current popular public test set, the technology provided by the invention obtains better detection effect under the condition of utilizing fewer candidate proposals.

The invention will be further illustrated by way of example with reference to the accompanying drawings in which:

drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a diagram of a video unit motion information coder-Decoder En-Decoder model framework.

FIG. 3 is a framework diagram of a single-frame detection model of video behavior based on spatio-temporal context analysis.

FIG. 4 is a video behavior chain set T_dAnd dynamically updating the operation flow chart on line.

In the drawings:

1-Single-frame image representation feature p'_i2-ConVGRU Unit, 3-fusion expression feature p_d4, image sequence contained in video unit, 5-feature extraction Network, 6-dimension reduction Network, 7-RPN Network, 8-Detection Network, 9-behavior classification result, 10-position adjustment quantity, 11-motion proposal score, 12-motion proposal, 13-time domain cutting, 14-computing behavior score, 15-building behavior candidate pool, 16-building candidate set P_t17-update action chain, 18-add new action chain.

Detailed Description

The invention discloses an online video behavior detection method based on spatio-temporal context analysis, which comprises the following steps:

1) a section of video sequence to be detected is uniformly divided into a plurality of video segments (8 frames are a segment, and a frame is overlapped between adjacent segments);

2) extracting optical flow information from each video segment, and respectively inputting an original RGB image and an optical-flow optical flow image into a model to form two independent calculation branches, wherein one RGB branch is taken as an example for explanation, and the optical-flow branches have the same condition;

3) each frame of image in the video clip is respectively input into a depth convolution network (used as behavior classification) trained in advance for motion feature extraction;

4) the extracted motion characteristics are input into a coding-decoding network formed by ConvGRU to extract video behavior space-time context information, and are fused with the motion information of the current frame, and each frame outputs fusion characteristics;

5) the fusion features are accessed into a behavior classification network and a position regression network, the behaviors appearing in each frame are classified, and meanwhile, the positions where the behaviors occur are positioned to generate behavior frames;

6) using Viterbi algorithm to link the behavior frames detected by each frame in the video segment according to behavior categories to form a plurality of behavior candidate areas, and circularly executing the following steps (7-9) until the end of the input video sequence;

7) if the ordinal number of the current video clip is a multiple of 10, executing a behavior chain time domain pruning algorithm, calculating the score of the behavior chain relative to each behavior category, and updating a behavior category candidate pool to enable the behavior chain to only contain a plurality of behavior categories with the maximum scores;

8) for a certain current behavior chain, if a behavior candidate region and an overlapping area (the overlapping area refers to the overlapping area between the last behavior frame of the behavior chain and the first behavior frame of the candidate region) of the behavior candidate region are larger than a specified threshold value, linking the behavior candidate region with the largest score to the behavior chain; if the overlapping area of the non-existing behavior candidate region and the non-existing behavior candidate region is larger than a specified threshold value, the behavior chain is terminated. This step of operation is performed separately for different categories in the behavior category candidate pool;

9) if the behavior candidate region does not form a link with any behavior chain, taking the behavior candidate region as a new behavior chain;

10) and fusing the behavior chain detection results of the RGB branch and the Optical-flow branch to obtain a final detection result.

FIG. 1 is a flow chart of the present invention, wherein S1-S8 correspond to embodiments steps 1) -8) in sequence. An online video behavior detection method based on spatio-temporal context analysis comprises the following specific operation flows:

1) video is evenly divided into segments S1: given a segment of input video, it is divided evenly into several video segments, each segment containing 8 frames of images. Each section is taken as an independent video unit, and a behavior candidate area is extracted;

2) extracting RGB images or optical flows S2: for each video unit, extracting optical-flow optical flow images of each frame as motion information description. The original RGB image and the optical flow image are input into the model as two independent branches for computation, respectively. The following description will be mainly given of the RGB image branch, and the optical-flow optical flow branch is the same as that;

3) extracting single-frame expression feature p'_iS3: the behavior detection network framework of the single-frame image is shown in FIG. 3. In fig. 3, an image 4 contained by a video unit is shown. Each frame of image is extracted by a feature extraction network 5 to express features, which are marked as p_i. The feature extraction network is obtained based on fine tuning training of a VGG-16 model (Simnyan K.and Zisserman A.2014.Very Deep conditional Networks for Large Scale Image recognition. ArXiv (2014). https:// doi. org/arXiv:1409.1556), and the conv5 layer features are taken. Constructing a dimensionality reduction network 6, and adding p_iFrom 512 to 128, denoted as p'_iAnd the overfitting of the whole network model is prevented. The dimensionality reduction network is a convolution module consisting of 128 layers of convolution layers;

4) extracting fusion expression feature p_dS4: representing feature p 'of each frame image'_iAnd inputting the video units into a ConvGRU network 2 to construct video unit space-time fusion motion coding. The structure of the En-Decoder model of the motion information codec is shown in FIG. 2. FIG. 2 shows the input feature p 'of the current frame'_i1 and fusion expression feature p _d2. The En-Decoder model works over the entire video unit, including the forward encoding and backward decoding processes. Characteristic p'_iAnd simultaneously participate in the forward encoding and backward decoding processes, and the specific input mode is shown in fig. 2. Forward coding of a feature p 'to a single frame'_iAccumulating along with time to obtain the representation of the video unit motion sequence; backward decoding propagates motion sequence characterization back to each frame in the video unit and with feature p'_iFusing to obtain the feature p fusing the current frame and the space-time context information_d；

5) Calculating the outputs of Detection network and RPN network S5: characteristic p of handlebar_dInputting the result into the RPN 7, calculating to obtain a sport proposition score 11, which is marked as s_rAnd sports proposal 12, denoted as p_r. The RPN network is a 2-layer 3 x 3 convolutional network, which is at p_dUp slide, calculating a sports proposal score s at each position_rThe score value is greater than a specified value (e.g.0.5) is considered as a movement proposal p_r. Detection Network8 accepts p_dAnd p_rAs input, the behavior classification result 9 is output, which is marked as s_cAnd a position adjustment 10, noted as delta_r. The Detection Network is composed of 2 layers of full connection layers containing 1024 hidden units, and a behavior classification result s_cIncluding a classification score, a position adjustment delta, for each of the behavior and background classes_rThe corresponding 3 position deviations (center position, width and height) are given for each type of behavior. From p_rAnd delta_rA revised behavior candidate box b may be calculated_t；

6) Computer video unit behavior candidate region P S6: note b_tThe corresponding RPN motion proposal is scored as s_r(b_t) Using Viterbi algorithm to carry out processing on different image frames in the same video unit_tAnd linking to obtain a behavior candidate region p, as shown in formula (1):

T_pis the video unit duration, here taken as 8;

is b is_tAnd b_t-1Cross Over Intersection Union (IoU);

taking the harmonic coefficient as 0.5;

7) computing a video behavior chain set T_dS7: as video is continuously input, a behavior candidate region p corresponding to each video unit is obtained, and a dynamically increasing video behavior chain set T is obtained by the following rules (a) - (f)_d. FIG. 4 is a video behavior chain set T_dDynamically updating operational flow diagram on-line, for

Rules (a) - (f) are implemented, the main idea being: maintaining a dynamically updated behavior category candidate pool, and gradually reducing the number of behavior category candidates according to the judgment of continuously input videos; and determining whether the newly generated candidate region p is linked to the original video behavior chain or used as a new behavior chain according to the set linking method. As shown in fig. 4, the specific steps are:

(a) time domain clipping 13. If the current set T_dNumber of elements > Upper bound N_dAnd the updating is finished. Otherwise, performing time domain clipping on T by using the Viterbi algorithm, as shown in equation (2):

T_lb being included in the action chain T_tThe number of (2);

l_te {0, c } is b_tThe category of the system is 0, background category and behavior category c; if l_tC, then

Is b is_tCorresponding Detection Network class c classification score s_c(b_t) If l is_t＝0，

Is defined as 1-s_c(b_t) (ii) a If l_t＝l_t-1，ω(l_t，l_t-1) 0, otherwise, ω (l)_t，l_t-1)＝0.7；λ_ω0.4; through time domain cutting, the background block contained in T is subtracted;

(b) calculate a behavior score 14: for T, calculating its score s (T) with respect to each behavior category, s (T) being defined as the average of all p scores s (p) belonging to T; similarly, the score s (p) of p is defined as all b's belonging to p_tScore s_c(b_t) Average value;

(c) constructing behavior candidate pool15: constructing a behavior category candidate pool according to the sequence of s (T) from high to low, and specifically: i) at the beginning, all categories are reserved; ii) retaining the first 5 behavior classes while processing the 10 th video unit; iii) when processing the 20 th video unit, retaining the first 3 behavior categories; iv) processing the 30 th video unit and later, only the first ranked behavior category is retained. Setting the upper limit of the behavior category of the current candidate pool as N_pFor each behavior category in the candidate pool j is less than or equal to N_pExecute rules (d) - (e):

(d) constructing a candidate set P_t16: let the newly generated behavior candidate region P be added to the set P if IoU between T and P is greater than a specified threshold (e.g., 0.5)_t(initial P)_tEmpty). IoU between T and p is defined as IoU between the last behavior candidate box belonging to T and the first behavior candidate box belonging to p;

(e) update action chain 17: if P_tIf not, the score is set to P '(P' is belonged to P ') corresponding to s (P') with the maximum score_t) Linking to T, namely adding p ' to the back of T to form a new behavior chain T ', and updating T to T ';

(f) adding a new chain of actions 18: if p is present_new(p_new∈P_t) If there is no link to any T, then p is linked_newAdding the set T as a new behavior chain_d；

8) Fusing the RGB and Optical-flow detection results S8: and fusing the behavior chain detection results of the RGB branch and the Optical-flow branch to obtain a final detection result. The fusion method comprises the following steps: let T_rgbFor a chain of lines of RGB branches, T_optIs a behavior chain of Optical-flow branch if T_rgbAnd T_optIoU between is greater than a specified threshold (e.g., 0.7), then max (s (T)_rgb)，s(T_opt) The corresponding action chain, the other one is deleted; otherwise, both behavior chains are retained.

Taking mAP (mean Average precision) as an evaluation standard, the method provided by the invention obtains the best action detection result on a J-HMDB-21 data set, and the comparison with other methods is shown in Table 1:

mAP	0.5	0.5:0.95
			Gurkirt et al.[1]	72.0	41.6
ACT[2]	72.2	43.2
			Peng and Schmid[3]	73.1	-
Harkirat et al.[4]	67.3	36.1
			the invention	75.9	44.8

TABLE 1 comparison with other methods, '-' indication is not mentioned, the higher the number the better the result

The methods compared in table 1 are listed below:

[1]S.G.,S.S.,and C.F.,“Online real time multiple spatiotemporal action localisation and prediction on a single platform,”ArXiv,2016.

[2]V.Kalogeiton,P.Weinzaepfel,V.Ferrari,and C.Schmid,“Action tubelet detector for spatio-temporal action localization,”in IEEE International Conference on Computer Vision,2017, pp.4415–4423.

[3]P.X.and S.C.,“Multi-region two-stream r-cnn for action detection,”European Conference on Computer Vision,pp.744–759,2016.

[4]B.H.,S.M.,S.G.,S.S.,C.F.,and T.P.H.,“Incremental tube construction for human action detection,”ArXiv,2017.

it is noted that the disclosed embodiments are intended to aid in further understanding of the invention, but those skilled in the art will appreciate that: various substitutions and modifications are possible without departing from the spirit and scope of the invention and appended claims. Therefore, the invention should not be limited to the embodiments disclosed, but the scope of the invention is defined by the appended claims.

Claims

1. An online video behavior detection method based on spatio-temporal context analysis is used for detecting online video behaviors and is characterized by comprising the following steps:

step 1: calculating an optical flow image for the current frame, extracting depth expression features of the RGB image and the optical flow image, specifically, constructing an additional convolution network on a conv5 layer of a VGG16 network structure to realize extraction of depth features of specified dimensions;

step 2: constructing a coding-decoding network to extract space-time context information of video behaviors, and fusing the space-time context information with current frame information to obtain fusion characteristics, namely constructing a coding-decoding module by using a ConvGRU model to realize the fusion of the space-time context information and the current frame information;

and 4, step 4: constructing a behavior category candidate pool, updating the behavior categories which may appear, namely constructing and maintaining a possible behavior category candidate pool, and gradually reducing the number of the behavior categories in the candidate pool along with the continuous input of the video to realize the purpose of online judgment of the behavior categories;

step 6: and fusing the detection results of the RGB image branches and the optical flow image branches to obtain a final detection result, wherein the final detection result is specifically that if the overlapping rate of two behavior chains generated by two different branches is greater than a specified threshold, the behavior chain with higher score is reserved, otherwise, the two behavior chains are reserved simultaneously.

2. The on-line video behavior detection method based on spatio-temporal context analysis according to claim 1, characterized in that: before step 1, the method further comprises the following steps:

a video sequence to be detected is evenly divided into a plurality of video segments, wherein 8 frames are a segment, and adjacent segments are overlapped by one frame.

3. The on-line video behavior detection method based on spatio-temporal context analysis according to claim 1, characterized in that:

in the step 3, the fusion features are accessed into a behavior classification network and a position regression network, the behaviors appearing in each frame are classified, and meanwhile, the positions where the behaviors occur are positioned, so that a behavior frame is generated.

4. The on-line video behavior detection method based on spatio-temporal context analysis according to claim 1, characterized in that: in step 4, if the ordinal number of the current video segment is a multiple of 10, a behavior chain time domain pruning algorithm is executed, the score of the behavior chain relative to each behavior category is calculated, and the behavior category candidate pool is updated so that only a plurality of behavior categories with the largest scores are included.

5. The on-line video behavior detection method based on spatio-temporal context analysis according to claim 1, characterized in that: in the step 5, for a current behavior chain, if a behavior candidate region and an overlapping area of the behavior candidate region are larger than a specified threshold, the behavior candidate region with the largest score is linked to the behavior chain; if the overlapping area of the non-existing behavior candidate region and the non-existing behavior candidate region is larger than a specified threshold value, the behavior chain is terminated.