CN115641529A

CN115641529A - Weak supervision time sequence behavior detection method based on context modeling and background suppression

Info

Publication number: CN115641529A
Application number: CN202211208771.0A
Authority: CN
Inventors: 王传旭; 王静; 闫春娟
Original assignee: Qingdao University of Science and Technology
Current assignee: Qingdao University of Science and Technology
Priority date: 2022-09-30
Filing date: 2022-09-30
Publication date: 2023-01-24

Abstract

The invention discloses a weak supervision time sequence behavior detection method based on context modeling and background suppression, which comprises the following steps: dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and constructing video-level feature representation of each segment; designing a memory library M as a learning source for action positioning, and modeling the context information of the video based on a self-attention module; adding a background auxiliary class, and inhibiting the input characteristics of a background frame by using a filtering module; and further combining the refined segment characteristics and the foreground weight in an attention weighting pool to realize video-level prediction. According to the scheme, a self-attention module is introduced, and potential time structures of action segments are modeled in a characteristic modeling and predicting stage, so that action characteristics with different attributes are refined, and the integrity of an action instance is guaranteed; and adding a background auxiliary class, attenuating input features from a background frame through a filtering module, and creating a negative sample of the background class so as to learn the features of the background segment, inhibit the influence of background noise and improve the accuracy and the quality of action detection.

Description

Weak supervision time sequence behavior detection method based on context modeling and background suppression

Technical Field

The invention belongs to the technical field of video time sequence action detection, and particularly relates to a weak supervision time sequence action detection method based on context modeling and background suppression.

Background

With the popularization and development of multimedia, internet, and photographing devices, video data shows explosive growth. For video time sequence behavior positioning, a large amount of marking information is needed for training and learning, the cost of accurate time sequence boundary marking is extremely high, and huge manpower and financial resources are consumed, so that the application of a time sequence action detection algorithm is limited to a great extent, and the time sequence action is monitored weakly. The behavior positioning technology based on weak supervision only uses video-level labels in the training process, so that the waste of human resources and time and labeling errors can be further reduced, and the method has good flexibility.

Most existing weak supervised timing action positioning (WTAL) methods can be divided into two categories. The method is inspired by a weak supervision image semantic segmentation task, weak supervision time sequence action detection is regarded as a video identification task, a foreground-background separation attention mechanism is introduced to construct video level features, and then an action classifier is applied to identify videos. Yet another approach is to formulate the task as a Multiple Instance Learning (MIL) problem, treating the whole un-clipped video as a packet containing positive and negative instances, i.e. action instance and background frame (background frame refers to the video segment in the video that does not belong to the category to be detected). Segment classification is performed over time to generate a Class Activation Sequence (CAS), and then behavior suggestions are generated by temporarily merging CAS resulting in video level prediction, segment level class score thresholding.

Both of the above methods aim to learn an effective classification function to identify actions from action instances and background frames. However, existing methods do not fully simulate the behavior detection problem and still present two challenges of localization integrity and background interference:

(1) Positioning integrity problem: for a continuous time-series behavior, identifying an action often over-depends on those feature regions that are significantly helpful in classification, resulting in incomplete localization. As fig. 1 is an example of a diving activity, (a) represents the true localization, (b) for prediction based on the MIL method, the MIL framework captures only the most discriminative positions in the full diving activity, which may yield a high classification confidence but does not yield good localization performance.

(2) Background interference: background frames are trained to be classified as action classes for video, even though they do not have the characteristics of action, such inconsistencies that push background frames to action classes can lead to false positives and degradation of detection performance. The existing weak supervision method is to train directly according to the video class label, only consider the action class, not consider the background class, and inevitably or erroneously detect, so as to limit the detection accuracy of the model.

Disclosure of Invention

The invention aims to solve the problems of incomplete positioning and background interference in the prior art, and provides a weak supervision time sequence behavior detection method based on context modeling and background suppression.

The invention is realized by adopting the following technical scheme: a weak supervision time sequence behavior detection method based on context modeling and background suppression comprises the following steps:

a, dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and performing feature fusion on the spatial features and the temporal motion features to construct video-level feature representation;

b, designing a memory library M as a learning source for motion positioning, and modeling the context information of the video by adopting a self-attention module to extract segment-level motion characteristics and train a segment-level classifier;

step C, adding background auxiliary classes, and inhibiting the input characteristics of a background frame through a filtering module to prevent the interference of background noise and obtain foreground attention weight;

and D, combining the step B and the step C, performing iterative optimization training on the network, and further combining the refined segment-level action characteristics and the foreground attention weight in an attention weighting pool to realize video-level prediction.

Further, the step a is specifically implemented by the following steps:

method for converting video V into video V by uniform sampling strategy _i Dividing into T non-overlapping segments, extracting scene space characteristics based on characteristic extractor

And temporal motion characteristics

Then fusing the two-stream segment-level features to obtain x _i ∈R ^2D ，i∈[1,T]In turn, build a video level feature representation

D represents a feature dimension.

Further, the step B specifically includes the following steps:

(1) The video level characteristics obtained in the step A are obtained

Storing in a memory bank M, M ∈ R ^T×2D By using E _Q 、E _k And E _v The encoder respectively and correspondingly generates a query, a key and a value for the video clip;

K _i ＝E _k (M)

V _i ＝E _v (M)

K _i ∈R ^T×2D/m ，V _i ∈R ^T×(C+1)2D is a key and a value, m is a hyper-parameter controlling the memory reading efficiency;

(2) Based on an encoder E _Q To characterize video levels

Encoded as a set of queries Q _i ，Q _i ∈R ^T×2D/m Then, calculating similarity scores between the video segments with the query, and aggregating context information by using the similarity scores to obtain refined segment-level action characteristics:

where I is an identity matrix used to store the original video information,

and

maintain the same dimensions; through information transmission among the segments, extracting global context information and obtaining more distinguishing features which are easy to classify and position;

(3) Calculating Q _i And K _i The interaction between the two and the correlation between different segments are obtained, so that the network has a global view, and finally the similar matrix V is obtained by aggregation _i ^o As follows:

wherein, V _i ^o ∈R ^T×(C+1)2D ；

(4) Will similar the matrix V _i ^o Re-shaping into a set of segment-level classifiers

Which accommodates appearance or motion variations of each segment; using V _i ^o Computing the sparse loss L _s Function classification in training fragment levelThe device comprises:

wherein | | | calving ₁ For L1 loss, it encourages background frames to have low similarity to all action segments.

Further, the step C is specifically realized by the following steps:

(1) Will be provided with

The filtering module comprises two time sequence 1D convolution and Sigmoid functions and is used for inhibiting a background frame by training a training target opposite to a background class to obtain a foreground attention weight

W _i ∈[0,1]，

Is a function with a parameter phi;

(2) Utilizing real-world behavior classes

And the predicted score p _j Constructing a binary cross-entropy loss L for each class _sup To train the filtering module:

wherein p is _j Representing the prediction score.

Further, in the step D, the video level prediction is implemented by combining the step B and the step C, and specifically implemented by the following manner:

applying the classifier to the corresponding segment, the video level classification result

Obtained from the attention weighting pool:

wherein the content of the first and second substances,

motion classification loss is determined by prediction between N videos and video label y _i Comprises the following steps:

wherein L is _act Representing the action classification penalty, C represents the action class total.

Further, in the step D, when performing iterative optimization training on the network, the following method is specifically adopted:

(1) Combining the step B and the step C to define a joint loss function;

L _tol ＝λ ₁ L _sup +λ ₂ L _act +λ ₃ L _s

wherein λ is ₁ ,λ ₂ And λ ₃ Is a hyper-parameter to be learned for balancing the contribution of each loss function;

(2) And (3) video positioning reasoning:

1) Predicting scores for video levels

Setting a threshold and discarding confidence scores below a threshold θ _cls The category of (d);

2) Threshold θ on each remaining category _act Apply to the foreground attention weight to generate an action proposal:

to assign confidence to each action proposal, the class activation sequence CAS is first computed and then passed through Softmax along the class dimension to obtain the class score at each temporal location

Further action proposal { (c, q, t) _s ,t _e ) The confidence q in the } is:

finally, since one instance of action in the un-clipped video may repeat multiple occurrences, the present scheme uses the class non-very restrictive (NMS) to remove highly overlapping action proposals.

Compared with the prior art, the invention has the advantages and positive effects that:

(1) Aiming at the problem of positioning integrity, the scheme introduces self-attention to model the potential time structure of the action segment in the characteristic modeling and prediction stage:

the traditional MIL method, which takes video clips as independent instances, ignores the modeling of potential temporal structures during the feature modeling and prediction phase, resulting in a low quality action proposal generated from the CAS. In the scheme, firstly, a memory base M is designed as a learning source of action positioning, and a self-attention module is introduced to model context information of a video, so that action characteristics are refined, smoother time classification scores are encouraged, and integrity positioning is realized;

(2) Aiming at the problem of background interference, a filtering module is designed:

the un-clipped video contains a large number of background frames besides the action segments, and the method of weak supervision only has video-level annotation, and cannot know that the background frames are the background frames and the action segments are the action segments, so that a plurality of background segments are mistaken for the action to be detected. According to the scheme, a background auxiliary class is added, and the input characteristics of a background frame are suppressed by using a filtering module so as to prevent the interference of background noise.

Drawings

FIG. 1 is a diagram of an example capture action of a prior MIL framework; (a) Representing true localization, (b) prediction based on MIL method;

FIG. 2 is a diagram illustrating an overall network architecture according to an embodiment of the present invention;

FIG. 3 is a schematic diagram illustrating a self-attention application principle of an embodiment of the present invention;

FIG. 4 is a schematic diagram of the positioning results shown on THUMOS14 according to the present invention, wherein (a) is a schematic diagram of the basketball result; (b) is a schematic diagram of the results of throwing a shot and throwing a discus; and (c) is a diagram illustrating the ice dance result.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention more clearly understood, the present invention will be further described with reference to the accompanying drawings and examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and thus, the present invention is not limited to the specific embodiments disclosed below.

The present solution proposes a general framework as shown in fig. 2, which includes a self-attention module and a filtering module, and before delving into its details, formally defines the problem statement:

problem description:

assuming N training videos, for each video V _i All have real label y _i ∈R ^C+1 Where C +1 is the number of action classes; if action class j exists in the video, then y _i (j) =1, otherwise y _i (j) =0. During testing, the goal of temporal motion localization is to generate a set of motion proposals for each video { (c, q, t) _s ,t _e ) Where c denotes the prediction class, q is the confidence score, t _s And t _e The representations are the action start time and the end time, respectively. The invention aims to solve the problem that the starting and stopping boundaries of behavior examples in un-edited videos are positioned and corresponding behavior categories are identified for training data only labeled by video-level categories, and the key points are designed as follows:

key point 1: how to model behavior instance integrity:

in the absence of fine-grained temporal boundary annotation for un-clipped video, it becomes very difficult to detect complete and accurate behavior instances. According to the scheme, a self-attention module is introduced, and the potential time structure of the action segment is modeled in the characteristic modeling and predicting stage, so that the action characteristics with different attributes are refined, smooth segment classification scores are encouraged, and the completeness of the behavior instance is guaranteed.

Key point 2: how to suppress background interference:

aiming at the weak supervision method only with video level annotation, the background frames and the action segments cannot be known, so that many background segments caused by the background frames and the action segments are mistaken for the action to be detected. According to the scheme, a background auxiliary class is added, input features from a background frame are attenuated through a filtering module, and a negative sample of the background class is created, so that the features of background segments are learned, the influence of background noise is suppressed, and the accuracy and the quality of motion detection are improved.

The embodiment provides a weak supervision time sequence detection method based on context modeling and background suppression, which comprises the following steps:

a, dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and performing feature fusion on the spatial features and the temporal motion features to construct video-level features;

b, designing a memory library M as a learning source of action positioning, and introducing a self-attention module to model the context information of the video, thereby extracting segment-level action characteristics to realize integrity positioning;

step C, adding background auxiliary classes, and utilizing a filtering module to suppress input characteristics of a background frame so as to prevent interference of background noise and obtain foreground attention weight;

and D, combining the step B and the step C, performing iterative training on the network to further realize video-level prediction.

Specifically, the following describes the scheme of the present invention in detail:

1. feature extraction

Method for converting video V into video V by uniform sampling strategy _i Dividing into T non-overlapping 16-frame segments, and extracting scene space characteristics by using I3D characteristic extractor

And temporal motion characteristics

Then fusing the fragment-level features of the RGB Flow and the Flow to obtain x _i ∈R ^2D ，i∈[1,T]In turn, build a video level feature representation X _i ^e ＝[x ₁ ,...,x _T ]∈R ^2D×T And D represents a feature dimension.

2. Self-attention module

(1) Storing the video level feature representation of T segments and 2D dimensions into a memory base M, wherein M belongs to R ^T×2D In, use E _Q 、E _k And E _v The encoder respectively generates a query (Q), a key (K) and a value (V) corresponding to the video clip;

E _k in order to reduce the dimension of the fragment, its keys store information about the appearance and movement of the fragment for efficient reading from memory, which is implemented by a full connectivity layer (FC). E _v An MLP network is composed of two FC layers, and a bottleneck structure is adopted between the FC layers to reduce parameters, and each fragment is coded into a feature specific to a category for classification.

K _i ∈R ^T×2D/m ，V _i ∈R ^T×(C+1)2D Is a key and a value, and m is a hyper-parameter controlling memory read efficiency. Given the memory base M and the input video, how to perform video classification and background suppression is described next.

(2) For video classification, encoder E implemented by FC layer _Q To characterize video levels

Encoded as a set of queries Q _i ，Q _i ∈R ^T×2D/m Then, the similarity score between the video segments with the query is calculated, and the context information is aggregated by using the similarity score to obtain the refined segment characteristics, as shown in fig. 3, which are expressed as follows：

Where I is an identity matrix used to store the original video information,

and

maintain the same dimensions; through information transmission among the segments, global context information is extracted and more discrimination characteristics for classification and positioning are obtained. Fig. 3 applies self-attention to each query fragment, aggregates context information by computing similarities with other fragments,

and

the method represents element-by-element addition and matrix multiplication, and T and 2D respectively represent the number of video segments and characteristic dimensions.

(3) Calculating Q _i And K _i The interaction between the segments and the correlation between different segments are obtained, so that the network has a global view, and finally the correlation scores are aggregated to obtain a similarity matrix, which is as follows:

wherein, V _i ^o ∈R ^T×(C+1)2D ；

(4) For subsequent classification, a similarity matrix V is used _i ^o Reshape into a set of segment classifiers

Which accommodates appearance or motion variations of each segment; by means of V _i ^o Calculating a sparse loss function:

wherein | | | purple hair ₁ For L1 loss, it encourages background frames to have low similarity to all action segments.

3. Suppression module

(1) To create a negative example of a background class, video level features are represented

As input to the filtering module, which contains two time-sequential 1D convolution and Sigmoid functions, the background frames are suppressed by training the training target against the background class, which returns the foreground attention weight

W _i ∈[0,1]，

Is a function with a parameter phi. W _i Acting as a set of segments without any background activity, which is considered as a negative example of the background class.

(2) In this process, the real behavior categories are utilized

And the predicted score p _j Constructing a binary cross-entropy loss L for each class _sup And (4) carrying out constraint:

(3) Finally, applying the classifier to the corresponding segment and obtaining the video-level classification result

Obtained from the attention weighting pool, calculated as follows:

wherein the content of the first and second substances,

4. network training and reasoning

(1) In conjunction with step 2 and step 3, a joint loss function is defined:

L _tol ＝λ ₁ L _sup +λ ₂ L _act +λ ₃ L _s (8)

wherein λ is ₁ ,λ ₂ And λ ₃ Is a hyper-parameter that needs to be learned to balance the contribution of each loss function.

(2) After the model is trained, a two-step walking method is adopted to realize behavior positioning;

predicting scores for video levels

Setting a threshold and discarding confidence scores below a threshold θ _cls A category of (1);

then, a threshold θ is applied to each of the remaining categories _act Apply to the foreground attention weight to generate an action proposal;

to assign confidence to each proposal, the class activation sequence CAS is first computed and then passed through Softmax along the class dimension to obtain the class score at each temporal location

(t denotes a segment index, i.e. ";" denotes due to the dynamics of the time position),

then the action proposal { (c, q, t) _s ,t _e ) The confidence q in the } is calculated as follows:

in order to remove action proposals with high overlap (high overlap means that when in un-clipped video, one same action appears multiple times, only one action is calculated and represented in the embodiment), the scheme uses a class non-maximum suppression (NMS) for processing.

Implementation details:

the present embodiment uses a dual-stream I3D network as a feature extractor, applies a TV-L1 algorithm to extract optical flow from RGB data, and sets D =1024. In the formula (8), λ ₁ ＝λ ₂ ＝0.8,λ ₃ And =0.2. In the inference process, the threshold θ _cls Is set to 0.1 (the value is generally 0.1-1), theta _act Is a video V _i The mean of foreground weights of the corresponding category. And uses a threshold of 0.3 NMS-like to remove the highly overlapping propofol. The model is a network framework based on PyTorch deep learning, the whole experiment is carried out on a single GTX 3060GPU, adam optimization is used for training, the learning rate is 10 ^-4 The batch size is 20.

As shown in fig. 4, fig. 4 shows the positioning result on thumb 14. For each example, there are three plots with multiple sample frames. The first graph indicates the real situation. The second and third graphs show the segment activation sequences corresponding to the self-attention module and the filtering module, respectively, in which the horizontal axis represents the time step of the video and the vertical axis represents the activation strength, ranging from 0 to 1.

Figure 4 qualitatively illustrates the results of the proposed algorithm as a test on the thumb 14 data set. Fig. 4 (a) relates to a motion example with significant frequency, and all frames in the video have similar elements, i.e., human, basketball. By introducing sparsity L in the context modeling process _s Loss, seek action, and actionAnd slight differences from background, background and background, thereby avoiding context confusion. FIG. 4 (b) contains instances of actions from two different classes, namely "Throw cuts" and "Shotput". Although the visual appearance and motion pattern are very similar between all frames, the method of the present invention is still able to locate most of the time intervals of multiple actions. Fig. 4 (c) depicts a single action "ice dance" with background challenges that looks very similar to the foreground, even though the model achieves separation of action from context through self-attention context modeling and suppression of background frames by the filtering module.

Academic problems in the task of weakly supervised behavior recognition and localization are addressed, such as inaccurate motion boundary localization due to background frame interference, and incomplete motion localization due to some candidate segments being ignored at will. In order to better solve the problems, the invention designs a context modeling framework and a learning background suppression paradigm to solve the weak supervision timing sequence action positioning task. The solution to the first problem is to model the potential temporal structure of the action segment in the feature modeling and prediction phase, and further refine the action features of different attributes, thereby encouraging smooth segment classification scores. The guiding concept of the second problem is to add an auxiliary background class, and suppress the input features of the background frame by using the filtering module, thereby preventing the interference of the background noise. The video-level prediction performance is remarkably improved by combining the high-quality classification score with the accurate foreground weight. A large number of experiments are carried out on the THUMOS14 and activityNet1.2 data sets, and the effectiveness and the feasibility of the method are proved.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.

Claims

1. The weak supervision time sequence behavior detection method based on context modeling and background suppression is characterized by comprising the following steps of:

2. The weak supervised temporal behavior detection method based on context modeling and background suppression as recited in claim 1, wherein: the step a is specifically realized by the following steps:

And temporal motion characteristics

Then fusing the two-stream segment-level features to obtain x _i ∈R ^2D ，i∈[1,T]To build a video level feature representation

D represents a feature dimension.

3. The method of claim 2 for detecting weakly supervised temporal behavior based on context modeling and background suppression, wherein: the step B specifically comprises the following steps:

(1) The video level characteristic X obtained in the step A is used _i ^e Storing in a memory bank M, M ∈ R ^T×2D By using E _Q 、E _k And E _v The encoder respectively and correspondingly generates a query, a key and a value for the video clip;

K _i ＝E _k (M)

V _i ＝E _v (M)

(2) Based on an encoder E _Q To characterize video levels

where I is an identity matrix used to store the original video information,

and

maintaining the same dimensions; through information transmission among the segments, extracting global context information and obtaining more distinguishing features which are easy to classify and position;

wherein, V _i ^o ∈R ^T×(C+1)2D ；

(4) Will similar the matrix V _i ^o Re-modeling into a set of segment-level classifiers

Which accommodates appearance or motion variations of each segment; using V _i ^o Compute sparse loss functions to train segment-level classifiers:

4. The method of claim 3 for detecting weakly supervised temporal behavior based on context modeling and background suppression, wherein: the step C is specifically realized by the following steps:

(1) Will be provided with

W _i ∈[0,1]，

Is a function with a parameter phi;

(2) Utilizing real behavior categories

wherein p is _j Indicates the prediction score, L _sup Representing a binary cross entropy loss.

5. The weak supervised temporal behavior detection method based on context modeling and background suppression as recited in claim 4, wherein: in the step D, the video level prediction is realized by combining the steps B and C, and specifically, the following method is used:

Obtained from the attention weighting pool:

wherein, the first and the second end of the pipe are connected with each other,

motion classification loss is determined by prediction and true video tags y between N videos _i The method comprises the following steps:

wherein L is _act Representing the action classification penalty, C +1 represents the action class total.

6. The method of claim 4 for detecting weakly supervised temporal behavior based on context modeling and background suppression, wherein: in the step D, when performing iterative optimization training on the network, the following method is specifically adopted:

(1) Combining the step B and the step C to define a joint loss function;

L _tol ＝λ ₁ L _sup +λ ₂ L _act +λ ₃ L _s

(2) And (3) video positioning reasoning:

1) Predicting scores for video levels

Further action proposal { (c, q, t) _s ,t _e ) The confidence q in the } is:

finally, the use of class negatives greatly inhibits NMS removal of highly overlapping action proposals.