CN115641529A - Weak supervision time sequence behavior detection method based on context modeling and background suppression - Google Patents

Weak supervision time sequence behavior detection method based on context modeling and background suppression Download PDF

Info

Publication number
CN115641529A
CN115641529A CN202211208771.0A CN202211208771A CN115641529A CN 115641529 A CN115641529 A CN 115641529A CN 202211208771 A CN202211208771 A CN 202211208771A CN 115641529 A CN115641529 A CN 115641529A
Authority
CN
China
Prior art keywords
video
background
action
segment
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211208771.0A
Other languages
Chinese (zh)
Inventor
王传旭
王静
闫春娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao University of Science and Technology
Original Assignee
Qingdao University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao University of Science and Technology filed Critical Qingdao University of Science and Technology
Priority to CN202211208771.0A priority Critical patent/CN115641529A/en
Publication of CN115641529A publication Critical patent/CN115641529A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a weak supervision time sequence behavior detection method based on context modeling and background suppression, which comprises the following steps: dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and constructing video-level feature representation of each segment; designing a memory library M as a learning source for action positioning, and modeling the context information of the video based on a self-attention module; adding a background auxiliary class, and inhibiting the input characteristics of a background frame by using a filtering module; and further combining the refined segment characteristics and the foreground weight in an attention weighting pool to realize video-level prediction. According to the scheme, a self-attention module is introduced, and potential time structures of action segments are modeled in a characteristic modeling and predicting stage, so that action characteristics with different attributes are refined, and the integrity of an action instance is guaranteed; and adding a background auxiliary class, attenuating input features from a background frame through a filtering module, and creating a negative sample of the background class so as to learn the features of the background segment, inhibit the influence of background noise and improve the accuracy and the quality of action detection.

Description

Weak supervision time sequence behavior detection method based on context modeling and background suppression
Technical Field
The invention belongs to the technical field of video time sequence action detection, and particularly relates to a weak supervision time sequence action detection method based on context modeling and background suppression.
Background
With the popularization and development of multimedia, internet, and photographing devices, video data shows explosive growth. For video time sequence behavior positioning, a large amount of marking information is needed for training and learning, the cost of accurate time sequence boundary marking is extremely high, and huge manpower and financial resources are consumed, so that the application of a time sequence action detection algorithm is limited to a great extent, and the time sequence action is monitored weakly. The behavior positioning technology based on weak supervision only uses video-level labels in the training process, so that the waste of human resources and time and labeling errors can be further reduced, and the method has good flexibility.
Most existing weak supervised timing action positioning (WTAL) methods can be divided into two categories. The method is inspired by a weak supervision image semantic segmentation task, weak supervision time sequence action detection is regarded as a video identification task, a foreground-background separation attention mechanism is introduced to construct video level features, and then an action classifier is applied to identify videos. Yet another approach is to formulate the task as a Multiple Instance Learning (MIL) problem, treating the whole un-clipped video as a packet containing positive and negative instances, i.e. action instance and background frame (background frame refers to the video segment in the video that does not belong to the category to be detected). Segment classification is performed over time to generate a Class Activation Sequence (CAS), and then behavior suggestions are generated by temporarily merging CAS resulting in video level prediction, segment level class score thresholding.
Both of the above methods aim to learn an effective classification function to identify actions from action instances and background frames. However, existing methods do not fully simulate the behavior detection problem and still present two challenges of localization integrity and background interference:
(1) Positioning integrity problem: for a continuous time-series behavior, identifying an action often over-depends on those feature regions that are significantly helpful in classification, resulting in incomplete localization. As fig. 1 is an example of a diving activity, (a) represents the true localization, (b) for prediction based on the MIL method, the MIL framework captures only the most discriminative positions in the full diving activity, which may yield a high classification confidence but does not yield good localization performance.
(2) Background interference: background frames are trained to be classified as action classes for video, even though they do not have the characteristics of action, such inconsistencies that push background frames to action classes can lead to false positives and degradation of detection performance. The existing weak supervision method is to train directly according to the video class label, only consider the action class, not consider the background class, and inevitably or erroneously detect, so as to limit the detection accuracy of the model.
Disclosure of Invention
The invention aims to solve the problems of incomplete positioning and background interference in the prior art, and provides a weak supervision time sequence behavior detection method based on context modeling and background suppression.
The invention is realized by adopting the following technical scheme: a weak supervision time sequence behavior detection method based on context modeling and background suppression comprises the following steps:
a, dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and performing feature fusion on the spatial features and the temporal motion features to construct video-level feature representation;
b, designing a memory library M as a learning source for motion positioning, and modeling the context information of the video by adopting a self-attention module to extract segment-level motion characteristics and train a segment-level classifier;
step C, adding background auxiliary classes, and inhibiting the input characteristics of a background frame through a filtering module to prevent the interference of background noise and obtain foreground attention weight;
and D, combining the step B and the step C, performing iterative optimization training on the network, and further combining the refined segment-level action characteristics and the foreground attention weight in an attention weighting pool to realize video-level prediction.
Further, the step a is specifically implemented by the following steps:
method for converting video V into video V by uniform sampling strategy i Dividing into T non-overlapping segments, extracting scene space characteristics based on characteristic extractor
Figure BDA0003874357860000021
And temporal motion characteristics
Figure BDA0003874357860000022
Then fusing the two-stream segment-level features to obtain x i ∈R 2D ,i∈[1,T]In turn, build a video level feature representation
Figure BDA0003874357860000023
D represents a feature dimension.
Further, the step B specifically includes the following steps:
(1) The video level characteristics obtained in the step A are obtained
Figure BDA0003874357860000024
Storing in a memory bank M, M ∈ R T×2D By using E Q 、E k And E v The encoder respectively and correspondingly generates a query, a key and a value for the video clip;
K i =E k (M)
V i =E v (M)
K i ∈R T×2D/m ,V i ∈R T×(C+1)2D is a key and a value, m is a hyper-parameter controlling the memory reading efficiency;
(2) Based on an encoder E Q To characterize video levels
Figure BDA0003874357860000028
Encoded as a set of queries Q i ,Q i ∈R T×2D/m Then, calculating similarity scores between the video segments with the query, and aggregating context information by using the similarity scores to obtain refined segment-level action characteristics:
Figure BDA0003874357860000025
where I is an identity matrix used to store the original video information,
Figure BDA0003874357860000026
and
Figure BDA0003874357860000027
maintain the same dimensions; through information transmission among the segments, extracting global context information and obtaining more distinguishing features which are easy to classify and position;
(3) Calculating Q i And K i The interaction between the two and the correlation between different segments are obtained, so that the network has a global view, and finally the similar matrix V is obtained by aggregation i o As follows:
Figure BDA0003874357860000031
wherein, V i o ∈R T×(C+1)2D
(4) Will similar the matrix V i o Re-shaping into a set of segment-level classifiers
Figure BDA0003874357860000032
Which accommodates appearance or motion variations of each segment; using V i o Computing the sparse loss L s Function classification in training fragment levelThe device comprises:
Figure BDA0003874357860000033
wherein | | | calving 1 For L1 loss, it encourages background frames to have low similarity to all action segments.
Further, the step C is specifically realized by the following steps:
(1) Will be provided with
Figure BDA0003874357860000034
The filtering module comprises two time sequence 1D convolution and Sigmoid functions and is used for inhibiting a background frame by training a training target opposite to a background class to obtain a foreground attention weight
Figure BDA0003874357860000035
W i ∈[0,1],
Figure BDA0003874357860000036
Is a function with a parameter phi;
(2) Utilizing real-world behavior classes
Figure BDA0003874357860000037
And the predicted score p j Constructing a binary cross-entropy loss L for each class sup To train the filtering module:
Figure BDA0003874357860000038
wherein p is j Representing the prediction score.
Further, in the step D, the video level prediction is implemented by combining the step B and the step C, and specifically implemented by the following manner:
applying the classifier to the corresponding segment, the video level classification result
Figure BDA0003874357860000039
Obtained from the attention weighting pool:
Figure BDA00038743578600000310
wherein the content of the first and second substances,
Figure BDA00038743578600000311
motion classification loss is determined by prediction between N videos and video label y i Comprises the following steps:
Figure BDA00038743578600000312
wherein L is act Representing the action classification penalty, C represents the action class total.
Further, in the step D, when performing iterative optimization training on the network, the following method is specifically adopted:
(1) Combining the step B and the step C to define a joint loss function;
L tol =λ 1 L sup2 L act3 L s
wherein λ is 12 And λ 3 Is a hyper-parameter to be learned for balancing the contribution of each loss function;
(2) And (3) video positioning reasoning:
1) Predicting scores for video levels
Figure BDA0003874357860000041
Setting a threshold and discarding confidence scores below a threshold θ cls The category of (d);
2) Threshold θ on each remaining category act Apply to the foreground attention weight to generate an action proposal:
to assign confidence to each action proposal, the class activation sequence CAS is first computed and then passed through Softmax along the class dimension to obtain the class score at each temporal location
Figure BDA0003874357860000042
Further action proposal { (c, q, t) s ,t e ) The confidence q in the } is:
Figure BDA0003874357860000043
finally, since one instance of action in the un-clipped video may repeat multiple occurrences, the present scheme uses the class non-very restrictive (NMS) to remove highly overlapping action proposals.
Compared with the prior art, the invention has the advantages and positive effects that:
(1) Aiming at the problem of positioning integrity, the scheme introduces self-attention to model the potential time structure of the action segment in the characteristic modeling and prediction stage:
the traditional MIL method, which takes video clips as independent instances, ignores the modeling of potential temporal structures during the feature modeling and prediction phase, resulting in a low quality action proposal generated from the CAS. In the scheme, firstly, a memory base M is designed as a learning source of action positioning, and a self-attention module is introduced to model context information of a video, so that action characteristics are refined, smoother time classification scores are encouraged, and integrity positioning is realized;
(2) Aiming at the problem of background interference, a filtering module is designed:
the un-clipped video contains a large number of background frames besides the action segments, and the method of weak supervision only has video-level annotation, and cannot know that the background frames are the background frames and the action segments are the action segments, so that a plurality of background segments are mistaken for the action to be detected. According to the scheme, a background auxiliary class is added, and the input characteristics of a background frame are suppressed by using a filtering module so as to prevent the interference of background noise.
Drawings
FIG. 1 is a diagram of an example capture action of a prior MIL framework; (a) Representing true localization, (b) prediction based on MIL method;
FIG. 2 is a diagram illustrating an overall network architecture according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a self-attention application principle of an embodiment of the present invention;
FIG. 4 is a schematic diagram of the positioning results shown on THUMOS14 according to the present invention, wherein (a) is a schematic diagram of the basketball result; (b) is a schematic diagram of the results of throwing a shot and throwing a discus; and (c) is a diagram illustrating the ice dance result.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention more clearly understood, the present invention will be further described with reference to the accompanying drawings and examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and thus, the present invention is not limited to the specific embodiments disclosed below.
The present solution proposes a general framework as shown in fig. 2, which includes a self-attention module and a filtering module, and before delving into its details, formally defines the problem statement:
problem description:
assuming N training videos, for each video V i All have real label y i ∈R C+1 Where C +1 is the number of action classes; if action class j exists in the video, then y i (j) =1, otherwise y i (j) =0. During testing, the goal of temporal motion localization is to generate a set of motion proposals for each video { (c, q, t) s ,t e ) Where c denotes the prediction class, q is the confidence score, t s And t e The representations are the action start time and the end time, respectively. The invention aims to solve the problem that the starting and stopping boundaries of behavior examples in un-edited videos are positioned and corresponding behavior categories are identified for training data only labeled by video-level categories, and the key points are designed as follows:
key point 1: how to model behavior instance integrity:
in the absence of fine-grained temporal boundary annotation for un-clipped video, it becomes very difficult to detect complete and accurate behavior instances. According to the scheme, a self-attention module is introduced, and the potential time structure of the action segment is modeled in the characteristic modeling and predicting stage, so that the action characteristics with different attributes are refined, smooth segment classification scores are encouraged, and the completeness of the behavior instance is guaranteed.
Key point 2: how to suppress background interference:
aiming at the weak supervision method only with video level annotation, the background frames and the action segments cannot be known, so that many background segments caused by the background frames and the action segments are mistaken for the action to be detected. According to the scheme, a background auxiliary class is added, input features from a background frame are attenuated through a filtering module, and a negative sample of the background class is created, so that the features of background segments are learned, the influence of background noise is suppressed, and the accuracy and the quality of motion detection are improved.
The embodiment provides a weak supervision time sequence detection method based on context modeling and background suppression, which comprises the following steps:
a, dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and performing feature fusion on the spatial features and the temporal motion features to construct video-level features;
b, designing a memory library M as a learning source of action positioning, and introducing a self-attention module to model the context information of the video, thereby extracting segment-level action characteristics to realize integrity positioning;
step C, adding background auxiliary classes, and utilizing a filtering module to suppress input characteristics of a background frame so as to prevent interference of background noise and obtain foreground attention weight;
and D, combining the step B and the step C, performing iterative training on the network to further realize video-level prediction.
Specifically, the following describes the scheme of the present invention in detail:
1. feature extraction
Method for converting video V into video V by uniform sampling strategy i Dividing into T non-overlapping 16-frame segments, and extracting scene space characteristics by using I3D characteristic extractor
Figure BDA0003874357860000061
And temporal motion characteristics
Figure BDA0003874357860000062
Then fusing the fragment-level features of the RGB Flow and the Flow to obtain x i ∈R 2D ,i∈[1,T]In turn, build a video level feature representation X i e =[x 1 ,...,x T ]∈R 2D×T And D represents a feature dimension.
2. Self-attention module
(1) Storing the video level feature representation of T segments and 2D dimensions into a memory base M, wherein M belongs to R T×2D In, use E Q 、E k And E v The encoder respectively generates a query (Q), a key (K) and a value (V) corresponding to the video clip;
E k in order to reduce the dimension of the fragment, its keys store information about the appearance and movement of the fragment for efficient reading from memory, which is implemented by a full connectivity layer (FC). E v An MLP network is composed of two FC layers, and a bottleneck structure is adopted between the FC layers to reduce parameters, and each fragment is coded into a feature specific to a category for classification.
Figure BDA0003874357860000063
K i ∈R T×2D/m ,V i ∈R T×(C+1)2D Is a key and a value, and m is a hyper-parameter controlling memory read efficiency. Given the memory base M and the input video, how to perform video classification and background suppression is described next.
(2) For video classification, encoder E implemented by FC layer Q To characterize video levels
Figure BDA0003874357860000064
Encoded as a set of queries Q i ,Q i ∈R T×2D/m Then, the similarity score between the video segments with the query is calculated, and the context information is aggregated by using the similarity score to obtain the refined segment characteristics, as shown in fig. 3, which are expressed as follows:
Figure BDA0003874357860000065
Where I is an identity matrix used to store the original video information,
Figure BDA0003874357860000066
and
Figure BDA0003874357860000067
maintain the same dimensions; through information transmission among the segments, global context information is extracted and more discrimination characteristics for classification and positioning are obtained. Fig. 3 applies self-attention to each query fragment, aggregates context information by computing similarities with other fragments,
Figure BDA0003874357860000068
and
Figure BDA0003874357860000069
the method represents element-by-element addition and matrix multiplication, and T and 2D respectively represent the number of video segments and characteristic dimensions.
(3) Calculating Q i And K i The interaction between the segments and the correlation between different segments are obtained, so that the network has a global view, and finally the correlation scores are aggregated to obtain a similarity matrix, which is as follows:
Figure BDA00038743578600000610
wherein, V i o ∈R T×(C+1)2D
(4) For subsequent classification, a similarity matrix V is used i o Reshape into a set of segment classifiers
Figure BDA00038743578600000611
Which accommodates appearance or motion variations of each segment; by means of V i o Calculating a sparse loss function:
Figure BDA00038743578600000612
wherein | | | purple hair 1 For L1 loss, it encourages background frames to have low similarity to all action segments.
3. Suppression module
(1) To create a negative example of a background class, video level features are represented
Figure BDA0003874357860000071
As input to the filtering module, which contains two time-sequential 1D convolution and Sigmoid functions, the background frames are suppressed by training the training target against the background class, which returns the foreground attention weight
Figure BDA0003874357860000072
W i ∈[0,1],
Figure BDA0003874357860000073
Is a function with a parameter phi. W i Acting as a set of segments without any background activity, which is considered as a negative example of the background class.
(2) In this process, the real behavior categories are utilized
Figure BDA0003874357860000074
And the predicted score p j Constructing a binary cross-entropy loss L for each class sup And (4) carrying out constraint:
Figure BDA0003874357860000075
(3) Finally, applying the classifier to the corresponding segment and obtaining the video-level classification result
Figure BDA0003874357860000076
Obtained from the attention weighting pool, calculated as follows:
Figure BDA0003874357860000077
wherein the content of the first and second substances,
Figure BDA0003874357860000078
motion classification loss is determined by prediction between N videos and video label y i Comprises the following steps:
Figure BDA0003874357860000079
4. network training and reasoning
(1) In conjunction with step 2 and step 3, a joint loss function is defined:
L tol =λ 1 L sup2 L act3 L s (8)
wherein λ is 12 And λ 3 Is a hyper-parameter that needs to be learned to balance the contribution of each loss function.
(2) After the model is trained, a two-step walking method is adopted to realize behavior positioning;
predicting scores for video levels
Figure BDA00038743578600000710
Setting a threshold and discarding confidence scores below a threshold θ cls A category of (1);
then, a threshold θ is applied to each of the remaining categories act Apply to the foreground attention weight to generate an action proposal;
to assign confidence to each proposal, the class activation sequence CAS is first computed and then passed through Softmax along the class dimension to obtain the class score at each temporal location
Figure BDA00038743578600000711
(t denotes a segment index, i.e. ";" denotes due to the dynamics of the time position),
Figure BDA00038743578600000712
then the action proposal { (c, q, t) s ,t e ) The confidence q in the } is calculated as follows:
Figure BDA00038743578600000713
in order to remove action proposals with high overlap (high overlap means that when in un-clipped video, one same action appears multiple times, only one action is calculated and represented in the embodiment), the scheme uses a class non-maximum suppression (NMS) for processing.
Implementation details:
the present embodiment uses a dual-stream I3D network as a feature extractor, applies a TV-L1 algorithm to extract optical flow from RGB data, and sets D =1024. In the formula (8), λ 1 =λ 2 =0.8,λ 3 And =0.2. In the inference process, the threshold θ cls Is set to 0.1 (the value is generally 0.1-1), theta act Is a video V i The mean of foreground weights of the corresponding category. And uses a threshold of 0.3 NMS-like to remove the highly overlapping propofol. The model is a network framework based on PyTorch deep learning, the whole experiment is carried out on a single GTX 3060GPU, adam optimization is used for training, the learning rate is 10 -4 The batch size is 20.
As shown in fig. 4, fig. 4 shows the positioning result on thumb 14. For each example, there are three plots with multiple sample frames. The first graph indicates the real situation. The second and third graphs show the segment activation sequences corresponding to the self-attention module and the filtering module, respectively, in which the horizontal axis represents the time step of the video and the vertical axis represents the activation strength, ranging from 0 to 1.
Figure 4 qualitatively illustrates the results of the proposed algorithm as a test on the thumb 14 data set. Fig. 4 (a) relates to a motion example with significant frequency, and all frames in the video have similar elements, i.e., human, basketball. By introducing sparsity L in the context modeling process s Loss, seek action, and actionAnd slight differences from background, background and background, thereby avoiding context confusion. FIG. 4 (b) contains instances of actions from two different classes, namely "Throw cuts" and "Shotput". Although the visual appearance and motion pattern are very similar between all frames, the method of the present invention is still able to locate most of the time intervals of multiple actions. Fig. 4 (c) depicts a single action "ice dance" with background challenges that looks very similar to the foreground, even though the model achieves separation of action from context through self-attention context modeling and suppression of background frames by the filtering module.
Academic problems in the task of weakly supervised behavior recognition and localization are addressed, such as inaccurate motion boundary localization due to background frame interference, and incomplete motion localization due to some candidate segments being ignored at will. In order to better solve the problems, the invention designs a context modeling framework and a learning background suppression paradigm to solve the weak supervision timing sequence action positioning task. The solution to the first problem is to model the potential temporal structure of the action segment in the feature modeling and prediction phase, and further refine the action features of different attributes, thereby encouraging smooth segment classification scores. The guiding concept of the second problem is to add an auxiliary background class, and suppress the input features of the background frame by using the filtering module, thereby preventing the interference of the background noise. The video-level prediction performance is remarkably improved by combining the high-quality classification score with the accurate foreground weight. A large number of experiments are carried out on the THUMOS14 and activityNet1.2 data sets, and the effectiveness and the feasibility of the method are proved.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.

Claims (6)

1. The weak supervision time sequence behavior detection method based on context modeling and background suppression is characterized by comprising the following steps of:
a, dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and performing feature fusion on the spatial features and the temporal motion features to construct video-level feature representation;
b, designing a memory library M as a learning source for motion positioning, and modeling the context information of the video by adopting a self-attention module to extract segment-level motion characteristics and train a segment-level classifier;
step C, adding background auxiliary classes, and inhibiting the input characteristics of a background frame through a filtering module to prevent the interference of background noise and obtain foreground attention weight;
and D, combining the step B and the step C, performing iterative optimization training on the network, and further combining the refined segment-level action characteristics and the foreground attention weight in an attention weighting pool to realize video-level prediction.
2. The weak supervised temporal behavior detection method based on context modeling and background suppression as recited in claim 1, wherein: the step a is specifically realized by the following steps:
method for converting video V into video V by uniform sampling strategy i Dividing into T non-overlapping segments, extracting scene space characteristics based on characteristic extractor
Figure FDA0003874357850000011
And temporal motion characteristics
Figure FDA0003874357850000012
Then fusing the two-stream segment-level features to obtain x i ∈R 2D ,i∈[1,T]To build a video level feature representation
Figure FDA0003874357850000013
D represents a feature dimension.
3. The method of claim 2 for detecting weakly supervised temporal behavior based on context modeling and background suppression, wherein: the step B specifically comprises the following steps:
(1) The video level characteristic X obtained in the step A is used i e Storing in a memory bank M, M ∈ R T×2D By using E Q 、E k And E v The encoder respectively and correspondingly generates a query, a key and a value for the video clip;
K i =E k (M)
V i =E v (M)
K i ∈R T×2D/m ,V i ∈R T×(C+1)2D is a key and a value, m is a hyper-parameter controlling the memory reading efficiency;
(2) Based on an encoder E Q To characterize video levels
Figure FDA0003874357850000014
Encoded as a set of queries Q i ,Q i ∈R T×2D/m Then, calculating similarity scores between the video segments with the query, and aggregating context information by using the similarity scores to obtain refined segment-level action characteristics:
Figure FDA0003874357850000015
where I is an identity matrix used to store the original video information,
Figure FDA0003874357850000016
and
Figure FDA0003874357850000017
maintaining the same dimensions; through information transmission among the segments, extracting global context information and obtaining more distinguishing features which are easy to classify and position;
(3) Calculating Q i And K i The interaction between the two and the correlation between different segments are obtained, so that the network has a global view, and finally the similar matrix V is obtained by aggregation i o As follows:
Figure FDA0003874357850000021
wherein, V i o ∈R T×(C+1)2D
(4) Will similar the matrix V i o Re-modeling into a set of segment-level classifiers
Figure FDA0003874357850000022
Which accommodates appearance or motion variations of each segment; using V i o Compute sparse loss functions to train segment-level classifiers:
Figure FDA0003874357850000023
wherein | | | purple hair 1 For L1 loss, it encourages background frames to have low similarity to all action segments.
4. The method of claim 3 for detecting weakly supervised temporal behavior based on context modeling and background suppression, wherein: the step C is specifically realized by the following steps:
(1) Will be provided with
Figure FDA0003874357850000024
The filtering module comprises two time sequence 1D convolution and Sigmoid functions and is used for inhibiting a background frame by training a training target opposite to a background class to obtain a foreground attention weight
Figure FDA0003874357850000025
W i ∈[0,1],
Figure FDA0003874357850000026
Is a function with a parameter phi;
(2) Utilizing real behavior categories
Figure FDA0003874357850000027
And the predicted score p j Constructing a binary cross-entropy loss L for each class sup To train the filtering module:
Figure FDA0003874357850000028
wherein p is j Indicates the prediction score, L sup Representing a binary cross entropy loss.
5. The weak supervised temporal behavior detection method based on context modeling and background suppression as recited in claim 4, wherein: in the step D, the video level prediction is realized by combining the steps B and C, and specifically, the following method is used:
applying the classifier to the corresponding segment, the video level classification result
Figure FDA0003874357850000029
Obtained from the attention weighting pool:
Figure FDA00038743578500000210
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA00038743578500000211
motion classification loss is determined by prediction and true video tags y between N videos i The method comprises the following steps:
Figure FDA00038743578500000212
wherein L is act Representing the action classification penalty, C +1 represents the action class total.
6. The method of claim 4 for detecting weakly supervised temporal behavior based on context modeling and background suppression, wherein: in the step D, when performing iterative optimization training on the network, the following method is specifically adopted:
(1) Combining the step B and the step C to define a joint loss function;
L tol =λ 1 L sup2 L act3 L s
wherein λ is 12 And λ 3 Is a hyper-parameter to be learned for balancing the contribution of each loss function;
(2) And (3) video positioning reasoning:
1) Predicting scores for video levels
Figure FDA0003874357850000031
Setting a threshold and discarding confidence scores below a threshold θ cls The category of (d);
2) Threshold θ on each remaining category act Apply to the foreground attention weight to generate an action proposal:
to assign confidence to each action proposal, the class activation sequence CAS is first computed and then passed through Softmax along the class dimension to obtain the class score at each temporal location
Figure FDA0003874357850000032
Further action proposal { (c, q, t) s ,t e ) The confidence q in the } is:
Figure FDA0003874357850000033
finally, the use of class negatives greatly inhibits NMS removal of highly overlapping action proposals.
CN202211208771.0A 2022-09-30 2022-09-30 Weak supervision time sequence behavior detection method based on context modeling and background suppression Pending CN115641529A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211208771.0A CN115641529A (en) 2022-09-30 2022-09-30 Weak supervision time sequence behavior detection method based on context modeling and background suppression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211208771.0A CN115641529A (en) 2022-09-30 2022-09-30 Weak supervision time sequence behavior detection method based on context modeling and background suppression

Publications (1)

Publication Number Publication Date
CN115641529A true CN115641529A (en) 2023-01-24

Family

ID=84941570

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211208771.0A Pending CN115641529A (en) 2022-09-30 2022-09-30 Weak supervision time sequence behavior detection method based on context modeling and background suppression

Country Status (1)

Country Link
CN (1) CN115641529A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030538A (en) * 2023-03-30 2023-04-28 中国科学技术大学 Weak supervision action detection method, system, equipment and storage medium
CN116503959A (en) * 2023-06-30 2023-07-28 山东省人工智能研究院 Weak supervision time sequence action positioning method and system based on uncertainty perception
CN116612420A (en) * 2023-07-20 2023-08-18 中国科学技术大学 Weak supervision video time sequence action detection method, system, equipment and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116030538A (en) * 2023-03-30 2023-04-28 中国科学技术大学 Weak supervision action detection method, system, equipment and storage medium
CN116503959A (en) * 2023-06-30 2023-07-28 山东省人工智能研究院 Weak supervision time sequence action positioning method and system based on uncertainty perception
CN116503959B (en) * 2023-06-30 2023-09-08 山东省人工智能研究院 Weak supervision time sequence action positioning method and system based on uncertainty perception
CN116612420A (en) * 2023-07-20 2023-08-18 中国科学技术大学 Weak supervision video time sequence action detection method, system, equipment and storage medium
CN116612420B (en) * 2023-07-20 2023-11-28 中国科学技术大学 Weak supervision video time sequence action detection method, system, equipment and storage medium

Similar Documents

Publication Publication Date Title
Chen et al. Once for all: a two-flow convolutional neural network for visual tracking
Xu et al. Segregated temporal assembly recurrent networks for weakly supervised multiple action detection
Cao et al. An attention enhanced bidirectional LSTM for early forest fire smoke recognition
CN112115995B (en) Image multi-label classification method based on semi-supervised learning
Zhou et al. Attention-driven loss for anomaly detection in video surveillance
Durand et al. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation
Jing et al. Videossl: Semi-supervised learning for video classification
CN115641529A (en) Weak supervision time sequence behavior detection method based on context modeling and background suppression
CN110210335B (en) Training method, system and device for pedestrian re-recognition learning model
US11640714B2 (en) Video panoptic segmentation
An Anomalies detection and tracking using Siamese neural networks
Yao et al. R²IPoints: Pursuing Rotation-Insensitive Point Representation for Aerial Object Detection
CN113283282A (en) Weak supervision time sequence action detection method based on time domain semantic features
Zhao et al. Real-time pedestrian detection based on improved YOLO model
Bodesheim et al. Pre-trained models are not enough: active and lifelong learning is important for long-term visual monitoring of mammals in biodiversity research—individual identification and attribute prediction with image features from deep neural networks and decoupled decision models applied to elephants and great apes
Deshpande et al. Anomaly detection in surveillance videos using transformer based attention model
Vainstein et al. Modeling video activity with dynamic phrases and its application to action recognition in tennis videos
Lee et al. License plate detection via information maximization
Pramono et al. Relational reasoning for group activity recognition via self-attention augmented conditional random field
CN113128410A (en) Weak supervision pedestrian re-identification method based on track association learning
Yu et al. Self-label refining for unsupervised person re-identification
Zhang et al. Action detection with two-stream enhanced detector
Anusha et al. Object detection using deep learning
Pham et al. Vietnamese Scene Text Detection and Recognition using Deep Learning: An Empirical Study
Huang et al. Bidirectional tracking scheme for visual object tracking based on recursive orthogonal least squares

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination