CN115641529A - Weak supervision time sequence behavior detection method based on context modeling and background suppression - Google Patents
Weak supervision time sequence behavior detection method based on context modeling and background suppression Download PDFInfo
- Publication number
- CN115641529A CN115641529A CN202211208771.0A CN202211208771A CN115641529A CN 115641529 A CN115641529 A CN 115641529A CN 202211208771 A CN202211208771 A CN 202211208771A CN 115641529 A CN115641529 A CN 115641529A
- Authority
- CN
- China
- Prior art keywords
- video
- background
- action
- segment
- level
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 18
- 230000001629 suppression Effects 0.000 title claims abstract description 18
- 230000009471 action Effects 0.000 claims abstract description 72
- 230000033001 locomotion Effects 0.000 claims abstract description 28
- 230000006399 behavior Effects 0.000 claims abstract description 25
- 230000002123 temporal effect Effects 0.000 claims abstract description 22
- 238000001914 filtration Methods 0.000 claims abstract description 17
- 230000002401 inhibitory effect Effects 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 29
- 238000012549 training Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 16
- 239000011159 matrix material Substances 0.000 claims description 10
- 230000004913 activation Effects 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 5
- 230000005540 biological transmission Effects 0.000 claims description 3
- 230000004927 fusion Effects 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 claims description 3
- 230000004931 aggregating effect Effects 0.000 claims description 2
- 230000002776 aggregation Effects 0.000 claims description 2
- 238000004220 aggregation Methods 0.000 claims description 2
- 230000004807 localization Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 239000012634 fragment Substances 0.000 description 7
- 230000008569 process Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000009189 diving Effects 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 239000000126 substance Substances 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 210000003813 thumb Anatomy 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 241000950638 Symphysodon discus Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000012010 growth Effects 0.000 description 1
- HOQADATXFBOEGG-UHFFFAOYSA-N isofenphos Chemical compound CCOP(=S)(NC(C)C)OC1=CC=CC=C1C(=O)OC(C)C HOQADATXFBOEGG-UHFFFAOYSA-N 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- OLBCVFGFOZPWHH-UHFFFAOYSA-N propofol Chemical compound CC(C)C1=CC=CC(C(C)C)=C1O OLBCVFGFOZPWHH-UHFFFAOYSA-N 0.000 description 1
- 229960004134 propofol Drugs 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000007493 shaping process Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 239000002699 waste material Substances 0.000 description 1
Images
Abstract
The invention discloses a weak supervision time sequence behavior detection method based on context modeling and background suppression, which comprises the following steps: dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and constructing video-level feature representation of each segment; designing a memory library M as a learning source for action positioning, and modeling the context information of the video based on a self-attention module; adding a background auxiliary class, and inhibiting the input characteristics of a background frame by using a filtering module; and further combining the refined segment characteristics and the foreground weight in an attention weighting pool to realize video-level prediction. According to the scheme, a self-attention module is introduced, and potential time structures of action segments are modeled in a characteristic modeling and predicting stage, so that action characteristics with different attributes are refined, and the integrity of an action instance is guaranteed; and adding a background auxiliary class, attenuating input features from a background frame through a filtering module, and creating a negative sample of the background class so as to learn the features of the background segment, inhibit the influence of background noise and improve the accuracy and the quality of action detection.
Description
Technical Field
The invention belongs to the technical field of video time sequence action detection, and particularly relates to a weak supervision time sequence action detection method based on context modeling and background suppression.
Background
With the popularization and development of multimedia, internet, and photographing devices, video data shows explosive growth. For video time sequence behavior positioning, a large amount of marking information is needed for training and learning, the cost of accurate time sequence boundary marking is extremely high, and huge manpower and financial resources are consumed, so that the application of a time sequence action detection algorithm is limited to a great extent, and the time sequence action is monitored weakly. The behavior positioning technology based on weak supervision only uses video-level labels in the training process, so that the waste of human resources and time and labeling errors can be further reduced, and the method has good flexibility.
Most existing weak supervised timing action positioning (WTAL) methods can be divided into two categories. The method is inspired by a weak supervision image semantic segmentation task, weak supervision time sequence action detection is regarded as a video identification task, a foreground-background separation attention mechanism is introduced to construct video level features, and then an action classifier is applied to identify videos. Yet another approach is to formulate the task as a Multiple Instance Learning (MIL) problem, treating the whole un-clipped video as a packet containing positive and negative instances, i.e. action instance and background frame (background frame refers to the video segment in the video that does not belong to the category to be detected). Segment classification is performed over time to generate a Class Activation Sequence (CAS), and then behavior suggestions are generated by temporarily merging CAS resulting in video level prediction, segment level class score thresholding.
Both of the above methods aim to learn an effective classification function to identify actions from action instances and background frames. However, existing methods do not fully simulate the behavior detection problem and still present two challenges of localization integrity and background interference:
(1) Positioning integrity problem: for a continuous time-series behavior, identifying an action often over-depends on those feature regions that are significantly helpful in classification, resulting in incomplete localization. As fig. 1 is an example of a diving activity, (a) represents the true localization, (b) for prediction based on the MIL method, the MIL framework captures only the most discriminative positions in the full diving activity, which may yield a high classification confidence but does not yield good localization performance.
(2) Background interference: background frames are trained to be classified as action classes for video, even though they do not have the characteristics of action, such inconsistencies that push background frames to action classes can lead to false positives and degradation of detection performance. The existing weak supervision method is to train directly according to the video class label, only consider the action class, not consider the background class, and inevitably or erroneously detect, so as to limit the detection accuracy of the model.
Disclosure of Invention
The invention aims to solve the problems of incomplete positioning and background interference in the prior art, and provides a weak supervision time sequence behavior detection method based on context modeling and background suppression.
The invention is realized by adopting the following technical scheme: a weak supervision time sequence behavior detection method based on context modeling and background suppression comprises the following steps:
a, dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and performing feature fusion on the spatial features and the temporal motion features to construct video-level feature representation;
b, designing a memory library M as a learning source for motion positioning, and modeling the context information of the video by adopting a self-attention module to extract segment-level motion characteristics and train a segment-level classifier;
step C, adding background auxiliary classes, and inhibiting the input characteristics of a background frame through a filtering module to prevent the interference of background noise and obtain foreground attention weight;
and D, combining the step B and the step C, performing iterative optimization training on the network, and further combining the refined segment-level action characteristics and the foreground attention weight in an attention weighting pool to realize video-level prediction.
Further, the step a is specifically implemented by the following steps:
method for converting video V into video V by uniform sampling strategy i Dividing into T non-overlapping segments, extracting scene space characteristics based on characteristic extractorAnd temporal motion characteristicsThen fusing the two-stream segment-level features to obtain x i ∈R 2D ,i∈[1,T]In turn, build a video level feature representationD represents a feature dimension.
Further, the step B specifically includes the following steps:
(1) The video level characteristics obtained in the step A are obtainedStoring in a memory bank M, M ∈ R T×2D By using E Q 、E k And E v The encoder respectively and correspondingly generates a query, a key and a value for the video clip;
K i =E k (M)
V i =E v (M)
K i ∈R T×2D/m ,V i ∈R T×(C+1)2D is a key and a value, m is a hyper-parameter controlling the memory reading efficiency;
(2) Based on an encoder E Q To characterize video levelsEncoded as a set of queries Q i ,Q i ∈R T×2D/m Then, calculating similarity scores between the video segments with the query, and aggregating context information by using the similarity scores to obtain refined segment-level action characteristics:
where I is an identity matrix used to store the original video information,andmaintain the same dimensions; through information transmission among the segments, extracting global context information and obtaining more distinguishing features which are easy to classify and position;
(3) Calculating Q i And K i The interaction between the two and the correlation between different segments are obtained, so that the network has a global view, and finally the similar matrix V is obtained by aggregation i o As follows:
wherein, V i o ∈R T×(C+1)2D ;
(4) Will similar the matrix V i o Re-shaping into a set of segment-level classifiersWhich accommodates appearance or motion variations of each segment; using V i o Computing the sparse loss L s Function classification in training fragment levelThe device comprises:
wherein | | | calving 1 For L1 loss, it encourages background frames to have low similarity to all action segments.
Further, the step C is specifically realized by the following steps:
(1) Will be provided withThe filtering module comprises two time sequence 1D convolution and Sigmoid functions and is used for inhibiting a background frame by training a training target opposite to a background class to obtain a foreground attention weightW i ∈[0,1],Is a function with a parameter phi;
(2) Utilizing real-world behavior classesAnd the predicted score p j Constructing a binary cross-entropy loss L for each class sup To train the filtering module:
wherein p is j Representing the prediction score.
Further, in the step D, the video level prediction is implemented by combining the step B and the step C, and specifically implemented by the following manner:
applying the classifier to the corresponding segment, the video level classification resultObtained from the attention weighting pool:
wherein the content of the first and second substances,motion classification loss is determined by prediction between N videos and video label y i Comprises the following steps:
wherein L is act Representing the action classification penalty, C represents the action class total.
Further, in the step D, when performing iterative optimization training on the network, the following method is specifically adopted:
(1) Combining the step B and the step C to define a joint loss function;
L tol =λ 1 L sup +λ 2 L act +λ 3 L s
wherein λ is 1 ,λ 2 And λ 3 Is a hyper-parameter to be learned for balancing the contribution of each loss function;
(2) And (3) video positioning reasoning:
1) Predicting scores for video levelsSetting a threshold and discarding confidence scores below a threshold θ cls The category of (d);
2) Threshold θ on each remaining category act Apply to the foreground attention weight to generate an action proposal:
to assign confidence to each action proposal, the class activation sequence CAS is first computed and then passed through Softmax along the class dimension to obtain the class score at each temporal locationFurther action proposal { (c, q, t) s ,t e ) The confidence q in the } is:
finally, since one instance of action in the un-clipped video may repeat multiple occurrences, the present scheme uses the class non-very restrictive (NMS) to remove highly overlapping action proposals.
Compared with the prior art, the invention has the advantages and positive effects that:
(1) Aiming at the problem of positioning integrity, the scheme introduces self-attention to model the potential time structure of the action segment in the characteristic modeling and prediction stage:
the traditional MIL method, which takes video clips as independent instances, ignores the modeling of potential temporal structures during the feature modeling and prediction phase, resulting in a low quality action proposal generated from the CAS. In the scheme, firstly, a memory base M is designed as a learning source of action positioning, and a self-attention module is introduced to model context information of a video, so that action characteristics are refined, smoother time classification scores are encouraged, and integrity positioning is realized;
(2) Aiming at the problem of background interference, a filtering module is designed:
the un-clipped video contains a large number of background frames besides the action segments, and the method of weak supervision only has video-level annotation, and cannot know that the background frames are the background frames and the action segments are the action segments, so that a plurality of background segments are mistaken for the action to be detected. According to the scheme, a background auxiliary class is added, and the input characteristics of a background frame are suppressed by using a filtering module so as to prevent the interference of background noise.
Drawings
FIG. 1 is a diagram of an example capture action of a prior MIL framework; (a) Representing true localization, (b) prediction based on MIL method;
FIG. 2 is a diagram illustrating an overall network architecture according to an embodiment of the present invention;
FIG. 3 is a schematic diagram illustrating a self-attention application principle of an embodiment of the present invention;
FIG. 4 is a schematic diagram of the positioning results shown on THUMOS14 according to the present invention, wherein (a) is a schematic diagram of the basketball result; (b) is a schematic diagram of the results of throwing a shot and throwing a discus; and (c) is a diagram illustrating the ice dance result.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention more clearly understood, the present invention will be further described with reference to the accompanying drawings and examples. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced in other ways than those described herein, and thus, the present invention is not limited to the specific embodiments disclosed below.
The present solution proposes a general framework as shown in fig. 2, which includes a self-attention module and a filtering module, and before delving into its details, formally defines the problem statement:
problem description:
assuming N training videos, for each video V i All have real label y i ∈R C+1 Where C +1 is the number of action classes; if action class j exists in the video, then y i (j) =1, otherwise y i (j) =0. During testing, the goal of temporal motion localization is to generate a set of motion proposals for each video { (c, q, t) s ,t e ) Where c denotes the prediction class, q is the confidence score, t s And t e The representations are the action start time and the end time, respectively. The invention aims to solve the problem that the starting and stopping boundaries of behavior examples in un-edited videos are positioned and corresponding behavior categories are identified for training data only labeled by video-level categories, and the key points are designed as follows:
key point 1: how to model behavior instance integrity:
in the absence of fine-grained temporal boundary annotation for un-clipped video, it becomes very difficult to detect complete and accurate behavior instances. According to the scheme, a self-attention module is introduced, and the potential time structure of the action segment is modeled in the characteristic modeling and predicting stage, so that the action characteristics with different attributes are refined, smooth segment classification scores are encouraged, and the completeness of the behavior instance is guaranteed.
Key point 2: how to suppress background interference:
aiming at the weak supervision method only with video level annotation, the background frames and the action segments cannot be known, so that many background segments caused by the background frames and the action segments are mistaken for the action to be detected. According to the scheme, a background auxiliary class is added, input features from a background frame are attenuated through a filtering module, and a negative sample of the background class is created, so that the features of background segments are learned, the influence of background noise is suppressed, and the accuracy and the quality of motion detection are improved.
The embodiment provides a weak supervision time sequence detection method based on context modeling and background suppression, which comprises the following steps:
a, dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and performing feature fusion on the spatial features and the temporal motion features to construct video-level features;
b, designing a memory library M as a learning source of action positioning, and introducing a self-attention module to model the context information of the video, thereby extracting segment-level action characteristics to realize integrity positioning;
step C, adding background auxiliary classes, and utilizing a filtering module to suppress input characteristics of a background frame so as to prevent interference of background noise and obtain foreground attention weight;
and D, combining the step B and the step C, performing iterative training on the network to further realize video-level prediction.
Specifically, the following describes the scheme of the present invention in detail:
1. feature extraction
Method for converting video V into video V by uniform sampling strategy i Dividing into T non-overlapping 16-frame segments, and extracting scene space characteristics by using I3D characteristic extractorAnd temporal motion characteristicsThen fusing the fragment-level features of the RGB Flow and the Flow to obtain x i ∈R 2D ,i∈[1,T]In turn, build a video level feature representation X i e =[x 1 ,...,x T ]∈R 2D×T And D represents a feature dimension.
2. Self-attention module
(1) Storing the video level feature representation of T segments and 2D dimensions into a memory base M, wherein M belongs to R T×2D In, use E Q 、E k And E v The encoder respectively generates a query (Q), a key (K) and a value (V) corresponding to the video clip;
E k in order to reduce the dimension of the fragment, its keys store information about the appearance and movement of the fragment for efficient reading from memory, which is implemented by a full connectivity layer (FC). E v An MLP network is composed of two FC layers, and a bottleneck structure is adopted between the FC layers to reduce parameters, and each fragment is coded into a feature specific to a category for classification.
K i ∈R T×2D/m ,V i ∈R T×(C+1)2D Is a key and a value, and m is a hyper-parameter controlling memory read efficiency. Given the memory base M and the input video, how to perform video classification and background suppression is described next.
(2) For video classification, encoder E implemented by FC layer Q To characterize video levelsEncoded as a set of queries Q i ,Q i ∈R T×2D/m Then, the similarity score between the video segments with the query is calculated, and the context information is aggregated by using the similarity score to obtain the refined segment characteristics, as shown in fig. 3, which are expressed as follows:
Where I is an identity matrix used to store the original video information,andmaintain the same dimensions; through information transmission among the segments, global context information is extracted and more discrimination characteristics for classification and positioning are obtained. Fig. 3 applies self-attention to each query fragment, aggregates context information by computing similarities with other fragments,andthe method represents element-by-element addition and matrix multiplication, and T and 2D respectively represent the number of video segments and characteristic dimensions.
(3) Calculating Q i And K i The interaction between the segments and the correlation between different segments are obtained, so that the network has a global view, and finally the correlation scores are aggregated to obtain a similarity matrix, which is as follows:
wherein, V i o ∈R T×(C+1)2D ;
(4) For subsequent classification, a similarity matrix V is used i o Reshape into a set of segment classifiersWhich accommodates appearance or motion variations of each segment; by means of V i o Calculating a sparse loss function:
wherein | | | purple hair 1 For L1 loss, it encourages background frames to have low similarity to all action segments.
3. Suppression module
(1) To create a negative example of a background class, video level features are representedAs input to the filtering module, which contains two time-sequential 1D convolution and Sigmoid functions, the background frames are suppressed by training the training target against the background class, which returns the foreground attention weightW i ∈[0,1],Is a function with a parameter phi. W i Acting as a set of segments without any background activity, which is considered as a negative example of the background class.
(2) In this process, the real behavior categories are utilizedAnd the predicted score p j Constructing a binary cross-entropy loss L for each class sup And (4) carrying out constraint:
(3) Finally, applying the classifier to the corresponding segment and obtaining the video-level classification resultObtained from the attention weighting pool, calculated as follows:
wherein the content of the first and second substances,motion classification loss is determined by prediction between N videos and video label y i Comprises the following steps:
4. network training and reasoning
(1) In conjunction with step 2 and step 3, a joint loss function is defined:
L tol =λ 1 L sup +λ 2 L act +λ 3 L s (8)
wherein λ is 1 ,λ 2 And λ 3 Is a hyper-parameter that needs to be learned to balance the contribution of each loss function.
(2) After the model is trained, a two-step walking method is adopted to realize behavior positioning;
predicting scores for video levelsSetting a threshold and discarding confidence scores below a threshold θ cls A category of (1);
then, a threshold θ is applied to each of the remaining categories act Apply to the foreground attention weight to generate an action proposal;
to assign confidence to each proposal, the class activation sequence CAS is first computed and then passed through Softmax along the class dimension to obtain the class score at each temporal location(t denotes a segment index, i.e. ";" denotes due to the dynamics of the time position),then the action proposal { (c, q, t) s ,t e ) The confidence q in the } is calculated as follows:
in order to remove action proposals with high overlap (high overlap means that when in un-clipped video, one same action appears multiple times, only one action is calculated and represented in the embodiment), the scheme uses a class non-maximum suppression (NMS) for processing.
Implementation details:
the present embodiment uses a dual-stream I3D network as a feature extractor, applies a TV-L1 algorithm to extract optical flow from RGB data, and sets D =1024. In the formula (8), λ 1 =λ 2 =0.8,λ 3 And =0.2. In the inference process, the threshold θ cls Is set to 0.1 (the value is generally 0.1-1), theta act Is a video V i The mean of foreground weights of the corresponding category. And uses a threshold of 0.3 NMS-like to remove the highly overlapping propofol. The model is a network framework based on PyTorch deep learning, the whole experiment is carried out on a single GTX 3060GPU, adam optimization is used for training, the learning rate is 10 -4 The batch size is 20.
As shown in fig. 4, fig. 4 shows the positioning result on thumb 14. For each example, there are three plots with multiple sample frames. The first graph indicates the real situation. The second and third graphs show the segment activation sequences corresponding to the self-attention module and the filtering module, respectively, in which the horizontal axis represents the time step of the video and the vertical axis represents the activation strength, ranging from 0 to 1.
Figure 4 qualitatively illustrates the results of the proposed algorithm as a test on the thumb 14 data set. Fig. 4 (a) relates to a motion example with significant frequency, and all frames in the video have similar elements, i.e., human, basketball. By introducing sparsity L in the context modeling process s Loss, seek action, and actionAnd slight differences from background, background and background, thereby avoiding context confusion. FIG. 4 (b) contains instances of actions from two different classes, namely "Throw cuts" and "Shotput". Although the visual appearance and motion pattern are very similar between all frames, the method of the present invention is still able to locate most of the time intervals of multiple actions. Fig. 4 (c) depicts a single action "ice dance" with background challenges that looks very similar to the foreground, even though the model achieves separation of action from context through self-attention context modeling and suppression of background frames by the filtering module.
Academic problems in the task of weakly supervised behavior recognition and localization are addressed, such as inaccurate motion boundary localization due to background frame interference, and incomplete motion localization due to some candidate segments being ignored at will. In order to better solve the problems, the invention designs a context modeling framework and a learning background suppression paradigm to solve the weak supervision timing sequence action positioning task. The solution to the first problem is to model the potential temporal structure of the action segment in the feature modeling and prediction phase, and further refine the action features of different attributes, thereby encouraging smooth segment classification scores. The guiding concept of the second problem is to add an auxiliary background class, and suppress the input features of the background frame by using the filtering module, thereby preventing the interference of the background noise. The video-level prediction performance is remarkably improved by combining the high-quality classification score with the accurate foreground weight. A large number of experiments are carried out on the THUMOS14 and activityNet1.2 data sets, and the effectiveness and the feasibility of the method are proved.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention.
Claims (6)
1. The weak supervision time sequence behavior detection method based on context modeling and background suppression is characterized by comprising the following steps of:
a, dividing a video into a plurality of non-overlapping segments, extracting spatial features and temporal motion features of a video scene, and performing feature fusion on the spatial features and the temporal motion features to construct video-level feature representation;
b, designing a memory library M as a learning source for motion positioning, and modeling the context information of the video by adopting a self-attention module to extract segment-level motion characteristics and train a segment-level classifier;
step C, adding background auxiliary classes, and inhibiting the input characteristics of a background frame through a filtering module to prevent the interference of background noise and obtain foreground attention weight;
and D, combining the step B and the step C, performing iterative optimization training on the network, and further combining the refined segment-level action characteristics and the foreground attention weight in an attention weighting pool to realize video-level prediction.
2. The weak supervised temporal behavior detection method based on context modeling and background suppression as recited in claim 1, wherein: the step a is specifically realized by the following steps:
method for converting video V into video V by uniform sampling strategy i Dividing into T non-overlapping segments, extracting scene space characteristics based on characteristic extractorAnd temporal motion characteristicsThen fusing the two-stream segment-level features to obtain x i ∈R 2D ,i∈[1,T]To build a video level feature representationD represents a feature dimension.
3. The method of claim 2 for detecting weakly supervised temporal behavior based on context modeling and background suppression, wherein: the step B specifically comprises the following steps:
(1) The video level characteristic X obtained in the step A is used i e Storing in a memory bank M, M ∈ R T×2D By using E Q 、E k And E v The encoder respectively and correspondingly generates a query, a key and a value for the video clip;
K i =E k (M)
V i =E v (M)
K i ∈R T×2D/m ,V i ∈R T×(C+1)2D is a key and a value, m is a hyper-parameter controlling the memory reading efficiency;
(2) Based on an encoder E Q To characterize video levelsEncoded as a set of queries Q i ,Q i ∈R T×2D/m Then, calculating similarity scores between the video segments with the query, and aggregating context information by using the similarity scores to obtain refined segment-level action characteristics:
where I is an identity matrix used to store the original video information,andmaintaining the same dimensions; through information transmission among the segments, extracting global context information and obtaining more distinguishing features which are easy to classify and position;
(3) Calculating Q i And K i The interaction between the two and the correlation between different segments are obtained, so that the network has a global view, and finally the similar matrix V is obtained by aggregation i o As follows:
wherein, V i o ∈R T×(C+1)2D ;
(4) Will similar the matrix V i o Re-modeling into a set of segment-level classifiersWhich accommodates appearance or motion variations of each segment; using V i o Compute sparse loss functions to train segment-level classifiers:
wherein | | | purple hair 1 For L1 loss, it encourages background frames to have low similarity to all action segments.
4. The method of claim 3 for detecting weakly supervised temporal behavior based on context modeling and background suppression, wherein: the step C is specifically realized by the following steps:
(1) Will be provided withThe filtering module comprises two time sequence 1D convolution and Sigmoid functions and is used for inhibiting a background frame by training a training target opposite to a background class to obtain a foreground attention weightW i ∈[0,1],Is a function with a parameter phi;
(2) Utilizing real behavior categoriesAnd the predicted score p j Constructing a binary cross-entropy loss L for each class sup To train the filtering module:
wherein p is j Indicates the prediction score, L sup Representing a binary cross entropy loss.
5. The weak supervised temporal behavior detection method based on context modeling and background suppression as recited in claim 4, wherein: in the step D, the video level prediction is realized by combining the steps B and C, and specifically, the following method is used:
applying the classifier to the corresponding segment, the video level classification resultObtained from the attention weighting pool:
wherein, the first and the second end of the pipe are connected with each other,motion classification loss is determined by prediction and true video tags y between N videos i The method comprises the following steps:
wherein L is act Representing the action classification penalty, C +1 represents the action class total.
6. The method of claim 4 for detecting weakly supervised temporal behavior based on context modeling and background suppression, wherein: in the step D, when performing iterative optimization training on the network, the following method is specifically adopted:
(1) Combining the step B and the step C to define a joint loss function;
L tol =λ 1 L sup +λ 2 L act +λ 3 L s
wherein λ is 1 ,λ 2 And λ 3 Is a hyper-parameter to be learned for balancing the contribution of each loss function;
(2) And (3) video positioning reasoning:
1) Predicting scores for video levelsSetting a threshold and discarding confidence scores below a threshold θ cls The category of (d);
2) Threshold θ on each remaining category act Apply to the foreground attention weight to generate an action proposal:
to assign confidence to each action proposal, the class activation sequence CAS is first computed and then passed through Softmax along the class dimension to obtain the class score at each temporal locationFurther action proposal { (c, q, t) s ,t e ) The confidence q in the } is:
finally, the use of class negatives greatly inhibits NMS removal of highly overlapping action proposals.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211208771.0A CN115641529A (en) | 2022-09-30 | 2022-09-30 | Weak supervision time sequence behavior detection method based on context modeling and background suppression |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211208771.0A CN115641529A (en) | 2022-09-30 | 2022-09-30 | Weak supervision time sequence behavior detection method based on context modeling and background suppression |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115641529A true CN115641529A (en) | 2023-01-24 |
Family
ID=84941570
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211208771.0A Pending CN115641529A (en) | 2022-09-30 | 2022-09-30 | Weak supervision time sequence behavior detection method based on context modeling and background suppression |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115641529A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116030538A (en) * | 2023-03-30 | 2023-04-28 | 中国科学技术大学 | Weak supervision action detection method, system, equipment and storage medium |
CN116503959A (en) * | 2023-06-30 | 2023-07-28 | 山东省人工智能研究院 | Weak supervision time sequence action positioning method and system based on uncertainty perception |
CN116612420A (en) * | 2023-07-20 | 2023-08-18 | 中国科学技术大学 | Weak supervision video time sequence action detection method, system, equipment and storage medium |
-
2022
- 2022-09-30 CN CN202211208771.0A patent/CN115641529A/en active Pending
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116030538A (en) * | 2023-03-30 | 2023-04-28 | 中国科学技术大学 | Weak supervision action detection method, system, equipment and storage medium |
CN116503959A (en) * | 2023-06-30 | 2023-07-28 | 山东省人工智能研究院 | Weak supervision time sequence action positioning method and system based on uncertainty perception |
CN116503959B (en) * | 2023-06-30 | 2023-09-08 | 山东省人工智能研究院 | Weak supervision time sequence action positioning method and system based on uncertainty perception |
CN116612420A (en) * | 2023-07-20 | 2023-08-18 | 中国科学技术大学 | Weak supervision video time sequence action detection method, system, equipment and storage medium |
CN116612420B (en) * | 2023-07-20 | 2023-11-28 | 中国科学技术大学 | Weak supervision video time sequence action detection method, system, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | Once for all: a two-flow convolutional neural network for visual tracking | |
Xu et al. | Segregated temporal assembly recurrent networks for weakly supervised multiple action detection | |
Cao et al. | An attention enhanced bidirectional LSTM for early forest fire smoke recognition | |
CN112115995B (en) | Image multi-label classification method based on semi-supervised learning | |
Zhou et al. | Attention-driven loss for anomaly detection in video surveillance | |
Durand et al. | Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation | |
Jing et al. | Videossl: Semi-supervised learning for video classification | |
CN115641529A (en) | Weak supervision time sequence behavior detection method based on context modeling and background suppression | |
CN110210335B (en) | Training method, system and device for pedestrian re-recognition learning model | |
US11640714B2 (en) | Video panoptic segmentation | |
An | Anomalies detection and tracking using Siamese neural networks | |
Yao et al. | R²IPoints: Pursuing Rotation-Insensitive Point Representation for Aerial Object Detection | |
CN113283282A (en) | Weak supervision time sequence action detection method based on time domain semantic features | |
Zhao et al. | Real-time pedestrian detection based on improved YOLO model | |
Bodesheim et al. | Pre-trained models are not enough: active and lifelong learning is important for long-term visual monitoring of mammals in biodiversity research—individual identification and attribute prediction with image features from deep neural networks and decoupled decision models applied to elephants and great apes | |
Deshpande et al. | Anomaly detection in surveillance videos using transformer based attention model | |
Vainstein et al. | Modeling video activity with dynamic phrases and its application to action recognition in tennis videos | |
Lee et al. | License plate detection via information maximization | |
Pramono et al. | Relational reasoning for group activity recognition via self-attention augmented conditional random field | |
CN113128410A (en) | Weak supervision pedestrian re-identification method based on track association learning | |
Yu et al. | Self-label refining for unsupervised person re-identification | |
Zhang et al. | Action detection with two-stream enhanced detector | |
Anusha et al. | Object detection using deep learning | |
Pham et al. | Vietnamese Scene Text Detection and Recognition using Deep Learning: An Empirical Study | |
Huang et al. | Bidirectional tracking scheme for visual object tracking based on recursive orthogonal least squares |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |