CN116935303A

CN116935303A - Weak supervision self-training video anomaly detection method

Info

Publication number: CN116935303A
Application number: CN202211328891.4A
Authority: CN
Inventors: 唐俊; 汪振涛; 王科; 朱明�
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-10-24

Abstract

The application relates to the technical field of computer vision, which solves the technical problem that a weak supervision method can introduce a large amount of label noise, in particular to a weak supervision self-training video anomaly detection method, which comprises the following steps: s1, acquiring a video anomaly detection data set and dividing the video anomaly detection data set into a training set and a testing set; s2, constructing an anomaly detection model and training by adopting a training set, wherein the anomaly detection model comprises a first-stage network model and a second-stage network model; if the abnormality score is greater than the threshold, the abnormal fragment is obtained; if the abnormality score is less than the threshold, it is a normal segment. S3, inputting the video to be detected into the trained abnormality detection model to predict the abnormality score of each video segment, and detecting the abnormality in the video segment according to the abnormality threshold. The application achieves the purpose of generating the pseudo tag with high confidence coefficient by using priori knowledge and improving the accuracy of anomaly detection.

Description

Weak supervision self-training video anomaly detection method

Technical Field

The application relates to the technical field of computer vision, in particular to a weak supervision self-training video anomaly detection method.

Background

Because the abnormal event is rare, one popular paradigm is classification or unsupervised learning, i.e. only a normal video segment is added during training, so that the deep network learns the characteristics of the normal video, and the characteristics different from the normal video are determined to be abnormal. Such methods have the obvious limitation that there is insufficient data to learn to characterize all normal behavior, and therefore normal segments not encompassed in some training sets may be misdetected as anomalies.

For anomalous video, the video level tag is not perfectly correct, and can only be considered as a noisy tag, or a low confidence tag. To solve this problem, other studies have proposed using GCN to obtain feature similarity and temporal consistency of video segments to clean tag noise using an iterative method that assigns video-level tags directly to each video segment, resulting in abnormal segments being affected by normal segments in the abnormal video. Other studies use methods for directly generating pseudo tags, such as cleaning tag noise by using a binary clustering method based on space-time video features, and the clustering and the network complement each other in the training process; the pseudo tag generator is trained using MILs methods to generate segment-level video tags. However, the method is only equivalent to directly using the anomaly score of the video segment to supervise anomaly detection, ignoring the segment distribution characteristics in the anomaly video, and possibly introducing unnecessary tag noise.

Disclosure of Invention

Aiming at the defects of the prior art, the application provides a weak supervision self-training video anomaly detection method, solves the technical problem that a great amount of label noise is introduced in the weak supervision method, and achieves the purposes of generating a pseudo label with high confidence coefficient by using priori knowledge and improving anomaly detection accuracy.

In order to solve the technical problems, the application provides the following technical scheme: a weak supervision self-training video anomaly detection method comprises the following steps:

s1, acquiring a video anomaly detection data set and dividing the video anomaly detection data set into a training set and a testing set;

s2, constructing an anomaly detection model and training by adopting a training set, wherein the anomaly detection model comprises a first-stage network model and a second-stage network model;

s3, inputting the video to be detected into the trained abnormality detection model to predict the abnormality score of each video segment, and detecting the abnormality in the video segment according to an abnormality threshold;

if the abnormality score is greater than the threshold, the abnormal fragment is obtained;

if the abnormality score is less than the threshold, it is a normal segment.

Further, in step S1, the specific process includes the following steps:

s11, collecting monitoring videos in different scenes, wherein the different scenes comprise supermarkets, banks, university campuses, highways, parks and residential communities;

s12, obtaining a video level label corresponding to the abnormal event according to whether the abnormal event mark is contained in the monitoring video and is an abnormal video va and a normal video vn.

S13, dividing the video anomaly detection data set into a training set and a testing set.

Further, in step S2, specifically includes:

s21, training in a first stage, dividing each video in a training set into a plurality of video segments with the same length, sending the video segments into a video feature encoder to obtain corresponding video segment features, training a pseudo tag generator guided based on priori knowledge by utilizing the video segment features and the video level tags, and generating pseudo tags of the video segments;

s22, training in a second stage, and supervising and training a multi-scale time feature network model by using the video clips and the pseudo tags generated by the pseudo tag generator.

Further, in step S21, the specific process includes the steps of:

s211, dividing each video in the training set into a plurality of video segments with the same lengthAnd->

S212, setting a first-stage training related parameter;

s213, building a first-stage network model according to related parameters, wherein the first-stage network model consists of a feature extraction module and a pseudo tag generator, and the pseudo tag generator comprises a multi-layer perceptron, a Gaussian Mixture Module (GMM) and a one-dimensional median filtering module (OMF);

s214, extracting corresponding video features from the video segments through a feature extraction module, and inputting the video features into a pseudo tag generator to generate pseudo tags with high confidence.

Further, in step S214, the specific process includes the following steps:

s2141, several video clipsAnd->Obtaining corresponding video clip feature +.>And->

S2142, training a multi-layer perceptron by using video segment characteristics and video-level labels, wherein the multi-layer perceptron consists of three full-connection layers with different neuron numbers;

s2143, the coarse granularity anomaly scores of the normal video and the anomaly video output by the trained multi-layer perceptron are respectivelyAnd->Abnormality score ++>And->Respectively inputting the high confidence anomaly score and the high confidence anomaly score into a Gaussian Mixture Module (GMM) and a one-dimensional median filter module (OMF) by guiding with priori knowledge>And->Pseudo tag of video clip obtained after normalization +.>

Further, in step S22, the specific process includes the steps of:

s221, constructing a second-stage network model, wherein the second-stage network model is a multi-scale time network, and the multi-scale time network consists of a main network and three branches;

s222, training the pseudo tag obtained in the first stageAnd video clip->And->In combination with the input into the second stage training multi-scale time feature network, the backbone network extracts deep features of the video segments, and layers 4 and 5 of the backbone networkOutput characteristic diagram f _b-4 And f _b-5 To the next three branches.

S223, training the three branches.

Further, the method comprises the steps of,

the backbone network adopts a pre-trained C3D or I3D encoder;

the three branches are an attention branch, a self-guiding branch and a cavity convolution branch respectively;

the attention branch consists of an attention module and a classifying head H _c The self-guiding branch consists of two three-dimensional convolution layers and a classifying head H _g The cavity convolution branch consists of two cavity convolution and classification heads H with different cavity rates _c Composition is prepared.

Further, in step S223, the specific process includes the steps of:

s2231, attention branch training;

s2232, self-guided branch training, willInput Conv3d ₃ Performing dimension reduction through global average pooling once again and average pooling once again, obtaining an abnormality score through softmax operation, and using cross entropy loss L _self-guide Optimizing the self-guiding branch;

s2233, training the cavity convolution branch, wherein the cavity convolution branch is toAs input, DC3d is convolved by two three-dimensional holes, respectively ₁ 、DC3d ₂ Obtaining a corresponding feature map, and adding the corresponding feature maps to obtain a multi-scale feature map f _multi-scale 。

Further, in step S2231, the specific process includes:

s22311, f _b-4 Input to two three-dimensional convolution layers Conv3d ₁ And Conv3d ₂ The obtained characteristic diagram and f _b-5 The attention operation is carried out by adding, and the characteristic diagram of the three-dimensional convolution output of the first layer is recorded as

S22312, willInput second layer three-dimensional convolution Conv3d ₂ Obtaining a characteristic diagram f ^* ；

S22313, f ^* And f _b-5 Multiply by f _b-5 Obtaining a final attention characteristic diagram f _atten ；

S22314 Using pseudo tagAs a supervision, f _atten By sorting head H _c And softmax operations to generate a net final anomaly score, using L _atten The attention branches are optimized.

By means of the technical scheme, the application provides a weak supervision self-training video anomaly detection method, which has at least the following beneficial effects:

1. aiming at the problem of label noise introduced by a weak supervision method, the application combines weak supervision learning with a self-training method to construct a two-stage self-training network, wherein a pseudo label generator guided based on priori knowledge is trained by using a video-level label in the first stage, and the pseudo label with high confidence conforming to video segment distribution is generated by using the pseudo label generator.

2. The application takes into account that the normal segments in a period of time before and after the occurrence of an abnormal event also contain part of the abnormal information. Thus, in the first stage consider modeling the entire process of occurrence of an abnormal event using a gaussian model. Because a plurality of abnormal events can be included in one abnormal video, the label distribution of the whole abnormal video is modeled by using the GMM, so that a label with high confidence which is more in line with the characteristics of the abnormal events is generated, and the problem that label noise is introduced by a weak supervision method is solved.

3. The application obtains the pseudo tag of the normal video by adopting a one-dimensional median filtering mode. Instead of directly assigning the video level labels of the normal video to the normal video segments, soft label supervision is used for the normal segments in the abnormal video, which is not beneficial to learning the characteristics of the normal video segments by the network, and the use of the median of the abnormal scores as the pseudo label is more robust.

4. In order to better learn the characteristics of the abnormal part in the video, the application adopts a multi-scale space-time characteristic network comprising three branches, and can well improve the abnormal detection effect.

5. The application uses the false label guided by priori knowledge to generate the false label with high confidence, thereby reducing the adverse effect caused by label noise introduced by a weak supervision method.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of an anomaly detection method of the present application;

FIG. 2 is a network structure diagram of a first stage network model of the present application;

FIG. 3 is a network architecture diagram of a second stage network model of the present application;

FIG. 4 is a schematic diagram comparing the results of the present application with the existing encoder-based anomaly detection method on Shangghai Tech dataset;

FIG. 5 is a schematic diagram comparing the results of the present application with the prior encoder-based anomaly detection method on UCF-Crime datasets;

FIG. 6 is a graph showing the comparison of the results of the present application with other methods on ShangghaiTech dataset;

FIG. 7 is a graphical representation of the results of the present application versus other methods on UCF-Crime datasets;

FIG. 8 is a schematic diagram of a pseudo tag of the present application in comparison to other pseudo tag based methods;

FIG. 9 is a schematic diagram showing the effect of the present application on the partial visualization of the results of the detection on two data sets.

Detailed Description

In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. Therefore, the realization process of how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in a method of implementing an embodiment described above may be implemented by a program to instruct related hardware, and thus, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Background overview

Video surveillance plays an important role in security work today, but manually looking for surveillance videos to find abnormal events is often time consuming, laborious and inefficient. The Video Anomaly Detection (VAD) method can automatically detect anomalies in the monitoring video, effectively protect personal and property safety of people, save a great deal of manpower and achieve the effect of twice the effort.

Recent research has focused on weak surveillance, such providing video-level labels to indicate whether the video contains abnormal segments, using relatively small human labeling to achieve better detection than unsupervised. The weak supervision anomaly detection problem was treated as MILs task in previous work. These methods treat the video as a package containing several non-overlapping video segments, each of which is treated as an instance, with the labels at the package level being used to learn the labels at the instance level. For anomalous video, the video level tag is not perfectly correct, and can only be considered as a noisy tag, or a low confidence tag. To solve this problem, other studies have proposed using GCN to obtain feature similarity and temporal consistency of video segments to clean tag noise using an iterative method that assigns video-level tags directly to each video segment, resulting in abnormal segments being affected by normal segments in the abnormal video. Other studies use methods for directly generating pseudo tags, such as cleaning tag noise by using a binary clustering method based on space-time video features, and the clustering and the network complement each other in the training process; the pseudo tag generator is trained using MILs methods to generate segment-level video tags. However, the method is only equivalent to directly using the anomaly score of the video segment to supervise anomaly detection, ignoring the segment distribution characteristics in the anomaly video, and possibly introducing unnecessary tag noise.

Aiming at the technical problems existing in the prior art, namely the problem that a large amount of label noise is introduced in the weak supervision method, the embodiment utilizes priori knowledge to generate the pseudo labels with high confidence, improves the accuracy of anomaly detection, relieves the problem of lower detection precision of the weak supervision method, and realizes higher anomaly detection precision based on the weak supervision method with little manpower labeling.

Referring to fig. 1-9, a specific implementation manner of the present embodiment is shown, in the present embodiment, mainly, the prior knowledge of the segment distribution of the abnormal video and the normal video is used to generate a pseudo tag with high confidence, so as to reduce the influence of the tag noise introduced by the weak supervision method on the abnormal detection result, and the multi-scale space-time feature is used to assist the learning of the abnormal detection model on the abnormal event.

Referring to fig. 1, the embodiment provides a weak supervision self-training video anomaly detection method, which includes the following steps:

s1, acquiring a video anomaly detection data set and dividing the video anomaly detection data set into a training set and a testing set; acquiring a monitoring video of a real life scene, and constructing a video anomaly detection data set according to whether the monitoring video contains an anomaly event mark as an anomaly video and a normal video;

in step S1, the specific process includes the following steps:

s12, obtaining a video-level tag corresponding to the abnormal event according to whether the monitoring video contains the abnormal event marks as the abnormal video va and the normal video vn, wherein the abnormal event comprises an event with a great influence on public safety.

S13, dividing the video anomaly detection data set into a training set and a testing set, wherein the testing set is used for testing the trained anomaly detection model, the anomaly detection model is mainly trained through the training set, the optimized anomaly detection model is obtained through optimization and improvement in the training process, and the testing set is used for testing the anomaly detection model, so that a conventional testing mode is adopted in the technical scheme, and detailed description of a testing stage is omitted.

The present embodiment utilizes the disclosed video anomaly detection dataset, shangghai Tech dataset, and UCF-Crime dataset. The UCF-Crime dataset is a large-scale dataset. It consists of long monitoring video without clipping, covering 13 real world anomalies that are chosen because they have a significant impact on public safety. The dataset consisted of 1900 videos, accumulated 13769300 frames. Of which 1650 videos are used for training and the remaining 250 videos are used for testing.

The Shanghaitech dataset contains 13 scenes with complex lighting conditions and camera angles. It contains 130 anomalies and more than 27 tens of thousands of video frames. The method comprises 1198 marked images, and the head center of 330165 people is marked.

S2, constructing an anomaly detection model and training by adopting a training set, wherein the anomaly detection model comprises a first-stage network model and a second-stage network model.

In order to solve the problem that the weak supervision method introduces the tag noise, the prior study uses a binary clustering method based on space-time video characteristics to clean the tag noise, or adopts an iterative method to clean the tag noise according to the similarity and the time domain consistency of the video characteristics. Unlike other methods, the present embodiment also considers that normal segments within a period of time before and after the occurrence of an abnormal event also contain part of abnormal information. Thus, in the first stage we consider using a gaussian model to model the whole process of occurrence of an anomaly event. Since multiple anomaly events may be included in one anomaly video, we model the tag distribution of the entire anomaly video using GMM, thereby generating a high confidence tag that better fits the characteristics of the anomaly event.

Aiming at the problem of label noise introduced by a weak supervision method, the embodiment combines weak supervision learning with a self-training method to construct a two-stage self-training network, namely an anomaly detection model, wherein a pseudo-label generator guided based on priori knowledge is trained by using a video-level label in the first stage, and the pseudo-label conforming to the high confidence coefficient of video segment distribution is generated by using the pseudo-label generator.

Referring to fig. 2 and 3, in step S2, the method specifically includes:

s21, training in a first stage by adopting a training set, dividing each video in the training set into a plurality of video segments with the same length, sending the video segments into a video feature encoder to obtain corresponding video segment features, training a pseudo tag generator guided based on priori knowledge by utilizing the video segment features and the video-level tags, and generating pseudo tags of the video segments.

The normal clips are the vast majority of either normal video or abnormal video. In this embodiment, a one-dimensional median filtering mode is adopted to obtain a pseudo tag of a normal video. Instead of directly assigning the video level labels of the normal video to the normal video segments, soft label supervision is used for the normal segments in the abnormal video, which is not beneficial to learning the characteristics of the normal video segments by the network, and the use of the median of the abnormal scores as the pseudo label is more robust.

Referring to fig. 2, in step S21, the specific process includes the following steps:

s211, dividing each video in the training set into a plurality of video segments with the same lengthAnd->Typically as a video clip every 16 frames.

S212, setting a first-stage training related parameter, wherein the first-stage training related parameter is as follows: the operating system used was ubuntu18.04, python3.8 was used as the programming language using python 1.5.0 deep learning framework, and GPU hardware was NVIDIA GeForce RTX 2080Ti. In this embodiment, an adaptive gradient optimizer (Adam) is used to optimize the network model, the exponential decay rate is set to b_1=0.9, b_2=0.999, and in the experiment, batch_size is set to 40, and 3000 epochs are trained.

S213, building a first-stage network model according to the related parameters; as shown in fig. 2, the first-stage network model is composed of a feature extraction module and a pseudo tag generator, wherein the pseudo tag generator comprises a multi-layer perceptron, a gaussian mixture module GMM and a one-dimensional median filtering module OMF;

The gaussian mixture module GMM is configured to learn the distribution of the anomaly scores of the anomaly videos, output a pseudo tag with high confidence, and in this embodiment, the number of gaussians is set to 3. The one-dimensional median filtering module is specifically implemented by firstly selecting L (L < N) continuous fragments in a video, selecting the median value of the abnormality scores of the fragments as a pseudo tag, then selecting the next L fragments, and repeating the steps until all the fragments are selected.

In step S214, the specific process includes the following steps:

s2141, several video clipsAnd->Inputting the corresponding video segment characteristics into a pre-trained video characteristic encoder>And->

S2142, training a multi-layer perceptron by using video segment characteristics and video-level labels, wherein the multi-layer perceptron consists of three full-connection layers with different neuron numbers; the loss function of this part is:

where N represents the number of segments of the video,abnormality score of the ith fragment in positive and negative packs during training, respectively,>the average value of the abnormal scores of k fragments with the largest abnormal scores in the positive and negative packets in the training process is respectively shown, lambda is an over-parameter, the first term of the loss is topk ranking loss, the second term is sparsification loss, the first term of the loss is topk ranking loss, the second term is sparsity loss, and the top is _k (-) is defined as:

the parameter k is the selected number of examples in each packet, Ω _k Is a subset of the packet size k, s _i For an anomaly score in a package,is a set of anomaly scores in a package.

In this embodiment, the feature dimensions of the input are 16×1024 (i3d_rgb) and 16×4096 (c3d_rgb), the number of neurons of the full-connection layer in the multi-layer perceptron is 512, 32,1, respectively, the super parameter λ=0.01 of the loss function in the experiment, the parameter k in topk on the ShanghaiTech dataset is set to 3, and the parameter k on the UCF-Crime dataset is set to 7.

Wherein the method comprises the steps ofNormalized pseudo tag for the i-th fragment in positive and negative packets,/for the positive and negative packets>And->Representing the maximum and minimum of the high confidence anomaly scores in the package, respectively.

S22, performing second-stage training, wherein a multi-scale time feature network model is supervised and trained by using the video clips and the pseudo tags generated by the pseudo tag generator, and the second-stage training has the following relevant parameters: the operating system used was ubuntu18.04, python3.8 was used as the programming language using python 1.5.0 deep learning framework, and GPU hardware was NVIDIA GeForce RTX 2080Ti. The application optimizes the network model by adopting an adaptive gradient optimizer (Adam), wherein the exponential decay rate is set to b_1=0.9, and b_2=0.999. In the experiment, batch_size was set to 10, and 300 epochs were trained.

Referring to fig. 3, in step S22, the specific process includes the following steps:

s221, building a second-stage network model, wherein the second-stage network model is a multi-scale time network, and the multi-scale time network consists of a main network and three branches as shown in FIG. 3;

the backbone network adopts a pre-trained C3D or I3D encoder;

the attention branch consists of an attention module and a classifying head H _c The self-guiding branch consists of two three-dimensional convolution layers and a classifying head H _g The cavity convolution branch consists of two cavity convolution and classification heads H with different cavity rates _c Composition;

the present application proposes a multi-scale time network for video anomaly detection, as shown in fig. 2. The network consists of a backbone network and three branches. The backbone network typically employs pre-trained C3D or I3D encoder, three branches are respectively: 1) Attention branching, consisting of an attention module and a sorting head H _c Composition; 2) Self-guided branching consisting of two three-dimensional convolution layers and a sorting head H _g Composition; 3) A cavity convolution branch consisting of two cavity convolution and classification heads H with different cavity rates _c Composition is prepared.

In order to better learn the characteristics of abnormal parts in video, the application adopts a multi-scale space-time characteristic network comprising three branches, and comprises the following steps: 1) Attention branches, focusing the network more on the unusual part of the segment; 2) A self-guiding branch guiding learning of the abnormal part; 3) The cavity convolves the branches, and the network focuses on the abnormal part and takes the multi-scale space-time characteristics as assistance. The network can well improve the abnormality detection effect.

S222, training the pseudo tag obtained in the first stageAnd video clip->And->Combining the input to the second stage training multi-scale time feature network, the main network extracts the deep features of the video segment, and outputs the feature map f of the 4 th layer and the 5 th layer of the main network _b-4 And f _b-5 And the video is input into the following three branches to better learn the abnormal information in the video clips.

S223, training the three branches.

In step S223, the specific process includes the following steps:

s2231, attention branch training;

in step S2231, the specific process includes:

S22313, f is equal to f _b-5 Multiply by f _b-5 Obtaining a final attention characteristic diagram f _atten ；

f _atten ＝f ^* ·f _b-5 +f _b-5

S22314 Using pseudo tagAs a supervision, f _atten By sorting head H _c And softmax operations to generate a net final anomaly score, using L _atten Optimizing attention branches, L _atten For cross entropy loss:

wherein S is _i Is the characteristic of the ith video fragment in the positive or negative packet, h _c (S _i ) For the anomaly score of the branch output,a pseudo tag (generated by the first stage) corresponding to the video segment.

s2233, training a hole convolution branch, in order to assist a feature encoder to learn the features of a video, a hole convolution module is introduced to capture the features of the video with different scales, and the hole convolution branch comprises two three-dimensional hole convolution DC3d ₁ 、DC3d ₂ Their void fractions are 2 and 4, respectively, and the void convolution branches willAs input, DC3d is convolved by two three-dimensional holes, respectively ₁ 、DC3d ₂ Obtaining a corresponding feature map, and adding the corresponding feature maps to obtain a multi-scale feature map f _multi-scale 。

Also, we use pseudo tagsAs a supervision, cross entropy is used to train the optimization of this part.

In this embodiment, the void ratio of the void convolution is set to 2 and 4, respectively.

The total loss of the step is as follows:

wherein lambda is ₁ 、λ ₂ 、λ ₃ 、λ ₄ Is the parameter of the ultrasonic wave to be used as the ultrasonic wave,is a set of anomaly scores for a certain package.

In the present embodiment, the super parameter lambda ₁ ＝λ ₂ ＝λ ₃ ＝1，λ ₄ =0.8, the parameter k in topk is set to 3.

S3, inputting the video to be detected into a trained abnormality detection model to predict an abnormality score of each video segment, and detecting the abnormality in the video segment according to an abnormality threshold, wherein the threshold is set to be 0.75 in the embodiment;

if the abnormality score is greater than the threshold value of 0.75, the abnormal fragment is obtained;

if the abnormality score is less than the threshold value of 0.75, it is a normal segment.

And (3) performing anomaly detection by using the trained anomaly detection model, firstly dividing each unprocessed video to be detected into video segments with equal length, then inputting the segments into the trained anomaly detection model to obtain the anomaly score of each segment, setting an anomaly threshold (the threshold is 0.75 according to experience or a data set setting value), wherein the anomaly score is greater than the threshold and is the anomaly segment, and the anomaly score is smaller than the threshold and is the normal segment, so that anomaly detection is realized.

On the UCF-Crime data set, three different backbone networks (I3D_RGB, C3D_RGB, video_Swin) are used to compare the experimental results of the present application with the results of the advanced encoder-based weak surveillance Video anomaly detection method MIST, as shown in FIG. 4, the experimental results of the present application achieve better results than the method on all three backbone networks. FIG. 7 is a comparison of experimental results on UCF-Crime data sets with results of other weak surveillance video anomaly detection methods.

On the Shanghaitech dataset, two kinds of backbone networks of 3d_rgb and c3d_rgb are used, and compared with the two current most advanced encoder-based weak surveillance video anomaly detection methods (MISTs, MSLs), as shown in fig. 5, the experimental results of this embodiment use two different backbone networks, which are both superior to the two methods. Fig. 6 is a comparison of experimental results of this embodiment on Shanghaitech dataset with results of other weak surveillance video anomaly detection methods.

In order to more intuitively show the experimental result of the embodiment, the embodiment uses I3D as a trunk training model on the UCF-Crime data set and the Shangghaitech data set, tests on the corresponding data sets, visualizes the finally predicted anomaly scores of the models, and draws the anomaly score curve and the ground truth on the same graph, as shown in FIG. 8. Wherein the curve portion is the anomaly score predicted by the present embodiment, and the shaded area is the real anomaly portion (group_trunk) in the video, wherein the first behavior is the video visualization result in the UCF-Crime dataset, and the second behavior is the video visualization result in the Shanghaitech dataset. As can be seen from fig. 9, the present application can accurately locate an abnormal event.

The foregoing embodiments have been presented in a detail description of the application, and are presented herein with a particular application to the understanding of the principles and embodiments of the application, the foregoing embodiments being merely intended to facilitate an understanding of the method of the application and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. The weak supervision self-training video anomaly detection method is characterized by comprising the following steps of:

if the abnormality score is less than the threshold, it is a normal segment.

2. The method of claim 1, wherein: in step S1, the specific process includes the following steps:

s12, marking the monitoring video as an abnormal video according to whether the monitoring video contains the abnormal eventAnd normal video +.>Obtaining a video-level tag corresponding to the abnormal event;

3. The method of claim 1, wherein: in step S2, specifically, the method includes:

4. A detection method according to claim 3, wherein: in step S21, the specific process includes the steps of:

S212, setting a first-stage training related parameter;

5. The method of claim 4, wherein: in step S214, the specific process includes the following steps:

s2141, several video clipsAnd->Obtaining corresponding video segment characteristics in an input video feature encoderAnd->

6. A detection method according to claim 3, wherein: in step S22, the specific process includes the steps of:

s222, training the pseudo tag obtained in the first stageAnd video clip->And->Combining the input to the second stage training multi-scale time feature network, the main network extracts the deep features of the video segment, and outputs the feature map f of the 4 th layer and the 5 th layer of the main network _b-4 And f _b-5 To the next three branches.

S223, training the three branches.

7. The method of detecting according to claim 6, wherein:

the backbone network adopts a pre-trained C3D or I3D encoder;

8. The method of detecting according to claim 6, wherein: in step S223, the specific process includes the following steps:

s2231, attention branch training;

9. The method of claim 1, wherein: in step S2231, the specific process includes:

S22314 Using pseudo tagAs a supervision, f _atten By sorting head H _c And sofamx operations to generate a net final anomaly score, using L _atten The attention branches are optimized.