CN116935303A - Weak supervision self-training video anomaly detection method - Google Patents

Weak supervision self-training video anomaly detection method Download PDF

Info

Publication number
CN116935303A
CN116935303A CN202211328891.4A CN202211328891A CN116935303A CN 116935303 A CN116935303 A CN 116935303A CN 202211328891 A CN202211328891 A CN 202211328891A CN 116935303 A CN116935303 A CN 116935303A
Authority
CN
China
Prior art keywords
video
training
anomaly detection
anomaly
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211328891.4A
Other languages
Chinese (zh)
Inventor
唐俊
汪振涛
王科
朱明�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202211328891.4A priority Critical patent/CN116935303A/en
Publication of CN116935303A publication Critical patent/CN116935303A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to the technical field of computer vision, which solves the technical problem that a weak supervision method can introduce a large amount of label noise, in particular to a weak supervision self-training video anomaly detection method, which comprises the following steps: s1, acquiring a video anomaly detection data set and dividing the video anomaly detection data set into a training set and a testing set; s2, constructing an anomaly detection model and training by adopting a training set, wherein the anomaly detection model comprises a first-stage network model and a second-stage network model; if the abnormality score is greater than the threshold, the abnormal fragment is obtained; if the abnormality score is less than the threshold, it is a normal segment. S3, inputting the video to be detected into the trained abnormality detection model to predict the abnormality score of each video segment, and detecting the abnormality in the video segment according to the abnormality threshold. The application achieves the purpose of generating the pseudo tag with high confidence coefficient by using priori knowledge and improving the accuracy of anomaly detection.

Description

Weak supervision self-training video anomaly detection method
Technical Field
The application relates to the technical field of computer vision, in particular to a weak supervision self-training video anomaly detection method.
Background
Because the abnormal event is rare, one popular paradigm is classification or unsupervised learning, i.e. only a normal video segment is added during training, so that the deep network learns the characteristics of the normal video, and the characteristics different from the normal video are determined to be abnormal. Such methods have the obvious limitation that there is insufficient data to learn to characterize all normal behavior, and therefore normal segments not encompassed in some training sets may be misdetected as anomalies.
For anomalous video, the video level tag is not perfectly correct, and can only be considered as a noisy tag, or a low confidence tag. To solve this problem, other studies have proposed using GCN to obtain feature similarity and temporal consistency of video segments to clean tag noise using an iterative method that assigns video-level tags directly to each video segment, resulting in abnormal segments being affected by normal segments in the abnormal video. Other studies use methods for directly generating pseudo tags, such as cleaning tag noise by using a binary clustering method based on space-time video features, and the clustering and the network complement each other in the training process; the pseudo tag generator is trained using MILs methods to generate segment-level video tags. However, the method is only equivalent to directly using the anomaly score of the video segment to supervise anomaly detection, ignoring the segment distribution characteristics in the anomaly video, and possibly introducing unnecessary tag noise.
Disclosure of Invention
Aiming at the defects of the prior art, the application provides a weak supervision self-training video anomaly detection method, solves the technical problem that a great amount of label noise is introduced in the weak supervision method, and achieves the purposes of generating a pseudo label with high confidence coefficient by using priori knowledge and improving anomaly detection accuracy.
In order to solve the technical problems, the application provides the following technical scheme: a weak supervision self-training video anomaly detection method comprises the following steps:
s1, acquiring a video anomaly detection data set and dividing the video anomaly detection data set into a training set and a testing set;
s2, constructing an anomaly detection model and training by adopting a training set, wherein the anomaly detection model comprises a first-stage network model and a second-stage network model;
s3, inputting the video to be detected into the trained abnormality detection model to predict the abnormality score of each video segment, and detecting the abnormality in the video segment according to an abnormality threshold;
if the abnormality score is greater than the threshold, the abnormal fragment is obtained;
if the abnormality score is less than the threshold, it is a normal segment.
Further, in step S1, the specific process includes the following steps:
s11, collecting monitoring videos in different scenes, wherein the different scenes comprise supermarkets, banks, university campuses, highways, parks and residential communities;
s12, obtaining a video level label corresponding to the abnormal event according to whether the abnormal event mark is contained in the monitoring video and is an abnormal video va and a normal video vn.
S13, dividing the video anomaly detection data set into a training set and a testing set.
Further, in step S2, specifically includes:
s21, training in a first stage, dividing each video in a training set into a plurality of video segments with the same length, sending the video segments into a video feature encoder to obtain corresponding video segment features, training a pseudo tag generator guided based on priori knowledge by utilizing the video segment features and the video level tags, and generating pseudo tags of the video segments;
s22, training in a second stage, and supervising and training a multi-scale time feature network model by using the video clips and the pseudo tags generated by the pseudo tag generator.
Further, in step S21, the specific process includes the steps of:
s211, dividing each video in the training set into a plurality of video segments with the same lengthAnd->
S212, setting a first-stage training related parameter;
s213, building a first-stage network model according to related parameters, wherein the first-stage network model consists of a feature extraction module and a pseudo tag generator, and the pseudo tag generator comprises a multi-layer perceptron, a Gaussian Mixture Module (GMM) and a one-dimensional median filtering module (OMF);
s214, extracting corresponding video features from the video segments through a feature extraction module, and inputting the video features into a pseudo tag generator to generate pseudo tags with high confidence.
Further, in step S214, the specific process includes the following steps:
s2141, several video clipsAnd->Obtaining corresponding video clip feature +.>And->
S2142, training a multi-layer perceptron by using video segment characteristics and video-level labels, wherein the multi-layer perceptron consists of three full-connection layers with different neuron numbers;
s2143, the coarse granularity anomaly scores of the normal video and the anomaly video output by the trained multi-layer perceptron are respectivelyAnd->Abnormality score ++>And->Respectively inputting the high confidence anomaly score and the high confidence anomaly score into a Gaussian Mixture Module (GMM) and a one-dimensional median filter module (OMF) by guiding with priori knowledge>And->Pseudo tag of video clip obtained after normalization +.>
Further, in step S22, the specific process includes the steps of:
s221, constructing a second-stage network model, wherein the second-stage network model is a multi-scale time network, and the multi-scale time network consists of a main network and three branches;
s222, training the pseudo tag obtained in the first stageAnd video clip->And->In combination with the input into the second stage training multi-scale time feature network, the backbone network extracts deep features of the video segments, and layers 4 and 5 of the backbone networkOutput characteristic diagram f b-4 And f b-5 To the next three branches.
S223, training the three branches.
Further, the method comprises the steps of,
the backbone network adopts a pre-trained C3D or I3D encoder;
the three branches are an attention branch, a self-guiding branch and a cavity convolution branch respectively;
the attention branch consists of an attention module and a classifying head H c The self-guiding branch consists of two three-dimensional convolution layers and a classifying head H g The cavity convolution branch consists of two cavity convolution and classification heads H with different cavity rates c Composition is prepared.
Further, in step S223, the specific process includes the steps of:
s2231, attention branch training;
s2232, self-guided branch training, willInput Conv3d 3 Performing dimension reduction through global average pooling once again and average pooling once again, obtaining an abnormality score through softmax operation, and using cross entropy loss L self-guide Optimizing the self-guiding branch;
s2233, training the cavity convolution branch, wherein the cavity convolution branch is toAs input, DC3d is convolved by two three-dimensional holes, respectively 1 、DC3d 2 Obtaining a corresponding feature map, and adding the corresponding feature maps to obtain a multi-scale feature map f multi-scale
Further, in step S2231, the specific process includes:
s22311, f b-4 Input to two three-dimensional convolution layers Conv3d 1 And Conv3d 2 The obtained characteristic diagram and f b-5 The attention operation is carried out by adding, and the characteristic diagram of the three-dimensional convolution output of the first layer is recorded as
S22312, willInput second layer three-dimensional convolution Conv3d 2 Obtaining a characteristic diagram f *
S22313, f * And f b-5 Multiply by f b-5 Obtaining a final attention characteristic diagram f atten
S22314 Using pseudo tagAs a supervision, f atten By sorting head H c And softmax operations to generate a net final anomaly score, using L atten The attention branches are optimized.
By means of the technical scheme, the application provides a weak supervision self-training video anomaly detection method, which has at least the following beneficial effects:
1. aiming at the problem of label noise introduced by a weak supervision method, the application combines weak supervision learning with a self-training method to construct a two-stage self-training network, wherein a pseudo label generator guided based on priori knowledge is trained by using a video-level label in the first stage, and the pseudo label with high confidence conforming to video segment distribution is generated by using the pseudo label generator.
2. The application takes into account that the normal segments in a period of time before and after the occurrence of an abnormal event also contain part of the abnormal information. Thus, in the first stage consider modeling the entire process of occurrence of an abnormal event using a gaussian model. Because a plurality of abnormal events can be included in one abnormal video, the label distribution of the whole abnormal video is modeled by using the GMM, so that a label with high confidence which is more in line with the characteristics of the abnormal events is generated, and the problem that label noise is introduced by a weak supervision method is solved.
3. The application obtains the pseudo tag of the normal video by adopting a one-dimensional median filtering mode. Instead of directly assigning the video level labels of the normal video to the normal video segments, soft label supervision is used for the normal segments in the abnormal video, which is not beneficial to learning the characteristics of the normal video segments by the network, and the use of the median of the abnormal scores as the pseudo label is more robust.
4. In order to better learn the characteristics of the abnormal part in the video, the application adopts a multi-scale space-time characteristic network comprising three branches, and can well improve the abnormal detection effect.
5. The application uses the false label guided by priori knowledge to generate the false label with high confidence, thereby reducing the adverse effect caused by label noise introduced by a weak supervision method.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow chart of an anomaly detection method of the present application;
FIG. 2 is a network structure diagram of a first stage network model of the present application;
FIG. 3 is a network architecture diagram of a second stage network model of the present application;
FIG. 4 is a schematic diagram comparing the results of the present application with the existing encoder-based anomaly detection method on Shangghai Tech dataset;
FIG. 5 is a schematic diagram comparing the results of the present application with the prior encoder-based anomaly detection method on UCF-Crime datasets;
FIG. 6 is a graph showing the comparison of the results of the present application with other methods on ShangghaiTech dataset;
FIG. 7 is a graphical representation of the results of the present application versus other methods on UCF-Crime datasets;
FIG. 8 is a schematic diagram of a pseudo tag of the present application in comparison to other pseudo tag based methods;
FIG. 9 is a schematic diagram showing the effect of the present application on the partial visualization of the results of the detection on two data sets.
Detailed Description
In order that the above-recited objects, features and advantages of the present application will become more readily apparent, a more particular description of the application will be rendered by reference to the appended drawings and appended detailed description. Therefore, the realization process of how to apply the technical means to solve the technical problems and achieve the technical effects can be fully understood and implemented.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in a method of implementing an embodiment described above may be implemented by a program to instruct related hardware, and thus, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Background overview
Video surveillance plays an important role in security work today, but manually looking for surveillance videos to find abnormal events is often time consuming, laborious and inefficient. The Video Anomaly Detection (VAD) method can automatically detect anomalies in the monitoring video, effectively protect personal and property safety of people, save a great deal of manpower and achieve the effect of twice the effort.
Because the abnormal event is rare, one popular paradigm is classification or unsupervised learning, i.e. only a normal video segment is added during training, so that the deep network learns the characteristics of the normal video, and the characteristics different from the normal video are determined to be abnormal. Such methods have the obvious limitation that there is insufficient data to learn to characterize all normal behavior, and therefore normal segments not encompassed in some training sets may be misdetected as anomalies.
Recent research has focused on weak surveillance, such providing video-level labels to indicate whether the video contains abnormal segments, using relatively small human labeling to achieve better detection than unsupervised. The weak supervision anomaly detection problem was treated as MILs task in previous work. These methods treat the video as a package containing several non-overlapping video segments, each of which is treated as an instance, with the labels at the package level being used to learn the labels at the instance level. For anomalous video, the video level tag is not perfectly correct, and can only be considered as a noisy tag, or a low confidence tag. To solve this problem, other studies have proposed using GCN to obtain feature similarity and temporal consistency of video segments to clean tag noise using an iterative method that assigns video-level tags directly to each video segment, resulting in abnormal segments being affected by normal segments in the abnormal video. Other studies use methods for directly generating pseudo tags, such as cleaning tag noise by using a binary clustering method based on space-time video features, and the clustering and the network complement each other in the training process; the pseudo tag generator is trained using MILs methods to generate segment-level video tags. However, the method is only equivalent to directly using the anomaly score of the video segment to supervise anomaly detection, ignoring the segment distribution characteristics in the anomaly video, and possibly introducing unnecessary tag noise.
Aiming at the technical problems existing in the prior art, namely the problem that a large amount of label noise is introduced in the weak supervision method, the embodiment utilizes priori knowledge to generate the pseudo labels with high confidence, improves the accuracy of anomaly detection, relieves the problem of lower detection precision of the weak supervision method, and realizes higher anomaly detection precision based on the weak supervision method with little manpower labeling.
Referring to fig. 1-9, a specific implementation manner of the present embodiment is shown, in the present embodiment, mainly, the prior knowledge of the segment distribution of the abnormal video and the normal video is used to generate a pseudo tag with high confidence, so as to reduce the influence of the tag noise introduced by the weak supervision method on the abnormal detection result, and the multi-scale space-time feature is used to assist the learning of the abnormal detection model on the abnormal event.
Referring to fig. 1, the embodiment provides a weak supervision self-training video anomaly detection method, which includes the following steps:
s1, acquiring a video anomaly detection data set and dividing the video anomaly detection data set into a training set and a testing set; acquiring a monitoring video of a real life scene, and constructing a video anomaly detection data set according to whether the monitoring video contains an anomaly event mark as an anomaly video and a normal video;
in step S1, the specific process includes the following steps:
s11, collecting monitoring videos in different scenes, wherein the different scenes comprise supermarkets, banks, university campuses, highways, parks and residential communities;
s12, obtaining a video-level tag corresponding to the abnormal event according to whether the monitoring video contains the abnormal event marks as the abnormal video va and the normal video vn, wherein the abnormal event comprises an event with a great influence on public safety.
S13, dividing the video anomaly detection data set into a training set and a testing set, wherein the testing set is used for testing the trained anomaly detection model, the anomaly detection model is mainly trained through the training set, the optimized anomaly detection model is obtained through optimization and improvement in the training process, and the testing set is used for testing the anomaly detection model, so that a conventional testing mode is adopted in the technical scheme, and detailed description of a testing stage is omitted.
The present embodiment utilizes the disclosed video anomaly detection dataset, shangghai Tech dataset, and UCF-Crime dataset. The UCF-Crime dataset is a large-scale dataset. It consists of long monitoring video without clipping, covering 13 real world anomalies that are chosen because they have a significant impact on public safety. The dataset consisted of 1900 videos, accumulated 13769300 frames. Of which 1650 videos are used for training and the remaining 250 videos are used for testing.
The Shanghaitech dataset contains 13 scenes with complex lighting conditions and camera angles. It contains 130 anomalies and more than 27 tens of thousands of video frames. The method comprises 1198 marked images, and the head center of 330165 people is marked.
S2, constructing an anomaly detection model and training by adopting a training set, wherein the anomaly detection model comprises a first-stage network model and a second-stage network model.
In order to solve the problem that the weak supervision method introduces the tag noise, the prior study uses a binary clustering method based on space-time video characteristics to clean the tag noise, or adopts an iterative method to clean the tag noise according to the similarity and the time domain consistency of the video characteristics. Unlike other methods, the present embodiment also considers that normal segments within a period of time before and after the occurrence of an abnormal event also contain part of abnormal information. Thus, in the first stage we consider using a gaussian model to model the whole process of occurrence of an anomaly event. Since multiple anomaly events may be included in one anomaly video, we model the tag distribution of the entire anomaly video using GMM, thereby generating a high confidence tag that better fits the characteristics of the anomaly event.
Aiming at the problem of label noise introduced by a weak supervision method, the embodiment combines weak supervision learning with a self-training method to construct a two-stage self-training network, namely an anomaly detection model, wherein a pseudo-label generator guided based on priori knowledge is trained by using a video-level label in the first stage, and the pseudo-label conforming to the high confidence coefficient of video segment distribution is generated by using the pseudo-label generator.
Referring to fig. 2 and 3, in step S2, the method specifically includes:
s21, training in a first stage by adopting a training set, dividing each video in the training set into a plurality of video segments with the same length, sending the video segments into a video feature encoder to obtain corresponding video segment features, training a pseudo tag generator guided based on priori knowledge by utilizing the video segment features and the video-level tags, and generating pseudo tags of the video segments.
The normal clips are the vast majority of either normal video or abnormal video. In this embodiment, a one-dimensional median filtering mode is adopted to obtain a pseudo tag of a normal video. Instead of directly assigning the video level labels of the normal video to the normal video segments, soft label supervision is used for the normal segments in the abnormal video, which is not beneficial to learning the characteristics of the normal video segments by the network, and the use of the median of the abnormal scores as the pseudo label is more robust.
Referring to fig. 2, in step S21, the specific process includes the following steps:
s211, dividing each video in the training set into a plurality of video segments with the same lengthAnd->Typically as a video clip every 16 frames.
S212, setting a first-stage training related parameter, wherein the first-stage training related parameter is as follows: the operating system used was ubuntu18.04, python3.8 was used as the programming language using python 1.5.0 deep learning framework, and GPU hardware was NVIDIA GeForce RTX 2080Ti. In this embodiment, an adaptive gradient optimizer (Adam) is used to optimize the network model, the exponential decay rate is set to b_1=0.9, b_2=0.999, and in the experiment, batch_size is set to 40, and 3000 epochs are trained.
S213, building a first-stage network model according to the related parameters; as shown in fig. 2, the first-stage network model is composed of a feature extraction module and a pseudo tag generator, wherein the pseudo tag generator comprises a multi-layer perceptron, a gaussian mixture module GMM and a one-dimensional median filtering module OMF;
s214, extracting corresponding video features from the video segments through a feature extraction module, and inputting the video features into a pseudo tag generator to generate pseudo tags with high confidence.
The gaussian mixture module GMM is configured to learn the distribution of the anomaly scores of the anomaly videos, output a pseudo tag with high confidence, and in this embodiment, the number of gaussians is set to 3. The one-dimensional median filtering module is specifically implemented by firstly selecting L (L < N) continuous fragments in a video, selecting the median value of the abnormality scores of the fragments as a pseudo tag, then selecting the next L fragments, and repeating the steps until all the fragments are selected.
In step S214, the specific process includes the following steps:
s2141, several video clipsAnd->Inputting the corresponding video segment characteristics into a pre-trained video characteristic encoder>And->
S2142, training a multi-layer perceptron by using video segment characteristics and video-level labels, wherein the multi-layer perceptron consists of three full-connection layers with different neuron numbers; the loss function of this part is:
where N represents the number of segments of the video,abnormality score of the ith fragment in positive and negative packs during training, respectively,>the average value of the abnormal scores of k fragments with the largest abnormal scores in the positive and negative packets in the training process is respectively shown, lambda is an over-parameter, the first term of the loss is topk ranking loss, the second term is sparsification loss, the first term of the loss is topk ranking loss, the second term is sparsity loss, and the top is k (-) is defined as:
the parameter k is the selected number of examples in each packet, Ω k Is a subset of the packet size k, s i For an anomaly score in a package,is a set of anomaly scores in a package.
In this embodiment, the feature dimensions of the input are 16×1024 (i3d_rgb) and 16×4096 (c3d_rgb), the number of neurons of the full-connection layer in the multi-layer perceptron is 512, 32,1, respectively, the super parameter λ=0.01 of the loss function in the experiment, the parameter k in topk on the ShanghaiTech dataset is set to 3, and the parameter k on the UCF-Crime dataset is set to 7.
S2143, the coarse granularity anomaly scores of the normal video and the anomaly video output by the trained multi-layer perceptron are respectivelyAnd->Abnormality score ++>And->Respectively inputting the high confidence anomaly score and the high confidence anomaly score into a Gaussian Mixture Module (GMM) and a one-dimensional median filter module (OMF) by guiding with priori knowledge>And->Pseudo tag of video clip obtained after normalization +.>
Wherein the method comprises the steps ofNormalized pseudo tag for the i-th fragment in positive and negative packets,/for the positive and negative packets>And->Representing the maximum and minimum of the high confidence anomaly scores in the package, respectively.
S22, performing second-stage training, wherein a multi-scale time feature network model is supervised and trained by using the video clips and the pseudo tags generated by the pseudo tag generator, and the second-stage training has the following relevant parameters: the operating system used was ubuntu18.04, python3.8 was used as the programming language using python 1.5.0 deep learning framework, and GPU hardware was NVIDIA GeForce RTX 2080Ti. The application optimizes the network model by adopting an adaptive gradient optimizer (Adam), wherein the exponential decay rate is set to b_1=0.9, and b_2=0.999. In the experiment, batch_size was set to 10, and 300 epochs were trained.
Referring to fig. 3, in step S22, the specific process includes the following steps:
s221, building a second-stage network model, wherein the second-stage network model is a multi-scale time network, and the multi-scale time network consists of a main network and three branches as shown in FIG. 3;
the backbone network adopts a pre-trained C3D or I3D encoder;
the three branches are an attention branch, a self-guiding branch and a cavity convolution branch respectively;
the attention branch consists of an attention module and a classifying head H c The self-guiding branch consists of two three-dimensional convolution layers and a classifying head H g The cavity convolution branch consists of two cavity convolution and classification heads H with different cavity rates c Composition;
the present application proposes a multi-scale time network for video anomaly detection, as shown in fig. 2. The network consists of a backbone network and three branches. The backbone network typically employs pre-trained C3D or I3D encoder, three branches are respectively: 1) Attention branching, consisting of an attention module and a sorting head H c Composition; 2) Self-guided branching consisting of two three-dimensional convolution layers and a sorting head H g Composition; 3) A cavity convolution branch consisting of two cavity convolution and classification heads H with different cavity rates c Composition is prepared.
In order to better learn the characteristics of abnormal parts in video, the application adopts a multi-scale space-time characteristic network comprising three branches, and comprises the following steps: 1) Attention branches, focusing the network more on the unusual part of the segment; 2) A self-guiding branch guiding learning of the abnormal part; 3) The cavity convolves the branches, and the network focuses on the abnormal part and takes the multi-scale space-time characteristics as assistance. The network can well improve the abnormality detection effect.
S222, training the pseudo tag obtained in the first stageAnd video clip->And->Combining the input to the second stage training multi-scale time feature network, the main network extracts the deep features of the video segment, and outputs the feature map f of the 4 th layer and the 5 th layer of the main network b-4 And f b-5 And the video is input into the following three branches to better learn the abnormal information in the video clips.
S223, training the three branches.
In step S223, the specific process includes the following steps:
s2231, attention branch training;
in step S2231, the specific process includes:
s22311, f b-4 Input to two three-dimensional convolution layers Conv3d 1 And Conv3d 2 The obtained characteristic diagram and f b-5 The attention operation is carried out by adding, and the characteristic diagram of the three-dimensional convolution output of the first layer is recorded as
S22312, willInput second layer three-dimensional convolution Conv3d 2 Obtaining a characteristic diagram f *
S22313, f is equal to f b-5 Multiply by f b-5 Obtaining a final attention characteristic diagram f atten
f atten =f * ·f b-5 +f b-5
S22314 Using pseudo tagAs a supervision, f atten By sorting head H c And softmax operations to generate a net final anomaly score, using L atten Optimizing attention branches, L atten For cross entropy loss:
wherein S is i Is the characteristic of the ith video fragment in the positive or negative packet, h c (S i ) For the anomaly score of the branch output,a pseudo tag (generated by the first stage) corresponding to the video segment.
S2232, self-guided branch training, willInput Conv3d 3 Performing dimension reduction through global average pooling once again and average pooling once again, obtaining an abnormality score through softmax operation, and using cross entropy loss L self-guide Optimizing the self-guiding branch;
s2233, training a hole convolution branch, in order to assist a feature encoder to learn the features of a video, a hole convolution module is introduced to capture the features of the video with different scales, and the hole convolution branch comprises two three-dimensional hole convolution DC3d 1 、DC3d 2 Their void fractions are 2 and 4, respectively, and the void convolution branches willAs input, DC3d is convolved by two three-dimensional holes, respectively 1 、DC3d 2 Obtaining a corresponding feature map, and adding the corresponding feature maps to obtain a multi-scale feature map f multi-scale
Also, we use pseudo tagsAs a supervision, cross entropy is used to train the optimization of this part.
In this embodiment, the void ratio of the void convolution is set to 2 and 4, respectively.
The total loss of the step is as follows:
wherein lambda is 1 、λ 2 、λ 3 、λ 4 Is the parameter of the ultrasonic wave to be used as the ultrasonic wave,is a set of anomaly scores for a certain package.
In the present embodiment, the super parameter lambda 1 =λ 2 =λ 3 =1,λ 4 =0.8, the parameter k in topk is set to 3.
S3, inputting the video to be detected into a trained abnormality detection model to predict an abnormality score of each video segment, and detecting the abnormality in the video segment according to an abnormality threshold, wherein the threshold is set to be 0.75 in the embodiment;
if the abnormality score is greater than the threshold value of 0.75, the abnormal fragment is obtained;
if the abnormality score is less than the threshold value of 0.75, it is a normal segment.
And (3) performing anomaly detection by using the trained anomaly detection model, firstly dividing each unprocessed video to be detected into video segments with equal length, then inputting the segments into the trained anomaly detection model to obtain the anomaly score of each segment, setting an anomaly threshold (the threshold is 0.75 according to experience or a data set setting value), wherein the anomaly score is greater than the threshold and is the anomaly segment, and the anomaly score is smaller than the threshold and is the normal segment, so that anomaly detection is realized.
On the UCF-Crime data set, three different backbone networks (I3D_RGB, C3D_RGB, video_Swin) are used to compare the experimental results of the present application with the results of the advanced encoder-based weak surveillance Video anomaly detection method MIST, as shown in FIG. 4, the experimental results of the present application achieve better results than the method on all three backbone networks. FIG. 7 is a comparison of experimental results on UCF-Crime data sets with results of other weak surveillance video anomaly detection methods.
On the Shanghaitech dataset, two kinds of backbone networks of 3d_rgb and c3d_rgb are used, and compared with the two current most advanced encoder-based weak surveillance video anomaly detection methods (MISTs, MSLs), as shown in fig. 5, the experimental results of this embodiment use two different backbone networks, which are both superior to the two methods. Fig. 6 is a comparison of experimental results of this embodiment on Shanghaitech dataset with results of other weak surveillance video anomaly detection methods.
In order to more intuitively show the experimental result of the embodiment, the embodiment uses I3D as a trunk training model on the UCF-Crime data set and the Shangghaitech data set, tests on the corresponding data sets, visualizes the finally predicted anomaly scores of the models, and draws the anomaly score curve and the ground truth on the same graph, as shown in FIG. 8. Wherein the curve portion is the anomaly score predicted by the present embodiment, and the shaded area is the real anomaly portion (group_trunk) in the video, wherein the first behavior is the video visualization result in the UCF-Crime dataset, and the second behavior is the video visualization result in the Shanghaitech dataset. As can be seen from fig. 9, the present application can accurately locate an abnormal event.
The foregoing embodiments have been presented in a detail description of the application, and are presented herein with a particular application to the understanding of the principles and embodiments of the application, the foregoing embodiments being merely intended to facilitate an understanding of the method of the application and its core concepts; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims (9)

1. The weak supervision self-training video anomaly detection method is characterized by comprising the following steps of:
s1, acquiring a video anomaly detection data set and dividing the video anomaly detection data set into a training set and a testing set;
s2, constructing an anomaly detection model and training by adopting a training set, wherein the anomaly detection model comprises a first-stage network model and a second-stage network model;
s3, inputting the video to be detected into the trained abnormality detection model to predict the abnormality score of each video segment, and detecting the abnormality in the video segment according to an abnormality threshold;
if the abnormality score is greater than the threshold, the abnormal fragment is obtained;
if the abnormality score is less than the threshold, it is a normal segment.
2. The method of claim 1, wherein: in step S1, the specific process includes the following steps:
s11, collecting monitoring videos in different scenes, wherein the different scenes comprise supermarkets, banks, university campuses, highways, parks and residential communities;
s12, marking the monitoring video as an abnormal video according to whether the monitoring video contains the abnormal eventAnd normal video +.>Obtaining a video-level tag corresponding to the abnormal event;
s13, dividing the video anomaly detection data set into a training set and a testing set.
3. The method of claim 1, wherein: in step S2, specifically, the method includes:
s21, training in a first stage, dividing each video in a training set into a plurality of video segments with the same length, sending the video segments into a video feature encoder to obtain corresponding video segment features, training a pseudo tag generator guided based on priori knowledge by utilizing the video segment features and the video level tags, and generating pseudo tags of the video segments;
s22, training in a second stage, and supervising and training a multi-scale time feature network model by using the video clips and the pseudo tags generated by the pseudo tag generator.
4. A detection method according to claim 3, wherein: in step S21, the specific process includes the steps of:
s211, dividing each video in the training set into a plurality of video segments with the same lengthAnd->
S212, setting a first-stage training related parameter;
s213, building a first-stage network model according to related parameters, wherein the first-stage network model consists of a feature extraction module and a pseudo tag generator, and the pseudo tag generator comprises a multi-layer perceptron, a Gaussian Mixture Module (GMM) and a one-dimensional median filtering module (OMF);
s214, extracting corresponding video features from the video segments through a feature extraction module, and inputting the video features into a pseudo tag generator to generate pseudo tags with high confidence.
5. The method of claim 4, wherein: in step S214, the specific process includes the following steps:
s2141, several video clipsAnd->Obtaining corresponding video segment characteristics in an input video feature encoderAnd->
S2142, training a multi-layer perceptron by using video segment characteristics and video-level labels, wherein the multi-layer perceptron consists of three full-connection layers with different neuron numbers;
s2143, the coarse granularity anomaly scores of the normal video and the anomaly video output by the trained multi-layer perceptron are respectivelyAnd->Abnormality score ++>And->Respectively inputting the high confidence anomaly score and the high confidence anomaly score into a Gaussian Mixture Module (GMM) and a one-dimensional median filter module (OMF) by guiding with priori knowledge>And->Pseudo tag of video clip obtained after normalization +.>
6. A detection method according to claim 3, wherein: in step S22, the specific process includes the steps of:
s221, constructing a second-stage network model, wherein the second-stage network model is a multi-scale time network, and the multi-scale time network consists of a main network and three branches;
s222, training the pseudo tag obtained in the first stageAnd video clip->And->Combining the input to the second stage training multi-scale time feature network, the main network extracts the deep features of the video segment, and outputs the feature map f of the 4 th layer and the 5 th layer of the main network b-4 And f b-5 To the next three branches.
S223, training the three branches.
7. The method of detecting according to claim 6, wherein:
the backbone network adopts a pre-trained C3D or I3D encoder;
the three branches are an attention branch, a self-guiding branch and a cavity convolution branch respectively;
the attention branch consists of an attention module and a classifying head H c The self-guiding branch consists of two three-dimensional convolution layers and a classifying head H g The cavity convolution branch consists of two cavity convolution and classification heads H with different cavity rates c Composition is prepared.
8. The method of detecting according to claim 6, wherein: in step S223, the specific process includes the following steps:
s2231, attention branch training;
s2232, self-guided branch training, willInput Conv3d 3 Performing dimension reduction through global average pooling once again and average pooling once again, obtaining an abnormality score through softmax operation, and using cross entropy loss L self-guide Optimizing the self-guiding branch;
s2233, training the cavity convolution branch, wherein the cavity convolution branch is toAs input, DC3d is convolved by two three-dimensional holes, respectively 1 、DC3d 2 Obtaining a corresponding feature map, and adding the corresponding feature maps to obtain a multi-scale feature map f multi-scale
9. The method of claim 1, wherein: in step S2231, the specific process includes:
s22311, f b-4 Input to two three-dimensional convolution layers Conv3d 1 And Conv3d 2 The obtained characteristic diagram and f b-5 The attention operation is carried out by adding, and the characteristic diagram of the three-dimensional convolution output of the first layer is recorded as
S22312, willInput second layer three-dimensional convolution Conv3d 2 Obtaining a characteristic diagram f *
S22313, f * And f b-5 Multiply by f b-5 Obtaining a final attention characteristic diagram f atten
S22314 Using pseudo tagAs a supervision, f atten By sorting head H c And sofamx operations to generate a net final anomaly score, using L atten The attention branches are optimized.
CN202211328891.4A 2022-10-27 2022-10-27 Weak supervision self-training video anomaly detection method Pending CN116935303A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211328891.4A CN116935303A (en) 2022-10-27 2022-10-27 Weak supervision self-training video anomaly detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211328891.4A CN116935303A (en) 2022-10-27 2022-10-27 Weak supervision self-training video anomaly detection method

Publications (1)

Publication Number Publication Date
CN116935303A true CN116935303A (en) 2023-10-24

Family

ID=88386776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211328891.4A Pending CN116935303A (en) 2022-10-27 2022-10-27 Weak supervision self-training video anomaly detection method

Country Status (1)

Country Link
CN (1) CN116935303A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690191A (en) * 2024-02-02 2024-03-12 南京邮电大学 Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117690191A (en) * 2024-02-02 2024-03-12 南京邮电大学 Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system
CN117690191B (en) * 2024-02-02 2024-04-30 南京邮电大学 Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system

Similar Documents

Publication Publication Date Title
CN109919031B (en) Human behavior recognition method based on deep neural network
CN109359519B (en) Video abnormal behavior detection method based on deep learning
CN111611847B (en) Video motion detection method based on scale attention hole convolution network
CN110084151B (en) Video abnormal behavior discrimination method based on non-local network deep learning
CN108133188A (en) A kind of Activity recognition method based on motion history image and convolutional neural networks
CN108537119B (en) Small sample video identification method
CN111046821B (en) Video behavior recognition method and system and electronic equipment
CN110378233B (en) Double-branch anomaly detection method based on crowd behavior prior knowledge
CN111859010B (en) Semi-supervised audio event identification method based on depth mutual information maximization
CN112699786A (en) Video behavior identification method and system based on space enhancement module
CN110826702A (en) Abnormal event detection method for multitask deep network
CN112434599B (en) Pedestrian re-identification method based on random occlusion recovery of noise channel
CN110599459A (en) Underground pipe network risk assessment cloud system based on deep learning
CN114049581A (en) Weak supervision behavior positioning method and device based on action fragment sequencing
Wang et al. Reliable identification of redundant kernels for convolutional neural network compression
He et al. What catches the eye? Visualizing and understanding deep saliency models
CN116192500A (en) Malicious flow detection device and method for resisting tag noise
CN116935303A (en) Weak supervision self-training video anomaly detection method
CN114782997A (en) Pedestrian re-identification method and system based on multi-loss attention adaptive network
CN111539445A (en) Object classification method and system based on semi-supervised feature fusion
CN115019039A (en) Example segmentation method and system combining self-supervision and global information enhancement
Wang et al. Multi-channel attentive weighting of visual frames for multimodal video classification
CN110347853A (en) A kind of image hash code generation method based on Recognition with Recurrent Neural Network
Xu et al. Cascaded boundary network for high-quality temporal action proposal generation
Wang et al. A differentiable parallel sampler for efficient video classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination