CN113705490B

CN113705490B - Anomaly detection method based on reconstruction and prediction

Info

Publication number: CN113705490B
Application number: CN202111016334.4A
Authority: CN
Inventors: 仲元红; 陈霞; 朱冬; 张建; 杨易
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2023-09-12
Anticipated expiration: 2041-08-31
Also published as: CN113705490A

Abstract

The invention relates to the technical field of video and image processing, in particular to an anomaly detection method based on reconstruction and prediction, which comprises the following steps: acquiring a test video sequence to be detected; inputting the test video sequence into a pre-trained abnormality detection model; the anomaly detection model firstly extracts spatial appearance characteristics and time motion characteristics of a test video sequence respectively, then fuses the spatial appearance characteristics and the time motion characteristics to obtain corresponding space-time characteristics, then acquires corresponding reconstructed frames based on the space-time characteristics, and finally calculates corresponding anomaly scores according to the reconstructed frames; and taking the anomaly score of the test video sequence as an anomaly detection result. The abnormality detection method provided by the invention can give consideration to the abnormality detection performance and accuracy, so that the abnormality detection effect and efficiency can be improved.

Description

Anomaly detection method based on reconstruction and prediction

Technical Field

The invention relates to the technical field of video and image processing, in particular to an anomaly detection method based on reconstruction and prediction.

Background

Video anomaly detection is an important research task in computer vision and has many applications such as traffic accident detection, violence detection, and abnormal crowd behavior detection. Because of the uncertainty and diversity of anomalies, accurately identifying an anomalous event from a normal event remains a challenging task, although it has been studied for years. Meanwhile, in the real world, it is difficult to enumerate all abnormal events to learn various abnormal patterns. Thus, many studies are based on a class of classification methods to detect anomalies, rather than binary classification based on supervised ideas. The abnormality detection based on the class classification is to learn the distribution of the normal pattern from the normal data and calculate the probability that the test sample obeys the distribution to reflect the abnormality.

Aiming at the problem that the existing anomaly detection method is sensitive to noise and time intervals, china patent publication No. CN111680614A discloses an anomaly behavior detection method based on video monitoring, which is characterized in that after target objects in video frame images are extracted, the features are clustered and then input into an SVM classifier, the anomaly score with the highest score is taken as the anomaly score of the target objects, and finally the highest value in the anomaly scores of all the target objects in the video frame images is taken as the anomaly score of the frame images, so that the classification can be fast and accurate by utilizing the SVM classifier, and the real-time requirement is met.

The anomaly (behavior) detection method in the prior art detects the foreground object in each video frame by using the object detection technology, inputs the foreground object into the convolutional self-encoder network frame for reconstruction, and classifies the anomaly by the reconstruction error. However, in the existing anomaly detection method, all pixels in a frame are treated equally, the model loses focus, and a complex region which is difficult to reconstruct during training is not preferentially learned and reconstructed, so that the model cannot effectively obtain a reconstructed image of a high-quality foreground (because a simple background pixel controls optimization of the model), and further the performance of anomaly detection is reduced, because the foreground is more important than a stationary background in anomaly detection. Meanwhile, the existing reconstruction method tries to minimize the difference between the reconstruction frame and the real label thereof, and ensures the similarity in the pixel space and even the potential space, but the similarity of different normal frames in the same scene is ignored due to one-to-one constraint, so that the accuracy of anomaly detection is not high. Therefore, how to design an abnormality detection method that can achieve both of abnormality detection performance and accuracy is a technical problem that needs to be solved.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to solve the technical problems that: how to provide an abnormality detection method capable of taking both abnormality detection performance and accuracy into consideration, thereby improving the effect and efficiency of abnormality detection.

In order to solve the technical problems, the invention adopts the following technical scheme:

an anomaly detection method based on reconstruction and prediction comprises the following steps:

s1: acquiring a test video sequence to be detected;

s2: inputting the test video sequence into a pre-trained abnormality detection model; the anomaly detection model firstly extracts spatial appearance characteristics and time motion characteristics of a test video sequence respectively, then fuses the spatial appearance characteristics and the time motion characteristics to obtain corresponding space-time characteristics, then acquires corresponding reconstructed frames based on the space-time characteristics, and finally calculates corresponding anomaly scores according to the reconstructed frames;

s3: and taking the anomaly score of the test video sequence as an anomaly detection result.

Preferably, the anomaly detection model comprises a reconstruction encoder for extracting spatial appearance characteristics, a prediction encoder for extracting temporal motion characteristics, a fusion module connected with the outputs of the reconstruction encoder and the prediction encoder and used for fusing to obtain space-time characteristics, and a decoder connected with the output of the fusion module and used for acquiring reconstructed frames.

Preferably, in step S2, a current frame of the test video sequence is input to the reconstruction encoder to extract corresponding spatial appearance features; a number of frames preceding the current frame of the test video sequence are input to the predictive encoder to extract corresponding temporal motion characteristics.

Preferably, when the anomaly detection model is trained, reversely erasing the video sequence input in the current round based on the reconstruction error of the previous round of the anomaly detection model so as to remove pixels with the reconstruction error smaller than a preset threshold value in the video sequence and obtain a corresponding erasing frame.

Preferably, I _t Representing the t-th frame in a video sequence，I _t-Δ Represents I _t A previous delta frame;

the reverse erase refers to: after each round of training iterations, except for the first round, the original frame I is first calculated _t Reconstructing a framePixel level errors in between; setting the corresponding pixel value in the mask to 1 or 0 according to whether the value of the pixel level error is larger than a preset threshold value or not so as to obtain a corresponding mask; finally, before the training of the current round, from I _t-Δ To I _t Is multiplied pixel by a mask to obtain an erased frame of the current round of the anomaly detection model, denoted as I _t ′ _-Δ To I _t ′。

Preferably, when the anomaly detection model is trained, a depth SVDD module is connected to the output of the decoder; the depth SVDD module is used for searching the super sphere with the minimum volume to contain all or most advanced features of the reconstructed frame of the normal event, and using the compact constraint of the high-level features of the reconstructed frame to make the reconstructed normal frame similar so as to increase the reconstruction distance between the normal frame and the abnormal frame.

Preferably, the depth SVDD module includes a mapping encoder connected to an output of the decoder, and a super sphere connected to an output of the mapping encoder; the mapping encoder will reconstruct the frame firstMapping into a low-dimensional potential representation, and then fitting the low-dimensional representation into a hypersphere with a minimum volume to force the anomaly detection model to learn common factors for extracting normal events;

the objective function of the depth SVDD module is defined as:

wherein: c and R respectively represent the center and radius of the hypersphere, n represents the number of frames, < >>Representing reconstructed frame outputted by network with parameter W>Is represented by the lower dimension of argmax {.cndot }, represents a function taking the maximum value.

Preferably, optimizing the anomaly detection model by training a loss function;

reconstructing a frameConstrained in pixel space and potential space of the depth SVDD module;

optimizing the anomaly detection model in pixel space based on intensity loss and weighted RGB loss; in potential space, the anomaly detection model is optimized based on feature compactness loss.

Preferably, the training loss function is expressed by the following formula:

L＝λ _int L _int +λ _rgb L _rgb +λ _compact L _compact the method comprises the steps of carrying out a first treatment on the surface of the Wherein: l (L) _int Indicating loss of strength, L _rgb Represents weighted RGB loss, L _compcat Representing a loss of compactness of a feature lambda _int 、λ _rgb 、λ _compact The hyper-parameters corresponding to each loss, respectively, determine their contribution to the total training loss;

loss of strength L _int Calculated by the following formula:

wherein: t denotes the t-th frame of the video sequence, I.I ₂ Representation l ₂ A norm;

weighted RGB loss L _rgb Calculated by the following formula:

in the middle of：||·|| ₁ Representation l ₁ Norm, N, represents the number of previous frames, frame I _t-i The weight of (2) is (N-i+1)/N;

the feature compaction loss is calculated by the following formula:

wherein: c and R respectively represent the center and radius of the hypersphere, n represents the number of frames, < >>Representing reconstructed frame outputted by network with parameter W>Is a low-dimensional representation of (c).

Preferably, the anomaly detection model calculates the corresponding anomaly score by:

s201: the partial score for each image block in the test video sequence is defined as:

wherein: p represents an image block in the I frame, I and j represent the spatial positions of pixels in the image block, P represents the number of pixels in the image block, and the image block is determined by a window with a sliding step length of 4;

s202: calculating anomaly scores for frames in a test video sequence:

Score＝argmax{S(P ₁ ),S(P ₂ ),...,S(P _m ) -a }; wherein: the size of P is set to 16×16, m representing the number of image blocks;

s203: after obtaining the score for each frame in the test video sequence, the scores for all frames are normalized to the range of [0,1] to obtain the following frame-level anomaly scores:

wherein: min _Score Sum max _Score Testing minimum and maximum scores in the video sequence respectively;

s204: and smoothing the frame-level anomaly score in the time dimension by adopting a Gaussian filter to obtain the anomaly score corresponding to the test video sequence.

Compared with the prior art, the abnormality detection method has the following beneficial effects:

according to the invention, the spatial features and the temporal features of the video sequence are respectively extracted through the reconstruction method and the prediction method, and the corresponding space-time features are obtained through fusion to calculate the reconstruction frame, so that the model does not lose focus, the complex region which is difficult to reconstruct during the prior learning and reconstruction training can be preferentially studied, the reconstructed image of the high-quality foreground can be effectively obtained, and the anomaly detection performance of the anomaly detection model is further improved; meanwhile, the mode of extracting the spatial features and the time features considers the similarity of different normal frames in the same scene, namely the abnormality detection accuracy of the abnormality detection model can be improved. Therefore, the abnormality detection method provided by the invention has the advantages of both abnormality detection performance and accuracy, so that the abnormality detection effect and efficiency can be improved.

In the invention, some pixels are erased from the original frame in a reverse erasing mode to create the input data of the model (namely, the erased frame), which can keep the pixels with larger reconstruction errors in the previous training round, remove the pixels with smaller reconstruction errors, further force the model to focus on the pixels which are not reconstructed in the previous training round, ensure that the simple background and the complex foreground are reconstructed with high quality, the erased frame keeps most foreground pixels, and discards most background pixels, thereby being beneficial to the model to automatically form a focus mechanism on the foreground, and further being capable of considering the anomaly detection performance and the accuracy.

In the invention, the depth SVDD module directly acts on the reconstructed frame, so that all or most of advanced features of the reconstructed frame of the normal event can be found by the minimum hyper sphere, the similarity between reconstructed images of the normal frame is ensured by the similar low-dimensional features in the potential space, and the reconstruction distance between the normal frame and the abnormal frame can be effectively increased, thereby further improving the accuracy of abnormal detection.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings, in which:

FIG. 1 is a logic block diagram of an anomaly detection method;

FIG. 2 is a diagram of a network configuration during an anomaly detection model test;

FIG. 3 is a diagram of a network structure during training of an anomaly detection model;

FIG. 4 is a partial example diagram of three data sets (exception events are marked by bounding boxes);

FIG. 5 is a network block diagram of an encoder and decoder;

FIG. 6 is a schematic diagram of qualitative results of frame reconstruction for three data sets (lighter colors represent larger errors);

FIG. 7 is a comparative schematic of anomaly scores;

FIG. 8 is a graph showing a comparison of average scores of normal and abnormal frames;

FIG. 9 is a graphical representation of the visual results of reverse erasure during different training periods;

FIG. 10 is a comparative graph of model training loss with and without reverse erasure on Ped 2;

FIG. 11 is a schematic diagram of a model visualization result with and without reverse erasure on Avenue and Ped 2;

FIG. 12 is a t-SNE visualization schematic of a low-dimensional representation of a reconstructed frame in Avenue and Ped 2.

Detailed Description

The following is a further detailed description of the embodiments:

examples:

an anomaly detection method based on reconstruction and prediction is disclosed in this embodiment.

As shown in fig. 1, the anomaly detection method based on reconstruction and prediction includes the steps of:

s1: acquiring a test video sequence to be detected;

s2: inputting the test video sequence into a pre-trained abnormality detection model; firstly, respectively extracting spatial appearance characteristics and time motion characteristics of a test video sequence by an anomaly detection model, then fusing the spatial appearance characteristics and the time motion characteristics to obtain corresponding space-time characteristics, acquiring corresponding reconstructed frames based on the space-time characteristics, and finally calculating corresponding anomaly scores according to the reconstructed frames;

In a specific implementation process, as shown in fig. 2, an anomaly detection model (Dual-Encoder Single-Decoder network, DESDnet) includes a reconstruction Encoder for extracting spatial appearance features, a prediction Encoder for extracting temporal motion features, a fusion module connected to the outputs of the reconstruction Encoder and the prediction Encoder and used for fusing to obtain temporal and spatial features, and a Decoder connected to the output of the fusion module and used for obtaining reconstructed frames. Specifically, the fusion module of the anomaly detection model comprises a two-dimensional convolution layer and a Tanh activation layer; the convolution kernel of the two-dimensional convolution layer is 1×1 in size and the channel is 512. Inputting the current frame of the test video sequence to the reconstruction encoder to extract corresponding spatial appearance features; a number of frames preceding the current frame of the test video sequence are input to the predictive encoder to extract corresponding temporal motion characteristics. During the test phase, from I _t-Δ To I _t Is input to a reconstruction encoder and a predictive encoder to extract spatial and temporal features of the video sequence, respectively. Will look at feature a _t And motion characteristics m _t Inputting the cascade connection into a fusion module to obtain corresponding space-time characteristics; compared with the fusion method of the series characteristics, the method saves the calculated amount and improves the expression capacity of the model. In addition, the spatio-temporal features are input into a decoder, and a reconstructed frame is obtained by performing deconvolution

In the specific implementation process, in combination with the illustration of fig. 3, when the anomaly detection model is trained, reverse erasure is performed on the video sequence input in the current round based on the reconstruction error of the previous round of the anomaly detection model, so as to remove pixels in the video sequence, where the reconstruction error is smaller than a preset threshold, and obtain a corresponding erasure frame.

Specifically, I _t Representing the t-th frame, I, in a video sequence _t-Δ Represents I _t A previous delta frame;

reverse erase refers to: after each round of training iterations, except for the first round, the original frame I is first calculated _t Reconstructing a framePixel level errors in between; setting the corresponding pixel value in the mask to 1 or 0 according to whether the value of the pixel level error is larger than a preset threshold value or not so as to obtain a corresponding mask; finally, before the training of the current round, from I _t-Δ To I _t Is multiplied pixel by a mask to obtain an erased frame of the current run of the anomaly detection model, denoted as I' _t-Δ To I' _t . During the training phase, a slave I 'is given' _t-Δ To I' _t Is the erasure frame of I' _t Is input to a reconstruction encoder to extract appearance features in the spatial domain, denoted as a _t From I' _t-Δ To I' _t-1 Is input to a predictive encoder to extract motion features in the temporal domain, denoted as m _t The method comprises the steps of carrying out a first treatment on the surface of the Compared with capturing motion mode by using light flow, the inventionThe method avoids inaccuracy and high calculation cost caused by optical flow calculation.

In the invention, some pixels are erased from the original frame in a reverse erasing mode to create the input data of the model (namely, the erased frame), which can keep the pixels with larger reconstruction errors in the previous training round, remove the pixels with smaller reconstruction errors, further force the model to focus on the pixels which are not reconstructed in the previous training round, ensure that the simple background and the complex foreground are reconstructed with high quality, the erased frame keeps most foreground pixels, and discards most background pixels, thereby being beneficial to the model to automatically form a focus mechanism on the foreground, and further being capable of considering the anomaly detection performance and the accuracy. Meanwhile, the input uncertainty change also enables the abnormal detection model to have stronger robustness to noise, so that the model does not lose focus, a complex area which is difficult to reconstruct during training can be learned and reconstructed preferentially, a reconstructed image with high quality prospect can be obtained effectively, and further the abnormal detection performance of the abnormal detection model is improved.

In the implementation process, in combination with the illustration in fig. 3, when the anomaly detection model is trained, a depth SVDD module is connected to the output of the decoder; the depth SVDD module is used for searching the super sphere with the minimum volume to contain all or most advanced features of the reconstructed frame of the normal event, and using the compact constraint of the high-level features of the reconstructed frame to make the reconstructed normal frame similar so as to increase the reconstruction distance between the normal frame and the abnormal frame.

Specifically, the depth SVDD module comprises a mapping encoder connected with the output of the decoder and a super sphere connected with the output of the mapping encoder; the mapping encoder will reconstruct the frame firstMapping into a low-dimensional potential representation, and then fitting the low-dimensional representation into a hypersphere with a minimum volume to force an anomaly detection model to learn common factors of extracting normal events;

the objective function of the depth SVDD module is defined as:

wherein: c and R respectively represent the center and radius of the hypersphere, n represents the number of frames, < >>Representing reconstructed frame outputted by network with parameter W>Is represented by the lower dimension of argmax {.cndot }, represents a function taking the maximum value. In the objective function, a first term is used to minimize the volume of the hypersphere, and a second term is the penalty term for samples located outside the hypersphere; super parameter v E (0, 1)]For measuring the volume and boundary loss of the hypersphere; a large v means that some samples are allowed to fall outside the nanospheres, if v is small, there is a great penalty for falling outside the nanospheres; optimizing network parameters W and radius R by a block coordinate descent and alternation minimization method; i.e. fixing R, and iterating the network k times to optimize the parameter W; after k times, the latest W is used again to optimize R.

In the specific implementation process, an abnormality detection model is optimized through training a loss function;

optimizing an anomaly detection model in pixel space based on the intensity loss and the weighted RGB loss; in potential space, the anomaly detection model is optimized based on feature compactness loss.

Specifically, the training loss function is expressed by the following formula:

loss of strength L _int Calculated by the following formula:

weighted RGB loss L _rgb Calculated by the following formula:

wherein: I.I ₁ Representation l ₁ Norm, N, represents the number of previous frames, frame I _t-i The weight of (2) is (N-i+1)/N;

the feature compaction loss is calculated by the following formula:

In order to restrict the reconstruction of all normal frames to the reachable range, the mean value of the feature vectors of the reconstructed frames extracted by the first round training model is taken as the center c. In the subsequent training, the euclidean distance between the feature representation of the reconstructed frame and the center c is calculated, and the feature compactness loss is obtained according to the distance.

In the present invention, by minimizing the feature compaction loss, the model is enabled to automatically map the reconstruction of normal frames near the center of the hyper-sphere to obtain a compact description of normal events. Therefore, the characteristics of the reconstructed frame containing the normal event are close to the center of the hyper sphere, and the characteristics of the abnormal event are far away from the center and even fall outside the hyper sphere, which means that the reconstructed images of all the normal frames in the pixel space are more similar, and the reconstructed images of the abnormal frames are more different from the reconstructed images of the normal frames, so that the distinguishing property of the abnormality can be increased, and the abnormality detection performance and accuracy of the abnormality detection model can be improved.

In the specific implementation process, the anomaly detection model calculates the corresponding anomaly score by the following steps:

s202: calculating anomaly scores for frames in a test video sequence:

According to the invention, the anomaly score of the test video sequence can be effectively calculated through the steps, and further the anomaly behavior or the anomaly event in the test video sequence can be detected based on the anomaly score, so that the anomaly detection effect can be assisted to be improved.

In order to better illustrate the advantages of the abnormality detection method of the present invention, the present embodiment also discloses the following experiment:

the experiment was performed on three publicly available data sets, as shown in fig. 4, a CUHK Avenue data set, a UCSD pedicure data set, and a certain university campus anomaly detection data set, respectively.

According to the network structure parameters in fig. 5, the model in the present invention is implemented on Pytorch.

To train the model, adam's algorithm with an initial learning rate of 0.0002 was introduced and cosine annealing was used to attenuate the learning rate. The batch size was set to 4, and the number of training rounds on CUHK Avenue, UCSD Ped2, and certain university campus anomaly detection dataset were 60, and 10, respectively. For all data sets, the frame size is adjusted to 256×256 pixels, and the pixel intensity is normalized to the range of [ -1,1 ]. The total length of the input frame is set to 5, i.e., Δ=4.

In training the loss function, the super-parameter lambda _int 、λ _rgb 、λ _compact Set to 1, 0.2, 0.01, respectively. V of the depth SVDD module is set to 0.1 to ensure tolerance of the model to various normal modes. In order to reduce the memory required for computation, the present embodiment does not compute a special mask for each frame in the training set; instead, an or operation is performed on these masks to generate a generic mask for erasure in the next round of training. The whole experiment is carried out on a computer running a Linux Ubuntu16.04 operating system, an Intel (R) Core (TM) i7-7800xCPU@3.50GHz is adopted, and a display card is GeForce GTX 1080 with 8GB memory.

Cuhkaavenue dataset; 37 videos are included, 16 15328 frames of videos are used for training the model, and the other 21 15324 frames of videos are used for evaluating the abnormality detection performance of the model; a resolution of 640 x 360 per frame in this dataset 47 anomalies can be observed, including wander, throwing objects, and running.

UCSD Pedesmin dataset; comprises a Ped1 (UCSD Pederstrain 1) data set and a Ped2 (UCSD Pederstrain 2) data set; experiments were performed on Ped2, but not on Ped1, because the 158 x 238 frame resolution in Ped1 was quite low; in Ped2, there are 16 training videos and 12 test videos, each video not exceeding 200 frames; the resolution of the video frame is 360×240; there are 12 irregular events in the Ped2 dataset that are mainly manifested as objects with abnormal appearance, such as bicycles and trucks on sidewalks.

A data set for detecting the abnormality of a university park; is a very challenging video anomaly detection dataset consisting of 13 scenes and over 27 tens of thousands of training frames; it contains 330 training videos and 107 test videos; the resolution of each frame is 856 multiplied by 480; 130 abnormal events are included in the data set for detecting the abnormality of the university, including the occurrence of bicycles, skateboards and the like.

The present example evaluates the performance of anomaly detection by AUC (Area Under the Curve ).

1. Abnormality detection model pertaining to the present invention

Comparing the anomaly detection model of the present invention with a typical conventional method and a latest method based on deep learning, comprising: deep, stacked RNN, models proposed by Liu et al, models proposed by Lu et al, MESDnet, memAE, STAE, ST-CaAE and Kim, etc. AUC performance for each model is shown in table 1.

Table 1AUC performance comparison results

As can be seen from table 1, the model of the present invention achieved good AUC performance over three different data sets, showing great competitiveness compared to the most advanced methods. On CUHK Avenue and UCSD Ped2 data sets, the AUC performance of the model of the invention respectively reaches 89.9% and 97.5%, which is superior to the detection performance of other methods. The data set for detecting the abnormal of the university is a new data set in video abnormal detection, so that only a few researches provide test results of the data set.

Although the model of the present invention did not achieve the best AUC performance on a university campus anomaly detection dataset, the AUC was only 1.1% lower than the highest value. In addition, in order to intuitively observe the detection performance, fig. 6 provides qualitative results of the model for frame reconstruction on three data sets, and normal regions can be well reconstructed while abnormal regions cannot, as shown in connection with fig. 6.

2. Reconstruction and prediction model for the present invention

In order to evaluate the effect of reconstruction and prediction fusion in the present invention, a reconstruction encoder and a prediction encoder are combined with a decoder to obtain three different models: 1) Reconstruction model consisting of reconstruction encoder and decoder, in frame I _t Is input; 2) A predictive model consisting of a predictive encoder and a decoder, denoted I _t-Δ To I _t-1 Is input; 3) Consists of a reconstruction encoder, a prediction encoder and a decoder, and is represented by I _t-Δ To I _t Is input. In order to keep pace with the proposed model, a jump connection is used between the encoder and decoder of the predictive model. For training of each model, pixel intensity loss, weighted RGB loss, and feature compactness loss are employed to supervise training. From these models, the performance of the reconstruction model and the prediction model in the independent detection of anomalies can be obtained.

FIG. 7 shows anomaly scores for video sequences of Avenue and Ped2 datasets on the three models described above. The result shows that the model of the invention always generates larger reconstruction error for abnormal frames and smaller error for normal frames; the average of the normal and abnormal scores and the gap between them are shown in fig. 8. Overall, the model of the present invention has the greatest difference in score on each dataset, indicating that the model of the present invention has better detection performance. In addition, the AUCs listed in table 2 also demonstrate that neither the reconstructed nor the predicted model can achieve the AUC performance achieved by the model combination of the present invention.

Table 2 comparison of AUC for different models

3. Reverse erasure of reconstruction errors in connection with the present invention

Fig. 9 shows the mask for erasure at different training periods, and the frame images before and after erasure. As can be seen from fig. 9, the erased pixels in each round are mainly background pixels, which helps the model to pay more attention to the complex foreground; and as the number of training rounds increases, more background pixels remain in the erased frame, indicating that the reconstruction error gap between foreground and background is decreasing. This reflects that the reverse erasure can effectively guide the model to reduce reconstruction errors for the foreground pixels. This can also be verified in the reconstruction error map provided in fig. 9.

To better demonstrate the advantages of the reverse erase of the present invention, the present embodiment performed an ablation experiment on the reverse erase: the training loss for the model with and without reverse erasure on Ped2 is shown in fig. 10; while fig. 10 shows that the model with reverse erasure does not significantly reduce the training loss, it can be seen that the drop in training loss is predominantly dominated by foreground pixels, rather than background pixels, as compared to fig. 9; conversely, a model without reverse erasure loses guidance, looking at all areas at the same kernel, resulting in simple background dominated model convergence. Finally, we list AUC performance for the model with and without reverse erasure in table 3 and give a visual comparison in fig. 11. The result shows that the reverse erasure model has better detection performance.

Table 3 comparison of AUC of models without and with reverse erasure

4. Depth SVDD module relating to the present invention

Based on the t-distributed random field embedding (t-distributed Stochastic Neighbor Embedding, t-SNE) method, FIG. 12 provides a t-SNE visualization of a low-dimensional representation of reconstructed frames on Avenue and Ped2 datasets. It can be observed that in three-dimensional space, especially in the Ped2 dataset, most of the normal data is aggregated in the form of nearly spheres, with the abnormal data being scattered outside the spheres. This result is due to the feature compaction loss based on depth SVDD, which aims to find a minimum volume hyper-sphere that contains normal data but no abnormal data.

To verify the advantages of applying depth SVDD after decoder, the experiment explored three methods: 1) The mapping encoder after the decoder is removed, has no constraint on the characteristics, is a simple double-coding single-decoding structure and is expressed as DESD; 2) The depth SVDD is performed at the bottleneck between encoder and decoder, i.e. the spatio-temporal representation of the input frames is mapped into a compact hyper-sphere, denoted DE-SVDD-SD; 3) The depth SVDD, denoted DESD-SVDD, is performed after the decoder.

AUC performance for the different methods is summarized in table 4. In the table, the feature AUC is calculated from the distance of the low-dimensional feature of the frame from the center of the hyper-sphere. First, the distance is defined as follows:

wherein: w (W) ^* Parameters representing a pre-training network; a large distance means that the low dimensional characteristics of the frame deviate more severely from the normal mode.

The anomaly score is expressed as

From table 4, it can be observed that DESD-SVDD achieved the highest AUC on both data sets, whether frame-based or feature-based. The frame AUC of DE-SVDD-SD is lower than DESD-SVDD, confirming that even if higher-layer features are limited, the reconstructed outlier frame of the decoder may not be close to the normal frame due to the strong representation capability of CNN.

Table 4 comparison of AUC of potential feature space under different constraints

5. Weighted RGB loss with respect to the present invention

The effect of weighted RGB loss was studied by comparing with calculating the motion loss of the RGB difference between two adjacent frames. Table 5 shows that weighted RGB losses can give higher AUC on both Ped2 and Avenue datasets.

Table 5 AUC performance under different motion constraints

Furthermore, in experiments, it was found that RGB loss λ will be weighted _rgb A fixed parameter of 0.2 allows good detection performance to be obtained on different data sets. Lambda was performed using the Ped2 dataset as an example _rgb The results of the experiments are summarized in table 6.

Table 6 AUC comparison of weighted RGB losses for different weights on the Ped2 dataset

6. Conclusion(s)

In the traditional video anomaly detection based on deep learning, the experiment researches that network optimization is not important, and the similarity between different normal frames is ignored. In the invention, each frame in the video is reconstructed through a double-encoder single-decoder network of an anomaly detection module, and a training strategy is provided, which comprises reverse erasure and depth SVDD based on reconstruction errors, so as to standardize the training of the network. In training, according to the reconstruction error of the previous training, removing the pixel with smaller error in the original frame, and then inputting the frame into the model, so that the model is focused on the pixel with larger learning error, and the reconstruction quality is improved; in addition, the reconstruction of normal frames is mapped into the super sphere with the smallest volume by applying the depth SVDD, so that the reconstruction of abnormal frames is easier to identify. The experimental results on the three data sets show that the method of the invention has a competitive advantage compared with the existing method.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will understand that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Meanwhile, the common general knowledge of the specific construction and characteristics of the embodiment is not described here too much. Finally, the scope of the invention as claimed should be determined by the claims, and the description of the embodiments and the like in the specification should be construed to explain the content of the claims.

Claims

1. The anomaly detection method based on reconstruction and prediction is characterized by comprising the following steps of:

s1: acquiring a test video sequence to be detected;

the anomaly detection model comprises a reconstruction encoder for extracting spatial appearance characteristics, a prediction encoder for extracting temporal motion characteristics, a fusion module which is connected with the output of the reconstruction encoder and the output of the prediction encoder and is used for fusion to obtain space-time characteristics, and a decoder which is connected with the output of the fusion module and is used for obtaining a reconstructed frame;

when the anomaly detection model is trained, reversely erasing a video sequence input in the current round based on the reconstruction error of the previous round of the anomaly detection model so as to remove pixels with the reconstruction error smaller than a preset threshold value in the video sequence and obtain a corresponding erasing frame;

I _t representing the t-th frame, I, in a video sequence _t-Δ Represents I _t A previous delta frame;

the reverse erase refers to: after each round of training iterations, except for the first round, the original frame I is first calculated _t Reconstructing a framePixel level errors in between; setting the corresponding pixel value in the mask to 1 or 0 according to whether the value of the pixel level error is larger than a preset threshold value or not so as to obtain a corresponding mask; finally, before the training of the current round, from I _t-Δ To I _t Is multiplied pixel by a mask to obtain an erased frame of the current round of the anomaly detection model, denoted as I' _t-Δ To I' _t ；

2. The reconstruction and prediction based anomaly detection method of claim 1, wherein: in step S2, inputting the current frame of the test video sequence to the reconstruction encoder to extract corresponding spatial appearance features; a number of frames preceding the current frame of the test video sequence are input to the predictive encoder to extract corresponding temporal motion characteristics.

3. The reconstruction and prediction based anomaly detection method of claim 1, wherein: when training the anomaly detection model, connecting a depth SVDD module to the output of the decoder; the depth SVDD module is used for searching the super sphere with the minimum volume to contain all or most advanced features of the reconstructed frame of the normal event, and using the compact constraint of the high-level features of the reconstructed frame to make the reconstructed normal frame similar so as to increase the reconstruction distance between the normal frame and the abnormal frame.

4. The reconstruction and prediction based anomaly detection method of claim 3, wherein: the depth SVDD module comprises a mapping encoder connected with the output of the decoder and a hypersphere connected with the output of the mapping encoder; the mapping encoder will reconstruct the frame firstMapping into a low-dimensional potential representation, and then fitting the low-dimensional representation into a hypersphere with a minimum volume to force the anomaly detection model to learn common factors for extracting normal events;

the objective function of the depth SVDD module is defined as:

wherein: c and R respectively represent the center and radius of the hypersphere, n represents the number of frames, < >>Representing reconstructed frame outputted by network with parameter W>Is represented by argmax { · } represents a function taking the maximum value; super parameter v E (0, 1)]For measuring the volume and boundary loss of the hypersphere.

5. The reconstruction and prediction based anomaly detection method of claim 3, wherein: optimizing the anomaly detection model by training a loss function;

6. The reconstruction and prediction based anomaly detection method of claim 5, wherein: the training loss function is expressed by the following formula:

loss of strength L _int Calculated by the following formula:

weighted RGB loss L _rgb Calculated by the following formula:

wherein: I.I ₂ Representation l ₂ Norm, N, represents the number of previous frames, frame I _t-i The weight of (2) is (N-i+1)/N;

the feature compaction loss is calculated by the following formula:

7. The reconstruction and prediction based anomaly detection method of claim 1, wherein: the anomaly detection model calculates a corresponding anomaly score by:

s202: calculating anomaly scores for frames in a test video sequence: