CN116543335A

CN116543335A - Visual anomaly detection method based on time sequence spatial information enhancement

Info

Publication number: CN116543335A
Application number: CN202310510187.9A
Authority: CN
Inventors: 王霖; 李名洋; 王玮; 柴志鹏
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-08-04

Abstract

The invention provides a visual anomaly detection method based on time sequence spatial information enhancement, which is used for solving the problem of slower operation speed of an anomaly detection network; the method comprises the following steps: firstly, constructing an optical flow reconstructor and a future frame predictor of a multi-stage memory enhancement self-encoder with jump connection; secondly, training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss; training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error; and finally, detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error. The method and the device can better utilize the space detail information, improve the detection precision, and are suitable for abnormal event detection occasions in complex and various scene monitoring videos.

Description

Visual anomaly detection method based on time sequence spatial information enhancement

Technical Field

The invention relates to the technical field of general computer image processing, in particular to a visual anomaly detection method based on time sequence space information enhancement.

Background

The video abnormal event detection technology is widely applied to fault early warning in the fields of industry, security and transportation, and aims to spatially or temporally locate abnormal events in a monitoring video. In a broad sense, the formal definition of a video anomaly event is "the appearance of an anomaly appearance or motion attribute, or the appearance of a normal appearance or motion attribute at an anomaly location or time. By definition, an abnormal event is unusual, unusual and sporadic in nature, and thus acquisition of an abnormal event sample is difficult. In order to solve the detection problem of abnormal events with super boundaries and small samples, researchers try to model normal event behaviors by analyzing the motion characteristics and the space-time context characteristics of normal event samples, and then judge whether abnormal events occur in videos by using the model. With the continued development of deep learning techniques, researchers have attempted to detect anomalies using an unsupervised paradigm approach based on reconstruction or prediction. In the anomaly detection method based on deep learning, firstly, in the training process, a network learns a normal activity model of a behavior main body in a scene through a plurality of given video sequences which do not contain abnormal behaviors. Subsequently, in the detection phase, the network processes by extracting the same features in the video sequence to be tested, so as to calculate an anomaly score for each frame of video. The method learns the characteristics of the normal event sample and models the normal event sample in an end-to-end mode, so that the abnormal event sample cannot be effectively reconstructed or predicted in future frames, thereby obtaining higher reconstruction errors or prediction errors, and judging whether an abnormal event occurs or not according to a proper threshold by taking the reconstruction errors or the prediction errors as criteria.

Although the reconstruction and prediction-based anomaly detection method has been developed in a long-standing manner, the following drawbacks still exist: (1) Both methods rely on the capability of convolutional neural network to extract video sequence features, and part of methods introduce motion feature information such as optical flow information before a reconstruction and prediction stage, but do not achieve good effects; (2) The reconstruction-based method has great reconstruction error when training is performed on the normal sample and reconstructing the abnormal image according to the difference of the normal sample and the abnormal sample in aspects such as morphological characteristics, motion characteristics, space-time context information and the like. However, such assumptions do not always hold. The diversity of normal and abnormal samples results in the normal events recorded in the training set not being complete and comprehensive, which makes it possible for the abnormal samples to be reconstructed well. (3) The prediction-based method predicts future frames by predicting morphology features and spatio-temporal context features, and in the anomaly discrimination process, the anomaly frames have extremely high reconstruction errors according to a hypothesis similar to the reconstruction-based method. This allows the prediction-based approach to ignore the contribution of motion features to anomaly discrimination while having similar drawbacks as the reconstruction-based approach.

To solve the above problems, some researchers have tried to combine these two patterns in a serial or parallel manner to construct a framework of hybrid reconstruction and prediction patterns to further improve the anomaly detection performance. The hybrid model effectively combines reconstruction and prediction methods, where the parallel hybrid method calculates anomaly scores by fusing reconstruction errors and prediction errors during the discrimination phase, while the serial hybrid method makes modeling of anomaly events in future frames more difficult by introducing reconstruction information in the prediction branches. However, whether based on reconstruction, prediction, or hybrid models, there is no more reasonable consideration of spatial detail information in the network design process. The method has higher requirements on sample data, is influenced by the diversity of normal samples, can still generate abnormal effective modeling in the modeling process, and has good model quality, thereby being beneficial to effectively differentiating the abnormality in the distinguishing process.

Disclosure of Invention

Aiming at the problem of slower operation speed of an anomaly detection network, the invention provides a visual anomaly detection method based on time sequence space information enhancement, which is a hybrid framework combining optical flow reconstruction and optical flow guiding future frame prediction in a serial mode; wherein a future frame prediction model based on optical flow guidance accepts a previous video frame and optical flow as inputs at the same time, but the optical flow used is not an original optical flow image, but is reconstructed by an optical flow reconstruction model constructed by a multi-level memory enhanced self-encoder (ML-MemAE-SC) with jump connection, and then the reconstructed optical flow is input to the future frame prediction model; the method and the device can better utilize the space detail information, improve the detection precision, and are suitable for abnormal event detection occasions in complex and various scene monitoring videos.

The technical scheme of the invention is realized as follows:

a visual anomaly detection method based on time sequence space information enhancement comprises the following steps:

step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder;

step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss;

step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error;

step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error.

The future frame predictor includes an encoder E _θ Encoder and method for producing the sameAnd decoder D _ψ Encoder->And decoder D _ψ Jumping connection of a U-shaped pyramid attention mechanism module is added between the two modules; encoder E _θ And encoder->The system comprises a point convolution module, a serial depth separable residual block and an intensive cavity downsampling module; decoder D _ψ The device comprises an up-sampling module, a serial depth separable residual block and a point convolution module; encoder E _θ Is input to reconstruct optical flow information +.>Encoder->Is input as the original input frame x _1：t And reconstruct optical flow information->Is mixed with the information of the encoder E _θ Output information of (2) and encoder->Output information z of the decoding device is connected and sent to the decoder D _ψ 。

The mixing loss function of the training optical flow reconstructor is as follows:

L _recon-branch ＝λ _recon L _recon +λ _ent L _ent ；

wherein, the liquid crystal display device comprises a liquid crystal display device,reconstructing the loss function, f being the input, +.>The output of the reconstruction is provided with,m is the number of memory modules, N is the size of the memory module, which is the loss function of the memory module,/>Is the matching probability of the kth node in the mth memory module, lambda _recon 、λ _ent Are all weights.

The mixing loss function of the training future frame predictor is:

wherein, the liquid crystal display device comprises a liquid crystal display device,to predict loss, p (z|y _1：t ) For the output distribution of the optical flow encoder, q (z|x _1：t ，y _1：t ) For the output distribution of the image stream encoder, F _KL Representing the Kullback-Leibler divergence, y _1：t Representing video light flow, x _t+1 Truth-diagram representing future frames->Representing predicted output video frame-> For gradient loss, i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame, +>Representing predicted output video frames, x _i，j Truth-map pixel values representing future frames, +.>Representing predicted output video frame pixel values, lambda _predict 、λ _gd Are all weights.

The method for calculating the abnormal score according to the reconstruction error and the prediction error comprises the following steps:

wherein S is an anomaly score, S _r Reconstruction error, S _p For prediction error omega _r Weight, ω, of the reconstruction score _p Represents the weight, μ, occupied by the prediction score _r Representing the mean, σ, of the reconstruction errors for all training samples _r Represents the standard deviation, μ, of the reconstruction errors of all training samples _p Representing the mean, σ, of the prediction errors of all training samples _p Representing the standard deviation of the prediction error for all training samples.

The reconstruction error S _r And prediction error S _p The expressions of (2) are respectively:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing reconstructed optical flow information->Representing a predicted output video frame.

The serial depth separable residual block comprises a channel convolution layer I, a channel convolution layer II, a point convolution layer I and a point convolution layer II; the input end of the channel convolution layer I and the input end of the point convolution layer I are connected with the upper layer output end, the output end of the channel convolution layer I is connected with the input end of the channel convolution layer II, the output end of the channel convolution layer I is fused with the output end of the channel convolution layer II and then is connected with the input end of the point convolution layer II, and the output end of the point convolution layer II and the output end of the point convolution layer I are connected with the lower layer input end.

The intensive cavity downsampling module comprises three stages, namely Stage1, stage 2 and Stage 3; firstly, the input x is downsampled for the first time by utilizing the common convolution of Stage1 Stage to obtain a feature diagram x after the scale reduction ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then, the second and third downsampling are carried out by utilizing the cavity convolution of Stage 2 Stage and Stage 3 Stage respectively so as to obtain a feature diagram x after the scale reduction ₂ And x ₃ The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, for the feature map x ₁ And x ₂ Fusion is carried out to obtain a fusion result y ₁ For the characteristic diagram x ₂ And x ₃ Fusion is carried out to obtain a fusion result y ₂ And x is at the final stage ₁ 、y ₁ And y ₂ And fusing to obtain an output characteristic diagram y.

The U-shaped pyramid attention mechanism module comprises a right encoder branch and a left jump connection branch; the encoder branch on the right side sequentially passes through a convolution layer and a cavity convolution, the convolution layer output and the cavity convolution output are fused by utilizing point convolution, and finally, the output Att is obtained after an activation function _D (F _D ) The method comprises the steps of carrying out a first treatment on the surface of the The left jump connection branch passes through the convolution layer to obtain F ₁ Fusion is carried out with the output of the decoder branch convolution layer, and then the output F is obtained through hole convolution ₂ ；F ₂ 、F ₁ And hole convolution output F of decoder branch ₃ After the three features are subjected to channel dimension fusion by point convolution, the final output Att is obtained after an activation function _skip (F _skip ) The method comprises the steps of carrying out a first treatment on the surface of the Finally, the output Att of UPAM (F _skip ，F _D ) Input F to a decoder _D Multiplying by element to obtain the final attention profile F _Att 。

Compared with the prior art, the invention has the beneficial effects that:

(1) Aiming at the problem of slower operation speed of an anomaly detection network, a Serial depth separable residual Block (Serial Block) is designed to serve as a backbone of a future frame prediction network, so that the calculation amount and the parameter amount of convolution operation are effectively reduced and the operation speed is improved on the premise of ensuring the performance.

(2) In order to ensure that abundant and effective space detail information is reserved in the downsampling process in a prediction network encoder, a dense cavity downsampling module (DRSM) is designed, the module adopts a stepped structural design, and multiscale space relevance information is fully utilized in the downsampling process, so that a feature map after the scale reduction is ensured to have more effective space detail information.

(3) Because the U-shaped structure of the future frame prediction network branch contains more jump connections, part of the hierarchy contains certain interference information in the transmission process from top to bottom in the coding stage. In order to solve the problem, a U-shaped pyramid attention module (UPAM) is designed, and the network is guided to keep more abundant and effective spatial features in the feature fusion process through the interactive information extraction of jump connection input and original decoder input in the attention module, so that the future frame modeling quality is improved.

(4) The invention can accurately detect the abnormal events in the complex and various scene monitoring videos.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a block diagram of a future frame predictor network of the present invention.

Fig. 3 is a Serial depth separable residual Block (Serial Block) structure diagram of the present invention.

Fig. 4 is a diagram illustrating a structure of a dense hole downsampling module (DRSM) according to the present invention.

Fig. 5 is a diagram of the U-shaped pyramid attention mechanism module (UPAM) of the present invention.

FIG. 6 is an anomaly score plot of a test video at a UCSD Ped2 dataset of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, an embodiment of the present invention provides a visual anomaly detection method based on temporal spatial information enhancement, including two components: (1) Multi-level memory enhanced self-encoder with jump connection (ML-MemAE-SC), i.e. optical flow reconstructor. The method is used for reconstructing the optical flow image, and the reconstructed optical flow image and the original image are simultaneously input into a future frame predictor to conduct future frame prediction. (2) The condition variation based on the reconstructed optical flow guidance is derived from the encoder, i.e. the future frame predictor. The method is used for predicting future frames of the original image frames, and optical flow reconstruction information is introduced during prediction to expand prediction errors to differentiate between abnormal and normal states, so that detection accuracy is further improved. The method comprises the following specific steps:

step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder; the optical flow reconstructor network adopts three-stage memory storage units, and directly transmits the encoded information to the decoder by utilizing jump connection, so as to provide more information for the next-stage memory storage unit to keep a more comprehensive and complete normal mode. The memory module functions to represent the characteristics input to it by a weighted sum of similar memory slots and thus has the ability to memorize the normal mode when training the normal data. Each memory module is actually a matrixIncluding N real vectors in the fixed dimension C. Each row of the matrix is called a slot m _i Where i=1, 2,3,..n. Memory size n=2000. The method can effectively solve the problems that a single memory module cannot load all normal modes and a plurality of memory modules are cascaded to cause excessive loading and filtering. No hopping connection is added at the outermost layer of the network. If the connection uses the outermost layer, the reconstruction may be completed by the highest-level encoding and decoding information, and the effect of the normal mode information stored by all other lower-level memory modules is greatly weakened or even disabled, so that all other lower-level encoding, decoding and memory blocks cannot work. In order to further reduce the number of network model parameters, all convolution layers of the reconstruction network are replaced by Serial blocks, so that a lightweight version is formed.

The future frame predictor network detailed structure is shown in fig. 2. The future frame predictor network model consists of the following 5 main module connections: (1) a Serial depth separable residual Block (Serial Block); (2) a dense hole downsampling module (DRSM); (3) a U-shaped pyramid attention mechanism module (UPAM); (4) an upsampling module (Upsample); (5) Point convolution Module (Pointwise).

Each block in the future frame predictor network represents a respective module. As shown in fig. 2, there are two encoders E sharing a similar architecture _θ Andone decoder D _ψ . We are->And D _ψ Adding a jump connection between them helps to generate x _t+1 . The backbone of the future frame predictor network is realized by adopting a Serial Block, and the downsampling layer and the upsampling layer are respectively realized by a DRSM and an Upsample. The model contains 4 levels in total, and the feature map size corresponding to each level is (32, 32, 64), (16, 16, 128), (8, 128) and (4, 128), respectively. Will->Connected to the sampled z and sent to decoder D _ψ . The last two bottleneck layers are utilized to estimate the distribution and sample the data from them, and share the same layer settings. Encoder->Processing original input frame x _1：t And reconstruct optical flow information->Mixed input of E _θ Processing of reconstructed optical flow information->And (5) inputting. Features are fused in the bottleneck layer and then posterior distribution +.>And a priori distribution->Input decoder D _ψ To model->A predicted future output frame is made.

Step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss; when training the optical flow reconstruction network, the optical flow reconstruction network model training is performed by using the optical flow graph as an input. If the input is f, the reconstructed output isThe training process of the reconstruction network is to minimize l between input and output ₂ The distance, the reconstruction loss function may be expressed as:

in addition, we add entropy loss in the memory modules of the optical flow reconstruction network and match probability for each memory moduleThe loss function of the memory module can be expressed as:

wherein M is the number of memory modules, N is the size of the memory modules,is the probability of matching the kth node in the mth memory module. Weighting fusion of these two loss functions for final reconstructed branch training, e.gThe following is shown:

L _recon-branch ＝λ _recon L _recon +λ _ent L _ent

wherein lambda is _recon 、λ _ent Are all weights.

Step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error; in the process of training a future frame prediction network, training of a prediction network model is performed by using the video frame and the reconstructed optical flow as inputs. The future frame prediction network includes two encoders and one decoder, and the outputs of the three parts are comprehensively considered in training. Wherein the output distribution of the optical flow encoder is p (z|y _1：t ) The image stream encoder output distribution is q (z|x _1：t ，y _1：t ) The prediction loss of the future frame prediction network may be defined as:

wherein F is _KL Represents the Kullback-Leibler divergence, x _t+1 A group Truth representing a future frame,representing predicted output video frames, y _1：t Representing video streams.

Furthermore, gradient losses were introduced, as follows:

where i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame,representing predicted output video frames, x _i，j Truth-map pixel values representing future frames, +.>Representing predicted output video frame pixel values.

Mixing the predicted and gradient losses by means of weighted fusion obtains a total loss function to train the prediction network, which can be expressed as:

wherein lambda is _predict 、λ _gd Are all weights.

Step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, calculating an anomaly score according to the reconstruction error and the prediction error, and optimizing anomaly discrimination performance by using the anomaly score weighting loss.

In the test phase, the network uses the reconstruction error, i.e., y _1：t Andthe difference between them, i.e. the prediction error, x _t+1 And->The difference between them makes an abnormality judgment. The quality difference between the normal optical flow and the abnormal optical flow reconstructed by the optical flow reconstructor is utilized to improve the detection precision of the future frame predictor. The reconstructed abnormal optical flow is typically of lower quality, resulting in a future frame with larger prediction errors. In contrast, the reconstructed normal optical flow is typically of higher quality and the prediction module can successfully predict future frames with smaller prediction errors. The optical flow reconstruction and future frame prediction errors are used as final anomaly detection scores.

In the final abnormality detection stage, the video is detected by using the network model trained in the previous step, and the reconstruction error S is utilized in a weighted fusion mode _r And prediction error S _p To obtain a final anomaly score S, which can be expressed as:

wherein omega _r Weight, ω, of the reconstruction score _p Represents the weight, μ, occupied by the prediction score _r Representing the mean, σ, of the reconstruction errors for all training samples _r Represents the standard deviation, μ, of the reconstruction errors of all training samples _p Representing the mean, σ, of the prediction errors of all training samples _p Representing the standard deviation of the prediction error for all training samples. Reconstruction error S _r And prediction error S _p The expressions of (2) are respectively:

The present invention proposes a Serial depth separable residual Block (Serial Block). By utilizing channel-by-channel convolution and point convolution to extract features and combining residual connection and intensive connection, richer spatial features can be extracted and feature reuse can be effectively performed. The Serial Block structure is shown in FIG. 3. The trunk part comprises two layers of channel-by-channel convolutions and one layer of point convolutions, richer two-dimensional space details are obtained through stacking of the two layers of channel-by-channel convolutions, channel fusion is carried out through channel dimension splicing before point convolutions, and feature reuse is achieved. On the outside, the channel difference between the input and the output is balanced through one layer of point convolution, and finally the output introduces a residual error term, so that space details are maintained and the gradient disappearance or gradient explosion problem in the forward propagation process is solved. If it isThe input image is denoted by x, y _i The output characteristics of each layer of common convolution are represented, C (·), D (·) and P (·) respectively represent the common convolution, the channel-by-channel convolution and the point convolution, and the mathematical expression of the Serial Block is:

the invention provides an intensive hole downsampling module (DRSM). The module introduces a hole convolution to combine multi-scale space information in the downsampling process to guide the downsampling process of the feature map. The DRSM module structure is shown in figure 4.

The DRSM contains three phases, stage1, stage 2 and Stage 3, respectively. Wherein Stage1 is a scale reduction implemented using common convolution; stage 2 and Stage 3 are both scale-reduced by means of hole convolution, and the hole rates are set to 2 and 4 respectively. The multi-scale spatial information is acquired in the down-sampling process by introducing two stages of hole convolution. During the downsampling process, the module firstly performs the first downsampling on the input x by using common convolution to obtain a feature diagram x after the scale reduction ₁ . Then, the second and third downsampling are carried out by using the hole convolution with the hole rate of 2 and 4 respectively, so as to obtain a feature diagram x after the scale reduction ₂ And x ₃ . The three downsampling here correspond to three phases, respectively. Thereafter, for the feature map x ₁ And x ₂ Fusion is carried out to obtain a fusion result y ₁ For the characteristic diagram x ₂ And x ₃ Fusion is carried out to obtain a fusion result y ₂ And x is at the final stage ₁ 、y ₁ And y ₂ And fusing to obtain an output characteristic diagram y. More abundant and effective spatial information is preserved through multi-level fusion of multi-scale features. The downsampling process of the DRSM module can be expressed as:

wherein P is _* (. Cndot.) represents fusion layer point convolution, x _i (i＝1,2, 3) represent the outputs of the stages,representing channel dimension stitching operation S _out Representing the downscaled feature map.

The invention provides a U-shaped pyramid attention mechanism module (UPAM). The extraction of inter-region relevance information is increased by introducing a hole convolution to further optimize the attention weight matrix. The structure of UPAM is shown in fig. 5, comprising two structurally similar parallel branches. The left branch is a jump connection input F _skip I.e. the output of the corresponding level of the encoder, the right branch input is the input F of the decoder of the corresponding level _D 。

The encoder branch on the right side firstly passes through a convolution layer, then passes through hole convolution with the hole rate of 2, the convolution output and the hole convolution output are fused by point convolution, and finally the output Att is obtained after an activation function _D (F _D ). The left jump connection branch passes through the convolution layer to obtain F ₁ Fusion is carried out with the output of the decoder branch convolution layer, and then the output F is obtained through hole convolution with the hole rate of 2 ₂ 。F ₂ 、F ₁ And hole convolution output F of decoder branch ₃ After the three features are subjected to channel dimension fusion by point convolution, the final output Att is obtained after an activation function _skip (F _skip ). Finally, the output Att of UPAM (F _skip ，F _D ) Input F to a decoder _D Multiplying by element to obtain the final attention profile F _Att The complete process of UPAM can be expressed as:

the UPAM further acquires the relevance information in the region while acquiring the multi-scale information through hole convolution and jump connection in the branch. Jump connection input F of left branch _skip The transposition and serialization operation of the bottleneck layer is not performed, so that certain detailed information is still reserved. Through the space between the branches at the left side and the right sideFurther directs the decoder branches to reserve spatial detail weights. In the final stage, the weight values of the two branches are fused to form an attention weight matrix, and the space detail of the future frame modeling process is optimized in the form of element products.

The model proposed by the invention was tested on the reference dataset UCSD Ped 2. A network model is built based on a Pytorch framework, and an Adam optimizer is adopted for optimization, and the super parameter is set to be beta ₁ ＝0.9，β ₂ =0.999, initial learning rate is set to 1×10 ³ . All training and testing procedures, including reconstruction branch training and predictive branch training, were performed on a 16-core 32-thread machine with AMD Ryzen 9 5950x 3.4GHz CPU (64 GB RAM) and RTX 3090GPU (24 GB memory). Lambda (lambda) _recon ，λ _ent ，λ _CVAE And lambda (lambda) _gd Are set to 1.0,2e respectively ^-4 1.0 and 1.0. The batch size and training runs are uniformly set to 128 and 120. Furthermore, the fusion coefficient (ω) of the reconstruction error and the prediction error _r ，ω _p ) The Ped2 dataset is set (1.0,0.1) for UCSD.

Performance testing was performed on the UCSD Ped2 test set after model training was completed. FIG. 6 is a visualization of predicted anomaly scores for anomaly videos on a UCSD Ped2 test set in accordance with the present invention; the detection result in the UCSD Ped2 test set shows that the invention can accurately detect the abnormal event in the complex multi-scene monitoring video, and verifies the time positioning capability of the invention on the abnormal event.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The visual anomaly detection method based on time sequence space information enhancement is characterized by comprising the following steps of:

2. The method for visual anomaly detection based on temporal spatial information enhancement according to claim 1, wherein the future frame predictor comprises an encoder E _θ Encoder and method for producing the sameAnd decoder D _ψ Encoder->And decoder D _ψ Jumping connection of a U-shaped pyramid attention mechanism module is added between the two modules; encoder E _θ And encoder->The system comprises a point convolution module, a serial depth separable residual block and an intensive cavity downsampling module; decoder D _ψ The device comprises an up-sampling module, a serial depth separable residual block and a point convolution module; encoder E _θ Is input to reconstruct optical flow information +.>Encoder->Is input as the original input frame x _1:t And reconstruct optical flow information->Is mixed with the information of the encoder E _θ Output information of (2) and encoder->Output information z of the decoding device is connected and sent to the decoder D _ψ 。

3. The method for visual anomaly detection based on temporal spatial information enhancement of claim 2, wherein the training optical flow reconstructor has a mixing loss function of:

L _recon-branch ＝λ _recon L _recon +λ _ent L _ent ；

4. A method of visual anomaly detection based on temporal spatial information enhancement as claimed in claim 3 wherein the training future frame predictor's mixing loss function is:

wherein, the liquid crystal display device comprises a liquid crystal display device,to predict loss, p (z|y _1:t ) For the output distribution of the optical flow encoder, q (z|x _1:t ,y _1:t ) For the output distribution of the image stream encoder, F _KL Representing the Kullback-Leibler divergence, y _1:t Representing video light flow, x _t+1 Truth-diagram representing future frames->Representing predicted output video frame-> For gradient loss, i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame, +>Representing predicted output video frames, x _i,j Truth-map pixel values representing future frames, +.>Representing predicted output video frame pixel values, lambda _predict 、λ _gd Are all weights.

5. The visual anomaly detection method based on temporal spatial information enhancement according to claim 4, wherein the method of calculating anomaly score from reconstruction error and prediction error is:

wherein S is abnormalDivide S _r Reconstruction error, S _p For prediction error omega _r Weight, ω, of the reconstruction score _p Represents the weight, μ, occupied by the prediction score _r Representing the mean, σ, of the reconstruction errors for all training samples _r Represents the standard deviation, μ, of the reconstruction errors of all training samples _p Representing the mean, σ, of the prediction errors of all training samples _p Representing the standard deviation of the prediction error for all training samples.

6. The method for visual anomaly detection based on temporal spatial information enhancement according to claim 5, wherein the reconstruction error S _r And prediction error S _p The expressions of (2) are respectively:

7. The visual anomaly detection method based on temporal spatial information enhancement according to claim 2, wherein the serial depth separable residual block comprises a channel convolution layer I, a channel convolution layer II, a point convolution layer I, and a point convolution layer II; the input end of the channel convolution layer I and the input end of the point convolution layer I are connected with the upper layer output end, the output end of the channel convolution layer I is connected with the input end of the channel convolution layer II, the output end of the channel convolution layer I is fused with the output end of the channel convolution layer II and then is connected with the input end of the point convolution layer II, and the output end of the point convolution layer II and the output end of the point convolution layer I are connected with the lower layer input end.

8. The visual anomaly detection method based on temporal spatial information enhancement according to claim 2, wherein the dense hole downsampling module comprises three stages Stage1, stage 2 and Stage 3; firstly, the input x is downsampled for the first time by utilizing the common convolution of Stage1 Stage to obtain a feature diagram x after the scale reduction ₁ The method comprises the steps of carrying out a first treatment on the surface of the Then, the second and third downsampling are carried out by utilizing the cavity convolution of Stage 2 Stage and Stage 3 Stage respectively so as to obtain a feature diagram x after the scale reduction ₂ And x ₃ The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, for the feature map x ₁ And x ₂ Fusion is carried out to obtain a fusion result y ₁ For the characteristic diagram x ₂ And x ₃ Fusion is carried out to obtain a fusion result y ₂ And x is at the final stage ₁ 、y ₁ And y ₂ And fusing to obtain an output characteristic diagram y.

9. The method for visual anomaly detection based on temporal spatial information enhancement of claim 2, wherein the U-pyramid attention mechanism module comprises a right encoder branch and a left skip connection branch; the encoder branch on the right side sequentially passes through a convolution layer and a cavity convolution, the convolution layer output and the cavity convolution output are fused by utilizing point convolution, and finally, the output Att is obtained after an activation function _D (F _D ) The method comprises the steps of carrying out a first treatment on the surface of the The left jump connection branch passes through the convolution layer to obtain F ₁ Fusion is carried out with the output of the decoder branch convolution layer, and then the output F is obtained through hole convolution ₂ ；F ₂ 、F ₁ And hole convolution output F of decoder branch ₃ After the three features are subjected to channel dimension fusion by point convolution, the final output Att is obtained after an activation function _skip (F _skip ) The method comprises the steps of carrying out a first treatment on the surface of the Finally, the output Att of UPAM (F _skip ,F _D ) Input F to a decoder _D Multiplying by element to obtain the final attention profile F _Att 。