CN116543335B

CN116543335B - Visual anomaly detection method based on time sequence spatial information enhancement

Info

Publication number: CN116543335B
Application number: CN202310510187.9A
Authority: CN
Inventors: 王霖; 李名洋; 王玮; 柴志鹏
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2024-06-21
Anticipated expiration: 2043-05-08
Also published as: CN116543335A

Abstract

The invention provides a visual anomaly detection method based on time sequence spatial information enhancement, which is used for solving the problem of slower operation speed of an anomaly detection network; the method comprises the following steps: firstly, constructing an optical flow reconstructor and a future frame predictor of a multi-stage memory enhancement self-encoder with jump connection; secondly, training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss; training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error; and finally, detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error. The method and the device can better utilize the space detail information, improve the detection precision, and are suitable for abnormal event detection occasions in complex and various scene monitoring videos.

Description

Visual anomaly detection method based on time sequence spatial information enhancement

Technical Field

The invention relates to the technical field of general computer image processing, in particular to a visual anomaly detection method based on time sequence space information enhancement.

Background

The video abnormal event detection technology is widely applied to fault early warning in the fields of industry, security and transportation, and aims to spatially or temporally locate abnormal events in a monitoring video. In a broad sense, the formal definition of a video anomaly event is "the appearance of an anomaly appearance or motion attribute, or the appearance of a normal appearance or motion attribute at an anomaly location or time. By definition, an abnormal event is unusual, unusual and sporadic in nature, and thus acquisition of an abnormal event sample is difficult. In order to solve the detection problem of abnormal events with super boundaries and small samples, researchers try to model normal event behaviors by analyzing the motion characteristics and the space-time context characteristics of normal event samples, and then judge whether abnormal events occur in videos by using the model. With the continued development of deep learning techniques, researchers have attempted to detect anomalies using an unsupervised paradigm approach based on reconstruction or prediction. In the anomaly detection method based on deep learning, firstly, in the training process, a network learns a normal activity model of a behavior main body in a scene through a plurality of given video sequences which do not contain abnormal behaviors. Subsequently, in the detection phase, the network processes by extracting the same features in the video sequence to be tested, so as to calculate an anomaly score for each frame of video. The method learns the characteristics of the normal event sample and models the normal event sample in an end-to-end mode, so that the abnormal event sample cannot be effectively reconstructed or predicted in future frames, thereby obtaining higher reconstruction errors or prediction errors, and judging whether an abnormal event occurs or not according to a proper threshold by taking the reconstruction errors or the prediction errors as criteria.

Although the reconstruction and prediction-based anomaly detection method has been developed in a long-standing manner, the following drawbacks still exist: (1) Both methods rely on the capability of convolutional neural network to extract video sequence features, and part of methods introduce motion feature information such as optical flow information before a reconstruction and prediction stage, but do not achieve good effects; (2) The reconstruction-based method has great reconstruction error when training is performed on the normal sample and reconstructing the abnormal image according to the difference of the normal sample and the abnormal sample in aspects such as morphological characteristics, motion characteristics, space-time context information and the like. However, such assumptions do not always hold. The diversity of normal and abnormal samples results in the normal events recorded in the training set not being complete and comprehensive, which makes it possible for the abnormal samples to be reconstructed well. (3) The prediction-based method predicts future frames by predicting morphology features and spatio-temporal context features, and in the anomaly discrimination process, the anomaly frames have extremely high reconstruction errors according to a hypothesis similar to the reconstruction-based method. This allows the prediction-based approach to ignore the contribution of motion features to anomaly discrimination while having similar drawbacks as the reconstruction-based approach.

To solve the above problems, some researchers have tried to combine these two patterns in a serial or parallel manner to construct a framework of hybrid reconstruction and prediction patterns to further improve the anomaly detection performance. The hybrid model effectively combines reconstruction and prediction methods, where the parallel hybrid method calculates anomaly scores by fusing reconstruction errors and prediction errors during the discrimination phase, while the serial hybrid method makes modeling of anomaly events in future frames more difficult by introducing reconstruction information in the prediction branches. However, whether based on reconstruction, prediction, or hybrid models, there is no more reasonable consideration of spatial detail information in the network design process. The method has higher requirements on sample data, is influenced by the diversity of normal samples, can still generate abnormal effective modeling in the modeling process, and has good model quality, thereby being beneficial to effectively differentiating the abnormality in the distinguishing process.

Disclosure of Invention

Aiming at the problem of slower operation speed of an anomaly detection network, the invention provides a visual anomaly detection method based on time sequence space information enhancement, which is a hybrid framework combining optical flow reconstruction and optical flow guiding future frame prediction in a serial mode; wherein a future frame prediction model based on optical flow guidance accepts a previous video frame and optical flow as inputs at the same time, but the optical flow used is not an original optical flow image, but is reconstructed by an optical flow reconstruction model constructed by a multi-level memory enhancement self-encoder (ML-MemAE-SC) with jump connection, and then the reconstructed optical flow is input to the future frame prediction model; the method and the device can better utilize the space detail information, improve the detection precision, and are suitable for abnormal event detection occasions in complex and various scene monitoring videos.

The technical scheme of the invention is realized as follows:

a visual anomaly detection method based on time sequence space information enhancement comprises the following steps:

Step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder;

step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss;

step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error;

step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error.

The future frame predictor includes an encoder E _θ, an encoderAnd decoder D _ψ, encoder/>Jump connection of the U-shaped pyramid attention mechanism module is added between the U-shaped pyramid attention mechanism module and the decoder D _ψ; encoder E _θ and encoder/>The system comprises a point convolution module, a serial depth separable residual block and an intensive cavity downsampling module; decoder D _ψ includes an upsampling module, a serial depth separable residual block, and a point convolution module; the input of the encoder E _θ is the reconstructed optical flow information/>Encoder/>Inputs of (a) are the original input frame x _1:t and the reconstructed optical flow information/>Is a mixed information of the output information of the encoder E _θ and the encoder/>Is connected to the output information z of the decoder D _ψ.

The mixing loss function of the training optical flow reconstructor is as follows:

L_recon-branch＝λ_reconL_recon+λ_entL_ent；

Wherein, Reconstructing the loss function, f being the input,/>The output of the reconstruction is provided with,M is the number of memory modules, N is the size of the memory module, is the loss function of the memory module,/>Is the matching probability of the kth node in the mth memory module, and lambda _recon、λ_ent is the weight.

The mixing loss function of the training future frame predictor is:

Wherein, For prediction loss, p (z|y _1:t) is the output distribution of the optical flow encoder, q (z|x _1:t,y_1:t) is the output distribution of the image flow encoder, F _KL represents the Kullback-Leibler divergence, y _1:t represents video optical flow, x _t+1 represents the truth-chart of future frames,/>Representing predicted output video frames,/> For gradient loss, i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame,/>Representing a predicted output video frame, x _i,j represents the truth-map pixel values of the future frame,/>And representing the pixel value of the predicted output video frame, wherein lambda _predict、λ_gd is the weight.

The method for calculating the abnormal score according to the reconstruction error and the prediction error comprises the following steps:

Wherein S is an anomaly score, S _r is a reconstruction error, S _p is a prediction error, ω _r is a weight occupied by the reconstruction score, ω _p is a weight occupied by the prediction score, μ _r is a mean value of the reconstruction errors of all training samples, σ _r is a standard deviation of the reconstruction errors of all training samples, μ _p is a mean value of the prediction errors of all training samples, and σ _p is a standard deviation of the prediction errors of all training samples.

The expression of the reconstruction error S _r and the prediction error S _p are respectively:

Wherein, Representing reconstructed optical flow information,/>Representing a predicted output video frame.

The serial depth separable residual block comprises a channel convolution layer I, a channel convolution layer II, a point convolution layer I and a point convolution layer II; the input end of the channel convolution layer I and the input end of the point convolution layer I are connected with the upper layer output end, the output end of the channel convolution layer I is connected with the input end of the channel convolution layer II, the output end of the channel convolution layer I is fused with the output end of the channel convolution layer II and then is connected with the input end of the point convolution layer II, and the output end of the point convolution layer II and the output end of the point convolution layer I are connected with the lower layer input end.

The intensive cavity downsampling module comprises three stages, namely Stage 1, stage 2 and Stage 3; firstly, performing first downsampling on an input x by using common convolution of Stage 1 Stage to obtain a feature map x ₁ after scale reduction; then, carrying out second and third downsampling by utilizing cavity convolution of Stage 2 Stage and Stage 3 Stage respectively so as to obtain feature graphs x ₂ and x ₃ after scale reduction; thereafter, the feature maps x ₁ and x ₂ are fused to obtain a fusion result y ₁, the feature maps x ₂ and x ₃ are fused to obtain a fusion result y ₂, and the feature maps x ₁、y₁ and y ₂ are fused in the final stage to obtain an output feature map y.

The U-shaped pyramid attention mechanism module comprises a decoder branch on the right side and a jump connection branch on the left side; the decoder branch on the right side sequentially passes through a convolution layer and a hole convolution, the convolution layer output and the hole convolution output are fused by utilizing point convolution, and finally, the output Att _D(F_D is obtained after an activation function; the jump connection branch on the left side is subjected to convolution layer to obtain F ₁, is fused with the output of the decoder branch convolution layer, is subjected to hole convolution to obtain the output F ₂;F₂、F₁ and the hole convolution output F ₃ of the decoder branch, is subjected to channel dimension fusion by point convolution, and is subjected to activation function to obtain final output Att _skip(F_skip); finally, the UPAM output Att (F _skip,F_D) is multiplied by the decoder input F _D by elements to obtain the final attention profile F _Att.

Compared with the prior art, the invention has the beneficial effects that:

(1) Aiming at the problem of slower operation speed of an anomaly detection network, a Serial depth separable residual Block (Serial Block) is designed to serve as a backbone of a future frame prediction network, so that the calculation amount and the parameter amount of convolution operation are effectively reduced and the operation speed is improved on the premise of ensuring the performance.

(2) In order to ensure that abundant and effective space detail information is reserved in the downsampling process in the prediction network encoder, a dense cavity downsampling module (DRSM) is designed, the module adopts a stepped structural design, and multiscale space relevance information is fully utilized in the downsampling process, so that a feature map after the scale reduction is ensured to have more effective space detail information.

(3) Because the U-shaped structure of the future frame prediction network branch contains more jump connections, part of the hierarchy contains certain interference information in the transmission process from top to bottom in the coding stage. In order to solve the problem, a U-shaped pyramid attention module (UPAM) is designed, and the network is guided to keep more abundant and effective spatial features in the feature fusion process through the interactive information extraction of jump connection input and original decoder input in the attention module, so that the future frame modeling quality is improved.

(4) The invention can accurately detect the abnormal events in the complex and various scene monitoring videos.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a block diagram of a future frame predictor network of the present invention.

Fig. 3 is a Serial depth separable residual Block (Serial Block) structure diagram of the present invention.

Fig. 4 is a diagram illustrating a structure of the dense hole downsampling module (DRSM) according to the present invention.

Fig. 5 is a diagram of the structure of the U-shaped pyramid attention mechanism module (UPAM) of the present invention.

FIG. 6 is an anomaly score plot of a test video at a UCSD Ped2 dataset of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, an embodiment of the present invention provides a visual anomaly detection method based on temporal spatial information enhancement, including two components: (1) Multi-level memory enhanced self-encoder with jump connection (ML-MemAE-SC), i.e. optical flow reconstructor. The method is used for reconstructing the optical flow image, and the reconstructed optical flow image and the original image are simultaneously input into a future frame predictor to conduct future frame prediction. (2) The condition variation based on the reconstructed optical flow guidance is derived from the encoder, i.e. the future frame predictor. The method is used for predicting future frames of the original image frames, and optical flow reconstruction information is introduced during prediction to expand prediction errors to differentiate between abnormal and normal states, so that detection accuracy is further improved. The method comprises the following specific steps:

Step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder; the optical flow reconstructor network adopts three-stage memory storage units, and directly transmits the encoded information to the decoder by utilizing jump connection, so as to provide more information for the next-stage memory storage unit to keep a more comprehensive and complete normal mode. The memory module functions to represent the characteristics input to it by a weighted sum of similar memory slots and thus has the ability to memorize the normal mode when training the normal data. Each memory module is actually a matrix Including N real vectors in the fixed dimension C. Each row of the matrix is referred to as a slot m _i, where i=1, 2, 3. Memory size n=2000. The method can effectively solve the problems that a single memory module cannot load all normal modes and a plurality of memory modules are cascaded to cause excessive loading and filtering. No hopping connection is added at the outermost layer of the network. If the connection uses the outermost layer, the reconstruction may be completed by the highest-level encoding and decoding information, and the effect of the normal mode information stored by all other lower-level memory modules is greatly weakened or even disabled, so that all other lower-level encoding, decoding and memory blocks cannot work. In order to further reduce the number of network model parameters, all convolution layers of the reconstruction network are replaced by Serial blocks, so that a lightweight version is formed.

The future frame predictor network detailed structure is shown in fig. 2. The future frame predictor network model consists of the following 5 main module connections: (1) a Serial depth separable residual Block (Serial Block); (2) an intensive hole downsampling module (DRSM); (3) a U-shaped pyramid attention mechanism module (UPAM); (4) an upsampling module (Upsample); (5) a point convolution module (Pointwise).

Each block in the future frame predictor network represents a respective module. As shown in fig. 2, there are two encoders E _θ andAnd a decoder D _ψ. We are/>And D _ψ to help generate x _t+1. The backbone of the future frame predictor network is implemented by Serial Block, and the downsampling layer and upsampling layer are implemented by DRSM and Upsample, respectively. The model contains 4 levels in total, and the feature map size corresponding to each level is (32,32,64), (16,16,128), (8,8,128) and (4,4,128), respectively. Will/>Is connected to the sampled z and sent to the decoder D _ψ. The last two bottleneck layers are utilized to estimate the distribution and sample the data from them, and share the same layer settings. Encoder/>Processing the original input frame x _1:t and reconstructing the optical flow information/>E _θ processes the reconstructed optical flow informationAnd (5) inputting. Features are fused in the bottleneck layer, and then posterior distribution/>And a priori distribution/>Input decoder D _ψ to model/>A predicted future output frame is made.

Step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss; when training the optical flow reconstruction network, the optical flow reconstruction network model training is performed by using the optical flow graph as an input. If the input is f, the reconstructed output isThe training process of the reconstruction network is to minimize the l ₂ distance between the input and the output, and the reconstruction loss function can be expressed as:

In addition, we add entropy loss in the memory modules of the optical flow reconstruction network and match probability for each memory module The loss function of the memory module can be expressed as:

wherein M is the number of memory modules, N is the size of the memory modules, Is the probability of matching the kth node in the mth memory module. The two loss functions are weighted and fused for final reconstructed branch training as follows:

L_recon-branch＝λ_reconL_recon+λ_entL_ent

Wherein lambda _recon、λ_ent is the weight.

Step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error; in the process of training a future frame prediction network, training of a prediction network model is performed by using the video frame and the reconstructed optical flow as inputs. The future frame prediction network includes two encoders and one decoder, and the outputs of the three parts are comprehensively considered in training. Where the output profile of the optical flow encoder is p (z|y _1:t) and the output profile of the image flow encoder is q (z|x _1:t,y_1:t), the prediction loss of the future frame prediction network can be defined as:

Where F _KL represents the Kullback-Leibler divergence, x _t+1 represents Ground Truth of the future frame, Representing predicted output video frames, y _1:t represents video flow.

Furthermore, gradient losses were introduced, as follows:

where i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame, Representing a predicted output video frame, x _i,j represents the truth-map pixel values of the future frame,/>Representing predicted output video frame pixel values.

Mixing the predicted and gradient losses by means of weighted fusion obtains a total loss function to train the prediction network, which can be expressed as:

Wherein lambda _predict、λ_gd is the weight.

Step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, calculating an anomaly score according to the reconstruction error and the prediction error, and optimizing anomaly discrimination performance by using the anomaly score weighting loss.

During the test phase, the network uses the reconstruction errors, i.e., y _1:t andThe difference between them, i.e. x _t+1 and the prediction errorThe difference between them makes an abnormality judgment. The quality difference between the normal optical flow and the abnormal optical flow reconstructed by the optical flow reconstructor is utilized to improve the detection precision of the future frame predictor. The reconstructed abnormal optical flow is typically of lower quality, resulting in a future frame with larger prediction errors. In contrast, the reconstructed normal optical flow is typically of higher quality and the prediction module can successfully predict future frames with smaller prediction errors. The optical flow reconstruction and future frame prediction errors are used as final anomaly detection scores.

In the final anomaly detection stage, the video is detected by using the trained network model in the previous step, and the reconstruction error S _r and the prediction error S _p are utilized in a weighted fusion manner to obtain a final anomaly score S, which can be expressed as:

Wherein ω _r is the weight of the reconstruction score, ω _p is the weight of the prediction score, μ _r is the mean of the reconstruction errors of all training samples, σ _r is the standard deviation of the reconstruction errors of all training samples, μ _p is the mean of the prediction errors of all training samples, and σ _p is the standard deviation of the prediction errors of all training samples. The expression of the reconstruction error S _r and the prediction error S _p are respectively:

The present invention proposes a Serial depth separable residual Block (Serial Block). By utilizing channel-by-channel convolution and point convolution to extract features and combining residual connection and intensive connection, richer spatial features can be extracted and feature reuse can be effectively performed. The Serial Block structure is shown in FIG. 3. The trunk part comprises two layers of channel-by-channel convolutions and one layer of point convolutions, richer two-dimensional space details are obtained through stacking of the two layers of channel-by-channel convolutions, channel fusion is carried out through channel dimension splicing before point convolutions, and feature reuse is achieved. On the outside, the channel difference between the input and the output is balanced through one layer of point convolution, and finally the output introduces a residual error term, so that space details are maintained and the gradient disappearance or gradient explosion problem in the forward propagation process is solved. If x is used to represent the input image, y _i represents the output characteristics of each layer of normal convolution, and C (, D (, and P (), respectively represent the normal convolution, the channel-by-channel convolution, and the point convolution, then the mathematical expression of the Serial Block is:

The invention provides an intensive cavity downsampling module (DRSM). The module introduces a hole convolution to combine multi-scale space information in the downsampling process to guide the downsampling process of the feature map. DRSM the module structure is shown in figure 4.

DRSM contains three stages, stage1, stage 2 and Stage 3 respectively. Wherein Stage1 is a scale reduction implemented using common convolution; stage 2 and Stage 3 are both scale-reduced by means of hole convolution, and the hole rates are set to 2 and 4 respectively. The multi-scale spatial information is acquired in the down-sampling process by introducing two stages of hole convolution. During the downsampling process, the module firstly performs first downsampling on the input x by using common convolution to obtain a feature map x ₁ after the scale reduction. Then, second and third downsampling are performed using hole convolutions with hole rates of 2 and 4, respectively, to obtain downscaled feature maps x ₂ and x ₃. The three downsampling here correspond to three phases, respectively. Thereafter, the feature maps x ₁ and x ₂ are fused to obtain a fusion result y ₁, the feature maps x ₂ and x ₃ are fused to obtain a fusion result y ₂, and the feature maps x ₁、y₁ and y ₂ are fused in the final stage to obtain an output feature map y. More abundant and effective spatial information is preserved through multi-level fusion of multi-scale features. The downsampling process at the DRSM block may be expressed as:

Where P _* (·) represents the fusion layer point convolution, x _i (i=1, 2, 3) represents the output of each stage, Representing the channel dimension stitching operation, S _out represents the feature map after the scale reduction.

The invention provides a U-shaped pyramid attention mechanism module (UPAM). The extraction of inter-region relevance information is increased by introducing a hole convolution to further optimize the attention weight matrix. UPAM is shown in fig. 5 as comprising two structurally similar parallel branches. The left branch is the jump connection input F _skip, i.e. the output of the corresponding level of the encoder, and the right branch input is the input F _D of the corresponding level decoder.

The decoder branch on the right side firstly passes through a convolution layer, then passes through hole convolution with the hole rate of 2, and uses point convolution to fuse convolution output and hole convolution output, and finally obtains output Att _D(F_D after an activation function. The jump connection branch on the left side is subjected to convolution layer to obtain F ₁, is fused with the output of the decoder branch convolution layer, is subjected to hole convolution with the hole rate of 2 to obtain the output F ₂.F₂、F₁, and is subjected to hole convolution output F ₃ of the decoder branch, and is subjected to channel dimension fusion by point convolution to obtain final output Att _skip(F_skip after an activation function. Finally, the complete process of multiplying the UPAM output Att (F _skip,F_D) by the decoder input F _D by elements to obtain the final attention profile F _Att, UPAM can be expressed as:

UPAM is connected in a jumping mode through hole convolution in the branches, and relevance information in the region is further acquired while multi-scale information is acquired. The jump connection input F _skip of the left branch does not go through the transpose and serialization operations of the bottleneck layer, so that a certain amount of detail information is still retained. The decoder branches are further guided to reserve space detail weights by jump connections between the left and right branches. In the final stage, the weight values of the two branches are fused to form an attention weight matrix, and the space detail of the future frame modeling process is optimized in the form of element products.

The model proposed by the invention was tested on the reference dataset UCSD Ped 2. A network model was constructed based on Pytorch framework and optimized with Adam optimizer with super parameters set to β ₁＝0.9,β₂ =0.999 and initial learning rate set to 1×10 ³. All training and testing procedures, including reconstruction branch training and predictive branch training, were performed on a 16-core 32-thread machine with AMD Ryzen 9 5950x 3.4GHz CPU (64 GB RAM) and RTX 3090GPU (24 GB memory). Lambda _recon,λ_ent,λ_CVAE and lambda _gd were set to 1.0,2e ^-4, 1.0 and 1.0, respectively. The batch size and training runs are uniformly set to 128 and 120. Further, a fusion coefficient (ω _r,ω_p) of the reconstruction error and the prediction error is set to (1.0,0.1) for the UCSD Ped2 dataset.

Performance testing was performed on the UCSD Ped2 test set after model training was completed. FIG. 6 is a visualization of predicted anomaly scores for anomaly videos on a UCSD Ped2 test set in accordance with the present invention; the detection result in the UCSD Ped2 test set shows that the invention can accurately detect the abnormal event in the complex multi-scene monitoring video, and verifies the time positioning capability of the invention on the abnormal event.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. The visual anomaly detection method based on time sequence space information enhancement is characterized by comprising the following steps of:

The future frame predictor includes an encoder E _θ, an encoder And decoder D _ψ, encoder/>Jump connection of the U-shaped pyramid attention mechanism module is added between the U-shaped pyramid attention mechanism module and the decoder D _ψ; encoder E _θ and encoder/>The system comprises a point convolution module, a serial depth separable residual block and an intensive cavity downsampling module; decoder D _ψ includes an upsampling module, a serial depth separable residual block, and a point convolution module; the input of the encoder E _θ is the reconstructed optical flow information/>Encoder/>Inputs of (a) are the original input frame x _1：t and the reconstructed optical flow information/>Is a mixed information of the output information of the encoder E _θ and the encoder/>The output information z of (3) is connected and then sent to a decoder D _ψ;

The serial depth separable residual block comprises a channel convolution layer I, a channel convolution layer II, a point convolution layer I and a point convolution layer II; the input end of the channel convolution layer I and the input end of the point convolution layer I are connected with the upper layer output end, the output end of the channel convolution layer I is connected with the input end of the channel convolution layer II, the output end of the channel convolution layer I is connected with the input end of the point convolution layer II after being fused with the output end of the channel convolution layer II, and the output end of the point convolution layer II and the output end of the point convolution layer I are connected with the lower layer input end;

The intensive cavity downsampling module comprises three stages, namely Stage 1, stage 2 and Stage 3; firstly, performing first downsampling on an input x by using common convolution of Stage 1 Stage to obtain a feature map x ₁ after scale reduction; then, carrying out second and third downsampling by utilizing cavity convolution of Stage 2 Stage and Stage 3 Stage respectively so as to obtain feature graphs x ₂ and x ₃ after scale reduction; thereafter, the feature maps x ₁ and x ₂ are respectively fused to obtain a fusion result y ₁, the feature maps x ₂ and x ₃ are fused to obtain a fusion result y ₂, and the feature maps x ₁、y₁ and y ₂ are fused in a final stage to obtain an output feature map y;

The U-shaped pyramid attention mechanism module comprises a decoder branch on the right side and a jump connection branch on the left side; the decoder branch on the right side sequentially passes through a convolution layer and a hole convolution, the convolution layer output and the hole convolution output are fused by utilizing point convolution, and finally, the output Att _D(F_D is obtained after an activation function; the jump connection branch on the left side is subjected to convolution layer to obtain F ₁, is fused with the output of the decoder branch convolution layer, is subjected to hole convolution to obtain the output F ₂;F₂、F₁ and the hole convolution output F ₃ of the decoder branch, is subjected to channel dimension fusion by point convolution, and is subjected to activation function to obtain final output Att _skip(F_skip); finally, the UPAM output Att (F _skip,F_D) is multiplied by the decoder input F _D by elements to obtain the final attention profile F _Att;

2. The method for visual anomaly detection based on temporal spatial information enhancement of claim 1, wherein the training optical flow reconstructor has a mixing loss function of:

L_recon-branch＝λ_reconL_recon+λ_entL_ent；

3. The method for visual anomaly detection based on temporal spatial information enhancement of claim 2, wherein the training future frame predictor's mixing loss function is:

Wherein, For prediction loss, p (z|y _1：t) is the output distribution of the optical flow encoder, q (z|x _1：t,y_1：t) is the output distribution of the image flow encoder, F _KL represents the Kullback-Leibler divergence, y _1：t represents video optical flow, x _t+1 represents the truth-chart of future frames,/>Representing a predicted output video frame, For gradient loss, i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame,/>Representing a predicted output video frame, x _i,j represents the truth-map pixel values of the future frame,/>And representing the pixel value of the predicted output video frame, wherein lambda _predict、λ_gd is the weight.

4. The visual anomaly detection method based on temporal spatial information enhancement according to claim 3, wherein the method of calculating anomaly score from reconstruction error and prediction error is:

5. The visual anomaly detection method based on temporal spatial information enhancement according to claim 4, wherein the expression of the reconstruction error S _r and the prediction error S _p are respectively: