CN116543335B - Visual anomaly detection method based on time sequence spatial information enhancement - Google Patents
Visual anomaly detection method based on time sequence spatial information enhancement Download PDFInfo
- Publication number
- CN116543335B CN116543335B CN202310510187.9A CN202310510187A CN116543335B CN 116543335 B CN116543335 B CN 116543335B CN 202310510187 A CN202310510187 A CN 202310510187A CN 116543335 B CN116543335 B CN 116543335B
- Authority
- CN
- China
- Prior art keywords
- output
- optical flow
- encoder
- convolution layer
- error
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 30
- 230000000007 visual effect Effects 0.000 title claims abstract description 12
- 230000003287 optical effect Effects 0.000 claims abstract description 54
- 238000000034 method Methods 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 40
- 230000006870 function Effects 0.000 claims abstract description 27
- 230000004927 fusion Effects 0.000 claims description 15
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 claims description 10
- 230000007246 mechanism Effects 0.000 claims description 9
- 230000009467 reduction Effects 0.000 claims description 8
- 238000010586 diagram Methods 0.000 claims description 7
- 230000004913 activation Effects 0.000 claims description 6
- 238000002156 mixing Methods 0.000 claims description 5
- 230000002123 temporal effect Effects 0.000 claims description 5
- 230000002159 abnormal effect Effects 0.000 abstract description 22
- 238000012544 monitoring process Methods 0.000 abstract description 5
- 230000006993 memory improvement Effects 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 18
- 238000012360 testing method Methods 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 4
- 238000013459 approach Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000005856 abnormality Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000005055 memory storage Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 206010000117 Abnormal behaviour Diseases 0.000 description 1
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000012938 design process Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000012956 testing procedure Methods 0.000 description 1
- 239000013598 vector Substances 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/246—Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a visual anomaly detection method based on time sequence spatial information enhancement, which is used for solving the problem of slower operation speed of an anomaly detection network; the method comprises the following steps: firstly, constructing an optical flow reconstructor and a future frame predictor of a multi-stage memory enhancement self-encoder with jump connection; secondly, training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss; training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error; and finally, detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error. The method and the device can better utilize the space detail information, improve the detection precision, and are suitable for abnormal event detection occasions in complex and various scene monitoring videos.
Description
Technical Field
The invention relates to the technical field of general computer image processing, in particular to a visual anomaly detection method based on time sequence space information enhancement.
Background
The video abnormal event detection technology is widely applied to fault early warning in the fields of industry, security and transportation, and aims to spatially or temporally locate abnormal events in a monitoring video. In a broad sense, the formal definition of a video anomaly event is "the appearance of an anomaly appearance or motion attribute, or the appearance of a normal appearance or motion attribute at an anomaly location or time. By definition, an abnormal event is unusual, unusual and sporadic in nature, and thus acquisition of an abnormal event sample is difficult. In order to solve the detection problem of abnormal events with super boundaries and small samples, researchers try to model normal event behaviors by analyzing the motion characteristics and the space-time context characteristics of normal event samples, and then judge whether abnormal events occur in videos by using the model. With the continued development of deep learning techniques, researchers have attempted to detect anomalies using an unsupervised paradigm approach based on reconstruction or prediction. In the anomaly detection method based on deep learning, firstly, in the training process, a network learns a normal activity model of a behavior main body in a scene through a plurality of given video sequences which do not contain abnormal behaviors. Subsequently, in the detection phase, the network processes by extracting the same features in the video sequence to be tested, so as to calculate an anomaly score for each frame of video. The method learns the characteristics of the normal event sample and models the normal event sample in an end-to-end mode, so that the abnormal event sample cannot be effectively reconstructed or predicted in future frames, thereby obtaining higher reconstruction errors or prediction errors, and judging whether an abnormal event occurs or not according to a proper threshold by taking the reconstruction errors or the prediction errors as criteria.
Although the reconstruction and prediction-based anomaly detection method has been developed in a long-standing manner, the following drawbacks still exist: (1) Both methods rely on the capability of convolutional neural network to extract video sequence features, and part of methods introduce motion feature information such as optical flow information before a reconstruction and prediction stage, but do not achieve good effects; (2) The reconstruction-based method has great reconstruction error when training is performed on the normal sample and reconstructing the abnormal image according to the difference of the normal sample and the abnormal sample in aspects such as morphological characteristics, motion characteristics, space-time context information and the like. However, such assumptions do not always hold. The diversity of normal and abnormal samples results in the normal events recorded in the training set not being complete and comprehensive, which makes it possible for the abnormal samples to be reconstructed well. (3) The prediction-based method predicts future frames by predicting morphology features and spatio-temporal context features, and in the anomaly discrimination process, the anomaly frames have extremely high reconstruction errors according to a hypothesis similar to the reconstruction-based method. This allows the prediction-based approach to ignore the contribution of motion features to anomaly discrimination while having similar drawbacks as the reconstruction-based approach.
To solve the above problems, some researchers have tried to combine these two patterns in a serial or parallel manner to construct a framework of hybrid reconstruction and prediction patterns to further improve the anomaly detection performance. The hybrid model effectively combines reconstruction and prediction methods, where the parallel hybrid method calculates anomaly scores by fusing reconstruction errors and prediction errors during the discrimination phase, while the serial hybrid method makes modeling of anomaly events in future frames more difficult by introducing reconstruction information in the prediction branches. However, whether based on reconstruction, prediction, or hybrid models, there is no more reasonable consideration of spatial detail information in the network design process. The method has higher requirements on sample data, is influenced by the diversity of normal samples, can still generate abnormal effective modeling in the modeling process, and has good model quality, thereby being beneficial to effectively differentiating the abnormality in the distinguishing process.
Disclosure of Invention
Aiming at the problem of slower operation speed of an anomaly detection network, the invention provides a visual anomaly detection method based on time sequence space information enhancement, which is a hybrid framework combining optical flow reconstruction and optical flow guiding future frame prediction in a serial mode; wherein a future frame prediction model based on optical flow guidance accepts a previous video frame and optical flow as inputs at the same time, but the optical flow used is not an original optical flow image, but is reconstructed by an optical flow reconstruction model constructed by a multi-level memory enhancement self-encoder (ML-MemAE-SC) with jump connection, and then the reconstructed optical flow is input to the future frame prediction model; the method and the device can better utilize the space detail information, improve the detection precision, and are suitable for abnormal event detection occasions in complex and various scene monitoring videos.
The technical scheme of the invention is realized as follows:
a visual anomaly detection method based on time sequence space information enhancement comprises the following steps:
Step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder;
step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss;
step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error;
step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error.
The future frame predictor includes an encoder E θ, an encoderAnd decoder D ψ, encoder/>Jump connection of the U-shaped pyramid attention mechanism module is added between the U-shaped pyramid attention mechanism module and the decoder D ψ; encoder E θ and encoder/>The system comprises a point convolution module, a serial depth separable residual block and an intensive cavity downsampling module; decoder D ψ includes an upsampling module, a serial depth separable residual block, and a point convolution module; the input of the encoder E θ is the reconstructed optical flow information/>Encoder/>Inputs of (a) are the original input frame x 1:t and the reconstructed optical flow information/>Is a mixed information of the output information of the encoder E θ and the encoder/>Is connected to the output information z of the decoder D ψ.
The mixing loss function of the training optical flow reconstructor is as follows:
Lrecon-branch=λreconLrecon+λentLent;
Wherein, Reconstructing the loss function, f being the input,/>The output of the reconstruction is provided with,M is the number of memory modules, N is the size of the memory module, is the loss function of the memory module,/>Is the matching probability of the kth node in the mth memory module, and lambda recon、λent is the weight.
The mixing loss function of the training future frame predictor is:
Wherein, For prediction loss, p (z|y 1:t) is the output distribution of the optical flow encoder, q (z|x 1:t,y1:t) is the output distribution of the image flow encoder, F KL represents the Kullback-Leibler divergence, y 1:t represents video optical flow, x t+1 represents the truth-chart of future frames,/>Representing predicted output video frames,/> For gradient loss, i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame,/>Representing a predicted output video frame, x i,j represents the truth-map pixel values of the future frame,/>And representing the pixel value of the predicted output video frame, wherein lambda predict、λgd is the weight.
The method for calculating the abnormal score according to the reconstruction error and the prediction error comprises the following steps:
Wherein S is an anomaly score, S r is a reconstruction error, S p is a prediction error, ω r is a weight occupied by the reconstruction score, ω p is a weight occupied by the prediction score, μ r is a mean value of the reconstruction errors of all training samples, σ r is a standard deviation of the reconstruction errors of all training samples, μ p is a mean value of the prediction errors of all training samples, and σ p is a standard deviation of the prediction errors of all training samples.
The expression of the reconstruction error S r and the prediction error S p are respectively:
Wherein, Representing reconstructed optical flow information,/>Representing a predicted output video frame.
The serial depth separable residual block comprises a channel convolution layer I, a channel convolution layer II, a point convolution layer I and a point convolution layer II; the input end of the channel convolution layer I and the input end of the point convolution layer I are connected with the upper layer output end, the output end of the channel convolution layer I is connected with the input end of the channel convolution layer II, the output end of the channel convolution layer I is fused with the output end of the channel convolution layer II and then is connected with the input end of the point convolution layer II, and the output end of the point convolution layer II and the output end of the point convolution layer I are connected with the lower layer input end.
The intensive cavity downsampling module comprises three stages, namely Stage 1, stage 2 and Stage 3; firstly, performing first downsampling on an input x by using common convolution of Stage 1 Stage to obtain a feature map x 1 after scale reduction; then, carrying out second and third downsampling by utilizing cavity convolution of Stage 2 Stage and Stage 3 Stage respectively so as to obtain feature graphs x 2 and x 3 after scale reduction; thereafter, the feature maps x 1 and x 2 are fused to obtain a fusion result y 1, the feature maps x 2 and x 3 are fused to obtain a fusion result y 2, and the feature maps x 1、y1 and y 2 are fused in the final stage to obtain an output feature map y.
The U-shaped pyramid attention mechanism module comprises a decoder branch on the right side and a jump connection branch on the left side; the decoder branch on the right side sequentially passes through a convolution layer and a hole convolution, the convolution layer output and the hole convolution output are fused by utilizing point convolution, and finally, the output Att D(FD is obtained after an activation function; the jump connection branch on the left side is subjected to convolution layer to obtain F 1, is fused with the output of the decoder branch convolution layer, is subjected to hole convolution to obtain the output F 2;F2、F1 and the hole convolution output F 3 of the decoder branch, is subjected to channel dimension fusion by point convolution, and is subjected to activation function to obtain final output Att skip(Fskip); finally, the UPAM output Att (F skip,FD) is multiplied by the decoder input F D by elements to obtain the final attention profile F Att.
Compared with the prior art, the invention has the beneficial effects that:
(1) Aiming at the problem of slower operation speed of an anomaly detection network, a Serial depth separable residual Block (Serial Block) is designed to serve as a backbone of a future frame prediction network, so that the calculation amount and the parameter amount of convolution operation are effectively reduced and the operation speed is improved on the premise of ensuring the performance.
(2) In order to ensure that abundant and effective space detail information is reserved in the downsampling process in the prediction network encoder, a dense cavity downsampling module (DRSM) is designed, the module adopts a stepped structural design, and multiscale space relevance information is fully utilized in the downsampling process, so that a feature map after the scale reduction is ensured to have more effective space detail information.
(3) Because the U-shaped structure of the future frame prediction network branch contains more jump connections, part of the hierarchy contains certain interference information in the transmission process from top to bottom in the coding stage. In order to solve the problem, a U-shaped pyramid attention module (UPAM) is designed, and the network is guided to keep more abundant and effective spatial features in the feature fusion process through the interactive information extraction of jump connection input and original decoder input in the attention module, so that the future frame modeling quality is improved.
(4) The invention can accurately detect the abnormal events in the complex and various scene monitoring videos.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a block diagram of a future frame predictor network of the present invention.
Fig. 3 is a Serial depth separable residual Block (Serial Block) structure diagram of the present invention.
Fig. 4 is a diagram illustrating a structure of the dense hole downsampling module (DRSM) according to the present invention.
Fig. 5 is a diagram of the structure of the U-shaped pyramid attention mechanism module (UPAM) of the present invention.
FIG. 6 is an anomaly score plot of a test video at a UCSD Ped2 dataset of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, an embodiment of the present invention provides a visual anomaly detection method based on temporal spatial information enhancement, including two components: (1) Multi-level memory enhanced self-encoder with jump connection (ML-MemAE-SC), i.e. optical flow reconstructor. The method is used for reconstructing the optical flow image, and the reconstructed optical flow image and the original image are simultaneously input into a future frame predictor to conduct future frame prediction. (2) The condition variation based on the reconstructed optical flow guidance is derived from the encoder, i.e. the future frame predictor. The method is used for predicting future frames of the original image frames, and optical flow reconstruction information is introduced during prediction to expand prediction errors to differentiate between abnormal and normal states, so that detection accuracy is further improved. The method comprises the following specific steps:
Step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder; the optical flow reconstructor network adopts three-stage memory storage units, and directly transmits the encoded information to the decoder by utilizing jump connection, so as to provide more information for the next-stage memory storage unit to keep a more comprehensive and complete normal mode. The memory module functions to represent the characteristics input to it by a weighted sum of similar memory slots and thus has the ability to memorize the normal mode when training the normal data. Each memory module is actually a matrix Including N real vectors in the fixed dimension C. Each row of the matrix is referred to as a slot m i, where i=1, 2, 3. Memory size n=2000. The method can effectively solve the problems that a single memory module cannot load all normal modes and a plurality of memory modules are cascaded to cause excessive loading and filtering. No hopping connection is added at the outermost layer of the network. If the connection uses the outermost layer, the reconstruction may be completed by the highest-level encoding and decoding information, and the effect of the normal mode information stored by all other lower-level memory modules is greatly weakened or even disabled, so that all other lower-level encoding, decoding and memory blocks cannot work. In order to further reduce the number of network model parameters, all convolution layers of the reconstruction network are replaced by Serial blocks, so that a lightweight version is formed.
The future frame predictor network detailed structure is shown in fig. 2. The future frame predictor network model consists of the following 5 main module connections: (1) a Serial depth separable residual Block (Serial Block); (2) an intensive hole downsampling module (DRSM); (3) a U-shaped pyramid attention mechanism module (UPAM); (4) an upsampling module (Upsample); (5) a point convolution module (Pointwise).
Each block in the future frame predictor network represents a respective module. As shown in fig. 2, there are two encoders E θ andAnd a decoder D ψ. We are/>And D ψ to help generate x t+1. The backbone of the future frame predictor network is implemented by Serial Block, and the downsampling layer and upsampling layer are implemented by DRSM and Upsample, respectively. The model contains 4 levels in total, and the feature map size corresponding to each level is (32,32,64), (16,16,128), (8,8,128) and (4,4,128), respectively. Will/>Is connected to the sampled z and sent to the decoder D ψ. The last two bottleneck layers are utilized to estimate the distribution and sample the data from them, and share the same layer settings. Encoder/>Processing the original input frame x 1:t and reconstructing the optical flow information/>E θ processes the reconstructed optical flow informationAnd (5) inputting. Features are fused in the bottleneck layer, and then posterior distribution/>And a priori distribution/>Input decoder D ψ to model/>A predicted future output frame is made.
Step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss; when training the optical flow reconstruction network, the optical flow reconstruction network model training is performed by using the optical flow graph as an input. If the input is f, the reconstructed output isThe training process of the reconstruction network is to minimize the l 2 distance between the input and the output, and the reconstruction loss function can be expressed as:
In addition, we add entropy loss in the memory modules of the optical flow reconstruction network and match probability for each memory module The loss function of the memory module can be expressed as:
wherein M is the number of memory modules, N is the size of the memory modules, Is the probability of matching the kth node in the mth memory module. The two loss functions are weighted and fused for final reconstructed branch training as follows:
Lrecon-branch=λreconLrecon+λentLent
Wherein lambda recon、λent is the weight.
Step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error; in the process of training a future frame prediction network, training of a prediction network model is performed by using the video frame and the reconstructed optical flow as inputs. The future frame prediction network includes two encoders and one decoder, and the outputs of the three parts are comprehensively considered in training. Where the output profile of the optical flow encoder is p (z|y 1:t) and the output profile of the image flow encoder is q (z|x 1:t,y1:t), the prediction loss of the future frame prediction network can be defined as:
Where F KL represents the Kullback-Leibler divergence, x t+1 represents Ground Truth of the future frame, Representing predicted output video frames, y 1:t represents video flow.
Furthermore, gradient losses were introduced, as follows:
where i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame, Representing a predicted output video frame, x i,j represents the truth-map pixel values of the future frame,/>Representing predicted output video frame pixel values.
Mixing the predicted and gradient losses by means of weighted fusion obtains a total loss function to train the prediction network, which can be expressed as:
Wherein lambda predict、λgd is the weight.
Step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, calculating an anomaly score according to the reconstruction error and the prediction error, and optimizing anomaly discrimination performance by using the anomaly score weighting loss.
During the test phase, the network uses the reconstruction errors, i.e., y 1:t andThe difference between them, i.e. x t+1 and the prediction errorThe difference between them makes an abnormality judgment. The quality difference between the normal optical flow and the abnormal optical flow reconstructed by the optical flow reconstructor is utilized to improve the detection precision of the future frame predictor. The reconstructed abnormal optical flow is typically of lower quality, resulting in a future frame with larger prediction errors. In contrast, the reconstructed normal optical flow is typically of higher quality and the prediction module can successfully predict future frames with smaller prediction errors. The optical flow reconstruction and future frame prediction errors are used as final anomaly detection scores.
In the final anomaly detection stage, the video is detected by using the trained network model in the previous step, and the reconstruction error S r and the prediction error S p are utilized in a weighted fusion manner to obtain a final anomaly score S, which can be expressed as:
Wherein ω r is the weight of the reconstruction score, ω p is the weight of the prediction score, μ r is the mean of the reconstruction errors of all training samples, σ r is the standard deviation of the reconstruction errors of all training samples, μ p is the mean of the prediction errors of all training samples, and σ p is the standard deviation of the prediction errors of all training samples. The expression of the reconstruction error S r and the prediction error S p are respectively:
Wherein, Representing reconstructed optical flow information,/>Representing a predicted output video frame.
The present invention proposes a Serial depth separable residual Block (Serial Block). By utilizing channel-by-channel convolution and point convolution to extract features and combining residual connection and intensive connection, richer spatial features can be extracted and feature reuse can be effectively performed. The Serial Block structure is shown in FIG. 3. The trunk part comprises two layers of channel-by-channel convolutions and one layer of point convolutions, richer two-dimensional space details are obtained through stacking of the two layers of channel-by-channel convolutions, channel fusion is carried out through channel dimension splicing before point convolutions, and feature reuse is achieved. On the outside, the channel difference between the input and the output is balanced through one layer of point convolution, and finally the output introduces a residual error term, so that space details are maintained and the gradient disappearance or gradient explosion problem in the forward propagation process is solved. If x is used to represent the input image, y i represents the output characteristics of each layer of normal convolution, and C (, D (, and P (), respectively represent the normal convolution, the channel-by-channel convolution, and the point convolution, then the mathematical expression of the Serial Block is:
The invention provides an intensive cavity downsampling module (DRSM). The module introduces a hole convolution to combine multi-scale space information in the downsampling process to guide the downsampling process of the feature map. DRSM the module structure is shown in figure 4.
DRSM contains three stages, stage1, stage 2 and Stage 3 respectively. Wherein Stage1 is a scale reduction implemented using common convolution; stage 2 and Stage 3 are both scale-reduced by means of hole convolution, and the hole rates are set to 2 and 4 respectively. The multi-scale spatial information is acquired in the down-sampling process by introducing two stages of hole convolution. During the downsampling process, the module firstly performs first downsampling on the input x by using common convolution to obtain a feature map x 1 after the scale reduction. Then, second and third downsampling are performed using hole convolutions with hole rates of 2 and 4, respectively, to obtain downscaled feature maps x 2 and x 3. The three downsampling here correspond to three phases, respectively. Thereafter, the feature maps x 1 and x 2 are fused to obtain a fusion result y 1, the feature maps x 2 and x 3 are fused to obtain a fusion result y 2, and the feature maps x 1、y1 and y 2 are fused in the final stage to obtain an output feature map y. More abundant and effective spatial information is preserved through multi-level fusion of multi-scale features. The downsampling process at the DRSM block may be expressed as:
Where P * (·) represents the fusion layer point convolution, x i (i=1, 2, 3) represents the output of each stage, Representing the channel dimension stitching operation, S out represents the feature map after the scale reduction.
The invention provides a U-shaped pyramid attention mechanism module (UPAM). The extraction of inter-region relevance information is increased by introducing a hole convolution to further optimize the attention weight matrix. UPAM is shown in fig. 5 as comprising two structurally similar parallel branches. The left branch is the jump connection input F skip, i.e. the output of the corresponding level of the encoder, and the right branch input is the input F D of the corresponding level decoder.
The decoder branch on the right side firstly passes through a convolution layer, then passes through hole convolution with the hole rate of 2, and uses point convolution to fuse convolution output and hole convolution output, and finally obtains output Att D(FD after an activation function. The jump connection branch on the left side is subjected to convolution layer to obtain F 1, is fused with the output of the decoder branch convolution layer, is subjected to hole convolution with the hole rate of 2 to obtain the output F 2.F2、F1, and is subjected to hole convolution output F 3 of the decoder branch, and is subjected to channel dimension fusion by point convolution to obtain final output Att skip(Fskip after an activation function. Finally, the complete process of multiplying the UPAM output Att (F skip,FD) by the decoder input F D by elements to obtain the final attention profile F Att, UPAM can be expressed as:
UPAM is connected in a jumping mode through hole convolution in the branches, and relevance information in the region is further acquired while multi-scale information is acquired. The jump connection input F skip of the left branch does not go through the transpose and serialization operations of the bottleneck layer, so that a certain amount of detail information is still retained. The decoder branches are further guided to reserve space detail weights by jump connections between the left and right branches. In the final stage, the weight values of the two branches are fused to form an attention weight matrix, and the space detail of the future frame modeling process is optimized in the form of element products.
The model proposed by the invention was tested on the reference dataset UCSD Ped 2. A network model was constructed based on Pytorch framework and optimized with Adam optimizer with super parameters set to β 1=0.9,β2 =0.999 and initial learning rate set to 1×10 3. All training and testing procedures, including reconstruction branch training and predictive branch training, were performed on a 16-core 32-thread machine with AMD Ryzen 9 5950x 3.4GHz CPU (64 GB RAM) and RTX 3090GPU (24 GB memory). Lambda recon,λent,λCVAE and lambda gd were set to 1.0,2e -4, 1.0 and 1.0, respectively. The batch size and training runs are uniformly set to 128 and 120. Further, a fusion coefficient (ω r,ωp) of the reconstruction error and the prediction error is set to (1.0,0.1) for the UCSD Ped2 dataset.
Performance testing was performed on the UCSD Ped2 test set after model training was completed. FIG. 6 is a visualization of predicted anomaly scores for anomaly videos on a UCSD Ped2 test set in accordance with the present invention; the detection result in the UCSD Ped2 test set shows that the invention can accurately detect the abnormal event in the complex multi-scene monitoring video, and verifies the time positioning capability of the invention on the abnormal event.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.
Claims (5)
1. The visual anomaly detection method based on time sequence space information enhancement is characterized by comprising the following steps of:
Step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder;
The future frame predictor includes an encoder E θ, an encoder And decoder D ψ, encoder/>Jump connection of the U-shaped pyramid attention mechanism module is added between the U-shaped pyramid attention mechanism module and the decoder D ψ; encoder E θ and encoder/>The system comprises a point convolution module, a serial depth separable residual block and an intensive cavity downsampling module; decoder D ψ includes an upsampling module, a serial depth separable residual block, and a point convolution module; the input of the encoder E θ is the reconstructed optical flow information/>Encoder/>Inputs of (a) are the original input frame x 1:t and the reconstructed optical flow information/>Is a mixed information of the output information of the encoder E θ and the encoder/>The output information z of (3) is connected and then sent to a decoder D ψ;
The serial depth separable residual block comprises a channel convolution layer I, a channel convolution layer II, a point convolution layer I and a point convolution layer II; the input end of the channel convolution layer I and the input end of the point convolution layer I are connected with the upper layer output end, the output end of the channel convolution layer I is connected with the input end of the channel convolution layer II, the output end of the channel convolution layer I is connected with the input end of the point convolution layer II after being fused with the output end of the channel convolution layer II, and the output end of the point convolution layer II and the output end of the point convolution layer I are connected with the lower layer input end;
The intensive cavity downsampling module comprises three stages, namely Stage 1, stage 2 and Stage 3; firstly, performing first downsampling on an input x by using common convolution of Stage 1 Stage to obtain a feature map x 1 after scale reduction; then, carrying out second and third downsampling by utilizing cavity convolution of Stage 2 Stage and Stage 3 Stage respectively so as to obtain feature graphs x 2 and x 3 after scale reduction; thereafter, the feature maps x 1 and x 2 are respectively fused to obtain a fusion result y 1, the feature maps x 2 and x 3 are fused to obtain a fusion result y 2, and the feature maps x 1、y1 and y 2 are fused in a final stage to obtain an output feature map y;
The U-shaped pyramid attention mechanism module comprises a decoder branch on the right side and a jump connection branch on the left side; the decoder branch on the right side sequentially passes through a convolution layer and a hole convolution, the convolution layer output and the hole convolution output are fused by utilizing point convolution, and finally, the output Att D(FD is obtained after an activation function; the jump connection branch on the left side is subjected to convolution layer to obtain F 1, is fused with the output of the decoder branch convolution layer, is subjected to hole convolution to obtain the output F 2;F2、F1 and the hole convolution output F 3 of the decoder branch, is subjected to channel dimension fusion by point convolution, and is subjected to activation function to obtain final output Att skip(Fskip); finally, the UPAM output Att (F skip,FD) is multiplied by the decoder input F D by elements to obtain the final attention profile F Att;
step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss;
step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error;
step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error.
2. The method for visual anomaly detection based on temporal spatial information enhancement of claim 1, wherein the training optical flow reconstructor has a mixing loss function of:
Lrecon-branch=λreconLrecon+λentLent;
Wherein, Reconstructing the loss function, f being the input,/>The output of the reconstruction is provided with,M is the number of memory modules, N is the size of the memory module, is the loss function of the memory module,/>Is the matching probability of the kth node in the mth memory module, and lambda recon、λent is the weight.
3. The method for visual anomaly detection based on temporal spatial information enhancement of claim 2, wherein the training future frame predictor's mixing loss function is:
Wherein, For prediction loss, p (z|y 1:t) is the output distribution of the optical flow encoder, q (z|x 1:t,y1:t) is the output distribution of the image flow encoder, F KL represents the Kullback-Leibler divergence, y 1:t represents video optical flow, x t+1 represents the truth-chart of future frames,/>Representing a predicted output video frame, For gradient loss, i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame,/>Representing a predicted output video frame, x i,j represents the truth-map pixel values of the future frame,/>And representing the pixel value of the predicted output video frame, wherein lambda predict、λgd is the weight.
4. The visual anomaly detection method based on temporal spatial information enhancement according to claim 3, wherein the method of calculating anomaly score from reconstruction error and prediction error is:
Wherein S is an anomaly score, S r is a reconstruction error, S p is a prediction error, ω r is a weight occupied by the reconstruction score, ω p is a weight occupied by the prediction score, μ r is a mean value of the reconstruction errors of all training samples, σ r is a standard deviation of the reconstruction errors of all training samples, μ p is a mean value of the prediction errors of all training samples, and σ p is a standard deviation of the prediction errors of all training samples.
5. The visual anomaly detection method based on temporal spatial information enhancement according to claim 4, wherein the expression of the reconstruction error S r and the prediction error S p are respectively:
Wherein, Representing reconstructed optical flow information,/>Representing a predicted output video frame.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310510187.9A CN116543335B (en) | 2023-05-08 | 2023-05-08 | Visual anomaly detection method based on time sequence spatial information enhancement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310510187.9A CN116543335B (en) | 2023-05-08 | 2023-05-08 | Visual anomaly detection method based on time sequence spatial information enhancement |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116543335A CN116543335A (en) | 2023-08-04 |
CN116543335B true CN116543335B (en) | 2024-06-21 |
Family
ID=87448376
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310510187.9A Active CN116543335B (en) | 2023-05-08 | 2023-05-08 | Visual anomaly detection method based on time sequence spatial information enhancement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116543335B (en) |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109117774B (en) * | 2018-08-01 | 2021-09-28 | 广东工业大学 | Multi-view video anomaly detection method based on sparse coding |
CN111666819B (en) * | 2020-05-11 | 2022-06-14 | 武汉大学 | High-precision video abnormal event detection method integrating multivariate information |
CN113569756B (en) * | 2021-07-29 | 2023-06-09 | 西安交通大学 | Abnormal behavior detection and positioning method, system, terminal equipment and readable storage medium |
CN114332053B (en) * | 2021-12-31 | 2024-07-19 | 上海交通大学 | Multi-mode two-stage unsupervised video anomaly detection method |
CN114612836B (en) * | 2022-03-15 | 2024-04-05 | 南京邮电大学 | Monitoring video abnormity detection method based on memory-enhanced future video frame prediction |
CN114821434A (en) * | 2022-05-05 | 2022-07-29 | 西藏民族大学 | Space-time enhanced video anomaly detection method based on optical flow constraint |
CN114708620A (en) * | 2022-05-10 | 2022-07-05 | 山东交通学院 | Pedestrian re-identification method and system applied to unmanned aerial vehicle at aerial view angle |
CN114973102B (en) * | 2022-06-17 | 2024-09-27 | 南通大学 | Video anomaly detection method based on multipath attention time sequence |
CN115511650A (en) * | 2022-09-22 | 2022-12-23 | 国网智联电商有限公司 | Method and device for determining user propagation in crowd sensing task |
CN115527150A (en) * | 2022-10-31 | 2022-12-27 | 南京邮电大学 | Dual-branch video anomaly detection method combined with convolution attention module |
CN115830541A (en) * | 2022-12-05 | 2023-03-21 | 桂林电子科技大学 | Video abnormal event detection method based on double-current space-time self-encoder |
CN115690665B (en) * | 2023-01-03 | 2023-03-28 | 华东交通大学 | Video anomaly detection method and device based on cross U-Net network |
-
2023
- 2023-05-08 CN CN202310510187.9A patent/CN116543335B/en active Active
Non-Patent Citations (1)
Title |
---|
A Hybrid Video Anomaly Detection Framework via Memory-Augmented Flow Reconstruction and Flow-Guided Frame Prediction;Zhian Liu.et al;《ICCV》;20211231;第13588-13597页 * |
Also Published As
Publication number | Publication date |
---|---|
CN116543335A (en) | 2023-08-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110189334B (en) | Medical image segmentation method of residual error type full convolution neural network based on attention mechanism | |
CN109543740B (en) | Target detection method based on generation countermeasure network | |
CN110458833B (en) | Medical image processing method, medical device and storage medium based on artificial intelligence | |
CN112052763B (en) | Video abnormal event detection method based on two-way review generation countermeasure network | |
CN113298789A (en) | Insulator defect detection method and system, electronic device and readable storage medium | |
CN111612790A (en) | Medical image segmentation method based on T-shaped attention structure | |
CN109583340A (en) | A kind of video object detection method based on deep learning | |
CN112464851A (en) | Smart power grid foreign matter intrusion detection method and system based on visual perception | |
CN115619743A (en) | Construction method and application of OLED novel display device surface defect detection model | |
CN114549985B (en) | Target detection method and system based on self-supervision contrast learning | |
CN116030538B (en) | Weak supervision action detection method, system, equipment and storage medium | |
CN116797618A (en) | Multi-stage segmentation method based on multi-mode MRI (magnetic resonance imaging) heart image | |
CN114677349B (en) | Image segmentation method and system for enhancing edge information of encoding and decoding end and guiding attention | |
Kan et al. | A GAN-based input-size flexibility model for single image dehazing | |
CN116543335B (en) | Visual anomaly detection method based on time sequence spatial information enhancement | |
CN111667488B (en) | Medical image segmentation method based on multi-angle U-Net | |
CN117115445A (en) | Image invisible area completion method based on modeless instance segmentation | |
CN117788784A (en) | Target detection method based on improved BiFPN network | |
US20230090941A1 (en) | Processing video content using gated transformer neural networks | |
CN113962332B (en) | Salient target identification method based on self-optimizing fusion feedback | |
CN115484456A (en) | Video anomaly prediction method and device based on semantic clustering | |
CN116597503A (en) | Classroom behavior detection method based on space-time characteristics | |
CN115527151A (en) | Video anomaly detection method and system, electronic equipment and storage medium | |
CN115376178A (en) | Unknown domain pedestrian re-identification method and system based on domain style filtering | |
CN111899161A (en) | Super-resolution reconstruction method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |