CN116543335B - Visual anomaly detection method based on time sequence spatial information enhancement - Google Patents

Visual anomaly detection method based on time sequence spatial information enhancement Download PDF

Info

Publication number
CN116543335B
CN116543335B CN202310510187.9A CN202310510187A CN116543335B CN 116543335 B CN116543335 B CN 116543335B CN 202310510187 A CN202310510187 A CN 202310510187A CN 116543335 B CN116543335 B CN 116543335B
Authority
CN
China
Prior art keywords
output
optical flow
encoder
convolution layer
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310510187.9A
Other languages
Chinese (zh)
Other versions
CN116543335A (en
Inventor
王霖
李名洋
王玮
柴志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202310510187.9A priority Critical patent/CN116543335B/en
Publication of CN116543335A publication Critical patent/CN116543335A/en
Application granted granted Critical
Publication of CN116543335B publication Critical patent/CN116543335B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a visual anomaly detection method based on time sequence spatial information enhancement, which is used for solving the problem of slower operation speed of an anomaly detection network; the method comprises the following steps: firstly, constructing an optical flow reconstructor and a future frame predictor of a multi-stage memory enhancement self-encoder with jump connection; secondly, training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss; training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error; and finally, detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error. The method and the device can better utilize the space detail information, improve the detection precision, and are suitable for abnormal event detection occasions in complex and various scene monitoring videos.

Description

Visual anomaly detection method based on time sequence spatial information enhancement
Technical Field
The invention relates to the technical field of general computer image processing, in particular to a visual anomaly detection method based on time sequence space information enhancement.
Background
The video abnormal event detection technology is widely applied to fault early warning in the fields of industry, security and transportation, and aims to spatially or temporally locate abnormal events in a monitoring video. In a broad sense, the formal definition of a video anomaly event is "the appearance of an anomaly appearance or motion attribute, or the appearance of a normal appearance or motion attribute at an anomaly location or time. By definition, an abnormal event is unusual, unusual and sporadic in nature, and thus acquisition of an abnormal event sample is difficult. In order to solve the detection problem of abnormal events with super boundaries and small samples, researchers try to model normal event behaviors by analyzing the motion characteristics and the space-time context characteristics of normal event samples, and then judge whether abnormal events occur in videos by using the model. With the continued development of deep learning techniques, researchers have attempted to detect anomalies using an unsupervised paradigm approach based on reconstruction or prediction. In the anomaly detection method based on deep learning, firstly, in the training process, a network learns a normal activity model of a behavior main body in a scene through a plurality of given video sequences which do not contain abnormal behaviors. Subsequently, in the detection phase, the network processes by extracting the same features in the video sequence to be tested, so as to calculate an anomaly score for each frame of video. The method learns the characteristics of the normal event sample and models the normal event sample in an end-to-end mode, so that the abnormal event sample cannot be effectively reconstructed or predicted in future frames, thereby obtaining higher reconstruction errors or prediction errors, and judging whether an abnormal event occurs or not according to a proper threshold by taking the reconstruction errors or the prediction errors as criteria.
Although the reconstruction and prediction-based anomaly detection method has been developed in a long-standing manner, the following drawbacks still exist: (1) Both methods rely on the capability of convolutional neural network to extract video sequence features, and part of methods introduce motion feature information such as optical flow information before a reconstruction and prediction stage, but do not achieve good effects; (2) The reconstruction-based method has great reconstruction error when training is performed on the normal sample and reconstructing the abnormal image according to the difference of the normal sample and the abnormal sample in aspects such as morphological characteristics, motion characteristics, space-time context information and the like. However, such assumptions do not always hold. The diversity of normal and abnormal samples results in the normal events recorded in the training set not being complete and comprehensive, which makes it possible for the abnormal samples to be reconstructed well. (3) The prediction-based method predicts future frames by predicting morphology features and spatio-temporal context features, and in the anomaly discrimination process, the anomaly frames have extremely high reconstruction errors according to a hypothesis similar to the reconstruction-based method. This allows the prediction-based approach to ignore the contribution of motion features to anomaly discrimination while having similar drawbacks as the reconstruction-based approach.
To solve the above problems, some researchers have tried to combine these two patterns in a serial or parallel manner to construct a framework of hybrid reconstruction and prediction patterns to further improve the anomaly detection performance. The hybrid model effectively combines reconstruction and prediction methods, where the parallel hybrid method calculates anomaly scores by fusing reconstruction errors and prediction errors during the discrimination phase, while the serial hybrid method makes modeling of anomaly events in future frames more difficult by introducing reconstruction information in the prediction branches. However, whether based on reconstruction, prediction, or hybrid models, there is no more reasonable consideration of spatial detail information in the network design process. The method has higher requirements on sample data, is influenced by the diversity of normal samples, can still generate abnormal effective modeling in the modeling process, and has good model quality, thereby being beneficial to effectively differentiating the abnormality in the distinguishing process.
Disclosure of Invention
Aiming at the problem of slower operation speed of an anomaly detection network, the invention provides a visual anomaly detection method based on time sequence space information enhancement, which is a hybrid framework combining optical flow reconstruction and optical flow guiding future frame prediction in a serial mode; wherein a future frame prediction model based on optical flow guidance accepts a previous video frame and optical flow as inputs at the same time, but the optical flow used is not an original optical flow image, but is reconstructed by an optical flow reconstruction model constructed by a multi-level memory enhancement self-encoder (ML-MemAE-SC) with jump connection, and then the reconstructed optical flow is input to the future frame prediction model; the method and the device can better utilize the space detail information, improve the detection precision, and are suitable for abnormal event detection occasions in complex and various scene monitoring videos.
The technical scheme of the invention is realized as follows:
a visual anomaly detection method based on time sequence space information enhancement comprises the following steps:
Step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder;
step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss;
step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error;
step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error.
The future frame predictor includes an encoder E θ, an encoderAnd decoder D ψ, encoder/>Jump connection of the U-shaped pyramid attention mechanism module is added between the U-shaped pyramid attention mechanism module and the decoder D ψ; encoder E θ and encoder/>The system comprises a point convolution module, a serial depth separable residual block and an intensive cavity downsampling module; decoder D ψ includes an upsampling module, a serial depth separable residual block, and a point convolution module; the input of the encoder E θ is the reconstructed optical flow information/>Encoder/>Inputs of (a) are the original input frame x 1:t and the reconstructed optical flow information/>Is a mixed information of the output information of the encoder E θ and the encoder/>Is connected to the output information z of the decoder D ψ.
The mixing loss function of the training optical flow reconstructor is as follows:
Lrecon-branch=λreconLreconentLent
Wherein, Reconstructing the loss function, f being the input,/>The output of the reconstruction is provided with,M is the number of memory modules, N is the size of the memory module, is the loss function of the memory module,/>Is the matching probability of the kth node in the mth memory module, and lambda recon、λent is the weight.
The mixing loss function of the training future frame predictor is:
Wherein, For prediction loss, p (z|y 1:t) is the output distribution of the optical flow encoder, q (z|x 1:t,y1:t) is the output distribution of the image flow encoder, F KL represents the Kullback-Leibler divergence, y 1:t represents video optical flow, x t+1 represents the truth-chart of future frames,/>Representing predicted output video frames,/> For gradient loss, i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame,/>Representing a predicted output video frame, x i,j represents the truth-map pixel values of the future frame,/>And representing the pixel value of the predicted output video frame, wherein lambda predict、λgd is the weight.
The method for calculating the abnormal score according to the reconstruction error and the prediction error comprises the following steps:
Wherein S is an anomaly score, S r is a reconstruction error, S p is a prediction error, ω r is a weight occupied by the reconstruction score, ω p is a weight occupied by the prediction score, μ r is a mean value of the reconstruction errors of all training samples, σ r is a standard deviation of the reconstruction errors of all training samples, μ p is a mean value of the prediction errors of all training samples, and σ p is a standard deviation of the prediction errors of all training samples.
The expression of the reconstruction error S r and the prediction error S p are respectively:
Wherein, Representing reconstructed optical flow information,/>Representing a predicted output video frame.
The serial depth separable residual block comprises a channel convolution layer I, a channel convolution layer II, a point convolution layer I and a point convolution layer II; the input end of the channel convolution layer I and the input end of the point convolution layer I are connected with the upper layer output end, the output end of the channel convolution layer I is connected with the input end of the channel convolution layer II, the output end of the channel convolution layer I is fused with the output end of the channel convolution layer II and then is connected with the input end of the point convolution layer II, and the output end of the point convolution layer II and the output end of the point convolution layer I are connected with the lower layer input end.
The intensive cavity downsampling module comprises three stages, namely Stage 1, stage 2 and Stage 3; firstly, performing first downsampling on an input x by using common convolution of Stage 1 Stage to obtain a feature map x 1 after scale reduction; then, carrying out second and third downsampling by utilizing cavity convolution of Stage 2 Stage and Stage 3 Stage respectively so as to obtain feature graphs x 2 and x 3 after scale reduction; thereafter, the feature maps x 1 and x 2 are fused to obtain a fusion result y 1, the feature maps x 2 and x 3 are fused to obtain a fusion result y 2, and the feature maps x 1、y1 and y 2 are fused in the final stage to obtain an output feature map y.
The U-shaped pyramid attention mechanism module comprises a decoder branch on the right side and a jump connection branch on the left side; the decoder branch on the right side sequentially passes through a convolution layer and a hole convolution, the convolution layer output and the hole convolution output are fused by utilizing point convolution, and finally, the output Att D(FD is obtained after an activation function; the jump connection branch on the left side is subjected to convolution layer to obtain F 1, is fused with the output of the decoder branch convolution layer, is subjected to hole convolution to obtain the output F 2;F2、F1 and the hole convolution output F 3 of the decoder branch, is subjected to channel dimension fusion by point convolution, and is subjected to activation function to obtain final output Att skip(Fskip); finally, the UPAM output Att (F skip,FD) is multiplied by the decoder input F D by elements to obtain the final attention profile F Att.
Compared with the prior art, the invention has the beneficial effects that:
(1) Aiming at the problem of slower operation speed of an anomaly detection network, a Serial depth separable residual Block (Serial Block) is designed to serve as a backbone of a future frame prediction network, so that the calculation amount and the parameter amount of convolution operation are effectively reduced and the operation speed is improved on the premise of ensuring the performance.
(2) In order to ensure that abundant and effective space detail information is reserved in the downsampling process in the prediction network encoder, a dense cavity downsampling module (DRSM) is designed, the module adopts a stepped structural design, and multiscale space relevance information is fully utilized in the downsampling process, so that a feature map after the scale reduction is ensured to have more effective space detail information.
(3) Because the U-shaped structure of the future frame prediction network branch contains more jump connections, part of the hierarchy contains certain interference information in the transmission process from top to bottom in the coding stage. In order to solve the problem, a U-shaped pyramid attention module (UPAM) is designed, and the network is guided to keep more abundant and effective spatial features in the feature fusion process through the interactive information extraction of jump connection input and original decoder input in the attention module, so that the future frame modeling quality is improved.
(4) The invention can accurately detect the abnormal events in the complex and various scene monitoring videos.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a block diagram of a future frame predictor network of the present invention.
Fig. 3 is a Serial depth separable residual Block (Serial Block) structure diagram of the present invention.
Fig. 4 is a diagram illustrating a structure of the dense hole downsampling module (DRSM) according to the present invention.
Fig. 5 is a diagram of the structure of the U-shaped pyramid attention mechanism module (UPAM) of the present invention.
FIG. 6 is an anomaly score plot of a test video at a UCSD Ped2 dataset of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, an embodiment of the present invention provides a visual anomaly detection method based on temporal spatial information enhancement, including two components: (1) Multi-level memory enhanced self-encoder with jump connection (ML-MemAE-SC), i.e. optical flow reconstructor. The method is used for reconstructing the optical flow image, and the reconstructed optical flow image and the original image are simultaneously input into a future frame predictor to conduct future frame prediction. (2) The condition variation based on the reconstructed optical flow guidance is derived from the encoder, i.e. the future frame predictor. The method is used for predicting future frames of the original image frames, and optical flow reconstruction information is introduced during prediction to expand prediction errors to differentiate between abnormal and normal states, so that detection accuracy is further improved. The method comprises the following specific steps:
Step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder; the optical flow reconstructor network adopts three-stage memory storage units, and directly transmits the encoded information to the decoder by utilizing jump connection, so as to provide more information for the next-stage memory storage unit to keep a more comprehensive and complete normal mode. The memory module functions to represent the characteristics input to it by a weighted sum of similar memory slots and thus has the ability to memorize the normal mode when training the normal data. Each memory module is actually a matrix Including N real vectors in the fixed dimension C. Each row of the matrix is referred to as a slot m i, where i=1, 2, 3. Memory size n=2000. The method can effectively solve the problems that a single memory module cannot load all normal modes and a plurality of memory modules are cascaded to cause excessive loading and filtering. No hopping connection is added at the outermost layer of the network. If the connection uses the outermost layer, the reconstruction may be completed by the highest-level encoding and decoding information, and the effect of the normal mode information stored by all other lower-level memory modules is greatly weakened or even disabled, so that all other lower-level encoding, decoding and memory blocks cannot work. In order to further reduce the number of network model parameters, all convolution layers of the reconstruction network are replaced by Serial blocks, so that a lightweight version is formed.
The future frame predictor network detailed structure is shown in fig. 2. The future frame predictor network model consists of the following 5 main module connections: (1) a Serial depth separable residual Block (Serial Block); (2) an intensive hole downsampling module (DRSM); (3) a U-shaped pyramid attention mechanism module (UPAM); (4) an upsampling module (Upsample); (5) a point convolution module (Pointwise).
Each block in the future frame predictor network represents a respective module. As shown in fig. 2, there are two encoders E θ andAnd a decoder D ψ. We are/>And D ψ to help generate x t+1. The backbone of the future frame predictor network is implemented by Serial Block, and the downsampling layer and upsampling layer are implemented by DRSM and Upsample, respectively. The model contains 4 levels in total, and the feature map size corresponding to each level is (32,32,64), (16,16,128), (8,8,128) and (4,4,128), respectively. Will/>Is connected to the sampled z and sent to the decoder D ψ. The last two bottleneck layers are utilized to estimate the distribution and sample the data from them, and share the same layer settings. Encoder/>Processing the original input frame x 1:t and reconstructing the optical flow information/>E θ processes the reconstructed optical flow informationAnd (5) inputting. Features are fused in the bottleneck layer, and then posterior distribution/>And a priori distribution/>Input decoder D ψ to model/>A predicted future output frame is made.
Step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss; when training the optical flow reconstruction network, the optical flow reconstruction network model training is performed by using the optical flow graph as an input. If the input is f, the reconstructed output isThe training process of the reconstruction network is to minimize the l 2 distance between the input and the output, and the reconstruction loss function can be expressed as:
In addition, we add entropy loss in the memory modules of the optical flow reconstruction network and match probability for each memory module The loss function of the memory module can be expressed as:
wherein M is the number of memory modules, N is the size of the memory modules, Is the probability of matching the kth node in the mth memory module. The two loss functions are weighted and fused for final reconstructed branch training as follows:
Lrecon-branch=λreconLreconentLent
Wherein lambda recon、λent is the weight.
Step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error; in the process of training a future frame prediction network, training of a prediction network model is performed by using the video frame and the reconstructed optical flow as inputs. The future frame prediction network includes two encoders and one decoder, and the outputs of the three parts are comprehensively considered in training. Where the output profile of the optical flow encoder is p (z|y 1:t) and the output profile of the image flow encoder is q (z|x 1:t,y1:t), the prediction loss of the future frame prediction network can be defined as:
Where F KL represents the Kullback-Leibler divergence, x t+1 represents Ground Truth of the future frame, Representing predicted output video frames, y 1:t represents video flow.
Furthermore, gradient losses were introduced, as follows:
where i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame, Representing a predicted output video frame, x i,j represents the truth-map pixel values of the future frame,/>Representing predicted output video frame pixel values.
Mixing the predicted and gradient losses by means of weighted fusion obtains a total loss function to train the prediction network, which can be expressed as:
Wherein lambda predict、λgd is the weight.
Step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, calculating an anomaly score according to the reconstruction error and the prediction error, and optimizing anomaly discrimination performance by using the anomaly score weighting loss.
During the test phase, the network uses the reconstruction errors, i.e., y 1:t andThe difference between them, i.e. x t+1 and the prediction errorThe difference between them makes an abnormality judgment. The quality difference between the normal optical flow and the abnormal optical flow reconstructed by the optical flow reconstructor is utilized to improve the detection precision of the future frame predictor. The reconstructed abnormal optical flow is typically of lower quality, resulting in a future frame with larger prediction errors. In contrast, the reconstructed normal optical flow is typically of higher quality and the prediction module can successfully predict future frames with smaller prediction errors. The optical flow reconstruction and future frame prediction errors are used as final anomaly detection scores.
In the final anomaly detection stage, the video is detected by using the trained network model in the previous step, and the reconstruction error S r and the prediction error S p are utilized in a weighted fusion manner to obtain a final anomaly score S, which can be expressed as:
Wherein ω r is the weight of the reconstruction score, ω p is the weight of the prediction score, μ r is the mean of the reconstruction errors of all training samples, σ r is the standard deviation of the reconstruction errors of all training samples, μ p is the mean of the prediction errors of all training samples, and σ p is the standard deviation of the prediction errors of all training samples. The expression of the reconstruction error S r and the prediction error S p are respectively:
Wherein, Representing reconstructed optical flow information,/>Representing a predicted output video frame.
The present invention proposes a Serial depth separable residual Block (Serial Block). By utilizing channel-by-channel convolution and point convolution to extract features and combining residual connection and intensive connection, richer spatial features can be extracted and feature reuse can be effectively performed. The Serial Block structure is shown in FIG. 3. The trunk part comprises two layers of channel-by-channel convolutions and one layer of point convolutions, richer two-dimensional space details are obtained through stacking of the two layers of channel-by-channel convolutions, channel fusion is carried out through channel dimension splicing before point convolutions, and feature reuse is achieved. On the outside, the channel difference between the input and the output is balanced through one layer of point convolution, and finally the output introduces a residual error term, so that space details are maintained and the gradient disappearance or gradient explosion problem in the forward propagation process is solved. If x is used to represent the input image, y i represents the output characteristics of each layer of normal convolution, and C (, D (, and P (), respectively represent the normal convolution, the channel-by-channel convolution, and the point convolution, then the mathematical expression of the Serial Block is:
The invention provides an intensive cavity downsampling module (DRSM). The module introduces a hole convolution to combine multi-scale space information in the downsampling process to guide the downsampling process of the feature map. DRSM the module structure is shown in figure 4.
DRSM contains three stages, stage1, stage 2 and Stage 3 respectively. Wherein Stage1 is a scale reduction implemented using common convolution; stage 2 and Stage 3 are both scale-reduced by means of hole convolution, and the hole rates are set to 2 and 4 respectively. The multi-scale spatial information is acquired in the down-sampling process by introducing two stages of hole convolution. During the downsampling process, the module firstly performs first downsampling on the input x by using common convolution to obtain a feature map x 1 after the scale reduction. Then, second and third downsampling are performed using hole convolutions with hole rates of 2 and 4, respectively, to obtain downscaled feature maps x 2 and x 3. The three downsampling here correspond to three phases, respectively. Thereafter, the feature maps x 1 and x 2 are fused to obtain a fusion result y 1, the feature maps x 2 and x 3 are fused to obtain a fusion result y 2, and the feature maps x 1、y1 and y 2 are fused in the final stage to obtain an output feature map y. More abundant and effective spatial information is preserved through multi-level fusion of multi-scale features. The downsampling process at the DRSM block may be expressed as:
Where P * (·) represents the fusion layer point convolution, x i (i=1, 2, 3) represents the output of each stage, Representing the channel dimension stitching operation, S out represents the feature map after the scale reduction.
The invention provides a U-shaped pyramid attention mechanism module (UPAM). The extraction of inter-region relevance information is increased by introducing a hole convolution to further optimize the attention weight matrix. UPAM is shown in fig. 5 as comprising two structurally similar parallel branches. The left branch is the jump connection input F skip, i.e. the output of the corresponding level of the encoder, and the right branch input is the input F D of the corresponding level decoder.
The decoder branch on the right side firstly passes through a convolution layer, then passes through hole convolution with the hole rate of 2, and uses point convolution to fuse convolution output and hole convolution output, and finally obtains output Att D(FD after an activation function. The jump connection branch on the left side is subjected to convolution layer to obtain F 1, is fused with the output of the decoder branch convolution layer, is subjected to hole convolution with the hole rate of 2 to obtain the output F 2.F2、F1, and is subjected to hole convolution output F 3 of the decoder branch, and is subjected to channel dimension fusion by point convolution to obtain final output Att skip(Fskip after an activation function. Finally, the complete process of multiplying the UPAM output Att (F skip,FD) by the decoder input F D by elements to obtain the final attention profile F Att, UPAM can be expressed as:
UPAM is connected in a jumping mode through hole convolution in the branches, and relevance information in the region is further acquired while multi-scale information is acquired. The jump connection input F skip of the left branch does not go through the transpose and serialization operations of the bottleneck layer, so that a certain amount of detail information is still retained. The decoder branches are further guided to reserve space detail weights by jump connections between the left and right branches. In the final stage, the weight values of the two branches are fused to form an attention weight matrix, and the space detail of the future frame modeling process is optimized in the form of element products.
The model proposed by the invention was tested on the reference dataset UCSD Ped 2. A network model was constructed based on Pytorch framework and optimized with Adam optimizer with super parameters set to β 1=0.9,β2 =0.999 and initial learning rate set to 1×10 3. All training and testing procedures, including reconstruction branch training and predictive branch training, were performed on a 16-core 32-thread machine with AMD Ryzen 9 5950x 3.4GHz CPU (64 GB RAM) and RTX 3090GPU (24 GB memory). Lambda reconentCVAE and lambda gd were set to 1.0,2e -4, 1.0 and 1.0, respectively. The batch size and training runs are uniformly set to 128 and 120. Further, a fusion coefficient (ω rp) of the reconstruction error and the prediction error is set to (1.0,0.1) for the UCSD Ped2 dataset.
Performance testing was performed on the UCSD Ped2 test set after model training was completed. FIG. 6 is a visualization of predicted anomaly scores for anomaly videos on a UCSD Ped2 test set in accordance with the present invention; the detection result in the UCSD Ped2 test set shows that the invention can accurately detect the abnormal event in the complex multi-scene monitoring video, and verifies the time positioning capability of the invention on the abnormal event.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (5)

1. The visual anomaly detection method based on time sequence space information enhancement is characterized by comprising the following steps of:
Step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder;
The future frame predictor includes an encoder E θ, an encoder And decoder D ψ, encoder/>Jump connection of the U-shaped pyramid attention mechanism module is added between the U-shaped pyramid attention mechanism module and the decoder D ψ; encoder E θ and encoder/>The system comprises a point convolution module, a serial depth separable residual block and an intensive cavity downsampling module; decoder D ψ includes an upsampling module, a serial depth separable residual block, and a point convolution module; the input of the encoder E θ is the reconstructed optical flow information/>Encoder/>Inputs of (a) are the original input frame x 1:t and the reconstructed optical flow information/>Is a mixed information of the output information of the encoder E θ and the encoder/>The output information z of (3) is connected and then sent to a decoder D ψ;
The serial depth separable residual block comprises a channel convolution layer I, a channel convolution layer II, a point convolution layer I and a point convolution layer II; the input end of the channel convolution layer I and the input end of the point convolution layer I are connected with the upper layer output end, the output end of the channel convolution layer I is connected with the input end of the channel convolution layer II, the output end of the channel convolution layer I is connected with the input end of the point convolution layer II after being fused with the output end of the channel convolution layer II, and the output end of the point convolution layer II and the output end of the point convolution layer I are connected with the lower layer input end;
The intensive cavity downsampling module comprises three stages, namely Stage 1, stage 2 and Stage 3; firstly, performing first downsampling on an input x by using common convolution of Stage 1 Stage to obtain a feature map x 1 after scale reduction; then, carrying out second and third downsampling by utilizing cavity convolution of Stage 2 Stage and Stage 3 Stage respectively so as to obtain feature graphs x 2 and x 3 after scale reduction; thereafter, the feature maps x 1 and x 2 are respectively fused to obtain a fusion result y 1, the feature maps x 2 and x 3 are fused to obtain a fusion result y 2, and the feature maps x 1、y1 and y 2 are fused in a final stage to obtain an output feature map y;
The U-shaped pyramid attention mechanism module comprises a decoder branch on the right side and a jump connection branch on the left side; the decoder branch on the right side sequentially passes through a convolution layer and a hole convolution, the convolution layer output and the hole convolution output are fused by utilizing point convolution, and finally, the output Att D(FD is obtained after an activation function; the jump connection branch on the left side is subjected to convolution layer to obtain F 1, is fused with the output of the decoder branch convolution layer, is subjected to hole convolution to obtain the output F 2;F2、F1 and the hole convolution output F 3 of the decoder branch, is subjected to channel dimension fusion by point convolution, and is subjected to activation function to obtain final output Att skip(Fskip); finally, the UPAM output Att (F skip,FD) is multiplied by the decoder input F D by elements to obtain the final attention profile F Att;
step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss;
step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error;
step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error.
2. The method for visual anomaly detection based on temporal spatial information enhancement of claim 1, wherein the training optical flow reconstructor has a mixing loss function of:
Lrecon-branch=λreconLreconentLent
Wherein, Reconstructing the loss function, f being the input,/>The output of the reconstruction is provided with,M is the number of memory modules, N is the size of the memory module, is the loss function of the memory module,/>Is the matching probability of the kth node in the mth memory module, and lambda recon、λent is the weight.
3. The method for visual anomaly detection based on temporal spatial information enhancement of claim 2, wherein the training future frame predictor's mixing loss function is:
Wherein, For prediction loss, p (z|y 1:t) is the output distribution of the optical flow encoder, q (z|x 1:t,y1:t) is the output distribution of the image flow encoder, F KL represents the Kullback-Leibler divergence, y 1:t represents video optical flow, x t+1 represents the truth-chart of future frames,/>Representing a predicted output video frame, For gradient loss, i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame,/>Representing a predicted output video frame, x i,j represents the truth-map pixel values of the future frame,/>And representing the pixel value of the predicted output video frame, wherein lambda predict、λgd is the weight.
4. The visual anomaly detection method based on temporal spatial information enhancement according to claim 3, wherein the method of calculating anomaly score from reconstruction error and prediction error is:
Wherein S is an anomaly score, S r is a reconstruction error, S p is a prediction error, ω r is a weight occupied by the reconstruction score, ω p is a weight occupied by the prediction score, μ r is a mean value of the reconstruction errors of all training samples, σ r is a standard deviation of the reconstruction errors of all training samples, μ p is a mean value of the prediction errors of all training samples, and σ p is a standard deviation of the prediction errors of all training samples.
5. The visual anomaly detection method based on temporal spatial information enhancement according to claim 4, wherein the expression of the reconstruction error S r and the prediction error S p are respectively:
Wherein, Representing reconstructed optical flow information,/>Representing a predicted output video frame.
CN202310510187.9A 2023-05-08 2023-05-08 Visual anomaly detection method based on time sequence spatial information enhancement Active CN116543335B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310510187.9A CN116543335B (en) 2023-05-08 2023-05-08 Visual anomaly detection method based on time sequence spatial information enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310510187.9A CN116543335B (en) 2023-05-08 2023-05-08 Visual anomaly detection method based on time sequence spatial information enhancement

Publications (2)

Publication Number Publication Date
CN116543335A CN116543335A (en) 2023-08-04
CN116543335B true CN116543335B (en) 2024-06-21

Family

ID=87448376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310510187.9A Active CN116543335B (en) 2023-05-08 2023-05-08 Visual anomaly detection method based on time sequence spatial information enhancement

Country Status (1)

Country Link
CN (1) CN116543335B (en)

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117774B (en) * 2018-08-01 2021-09-28 广东工业大学 Multi-view video anomaly detection method based on sparse coding
CN111666819B (en) * 2020-05-11 2022-06-14 武汉大学 High-precision video abnormal event detection method integrating multivariate information
CN113569756B (en) * 2021-07-29 2023-06-09 西安交通大学 Abnormal behavior detection and positioning method, system, terminal equipment and readable storage medium
CN114332053B (en) * 2021-12-31 2024-07-19 上海交通大学 Multi-mode two-stage unsupervised video anomaly detection method
CN114612836B (en) * 2022-03-15 2024-04-05 南京邮电大学 Monitoring video abnormity detection method based on memory-enhanced future video frame prediction
CN114821434A (en) * 2022-05-05 2022-07-29 西藏民族大学 Space-time enhanced video anomaly detection method based on optical flow constraint
CN114708620A (en) * 2022-05-10 2022-07-05 山东交通学院 Pedestrian re-identification method and system applied to unmanned aerial vehicle at aerial view angle
CN114973102B (en) * 2022-06-17 2024-09-27 南通大学 Video anomaly detection method based on multipath attention time sequence
CN115511650A (en) * 2022-09-22 2022-12-23 国网智联电商有限公司 Method and device for determining user propagation in crowd sensing task
CN115527150A (en) * 2022-10-31 2022-12-27 南京邮电大学 Dual-branch video anomaly detection method combined with convolution attention module
CN115830541A (en) * 2022-12-05 2023-03-21 桂林电子科技大学 Video abnormal event detection method based on double-current space-time self-encoder
CN115690665B (en) * 2023-01-03 2023-03-28 华东交通大学 Video anomaly detection method and device based on cross U-Net network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A Hybrid Video Anomaly Detection Framework via Memory-Augmented Flow Reconstruction and Flow-Guided Frame Prediction;Zhian Liu.et al;《ICCV》;20211231;第13588-13597页 *

Also Published As

Publication number Publication date
CN116543335A (en) 2023-08-04

Similar Documents

Publication Publication Date Title
CN110189334B (en) Medical image segmentation method of residual error type full convolution neural network based on attention mechanism
CN109543740B (en) Target detection method based on generation countermeasure network
CN110458833B (en) Medical image processing method, medical device and storage medium based on artificial intelligence
CN112052763B (en) Video abnormal event detection method based on two-way review generation countermeasure network
CN113298789A (en) Insulator defect detection method and system, electronic device and readable storage medium
CN111612790A (en) Medical image segmentation method based on T-shaped attention structure
CN109583340A (en) A kind of video object detection method based on deep learning
CN112464851A (en) Smart power grid foreign matter intrusion detection method and system based on visual perception
CN115619743A (en) Construction method and application of OLED novel display device surface defect detection model
CN114549985B (en) Target detection method and system based on self-supervision contrast learning
CN116030538B (en) Weak supervision action detection method, system, equipment and storage medium
CN116797618A (en) Multi-stage segmentation method based on multi-mode MRI (magnetic resonance imaging) heart image
CN114677349B (en) Image segmentation method and system for enhancing edge information of encoding and decoding end and guiding attention
Kan et al. A GAN-based input-size flexibility model for single image dehazing
CN116543335B (en) Visual anomaly detection method based on time sequence spatial information enhancement
CN111667488B (en) Medical image segmentation method based on multi-angle U-Net
CN117115445A (en) Image invisible area completion method based on modeless instance segmentation
CN117788784A (en) Target detection method based on improved BiFPN network
US20230090941A1 (en) Processing video content using gated transformer neural networks
CN113962332B (en) Salient target identification method based on self-optimizing fusion feedback
CN115484456A (en) Video anomaly prediction method and device based on semantic clustering
CN116597503A (en) Classroom behavior detection method based on space-time characteristics
CN115527151A (en) Video anomaly detection method and system, electronic equipment and storage medium
CN115376178A (en) Unknown domain pedestrian re-identification method and system based on domain style filtering
CN111899161A (en) Super-resolution reconstruction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant