CN116543335A - Visual anomaly detection method based on time sequence spatial information enhancement - Google Patents

Visual anomaly detection method based on time sequence spatial information enhancement Download PDF

Info

Publication number
CN116543335A
CN116543335A CN202310510187.9A CN202310510187A CN116543335A CN 116543335 A CN116543335 A CN 116543335A CN 202310510187 A CN202310510187 A CN 202310510187A CN 116543335 A CN116543335 A CN 116543335A
Authority
CN
China
Prior art keywords
output
encoder
optical flow
convolution layer
error
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310510187.9A
Other languages
Chinese (zh)
Inventor
王霖
李名洋
王玮
柴志鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202310510187.9A priority Critical patent/CN116543335A/en
Publication of CN116543335A publication Critical patent/CN116543335A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/42Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a visual anomaly detection method based on time sequence spatial information enhancement, which is used for solving the problem of slower operation speed of an anomaly detection network; the method comprises the following steps: firstly, constructing an optical flow reconstructor and a future frame predictor of a multi-stage memory enhancement self-encoder with jump connection; secondly, training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss; training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error; and finally, detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error. The method and the device can better utilize the space detail information, improve the detection precision, and are suitable for abnormal event detection occasions in complex and various scene monitoring videos.

Description

Visual anomaly detection method based on time sequence spatial information enhancement
Technical Field
The invention relates to the technical field of general computer image processing, in particular to a visual anomaly detection method based on time sequence space information enhancement.
Background
The video abnormal event detection technology is widely applied to fault early warning in the fields of industry, security and transportation, and aims to spatially or temporally locate abnormal events in a monitoring video. In a broad sense, the formal definition of a video anomaly event is "the appearance of an anomaly appearance or motion attribute, or the appearance of a normal appearance or motion attribute at an anomaly location or time. By definition, an abnormal event is unusual, unusual and sporadic in nature, and thus acquisition of an abnormal event sample is difficult. In order to solve the detection problem of abnormal events with super boundaries and small samples, researchers try to model normal event behaviors by analyzing the motion characteristics and the space-time context characteristics of normal event samples, and then judge whether abnormal events occur in videos by using the model. With the continued development of deep learning techniques, researchers have attempted to detect anomalies using an unsupervised paradigm approach based on reconstruction or prediction. In the anomaly detection method based on deep learning, firstly, in the training process, a network learns a normal activity model of a behavior main body in a scene through a plurality of given video sequences which do not contain abnormal behaviors. Subsequently, in the detection phase, the network processes by extracting the same features in the video sequence to be tested, so as to calculate an anomaly score for each frame of video. The method learns the characteristics of the normal event sample and models the normal event sample in an end-to-end mode, so that the abnormal event sample cannot be effectively reconstructed or predicted in future frames, thereby obtaining higher reconstruction errors or prediction errors, and judging whether an abnormal event occurs or not according to a proper threshold by taking the reconstruction errors or the prediction errors as criteria.
Although the reconstruction and prediction-based anomaly detection method has been developed in a long-standing manner, the following drawbacks still exist: (1) Both methods rely on the capability of convolutional neural network to extract video sequence features, and part of methods introduce motion feature information such as optical flow information before a reconstruction and prediction stage, but do not achieve good effects; (2) The reconstruction-based method has great reconstruction error when training is performed on the normal sample and reconstructing the abnormal image according to the difference of the normal sample and the abnormal sample in aspects such as morphological characteristics, motion characteristics, space-time context information and the like. However, such assumptions do not always hold. The diversity of normal and abnormal samples results in the normal events recorded in the training set not being complete and comprehensive, which makes it possible for the abnormal samples to be reconstructed well. (3) The prediction-based method predicts future frames by predicting morphology features and spatio-temporal context features, and in the anomaly discrimination process, the anomaly frames have extremely high reconstruction errors according to a hypothesis similar to the reconstruction-based method. This allows the prediction-based approach to ignore the contribution of motion features to anomaly discrimination while having similar drawbacks as the reconstruction-based approach.
To solve the above problems, some researchers have tried to combine these two patterns in a serial or parallel manner to construct a framework of hybrid reconstruction and prediction patterns to further improve the anomaly detection performance. The hybrid model effectively combines reconstruction and prediction methods, where the parallel hybrid method calculates anomaly scores by fusing reconstruction errors and prediction errors during the discrimination phase, while the serial hybrid method makes modeling of anomaly events in future frames more difficult by introducing reconstruction information in the prediction branches. However, whether based on reconstruction, prediction, or hybrid models, there is no more reasonable consideration of spatial detail information in the network design process. The method has higher requirements on sample data, is influenced by the diversity of normal samples, can still generate abnormal effective modeling in the modeling process, and has good model quality, thereby being beneficial to effectively differentiating the abnormality in the distinguishing process.
Disclosure of Invention
Aiming at the problem of slower operation speed of an anomaly detection network, the invention provides a visual anomaly detection method based on time sequence space information enhancement, which is a hybrid framework combining optical flow reconstruction and optical flow guiding future frame prediction in a serial mode; wherein a future frame prediction model based on optical flow guidance accepts a previous video frame and optical flow as inputs at the same time, but the optical flow used is not an original optical flow image, but is reconstructed by an optical flow reconstruction model constructed by a multi-level memory enhanced self-encoder (ML-MemAE-SC) with jump connection, and then the reconstructed optical flow is input to the future frame prediction model; the method and the device can better utilize the space detail information, improve the detection precision, and are suitable for abnormal event detection occasions in complex and various scene monitoring videos.
The technical scheme of the invention is realized as follows:
a visual anomaly detection method based on time sequence space information enhancement comprises the following steps:
step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder;
step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss;
step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error;
step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error.
The future frame predictor includes an encoder E θ Encoder and method for producing the sameAnd decoder D ψ Encoder->And decoder D ψ Jumping connection of a U-shaped pyramid attention mechanism module is added between the two modules; encoder E θ And encoder->The system comprises a point convolution module, a serial depth separable residual block and an intensive cavity downsampling module; decoder D ψ The device comprises an up-sampling module, a serial depth separable residual block and a point convolution module; encoder E θ Is input to reconstruct optical flow information +.>Encoder->Is input as the original input frame x 1:t And reconstruct optical flow information->Is mixed with the information of the encoder E θ Output information of (2) and encoder->Output information z of the decoding device is connected and sent to the decoder D ψ
The mixing loss function of the training optical flow reconstructor is as follows:
L recon-branch =λ recon L reconent L ent
wherein, the liquid crystal display device comprises a liquid crystal display device,reconstructing the loss function, f being the input, +.>The output of the reconstruction is provided with,m is the number of memory modules, N is the size of the memory module, which is the loss function of the memory module,/>Is the matching probability of the kth node in the mth memory module, lambda recon 、λ ent Are all weights.
The mixing loss function of the training future frame predictor is:
wherein, the liquid crystal display device comprises a liquid crystal display device,to predict loss, p (z|y 1:t ) For the output distribution of the optical flow encoder, q (z|x 1:t ,y 1:t ) For the output distribution of the image stream encoder, F KL Representing the Kullback-Leibler divergence, y 1:t Representing video light flow, x t+1 Truth-diagram representing future frames->Representing predicted output video frame-> For gradient loss, i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame, +>Representing predicted output video frames, x i,j Truth-map pixel values representing future frames, +.>Representing predicted output video frame pixel values, lambda predict 、λ gd Are all weights.
The method for calculating the abnormal score according to the reconstruction error and the prediction error comprises the following steps:
wherein S is an anomaly score, S r Reconstruction error, S p For prediction error omega r Weight, ω, of the reconstruction score p Represents the weight, μ, occupied by the prediction score r Representing the mean, σ, of the reconstruction errors for all training samples r Represents the standard deviation, μ, of the reconstruction errors of all training samples p Representing the mean, σ, of the prediction errors of all training samples p Representing the standard deviation of the prediction error for all training samples.
The reconstruction error S r And prediction error S p The expressions of (2) are respectively:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing reconstructed optical flow information->Representing a predicted output video frame.
The serial depth separable residual block comprises a channel convolution layer I, a channel convolution layer II, a point convolution layer I and a point convolution layer II; the input end of the channel convolution layer I and the input end of the point convolution layer I are connected with the upper layer output end, the output end of the channel convolution layer I is connected with the input end of the channel convolution layer II, the output end of the channel convolution layer I is fused with the output end of the channel convolution layer II and then is connected with the input end of the point convolution layer II, and the output end of the point convolution layer II and the output end of the point convolution layer I are connected with the lower layer input end.
The intensive cavity downsampling module comprises three stages, namely Stage1, stage 2 and Stage 3; firstly, the input x is downsampled for the first time by utilizing the common convolution of Stage1 Stage to obtain a feature diagram x after the scale reduction 1 The method comprises the steps of carrying out a first treatment on the surface of the Then, the second and third downsampling are carried out by utilizing the cavity convolution of Stage 2 Stage and Stage 3 Stage respectively so as to obtain a feature diagram x after the scale reduction 2 And x 3 The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, for the feature map x 1 And x 2 Fusion is carried out to obtain a fusion result y 1 For the characteristic diagram x 2 And x 3 Fusion is carried out to obtain a fusion result y 2 And x is at the final stage 1 、y 1 And y 2 And fusing to obtain an output characteristic diagram y.
The U-shaped pyramid attention mechanism module comprises a right encoder branch and a left jump connection branch; the encoder branch on the right side sequentially passes through a convolution layer and a cavity convolution, the convolution layer output and the cavity convolution output are fused by utilizing point convolution, and finally, the output Att is obtained after an activation function D (F D ) The method comprises the steps of carrying out a first treatment on the surface of the The left jump connection branch passes through the convolution layer to obtain F 1 Fusion is carried out with the output of the decoder branch convolution layer, and then the output F is obtained through hole convolution 2 ;F 2 、F 1 And hole convolution output F of decoder branch 3 After the three features are subjected to channel dimension fusion by point convolution, the final output Att is obtained after an activation function skip (F skip ) The method comprises the steps of carrying out a first treatment on the surface of the Finally, the output Att of UPAM (F skip ,F D ) Input F to a decoder D Multiplying by element to obtain the final attention profile F Att
Compared with the prior art, the invention has the beneficial effects that:
(1) Aiming at the problem of slower operation speed of an anomaly detection network, a Serial depth separable residual Block (Serial Block) is designed to serve as a backbone of a future frame prediction network, so that the calculation amount and the parameter amount of convolution operation are effectively reduced and the operation speed is improved on the premise of ensuring the performance.
(2) In order to ensure that abundant and effective space detail information is reserved in the downsampling process in a prediction network encoder, a dense cavity downsampling module (DRSM) is designed, the module adopts a stepped structural design, and multiscale space relevance information is fully utilized in the downsampling process, so that a feature map after the scale reduction is ensured to have more effective space detail information.
(3) Because the U-shaped structure of the future frame prediction network branch contains more jump connections, part of the hierarchy contains certain interference information in the transmission process from top to bottom in the coding stage. In order to solve the problem, a U-shaped pyramid attention module (UPAM) is designed, and the network is guided to keep more abundant and effective spatial features in the feature fusion process through the interactive information extraction of jump connection input and original decoder input in the attention module, so that the future frame modeling quality is improved.
(4) The invention can accurately detect the abnormal events in the complex and various scene monitoring videos.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of the present invention.
Fig. 2 is a block diagram of a future frame predictor network of the present invention.
Fig. 3 is a Serial depth separable residual Block (Serial Block) structure diagram of the present invention.
Fig. 4 is a diagram illustrating a structure of a dense hole downsampling module (DRSM) according to the present invention.
Fig. 5 is a diagram of the U-shaped pyramid attention mechanism module (UPAM) of the present invention.
FIG. 6 is an anomaly score plot of a test video at a UCSD Ped2 dataset of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 1, an embodiment of the present invention provides a visual anomaly detection method based on temporal spatial information enhancement, including two components: (1) Multi-level memory enhanced self-encoder with jump connection (ML-MemAE-SC), i.e. optical flow reconstructor. The method is used for reconstructing the optical flow image, and the reconstructed optical flow image and the original image are simultaneously input into a future frame predictor to conduct future frame prediction. (2) The condition variation based on the reconstructed optical flow guidance is derived from the encoder, i.e. the future frame predictor. The method is used for predicting future frames of the original image frames, and optical flow reconstruction information is introduced during prediction to expand prediction errors to differentiate between abnormal and normal states, so that detection accuracy is further improved. The method comprises the following specific steps:
step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder; the optical flow reconstructor network adopts three-stage memory storage units, and directly transmits the encoded information to the decoder by utilizing jump connection, so as to provide more information for the next-stage memory storage unit to keep a more comprehensive and complete normal mode. The memory module functions to represent the characteristics input to it by a weighted sum of similar memory slots and thus has the ability to memorize the normal mode when training the normal data. Each memory module is actually a matrixIncluding N real vectors in the fixed dimension C. Each row of the matrix is called a slot m i Where i=1, 2,3,..n. Memory size n=2000. The method can effectively solve the problems that a single memory module cannot load all normal modes and a plurality of memory modules are cascaded to cause excessive loading and filtering. No hopping connection is added at the outermost layer of the network. If the connection uses the outermost layer, the reconstruction may be completed by the highest-level encoding and decoding information, and the effect of the normal mode information stored by all other lower-level memory modules is greatly weakened or even disabled, so that all other lower-level encoding, decoding and memory blocks cannot work. In order to further reduce the number of network model parameters, all convolution layers of the reconstruction network are replaced by Serial blocks, so that a lightweight version is formed.
The future frame predictor network detailed structure is shown in fig. 2. The future frame predictor network model consists of the following 5 main module connections: (1) a Serial depth separable residual Block (Serial Block); (2) a dense hole downsampling module (DRSM); (3) a U-shaped pyramid attention mechanism module (UPAM); (4) an upsampling module (Upsample); (5) Point convolution Module (Pointwise).
Each block in the future frame predictor network represents a respective module. As shown in fig. 2, there are two encoders E sharing a similar architecture θ Andone decoder D ψ . We are->And D ψ Adding a jump connection between them helps to generate x t+1 . The backbone of the future frame predictor network is realized by adopting a Serial Block, and the downsampling layer and the upsampling layer are respectively realized by a DRSM and an Upsample. The model contains 4 levels in total, and the feature map size corresponding to each level is (32, 32, 64), (16, 16, 128), (8, 128) and (4, 128), respectively. Will->Connected to the sampled z and sent to decoder D ψ . The last two bottleneck layers are utilized to estimate the distribution and sample the data from them, and share the same layer settings. Encoder->Processing original input frame x 1:t And reconstruct optical flow information->Mixed input of E θ Processing of reconstructed optical flow information->And (5) inputting. Features are fused in the bottleneck layer and then posterior distribution +.>And a priori distribution->Input decoder D ψ To model->A predicted future output frame is made.
Step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss; when training the optical flow reconstruction network, the optical flow reconstruction network model training is performed by using the optical flow graph as an input. If the input is f, the reconstructed output isThe training process of the reconstruction network is to minimize l between input and output 2 The distance, the reconstruction loss function may be expressed as:
in addition, we add entropy loss in the memory modules of the optical flow reconstruction network and match probability for each memory moduleThe loss function of the memory module can be expressed as:
wherein M is the number of memory modules, N is the size of the memory modules,is the probability of matching the kth node in the mth memory module. Weighting fusion of these two loss functions for final reconstructed branch training, e.gThe following is shown:
L recon-branch =λ recon L reconent L ent
wherein lambda is recon 、λ ent Are all weights.
Step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error; in the process of training a future frame prediction network, training of a prediction network model is performed by using the video frame and the reconstructed optical flow as inputs. The future frame prediction network includes two encoders and one decoder, and the outputs of the three parts are comprehensively considered in training. Wherein the output distribution of the optical flow encoder is p (z|y 1:t ) The image stream encoder output distribution is q (z|x 1:t ,y 1:t ) The prediction loss of the future frame prediction network may be defined as:
wherein F is KL Represents the Kullback-Leibler divergence, x t+1 A group Truth representing a future frame,representing predicted output video frames, y 1:t Representing video streams.
Furthermore, gradient losses were introduced, as follows:
where i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame,representing predicted output video frames, x i,j Truth-map pixel values representing future frames, +.>Representing predicted output video frame pixel values.
Mixing the predicted and gradient losses by means of weighted fusion obtains a total loss function to train the prediction network, which can be expressed as:
wherein lambda is predict 、λ gd Are all weights.
Step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, calculating an anomaly score according to the reconstruction error and the prediction error, and optimizing anomaly discrimination performance by using the anomaly score weighting loss.
In the test phase, the network uses the reconstruction error, i.e., y 1:t Andthe difference between them, i.e. the prediction error, x t+1 And->The difference between them makes an abnormality judgment. The quality difference between the normal optical flow and the abnormal optical flow reconstructed by the optical flow reconstructor is utilized to improve the detection precision of the future frame predictor. The reconstructed abnormal optical flow is typically of lower quality, resulting in a future frame with larger prediction errors. In contrast, the reconstructed normal optical flow is typically of higher quality and the prediction module can successfully predict future frames with smaller prediction errors. The optical flow reconstruction and future frame prediction errors are used as final anomaly detection scores.
In the final abnormality detection stage, the video is detected by using the network model trained in the previous step, and the reconstruction error S is utilized in a weighted fusion mode r And prediction error S p To obtain a final anomaly score S, which can be expressed as:
wherein omega r Weight, ω, of the reconstruction score p Represents the weight, μ, occupied by the prediction score r Representing the mean, σ, of the reconstruction errors for all training samples r Represents the standard deviation, μ, of the reconstruction errors of all training samples p Representing the mean, σ, of the prediction errors of all training samples p Representing the standard deviation of the prediction error for all training samples. Reconstruction error S r And prediction error S p The expressions of (2) are respectively:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing reconstructed optical flow information->Representing a predicted output video frame.
The present invention proposes a Serial depth separable residual Block (Serial Block). By utilizing channel-by-channel convolution and point convolution to extract features and combining residual connection and intensive connection, richer spatial features can be extracted and feature reuse can be effectively performed. The Serial Block structure is shown in FIG. 3. The trunk part comprises two layers of channel-by-channel convolutions and one layer of point convolutions, richer two-dimensional space details are obtained through stacking of the two layers of channel-by-channel convolutions, channel fusion is carried out through channel dimension splicing before point convolutions, and feature reuse is achieved. On the outside, the channel difference between the input and the output is balanced through one layer of point convolution, and finally the output introduces a residual error term, so that space details are maintained and the gradient disappearance or gradient explosion problem in the forward propagation process is solved. If it isThe input image is denoted by x, y i The output characteristics of each layer of common convolution are represented, C (·), D (·) and P (·) respectively represent the common convolution, the channel-by-channel convolution and the point convolution, and the mathematical expression of the Serial Block is:
the invention provides an intensive hole downsampling module (DRSM). The module introduces a hole convolution to combine multi-scale space information in the downsampling process to guide the downsampling process of the feature map. The DRSM module structure is shown in figure 4.
The DRSM contains three phases, stage1, stage 2 and Stage 3, respectively. Wherein Stage1 is a scale reduction implemented using common convolution; stage 2 and Stage 3 are both scale-reduced by means of hole convolution, and the hole rates are set to 2 and 4 respectively. The multi-scale spatial information is acquired in the down-sampling process by introducing two stages of hole convolution. During the downsampling process, the module firstly performs the first downsampling on the input x by using common convolution to obtain a feature diagram x after the scale reduction 1 . Then, the second and third downsampling are carried out by using the hole convolution with the hole rate of 2 and 4 respectively, so as to obtain a feature diagram x after the scale reduction 2 And x 3 . The three downsampling here correspond to three phases, respectively. Thereafter, for the feature map x 1 And x 2 Fusion is carried out to obtain a fusion result y 1 For the characteristic diagram x 2 And x 3 Fusion is carried out to obtain a fusion result y 2 And x is at the final stage 1 、y 1 And y 2 And fusing to obtain an output characteristic diagram y. More abundant and effective spatial information is preserved through multi-level fusion of multi-scale features. The downsampling process of the DRSM module can be expressed as:
wherein P is * (. Cndot.) represents fusion layer point convolution, x i (i=1,2, 3) represent the outputs of the stages,representing channel dimension stitching operation S out Representing the downscaled feature map.
The invention provides a U-shaped pyramid attention mechanism module (UPAM). The extraction of inter-region relevance information is increased by introducing a hole convolution to further optimize the attention weight matrix. The structure of UPAM is shown in fig. 5, comprising two structurally similar parallel branches. The left branch is a jump connection input F skip I.e. the output of the corresponding level of the encoder, the right branch input is the input F of the decoder of the corresponding level D
The encoder branch on the right side firstly passes through a convolution layer, then passes through hole convolution with the hole rate of 2, the convolution output and the hole convolution output are fused by point convolution, and finally the output Att is obtained after an activation function D (F D ). The left jump connection branch passes through the convolution layer to obtain F 1 Fusion is carried out with the output of the decoder branch convolution layer, and then the output F is obtained through hole convolution with the hole rate of 2 2 。F 2 、F 1 And hole convolution output F of decoder branch 3 After the three features are subjected to channel dimension fusion by point convolution, the final output Att is obtained after an activation function skip (F skip ). Finally, the output Att of UPAM (F skip ,F D ) Input F to a decoder D Multiplying by element to obtain the final attention profile F Att The complete process of UPAM can be expressed as:
the UPAM further acquires the relevance information in the region while acquiring the multi-scale information through hole convolution and jump connection in the branch. Jump connection input F of left branch skip The transposition and serialization operation of the bottleneck layer is not performed, so that certain detailed information is still reserved. Through the space between the branches at the left side and the right sideFurther directs the decoder branches to reserve spatial detail weights. In the final stage, the weight values of the two branches are fused to form an attention weight matrix, and the space detail of the future frame modeling process is optimized in the form of element products.
The model proposed by the invention was tested on the reference dataset UCSD Ped 2. A network model is built based on a Pytorch framework, and an Adam optimizer is adopted for optimization, and the super parameter is set to be beta 1 =0.9,β 2 =0.999, initial learning rate is set to 1×10 3 . All training and testing procedures, including reconstruction branch training and predictive branch training, were performed on a 16-core 32-thread machine with AMD Ryzen 9 5950x 3.4GHz CPU (64 GB RAM) and RTX 3090GPU (24 GB memory). Lambda (lambda) recon ,λ ent ,λ CVAE And lambda (lambda) gd Are set to 1.0,2e respectively -4 1.0 and 1.0. The batch size and training runs are uniformly set to 128 and 120. Furthermore, the fusion coefficient (ω) of the reconstruction error and the prediction error r ,ω p ) The Ped2 dataset is set (1.0,0.1) for UCSD.
Performance testing was performed on the UCSD Ped2 test set after model training was completed. FIG. 6 is a visualization of predicted anomaly scores for anomaly videos on a UCSD Ped2 test set in accordance with the present invention; the detection result in the UCSD Ped2 test set shows that the invention can accurately detect the abnormal event in the complex multi-scene monitoring video, and verifies the time positioning capability of the invention on the abnormal event.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (9)

1. The visual anomaly detection method based on time sequence space information enhancement is characterized by comprising the following steps of:
step one: constructing an optical flow reconstructor and a future frame predictor with a jump-connected multi-level memory enhanced self-encoder;
step two: training an optical flow reconstructor by utilizing a mixed loss function combining the reconstructed pixel distance error and the memory module matching probability entropy loss;
step three: training a future frame predictor using a hybrid loss function that combines a predicted frame gaussian error and a gradient error;
step four: and detecting the video stream by using the trained optical flow reconstructor and the future frame predictor, and calculating an anomaly score according to the reconstruction error and the prediction error.
2. The method for visual anomaly detection based on temporal spatial information enhancement according to claim 1, wherein the future frame predictor comprises an encoder E θ Encoder and method for producing the sameAnd decoder D ψ Encoder->And decoder D ψ Jumping connection of a U-shaped pyramid attention mechanism module is added between the two modules; encoder E θ And encoder->The system comprises a point convolution module, a serial depth separable residual block and an intensive cavity downsampling module; decoder D ψ The device comprises an up-sampling module, a serial depth separable residual block and a point convolution module; encoder E θ Is input to reconstruct optical flow information +.>Encoder->Is input as the original input frame x 1:t And reconstruct optical flow information->Is mixed with the information of the encoder E θ Output information of (2) and encoder->Output information z of the decoding device is connected and sent to the decoder D ψ
3. The method for visual anomaly detection based on temporal spatial information enhancement of claim 2, wherein the training optical flow reconstructor has a mixing loss function of:
L recon-branch =λ recon L reconent L ent
wherein, the liquid crystal display device comprises a liquid crystal display device,reconstructing the loss function, f being the input, +.>The output of the reconstruction is provided with,m is the number of memory modules, N is the size of the memory module, which is the loss function of the memory module,/>Is the matching probability of the kth node in the mth memory module, lambda recon 、λ ent Are all weights.
4. A method of visual anomaly detection based on temporal spatial information enhancement as claimed in claim 3 wherein the training future frame predictor's mixing loss function is:
wherein, the liquid crystal display device comprises a liquid crystal display device,to predict loss, p (z|y 1:t ) For the output distribution of the optical flow encoder, q (z|x 1:t ,y 1:t ) For the output distribution of the image stream encoder, F KL Representing the Kullback-Leibler divergence, y 1:t Representing video light flow, x t+1 Truth-diagram representing future frames->Representing predicted output video frame-> For gradient loss, i, j represents the spatial coordinates of the pixels in the image, x represents the truth-value diagram of the future frame, +>Representing predicted output video frames, x i,j Truth-map pixel values representing future frames, +.>Representing predicted output video frame pixel values, lambda predict 、λ gd Are all weights.
5. The visual anomaly detection method based on temporal spatial information enhancement according to claim 4, wherein the method of calculating anomaly score from reconstruction error and prediction error is:
wherein S is abnormalDivide S r Reconstruction error, S p For prediction error omega r Weight, ω, of the reconstruction score p Represents the weight, μ, occupied by the prediction score r Representing the mean, σ, of the reconstruction errors for all training samples r Represents the standard deviation, μ, of the reconstruction errors of all training samples p Representing the mean, σ, of the prediction errors of all training samples p Representing the standard deviation of the prediction error for all training samples.
6. The method for visual anomaly detection based on temporal spatial information enhancement according to claim 5, wherein the reconstruction error S r And prediction error S p The expressions of (2) are respectively:
wherein, the liquid crystal display device comprises a liquid crystal display device,representing reconstructed optical flow information->Representing a predicted output video frame.
7. The visual anomaly detection method based on temporal spatial information enhancement according to claim 2, wherein the serial depth separable residual block comprises a channel convolution layer I, a channel convolution layer II, a point convolution layer I, and a point convolution layer II; the input end of the channel convolution layer I and the input end of the point convolution layer I are connected with the upper layer output end, the output end of the channel convolution layer I is connected with the input end of the channel convolution layer II, the output end of the channel convolution layer I is fused with the output end of the channel convolution layer II and then is connected with the input end of the point convolution layer II, and the output end of the point convolution layer II and the output end of the point convolution layer I are connected with the lower layer input end.
8. The visual anomaly detection method based on temporal spatial information enhancement according to claim 2, wherein the dense hole downsampling module comprises three stages Stage1, stage 2 and Stage 3; firstly, the input x is downsampled for the first time by utilizing the common convolution of Stage1 Stage to obtain a feature diagram x after the scale reduction 1 The method comprises the steps of carrying out a first treatment on the surface of the Then, the second and third downsampling are carried out by utilizing the cavity convolution of Stage 2 Stage and Stage 3 Stage respectively so as to obtain a feature diagram x after the scale reduction 2 And x 3 The method comprises the steps of carrying out a first treatment on the surface of the Thereafter, for the feature map x 1 And x 2 Fusion is carried out to obtain a fusion result y 1 For the characteristic diagram x 2 And x 3 Fusion is carried out to obtain a fusion result y 2 And x is at the final stage 1 、y 1 And y 2 And fusing to obtain an output characteristic diagram y.
9. The method for visual anomaly detection based on temporal spatial information enhancement of claim 2, wherein the U-pyramid attention mechanism module comprises a right encoder branch and a left skip connection branch; the encoder branch on the right side sequentially passes through a convolution layer and a cavity convolution, the convolution layer output and the cavity convolution output are fused by utilizing point convolution, and finally, the output Att is obtained after an activation function D (F D ) The method comprises the steps of carrying out a first treatment on the surface of the The left jump connection branch passes through the convolution layer to obtain F 1 Fusion is carried out with the output of the decoder branch convolution layer, and then the output F is obtained through hole convolution 2 ;F 2 、F 1 And hole convolution output F of decoder branch 3 After the three features are subjected to channel dimension fusion by point convolution, the final output Att is obtained after an activation function skip (F skip ) The method comprises the steps of carrying out a first treatment on the surface of the Finally, the output Att of UPAM (F skip ,F D ) Input F to a decoder D Multiplying by element to obtain the final attention profile F Att
CN202310510187.9A 2023-05-08 2023-05-08 Visual anomaly detection method based on time sequence spatial information enhancement Pending CN116543335A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310510187.9A CN116543335A (en) 2023-05-08 2023-05-08 Visual anomaly detection method based on time sequence spatial information enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310510187.9A CN116543335A (en) 2023-05-08 2023-05-08 Visual anomaly detection method based on time sequence spatial information enhancement

Publications (1)

Publication Number Publication Date
CN116543335A true CN116543335A (en) 2023-08-04

Family

ID=87448376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310510187.9A Pending CN116543335A (en) 2023-05-08 2023-05-08 Visual anomaly detection method based on time sequence spatial information enhancement

Country Status (1)

Country Link
CN (1) CN116543335A (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117774A (en) * 2018-08-01 2019-01-01 广东工业大学 A kind of multi-angle video method for detecting abnormality based on sparse coding
CN113569756A (en) * 2021-07-29 2021-10-29 西安交通大学 Abnormal behavior detection and positioning method, system, terminal equipment and readable storage medium
CN114612836A (en) * 2022-03-15 2022-06-10 南京邮电大学 Monitoring video abnormity detection method based on memory enhancement future video frame prediction
CN114708620A (en) * 2022-05-10 2022-07-05 山东交通学院 Pedestrian re-identification method and system applied to unmanned aerial vehicle at aerial view angle
CN114821434A (en) * 2022-05-05 2022-07-29 西藏民族大学 Space-time enhanced video anomaly detection method based on optical flow constraint
CN114973102A (en) * 2022-06-17 2022-08-30 南通大学 Video anomaly detection method based on multipath attention time sequence
CN115511650A (en) * 2022-09-22 2022-12-23 国网智联电商有限公司 Method and device for determining user propagation in crowd sensing task
CN115527150A (en) * 2022-10-31 2022-12-27 南京邮电大学 Dual-branch video anomaly detection method combined with convolution attention module
CN115690665A (en) * 2023-01-03 2023-02-03 华东交通大学 Video anomaly detection method and device based on cross U-Net network
CN115830541A (en) * 2022-12-05 2023-03-21 桂林电子科技大学 Video abnormal event detection method based on double-current space-time self-encoder

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109117774A (en) * 2018-08-01 2019-01-01 广东工业大学 A kind of multi-angle video method for detecting abnormality based on sparse coding
CN113569756A (en) * 2021-07-29 2021-10-29 西安交通大学 Abnormal behavior detection and positioning method, system, terminal equipment and readable storage medium
CN114612836A (en) * 2022-03-15 2022-06-10 南京邮电大学 Monitoring video abnormity detection method based on memory enhancement future video frame prediction
CN114821434A (en) * 2022-05-05 2022-07-29 西藏民族大学 Space-time enhanced video anomaly detection method based on optical flow constraint
CN114708620A (en) * 2022-05-10 2022-07-05 山东交通学院 Pedestrian re-identification method and system applied to unmanned aerial vehicle at aerial view angle
CN114973102A (en) * 2022-06-17 2022-08-30 南通大学 Video anomaly detection method based on multipath attention time sequence
CN115511650A (en) * 2022-09-22 2022-12-23 国网智联电商有限公司 Method and device for determining user propagation in crowd sensing task
CN115527150A (en) * 2022-10-31 2022-12-27 南京邮电大学 Dual-branch video anomaly detection method combined with convolution attention module
CN115830541A (en) * 2022-12-05 2023-03-21 桂林电子科技大学 Video abnormal event detection method based on double-current space-time self-encoder
CN115690665A (en) * 2023-01-03 2023-02-03 华东交通大学 Video anomaly detection method and device based on cross U-Net network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
CHAO HU.ET AL: "Spatio-Temporal-based Context Fusion for Video Anomaly Detection", 《ARXIV》, 18 October 2022 (2022-10-18) *
ZHIAN LIU.ET AL: "A Hybrid Video Anomaly Detection Framework via Memory-Augmented Flow Reconstruction and Flow-Guided Frame Prediction", 《ICCV》, 31 December 2021 (2021-12-31), pages 13588 - 13597 *
王向军等: "边缘信息引导多级尺度特征融合的显著性目标检测方法", 《红外与激光工程》, vol. 52, no. 1, 31 January 2023 (2023-01-31) *
王思齐等: "智能视频异常事件检测方法综述", 《计算机工程与科学》, no. 08, 15 August 2020 (2020-08-15) *

Similar Documents

Publication Publication Date Title
CN110189334B (en) Medical image segmentation method of residual error type full convolution neural network based on attention mechanism
CN111612790B (en) Medical image segmentation method based on T-shaped attention structure
CN112507990A (en) Video time-space feature learning and extracting method, device, equipment and storage medium
CN112598053B (en) Active significance target detection method based on semi-supervised learning
CN112052763A (en) Video abnormal event detection method based on bidirectional review generation countermeasure network
CN113298789A (en) Insulator defect detection method and system, electronic device and readable storage medium
CN114549985B (en) Target detection method and system based on self-supervision contrast learning
CN114463759A (en) Lightweight character detection method and device based on anchor-frame-free algorithm
CN110930378A (en) Emphysema image processing method and system based on low data demand
CN113705394B (en) Behavior recognition method combining long time domain features and short time domain features
CN116543335A (en) Visual anomaly detection method based on time sequence spatial information enhancement
CN116797618A (en) Multi-stage segmentation method based on multi-mode MRI (magnetic resonance imaging) heart image
US20230090941A1 (en) Processing video content using gated transformer neural networks
CN111275751A (en) Unsupervised absolute scale calculation method and system
Dastbaravardeh et al. Channel Attention‐Based Approach with Autoencoder Network for Human Action Recognition in Low‐Resolution Frames
CN115484456A (en) Video anomaly prediction method and device based on semantic clustering
CN116452472A (en) Low-illumination image enhancement method based on semantic knowledge guidance
Wang et al. A Novel Neural Network Based on Transformer for Polyp Image Segmentation
CN111160346A (en) Ischemic stroke segmentation system based on three-dimensional convolution
Fan et al. EGFNet: Efficient guided feature fusion network for skin cancer lesion segmentation
CN111899161A (en) Super-resolution reconstruction method
Guo et al. An unsupervised optical flow estimation for LiDAR image sequences
CN113947612B (en) Video anomaly detection method based on foreground and background separation
CN111242038B (en) Dynamic tongue fibrillation detection method based on frame prediction network
KR102454742B1 (en) Method for analyzing thickness of cortical region

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination