CN112052763A - Video abnormal event detection method based on bidirectional review generation countermeasure network - Google Patents

Video abnormal event detection method based on bidirectional review generation countermeasure network Download PDF

Info

Publication number
CN112052763A
CN112052763A CN202010878108.6A CN202010878108A CN112052763A CN 112052763 A CN112052763 A CN 112052763A CN 202010878108 A CN202010878108 A CN 202010878108A CN 112052763 A CN112052763 A CN 112052763A
Authority
CN
China
Prior art keywords
layer
network
convolution
frame
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010878108.6A
Other languages
Chinese (zh)
Other versions
CN112052763B (en
Inventor
刘静
杨智伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202010878108.6A priority Critical patent/CN112052763B/en
Publication of CN112052763A publication Critical patent/CN112052763A/en
Application granted granted Critical
Publication of CN112052763B publication Critical patent/CN112052763B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video abnormal event detection method based on a bidirectional review generation countermeasure network, which solves the problem that the prior art does not utilize the reverse mapping relation among video frame sequences and does not carry out motion constraint from the perspective of long-term time sequence consistency of the video frame sequences, so that the detection precision of abnormal events in videos is insufficient. The method comprises the following implementation steps: a generation countermeasure network composed of a generator, a frame discriminator and a sequence discriminator is constructed, during training, the forward prediction and the backward prediction are adopted, and a two-way review mode of retrospective prediction is combined, the generation countermeasure network is trained through alternate updating of the generator, the frame discriminator and the sequence discriminator, a generator capable of accurately predicting a future frame image of a normal event in a video but incapable of accurately predicting a future frame image of an abnormal event in the video is obtained, and whether the abnormal event occurs or not is detected according to a prediction error.

Description

Video abnormal event detection method based on bidirectional review generation countermeasure network
Technical Field
The invention belongs to the technical field of image processing, and further relates to a video abnormal event detection method based on a bidirectional review generation countermeasure network in the technical field of computer vision. The method can be used for detecting the abnormal events in the video monitoring images.
Background
In recent years, intelligent security has been paid more attention to people, and particularly, abnormal events in video monitoring are automatically detected. The method plays a vital role in improving the response and processing efficiency of abnormal events in public places, maintaining public safety and reducing property loss. Currently, there are two main methods for detecting abnormal events in video monitoring: a method is that a model capable of reconstructing a current video frame image is constructed by learning video data of normal behaviors, such as a sparse dictionary, a self-encoder and the like, the trained reconstruction model can better reconstruct the current frame image of a normal event, for an abnormal event, the reconstructed current frame image has a larger reconstruction error, and abnormal event detection is carried out according to the error. However, the reconstruction model has a large fault-tolerant capability, and sometimes the abnormal frame image can be well reconstructed, so that the detection accuracy is low. The other method is that a model capable of predicting a future frame image is constructed by learning video data of normal behaviors, the model can better predict the future frame image of a normal event, and due to the fact that an abnormal event is unpredictable, a large error is generated between the future frame image predicted by the model and a real frame image, and whether the abnormal event occurs or not can be detected according to the error.
Wen Liu, in its published paper "Future Frame Prediction for analysis Detection-A New Baseline" (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),2018, pp.6536-6545), proposes a video Anomaly Detection method based on Future Frame Prediction. The method adopts Unet as a prediction network to predict future frames of video data, when the prediction network is trained, the future frame images are predicted forward according to the first four frame images, the appearance constraint is carried out on the predicted future frame images by minimizing the gradient loss and the strength loss between the predicted future frame images and the images of the real future frames, the motion constraint is carried out by minimizing the optical flow loss between the predicted future frame images and the images of the real future frames, and the pairing-counteracting network is generated in a combined mode to optimize the model. The method not only restrains the appearance of the predicted future frame image, but also restrains the motion of the predicted future frame image by extracting the optical flow, so that the future frame image of the normal video can be predicted better, and the abnormal video has unpredictability, so that a larger error can be generated between the predicted future frame image and the real future frame image, and the detection effect of the abnormal event is improved. However, the method still has the disadvantages that only forward prediction is carried out during training, the reverse mapping relation between video frame sequences is not utilized, so that the prediction effect of the predicted future frame image of the normal event is insufficient, in addition, the motion constraint is only carried out between the predicted future frame image and the previous frame image in a mode of extracting the optical flow for motion constraint, and the motion constraint is not carried out on the predicted future frame image from the perspective of the long-term time sequence consistency of the video frame sequences, so that the motion constraint on the predicted future frame image of the normal event is insufficient, a larger error also occurs in the predicted future frame image of the normal event, the capability of a prediction network for distinguishing the normal event from the abnormal event mode is reduced, and the video abnormal detection effect is reduced.
The patent document "video anomaly detection method based on ST-Unet" (patent application No. 201811501290.2, application publication No. CN109711280A) filed by Beijing university of industry proposes an ST-Unet network capable of simultaneously utilizing video data space-time information to detect video anomalies. The method constructs a new ST-Unet network by adding ConvLSTM into Unet. In training the ST-Unet network, the method first forward predicts a future frame image from the first four frames of video frames and reconstructs a last frame image of the input video frame, and then optimizes a model by minimizing differences between the predicted future frame image and a real future frame image, and the reconstructed image and the last frame image of the real input video frame and combining them to generate a pairwise opposing network. In the testing stage, firstly, errors between a future frame image predicted by an ST-Unet network and a real future frame image and errors between a generated reconstructed image and a real input image are compared to respectively obtain a prediction error and a reconstruction error, then, the obtained prediction error and the reconstruction error are subjected to weighted summation processing to obtain an abnormal score, and finally, whether the abnormality occurs or not is judged according to the abnormal score. In the method, ConvLSTM is added into the Unet network to simultaneously acquire the temporal-spatial information of the video data, and the anomaly detection effect is further improved by combining reconstruction errors. However, the method still has the disadvantages that only forward prediction is performed during training, the reverse mapping relation between video frame sequences is not utilized, in addition, the detection network is too complex due to the combination of the prediction model and the reconstruction model, the detection efficiency is reduced, in addition, due to the diversity of scenes, the appropriate weight ratio of the reconstruction error and the prediction error is difficult to select, and the detection effect is unstable due to the change of the scenes.
Disclosure of Invention
The invention aims to provide a video abnormal event detection method based on a bidirectional review generation countermeasure network, aiming at overcoming the defects of the prior art, and solving the problem that the prior art does not utilize the reverse mapping relation among video frame sequences and does not carry out motion constraint from the perspective of long-term time sequence consistency of the video frame sequences, so that the detection precision of the abnormal event in the video is insufficient.
In order to achieve the purpose, the invention is characterized in that a bidirectional review generation confrontation network is constructed, the network consists of a generator, a frame discriminator and a sequence discriminator, and the generator can perform forward prediction and backward prediction by adopting a bidirectional review training mode, so that the bidirectional mapping relation between normal video frame sequences is fully mined, and a more accurate future frame image of a normal event is predicted. In addition, long-term time sequence information between a predicted frame and an input frame sequence is captured through a sequence discriminator formed by a 3D convolutional layer, motion constraint is carried out by using discrimination loss, and the predicted frame image and the real frame image are ensured to be consistent in motion from the perspective of long-term time sequence consistency, so that the capability of a prediction network for distinguishing a normal event motion mode and an abnormal event motion mode in a video is enhanced, and the abnormal event detection effect is improved.
The method comprises the following specific steps:
(1) constructing a generation countermeasure network:
(1a) a generator network with 15 layers is built, and the structure of the generator network sequentially comprises the following steps: input layer → first convolution layer → first normalization layer → first activation function layer → second convolution layer → second normalization layer → second activation function layer → first downsampling layer combination → second downsampling layer combination → third downsampling layer combination → first upsampling layer combination → second upsampling layer combination → third convolutional layer → output layer; the structure of each downsampling layer combination is as follows in sequence: the first max pooling layer → the first convolution layer → the first normalization layer → the first activation function layer → the second convolution layer → the second normalization layer → the second activation function layer; the structure of each up-sampling layer combination is as follows in sequence: the first deconvolution layer → the first convolution layer → the first normalization layer → the first activation function layer → the second convolution layer → the second normalization layer → the second activation function layer; the feature maps output by the first, second and third down-sampling layer combinations are spliced and fused with the feature maps output by the first, second and third up-sampling layer combinations respectively;
the parameters of each layer in the generator network are set as follows: setting the sizes of convolution kernels in the first convolution layer, the second convolution layer and the third convolution layer to be 3 multiplied by 3, setting convolution step lengths to be 2 and setting the number of the convolution kernels to be 64; the first and second activation function layers are realized by adopting a ReLU function;
the size of the pooling convolution kernel of the largest pooling layer in each downsampling layer combination is set to be 2 multiplied by 2, and the pooling step length is set to be 2; the sizes of convolution kernels of the convolution layers are all set to be 3 multiplied by 3, convolution step lengths are all set to be 2, and the number of the convolution kernels is 128, 256 and 512 respectively; the activation function layers are all realized by adopting a ReLU function;
the convolution kernel size of the deconvolution layer in each upsampling layer combination is set to be 2 x 2, and the convolution step length is set to be 2; the sizes of convolution kernels of the convolution layers are all set to be 3 multiplied by 3, convolution step lengths are all set to be 2, and the number of the convolution kernels is 256, 128 and 64; the activation function layers are all realized by adopting a ReLU function;
(1b) a14-layer frame discriminator network is built, and the structure sequentially comprises the following steps: the input layer → the first convolution layer → the first activation function layer → the second convolution layer → the first normalization layer → the second activation function layer → the third convolution layer → the second normalization layer → the third activation function layer → the fourth convolution layer → the fourth activation function layer → the fifth convolution layer → the fifth activation function layer → the output layer;
the parameters of each layer in the frame discriminator network are set as follows: setting the sizes of convolution kernels of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer to be 3 multiplied by 3, setting convolution step sizes to be 2, and sequentially setting the number of convolution kernels to be 128, 256, 512 and 1; the first normalization layer and the second normalization layer are both realized by adopting a BatchNorm2d function; the first, second, third and fourth activation function layers are all realized by adopting LeakyReLU functions, and the slopes of the first, second, third and fourth activation function layers are all set to be 0.2; the fifth activation function layer is realized by adopting a Sigmoid function;
(1c) a16-layer sequence discriminator network is built, and the structure sequentially comprises the following steps: an input layer → a first 3D convolution layer → a first 3D maximum pooling layer → a first normalization layer → a first activation function layer → a second 3D convolution layer → a second 3D maximum pooling layer → a second normalization layer → a second activation function layer → a third 3D convolution layer → a third 3D maximum pooling layer → a third normalization layer → a third activation function layer → a fourth 3D convolution layer → a fourth activation function layer → an output layer;
the parameters of each layer in the sequence discriminator network are set as follows: setting the sizes of convolution kernels of the first, second, third and fourth 3D convolution layers to be 2 multiplied by 3, setting convolution step sizes to be 1 multiplied by 2, and setting the number of convolution kernels to be 128, 256, 512 and 1 in sequence; the sizes of the pooling convolution kernels of the first, second and third 3D maximum pooling layers are all set to be 2 multiplied by 3, and the pooling step lengths are all set to be 1 multiplied by 2; the first normalization layer, the second normalization layer and the third normalization layer are all realized by adopting a BatchNorm3d function; the first, second and third activation function layers are all realized by adopting LeakyReLU functions, and the slopes of the first, second and third activation function layers are all set to be 0.2; the fourth activation function layer is realized by adopting a Sigmoid function;
(1d) respectively cascading a generator network, a frame discriminator network and a sequence discriminator network to form a generation countermeasure network;
(2) initializing the generation of the countermeasure network:
initializing weights of all convolution layers and normalization layers in the generated countermeasure network to random values satisfying normal distribution; wherein the mean value of the normal distribution is 0, and the standard deviation is 0.02;
(3) generating a training data set:
selecting continuous monitoring videos which are T minutes long and do not contain any abnormal events, and sequentially cutting the continuous monitoring videos into a plurality of groups of video frame sequences with the length of 5 and the size of W multiplied by H to form a training data set; wherein T is more than 10, W, H respectively represents the width and height of each frame of image, W is more than or equal to 64 and less than or equal to 256, H is more than or equal to 64 and less than or equal to 256, and the unit of W and H is pixel;
(4) training the generator network in a two-way review mode:
(4a) arranging the first 4 frames of each group of video frame sequences into a forward video frame sequence according to a forward time sequence, and arranging the last 4 frames of each group of video frame sequences into a reverse video frame sequence according to a reverse time sequence;
(4b) inputting the forward video frame sequence into a generator network for forward prediction, and outputting a forward prediction frame image; inputting the reverse video frame sequence into a generator network for backward prediction, and outputting a backward prediction frame image;
(4c) adding the forward predicted frame image into the previous video frame sequence used for forward prediction, arranging the last 4 frames of the expanded video frame sequence according to a reverse time sequence, inputting the arranged frames into a generator network for backward retrospective prediction, and outputting a backward retrospective predicted frame image; adding the backward prediction frame images into the previous video frame sequence used for backward prediction, then arranging the first 4 frames of the expanded video frame sequence according to the forward time sequence, inputting the arranged frames into a generator network for forward retrospective prediction, and outputting forward retrospective prediction frame images;
(4d) calculating a loss value of a generator network according to a generator network loss function constructed according to errors between a plurality of predicted frame images and real frame images generated by bidirectional prediction and retrospective prediction; reversely transmitting the loss value of the generator network by using a gradient descent method, and calculating all gradients of each convolution layer and each convolution kernel in each deconvolution layer of the generator network; iteratively updating all weights of each convolution kernel in each convolutional layer and each deconvolution layer of the generator network using an Adam optimizer according to all gradients of each convolution kernel in each convolutional layer and each deconvolution layer of the generator network; the initial learning rate of the Adam optimizer is 0.0002;
(5) training a frame discriminator network:
(5a) sequentially inputting a forward predicted frame image and a real image thereof, a backward predicted frame image and a real image thereof, a forward retrospective predicted frame image and a real image thereof, and a backward retrospective predicted frame image and a real image thereof into a frame discriminator network, and outputting corresponding true-false probabilities by the frame discriminator network;
(5b) calculating a loss value of the frame discriminator network according to a frame discriminator loss function constructed by the true and false probabilities output by the frame discriminator network; reversely transmitting the loss value of the frame discriminator network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the frame discriminator network and all gradients of a normalization layer; iteratively updating all weights of each convolution kernel of each convolution layer of the frame arbiter network and all weights of the normalization layer by using an Adam optimizer according to all gradients of each convolution kernel of each convolution layer of the frame arbiter network and all gradients of the normalization layer; the initial learning rate of the Adam optimizer is 0.00002;
(6) training a sequence discriminator network:
(6a) a video frame sequence formed by a forward predicted frame image, a backward predicted frame image, a retrospective predicted frame image and a corresponding input frame image and a corresponding real video frame sequence are sequentially input into a sequence discriminator network, and the sequence discriminator network outputs corresponding authenticity probability;
(6b) calculating a loss value of the sequence discriminator network according to a sequence discriminator loss function constructed according to the authenticity probability output by the sequence discriminator network; reversely transmitting the loss value of the sequence discriminator network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the sequence discriminator network and all gradients of a normalization layer; iteratively updating all weights of each convolution kernel of each convolution layer of the sequence discriminator network and all weights of the normalization layer by using an Adam optimizer according to all gradients of each convolution kernel of each convolution layer of the sequence discriminator network and all gradients of the normalization layer; the initial learning rate of the Adam optimizer is 0.00002;
(7) judging whether the generator network loss function is converged, if so, executing the step (8), otherwise, executing the step (4);
(8) finishing the training of generating the countermeasure network by two-way review to obtain the trained generator network weight, and storing all the weights of each convolution layer and each convolution kernel of each deconvolution layer of the generator network in the trained two-way review generation countermeasure network;
(9) detecting the video:
sequentially cutting a video to be detected into a video frame sequence with the length of 5 and the size of M multiplied by N, inputting the first 4 frames of the video frame sequence into a trained generator network, outputting a predicted future frame image, calculating an abnormality score S according to an error between the predicted future frame image and a 5 th frame real image in the video frame sequence, if the abnormality score S exceeds a set threshold value, judging that the future frame image is abnormal, otherwise, judging that the future frame image is not abnormal; wherein, the values of M and N are equal to the values of W and H, and the value range of the abnormal score S is more than or equal to 0 and less than or equal to 1.
Compared with the prior art, the invention has the following advantages:
firstly, the invention constructs a generation countermeasure network composed of a generator, a frame discriminator and a sequence discriminator, and adopts a training mode of forward and backward prediction and combined retrospective prediction for bidirectional retrospective, thereby overcoming the problem that the prediction effect of the predicted future frame image of the normal event is insufficient because only forward prediction is carried out and the backward mapping relation between video frame sequences is not utilized in the prior art, and leading the prediction network of the invention to have stronger capability of distinguishing the appearance modes of the normal event and the abnormal event, thereby improving the effect of detecting the appearance abnormality in the video.
Secondly, because the sequence discriminator in the countermeasure network generated by the invention is composed of 3D convolution layers, the sequence discriminator can capture the long-term time sequence relation between the video frame sequences and carry out motion constraint by using discrimination loss, the problem that the motion constraint of the predicted future frame image of the predicted normal event is insufficient because the motion constraint of the predicted future frame image is not carried out from the perspective of the long-term time sequence consistency of the video frame sequences in the prior art is overcome, and the video abnormal event detection network provided by the invention has stronger capability of distinguishing the normal event motion mode from the abnormal event motion mode, thereby improving the effect of detecting the motion abnormality in the video.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of the overall architecture of the generator network of the present invention; wherein fig. 2(a) is a schematic diagram of a structure of a generator network of the present invention, fig. 2(b) is a schematic diagram of a down-sampling layer combination in the generator network, and fig. 2(c) is a schematic diagram of an up-sampling layer combination in the generator network;
FIG. 3 is a schematic diagram of a frame discriminator network in the generation countermeasure network according to the present invention;
FIG. 4 is a diagram illustrating a sequence discriminator network in the generation countermeasure network according to the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
The specific steps of the present invention are further described with reference to fig. 1:
step 1, constructing a generation countermeasure network.
The specific structure of the constructed generator network will be further described with reference to fig. 2 (a).
A generator network with 15 layers is built, and the structure of the generator network sequentially comprises the following steps: input layer → first convolution layer → first normalization layer → first activation function layer → second convolution layer → second normalization layer → second activation function layer → first downsampling layer combination → second downsampling layer combination → third downsampling layer combination → first upsampling layer combination → second upsampling layer combination → third convolutional layer → output layer.
The specific structure of the down-sampling layer combination in the generator network is further described with reference to fig. 2 (b).
The structure of each downsampling layer combination is as follows in sequence: the first max pooling layer → the first convolution layer → the first normalization layer → the first activation function layer → the second convolution layer → the second normalization layer → the second activation function layer.
The specific structure of the up-sampling layer combination in the generator network is further described with reference to fig. 2 (c).
The structure of each up-sampling layer combination is as follows: first deconvolution layer → first convolution layer → first normalization layer → first activation function layer → second convolution layer → second normalization layer → second activation function layer.
And the feature maps output by the first, second and third down-sampling layer combinations are spliced and fused with the feature maps output by the first, second and third up-sampling layer combinations respectively.
The parameters of each layer in the generator network are set as follows: setting the sizes of convolution kernels in the first convolution layer, the second convolution layer and the third convolution layer to be 3 multiplied by 3, setting convolution step lengths to be 2 and setting the number of the convolution kernels to be 64; the first and second activation function layers are both realized by adopting a ReLU function.
The size of the pooling convolution kernel of the largest pooling layer in each downsampling layer combination is set to be 2 multiplied by 2, and the pooling step length is set to be 2; the sizes of convolution kernels of the convolution layers are all set to be 3 multiplied by 3, convolution step lengths are all set to be 2, and the number of the convolution kernels is 128, 256 and 512 respectively; the activation function layers are all realized by adopting a ReLU function.
The convolution kernel size of the deconvolution layer in each upsampling layer combination is set to be 2 x 2, and the convolution step length is set to be 2; the sizes of convolution kernels of the convolution layers are all set to be 3 multiplied by 3, convolution step lengths are all set to be 2, and the number of the convolution kernels is 256, 128 and 64; the activation function layers are all realized by adopting a ReLU function.
The specific structure of the frame discriminator network constructed by the present invention will be further described with reference to fig. 3.
A14-layer frame discriminator network is built, and the structure sequentially comprises the following steps: the input layer → the first convolution layer → the first activation function layer → the second convolution layer → the first normalization layer → the second activation function layer → the third convolution layer → the second normalization layer → the third activation function layer → the fourth convolution layer → the fourth activation function layer → the fifth convolution layer → the fifth activation function layer → the output layer.
The parameters of each layer in the frame discriminator network are set as follows: setting the sizes of convolution kernels of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer to be 3 multiplied by 3, setting convolution step sizes to be 2, and sequentially setting the number of convolution kernels to be 128, 256, 512 and 1; the first normalization layer and the second normalization layer are both realized by adopting a BatchNorm2d function; the first, second, third and fourth activation function layers are all realized by adopting LeakyReLU functions, and the slopes of the first, second, third and fourth activation function layers are all set to be 0.2; the fifth activation function layer is realized by adopting a Sigmoid function.
The specific structure of the sequence discriminator network constructed by the present invention will be further described with reference to fig. 4.
A16-layer sequence discriminator network is built, and the structure sequentially comprises the following steps: the input layer → the first 3D convolution layer → the first 3D max-pooling layer → the first normalization layer → the first activation function layer → the second 3D convolution layer → the second 3D max-pooling layer → the second normalization layer → the second activation function layer → the third 3D convolution layer → the third 3D max-pooling layer → the third normalization layer → the third activation function layer → the fourth 3D convolution layer → the fourth activation function layer → the output layer.
The parameters of each layer in the sequence discriminator network are set as follows: setting the sizes of convolution kernels of the first, second, third and fourth 3D convolution layers to be 2 multiplied by 3, setting convolution step sizes to be 1 multiplied by 2, and setting the number of convolution kernels to be 128, 256, 512 and 1 in sequence; the sizes of the pooling convolution kernels of the first, second and third 3D maximum pooling layers are all set to be 2 multiplied by 3, and the pooling step lengths are all set to be 1 multiplied by 2; the first normalization layer, the second normalization layer and the third normalization layer are all realized by adopting a BatchNorm3d function; the first, second and third activation function layers are all realized by adopting LeakyReLU functions, and the slopes of the first, second and third activation function layers are all set to be 0.2; the fourth activation function layer is realized by adopting a Sigmoid function.
And respectively cascading the generator network with the frame discriminator network and the sequence discriminator network to form a generation countermeasure network.
And 2, initializing to generate the countermeasure network.
Initializing weights of all convolution layers and normalization layers in the generated countermeasure network to random values satisfying normal distribution; wherein, the mean value of the normal distribution is 0, and the standard deviation is 0.02.
And 3, generating a training data set.
Selecting continuous monitoring videos which are T minutes long and do not contain any abnormal events, and sequentially cutting the continuous monitoring videos into a plurality of groups of video frame sequences with the length of 5 and the size of W multiplied by H to form a training data set; where T > 10, W, H represent the width and height of each frame of image, respectively, W is 64 ≦ 256, H is 64 ≦ 256, and the units of W and H are pixels.
And 4, training the generator network by adopting a bidirectional review mode.
Step 1, arranging the first 4 frames of each group of video frame sequences into a forward video frame sequence according to a forward time sequence, and arranging the last 4 frames of each group of video frame sequences into a reverse video frame sequence according to a reverse time sequence;
step 2, inputting the forward video frame sequence into a generator network for forward prediction, and outputting a forward prediction frame image; and inputting the reverse video frame sequence into a generator network for backward prediction, and outputting a backward prediction frame image.
The forward prediction is realized by the following formula:
x′j+1=G(Xi:j)
wherein, x'j+1Video frame image of j +1 frame representing forward predicted output of generator network, G (-) representing output of generator network in two-way retrospective generation countermeasure network, Xi:jThe forward video frame sequence formed by arranging the first 4 frames of each group of video frame sequence in the forward time sequence in the first step is shown, i, j respectively represent the starting frame and the ending frame of the video frame sequence, wherein j-i +1 is 4.
The backward prediction is realized by the following formula:
Figure BDA0002653249010000091
wherein, x'iRepresenting the ith frame of video frame image that the generator network makes the backward prediction output,
Figure BDA0002653249010000101
representing a reverse video frame sequence in which the first 4 frames of each set of video frame sequence in the first step are arranged in reverse chronological order.
Step 3, adding the forward predicted frame image into the previous video frame sequence used for forward prediction, arranging the last 4 frames of the expanded video frame sequence according to a reverse time sequence, inputting the arranged frames into a generator network for backward retrospective prediction, and outputting a backward retrospective predicted frame image; adding the backward prediction frame images into the previous video frame sequence used for backward prediction, arranging the first 4 frames of the expanded video frame sequence according to the forward time sequence, inputting the arranged frames into a generator network for forward retrospective prediction, and outputting forward retrospective prediction frame images.
The backward retrospective prediction is realized by the following formula:
Figure BDA0002653249010000102
wherein, x ″)j+1Video frame image of j +1 frame representing backward retrospective prediction output by the generator network,
Figure BDA0002653249010000103
the last 4 frames of the extended video frame sequence in reverse chronological order in step 3 are shown.
The forward retrospective prediction is implemented by the following formula:
Figure BDA0002653249010000104
wherein, x ″)iThe i frame video frame images representing the forward retrospective prediction output of the generator network,
Figure BDA0002653249010000105
representing the sequence of video frames in forward chronological order after expansion in step 34 frames.
And 4, calculating a loss value of the generator network according to a generator network loss function constructed by errors between a plurality of predicted frame images and real frame images generated by bidirectional prediction and retrospective prediction.
The generator network loss function is as follows:
LG=1*L+1*L′+0.005*L″+0.005*L″
wherein L isGA function representing the loss of the generator network, wherein the function represents the multiplication operation, L represents the loss of the intensity error between the predicted frame image output by the generator and the real image, L ' represents the loss of the gradient error between the predicted frame image output by the generator and the real image, L ' represents the frame countermeasure loss of the generator network, and L ' represents the sequence countermeasure loss of the generator network;
the L, L ', L + and L' "are derived from the following equations, respectively:
Figure BDA0002653249010000106
Figure BDA0002653249010000111
Figure BDA0002653249010000112
Figure BDA0002653249010000113
wherein | · | purple sweet2Denotes a 2 norm operation, xiIs represented by x'iAnd x ″)iCorresponding real image, xjIs represented by x'jAnd x ″)jCorresponding real images, wherein K and L represent the size of each frame of image, the values of K and L are equal to the values of W and H, m and n respectively represent the position coordinates of pixels in the image, sigma represents summation operation, | | |1Represents a 1 normOperation DF(. D) shows the output of a frame arbiter network in a two-way look-back generation countermeasure networkS(. to) denotes the output of the sequence arbiter network in the two-way review generation countermeasure network, and @ denotes the time-series superposition operation.
Reversely transmitting the loss value of the generator network by using a gradient descent method, and calculating all gradients of each convolution layer and each convolution kernel in each deconvolution layer of the generator network; iteratively updating all weights of each convolution kernel in each convolutional layer and each deconvolution layer of the generator network using an Adam optimizer according to all gradients of each convolution kernel in each convolutional layer and each deconvolution layer of the generator network; the initial learning rate of the Adam optimizer is 0.0002.
And 5, training the frame discriminator network.
Step 1, a forward predicted frame image and a real image thereof, a backward predicted frame image and a real image thereof, a forward retrospective predicted frame image and a real image thereof, and a backward retrospective predicted frame image and a real image thereof are sequentially input into a frame discriminator network, and the frame discriminator network outputs corresponding true-false probabilities.
And 2, calculating a loss value of the frame discriminator network according to a frame discriminator loss function constructed by the true and false probabilities output by the frame discriminator network.
The frame discriminator loss function has the following form:
Figure BDA0002653249010000121
wherein the content of the first and second substances,
Figure BDA0002653249010000123
representing the frame discriminator loss function.
Reversely transmitting the loss value of the frame discriminator network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the frame discriminator network and all gradients of a normalization layer; iteratively updating all weights of each convolution kernel of each convolution layer of the frame arbiter network and all weights of the normalization layer by using an Adam optimizer according to all gradients of each convolution kernel of each convolution layer of the frame arbiter network and all gradients of the normalization layer; the initial learning rate of the Adam optimizer is 0.00002.
And 6, training the sequence discriminator network.
Step 1, a video frame sequence formed by a forward predicted frame image, a backward predicted frame image, a retrospective predicted frame image and a corresponding input frame image and a corresponding real video frame sequence are sequentially input into a sequence discriminator network, and the sequence discriminator network outputs corresponding authenticity probability.
And 2, calculating the loss value of the sequence discriminator network according to a sequence discriminator loss function constructed by the authenticity probability output by the sequence discriminator network.
The loss function form of the sequence discriminator is as follows:
Figure BDA0002653249010000122
wherein the content of the first and second substances,
Figure BDA0002653249010000124
representing the sequence discriminator loss function.
Reversely transmitting the loss value of the sequence discriminator network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the sequence discriminator network and all gradients of a normalization layer; iteratively updating all weights of each convolution kernel of each convolution layer of the sequence discriminator network and all weights of the normalization layer by using an Adam optimizer according to all gradients of each convolution kernel of each convolution layer of the sequence discriminator network and all gradients of the normalization layer; the initial learning rate of the Adam optimizer is 0.00002.
And 7, judging whether the generator network loss function is converged, if so, executing a step 8, otherwise, executing a step 4.
And 8, finishing the training of generating the countermeasure network by bidirectional review to obtain the trained generator network weight, and storing all the weights of each convolution layer and each convolution kernel of each deconvolution layer of the generator network in the trained bidirectional review generated countermeasure network.
And 9, detecting the video.
Sequentially cutting a video to be detected into a video frame sequence with the length of 5 and the size of M multiplied by N, inputting the first 4 frames of the video frame sequence into a trained generator network, outputting a predicted future frame image, calculating an abnormality score S according to an error between the predicted future frame image and a 5 th frame real image in the video frame sequence, if the abnormality score S exceeds a set threshold value, judging that the future frame image is abnormal, otherwise, judging that the future frame image is not abnormal; wherein, the values of M and N are equal to the values of W and H, the value range of the abnormal score S is more than or equal to 0 and less than or equal to 1, and the set threshold value is 0.5.
The calculation of the abnormal score S is realized by the following formula:
Figure BDA0002653249010000131
Figure BDA0002653249010000132
wherein PSNR (x, x ') represents a peak signal-to-noise ratio between a real image and a predicted future frame image, x represents the real image, x' represents the predicted future frame image, log10Representing a base-10 logarithmic operation, max representing a maximum value operation, F representing the total number of pixel points in the corresponding real image or the predicted future frame image, l representing the serial numbers of all pixel points in the corresponding real image or the predicted future frame image, S (t) representing the abnormal score of the predicted future frame image at the t-th moment, xtRepresenting a real image at time t, x'tAnd min represents the minimum value operation of the predicted future frame image at the t-th moment.
The effect of the present invention is further explained by combining the simulation experiment as follows:
1. simulation experiment conditions are as follows:
the hardware platform of the simulation experiment of the invention is as follows: the processor is an Intel (R) Core i5-9400F CPU, the main frequency is 2.9GHz, the memory is 32GB, and the display card is an NVIDIA GeForce RTX 2070 Super.
The software platform of the simulation experiment of the invention is as follows: ubuntu 16.04 operating system, python 3.6, pytorch 1.2.0.
2. Simulation content and simulation result analysis:
when a training set and a test set are generated in a simulation experiment, a public standard data set CUHK Avenue (Avenue) is used. The video data set was 20 minutes in duration and contained 37 video segments, including 47 exceptional events. In the simulation experiment, 16 normal video segments in the Avenue data set are used to form a training set, and 21 abnormal video segments form a testing set.
The simulation experiment of the invention is to respectively detect the abnormal events in 21 video segments forming a test set by adopting the invention and three prior arts (an abnormal detection method FFP based on future frame prediction, a video abnormal detection method AnoPCN based on a depth prediction coding network and an abnormal detection method PRI combining prediction and reconstruction).
In the simulation experiment, three prior arts are adopted:
the prior art is an Anomaly Detection method based on Future Frame Prediction, which is a video Anomaly Detection method provided by W.Liu et al in Future Frame Prediction for analysis Detection-A New Baseline, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun.2018, pp.6536-6545, and is called an Anomaly Detection method based on Future Frame Prediction for short.
The Video anomaly detection method based on the depth prediction coding network in the prior art refers to a Video anomaly detection method provided by M.Ye et al in Video analysis detection Video decoding network, "in Proceedings of the 27th ACM International Conference on Multimedia,2019, pp.1805-1813", which is called a Video anomaly detection method based on the depth prediction coding network for short.
The method for detecting the abnormality by combining prediction and reconstruction in the prior art refers to a video abnormality detection method provided by y.tang et al in "integration prediction and reconstruction for analysis detection," Pattern recognit.lett., vol.129, pp.123-130,2020 ", which is referred to as an abnormality detection method combining prediction and reconstruction for short.
In order to evaluate the effect of the simulation result of the invention, the invention adopts AUC as a performance evaluation index to compare with the three existing technologies, and the comparison result is shown in Table 1.
From table 1, it can be seen that the AUC of the method of the present invention on Avenue data set is 88.6%, which is higher than 3 prior art methods, demonstrating that the method can detect abnormal events in video more effectively.
The bidirectional review generated confrontation network composed of the sequence discriminator adopts the training mode of bidirectional review, so that the generator can fully mine the bidirectional mapping relation between the predicted frame and the input frame sequence, predict the more accurate future frame image of the normal event, effectively improve the capability of the predicted network for distinguishing the appearance modes of the normal video and the abnormal video, in addition, the sequence discriminator composed of the 3D convolution layer can carry out motion constraint on the predicted frame image from the perspective of long-term time sequence consistency, effectively improve the capability of the predicted network for distinguishing the motion modes of the normal video and the abnormal video, solve the problems that the predicted future frame image of the normal event is insufficient in the prior art because only forward prediction is carried out and the reverse mapping relation between video frame sequences is not utilized, and the motion constraint is not carried out from the perspective of the long-term time sequence consistency of the video frame sequences, the problem that the motion constraint of the predicted normal event of the future frame image is insufficient is solved, and the method is a very practical video abnormal event detection method.
TABLE 1 comparison of AUC values of the present invention and of the three prior art
Figure BDA0002653249010000151
The above simulation experiments show that: the bidirectional review generation confrontation network which is composed of a generator, a frame discriminator and a sequence discriminator and is constructed by the method of the invention adopts a bidirectional review training mode, so that the generator can fully mine the bidirectional mapping relationship between a predicted frame and an input frame sequence, predict the future frame image of a more accurate normal event, effectively improve the capability of the predicted network for distinguishing the appearance modes of a normal video and an abnormal video, in addition, the sequence discriminator which is composed of a 3D convolution layer can carry out motion constraint on the predicted frame image from the perspective of long-term time sequence consistency, effectively improve the capability of the predicted network for distinguishing the motion modes of the normal video and the abnormal video, solve the problem that the prediction effect of the future frame image of the predicted normal event is insufficient due to the fact that only forward prediction is carried out and the reverse mapping relationship between video frame sequences is not utilized in the prior art, and the problem that the motion constraint is not carried out from the perspective of the long-term time sequence consistency of the video frame sequence, so that the motion constraint on the predicted future frame image of the normal event is insufficient is solved, and the method is a very practical video abnormal event detection method.

Claims (9)

1. A video abnormal event detection method based on a bidirectional retrospective generation countermeasure network is characterized in that the generation countermeasure network consisting of a generator, a frame discriminator and a sequence discriminator is constructed, during training, the forward and backward prediction and the bidirectional retrospective mode of combined retrospective prediction are adopted, the generator, the frame discriminator and the sequence discriminator are alternately updated to train the generation countermeasure network, and the generator which can accurately predict a future frame image of a normal event in a video and cannot accurately predict the future frame image of the abnormal event in the video is obtained; the method comprises the following specific steps:
(1) constructing a generation countermeasure network:
(1a) a generator network with 15 layers is built, and the structure of the generator network sequentially comprises the following steps: input layer → first convolution layer → first normalization layer → first activation function layer → second convolution layer → second normalization layer → second activation function layer → first downsampling layer combination → second downsampling layer combination → third downsampling layer combination → first upsampling layer combination → second upsampling layer combination → third convolutional layer → output layer; the structure of each downsampling layer combination is as follows in sequence: the first max pooling layer → the first convolution layer → the first normalization layer → the first activation function layer → the second convolution layer → the second normalization layer → the second activation function layer; the structure of each up-sampling layer combination is as follows in sequence: the first deconvolution layer → the first convolution layer → the first normalization layer → the first activation function layer → the second convolution layer → the second normalization layer → the second activation function layer; the feature maps output by the first, second and third down-sampling layer combinations are spliced and fused with the feature maps output by the first, second and third up-sampling layer combinations respectively;
the parameters of each layer in the generator network are set as follows: setting the sizes of convolution kernels in the first convolution layer, the second convolution layer and the third convolution layer to be 3 multiplied by 3, setting convolution step lengths to be 2 and setting the number of the convolution kernels to be 64; the first and second activation function layers are realized by adopting a ReLU function;
the size of the pooling convolution kernel of the largest pooling layer in each downsampling layer combination is set to be 2 multiplied by 2, and the pooling step length is set to be 2; the sizes of convolution kernels of the convolution layers are all set to be 3 multiplied by 3, convolution step lengths are all set to be 2, and the number of the convolution kernels is 128, 256 and 512 respectively; the activation function layers are all realized by adopting a ReLU function;
the convolution kernel size of the deconvolution layer in each upsampling layer combination is set to be 2 x 2, and the convolution step length is set to be 2; the sizes of convolution kernels of the convolution layers are all set to be 3 multiplied by 3, convolution step lengths are all set to be 2, and the number of the convolution kernels is 256, 128 and 64; the activation function layers are all realized by adopting a ReLU function;
(1b) a14-layer frame discriminator network is built, and the structure sequentially comprises the following steps: the input layer → the first convolution layer → the first activation function layer → the second convolution layer → the first normalization layer → the second activation function layer → the third convolution layer → the second normalization layer → the third activation function layer → the fourth convolution layer → the fourth activation function layer → the fifth convolution layer → the fifth activation function layer → the output layer;
the parameters of each layer in the frame discriminator network are set as follows: setting the sizes of convolution kernels of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer to be 3 multiplied by 3, setting convolution step sizes to be 2, and sequentially setting the number of convolution kernels to be 128, 256, 512 and 1; the first normalization layer and the second normalization layer are both realized by adopting a BatchNorm2d function; the first, second, third and fourth activation function layers are all realized by adopting LeakyReLU functions, and the slopes of the first, second, third and fourth activation function layers are all set to be 0.2; the fifth activation function layer is realized by adopting a Sigmoid function;
(1c) a16-layer sequence discriminator network is built, and the structure sequentially comprises the following steps: an input layer → a first 3D convolution layer → a first 3D maximum pooling layer → a first normalization layer → a first activation function layer → a second 3D convolution layer → a second 3D maximum pooling layer → a second normalization layer → a second activation function layer → a third 3D convolution layer → a third 3D maximum pooling layer → a third normalization layer → a third activation function layer → a fourth 3D convolution layer → a fourth activation function layer → an output layer;
the parameters of each layer in the sequence discriminator network are set as follows: setting the sizes of convolution kernels of the first, second, third and fourth 3D convolution layers to be 2 multiplied by 3, setting convolution step sizes to be 1 multiplied by 2, and setting the number of convolution kernels to be 128, 256, 512 and 1 in sequence; the sizes of the pooling convolution kernels of the first, second and third 3D maximum pooling layers are all set to be 2 multiplied by 3, and the pooling step lengths are all set to be 1 multiplied by 2; the first normalization layer, the second normalization layer and the third normalization layer are all realized by adopting a BatchNorm3d function; the first, second and third activation function layers are all realized by adopting LeakyReLU functions, and the slopes of the first, second and third activation function layers are all set to be 0.2; the fourth activation function layer is realized by adopting a Sigmoid function;
(1d) respectively cascading a generator network, a frame discriminator network and a sequence discriminator network to form a generation countermeasure network;
(2) initializing the generation of the countermeasure network:
initializing weights of all convolution layers and normalization layers in the generated countermeasure network to random values satisfying normal distribution; wherein the mean value of the normal distribution is 0, and the standard deviation is 0.02;
(3) generating a training data set:
selecting continuous monitoring videos which are T minutes long and do not contain any abnormal events, and sequentially cutting the continuous monitoring videos into a plurality of groups of video frame sequences with the length of 5 and the size of W multiplied by H to form a training data set; wherein T is more than 10, W, H respectively represents the width and height of each frame of image, W is more than or equal to 64 and less than or equal to 256, H is more than or equal to 64 and less than or equal to 256, and the unit of W and H is pixel;
(4) training the generator network in a two-way review mode:
(4a) arranging the first 4 frames of each group of video frame sequences into a forward video frame sequence according to a forward time sequence, and arranging the last 4 frames of each group of video frame sequences into a reverse video frame sequence according to a reverse time sequence;
(4b) inputting the forward video frame sequence into a generator network for forward prediction, and outputting a forward prediction frame image; inputting the reverse video frame sequence into a generator network for backward prediction, and outputting a backward prediction frame image;
(4c) adding the forward predicted frame image into the previous video frame sequence used for forward prediction, arranging the last 4 frames of the expanded video frame sequence according to a reverse time sequence, inputting the arranged frames into a generator network for backward retrospective prediction, and outputting a backward retrospective predicted frame image; adding the backward prediction frame images into the previous video frame sequence used for backward prediction, then arranging the first 4 frames of the expanded video frame sequence according to the forward time sequence, inputting the arranged frames into a generator network for forward retrospective prediction, and outputting forward retrospective prediction frame images;
(4d) calculating a loss value of a generator network according to a generator network loss function constructed according to errors between a plurality of predicted frame images and real frame images generated by bidirectional prediction and retrospective prediction; reversely transmitting the loss value of the generator network by using a gradient descent method, and calculating all gradients of each convolution layer and each convolution kernel in each deconvolution layer of the generator network; iteratively updating all weights of each convolution kernel in each convolutional layer and each deconvolution layer of the generator network using an Adam optimizer according to all gradients of each convolution kernel in each convolutional layer and each deconvolution layer of the generator network; the initial learning rate of the Adam optimizer is 0.0002;
(5) training a frame discriminator network:
(5a) sequentially inputting a forward predicted frame image and a real image thereof, a backward predicted frame image and a real image thereof, a forward retrospective predicted frame image and a real image thereof, and a backward retrospective predicted frame image and a real image thereof into a frame discriminator network, and outputting corresponding true-false probabilities by the frame discriminator network;
(5b) calculating a loss value of the frame discriminator network according to a frame discriminator loss function constructed by the true and false probabilities output by the frame discriminator network; reversely transmitting the loss value of the frame discriminator network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the frame discriminator network and all gradients of a normalization layer; iteratively updating all weights of each convolution kernel of each convolution layer of the frame arbiter network and all weights of the normalization layer by using an Adam optimizer according to all gradients of each convolution kernel of each convolution layer of the frame arbiter network and all gradients of the normalization layer; the initial learning rate of the Adam optimizer is 0.00002;
(6) training a sequence discriminator network:
(6a) a video frame sequence formed by a forward predicted frame image, a backward predicted frame image, a retrospective predicted frame image and a corresponding input frame image and a corresponding real video frame sequence are sequentially input into a sequence discriminator network, and the sequence discriminator network outputs corresponding authenticity probability;
(6b) calculating a loss value of the sequence discriminator network according to a sequence discriminator loss function constructed according to the authenticity probability output by the sequence discriminator network; reversely transmitting the loss value of the sequence discriminator network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the sequence discriminator network and all gradients of a normalization layer; iteratively updating all weights of each convolution kernel of each convolution layer of the sequence discriminator network and all weights of the normalization layer by using an Adam optimizer according to all gradients of each convolution kernel of each convolution layer of the sequence discriminator network and all gradients of the normalization layer; the initial learning rate of the Adam optimizer is 0.00002;
(7) judging whether the generator network loss function is converged, if so, executing the step (8), otherwise, executing the step (4);
(8) finishing the training of generating the countermeasure network by two-way review to obtain the trained generator network weight, and storing all the weights of each convolution layer and each convolution kernel of each deconvolution layer of the generator network in the trained two-way review generation countermeasure network;
(9) detecting the video:
sequentially cutting a video to be detected into a video frame sequence with the length of 5 and the size of M multiplied by N, inputting the first 4 frames of the video frame sequence into a trained generator network, outputting a predicted future frame image, calculating an abnormality score S according to an error between the predicted future frame image and a 5 th frame real image in the video frame sequence, if the abnormality score S exceeds a set threshold value, judging that the future frame image is abnormal, otherwise, judging that the future frame image is not abnormal; wherein, the values of M and N are equal to the values of W and H, and the value range of the abnormal score S is more than or equal to 0 and less than or equal to 1.
2. The method for detecting video anomaly based on bi-directional review generation countermeasure network as claimed in claim 1, wherein the forward prediction in step (4b) is implemented by the following formula:
x′j+1=G(Xi:j)
wherein, x'j+1Video frame image of j +1 frame representing forward predicted output of generator network, G (-) representing output of generator network in two-way retrospective generation countermeasure network, Xi:jThe forward video frame sequence formed by arranging the first 4 frames of each group of video frame sequence in the forward time sequence in the step (4a) is shown, i, j respectively represent the starting frame and the ending frame index of the video frame sequence, wherein j-i +1 is 4.
3. The method for detecting video anomaly based on bidirectional review generation countermeasure network of claim 2, wherein the backward prediction in step (4b) is implemented by the following formula:
Figure FDA0002653247000000041
wherein, x'iRepresenting the ith frame of video frame image that the generator network makes the backward prediction output,
Figure FDA0002653247000000051
representing a reverse video frame sequence in which the first 4 frames of each set of video frame sequence in step (4a) are arranged in reverse chronological order.
4. The method for detecting video anomaly based on bidirectional retrospective generation of confrontation networks as claimed in claim 2, wherein the backward retrospective prediction in step (4c) is implemented by the following formula:
Figure FDA0002653247000000052
wherein, x ″)iRepresenting the i-th frame of video frame images that the generator network makes a backward retrospective prediction output,
Figure FDA0002653247000000053
representing the last 4 frames of the extended sequence of video frames in reverse chronological order in step (4 c).
5. The method for detecting video anomaly based on bi-directional retrospective generation of confrontational networks according to claim 2, wherein the forward retrospective prediction in step (4c) is implemented by the following formula:
Figure FDA0002653247000000054
wherein, x ″)j+1The j +1 frame video frame image representing the forward retrospective prediction output by the generator network,
Figure FDA0002653247000000055
representing the first 4 frames of the sequence of video frames in forward chronological order after the expansion in step (4 c).
6. The method for video anomaly detection based on bidirectional review generation countermeasure network of claim 5, wherein the generator network loss function in step (4d) is as follows:
LG=1*L+1*L′+0.005*L″+0.005*L″′
wherein L isGA function representing the loss of the generator network, wherein the function represents the multiplication operation, L represents the loss of the intensity error between the predicted frame image output by the generator and the real image, L ' represents the loss of the gradient error between the predicted frame image output by the generator and the real image, L ' represents the frame countermeasure loss of the generator network, and L ' represents the sequence countermeasure loss of the generator network;
the L, L ', L ", and L'" are derived from the following equations, respectively:
Figure FDA0002653247000000056
Figure FDA0002653247000000061
Figure FDA0002653247000000062
Figure FDA0002653247000000063
wherein | · | purple sweet2Denotes a 2 norm operation, xiIs represented by x'iAnd x ″)iCorresponding real image, xjIs represented by x'jAnd x ″)jCorresponding real images and K, L represent the size of each frame of image, the values of K, L are equal to the values of W, H, m, n respectively represent the position coordinates of the pixels in the image, sigma represents the summation operation, | | |1Denotes a 1 norm operation, DF(. D) shows the output of a frame arbiter network in a two-way look-back generation countermeasure networkS(. to) shows two-way review generationThe output of the sequence arbiter network in the countermeasure network, u, represents a time-series superposition operation.
7. The method of detecting video anomalies based on bi-directional review generation versus network of claim 6, wherein the frame discriminator loss function in step (5b) is of the form:
Figure FDA0002653247000000064
wherein the content of the first and second substances,
Figure FDA0002653247000000065
representing the frame discriminator loss function.
8. The method for detecting video anomaly based on bi-directional review generation countermeasure network of claim 6, wherein the sequence discriminator loss function in step (6b) is in the form of:
Figure FDA0002653247000000071
wherein the content of the first and second substances,
Figure FDA0002653247000000072
representing the sequence discriminator loss function.
9. The method for detecting video anomaly based on bidirectional review generation countermeasure network of claim 6, wherein said calculating the anomaly score S in step (9) is implemented by the following formula:
Figure FDA0002653247000000073
Figure FDA0002653247000000074
wherein PSNR (x, x ') represents a peak signal-to-noise ratio between a real image and a predicted future frame image, x represents the real image, x' represents the predicted future frame image, log10Representing a base-10 logarithmic operation, max representing a maximum value operation, F representing the total number of pixel points in the corresponding real image or the predicted future frame image, l representing the serial numbers of all pixel points in the corresponding real image or the predicted future frame image, S (t) representing the abnormal score of the predicted future frame image at the t-th moment, xtRepresenting a real image at time t, x'tAnd min represents the minimum value operation of the predicted future frame image at the t-th moment.
CN202010878108.6A 2020-08-27 2020-08-27 Video abnormal event detection method based on two-way review generation countermeasure network Active CN112052763B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010878108.6A CN112052763B (en) 2020-08-27 2020-08-27 Video abnormal event detection method based on two-way review generation countermeasure network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010878108.6A CN112052763B (en) 2020-08-27 2020-08-27 Video abnormal event detection method based on two-way review generation countermeasure network

Publications (2)

Publication Number Publication Date
CN112052763A true CN112052763A (en) 2020-12-08
CN112052763B CN112052763B (en) 2024-02-09

Family

ID=73600525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010878108.6A Active CN112052763B (en) 2020-08-27 2020-08-27 Video abnormal event detection method based on two-way review generation countermeasure network

Country Status (1)

Country Link
CN (1) CN112052763B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633180A (en) * 2020-12-25 2021-04-09 浙江大学 Video anomaly detection method and system based on dual memory module
CN112819831A (en) * 2021-01-29 2021-05-18 北京小白世纪网络科技有限公司 Segmentation model generation method and device based on convolution Lstm and multi-model fusion
CN113096673A (en) * 2021-03-30 2021-07-09 山东省计算中心(国家超级计算济南中心) Voice processing method and system based on generation countermeasure network
CN113283849A (en) * 2021-07-26 2021-08-20 山东新北洋信息技术股份有限公司 Logistics abnormity intelligent detection method based on video context association
CN113435432A (en) * 2021-08-27 2021-09-24 腾讯科技(深圳)有限公司 Video anomaly detection model training method, video anomaly detection method and device
CN113810611A (en) * 2021-09-17 2021-12-17 北京航空航天大学 Data simulation method and device for event camera
CN113947612A (en) * 2021-09-28 2022-01-18 西安电子科技大学广州研究院 Video anomaly detection method based on foreground and background separation
CN114067251A (en) * 2021-11-18 2022-02-18 西安交通大学 Unsupervised monitoring video prediction frame abnormity detection method
CN116756575A (en) * 2023-08-17 2023-09-15 山东科技大学 Non-invasive load decomposition method based on BGAIN-DD network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109711280A (en) * 2018-12-10 2019-05-03 北京工业大学 A kind of video abnormality detection method based on ST-Unet
CN109919032A (en) * 2019-01-31 2019-06-21 华南理工大学 A kind of video anomaly detection method based on action prediction
CN110568442A (en) * 2019-10-15 2019-12-13 中国人民解放军国防科技大学 Radar echo extrapolation method based on confrontation extrapolation neural network
WO2020037965A1 (en) * 2018-08-21 2020-02-27 北京大学深圳研究生院 Method for multi-motion flow deep convolutional network model for video prediction
US20200134804A1 (en) * 2018-10-26 2020-04-30 Nec Laboratories America, Inc. Fully convolutional transformer based generative adversarial networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020037965A1 (en) * 2018-08-21 2020-02-27 北京大学深圳研究生院 Method for multi-motion flow deep convolutional network model for video prediction
US20200134804A1 (en) * 2018-10-26 2020-04-30 Nec Laboratories America, Inc. Fully convolutional transformer based generative adversarial networks
CN109711280A (en) * 2018-12-10 2019-05-03 北京工业大学 A kind of video abnormality detection method based on ST-Unet
CN109919032A (en) * 2019-01-31 2019-06-21 华南理工大学 A kind of video anomaly detection method based on action prediction
CN110568442A (en) * 2019-10-15 2019-12-13 中国人民解放军国防科技大学 Radar echo extrapolation method based on confrontation extrapolation neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
袁帅;秦贵和;晏婕;: "应用残差生成对抗网络的路况视频帧预测模型", 西安交通大学学报, no. 10 *
陈莹;何丹丹;: "基于贝叶斯融合的时空流异常行为检测模型", 电子与信息学报, no. 05 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112633180B (en) * 2020-12-25 2022-05-24 浙江大学 Video anomaly detection method and system based on dual memory module
CN112633180A (en) * 2020-12-25 2021-04-09 浙江大学 Video anomaly detection method and system based on dual memory module
CN112819831A (en) * 2021-01-29 2021-05-18 北京小白世纪网络科技有限公司 Segmentation model generation method and device based on convolution Lstm and multi-model fusion
CN112819831B (en) * 2021-01-29 2024-04-19 北京小白世纪网络科技有限公司 Segmentation model generation method and device based on convolution Lstm and multi-model fusion
CN113096673A (en) * 2021-03-30 2021-07-09 山东省计算中心(国家超级计算济南中心) Voice processing method and system based on generation countermeasure network
CN113283849A (en) * 2021-07-26 2021-08-20 山东新北洋信息技术股份有限公司 Logistics abnormity intelligent detection method based on video context association
CN113283849B (en) * 2021-07-26 2021-11-02 山东建筑大学 Logistics abnormity intelligent detection method based on video context association
CN113435432A (en) * 2021-08-27 2021-09-24 腾讯科技(深圳)有限公司 Video anomaly detection model training method, video anomaly detection method and device
CN113810611A (en) * 2021-09-17 2021-12-17 北京航空航天大学 Data simulation method and device for event camera
CN113947612A (en) * 2021-09-28 2022-01-18 西安电子科技大学广州研究院 Video anomaly detection method based on foreground and background separation
CN113947612B (en) * 2021-09-28 2024-03-29 西安电子科技大学广州研究院 Video anomaly detection method based on foreground and background separation
CN114067251A (en) * 2021-11-18 2022-02-18 西安交通大学 Unsupervised monitoring video prediction frame abnormity detection method
CN114067251B (en) * 2021-11-18 2023-09-15 西安交通大学 Method for detecting anomaly of unsupervised monitoring video prediction frame
CN116756575A (en) * 2023-08-17 2023-09-15 山东科技大学 Non-invasive load decomposition method based on BGAIN-DD network
CN116756575B (en) * 2023-08-17 2023-11-03 山东科技大学 Non-invasive load decomposition method based on BGAIN-DD network

Also Published As

Publication number Publication date
CN112052763B (en) 2024-02-09

Similar Documents

Publication Publication Date Title
CN112052763B (en) Video abnormal event detection method based on two-way review generation countermeasure network
CN111476717B (en) Face image super-resolution reconstruction method based on self-attention generation countermeasure network
CN111652903B (en) Pedestrian target tracking method based on convolution association network in automatic driving scene
Li et al. DLEP: A deep learning model for earthquake prediction
CN114978613B (en) Network intrusion detection method based on data enhancement and self-supervision feature enhancement
CN115601661A (en) Building change detection method for urban dynamic monitoring
CN116522265A (en) Industrial Internet time sequence data anomaly detection method and device
Fan et al. Structural dynamic response reconstruction using self-attention enhanced generative adversarial networks
CN113222824B (en) Infrared image super-resolution and small target detection method
Wang et al. Edge computing-enabled crowd density estimation based on lightweight convolutional neural network
CN113379597A (en) Face super-resolution reconstruction method
CN116206214A (en) Automatic landslide recognition method, system, equipment and medium based on lightweight convolutional neural network and double attention
CN115830707A (en) Multi-view human behavior identification method based on hypergraph learning
Li et al. Towards accurate and reliable change detection of remote sensing images via knowledge review and online uncertainty estimation
Fang et al. An attention-based U-Net network for anomaly detection in crowded scenes
CN111553371B (en) Image semantic description method and system based on multi-feature extraction
CN114220169A (en) Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM
He et al. Remote Sensing Image Scene Classification Based on ECA Attention Mechanism Convolutional Neural Network
CN113947612B (en) Video anomaly detection method based on foreground and background separation
Wan et al. Siamese Attentive Convolutional Network for Effective Remote Sensing Image Change Detection
CN113869514B (en) Multi-knowledge integration and optimization method based on genetic algorithm
CN115456957B (en) Method for detecting change of remote sensing image by full-scale feature aggregation
CN115100487A (en) Stereo image significance detection method based on multi-layer cross-modal integrated network
Guo et al. Semantic-driven automatic filter pruning for neural networks
KR20230086233A (en) A Reconstructed Convolutional Neural Network Using Conditional Min Pooling and Method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant