CN112052763A

CN112052763A - Video abnormal event detection method based on bidirectional review generation countermeasure network

Info

Publication number: CN112052763A
Application number: CN202010878108.6A
Authority: CN
Inventors: 刘静; 杨智伟
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2020-12-08
Anticipated expiration: 2040-08-27
Also published as: CN112052763B

Abstract

The invention provides a video abnormal event detection method based on a bidirectional review generation countermeasure network, which solves the problem that the prior art does not utilize the reverse mapping relation among video frame sequences and does not carry out motion constraint from the perspective of long-term time sequence consistency of the video frame sequences, so that the detection precision of abnormal events in videos is insufficient. The method comprises the following implementation steps: a generation countermeasure network composed of a generator, a frame discriminator and a sequence discriminator is constructed, during training, the forward prediction and the backward prediction are adopted, and a two-way review mode of retrospective prediction is combined, the generation countermeasure network is trained through alternate updating of the generator, the frame discriminator and the sequence discriminator, a generator capable of accurately predicting a future frame image of a normal event in a video but incapable of accurately predicting a future frame image of an abnormal event in the video is obtained, and whether the abnormal event occurs or not is detected according to a prediction error.

Description

Video abnormal event detection method based on bidirectional review generation countermeasure network

Technical Field

The invention belongs to the technical field of image processing, and further relates to a video abnormal event detection method based on a bidirectional review generation countermeasure network in the technical field of computer vision. The method can be used for detecting the abnormal events in the video monitoring images.

Background

In recent years, intelligent security has been paid more attention to people, and particularly, abnormal events in video monitoring are automatically detected. The method plays a vital role in improving the response and processing efficiency of abnormal events in public places, maintaining public safety and reducing property loss. Currently, there are two main methods for detecting abnormal events in video monitoring: a method is that a model capable of reconstructing a current video frame image is constructed by learning video data of normal behaviors, such as a sparse dictionary, a self-encoder and the like, the trained reconstruction model can better reconstruct the current frame image of a normal event, for an abnormal event, the reconstructed current frame image has a larger reconstruction error, and abnormal event detection is carried out according to the error. However, the reconstruction model has a large fault-tolerant capability, and sometimes the abnormal frame image can be well reconstructed, so that the detection accuracy is low. The other method is that a model capable of predicting a future frame image is constructed by learning video data of normal behaviors, the model can better predict the future frame image of a normal event, and due to the fact that an abnormal event is unpredictable, a large error is generated between the future frame image predicted by the model and a real frame image, and whether the abnormal event occurs or not can be detected according to the error.

Wen Liu, in its published paper "Future Frame Prediction for analysis Detection-A New Baseline" (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR),2018, pp.6536-6545), proposes a video Anomaly Detection method based on Future Frame Prediction. The method adopts Unet as a prediction network to predict future frames of video data, when the prediction network is trained, the future frame images are predicted forward according to the first four frame images, the appearance constraint is carried out on the predicted future frame images by minimizing the gradient loss and the strength loss between the predicted future frame images and the images of the real future frames, the motion constraint is carried out by minimizing the optical flow loss between the predicted future frame images and the images of the real future frames, and the pairing-counteracting network is generated in a combined mode to optimize the model. The method not only restrains the appearance of the predicted future frame image, but also restrains the motion of the predicted future frame image by extracting the optical flow, so that the future frame image of the normal video can be predicted better, and the abnormal video has unpredictability, so that a larger error can be generated between the predicted future frame image and the real future frame image, and the detection effect of the abnormal event is improved. However, the method still has the disadvantages that only forward prediction is carried out during training, the reverse mapping relation between video frame sequences is not utilized, so that the prediction effect of the predicted future frame image of the normal event is insufficient, in addition, the motion constraint is only carried out between the predicted future frame image and the previous frame image in a mode of extracting the optical flow for motion constraint, and the motion constraint is not carried out on the predicted future frame image from the perspective of the long-term time sequence consistency of the video frame sequences, so that the motion constraint on the predicted future frame image of the normal event is insufficient, a larger error also occurs in the predicted future frame image of the normal event, the capability of a prediction network for distinguishing the normal event from the abnormal event mode is reduced, and the video abnormal detection effect is reduced.

The patent document "video anomaly detection method based on ST-Unet" (patent application No. 201811501290.2, application publication No. CN109711280A) filed by Beijing university of industry proposes an ST-Unet network capable of simultaneously utilizing video data space-time information to detect video anomalies. The method constructs a new ST-Unet network by adding ConvLSTM into Unet. In training the ST-Unet network, the method first forward predicts a future frame image from the first four frames of video frames and reconstructs a last frame image of the input video frame, and then optimizes a model by minimizing differences between the predicted future frame image and a real future frame image, and the reconstructed image and the last frame image of the real input video frame and combining them to generate a pairwise opposing network. In the testing stage, firstly, errors between a future frame image predicted by an ST-Unet network and a real future frame image and errors between a generated reconstructed image and a real input image are compared to respectively obtain a prediction error and a reconstruction error, then, the obtained prediction error and the reconstruction error are subjected to weighted summation processing to obtain an abnormal score, and finally, whether the abnormality occurs or not is judged according to the abnormal score. In the method, ConvLSTM is added into the Unet network to simultaneously acquire the temporal-spatial information of the video data, and the anomaly detection effect is further improved by combining reconstruction errors. However, the method still has the disadvantages that only forward prediction is performed during training, the reverse mapping relation between video frame sequences is not utilized, in addition, the detection network is too complex due to the combination of the prediction model and the reconstruction model, the detection efficiency is reduced, in addition, due to the diversity of scenes, the appropriate weight ratio of the reconstruction error and the prediction error is difficult to select, and the detection effect is unstable due to the change of the scenes.

Disclosure of Invention

The invention aims to provide a video abnormal event detection method based on a bidirectional review generation countermeasure network, aiming at overcoming the defects of the prior art, and solving the problem that the prior art does not utilize the reverse mapping relation among video frame sequences and does not carry out motion constraint from the perspective of long-term time sequence consistency of the video frame sequences, so that the detection precision of the abnormal event in the video is insufficient.

In order to achieve the purpose, the invention is characterized in that a bidirectional review generation confrontation network is constructed, the network consists of a generator, a frame discriminator and a sequence discriminator, and the generator can perform forward prediction and backward prediction by adopting a bidirectional review training mode, so that the bidirectional mapping relation between normal video frame sequences is fully mined, and a more accurate future frame image of a normal event is predicted. In addition, long-term time sequence information between a predicted frame and an input frame sequence is captured through a sequence discriminator formed by a 3D convolutional layer, motion constraint is carried out by using discrimination loss, and the predicted frame image and the real frame image are ensured to be consistent in motion from the perspective of long-term time sequence consistency, so that the capability of a prediction network for distinguishing a normal event motion mode and an abnormal event motion mode in a video is enhanced, and the abnormal event detection effect is improved.

The method comprises the following specific steps:

(1) constructing a generation countermeasure network:

(1a) a generator network with 15 layers is built, and the structure of the generator network sequentially comprises the following steps: input layer → first convolution layer → first normalization layer → first activation function layer → second convolution layer → second normalization layer → second activation function layer → first downsampling layer combination → second downsampling layer combination → third downsampling layer combination → first upsampling layer combination → second upsampling layer combination → third convolutional layer → output layer; the structure of each downsampling layer combination is as follows in sequence: the first max pooling layer → the first convolution layer → the first normalization layer → the first activation function layer → the second convolution layer → the second normalization layer → the second activation function layer; the structure of each up-sampling layer combination is as follows in sequence: the first deconvolution layer → the first convolution layer → the first normalization layer → the first activation function layer → the second convolution layer → the second normalization layer → the second activation function layer; the feature maps output by the first, second and third down-sampling layer combinations are spliced and fused with the feature maps output by the first, second and third up-sampling layer combinations respectively;

the parameters of each layer in the generator network are set as follows: setting the sizes of convolution kernels in the first convolution layer, the second convolution layer and the third convolution layer to be 3 multiplied by 3, setting convolution step lengths to be 2 and setting the number of the convolution kernels to be 64; the first and second activation function layers are realized by adopting a ReLU function;

the size of the pooling convolution kernel of the largest pooling layer in each downsampling layer combination is set to be 2 multiplied by 2, and the pooling step length is set to be 2; the sizes of convolution kernels of the convolution layers are all set to be 3 multiplied by 3, convolution step lengths are all set to be 2, and the number of the convolution kernels is 128, 256 and 512 respectively; the activation function layers are all realized by adopting a ReLU function;

the convolution kernel size of the deconvolution layer in each upsampling layer combination is set to be 2 x 2, and the convolution step length is set to be 2; the sizes of convolution kernels of the convolution layers are all set to be 3 multiplied by 3, convolution step lengths are all set to be 2, and the number of the convolution kernels is 256, 128 and 64; the activation function layers are all realized by adopting a ReLU function;

(1b) a14-layer frame discriminator network is built, and the structure sequentially comprises the following steps: the input layer → the first convolution layer → the first activation function layer → the second convolution layer → the first normalization layer → the second activation function layer → the third convolution layer → the second normalization layer → the third activation function layer → the fourth convolution layer → the fourth activation function layer → the fifth convolution layer → the fifth activation function layer → the output layer;

the parameters of each layer in the frame discriminator network are set as follows: setting the sizes of convolution kernels of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer to be 3 multiplied by 3, setting convolution step sizes to be 2, and sequentially setting the number of convolution kernels to be 128, 256, 512 and 1; the first normalization layer and the second normalization layer are both realized by adopting a BatchNorm2d function; the first, second, third and fourth activation function layers are all realized by adopting LeakyReLU functions, and the slopes of the first, second, third and fourth activation function layers are all set to be 0.2; the fifth activation function layer is realized by adopting a Sigmoid function;

(1c) a16-layer sequence discriminator network is built, and the structure sequentially comprises the following steps: an input layer → a first 3D convolution layer → a first 3D maximum pooling layer → a first normalization layer → a first activation function layer → a second 3D convolution layer → a second 3D maximum pooling layer → a second normalization layer → a second activation function layer → a third 3D convolution layer → a third 3D maximum pooling layer → a third normalization layer → a third activation function layer → a fourth 3D convolution layer → a fourth activation function layer → an output layer;

the parameters of each layer in the sequence discriminator network are set as follows: setting the sizes of convolution kernels of the first, second, third and fourth 3D convolution layers to be 2 multiplied by 3, setting convolution step sizes to be 1 multiplied by 2, and setting the number of convolution kernels to be 128, 256, 512 and 1 in sequence; the sizes of the pooling convolution kernels of the first, second and third 3D maximum pooling layers are all set to be 2 multiplied by 3, and the pooling step lengths are all set to be 1 multiplied by 2; the first normalization layer, the second normalization layer and the third normalization layer are all realized by adopting a BatchNorm3d function; the first, second and third activation function layers are all realized by adopting LeakyReLU functions, and the slopes of the first, second and third activation function layers are all set to be 0.2; the fourth activation function layer is realized by adopting a Sigmoid function;

(1d) respectively cascading a generator network, a frame discriminator network and a sequence discriminator network to form a generation countermeasure network;

(2) initializing the generation of the countermeasure network:

initializing weights of all convolution layers and normalization layers in the generated countermeasure network to random values satisfying normal distribution; wherein the mean value of the normal distribution is 0, and the standard deviation is 0.02;

(3) generating a training data set:

selecting continuous monitoring videos which are T minutes long and do not contain any abnormal events, and sequentially cutting the continuous monitoring videos into a plurality of groups of video frame sequences with the length of 5 and the size of W multiplied by H to form a training data set; wherein T is more than 10, W, H respectively represents the width and height of each frame of image, W is more than or equal to 64 and less than or equal to 256, H is more than or equal to 64 and less than or equal to 256, and the unit of W and H is pixel;

(4) training the generator network in a two-way review mode:

(4a) arranging the first 4 frames of each group of video frame sequences into a forward video frame sequence according to a forward time sequence, and arranging the last 4 frames of each group of video frame sequences into a reverse video frame sequence according to a reverse time sequence;

(4b) inputting the forward video frame sequence into a generator network for forward prediction, and outputting a forward prediction frame image; inputting the reverse video frame sequence into a generator network for backward prediction, and outputting a backward prediction frame image;

(4c) adding the forward predicted frame image into the previous video frame sequence used for forward prediction, arranging the last 4 frames of the expanded video frame sequence according to a reverse time sequence, inputting the arranged frames into a generator network for backward retrospective prediction, and outputting a backward retrospective predicted frame image; adding the backward prediction frame images into the previous video frame sequence used for backward prediction, then arranging the first 4 frames of the expanded video frame sequence according to the forward time sequence, inputting the arranged frames into a generator network for forward retrospective prediction, and outputting forward retrospective prediction frame images;

(4d) calculating a loss value of a generator network according to a generator network loss function constructed according to errors between a plurality of predicted frame images and real frame images generated by bidirectional prediction and retrospective prediction; reversely transmitting the loss value of the generator network by using a gradient descent method, and calculating all gradients of each convolution layer and each convolution kernel in each deconvolution layer of the generator network; iteratively updating all weights of each convolution kernel in each convolutional layer and each deconvolution layer of the generator network using an Adam optimizer according to all gradients of each convolution kernel in each convolutional layer and each deconvolution layer of the generator network; the initial learning rate of the Adam optimizer is 0.0002;

(5) training a frame discriminator network:

(5a) sequentially inputting a forward predicted frame image and a real image thereof, a backward predicted frame image and a real image thereof, a forward retrospective predicted frame image and a real image thereof, and a backward retrospective predicted frame image and a real image thereof into a frame discriminator network, and outputting corresponding true-false probabilities by the frame discriminator network;

(5b) calculating a loss value of the frame discriminator network according to a frame discriminator loss function constructed by the true and false probabilities output by the frame discriminator network; reversely transmitting the loss value of the frame discriminator network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the frame discriminator network and all gradients of a normalization layer; iteratively updating all weights of each convolution kernel of each convolution layer of the frame arbiter network and all weights of the normalization layer by using an Adam optimizer according to all gradients of each convolution kernel of each convolution layer of the frame arbiter network and all gradients of the normalization layer; the initial learning rate of the Adam optimizer is 0.00002;

(6) training a sequence discriminator network:

(6a) a video frame sequence formed by a forward predicted frame image, a backward predicted frame image, a retrospective predicted frame image and a corresponding input frame image and a corresponding real video frame sequence are sequentially input into a sequence discriminator network, and the sequence discriminator network outputs corresponding authenticity probability;

(6b) calculating a loss value of the sequence discriminator network according to a sequence discriminator loss function constructed according to the authenticity probability output by the sequence discriminator network; reversely transmitting the loss value of the sequence discriminator network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the sequence discriminator network and all gradients of a normalization layer; iteratively updating all weights of each convolution kernel of each convolution layer of the sequence discriminator network and all weights of the normalization layer by using an Adam optimizer according to all gradients of each convolution kernel of each convolution layer of the sequence discriminator network and all gradients of the normalization layer; the initial learning rate of the Adam optimizer is 0.00002;

(7) judging whether the generator network loss function is converged, if so, executing the step (8), otherwise, executing the step (4);

(8) finishing the training of generating the countermeasure network by two-way review to obtain the trained generator network weight, and storing all the weights of each convolution layer and each convolution kernel of each deconvolution layer of the generator network in the trained two-way review generation countermeasure network;

(9) detecting the video:

sequentially cutting a video to be detected into a video frame sequence with the length of 5 and the size of M multiplied by N, inputting the first 4 frames of the video frame sequence into a trained generator network, outputting a predicted future frame image, calculating an abnormality score S according to an error between the predicted future frame image and a 5 th frame real image in the video frame sequence, if the abnormality score S exceeds a set threshold value, judging that the future frame image is abnormal, otherwise, judging that the future frame image is not abnormal; wherein, the values of M and N are equal to the values of W and H, and the value range of the abnormal score S is more than or equal to 0 and less than or equal to 1.

Compared with the prior art, the invention has the following advantages:

firstly, the invention constructs a generation countermeasure network composed of a generator, a frame discriminator and a sequence discriminator, and adopts a training mode of forward and backward prediction and combined retrospective prediction for bidirectional retrospective, thereby overcoming the problem that the prediction effect of the predicted future frame image of the normal event is insufficient because only forward prediction is carried out and the backward mapping relation between video frame sequences is not utilized in the prior art, and leading the prediction network of the invention to have stronger capability of distinguishing the appearance modes of the normal event and the abnormal event, thereby improving the effect of detecting the appearance abnormality in the video.

Secondly, because the sequence discriminator in the countermeasure network generated by the invention is composed of 3D convolution layers, the sequence discriminator can capture the long-term time sequence relation between the video frame sequences and carry out motion constraint by using discrimination loss, the problem that the motion constraint of the predicted future frame image of the predicted normal event is insufficient because the motion constraint of the predicted future frame image is not carried out from the perspective of the long-term time sequence consistency of the video frame sequences in the prior art is overcome, and the video abnormal event detection network provided by the invention has stronger capability of distinguishing the normal event motion mode from the abnormal event motion mode, thereby improving the effect of detecting the motion abnormality in the video.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the overall architecture of the generator network of the present invention; wherein fig. 2(a) is a schematic diagram of a structure of a generator network of the present invention, fig. 2(b) is a schematic diagram of a down-sampling layer combination in the generator network, and fig. 2(c) is a schematic diagram of an up-sampling layer combination in the generator network;

FIG. 3 is a schematic diagram of a frame discriminator network in the generation countermeasure network according to the present invention;

FIG. 4 is a diagram illustrating a sequence discriminator network in the generation countermeasure network according to the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The specific steps of the present invention are further described with reference to fig. 1:

step 1, constructing a generation countermeasure network.

The specific structure of the constructed generator network will be further described with reference to fig. 2 (a).

A generator network with 15 layers is built, and the structure of the generator network sequentially comprises the following steps: input layer → first convolution layer → first normalization layer → first activation function layer → second convolution layer → second normalization layer → second activation function layer → first downsampling layer combination → second downsampling layer combination → third downsampling layer combination → first upsampling layer combination → second upsampling layer combination → third convolutional layer → output layer.

The specific structure of the down-sampling layer combination in the generator network is further described with reference to fig. 2 (b).

The structure of each downsampling layer combination is as follows in sequence: the first max pooling layer → the first convolution layer → the first normalization layer → the first activation function layer → the second convolution layer → the second normalization layer → the second activation function layer.

The specific structure of the up-sampling layer combination in the generator network is further described with reference to fig. 2 (c).

The structure of each up-sampling layer combination is as follows: first deconvolution layer → first convolution layer → first normalization layer → first activation function layer → second convolution layer → second normalization layer → second activation function layer.

And the feature maps output by the first, second and third down-sampling layer combinations are spliced and fused with the feature maps output by the first, second and third up-sampling layer combinations respectively.

The parameters of each layer in the generator network are set as follows: setting the sizes of convolution kernels in the first convolution layer, the second convolution layer and the third convolution layer to be 3 multiplied by 3, setting convolution step lengths to be 2 and setting the number of the convolution kernels to be 64; the first and second activation function layers are both realized by adopting a ReLU function.

The size of the pooling convolution kernel of the largest pooling layer in each downsampling layer combination is set to be 2 multiplied by 2, and the pooling step length is set to be 2; the sizes of convolution kernels of the convolution layers are all set to be 3 multiplied by 3, convolution step lengths are all set to be 2, and the number of the convolution kernels is 128, 256 and 512 respectively; the activation function layers are all realized by adopting a ReLU function.

The convolution kernel size of the deconvolution layer in each upsampling layer combination is set to be 2 x 2, and the convolution step length is set to be 2; the sizes of convolution kernels of the convolution layers are all set to be 3 multiplied by 3, convolution step lengths are all set to be 2, and the number of the convolution kernels is 256, 128 and 64; the activation function layers are all realized by adopting a ReLU function.

The specific structure of the frame discriminator network constructed by the present invention will be further described with reference to fig. 3.

A14-layer frame discriminator network is built, and the structure sequentially comprises the following steps: the input layer → the first convolution layer → the first activation function layer → the second convolution layer → the first normalization layer → the second activation function layer → the third convolution layer → the second normalization layer → the third activation function layer → the fourth convolution layer → the fourth activation function layer → the fifth convolution layer → the fifth activation function layer → the output layer.

The parameters of each layer in the frame discriminator network are set as follows: setting the sizes of convolution kernels of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer to be 3 multiplied by 3, setting convolution step sizes to be 2, and sequentially setting the number of convolution kernels to be 128, 256, 512 and 1; the first normalization layer and the second normalization layer are both realized by adopting a BatchNorm2d function; the first, second, third and fourth activation function layers are all realized by adopting LeakyReLU functions, and the slopes of the first, second, third and fourth activation function layers are all set to be 0.2; the fifth activation function layer is realized by adopting a Sigmoid function.

The specific structure of the sequence discriminator network constructed by the present invention will be further described with reference to fig. 4.

A16-layer sequence discriminator network is built, and the structure sequentially comprises the following steps: the input layer → the first 3D convolution layer → the first 3D max-pooling layer → the first normalization layer → the first activation function layer → the second 3D convolution layer → the second 3D max-pooling layer → the second normalization layer → the second activation function layer → the third 3D convolution layer → the third 3D max-pooling layer → the third normalization layer → the third activation function layer → the fourth 3D convolution layer → the fourth activation function layer → the output layer.

The parameters of each layer in the sequence discriminator network are set as follows: setting the sizes of convolution kernels of the first, second, third and fourth 3D convolution layers to be 2 multiplied by 3, setting convolution step sizes to be 1 multiplied by 2, and setting the number of convolution kernels to be 128, 256, 512 and 1 in sequence; the sizes of the pooling convolution kernels of the first, second and third 3D maximum pooling layers are all set to be 2 multiplied by 3, and the pooling step lengths are all set to be 1 multiplied by 2; the first normalization layer, the second normalization layer and the third normalization layer are all realized by adopting a BatchNorm3d function; the first, second and third activation function layers are all realized by adopting LeakyReLU functions, and the slopes of the first, second and third activation function layers are all set to be 0.2; the fourth activation function layer is realized by adopting a Sigmoid function.

And respectively cascading the generator network with the frame discriminator network and the sequence discriminator network to form a generation countermeasure network.

And 2, initializing to generate the countermeasure network.

Initializing weights of all convolution layers and normalization layers in the generated countermeasure network to random values satisfying normal distribution; wherein, the mean value of the normal distribution is 0, and the standard deviation is 0.02.

And 3, generating a training data set.

Selecting continuous monitoring videos which are T minutes long and do not contain any abnormal events, and sequentially cutting the continuous monitoring videos into a plurality of groups of video frame sequences with the length of 5 and the size of W multiplied by H to form a training data set; where T > 10, W, H represent the width and height of each frame of image, respectively, W is 64 ≦ 256, H is 64 ≦ 256, and the units of W and H are pixels.

And 4, training the generator network by adopting a bidirectional review mode.

Step 1, arranging the first 4 frames of each group of video frame sequences into a forward video frame sequence according to a forward time sequence, and arranging the last 4 frames of each group of video frame sequences into a reverse video frame sequence according to a reverse time sequence;

step 2, inputting the forward video frame sequence into a generator network for forward prediction, and outputting a forward prediction frame image; and inputting the reverse video frame sequence into a generator network for backward prediction, and outputting a backward prediction frame image.

The forward prediction is realized by the following formula:

x′_j+1＝G(X_i:j)

wherein, x'_j+1Video frame image of j +1 frame representing forward predicted output of generator network, G (-) representing output of generator network in two-way retrospective generation countermeasure network, X_i:jThe forward video frame sequence formed by arranging the first 4 frames of each group of video frame sequence in the forward time sequence in the first step is shown, i, j respectively represent the starting frame and the ending frame of the video frame sequence, wherein j-i +1 is 4.

The backward prediction is realized by the following formula:

wherein, x'_iRepresenting the ith frame of video frame image that the generator network makes the backward prediction output,

representing a reverse video frame sequence in which the first 4 frames of each set of video frame sequence in the first step are arranged in reverse chronological order.

Step 3, adding the forward predicted frame image into the previous video frame sequence used for forward prediction, arranging the last 4 frames of the expanded video frame sequence according to a reverse time sequence, inputting the arranged frames into a generator network for backward retrospective prediction, and outputting a backward retrospective predicted frame image; adding the backward prediction frame images into the previous video frame sequence used for backward prediction, arranging the first 4 frames of the expanded video frame sequence according to the forward time sequence, inputting the arranged frames into a generator network for forward retrospective prediction, and outputting forward retrospective prediction frame images.

The backward retrospective prediction is realized by the following formula:

wherein, x ″)_j+1Video frame image of j +1 frame representing backward retrospective prediction output by the generator network,

the last 4 frames of the extended video frame sequence in reverse chronological order in step 3 are shown.

The forward retrospective prediction is implemented by the following formula:

wherein, x ″)_iThe i frame video frame images representing the forward retrospective prediction output of the generator network,

representing the sequence of video frames in forward chronological order after expansion in step 34 frames.

And 4, calculating a loss value of the generator network according to a generator network loss function constructed by errors between a plurality of predicted frame images and real frame images generated by bidirectional prediction and retrospective prediction.

The generator network loss function is as follows:

L_G＝1*L+1*L′+0.005*L″+0.005*L″

wherein L is_GA function representing the loss of the generator network, wherein the function represents the multiplication operation, L represents the loss of the intensity error between the predicted frame image output by the generator and the real image, L ' represents the loss of the gradient error between the predicted frame image output by the generator and the real image, L ' represents the frame countermeasure loss of the generator network, and L ' represents the sequence countermeasure loss of the generator network;

the L, L ', L + and L' "are derived from the following equations, respectively:

wherein | · | purple sweet₂Denotes a 2 norm operation, x_iIs represented by x'_iAnd x ″)_iCorresponding real image, x_jIs represented by x'_jAnd x ″)_jCorresponding real images, wherein K and L represent the size of each frame of image, the values of K and L are equal to the values of W and H, m and n respectively represent the position coordinates of pixels in the image, sigma represents summation operation, | | |₁Represents a 1 normOperation D_F(. D) shows the output of a frame arbiter network in a two-way look-back generation countermeasure network_S(. to) denotes the output of the sequence arbiter network in the two-way review generation countermeasure network, and @ denotes the time-series superposition operation.

Reversely transmitting the loss value of the generator network by using a gradient descent method, and calculating all gradients of each convolution layer and each convolution kernel in each deconvolution layer of the generator network; iteratively updating all weights of each convolution kernel in each convolutional layer and each deconvolution layer of the generator network using an Adam optimizer according to all gradients of each convolution kernel in each convolutional layer and each deconvolution layer of the generator network; the initial learning rate of the Adam optimizer is 0.0002.

And 5, training the frame discriminator network.

Step 1, a forward predicted frame image and a real image thereof, a backward predicted frame image and a real image thereof, a forward retrospective predicted frame image and a real image thereof, and a backward retrospective predicted frame image and a real image thereof are sequentially input into a frame discriminator network, and the frame discriminator network outputs corresponding true-false probabilities.

And 2, calculating a loss value of the frame discriminator network according to a frame discriminator loss function constructed by the true and false probabilities output by the frame discriminator network.

The frame discriminator loss function has the following form:

wherein the content of the first and second substances,

representing the frame discriminator loss function.

Reversely transmitting the loss value of the frame discriminator network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the frame discriminator network and all gradients of a normalization layer; iteratively updating all weights of each convolution kernel of each convolution layer of the frame arbiter network and all weights of the normalization layer by using an Adam optimizer according to all gradients of each convolution kernel of each convolution layer of the frame arbiter network and all gradients of the normalization layer; the initial learning rate of the Adam optimizer is 0.00002.

And 6, training the sequence discriminator network.

Step 1, a video frame sequence formed by a forward predicted frame image, a backward predicted frame image, a retrospective predicted frame image and a corresponding input frame image and a corresponding real video frame sequence are sequentially input into a sequence discriminator network, and the sequence discriminator network outputs corresponding authenticity probability.

And 2, calculating the loss value of the sequence discriminator network according to a sequence discriminator loss function constructed by the authenticity probability output by the sequence discriminator network.

The loss function form of the sequence discriminator is as follows:

wherein the content of the first and second substances,

representing the sequence discriminator loss function.

Reversely transmitting the loss value of the sequence discriminator network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the sequence discriminator network and all gradients of a normalization layer; iteratively updating all weights of each convolution kernel of each convolution layer of the sequence discriminator network and all weights of the normalization layer by using an Adam optimizer according to all gradients of each convolution kernel of each convolution layer of the sequence discriminator network and all gradients of the normalization layer; the initial learning rate of the Adam optimizer is 0.00002.

And 7, judging whether the generator network loss function is converged, if so, executing a step 8, otherwise, executing a step 4.

And 8, finishing the training of generating the countermeasure network by bidirectional review to obtain the trained generator network weight, and storing all the weights of each convolution layer and each convolution kernel of each deconvolution layer of the generator network in the trained bidirectional review generated countermeasure network.

And 9, detecting the video.

Sequentially cutting a video to be detected into a video frame sequence with the length of 5 and the size of M multiplied by N, inputting the first 4 frames of the video frame sequence into a trained generator network, outputting a predicted future frame image, calculating an abnormality score S according to an error between the predicted future frame image and a 5 th frame real image in the video frame sequence, if the abnormality score S exceeds a set threshold value, judging that the future frame image is abnormal, otherwise, judging that the future frame image is not abnormal; wherein, the values of M and N are equal to the values of W and H, the value range of the abnormal score S is more than or equal to 0 and less than or equal to 1, and the set threshold value is 0.5.

The calculation of the abnormal score S is realized by the following formula:

wherein PSNR (x, x ') represents a peak signal-to-noise ratio between a real image and a predicted future frame image, x represents the real image, x' represents the predicted future frame image, log₁₀Representing a base-10 logarithmic operation, max representing a maximum value operation, F representing the total number of pixel points in the corresponding real image or the predicted future frame image, l representing the serial numbers of all pixel points in the corresponding real image or the predicted future frame image, S (t) representing the abnormal score of the predicted future frame image at the t-th moment, x_tRepresenting a real image at time t, x'_tAnd min represents the minimum value operation of the predicted future frame image at the t-th moment.

The effect of the present invention is further explained by combining the simulation experiment as follows:

1. simulation experiment conditions are as follows:

the hardware platform of the simulation experiment of the invention is as follows: the processor is an Intel (R) Core i5-9400F CPU, the main frequency is 2.9GHz, the memory is 32GB, and the display card is an NVIDIA GeForce RTX 2070 Super.

The software platform of the simulation experiment of the invention is as follows: ubuntu 16.04 operating system, python 3.6, pytorch 1.2.0.

2. Simulation content and simulation result analysis:

when a training set and a test set are generated in a simulation experiment, a public standard data set CUHK Avenue (Avenue) is used. The video data set was 20 minutes in duration and contained 37 video segments, including 47 exceptional events. In the simulation experiment, 16 normal video segments in the Avenue data set are used to form a training set, and 21 abnormal video segments form a testing set.

The simulation experiment of the invention is to respectively detect the abnormal events in 21 video segments forming a test set by adopting the invention and three prior arts (an abnormal detection method FFP based on future frame prediction, a video abnormal detection method AnoPCN based on a depth prediction coding network and an abnormal detection method PRI combining prediction and reconstruction).

In the simulation experiment, three prior arts are adopted:

the prior art is an Anomaly Detection method based on Future Frame Prediction, which is a video Anomaly Detection method provided by W.Liu et al in Future Frame Prediction for analysis Detection-A New Baseline, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Jun.2018, pp.6536-6545, and is called an Anomaly Detection method based on Future Frame Prediction for short.

The Video anomaly detection method based on the depth prediction coding network in the prior art refers to a Video anomaly detection method provided by M.Ye et al in Video analysis detection Video decoding network, "in Proceedings of the 27th ACM International Conference on Multimedia,2019, pp.1805-1813", which is called a Video anomaly detection method based on the depth prediction coding network for short.

The method for detecting the abnormality by combining prediction and reconstruction in the prior art refers to a video abnormality detection method provided by y.tang et al in "integration prediction and reconstruction for analysis detection," Pattern recognit.lett., vol.129, pp.123-130,2020 ", which is referred to as an abnormality detection method combining prediction and reconstruction for short.

In order to evaluate the effect of the simulation result of the invention, the invention adopts AUC as a performance evaluation index to compare with the three existing technologies, and the comparison result is shown in Table 1.

From table 1, it can be seen that the AUC of the method of the present invention on Avenue data set is 88.6%, which is higher than 3 prior art methods, demonstrating that the method can detect abnormal events in video more effectively.

The bidirectional review generated confrontation network composed of the sequence discriminator adopts the training mode of bidirectional review, so that the generator can fully mine the bidirectional mapping relation between the predicted frame and the input frame sequence, predict the more accurate future frame image of the normal event, effectively improve the capability of the predicted network for distinguishing the appearance modes of the normal video and the abnormal video, in addition, the sequence discriminator composed of the 3D convolution layer can carry out motion constraint on the predicted frame image from the perspective of long-term time sequence consistency, effectively improve the capability of the predicted network for distinguishing the motion modes of the normal video and the abnormal video, solve the problems that the predicted future frame image of the normal event is insufficient in the prior art because only forward prediction is carried out and the reverse mapping relation between video frame sequences is not utilized, and the motion constraint is not carried out from the perspective of the long-term time sequence consistency of the video frame sequences, the problem that the motion constraint of the predicted normal event of the future frame image is insufficient is solved, and the method is a very practical video abnormal event detection method.

TABLE 1 comparison of AUC values of the present invention and of the three prior art

The above simulation experiments show that: the bidirectional review generation confrontation network which is composed of a generator, a frame discriminator and a sequence discriminator and is constructed by the method of the invention adopts a bidirectional review training mode, so that the generator can fully mine the bidirectional mapping relationship between a predicted frame and an input frame sequence, predict the future frame image of a more accurate normal event, effectively improve the capability of the predicted network for distinguishing the appearance modes of a normal video and an abnormal video, in addition, the sequence discriminator which is composed of a 3D convolution layer can carry out motion constraint on the predicted frame image from the perspective of long-term time sequence consistency, effectively improve the capability of the predicted network for distinguishing the motion modes of the normal video and the abnormal video, solve the problem that the prediction effect of the future frame image of the predicted normal event is insufficient due to the fact that only forward prediction is carried out and the reverse mapping relationship between video frame sequences is not utilized in the prior art, and the problem that the motion constraint is not carried out from the perspective of the long-term time sequence consistency of the video frame sequence, so that the motion constraint on the predicted future frame image of the normal event is insufficient is solved, and the method is a very practical video abnormal event detection method.

Claims

1. A video abnormal event detection method based on a bidirectional retrospective generation countermeasure network is characterized in that the generation countermeasure network consisting of a generator, a frame discriminator and a sequence discriminator is constructed, during training, the forward and backward prediction and the bidirectional retrospective mode of combined retrospective prediction are adopted, the generator, the frame discriminator and the sequence discriminator are alternately updated to train the generation countermeasure network, and the generator which can accurately predict a future frame image of a normal event in a video and cannot accurately predict the future frame image of the abnormal event in the video is obtained; the method comprises the following specific steps:

(1) constructing a generation countermeasure network:

(2) initializing the generation of the countermeasure network:

(3) generating a training data set:

(4) training the generator network in a two-way review mode:

(5) training a frame discriminator network:

(6) training a sequence discriminator network:

(9) detecting the video:

2. The method for detecting video anomaly based on bi-directional review generation countermeasure network as claimed in claim 1, wherein the forward prediction in step (4b) is implemented by the following formula:

x′_j+1＝G(X_i:j)

wherein, x'_j+1Video frame image of j +1 frame representing forward predicted output of generator network, G (-) representing output of generator network in two-way retrospective generation countermeasure network, X_i:jThe forward video frame sequence formed by arranging the first 4 frames of each group of video frame sequence in the forward time sequence in the step (4a) is shown, i, j respectively represent the starting frame and the ending frame index of the video frame sequence, wherein j-i +1 is 4.

3. The method for detecting video anomaly based on bidirectional review generation countermeasure network of claim 2, wherein the backward prediction in step (4b) is implemented by the following formula:

representing a reverse video frame sequence in which the first 4 frames of each set of video frame sequence in step (4a) are arranged in reverse chronological order.

4. The method for detecting video anomaly based on bidirectional retrospective generation of confrontation networks as claimed in claim 2, wherein the backward retrospective prediction in step (4c) is implemented by the following formula:

wherein, x ″)_iRepresenting the i-th frame of video frame images that the generator network makes a backward retrospective prediction output,

representing the last 4 frames of the extended sequence of video frames in reverse chronological order in step (4 c).

5. The method for detecting video anomaly based on bi-directional retrospective generation of confrontational networks according to claim 2, wherein the forward retrospective prediction in step (4c) is implemented by the following formula:

wherein, x ″)_j+1The j +1 frame video frame image representing the forward retrospective prediction output by the generator network,

representing the first 4 frames of the sequence of video frames in forward chronological order after the expansion in step (4 c).

6. The method for video anomaly detection based on bidirectional review generation countermeasure network of claim 5, wherein the generator network loss function in step (4d) is as follows:

L_G＝1*L+1*L′+0.005*L″+0.005*L″′

the L, L ', L ", and L'" are derived from the following equations, respectively:

wherein | · | purple sweet₂Denotes a 2 norm operation, x_iIs represented by x'_iAnd x ″)_iCorresponding real image, x_jIs represented by x'_jAnd x ″)_jCorresponding real images and K, L represent the size of each frame of image, the values of K, L are equal to the values of W, H, m, n respectively represent the position coordinates of the pixels in the image, sigma represents the summation operation, | | |₁Denotes a 1 norm operation, D_F(. D) shows the output of a frame arbiter network in a two-way look-back generation countermeasure network_S(. to) shows two-way review generationThe output of the sequence arbiter network in the countermeasure network, u, represents a time-series superposition operation.

7. The method of detecting video anomalies based on bi-directional review generation versus network of claim 6, wherein the frame discriminator loss function in step (5b) is of the form:

wherein the content of the first and second substances,

representing the frame discriminator loss function.

8. The method for detecting video anomaly based on bi-directional review generation countermeasure network of claim 6, wherein the sequence discriminator loss function in step (6b) is in the form of:

wherein the content of the first and second substances,

representing the sequence discriminator loss function.

9. The method for detecting video anomaly based on bidirectional review generation countermeasure network of claim 6, wherein said calculating the anomaly score S in step (9) is implemented by the following formula: