CN112052763B

CN112052763B - Video abnormal event detection method based on two-way review generation countermeasure network

Info

Publication number: CN112052763B
Application number: CN202010878108.6A
Authority: CN
Inventors: 刘静; 杨智伟
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2024-02-09
Anticipated expiration: 2040-08-27
Also published as: CN112052763A

Abstract

The invention provides a method for detecting video abnormal events based on a two-way review generation countermeasure network, which solves the problem that the prior art does not utilize reverse mapping relation between video frame sequences and does not carry out motion constraint from the angle of long-term time sequence consistency of the video frame sequences, so that the detection precision of the abnormal events in the video is insufficient. The implementation steps are as follows: the method comprises the steps of constructing a generating countermeasure network consisting of a generator, a frame discriminator and a sequence discriminator, training the generating countermeasure network by adopting a bidirectional review mode of combining forward prediction and backward prediction with retrospective prediction during training through alternate updating of the generator, the frame discriminator and the sequence discriminator, and obtaining a generator capable of accurately predicting a normal event future frame image in a video but not an abnormal event future frame image in the video, so as to detect whether an abnormal event occurs according to a prediction error.

Description

Video abnormal event detection method based on two-way review generation countermeasure network

Technical Field

The invention belongs to the technical field of image processing, and further relates to a video abnormal event detection method based on a two-way review generation countermeasure network in the technical field of computer vision. The method and the device can be used for detecting the abnormal event in the video monitoring image.

Background

In recent years, intelligent security and protection are gaining attention, and especially, automatic detection of abnormal events in video monitoring is performed. The method plays a vital role in improving response and processing efficiency of abnormal events in public places, maintaining public safety and reducing property loss. At present, two main methods exist for detecting abnormal events in video monitoring: one method is to construct a model capable of reconstructing the current video frame image by learning the video data of normal behavior, such as a sparse dictionary, a self-encoder and the like, wherein the trained reconstruction model can reconstruct the current frame image of a normal event better, and for an abnormal event, a larger reconstruction error occurs in the reconstructed current frame image, and abnormal event detection is performed according to the error. However, the reconstruction model has a larger fault tolerance, so that the abnormal frame image can be better reconstructed sometimes, and the detection precision is lower. Another method is to construct a model capable of predicting future frame images by learning video data of normal behavior, the model can better predict future frame images of normal events, and abnormal events can generate larger errors between the future frame images predicted by the model and real frame images due to unpredictability, and whether the abnormal events occur can be detected according to the errors.

Wen Liu, in its published paper "Future Frame Prediction for Anomaly Detection-A New Baseline" (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6536-6545) proposes a video anomaly detection method based on future frame prediction. The method adopts the Unet as a prediction network to predict future frames of video data, when the prediction network is trained, the future frame images are predicted forward according to the previous four frame images, the appearance constraint is carried out on the predicted future frame images by minimizing gradient loss and strength loss between the predicted future frame images and images of real future frames, the motion constraint is carried out by minimizing optical flow loss between the predicted future frame images and the real future frame images, and the contrast network is generated in a combined mode to optimize the model. According to the method, not only is the appearance of the predicted future frame image constrained, but also the predicted future frame image is motion constrained by extracting the optical flow, so that the future frame image of the normal video can be predicted better, and the abnormal video has unpredictability, so that a larger error can be generated between the predicted future frame image and the real future frame image, and the detection effect of the abnormal event is improved. However, the method still has the defects that only forward prediction is performed during training, the reverse mapping relation between video frame sequences is not utilized, so that the prediction effect of future frame images of predicted normal events is insufficient, in addition, only the motion constraint is performed between the predicted future frame images and the previous frame images in a mode of extracting optical flow to perform motion constraint, and the motion constraint of the predicted future frame images from the angle of long-term time sequence consistency of the video frame sequences is not performed, so that the motion constraint of the future frame images of the predicted normal events is insufficient, larger errors also appear in the future frame images of the predicted normal events, and the capability of a prediction network to distinguish normal events from abnormal event modes is reduced, so that the video abnormality detection effect is reduced.

In patent literature "a method for detecting video abnormality based on ST-une" applied by university of Beijing industry (patent application number: 201811501290.2, application publication number: CN109711280 a), an ST-une network capable of simultaneously utilizing temporal and spatial information of video data is proposed for video abnormality detection. The method constructs a new ST-Unet network by adding ConvLSTM in the Unet. When training the ST-Unet network, the method firstly predicts a future frame image forward according to the previous four frames of video frames, reconstructs the last frame image of the input video frame, and then optimizes the model by generating an contrast network by minimizing the difference between the predicted future frame image and the real future frame image, and combining the reconstructed image and the last frame image of the real input video frame. In the test stage, firstly, the prediction error and the reconstruction error are respectively obtained by comparing the errors between a future frame image predicted by the ST-Unet network and a real future frame image and between a generated reconstruction image and a real input image, then the obtained prediction error and reconstruction error are subjected to weighted summation processing to obtain an anomaly score, and finally whether the anomaly occurs is judged according to the anomaly score. According to the method, convLSTM is added into the Unet network to simultaneously acquire the space-time information of the video data, and the anomaly detection effect is further improved by combining the reconstruction errors. However, the method still has the defects that the forward prediction is performed during training, the reverse mapping relation between video frame sequences is not utilized, in addition, the detection network is too complex by combining the prediction model and the reconstruction model, the detection efficiency is reduced, and due to the diversity of scenes, the detection effect can be unstable due to the change of the scenes because proper reconstruction errors and weight proportion of the prediction errors are difficult to select.

Disclosure of Invention

The invention aims to solve the problems that the prior art does not utilize reverse mapping relation between video frame sequences and does not carry out motion constraint from the angle of long-term time sequence consistency of the video frame sequences, so that the detection precision of abnormal events in video is insufficient.

In order to achieve the above purpose, the idea of the present invention is to construct a two-way review generation countermeasure network, which is composed of a generator, a frame discriminator and a sequence discriminator, and by adopting the two-way review training mode, the generator can make forward prediction and backward prediction, so as to fully mine the two-way mapping relation between the normal video frame sequences, and predict the more accurate future frame image of the normal event. In addition, long-term time sequence information between a predicted frame and an input frame sequence is captured through a sequence discriminator formed by a 3D convolution layer, motion constraint is carried out by utilizing discrimination loss, and the predicted frame image and the real frame image are ensured to be consistent in motion from the aspect of long-term time sequence consistency, so that the capability of distinguishing normal event and abnormal event motion modes in a video of a prediction network is enhanced, and the abnormal event detection effect is improved.

The method comprises the following specific steps:

(1) Constructing a generation countermeasure network:

(1a) A15-layer generator network is built, and the structure of the generator network is as follows: input layer- & gt first convolution layer- & gt first normalization layer- & gt first activation function layer- & gt second convolution layer- & gt second normalization layer- & gt second activation function layer- & gt first downsampling layer combination- & gt second downsampling layer combination- & gt third downsampling layer combination- & gt first upsampling layer combination- & gt second upsampling layer combination- & gt third convolution layer- & gt output layer; the structure of each downsampling layer combination is as follows: first max pooling layer- & gtfirst convolution layer- & gtfirst normalization layer- & gtfirst activation function layer- & gtsecond convolution layer- & gtsecond normalization layer- & gtsecond activation function layer; the structure of each upsampling layer combination is as follows: first deconvolution layer- & gt first convolution layer- & gt first normalization layer- & gt first activation function layer- & gt second convolution layer- & gt second normalization layer- & gt second activation function layer; the characteristic diagrams output by the first, second and third down sampling layers are spliced and fused with the characteristic diagrams output by the first, second and third up sampling layers respectively;

setting parameters of each layer in the generator network as follows: the sizes of convolution kernels in the first convolution layer, the second convolution layer and the third convolution layer are all set to be 3 multiplied by 3, the convolution step sizes are all set to be 2, and the number of the convolution kernels is all 64; the first and second activation function layers are realized by adopting a ReLU function;

The size of the pooling convolution kernel of the largest pooling layer in each downsampling layer combination is set to be 2 multiplied by 2, and the pooling step length is set to be 2; the convolution kernel sizes of the convolution layers are all set to be 3 multiplied by 3, the convolution step sizes are all set to be 2, and the number of the convolution kernels is 128, 256 and 512 respectively; the activation function layers are all realized by adopting a ReLU function;

the convolution kernel size of the deconvolution layer in each up-sampling layer combination is set to be 2 multiplied by 2, and the convolution step length is set to be 2; the convolution kernel sizes of the convolution layers are all set to be 3 multiplied by 3, the convolution step sizes are all set to be 2, and the number of the convolution kernels is 256, 128 and 64 respectively; the activation function layers are all realized by adopting a ReLU function;

(1b) A14-layer frame discriminator network is built, and the structure of the network is as follows: input layer- & gt first convolution layer- & gt first activation function layer- & gt second convolution layer- & gt first normalization layer- & gt second activation function layer- & gt third convolution layer- & gt second normalization layer- & gt third activation function layer- & gt fourth convolution layer- & gt fourth activation function layer- & gt fifth convolution layer- & gt fifth activation function layer- & gt output layer;

setting parameters of each layer in the frame discriminator network as follows: the convolution kernel sizes of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer are all set to be 3 multiplied by 3, the convolution step length is all set to be 2, and the number of the convolution kernels is sequentially set to be 128, 256, 512 and 1; the first normalization layer and the second normalization layer are realized by adopting a BatchNorm2d function; the first, second, third and fourth activation function layers are realized by adopting a LeakyReLU function, and the slopes of the first, second, third and fourth activation function layers are all set to be 0.2; the fifth activation function layer is realized by adopting a Sigmoid function;

(1c) A16-layer sequence discriminator network is built, and the structure of the network is as follows: input layer- & gt first 3D convolution layer- & gt first 3D maximum pooling layer- & gt first normalization layer- & gt first activation function layer- & gt second 3D convolution layer- & gt second 3D maximum pooling layer- & gt second normalization layer- & gt second activation function layer- & gt third 3D convolution layer- & gt third 3D maximum pooling layer- & gt third normalization layer- & gt third activation function layer- & gt fourth 3D convolution layer- & gt fourth activation function layer- & gt output layer;

setting parameters of each layer in the sequence identifier network as follows: the convolution kernel sizes of the first, second, third and fourth 3D convolution layers are all set to be 2 multiplied by 3, the convolution step sizes are all set to be 1 multiplied by 2, and the number of the convolution kernels is sequentially set to be 128, 256, 512 and 1; the sizes of the pooling convolution kernels of the first, second and third 3D maximum pooling layers are all set to be 2 multiplied by 3, and the pooling step sizes are all set to be 1 multiplied by 2; the first normalization layer, the second normalization layer and the third normalization layer are all realized by adopting a BatchNorm3d function; the first, second and third activation function layers are realized by adopting a LeakyReLU function, and the slopes of the first, second and third activation function layers are all set to be 0.2; the fourth activation function layer is realized by adopting a Sigmoid function;

(1d) Respectively cascading the generator network, the frame discriminator network and the sequence discriminator network to form a generating countermeasure network;

(2) Initializing and generating an countermeasure network:

initializing the weights of all convolution layers and normalization layers in the generated countermeasure network to random values meeting normal distribution; wherein the average value of the normal distribution is 0, and the standard deviation is 0.02;

(3) Generating a training data set:

selecting a continuous monitoring video which does not contain any abnormal event and has the duration of T minutes, and sequentially segmenting the continuous monitoring video into a plurality of groups of video frame sequences with the length of 5 and the size of W multiplied by H to form a training data set; wherein T is more than 10, W and H respectively represent the width and the height of each frame of image, W is more than or equal to 64 and less than or equal to 256, H is more than or equal to 64 and less than or equal to 256, and the units of W and H are pixels;

(4) Training the generator network in a two-way review mode:

(4a) The first 4 frames of each group of video frame sequences are arranged into a forward video frame sequence according to a forward time sequence, and the last 4 frames of each group of video frame sequences are arranged into a reverse video frame sequence according to a reverse time sequence;

(4b) Inputting the forward video frame sequence into a generator network for forward prediction, and outputting a forward prediction frame image; inputting the reverse video frame sequence into a generator network for backward prediction, and outputting a backward prediction frame image;

(4c) Adding the forward predicted frame image into a video frame sequence used for forward prediction, arranging the rear 4 frames of the extended video frame sequence according to a reverse time sequence, inputting the arranged frames into a generator network for backward retrospective prediction, and outputting the backward retrospective predicted frame image; adding the backward prediction frame image into a video frame sequence used for backward prediction before, arranging the first 4 frames of the extended video frame sequence according to a forward time sequence, inputting the sequence into a generator network for forward retrospective prediction, and outputting the forward retrospective prediction frame image;

(4d) Calculating a loss value of a generator network according to a generator network loss function constructed by errors between a plurality of predicted frame images and real frame images generated by bidirectional prediction and retrospective prediction; counter-propagating the loss value of the generator network by using a gradient descent method, and calculating all gradients of each convolution kernel in each convolution layer and each deconvolution layer of the generator network; iteratively updating the ownership weights of each convolution kernel in each convolution layer and deconvolution layer of the generator network by using an Adam optimizer according to all gradients of each convolution kernel in each convolution layer and deconvolution layer of the generator network; the initial learning rate of the Adam optimizer is 0.0002;

(5) Training a frame discriminator network:

(5a) The method comprises the steps of sequentially inputting a forward predictive frame image and a real image thereof, a backward predictive frame image and a real image thereof, a forward retrospective predictive frame image and a real image thereof, and a backward retrospective predictive frame image and a real image thereof into a frame discriminator network, and outputting corresponding authenticity probabilities by the frame discriminator network;

(5b) Calculating a loss value of the frame discriminator network according to a frame discriminator loss function constructed by the true-false probability output by the frame discriminator network; counter-propagating the loss value of the frame identifier network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the frame identifier network and all gradients of the normalization layer; according to all gradients of each convolution kernel of each convolution layer of the frame discriminator network and all gradients of the normalization layer, iteratively updating all weights of each convolution kernel of each convolution layer of the frame discriminator network and all weights of the normalization layer by using an Adam optimizer; the initial learning rate of the Adam optimizer is 0.00002;

(6) Training a sequence identifier network:

(6a) The method comprises the steps that a video frame sequence formed by a forward prediction frame image, a backward prediction frame image, a retrospective prediction frame image and a corresponding input frame image and a corresponding real video frame sequence are sequentially input into a sequence discriminator network, and the sequence discriminator network outputs corresponding authenticity probability;

(6b) Calculating a loss value of the sequence identifier network according to a sequence identifier loss function constructed by the authenticity probability output by the sequence identifier network; counter-propagating the loss value of the sequence identifier network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the sequence identifier network and all gradients of the normalization layer; according to all gradients of each convolution kernel of each convolution layer of the sequence identifier network and all gradients of the normalization layer, iteratively updating all weights of each convolution kernel of each convolution layer of the sequence identifier network and all weights of the normalization layer by using an Adam optimizer; the initial learning rate of the Adam optimizer is 0.00002;

(7) Judging whether the generator network loss function is converged, if so, executing the step (8), otherwise, executing the step (4);

(8) Training of generating an countermeasure network by two-way review is completed, trained generator network weights are obtained, and all weights of convolution kernels of each convolution layer and each deconvolution layer of the generator network in the trained two-way review generation countermeasure network are saved;

(9) Detecting the video:

dividing the video to be detected into a video frame sequence with the length of 5 and the size of MxN in sequence, inputting the first 4 frames of the video frame sequence into a trained generator network, outputting predicted future frame images, calculating an anomaly score S according to the error between the predicted future frame images and the 5 th frame real image in the video frame sequence, judging that the anomaly occurs in the future frame images if the anomaly score S exceeds a set threshold value, and otherwise, judging that the anomaly does not occur in the future frame images; wherein the values of M and N are equal to those of W and H, and the value range of the anomaly score S is more than or equal to 0 and less than or equal to 1.

Compared with the prior art, the invention has the following advantages:

firstly, the invention constructs a generating countermeasure network composed of a generator, a frame discriminator and a sequence discriminator, adopts a two-way retrospective training mode of combining forward prediction and backward prediction with retrospective prediction, overcomes the problem that the prediction effect of future frame images of predicted normal events is insufficient because the reverse mapping relation between video frame sequences is not utilized in the prior art, and ensures that the prediction network has stronger capability of distinguishing normal events from abnormal event appearance modes, thereby improving the effect of detecting appearance anomalies in videos.

Second, because the sequence discriminator in the generation countermeasure network is composed of the 3D convolution layer, the sequence discriminator can capture the long-term time sequence relation between video frame sequences and perform motion constraint by utilizing discrimination loss, the problem that predicted future frame images are not motion constrained from the angle of long-term time sequence consistency of the video frame sequences in the prior art, so that the motion constraint on the predicted future frame images of normal events is insufficient is solved, and the video abnormal event detection network provided by the invention has stronger capability of distinguishing normal events from abnormal event motion modes, thereby improving the effect of detecting motion anomalies in videos.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of the overall architecture of a generator network of the present invention; fig. 2 (a) is a schematic structural diagram of a generator network according to the present invention, fig. 2 (b) is a schematic diagram of a combination of downsampling layers in the generator network, and fig. 2 (c) is a schematic diagram of a combination of upsampling layers in the generator network;

FIG. 3 is a schematic diagram of a frame identifier network in a countermeasure network according to the present invention;

FIG. 4 is a schematic diagram of the structure of a sequence identifier network in a generated countermeasure network according to the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The specific steps of the present invention are further described with reference to fig. 1:

step 1, constructing a generated countermeasure network.

The specific structure of the constructed generator network is further described with reference to fig. 2 (a).

A15-layer generator network is built, and the structure of the generator network is as follows: input layer- & gt first convolution layer- & gt first normalization layer- & gt first activation function layer- & gt second convolution layer- & gt second normalization layer- & gt second activation function layer- & gt first downsampling layer combination- & gt second downsampling layer combination- & gt third downsampling layer combination- & gt first upsampling layer combination- & gt second upsampling layer combination- & gt third convolution layer- & gt output layer.

The specific structure of the downsampling layer combination in the generator network is further described with reference to fig. 2 (b).

The structure of each downsampling layer combination is as follows: first max pooling layer → first convolution layer → first normalization layer → first activation function layer → second convolution layer → second normalization layer → second activation function layer.

The specific structure of the up-sampling layer combination in the generator network is further described with reference to fig. 2 (c).

The structure of each upsampling layer combination is in turn: first deconvolution layer- & gtfirst convolution layer- & gtfirst normalization layer- & gtfirst activation function layer- & gtsecond convolution layer- & gtsecond normalization layer- & gtsecond activation function layer.

And the characteristic diagrams output by the first, second and third downsampling layers are respectively spliced and fused with the characteristic diagrams output by the first, second and third upsampling layers.

Setting parameters of each layer in the generator network as follows: the sizes of convolution kernels in the first convolution layer, the second convolution layer and the third convolution layer are all set to be 3 multiplied by 3, the convolution step sizes are all set to be 2, and the number of the convolution kernels is all 64; the first and second activation function layers are realized by adopting a ReLU function.

The size of the pooling convolution kernel of the largest pooling layer in each downsampling layer combination is set to be 2 multiplied by 2, and the pooling step length is set to be 2; the convolution kernel sizes of the convolution layers are all set to be 3 multiplied by 3, the convolution step sizes are all set to be 2, and the number of the convolution kernels is 128, 256 and 512 respectively; the activation function layers are all realized by adopting a ReLU function.

The convolution kernel size of the deconvolution layer in each up-sampling layer combination is set to be 2 multiplied by 2, and the convolution step length is set to be 2; the convolution kernel sizes of the convolution layers are all set to be 3 multiplied by 3, the convolution step sizes are all set to be 2, and the number of the convolution kernels is 256, 128 and 64 respectively; the activation function layers are all realized by adopting a ReLU function.

The specific structure of the frame discriminator network constructed by the present invention will be further described with reference to fig. 3.

A14-layer frame discriminator network is built, and the structure of the network is as follows: input layer- & gt first convolution layer- & gt first activation function layer- & gt second convolution layer- & gt first normalization layer- & gt second activation function layer- & gt third convolution layer- & gt second normalization layer- & gt third activation function layer- & gt fourth convolution layer- & gt fourth activation function layer- & gt fifth convolution layer- & gt fifth activation function layer- & gt output layer.

Setting parameters of each layer in the frame discriminator network as follows: the convolution kernel sizes of the first convolution layer, the second convolution layer, the third convolution layer, the fourth convolution layer and the fifth convolution layer are all set to be 3 multiplied by 3, the convolution step length is all set to be 2, and the number of the convolution kernels is sequentially set to be 128, 256, 512 and 1; the first normalization layer and the second normalization layer are realized by adopting a BatchNorm2d function; the first, second, third and fourth activation function layers are realized by adopting a LeakyReLU function, and the slopes of the first, second, third and fourth activation function layers are all set to be 0.2; the fifth activation function layer is implemented by adopting a Sigmoid function.

The specific structure of the sequence identifier network constructed by the present invention will be further described with reference to fig. 4.

A16-layer sequence discriminator network is built, and the structure of the network is as follows: input layer- & gtfirst 3D convolution layer- & gtfirst 3D max pooling layer- & gtfirst normalization layer- & gtfirst activation function layer- & gtsecond 3D convolution layer- & gtsecond 3D max pooling layer- & gtsecond normalization layer- & gtsecond activation function layer- & gtthird 3D convolution layer- & gtthird 3D max pooling layer- & gtthird normalization layer- & gtthird activation function layer- & gtfourth 3D convolution layer- & gtfourth activation function layer- & gtoutput layer.

Setting parameters of each layer in the sequence identifier network as follows: the convolution kernel sizes of the first, second, third and fourth 3D convolution layers are all set to be 2 multiplied by 3, the convolution step sizes are all set to be 1 multiplied by 2, and the number of the convolution kernels is sequentially set to be 128, 256, 512 and 1; the sizes of the pooling convolution kernels of the first, second and third 3D maximum pooling layers are all set to be 2 multiplied by 3, and the pooling step sizes are all set to be 1 multiplied by 2; the first normalization layer, the second normalization layer and the third normalization layer are all realized by adopting a BatchNorm3d function; the first, second and third activation function layers are realized by adopting a LeakyReLU function, and the slopes of the first, second and third activation function layers are all set to be 0.2; the fourth activation function layer is implemented by adopting a Sigmoid function.

The generator network, the frame discriminator network and the sequence discriminator network are respectively cascaded to form a generating countermeasure network.

And step 2, initializing to generate an countermeasure network.

Initializing the weights of all convolution layers and normalization layers in the generated countermeasure network to random values meeting normal distribution; wherein the mean value of the normal distribution is 0, and the standard deviation is 0.02.

And 3, generating a training data set.

Selecting a continuous monitoring video which does not contain any abnormal event and has the duration of T minutes, and sequentially segmenting the continuous monitoring video into a plurality of groups of video frame sequences with the length of 5 and the size of W multiplied by H to form a training data set; wherein T is more than 10, W and H respectively represent the width and the height of each frame of image, W is more than or equal to 64 and less than or equal to 256, H is more than or equal to 64 and less than or equal to 256, and the units of W and H are pixels.

And 4, training the generator network in a two-way review mode.

Step 1, arranging the first 4 frames of each group of video frame sequences into a forward video frame sequence according to a forward time sequence, and arranging the last 4 frames of each group of video frame sequences into a reverse video frame sequence according to a reverse time sequence;

step 2, inputting the forward video frame sequence into a generator network for forward prediction, and outputting a forward prediction frame image; the reverse video frame sequence is input into a generator network for backward prediction, and a backward prediction frame image is output.

The forward prediction is realized by the following formula:

x′ _j+1 ＝G(X _i:j )

wherein x' _j+1 The j+1st frame video frame image representing forward predicted output by the generator network, G (·) represents the output of the generator network in the two-way review generation countermeasure network, X _i:j Representing a forward video frame sequence in which the first 4 frames of each group of video frame sequences are arranged in forward time order in the first step, i, j represent the start frame and end frame labels of the video frame sequence, respectively, wherein j-i+1=4.

The backward prediction is realized by the following formula:

wherein x' _i Represents the i-th frame video frame image of the backward prediction output by the generator network,representing a reverse video frame sequence in which the first 4 frames of each group of video frame sequences in the first step are arranged in reverse temporal order.

Step 3, adding the forward predicted frame image into the video frame sequence used for forward prediction, arranging the last 4 frames of the extended video frame sequence according to the reverse time sequence, inputting the arranged frames into a generator network for backward retrospective prediction, and outputting the backward retrospective predicted frame image; and adding the backward prediction frame image into a video frame sequence used for backward prediction, arranging the first 4 frames of the extended video frame sequence according to the forward time sequence, inputting the sequence into a generator network for forward retrospective prediction, and outputting the forward retrospective prediction frame image.

The backward retrospective prediction is implemented by the following formula:

wherein x% _j+1 Represents the j+1st frame video frame image of the backward retrospective prediction output by the generator network,representing the last 4 frames of the sequence of extended video frames in reverse temporal order in step 3.

The forward retrospective prediction is achieved by the following formula:

wherein x% _i The i-th frame video frame image representing the forward retrospective prediction output by the generator network,representing the first 4 frames of the sequence of video frames arranged in forward temporal order after expansion in step 3.

And 4, calculating a loss value of the generator network according to a generator network loss function constructed by errors between a plurality of predicted frame images and real frame images generated by bidirectional prediction and retrospective prediction.

The generator network loss function is as follows:

L _G ＝1*L+1*L′+0.005*L″+0.005*L″

wherein L is _G Representing a generator network loss function, representing a multiplication operation, wherein L represents an intensity error loss between a predicted frame image and a real image output by a generator, L ' represents a gradient error loss between the predicted frame image and the real image output by the generator, L ' represents a frame countermeasure loss of the generator network, and L ' represents a sequence countermeasure loss of the generator network;

the L, L ', l+ and L' "are derived from the following formulas, respectively:

wherein I ₂ Representing 2-norm operations, x _i Representing and x' _i And x _i Corresponding real image, x _j Representing and x' _j And x _j The corresponding real images, K and L represent the size of each frame of image, the values of K and L are equal to those of W and H, m and n respectively represent the position coordinates of pixels in the images, and sigma represents summation operation ₁ Representing 1-norm operations, D _F (. Cndot.) represents the output of a frame arbiter network in a two-way review generation countermeasure network, D _S (. Cndot.) represents the output of a sequence identifier network in a two-way look-back generation countermeasure network, and U represents the timing overlap-add operation.

Counter-propagating the loss value of the generator network by using a gradient descent method, and calculating all gradients of each convolution kernel in each convolution layer and each deconvolution layer of the generator network; iteratively updating the ownership weights of each convolution kernel in each convolution layer and deconvolution layer of the generator network by using an Adam optimizer according to all gradients of each convolution kernel in each convolution layer and deconvolution layer of the generator network; the initial learning rate of the Adam optimizer is 0.0002.

And 5, training the frame discriminator network.

And step 1, sequentially inputting the forward predictive frame image and the real image thereof, the backward predictive frame image and the real image thereof, the forward retrospective predictive frame image and the real image thereof and the backward retrospective predictive frame image and the real image thereof into a frame discriminator network, and outputting the corresponding authenticity probability by the frame discriminator network.

And step 2, calculating a loss value of the frame discriminator network according to the frame discriminator loss function constructed by the true-false probability output by the frame discriminator network.

The frame discriminator loss function is as follows:

wherein,representing the frame arbiter loss function.

Counter-propagating the loss value of the frame identifier network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the frame identifier network and all gradients of the normalization layer; according to all gradients of each convolution kernel of each convolution layer of the frame discriminator network and all gradients of the normalization layer, iteratively updating all weights of each convolution kernel of each convolution layer of the frame discriminator network and all weights of the normalization layer by using an Adam optimizer; the initial learning rate of the Adam optimizer is 0.00002.

And 6, training the sequence discriminator network.

And step 1, sequentially inputting a video frame sequence formed by the forward predicted frame image, the backward predicted frame image, the retrospective predicted frame image and the corresponding input frame image and the corresponding real video frame sequence into a sequence discriminator network, and outputting the corresponding authenticity probability by the sequence discriminator network.

And step 2, calculating a loss value of the sequence identifier network according to the sequence identifier loss function constructed by the true-false probability output by the sequence identifier network.

The loss function of the sequence discriminator is as follows:

wherein,representing the sequence identifier penalty function.

Counter-propagating the loss value of the sequence identifier network by using a gradient descent method, and calculating all gradients of each convolution kernel of each convolution layer of the sequence identifier network and all gradients of the normalization layer; according to all gradients of each convolution kernel of each convolution layer of the sequence identifier network and all gradients of the normalization layer, iteratively updating all weights of each convolution kernel of each convolution layer of the sequence identifier network and all weights of the normalization layer by using an Adam optimizer; the initial learning rate of the Adam optimizer is 0.00002.

And 7, judging whether the generator network loss function is converged, if so, executing the step 8, otherwise, executing the step 4.

And 8, finishing the training of generating the countermeasure network by the two-way review, obtaining trained generator network weights, and storing the ownership weights of each convolution layer and each convolution kernel of the deconvolution layer of the generator network in the trained two-way review generation countermeasure network.

And 9, detecting the video.

Dividing the video to be detected into a video frame sequence with the length of 5 and the size of MxN in sequence, inputting the first 4 frames of the video frame sequence into a trained generator network, outputting predicted future frame images, calculating an anomaly score S according to the error between the predicted future frame images and the 5 th frame real image in the video frame sequence, judging that the anomaly occurs in the future frame images if the anomaly score S exceeds a set threshold value, and otherwise, judging that the anomaly does not occur in the future frame images; wherein, the values of M and N are equal to the values of W and H, the value range of the anomaly fraction S is more than or equal to 0 and less than or equal to 1, and the threshold value is set to 0.5.

The anomaly score S is calculated by the following formula:

wherein PSNR (x, x ') represents a peak signal-to-noise ratio between the real image and the predicted future frame image, x represents the real image, x' represents the predicted future frame image, log ₁₀ Represents a logarithmic operation based on 10, max represents a maximum value taking operation, F represents a total number of pixels in the future frame image corresponding to the real image or the prediction, l represents a sequence number of all pixels in the future frame image corresponding to the real image or the prediction, S (t) represents an anomaly score of the future frame image at the t-th time of the prediction, x _t Representing the real image at time t, x' _t Representing a predicted future frame image at time t, and min represents a minimum value operation.

The effects of the present invention are further described below in conjunction with simulation experiments:

1. simulation experiment conditions:

the hardware platform of the simulation experiment of the invention is: the processor is an Intel (R) Core i5-9400F CPU, the main frequency is 2.9GHz, the memory is 32GB, and the display card is NVIDIA GeForce RTX 2070Super.

The software platform of the simulation experiment of the invention is: ubuntu 16.04 operating system, python 3.6, pytorch1.2.0.

2. Simulation content and simulation result analysis:

when the training set and the test set are generated in the simulation experiment, a published standard data set CUHK Avenue (Avenue) is used. The video dataset was 20 minutes long, for a total of 37 video clips, containing 47 anomaly events. In the simulation experiment, 16 normal video clips in Avenue data set are used to form a training set, and 21 abnormal video clips are used to form a testing set.

The simulation experiment of the invention is to detect the abnormal events in 21 video clips in the test set by adopting the invention and three existing technologies (an abnormality detection method FFP based on future frame prediction, a video abnormality detection method AnoPCN based on a depth prediction coding network and an abnormality detection method PRI combined with prediction and reconstruction).

In simulation experiments, three prior art techniques employed refer to:

the prior art anomaly detection method based on future frame prediction refers to a video anomaly detection method proposed by W.Liu et al in 'Future Frame Prediction for Anomaly Detection-A New Baseline, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, jun.2018, pp.6536-6545', and is simply an anomaly detection method based on future frame prediction.

The video anomaly detection method based on the depth prediction coding network in the prior art refers to a video anomaly detection method proposed by M.Ye et al in AnoPCN: video anomaly detection via deep predictive coding network, in Proceedings of the 27th ACM International Conference on Multimedia,2019,pp.1805-1813, and is simply referred to as a video anomaly detection method based on the depth prediction coding network.

The abnormality detection method combining prediction and reconstruction in the prior art refers to a video abnormality detection method, which is proposed in "Integrating prediction and reconstruction for anomaly detection," Pattern recording. Lett, vol.129, pp.123-130,2020, "by y.tang et al, for short, prediction and reconstruction.

In order to evaluate the effect of the simulation result of the present invention, the present invention uses AUC as a performance evaluation index to compare with the existing three technologies, and the comparison result is shown in table 1.

From table 1, the AUC of the method of the present invention on the Avenue dataset is 88.6% higher than 3 prior art methods, demonstrating that the method can detect anomalies in video more effectively.

The bidirectional review generation countermeasure network composed of the sequence discriminators adopts a bidirectional review training mode, so that the generator can fully mine the bidirectional mapping relation between the predicted frames and the input frame sequences, predicts the future frame images of the more accurate normal events, effectively improves the capability of the predicted network to distinguish the normal video from the abnormal video appearance modes, and in addition, the sequence discriminators composed of the 3D convolution layers can conduct motion constraint on the predicted frame images from the long-term time sequence consistency, effectively improves the capability of the predicted network to distinguish the normal video from the abnormal video motion modes, solves the problem that the prediction effect of the future frame images of the predicted normal events is insufficient due to the fact that the reverse mapping relation between the video frame sequences is not utilized in the prior art, and does not conduct motion constraint from the angle of the long-term time sequence consistency of the video frame sequences, so that the motion constraint on the future frame images of the predicted normal events is insufficient, and is a very practical video abnormal event detection method.

Table 1 comparison of AUC values for the present invention and three prior art

The simulation experiment shows that: the bidirectional review generation countermeasure network composed of the generator, the frame discriminator and the sequence discriminator is constructed by the method, a bidirectional review training mode is adopted, so that the generator can fully mine the bidirectional mapping relation between the predicted frame and the input frame sequence, predicts the future frame image of a more accurate normal event, effectively improves the capability of the predicted network to distinguish the normal video from the abnormal video appearance mode, and in addition, the sequence discriminator composed of the 3D convolution layer can carry out motion constraint on the predicted frame image from the long-term time sequence consistency, effectively improves the capability of the predicted network to distinguish the normal video from the abnormal video motion mode, solves the problem that the prediction effect of the future frame image of the predicted normal event is insufficient due to the fact that the reverse mapping relation between the video frame sequences is not utilized in the prior art, and does not carry out motion constraint from the long-term time sequence consistency of the video frame sequence, so that the motion constraint on the future frame image of the predicted normal event is insufficient.

Claims

1. A video abnormal event detection method based on a two-way review generation countermeasure network is characterized in that a generation countermeasure network consisting of a generator, a frame discriminator and a sequence discriminator is constructed, and during training, the generation countermeasure network is trained by adopting a mode of combining forward prediction and backward prediction with two-way review of retrospective prediction through alternate updating of the generator, the frame discriminator and the sequence discriminator, so that a generator capable of accurately predicting a normal event future frame image in a video but not an abnormal event future frame image in the video is obtained; the method comprises the following specific steps:

(1) Constructing a generation countermeasure network:

(2) Initializing and generating an countermeasure network:

(3) Generating a training data set:

(4) Training the generator network in a two-way review mode:

The generator network loss function is as follows:

L _G ＝1*L+1*L′+0.005*L″+0.005*L″′

wherein L is _G Representing a generator network loss function, representing a multiplication operation, wherein L represents an intensity error loss between a predicted frame image and a real image output by a generator, L ' represents a gradient error loss between the predicted frame image and the real image output by the generator, L ' represents a frame countermeasure loss of the generator network, and L ' "represents a sequence countermeasure loss of the generator network;

(5) Training a frame discriminator network:

(6) Training a sequence identifier network:

(9) Detecting the video:

2. The method of claim 1, wherein the forward prediction in step (4 b) is implemented by the following formula:

x′ _j+1 ＝G(X _i:j )

wherein x is _j ′ ₊₁ The j+1st frame video frame image representing forward predicted output by the generator network, G (·) represents the output of the generator network in the two-way review generation countermeasure network, X _i:j Representing the forward video frame sequence in which the first 4 frames of each group of video frame sequences in step (4 a) are arranged in forward temporal order, i, j represent the start frame and end frame labels of the video frame sequence, respectively, where j-i+1=4.

3. The method of video anomaly detection for a two-way retrospective generation countermeasure network of claim 2, wherein the backward prediction in step (4 b) is implemented by the following formula:

wherein x' _i Represents the i-th frame video frame image of the backward prediction output by the generator network,representing the reverse video frame sequence in which the first 4 frames of each set of video frame sequences in step (4 a) are arranged in reverse temporal order.

4. A method of video anomaly detection for a two-way retrospective generated countermeasure network as claimed in claim 3, wherein the backward retrospective prediction in step (4 c) is implemented by the following formula:

wherein x% _i The generator network is presented with the i-th frame video frame image for backward retrospective prediction output,representing the last 4 frames of the sequence of extended video frames in reverse temporal order in step (4 c).

5. The method of video anomaly detection for a two-way retrospective generation countermeasure network of claim 4, wherein the forward retrospective prediction in step (4 c) is implemented by the following formula:

wherein x% _j+1 Represents the j+1st frame video frame image of the forward retrospective prediction output by the generator network, Representing the first 4 frames of the sequence of video frames arranged in forward temporal order after expansion in step (4 c).

6. The method of claim 5, wherein the L, L ', L "and L'" of step (4 d) are derived from the following formulas, respectively:

wherein I ₂ Representing 2-norm operations, x _i Representing and x' _i And x _i Corresponding real image, x _j Representing and x' _j And x _j The corresponding real image and K, L represent the size of each frame of image, the values of K, L are equal to those of W, H, m and n respectively represent the position coordinates of pixels in the image, and sigma represents the summation operation ₁ Representing 1-norm operations, D _F (. Cndot.) represents the output of a frame arbiter network in a two-way review generation countermeasure network, D _S (. Cndot.) represents the output of a sequence identifier network in a two-way look-back generation countermeasure network, and U represents the timing overlap-add operation.

7. The method of claim 6, wherein the frame arbiter loss function of step (5 b) is in the form of:

wherein,representing the frame arbiter loss function.

8. The method of claim 6, wherein the sequence discriminator loss function in step (6 b) is in the form of:

Wherein,representing the sequence identifier penalty function.

9. The method for video anomaly detection for generating a countermeasure network based on two-way review of claim 6, wherein the calculating of anomaly score S in step (9) is performed by the following formula:

where PSNR (x, x') represents the peak signal-to-noise ratio between the real image and the predicted future frame image, x represents the real image,x' represents the predicted future frame image, log ₁₀ Represents a logarithmic operation based on 10, max represents a maximum value taking operation, F represents a total number of pixels in the future frame image corresponding to the real image or the prediction, l represents a sequence number of all pixels in the future frame image corresponding to the real image or the prediction, S (t) represents an anomaly score of the future frame image at the t-th time of the prediction, x _t Representing the real image at time t, x' _t Representing a predicted future frame image at time t, and min represents a minimum value operation.