CN114581738A

CN114581738A - Behavior prediction network training method and system and behavior anomaly detection method and system

Info

Publication number: CN114581738A
Application number: CN202210285382.1A
Authority: CN
Inventors: 李洪均; 孙晓虎; 李超波; 陈俊杰
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-06-03

Abstract

The invention discloses a behavior prediction network training method and system and a behavior anomaly detection method and system, and relates to the technical field of video anomaly detection. The training method comprises the following steps: constructing a heterogeneous twin network based on a convolution network and a U-Net network; acquiring a training video, wherein the training video comprises a plurality of time-continuous RGB video frames containing normal behaviors and optical flow frames; training the convolution network and the U-Net network through the RGB video frame and the optical flow frame respectively; determining an apparent loss function and a motion loss function; determining a multi-constraint loss function according to the apparent loss function and the motion loss function; and adjusting weights in the convolution network and the U-Net network according to the multi-constraint loss function to train the heterogeneous twin network to obtain the trained heterogeneous twin network. By the method and the device, the behavior prediction can be effectively carried out on the moving objects which move rapidly and have similar appearances in complex scenes, and further, the abnormal behaviors can be effectively detected.

Description

Behavior prediction network training method and system and behavior anomaly detection method and system

Technical Field

The invention relates to the technical field of video anomaly detection, in particular to a behavior prediction network training method and system and a behavior anomaly detection method and system.

Background

With the popularization of monitoring equipment and the wide attention of people to social public safety, video anomaly detection gradually becomes a research hotspot in the field of computer vision. Video anomaly detection aims to automatically detect and locate events deviating from expected behaviors in surveillance videos by utilizing computer vision technology and combining a machine learning method. However, video anomaly detection is a very challenging task, mainly expressed in the following aspects: (1) scarcity property: the number of normal samples in the real world is much larger than the number of abnormal samples, and the acquisition of abnormal samples is extremely expensive. (2) Ambiguity: there is no clear boundary between normal behavior and abnormal behavior. For example, skateboarders and pedestrians, although similar in appearance, are regarded as abnormal objects prohibited from appearing on sidewalks.

Most existing methods assume that all regions in the scene (including stationary background and moving foreground objects) have the same contribution. Unfortunately, this assumption may not be ideal because it can be found empirically that the primary element of anomaly detection is a moving object/person, not a stationary background. Most of the existing work uses "twin networks" to extract features from different information separately. For moving objects in a non-complex scene, the network can basically balance real-time performance and accuracy. However, for some moving objects with fast motion and similar appearance in complex scenes, the performance of the "twin network" may be reduced. Therefore, the feature extraction of moving objects with fast motion and similar appearance in complex scenes cannot meet the requirements.

Disclosure of Invention

The invention aims to provide a behavior prediction network training method and system and a behavior anomaly detection method and system. The method and the device can effectively predict the behavior of the moving object which moves rapidly and has similar appearance in a complex scene, and further can effectively detect the abnormal behavior.

In order to achieve the purpose, the invention provides the following scheme:

a behavior prediction network training method comprises the following steps:

constructing a heterogeneous twin network based on a convolution network and a U-Net network;

acquiring a training video, wherein the training video comprises a plurality of time-continuous RGB video frames containing normal behaviors and optical flow frames;

inputting RGB video frames continuous in any time into a convolution network of the heterogeneous twin network, and inputting optical flow frames continuous in any time into a U-Net network of the heterogeneous twin network;

determining an apparent loss function according to the output of the convolution network of the heterogeneous twin network and the RGB video frame at the next moment of the RGB video frame continuous in any time;

determining a motion loss function according to the output of the U-Net network of the heterogeneous twin network and the optical flow frame at the next moment of the optical flow frame continuous in any time;

determining a multi-constraint loss function according to the apparent loss function and the motion loss function;

and adjusting weights in the convolution network and the U-Net network according to the multi-constraint loss function so as to train the heterogeneous twin network and obtain the trained heterogeneous twin network.

The invention also provides a behavior abnormity detection method, which comprises the following steps:

acquiring a target video real frame, wherein the target video real frame comprises a plurality of RGB real video frames containing normal behaviors and optical flow real frames which are continuous in time;

inputting the real target video frame into a heterogeneous twin network to obtain a target video prediction frame, wherein the heterogeneous twin network is a network trained according to the behavior prediction network training method;

calculating the peak signal-to-noise ratio of the target video prediction frame and the target video real frame;

calculating a regularity score according to the peak signal-to-noise ratio, wherein the regularity score is used for judging the normal degree of the real frame of the target video;

judging whether the regularity score is lower than a preset threshold value or not;

if yes, abnormal behaviors exist in the real frame of the target video;

if not, the abnormal behavior does not exist in the real frame of the target video.

The invention also provides a behavior prediction network training system, which comprises:

the heterogeneous twin network construction unit is used for constructing a heterogeneous twin network based on the convolution network and the U-Net network;

the training video acquisition unit is used for acquiring a training video, and the training video comprises a plurality of time-continuous RGB video frames containing normal behaviors and optical flow frames;

the input unit is used for inputting RGB video frames continuous in any time into the convolution network of the heterogeneous twin network and inputting optical flow frames continuous in any time into the U-Net network of the heterogeneous twin network;

an apparent loss function determining unit, configured to determine an apparent loss function according to an output of the convolutional network of the heterogeneous twin network and an RGB video frame at a next time of the RGB video frame that is continuous in any time;

a motion loss function determining unit, configured to determine a motion loss function according to an output of the U-Net network of the heterogeneous twin network and an optical flow frame at a next time of the arbitrary time-continuous optical flow frame;

a multi-constraint loss function determination unit for determining a multi-constraint loss function from the apparent loss function and the motion loss function;

and the heterogeneous twin network training unit is used for adjusting the weights in the convolution network and the U-Net network according to the multi-constraint loss function so as to train the heterogeneous twin network and obtain the trained heterogeneous twin network.

The present invention also provides a behavior anomaly detection system, which includes:

the target video real frame acquisition unit is used for acquiring a target video real frame, and the target video real frame comprises a plurality of time-continuous RGB real video frames containing normal behaviors and an optical flow real frame;

a target video prediction frame obtaining unit, configured to input the target video real frame into a heterogeneous twin network to obtain a target video prediction frame, where the heterogeneous twin network is a network trained according to the behavior prediction network training method;

the peak signal-to-noise ratio calculation unit is used for calculating the peak signal-to-noise ratio of the target video prediction frame and the target video real frame;

the regularity score calculating unit is used for calculating a regularity score according to the peak signal-to-noise ratio, and the regularity score is used for judging the normal degree of the real frame of the target video;

the judging unit is used for judging whether the regularity score is lower than a preset threshold value or not;

if so, abnormal behaviors exist in the real frame of the target video;

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a behavior prediction network training method and system and a behavior anomaly detection method and system, wherein the behavior prediction network training method comprises the following steps: constructing an isomeric twin network based on a convolution network and a U-Net network; acquiring a training video, wherein the training video comprises a plurality of time-continuous RGB video frames containing normal behaviors and optical flow frames; inputting RGB video frames continuous in any time into a convolution network of the heterogeneous twin network, and inputting optical flow frames continuous in any time into a U-Net network of the heterogeneous twin network; determining an apparent loss function according to the output of the convolution network of the heterogeneous twin network and the RGB video frame at the next moment of the RGB video frame continuous in any time; determining a motion loss function according to the output of the U-Net network of the heterogeneous twin network and the optical flow frame at the next moment of the optical flow frame at any time; determining a multi-constraint loss function according to the apparent loss function and the motion loss function; and adjusting weights in the convolution network and the U-Net network according to the multi-constraint loss function so as to train the heterogeneous twin network and obtain the trained heterogeneous twin network. Compared with a twin network in the prior art, the heterogeneous twin network is composed of a convolution network and a U-Net network, the convolution network can be suitable for feature extraction of moving objects with similar appearances, the U-Net network can be suitable for feature extraction of moving objects with rapid movement, behavior prediction of the moving objects with rapid movement and similar appearances in complex scenes can be effectively achieved through the method, and abnormal behaviors can be effectively detected according to behavior prediction results.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a behavior prediction network training method according to embodiment 1 of the present invention;

FIG. 2 is a diagram of a heterogeneous twin network architecture;

FIG. 3 is an exemplary diagram of detection performance of a convolutional network and a U-Net network in a complex environment;

FIG. 4 is a schematic view of a "chain" manifold distribution;

FIG. 5 is a key-value module diagram;

fig. 6 is a flowchart of a behavior anomaly detection method according to embodiment 2 of the present invention;

FIG. 7 is a heterogeneous twin video anomaly detection framework based on key-value modules;

fig. 8 is a block diagram of a behavior prediction network training system according to embodiment 3 of the present invention;

fig. 9 is a block diagram of a behavior anomaly detection system according to embodiment 4 of the present invention;

FIG. 10 is a frame level ROC plot of different methods on the UCSD Ped2, CUHKAVAnue, and ShanghaiTech data sets;

FIG. 11 is a graph comparing EER for different methods;

FIG. 12 is a graph of the regularity scores of #002, #004 videos in the UCSD Ped2 dataset;

FIG. 13 is a graph of the regularity scores of #004, #015 videos in the Chukavenue dataset;

FIG. 14 is a graph of the regularity scores of #01_0029 and #03_0032 videos in the ShanghaiTech dataset;

FIG. 15 is a graph of AUC and EER for different abnormal behavior on the ShanghaiTech data set;

FIG. 16 is a graph of feature distribution in a key-value module visualized from different angles;

FIG. 17 is a visualization of t-SNE in MNIST (upper) and ShanghaiTech (lower) datasets;

FIG. 18 is a diagram of different anomalous behavior in a similar complex environment;

FIG. 19 is an analysis of the optimal placement of different loss coefficients on the Ped2 data set.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1:

referring to fig. 1, the present invention provides a behavior prediction network training method, which includes the following steps:

a1: constructing a heterogeneous twin network based on a convolution network and a U-Net network; in the field of video anomaly detection, it is very important to determine anomalies in a surveillance video by making full use of complementary information of appearance and motion of a target object. In view of the insufficient detection capability of the twin network in a complex scene, the invention adopts a heterogeneous twin network to perform characteristic coding on appearance information such as the shape, the position and the like of a target object and motion information such as the speed and the like, wherein the network consists of two independent processing streams (appearance and motion), each stream uses a different coder and the input of the two streams is different, as shown in FIG. 2;

a2: acquiring a training video, wherein the training video comprises a plurality of time-continuous RGB video frames containing normal behaviors and optical flow frames;

a3: inputting RGB video frames continuous in any time into a convolution network of the heterogeneous twin network, and inputting optical flow frames continuous in any time into a U-Net network of the heterogeneous twin network;

a4: determining an apparent loss function according to the output of the convolution network of the heterogeneous twin network and the RGB video frame at the next moment of the RGB video frame continuous in any time;

a5: determining a motion loss function according to the output of the U-Net network of the heterogeneous twin network and the optical flow frame at the next moment of the optical flow frame continuous in any time;

a6: determining a multi-constraint loss function according to the apparent loss function and the motion loss function;

a7: and adjusting the weights in the convolution network and the U-Net network according to the multi-constraint loss function so as to train the heterogeneous twin network and obtain the trained heterogeneous twin network.

As shown in fig. 3, the performance of detecting abnormal motion is different for different networks. The heterogeneous twin network is composed of two sub-networks with different structures, and is more suitable for two situations with different input information. The invention effectively solves the limitation of feature extraction capability of a twin network in a complex scene, and makes the network more pertinent and adaptive.

After the step A3, before the step a4, the method further includes:

a8: converting and coding a plurality of time-continuous RGB video frames by adopting a convolutional network coder to obtain a plurality of first coding sequences;

a9: determining the weight of each first coding sequence to obtain a plurality of first addressing probabilities;

a10: determining the key-value pairs corresponding to the first coding sequences according to the first addressing probability corresponding to the first coding sequences to obtain the key-value pairs corresponding to the first coding sequences;

a11: determining the similarity between the key-value pairs corresponding to the first coding sequences, and merging the first coding sequences in the key-value pairs with the similarity larger than a first preset threshold value into the same key-value pair;

a12: and decoding the first coding sequence in each key-value pair by adopting a decoder of a convolutional network to obtain an RGB video prediction frame, wherein the RGB video prediction frame is the output of the convolutional network of the heterogeneous twin network.

When training a convolutional network, an encoder of the convolutional network detects whether an anomaly exists mainly by learning common apparent features of a static scene and a target object of interest, such as a truck, a bicycle and the like, and an input of the encoder of the convolutional network is a video sequence I with a length T_t-T,...,I_t-1The decoder of the convolutional network outputs the first predicted frame. Specifically, the encoder of the convolutional network converts the continuous video frames into a first coding sequence, the first coding sequence is stored in the key-value module to obtain a first potential vector, and the decoder of the convolutional network decodes the first potential vector to obtain a first predicted frame.

Wherein E is_aEncoder for convolutional networks, D_aIs a decoder of a convolutional network and,

and

parameters of the encoder and decoder of the convolutional network, respectively, z^aIs the first coding sequence, z^aIn retrieving features already stored in the key-value module,

for the latent vector, obtained using the retrieved features already stored in the key-value module, for the standard AE model, there are

Is the first predicted frame.

In order to force predicted frames in image space

Closer to its real frame, the present invention adds an apparent loss function as an appearance penalty, which guarantees the consistency of all pixels in the RGB space. The apparent loss function is shown in equation (3):

wherein l_aAs a function of apparent loss, I_tFor a real frame of the RGB video,

predicting frames for RGB video, t is videoA frame sequence, T is the total length of the video frame sequence, | | · | | luminous₂Is the euclidean distance.

After step A3, before step a5, further comprising:

a13: converting and coding a plurality of time continuous optical flow frames by adopting a coder of a U-Net network to obtain a plurality of second coding sequences;

a14: determining the weight of each second coding sequence to obtain a plurality of second addressing probabilities;

a15: determining the key-value pairs corresponding to the second coding sequences according to the second addressing probabilities corresponding to the second coding sequences to obtain the key-value pairs corresponding to the second coding sequences;

a16: determining the similarity between the key-value pairs corresponding to the second coding sequences, and merging the second coding sequences in the key-value pairs with the similarity larger than a second preset threshold value into the same key-value pair;

a17: and decoding the second coding sequence in each key-value pair by adopting a decoder of the U-Net network to obtain an optical flow prediction frame, wherein the optical flow prediction frame is the output of the U-Net network of the heterogeneous twin network.

In addition to the apparent characteristics of the target object, the motion state of a typical target is also very important for detecting anomalies in video. Unlike the convolutional network, the encoder of the U-Net network comprises jump connection between high and low layers with the same resolution, so that the combination of high-level semantic features and low-level detail features is realized, and the association between abnormal values and corresponding motions is studied. To better extract salient motion features, the present invention uses optical flow as a motion-related feature, which is sensitive to motion discontinuities.

Similarly, in training the U-Net network, the input of the encoder of the U-Net network is a video sequence with the length T

The output of the decoder of the U-Net network is a predicted frame

Specifically, the encoder of the U-Net network converts the continuous video frames into a second coding sequence, the second coding sequence is stored in the key-value module to obtain a second potential vector, and the decoder of the convolutional network decodes the second potential vector to obtain a second predicted frame.

Wherein E is_mEncoder for U-Net network, D_mIs a decoder of the U-Net network,

and

parameters of the encoder and decoder of the U-Net network, z^mIs the second coding sequence, z^aIn retrieving features already stored in the key-value module,

is a second potential vector, is obtained using the retrieved features that have been stored in the key-value module,

is the second predicted frame.

Noisy areas may be amplified due to the process of generating a smooth flow of light. Thus, the present invention employs L₁Distance is lost to reduce its impact in learning motion information, denoted as a motion loss function. The motion loss function is shown in equation (6):

wherein l_mIn order to be a function of the loss of motion,

for the real frame of the optical flow,

predicting frames for optical flow, | · | | non-conducting phosphor₁Is L₁Distance.

Because the clustering algorithm may be clustered into a chain in the high-dimensional space, different classes of features may be "covered" together in the high-dimensional to low-dimensional mapping process, affecting the determination of the distance between different features, as shown in fig. 4. Aiming at the limitation of the clustering discrimination algorithm, the invention introduces a key-value module for the first time, and the schematic diagram of the key-value module is shown in FIG. 5. Specifically, the appearance and motion characteristics extracted by the heterogeneous twin network are stored in a key-value module so that a new sample can be reasoned to judge whether the new sample has an abnormality or not in the testing stage. Meanwhile, the characteristics are updated through an updating mechanism of the key-value module, so that the characteristics of different types are stored in different key-value pairs, and chain clustering in a high-dimensional characteristic space is effectively avoided. Meanwhile, the prediction difference of the same target object in the normal sample in different context environments is relieved through the continuous updating of the key-value module.

As shown in fig. 5, the key-value module consists of three main components: key addressing, value reading and writing controllers.

In the Key-value module, Key-value pairs are defined as a vector Z and are generated by Key Hash, as shown in the following equation:

where N represents the maximum number of key-value pairs contained by the key-value module Z, k_ziKeys representing i-th key-value pairs, v_ziRepresenting the value of the ith key-value pair.

During key addressing, each candidate x is assigned a weight as its addressing probability to retrieve its associated item, where the candidate x represents each first encoded sequence and each second encoded sequence, and each addressing probability is defined as follows:

wherein, w_iIn order to be the weight, the weight is,

to generate candidates x, phi_K(z_i) For the key in the generated i-th key-value pair, Φ_K(z_j) Is the key in the generated jth key-value pair.

In the value reading stage, the values of the key-value pairs are read by obtaining the weighted sum of the key-value pairs by the addressing probability and returning the output vector

The output vector is shown as:

wherein,

as an output vector, [ phi ]_v(z_i) Is the extracted feature value.

In order to fully embody the intra-class universality and the inter-class diversity except individual samples, the write controller is utilized to update the key-value pairs in the key-value module. The motivation is to store new similar features in the same key-value pair, and specifically, the similarity between features is calculated through the residual similarity. This allows more relevant information to be collected for subsequent access. The rule is to update the key with a new candidate x by integrating and normalizing the output vector with x, and the updated key is expressed as follows:

wherein,

for updated key, [ phi ]_K(x) Is the key of the candidate x and,

is an output vector of k similar features.

The key-value pairs are then repeatedly updated. It is noted that if there is no available memory space, the key is updated by pressing equation (10). The key addressing equation is accordingly transformed to update the query, as shown in equation (11):

wherein,

in order to update the addressing probability,

is an updated candidate.

The appearance and motion characteristics extracted by the heterogeneous twin network are stored in the key-value module so as to reason the new sample in the testing stage to judge whether the new sample is abnormal or not. Meanwhile, the characteristics are updated through an updating mechanism of the key-value module, so that the characteristics of different types are stored in different key-value pairs, and chain clustering in a high-dimensional characteristic space is effectively avoided. Meanwhile, the prediction difference of the same target object in the normal sample in different context environments is relieved through the continuous updating of the key-value module.

As a possible implementation, the multi-constraint loss function of the heterogeneous twin network further comprises: the device comprises a characteristic compactness loss function and a characteristic separation loss function, wherein the characteristic compactness loss function is used for representing intra-class loss of the key-value pairs, and the characteristic separation loss function is used for representing inter-class loss of the key-value pairs and is punished by adopting an L2-norm.

Specifically, the calculation formula of the multi-constraint loss function is as follows:

l_total＝η_al_a+η_ml_m+η_fl_f (12)

wherein eta is_aIs a weight coefficient of apparent loss, η_mWeight coefficient, η, for motion loss_fWeight coefficient for characteristic loss, l_totalFor a multi-constraint loss function,/_aAs a function of apparent loss,/_fIs a characteristic loss function.

l_f＝l_c+l_s (13)

Wherein l_cFor a characteristic compactness loss function,/_sA characteristic separation loss function.

Wherein,

for an updated key, n is the key index of the closest entry in the key-value module, z_nThe key in the key-value module that is closest to the query term, and N is the total number of key-value pairs.

Wherein z is_mIs a keyThe key next to the query term in the value module, m is the key index of the next-to-next term in the key-value module, w_iTo weight, α controls the confidence between key-value pairs.

Through the multi-constraint loss function, the learned normal sample characteristics have compactness and representativeness, and the optimal configuration of different loss parameters is analyzed on a video anomaly detection data set.

Example 2:

referring to fig. 6, the present invention provides a behavior anomaly detection method, including:

b1: acquiring a target video real frame, wherein the target video real frame comprises a plurality of RGB real video frames containing normal behaviors and optical flow real frames which are continuous in time;

b2: inputting the real target video frame into a heterogeneous twin network to obtain a target video prediction frame, wherein the heterogeneous twin network is a network trained according to the behavior prediction network training method in embodiment 1;

b3: calculating the peak signal-to-noise ratio of the target video prediction frame and the target video real frame;

b4: calculating a regularity score according to the peak signal-to-noise ratio, wherein the regularity score is used for judging the normal degree of the real frame of the target video;

b5: judging whether the regularity score is lower than a preset threshold value or not;

b6: if so, abnormal behaviors exist in the real frame of the target video;

b7: if not, abnormal behaviors do not exist in the real frame of the target video. A specific anomaly detection framework is shown in fig. 7.

In the testing phase, to evaluate the prediction quality from different data set images, the PSNR between the target video predicted frame and the target video real frame is calculated and measured with equation (18).

Wherein PSNR is the peak signal-to-noise ratio, L_tIn order to obtain the real frame of the target video,

and predicting a frame for the target video, wherein l is a spatial index of the predicted frame for the target video, r is a spatial index of a real frame of the target video, and M is the number of pixels.

Specifically, the calculation formula of the regularity score is as follows:

wherein s (t) is the regularity score and PSNR is the peak signal-to-noise ratio L_tIn order to obtain the real frame of the target video,

frames are predicted for the target video.

Example 3:

referring to fig. 8, the present invention provides a behavior prediction network training system, which includes:

the heterogeneous twin network construction unit 1 is used for constructing a heterogeneous twin network based on a convolution network and a U-Net network;

the training video acquisition unit 2 is used for acquiring a training video, and the training video comprises a plurality of RGB video frames and optical flow frames which are continuous in time and contain normal behaviors;

the input unit 3 is used for inputting RGB video frames continuous in any time into the convolution network of the heterogeneous twin network and inputting optical flow frames continuous in any time into the U-Net network of the heterogeneous twin network;

an apparent loss function determining unit 4, configured to determine an apparent loss function according to an output of the convolutional network of the heterogeneous twin network and an RGB video frame at a next time of the RGB video frame that is continuous in any time;

a motion loss function determining unit 5, configured to determine a motion loss function according to an output of the U-Net network of the heterogeneous twin network and an optical flow frame at a next time of the arbitrary time-continuous optical flow frame;

a multi-constraint loss function determination unit 6 for determining a multi-constraint loss function from the apparent loss function and the motion loss function;

and the heterogeneous twin network training unit 7 is used for adjusting the weights in the convolution network and the U-Net network according to the multi-constraint loss function so as to train the heterogeneous twin network and obtain the trained heterogeneous twin network.

Example 4:

referring to fig. 9, the present invention provides a behavior anomaly detection system, which includes:

a target video real frame acquiring unit 8, configured to acquire a target video real frame, where the target video real frame includes a plurality of RGB real video frames including normal behaviors and an optical flow real frame that are continuous in time;

a target video prediction frame obtaining unit 9, configured to input the target video real frame into a heterogeneous twin network to obtain a target video prediction frame, where the heterogeneous twin network is a network trained according to the behavior prediction network training method;

a peak signal-to-noise ratio calculation unit 10, configured to calculate a peak signal-to-noise ratio between the target video predicted frame and the target video real frame;

a regularity score calculating unit 11, configured to calculate a regularity score according to the peak signal-to-noise ratio, where the regularity score is used to determine a normal degree of the real frame of the target video;

a judging unit 12, configured to judge whether the regularity score is lower than a preset threshold;

if so, abnormal behaviors exist in the real frame of the target video;

Example 5:

to verify the advantages of the present invention, the behavioral anomaly detection method of the present invention is now compared with advanced algorithms:

in this embodiment, the behavioral anomaly detection method of the present invention is compared to different advanced methods, including classification-based methods, reconstruction-based methods, and prediction-based methods. The AUC results of the different methods are shown in table 1.

TABLE 1 AUC results on UCSD Ped2, CUHKAVANUE and ShanghaiTech datasets for different methods

As can be seen from table 1, the behavioral anomaly detection method of the present invention achieved better results on the baseline common data sets UCSD Ped2, CUHKAvenue, and ShanghaiTech than the advanced method. In the upper half, the accuracy of the results of the present invention on the UCSD Ped2 dataset was improved by at least 2.61% (94.10% vs 96.71%) compared to the classification-based approach. This is mainly because most classification methods use traditional handmade-based features, which have limited mining of representative features compared to deep learning methods. In the middle part, the method of the invention also performs best on three data sets compared to the reconstruction-based method. In particular, the performance of the algorithm of the present invention on the UCSD Ped2, CUHK Avenue and ShanghaiTech data sets was improved by 4.51%, 3.20% and 4.28% over SNRR-AE, respectively. In the lower half, the prediction task of the present invention achieves the best results on the CUHK Avenue and ShanghaiTech data sets, with average AUC reaching 86.70% and 73.88%, compared to methods based on future frame prediction. This demonstrates the effectiveness of the anomaly detection method of the present invention using a heterogeneous twin network based on key-value modules. The performance of the algorithm is respectively improved by 1.31%, 1.60% and 1.08% compared with the Frame-Prediction on three reference data sets. Although the Frame-Prediction also predicts future frames by adding optical flow, it uses the same U-Net network to extract appearance information and optical flow motion information, which may introduce some low-level information unrelated to appearance features to some extent. The main reason that the method of the invention is more effective is that the CAE model and the U-Net model are respectively adopted to independently code the appearance information and the motion information. The performance of AnoPCN on UCSD Ped2 dataset was slightly better than the method of the present invention (96.80% vs 96.71%), mainly because the proportion of apparent information (pedestrians, etc.) in UCSD Ped2 dataset was much higher than motion information (biking, etc.). The AnoPCN designs a deep neural network for generating frame prediction by using a predictive coding mechanism, and introduces an error refinement module to refine the coarse prediction, thereby being more beneficial to extracting apparent information. The method of the present invention combines both appearance and motion information, and therefore performs only marginally better on the UCSD Ped2 dataset. In contrast, in the case of CUHKAvenue and ShanghaiTech, which are relatively complex data sets with motion information and apparent information in a comparable ratio, the performance of the algorithm of the present invention is superior to that of AnoPCN, mainly because optical flow provides more discriminative motion cues.

FIG. 10 shows a visualization of a typical frame-level ROC curve of the method of the present invention with different methods on three datasets UCSD Ped2, CUHK Avenue and ShanghaiTech. It is clear from the frame-level evaluation that the method of the present invention is superior to other methods. In addition, the method of the present invention achieves a lower EER on three datasets, as shown in FIG. 11. EER is used to measure the error rate of the algorithm, and in detail, the smaller the EER, the lower the error rate of the algorithm. The method of the present invention obtains frame level EERs of 0.104, 0.207 and 0.327 on three data sets of UCSD Ped2, CUHKAVAnue and ShanghaiTech, respectively. It can be seen that the method of the present invention also gives better experimental results than others.

To qualitatively analyze the anomaly detection performance of the proposed model, the present invention visualizes the anomaly detection examples of the three reference data sets under the proposed framework, as shown in fig. 12, 13 and 14, respectively. The regularity score curve displays the abnormal scores of all the video frames in sequence, and can more intuitively reflect the performance of the proposed method. In each sub-graph, the regularity score represents the likelihood of normality, and the shaded portion in the video frame represents an anomaly in the real frame.

As can be seen from the left diagram of fig. 12, the method of the present invention can detect abnormalities well even in a crowded environment. It can be seen from the right diagram of fig. 12 that the regularity curve drops immediately when only one anomaly (car) occurs. When various anomalies (automobiles and bicycles) occur, the anomalies gradually descend to the lowest point. As shown in fig. 13, the regularity score is significantly lowered when an abnormality occurs, and is raised when the abnormality disappears. This indicates that the method of the present invention is capable of detecting the occurrence of an abnormality. The regularity score curve of fig. 13 may be found to be very coarse due to noise carried by the UCHK Avenue dataset itself. Figure 14 shows the most challenging abnormal behavior of biking, pushing, etc. in the ShanghaiTech dataset and results in a better regularity score, indicating that the method of the present invention is able to detect the occurrence of an abnormality.

To assess how the different components affect the anomaly detection performance of the proposed method of the present invention, ablation experiments were performed on the ShanghaiTech dataset and the results of the anomaly detection based on AUC are reported as shown in table 2. As can be seen from table 2, the performance of the prediction result after adding the motion constraint is much higher than that before adding no key-value module, from 67.60% to 70.70%, because the optical flow is more sensitive to fast moving objects such as running and cycling, and reflects the necessity of the optical flow as additional information to improve the abnormality detection performance. After the addition of the key-value pair module, ShanghaiTech predicted an increase of 3.18% (70.70% vs 73.88%) over that before the addition. On one hand, the key-value module converts the characteristics of the high-dimensional space into low-dimensional dynamic state for storage, thereby avoiding the influence caused by chain-shaped characteristics. On the other hand, due to the fact that the key-value module continuously updates the context information, strong similarity between basic components of the normal/abnormal samples and generalization capability of the neural network on abnormal behaviors are relieved, and prediction errors of the model are reduced.

Table 2 ablation test results on the ShanghaiTech data set, anomaly detection performance is reported in AUC (%) form

In addition, the present example also performed comparative experiments of the heterogeneous twin network and the twin network. Since the jump connection in the U-Net network may not be able to learn useful apparent information from the video frames, this embodiment employs two CAE networks with the same structure as the comparison experiment. As can be seen from Table 2, the AUC values for the method of the invention are about 1.4% higher than when a twin network is used. Meanwhile, the heterogeneous twin network has a larger contribution degree (70.05% vs 68.75%) to the extraction of the optical flow action characteristics than the twin network, which fully verifies the superiority of adopting the heterogeneous twin network.

Case analysis of ShanghaiTech dataset. Although the invention achieves advanced performance on test data sets, the identification capability of the invention in some specific abnormal behaviors by adopting the heterogeneous twin network is not enough. Therefore, the present embodiment has worked out the ShanghaiTech dataset, mainly because the ShanghaiTech dataset is the most challenging and realistic-looking dataset, with the most scenarios and abnormal behavior types. The present embodiment first classifies test videos of the ShanghaiTech data set into 15 classes. The video subsets were then separately VAD tested and the AUC and EER reported for each class as shown in fig. 15.

In order to verify that the method of the present invention can alleviate the "chain" clustering phenomenon existing in the high-dimensional space, the present embodiment performs visual analysis on the feature distribution in the key-value module from different angles on the ShanghaiTech data set. Fig. 16 (a) is a feature distribution of a few samples in the Ped2 dataset in the key-value model, which looks like a comparison of "hash". In order to more intuitively reflect that the key-value module can effectively solve the "chain-like" prevalence distribution phenomenon of the high-order space, the present embodiment visualizes it from different angles, as shown in (b) and (c) of fig. 16. As can be seen from (b) in fig. 16, the method of the present invention stores the extracted features in the key-value module, thereby well separating the features of different categories, avoiding direct contact between different features, and effectively solving the "chain-like" popular distribution phenomenon existing in video anomaly detection. Fig. 16 (c) is a two-dimensional map of fig. 16 (b).

From fig. 17, it can be seen that for a small data set such as MNIST, the T-SNR method has a better classification effect, and the clustering effect becomes better with the increase of the iteration cycle. But for those datasets (ShanghaiTech datasets, etc.) where the abnormal behavior is different but may exist in the same complex environment and the abnormal target is smaller in the whole graph, it is not suitable to use a similar T-SNR method. Meanwhile, it can be found that as the iteration period increases, a "chain" phenomenon occurs, because the environments in which many different classes of abnormal behaviors are located are similar and the significance ratio of the abnormal object is small, as shown in fig. 18. Therefore, there may be some drawbacks to using a clustering method to classify abnormal behavior. The invention only needs to store the characteristics of different classes and update the similar characteristics in time, thereby avoiding the phenomenon.

To explore different loss coefficients eta_a，η_mAnd η_fThree sets of test experiments were performed for the optimal configuration of the proposed model. The relationship between the two loss coefficients was analyzed with the assurance that all other conditions remained unchanged. FIG. 19 shows AUC results on a UCSD Ped2 data set with parameter ranges of [0, 10%]Or [0,1 ]]。

FIG. 17 (a) shows the parameter η_aAnd η_mImpact on average frame level AUC. It can be found that when eta is_a＝2，η_mThe performance of the process of the invention is best when 1. However, when η_aWhen the value of (b) is greater than 2, the performance begins to decline, up to about 2 percentage points. When eta_mWhen the value of (a) is increased to 4, the performance is slightly degraded, and when eta is increased to 4_mWhen the value of (A) is increased to 7, the performance thereof is remarkably deteriorated. FIG. 17 (b) shows the parameter η_aAnd η_fImpact on average frame level AUC. Can find that_aPerformance of at_fIt performs best when ═ 0.1, with η_aThe increase of the value is increased first and then decreased, at eta_a2-position optimizationThe value is obtained. FIG. 17 (c) shows the parameter η_fAnd η_mThe relationship between them. It can be seen that when eta_fAUC is best when equal to 0.1, followed by η_mThe value of (c) will change constantly.

By the above phenomenon, it can be found that: the setting of different hyper-parameters has a significant impact on the performance of the network.

This example was performed on an NVIDIA TiTan RTX GPU. The average running time for video anomaly detection is about 20 fps. Run times for other methods are shown in table 3.

TABLE 3 average run time of different video anomaly detection methods

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principle and the embodiment of the present invention are explained by applying specific examples, and the above description of the embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A behavior prediction network training method is characterized by comprising the following steps:

determining a motion loss function according to the output of the U-Net network of the heterogeneous twin network and the optical flow frame at the next moment of the optical flow frame at any time;

2. The behavior prediction network training method according to claim 1, further comprising, after the inputting of the arbitrary time-continuous RGB video frames into the convolution network of the heterogeneous twin network, before the determining the apparent loss function:

converting and coding a plurality of time-continuous RGB video frames by adopting a coder of a convolutional network to obtain a plurality of first coding sequences;

determining the weight of each first coding sequence to obtain a plurality of first addressing probabilities;

determining the key-value pairs corresponding to the first coding sequences according to the first addressing probability corresponding to the first coding sequences to obtain the key-value pairs corresponding to the first coding sequences;

determining the similarity between the key-value pairs corresponding to the first coding sequences, and merging the first coding sequences in the key-value pairs with the similarity larger than a first preset threshold value into the same key-value pair;

and decoding the first coding sequence in each key-value pair by adopting a decoder of a convolutional network to obtain the RGB video prediction frame.

3. The behavior prediction network training method according to claim 1, further comprising, after the inputting of the arbitrary time-continuous optical flow frames into the U-Net network of the heterogeneous twin network, before the determining the motion loss function:

converting and coding a plurality of time continuous optical flow frames by adopting a coder of a U-Net network to obtain a plurality of second coding sequences;

determining the weight of each second coding sequence to obtain a plurality of second addressing probabilities;

determining the key-value pairs corresponding to the second coding sequences according to the second addressing probabilities corresponding to the second coding sequences to obtain the key-value pairs corresponding to the second coding sequences;

determining the similarity between the key-value pairs corresponding to the second coding sequences, and merging the second coding sequences in the key-value pairs with the similarity larger than a second preset threshold value into the same key-value pair;

and decoding the second coding sequence in each key-value pair by adopting a decoder of the U-Net network to obtain the optical flow prediction frame.

4. A behavior prediction network training method according to claim 2 or 3, characterized in that the multi-constraint loss function of the heterogeneous twin network further comprises: a characteristic compactness loss function to represent intra-class losses for key-value pairs, and a characteristic separation loss function to represent inter-class losses for key-value pairs.

5. The behavior prediction network training method of claim 4, wherein the multi-constraint loss function is calculated as follows:

wherein eta_aIs apparent lossWeight coefficient, η_mWeight coefficient, η, for motion loss_fIn order to characterize the weight coefficients of the loss,

in order to be a multi-constraint loss function,

in order to be a function of the apparent loss,

I_tfor a real frame of the RGB video,

predicting a frame for an RGB video, T being a sequence of video frames, T being a total length of the sequence of video frames, | · | | survival₂Is the euclidean distance between the two nodes,

in order to be a function of the loss of motion,

for the real frame of the optical flow,

predicting frames for optical flow, | · | shading₁Is L₁The distance between the first and second electrodes,

in order to be a function of the characteristic loss,

in order to characterize the compactness loss function,

in order to characterize the separation loss function,

for an updated key, n is the key index of the closest entry in the key-value module, z_nThe key in the key-value module that is closest to the query term, N is the total number of key-value pairs,

z_mfor the key next to the query term in the key-value module, m is the key index of the next-to-next term in the key-value module, w_iAs a weight, α controls the confidence between key-value pairs.

6. A method for detecting behavioral anomalies, comprising:

inputting the real target video frame into a heterogeneous twin network to obtain a target video prediction frame, wherein the heterogeneous twin network is a network trained according to the behavior prediction network training method of claim 1;

if so, abnormal behaviors exist in the real frame of the target video;

7. The behavioral anomaly detection method according to claim 6, characterized in that the peak signal-to-noise ratio is calculated as follows:

8. The behavioral abnormality detection method according to claim 6, characterized in that the calculation formula of the regularity score is as follows:

frames are predicted for the target video.

9. A behavior prediction network training system, comprising:

a motion loss function determining unit, configured to determine a motion loss function according to an output of the U-Net network of the heterogeneous twin network and an optical flow frame at a next time of the optical flow frame at any time;

10. A behavioral anomaly detection system, comprising:

a target video prediction frame obtaining unit, configured to input the target video real frame into a heterogeneous twin network to obtain a target video prediction frame, where the heterogeneous twin network is a network trained according to the behavior prediction network training method of claim 1;

if so, abnormal behaviors exist in the real frame of the target video;