CN116385935A

CN116385935A - Abnormal event detection algorithm based on unsupervised domain self-adaption

Info

Publication number: CN116385935A
Application number: CN202310369508.8A
Authority: CN
Inventors: 李璐; 路文; 伍凌帆
Original assignee: Suzhou Haiyuhong Intelligent Technology Co ltd
Current assignee: Suzhou Haiyuhong Intelligent Technology Co ltd
Priority date: 2023-04-08
Filing date: 2023-04-08
Publication date: 2023-07-04

Abstract

The invention discloses an abnormal event detection algorithm based on unsupervised domain self-adaption, which comprises the following steps: step 1) performing experimental verification on UCSD data set and ShanghaiTech Campus data set; step 2) constructing an abnormal event detection algorithm based on unsupervised domain self-adaption: the model comprises a pre-training module and a domain self-adapting module; step 3) carrying out iterative training on an abnormal event detection algorithm model based on unsupervised domain self-adaption; and 4) obtaining an abnormal event detection model detection result based on unsupervised domain self-adaption. The abnormal event detection algorithm based on the unsupervised domain self-adaption performs supervised pre-training on a source domain data set, provides a perception contrast loss, can promote the reconstruction of a normal sample by a video frame reconstruction network, inhibits the reconstruction of the abnormal sample by the video frame reconstruction network, and determines the discrimination limit of the normal event and the abnormal event of the source domain to lead the prior knowledge of the normal event and the abnormal event to the target domain.

Description

Abnormal event detection algorithm based on unsupervised domain self-adaption

Technical Field

The invention relates to an abnormal event detection technology in a monitoring video, in particular to an abnormal event detection algorithm based on unsupervised domain self-adaption.

Background

The continuous development and wide application of the abnormal event detection technology in the monitoring video play an important role in promoting the intelligent monitoring development. In recent years, an abnormal event detection algorithm in a monitoring video based on deep learning has better progress in improving the model detection performance. Because the supervised abnormal event detection algorithm generally depends on a large amount of tag data, each frame of image of the video needs to be manually marked before a network model is trained; and the collected data often has the problem of unbalanced categories, so that the abnormal event detection effect is poor. The unsupervised abnormal event detection algorithm does not need to label the data, and the data which does not accord with the characteristic distribution rule of the normal event is judged to be the abnormal event by learning the characteristic representation of a large number of normal events, so that a large amount of manual labeling cost is saved. Because the abnormal event detection algorithm based on the unsupervised learning only uses the normal sample for training, the prior information of the abnormal event is lacking, and therefore the judgment limit of the normal event and the abnormal event is unclear, and false detection is easy to generate. In addition, there is a problem that the conventional abnormal event detection algorithm in the monitoring video lacks scene applicability, and is good in one scene and general in other scenes.

The method for detecting the abnormal event in the monitoring video based on the computer vision can be divided into a traditional method and a deep learning method. Deep learning is widely used in various fields because of its strong learning ability. In solving many of the challenging problems in the field of abnormal event detection in surveillance videos, students often use a deep learning approach. Compared with the traditional algorithm, the abnormal event detection algorithm in the monitoring video based on the deep learning has more excellent performance, and is a main stream algorithm of the abnormal event detection research direction in the monitoring video.

The abnormal event detection algorithm in the monitoring video can be divided into a full-supervision learning algorithm, a weak-supervision learning algorithm and an unsupervised learning algorithm according to whether to label data or not and how to label the data. In a real video monitoring scene, abnormal events are relatively few, and normal event data and abnormal event data are unbalanced, so that the detection performance of an algorithm model is reduced. And the abnormal events are diverse and non-exhaustive. The unsupervised learning algorithm does not need to label the data, and the data which does not accord with the characteristic distribution rule of the normal event is judged as the abnormal event by learning the characteristic representation of a large number of normal events. The existing unsupervised abnormal event detection algorithm only adopts normal event data for training and lacks prior information of abnormal events, so that the discrimination boundaries of the normal events and the abnormal events are unclear, and false detection is easy to generate.

In order to improve the discrimination capability of a network model for normal events and abnormal events, park H, noh J, ham B. Et al propose learning memory-guide normal anomaly detection in an IEEE computer vision and pattern recognition conference, and introduce a memory module and simultaneously provide a feature compactness loss and a feature separability loss to train the memory module so that the normal event features in the memory module have diversity, thereby improving the discrimination capability of the network model.

However, the existing abnormal event algorithm has the following two main disadvantages:

(1) The supervised abnormal event detection algorithm generally relies on a large amount of label data, each frame of image of the video needs to be manually marked before a network model is trained, and the collected data often has the problem of unbalanced categories.

(2) The abnormal event detection algorithm based on the unsupervised learning only uses normal samples for training, lacks prior information of the abnormal event, and further causes the problem that the discrimination boundaries of the normal event and the abnormal event are unclear.

Disclosure of Invention

The invention aims to solve the technical problems that the normal event and the abnormal event are unclear in discrimination boundaries and easy to generate false detection because the unsupervised abnormal event detection algorithm lacks prior information of the abnormal event, and the prior knowledge in a source domain can be well defined to be introduced into a target domain.

In order to solve the technical problems, the invention is realized by the following technical scheme: an abnormal event detection algorithm based on unsupervised domain adaptation comprises the following steps:

step 1) performing experimental verification on UCSD data set and ShanghaiTech Campus data set; using ShanghaiTech Campus data set as source domain data, and using UCSD Ped1 data set and UCSD Ped2 data set as target domain data respectively to perform experiments;

step 2) constructing an abnormal event detection algorithm based on unsupervised domain self-adaption: the model comprises a pre-training module and a domain self-adapting module;

the pre-training module performs supervised training on the source domain data, and only uses normal samples to train a reconstruction network when reconstructing an input video frame; in the pre-training stage, a supervised learning mode is adopted to input the reconstructed video frame, the normal sample and the abnormal sample into a feature extraction network to extract corresponding features, and the distance between the reconstructed video frame features and the normal sample features is enlarged by reducing the distance between the reconstructed video frame features and the normal sample features, so that the reconstruction of the normal sample by the video frame reconstruction network can be promoted, the reconstruction of the abnormal sample by the video frame reconstruction network is restrained, and the discrimination limits of the normal event and the abnormal event of the source domain are defined;

step 3) carrying out iterative training on an abnormal event detection algorithm model based on unsupervised domain self-adaption: the loss function L=lambda is adopted in the pre-training module _res L _res +λ _per L _per Back propagation is carried out, and the weight parameters of the reconstructed network are updated; video frame reconstruction network in target domain employs loss function L _rec Back propagation is carried out, and domain discriminators and video frame reconstruction network parameters are updated;

step 4) obtaining an abnormal event detection model detection result based on unsupervised domain self-adaption: and taking the test sample set as the input of a trained abnormal event detection model based on the unsupervised domain self-adaption to perform forward reasoning so as to obtain the detection result of each test sample.

Further, in the step 1), the source domain data is data having different scenes or environmental conditions from the target domain data, and the training set of the source domain data set includes both normal event video frames and abnormal event video frames, which are labeled, and the training set of the target domain data set includes only normal event video frames, which are unlabeled.

Further, in the step 2), the VGG19 pretrained model in Pytorch is adopted as a feature extraction network in the self-adaptive abnormal event detection algorithm based on the unsupervised domain, the reconstructed video frame, the normal sample and the abnormal sample are input into the network, the features of the 4 th layer, the 9 th layer, the 14 th layer, the 23 rd layer and the 32 rd layer corresponding to the reconstructed video frame, the features of different scales are extracted respectively, a perception contrast loss is provided, the distance between the features of the reconstructed video frame and the features of the normal sample is enlarged by reducing the distance between the features of the reconstructed video frame and the features of the normal sample and the features of the abnormal sample, so that the reconstruction of the normal sample by the video frame reconstruction network can be promoted, the reconstruction of the abnormal sample by the video frame reconstruction network is restrained, and the discrimination limit of the normal event and the abnormal event of the source domain is clear; the perceptual contrast loss function formula is as follows:

wherein C is _i ，H _i ，W _i The channel number, the height and the width of the feature diagram corresponding to the ith item are respectively represented by I _p Represents a normal sample, I _n Representing an abnormal sample, I _r Representing reconstructed video frames, f _i (. Cndot.) represents the ith feature map obtained by inputting video frames into the feature extraction network; the constraint of perceived contrast loss enables the reconstructed video frame to be more similar to a normal sample and more different from an abnormal sample in texture detail and semantic information;

therefore, in the pre-training module, the total loss function formula is as follows:

L＝λ _res L _res +λ _per L _per

wherein lambda is _res ，λ _per The method comprises the steps of respectively representing the weight coefficients of an MSE loss function and a perception contrast loss function, keeping the parameters of a feature extraction network unchanged in a pre-training process, and continuously updating optimization parameters by a reconstruction network, so that the reconstruction of a normal sample by a video frame reconstruction network is promoted, the reconstruction of an abnormal sample by the video frame reconstruction network is restrained, and the discrimination limit of a normal event and an abnormal event of a source domain is defined.

Further, the lambda _res ，λ _per The weight coefficients representing the MSE loss function and the perceptual contrast loss function, respectively, may be replaced by using l ₁ The loss function replaces the MSE loss function to minimize the difference between the predicted video frame and the actual video frame.

Further, the step 2) continuously optimizes parameters of the video frame reconstruction network and the domain discriminator through alternate training of the video frame reconstruction network and the domain discriminator, so that the video frame reconstruction network can learn domain invariant features well represented in both domains, thereby aligning data distribution of the two domains, and enabling the domain discriminator to discriminate whether the input video frame is from the source domain or the target domain as much as possible.

Further, the target domain adopts an unsupervised learning mode, and a normal sample in the target domain and a normal sample in the source domain are input into a pre-trained reconstruction network to be trained together; the whole training process is based on countermeasure learning, the video frame reconstruction network and the domain discriminator are trained alternately to continuously optimize the video frame reconstruction network and the domain discriminator, and the data distribution of the source domain and the target domain is aligned.

Further, the specific training process is that the video frame reconstruction network is fixed when the domain discriminator is trained, so that the domain discriminator can distinguish whether the data come from the source domain or the target domain as far as possible; the domain discriminator is fixed when the video frame reconstruction network is trained, so that the video frame reconstruction network generates a result which cannot be resolved by the domain discriminator;

the loss function in training the domain arbiter is shown as the formula

L _dis ＝f(d(g(x _s )),t _s )+f(d(g(x _t )),t _t )

The loss function when training the video frame reconstruction network is shown as the following formula, the optimized domain discriminator and the video frame reconstruction network parameters are continuously updated through back propagation, and finally, the target domain data distribution is aligned with the source domain data distribution;

L _rec ＝f(d(g(x _s )),t _t )+f(d(g(x _t )),t _s )

wherein x is _s Representing source domain data, x _t Representing target domain data, g (·) representing a video frame reconstruction network, d (·) representing a domain arbiter, t _s A tag representing source domain data, defined as 0, t _t A label representing target domain data, defined as 1, f (·) represents a binary cross entropy loss function, the formula of which is shown below:

f(p,q)＝-w×[p×log(q)+(1-p)×log(1-q)]

where p represents a label value, q represents an actual predicted value, and w represents a weight coefficient.

Furthermore, the non-supervision learning mode is adopted in the target domain, the non-supervision learning mode is mainly adopted, only the normal samples in the target domain are required to be trained, priori knowledge capable of well defining normal events and abnormal events in the pre-training module can be introduced into the target domain, the judgment limit of the normal events and the abnormal events in the target domain is clear, the abnormal event detection performance of the algorithm model in the target domain is improved, and Ji Yuanyu data and target domain data can be distributed through countermeasure training, so that domain offset is reduced, and the algorithm model has better scene applicability.

Compared with the prior art, the invention has the following advantages: firstly, performing supervised pre-training on a source domain data set, and providing a perception contrast loss; the distance between the reconstructed video frame features and the normal sample features is shortened, and the distance between the reconstructed video frame features and the abnormal sample features is enlarged, so that the reconstruction of the normal event by the video frame reconstruction network is promoted, the reconstruction of the abnormal event by the video frame reconstruction network is restrained, and the discrimination limit of the normal event and the abnormal event of the source domain is defined; then, an unsupervised learning mode is adopted in the target domain, the priori knowledge of the normal event and the abnormal event can be well defined in the pre-training stage and introduced into the target domain through the unsupervised domain self-adaption based on the counterlearning, the discrimination limit of the normal event and the abnormal event in the target domain is defined, and the abnormal event detection performance of the algorithm model in the target domain is improved; finally, by countertraining, domain offset is reduced for Ji Yuanyu data and target domain data distribution, so that the algorithm model has better scene applicability, and priori knowledge of normal events and abnormal events can be well defined and introduced into the target domain.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a schematic flow chart of an anomaly event detection algorithm based on unsupervised domain adaptation.

Detailed Description

The invention is described in detail below with reference to the drawings and the detailed description.

An abnormal event detection algorithm based on unsupervised domain adaptation, as shown in fig. 1, comprises the following steps:

step 1) performing experimental verification on UCSD data set and ShanghaiTech Campus data set; using ShanghaiTech Campus data set as source domain data, and using UCSD Ped1 data set and UCSD Ped2 data set as target domain data respectively to perform experiments; the source domain data is data with different scenes or environmental conditions from the target domain data, and the training set of the source domain data set not only contains normal event video frames, but also contains abnormal event video frames, and is labeled; the training set of the target domain data set only comprises normal event video frames and is unlabeled;

the VGG19 pre-training model in Pytorch is adopted as a feature extraction network based on an unsupervised domain self-adaptive abnormal event detection algorithm, a reconstructed video frame, a normal sample and an abnormal sample are input into the network, features of different scales of a 4 th layer, a 9 th layer, a 14 th layer, a 23 rd layer and a 32 nd layer corresponding to the reconstructed video frame, the features are extracted respectively, a perception contrast loss is provided, the distance between the reconstructed video frame features and the normal sample features at the corresponding scales is enlarged by reducing the distance between the reconstructed video frame features and the normal sample features at the corresponding scales, and therefore the reconstruction of the normal sample by the video frame reconstruction network can be promoted, the reconstruction of the abnormal sample by the video frame reconstruction network is restrained, and the discrimination limit of the normal event and the abnormal event of a source domain is defined; the perceptual contrast loss function formula is as follows:

L＝λ _res L _res +λ _per L _per

wherein lambda is _res ，λ _per Weight coefficients representing the MSE loss function and the perceptual contrast loss function, respectively, or using l ₁ The loss function replaces the MSE loss function to minimize the difference between the predicted video frame and the real video frame; in the pre-training process, the parameters of the feature extraction network are kept unchanged, and the reconstruction network continuously updates the optimization parameters, so that the reconstruction of the normal samples by the video frame reconstruction network is promoted, the reconstruction of the abnormal samples by the video frame reconstruction network is restrained, and the discrimination limit of the normal events and the abnormal events of the source domain is defined.

The abnormal event detection algorithm carries out unsupervised domain self-adaption based on the generated countermeasures; through the alternate training of the video frame reconstruction network and the domain discriminator, the parameters of the video frame reconstruction network and the domain discriminator are continuously optimized, so that the video frame reconstruction network can learn domain invariant features which are well represented in both domains, thereby aligning the data distribution of the two domains, and enabling the domain discriminator to discriminate whether an input video frame comes from a source domain or a target domain as far as possible;

inputting a normal sample in a target domain and a normal sample in a source domain into a pre-trained reconstruction network to be trained together in an unsupervised learning mode; the whole training process is based on countermeasure learning, the video frame reconstruction network and the domain discriminator are trained alternately to continuously optimize the video frame reconstruction network and the domain discriminator, and the data distribution of the source domain and the target domain is aligned; the specific training process is that the video frame reconstruction network is fixed when the domain discriminator is trained, so that the domain discriminator can distinguish whether the data come from the source domain or the target domain as far as possible; the domain discriminator is fixed when the video frame reconstruction network is trained, so that the video frame reconstruction network generates a result which cannot be resolved by the domain discriminator;

the loss function in training the domain arbiter is shown as the formula

L _dis ＝f(d(g(x _s )),t _s )+f(d(g(x _t )),t _t )

L _rec ＝f(d(g(x _s )),t _t )+f(d(g(x _t )),t _s )

f(p,q)＝-w×[p×log(q)+(1-p)×log(1-q)]

wherein p represents a tag value, q represents an actual predicted value, and w represents a weight coefficient;

through the self-adaptive module of the unsupervised domain, the normal sample in the target domain is trained only by adopting the unsupervised learning mode, the priori knowledge of the normal event and the abnormal event can be well defined in the pre-training module and introduced into the target domain, the discrimination limit of the normal event and the abnormal event of the target domain is defined, the abnormal event detection performance of the algorithm model in the target domain is improved, and the Ji Yuanyu data and the target domain data can be distributed by countermeasure training, so that the domain offset is reduced, and the algorithm model has better scene applicability;

Firstly, performing supervised pre-training on a source domain data set, and providing a perception contrast loss; the distance between the reconstructed video frame features and the normal sample features is shortened, and the distance between the reconstructed video frame features and the abnormal sample features is enlarged, so that the reconstruction of the normal event by the video frame reconstruction network is promoted, the reconstruction of the abnormal event by the video frame reconstruction network is restrained, and the discrimination limit of the normal event and the abnormal event of the source domain is defined; then, an unsupervised learning mode is adopted in the target domain, the priori knowledge of the normal event and the abnormal event can be well defined in the pre-training stage and introduced into the target domain through the unsupervised domain self-adaption based on the counterlearning, the discrimination limit of the normal event and the abnormal event in the target domain is defined, and the abnormal event detection performance of the algorithm model in the target domain is improved; finally, by countertraining, domain offset is reduced for Ji Yuanyu data and target domain data distribution, so that the algorithm model has better scene applicability, and priori knowledge of normal events and abnormal events can be well defined and introduced into the target domain.

It is emphasized that: the above embodiments are merely preferred embodiments of the present invention, and the present invention is not limited in any way, and any simple modification, equivalent variation and modification made to the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims

1. An abnormal event detection algorithm based on unsupervised domain self-adaption is characterized by comprising the following steps:

2. The abnormal event detection algorithm based on unsupervised domain adaptation according to claim 1, wherein the source domain data in step 1) is data having different scene or environmental condition from the target domain data, and the training set of the source domain data set includes both normal event video frames and abnormal event video frames, and the training set of the target domain data set includes only normal event video frames, and is not labeled.

3. The abnormal event detection algorithm based on the unsupervised domain self-adaption according to claim 1, wherein the abnormal event detection algorithm based on the unsupervised domain self-adaption in step 2) adopts a VGG19 pretrained model in a Pytorch as a feature extraction network, inputs a reconstructed video frame, a normal sample and an abnormal sample into the network, extracts features of different scales of a corresponding layer 4, a layer 9, a layer 14, a layer 23 and a layer 32 respectively, proposes a perception contrast loss, and enlarges the distance between the features of the reconstructed video frame and the features of the normal sample in the corresponding scales by reducing the distance between the features of the reconstructed video frame and the features of the normal sample, thereby promoting the reconstruction of the normal sample by the video frame reconstruction network, inhibiting the reconstruction of the abnormal sample by the video frame reconstruction network, and defining the discrimination limits of the normal event and the abnormal event of the source domain; the perceptual contrast loss function formula is as follows:

L＝λ _res L _res +λ _per L _per

wherein lambda is _res ，λ _per The weight coefficients respectively representing the MSE loss function and the perception contrast loss function, the characteristic extraction network parameters are kept unchanged in the pre-training process, and the reconstruction network continuously updates the optimization parameters, so that the reconstruction of the video frame reconstruction network to the normal samples is promoted, the reconstruction of the video frame reconstruction network to the abnormal samples is restrained, and the source domain normal events and abnormal events are definedDiscrimination limits for the part.

4. An unsupervised domain adaptation based anomaly event detection algorithm according to claim 3, wherein λ is _res ，λ _per The weight coefficients representing the MSE loss function and the perceptual contrast loss function, respectively, may be replaced by using l ₁ The loss function replaces the MSE loss function to minimize the difference between the predicted video frame and the actual video frame.

5. The abnormal event detection algorithm based on unsupervised domain adaptation according to claim 1, wherein the step 2) continuously optimizes parameters of the video frame reconstruction network and the domain discriminator through alternate training of the video frame reconstruction network and the domain discriminator, so that the video frame reconstruction network can learn domain invariant features well behaved in both domains, thereby aligning data distribution of both domains, and enabling the domain discriminator to discriminate whether an input video frame is from a source domain or a target domain as much as possible.

6. The abnormal event detection algorithm based on the unsupervised domain adaptation as claimed in claim 5, wherein the target domain adopts an unsupervised learning mode to input the normal samples in the target domain and the normal samples in the source domain into the pre-trained reconstruction network for training together; the whole training process is based on countermeasure learning, the video frame reconstruction network and the domain discriminator are trained alternately to continuously optimize the video frame reconstruction network and the domain discriminator, and the data distribution of the source domain and the target domain is aligned.

7. The abnormal event detection algorithm based on unsupervised domain adaptation according to claim 6, wherein the specific training process is to fix the video frame reconstruction network when training the domain arbiter, so that the domain arbiter can distinguish whether the data is from the source domain or the target domain as far as possible; the domain discriminator is fixed when the video frame reconstruction network is trained, so that the video frame reconstruction network generates a result which cannot be resolved by the domain discriminator;

the loss function in training the domain arbiter is shown as the formula

L _dis ＝f(d(g(x _s )),t _s )+f(d(g(x _t )),t _t )

L _rec ＝f(d(g(x _s )),t _t )+f(d(g(x _t )),t _s )

f(p,q)＝-w×[p×log(q)+(1-p)×log(1-q)]

8. The abnormal event detection algorithm based on the unsupervised domain self-adaption according to claim 6, wherein the unsupervised domain self-adaption module is adopted in the target domain, only the normal samples in the target domain are required to be trained, the priori knowledge capable of well defining the normal events and the abnormal events in the pre-training module can be introduced into the target domain, the discrimination limits of the normal events and the abnormal events in the target domain are defined, the abnormal event detection performance of the algorithm model in the target domain is improved, and Ji Yuanyu data and the target domain data can be distributed through countermeasure training, so that domain offset is reduced, and the algorithm model has better scene applicability.