CN115797830A

CN115797830A - Multi-perception video abnormal event detection method and device based on homologous heterogeneous information

Info

Publication number: CN115797830A
Application number: CN202211484883.9A
Authority: CN
Inventors: 李洪均; 李超波; 章国安
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-03-14

Abstract

The invention provides a method and a device for detecting a multi-sensing video abnormal event based on homologous heterogeneous information, wherein the method comprises the following steps: the system comprises a target detection network, a graph-text semantic perception module, a bidirectional prediction module with heterogeneous feature fusion and a time sequence attention discrimination module. The image-text semantic perception module carries out semantic association on the image features and the text features to ensure semantic consistency; the bidirectional prediction module for heterogeneous feature fusion introduces the idea of heterogeneous feature fusion to combine time, space and content information, enhances the extraction of normal features and inhibits the generation of abnormal features, thereby improving the capture effect of discriminant features in abnormal detection; the time sequence attention judging module pays attention to the time sequence relation between frames, and judges and distinguishes the characteristics of the network learning pseudo-abnormality and the normal time sequence, so that the capability of the network for detecting the abnormality by combining the time sequence characteristics is improved.

Description

Multi-perception video abnormal event detection method and device based on homologous heterogeneous information

Technical Field

The invention belongs to the field of intelligent video processing, and particularly relates to a method and equipment for detecting multi-perception video abnormal events based on homologous heterogeneous information.

Background

Video surveillance is an important information aid in the field of public safety. Along with the increase of the coverage of the monitoring equipment, the efficiency is difficult to guarantee while the labor cost is increased in the video safety prevention and control.

The essence of anomaly detection is to understand and distinguish the inherent differences between normal and abnormal events. Because of the low frequency of occurrence of abnormal practices, the shortage of samples, and the difficulty of labeling, some researchers try to explore an abnormal detection method based on frame prediction, and determine an abnormality by using the difference between an input frame and a predicted frame through normal sample training so as to avoid the definition of the abnormality: liu et al proposed an anomaly detection framework based on future frame prediction and use optical flow to represent motion information; li et al propose a two-branch prediction model and take consecutive frames before and after the target frame as input; lee et al propose a multi-scale aggregation network to take into account contextual information of anomalous events. These methods focus on context information from two directions, but they obtain the antecedent and consequent information through two branches, resulting in independence of the two information, which in turn affects detection performance.

Furthermore, most frame prediction based anomaly detection methods achieve excellent performance in unsupervised learning, but also ignore the essential difference between using future frame prediction and anomaly detection: on one hand, the frame prediction focuses on the prediction effect through the context information of the target frame, the target frame is ignored, and the abnormality detection needs to focus on the discriminant characteristics between normal and abnormal provided in the target frame; on the other hand, in the frame prediction process, the model extracts high-level features but lacks perception of content in an image frame, and abnormal event detection usually needs to determine the content attribute of a target, so that interference of uncertain factors on discriminant features is avoided as much as possible.

In order to learn the distinguishing features more clearly, the prior art focuses on the abnormal detection of local targets existing in a scene: ionescu et al introduced a target-centric convolutional auto-encoder and a one-to-one classifier to separate normal targets from abnormal targets; georgescu et al propose a background elimination network depending on objects that may cause exceptions. The abnormal detection methods break through the limitation that the foreground target occupies a small proportion of the whole image, and capture the key information in the frame according to the refined characteristics of the local area. However, the learning of local features at the time of testing ignores the relevance of the target in the global and the influence of the local target on the whole.

In order to pay attention to integrity, huang et al propose a global attribute restoration network, which is used for deleting some attributes, learning semantic feature embedding and performing anomaly detection; lv et al propose a high-order context coding model to extract semantic representations and measure dynamic changes. The methods pay attention to global high-level semantic information and show better performance in anomaly detection.

However, the visual semantic information is only a part of semantic expression as a basic unit, and the important correlation relationship among semantic units is lacked; and the single visual mode has insufficient semantic expression and large matching difficulty of semantic content consistency in the correlation, and influences the semantic perception capability. In addition, the refining of fine-grained features is weak, so that the expression semantics are insufficient in anomaly detection, and the detection precision is influenced.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a method and equipment for detecting a multi-sensing video abnormal event based on homologous heterogeneous information, which improve the sensing capability of discriminant characteristics between normal and abnormal from multiple aspects, so as to solve the problems that the content information in a video frame cannot be acquired in the abnormal detection process, the semantic association between a local target and the whole is lost, and the inter-frame time sequence difference is ignored, and improve the abnormal detection precision.

The purpose of the invention is realized by the following technical scheme:

a multi-perception video abnormal event detection method based on homologous heterogeneous information is characterized by comprising the following steps: the system comprises a target detection network, a graph-text semantic perception module, a bidirectional prediction module with heterogeneous feature fusion and a time sequence attention distinguishing module, wherein the target detection network adopts a YoloV3 network to extract a target in a video frame; the image-text meaning perception module comprises an image feature extractor T _o Text feature extractor T _d The image-text semantic perception module extracts image features and text features of the video and calculates semantic correlation between the image features and the text features to ensure the consistency of the semantic features; the heterogeneous feature fusion bidirectional prediction module comprises a forward encoder, a backward encoder and a decoder, and enhances the extraction of normal features and inhibits the generation of abnormal features; the time sequence attention distinguishing module comprises a 3D convolution neural network, a time sequence attention mechanism and a 2D convolution network, and learns and distinguishes the characteristics of the false abnormal time sequence and the normal time sequence.

And carrying out abnormity judgment on the video to be detected through the combination of the prediction error, the semantic correlation and the time sequence information.

Preferably, the process of extracting the image fine-granularity features and the text features of the video by the graph-text meaning perception module comprises the following steps:

acquisition of a video sequence I over a target detection network ₁ ,…,I _M N targets in each frame and the category of the targets, wherein M is the length of the video sequence, and the value of N in different frames is not fixed; the tth frame I _t The ith target area of (2) is marked as

Wherein t =1,2, \8230;, M, i =1,2, \8230;, N, W, H, and C are the width, height, and channel number of the target region, respectively, converting the target region in the video frame to a fixed size, while uniformly dividing the target region into P subblocks of size P × P, wherein

Taking P as an image feature extractor T _o The length of the input sequence of (a);

each sub-block is subjected to characteristic refining and is mapped to a space with fixed dimensionality, and then space embedding is generated

Where x =1,2, \8230;. P, addition site is embedded

To retain the relative position information of each sub-block to obtain the embedded characteristics

Is represented by formula (1):

image feature extractor T _o Consisting of l stacked identical transform frames, each being a serial process of concatenation of two residuals, the first one to embed a feature

Combining the output of the layer normalization and the multi-head self-attention mechanism with the output to obtain the intermediate characteristic

As shown in equation (2):

wherein LN (-) represents layer normalization, MSA (-) represents a multi-head attention mechanism;

the second residual is a pair of intermediate features

Features and intermediate features after layer normalization and multi-layer perception

Adding to obtain the output characteristics of the first Transformer frame

As shown in equation (3):

wherein MLP (-) represents multi-layer perception;

then will be

As input and output to a second transform framework

By analogy, output characteristics are obtained after the frames are stacked

Will be provided with

Output Final target region feature z 'into independent Multi-layer perception' _o ；

For each target area

Its corresponding classification is recorded as

Establishing a mapping V of class labels and texts according to the pre-training sample classes of the target detection network, wherein each target class label

Obtain corresponding text

Wherein

Is a length 76 sequence, and uses [ SOS ]]And [ EOS]The mark is drawn up; converting the encoding into a computer understandable form according to the byte pairs, and simultaneously preserving semantic context by embedding the position information of text characters to obtain text embedding characteristics

Text embedding features

Through a text feature extractor T _d The process of (2) is shown in formula (4):

wherein z' _d For text feature extractor T _d The output characteristic of (a); z 'using layer normalization and multilayer perception' _o And z' _d Mapping into a multimodal embedding space, output image features z _o ＝MLP(LN(z _o ') and text feature z) _d ＝MLP(LN(z′ _d ))。

Preferably, the process of calculating the semantic correlation between the image feature and the text feature to ensure the consistency of the semantic features comprises:

semantically associating an objective function L when training a graph-text semantic perception module _sem (z _o ,z _d ) As shown in formula (5):

wherein

And

respectively representing objects

Image and text features of (1), sim [ ·]Representing cosine similarity; minimizing a semantically related objective function L _sem (z _o ,z _d ) Enabling two vectors of the image characteristic and the text characteristic to be close to each other in absolute distance and to be constrained to the same direction;

when the video to be detected is detected abnormally, the global image characteristic and the local text characteristic are calculated

The image features and the text features are expressed as posterior probability vectors relative to a group of semantic concepts, when the video to be tested is abnormal, the correlation between the global image features and the local text features is weak, and the cognition on a semantic space is different.

Preferably, the bidirectional prediction module for heterogeneous feature fusion enhances extraction of normal features and suppresses generation of abnormal features, including:

extracting forward coding features and backward coding features using a bi-directional 3D encoder, wherein the bi-directional 3D encoder is represented by a forward encoder E _f And a backward encoder E _b Composition is carried out; forward encoder E _f And backward encoder E _b Have the same structure and network parameters;

according to each target area, taking the front L frame and the back L frame at the corresponding position to form a forward target sequence without the target in the t frame

And backward target sequence

Respectively input forward directionEncoder E _f And a backward encoder E _b To obtain a forward coding characteristic z _f And backward coding feature z _b As shown in equation (6) and equation (7):

characterizing an image z _o Text feature z _d Forward coding characteristic z _f And backward coding feature z _b Splicing to obtain heterogeneous fusion characteristic z = concat [ z ] _f ,z _b ,z _o ,z _d ]；

The decoder predicts the target region according to the heterogeneous fusion features and predicts the intermediate target region by expressing the acquired features in a 2D form

As shown in formula (7):

the difference between the predicted target region and the real target region is expressed as an objective function

As shown in equation (8):

wherein,

and

respectively a real target area and a predicted target area, and W and H respectively represent the width and the height of the target area;

encoder and decoder minimize objective function using only normal samples during heterogeneous feature fusion bi-directional prediction module training

The heterogeneous fusion characteristics comprise content information of normal samples, a decoder can predict the normal samples, and for abnormal samples, the decoder cannot predict abnormal targets.

Preferably, the generation process of the pseudo-abnormal sequence is as follows: according to the target area

And L frames before and after the sequence to generate continuous normal sequence

And mark it as y _n (S _n ) =0; generating and S by random jump _n Pseudo-abnormal sequence of the same length

The symbol y _a (S _a ) =1, wherein a is a random number, S _a I.e. a false exception timing.

Preferably, the process of learning and distinguishing the characteristics of the pseudo-abnormal time sequence and the normal time sequence by the time sequence attention distinguishing module includes: given normal timing

And pseudo exception timing

Wherein 2L +1 is the sequence length, C is the number of channels, and W × H is the size of the target region; will S _n ,S _a Inputting the 3D convolution neural network of the time sequence attention discrimination module, and extracting the time sequence feature z _t ′；

Separately calculating S using a time-sequential attention mechanism _n And S _a An attention score for each target region; the time sequence attention mechanism adopts 3D average pooling and 3D maximum pooling to complete compression operation in a time dimension; then, after 3D average pooling and 3D maximum pooling, two layers of full connection are respectively used for obtaining different scaling factors and obtaining final attention scores, and finally, each time dimension is scaled according to the attention scores to finish recalibration z _t As shown in equation (9):

z _t ＝z′ _t ·δ(fc(maxp _3D (z′ _t )；θ)+fc(avgp _3D (z′ _t ))；θ) (9)

wherein maxp _3D (. To) denotes maximum pooling in 3D, avgp _3D (·) denotes 3D average pooling, fc- (·) denotes the fully-connected layer containing parameter θ, δ denotes the Sigmoid activation function;

performing nonlinear processing on the time sequence by using a 2D convolution network of a time sequence attention discrimination module; the 2D convolutional network mainly comprises convolutional layers and full connections, and batch normalization, relu activation and selective discarding are used after the convolutional layers; the full connection layer is followed by the softmax function to output the abnormal discrimination probability

Wherein S _k ＝{S _n ,S _a }; using cross entropy as the objective function L (S) _k ) As shown in equation (10):

wherein the probability of abnormality discrimination

softmax (. Circle.) represents a softmax function, conv (. Circle.) represents a convolution, fc ₁ () represents a fully connected operation; when the time sequence to be measured is abnormal, k = a, y _k (S _k ) And =1, otherwise 0.

Preferably, the process of performing anomaly judgment on the video to be detected by combining the prediction error, the semantic relevancy and the time sequence information specifically comprises the following steps:

spatially, using the prediction error of the real target region and the prediction target region as the spatial anomaly score of the region of each target, for an arbitrary frame I in which N targets exist _t Selecting the maximum value of all target abnormal scores in the frame as the spatial abnormal score S of the frame _spa (I _t ) As shown in formula (11):

wherein the real target area

Represents a predicted target region, | ·| non-conducting light ₂ Represents L ₂ A norm;

on the content, frame I _t Middle ith target area

Corresponding text is

Using global image feature z _o (I _t ) And local text features

The semantic correlation between the two is used as a local abnormal score, and the minimum semantic correlation is simultaneously selected as a global semantic abnormal score S _sem (I _t ) As shown in formula (12):

wherein sim [. Cndot. ] represents cosine similarity; the higher the probability of occurrence of target abnormality in the frame is, the smaller the semantic association degree is;

in time sequence, the target

Corresponding time series to be measured is

Targeting the timing discrimination module to obtain the output anomaly discrimination probability

A timing anomaly score of; for all targets, selecting the maximum value of the abnormality discrimination probability as the time series abnormality score S of the frame _tem (I _t ) As shown in equation (13):

wherein

Is to the target sequence

A predicted probability of (d);

linearly adding the spatial abnormality score, the semantic abnormality score and the time sequence abnormality score to obtain a final abnormality score S (I) _t ) As shown in equation (14):

S(I _t )＝S _spa (I _t )+α·(1-S _sem (I _t ))+β·S _tem (I _t ) (14)

wherein alpha and beta are respectively a semantic coefficient and a time sequence parameter;

abnormal score S (I) _t ) The larger the value of (a) is, the larger the probability that an abnormality occurs in the frame is indicated.

Preferably, the bi-directional 3D encoder is composed of 6 layers of 3D convolution layers, the convolution kernel size being 3 × 3 × 3; performing batch normalization and Relu activation after convolution of each layer, wherein pooling operation is not used after convolution of the first layer and the third layer, and 3D maximal pooling with the pooling size of 1 × 2 × 2 and the stride of 1 × 2 × 2 is adopted for other layers;

preferably, the decoder is implemented by 4 upsampling, each upsampling is followed by feature expression using a 2D convolutional layer with a kernel size of 3 × 3, and each layer of convolution is followed by batch normalization and Relu activation, respectively.

The invention also proposes an electronic device comprising a memory, a processor and program instructions stored in the memory for execution by the processor, the processor executing the program instructions to implement the steps of the method of the invention.

Compared with the prior art, the invention has the beneficial effects that:

1. aiming at the defects of local target and global semantic perception in abnormal detection, the method establishes a graph-text semantic perception module, correlates the image characteristics and the text characteristics of a target area and ensures the consistency of perception semantics;

2. the invention adopts the strategies of local training and global reasoning to sense the semantic correlation between the local target and the global information, thereby improving the accuracy of anomaly detection;

3. the invention designs a bidirectional prediction module with heterogeneous feature fusion, which enhances the extraction of normal features and inhibits the generation of abnormal features by combining space, time and content information, thereby capturing more discriminant features;

4. in the invention, the time sequence difference between frames is considered, and a time sequence attention distinguishing module is adopted to distinguish the generated pseudo-abnormal sequence and the normal sequence from the time dimension, thereby improving the sensitivity to the abnormality.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a general framework diagram of a method for detecting an abnormal event in a multi-sensor video based on heterogeneous information according to an embodiment of the present invention;

fig. 2 is a schematic view of visualization of anomaly detection and localization provided by the embodiment of the present invention.

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the present invention provides a method for detecting a multi-sensor video abnormal event based on homologous heterogeneous information, which mainly comprises a target detection network, a graph-text-semantic perception module, a bidirectional prediction module with heterogeneous feature fusion, and a time sequence attention discrimination module, wherein the target detection network adopts a YoloV3 network to extract a target in a video frame; the image-text meaning perception module comprises an image feature extractor T _o Text feature extractor T _d The image-text semantic perception module extracts image features and text features of the video and calculates semantic correlation between the image features and the text features to ensure the consistency of the semantic features; the heterogeneous feature fusion bidirectional prediction module comprises a forward encoder, a backward encoder and a decoder, and enhances the extraction of normal features and inhibits the generation of abnormal features; the time sequence attention distinguishing module comprises a 3D convolution neural network, a time sequence attention mechanism and a 2D convolution network, and learns and distinguishes the characteristics of the pseudo-abnormal time sequence and the normal time sequence.

And carrying out abnormity judgment on the video to be detected through the joint prediction error, the semantic relevancy and the time sequence information.

The process of extracting the image fine granularity feature and the text feature of the video by the graph-text meaning perception module comprises the following steps:

acquisition of a video sequence I over a target detection network ₁ ,…,I _M N targets in each frame and the category of the targets, wherein M is the length of the video sequence, and the value of N in different frames is not fixed; the tth frame I _t The ith target area of

Wherein t =1,2, \8230, M, i =1,2, \8230, N, W, H and C are the width, height and channel number of the target region, respectively, to convert the target region in the video frame to a fixed size while uniformly dividing the target region into P subblocks of size P × P, wherein

Wherein x =1,2, \8230;, P, the addition position is embedded

Is represented by formula (1):

As shown in equation (2):

the second residual is a pair of intermediate features

Adding to obtain the output characteristics of the first Transformer frame

As shown in equation (3):

wherein MLP (-) represents multi-layer perception;

then will be

As input and output to a second transform framework

By analogy, output characteristics are obtained after the frames are stacked

Will be provided with

Output Final target region feature z 'into independent Multi-layer perception' _o 。

For each target area

Its corresponding classification is recorded as

Obtain corresponding text

Wherein

Text embedding features

wherein z' _d For text feature extractor T _d The output characteristic of (1).

Since there may be gaps between the features of text and fine-grained image feature extractor representations, it is desirable to map the features of both the text and image modalities to the same multimodal space, so z 'is done using layer normalization and multi-layer perception' _o And z' _d Mapping into a multimodal embedding space, output image features z _o ＝MLP(LN(z _o ') and text feature z) _d ＝MLP(LN(z′ _d ))。

The process of calculating the semantic correlation degree between the image feature and the text feature to ensure the consistency of the semantic features comprises the following steps:

wherein

And

respectively represent objects

Image and text features of (1), sim [ ·]Representing cosine similarity; minimizing a semantically related objective function L _sem (z _o ,z _d ) Enabling the two vectors of the image characteristic and the text characteristic to be close to each other in absolute distance and constrained to be in the same direction;

The image characteristic and the text characteristic are expressed as posterior probability vectors relative to a group of semantic concepts, and when the video to be tested is abnormal, the correlation between the global image characteristic and the local text characteristic is weakThe recognition in semantic space presents differences.

The heterogeneous feature fusion bidirectional prediction module for enhancing the extraction of normal features and inhibiting the generation of abnormal features comprises the following steps:

extracting forward coding features and backward coding features using a bi-directional 3D encoder, wherein the bi-directional 3D encoder is composed of a forward encoder E _f And backward encoder E _b Composition is carried out; forward encoder E _f And a backward encoder E _b Have the same structure and network parameters; the bidirectional 3D encoder is composed of 6 layers of 3D convolution layers, and the size of the convolution layer is 3 multiplied by 3; and after convolution of each layer, batch normalization and Relu activation are respectively carried out, pooling operation is not used after convolution of the first layer and the third layer, and 3D maximum pooling with the pooling size of 1 × 2 × 2 and the step size of 1 × 2 × 2 is adopted for other layers.

According to each target area, taking the front L frame and the back L frame of the corresponding position to form a forward target sequence which does not contain the target in the t frame

And backward target sequence

Separately input forward encoder E _f And a backward encoder E _b To obtain the forward coding feature z _f And backward coding feature z _b As shown in equation (6) and equation (7):

characterizing an image z _o Text feature z _d Forward coding characteristic z _f And backward coding feature z _b Splicing to obtain the heterogeneous fusion characteristic z = concat [ z ] _f ,z _b ,z _o ,z _d ]Heterogeneous fusionThe time, space and content information are fused by combining the characteristics, uncertainty in the prediction process is reduced, and the time sequence relation between the current frame and the context is enhanced;

the decoder is realized by 4 times of upsampling, after each time of upsampling, a 2D convolutional layer with the kernel size of 3 multiplied by 3 is used for carrying out characteristic expression, and after each layer of convolution, batch normalization and Relu activation are respectively carried out; the decoder predicts the target region according to the heterogeneous fusion features and predicts the intermediate target region by expressing the acquired features in a 2D form

As shown in equation (7):

As shown in equation (8):

wherein,

and

The heterogeneous fusion characteristics comprise content information of normal samples, a decoder can predict the normal samples,for the abnormal samples, the decoder cannot predict the abnormal target.

The generation process of the pseudo-abnormal sequence comprises the following steps: according to the target area

The symbol y _a (S _a ) =1, wherein a is a random number; s. the _a The continuity between frames is weak, and the characteristics of irregular movement behaviors under abnormal conditions are met, so that S is determined _a As a pseudo exception timing.

The process that the time sequence attention distinguishing module learns the characteristics of the pseudo-abnormal time sequence and the normal time sequence and distinguishes comprises the following steps: given normal timing

And pseudo exception timing

Separately calculating S using a time-sequential attention mechanism _n And S _a An attention score for each target region; the time sequence attention mechanism adopts 3D average pooling and 3D maximum pooling to complete compression operation in a time dimension; then, after 3D average pooling and 3D maximum pooling, two layers of full connection are respectively used for obtaining different scaling factors and obtaining final attention scores, and finally, each time dimension is scaled according to the attention scores to finish recalibration z _t As shown in the formula (9)：

Wherein maxp _3D (. To) denotes maximum pooling in 3D, avgp _3D (. Cndot.) represents 3D average pooling, fc (. Cndot.) represents a fully connected layer containing the parameter θ, δ represents a Sigmoid activation function;

performing nonlinear processing on the time sequence by using a 2D convolution network of a time sequence attention discrimination module; the 2D convolutional network mainly comprises convolution and full connection, after convolution, batch normalization and Relu activation are used for improving generalization performance, and meanwhile selective discarding is used for avoiding overfitting; the full connection layer is followed by softmax function to output abnormal discrimination probability

wherein the probability of abnormality discrimination

softmax (. Circle.) represents a softmax function, conv (. Circle.) represents a convolution, fc ₁ (. Cndot.) represents a fully connected operation; when the time sequence to be measured is abnormal, k = a, y _k (S _k ) And =1, otherwise 0.

The process of carrying out abnormity judgment on the video to be tested by combining the prediction error, the semantic relevancy and the time sequence information specifically comprises the following steps:

spatially, using the prediction error of the real target region and the prediction target region as the spatial anomaly score for the region of each target, for any frame I where there are N targets _t Selecting the maximum value of all target abnormal scores in the frame as the spatial abnormal score S of the frame _spa (I _t ) As shown in formula (11):

wherein the real target area

on the content, frame I _t The ith target region

Corresponding text is

Using global image feature z _o (I _t ) And local text features

in time sequence, the target

Corresponding time series to be measured is

Anomaly discrimination using timing discrimination module to obtain outputProbability as a target

wherein

Is to the target sequence

A predicted probability of (d);

S(I _t )＝S _spa (I _t )+α·(1-S _sem (I _t ))+β·S _tem (I _t ) (14)

where α and β are semantic coefficients, timing parameters, respectively, that control the importance of the semantic anomaly score and the timing anomaly score relative to the spatial anomaly score;

The effectiveness of the method proposed by the present invention is verified by experiments below.

The method of the invention evaluates the abnormality detection on three data sets of UCSD Ped2, CUHK Avenue and ShanghaiTech. All reported results were implemented using an Intel Xeon (R) CPU, NVIDIA GTX1080Ti GPU device. The method is mainly realized in Anaconda3, python3.8, tensorFlow and PyTorch frames, and standard evaluation indexes are adopted to evaluate the abnormality detection performance. Performance was evaluated by calculating the area under the subject operating characteristic curve (AUC) as a scalar by gradually changing the threshold of the abnormality score, with higher AUC values indicating better abnormality detection performance.

Table 1 lists comparisons of the present invention method (MPFork) with some currently preferred methods on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, such as future frame prediction method (FFP), memory enhanced depth auto-encoder (MemAE), memory-guided normality Method (MNN), stacked recurrent neural network auto-encoder (sRNN-AE), generative cooperation discriminant network (GADNet), self-trained prediction model (SPM) self-supervised prediction convolutional attention block (SSPCAB), implicit dual path auto-encoder (ITAE), multi-path prediction anomaly detection (ROADMAP), anomaly detection for graph volume noise cleaning (GCLNL), variant anomaly behavior detection method (VABD), time-aware contrast network (TAC-Net), self-supervised attention generation countermeasure network (SSAGAN).

TABLE 1 comparison of AUC results by different methods

As can be seen from Table 1, the method of the present invention achieved the best performance on the UCSD Ped2, CUHK Avenue and ShanghaiTech data sets, and achieved detection accuracies of 99.8%,91.3% and 86.1%, respectively, compared to other methods. ITAE achieves suboptimal detection precision of 99.2% on a Ped2 data set, but the method of the invention respectively exceeds the method by 3.3% and 9.8% in the other two data sets, which proves that the method of the invention is helpful to distinguish between normal and abnormal and improve abnormal detection performance.

The abnormal positioning capacity and the abnormal time sequence sensitivity of the method are further verified by using two indexes of RBDC and TBDC, wherein the RBDC mainly expresses the positioning capacity according to the overlapping area of a real abnormal region and a predicted abnormal region, and the TBDC mainly depends on the tracking detection rate and the number of false positive regions in continuous frames. The detection accuracy of the self-supervised multitasking learning (SMTL), the object-centric automatic encoder (OCAE), the FFP, the SSPCAB and the method of the invention (MPFork) are compared in table 2.

TABLE 2 comparison of RBDC and TBDC Performance under different methods

It can be seen from table 2 that the process of the invention gives better results overall at both RBDC and TBDC indices. Compared with FFP and SSPCAB, the method has the advantages that the RBDC and the TBDC on the UCSD ped2 data set are respectively improved by 35% and 40% in an abnormal manner; increases of over 15% and 25% were obtained on CUHK Avenue; the method obtains 30% and 25% improvement on ShanghaiTech, and the method is favorable for improving the sensitivity and the positioning capability of the abnormal track by considering the spatial prediction of the abnormal in heterogeneous feature fusion and the strategy of the time sequence feature during the abnormal detection. In addition, the method is superior to SMTL in three data because the method not only considers the space-time characteristics, but also focuses on the semantic correlation between the local characteristics and the whole situation, and the omission condition is reduced.

The effectiveness of the method of the present invention is further verified by evaluating the contribution of each policy in the model to anomaly detection. The method is characterized in that the unidirectional frame prediction network is used as a reference for carrying out abnormity detection and mainly comprises a 3D encoder and a 2D decoder. It is proposed that MPFork mainly comprises three modules, namely a heterogeneous feature fusion bidirectional prediction module (BWH), an image-context perception module (ISP) and a time sequence attention discrimination module (TAD). Compared with a reference model, the bidirectional prediction module with heterogeneous feature fusion not only increases bidirectional prediction (BiP), but also performs anomaly detection by using homologous heterogeneous feature fusion, so that the effect of space-time features provided by bidirectional prediction on anomaly detection is verified, and the influence of heterogeneous feature fusion in frame prediction on anomaly detection is verified respectively, and the bidirectional prediction module mainly comprises fine-grained image features (ViF) of a target area obtained by an image feature extractor and text features (TeF) obtained by a text feature extractor. The performance changes after adding different strategies compared to the baseline model are shown in table 3.

TABLE 3 Effect of different strategies on anomaly detection

From table 3, it can be seen that the bi-directional prediction is improved by 2.4% compared to the performance based on the reference model, especially on UCSD Ped 2. That is because bi-directional prediction employs forward and backward feature extraction, indirectly providing richer temporal information for inter-frame prediction. When the fine-grained image features or text features of the target area are increased, the performance of the three data sets is improved to different degrees, which shows the effectiveness of the image features or text features of the target area on anomaly detection. In particular, when using the heterogeneous information fused bi-directional prediction module, there was an improvement of 5.0%,2.1% and 2.0% on UCSD Ped2, CUHK Avenue and ShanghaiTech datasets, respectively. The time sequence characteristics of the sequence are combined in the prediction process, and the spatial characteristics and content information of the current target area are considered, so that the prediction uncertainty is reduced, the capture of normal characteristics is enhanced, and the generation of abnormal characteristics is inhibited. After the image-text semantic perception module is added, the performance improvement on the Avenue data set is more obvious, namely, due to the fact that abnormal conditions such as bag throwing and paper losing exist in the Avenue data set, detection omission easily occurs when local features are used for the abnormal conditions, the image-text semantic perception provided by the people pays attention to the correlation between local features and the global features, the relation between people and objects is paid attention to more easily, and therefore the abnormal conditions are detected. After a time sequence attention distinguishing module is added, the abnormality detection performance is obviously improved in the CUHK Avenue and ShanghaiTech data sets. Because the two data sets have the conditions of rapid running, chasing and alarming and the like, the time sequence attention distinguishing module focuses more on the time sequence characteristics of the target, and is convenient to distinguish the abnormal time sequences different from the normal abnormal time sequences. The simultaneous use of our proposed MPFork in these strategies is to achieve optimal performance, which is 6.3%,9.1% and 7.0% improvement over the baseline model on the three and data sets, respectively. This result indicates that all these proposed strategies contribute to anomaly detection, as their combination can optimize the framework from different aspects and improve performance.

In addition, in real life, the anomaly detection model not only needs to index the video frame where the anomaly event is located, but also needs to locate the specific position of the anomaly target in the frame. To more intuitively illustrate the performance of the proposed method, we have selected several representative cases from different data sets to visualize the detection and localization results of different models, as shown in fig. 2, where a rectangular box is a missing or false detection case. In each data set, the first column shows the visualization effect of the normal frame, and the second column and the third column show the detection and positioning effects of different abnormal frames under different models respectively. The first line is the video frame which is really collected, and the lower lines correspond to the detection and positioning effects corresponding to the BiP, the BWH and the MPFork respectively. It can be seen that BiP is more likely to determine some normal targets as abnormal and cause false detection, and meanwhile, the situation of missed detection is also likely to occur. The abnormal detection is carried out by the BiP according to the prediction error, and the prediction error of part of normal targets is relatively large to generate false detection due to insufficient discriminant feature learning of normal and abnormal; meanwhile, the strong characterization capability of the deep network causes the prediction effect of part of abnormal targets to be good and the missed detection is generated. The energy values of some normal targets in BWH decrease and the energy values of abnormal targets increase compared to BiP, indicating an increased ability of BWH to learn the discriminative features of normal and abnormal targets. This is because the BWH simultaneously fuses the temporal, spatial, and content information of the target, which helps to enhance the normal features in the solution space and suppress the generation of abnormal features during inference. The MPFork provided by the inventor shows better detection and positioning advantages on three data sets, because the MPFork combines local and global semantic perception and time sequence discrimination functions on the basis of BWH, improves the extraction capability of discriminant features, deeply perceives the abnormality from a semantic understanding level and detects the abnormality from a time sequence angle during reasoning, thereby being beneficial to the detection and positioning of the abnormality, being more suitable for the abnormality detection in real life and improving the detection performance.

Finally, the present invention provides an electronic device, which includes a memory, a processor, and a program stored in the memory and executable by the processor, wherein the processor executes the program to implement the steps of a method for detecting video anomalies based on scene classification as proposed in the present invention. It should be noted that the description of the apparatus according to the embodiment of the present invention is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description thereof is omitted.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes, modifications, equivalents, and improvements can be made therein without departing from the spirit and scope of the invention.

Claims

1. A multi-perception video abnormal event detection method based on homologous heterogeneous information is characterized by comprising the following steps: the system comprises a target detection network, a graph-text semantic perception module, a heterogeneous characteristic fusion bidirectional prediction module and a time sequence attention discrimination module;

the target detection network adopts a YoloV3 network to extract a target in a video frame;

the graph-text meaning perception module comprises an image feature extractor T _o Text feature extractor T _d And a semantic relevance description portion; the image-text semantic perception module extracts image features and text features of a video, and calculates semantic correlation degrees between the image features and the text features to ensure the consistency of the semantic features;

the isomeric characteristicsThe fused bidirectional prediction module comprises a forward encoder E _f Backward encoder E _b And a decoder; the heterogeneous feature fusion bidirectional prediction module enhances the extraction of normal features and inhibits the generation of abnormal features;

the time sequence attention distinguishing module comprises a 3D convolutional neural network, a time sequence attention mechanism and a 2D convolutional network, and learns and distinguishes the characteristics of a pseudo-abnormal time sequence and a normal time sequence;

and (4) carrying out abnormity judgment on the video to be detected by combining the prediction error, the semantic relevancy and the time sequence information.

2. The method for detecting the abnormal events of the multi-perception video based on the heterogeneous information according to claim 1, wherein the process of extracting the image fine granularity feature and the text feature of the video by the graph-text meaning perception module comprises:

obtaining a video sequence I through the target detection network ₁ ,…,I _M N targets and the categories thereof in each frame, wherein M is the length of the video sequence, and the value of N in different frames is not fixed; the tth frame I _t The ith target area of

Using P as the image feature extractor T _o The length of the input sequence of (a);

Wherein x =1,2, \ 8230;, P, addition position inlayInto

Is represented by formula (1):

the image feature extractor T _o Consisting of l stacked identical transform frames, each being a serial process of concatenation of two residuals, the first one to embed a feature

The output of the layer normalization and multi-head self-attention mechanism is combined with the output of the layer normalization and multi-head self-attention mechanism to obtain intermediate characteristics

As shown in equation (2):

the second residual is a pair of intermediate features

Adding to obtain the output characteristics of the first Transformer frame

As shown in equation (3):

wherein MLP (·) represents multi-layer perception;

then will

As input and output to a second transform framework

By analogy, output characteristics are obtained after the frames are stacked

Will be provided with

For each target area

Its corresponding classification is recorded as

Obtain corresponding text

Wherein

Is a length 76 sequence, and uses [ SOS ]]And [ EOS]The mark is drawn up; converting the encoding into a computer understandable form according to byte pairs, and simultaneously preserving semantic context and obtaining text embedding characteristics by embedding position information of text characters

Text embedding features

wherein z' _d For text feature extractor T _d The output characteristics of (1); z 'using layer normalization and multilayer perception' _o And z' _d Mapping into a multimodal embedding space, output image features z _o ＝MLP(LN(z _o ') and text feature z) _d ＝MLP(LN(z′ _d ))。

3. The method according to claim 2, wherein the calculating of the semantic correlation between the image feature and the text feature ensures consistency of semantic features, and comprises:

when the graph-text semantic perception module is trained, the semantic association objective function L _sem (z _o ,z _d ) As shown in formula (5):

wherein

And

respectively representing objects

4. The method according to claim 3, wherein the heterogeneous feature fusion based bi-directional prediction module enhances extraction of normal features and suppresses generation of abnormal features comprises:

extracting forward coding features and backward coding features using a bi-directional 3D encoder, wherein the bi-directional 3D encoder is formed by the forward encoder E _f And a backward encoder E _b Composition of, the forward encoder E _f And a backward encoder E _b Have the same structure and network parameters;

And backward target sequence

Respectively input into the forward encoder E _f And a backward encoder E _b To obtain a forward coding characteristic z _f And backward coding feature z _b As shown in equation (6) and equation (7):

characterizing the image z _o Text feature z _d Forward coding characteristic z _f And backward coding feature z _b Splicing to obtain heterogeneous fusion characteristic z = concat [ z ] _f ,z _b ,z _o ,z _d ]；

The decoder predicts a target region based on heterogeneous fusion features, and predicts an intermediate target region by expressing the acquired features in a 2D format

As shown in equation (7):

As shown in equation (8):

wherein,

and

respectively a real target area and a predicted target area, and W and H are respectively the width and the height of the target area;

in the heterogeneous feature fusion bidirectional prediction module training process, only normal samples are used, and an encoder and a decoder minimize an objective function

5. The method for detecting the abnormal events of the multi-perception video based on the heterogeneous information of claim 4, wherein the generation process of the pseudo-abnormal sequence is as follows: according to the target area

Mark y _a (S _a ) =1, wherein a is a random number, S _a I.e. a false exception timing.

6. According to claimThe method for detecting abnormal events of multi-sensing video based on heterogeneous information is characterized in that the process of learning and distinguishing the characteristics of the pseudo-abnormal time sequence and the normal time sequence by the time sequence attention distinguishing module comprises the following steps: given normal timing

And pseudo exception timing

Wherein 2L +1 is the sequence length, C is the number of channels, and W × H is the size of the target region; will S _n ,S _a Inputting the time sequence feature z 'into a 3D convolutional neural network and extracting' _t ；

Separately calculating S using the time series attention mechanism _n And S _a An attention score for each target region; the time sequence attention mechanism adopts 3D average pooling and 3D maximum pooling to complete compression operation in a time dimension; then, after 3D average pooling and 3D maximum pooling, two layers of full connection are respectively used for obtaining different scaling factors and obtaining final attention scores, and finally, each time dimension is scaled according to the attention scores to finish recalibration z _t As shown in equation (9):

performing nonlinear processing on a time sequence by using the 2D convolutional network; the 2D convolutional network mainly comprises convolution and full connection, and batch normalization, relu activation and selective discarding are used after convolution; the full connection layer is followed by the softmax function to output the abnormal discrimination probability

Wherein S _k ＝{S _n ,S _a }; using cross entropy asObjective function L (S) _k ) As shown in equation (10):

wherein the probability of abnormality discrimination

7. The method for detecting the abnormal events of the multi-sensor video based on the heterogeneous information according to claim 6, wherein the process of performing the abnormal judgment on the video to be detected by combining the prediction error, the semantic relevance and the time sequence information specifically comprises the following steps:

wherein the real target area

on the content, frame I _t The ith target region

Corresponding text is

Using global image feature z _o (I _t ) And local text features

wherein sim [. Cndot. ] represents the cosine similarity; the higher the probability of occurrence of target abnormality in the frame is, the smaller the semantic association degree is;

in time sequence, target

Corresponding time series to be measured is

Adopting the abnormal discrimination probability output by the time sequence discrimination module as a target

A timing anomaly score of; for all targets, selecting the maximum value of the abnormal discrimination probability as the time sequence abnormal score S of the frame _tem (I _t ) As shown in equation (13):

wherein

Is to the target sequence

A predicted probability of (d);

S(I _t )＝S _spa (I _t )+α·(1-S _sem (I _t ))+β·S _tem (I _t ) (14)

8. The method for detecting multi-perception video abnormal events based on the heterogeneous information as claimed in claim 4, wherein:

the bidirectional 3D encoder is composed of 6 layers of 3D convolution layers, and the size of a convolution kernel is 3 multiplied by 3; performing batch normalization and Relu activation respectively after convolution of each layer, wherein pooling operation is not used after convolution of the first layer and the third layer, and 3D maximum pooling with the pooling size of 1 × 2 × 2 and the stride of 1 × 2 × 2 is adopted for other layers;

the decoder is realized by 4 times of upsampling, after each time of upsampling, a 2D convolutional layer with the kernel size of 3 x 3 is used for carrying out feature expression, and after each layer of convolution, batch normalization and Relu activation are respectively carried out.

9. An electronic device comprising a memory, a processor and program instructions stored in the memory for execution by the processor, wherein the program instructions are executable by the processor to perform the steps of the method of any of claims 1-8.