CN115797830A - Multi-perception video abnormal event detection method and device based on homologous heterogeneous information - Google Patents

Multi-perception video abnormal event detection method and device based on homologous heterogeneous information Download PDF

Info

Publication number
CN115797830A
CN115797830A CN202211484883.9A CN202211484883A CN115797830A CN 115797830 A CN115797830 A CN 115797830A CN 202211484883 A CN202211484883 A CN 202211484883A CN 115797830 A CN115797830 A CN 115797830A
Authority
CN
China
Prior art keywords
abnormal
features
text
semantic
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202211484883.9A
Other languages
Chinese (zh)
Inventor
李洪均
李超波
章国安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nantong University
Original Assignee
Nantong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nantong University filed Critical Nantong University
Priority to CN202211484883.9A priority Critical patent/CN115797830A/en
Publication of CN115797830A publication Critical patent/CN115797830A/en
Withdrawn legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention provides a method and a device for detecting a multi-sensing video abnormal event based on homologous heterogeneous information, wherein the method comprises the following steps: the system comprises a target detection network, a graph-text semantic perception module, a bidirectional prediction module with heterogeneous feature fusion and a time sequence attention discrimination module. The image-text semantic perception module carries out semantic association on the image features and the text features to ensure semantic consistency; the bidirectional prediction module for heterogeneous feature fusion introduces the idea of heterogeneous feature fusion to combine time, space and content information, enhances the extraction of normal features and inhibits the generation of abnormal features, thereby improving the capture effect of discriminant features in abnormal detection; the time sequence attention judging module pays attention to the time sequence relation between frames, and judges and distinguishes the characteristics of the network learning pseudo-abnormality and the normal time sequence, so that the capability of the network for detecting the abnormality by combining the time sequence characteristics is improved.

Description

Multi-perception video abnormal event detection method and device based on homologous heterogeneous information
Technical Field
The invention belongs to the field of intelligent video processing, and particularly relates to a method and equipment for detecting multi-perception video abnormal events based on homologous heterogeneous information.
Background
Video surveillance is an important information aid in the field of public safety. Along with the increase of the coverage of the monitoring equipment, the efficiency is difficult to guarantee while the labor cost is increased in the video safety prevention and control.
The essence of anomaly detection is to understand and distinguish the inherent differences between normal and abnormal events. Because of the low frequency of occurrence of abnormal practices, the shortage of samples, and the difficulty of labeling, some researchers try to explore an abnormal detection method based on frame prediction, and determine an abnormality by using the difference between an input frame and a predicted frame through normal sample training so as to avoid the definition of the abnormality: liu et al proposed an anomaly detection framework based on future frame prediction and use optical flow to represent motion information; li et al propose a two-branch prediction model and take consecutive frames before and after the target frame as input; lee et al propose a multi-scale aggregation network to take into account contextual information of anomalous events. These methods focus on context information from two directions, but they obtain the antecedent and consequent information through two branches, resulting in independence of the two information, which in turn affects detection performance.
Furthermore, most frame prediction based anomaly detection methods achieve excellent performance in unsupervised learning, but also ignore the essential difference between using future frame prediction and anomaly detection: on one hand, the frame prediction focuses on the prediction effect through the context information of the target frame, the target frame is ignored, and the abnormality detection needs to focus on the discriminant characteristics between normal and abnormal provided in the target frame; on the other hand, in the frame prediction process, the model extracts high-level features but lacks perception of content in an image frame, and abnormal event detection usually needs to determine the content attribute of a target, so that interference of uncertain factors on discriminant features is avoided as much as possible.
In order to learn the distinguishing features more clearly, the prior art focuses on the abnormal detection of local targets existing in a scene: ionescu et al introduced a target-centric convolutional auto-encoder and a one-to-one classifier to separate normal targets from abnormal targets; georgescu et al propose a background elimination network depending on objects that may cause exceptions. The abnormal detection methods break through the limitation that the foreground target occupies a small proportion of the whole image, and capture the key information in the frame according to the refined characteristics of the local area. However, the learning of local features at the time of testing ignores the relevance of the target in the global and the influence of the local target on the whole.
In order to pay attention to integrity, huang et al propose a global attribute restoration network, which is used for deleting some attributes, learning semantic feature embedding and performing anomaly detection; lv et al propose a high-order context coding model to extract semantic representations and measure dynamic changes. The methods pay attention to global high-level semantic information and show better performance in anomaly detection.
However, the visual semantic information is only a part of semantic expression as a basic unit, and the important correlation relationship among semantic units is lacked; and the single visual mode has insufficient semantic expression and large matching difficulty of semantic content consistency in the correlation, and influences the semantic perception capability. In addition, the refining of fine-grained features is weak, so that the expression semantics are insufficient in anomaly detection, and the detection precision is influenced.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method and equipment for detecting a multi-sensing video abnormal event based on homologous heterogeneous information, which improve the sensing capability of discriminant characteristics between normal and abnormal from multiple aspects, so as to solve the problems that the content information in a video frame cannot be acquired in the abnormal detection process, the semantic association between a local target and the whole is lost, and the inter-frame time sequence difference is ignored, and improve the abnormal detection precision.
The purpose of the invention is realized by the following technical scheme:
a multi-perception video abnormal event detection method based on homologous heterogeneous information is characterized by comprising the following steps: the system comprises a target detection network, a graph-text semantic perception module, a bidirectional prediction module with heterogeneous feature fusion and a time sequence attention distinguishing module, wherein the target detection network adopts a YoloV3 network to extract a target in a video frame; the image-text meaning perception module comprises an image feature extractor T o Text feature extractor T d The image-text semantic perception module extracts image features and text features of the video and calculates semantic correlation between the image features and the text features to ensure the consistency of the semantic features; the heterogeneous feature fusion bidirectional prediction module comprises a forward encoder, a backward encoder and a decoder, and enhances the extraction of normal features and inhibits the generation of abnormal features; the time sequence attention distinguishing module comprises a 3D convolution neural network, a time sequence attention mechanism and a 2D convolution network, and learns and distinguishes the characteristics of the false abnormal time sequence and the normal time sequence.
And carrying out abnormity judgment on the video to be detected through the combination of the prediction error, the semantic correlation and the time sequence information.
Preferably, the process of extracting the image fine-granularity features and the text features of the video by the graph-text meaning perception module comprises the following steps:
acquisition of a video sequence I over a target detection network 1 ,…,I M N targets in each frame and the category of the targets, wherein M is the length of the video sequence, and the value of N in different frames is not fixed; the tth frame I t The ith target area of (2) is marked as
Figure BDA0003961688900000021
Wherein t =1,2, \8230;, M, i =1,2, \8230;, N, W, H, and C are the width, height, and channel number of the target region, respectively, converting the target region in the video frame to a fixed size, while uniformly dividing the target region into P subblocks of size P × P, wherein
Figure BDA0003961688900000022
Taking P as an image feature extractor T o The length of the input sequence of (a);
each sub-block is subjected to characteristic refining and is mapped to a space with fixed dimensionality, and then space embedding is generated
Figure BDA0003961688900000023
Where x =1,2, \8230;. P, addition site is embedded
Figure BDA0003961688900000024
To retain the relative position information of each sub-block to obtain the embedded characteristics
Figure BDA0003961688900000025
Is represented by formula (1):
Figure BDA0003961688900000031
image feature extractor T o Consisting of l stacked identical transform frames, each being a serial process of concatenation of two residuals, the first one to embed a feature
Figure BDA0003961688900000032
Combining the output of the layer normalization and the multi-head self-attention mechanism with the output to obtain the intermediate characteristic
Figure BDA0003961688900000033
As shown in equation (2):
Figure BDA0003961688900000034
wherein LN (-) represents layer normalization, MSA (-) represents a multi-head attention mechanism;
the second residual is a pair of intermediate features
Figure BDA0003961688900000035
Features and intermediate features after layer normalization and multi-layer perception
Figure BDA0003961688900000036
Adding to obtain the output characteristics of the first Transformer frame
Figure BDA0003961688900000037
As shown in equation (3):
Figure BDA0003961688900000038
wherein MLP (-) represents multi-layer perception;
then will be
Figure BDA0003961688900000039
As input and output to a second transform framework
Figure BDA00039616889000000310
By analogy, output characteristics are obtained after the frames are stacked
Figure BDA00039616889000000311
Will be provided with
Figure BDA00039616889000000312
Output Final target region feature z 'into independent Multi-layer perception' o
For each target area
Figure BDA00039616889000000313
Its corresponding classification is recorded as
Figure BDA00039616889000000314
Establishing a mapping V of class labels and texts according to the pre-training sample classes of the target detection network, wherein each target class label
Figure BDA00039616889000000315
Obtain corresponding text
Figure BDA00039616889000000316
Wherein
Figure BDA00039616889000000317
Is a length 76 sequence, and uses [ SOS ]]And [ EOS]The mark is drawn up; converting the encoding into a computer understandable form according to the byte pairs, and simultaneously preserving semantic context by embedding the position information of text characters to obtain text embedding characteristics
Figure BDA00039616889000000318
Text embedding features
Figure BDA00039616889000000319
Through a text feature extractor T d The process of (2) is shown in formula (4):
Figure BDA00039616889000000320
wherein z' d For text feature extractor T d The output characteristic of (a); z 'using layer normalization and multilayer perception' o And z' d Mapping into a multimodal embedding space, output image features z o =MLP(LN(z o ') and text feature z) d =MLP(LN(z′ d ))。
Preferably, the process of calculating the semantic correlation between the image feature and the text feature to ensure the consistency of the semantic features comprises:
semantically associating an objective function L when training a graph-text semantic perception module sem (z o ,z d ) As shown in formula (5):
Figure BDA0003961688900000041
wherein
Figure BDA0003961688900000042
And
Figure BDA0003961688900000043
respectively representing objects
Figure BDA0003961688900000044
Image and text features of (1), sim [ ·]Representing cosine similarity; minimizing a semantically related objective function L sem (z o ,z d ) Enabling two vectors of the image characteristic and the text characteristic to be close to each other in absolute distance and to be constrained to the same direction;
when the video to be detected is detected abnormally, the global image characteristic and the local text characteristic are calculated
Figure BDA0003961688900000045
The image features and the text features are expressed as posterior probability vectors relative to a group of semantic concepts, when the video to be tested is abnormal, the correlation between the global image features and the local text features is weak, and the cognition on a semantic space is different.
Preferably, the bidirectional prediction module for heterogeneous feature fusion enhances extraction of normal features and suppresses generation of abnormal features, including:
extracting forward coding features and backward coding features using a bi-directional 3D encoder, wherein the bi-directional 3D encoder is represented by a forward encoder E f And a backward encoder E b Composition is carried out; forward encoder E f And backward encoder E b Have the same structure and network parameters;
according to each target area, taking the front L frame and the back L frame at the corresponding position to form a forward target sequence without the target in the t frame
Figure BDA0003961688900000046
And backward target sequence
Figure BDA0003961688900000047
Respectively input forward directionEncoder E f And a backward encoder E b To obtain a forward coding characteristic z f And backward coding feature z b As shown in equation (6) and equation (7):
Figure BDA0003961688900000048
Figure BDA0003961688900000049
characterizing an image z o Text feature z d Forward coding characteristic z f And backward coding feature z b Splicing to obtain heterogeneous fusion characteristic z = concat [ z ] f ,z b ,z o ,z d ];
The decoder predicts the target region according to the heterogeneous fusion features and predicts the intermediate target region by expressing the acquired features in a 2D form
Figure BDA00039616889000000410
As shown in formula (7):
Figure BDA00039616889000000411
the difference between the predicted target region and the real target region is expressed as an objective function
Figure BDA00039616889000000412
As shown in equation (8):
Figure BDA00039616889000000413
wherein the content of the first and second substances,
Figure BDA00039616889000000414
and
Figure BDA00039616889000000415
respectively a real target area and a predicted target area, and W and H respectively represent the width and the height of the target area;
encoder and decoder minimize objective function using only normal samples during heterogeneous feature fusion bi-directional prediction module training
Figure BDA0003961688900000051
The heterogeneous fusion characteristics comprise content information of normal samples, a decoder can predict the normal samples, and for abnormal samples, the decoder cannot predict abnormal targets.
Preferably, the generation process of the pseudo-abnormal sequence is as follows: according to the target area
Figure BDA0003961688900000052
And L frames before and after the sequence to generate continuous normal sequence
Figure BDA0003961688900000053
And mark it as y n (S n ) =0; generating and S by random jump n Pseudo-abnormal sequence of the same length
Figure BDA0003961688900000054
The symbol y a (S a ) =1, wherein a is a random number, S a I.e. a false exception timing.
Preferably, the process of learning and distinguishing the characteristics of the pseudo-abnormal time sequence and the normal time sequence by the time sequence attention distinguishing module includes: given normal timing
Figure BDA0003961688900000055
And pseudo exception timing
Figure BDA0003961688900000056
Wherein 2L +1 is the sequence length, C is the number of channels, and W × H is the size of the target region; will S n ,S a Inputting the 3D convolution neural network of the time sequence attention discrimination module, and extracting the time sequence feature z t ′;
Separately calculating S using a time-sequential attention mechanism n And S a An attention score for each target region; the time sequence attention mechanism adopts 3D average pooling and 3D maximum pooling to complete compression operation in a time dimension; then, after 3D average pooling and 3D maximum pooling, two layers of full connection are respectively used for obtaining different scaling factors and obtaining final attention scores, and finally, each time dimension is scaled according to the attention scores to finish recalibration z t As shown in equation (9):
z t =z′ t ·δ(fc(maxp 3D (z′ t );θ)+fc(avgp 3D (z′ t ));θ) (9)
wherein maxp 3D (. To) denotes maximum pooling in 3D, avgp 3D (·) denotes 3D average pooling, fc- (·) denotes the fully-connected layer containing parameter θ, δ denotes the Sigmoid activation function;
performing nonlinear processing on the time sequence by using a 2D convolution network of a time sequence attention discrimination module; the 2D convolutional network mainly comprises convolutional layers and full connections, and batch normalization, relu activation and selective discarding are used after the convolutional layers; the full connection layer is followed by the softmax function to output the abnormal discrimination probability
Figure BDA0003961688900000057
Wherein S k ={S n ,S a }; using cross entropy as the objective function L (S) k ) As shown in equation (10):
Figure BDA0003961688900000058
wherein the probability of abnormality discrimination
Figure BDA0003961688900000059
softmax (. Circle.) represents a softmax function, conv (. Circle.) represents a convolution, fc 1 () represents a fully connected operation; when the time sequence to be measured is abnormal, k = a, y k (S k ) And =1, otherwise 0.
Preferably, the process of performing anomaly judgment on the video to be detected by combining the prediction error, the semantic relevancy and the time sequence information specifically comprises the following steps:
spatially, using the prediction error of the real target region and the prediction target region as the spatial anomaly score of the region of each target, for an arbitrary frame I in which N targets exist t Selecting the maximum value of all target abnormal scores in the frame as the spatial abnormal score S of the frame spa (I t ) As shown in formula (11):
Figure BDA0003961688900000061
wherein the real target area
Figure BDA0003961688900000062
Figure BDA0003961688900000063
Represents a predicted target region, | ·| non-conducting light 2 Represents L 2 A norm;
on the content, frame I t Middle ith target area
Figure BDA0003961688900000064
Corresponding text is
Figure BDA0003961688900000065
Using global image feature z o (I t ) And local text features
Figure BDA0003961688900000066
The semantic correlation between the two is used as a local abnormal score, and the minimum semantic correlation is simultaneously selected as a global semantic abnormal score S sem (I t ) As shown in formula (12):
Figure BDA0003961688900000067
wherein sim [. Cndot. ] represents cosine similarity; the higher the probability of occurrence of target abnormality in the frame is, the smaller the semantic association degree is;
in time sequence, the target
Figure BDA0003961688900000068
Corresponding time series to be measured is
Figure BDA0003961688900000069
Targeting the timing discrimination module to obtain the output anomaly discrimination probability
Figure BDA00039616889000000610
A timing anomaly score of; for all targets, selecting the maximum value of the abnormality discrimination probability as the time series abnormality score S of the frame tem (I t ) As shown in equation (13):
Figure BDA00039616889000000611
wherein
Figure BDA00039616889000000612
Is to the target sequence
Figure BDA00039616889000000613
A predicted probability of (d);
linearly adding the spatial abnormality score, the semantic abnormality score and the time sequence abnormality score to obtain a final abnormality score S (I) t ) As shown in equation (14):
S(I t )=S spa (I t )+α·(1-S sem (I t ))+β·S tem (I t ) (14)
wherein alpha and beta are respectively a semantic coefficient and a time sequence parameter;
abnormal score S (I) t ) The larger the value of (a) is, the larger the probability that an abnormality occurs in the frame is indicated.
Preferably, the bi-directional 3D encoder is composed of 6 layers of 3D convolution layers, the convolution kernel size being 3 × 3 × 3; performing batch normalization and Relu activation after convolution of each layer, wherein pooling operation is not used after convolution of the first layer and the third layer, and 3D maximal pooling with the pooling size of 1 × 2 × 2 and the stride of 1 × 2 × 2 is adopted for other layers;
preferably, the decoder is implemented by 4 upsampling, each upsampling is followed by feature expression using a 2D convolutional layer with a kernel size of 3 × 3, and each layer of convolution is followed by batch normalization and Relu activation, respectively.
The invention also proposes an electronic device comprising a memory, a processor and program instructions stored in the memory for execution by the processor, the processor executing the program instructions to implement the steps of the method of the invention.
Compared with the prior art, the invention has the beneficial effects that:
1. aiming at the defects of local target and global semantic perception in abnormal detection, the method establishes a graph-text semantic perception module, correlates the image characteristics and the text characteristics of a target area and ensures the consistency of perception semantics;
2. the invention adopts the strategies of local training and global reasoning to sense the semantic correlation between the local target and the global information, thereby improving the accuracy of anomaly detection;
3. the invention designs a bidirectional prediction module with heterogeneous feature fusion, which enhances the extraction of normal features and inhibits the generation of abnormal features by combining space, time and content information, thereby capturing more discriminant features;
4. in the invention, the time sequence difference between frames is considered, and a time sequence attention distinguishing module is adopted to distinguish the generated pseudo-abnormal sequence and the normal sequence from the time dimension, thereby improving the sensitivity to the abnormality.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a general framework diagram of a method for detecting an abnormal event in a multi-sensor video based on heterogeneous information according to an embodiment of the present invention;
fig. 2 is a schematic view of visualization of anomaly detection and localization provided by the embodiment of the present invention.
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the present invention provides a method for detecting a multi-sensor video abnormal event based on homologous heterogeneous information, which mainly comprises a target detection network, a graph-text-semantic perception module, a bidirectional prediction module with heterogeneous feature fusion, and a time sequence attention discrimination module, wherein the target detection network adopts a YoloV3 network to extract a target in a video frame; the image-text meaning perception module comprises an image feature extractor T o Text feature extractor T d The image-text semantic perception module extracts image features and text features of the video and calculates semantic correlation between the image features and the text features to ensure the consistency of the semantic features; the heterogeneous feature fusion bidirectional prediction module comprises a forward encoder, a backward encoder and a decoder, and enhances the extraction of normal features and inhibits the generation of abnormal features; the time sequence attention distinguishing module comprises a 3D convolution neural network, a time sequence attention mechanism and a 2D convolution network, and learns and distinguishes the characteristics of the pseudo-abnormal time sequence and the normal time sequence.
And carrying out abnormity judgment on the video to be detected through the joint prediction error, the semantic relevancy and the time sequence information.
The process of extracting the image fine granularity feature and the text feature of the video by the graph-text meaning perception module comprises the following steps:
acquisition of a video sequence I over a target detection network 1 ,…,I M N targets in each frame and the category of the targets, wherein M is the length of the video sequence, and the value of N in different frames is not fixed; the tth frame I t The ith target area of
Figure BDA0003961688900000081
Wherein t =1,2, \8230, M, i =1,2, \8230, N, W, H and C are the width, height and channel number of the target region, respectively, to convert the target region in the video frame to a fixed size while uniformly dividing the target region into P subblocks of size P × P, wherein
Figure BDA0003961688900000082
Taking P as an image feature extractor T o The length of the input sequence of (a);
each sub-block is subjected to characteristic refining and is mapped to a space with fixed dimensionality, and then space embedding is generated
Figure BDA0003961688900000083
Wherein x =1,2, \8230;, P, the addition position is embedded
Figure BDA0003961688900000084
To retain the relative position information of each sub-block to obtain the embedded characteristics
Figure BDA0003961688900000085
Is represented by formula (1):
Figure BDA0003961688900000086
image feature extractor T o Consisting of l stacked identical transform frames, each being a serial process of concatenation of two residuals, the first one to embed a feature
Figure BDA0003961688900000087
Combining the output of the layer normalization and the multi-head self-attention mechanism with the output to obtain the intermediate characteristic
Figure BDA0003961688900000088
As shown in equation (2):
Figure BDA0003961688900000089
wherein LN (-) represents layer normalization, MSA (-) represents a multi-head attention mechanism;
the second residual is a pair of intermediate features
Figure BDA00039616889000000810
Features and intermediate features after layer normalization and multi-layer perception
Figure BDA00039616889000000811
Adding to obtain the output characteristics of the first Transformer frame
Figure BDA00039616889000000812
As shown in equation (3):
Figure BDA00039616889000000813
wherein MLP (-) represents multi-layer perception;
then will be
Figure BDA00039616889000000814
As input and output to a second transform framework
Figure BDA00039616889000000815
By analogy, output characteristics are obtained after the frames are stacked
Figure BDA00039616889000000816
Will be provided with
Figure BDA00039616889000000817
Output Final target region feature z 'into independent Multi-layer perception' o
For each target area
Figure BDA00039616889000000818
Its corresponding classification is recorded as
Figure BDA00039616889000000819
Establishing a mapping V of class labels and texts according to the pre-training sample classes of the target detection network, wherein each target class label
Figure BDA00039616889000000820
Obtain corresponding text
Figure BDA00039616889000000821
Wherein
Figure BDA00039616889000000822
Is a length 76 sequence, and uses [ SOS ]]And [ EOS]The mark is drawn up; converting the encoding into a computer understandable form according to the byte pairs, and simultaneously preserving semantic context by embedding the position information of text characters to obtain text embedding characteristics
Figure BDA00039616889000000823
Text embedding features
Figure BDA00039616889000000824
Through a text feature extractor T d The process of (2) is shown in formula (4):
Figure BDA0003961688900000091
wherein z' d For text feature extractor T d The output characteristic of (1).
Since there may be gaps between the features of text and fine-grained image feature extractor representations, it is desirable to map the features of both the text and image modalities to the same multimodal space, so z 'is done using layer normalization and multi-layer perception' o And z' d Mapping into a multimodal embedding space, output image features z o =MLP(LN(z o ') and text feature z) d =MLP(LN(z′ d ))。
The process of calculating the semantic correlation degree between the image feature and the text feature to ensure the consistency of the semantic features comprises the following steps:
semantically associating an objective function L when training a graph-text semantic perception module sem (z o ,z d ) As shown in formula (5):
Figure BDA0003961688900000092
wherein
Figure BDA0003961688900000093
And
Figure BDA0003961688900000094
respectively represent objects
Figure BDA0003961688900000095
Image and text features of (1), sim [ ·]Representing cosine similarity; minimizing a semantically related objective function L sem (z o ,z d ) Enabling the two vectors of the image characteristic and the text characteristic to be close to each other in absolute distance and constrained to be in the same direction;
when the video to be detected is detected abnormally, the global image characteristic and the local text characteristic are calculated
Figure BDA0003961688900000096
The image characteristic and the text characteristic are expressed as posterior probability vectors relative to a group of semantic concepts, and when the video to be tested is abnormal, the correlation between the global image characteristic and the local text characteristic is weakThe recognition in semantic space presents differences.
The heterogeneous feature fusion bidirectional prediction module for enhancing the extraction of normal features and inhibiting the generation of abnormal features comprises the following steps:
extracting forward coding features and backward coding features using a bi-directional 3D encoder, wherein the bi-directional 3D encoder is composed of a forward encoder E f And backward encoder E b Composition is carried out; forward encoder E f And a backward encoder E b Have the same structure and network parameters; the bidirectional 3D encoder is composed of 6 layers of 3D convolution layers, and the size of the convolution layer is 3 multiplied by 3; and after convolution of each layer, batch normalization and Relu activation are respectively carried out, pooling operation is not used after convolution of the first layer and the third layer, and 3D maximum pooling with the pooling size of 1 × 2 × 2 and the step size of 1 × 2 × 2 is adopted for other layers.
According to each target area, taking the front L frame and the back L frame of the corresponding position to form a forward target sequence which does not contain the target in the t frame
Figure BDA0003961688900000097
And backward target sequence
Figure BDA0003961688900000098
Separately input forward encoder E f And a backward encoder E b To obtain the forward coding feature z f And backward coding feature z b As shown in equation (6) and equation (7):
Figure BDA0003961688900000099
Figure BDA00039616889000000910
characterizing an image z o Text feature z d Forward coding characteristic z f And backward coding feature z b Splicing to obtain the heterogeneous fusion characteristic z = concat [ z ] f ,z b ,z o ,z d ]Heterogeneous fusionThe time, space and content information are fused by combining the characteristics, uncertainty in the prediction process is reduced, and the time sequence relation between the current frame and the context is enhanced;
the decoder is realized by 4 times of upsampling, after each time of upsampling, a 2D convolutional layer with the kernel size of 3 multiplied by 3 is used for carrying out characteristic expression, and after each layer of convolution, batch normalization and Relu activation are respectively carried out; the decoder predicts the target region according to the heterogeneous fusion features and predicts the intermediate target region by expressing the acquired features in a 2D form
Figure BDA0003961688900000101
As shown in equation (7):
Figure BDA0003961688900000102
the difference between the predicted target region and the real target region is expressed as an objective function
Figure BDA0003961688900000103
As shown in equation (8):
Figure BDA0003961688900000104
wherein the content of the first and second substances,
Figure BDA0003961688900000105
and
Figure BDA0003961688900000106
respectively a real target area and a predicted target area, and W and H respectively represent the width and the height of the target area;
encoder and decoder minimize objective function using only normal samples during heterogeneous feature fusion bi-directional prediction module training
Figure BDA0003961688900000107
The heterogeneous fusion characteristics comprise content information of normal samples, a decoder can predict the normal samples,for the abnormal samples, the decoder cannot predict the abnormal target.
The generation process of the pseudo-abnormal sequence comprises the following steps: according to the target area
Figure BDA0003961688900000108
And L frames before and after the sequence to generate continuous normal sequence
Figure BDA0003961688900000109
And mark it as y n (S n ) =0; generating and S by random jump n Pseudo-abnormal sequence of the same length
Figure BDA00039616889000001010
The symbol y a (S a ) =1, wherein a is a random number; s. the a The continuity between frames is weak, and the characteristics of irregular movement behaviors under abnormal conditions are met, so that S is determined a As a pseudo exception timing.
The process that the time sequence attention distinguishing module learns the characteristics of the pseudo-abnormal time sequence and the normal time sequence and distinguishes comprises the following steps: given normal timing
Figure BDA00039616889000001011
And pseudo exception timing
Figure BDA00039616889000001012
Wherein 2L +1 is the sequence length, C is the number of channels, and W × H is the size of the target region; will S n ,S a Inputting the 3D convolution neural network of the time sequence attention discrimination module, and extracting the time sequence feature z t ′;
Separately calculating S using a time-sequential attention mechanism n And S a An attention score for each target region; the time sequence attention mechanism adopts 3D average pooling and 3D maximum pooling to complete compression operation in a time dimension; then, after 3D average pooling and 3D maximum pooling, two layers of full connection are respectively used for obtaining different scaling factors and obtaining final attention scores, and finally, each time dimension is scaled according to the attention scores to finish recalibration z t As shown in the formula (9):
z t =z′ t ·δ(fc(maxp 3D (z′ t );θ)+fc(avgp 3D (z′ t ));θ) (9)
Wherein maxp 3D (. To) denotes maximum pooling in 3D, avgp 3D (. Cndot.) represents 3D average pooling, fc (. Cndot.) represents a fully connected layer containing the parameter θ, δ represents a Sigmoid activation function;
performing nonlinear processing on the time sequence by using a 2D convolution network of a time sequence attention discrimination module; the 2D convolutional network mainly comprises convolution and full connection, after convolution, batch normalization and Relu activation are used for improving generalization performance, and meanwhile selective discarding is used for avoiding overfitting; the full connection layer is followed by softmax function to output abnormal discrimination probability
Figure BDA0003961688900000111
Wherein S k ={S n ,S a }; using cross entropy as the objective function L (S) k ) As shown in equation (10):
Figure BDA0003961688900000112
wherein the probability of abnormality discrimination
Figure BDA0003961688900000113
softmax (. Circle.) represents a softmax function, conv (. Circle.) represents a convolution, fc 1 (. Cndot.) represents a fully connected operation; when the time sequence to be measured is abnormal, k = a, y k (S k ) And =1, otherwise 0.
The process of carrying out abnormity judgment on the video to be tested by combining the prediction error, the semantic relevancy and the time sequence information specifically comprises the following steps:
spatially, using the prediction error of the real target region and the prediction target region as the spatial anomaly score for the region of each target, for any frame I where there are N targets t Selecting the maximum value of all target abnormal scores in the frame as the spatial abnormal score S of the frame spa (I t ) As shown in formula (11):
Figure BDA0003961688900000114
wherein the real target area
Figure BDA0003961688900000115
Figure BDA0003961688900000116
Represents a predicted target region, | ·| non-conducting light 2 Represents L 2 A norm;
on the content, frame I t The ith target region
Figure BDA0003961688900000117
Corresponding text is
Figure BDA0003961688900000118
Using global image feature z o (I t ) And local text features
Figure BDA0003961688900000119
The semantic correlation between the two is used as a local abnormal score, and the minimum semantic correlation is simultaneously selected as a global semantic abnormal score S sem (I t ) As shown in formula (12):
Figure BDA00039616889000001110
wherein sim [. Cndot. ] represents cosine similarity; the higher the probability of occurrence of target abnormality in the frame is, the smaller the semantic association degree is;
in time sequence, the target
Figure BDA00039616889000001111
Corresponding time series to be measured is
Figure BDA00039616889000001112
Anomaly discrimination using timing discrimination module to obtain outputProbability as a target
Figure BDA00039616889000001113
A timing anomaly score of; for all targets, selecting the maximum value of the abnormality discrimination probability as the time series abnormality score S of the frame tem (I t ) As shown in equation (13):
Figure BDA00039616889000001114
wherein
Figure BDA00039616889000001115
Is to the target sequence
Figure BDA00039616889000001116
A predicted probability of (d);
linearly adding the spatial abnormality score, the semantic abnormality score and the time sequence abnormality score to obtain a final abnormality score S (I) t ) As shown in equation (14):
S(I t )=S spa (I t )+α·(1-S sem (I t ))+β·S tem (I t ) (14)
where α and β are semantic coefficients, timing parameters, respectively, that control the importance of the semantic anomaly score and the timing anomaly score relative to the spatial anomaly score;
abnormal score S (I) t ) The larger the value of (a) is, the larger the probability that an abnormality occurs in the frame is indicated.
The effectiveness of the method proposed by the present invention is verified by experiments below.
The method of the invention evaluates the abnormality detection on three data sets of UCSD Ped2, CUHK Avenue and ShanghaiTech. All reported results were implemented using an Intel Xeon (R) CPU, NVIDIA GTX1080Ti GPU device. The method is mainly realized in Anaconda3, python3.8, tensorFlow and PyTorch frames, and standard evaluation indexes are adopted to evaluate the abnormality detection performance. Performance was evaluated by calculating the area under the subject operating characteristic curve (AUC) as a scalar by gradually changing the threshold of the abnormality score, with higher AUC values indicating better abnormality detection performance.
Table 1 lists comparisons of the present invention method (MPFork) with some currently preferred methods on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, such as future frame prediction method (FFP), memory enhanced depth auto-encoder (MemAE), memory-guided normality Method (MNN), stacked recurrent neural network auto-encoder (sRNN-AE), generative cooperation discriminant network (GADNet), self-trained prediction model (SPM) self-supervised prediction convolutional attention block (SSPCAB), implicit dual path auto-encoder (ITAE), multi-path prediction anomaly detection (ROADMAP), anomaly detection for graph volume noise cleaning (GCLNL), variant anomaly behavior detection method (VABD), time-aware contrast network (TAC-Net), self-supervised attention generation countermeasure network (SSAGAN).
TABLE 1 comparison of AUC results by different methods
Figure BDA0003961688900000121
As can be seen from Table 1, the method of the present invention achieved the best performance on the UCSD Ped2, CUHK Avenue and ShanghaiTech data sets, and achieved detection accuracies of 99.8%,91.3% and 86.1%, respectively, compared to other methods. ITAE achieves suboptimal detection precision of 99.2% on a Ped2 data set, but the method of the invention respectively exceeds the method by 3.3% and 9.8% in the other two data sets, which proves that the method of the invention is helpful to distinguish between normal and abnormal and improve abnormal detection performance.
The abnormal positioning capacity and the abnormal time sequence sensitivity of the method are further verified by using two indexes of RBDC and TBDC, wherein the RBDC mainly expresses the positioning capacity according to the overlapping area of a real abnormal region and a predicted abnormal region, and the TBDC mainly depends on the tracking detection rate and the number of false positive regions in continuous frames. The detection accuracy of the self-supervised multitasking learning (SMTL), the object-centric automatic encoder (OCAE), the FFP, the SSPCAB and the method of the invention (MPFork) are compared in table 2.
TABLE 2 comparison of RBDC and TBDC Performance under different methods
Figure BDA0003961688900000131
It can be seen from table 2 that the process of the invention gives better results overall at both RBDC and TBDC indices. Compared with FFP and SSPCAB, the method has the advantages that the RBDC and the TBDC on the UCSD ped2 data set are respectively improved by 35% and 40% in an abnormal manner; increases of over 15% and 25% were obtained on CUHK Avenue; the method obtains 30% and 25% improvement on ShanghaiTech, and the method is favorable for improving the sensitivity and the positioning capability of the abnormal track by considering the spatial prediction of the abnormal in heterogeneous feature fusion and the strategy of the time sequence feature during the abnormal detection. In addition, the method is superior to SMTL in three data because the method not only considers the space-time characteristics, but also focuses on the semantic correlation between the local characteristics and the whole situation, and the omission condition is reduced.
The effectiveness of the method of the present invention is further verified by evaluating the contribution of each policy in the model to anomaly detection. The method is characterized in that the unidirectional frame prediction network is used as a reference for carrying out abnormity detection and mainly comprises a 3D encoder and a 2D decoder. It is proposed that MPFork mainly comprises three modules, namely a heterogeneous feature fusion bidirectional prediction module (BWH), an image-context perception module (ISP) and a time sequence attention discrimination module (TAD). Compared with a reference model, the bidirectional prediction module with heterogeneous feature fusion not only increases bidirectional prediction (BiP), but also performs anomaly detection by using homologous heterogeneous feature fusion, so that the effect of space-time features provided by bidirectional prediction on anomaly detection is verified, and the influence of heterogeneous feature fusion in frame prediction on anomaly detection is verified respectively, and the bidirectional prediction module mainly comprises fine-grained image features (ViF) of a target area obtained by an image feature extractor and text features (TeF) obtained by a text feature extractor. The performance changes after adding different strategies compared to the baseline model are shown in table 3.
TABLE 3 Effect of different strategies on anomaly detection
Figure BDA0003961688900000141
From table 3, it can be seen that the bi-directional prediction is improved by 2.4% compared to the performance based on the reference model, especially on UCSD Ped 2. That is because bi-directional prediction employs forward and backward feature extraction, indirectly providing richer temporal information for inter-frame prediction. When the fine-grained image features or text features of the target area are increased, the performance of the three data sets is improved to different degrees, which shows the effectiveness of the image features or text features of the target area on anomaly detection. In particular, when using the heterogeneous information fused bi-directional prediction module, there was an improvement of 5.0%,2.1% and 2.0% on UCSD Ped2, CUHK Avenue and ShanghaiTech datasets, respectively. The time sequence characteristics of the sequence are combined in the prediction process, and the spatial characteristics and content information of the current target area are considered, so that the prediction uncertainty is reduced, the capture of normal characteristics is enhanced, and the generation of abnormal characteristics is inhibited. After the image-text semantic perception module is added, the performance improvement on the Avenue data set is more obvious, namely, due to the fact that abnormal conditions such as bag throwing and paper losing exist in the Avenue data set, detection omission easily occurs when local features are used for the abnormal conditions, the image-text semantic perception provided by the people pays attention to the correlation between local features and the global features, the relation between people and objects is paid attention to more easily, and therefore the abnormal conditions are detected. After a time sequence attention distinguishing module is added, the abnormality detection performance is obviously improved in the CUHK Avenue and ShanghaiTech data sets. Because the two data sets have the conditions of rapid running, chasing and alarming and the like, the time sequence attention distinguishing module focuses more on the time sequence characteristics of the target, and is convenient to distinguish the abnormal time sequences different from the normal abnormal time sequences. The simultaneous use of our proposed MPFork in these strategies is to achieve optimal performance, which is 6.3%,9.1% and 7.0% improvement over the baseline model on the three and data sets, respectively. This result indicates that all these proposed strategies contribute to anomaly detection, as their combination can optimize the framework from different aspects and improve performance.
In addition, in real life, the anomaly detection model not only needs to index the video frame where the anomaly event is located, but also needs to locate the specific position of the anomaly target in the frame. To more intuitively illustrate the performance of the proposed method, we have selected several representative cases from different data sets to visualize the detection and localization results of different models, as shown in fig. 2, where a rectangular box is a missing or false detection case. In each data set, the first column shows the visualization effect of the normal frame, and the second column and the third column show the detection and positioning effects of different abnormal frames under different models respectively. The first line is the video frame which is really collected, and the lower lines correspond to the detection and positioning effects corresponding to the BiP, the BWH and the MPFork respectively. It can be seen that BiP is more likely to determine some normal targets as abnormal and cause false detection, and meanwhile, the situation of missed detection is also likely to occur. The abnormal detection is carried out by the BiP according to the prediction error, and the prediction error of part of normal targets is relatively large to generate false detection due to insufficient discriminant feature learning of normal and abnormal; meanwhile, the strong characterization capability of the deep network causes the prediction effect of part of abnormal targets to be good and the missed detection is generated. The energy values of some normal targets in BWH decrease and the energy values of abnormal targets increase compared to BiP, indicating an increased ability of BWH to learn the discriminative features of normal and abnormal targets. This is because the BWH simultaneously fuses the temporal, spatial, and content information of the target, which helps to enhance the normal features in the solution space and suppress the generation of abnormal features during inference. The MPFork provided by the inventor shows better detection and positioning advantages on three data sets, because the MPFork combines local and global semantic perception and time sequence discrimination functions on the basis of BWH, improves the extraction capability of discriminant features, deeply perceives the abnormality from a semantic understanding level and detects the abnormality from a time sequence angle during reasoning, thereby being beneficial to the detection and positioning of the abnormality, being more suitable for the abnormality detection in real life and improving the detection performance.
Finally, the present invention provides an electronic device, which includes a memory, a processor, and a program stored in the memory and executable by the processor, wherein the processor executes the program to implement the steps of a method for detecting video anomalies based on scene classification as proposed in the present invention. It should be noted that the description of the apparatus according to the embodiment of the present invention is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description thereof is omitted.
Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes, modifications, equivalents, and improvements can be made therein without departing from the spirit and scope of the invention.

Claims (9)

1. A multi-perception video abnormal event detection method based on homologous heterogeneous information is characterized by comprising the following steps: the system comprises a target detection network, a graph-text semantic perception module, a heterogeneous characteristic fusion bidirectional prediction module and a time sequence attention discrimination module;
the target detection network adopts a YoloV3 network to extract a target in a video frame;
the graph-text meaning perception module comprises an image feature extractor T o Text feature extractor T d And a semantic relevance description portion; the image-text semantic perception module extracts image features and text features of a video, and calculates semantic correlation degrees between the image features and the text features to ensure the consistency of the semantic features;
the isomeric characteristicsThe fused bidirectional prediction module comprises a forward encoder E f Backward encoder E b And a decoder; the heterogeneous feature fusion bidirectional prediction module enhances the extraction of normal features and inhibits the generation of abnormal features;
the time sequence attention distinguishing module comprises a 3D convolutional neural network, a time sequence attention mechanism and a 2D convolutional network, and learns and distinguishes the characteristics of a pseudo-abnormal time sequence and a normal time sequence;
and (4) carrying out abnormity judgment on the video to be detected by combining the prediction error, the semantic relevancy and the time sequence information.
2. The method for detecting the abnormal events of the multi-perception video based on the heterogeneous information according to claim 1, wherein the process of extracting the image fine granularity feature and the text feature of the video by the graph-text meaning perception module comprises:
obtaining a video sequence I through the target detection network 1 ,…,I M N targets and the categories thereof in each frame, wherein M is the length of the video sequence, and the value of N in different frames is not fixed; the tth frame I t The ith target area of
Figure FDA0003961688890000011
Wherein t =1,2, \8230, M, i =1,2, \8230, N, W, H and C are the width, height and channel number of the target region, respectively, to convert the target region in the video frame to a fixed size while uniformly dividing the target region into P subblocks of size P × P, wherein
Figure FDA0003961688890000012
Using P as the image feature extractor T o The length of the input sequence of (a);
each sub-block is subjected to characteristic refining and is mapped to a space with fixed dimensionality, and then space embedding is generated
Figure FDA0003961688890000013
Wherein x =1,2, \ 8230;, P, addition position inlayInto
Figure FDA0003961688890000014
To retain the relative position information of each sub-block to obtain the embedded characteristics
Figure FDA0003961688890000015
Is represented by formula (1):
Figure FDA0003961688890000016
the image feature extractor T o Consisting of l stacked identical transform frames, each being a serial process of concatenation of two residuals, the first one to embed a feature
Figure FDA0003961688890000017
The output of the layer normalization and multi-head self-attention mechanism is combined with the output of the layer normalization and multi-head self-attention mechanism to obtain intermediate characteristics
Figure FDA0003961688890000018
As shown in equation (2):
Figure FDA0003961688890000019
wherein LN (-) represents layer normalization, MSA (-) represents a multi-head attention mechanism;
the second residual is a pair of intermediate features
Figure FDA0003961688890000021
Features and intermediate features after layer normalization and multi-layer perception
Figure FDA0003961688890000022
Adding to obtain the output characteristics of the first Transformer frame
Figure FDA0003961688890000023
As shown in equation (3):
Figure FDA0003961688890000024
wherein MLP (·) represents multi-layer perception;
then will
Figure FDA0003961688890000025
As input and output to a second transform framework
Figure FDA0003961688890000026
By analogy, output characteristics are obtained after the frames are stacked
Figure FDA0003961688890000027
Will be provided with
Figure FDA0003961688890000028
Output Final target region feature z 'into independent Multi-layer perception' o
For each target area
Figure FDA0003961688890000029
Its corresponding classification is recorded as
Figure FDA00039616888900000210
Establishing a mapping V of class labels and texts according to the pre-training sample classes of the target detection network, wherein each target class label
Figure FDA00039616888900000211
Obtain corresponding text
Figure FDA00039616888900000212
Wherein
Figure FDA00039616888900000213
Is a length 76 sequence, and uses [ SOS ]]And [ EOS]The mark is drawn up; converting the encoding into a computer understandable form according to byte pairs, and simultaneously preserving semantic context and obtaining text embedding characteristics by embedding position information of text characters
Figure FDA00039616888900000214
Text embedding features
Figure FDA00039616888900000215
Through a text feature extractor T d The process of (2) is shown in formula (4):
Figure FDA00039616888900000216
wherein z' d For text feature extractor T d The output characteristics of (1); z 'using layer normalization and multilayer perception' o And z' d Mapping into a multimodal embedding space, output image features z o =MLP(LN(z o ') and text feature z) d =MLP(LN(z′ d ))。
3. The method according to claim 2, wherein the calculating of the semantic correlation between the image feature and the text feature ensures consistency of semantic features, and comprises:
when the graph-text semantic perception module is trained, the semantic association objective function L sem (z o ,z d ) As shown in formula (5):
Figure FDA00039616888900000217
wherein
Figure FDA00039616888900000218
And
Figure FDA00039616888900000219
respectively representing objects
Figure FDA00039616888900000220
Image and text features of (1), sim [ ·]Representing cosine similarity; minimizing a semantically related objective function L sem (z o ,z d ) Enabling two vectors of the image characteristic and the text characteristic to be close to each other in absolute distance and to be constrained to the same direction;
when the video to be detected is detected abnormally, the global image characteristic and the local text characteristic are calculated
Figure FDA00039616888900000221
The image features and the text features are expressed as posterior probability vectors relative to a group of semantic concepts, when the video to be tested is abnormal, the correlation between the global image features and the local text features is weak, and the cognition on a semantic space is different.
4. The method according to claim 3, wherein the heterogeneous feature fusion based bi-directional prediction module enhances extraction of normal features and suppresses generation of abnormal features comprises:
extracting forward coding features and backward coding features using a bi-directional 3D encoder, wherein the bi-directional 3D encoder is formed by the forward encoder E f And a backward encoder E b Composition of, the forward encoder E f And a backward encoder E b Have the same structure and network parameters;
according to each target area, taking the front L frame and the back L frame at the corresponding position to form a forward target sequence without the target in the t frame
Figure FDA0003961688890000031
And backward target sequence
Figure FDA0003961688890000032
Respectively input into the forward encoder E f And a backward encoder E b To obtain a forward coding characteristic z f And backward coding feature z b As shown in equation (6) and equation (7):
Figure FDA0003961688890000033
Figure FDA0003961688890000034
characterizing the image z o Text feature z d Forward coding characteristic z f And backward coding feature z b Splicing to obtain heterogeneous fusion characteristic z = concat [ z ] f ,z b ,z o ,z d ];
The decoder predicts a target region based on heterogeneous fusion features, and predicts an intermediate target region by expressing the acquired features in a 2D format
Figure FDA0003961688890000035
As shown in equation (7):
Figure FDA0003961688890000036
the difference between the predicted target region and the real target region is expressed as an objective function
Figure FDA0003961688890000037
As shown in equation (8):
Figure FDA0003961688890000038
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003961688890000039
and
Figure FDA00039616888900000310
respectively a real target area and a predicted target area, and W and H are respectively the width and the height of the target area;
in the heterogeneous feature fusion bidirectional prediction module training process, only normal samples are used, and an encoder and a decoder minimize an objective function
Figure FDA00039616888900000311
The heterogeneous fusion characteristics comprise content information of normal samples, a decoder can predict the normal samples, and for abnormal samples, the decoder cannot predict abnormal targets.
5. The method for detecting the abnormal events of the multi-perception video based on the heterogeneous information of claim 4, wherein the generation process of the pseudo-abnormal sequence is as follows: according to the target area
Figure FDA00039616888900000312
And L frames before and after the sequence to generate continuous normal sequence
Figure FDA00039616888900000313
And mark it as y n (S n ) =0; generating and S by random jump n Pseudo-abnormal sequence of the same length
Figure FDA0003961688890000041
Mark y a (S a ) =1, wherein a is a random number, S a I.e. a false exception timing.
6. According to claimThe method for detecting abnormal events of multi-sensing video based on heterogeneous information is characterized in that the process of learning and distinguishing the characteristics of the pseudo-abnormal time sequence and the normal time sequence by the time sequence attention distinguishing module comprises the following steps: given normal timing
Figure FDA0003961688890000042
And pseudo exception timing
Figure FDA0003961688890000043
Wherein 2L +1 is the sequence length, C is the number of channels, and W × H is the size of the target region; will S n ,S a Inputting the time sequence feature z 'into a 3D convolutional neural network and extracting' t
Separately calculating S using the time series attention mechanism n And S a An attention score for each target region; the time sequence attention mechanism adopts 3D average pooling and 3D maximum pooling to complete compression operation in a time dimension; then, after 3D average pooling and 3D maximum pooling, two layers of full connection are respectively used for obtaining different scaling factors and obtaining final attention scores, and finally, each time dimension is scaled according to the attention scores to finish recalibration z t As shown in equation (9):
z t =z′ t ·δ(fc(maxp 3D (z′ t );θ)+fc(avgp 3D (z′ t ));θ) (9)
wherein maxp 3D (. To) denotes maximum pooling in 3D, avgp 3D (. Cndot.) represents 3D average pooling, fc (. Cndot.) represents a fully connected layer containing the parameter θ, δ represents a Sigmoid activation function;
performing nonlinear processing on a time sequence by using the 2D convolutional network; the 2D convolutional network mainly comprises convolution and full connection, and batch normalization, relu activation and selective discarding are used after convolution; the full connection layer is followed by the softmax function to output the abnormal discrimination probability
Figure FDA0003961688890000044
Wherein S k ={S n ,S a }; using cross entropy asObjective function L (S) k ) As shown in equation (10):
Figure FDA0003961688890000045
wherein the probability of abnormality discrimination
Figure FDA0003961688890000046
softmax (. Circle.) represents a softmax function, conv (. Circle.) represents a convolution, fc 1 (. Cndot.) represents a fully connected operation; when the time sequence to be measured is abnormal, k = a, y k (S k ) And =1, otherwise 0.
7. The method for detecting the abnormal events of the multi-sensor video based on the heterogeneous information according to claim 6, wherein the process of performing the abnormal judgment on the video to be detected by combining the prediction error, the semantic relevance and the time sequence information specifically comprises the following steps:
spatially, using the prediction error of the real target region and the prediction target region as the spatial anomaly score of the region of each target, for an arbitrary frame I in which N targets exist t Selecting the maximum value of all target abnormal scores in the frame as the spatial abnormal score S of the frame spa (I t ) As shown in formula (11):
Figure FDA0003961688890000047
wherein the real target area
Figure FDA0003961688890000048
Figure FDA0003961688890000049
Represents a predicted target region, | ·| non-conducting light 2 Represents L 2 A norm;
on the content, frame I t The ith target region
Figure FDA00039616888900000410
Corresponding text is
Figure FDA00039616888900000411
Using global image feature z o (I t ) And local text features
Figure FDA0003961688890000051
The semantic correlation between the two is used as a local abnormal score, and the minimum semantic correlation is simultaneously selected as a global semantic abnormal score S sem (I t ) As shown in formula (12):
Figure FDA0003961688890000052
wherein sim [. Cndot. ] represents the cosine similarity; the higher the probability of occurrence of target abnormality in the frame is, the smaller the semantic association degree is;
in time sequence, target
Figure FDA0003961688890000053
Corresponding time series to be measured is
Figure FDA0003961688890000054
Adopting the abnormal discrimination probability output by the time sequence discrimination module as a target
Figure FDA0003961688890000055
A timing anomaly score of; for all targets, selecting the maximum value of the abnormal discrimination probability as the time sequence abnormal score S of the frame tem (I t ) As shown in equation (13):
Figure FDA0003961688890000056
wherein
Figure FDA0003961688890000057
Is to the target sequence
Figure FDA0003961688890000058
A predicted probability of (d);
linearly adding the spatial abnormality score, the semantic abnormality score and the time sequence abnormality score to obtain a final abnormality score S (I) t ) As shown in equation (14):
S(I t )=S spa (I t )+α·(1-S sem (I t ))+β·S tem (I t ) (14)
wherein alpha and beta are respectively a semantic coefficient and a time sequence parameter;
abnormal score S (I) t ) The larger the value of (a) is, the larger the probability that an abnormality occurs in the frame is indicated.
8. The method for detecting multi-perception video abnormal events based on the heterogeneous information as claimed in claim 4, wherein:
the bidirectional 3D encoder is composed of 6 layers of 3D convolution layers, and the size of a convolution kernel is 3 multiplied by 3; performing batch normalization and Relu activation respectively after convolution of each layer, wherein pooling operation is not used after convolution of the first layer and the third layer, and 3D maximum pooling with the pooling size of 1 × 2 × 2 and the stride of 1 × 2 × 2 is adopted for other layers;
the decoder is realized by 4 times of upsampling, after each time of upsampling, a 2D convolutional layer with the kernel size of 3 x 3 is used for carrying out feature expression, and after each layer of convolution, batch normalization and Relu activation are respectively carried out.
9. An electronic device comprising a memory, a processor and program instructions stored in the memory for execution by the processor, wherein the program instructions are executable by the processor to perform the steps of the method of any of claims 1-8.
CN202211484883.9A 2022-11-24 2022-11-24 Multi-perception video abnormal event detection method and device based on homologous heterogeneous information Withdrawn CN115797830A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211484883.9A CN115797830A (en) 2022-11-24 2022-11-24 Multi-perception video abnormal event detection method and device based on homologous heterogeneous information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211484883.9A CN115797830A (en) 2022-11-24 2022-11-24 Multi-perception video abnormal event detection method and device based on homologous heterogeneous information

Publications (1)

Publication Number Publication Date
CN115797830A true CN115797830A (en) 2023-03-14

Family

ID=85441197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211484883.9A Withdrawn CN115797830A (en) 2022-11-24 2022-11-24 Multi-perception video abnormal event detection method and device based on homologous heterogeneous information

Country Status (1)

Country Link
CN (1) CN115797830A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116257142A (en) * 2023-05-12 2023-06-13 福建省亿鑫海信息科技有限公司 Security monitoring method and terminal based on multi-mode data characterization
CN116506216A (en) * 2023-06-19 2023-07-28 国网上海能源互联网研究院有限公司 Lightweight malicious flow detection and evidence-storage method, device, equipment and medium
CN116886991A (en) * 2023-08-21 2023-10-13 珠海嘉立信发展有限公司 Method, apparatus, terminal device and readable storage medium for generating video data

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116257142A (en) * 2023-05-12 2023-06-13 福建省亿鑫海信息科技有限公司 Security monitoring method and terminal based on multi-mode data characterization
CN116257142B (en) * 2023-05-12 2023-07-21 福建省亿鑫海信息科技有限公司 Security monitoring method and terminal based on multi-mode data characterization
CN116506216A (en) * 2023-06-19 2023-07-28 国网上海能源互联网研究院有限公司 Lightweight malicious flow detection and evidence-storage method, device, equipment and medium
CN116506216B (en) * 2023-06-19 2023-09-12 国网上海能源互联网研究院有限公司 Lightweight malicious flow detection and evidence-storage method, device, equipment and medium
CN116886991A (en) * 2023-08-21 2023-10-13 珠海嘉立信发展有限公司 Method, apparatus, terminal device and readable storage medium for generating video data
CN116886991B (en) * 2023-08-21 2024-05-03 珠海嘉立信发展有限公司 Method, apparatus, terminal device and readable storage medium for generating video data

Similar Documents

Publication Publication Date Title
CN115797830A (en) Multi-perception video abnormal event detection method and device based on homologous heterogeneous information
Chen et al. Dcan: improving temporal action detection via dual context aggregation
Ge et al. Deepfake video detection via predictive representation learning
Antwi-Bekoe et al. A deep learning approach for insulator instance segmentation and defect detection
Wang et al. Afp-mask: Anchor-free polyp instance segmentation in colonoscopy
CN113705490B (en) Anomaly detection method based on reconstruction and prediction
CN111860248B (en) Visual target tracking method based on twin gradual attention-guided fusion network
CN113947702A (en) Multi-modal emotion recognition method and system based on context awareness
CN114170184A (en) Product image anomaly detection method and device based on embedded feature vector
CN114639042A (en) Video target detection algorithm based on improved CenterNet backbone network
CN114511502A (en) Gastrointestinal endoscope image polyp detection system based on artificial intelligence, terminal and storage medium
CN116542921A (en) Colon polyp segmentation method, device and storage medium
CN115412324A (en) Air-space-ground network intrusion detection method based on multi-mode conditional countermeasure field adaptation
Peng et al. An adaptive coarse-fine semantic segmentation method for the attachment recognition on marine current turbines
CN112967227B (en) Automatic diabetic retinopathy evaluation system based on focus perception modeling
Lee et al. Latent-ofer: Detect, mask, and reconstruct with latent vectors for occluded facial expression recognition
CN115690665B (en) Video anomaly detection method and device based on cross U-Net network
Singh et al. Attention-guided generator with dual discriminator GAN for real-time video anomaly detection
CN115953663A (en) Weak supervision shadow detection method using line marking
Wang et al. Scene uyghur recognition with embedded coordinate attention
CN115359511A (en) Pig abnormal behavior detection method
Shrestha et al. Magnet: Multi-region attention-assisted grounding of natural language queries at phrase level
CN112418205A (en) Interactive image segmentation method and system based on focusing on wrongly segmented areas
Zhai et al. Spike-based optical flow estimation via contrastive learning
Li et al. Cross-modality integration framework with prediction, perception and discrimination for video anomaly detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20230314