CN115797830A - Multi-perception video abnormal event detection method and device based on homologous heterogeneous information - Google Patents
Multi-perception video abnormal event detection method and device based on homologous heterogeneous information Download PDFInfo
- Publication number
- CN115797830A CN115797830A CN202211484883.9A CN202211484883A CN115797830A CN 115797830 A CN115797830 A CN 115797830A CN 202211484883 A CN202211484883 A CN 202211484883A CN 115797830 A CN115797830 A CN 115797830A
- Authority
- CN
- China
- Prior art keywords
- abnormal
- features
- text
- semantic
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 95
- 238000001514 detection method Methods 0.000 title claims abstract description 73
- 238000000034 method Methods 0.000 claims abstract description 72
- 230000005856 abnormality Effects 0.000 claims abstract description 39
- 230000008447 perception Effects 0.000 claims abstract description 35
- 230000004927 fusion Effects 0.000 claims abstract description 30
- 230000002457 bidirectional effect Effects 0.000 claims abstract description 20
- 238000000605 extraction Methods 0.000 claims abstract description 10
- 238000011176 pooling Methods 0.000 claims description 27
- 230000008569 process Effects 0.000 claims description 26
- 230000006870 function Effects 0.000 claims description 22
- 238000010606 normalization Methods 0.000 claims description 22
- 230000007246 mechanism Effects 0.000 claims description 16
- 230000004913 activation Effects 0.000 claims description 12
- 238000012549 training Methods 0.000 claims description 10
- 238000013507 mapping Methods 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 5
- 239000000284 extract Substances 0.000 claims description 4
- 238000007670 refining Methods 0.000 claims description 4
- 230000006835 compression Effects 0.000 claims description 3
- 238000007906 compression Methods 0.000 claims description 3
- 239000000203 mixture Substances 0.000 claims description 3
- 230000019771 cognition Effects 0.000 claims description 2
- 238000013527 convolutional neural network Methods 0.000 claims 2
- 230000000694 effects Effects 0.000 abstract description 8
- 238000013528 artificial neural network Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 5
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000035945 sensitivity Effects 0.000 description 3
- 101100272279 Beauveria bassiana Beas gene Proteins 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 230000033001 locomotion Effects 0.000 description 2
- 230000002123 temporal effect Effects 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000002547 anomalous effect Effects 0.000 description 1
- 238000012512 characterization method Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Landscapes
- Image Analysis (AREA)
Abstract
The invention provides a method and a device for detecting a multi-sensing video abnormal event based on homologous heterogeneous information, wherein the method comprises the following steps: the system comprises a target detection network, a graph-text semantic perception module, a bidirectional prediction module with heterogeneous feature fusion and a time sequence attention discrimination module. The image-text semantic perception module carries out semantic association on the image features and the text features to ensure semantic consistency; the bidirectional prediction module for heterogeneous feature fusion introduces the idea of heterogeneous feature fusion to combine time, space and content information, enhances the extraction of normal features and inhibits the generation of abnormal features, thereby improving the capture effect of discriminant features in abnormal detection; the time sequence attention judging module pays attention to the time sequence relation between frames, and judges and distinguishes the characteristics of the network learning pseudo-abnormality and the normal time sequence, so that the capability of the network for detecting the abnormality by combining the time sequence characteristics is improved.
Description
Technical Field
The invention belongs to the field of intelligent video processing, and particularly relates to a method and equipment for detecting multi-perception video abnormal events based on homologous heterogeneous information.
Background
Video surveillance is an important information aid in the field of public safety. Along with the increase of the coverage of the monitoring equipment, the efficiency is difficult to guarantee while the labor cost is increased in the video safety prevention and control.
The essence of anomaly detection is to understand and distinguish the inherent differences between normal and abnormal events. Because of the low frequency of occurrence of abnormal practices, the shortage of samples, and the difficulty of labeling, some researchers try to explore an abnormal detection method based on frame prediction, and determine an abnormality by using the difference between an input frame and a predicted frame through normal sample training so as to avoid the definition of the abnormality: liu et al proposed an anomaly detection framework based on future frame prediction and use optical flow to represent motion information; li et al propose a two-branch prediction model and take consecutive frames before and after the target frame as input; lee et al propose a multi-scale aggregation network to take into account contextual information of anomalous events. These methods focus on context information from two directions, but they obtain the antecedent and consequent information through two branches, resulting in independence of the two information, which in turn affects detection performance.
Furthermore, most frame prediction based anomaly detection methods achieve excellent performance in unsupervised learning, but also ignore the essential difference between using future frame prediction and anomaly detection: on one hand, the frame prediction focuses on the prediction effect through the context information of the target frame, the target frame is ignored, and the abnormality detection needs to focus on the discriminant characteristics between normal and abnormal provided in the target frame; on the other hand, in the frame prediction process, the model extracts high-level features but lacks perception of content in an image frame, and abnormal event detection usually needs to determine the content attribute of a target, so that interference of uncertain factors on discriminant features is avoided as much as possible.
In order to learn the distinguishing features more clearly, the prior art focuses on the abnormal detection of local targets existing in a scene: ionescu et al introduced a target-centric convolutional auto-encoder and a one-to-one classifier to separate normal targets from abnormal targets; georgescu et al propose a background elimination network depending on objects that may cause exceptions. The abnormal detection methods break through the limitation that the foreground target occupies a small proportion of the whole image, and capture the key information in the frame according to the refined characteristics of the local area. However, the learning of local features at the time of testing ignores the relevance of the target in the global and the influence of the local target on the whole.
In order to pay attention to integrity, huang et al propose a global attribute restoration network, which is used for deleting some attributes, learning semantic feature embedding and performing anomaly detection; lv et al propose a high-order context coding model to extract semantic representations and measure dynamic changes. The methods pay attention to global high-level semantic information and show better performance in anomaly detection.
However, the visual semantic information is only a part of semantic expression as a basic unit, and the important correlation relationship among semantic units is lacked; and the single visual mode has insufficient semantic expression and large matching difficulty of semantic content consistency in the correlation, and influences the semantic perception capability. In addition, the refining of fine-grained features is weak, so that the expression semantics are insufficient in anomaly detection, and the detection precision is influenced.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides a method and equipment for detecting a multi-sensing video abnormal event based on homologous heterogeneous information, which improve the sensing capability of discriminant characteristics between normal and abnormal from multiple aspects, so as to solve the problems that the content information in a video frame cannot be acquired in the abnormal detection process, the semantic association between a local target and the whole is lost, and the inter-frame time sequence difference is ignored, and improve the abnormal detection precision.
The purpose of the invention is realized by the following technical scheme:
a multi-perception video abnormal event detection method based on homologous heterogeneous information is characterized by comprising the following steps: the system comprises a target detection network, a graph-text semantic perception module, a bidirectional prediction module with heterogeneous feature fusion and a time sequence attention distinguishing module, wherein the target detection network adopts a YoloV3 network to extract a target in a video frame; the image-text meaning perception module comprises an image feature extractor T o Text feature extractor T d The image-text semantic perception module extracts image features and text features of the video and calculates semantic correlation between the image features and the text features to ensure the consistency of the semantic features; the heterogeneous feature fusion bidirectional prediction module comprises a forward encoder, a backward encoder and a decoder, and enhances the extraction of normal features and inhibits the generation of abnormal features; the time sequence attention distinguishing module comprises a 3D convolution neural network, a time sequence attention mechanism and a 2D convolution network, and learns and distinguishes the characteristics of the false abnormal time sequence and the normal time sequence.
And carrying out abnormity judgment on the video to be detected through the combination of the prediction error, the semantic correlation and the time sequence information.
Preferably, the process of extracting the image fine-granularity features and the text features of the video by the graph-text meaning perception module comprises the following steps:
acquisition of a video sequence I over a target detection network 1 ,…,I M N targets in each frame and the category of the targets, wherein M is the length of the video sequence, and the value of N in different frames is not fixed; the tth frame I t The ith target area of (2) is marked asWherein t =1,2, \8230;, M, i =1,2, \8230;, N, W, H, and C are the width, height, and channel number of the target region, respectively, converting the target region in the video frame to a fixed size, while uniformly dividing the target region into P subblocks of size P × P, whereinTaking P as an image feature extractor T o The length of the input sequence of (a);
each sub-block is subjected to characteristic refining and is mapped to a space with fixed dimensionality, and then space embedding is generatedWhere x =1,2, \8230;. P, addition site is embeddedTo retain the relative position information of each sub-block to obtain the embedded characteristicsIs represented by formula (1):
image feature extractor T o Consisting of l stacked identical transform frames, each being a serial process of concatenation of two residuals, the first one to embed a featureCombining the output of the layer normalization and the multi-head self-attention mechanism with the output to obtain the intermediate characteristicAs shown in equation (2):
wherein LN (-) represents layer normalization, MSA (-) represents a multi-head attention mechanism;
the second residual is a pair of intermediate featuresFeatures and intermediate features after layer normalization and multi-layer perceptionAdding to obtain the output characteristics of the first Transformer frameAs shown in equation (3):
wherein MLP (-) represents multi-layer perception;
then will beAs input and output to a second transform frameworkBy analogy, output characteristics are obtained after the frames are stackedWill be provided withOutput Final target region feature z 'into independent Multi-layer perception' o ;
For each target areaIts corresponding classification is recorded asEstablishing a mapping V of class labels and texts according to the pre-training sample classes of the target detection network, wherein each target class labelObtain corresponding textWhereinIs a length 76 sequence, and uses [ SOS ]]And [ EOS]The mark is drawn up; converting the encoding into a computer understandable form according to the byte pairs, and simultaneously preserving semantic context by embedding the position information of text characters to obtain text embedding characteristics
Text embedding featuresThrough a text feature extractor T d The process of (2) is shown in formula (4):
wherein z' d For text feature extractor T d The output characteristic of (a); z 'using layer normalization and multilayer perception' o And z' d Mapping into a multimodal embedding space, output image features z o =MLP(LN(z o ') and text feature z) d =MLP(LN(z′ d ))。
Preferably, the process of calculating the semantic correlation between the image feature and the text feature to ensure the consistency of the semantic features comprises:
semantically associating an objective function L when training a graph-text semantic perception module sem (z o ,z d ) As shown in formula (5):
whereinAndrespectively representing objectsImage and text features of (1), sim [ ·]Representing cosine similarity; minimizing a semantically related objective function L sem (z o ,z d ) Enabling two vectors of the image characteristic and the text characteristic to be close to each other in absolute distance and to be constrained to the same direction;
when the video to be detected is detected abnormally, the global image characteristic and the local text characteristic are calculatedThe image features and the text features are expressed as posterior probability vectors relative to a group of semantic concepts, when the video to be tested is abnormal, the correlation between the global image features and the local text features is weak, and the cognition on a semantic space is different.
Preferably, the bidirectional prediction module for heterogeneous feature fusion enhances extraction of normal features and suppresses generation of abnormal features, including:
extracting forward coding features and backward coding features using a bi-directional 3D encoder, wherein the bi-directional 3D encoder is represented by a forward encoder E f And a backward encoder E b Composition is carried out; forward encoder E f And backward encoder E b Have the same structure and network parameters;
according to each target area, taking the front L frame and the back L frame at the corresponding position to form a forward target sequence without the target in the t frameAnd backward target sequenceRespectively input forward directionEncoder E f And a backward encoder E b To obtain a forward coding characteristic z f And backward coding feature z b As shown in equation (6) and equation (7):
characterizing an image z o Text feature z d Forward coding characteristic z f And backward coding feature z b Splicing to obtain heterogeneous fusion characteristic z = concat [ z ] f ,z b ,z o ,z d ];
The decoder predicts the target region according to the heterogeneous fusion features and predicts the intermediate target region by expressing the acquired features in a 2D formAs shown in formula (7):
the difference between the predicted target region and the real target region is expressed as an objective functionAs shown in equation (8):
wherein,andrespectively a real target area and a predicted target area, and W and H respectively represent the width and the height of the target area;
encoder and decoder minimize objective function using only normal samples during heterogeneous feature fusion bi-directional prediction module trainingThe heterogeneous fusion characteristics comprise content information of normal samples, a decoder can predict the normal samples, and for abnormal samples, the decoder cannot predict abnormal targets.
Preferably, the generation process of the pseudo-abnormal sequence is as follows: according to the target areaAnd L frames before and after the sequence to generate continuous normal sequenceAnd mark it as y n (S n ) =0; generating and S by random jump n Pseudo-abnormal sequence of the same lengthThe symbol y a (S a ) =1, wherein a is a random number, S a I.e. a false exception timing.
Preferably, the process of learning and distinguishing the characteristics of the pseudo-abnormal time sequence and the normal time sequence by the time sequence attention distinguishing module includes: given normal timingAnd pseudo exception timingWherein 2L +1 is the sequence length, C is the number of channels, and W × H is the size of the target region; will S n ,S a Inputting the 3D convolution neural network of the time sequence attention discrimination module, and extracting the time sequence feature z t ′;
Separately calculating S using a time-sequential attention mechanism n And S a An attention score for each target region; the time sequence attention mechanism adopts 3D average pooling and 3D maximum pooling to complete compression operation in a time dimension; then, after 3D average pooling and 3D maximum pooling, two layers of full connection are respectively used for obtaining different scaling factors and obtaining final attention scores, and finally, each time dimension is scaled according to the attention scores to finish recalibration z t As shown in equation (9):
z t =z′ t ·δ(fc(maxp 3D (z′ t );θ)+fc(avgp 3D (z′ t ));θ) (9)
wherein maxp 3D (. To) denotes maximum pooling in 3D, avgp 3D (·) denotes 3D average pooling, fc- (·) denotes the fully-connected layer containing parameter θ, δ denotes the Sigmoid activation function;
performing nonlinear processing on the time sequence by using a 2D convolution network of a time sequence attention discrimination module; the 2D convolutional network mainly comprises convolutional layers and full connections, and batch normalization, relu activation and selective discarding are used after the convolutional layers; the full connection layer is followed by the softmax function to output the abnormal discrimination probabilityWherein S k ={S n ,S a }; using cross entropy as the objective function L (S) k ) As shown in equation (10):
wherein the probability of abnormality discriminationsoftmax (. Circle.) represents a softmax function, conv (. Circle.) represents a convolution, fc 1 () represents a fully connected operation; when the time sequence to be measured is abnormal, k = a, y k (S k ) And =1, otherwise 0.
Preferably, the process of performing anomaly judgment on the video to be detected by combining the prediction error, the semantic relevancy and the time sequence information specifically comprises the following steps:
spatially, using the prediction error of the real target region and the prediction target region as the spatial anomaly score of the region of each target, for an arbitrary frame I in which N targets exist t Selecting the maximum value of all target abnormal scores in the frame as the spatial abnormal score S of the frame spa (I t ) As shown in formula (11):
wherein the real target area Represents a predicted target region, | ·| non-conducting light 2 Represents L 2 A norm;
on the content, frame I t Middle ith target areaCorresponding text isUsing global image feature z o (I t ) And local text featuresThe semantic correlation between the two is used as a local abnormal score, and the minimum semantic correlation is simultaneously selected as a global semantic abnormal score S sem (I t ) As shown in formula (12):
wherein sim [. Cndot. ] represents cosine similarity; the higher the probability of occurrence of target abnormality in the frame is, the smaller the semantic association degree is;
in time sequence, the targetCorresponding time series to be measured isTargeting the timing discrimination module to obtain the output anomaly discrimination probabilityA timing anomaly score of; for all targets, selecting the maximum value of the abnormality discrimination probability as the time series abnormality score S of the frame tem (I t ) As shown in equation (13):
linearly adding the spatial abnormality score, the semantic abnormality score and the time sequence abnormality score to obtain a final abnormality score S (I) t ) As shown in equation (14):
S(I t )=S spa (I t )+α·(1-S sem (I t ))+β·S tem (I t ) (14)
wherein alpha and beta are respectively a semantic coefficient and a time sequence parameter;
abnormal score S (I) t ) The larger the value of (a) is, the larger the probability that an abnormality occurs in the frame is indicated.
Preferably, the bi-directional 3D encoder is composed of 6 layers of 3D convolution layers, the convolution kernel size being 3 × 3 × 3; performing batch normalization and Relu activation after convolution of each layer, wherein pooling operation is not used after convolution of the first layer and the third layer, and 3D maximal pooling with the pooling size of 1 × 2 × 2 and the stride of 1 × 2 × 2 is adopted for other layers;
preferably, the decoder is implemented by 4 upsampling, each upsampling is followed by feature expression using a 2D convolutional layer with a kernel size of 3 × 3, and each layer of convolution is followed by batch normalization and Relu activation, respectively.
The invention also proposes an electronic device comprising a memory, a processor and program instructions stored in the memory for execution by the processor, the processor executing the program instructions to implement the steps of the method of the invention.
Compared with the prior art, the invention has the beneficial effects that:
1. aiming at the defects of local target and global semantic perception in abnormal detection, the method establishes a graph-text semantic perception module, correlates the image characteristics and the text characteristics of a target area and ensures the consistency of perception semantics;
2. the invention adopts the strategies of local training and global reasoning to sense the semantic correlation between the local target and the global information, thereby improving the accuracy of anomaly detection;
3. the invention designs a bidirectional prediction module with heterogeneous feature fusion, which enhances the extraction of normal features and inhibits the generation of abnormal features by combining space, time and content information, thereby capturing more discriminant features;
4. in the invention, the time sequence difference between frames is considered, and a time sequence attention distinguishing module is adopted to distinguish the generated pseudo-abnormal sequence and the normal sequence from the time dimension, thereby improving the sensitivity to the abnormality.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a general framework diagram of a method for detecting an abnormal event in a multi-sensor video based on heterogeneous information according to an embodiment of the present invention;
fig. 2 is a schematic view of visualization of anomaly detection and localization provided by the embodiment of the present invention.
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, the present invention provides a method for detecting a multi-sensor video abnormal event based on homologous heterogeneous information, which mainly comprises a target detection network, a graph-text-semantic perception module, a bidirectional prediction module with heterogeneous feature fusion, and a time sequence attention discrimination module, wherein the target detection network adopts a YoloV3 network to extract a target in a video frame; the image-text meaning perception module comprises an image feature extractor T o Text feature extractor T d The image-text semantic perception module extracts image features and text features of the video and calculates semantic correlation between the image features and the text features to ensure the consistency of the semantic features; the heterogeneous feature fusion bidirectional prediction module comprises a forward encoder, a backward encoder and a decoder, and enhances the extraction of normal features and inhibits the generation of abnormal features; the time sequence attention distinguishing module comprises a 3D convolution neural network, a time sequence attention mechanism and a 2D convolution network, and learns and distinguishes the characteristics of the pseudo-abnormal time sequence and the normal time sequence.
And carrying out abnormity judgment on the video to be detected through the joint prediction error, the semantic relevancy and the time sequence information.
The process of extracting the image fine granularity feature and the text feature of the video by the graph-text meaning perception module comprises the following steps:
acquisition of a video sequence I over a target detection network 1 ,…,I M N targets in each frame and the category of the targets, wherein M is the length of the video sequence, and the value of N in different frames is not fixed; the tth frame I t The ith target area ofWherein t =1,2, \8230, M, i =1,2, \8230, N, W, H and C are the width, height and channel number of the target region, respectively, to convert the target region in the video frame to a fixed size while uniformly dividing the target region into P subblocks of size P × P, whereinTaking P as an image feature extractor T o The length of the input sequence of (a);
each sub-block is subjected to characteristic refining and is mapped to a space with fixed dimensionality, and then space embedding is generatedWherein x =1,2, \8230;, P, the addition position is embeddedTo retain the relative position information of each sub-block to obtain the embedded characteristicsIs represented by formula (1):
image feature extractor T o Consisting of l stacked identical transform frames, each being a serial process of concatenation of two residuals, the first one to embed a featureCombining the output of the layer normalization and the multi-head self-attention mechanism with the output to obtain the intermediate characteristicAs shown in equation (2):
wherein LN (-) represents layer normalization, MSA (-) represents a multi-head attention mechanism;
the second residual is a pair of intermediate featuresFeatures and intermediate features after layer normalization and multi-layer perceptionAdding to obtain the output characteristics of the first Transformer frameAs shown in equation (3):
wherein MLP (-) represents multi-layer perception;
then will beAs input and output to a second transform frameworkBy analogy, output characteristics are obtained after the frames are stackedWill be provided withOutput Final target region feature z 'into independent Multi-layer perception' o 。
For each target areaIts corresponding classification is recorded asEstablishing a mapping V of class labels and texts according to the pre-training sample classes of the target detection network, wherein each target class labelObtain corresponding textWhereinIs a length 76 sequence, and uses [ SOS ]]And [ EOS]The mark is drawn up; converting the encoding into a computer understandable form according to the byte pairs, and simultaneously preserving semantic context by embedding the position information of text characters to obtain text embedding characteristics
Text embedding featuresThrough a text feature extractor T d The process of (2) is shown in formula (4):
wherein z' d For text feature extractor T d The output characteristic of (1).
Since there may be gaps between the features of text and fine-grained image feature extractor representations, it is desirable to map the features of both the text and image modalities to the same multimodal space, so z 'is done using layer normalization and multi-layer perception' o And z' d Mapping into a multimodal embedding space, output image features z o =MLP(LN(z o ') and text feature z) d =MLP(LN(z′ d ))。
The process of calculating the semantic correlation degree between the image feature and the text feature to ensure the consistency of the semantic features comprises the following steps:
semantically associating an objective function L when training a graph-text semantic perception module sem (z o ,z d ) As shown in formula (5):
whereinAndrespectively represent objectsImage and text features of (1), sim [ ·]Representing cosine similarity; minimizing a semantically related objective function L sem (z o ,z d ) Enabling the two vectors of the image characteristic and the text characteristic to be close to each other in absolute distance and constrained to be in the same direction;
when the video to be detected is detected abnormally, the global image characteristic and the local text characteristic are calculatedThe image characteristic and the text characteristic are expressed as posterior probability vectors relative to a group of semantic concepts, and when the video to be tested is abnormal, the correlation between the global image characteristic and the local text characteristic is weakThe recognition in semantic space presents differences.
The heterogeneous feature fusion bidirectional prediction module for enhancing the extraction of normal features and inhibiting the generation of abnormal features comprises the following steps:
extracting forward coding features and backward coding features using a bi-directional 3D encoder, wherein the bi-directional 3D encoder is composed of a forward encoder E f And backward encoder E b Composition is carried out; forward encoder E f And a backward encoder E b Have the same structure and network parameters; the bidirectional 3D encoder is composed of 6 layers of 3D convolution layers, and the size of the convolution layer is 3 multiplied by 3; and after convolution of each layer, batch normalization and Relu activation are respectively carried out, pooling operation is not used after convolution of the first layer and the third layer, and 3D maximum pooling with the pooling size of 1 × 2 × 2 and the step size of 1 × 2 × 2 is adopted for other layers.
According to each target area, taking the front L frame and the back L frame of the corresponding position to form a forward target sequence which does not contain the target in the t frameAnd backward target sequenceSeparately input forward encoder E f And a backward encoder E b To obtain the forward coding feature z f And backward coding feature z b As shown in equation (6) and equation (7):
characterizing an image z o Text feature z d Forward coding characteristic z f And backward coding feature z b Splicing to obtain the heterogeneous fusion characteristic z = concat [ z ] f ,z b ,z o ,z d ]Heterogeneous fusionThe time, space and content information are fused by combining the characteristics, uncertainty in the prediction process is reduced, and the time sequence relation between the current frame and the context is enhanced;
the decoder is realized by 4 times of upsampling, after each time of upsampling, a 2D convolutional layer with the kernel size of 3 multiplied by 3 is used for carrying out characteristic expression, and after each layer of convolution, batch normalization and Relu activation are respectively carried out; the decoder predicts the target region according to the heterogeneous fusion features and predicts the intermediate target region by expressing the acquired features in a 2D formAs shown in equation (7):
the difference between the predicted target region and the real target region is expressed as an objective functionAs shown in equation (8):
wherein,andrespectively a real target area and a predicted target area, and W and H respectively represent the width and the height of the target area;
encoder and decoder minimize objective function using only normal samples during heterogeneous feature fusion bi-directional prediction module trainingThe heterogeneous fusion characteristics comprise content information of normal samples, a decoder can predict the normal samples,for the abnormal samples, the decoder cannot predict the abnormal target.
The generation process of the pseudo-abnormal sequence comprises the following steps: according to the target areaAnd L frames before and after the sequence to generate continuous normal sequenceAnd mark it as y n (S n ) =0; generating and S by random jump n Pseudo-abnormal sequence of the same lengthThe symbol y a (S a ) =1, wherein a is a random number; s. the a The continuity between frames is weak, and the characteristics of irregular movement behaviors under abnormal conditions are met, so that S is determined a As a pseudo exception timing.
The process that the time sequence attention distinguishing module learns the characteristics of the pseudo-abnormal time sequence and the normal time sequence and distinguishes comprises the following steps: given normal timingAnd pseudo exception timingWherein 2L +1 is the sequence length, C is the number of channels, and W × H is the size of the target region; will S n ,S a Inputting the 3D convolution neural network of the time sequence attention discrimination module, and extracting the time sequence feature z t ′;
Separately calculating S using a time-sequential attention mechanism n And S a An attention score for each target region; the time sequence attention mechanism adopts 3D average pooling and 3D maximum pooling to complete compression operation in a time dimension; then, after 3D average pooling and 3D maximum pooling, two layers of full connection are respectively used for obtaining different scaling factors and obtaining final attention scores, and finally, each time dimension is scaled according to the attention scores to finish recalibration z t As shown in the formula (9):
z t =z′ t ·δ(fc(maxp 3D (z′ t );θ)+fc(avgp 3D (z′ t ));θ) (9)
Wherein maxp 3D (. To) denotes maximum pooling in 3D, avgp 3D (. Cndot.) represents 3D average pooling, fc (. Cndot.) represents a fully connected layer containing the parameter θ, δ represents a Sigmoid activation function;
performing nonlinear processing on the time sequence by using a 2D convolution network of a time sequence attention discrimination module; the 2D convolutional network mainly comprises convolution and full connection, after convolution, batch normalization and Relu activation are used for improving generalization performance, and meanwhile selective discarding is used for avoiding overfitting; the full connection layer is followed by softmax function to output abnormal discrimination probabilityWherein S k ={S n ,S a }; using cross entropy as the objective function L (S) k ) As shown in equation (10):
wherein the probability of abnormality discriminationsoftmax (. Circle.) represents a softmax function, conv (. Circle.) represents a convolution, fc 1 (. Cndot.) represents a fully connected operation; when the time sequence to be measured is abnormal, k = a, y k (S k ) And =1, otherwise 0.
The process of carrying out abnormity judgment on the video to be tested by combining the prediction error, the semantic relevancy and the time sequence information specifically comprises the following steps:
spatially, using the prediction error of the real target region and the prediction target region as the spatial anomaly score for the region of each target, for any frame I where there are N targets t Selecting the maximum value of all target abnormal scores in the frame as the spatial abnormal score S of the frame spa (I t ) As shown in formula (11):
wherein the real target area Represents a predicted target region, | ·| non-conducting light 2 Represents L 2 A norm;
on the content, frame I t The ith target regionCorresponding text isUsing global image feature z o (I t ) And local text featuresThe semantic correlation between the two is used as a local abnormal score, and the minimum semantic correlation is simultaneously selected as a global semantic abnormal score S sem (I t ) As shown in formula (12):
wherein sim [. Cndot. ] represents cosine similarity; the higher the probability of occurrence of target abnormality in the frame is, the smaller the semantic association degree is;
in time sequence, the targetCorresponding time series to be measured isAnomaly discrimination using timing discrimination module to obtain outputProbability as a targetA timing anomaly score of; for all targets, selecting the maximum value of the abnormality discrimination probability as the time series abnormality score S of the frame tem (I t ) As shown in equation (13):
linearly adding the spatial abnormality score, the semantic abnormality score and the time sequence abnormality score to obtain a final abnormality score S (I) t ) As shown in equation (14):
S(I t )=S spa (I t )+α·(1-S sem (I t ))+β·S tem (I t ) (14)
where α and β are semantic coefficients, timing parameters, respectively, that control the importance of the semantic anomaly score and the timing anomaly score relative to the spatial anomaly score;
abnormal score S (I) t ) The larger the value of (a) is, the larger the probability that an abnormality occurs in the frame is indicated.
The effectiveness of the method proposed by the present invention is verified by experiments below.
The method of the invention evaluates the abnormality detection on three data sets of UCSD Ped2, CUHK Avenue and ShanghaiTech. All reported results were implemented using an Intel Xeon (R) CPU, NVIDIA GTX1080Ti GPU device. The method is mainly realized in Anaconda3, python3.8, tensorFlow and PyTorch frames, and standard evaluation indexes are adopted to evaluate the abnormality detection performance. Performance was evaluated by calculating the area under the subject operating characteristic curve (AUC) as a scalar by gradually changing the threshold of the abnormality score, with higher AUC values indicating better abnormality detection performance.
Table 1 lists comparisons of the present invention method (MPFork) with some currently preferred methods on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets, such as future frame prediction method (FFP), memory enhanced depth auto-encoder (MemAE), memory-guided normality Method (MNN), stacked recurrent neural network auto-encoder (sRNN-AE), generative cooperation discriminant network (GADNet), self-trained prediction model (SPM) self-supervised prediction convolutional attention block (SSPCAB), implicit dual path auto-encoder (ITAE), multi-path prediction anomaly detection (ROADMAP), anomaly detection for graph volume noise cleaning (GCLNL), variant anomaly behavior detection method (VABD), time-aware contrast network (TAC-Net), self-supervised attention generation countermeasure network (SSAGAN).
TABLE 1 comparison of AUC results by different methods
As can be seen from Table 1, the method of the present invention achieved the best performance on the UCSD Ped2, CUHK Avenue and ShanghaiTech data sets, and achieved detection accuracies of 99.8%,91.3% and 86.1%, respectively, compared to other methods. ITAE achieves suboptimal detection precision of 99.2% on a Ped2 data set, but the method of the invention respectively exceeds the method by 3.3% and 9.8% in the other two data sets, which proves that the method of the invention is helpful to distinguish between normal and abnormal and improve abnormal detection performance.
The abnormal positioning capacity and the abnormal time sequence sensitivity of the method are further verified by using two indexes of RBDC and TBDC, wherein the RBDC mainly expresses the positioning capacity according to the overlapping area of a real abnormal region and a predicted abnormal region, and the TBDC mainly depends on the tracking detection rate and the number of false positive regions in continuous frames. The detection accuracy of the self-supervised multitasking learning (SMTL), the object-centric automatic encoder (OCAE), the FFP, the SSPCAB and the method of the invention (MPFork) are compared in table 2.
TABLE 2 comparison of RBDC and TBDC Performance under different methods
It can be seen from table 2 that the process of the invention gives better results overall at both RBDC and TBDC indices. Compared with FFP and SSPCAB, the method has the advantages that the RBDC and the TBDC on the UCSD ped2 data set are respectively improved by 35% and 40% in an abnormal manner; increases of over 15% and 25% were obtained on CUHK Avenue; the method obtains 30% and 25% improvement on ShanghaiTech, and the method is favorable for improving the sensitivity and the positioning capability of the abnormal track by considering the spatial prediction of the abnormal in heterogeneous feature fusion and the strategy of the time sequence feature during the abnormal detection. In addition, the method is superior to SMTL in three data because the method not only considers the space-time characteristics, but also focuses on the semantic correlation between the local characteristics and the whole situation, and the omission condition is reduced.
The effectiveness of the method of the present invention is further verified by evaluating the contribution of each policy in the model to anomaly detection. The method is characterized in that the unidirectional frame prediction network is used as a reference for carrying out abnormity detection and mainly comprises a 3D encoder and a 2D decoder. It is proposed that MPFork mainly comprises three modules, namely a heterogeneous feature fusion bidirectional prediction module (BWH), an image-context perception module (ISP) and a time sequence attention discrimination module (TAD). Compared with a reference model, the bidirectional prediction module with heterogeneous feature fusion not only increases bidirectional prediction (BiP), but also performs anomaly detection by using homologous heterogeneous feature fusion, so that the effect of space-time features provided by bidirectional prediction on anomaly detection is verified, and the influence of heterogeneous feature fusion in frame prediction on anomaly detection is verified respectively, and the bidirectional prediction module mainly comprises fine-grained image features (ViF) of a target area obtained by an image feature extractor and text features (TeF) obtained by a text feature extractor. The performance changes after adding different strategies compared to the baseline model are shown in table 3.
TABLE 3 Effect of different strategies on anomaly detection
From table 3, it can be seen that the bi-directional prediction is improved by 2.4% compared to the performance based on the reference model, especially on UCSD Ped 2. That is because bi-directional prediction employs forward and backward feature extraction, indirectly providing richer temporal information for inter-frame prediction. When the fine-grained image features or text features of the target area are increased, the performance of the three data sets is improved to different degrees, which shows the effectiveness of the image features or text features of the target area on anomaly detection. In particular, when using the heterogeneous information fused bi-directional prediction module, there was an improvement of 5.0%,2.1% and 2.0% on UCSD Ped2, CUHK Avenue and ShanghaiTech datasets, respectively. The time sequence characteristics of the sequence are combined in the prediction process, and the spatial characteristics and content information of the current target area are considered, so that the prediction uncertainty is reduced, the capture of normal characteristics is enhanced, and the generation of abnormal characteristics is inhibited. After the image-text semantic perception module is added, the performance improvement on the Avenue data set is more obvious, namely, due to the fact that abnormal conditions such as bag throwing and paper losing exist in the Avenue data set, detection omission easily occurs when local features are used for the abnormal conditions, the image-text semantic perception provided by the people pays attention to the correlation between local features and the global features, the relation between people and objects is paid attention to more easily, and therefore the abnormal conditions are detected. After a time sequence attention distinguishing module is added, the abnormality detection performance is obviously improved in the CUHK Avenue and ShanghaiTech data sets. Because the two data sets have the conditions of rapid running, chasing and alarming and the like, the time sequence attention distinguishing module focuses more on the time sequence characteristics of the target, and is convenient to distinguish the abnormal time sequences different from the normal abnormal time sequences. The simultaneous use of our proposed MPFork in these strategies is to achieve optimal performance, which is 6.3%,9.1% and 7.0% improvement over the baseline model on the three and data sets, respectively. This result indicates that all these proposed strategies contribute to anomaly detection, as their combination can optimize the framework from different aspects and improve performance.
In addition, in real life, the anomaly detection model not only needs to index the video frame where the anomaly event is located, but also needs to locate the specific position of the anomaly target in the frame. To more intuitively illustrate the performance of the proposed method, we have selected several representative cases from different data sets to visualize the detection and localization results of different models, as shown in fig. 2, where a rectangular box is a missing or false detection case. In each data set, the first column shows the visualization effect of the normal frame, and the second column and the third column show the detection and positioning effects of different abnormal frames under different models respectively. The first line is the video frame which is really collected, and the lower lines correspond to the detection and positioning effects corresponding to the BiP, the BWH and the MPFork respectively. It can be seen that BiP is more likely to determine some normal targets as abnormal and cause false detection, and meanwhile, the situation of missed detection is also likely to occur. The abnormal detection is carried out by the BiP according to the prediction error, and the prediction error of part of normal targets is relatively large to generate false detection due to insufficient discriminant feature learning of normal and abnormal; meanwhile, the strong characterization capability of the deep network causes the prediction effect of part of abnormal targets to be good and the missed detection is generated. The energy values of some normal targets in BWH decrease and the energy values of abnormal targets increase compared to BiP, indicating an increased ability of BWH to learn the discriminative features of normal and abnormal targets. This is because the BWH simultaneously fuses the temporal, spatial, and content information of the target, which helps to enhance the normal features in the solution space and suppress the generation of abnormal features during inference. The MPFork provided by the inventor shows better detection and positioning advantages on three data sets, because the MPFork combines local and global semantic perception and time sequence discrimination functions on the basis of BWH, improves the extraction capability of discriminant features, deeply perceives the abnormality from a semantic understanding level and detects the abnormality from a time sequence angle during reasoning, thereby being beneficial to the detection and positioning of the abnormality, being more suitable for the abnormality detection in real life and improving the detection performance.
Finally, the present invention provides an electronic device, which includes a memory, a processor, and a program stored in the memory and executable by the processor, wherein the processor executes the program to implement the steps of a method for detecting video anomalies based on scene classification as proposed in the present invention. It should be noted that the description of the apparatus according to the embodiment of the present invention is similar to the description of the method embodiment, and has similar beneficial effects to the method embodiment, and therefore, the description thereof is omitted.
Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes, modifications, equivalents, and improvements can be made therein without departing from the spirit and scope of the invention.
Claims (9)
1. A multi-perception video abnormal event detection method based on homologous heterogeneous information is characterized by comprising the following steps: the system comprises a target detection network, a graph-text semantic perception module, a heterogeneous characteristic fusion bidirectional prediction module and a time sequence attention discrimination module;
the target detection network adopts a YoloV3 network to extract a target in a video frame;
the graph-text meaning perception module comprises an image feature extractor T o Text feature extractor T d And a semantic relevance description portion; the image-text semantic perception module extracts image features and text features of a video, and calculates semantic correlation degrees between the image features and the text features to ensure the consistency of the semantic features;
the isomeric characteristicsThe fused bidirectional prediction module comprises a forward encoder E f Backward encoder E b And a decoder; the heterogeneous feature fusion bidirectional prediction module enhances the extraction of normal features and inhibits the generation of abnormal features;
the time sequence attention distinguishing module comprises a 3D convolutional neural network, a time sequence attention mechanism and a 2D convolutional network, and learns and distinguishes the characteristics of a pseudo-abnormal time sequence and a normal time sequence;
and (4) carrying out abnormity judgment on the video to be detected by combining the prediction error, the semantic relevancy and the time sequence information.
2. The method for detecting the abnormal events of the multi-perception video based on the heterogeneous information according to claim 1, wherein the process of extracting the image fine granularity feature and the text feature of the video by the graph-text meaning perception module comprises:
obtaining a video sequence I through the target detection network 1 ,…,I M N targets and the categories thereof in each frame, wherein M is the length of the video sequence, and the value of N in different frames is not fixed; the tth frame I t The ith target area ofWherein t =1,2, \8230, M, i =1,2, \8230, N, W, H and C are the width, height and channel number of the target region, respectively, to convert the target region in the video frame to a fixed size while uniformly dividing the target region into P subblocks of size P × P, whereinUsing P as the image feature extractor T o The length of the input sequence of (a);
each sub-block is subjected to characteristic refining and is mapped to a space with fixed dimensionality, and then space embedding is generatedWherein x =1,2, \ 8230;, P, addition position inlayIntoTo retain the relative position information of each sub-block to obtain the embedded characteristicsIs represented by formula (1):
the image feature extractor T o Consisting of l stacked identical transform frames, each being a serial process of concatenation of two residuals, the first one to embed a featureThe output of the layer normalization and multi-head self-attention mechanism is combined with the output of the layer normalization and multi-head self-attention mechanism to obtain intermediate characteristicsAs shown in equation (2):
wherein LN (-) represents layer normalization, MSA (-) represents a multi-head attention mechanism;
the second residual is a pair of intermediate featuresFeatures and intermediate features after layer normalization and multi-layer perceptionAdding to obtain the output characteristics of the first Transformer frameAs shown in equation (3):
wherein MLP (·) represents multi-layer perception;
then willAs input and output to a second transform frameworkBy analogy, output characteristics are obtained after the frames are stackedWill be provided withOutput Final target region feature z 'into independent Multi-layer perception' o ;
For each target areaIts corresponding classification is recorded asEstablishing a mapping V of class labels and texts according to the pre-training sample classes of the target detection network, wherein each target class labelObtain corresponding textWhereinIs a length 76 sequence, and uses [ SOS ]]And [ EOS]The mark is drawn up; converting the encoding into a computer understandable form according to byte pairs, and simultaneously preserving semantic context and obtaining text embedding characteristics by embedding position information of text characters
Text embedding featuresThrough a text feature extractor T d The process of (2) is shown in formula (4):
wherein z' d For text feature extractor T d The output characteristics of (1); z 'using layer normalization and multilayer perception' o And z' d Mapping into a multimodal embedding space, output image features z o =MLP(LN(z o ') and text feature z) d =MLP(LN(z′ d ))。
3. The method according to claim 2, wherein the calculating of the semantic correlation between the image feature and the text feature ensures consistency of semantic features, and comprises:
when the graph-text semantic perception module is trained, the semantic association objective function L sem (z o ,z d ) As shown in formula (5):
whereinAndrespectively representing objectsImage and text features of (1), sim [ ·]Representing cosine similarity; minimizing a semantically related objective function L sem (z o ,z d ) Enabling two vectors of the image characteristic and the text characteristic to be close to each other in absolute distance and to be constrained to the same direction;
when the video to be detected is detected abnormally, the global image characteristic and the local text characteristic are calculatedThe image features and the text features are expressed as posterior probability vectors relative to a group of semantic concepts, when the video to be tested is abnormal, the correlation between the global image features and the local text features is weak, and the cognition on a semantic space is different.
4. The method according to claim 3, wherein the heterogeneous feature fusion based bi-directional prediction module enhances extraction of normal features and suppresses generation of abnormal features comprises:
extracting forward coding features and backward coding features using a bi-directional 3D encoder, wherein the bi-directional 3D encoder is formed by the forward encoder E f And a backward encoder E b Composition of, the forward encoder E f And a backward encoder E b Have the same structure and network parameters;
according to each target area, taking the front L frame and the back L frame at the corresponding position to form a forward target sequence without the target in the t frameAnd backward target sequenceRespectively input into the forward encoder E f And a backward encoder E b To obtain a forward coding characteristic z f And backward coding feature z b As shown in equation (6) and equation (7):
characterizing the image z o Text feature z d Forward coding characteristic z f And backward coding feature z b Splicing to obtain heterogeneous fusion characteristic z = concat [ z ] f ,z b ,z o ,z d ];
The decoder predicts a target region based on heterogeneous fusion features, and predicts an intermediate target region by expressing the acquired features in a 2D formatAs shown in equation (7):
the difference between the predicted target region and the real target region is expressed as an objective functionAs shown in equation (8):
wherein,andrespectively a real target area and a predicted target area, and W and H are respectively the width and the height of the target area;
in the heterogeneous feature fusion bidirectional prediction module training process, only normal samples are used, and an encoder and a decoder minimize an objective functionThe heterogeneous fusion characteristics comprise content information of normal samples, a decoder can predict the normal samples, and for abnormal samples, the decoder cannot predict abnormal targets.
5. The method for detecting the abnormal events of the multi-perception video based on the heterogeneous information of claim 4, wherein the generation process of the pseudo-abnormal sequence is as follows: according to the target areaAnd L frames before and after the sequence to generate continuous normal sequenceAnd mark it as y n (S n ) =0; generating and S by random jump n Pseudo-abnormal sequence of the same lengthMark y a (S a ) =1, wherein a is a random number, S a I.e. a false exception timing.
6. According to claimThe method for detecting abnormal events of multi-sensing video based on heterogeneous information is characterized in that the process of learning and distinguishing the characteristics of the pseudo-abnormal time sequence and the normal time sequence by the time sequence attention distinguishing module comprises the following steps: given normal timingAnd pseudo exception timingWherein 2L +1 is the sequence length, C is the number of channels, and W × H is the size of the target region; will S n ,S a Inputting the time sequence feature z 'into a 3D convolutional neural network and extracting' t ;
Separately calculating S using the time series attention mechanism n And S a An attention score for each target region; the time sequence attention mechanism adopts 3D average pooling and 3D maximum pooling to complete compression operation in a time dimension; then, after 3D average pooling and 3D maximum pooling, two layers of full connection are respectively used for obtaining different scaling factors and obtaining final attention scores, and finally, each time dimension is scaled according to the attention scores to finish recalibration z t As shown in equation (9):
z t =z′ t ·δ(fc(maxp 3D (z′ t );θ)+fc(avgp 3D (z′ t ));θ) (9)
wherein maxp 3D (. To) denotes maximum pooling in 3D, avgp 3D (. Cndot.) represents 3D average pooling, fc (. Cndot.) represents a fully connected layer containing the parameter θ, δ represents a Sigmoid activation function;
performing nonlinear processing on a time sequence by using the 2D convolutional network; the 2D convolutional network mainly comprises convolution and full connection, and batch normalization, relu activation and selective discarding are used after convolution; the full connection layer is followed by the softmax function to output the abnormal discrimination probabilityWherein S k ={S n ,S a }; using cross entropy asObjective function L (S) k ) As shown in equation (10):
7. The method for detecting the abnormal events of the multi-sensor video based on the heterogeneous information according to claim 6, wherein the process of performing the abnormal judgment on the video to be detected by combining the prediction error, the semantic relevance and the time sequence information specifically comprises the following steps:
spatially, using the prediction error of the real target region and the prediction target region as the spatial anomaly score of the region of each target, for an arbitrary frame I in which N targets exist t Selecting the maximum value of all target abnormal scores in the frame as the spatial abnormal score S of the frame spa (I t ) As shown in formula (11):
wherein the real target area Represents a predicted target region, | ·| non-conducting light 2 Represents L 2 A norm;
on the content, frame I t The ith target regionCorresponding text isUsing global image feature z o (I t ) And local text featuresThe semantic correlation between the two is used as a local abnormal score, and the minimum semantic correlation is simultaneously selected as a global semantic abnormal score S sem (I t ) As shown in formula (12):
wherein sim [. Cndot. ] represents the cosine similarity; the higher the probability of occurrence of target abnormality in the frame is, the smaller the semantic association degree is;
in time sequence, targetCorresponding time series to be measured isAdopting the abnormal discrimination probability output by the time sequence discrimination module as a targetA timing anomaly score of; for all targets, selecting the maximum value of the abnormal discrimination probability as the time sequence abnormal score S of the frame tem (I t ) As shown in equation (13):
linearly adding the spatial abnormality score, the semantic abnormality score and the time sequence abnormality score to obtain a final abnormality score S (I) t ) As shown in equation (14):
S(I t )=S spa (I t )+α·(1-S sem (I t ))+β·S tem (I t ) (14)
wherein alpha and beta are respectively a semantic coefficient and a time sequence parameter;
abnormal score S (I) t ) The larger the value of (a) is, the larger the probability that an abnormality occurs in the frame is indicated.
8. The method for detecting multi-perception video abnormal events based on the heterogeneous information as claimed in claim 4, wherein:
the bidirectional 3D encoder is composed of 6 layers of 3D convolution layers, and the size of a convolution kernel is 3 multiplied by 3; performing batch normalization and Relu activation respectively after convolution of each layer, wherein pooling operation is not used after convolution of the first layer and the third layer, and 3D maximum pooling with the pooling size of 1 × 2 × 2 and the stride of 1 × 2 × 2 is adopted for other layers;
the decoder is realized by 4 times of upsampling, after each time of upsampling, a 2D convolutional layer with the kernel size of 3 x 3 is used for carrying out feature expression, and after each layer of convolution, batch normalization and Relu activation are respectively carried out.
9. An electronic device comprising a memory, a processor and program instructions stored in the memory for execution by the processor, wherein the program instructions are executable by the processor to perform the steps of the method of any of claims 1-8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211484883.9A CN115797830A (en) | 2022-11-24 | 2022-11-24 | Multi-perception video abnormal event detection method and device based on homologous heterogeneous information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211484883.9A CN115797830A (en) | 2022-11-24 | 2022-11-24 | Multi-perception video abnormal event detection method and device based on homologous heterogeneous information |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115797830A true CN115797830A (en) | 2023-03-14 |
Family
ID=85441197
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211484883.9A Withdrawn CN115797830A (en) | 2022-11-24 | 2022-11-24 | Multi-perception video abnormal event detection method and device based on homologous heterogeneous information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115797830A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116257142A (en) * | 2023-05-12 | 2023-06-13 | 福建省亿鑫海信息科技有限公司 | Security monitoring method and terminal based on multi-mode data characterization |
CN116506216A (en) * | 2023-06-19 | 2023-07-28 | 国网上海能源互联网研究院有限公司 | Lightweight malicious flow detection and evidence-storage method, device, equipment and medium |
CN116886991A (en) * | 2023-08-21 | 2023-10-13 | 珠海嘉立信发展有限公司 | Method, apparatus, terminal device and readable storage medium for generating video data |
CN118568650A (en) * | 2024-08-05 | 2024-08-30 | 山东省计算中心(国家超级计算济南中心) | Industrial anomaly detection method and system based on fine-grained text prompt feature engineering |
-
2022
- 2022-11-24 CN CN202211484883.9A patent/CN115797830A/en not_active Withdrawn
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116257142A (en) * | 2023-05-12 | 2023-06-13 | 福建省亿鑫海信息科技有限公司 | Security monitoring method and terminal based on multi-mode data characterization |
CN116257142B (en) * | 2023-05-12 | 2023-07-21 | 福建省亿鑫海信息科技有限公司 | Security monitoring method and terminal based on multi-mode data characterization |
CN116506216A (en) * | 2023-06-19 | 2023-07-28 | 国网上海能源互联网研究院有限公司 | Lightweight malicious flow detection and evidence-storage method, device, equipment and medium |
CN116506216B (en) * | 2023-06-19 | 2023-09-12 | 国网上海能源互联网研究院有限公司 | Lightweight malicious flow detection and evidence-storage method, device, equipment and medium |
CN116886991A (en) * | 2023-08-21 | 2023-10-13 | 珠海嘉立信发展有限公司 | Method, apparatus, terminal device and readable storage medium for generating video data |
CN116886991B (en) * | 2023-08-21 | 2024-05-03 | 珠海嘉立信发展有限公司 | Method, apparatus, terminal device and readable storage medium for generating video data |
CN118568650A (en) * | 2024-08-05 | 2024-08-30 | 山东省计算中心(国家超级计算济南中心) | Industrial anomaly detection method and system based on fine-grained text prompt feature engineering |
CN118568650B (en) * | 2024-08-05 | 2024-10-15 | 山东省计算中心(国家超级计算济南中心) | Industrial anomaly detection method and system based on fine-grained text prompt feature engineering |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115797830A (en) | Multi-perception video abnormal event detection method and device based on homologous heterogeneous information | |
CN114119638B (en) | Medical image segmentation method integrating multi-scale features and attention mechanisms | |
Cong et al. | Does thermal really always matter for RGB-T salient object detection? | |
CN113947702B (en) | Multi-mode emotion recognition method and system based on context awareness | |
Antwi-Bekoe et al. | A deep learning approach for insulator instance segmentation and defect detection | |
KR20190105180A (en) | Apparatus for Lesion Diagnosis Based on Convolutional Neural Network and Method thereof | |
CN115861616A (en) | Semantic segmentation system for medical image sequence | |
CN112967227B (en) | Automatic diabetic retinopathy evaluation system based on focus perception modeling | |
CN113011399A (en) | Video abnormal event detection method and system based on generation cooperative judgment network | |
CN113705490A (en) | Anomaly detection method based on reconstruction and prediction | |
CN111860248B (en) | Visual target tracking method based on twin gradual attention-guided fusion network | |
Lee et al. | Latent-OFER: detect, mask, and reconstruct with latent vectors for occluded facial expression recognition | |
CN114511502A (en) | Gastrointestinal endoscope image polyp detection system based on artificial intelligence, terminal and storage medium | |
Singh et al. | Attention-guided generator with dual discriminator GAN for real-time video anomaly detection | |
CN115412324A (en) | Air-space-ground network intrusion detection method based on multi-mode conditional countermeasure field adaptation | |
CN116542921A (en) | Colon polyp segmentation method, device and storage medium | |
Zhai et al. | Spike-based optical flow estimation via contrastive learning | |
CN117671349A (en) | Perimeter intrusion target detection and tracking method | |
CN115690665B (en) | Video anomaly detection method and device based on cross U-Net network | |
Wang et al. | Scene uyghur recognition with embedded coordinate attention | |
Li et al. | Fingertip blood collection point localization research based on infrared finger vein image segmentation | |
Li et al. | Cross-modality integration framework with prediction, perception and discrimination for video anomaly detection | |
CN115359511A (en) | Pig abnormal behavior detection method | |
CN115188068A (en) | Three-dimensional human behavior recognition method under small sample condition | |
Gallo et al. | Boosted wireless capsule endoscopy frames classification |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WW01 | Invention patent application withdrawn after publication | ||
WW01 | Invention patent application withdrawn after publication |
Application publication date: 20230314 |