CN117690191B - Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system - Google Patents

Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system Download PDF

Info

Publication number
CN117690191B
CN117690191B CN202410150884.2A CN202410150884A CN117690191B CN 117690191 B CN117690191 B CN 117690191B CN 202410150884 A CN202410150884 A CN 202410150884A CN 117690191 B CN117690191 B CN 117690191B
Authority
CN
China
Prior art keywords
mode
video
audio
abnormal behavior
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410150884.2A
Other languages
Chinese (zh)
Other versions
CN117690191A (en
Inventor
徐小龙
王珺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202410150884.2A priority Critical patent/CN117690191B/en
Publication of CN117690191A publication Critical patent/CN117690191A/en
Application granted granted Critical
Publication of CN117690191B publication Critical patent/CN117690191B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Alarm Systems (AREA)

Abstract

The invention discloses a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system, relates to the technical field of abnormal behavior detection, combines the ideas of weak supervision pseudo-tag generation and cross-modal interaction to design a network structure so as to solve the problem of fine fragment tag deletion under weak supervision, improve the accuracy of network identification of weak supervision abnormal behaviors, meet the requirements of frame-level identification and video-level identification on loss functions, and be robust to noise, so that the method can be applied to tasks of intelligent monitoring equipment abnormal behavior detection.

Description

Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system
Technical Field
The invention relates to the technical field of abnormal behavior detection, in particular to a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system.
Background
In modern society, monitoring equipment is spread over all corners, especially important public security areas such as schools, hospitals and markets. The main function of the monitoring device is to record what happens in the current area so that the subsequent examination can be used as evidence, and the utilization rate of the device is low. The method fully utilizes a large amount of data collected by the devices, expands the monitoring function to realize intelligent anomaly monitoring, and is a very promising research direction. Along with the continuous development of deep learning, the abnormality detection technology is mature, and the hardware technology of the monitoring equipment is improved, so that necessary conditions are provided for the function expansion of the monitoring equipment, and the intelligent monitoring equipment capable of performing abnormality detection is possible.
With the development of hardware technology of monitoring devices, most of the monitoring devices today can realize visual and auditory data acquisition. The accompanying anomaly detection technology focuses on multi-modal fusion detection technologies that fuse video and audio information. The multi-modality aims to build a model capable of processing multi-modality information with correlation. The multi-mode learning proves that the information of multi-source data can be aggregated, so that the model learns to be more complete information, and the limitation of a single mode in anomaly detection is avoided. In addition, as human beings, our judgment is also dependent on multi-mode information rather than single-mode information, for example, in car accidents, the huge audible collision sound and visual flame, smoke and the like can be the basis of our judgment. So the multi-mode abnormality detection is closer to the judgment of human beings and is more suitable for human-computer interaction.
The technical difficulty in the multi-modal deep learning field mainly comes from the heterogeneity of the data. For the anomaly identification branch of audio-video fusion, firstly, video data is represented by continuous frame pictures, and audio data is represented by audio signals; secondly, how the correspondence between audio data and video data is at the instance level; finally, how to build a fusion model of audio data and video data to obtain the supplementary information of single mode deficiency.
The weak supervision abnormality detection is how to identify abnormal fragments in the complete video marked as abnormal, and the research topic has a profound application prospect. With today's massive data, fine data annotation is time consuming and expensive, especially at the frame level of video data. The video-level tag can be said to be a better compromise solution, so that the weak supervision abnormality detection method can be seen to have advantages in front of mass data.
Therefore, the weak supervision abnormal detection based on the multiple modes has wide development prospect and is a challenging task, and the existing abnormal behavior detection method is to be improved in the aspects of accuracy, applicability and the like. The prior art has the problems of insufficient multi-mode fusion, neglect of the identification of fragment level under weak supervision, low model robustness and the like.
Disclosure of Invention
The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system.
The invention adopts the following technical scheme for solving the technical problems:
The invention provides a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system, which comprises the following steps:
Step 1, extracting features of a video mode and extracting features of an audio mode;
step 2, adopting a self-attention network to respectively carry out self-attention enhancement on the characteristics of the video mode and the characteristics of the audio mode;
step 3, inputting the self-attention enhanced features obtained in the step 2 into a multi-layer perceptron to extract high-level semantic features, so as to obtain high-level semantic features of video and audio modes;
step 4, activating function normalized video and audio mode advanced semantic feature mean values to obtain abnormal behavior pseudo tags at the segment level;
Step 5, normalizing the advanced semantic features of the video and audio modes obtained in the step 3 to obtain gating information for background suppression, enhancing the features of the audio mode obtained in the step 2 by using the gating information of the video mode to obtain the features of the audio mode after background enhancement, and enhancing the features of the video mode obtained in the step 2 by using the gating information of the audio mode to obtain the features of the video mode after background enhancement;
step 6, performing cross-mode attention enhancement on the characteristics of the audio mode and the characteristics of the video mode after the background enhancement in the step 5 to obtain the characteristics of fusion of the audio mode and the video mode, and obtaining a final multi-mode abnormal behavior probability value by using a multi-layer perceptron;
Step 7, taking the abnormal behavior pseudo tag of the segment level obtained in the step 4 as a noise tag, and calculating a loss value with the multi-mode abnormal behavior probability value obtained in the step 6;
step 8, calculating the multi-mode abnormal behavior probability value and the loss value of the video-level tag in the step 6 in a multi-instance learning mode;
And 9, calculating a weighted sum of the loss values in the step 7 and the step 8 to be used as the loss value, wherein the step 2 to the step 6 are the multi-mode abnormal behavior detection network model under weak supervision, and training the multi-mode abnormal behavior detection network model under weak supervision.
As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, the step 2 specifically comprises the following steps:
The self-attention enhancement is carried out by respectively transmitting the features of the video mode and the features of the audio mode into a self-attention network, and the self-attention network calculates a dot product matrix and carries out softmax normalization by mapping the query, the key and the value into vectors to finally obtain a weighted sum value vector The calculation formula is as follows:
Wherein, Representing a matrix of query vectors,Features representing the audio/video modality obtained through step 1,A matrix of key vectors is represented,A matrix of vector of values is represented,Respectively representing a parameter matrix which can be learned in a query vector matrix, a key vector matrix and a value vector matrix; Representing a dot product matrix between a query vector matrix and a key vector matrix, superscript The transpose of the matrix is represented,Representing the dimensions of the key vector matrix,The value used for normalizing the dot product and avoiding the dot product is too large or too small; the softmax activation function is represented for normalizing the score to a probability value.
As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, in the step 3, the multi-layer sensor comprises three full-connection layers, and the method specifically comprises the following steps:
Wherein, Respectively represent the learnable parameter matrices of three fully connected layers,Respectively represent the bias terms of three fully connected layers,Is a high-level semantic feature of a video/audio modality.
As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, in the step 4, the function normalized video and audio mode advanced semantic feature mean value is activated to obtain the segment level abnormal behavior pseudo tag, and the specific calculation method is as follows:
and (3) averaging the video and audio mode advanced semantic features obtained in the step (3), normalizing the video and audio mode advanced semantic features into final abnormal scores by an activation function, taking the abnormal scores as abnormal behavior pseudo tags at a fragment level, and performing the following calculation process:
Wherein, Abnormal behavior pseudo tags representing the segment level,Representing the activation function Sigmoid,Representing the high-level semantic features of the video modality,Representing high-level semantic features of an audio modality.
As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, in step 5, the video and audio mode advanced semantic features are normalized to obtain the gating information of background suppression, and the calculation mode is as follows:
Wherein, Gating information representing the video modality,Gating information representing the audio modality,Representing the importance of each segment in the video and audio modalities respectively,D represents the number of fragments,A matrix dimension representing gating information;
the background suppression enhanced features are calculated as follows:
Wherein, The weighted proportion parameter is represented by a weighted proportion parameter,Representing the video characteristics after the background enhancement,Representing the audio features after the background enhancement,Features representing a video modality with enhanced self-attention,Features representing an audio modality after self-attention enhancement.
In step 6, the characteristics of the audio and video mode fusion comprise the characteristics of video mode flow and the characteristics of audio mode flow, and the characteristics of the video mode flow and the audio mode flow are expressed as the following formulas:
Wherein, A query vector matrix representing a video modality stream and an audio modality stream respectively,A key vector matrix representing a video modality stream and an audio modality stream respectively,A value vector matrix representing the video modality stream and the audio modality stream respectively,A matrix of learnable parameters representing a matrix of query vectors, a matrix of key vectors, a matrix of value vectors,Features representing a video modality stream are presented,And the characteristics of the audio frequency modal flow are represented, the characteristics of the video frequency and audio frequency modal flow are added to obtain the final fusion characteristics of the two modes, and then the final multi-mode abnormal behavior probability value is obtained through the full connection layer.
In step 7, the segment-level abnormal behavior pseudo tag obtained in step 4 is used as a noise tag, and the loss value calculation is carried out on the segment-level abnormal behavior pseudo tag and the multi-mode abnormal behavior probability value obtained in step 6, wherein the loss value calculation is specifically as follows:
and (3) taking the abnormal behavior pseudo tag at the segment level in the step (4) as a noise tag, and calculating a loss value between the multi-mode abnormal behavior probability value and the noise tag, wherein the loss value is calculated by adopting a noise loss function, and the noise loss function consists of a weighted sum of an average absolute error (MAE) and a Normalized Cross Entropy (NCE).
As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, in the step 9, the loss values in the step 7 and the step 8 are weighted and summed to be used as the loss valuesThe calculation process of (2) is as follows:
Wherein, Representing the multi-modal abnormal behavior probability values obtained in the multi-instance learning manner in step 8 and the loss values of the video-level tags,A loss value representing the segment level abnormal behavior pseudo tag and the multi-mode abnormal behavior probability value obtained in the step 7,All representing the specific gravity of the weighted sum.
As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, in the step 1, the characteristics of a video mode are extracted by adopting a pre-training I3D network, and the characteristics of an audio mode are extracted by adopting a pre-training VGGish network.
Compared with the prior art, the technical scheme provided by the invention has the following technical effects:
(1) The invention belongs to a multi-mode abnormal behavior detection algorithm, and constructs a multi-mode abnormal behavior detection model under weak supervision, and carries out abnormal recognition based on information of two modes of video and audio so as to meet the actual scene requirement of intelligent monitoring.
(2) The invention utilizes the self-attention mechanism and the multi-layer perceptron to enable the multi-mode abnormal behavior detection model under weak supervision to pay attention to the prediction of the segment level, enhances the distinguishing capability of the model between abnormality and normal, and effectively solves the problem of frame-level label deletion under weak supervision.
(3) The invention designs a multi-mode fusion module, which takes a prediction pseudo tag of a single mode at a segment level as a basis of background suppression to carry out background suppression on data of another mode, so that a multi-mode abnormal behavior detection model under weak supervision focuses on a key abnormal section, and a cross-mode attention mechanism is used for fusing data of video and audio modes, so that the model can extract key information from the data of the multi-mode, thereby improving the detection effect of the multi-mode abnormal behavior detection model under weak supervision.
(4) The invention designs a loss function suitable for training the weak supervision multi-mode abnormal behavior detection model, and the loss function pays attention to the prediction effects of the segment level and the video level at the same time, so that the problem of video segment tag deletion can be effectively solved, and the loss function has robustness to noise and meets the actual application requirements.
Drawings
FIG. 1 is a schematic diagram of an overall model of a method for intelligently detecting weakly supervised abnormal behavior for an intelligent monitoring system according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a background suppression method for a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a cross-modal attention enhancement method for a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
Example 1
1-3, In accordance with one embodiment of the present invention, there is provided a method for intelligently detecting weak supervision abnormal behavior in an intelligent monitoring system, including:
S1: the method comprises the steps of respectively extracting features of video and audio modes by using a deep neural network, respectively carrying out self-attention enhancement on the features of the two modes, inputting the features into a multi-layer perceptron to obtain high-level semantics, and normalizing the results to obtain abnormal behavior prediction of a segment level to be used as a pseudo tag. It should be noted that:
the invention adopts the I3D pre-training network to extract the characteristics of the video mode Extracting features of audio modalities using VGGish pre-training network
The features acquired from the video and audio modes are respectively transmitted into a self-attention module for self-attention enhancement, and are transmitted into a multi-layer perceptron for advanced semantic feature extraction;
further, self-attention enhancement is achieved by mapping the query, key, and value into vectors, calculating the dot product matrix and softmax normalization, resulting in a weighted sum of value vectors The calculation formula is as follows:
Wherein, Representing a matrix of query vectors,A matrix of key vectors is represented,A matrix of vector of values is represented,Respectively represent a parameter matrix which can be learned in a query vector matrix, a key vector matrix and a value vector matrix.Representing a dot product matrix between the query vector and the key vector,The dimensions of the key vector are represented,The values used to normalize the dot product avoid the dot product are too large or too small.The softmax activation function is represented for normalizing the score to a probability value.
Still further, in order to obtain the labels of the features at the segment level, the invention adopts the multilayer perceptron to conduct abnormal behavior prediction on the features with enhanced self-attention, calculates the average value of the abnormal behavior scores obtained by the video and audio modes to normalize the average value to the final abnormal score, and takes the abnormal behavior prediction as the pseudo label of the segment level.
The multi-layer sensor consists of three full-connection layers, and the calculation process is as follows:
Wherein, Respectively represent the learnable parameter matrices of three fully connected layers,The bias terms of the three fully connected layers are shown, respectively. The pseudo tag of the segment level is obtained by calculating a mean value of two modal anomaly predictions through a sigmoid activation function, and the calculation process is as follows:
Wherein, Abnormal behavior pseudo tags representing the segment level,Representing the activation function Sigmoid,Representing the high-level semantic features of the video modality,Representing high-level semantic features of an audio modality.
S2: the abnormal behavior prediction of the segment level in the video mode is used as the basis of the background suppression of the audio mode, and the abnormal behavior prediction of the segment level of the audio mode is used as the basis of the background suppression of the video mode, so that the joint attention of key segment areas among multiple modes can be realized, the characteristics of the key segments are enhanced, the occurrence of the conditions such as conflict of multi-mode detection results is reduced, and the following needs to be described:
Cross-modal background suppressed gating signal AndAccording to the advanced semantic features acquired by the multi-layer perceptron, abnormal behavior predicted values of video mode and audio mode fragment levels are obtained through sigmoid activation function calculation, wherein the calculation mode is as follows:
wherein the gating value The importance of each segment of the video and audio modalities is represented, with higher values indicating a higher likelihood of abnormal behavior in the segment. After the gating information of the video mode and the audio mode is obtained, the gating information is multiplied by the characteristics of the other mode respectively, and more important characteristics are selected from the opposite modes. Targeted enhancement features, i.e., corresponding segments of video features from example segments deemed important from audio features and corresponding segments of audio features from example segments deemed important from video features, can be achieved to reduce the probability of occurrence of situations such as difficulty in single modality identification, collision of multi-modality detection results, etc., as shown in fig. 2. The original mode characteristics and the characteristics subjected to gating enhancement are multiplied by coefficients to obtain a result, and the characteristics obtained after the video mode and the audio mode are subjected to background suppression are expressed asAndThe calculation method is as follows:
S3: the features after background suppression are subjected to cross-modal attention enhancement, and abnormal behavior prediction of cross-modal interaction features is calculated, as shown in fig. 3, the following needs to be described:
The cross-modal attention enhancement module consists of a video modal stream and an audio modal stream, wherein the two modalities exchange Key and Value of each other, so that the attention characteristic of one modality is conditioned on the other modality, namely, the attention characteristic is based on audio in the video stream, and the cross-modal interaction between video and audio is performed based on the video attention in the audio stream.
Taking video streaming as an example, we take audio featuresKey and Value are calculated for input and are characterized by video after background suppressionAs an input, a Query is calculated as follows:
Wherein, Representing a learnable embedding matrix, so thatWithout usingBecause the features after background suppression pay more attention to the example fragments which are considered important, the integrity of the event is ignored to a certain extent, fromCalculating Key and Value can treat the whole event equally from the original feature data, thereby calculating more proper attention features. Finally, the characteristics of the video modality are expressed as the following formula:
S4: in order to solve the problem of lack of fine fragment labels in weak supervision identification, the invention designs a loss function suitable for the problem model. The loss function mainly comprises two parts, wherein one part is to take the fragment level abnormal behavior pseudo tag obtained in the previous step as a noise tag and a multi-mode abnormal behavior probability value; the other part is to calculate the predicted loss value of the abnormal behavior of the video level after the multi-modal fusion in a multi-instance learning mode, and the following needs to be described:
because the weak supervision training data lacks the fine labels at the segment level, it is inaccurate and noisy to directly expand the video level labels to the labels at the segment level. In order to enable a network to learn abnormal behaviors at a segment level, namely distinguishing between normal and abnormal at the segment level, the invention calculates a loss value of segment level prediction by using a loss function capable of coping with noise data, and improves the resolving power of a model on segment level abnormality. The noise loss function of the present invention consists of a weighted sum of Mean Absolute Error (MAE) and Normalized Cross Entropy (NCE), and the calculation formula is as follows:
Wherein K represents the total category of the tag, Representing the distribution of samples x over the class k,Representing the probability output of sample x in the classifier,Representing the probability output of the network for class k,
The second part focuses on the accuracy of abnormality detection at the video level, and according to the thought of multi-instance learning, a video clip is regarded as an instance in multi-instance learning, and the average value of k predicted values with highest abnormality prediction in one video packet is taken as the abnormal predicted value of the video; In the video containing the abnormal event, the selected example is an example fragment with the highest abnormal event probability in the video; in the video of the normal event, the selected instance is the instance with the highest probability of causing the recognition error, so the loss function calculation formula of the video level is as follows:
Wherein, Representing the multi-modal anomaly behavior probability values calculated in a multi-instance learning manner and the loss values of the video-level tags,A tag representing an i-th video,Representing the average predictor of the k instances of the largest predictor in the video packet, N representing the number of videos.
Finally, the loss function calculation formula for the multi-mode abnormal behavior detection network model training under weak supervision is as follows:
Example 2
The embodiment is different from the first embodiment in that a verification test of a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system is provided, in order to verify and explain the technical effects adopted in the method, the embodiment adopts a traditional technical scheme to carry out a comparison test with the method of the invention, and the test results are compared by a scientific demonstration means to verify the true effects of the method.
The invention performs experimental verification on a large-scale abnormal behavior recognition dataset XD-Violence. The average Accuracy (AP) of frame-level anomaly recognition is used as an evaluation index for measuring the recognition effect of the model. The influence of background suppression on the performance of the model, such as cross-modal interaction, pseudo tag generation, noise loss function, etc., was examined by quantitative experiments, and the results are shown in table 1. As can be seen from the table, the average accuracy of frame-level anomaly identification under the video mode is 71.23%, and after cross-mode interaction of audio and video modes is introduced, the average accuracy is improved to 78.31%, so that the effectiveness of reasonable interaction of cross-mode context association on anomaly identification under multiple modes is seen. The invention carries out the information of cross-modal background inhibition highlighting key fragments on the basis, and experiments prove that the method is effective to weak supervision common sense, and the frame-level average precision is improved to 81.69 percent. In addition, according to the characteristic of lack of fine labels in weak supervision, the invention proposes to add pseudo labels and noise loss into model training, thereby improving the model identification effect, the frame-level average precision reaches 83.80 percent, and the model has robustness to noise,
TABLE 1 detection Effect of the modules on XD-Violence dataset
Method of AP(%)
Single mode vision 71.23
Cross-modal interaction of visual and auditory features 78.31
Cross-mode interaction fusion of dual modes through background suppression 81.69
Background suppression cross-mode interaction + pseudo tag + noise loss 83.80
On the other hand, the model of the present invention has an advantage of robustness to noise compared to other anomaly detection methods. For XD-Violence we consider symmetric noise with different noise levels, each tag has the same probability of flipping to another class, i.e., an abnormal class flipped to a normal class, and a normal class flipped to an abnormal class. We randomly selectTurning its label with training data, wherein. The prediction effect of the method of the invention is compared with the prediction effect of the three most advanced methods at present under different noise proportions, and the results are shown in table 2.
Table 2 identification effect of methods under different noise conditions
As can be seen from the table, the method of the invention is trained on the data set without noise, the average accuracy of the frame level reaches 83.80%, and the method is improved by 0.4% compared with MACIL method. To verify the robustness of the method to noise, we add different proportions of noise to the training set. Under the conditions of 10%, 20% and 30% proportion noise, the frame-level average precision of 82.85%, 82.15% and 78.88% is obtained, and the frame-level average precision is improved by 2.58%, 3.55% and 0.74% respectively compared with MACIL. Through experiments, the robustness of the proposed method to noise can be demonstrated.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention.

Claims (9)

1. The weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system is characterized by comprising the following steps of:
Step 1, extracting features of a video mode and extracting features of an audio mode;
step 2, adopting a self-attention network to respectively carry out self-attention enhancement on the characteristics of the video mode and the characteristics of the audio mode;
step 3, inputting the self-attention enhanced features obtained in the step 2 into a multi-layer perceptron to extract high-level semantic features, so as to obtain high-level semantic features of video and audio modes;
step 4, activating function normalized video and audio mode advanced semantic feature mean values to obtain abnormal behavior pseudo tags at the segment level;
Step 5, normalizing the advanced semantic features of the video and audio modes obtained in the step 3 to obtain gating information for background suppression, enhancing the features of the audio mode obtained in the step 2 by using the gating information of the video mode to obtain the features of the audio mode after background enhancement, and enhancing the features of the video mode obtained in the step 2 by using the gating information of the audio mode to obtain the features of the video mode after background enhancement;
step 6, performing cross-mode attention enhancement on the characteristics of the audio mode and the characteristics of the video mode after the background enhancement in the step 5 to obtain the characteristics of fusion of the audio mode and the video mode, and obtaining a final multi-mode abnormal behavior probability value by using a multi-layer perceptron;
In the cross-modal Attention enhancement, calculating a key vector matrix K v of a video mode stream, a key vector matrix K a of an audio mode stream, a value vector matrix V v of the video mode stream and a value vector matrix V a of the audio mode stream according to the characteristic attribute v of the video mode after the self-Attention enhancement, the characteristic attribute a of the audio mode after the self-Attention enhancement, which are obtained in the step 2, and calculating a query vector matrix Q v、Qa of the video mode stream and the audio mode stream according to the characteristic of the audio mode after the background enhancement, and carrying out cross-modal enhancement by exchanging the key vector matrix and the value vector matrix of the audio mode stream and the key vector matrix of the video mode stream;
step 7, taking the abnormal behavior pseudo tag of the segment level obtained in the step 4 as a noise tag, and calculating a loss value with the multi-mode abnormal behavior probability value obtained in the step 6; the loss value is calculated by adopting a noise loss function, and the noise loss function consists of a weighted sum of an average absolute error MAE and a normalized cross entropy NCE;
step 8, calculating the multi-mode abnormal behavior probability value and the loss value of the video-level tag in the step 6 in a multi-instance learning mode;
And 9, calculating a weighted sum of the loss values in the step 7 and the step 8 to be used as the loss value, wherein the step 2 to the step 6 are the multi-mode abnormal behavior detection network model under weak supervision, and training the multi-mode abnormal behavior detection network model under weak supervision.
2. The method for intelligently detecting weak supervision abnormal behavior of an intelligent monitoring system according to claim 1, wherein the step 2 is specifically as follows:
The method comprises the steps of respectively transmitting the features of a video mode and the features of an audio mode into a self-Attention network to carry out self-Attention enhancement, wherein the self-Attention network calculates a dot product matrix and carries out softmax normalization by mapping inquiry, keys and values into vectors, and finally obtains a weighted sum value vector Attention v/a, and the calculation formula is as follows:
Wherein q=w qFa/v, Q represents a query vector matrix, F a/v represents the characteristics of the audio/video modality obtained in step 1, k=w kFa/v, K represents a key vector matrix, v=w vFa/v, V represents a value vector matrix, and W q、Wk、Wv represents a query vector matrix, a key vector matrix, and a parameter matrix that can be learned in the value vector matrix, respectively; QK T denotes a dot product matrix between the query vector matrix and the key vector matrix, superscript T denotes a matrix transpose, d k denotes a dimension of the key vector matrix, The value used for normalizing the dot product and avoiding the dot product is too large or too small; softmax (x) represents the softmax activation function used to normalize the score to a probability value.
3. The method for intelligently detecting abnormal behavior under weak supervision of an intelligent monitoring system according to claim 2, wherein in step 3, the multi-layer sensor comprises three fully connected layers, specifically comprising the following steps:
Wherein W 1、W2、W3 represents the learnable parameter matrix of the three fully connected layers respectively, b 1、b2、b3 represents the bias term of the three fully connected layers respectively, Is a high-level semantic feature of a video/audio modality.
4. The method for intelligently detecting the weak supervision abnormal behavior of the intelligent monitoring system according to claim 2, wherein in step 4, the activation function normalizes the mean value of the high-level semantic features of the video and audio modes to obtain the pseudo tag of the abnormal behavior of the segment level, and the specific calculation method is as follows:
and (3) averaging the video and audio mode advanced semantic features obtained in the step (3), normalizing the video and audio mode advanced semantic features into final abnormal scores by an activation function, taking the abnormal scores as abnormal behavior pseudo tags at a fragment level, and performing the following calculation process:
Where P pseudo represents the abnormal behavior pseudo tag at the segment level, σ represents the activation function Sigmoid, Representing high-level semantic features of a video modality,/>Representing high-level semantic features of an audio modality.
5. The method for intelligently detecting weak supervision abnormal behavior of an intelligent monitoring system according to claim 4, wherein in step 5, the gating information of background suppression is obtained by normalizing advanced semantic features of video and audio modes, and the calculation method is as follows:
Wherein g v represents gating information of a video mode, g a represents gating information of an audio mode, g v、ga represents importance of each segment in the video mode and the audio mode respectively, g v、ga∈Rd×1, d represents the number of segments, and R d×1 represents matrix dimension of the gating information;
the background suppression enhanced features are calculated as follows:
where a represents a weighted proportion parameter, Representing background enhanced video features,/>Representing background enhanced audio features, attention v representing features of the self-Attention enhanced video modality, and Attention a representing features of the self-Attention enhanced audio modality.
6. The method for intelligent detection of weakly supervised abnormal behavior for intelligent monitoring systems as set forth in claim 5, wherein in step 6, the features of the audio and video modality fusion include features of video modality stream and features of audio modality stream, and features of video modality stream and audio modality stream are expressed as the following formulas:
Kv=Wk·Attentiona,Ka=Wk·Attentionv
Vv=Wv·Attentiona,Va=Wv·Attentionv
Wherein Q v、Qa represents the query vector matrix of the video modality stream and the audio modality stream, K v、Ka represents the key vector matrix of the video modality stream and the audio modality stream, V v、Va represents the value vector matrix of the video modality stream and the audio modality stream, W q、Wk、Wv represents the query vector matrix, the key vector matrix, and the learnable parameter matrix of the value vector matrix, Features representing a video modality stream,/>And the characteristics of the audio frequency modal flow are represented, the characteristics of the video frequency and audio frequency modal flow are added to obtain the final fusion characteristics of the two modes, and then the final multi-mode abnormal behavior probability value is obtained through the full connection layer.
7. The method for intelligently detecting the weak supervision abnormal behavior of the intelligent monitoring system according to claim 3, wherein in the step 7, the abnormal behavior pseudo tag at the segment level obtained in the step 4 is used as a noise tag, and the loss value calculation is performed with the multi-mode abnormal behavior probability value obtained in the step 6; the method comprises the following steps:
And (3) taking the segment-level abnormal behavior pseudo tag obtained in the step (4) as a noise tag, and calculating a loss value between the segment-level abnormal behavior pseudo tag and the multi-mode abnormal behavior probability value, wherein the loss value is calculated by adopting a noise loss function, and the noise loss function consists of a weighted sum of an average absolute error MAE and a normalized cross entropy NCE.
8. The intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system according to claim 1, wherein in step 9, the calculation process of weighting and summing the loss values of step 7 and step 8 as the loss value L total is as follows:
Ltotal=aLMIL+bLnoisy
Wherein, L MIL represents the multi-mode abnormal behavior probability value and the loss value of the video level label obtained in the multi-instance learning mode in the step 8, L noisy represents the loss value of the segment level abnormal behavior pseudo label and the multi-mode abnormal behavior probability value obtained in the step 7, and a and b both represent the proportion of the weighted sum.
9. The method for intelligently detecting weak supervision abnormal behavior of an intelligent monitoring system according to claim 1, wherein in step 1, the characteristics of a video mode are extracted by using a pre-training I3D network, and the characteristics of an audio mode are extracted by using a pre-training VGGish network.
CN202410150884.2A 2024-02-02 2024-02-02 Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system Active CN117690191B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410150884.2A CN117690191B (en) 2024-02-02 2024-02-02 Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410150884.2A CN117690191B (en) 2024-02-02 2024-02-02 Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system

Publications (2)

Publication Number Publication Date
CN117690191A CN117690191A (en) 2024-03-12
CN117690191B true CN117690191B (en) 2024-04-30

Family

ID=90128590

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410150884.2A Active CN117690191B (en) 2024-02-02 2024-02-02 Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system

Country Status (1)

Country Link
CN (1) CN117690191B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685597A (en) * 2021-03-12 2021-04-20 杭州一知智能科技有限公司 Weak supervision video clip retrieval method and system based on erasure mechanism
CN113971776A (en) * 2021-10-15 2022-01-25 浙江大学 Audio-visual event positioning method and system
CN116935303A (en) * 2022-10-27 2023-10-24 安徽大学 Weak supervision self-training video anomaly detection method
CN117173394A (en) * 2023-08-07 2023-12-05 山东大学 Weak supervision salient object detection method and system for unmanned aerial vehicle video data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685597A (en) * 2021-03-12 2021-04-20 杭州一知智能科技有限公司 Weak supervision video clip retrieval method and system based on erasure mechanism
CN113971776A (en) * 2021-10-15 2022-01-25 浙江大学 Audio-visual event positioning method and system
CN116935303A (en) * 2022-10-27 2023-10-24 安徽大学 Weak supervision self-training video anomaly detection method
CN117173394A (en) * 2023-08-07 2023-12-05 山东大学 Weak supervision salient object detection method and system for unmanned aerial vehicle video data

Also Published As

Publication number Publication date
CN117690191A (en) 2024-03-12

Similar Documents

Publication Publication Date Title
Guanghui et al. Multi-modal emotion recognition by fusing correlation features of speech-visual
Wang et al. Split and connect: A universal tracklet booster for multi-object tracking
Fang et al. Traffic accident detection via self-supervised consistency learning in driving scenarios
CN111626199B (en) Abnormal behavior analysis method for large-scale multi-person carriage scene
He et al. DepNet: An automated industrial intelligent system using deep learning for video‐based depression analysis
Zhang et al. Weakly supervised anomaly detection in videos considering the openness of events
CN111539445A (en) Object classification method and system based on semi-supervised feature fusion
Gorodnichev et al. Research and Development of a System for Determining Abnormal Human Behavior by Video Image Based on Deepstream Technology
Feng et al. SSLNet: A network for cross-modal sound source localization in visual scenes
Yin et al. Msa-gcn: Multiscale adaptive graph convolution network for gait emotion recognition
CN117690191B (en) Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system
Kumar et al. Abnormal human activity detection by convolutional recurrent neural network using fuzzy logic
Yang et al. LCSED: A low complexity CNN based SED model for IoT devices
CN116704609A (en) Online hand hygiene assessment method and system based on time sequence attention
Hou et al. Joint prediction of audio event and annoyance rating in an urban soundscape by hierarchical graph representation learning
Hao et al. Human behavior analysis based on attention mechanism and LSTM neural network
CN114973202B (en) Traffic scene obstacle detection method based on semantic segmentation
Liu [Retracted] Sports Deep Learning Method Based on Cognitive Human Behavior Recognition
Shi et al. Research on Safe Driving Evaluation Method Based on Machine Vision and Long Short‐Term Memory Network
CN113312968B (en) Real abnormality detection method in monitoring video
CN115147921A (en) Key area target abnormal behavior detection and positioning method based on multi-domain information fusion
Umeki et al. Salient object detection with importance degree
Han et al. NSNP-DFER: a nonlinear spiking neural P network for dynamic facial expression recognition
Chandrakala Anomalous human activity detection in videos using Bag-of-Adapted-Models-based representation
Chen et al. ABOS: an attention-based one-stage framework for person search

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant