CN117690191B

CN117690191B - Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system

Info

Publication number: CN117690191B
Application number: CN202410150884.2A
Authority: CN
Inventors: 徐小龙; 王珺
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2024-02-02
Filing date: 2024-02-02
Publication date: 2024-04-30
Anticipated expiration: 2044-02-02
Also published as: CN117690191A

Abstract

The invention discloses a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system, relates to the technical field of abnormal behavior detection, combines the ideas of weak supervision pseudo-tag generation and cross-modal interaction to design a network structure so as to solve the problem of fine fragment tag deletion under weak supervision, improve the accuracy of network identification of weak supervision abnormal behaviors, meet the requirements of frame-level identification and video-level identification on loss functions, and be robust to noise, so that the method can be applied to tasks of intelligent monitoring equipment abnormal behavior detection.

Description

Intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system

Technical Field

The invention relates to the technical field of abnormal behavior detection, in particular to a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system.

Background

In modern society, monitoring equipment is spread over all corners, especially important public security areas such as schools, hospitals and markets. The main function of the monitoring device is to record what happens in the current area so that the subsequent examination can be used as evidence, and the utilization rate of the device is low. The method fully utilizes a large amount of data collected by the devices, expands the monitoring function to realize intelligent anomaly monitoring, and is a very promising research direction. Along with the continuous development of deep learning, the abnormality detection technology is mature, and the hardware technology of the monitoring equipment is improved, so that necessary conditions are provided for the function expansion of the monitoring equipment, and the intelligent monitoring equipment capable of performing abnormality detection is possible.

With the development of hardware technology of monitoring devices, most of the monitoring devices today can realize visual and auditory data acquisition. The accompanying anomaly detection technology focuses on multi-modal fusion detection technologies that fuse video and audio information. The multi-modality aims to build a model capable of processing multi-modality information with correlation. The multi-mode learning proves that the information of multi-source data can be aggregated, so that the model learns to be more complete information, and the limitation of a single mode in anomaly detection is avoided. In addition, as human beings, our judgment is also dependent on multi-mode information rather than single-mode information, for example, in car accidents, the huge audible collision sound and visual flame, smoke and the like can be the basis of our judgment. So the multi-mode abnormality detection is closer to the judgment of human beings and is more suitable for human-computer interaction.

The technical difficulty in the multi-modal deep learning field mainly comes from the heterogeneity of the data. For the anomaly identification branch of audio-video fusion, firstly, video data is represented by continuous frame pictures, and audio data is represented by audio signals; secondly, how the correspondence between audio data and video data is at the instance level; finally, how to build a fusion model of audio data and video data to obtain the supplementary information of single mode deficiency.

The weak supervision abnormality detection is how to identify abnormal fragments in the complete video marked as abnormal, and the research topic has a profound application prospect. With today's massive data, fine data annotation is time consuming and expensive, especially at the frame level of video data. The video-level tag can be said to be a better compromise solution, so that the weak supervision abnormality detection method can be seen to have advantages in front of mass data.

Therefore, the weak supervision abnormal detection based on the multiple modes has wide development prospect and is a challenging task, and the existing abnormal behavior detection method is to be improved in the aspects of accuracy, applicability and the like. The prior art has the problems of insufficient multi-mode fusion, neglect of the identification of fragment level under weak supervision, low model robustness and the like.

Disclosure of Invention

The invention aims to solve the technical problem of overcoming the defects of the prior art and providing a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system.

The invention adopts the following technical scheme for solving the technical problems:

The invention provides a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system, which comprises the following steps:

Step 1, extracting features of a video mode and extracting features of an audio mode;

step 2, adopting a self-attention network to respectively carry out self-attention enhancement on the characteristics of the video mode and the characteristics of the audio mode;

step 3, inputting the self-attention enhanced features obtained in the step 2 into a multi-layer perceptron to extract high-level semantic features, so as to obtain high-level semantic features of video and audio modes;

step 4, activating function normalized video and audio mode advanced semantic feature mean values to obtain abnormal behavior pseudo tags at the segment level;

Step 5, normalizing the advanced semantic features of the video and audio modes obtained in the step 3 to obtain gating information for background suppression, enhancing the features of the audio mode obtained in the step 2 by using the gating information of the video mode to obtain the features of the audio mode after background enhancement, and enhancing the features of the video mode obtained in the step 2 by using the gating information of the audio mode to obtain the features of the video mode after background enhancement;

step 6, performing cross-mode attention enhancement on the characteristics of the audio mode and the characteristics of the video mode after the background enhancement in the step 5 to obtain the characteristics of fusion of the audio mode and the video mode, and obtaining a final multi-mode abnormal behavior probability value by using a multi-layer perceptron;

Step 7, taking the abnormal behavior pseudo tag of the segment level obtained in the step 4 as a noise tag, and calculating a loss value with the multi-mode abnormal behavior probability value obtained in the step 6;

step 8, calculating the multi-mode abnormal behavior probability value and the loss value of the video-level tag in the step 6 in a multi-instance learning mode;

And 9, calculating a weighted sum of the loss values in the step 7 and the step 8 to be used as the loss value, wherein the step 2 to the step 6 are the multi-mode abnormal behavior detection network model under weak supervision, and training the multi-mode abnormal behavior detection network model under weak supervision.

As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, the step 2 specifically comprises the following steps:

The self-attention enhancement is carried out by respectively transmitting the features of the video mode and the features of the audio mode into a self-attention network, and the self-attention network calculates a dot product matrix and carries out softmax normalization by mapping the query, the key and the value into vectors to finally obtain a weighted sum value vector The calculation formula is as follows:

Wherein, ，Representing a matrix of query vectors,Features representing the audio/video modality obtained through step 1,，A matrix of key vectors is represented,，A matrix of vector of values is represented,、、Respectively representing a parameter matrix which can be learned in a query vector matrix, a key vector matrix and a value vector matrix; Representing a dot product matrix between a query vector matrix and a key vector matrix, superscript The transpose of the matrix is represented,Representing the dimensions of the key vector matrix,The value used for normalizing the dot product and avoiding the dot product is too large or too small; the softmax activation function is represented for normalizing the score to a probability value.

As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, in the step 3, the multi-layer sensor comprises three full-connection layers, and the method specifically comprises the following steps:

Wherein, 、、Respectively represent the learnable parameter matrices of three fully connected layers,、、Respectively represent the bias terms of three fully connected layers,Is a high-level semantic feature of a video/audio modality.

As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, in the step 4, the function normalized video and audio mode advanced semantic feature mean value is activated to obtain the segment level abnormal behavior pseudo tag, and the specific calculation method is as follows:

and (3) averaging the video and audio mode advanced semantic features obtained in the step (3), normalizing the video and audio mode advanced semantic features into final abnormal scores by an activation function, taking the abnormal scores as abnormal behavior pseudo tags at a fragment level, and performing the following calculation process:

Wherein, Abnormal behavior pseudo tags representing the segment level,Representing the activation function Sigmoid,Representing the high-level semantic features of the video modality,Representing high-level semantic features of an audio modality.

As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, in step 5, the video and audio mode advanced semantic features are normalized to obtain the gating information of background suppression, and the calculation mode is as follows:

Wherein, Gating information representing the video modality,Gating information representing the audio modality,、Representing the importance of each segment in the video and audio modalities respectively,D represents the number of fragments,A matrix dimension representing gating information;

the background suppression enhanced features are calculated as follows:

Wherein, The weighted proportion parameter is represented by a weighted proportion parameter,Representing the video characteristics after the background enhancement,Representing the audio features after the background enhancement,Features representing a video modality with enhanced self-attention,Features representing an audio modality after self-attention enhancement.

In step 6, the characteristics of the audio and video mode fusion comprise the characteristics of video mode flow and the characteristics of audio mode flow, and the characteristics of the video mode flow and the audio mode flow are expressed as the following formulas:

Wherein, 、A query vector matrix representing a video modality stream and an audio modality stream respectively,、A key vector matrix representing a video modality stream and an audio modality stream respectively,、A value vector matrix representing the video modality stream and the audio modality stream respectively,、、A matrix of learnable parameters representing a matrix of query vectors, a matrix of key vectors, a matrix of value vectors,Features representing a video modality stream are presented,And the characteristics of the audio frequency modal flow are represented, the characteristics of the video frequency and audio frequency modal flow are added to obtain the final fusion characteristics of the two modes, and then the final multi-mode abnormal behavior probability value is obtained through the full connection layer.

In step 7, the segment-level abnormal behavior pseudo tag obtained in step 4 is used as a noise tag, and the loss value calculation is carried out on the segment-level abnormal behavior pseudo tag and the multi-mode abnormal behavior probability value obtained in step 6, wherein the loss value calculation is specifically as follows:

and (3) taking the abnormal behavior pseudo tag at the segment level in the step (4) as a noise tag, and calculating a loss value between the multi-mode abnormal behavior probability value and the noise tag, wherein the loss value is calculated by adopting a noise loss function, and the noise loss function consists of a weighted sum of an average absolute error (MAE) and a Normalized Cross Entropy (NCE).

As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, in the step 9, the loss values in the step 7 and the step 8 are weighted and summed to be used as the loss valuesThe calculation process of (2) is as follows:

Wherein, Representing the multi-modal abnormal behavior probability values obtained in the multi-instance learning manner in step 8 and the loss values of the video-level tags,A loss value representing the segment level abnormal behavior pseudo tag and the multi-mode abnormal behavior probability value obtained in the step 7,All representing the specific gravity of the weighted sum.

As a further optimization scheme of the weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system, in the step 1, the characteristics of a video mode are extracted by adopting a pre-training I3D network, and the characteristics of an audio mode are extracted by adopting a pre-training VGGish network.

Compared with the prior art, the technical scheme provided by the invention has the following technical effects:

(1) The invention belongs to a multi-mode abnormal behavior detection algorithm, and constructs a multi-mode abnormal behavior detection model under weak supervision, and carries out abnormal recognition based on information of two modes of video and audio so as to meet the actual scene requirement of intelligent monitoring.

(2) The invention utilizes the self-attention mechanism and the multi-layer perceptron to enable the multi-mode abnormal behavior detection model under weak supervision to pay attention to the prediction of the segment level, enhances the distinguishing capability of the model between abnormality and normal, and effectively solves the problem of frame-level label deletion under weak supervision.

(3) The invention designs a multi-mode fusion module, which takes a prediction pseudo tag of a single mode at a segment level as a basis of background suppression to carry out background suppression on data of another mode, so that a multi-mode abnormal behavior detection model under weak supervision focuses on a key abnormal section, and a cross-mode attention mechanism is used for fusing data of video and audio modes, so that the model can extract key information from the data of the multi-mode, thereby improving the detection effect of the multi-mode abnormal behavior detection model under weak supervision.

(4) The invention designs a loss function suitable for training the weak supervision multi-mode abnormal behavior detection model, and the loss function pays attention to the prediction effects of the segment level and the video level at the same time, so that the problem of video segment tag deletion can be effectively solved, and the loss function has robustness to noise and meets the actual application requirements.

Drawings

FIG. 1 is a schematic diagram of an overall model of a method for intelligently detecting weakly supervised abnormal behavior for an intelligent monitoring system according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a background suppression method for a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a cross-modal attention enhancement method for a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

Example 1

1-3, In accordance with one embodiment of the present invention, there is provided a method for intelligently detecting weak supervision abnormal behavior in an intelligent monitoring system, including:

S1: the method comprises the steps of respectively extracting features of video and audio modes by using a deep neural network, respectively carrying out self-attention enhancement on the features of the two modes, inputting the features into a multi-layer perceptron to obtain high-level semantics, and normalizing the results to obtain abnormal behavior prediction of a segment level to be used as a pseudo tag. It should be noted that:

the invention adopts the I3D pre-training network to extract the characteristics of the video mode Extracting features of audio modalities using VGGish pre-training network；

The features acquired from the video and audio modes are respectively transmitted into a self-attention module for self-attention enhancement, and are transmitted into a multi-layer perceptron for advanced semantic feature extraction;

further, self-attention enhancement is achieved by mapping the query, key, and value into vectors, calculating the dot product matrix and softmax normalization, resulting in a weighted sum of value vectors The calculation formula is as follows:

Wherein, Representing a matrix of query vectors,A matrix of key vectors is represented,A matrix of vector of values is represented,、、Respectively represent a parameter matrix which can be learned in a query vector matrix, a key vector matrix and a value vector matrix.Representing a dot product matrix between the query vector and the key vector,The dimensions of the key vector are represented,The values used to normalize the dot product avoid the dot product are too large or too small.The softmax activation function is represented for normalizing the score to a probability value.

Still further, in order to obtain the labels of the features at the segment level, the invention adopts the multilayer perceptron to conduct abnormal behavior prediction on the features with enhanced self-attention, calculates the average value of the abnormal behavior scores obtained by the video and audio modes to normalize the average value to the final abnormal score, and takes the abnormal behavior prediction as the pseudo label of the segment level.

The multi-layer sensor consists of three full-connection layers, and the calculation process is as follows:

Wherein, 、、Respectively represent the learnable parameter matrices of three fully connected layers,、、The bias terms of the three fully connected layers are shown, respectively. The pseudo tag of the segment level is obtained by calculating a mean value of two modal anomaly predictions through a sigmoid activation function, and the calculation process is as follows:

S2: the abnormal behavior prediction of the segment level in the video mode is used as the basis of the background suppression of the audio mode, and the abnormal behavior prediction of the segment level of the audio mode is used as the basis of the background suppression of the video mode, so that the joint attention of key segment areas among multiple modes can be realized, the characteristics of the key segments are enhanced, the occurrence of the conditions such as conflict of multi-mode detection results is reduced, and the following needs to be described:

Cross-modal background suppressed gating signal AndAccording to the advanced semantic features acquired by the multi-layer perceptron, abnormal behavior predicted values of video mode and audio mode fragment levels are obtained through sigmoid activation function calculation, wherein the calculation mode is as follows:

wherein the gating value The importance of each segment of the video and audio modalities is represented, with higher values indicating a higher likelihood of abnormal behavior in the segment. After the gating information of the video mode and the audio mode is obtained, the gating information is multiplied by the characteristics of the other mode respectively, and more important characteristics are selected from the opposite modes. Targeted enhancement features, i.e., corresponding segments of video features from example segments deemed important from audio features and corresponding segments of audio features from example segments deemed important from video features, can be achieved to reduce the probability of occurrence of situations such as difficulty in single modality identification, collision of multi-modality detection results, etc., as shown in fig. 2. The original mode characteristics and the characteristics subjected to gating enhancement are multiplied by coefficients to obtain a result, and the characteristics obtained after the video mode and the audio mode are subjected to background suppression are expressed asAndThe calculation method is as follows:

S3: the features after background suppression are subjected to cross-modal attention enhancement, and abnormal behavior prediction of cross-modal interaction features is calculated, as shown in fig. 3, the following needs to be described:

The cross-modal attention enhancement module consists of a video modal stream and an audio modal stream, wherein the two modalities exchange Key and Value of each other, so that the attention characteristic of one modality is conditioned on the other modality, namely, the attention characteristic is based on audio in the video stream, and the cross-modal interaction between video and audio is performed based on the video attention in the audio stream.

Taking video streaming as an example, we take audio featuresKey and Value are calculated for input and are characterized by video after background suppressionAs an input, a Query is calculated as follows:

Wherein, Representing a learnable embedding matrix, so thatWithout usingBecause the features after background suppression pay more attention to the example fragments which are considered important, the integrity of the event is ignored to a certain extent, fromCalculating Key and Value can treat the whole event equally from the original feature data, thereby calculating more proper attention features. Finally, the characteristics of the video modality are expressed as the following formula:

S4: in order to solve the problem of lack of fine fragment labels in weak supervision identification, the invention designs a loss function suitable for the problem model. The loss function mainly comprises two parts, wherein one part is to take the fragment level abnormal behavior pseudo tag obtained in the previous step as a noise tag and a multi-mode abnormal behavior probability value; the other part is to calculate the predicted loss value of the abnormal behavior of the video level after the multi-modal fusion in a multi-instance learning mode, and the following needs to be described:

because the weak supervision training data lacks the fine labels at the segment level, it is inaccurate and noisy to directly expand the video level labels to the labels at the segment level. In order to enable a network to learn abnormal behaviors at a segment level, namely distinguishing between normal and abnormal at the segment level, the invention calculates a loss value of segment level prediction by using a loss function capable of coping with noise data, and improves the resolving power of a model on segment level abnormality. The noise loss function of the present invention consists of a weighted sum of Mean Absolute Error (MAE) and Normalized Cross Entropy (NCE), and the calculation formula is as follows:

Wherein K represents the total category of the tag, Representing the distribution of samples x over the class k,Representing the probability output of sample x in the classifier,Representing the probability output of the network for class k,。

The second part focuses on the accuracy of abnormality detection at the video level, and according to the thought of multi-instance learning, a video clip is regarded as an instance in multi-instance learning, and the average value of k predicted values with highest abnormality prediction in one video packet is taken as the abnormal predicted value of the video; In the video containing the abnormal event, the selected example is an example fragment with the highest abnormal event probability in the video; in the video of the normal event, the selected instance is the instance with the highest probability of causing the recognition error, so the loss function calculation formula of the video level is as follows:

Wherein, Representing the multi-modal anomaly behavior probability values calculated in a multi-instance learning manner and the loss values of the video-level tags,A tag representing an i-th video,Representing the average predictor of the k instances of the largest predictor in the video packet, N representing the number of videos.

Finally, the loss function calculation formula for the multi-mode abnormal behavior detection network model training under weak supervision is as follows:

Example 2

The embodiment is different from the first embodiment in that a verification test of a weak supervision abnormal behavior intelligent detection method for an intelligent monitoring system is provided, in order to verify and explain the technical effects adopted in the method, the embodiment adopts a traditional technical scheme to carry out a comparison test with the method of the invention, and the test results are compared by a scientific demonstration means to verify the true effects of the method.

The invention performs experimental verification on a large-scale abnormal behavior recognition dataset XD-Violence. The average Accuracy (AP) of frame-level anomaly recognition is used as an evaluation index for measuring the recognition effect of the model. The influence of background suppression on the performance of the model, such as cross-modal interaction, pseudo tag generation, noise loss function, etc., was examined by quantitative experiments, and the results are shown in table 1. As can be seen from the table, the average accuracy of frame-level anomaly identification under the video mode is 71.23%, and after cross-mode interaction of audio and video modes is introduced, the average accuracy is improved to 78.31%, so that the effectiveness of reasonable interaction of cross-mode context association on anomaly identification under multiple modes is seen. The invention carries out the information of cross-modal background inhibition highlighting key fragments on the basis, and experiments prove that the method is effective to weak supervision common sense, and the frame-level average precision is improved to 81.69 percent. In addition, according to the characteristic of lack of fine labels in weak supervision, the invention proposes to add pseudo labels and noise loss into model training, thereby improving the model identification effect, the frame-level average precision reaches 83.80 percent, and the model has robustness to noise,

TABLE 1 detection Effect of the modules on XD-Violence dataset

Method of	AP（%）
		Single mode vision	71.23
Cross-modal interaction of visual and auditory features	78.31
		Cross-mode interaction fusion of dual modes through background suppression	81.69
Background suppression cross-mode interaction + pseudo tag + noise loss	83.80

On the other hand, the model of the present invention has an advantage of robustness to noise compared to other anomaly detection methods. For XD-Violence we consider symmetric noise with different noise levels, each tag has the same probability of flipping to another class, i.e., an abnormal class flipped to a normal class, and a normal class flipped to an abnormal class. We randomly selectTurning its label with training data, wherein. The prediction effect of the method of the invention is compared with the prediction effect of the three most advanced methods at present under different noise proportions, and the results are shown in table 2.

Table 2 identification effect of methods under different noise conditions

As can be seen from the table, the method of the invention is trained on the data set without noise, the average accuracy of the frame level reaches 83.80%, and the method is improved by 0.4% compared with MACIL method. To verify the robustness of the method to noise, we add different proportions of noise to the training set. Under the conditions of 10%, 20% and 30% proportion noise, the frame-level average precision of 82.85%, 82.15% and 78.88% is obtained, and the frame-level average precision is improved by 2.58%, 3.55% and 0.74% respectively compared with MACIL. Through experiments, the robustness of the proposed method to noise can be demonstrated.

The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention.

Claims

1. The weak supervision abnormal behavior intelligent detection method for the intelligent monitoring system is characterized by comprising the following steps of:

In the cross-modal Attention enhancement, calculating a key vector matrix K _v of a video mode stream, a key vector matrix K _a of an audio mode stream, a value vector matrix V _v of the video mode stream and a value vector matrix V _a of the audio mode stream according to the characteristic attribute _v of the video mode after the self-Attention enhancement, the characteristic attribute _a of the audio mode after the self-Attention enhancement, which are obtained in the step 2, and calculating a query vector matrix Q _v、Q_a of the video mode stream and the audio mode stream according to the characteristic of the audio mode after the background enhancement, and carrying out cross-modal enhancement by exchanging the key vector matrix and the value vector matrix of the audio mode stream and the key vector matrix of the video mode stream;

step 7, taking the abnormal behavior pseudo tag of the segment level obtained in the step 4 as a noise tag, and calculating a loss value with the multi-mode abnormal behavior probability value obtained in the step 6; the loss value is calculated by adopting a noise loss function, and the noise loss function consists of a weighted sum of an average absolute error MAE and a normalized cross entropy NCE;

2. The method for intelligently detecting weak supervision abnormal behavior of an intelligent monitoring system according to claim 1, wherein the step 2 is specifically as follows:

The method comprises the steps of respectively transmitting the features of a video mode and the features of an audio mode into a self-Attention network to carry out self-Attention enhancement, wherein the self-Attention network calculates a dot product matrix and carries out softmax normalization by mapping inquiry, keys and values into vectors, and finally obtains a weighted sum value vector Attention _v/a, and the calculation formula is as follows:

Wherein q=w _qF_a/v, Q represents a query vector matrix, F _a/v represents the characteristics of the audio/video modality obtained in step 1, k=w _kF_a/v, K represents a key vector matrix, v=w _vF_a/v, V represents a value vector matrix, and W _q、W_k、W_v represents a query vector matrix, a key vector matrix, and a parameter matrix that can be learned in the value vector matrix, respectively; QK ^T denotes a dot product matrix between the query vector matrix and the key vector matrix, superscript T denotes a matrix transpose, d _k denotes a dimension of the key vector matrix, The value used for normalizing the dot product and avoiding the dot product is too large or too small; softmax (x) represents the softmax activation function used to normalize the score to a probability value.

3. The method for intelligently detecting abnormal behavior under weak supervision of an intelligent monitoring system according to claim 2, wherein in step 3, the multi-layer sensor comprises three fully connected layers, specifically comprising the following steps:

Wherein W ₁、W₂、W₃ represents the learnable parameter matrix of the three fully connected layers respectively, b ₁、b₂、b₃ represents the bias term of the three fully connected layers respectively, Is a high-level semantic feature of a video/audio modality.

4. The method for intelligently detecting the weak supervision abnormal behavior of the intelligent monitoring system according to claim 2, wherein in step 4, the activation function normalizes the mean value of the high-level semantic features of the video and audio modes to obtain the pseudo tag of the abnormal behavior of the segment level, and the specific calculation method is as follows:

Where P _pseudo represents the abnormal behavior pseudo tag at the segment level, σ represents the activation function Sigmoid, Representing high-level semantic features of a video modality,/>Representing high-level semantic features of an audio modality.

5. The method for intelligently detecting weak supervision abnormal behavior of an intelligent monitoring system according to claim 4, wherein in step 5, the gating information of background suppression is obtained by normalizing advanced semantic features of video and audio modes, and the calculation method is as follows:

Wherein g _v represents gating information of a video mode, g _a represents gating information of an audio mode, g _v、g_a represents importance of each segment in the video mode and the audio mode respectively, g _v、g_a∈R^d×1, d represents the number of segments, and R ^d×1 represents matrix dimension of the gating information;

the background suppression enhanced features are calculated as follows:

where a represents a weighted proportion parameter, Representing background enhanced video features,/>Representing background enhanced audio features, attention _v representing features of the self-Attention enhanced video modality, and Attention _a representing features of the self-Attention enhanced audio modality.

6. The method for intelligent detection of weakly supervised abnormal behavior for intelligent monitoring systems as set forth in claim 5, wherein in step 6, the features of the audio and video modality fusion include features of video modality stream and features of audio modality stream, and features of video modality stream and audio modality stream are expressed as the following formulas:

K_v＝W^k·Attention_a,K_a＝W^k·Attention_v

V_v＝W^v·Attention_a,V_a＝W^v·Attention_v

Wherein Q _v、Q_a represents the query vector matrix of the video modality stream and the audio modality stream, K _v、K_a represents the key vector matrix of the video modality stream and the audio modality stream, V _v、V_a represents the value vector matrix of the video modality stream and the audio modality stream, W ^q、W^k、W^v represents the query vector matrix, the key vector matrix, and the learnable parameter matrix of the value vector matrix, Features representing a video modality stream,/>And the characteristics of the audio frequency modal flow are represented, the characteristics of the video frequency and audio frequency modal flow are added to obtain the final fusion characteristics of the two modes, and then the final multi-mode abnormal behavior probability value is obtained through the full connection layer.

7. The method for intelligently detecting the weak supervision abnormal behavior of the intelligent monitoring system according to claim 3, wherein in the step 7, the abnormal behavior pseudo tag at the segment level obtained in the step 4 is used as a noise tag, and the loss value calculation is performed with the multi-mode abnormal behavior probability value obtained in the step 6; the method comprises the following steps:

And (3) taking the segment-level abnormal behavior pseudo tag obtained in the step (4) as a noise tag, and calculating a loss value between the segment-level abnormal behavior pseudo tag and the multi-mode abnormal behavior probability value, wherein the loss value is calculated by adopting a noise loss function, and the noise loss function consists of a weighted sum of an average absolute error MAE and a normalized cross entropy NCE.

8. The intelligent detection method for weak supervision abnormal behavior of intelligent monitoring system according to claim 1, wherein in step 9, the calculation process of weighting and summing the loss values of step 7 and step 8 as the loss value L _total is as follows:

L_total＝aL_MIL+bL_noisy

Wherein, L _MIL represents the multi-mode abnormal behavior probability value and the loss value of the video level label obtained in the multi-instance learning mode in the step 8, L _noisy represents the loss value of the segment level abnormal behavior pseudo label and the multi-mode abnormal behavior probability value obtained in the step 7, and a and b both represent the proportion of the weighted sum.

9. The method for intelligently detecting weak supervision abnormal behavior of an intelligent monitoring system according to claim 1, wherein in step 1, the characteristics of a video mode are extracted by using a pre-training I3D network, and the characteristics of an audio mode are extracted by using a pre-training VGGish network.