CN114220169A

CN114220169A - Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM

Info

Publication number: CN114220169A
Application number: CN202111543515.2A
Authority: CN
Inventors: 陈雷; 秦野风; 贲晛烨; 李玉军
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-03-22

Abstract

The invention relates to a detection method for lightweight real-time monitoring of abnormal behaviors based on YOLO-TSM, which comprises the following steps: step 1: adopting a lightweight target detection network, namely a YOLO network, to perform target detection on pedestrians in a monitored video to obtain a target needing behavior detection and obtain spatial characteristics of the target; step 2: detecting the behaviors of the pedestrians in the detection frame by using a behavior recognition algorithm, namely through a TSM network, and acquiring the time-space characteristics of the pedestrians; and step 3: and fusing the obtained spatial features and the space-time features by using an attention mechanism module, reasoning the model result, and reasoning and outputting the result in real time. The invention combines the excellent real-time target detection model YOLO with the behavior recognition model TSM, and uses the attention mechanism to perform feature fusion, thereby improving the real-time reasoning rate while ensuring certain accuracy, refining the behavior detection and defining each behavior individual detected in the bit scene.

Description

Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM

Technical Field

The invention relates to a lightweight real-time monitoring abnormal behavior detection method based on a Yolo-TSM, and belongs to the technical field of space-time behavior detection.

Background

With the development of society and the progress of science and technology, the coverage range of real-time monitoring is more and more extensive, and the camera can not be left for monitoring no matter in public places or private places. Therefore, with the expanding use of video surveillance, the number of monitoring personnel and the effort are inevitably required, but the omission of abnormal situations is still unavoidable. In order to solve the problem, many researchers provide a behavior detection method for a network camera on the basis of video understanding, namely, human individuals in various scenes are subjected to real-time behavior detection, and abnormal situations caused by human reasons are avoided from being omitted. Therefore, real-time behavior detection is a video detection technology meeting the requirements of the fields of security and protection and the like.

The behavior Detection (Action Detection) algorithm based on the traditional RGB deep learning is mainly classified into a 2 DCNN-based dual-stream convolution algorithm and a 3DCNN algorithm. The double-current convolution algorithm has great advantages in reasoning speed, namely real-time performance, due to the adoption of the 2D convolution network, but has no advantages in accuracy due to poor extraction capability of time characteristics; the 3D convolution network algorithm is opposite, and the accuracy has great advantage due to the fact that the network parameters are too many and the instantaneity is poor.

Compared with the traditional behavior identification algorithm, the space-time behavior Detection algorithm (spatialiempty Action Detection) can define the behavior in a scene to each detected person to be identified on the basis of confirming the timeliness and the accuracy, so that the more accurate identification based on the object is realized, but the relative calculation amount is larger, so that the space-time behavior Detection algorithm applied to real-time video monitoring is less at present.

In summary, two major problems exist in the prior art, and the main problems are: at present, the real-time monitoring and analysis of abnormal behaviors have poor real-time effect under the condition of reaching certain accuracy, namely, the abnormal behaviors cannot be analyzed and processed in real time. The secondary challenges are: the current common behavior recognition only defines the overall behavior occurring in the scene, and lacks the behavior detection analysis for each individual in the scene.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a lightweight real-time monitoring abnormal behavior detection method based on a Yolo-TSM;

interpretation of terms:

1. and Mosaic data enhancement: and splicing the four transmitted key frame pictures, wherein each picture has a frame corresponding to the picture, obtaining a new picture after splicing the four pictures, simultaneously obtaining a detection frame corresponding to the picture, and enriching the background of target detection.

2. A CFAM module: the CFAM module is an attention mechanism model based on a gram matrix and can perform global feature fusion.

3. AVA data set: the URL comprising YouTube public video is labeled (such as [ walk ], [ kick (something) ], [ handshake ]) by using a set comprising 80 Atomic actions (Atomic actions), all actions are positioned at intervals, so that a 57.6k video segment, a 96k labeled human Action and a 210k Action label are generated, and the invention selects abnormal behaviors (fighting, falling, running and the like) as targets to carry out training test.

4. K-means clustering algorithm (K-mean): the method is a clustering analysis algorithm for iterative solution, and comprises the steps of dividing data into K groups in advance, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met.

5. The TSM module: a module for extracting the characteristics of time dimension information can be realized on the basis that 2DCNN does not increase a large number of operations, namely, the time receptive field is enlarged, and the time information is obtained.

The technical scheme of the invention is as follows:

a detection method for lightweight real-time monitoring of abnormal behaviors based on a YOLO-TSM comprises the following steps:

step 1: acquiring a data set and preprocessing the data set;

step 2: constructing a TSM network model;

and step 3: constructing a YOLO lightweight network model;

and 4, step 4: constructing a feature fusion model;

and 5: performing end-to-end training on the TSM network model, the YOLO lightweight network model and the feature fusion model constructed in the

steps

2, 3 and 4 by using the data set obtained after the preprocessing in the step 1;

step 6: preprocessing a video to be detected, inputting the preprocessed video into a trained YOLO-TSM lightweight model, wherein the YOLO-TSM lightweight model comprises a trained TSM network model, a YOLO lightweight network model and a feature fusion model, and the TSM network model and the YOLO lightweight network model are connected with the feature fusion model; and deducing according to the input video information by using the weight file obtained by training, carrying out target detection on the individual behaving in the video stream, carrying out behavior recognition, and finally positioning the abnormal behavior on the individual in the scene.

Preferably according to the invention, the data sets comprise an AVA data set and an abnormal behaviour data set.

According to the present invention, in step 1, the acquired data set is preprocessed, and the data set is processed into two forms of an input TSM network model and a YOLO lightweight network model:

processing the data set into a form of an input TSM network model, namely tensor A, wherein tensor A belongs to R (N, C, T, H and W), N is the batch processing size, C is the channel number, T is the time dimension, and H and W are the spatial resolution;

processing the data set into a form of an input YOLO lightweight network model, namely tensor B, wherein the tensor B belongs to R (N, C, H and W); the Mosaic data is enhanced, and is cut and compressed into H x W, so that a tensor B is obtained, and the dimensionality of the tensor B is (N, C, H, W);

according to a preferred embodiment of the present invention, the backbone network of the YOLO lightweight network model is Darknet53, and includes a first Conv convolutional layer, a second Conv convolutional layer, a first Residual Block module, a third Conv convolutional layer, a second Residual Block module, a fourth Conv convolutional layer, a third Residual Block module, a fifth Conv convolutional layer, a fourth Residual Block module, a sixth Conv convolutional layer, and a fifth Residual Block module, which are connected in sequence;

the first Residual Block module comprises 1 Residual Block, the second Residual Block module comprises 2 Residual blocks, the third Residual Block module comprises 8 Residual blocks, the fourth Residual Block module comprises 8 Residual blocks, and the fifth Residual Block module comprises 4 Residual blocks;

in the first Conv convolution layer, outputting the characteristics through a [3,3] convolution network with padding being 1, and continuously outputting through a normalization layer and an activation function;

in the second Conv convolution layer, outputting characteristics through a [3,3] convolution network with the step length of 2 for one time;

the first, second, third, fourth and fifth Residual Block modules comprise two Conv layers, a standardization function and an activation function, namely a BatchNorm + Relu layer, firstly pass through the Conv layer of [1,1], adjust the channel number, then pass through the BatchNorm + Relu layer and finally pass through the [3,3] Conv layer of which the padding is 1, and output characteristics;

the third, fourth, fifth and sixth Conv convolutional layers are [3,3] Conv layers with step size of 2 and padding of 1;

finally, the YOLO lightweight network model obtains a spatial feature tensor of [ C, H, W ].

According to the invention, the backbone network of the TSM network model is Resnet50 added with the TSM module, and comprises a Conv convolution layer, a maximum pooling layer, a first residual module, a second residual module, a third residual module, a fourth residual module, and an activation function between each residual module, which are connected in sequence;

in the Conv convolution layer, outputting characteristics through a [7,7] convolution network, and continuously outputting through a normalization layer and an activation function;

in the maximum pooling layer, reducing the dimension and reducing the input quantity through a [3,3] maximum pooling layer;

the first residual error module comprises 1 first Model-1 and 2 second Model models-2; the second residual error module comprises 1 first Model-1 and 3 second Model models-2; the third residual error module comprises 1 first Model-1 and 5 second Model models-2; the fourth residual module comprises 1 first Model-1 and 3 second models Model-2;

the first Model-1 is divided into two branches, one branch is firstly input into a [1,1] Conv layer through a TSM module, the number of channels is adjusted, then input into a [3,3] Conv layer through a normalization layer and an activation function, characteristics are obtained, then input into a [1,1] Conv layer with different channel numbers through a normalization layer and an activation function, the number of channels is adjusted, and output is obtained through a normalization layer; the two branches are input, and the final output is the addition of the two branches for output;

the second Model-2 is divided into two branches, one branch is firstly input into a [1,1] Conv layer through a TSM module, the number of channels is adjusted, then is input into a [3,3] Conv layer through a normalization layer and an activation function, the characteristics are obtained, then is input into a [1,1] Conv layer through a normalization layer and an activation function, the number of channels is adjusted, and the channel is output through a normalization layer; the two branches firstly enter a Conv layer of [1,1], the number of channels is adjusted, and the final output is the addition of the two branches for output after passing through a normalization layer and an activation function;

finally output one [ C¹,H¹,W¹]A size of the spatiotemporal feature tensor.

According to the present invention, preferably, the feature fusion model includes a channel fusion module, a CFAM module, a first Conv layer, and a second Conv layer, which are connected in sequence;

in the channel fusion module, [ C ] of TSM network model output¹,H¹,W¹]Firstly input into one [1,1]]Is adjusted to [ C ] in the volume of the convolution layer²,H,W]Then outputting the YOLO lightweight network model to [ C, H, W ]]And the characteristic tensor [ C ]²,H,W]Linking in channel dimension to obtain tensor [ C³,H,W]，C³＝C²+ C, inputting one [1,1]]Conv layer of (2), adjusting the number of channels Bc⁴*H*W]And then, the feature vector of each channel is unidimensionalized, i.e., the feature vector is changed to F [ C ]³,N]，N＝H*W；

In the CFAM module, the input feature vector F [ C [ ]³,N]The vector multiplication is carried out with the self transposition vector to obtain a gram matrix G [ N, N](ii) a Calculating to obtain a gram matrix G [ N x N]Then, generating a channel attention M by using a softmax layer, carrying out matrix multiplication on the channel attention M and the channel attention F, reshaping the obtained result into a three-dimensional space (C H W) with the same shape as the input tensor B, and combining the result with the original input feature map B;

inputting a Conv layer of [1,1] into the first Conv layer, adjusting the number of channels, and then passing through a normalization layer and an activation function;

in the second Conv layer, an [1,1] input is entered]The Conv layer of (2) adjusts the number of channels and outputs the output D [ C ] of the feature fusion model⁴,H,W]。

Preferably, in step 6, in the feature fusion model, a convolution kernel with a size of 1 × 1 is used for the last convolution layer, an output channel is adjusted, 5 prior frames are selected on a corresponding data set by using a k-means algorithm, and a feature tensor with a channel number of 5 × of (Numclasses +4+1) is generated, wherein Numclasses represent confidence scores of the behavior classification of Numclasses in the data set, 4 represents 4 coordinates of a target detection frame, and 1 represents a confidence of detection; and refining regression of the boundary box according to the anchor points, and finally realizing accurate positioning detection and behavior identification on the target and instant empty behavior detection.

A computer device comprising a memory storing a computer program and a processor implementing the steps of a YOLO-TSM based lightweight real-time monitoring abnormal behavior detection method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the YOLO-TSM-based lightweight real-time monitoring abnormal-behavior detection method.

The invention has the beneficial effects that:

the invention tries a mode of combining an excellent real-time target detection model YOLO with a behavior recognition model TSM, and uses an attention mechanism to perform feature fusion, thereby improving the real-time reasoning rate while ensuring certain accuracy, refining behavior detection and positioning each behavior individual detected in a scene.

Drawings

FIG. 1 is a schematic flow chart of the detection method for lightweight real-time monitoring of abnormal behaviors based on a YOLO-TSM according to the present invention;

FIG. 2 is a schematic diagram of a YOLO network;

FIG. 3 is a schematic structural diagram of Residual Block;

FIG. 4 is a schematic diagram of the connection structure between Residual blocks;

FIG. 5 is a schematic diagram of a TSM network;

FIG. 6 is a schematic diagram of a residual module;

FIG. 7 is a schematic structural diagram of a first Model-1;

FIG. 8 is a schematic structural diagram of a second Model-2;

FIG. 9 is a schematic diagram of a TSM network;

FIG. 10 is a schematic structural diagram of a feature fusion model;

fig. 11 is a schematic structural diagram of a CFAM module;

fig. 12 is a diagram showing the effect of the abnormal behavior detection method.

Detailed Description

The invention is further defined in the following, but not limited to, the figures and examples in the description.

Example 1

A detection method for lightweight real-time monitoring of abnormal behaviors based on a YOLO-TSM is shown in FIG. 1 and comprises the following steps:

step 1: acquiring a data set and preprocessing the data set;

step 2: constructing a TSM network model;

and step 3: constructing a YOLO lightweight network model;

and 4, step 4: constructing a feature fusion model;

steps

Example 2

The detection method for the lightweight real-time monitoring of the abnormal behavior based on the YOLO-TSM in the embodiment 1 is characterized in that:

the data sets include an AVA data set (part) and an abnormal behavior data set. The abnormal behavior data set is a video data set collected from a Youtube website, and comprises abnormal behaviors such as fighting, falling, running and the like, normal behaviors such as standing, walking and the like, and a total of 1000 videos of 10 s.

In step 1, preprocessing the acquired data set, and processing the data set into two forms of an input TSM network model and a YOLO lightweight network model respectively:

processing the data set into a form of an input TSM network model, namely tensor A, wherein tensor A belongs to R (N, C, T, H and W), N is the batch processing size, C is the channel number, T is the time dimension, and H and W are the spatial resolution; the method specifically comprises the following steps: taking continuous 8 frames of pictures, and clipping and compressing the pictures into 224 x 224, namely changing the dimension into (1,3,8, 224) to be input into the TSM network model.

Processing the data set into a form of an input YOLO lightweight network model, namely, a tensor B, wherein the tensor B belongs to R (N multiplied by C multiplied by H multiplied by W); the method specifically comprises the following steps: taking a plurality of key frames of the pictures, performing Mosaic data enhancement on the key frames, cutting and compressing the key frames into H x W, and obtaining a tensor B, wherein the dimensionality of the tensor B is (N, C, H, W); and taking a key frame of 8 pictures, selecting the 8 th picture, performing Mosaic data enhancement, clipping and compressing to 416 x 416, wherein the dimension of B is (1,3, 416), and inputting the key frame into a YOLO lightweight network model.

A backbone network (backbone) of the YOLO lightweight network model is Darknet53, and as shown in fig. 2, includes a first Conv convolutional layer, a second Conv convolutional layer, a first Residual Block module, a third Conv convolutional layer, a second Residual Block module, a fourth Conv convolutional layer, a third Residual Block module, a fifth Conv convolutional layer, a fourth Residual Block module, a sixth Conv convolutional layer, and a fifth Residual Block module, which are connected in sequence;

in the first Conv convolution layer, outputting the characteristics through a [3,3] convolution network with padding being 1, and continuing outputting through a normalization layer (BatchNorm) and an activation function (Relu);

the first, second, third, fourth and fifth Residual Block modules each include two Conv layers, a normalization function (BatchNorm) and an activation function (Relu), i.e., BatchNorm + Relu layer, as shown in fig. 3, the channel number is adjusted by a Conv layer of [1,1], and the characteristics are output by a BatchNorm + Relu layer and finally by a [3,3] Conv layer of padding ═ 1;

as shown in fig. 4, the characteristics are output through a [3,3] Conv layer with a step size of 2 and a step size of 1 among the first, second, third, fourth, and fifth Residual Block modules;

A backbone network (backbone) of the TSM network model is a Resnet50 added with a TSM module, and as shown in fig. 5, includes a Conv convolutional layer, a maximum pooling layer, a first residual module (Block1), a second residual module (Block2), a third residual module (Block3), a fourth residual module (Block4), and an activation function (Relu) between each residual module, which are connected in sequence;

in the Conv convolution layer, outputting characteristics through a [7,7] convolution network, and continuously outputting through a normalization layer (BatchNorm) and an activation function (Relu);

as shown in fig. 6, the residual Block (Block) is formed by connecting 1 first Model-1 and a plurality of second models Model-2;

as shown in fig. 7, the first Model-1 is divided into two branches, one branch is firstly input into a [1,1] Conv layer through the TSM module, the number of channels is adjusted, and then input into a [3,3] Conv layer through a normalization layer (BatchNorm) and an activation function (Relu) to obtain features, and then input into a [1,1] Conv layer with different channel numbers through a normalization layer (BatchNorm) and an activation function (Relu), the number of channels is adjusted, and output through a normalization layer (BatchNorm); the two branches are input, and the final output is the addition of the two branches for output;

as shown in fig. 8, the second Model-2 is divided into two branches, one branch is firstly input into a [1,1] Conv layer through the TSM module, the number of channels is adjusted, and then input into a [3,3] Conv layer through a normalization layer (BatchNorm) and an activation function (Relu), the features are obtained, and then input into a [1,1] Conv layer through a normalization layer (BatchNorm) and an activation function (Relu), the number of channels is adjusted, and output through a normalization layer (BatchNorm); the two branches firstly enter a Conv layer of [1,1], the number of channels is adjusted, and the two branches are finally output by adding the two branches through a normalization layer (BatchNorm) and an activation function (Relu);

residual module after TSM network improvement (Block): as shown in FIGS. 7 and 8, a TSM module is added to a branch input of all Model-1 and Model-1 of all residual modules (Block).

The TSM module: the TSM network adds a TSM module to each residual branch, cancels the last full connection layer, and directly inputs the result to the feature fusion model, and the principle is as shown in fig. 9, and temporal and channel dimensions in the tensor at the moment; the middle is a matrix after displacement through an STM module, the feature maps characterized by forward displacement of the first two channels by one step can be seen, and finally the vacancy padding after displacement is filled with zeros. For each inserted time-shifting block, the temporal field will expand by 2 as if a kernel size 3 convolution were run along the time dimension. Therefore, the TSM module has a large time field, and can acquire highly complex spatio-temporal information.

Activation function: an activation function (Relu) is passed between each Model's connections (including the Model between the two residual modules).

Finally output one [ C¹,H¹,W¹]A size of the spatiotemporal feature tensor.

The feature fusion model adopted in the step 4 is shown in fig. 10 according to the relationship between channels, and smoothly aggregates the features of different branches, wherein the feature fusion model comprises a Channel fusion module (Channel fusion), a CFAM module, a first Conv layer and a second Conv layer which are connected in sequence;

in the Channel fusion module (Channel fusion), the [ C ] output by the TSM network model¹,H¹,W¹]Firstly input into one [1,1]]The amount of the above-mentioned convolution layer is,is adjusted to [ C ]²,H,W]Then outputting the YOLO lightweight network model to [ C, H, W ]]And the characteristic tensor [ C ]²,H,W]Linking in channel dimension to obtain tensor [ C³,H,W]，C³＝C²+ C, inputting one [1,1]]Conv layer of (2), adjusting the number of channels Bc⁴*H*W]And then, the feature vector of each channel is unidimensionalized, i.e., the feature vector is changed to F [ C ]³,N]，N＝H*W；

In the CFAM module, as shown in FIG. 11, an input feature vector F [ C ]³,N]The vector multiplication is carried out with the self transposition vector to obtain a gram matrix G [ N, N](ii) a The Gram (Gram) matrix can map the dependency relationship among the channels; calculating to obtain a gram matrix G [ N x N]Then, generating a channel attention M by using a softmax layer, carrying out matrix multiplication on the channel attention M and the channel attention F, reshaping the obtained result into a three-dimensional space (C H W) with the same shape as the input tensor B, and combining the result with the original input feature map B;

inputting a [1,1] Conv layer into the first Conv layer, adjusting the number of channels, and then passing through a normalization layer (BatchNorm) and an activation function (Relu);

The specific implementation process of the step 5 is as follows:

the training configuration employed was as follows:

hardware environment:

CPU:AMD Ryzen 7 5800H

GPU:NVIDA GeForce RTX 3060(6G)

memory: 16G

Software environment:

OS：Windows 10

Python：Anaconda3 python3.7

CUDA：11.1

Torch：1.8.0

step 5.1: the TSM network model is initialized by adopting a pre-training model on Kinetics, and the YOLO lightweight network model is initialized by adopting a pre-training model on a CO-CO data set; the data amount required by training is reduced, parameters can be jointly updated although the two networks exist, and the complete architecture is realized in PyTorch and trained end to end.

Step 5.2: calculating a loss function in real time in a training process, wherein a boundary frame uses a Smooth L1loss function, the Smooth L1loss function is used for inputting continuous pictures into a TSM (time series memory) network model and a YOLO (lightweight class memory) network model for prediction and inference, and then a real result is coded and changed into the form of inference results of the TSM network model and the YOLO lightweight class network model, and the real result comprises position information and behavior types of a real frame, namely: comparing each prediction frame with all real frames to calculate the difference of coordinates of the four frames and the difference of behavior category types as losses;

after calculating the loss, performing back propagation to optimize;

in training, a small batch of random gradient descent algorithm with a weight attenuation strategy is selected to optimize a loss function;

and carrying out data enhancement on the YOLO lightweight network model, carrying out random mirroring, proportion change and the like according to the random number seeds, and enhancing the generalization capability. The initial learning rate is set to be 0.0001, the initial learning rate is attenuated to 0.4 after 30k iterations, and the partial AVA data set used has better effect in training 15 epochs and has better generalization capability.

The adopted data set is a partial abnormal behavior data set of AVA, the AVA comprises a URL of a YouTube public video, 80 Atomic Action (Atomic Action) sets are used for labeling (such as [ walking ], [ kicking (something) ], and [ handshaking ]), all actions are positioned in a space at times, so that a 57.6k video segment is generated, 96k human actions are labeled, and 210k Action labels are generated.

In addition, a user-defined data set is adopted for training, and some abnormal behavior videos on the network are analyzed, so that good effects are achieved. As shown in table 1:

TABLE 1

The feature fusion model smoothly aggregates the features of different branches according to the relationship among the channels, and greatly enhances the feature recognition capability. In the feature fusion model, the fact that behaviors are defined on an individual is realized, the behaviors are linked with a target detection result obtained in a YOLO network, and the individual carrying out the behaviors on an image is calibrated.

Step 6, using convolution kernels with the size of 1 multiplied by 1 on the final convolution layer in the feature fusion model, adjusting output channels, selecting 5 prior frames on a corresponding data set by using a k-means algorithm, and generating a feature tensor with the channel number of 5 times (Numclasses +4+1), wherein Numclasses represent the confidence scores of the behavior classification of Numclasses in the data set, 4 represents 4 coordinates of a target detection frame, and 1 represents the confidence of detection; and refining regression of the boundary box according to the anchor points, and finally realizing accurate positioning detection and behavior identification on the target and instant empty behavior detection.

Fig. 12 is an effect diagram obtained by the detection method of the present invention, and fig. 12 is a diagram of real-time detection of a network video, which shows that the detection of behaviors of two persons in the video is fight (light), and simultaneously, the persons implementing the behaviors are also positioned, so that a better detection effect is obtained.

Example 3

Example 4

Claims

1. A detection method for lightweight real-time monitoring of abnormal behaviors based on YOLO-TSM is characterized by comprising the following steps:

step 1: acquiring a data set and preprocessing the data set;

step 2: constructing a TSM network model;

and step 3: constructing a YOLO lightweight network model;

and 4, step 4: constructing a feature fusion model;

and 5: performing end-to-end training on the TSM network model, the YOLO lightweight network model and the feature fusion model constructed in the steps 2, 3 and 4 by using the data set obtained after the preprocessing in the step 1;

2. The method for detecting the abnormal behavior of the YOLO-TSM-based lightweight real-time monitoring device of claim 1, wherein in step 1, the acquired data set is preprocessed and processed into two forms of an input TSM network model and a YOLO lightweight network model:

processing the data set into a form of an input YOLO lightweight network model, namely tensor B, wherein the tensor B belongs to R (N, C, H and W); and enhancing the Mosaic data, cutting and compressing the Mosaic data into H x W to obtain a tensor B, wherein the dimensionality of the tensor B is (N, C, H, W).

3. The YOLO-TSM-based lightweight real-time monitoring abnormal behavior detection method according to claim 1, wherein the backbone network of the YOLO lightweight network model is Darknet53, and includes a first Conv convolutional layer, a second Conv convolutional layer, a first Residual Block module, a third Conv convolutional layer, a second Residual Block module, a fourth Conv convolutional layer, a third Residual Block module, a fifth Conv convolutional layer, a fourth Residual Block module, a sixth Conv convolutional layer, and a fifth Residual Block module, which are connected in sequence;

4. The method for detecting the abnormal behavior based on the YOLO-TSM in the lightweight real-time monitoring of the claim 1, wherein the backbone network of the TSM network model is Resnet50 added to the TSM module, and includes a Conv convolution layer, a maximum pooling layer, a first residual module, a second residual module, a third residual module, a fourth residual module, and an activation function between each of the residual modules, which are connected in sequence;

finally output one [ C¹,H¹,W¹]A size of the spatiotemporal feature tensor.

5. The method for detecting the abnormal behavior of the YOLO-TSM-based lightweight real-time monitoring system of claim 1, wherein the feature fusion model comprises a channel fusion module, a CFAM module, a first Conv layer and a second Conv layer, which are connected in sequence;

6. The method for detecting the abnormal behavior based on the YOLO-TSM in the lightweight real-time monitoring of the invention as claimed in any one of claims 1 to 5, wherein in step 6, in the feature fusion model, the convolution kernel with the size of 1 × 1 is used for the last convolution layer to adjust the output channel, 5 prior frames are selected on the corresponding data set by using the k-means algorithm, and a feature tensor with the number of 5 × channels (Numclasses +4+1) is generated, Numclasses represents the confidence scores of the behavior classification in the data set, 4 represents the 4 coordinates of the target detection frame, and 1 represents the confidence of the detection; and refining regression of the boundary box according to the anchor points, and finally realizing accurate positioning detection and behavior identification on the target and instant empty behavior detection.

7. A computer device comprising a memory storing a computer program and a processor implementing the steps of the YOLO-TSM based lightweight real-time monitoring abnormal behavior detection method of any one of claims 1-6 when the computer program is executed.

8. A computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the YOLO-TSM based lightweight real-time monitoring abnormal behavior detection method of any one of claims 1 to 6.