CN114220169A - Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM - Google Patents

Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM Download PDF

Info

Publication number
CN114220169A
CN114220169A CN202111543515.2A CN202111543515A CN114220169A CN 114220169 A CN114220169 A CN 114220169A CN 202111543515 A CN202111543515 A CN 202111543515A CN 114220169 A CN114220169 A CN 114220169A
Authority
CN
China
Prior art keywords
layer
conv
tsm
model
yolo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111543515.2A
Other languages
Chinese (zh)
Inventor
陈雷
秦野风
贲晛烨
李玉军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202111543515.2A priority Critical patent/CN114220169A/en
Publication of CN114220169A publication Critical patent/CN114220169A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a detection method for lightweight real-time monitoring of abnormal behaviors based on YOLO-TSM, which comprises the following steps: step 1: adopting a lightweight target detection network, namely a YOLO network, to perform target detection on pedestrians in a monitored video to obtain a target needing behavior detection and obtain spatial characteristics of the target; step 2: detecting the behaviors of the pedestrians in the detection frame by using a behavior recognition algorithm, namely through a TSM network, and acquiring the time-space characteristics of the pedestrians; and step 3: and fusing the obtained spatial features and the space-time features by using an attention mechanism module, reasoning the model result, and reasoning and outputting the result in real time. The invention combines the excellent real-time target detection model YOLO with the behavior recognition model TSM, and uses the attention mechanism to perform feature fusion, thereby improving the real-time reasoning rate while ensuring certain accuracy, refining the behavior detection and defining each behavior individual detected in the bit scene.

Description

Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM
Technical Field
The invention relates to a lightweight real-time monitoring abnormal behavior detection method based on a Yolo-TSM, and belongs to the technical field of space-time behavior detection.
Background
With the development of society and the progress of science and technology, the coverage range of real-time monitoring is more and more extensive, and the camera can not be left for monitoring no matter in public places or private places. Therefore, with the expanding use of video surveillance, the number of monitoring personnel and the effort are inevitably required, but the omission of abnormal situations is still unavoidable. In order to solve the problem, many researchers provide a behavior detection method for a network camera on the basis of video understanding, namely, human individuals in various scenes are subjected to real-time behavior detection, and abnormal situations caused by human reasons are avoided from being omitted. Therefore, real-time behavior detection is a video detection technology meeting the requirements of the fields of security and protection and the like.
The behavior Detection (Action Detection) algorithm based on the traditional RGB deep learning is mainly classified into a 2 DCNN-based dual-stream convolution algorithm and a 3DCNN algorithm. The double-current convolution algorithm has great advantages in reasoning speed, namely real-time performance, due to the adoption of the 2D convolution network, but has no advantages in accuracy due to poor extraction capability of time characteristics; the 3D convolution network algorithm is opposite, and the accuracy has great advantage due to the fact that the network parameters are too many and the instantaneity is poor.
Compared with the traditional behavior identification algorithm, the space-time behavior Detection algorithm (spatialiempty Action Detection) can define the behavior in a scene to each detected person to be identified on the basis of confirming the timeliness and the accuracy, so that the more accurate identification based on the object is realized, but the relative calculation amount is larger, so that the space-time behavior Detection algorithm applied to real-time video monitoring is less at present.
In summary, two major problems exist in the prior art, and the main problems are: at present, the real-time monitoring and analysis of abnormal behaviors have poor real-time effect under the condition of reaching certain accuracy, namely, the abnormal behaviors cannot be analyzed and processed in real time. The secondary challenges are: the current common behavior recognition only defines the overall behavior occurring in the scene, and lacks the behavior detection analysis for each individual in the scene.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a lightweight real-time monitoring abnormal behavior detection method based on a Yolo-TSM;
interpretation of terms:
1. and Mosaic data enhancement: and splicing the four transmitted key frame pictures, wherein each picture has a frame corresponding to the picture, obtaining a new picture after splicing the four pictures, simultaneously obtaining a detection frame corresponding to the picture, and enriching the background of target detection.
2. A CFAM module: the CFAM module is an attention mechanism model based on a gram matrix and can perform global feature fusion.
3. AVA data set: the URL comprising YouTube public video is labeled (such as [ walk ], [ kick (something) ], [ handshake ]) by using a set comprising 80 Atomic actions (Atomic actions), all actions are positioned at intervals, so that a 57.6k video segment, a 96k labeled human Action and a 210k Action label are generated, and the invention selects abnormal behaviors (fighting, falling, running and the like) as targets to carry out training test.
4. K-means clustering algorithm (K-mean): the method is a clustering analysis algorithm for iterative solution, and comprises the steps of dividing data into K groups in advance, randomly selecting K objects as initial clustering centers, calculating the distance between each object and each seed clustering center, and allocating each object to the nearest clustering center. The cluster centers and the objects assigned to them represent a cluster. The cluster center of a cluster is recalculated for each sample assigned based on the objects existing in the cluster. This process will be repeated until some termination condition is met.
5. The TSM module: a module for extracting the characteristics of time dimension information can be realized on the basis that 2DCNN does not increase a large number of operations, namely, the time receptive field is enlarged, and the time information is obtained.
The technical scheme of the invention is as follows:
a detection method for lightweight real-time monitoring of abnormal behaviors based on a YOLO-TSM comprises the following steps:
step 1: acquiring a data set and preprocessing the data set;
step 2: constructing a TSM network model;
and step 3: constructing a YOLO lightweight network model;
and 4, step 4: constructing a feature fusion model;
and 5: performing end-to-end training on the TSM network model, the YOLO lightweight network model and the feature fusion model constructed in the steps 2, 3 and 4 by using the data set obtained after the preprocessing in the step 1;
step 6: preprocessing a video to be detected, inputting the preprocessed video into a trained YOLO-TSM lightweight model, wherein the YOLO-TSM lightweight model comprises a trained TSM network model, a YOLO lightweight network model and a feature fusion model, and the TSM network model and the YOLO lightweight network model are connected with the feature fusion model; and deducing according to the input video information by using the weight file obtained by training, carrying out target detection on the individual behaving in the video stream, carrying out behavior recognition, and finally positioning the abnormal behavior on the individual in the scene.
Preferably according to the invention, the data sets comprise an AVA data set and an abnormal behaviour data set.
According to the present invention, in step 1, the acquired data set is preprocessed, and the data set is processed into two forms of an input TSM network model and a YOLO lightweight network model:
processing the data set into a form of an input TSM network model, namely tensor A, wherein tensor A belongs to R (N, C, T, H and W), N is the batch processing size, C is the channel number, T is the time dimension, and H and W are the spatial resolution;
processing the data set into a form of an input YOLO lightweight network model, namely tensor B, wherein the tensor B belongs to R (N, C, H and W); the Mosaic data is enhanced, and is cut and compressed into H x W, so that a tensor B is obtained, and the dimensionality of the tensor B is (N, C, H, W);
according to a preferred embodiment of the present invention, the backbone network of the YOLO lightweight network model is Darknet53, and includes a first Conv convolutional layer, a second Conv convolutional layer, a first Residual Block module, a third Conv convolutional layer, a second Residual Block module, a fourth Conv convolutional layer, a third Residual Block module, a fifth Conv convolutional layer, a fourth Residual Block module, a sixth Conv convolutional layer, and a fifth Residual Block module, which are connected in sequence;
the first Residual Block module comprises 1 Residual Block, the second Residual Block module comprises 2 Residual blocks, the third Residual Block module comprises 8 Residual blocks, the fourth Residual Block module comprises 8 Residual blocks, and the fifth Residual Block module comprises 4 Residual blocks;
in the first Conv convolution layer, outputting the characteristics through a [3,3] convolution network with padding being 1, and continuously outputting through a normalization layer and an activation function;
in the second Conv convolution layer, outputting characteristics through a [3,3] convolution network with the step length of 2 for one time;
the first, second, third, fourth and fifth Residual Block modules comprise two Conv layers, a standardization function and an activation function, namely a BatchNorm + Relu layer, firstly pass through the Conv layer of [1,1], adjust the channel number, then pass through the BatchNorm + Relu layer and finally pass through the [3,3] Conv layer of which the padding is 1, and output characteristics;
the third, fourth, fifth and sixth Conv convolutional layers are [3,3] Conv layers with step size of 2 and padding of 1;
finally, the YOLO lightweight network model obtains a spatial feature tensor of [ C, H, W ].
According to the invention, the backbone network of the TSM network model is Resnet50 added with the TSM module, and comprises a Conv convolution layer, a maximum pooling layer, a first residual module, a second residual module, a third residual module, a fourth residual module, and an activation function between each residual module, which are connected in sequence;
in the Conv convolution layer, outputting characteristics through a [7,7] convolution network, and continuously outputting through a normalization layer and an activation function;
in the maximum pooling layer, reducing the dimension and reducing the input quantity through a [3,3] maximum pooling layer;
the first residual error module comprises 1 first Model-1 and 2 second Model models-2; the second residual error module comprises 1 first Model-1 and 3 second Model models-2; the third residual error module comprises 1 first Model-1 and 5 second Model models-2; the fourth residual module comprises 1 first Model-1 and 3 second models Model-2;
the first Model-1 is divided into two branches, one branch is firstly input into a [1,1] Conv layer through a TSM module, the number of channels is adjusted, then input into a [3,3] Conv layer through a normalization layer and an activation function, characteristics are obtained, then input into a [1,1] Conv layer with different channel numbers through a normalization layer and an activation function, the number of channels is adjusted, and output is obtained through a normalization layer; the two branches are input, and the final output is the addition of the two branches for output;
the second Model-2 is divided into two branches, one branch is firstly input into a [1,1] Conv layer through a TSM module, the number of channels is adjusted, then is input into a [3,3] Conv layer through a normalization layer and an activation function, the characteristics are obtained, then is input into a [1,1] Conv layer through a normalization layer and an activation function, the number of channels is adjusted, and the channel is output through a normalization layer; the two branches firstly enter a Conv layer of [1,1], the number of channels is adjusted, and the final output is the addition of the two branches for output after passing through a normalization layer and an activation function;
finally output one [ C1,H1,W1]A size of the spatiotemporal feature tensor.
According to the present invention, preferably, the feature fusion model includes a channel fusion module, a CFAM module, a first Conv layer, and a second Conv layer, which are connected in sequence;
in the channel fusion module, [ C ] of TSM network model output1,H1,W1]Firstly input into one [1,1]]Is adjusted to [ C ] in the volume of the convolution layer2,H,W]Then outputting the YOLO lightweight network model to [ C, H, W ]]And the characteristic tensor [ C ]2,H,W]Linking in channel dimension to obtain tensor [ C3,H,W],C3=C2+ C, inputting one [1,1]]Conv layer of (2), adjusting the number of channels Bc4*H*W]And then, the feature vector of each channel is unidimensionalized, i.e., the feature vector is changed to F [ C ]3,N],N=H*W;
In the CFAM module, the input feature vector F [ C [ ]3,N]The vector multiplication is carried out with the self transposition vector to obtain a gram matrix G [ N, N](ii) a Calculating to obtain a gram matrix G [ N x N]Then, generating a channel attention M by using a softmax layer, carrying out matrix multiplication on the channel attention M and the channel attention F, reshaping the obtained result into a three-dimensional space (C H W) with the same shape as the input tensor B, and combining the result with the original input feature map B;
inputting a Conv layer of [1,1] into the first Conv layer, adjusting the number of channels, and then passing through a normalization layer and an activation function;
in the second Conv layer, an [1,1] input is entered]The Conv layer of (2) adjusts the number of channels and outputs the output D [ C ] of the feature fusion model4,H,W]。
Preferably, in step 6, in the feature fusion model, a convolution kernel with a size of 1 × 1 is used for the last convolution layer, an output channel is adjusted, 5 prior frames are selected on a corresponding data set by using a k-means algorithm, and a feature tensor with a channel number of 5 × of (Numclasses +4+1) is generated, wherein Numclasses represent confidence scores of the behavior classification of Numclasses in the data set, 4 represents 4 coordinates of a target detection frame, and 1 represents a confidence of detection; and refining regression of the boundary box according to the anchor points, and finally realizing accurate positioning detection and behavior identification on the target and instant empty behavior detection.
A computer device comprising a memory storing a computer program and a processor implementing the steps of a YOLO-TSM based lightweight real-time monitoring abnormal behavior detection method when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the YOLO-TSM-based lightweight real-time monitoring abnormal-behavior detection method.
The invention has the beneficial effects that:
the invention tries a mode of combining an excellent real-time target detection model YOLO with a behavior recognition model TSM, and uses an attention mechanism to perform feature fusion, thereby improving the real-time reasoning rate while ensuring certain accuracy, refining behavior detection and positioning each behavior individual detected in a scene.
Drawings
FIG. 1 is a schematic flow chart of the detection method for lightweight real-time monitoring of abnormal behaviors based on a YOLO-TSM according to the present invention;
FIG. 2 is a schematic diagram of a YOLO network;
FIG. 3 is a schematic structural diagram of Residual Block;
FIG. 4 is a schematic diagram of the connection structure between Residual blocks;
FIG. 5 is a schematic diagram of a TSM network;
FIG. 6 is a schematic diagram of a residual module;
FIG. 7 is a schematic structural diagram of a first Model-1;
FIG. 8 is a schematic structural diagram of a second Model-2;
FIG. 9 is a schematic diagram of a TSM network;
FIG. 10 is a schematic structural diagram of a feature fusion model;
fig. 11 is a schematic structural diagram of a CFAM module;
fig. 12 is a diagram showing the effect of the abnormal behavior detection method.
Detailed Description
The invention is further defined in the following, but not limited to, the figures and examples in the description.
Example 1
A detection method for lightweight real-time monitoring of abnormal behaviors based on a YOLO-TSM is shown in FIG. 1 and comprises the following steps:
step 1: acquiring a data set and preprocessing the data set;
step 2: constructing a TSM network model;
and step 3: constructing a YOLO lightweight network model;
and 4, step 4: constructing a feature fusion model;
and 5: performing end-to-end training on the TSM network model, the YOLO lightweight network model and the feature fusion model constructed in the steps 2, 3 and 4 by using the data set obtained after the preprocessing in the step 1;
step 6: preprocessing a video to be detected, inputting the preprocessed video into a trained YOLO-TSM lightweight model, wherein the YOLO-TSM lightweight model comprises a trained TSM network model, a YOLO lightweight network model and a feature fusion model, and the TSM network model and the YOLO lightweight network model are connected with the feature fusion model; and deducing according to the input video information by using the weight file obtained by training, carrying out target detection on the individual behaving in the video stream, carrying out behavior recognition, and finally positioning the abnormal behavior on the individual in the scene.
Example 2
The detection method for the lightweight real-time monitoring of the abnormal behavior based on the YOLO-TSM in the embodiment 1 is characterized in that:
the data sets include an AVA data set (part) and an abnormal behavior data set. The abnormal behavior data set is a video data set collected from a Youtube website, and comprises abnormal behaviors such as fighting, falling, running and the like, normal behaviors such as standing, walking and the like, and a total of 1000 videos of 10 s.
In step 1, preprocessing the acquired data set, and processing the data set into two forms of an input TSM network model and a YOLO lightweight network model respectively:
processing the data set into a form of an input TSM network model, namely tensor A, wherein tensor A belongs to R (N, C, T, H and W), N is the batch processing size, C is the channel number, T is the time dimension, and H and W are the spatial resolution; the method specifically comprises the following steps: taking continuous 8 frames of pictures, and clipping and compressing the pictures into 224 x 224, namely changing the dimension into (1,3,8, 224) to be input into the TSM network model.
Processing the data set into a form of an input YOLO lightweight network model, namely, a tensor B, wherein the tensor B belongs to R (N multiplied by C multiplied by H multiplied by W); the method specifically comprises the following steps: taking a plurality of key frames of the pictures, performing Mosaic data enhancement on the key frames, cutting and compressing the key frames into H x W, and obtaining a tensor B, wherein the dimensionality of the tensor B is (N, C, H, W); and taking a key frame of 8 pictures, selecting the 8 th picture, performing Mosaic data enhancement, clipping and compressing to 416 x 416, wherein the dimension of B is (1,3, 416), and inputting the key frame into a YOLO lightweight network model.
A backbone network (backbone) of the YOLO lightweight network model is Darknet53, and as shown in fig. 2, includes a first Conv convolutional layer, a second Conv convolutional layer, a first Residual Block module, a third Conv convolutional layer, a second Residual Block module, a fourth Conv convolutional layer, a third Residual Block module, a fifth Conv convolutional layer, a fourth Residual Block module, a sixth Conv convolutional layer, and a fifth Residual Block module, which are connected in sequence;
the first Residual Block module comprises 1 Residual Block, the second Residual Block module comprises 2 Residual blocks, the third Residual Block module comprises 8 Residual blocks, the fourth Residual Block module comprises 8 Residual blocks, and the fifth Residual Block module comprises 4 Residual blocks;
in the first Conv convolution layer, outputting the characteristics through a [3,3] convolution network with padding being 1, and continuing outputting through a normalization layer (BatchNorm) and an activation function (Relu);
in the second Conv convolution layer, outputting characteristics through a [3,3] convolution network with the step length of 2 for one time;
the first, second, third, fourth and fifth Residual Block modules each include two Conv layers, a normalization function (BatchNorm) and an activation function (Relu), i.e., BatchNorm + Relu layer, as shown in fig. 3, the channel number is adjusted by a Conv layer of [1,1], and the characteristics are output by a BatchNorm + Relu layer and finally by a [3,3] Conv layer of padding ═ 1;
the third, fourth, fifth and sixth Conv convolutional layers are [3,3] Conv layers with step size of 2 and padding of 1;
as shown in fig. 4, the characteristics are output through a [3,3] Conv layer with a step size of 2 and a step size of 1 among the first, second, third, fourth, and fifth Residual Block modules;
finally, the YOLO lightweight network model obtains a spatial feature tensor of [ C, H, W ].
A backbone network (backbone) of the TSM network model is a Resnet50 added with a TSM module, and as shown in fig. 5, includes a Conv convolutional layer, a maximum pooling layer, a first residual module (Block1), a second residual module (Block2), a third residual module (Block3), a fourth residual module (Block4), and an activation function (Relu) between each residual module, which are connected in sequence;
in the Conv convolution layer, outputting characteristics through a [7,7] convolution network, and continuously outputting through a normalization layer (BatchNorm) and an activation function (Relu);
in the maximum pooling layer, reducing the dimension and reducing the input quantity through a [3,3] maximum pooling layer;
as shown in fig. 6, the residual Block (Block) is formed by connecting 1 first Model-1 and a plurality of second models Model-2;
the first residual error module comprises 1 first Model-1 and 2 second Model models-2; the second residual error module comprises 1 first Model-1 and 3 second Model models-2; the third residual error module comprises 1 first Model-1 and 5 second Model models-2; the fourth residual module comprises 1 first Model-1 and 3 second models Model-2;
as shown in fig. 7, the first Model-1 is divided into two branches, one branch is firstly input into a [1,1] Conv layer through the TSM module, the number of channels is adjusted, and then input into a [3,3] Conv layer through a normalization layer (BatchNorm) and an activation function (Relu) to obtain features, and then input into a [1,1] Conv layer with different channel numbers through a normalization layer (BatchNorm) and an activation function (Relu), the number of channels is adjusted, and output through a normalization layer (BatchNorm); the two branches are input, and the final output is the addition of the two branches for output;
as shown in fig. 8, the second Model-2 is divided into two branches, one branch is firstly input into a [1,1] Conv layer through the TSM module, the number of channels is adjusted, and then input into a [3,3] Conv layer through a normalization layer (BatchNorm) and an activation function (Relu), the features are obtained, and then input into a [1,1] Conv layer through a normalization layer (BatchNorm) and an activation function (Relu), the number of channels is adjusted, and output through a normalization layer (BatchNorm); the two branches firstly enter a Conv layer of [1,1], the number of channels is adjusted, and the two branches are finally output by adding the two branches through a normalization layer (BatchNorm) and an activation function (Relu);
residual module after TSM network improvement (Block): as shown in FIGS. 7 and 8, a TSM module is added to a branch input of all Model-1 and Model-1 of all residual modules (Block).
The TSM module: the TSM network adds a TSM module to each residual branch, cancels the last full connection layer, and directly inputs the result to the feature fusion model, and the principle is as shown in fig. 9, and temporal and channel dimensions in the tensor at the moment; the middle is a matrix after displacement through an STM module, the feature maps characterized by forward displacement of the first two channels by one step can be seen, and finally the vacancy padding after displacement is filled with zeros. For each inserted time-shifting block, the temporal field will expand by 2 as if a kernel size 3 convolution were run along the time dimension. Therefore, the TSM module has a large time field, and can acquire highly complex spatio-temporal information.
Activation function: an activation function (Relu) is passed between each Model's connections (including the Model between the two residual modules).
Finally output one [ C1,H1,W1]A size of the spatiotemporal feature tensor.
The feature fusion model adopted in the step 4 is shown in fig. 10 according to the relationship between channels, and smoothly aggregates the features of different branches, wherein the feature fusion model comprises a Channel fusion module (Channel fusion), a CFAM module, a first Conv layer and a second Conv layer which are connected in sequence;
in the Channel fusion module (Channel fusion), the [ C ] output by the TSM network model1,H1,W1]Firstly input into one [1,1]]The amount of the above-mentioned convolution layer is,is adjusted to [ C ]2,H,W]Then outputting the YOLO lightweight network model to [ C, H, W ]]And the characteristic tensor [ C ]2,H,W]Linking in channel dimension to obtain tensor [ C3,H,W],C3=C2+ C, inputting one [1,1]]Conv layer of (2), adjusting the number of channels Bc4*H*W]And then, the feature vector of each channel is unidimensionalized, i.e., the feature vector is changed to F [ C ]3,N],N=H*W;
In the CFAM module, as shown in FIG. 11, an input feature vector F [ C ]3,N]The vector multiplication is carried out with the self transposition vector to obtain a gram matrix G [ N, N](ii) a The Gram (Gram) matrix can map the dependency relationship among the channels; calculating to obtain a gram matrix G [ N x N]Then, generating a channel attention M by using a softmax layer, carrying out matrix multiplication on the channel attention M and the channel attention F, reshaping the obtained result into a three-dimensional space (C H W) with the same shape as the input tensor B, and combining the result with the original input feature map B;
inputting a [1,1] Conv layer into the first Conv layer, adjusting the number of channels, and then passing through a normalization layer (BatchNorm) and an activation function (Relu);
in the second Conv layer, an [1,1] input is entered]The Conv layer of (2) adjusts the number of channels and outputs the output D [ C ] of the feature fusion model4,H,W]。
The specific implementation process of the step 5 is as follows:
the training configuration employed was as follows:
hardware environment:
CPU:AMD Ryzen 7 5800H
GPU:NVIDA GeForce RTX 3060(6G)
memory: 16G
Software environment:
OS:Windows 10
Python:Anaconda3 python3.7
CUDA:11.1
Torch:1.8.0
step 5.1: the TSM network model is initialized by adopting a pre-training model on Kinetics, and the YOLO lightweight network model is initialized by adopting a pre-training model on a CO-CO data set; the data amount required by training is reduced, parameters can be jointly updated although the two networks exist, and the complete architecture is realized in PyTorch and trained end to end.
Step 5.2: calculating a loss function in real time in a training process, wherein a boundary frame uses a Smooth L1loss function, the Smooth L1loss function is used for inputting continuous pictures into a TSM (time series memory) network model and a YOLO (lightweight class memory) network model for prediction and inference, and then a real result is coded and changed into the form of inference results of the TSM network model and the YOLO lightweight class network model, and the real result comprises position information and behavior types of a real frame, namely: comparing each prediction frame with all real frames to calculate the difference of coordinates of the four frames and the difference of behavior category types as losses;
after calculating the loss, performing back propagation to optimize;
in training, a small batch of random gradient descent algorithm with a weight attenuation strategy is selected to optimize a loss function;
and carrying out data enhancement on the YOLO lightweight network model, carrying out random mirroring, proportion change and the like according to the random number seeds, and enhancing the generalization capability. The initial learning rate is set to be 0.0001, the initial learning rate is attenuated to 0.4 after 30k iterations, and the partial AVA data set used has better effect in training 15 epochs and has better generalization capability.
The adopted data set is a partial abnormal behavior data set of AVA, the AVA comprises a URL of a YouTube public video, 80 Atomic Action (Atomic Action) sets are used for labeling (such as [ walking ], [ kicking (something) ], and [ handshaking ]), all actions are positioned in a space at times, so that a 57.6k video segment is generated, 96k human actions are labeled, and 210k Action labels are generated.
In addition, a user-defined data set is adopted for training, and some abnormal behavior videos on the network are analyzed, so that good effects are achieved. As shown in table 1:
TABLE 1
Figure BDA0003415067290000091
The feature fusion model smoothly aggregates the features of different branches according to the relationship among the channels, and greatly enhances the feature recognition capability. In the feature fusion model, the fact that behaviors are defined on an individual is realized, the behaviors are linked with a target detection result obtained in a YOLO network, and the individual carrying out the behaviors on an image is calibrated.
Step 6, using convolution kernels with the size of 1 multiplied by 1 on the final convolution layer in the feature fusion model, adjusting output channels, selecting 5 prior frames on a corresponding data set by using a k-means algorithm, and generating a feature tensor with the channel number of 5 times (Numclasses +4+1), wherein Numclasses represent the confidence scores of the behavior classification of Numclasses in the data set, 4 represents 4 coordinates of a target detection frame, and 1 represents the confidence of detection; and refining regression of the boundary box according to the anchor points, and finally realizing accurate positioning detection and behavior identification on the target and instant empty behavior detection.
Fig. 12 is an effect diagram obtained by the detection method of the present invention, and fig. 12 is a diagram of real-time detection of a network video, which shows that the detection of behaviors of two persons in the video is fight (light), and simultaneously, the persons implementing the behaviors are also positioned, so that a better detection effect is obtained.
Example 3
A computer device comprising a memory storing a computer program and a processor implementing the steps of a YOLO-TSM based lightweight real-time monitoring abnormal behavior detection method when executing the computer program.
Example 4
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the YOLO-TSM-based lightweight real-time monitoring abnormal-behavior detection method.

Claims (8)

1. A detection method for lightweight real-time monitoring of abnormal behaviors based on YOLO-TSM is characterized by comprising the following steps:
step 1: acquiring a data set and preprocessing the data set;
step 2: constructing a TSM network model;
and step 3: constructing a YOLO lightweight network model;
and 4, step 4: constructing a feature fusion model;
and 5: performing end-to-end training on the TSM network model, the YOLO lightweight network model and the feature fusion model constructed in the steps 2, 3 and 4 by using the data set obtained after the preprocessing in the step 1;
step 6: preprocessing a video to be detected, inputting the preprocessed video into a trained YOLO-TSM lightweight model, wherein the YOLO-TSM lightweight model comprises a trained TSM network model, a YOLO lightweight network model and a feature fusion model, and the TSM network model and the YOLO lightweight network model are connected with the feature fusion model; and deducing according to the input video information by using the weight file obtained by training, carrying out target detection on the individual behaving in the video stream, carrying out behavior recognition, and finally positioning the abnormal behavior on the individual in the scene.
2. The method for detecting the abnormal behavior of the YOLO-TSM-based lightweight real-time monitoring device of claim 1, wherein in step 1, the acquired data set is preprocessed and processed into two forms of an input TSM network model and a YOLO lightweight network model:
processing the data set into a form of an input TSM network model, namely tensor A, wherein tensor A belongs to R (N, C, T, H and W), N is the batch processing size, C is the channel number, T is the time dimension, and H and W are the spatial resolution;
processing the data set into a form of an input YOLO lightweight network model, namely tensor B, wherein the tensor B belongs to R (N, C, H and W); and enhancing the Mosaic data, cutting and compressing the Mosaic data into H x W to obtain a tensor B, wherein the dimensionality of the tensor B is (N, C, H, W).
3. The YOLO-TSM-based lightweight real-time monitoring abnormal behavior detection method according to claim 1, wherein the backbone network of the YOLO lightweight network model is Darknet53, and includes a first Conv convolutional layer, a second Conv convolutional layer, a first Residual Block module, a third Conv convolutional layer, a second Residual Block module, a fourth Conv convolutional layer, a third Residual Block module, a fifth Conv convolutional layer, a fourth Residual Block module, a sixth Conv convolutional layer, and a fifth Residual Block module, which are connected in sequence;
the first Residual Block module comprises 1 Residual Block, the second Residual Block module comprises 2 Residual blocks, the third Residual Block module comprises 8 Residual blocks, the fourth Residual Block module comprises 8 Residual blocks, and the fifth Residual Block module comprises 4 Residual blocks;
in the first Conv convolution layer, outputting the characteristics through a [3,3] convolution network with padding being 1, and continuously outputting through a normalization layer and an activation function;
in the second Conv convolution layer, outputting characteristics through a [3,3] convolution network with the step length of 2 for one time;
the first, second, third, fourth and fifth Residual Block modules comprise two Conv layers, a standardization function and an activation function, namely a BatchNorm + Relu layer, firstly pass through the Conv layer of [1,1], adjust the channel number, then pass through the BatchNorm + Relu layer and finally pass through the [3,3] Conv layer of which the padding is 1, and output characteristics;
the third, fourth, fifth and sixth Conv convolutional layers are [3,3] Conv layers with step size of 2 and padding of 1;
finally, the YOLO lightweight network model obtains a spatial feature tensor of [ C, H, W ].
4. The method for detecting the abnormal behavior based on the YOLO-TSM in the lightweight real-time monitoring of the claim 1, wherein the backbone network of the TSM network model is Resnet50 added to the TSM module, and includes a Conv convolution layer, a maximum pooling layer, a first residual module, a second residual module, a third residual module, a fourth residual module, and an activation function between each of the residual modules, which are connected in sequence;
in the Conv convolution layer, outputting characteristics through a [7,7] convolution network, and continuously outputting through a normalization layer and an activation function;
in the maximum pooling layer, reducing the dimension and reducing the input quantity through a [3,3] maximum pooling layer;
the first residual error module comprises 1 first Model-1 and 2 second Model models-2; the second residual error module comprises 1 first Model-1 and 3 second Model models-2; the third residual error module comprises 1 first Model-1 and 5 second Model models-2; the fourth residual module comprises 1 first Model-1 and 3 second models Model-2;
the first Model-1 is divided into two branches, one branch is firstly input into a [1,1] Conv layer through a TSM module, the number of channels is adjusted, then input into a [3,3] Conv layer through a normalization layer and an activation function, characteristics are obtained, then input into a [1,1] Conv layer with different channel numbers through a normalization layer and an activation function, the number of channels is adjusted, and output is obtained through a normalization layer; the two branches are input, and the final output is the addition of the two branches for output;
the second Model-2 is divided into two branches, one branch is firstly input into a [1,1] Conv layer through a TSM module, the number of channels is adjusted, then is input into a [3,3] Conv layer through a normalization layer and an activation function, the characteristics are obtained, then is input into a [1,1] Conv layer through a normalization layer and an activation function, the number of channels is adjusted, and the channel is output through a normalization layer; the two branches firstly enter a Conv layer of [1,1], the number of channels is adjusted, and the final output is the addition of the two branches for output after passing through a normalization layer and an activation function;
finally output one [ C1,H1,W1]A size of the spatiotemporal feature tensor.
5. The method for detecting the abnormal behavior of the YOLO-TSM-based lightweight real-time monitoring system of claim 1, wherein the feature fusion model comprises a channel fusion module, a CFAM module, a first Conv layer and a second Conv layer, which are connected in sequence;
in the channel fusion module, [ C ] of TSM network model output1,H1,W1]Firstly input into one [1,1]]Is adjusted to [ C ] in the volume of the convolution layer2,H,W]Then outputting the YOLO lightweight network model to [ C, H, W ]]And the characteristic tensor [ C ]2,H,W]Linking in channel dimension to obtain tensor [ C3,H,W],C3=C2+ C, inputting one [1,1]]Conv layer of (2), adjusting the number of channels Bc4*H*W]And then, the feature vector of each channel is unidimensionalized, i.e., the feature vector is changed to F [ C ]3,N],N=H*W;
In the CFAM module, the input feature vector F [ C [ ]3,N]The vector multiplication is carried out with the self transposition vector to obtain a gram matrix G [ N, N](ii) a Calculating to obtain a gram matrix G [ N x N]Then, generating a channel attention M by using a softmax layer, carrying out matrix multiplication on the channel attention M and the channel attention F, reshaping the obtained result into a three-dimensional space (C H W) with the same shape as the input tensor B, and combining the result with the original input feature map B;
inputting a Conv layer of [1,1] into the first Conv layer, adjusting the number of channels, and then passing through a normalization layer and an activation function;
in the second Conv layer, an [1,1] input is entered]The Conv layer of (2) adjusts the number of channels and outputs the output D [ C ] of the feature fusion model4,H,W]。
6. The method for detecting the abnormal behavior based on the YOLO-TSM in the lightweight real-time monitoring of the invention as claimed in any one of claims 1 to 5, wherein in step 6, in the feature fusion model, the convolution kernel with the size of 1 × 1 is used for the last convolution layer to adjust the output channel, 5 prior frames are selected on the corresponding data set by using the k-means algorithm, and a feature tensor with the number of 5 × channels (Numclasses +4+1) is generated, Numclasses represents the confidence scores of the behavior classification in the data set, 4 represents the 4 coordinates of the target detection frame, and 1 represents the confidence of the detection; and refining regression of the boundary box according to the anchor points, and finally realizing accurate positioning detection and behavior identification on the target and instant empty behavior detection.
7. A computer device comprising a memory storing a computer program and a processor implementing the steps of the YOLO-TSM based lightweight real-time monitoring abnormal behavior detection method of any one of claims 1-6 when the computer program is executed.
8. A computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps of the YOLO-TSM based lightweight real-time monitoring abnormal behavior detection method of any one of claims 1 to 6.
CN202111543515.2A 2021-12-16 2021-12-16 Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM Pending CN114220169A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111543515.2A CN114220169A (en) 2021-12-16 2021-12-16 Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111543515.2A CN114220169A (en) 2021-12-16 2021-12-16 Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM

Publications (1)

Publication Number Publication Date
CN114220169A true CN114220169A (en) 2022-03-22

Family

ID=80702956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111543515.2A Pending CN114220169A (en) 2021-12-16 2021-12-16 Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM

Country Status (1)

Country Link
CN (1) CN114220169A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117668669A (en) * 2024-02-01 2024-03-08 齐鲁工业大学(山东省科学院) Pipeline safety monitoring method and system based on improved YOLOv7

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961019A (en) * 2019-02-28 2019-07-02 华中科技大学 A kind of time-space behavior detection method
CN110942009A (en) * 2019-11-22 2020-03-31 南京甄视智能科技有限公司 Fall detection method and system based on space-time hybrid convolutional network
CN113239822A (en) * 2020-12-28 2021-08-10 武汉纺织大学 Dangerous behavior detection method and system based on space-time double-current convolutional neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961019A (en) * 2019-02-28 2019-07-02 华中科技大学 A kind of time-space behavior detection method
CN110942009A (en) * 2019-11-22 2020-03-31 南京甄视智能科技有限公司 Fall detection method and system based on space-time hybrid convolutional network
CN113239822A (en) * 2020-12-28 2021-08-10 武汉纺织大学 Dangerous behavior detection method and system based on space-time double-current convolutional neural network

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
BUBBLIIIING: "神经网络学习小记录20——ResNet50模型的复现详解", HTTPS://BLOG.CSDN.NET/WEIXIN_44791964/ARTICLE/DETAILS/102790260, 28 October 2019 (2019-10-28), pages 1 - 11 *
JI LIN 等: "TSM: Temporal Shift Module for Efficient Video Understanding", 《ARXIV》, 22 August 2019 (2019-08-22), pages 2 - 4 *
JOSEPH REDMON 等: "YOLOv3:An Incremental Improvement", 《ARXIV》, 8 April 2018 (2018-04-08), pages 2 *
OKAN KOPUKLYU 等: "You Only Watch Once: A Unified CNN Architecture for Real-Time Spatiotemporal Action Localization", 《ARXIV》, 5 March 2020 (2020-03-05), pages 2 - 4 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117668669A (en) * 2024-02-01 2024-03-08 齐鲁工业大学(山东省科学院) Pipeline safety monitoring method and system based on improved YOLOv7
CN117668669B (en) * 2024-02-01 2024-04-19 齐鲁工业大学(山东省科学院) Pipeline safety monitoring method and system based on improvement YOLOv (YOLOv)

Similar Documents

Publication Publication Date Title
Fan et al. Point 4d transformer networks for spatio-temporal modeling in point cloud videos
CN109389055B (en) Video classification method based on mixed convolution and attention mechanism
CN111814719A (en) Skeleton behavior identification method based on 3D space-time diagram convolution
CN112699786B (en) Video behavior identification method and system based on space enhancement module
CN112232164A (en) Video classification method and device
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN109598732A (en) A kind of medical image cutting method based on three-dimensional space weighting
CN114821640A (en) Skeleton action identification method based on multi-stream multi-scale expansion space-time diagram convolution network
CN113393474A (en) Feature fusion based three-dimensional point cloud classification and segmentation method
CN113920581A (en) Method for recognizing motion in video by using space-time convolution attention network
CN113610046A (en) Behavior identification method based on depth video linkage characteristics
Xu Fast modelling algorithm for realistic three-dimensional human face for film and television animation
CN111524140A (en) Medical image semantic segmentation method based on CNN and random forest method
CN114220169A (en) Lightweight real-time monitoring abnormal behavior detection method based on Yolo-TSM
Wang et al. Learning precise feature via self-attention and self-cooperation YOLOX for smoke detection
Ren et al. LightRay: Lightweight network for prohibited items detection in X-ray images during security inspection
Zhenhua et al. FTCF: Full temporal cross fusion network for violence detection in videos
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation
CN110782503B (en) Face image synthesis method and device based on two-branch depth correlation network
CN112613405B (en) Method for recognizing actions at any visual angle
CN114120076B (en) Cross-view video gait recognition method based on gait motion estimation
CN115661861A (en) Skeleton behavior identification method based on dynamic time sequence multidimensional adaptive graph convolution network
CN115830707A (en) Multi-view human behavior identification method based on hypergraph learning
Tang et al. A multi-task neural network for action recognition with 3d key-points
Zhao et al. Research on human behavior recognition in video based on 3DCCA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination