CN113065558B

CN113065558B - Lightweight small target detection method combined with attention mechanism

Info

Publication number: CN113065558B
Application number: CN202110432768.6A
Authority: CN
Inventors: 朱威; 王立凯; 靳作宝; 何德峰; 郑雅羽
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-04-21
Filing date: 2021-04-21
Publication date: 2024-03-22
Anticipated expiration: 2041-04-21
Also published as: CN113065558A

Abstract

The invention relates to a light-weight small target detection method combining an attention mechanism, which comprises the following steps: (1) building a small target detection network based on YOLOv 4: constructing an MSE multi-scale attention module, inserting the MSE multi-scale attention module into a feature extraction network, adding a shallow feature map as a prediction layer, improving an SPP module, and enhancing the feature extraction capability; (2) Constructing a small target data set, enhancing training set data by using a data enhancement strategy, customizing an anchor frame (3) carrying out channel pruning on the model, and recovering the model precision by adopting knowledge distillation; (4) Inputting an unmanned aerial vehicle aerial image, and obtaining target classification and positioning results. The invention utilizes the channel attention mechanism and the model compression strategy, and can effectively improve the phenomenon of small-target error detection and ensure the real-time performance of the model.

Description

Lightweight small target detection method combined with attention mechanism

Technical Field

The invention belongs to the application of a deep learning technology in the field of machine vision, and particularly relates to a lightweight small target detection method combined with an attention mechanism.

Background

The target detection finds out a specific target class and an accurate position thereof in a given image, wherein small target detection is an important research content in the field of target detection, and has important application value in remote sensing image target recognition, infrared imaging target recognition, agricultural pest and disease damage recognition and other scenes. In object detection, an object whose pixel value is 0.12% or less of the entire image or whose pixel value is less than 32×32 is generally referred to as a small object. Because of the low resolution and high noise of small-sized objects, features that are often extracted after multi-layer convolution are not obvious, and thus it is very difficult to detect small objects in an image.

Early small target detection mainly obtains characteristic information of a target by a manual design method. Wen Peizhi et al apply wavelet transformation in the small target detection process (see Wen Peizhi, shi Zelin, yu Hai, wu Xiaojun. Sea surface background infrared small target detection method based on wavelet transformation [ J ]. Photoelectric engineering, 2004), utilize multi-resolution analysis of orthogonal wavelet decomposition to realize band selection, suppress noise and background interference, and utilize different directional edges to fuse, obtain candidate points, and finally eliminate interference targets according to gray threshold. Che et al (see C.L.P.Chen, H.Li, Y.Wei, et al, a Local Contrast Method for Small Infrared Target Detection J// IEEE Transactions on Geoscience and Remote Sensing,2014,52 (1): 574-581) inspired by biological vision mechanisms, acquire a local contrast map of the input image using a proposed local contrast metric that can represent the difference between the current location and its neighborhood, so that both target signal enhancement and background clutter suppression can be achieved, and finally segment the target by adaptive threshold values. The method starts from the bottom layer characteristics of the image, uses the basic image characteristics to realize the detection task, has simpler operation, and has the problems of missing error detection and real-time performance for small target detection of complex background.

In recent years, with the improvement of computer power and the rapid development of deep learning theory, deep learning techniques have been widely used for target detection. Currently popular object detection models can be broadly divided into two categories: one-stage detection algorithm, classification and positioning are regarded as regression tasks, and typical algorithms are SSD and YOLO; the two-stage detection algorithm comprises candidate frame selection and target classification separation, wherein the representative algorithms are R-CNN and Faster R-CNN. Wherein the one-stage detection algorithm takes the whole detection task as a regression operation, thus having great advantages in terms of real-time performance.

The main ways to improve the detection of small targets by deep learning techniques are multi-scale representation, context information, super resolution, etc. Patent application number CN202010537199.7 discloses a detection method for small targets of pictures. Six feature graphs with different sizes are obtained from the picture to be detected, a bilinear interpolation method is adopted to conduct feature fusion on pyramid bottom feature graphs and pyramid high-level feature graphs in the six feature graphs with different sizes, six new feature graphs with different sizes are obtained, and the six new feature graphs with different sizes participate in prediction. The method adopts the multi-scale feature map to enhance the target feature information, but is easy to be interfered by complex background, and has higher false detection rate. The patent with the application number of CN202010444356.X discloses a method for detecting a small target of a remote sensing image based on resolution enhancement, which carries out super-resolution processing on the remote sensing image containing the small target and then carries out target detection, so that the problems that the available characteristic information of the small target in the remote sensing image is less and the small target area has geometric deformation are solved, the detailed characteristic information of the small target is further perfected by adopting a super-resolution processing technology, the limited characteristic information of the small target is fully utilized by adopting a deformable convolution network based on the area, and the detection capability of the small target in the remote sensing image is improved. Although the method has better accuracy, the real-time performance of the network is reduced due to the increase of the resolution of the picture, which is not beneficial to the light weight of the network.

Disclosure of Invention

In order to solve the problems of high false detection rate, omission, poor real-time performance and the like of the existing target detection method for small target detection, the invention provides a lightweight small target detection method combined with an attention mechanism, which comprises the following steps:

(1) Construction of Yolov 4-based improved small target detection network

The small target detection network is improved on the basis of a one-stage target detection network YOLOv4, and the specific network structure improvement comprises the following three aspects:

(1-1) building MSE multiscale attention mechanism Module, inserting into feature extraction network

The MSE multiscale attention mechanism module constructed by the invention is obtained by improving an SE attention module, the SE attention module is a lightweight attention mechanism module for the computer vision field, which is proposed by Hu et al in 2017, can be conveniently inserted between two network layers of a feature extraction network, and selects and emphasizes a feature channel of interest by learning global information and suppresses irrelevant interference information.

And constructing an MSE multi-scale attention mechanism module, inserting the MSE multi-scale attention mechanism module between a Concat layer and a CBM module in each CSP module of the Yolov4 feature extraction network CSPDarknet53 to form a new MSE-CSPUnit module, and obtaining the feature extraction network of the MSE-CSPDarknet53 with attention information. The construction of the MSE multi-scale attention mechanism module comprises the following specific steps:

(1-1-1) firstly, taking the output of a Concat layer of the CSP module as an input feature map, integrating feature maps of multiple scales through convolution kernels of different sizes, and carrying out the next feature extraction operation based on the multi-scale feature maps. The convolution kernel sizes are 3×3, 5×5, and 7×7, respectively, and for the case of an increase in the parameter amount caused by using a large-size convolution kernel, a 2-layer 3×3 convolution kernel is used instead of a 5×5 convolution kernel, and a 3-layer 3×3 convolution kernel is used instead of a 7×7 convolution kernel. Setting an input feature diagram X epsilon R ^C×H×W C, H, W are input channel, input height and input width respectively, and then the process of feature extraction using convolution kernels of different sizes for the input feature map is as follows:

X _c ＝V _3×3 X+V _5×5 X+V _7×7 X

wherein X is _c For multi-scale feature map output, V represents a convolution operation using different sized convolution kernels.

(1-1-2) pair X _c Performing extrusion operation, namely respectively extruding the channels to obtain channel-level feature information by using global average pooling and global maximum pooling, wherein the global average pooling focuses on global features of the feature map, and the global maximum pooling focuses on local features of the feature map:

X _max ＝max(X _c (i,j))

wherein X is _c X is an input multi-scale feature _avg X is the feature acquired after global average pooling _max For the features acquired after global maximum pooling, i=1, 2, …, H, j=1, 2, …, W, H, W are input height and input width, respectively.

(1-1-3) for X respectively _avg And X _max Excitation operation is carried out, and channel attention weight information X is generated by addition and normalization operation _s . During excitation operation, more nonlinear relation among channels is reserved by using Mish activation function, and FC is reserved ₁ 、FC ₂ Is two different full connection layers, whereinC is an input channel, r is a dimension reduction ratio, and FC ₁ Plays a role in reducing dimension to reduce parameters of a full connection layer, FC ₂ Playing a role in restoring the original dimension. The activation and normalization operations are as follows:

X _a ＝FC ₂ (Mish(FC ₁ (X _avg ))

X _m ＝FC ₂ (Mish(FC ₁ (X _max ))

X _s ＝Softmax(X _a +X _m )

where Mish is a nonlinear activation function and Softmax is a normalization function.

(1-1-4) weighting the channel attention weighting information generated in (1-1-3) and the multi-scale feature map generated in (1-1-1) to obtain the output X of the MSE multi-scale attention module _weight X is taken as _weight As input to the CBM module in the MSE-CSPUnit module.

X _weight ＝Scale(X _c ,X _s )

(1-2) adding shallow feature maps as prediction layers

The deep features have stronger semantic information and are more suitable for positioning; the shallow layer features have rich resolution information, and are more beneficial to detection of small targets. Deleting 19×19 feature graphs output by the FPN and PAN structures, and reserving the original 38×38 and 76 output feature graphs of the FPN and PAN structures; performing feature fusion on the MSE-CSPUnit output and the result of the up-sampling of the deep feature map below by using FPN and PAN structures to obtain a shallow feature map with the size of 152 multiplied by 152; finally, three characteristic diagrams with different sizes of 38×38, 76×76 and 152×152 are obtained to predict targets with different scales.

Here MSE-csput 2 refers to two MSE-csput modules.

(1-3) SPP Module improvements

The SPP module can enrich the expressive power of the feature map and provide important context information. In order to improve the performance in small target detection, SPP modules are respectively placed in front of the 38×38, 76×76 and 152×152 feature maps, so that effective fusion of local features and global features is realized. The SPP module performs maximum pooling operation on the input feature images of 1×1, 5×5, 9×9 and 13×13, and performs tensor stitching on the generated feature images with different scales.

(2) Training and optimizing small target detection networks

Aiming at specific application scenes, a small target detection data set is constructed, the picture data is subjected to multi-mode random adjustment by data enhancement, the number of small targets in the data, the brightness, the contrast and the saturation of the picture are subjected to random adjustment, and the generalization performance of a model is enhanced.

Finally, setting an anchor frame for fitting a target in the data set; and (3) reclustering the anchor frame of the target data set through a Kmeans++ algorithm to obtain anchor frame parameters which are more suitable for the current data set, and accelerating the convergence speed of the network.

(3) Model light-weight for small target detection network

(3-1) channel pruning

And carrying out channel pruning on the small target detection network aiming at the parameter redundancy of the network. And using gamma of a convolutional module BN layer of the YOLOv4 as a scaling factor, adding an L1 regularization term of the gamma of the BN layer in a loss function, performing sparsification training on the network for a plurality of times, sequencing the gamma based on a gamma value after gradient updating, and removing a channel where the gamma is smaller than the pruning threshold by setting the pruning threshold to obtain the pruned lightweight YOLOv4 network. In the YOLOv4 network, channel pruning is carried out on other convolution modules containing BN layers except for the convolution layers and SPP structures before the upsampling layer, so that a model file and a model structure configuration file after the channel pruning are obtained. For YOLOv4 sparse training, the established objective loss function is:

where x is the input value of the model, y is the desired output value, w is a trainable parameter in the network, g () is a penalty term for the scaling factor, and λ is the balancing factor.

(3-2) knowledge distillation recovery model accuracy

After pruning the channels, although the removed channels contribute little to the model output, the model accuracy after pruning is reduced by a small extent, so that the model accuracy is restored.

And (3) using the YOLOv4 network which is not pruned as a teacher network, and using the network after channel pruning as a student network to carry out knowledge distillation. Knowledge distillation of YOLOv4 will perform classification tasks and learning of regression tasks, and for distillation of regression results, the prediction results of the teacher network may be opposite to the tag values because the regression output is unbounded, and therefore not directly learned to the teacher network when calculating regression loss. First, calculating the L2 losses of the teacher network and the label value, the student network and the label value respectively, setting a range w, and when the deviation between the L2 losses of the student network and the label value and the L2 losses of the teacher network and the label value exceeds the range w, accounting the L2 losses of the student network in the losses. I.e. when the performance of the student network exceeds the teacher network by a certain value, no loss of the student network is calculated. The overall loss function is as follows:

L _reg ＝(1-v)L _sL1 (R _s ,y _reg )+vL _b (R _s ,R _t ,y _reg )

wherein w is a preset deviation range, y _reg Is the true tag value, R _t And R is _s Regression outputs of teacher and student, L _b For model distillation partial losses, L _sL1 For the loss of the student network and the real label, v is L _b And L _sL1 The balance factor between the two is set between 0.1 and 0.5 in 80 percent of the time before the network training and between 0.6 and 0.9 in the last 20 percent of the training time; l (L) _reg Is the total loss during the network distillation learning.

(4) Detection of input images using trained small target detection network models

Inputting a frame of unmanned aerial vehicle aerial image, and sending the aerial image into a small target detection network which is trained and optimized to locate and classify targets. The network firstly inputs the image into a feature extraction network with a attention mechanism to extract the features, and 3 feature graphs with different resolutions are respectively output through an SPP module. Detecting three targets with different scales on the 3 feature images by using regression and classification ideas, and obtaining classification and positioning results of the targets after confidence threshold filtering; repeating until the detection of the pictures in the test set is completed.

Compared with the prior art, the invention has the following beneficial effects:

according to the invention, the end-to-end convolutional neural network YOLOv4 is improved to realize a lightweight small target detection network, and compared with a traditional small target detection method, an MSE attention module is designed based on SE, and the designed attention module is inserted into a YOLOv4 feature extraction network, so that the attention capability of the network to an interested region is enhanced, and the interference of a complex background in the small target detection process is reduced; then adding a shallow layer feature map as a prediction layer, and predicting targets with different scales by using three feature maps with different sizes of 38×38, 76×76 and 152×152; the SPP modules are improved, and are respectively placed in front of the 38×38, 76×76 and 152×152 feature maps, so that effective fusion of local features and global features is realized; finally, compression optimization is carried out on the model by using a channel pruning and knowledge distillation strategy, and the large compression of the number of model parameters is realized under the condition of little precision loss; in addition, a data enhancement mode is used for randomly adjusting the number of small targets in a data set, the brightness, the contrast and the saturation of pictures, so that the training effect of the model is enhanced. In a small target data set, the network has good detection effect and robustness, and meets the requirement of light-weight model deployment.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a MSE-CSPUnit module after adding an MSE multiscale attention mechanism module;

FIG. 3 is a MSE multi-scale attention module structure of the present invention;

FIG. 4 is a small target detection network architecture designed in accordance with the present invention;

FIG. 5 is a comparison of the number of channels after model compression, wherein dark bars are before pruning and light bars are after pruning;

fig. 6 is a diagram of the detection effect of the small target detection network on the target picture according to the present invention, wherein (a) and (c) are detection effects before improvement, and (b) and (d) are detection effects after improvement corresponding to (a) and (c).

Detailed Description

The present invention will be described in detail with reference to examples and drawings, but the present invention is not limited thereto. The object of the embodiment of the object detection is various small objects in the data set, the selected processing platform is a combination of Intel i9-9900k, NVIDIA RTX2080ti and 32G RAM, and the operating system is Linux64 Ubuntu18.04. The method is realized on a deep learning framework Pytorch 1.6.

The light-weight small target detection method for the attention-drawing mechanism as shown in fig. 1 comprises four parts:

(1) Constructing a small target detection network based on Yolov4 improvement;

(2) Training and optimizing the small target detection network;

(3) Performing model weight reduction on the small target detection network;

(4) And detecting the input image by using the trained small target detection network model.

The first part of building a small target detection network based on the Yolov4 improvement specifically comprises the following steps:

(1-1) design MSE multiscale attention mechanism module, embedded in feature extraction network

And constructing an MSE multi-scale attention mechanism module, inserting the MSE multi-scale attention mechanism module between a Concat layer and a CBM module in each CSP module of the Yolov4 feature extraction network CSPDarknet53 to form a new MSE-CSPUnit module, and obtaining a feature extraction network of the MSE-CSPDarknet53 with attention information, wherein the rest modules except the MSE are conventional structural modules of the Yolov4 feature extraction network CSPDarknet53 as shown in fig. 2. The MSE multi-scale attention mechanism module is constructed as follows:

firstly, taking output of a Concat layer of a CSP module as an input feature map, integrating feature maps of various scales through convolution kernels of different sizes, and carrying out next feature extraction operation based on the multi-scale feature maps, wherein the sizes of the convolution kernels are 3 multiplied by 3, 5 multiplied by 5 and 7 multiplied by 7 respectively. For the case of an increase in the amount of parameters caused by the use of large-size convolution kernels, a 2-layer 3×3 convolution kernel is used instead of a 5×5 convolution kernel, and a 3-layer 3×3 convolution kernel is used instead of a 7×7 convolution kernel. Setting an input feature diagram X epsilon R ^C×H×W C, H, W are input channel, input height and input width respectively, and then the process of feature extraction using convolution kernels of different sizes for the input feature map is as follows:

X _c ＝V _3×3 X+V _5×5 X+V _7×7 X

wherein X is _c For multi-scale fusion feature output, V represents a convolution operation using different size convolution kernels.

For X _c Performing extrusion operation, focusing on local information of the feature map by using global maximum pooling operation aiming at the characteristic of less small target feature information, focusing on global features of the feature map by using global average pooling operation, and pooling operation with the following formula:

X _max ＝max(X _c (i,j))

wherein X is _avg X is the feature acquired after global average pooling _max For the features acquired after global maximum pooling, i=1, 2, …, H, j=1, 2, …, W, H, W are input height and input width, respectively.

Respectively to X _avg And X _max Exciting, adding, normalizing to generate attention weight information X _s . The use of a dash-activated function preserves more non-linear relationships between channels when performing excitation operations. FC (fiber channel) ₁ 、FC ₂ Is two different full connection layers, wherein C is an input channel, r is a dimension reduction ratio, and FC ₁ Plays a role in reducing dimension to reduce parameters of a full connection layer, FC ₂ Playing a role in restoring the original dimension. The activation and normalization operations are as follows:

X _a ＝FC ₂ (Mish(FC ₁ (X _avg ))

X _m ＝FC ₂ (Mish(FC ₁ (X _max ))

X _s ＝Softmax(X _a +X _m )

X is to be _s With the multiscale feature map X generated in the first step _c Weighting operation is carried out to obtain the output X of the MSE multi-scale attention module _weight X is taken as _weight As input to the CBM module in the MSE-CSPUnit module.

X _weight ＝Scale(X _c ,X _s )

(1-2) adding shallow features in the prediction layer

(1-3) SPP Module improvements

The second part of training and optimizing the small target detection network specifically comprises the following steps:

(2-1) construction of data sets

Firstly, a small target data set is constructed, and an unmanned aerial vehicle aerial photographing data set VisDrone2019 is selected for experiments. The VisDrone2019 data set is in an unmanned aerial vehicle aerial photographing mode, so that the VisDrone2019 data set contains a large number of small objects and dense objects, and in addition, illumination change and object shielding are difficulties of the data set. Meanwhile, as the unmanned aerial vehicle image is vertically shot, the object to be detected contains fewer features. For example, for pedestrian detection, the ground captured image may contain features such as human arms, legs, etc., while for unmanned aerial vehicle images, there may be only overhead features.

(2-2) data enhancement and Multi-modal random adjustment of Picture data

During network training, the training effect of the small target is improved by adopting an online enhancement mode for the data set. Since the dataset may contain fewer pictures of small objects, the model may be biased towards medium and large sized objects during training. The data online enhancement is realized by copying a plurality of small targets in the picture, and increasing the probability that the small targets are contained by the anchor by manually increasing the occurrence times of the small objects in the picture, so that the model can have the opportunity to obtain more small target training samples in the training process. And meanwhile, the picture is randomly rotated and zoomed, and meanwhile, the brightness, the contrast and the saturation are adjusted, so that the robustness of the model is improved.

(2-3) custom anchor boxes for fitting to targets in a dataset

For target detection of extremely scaled objects, a suitable anchor frame may more accurately fit objects in the dataset. And for the unmanned aerial vehicle aerial photographing data set, reclustering the anchor frame of the target data set through a Kmeans++ algorithm to obtain anchor frame parameters which are more suitable for the current data set. The anchor frame parameters obtained by the Kmeans++ algorithm are (1, 4), (2, 8), (4, 13), (4, 5), (8, 20), (9, 9), (16,29), (16, 15), (35, 42).

The third part of small target detection network model light weight specifically comprises:

(3-1) channel pruning

And carrying out channel pruning on the small target detection network aiming at the parameter redundancy of the network. And using gamma of a convolutional module BN layer of the YOLOv4 as a scaling factor, adding an L1 regularization term of the gamma of the BN layer in a loss function, carrying out preset rounds of sparse training on the network for several times, such as 300 rounds, sequencing the gamma based on a gamma value after gradient updating, and removing a channel where the gamma is smaller than a pruning threshold by setting the pruning threshold to obtain the pruned lightweight YOLOv4 network. In the YOLOv4 network, channel pruning is performed on other convolution modules containing BN layers, in addition to the convolution layers and SPP structures preceding the upsampling layer. And selecting the channel cutting proportion through multiple experiments to achieve better balance between speed and precision, finally selecting the cutting proportion to be 0.7, and obtaining a model file and a model structure configuration file after channel pruning.

(3-2) knowledge distillation recovery model accuracy

And (3) using the YOLOv4 network which is not pruned as a teacher network, and using the network after channel pruning as a student network to carry out knowledge distillation. Knowledge distillation of YOLOv4 will perform classification tasks and learning of regression tasks, and for distillation of regression results, the prediction results of the teacher network may be opposite to the true values because the regression output is unbounded, and therefore not directly learned to the teacher network when computing regression loss. Firstly, calculating the L2 distances between the teacher network and the tag value and between the student network and the tag value respectively, setting a deviation range w=0.3 through multiple experimental comparison, and only accounting for the L2 loss of the student network in the loss when the deviation between the L2 distances between the student network and the tag value and the teacher network and the tag value exceeds the range w. I.e. when the performance of the student network exceeds the teacher network by a certain value, no loss of the student network is calculated. The overall loss function is as follows:

L _reg ＝(1-v)L _sL1 (R _s ,y _reg )+vL _b (R _s ,R _t ,y _reg )

The fourth part of detecting the small target of the picture specifically comprises:

(4-1) inputting an aerial image of the unmanned aerial vehicle

And (4-2) after the aerial image of the unmanned aerial vehicle is read, the aerial image is sent into a small target detection network which is trained and optimized for target positioning and classification. The network firstly inputs the image into a feature extraction network with a attention mechanism to extract the features, and 3 feature graphs with different resolutions are respectively output through an SPP module. Three kinds of targets with different scales are detected by using regression and classification ideas, the confidence threshold is 0.2-0.6, the confidence threshold is generally set to be 0.3, and the classification and positioning results of the targets are obtained after threshold filtering.

(4-3) repeating the steps (4-1) to (4-2) until the detection of the pictures in the test set is completed, wherein the detection effect of various small targets is shown in fig. 6.

Claims

1. A light-weight small target detection method combined with an attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:

(1) Constructing a small target detection network based on Yolov4 improvement, which comprises the following steps:

(1-1) constructing an MSE multi-scale attention mechanism module, taking the output of a Concat layer of a CSP module as an input characteristic X, integrating a plurality of scale characteristic graphs through convolution kernels with different sizes, and obtaining a multi-scale fusion characteristic output X _c For X _c Extruding the channels by using global average pooling and global maximum pooling to obtain the characteristics X obtained after global average pooling _avg And feature X obtained after global maximization pooling _max For X respectively _avg And X _max Exciting, adding, normalizing to generate attention weight information X _s The method comprises the steps of carrying out a first treatment on the surface of the X to be generated _s And X is generated _c Weighting operation is carried out to obtain the output X of the MSE multi-scale attention module _weight ，X _weight ＝Scale(X _c ,X _s ) X is taken as _weight Inserting into a feature extraction network;

(1-2) adding a shallow layer feature map as a prediction layer, deleting 19×19 feature maps output by an FPN and PAN structure, retaining original 38×38 and 76×76 output feature maps of the FPN and PAN structure, carrying out feature fusion on MSE-CSPUnit output and an up-sampling result of a lower deep layer feature map by using the FPN and PAN structure to obtain 152×152 shallow layer feature maps, and finally obtaining 38×38, 76×76 and 152×152 feature maps with three different sizes to predict targets with different scales;

(1-3) SPP module improvement, namely respectively placing SPP modules between the FPN and PAN structures and the corresponding three prediction layers, and performing tensor splicing on the generated feature images with different scales after performing maximum pooling operation on the input feature images by the SPP modules;

(2) Training and optimizing a small target detection network;

(3) Performing model weight reduction on the small target detection network;

2. The method for detecting a lightweight small object in combination with an attention mechanism according to claim 1, wherein: the step (1-1) comprises the following steps: and constructing an MSE multi-scale attention mechanism module, and inserting the MSE multi-scale attention mechanism module between a Concat layer and a CBM module in each CSP module of the Yolov4 feature extraction network CSPDarknet53 to form a new MSE-CSPUnit module, thereby obtaining the feature extraction network of the MSE-CSPDarknet53 with attention information.

3. A method for detecting a lightweight small object in combination with an attention mechanism according to claim 1 or 2, characterized in that: the step (1-1) constructs an MSE multi-scale attention mechanism module based on the SE attention mechanism module, and comprises the following steps:

(1-1-1) taking the output of a Concat layer of the CSP module as an input characteristic X, integrating a plurality of scale characteristic graphs through convolution kernels with different sizes to obtain a multi-scale fusion characteristic output X _c The method comprises the steps of carrying out a first treatment on the surface of the Convolution kernel sizes are 3×3, 5×5, 7×7, x, respectively _c ＝V _3×3 X+V _5×5 X+V _7×7 X, where V represents a convolution operation using convolution kernels of different sizes;

(1-1-2) pair X _c Extruding the channels to obtain channel-level feature information by using global average pooling and global maximum pooling, wherein the global average pooling focuses on global features and the global maximum pooling focuses on local features,

X _max ＝max(X _c (i,j))；

wherein X is _avg X is the feature acquired after global average pooling _max For the features acquired after global maximum pooling, i=1, 2, …, H, j=1, 2, …, W, H, W are input height and input width, respectively;

(1-1-3) for X respectively _avg And X _max Exciting, adding, normalizing to generate attention weight information X _s ，FC ₁ 、FC ₂ Is two different full connection layers, whereinC is an input channel, r is a dimension reduction ratio, and FC ₁ Plays a role in reducing dimension to reduce parameters of a full connection layer, FC ₂ Play a role in restoring the original dimension;

X _a ＝FC ₂ (Mish(FC ₁ (X _avg ))

X _m ＝FC ₂ (Mish(FC ₁ (X _max ))

X _s ＝Softmax(X _a +X _m )

wherein, mish is a nonlinear activation function, and Softmax is a normalization function;

(1-1-4) X produced in (1-1-3) _s X generated with (1-1-1) _c Weighting operation is carried out to obtain the output X of the MSE multi-scale attention module _weight ，X _weight ＝Scale(X _c ,X _s ) X is taken as _weight As input to the CBM module in the MSE-CSPUnit module.

4. The method for detecting a lightweight small object in combination with an attention mechanism according to claim 1, wherein: in the step (1-3), after the SPP module performs the maximum pooling operation of 1×1, 5×5, 9×9 and 13×13 on the input feature map, tensor stitching is performed on the generated feature maps with different scales.

5. The method for detecting a lightweight small object in combination with an attention mechanism according to claim 1, wherein: the step (2) comprises the following steps:

(2-1) constructing a small target dataset;

(2-2) data enhancement, and performing multi-mode random adjustment on the picture data;

(2-3) setting an anchor frame for fitting to a target in the dataset.

6. The method for detecting a lightweight small object in combination with an attention mechanism according to claim 1, wherein: the step (3) comprises the following steps:

(3-1) channel pruning

Selecting gamma of a BN layer as a scaling factor, adding an L1 regularization term of the gamma of the BN layer in a loss function, performing sparse training on a network for a plurality of times, and performing channel pruning on layers except a convolution layer and an SPP module before an upsampling layer based on a gamma value after gradient updating to obtain a model file and a model structure configuration file after channel pruning;

(3-2) knowledge distillation recovery network accuracy

Taking a YOLOv4 network which is not pruned as a teacher network, and taking a network after channel pruning as a student network; calculating L2 losses of the teacher network and the label value and the student network and the label value respectively, setting a deviation range, and when the deviation between the L2 losses of the student network and the label value and the L2 losses of the teacher network and the label value exceeds a range w, calculating the L2 losses of the student network in total losses, wherein the total loss function is that

L _reg ＝(1-v)L _sL1 (R _s ,y _reg )+vL _b (R _s ,R _t ,y _reg )

Wherein L is _reg For total loss during network distillation learning, L _b For the model distillation part loss,

L _sL1 for the loss between the student's network regression output and the tag value, v is L _b And L _sL1 Balance factor between the training time and the training time is set between 0.1 and 0.5 and between 0.6 and 0.9 respectively before and after the network training and y _reg Is a tag value, R _t And R is _s Regression outputs of the teacher network and the student network are respectively, and w is a preset deviation range.

7. The method for detecting a lightweight small object in combination with an attention mechanism according to claim 1, wherein: the step (4) comprises the following steps: (4-1) inputting a frame of image;

(4-2) after reading one image, sending the image into a small target detection network which is trained and optimized to locate and classify targets; inputting the image into a feature extraction network with an attention mechanism to extract features, respectively outputting 3 feature images with different resolutions through an SPP module, detecting three targets with different scales on the 3 feature images, setting a confidence threshold to be 0.2-0.6, and obtaining classification and positioning results of the targets after threshold filtering;

(4-3) repeating the steps (4-1) to (4-2) until the detection of the pictures in the test set is completed.