CN113065558B - Lightweight small target detection method combined with attention mechanism - Google Patents

Lightweight small target detection method combined with attention mechanism Download PDF

Info

Publication number
CN113065558B
CN113065558B CN202110432768.6A CN202110432768A CN113065558B CN 113065558 B CN113065558 B CN 113065558B CN 202110432768 A CN202110432768 A CN 202110432768A CN 113065558 B CN113065558 B CN 113065558B
Authority
CN
China
Prior art keywords
network
module
feature
mse
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110432768.6A
Other languages
Chinese (zh)
Other versions
CN113065558A (en
Inventor
朱威
王立凯
靳作宝
何德峰
郑雅羽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202110432768.6A priority Critical patent/CN113065558B/en
Publication of CN113065558A publication Critical patent/CN113065558A/en
Application granted granted Critical
Publication of CN113065558B publication Critical patent/CN113065558B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Probability & Statistics with Applications (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a light-weight small target detection method combining an attention mechanism, which comprises the following steps: (1) building a small target detection network based on YOLOv 4: constructing an MSE multi-scale attention module, inserting the MSE multi-scale attention module into a feature extraction network, adding a shallow feature map as a prediction layer, improving an SPP module, and enhancing the feature extraction capability; (2) Constructing a small target data set, enhancing training set data by using a data enhancement strategy, customizing an anchor frame (3) carrying out channel pruning on the model, and recovering the model precision by adopting knowledge distillation; (4) Inputting an unmanned aerial vehicle aerial image, and obtaining target classification and positioning results. The invention utilizes the channel attention mechanism and the model compression strategy, and can effectively improve the phenomenon of small-target error detection and ensure the real-time performance of the model.

Description

Lightweight small target detection method combined with attention mechanism
Technical Field
The invention belongs to the application of a deep learning technology in the field of machine vision, and particularly relates to a lightweight small target detection method combined with an attention mechanism.
Background
The target detection finds out a specific target class and an accurate position thereof in a given image, wherein small target detection is an important research content in the field of target detection, and has important application value in remote sensing image target recognition, infrared imaging target recognition, agricultural pest and disease damage recognition and other scenes. In object detection, an object whose pixel value is 0.12% or less of the entire image or whose pixel value is less than 32×32 is generally referred to as a small object. Because of the low resolution and high noise of small-sized objects, features that are often extracted after multi-layer convolution are not obvious, and thus it is very difficult to detect small objects in an image.
Early small target detection mainly obtains characteristic information of a target by a manual design method. Wen Peizhi et al apply wavelet transformation in the small target detection process (see Wen Peizhi, shi Zelin, yu Hai, wu Xiaojun. Sea surface background infrared small target detection method based on wavelet transformation [ J ]. Photoelectric engineering, 2004), utilize multi-resolution analysis of orthogonal wavelet decomposition to realize band selection, suppress noise and background interference, and utilize different directional edges to fuse, obtain candidate points, and finally eliminate interference targets according to gray threshold. Che et al (see C.L.P.Chen, H.Li, Y.Wei, et al, a Local Contrast Method for Small Infrared Target Detection J// IEEE Transactions on Geoscience and Remote Sensing,2014,52 (1): 574-581) inspired by biological vision mechanisms, acquire a local contrast map of the input image using a proposed local contrast metric that can represent the difference between the current location and its neighborhood, so that both target signal enhancement and background clutter suppression can be achieved, and finally segment the target by adaptive threshold values. The method starts from the bottom layer characteristics of the image, uses the basic image characteristics to realize the detection task, has simpler operation, and has the problems of missing error detection and real-time performance for small target detection of complex background.
In recent years, with the improvement of computer power and the rapid development of deep learning theory, deep learning techniques have been widely used for target detection. Currently popular object detection models can be broadly divided into two categories: one-stage detection algorithm, classification and positioning are regarded as regression tasks, and typical algorithms are SSD and YOLO; the two-stage detection algorithm comprises candidate frame selection and target classification separation, wherein the representative algorithms are R-CNN and Faster R-CNN. Wherein the one-stage detection algorithm takes the whole detection task as a regression operation, thus having great advantages in terms of real-time performance.
The main ways to improve the detection of small targets by deep learning techniques are multi-scale representation, context information, super resolution, etc. Patent application number CN202010537199.7 discloses a detection method for small targets of pictures. Six feature graphs with different sizes are obtained from the picture to be detected, a bilinear interpolation method is adopted to conduct feature fusion on pyramid bottom feature graphs and pyramid high-level feature graphs in the six feature graphs with different sizes, six new feature graphs with different sizes are obtained, and the six new feature graphs with different sizes participate in prediction. The method adopts the multi-scale feature map to enhance the target feature information, but is easy to be interfered by complex background, and has higher false detection rate. The patent with the application number of CN202010444356.X discloses a method for detecting a small target of a remote sensing image based on resolution enhancement, which carries out super-resolution processing on the remote sensing image containing the small target and then carries out target detection, so that the problems that the available characteristic information of the small target in the remote sensing image is less and the small target area has geometric deformation are solved, the detailed characteristic information of the small target is further perfected by adopting a super-resolution processing technology, the limited characteristic information of the small target is fully utilized by adopting a deformable convolution network based on the area, and the detection capability of the small target in the remote sensing image is improved. Although the method has better accuracy, the real-time performance of the network is reduced due to the increase of the resolution of the picture, which is not beneficial to the light weight of the network.
Disclosure of Invention
In order to solve the problems of high false detection rate, omission, poor real-time performance and the like of the existing target detection method for small target detection, the invention provides a lightweight small target detection method combined with an attention mechanism, which comprises the following steps:
(1) Construction of Yolov 4-based improved small target detection network
The small target detection network is improved on the basis of a one-stage target detection network YOLOv4, and the specific network structure improvement comprises the following three aspects:
(1-1) building MSE multiscale attention mechanism Module, inserting into feature extraction network
The MSE multiscale attention mechanism module constructed by the invention is obtained by improving an SE attention module, the SE attention module is a lightweight attention mechanism module for the computer vision field, which is proposed by Hu et al in 2017, can be conveniently inserted between two network layers of a feature extraction network, and selects and emphasizes a feature channel of interest by learning global information and suppresses irrelevant interference information.
And constructing an MSE multi-scale attention mechanism module, inserting the MSE multi-scale attention mechanism module between a Concat layer and a CBM module in each CSP module of the Yolov4 feature extraction network CSPDarknet53 to form a new MSE-CSPUnit module, and obtaining the feature extraction network of the MSE-CSPDarknet53 with attention information. The construction of the MSE multi-scale attention mechanism module comprises the following specific steps:
(1-1-1) firstly, taking the output of a Concat layer of the CSP module as an input feature map, integrating feature maps of multiple scales through convolution kernels of different sizes, and carrying out the next feature extraction operation based on the multi-scale feature maps. The convolution kernel sizes are 3×3, 5×5, and 7×7, respectively, and for the case of an increase in the parameter amount caused by using a large-size convolution kernel, a 2-layer 3×3 convolution kernel is used instead of a 5×5 convolution kernel, and a 3-layer 3×3 convolution kernel is used instead of a 7×7 convolution kernel. Setting an input feature diagram X epsilon R C×H×W C, H, W are input channel, input height and input width respectively, and then the process of feature extraction using convolution kernels of different sizes for the input feature map is as follows:
X c =V 3×3 X+V 5×5 X+V 7×7 X
wherein X is c For multi-scale feature map output, V represents a convolution operation using different sized convolution kernels.
(1-1-2) pair X c Performing extrusion operation, namely respectively extruding the channels to obtain channel-level feature information by using global average pooling and global maximum pooling, wherein the global average pooling focuses on global features of the feature map, and the global maximum pooling focuses on local features of the feature map:
X max =max(X c (i,j))
wherein X is c X is an input multi-scale feature avg X is the feature acquired after global average pooling max For the features acquired after global maximum pooling, i=1, 2, …, H, j=1, 2, …, W, H, W are input height and input width, respectively.
(1-1-3) for X respectively avg And X max Excitation operation is carried out, and channel attention weight information X is generated by addition and normalization operation s . During excitation operation, more nonlinear relation among channels is reserved by using Mish activation function, and FC is reserved 1 、FC 2 Is two different full connection layers, whereinC is an input channel, r is a dimension reduction ratio, and FC 1 Plays a role in reducing dimension to reduce parameters of a full connection layer, FC 2 Playing a role in restoring the original dimension. The activation and normalization operations are as follows:
X a =FC 2 (Mish(FC 1 (X avg ))
X m =FC 2 (Mish(FC 1 (X max ))
X s =Softmax(X a +X m )
where Mish is a nonlinear activation function and Softmax is a normalization function.
(1-1-4) weighting the channel attention weighting information generated in (1-1-3) and the multi-scale feature map generated in (1-1-1) to obtain the output X of the MSE multi-scale attention module weight X is taken as weight As input to the CBM module in the MSE-CSPUnit module.
X weight =Scale(X c ,X s )
(1-2) adding shallow feature maps as prediction layers
The deep features have stronger semantic information and are more suitable for positioning; the shallow layer features have rich resolution information, and are more beneficial to detection of small targets. Deleting 19×19 feature graphs output by the FPN and PAN structures, and reserving the original 38×38 and 76 output feature graphs of the FPN and PAN structures; performing feature fusion on the MSE-CSPUnit output and the result of the up-sampling of the deep feature map below by using FPN and PAN structures to obtain a shallow feature map with the size of 152 multiplied by 152; finally, three characteristic diagrams with different sizes of 38×38, 76×76 and 152×152 are obtained to predict targets with different scales.
Here MSE-csput 2 refers to two MSE-csput modules.
(1-3) SPP Module improvements
The SPP module can enrich the expressive power of the feature map and provide important context information. In order to improve the performance in small target detection, SPP modules are respectively placed in front of the 38×38, 76×76 and 152×152 feature maps, so that effective fusion of local features and global features is realized. The SPP module performs maximum pooling operation on the input feature images of 1×1, 5×5, 9×9 and 13×13, and performs tensor stitching on the generated feature images with different scales.
(2) Training and optimizing small target detection networks
Aiming at specific application scenes, a small target detection data set is constructed, the picture data is subjected to multi-mode random adjustment by data enhancement, the number of small targets in the data, the brightness, the contrast and the saturation of the picture are subjected to random adjustment, and the generalization performance of a model is enhanced.
Finally, setting an anchor frame for fitting a target in the data set; and (3) reclustering the anchor frame of the target data set through a Kmeans++ algorithm to obtain anchor frame parameters which are more suitable for the current data set, and accelerating the convergence speed of the network.
(3) Model light-weight for small target detection network
(3-1) channel pruning
And carrying out channel pruning on the small target detection network aiming at the parameter redundancy of the network. And using gamma of a convolutional module BN layer of the YOLOv4 as a scaling factor, adding an L1 regularization term of the gamma of the BN layer in a loss function, performing sparsification training on the network for a plurality of times, sequencing the gamma based on a gamma value after gradient updating, and removing a channel where the gamma is smaller than the pruning threshold by setting the pruning threshold to obtain the pruned lightweight YOLOv4 network. In the YOLOv4 network, channel pruning is carried out on other convolution modules containing BN layers except for the convolution layers and SPP structures before the upsampling layer, so that a model file and a model structure configuration file after the channel pruning are obtained. For YOLOv4 sparse training, the established objective loss function is:
where x is the input value of the model, y is the desired output value, w is a trainable parameter in the network, g () is a penalty term for the scaling factor, and λ is the balancing factor.
(3-2) knowledge distillation recovery model accuracy
After pruning the channels, although the removed channels contribute little to the model output, the model accuracy after pruning is reduced by a small extent, so that the model accuracy is restored.
And (3) using the YOLOv4 network which is not pruned as a teacher network, and using the network after channel pruning as a student network to carry out knowledge distillation. Knowledge distillation of YOLOv4 will perform classification tasks and learning of regression tasks, and for distillation of regression results, the prediction results of the teacher network may be opposite to the tag values because the regression output is unbounded, and therefore not directly learned to the teacher network when calculating regression loss. First, calculating the L2 losses of the teacher network and the label value, the student network and the label value respectively, setting a range w, and when the deviation between the L2 losses of the student network and the label value and the L2 losses of the teacher network and the label value exceeds the range w, accounting the L2 losses of the student network in the losses. I.e. when the performance of the student network exceeds the teacher network by a certain value, no loss of the student network is calculated. The overall loss function is as follows:
L reg =(1-v)L sL1 (R s ,y reg )+vL b (R s ,R t ,y reg )
wherein w is a preset deviation range, y reg Is the true tag value, R t And R is s Regression outputs of teacher and student, L b For model distillation partial losses, L sL1 For the loss of the student network and the real label, v is L b And L sL1 The balance factor between the two is set between 0.1 and 0.5 in 80 percent of the time before the network training and between 0.6 and 0.9 in the last 20 percent of the training time; l (L) reg Is the total loss during the network distillation learning.
(4) Detection of input images using trained small target detection network models
Inputting a frame of unmanned aerial vehicle aerial image, and sending the aerial image into a small target detection network which is trained and optimized to locate and classify targets. The network firstly inputs the image into a feature extraction network with a attention mechanism to extract the features, and 3 feature graphs with different resolutions are respectively output through an SPP module. Detecting three targets with different scales on the 3 feature images by using regression and classification ideas, and obtaining classification and positioning results of the targets after confidence threshold filtering; repeating until the detection of the pictures in the test set is completed.
Compared with the prior art, the invention has the following beneficial effects:
according to the invention, the end-to-end convolutional neural network YOLOv4 is improved to realize a lightweight small target detection network, and compared with a traditional small target detection method, an MSE attention module is designed based on SE, and the designed attention module is inserted into a YOLOv4 feature extraction network, so that the attention capability of the network to an interested region is enhanced, and the interference of a complex background in the small target detection process is reduced; then adding a shallow layer feature map as a prediction layer, and predicting targets with different scales by using three feature maps with different sizes of 38×38, 76×76 and 152×152; the SPP modules are improved, and are respectively placed in front of the 38×38, 76×76 and 152×152 feature maps, so that effective fusion of local features and global features is realized; finally, compression optimization is carried out on the model by using a channel pruning and knowledge distillation strategy, and the large compression of the number of model parameters is realized under the condition of little precision loss; in addition, a data enhancement mode is used for randomly adjusting the number of small targets in a data set, the brightness, the contrast and the saturation of pictures, so that the training effect of the model is enhanced. In a small target data set, the network has good detection effect and robustness, and meets the requirement of light-weight model deployment.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a MSE-CSPUnit module after adding an MSE multiscale attention mechanism module;
FIG. 3 is a MSE multi-scale attention module structure of the present invention;
FIG. 4 is a small target detection network architecture designed in accordance with the present invention;
FIG. 5 is a comparison of the number of channels after model compression, wherein dark bars are before pruning and light bars are after pruning;
fig. 6 is a diagram of the detection effect of the small target detection network on the target picture according to the present invention, wherein (a) and (c) are detection effects before improvement, and (b) and (d) are detection effects after improvement corresponding to (a) and (c).
Detailed Description
The present invention will be described in detail with reference to examples and drawings, but the present invention is not limited thereto. The object of the embodiment of the object detection is various small objects in the data set, the selected processing platform is a combination of Intel i9-9900k, NVIDIA RTX2080ti and 32G RAM, and the operating system is Linux64 Ubuntu18.04. The method is realized on a deep learning framework Pytorch 1.6.
The light-weight small target detection method for the attention-drawing mechanism as shown in fig. 1 comprises four parts:
(1) Constructing a small target detection network based on Yolov4 improvement;
(2) Training and optimizing the small target detection network;
(3) Performing model weight reduction on the small target detection network;
(4) And detecting the input image by using the trained small target detection network model.
The first part of building a small target detection network based on the Yolov4 improvement specifically comprises the following steps:
(1-1) design MSE multiscale attention mechanism module, embedded in feature extraction network
And constructing an MSE multi-scale attention mechanism module, inserting the MSE multi-scale attention mechanism module between a Concat layer and a CBM module in each CSP module of the Yolov4 feature extraction network CSPDarknet53 to form a new MSE-CSPUnit module, and obtaining a feature extraction network of the MSE-CSPDarknet53 with attention information, wherein the rest modules except the MSE are conventional structural modules of the Yolov4 feature extraction network CSPDarknet53 as shown in fig. 2. The MSE multi-scale attention mechanism module is constructed as follows:
firstly, taking output of a Concat layer of a CSP module as an input feature map, integrating feature maps of various scales through convolution kernels of different sizes, and carrying out next feature extraction operation based on the multi-scale feature maps, wherein the sizes of the convolution kernels are 3 multiplied by 3, 5 multiplied by 5 and 7 multiplied by 7 respectively. For the case of an increase in the amount of parameters caused by the use of large-size convolution kernels, a 2-layer 3×3 convolution kernel is used instead of a 5×5 convolution kernel, and a 3-layer 3×3 convolution kernel is used instead of a 7×7 convolution kernel. Setting an input feature diagram X epsilon R C×H×W C, H, W are input channel, input height and input width respectively, and then the process of feature extraction using convolution kernels of different sizes for the input feature map is as follows:
X c =V 3×3 X+V 5×5 X+V 7×7 X
wherein X is c For multi-scale fusion feature output, V represents a convolution operation using different size convolution kernels.
For X c Performing extrusion operation, focusing on local information of the feature map by using global maximum pooling operation aiming at the characteristic of less small target feature information, focusing on global features of the feature map by using global average pooling operation, and pooling operation with the following formula:
X max =max(X c (i,j))
wherein X is avg X is the feature acquired after global average pooling max For the features acquired after global maximum pooling, i=1, 2, …, H, j=1, 2, …, W, H, W are input height and input width, respectively.
Respectively to X avg And X max Exciting, adding, normalizing to generate attention weight information X s . The use of a dash-activated function preserves more non-linear relationships between channels when performing excitation operations. FC (fiber channel) 1 、FC 2 Is two different full connection layers, wherein C is an input channel, r is a dimension reduction ratio, and FC 1 Plays a role in reducing dimension to reduce parameters of a full connection layer, FC 2 Playing a role in restoring the original dimension. The activation and normalization operations are as follows:
X a =FC 2 (Mish(FC 1 (X avg ))
X m =FC 2 (Mish(FC 1 (X max ))
X s =Softmax(X a +X m )
where Mish is a nonlinear activation function and Softmax is a normalization function.
X is to be s With the multiscale feature map X generated in the first step c Weighting operation is carried out to obtain the output X of the MSE multi-scale attention module weight X is taken as weight As input to the CBM module in the MSE-CSPUnit module.
X weight =Scale(X c ,X s )
(1-2) adding shallow features in the prediction layer
The deep features have stronger semantic information and are more suitable for positioning; the shallow layer features have rich resolution information, and are more beneficial to detection of small targets. Deleting 19×19 feature graphs output by the FPN and PAN structures, and reserving the original 38×38 and 76 output feature graphs of the FPN and PAN structures; performing feature fusion on the MSE-CSPUnit output and the result of the up-sampling of the deep feature map below by using FPN and PAN structures to obtain a shallow feature map with the size of 152 multiplied by 152; finally, three characteristic diagrams with different sizes of 38×38, 76×76 and 152×152 are obtained to predict targets with different scales.
(1-3) SPP Module improvements
The SPP module can enrich the expressive power of the feature map and provide important context information. In order to improve the performance in small target detection, SPP modules are respectively placed in front of the 38×38, 76×76 and 152×152 feature maps, so that effective fusion of local features and global features is realized. The SPP module performs maximum pooling operation on the input feature images of 1×1, 5×5, 9×9 and 13×13, and performs tensor stitching on the generated feature images with different scales.
The second part of training and optimizing the small target detection network specifically comprises the following steps:
(2-1) construction of data sets
Firstly, a small target data set is constructed, and an unmanned aerial vehicle aerial photographing data set VisDrone2019 is selected for experiments. The VisDrone2019 data set is in an unmanned aerial vehicle aerial photographing mode, so that the VisDrone2019 data set contains a large number of small objects and dense objects, and in addition, illumination change and object shielding are difficulties of the data set. Meanwhile, as the unmanned aerial vehicle image is vertically shot, the object to be detected contains fewer features. For example, for pedestrian detection, the ground captured image may contain features such as human arms, legs, etc., while for unmanned aerial vehicle images, there may be only overhead features.
(2-2) data enhancement and Multi-modal random adjustment of Picture data
During network training, the training effect of the small target is improved by adopting an online enhancement mode for the data set. Since the dataset may contain fewer pictures of small objects, the model may be biased towards medium and large sized objects during training. The data online enhancement is realized by copying a plurality of small targets in the picture, and increasing the probability that the small targets are contained by the anchor by manually increasing the occurrence times of the small objects in the picture, so that the model can have the opportunity to obtain more small target training samples in the training process. And meanwhile, the picture is randomly rotated and zoomed, and meanwhile, the brightness, the contrast and the saturation are adjusted, so that the robustness of the model is improved.
(2-3) custom anchor boxes for fitting to targets in a dataset
For target detection of extremely scaled objects, a suitable anchor frame may more accurately fit objects in the dataset. And for the unmanned aerial vehicle aerial photographing data set, reclustering the anchor frame of the target data set through a Kmeans++ algorithm to obtain anchor frame parameters which are more suitable for the current data set. The anchor frame parameters obtained by the Kmeans++ algorithm are (1, 4), (2, 8), (4, 13), (4, 5), (8, 20), (9, 9), (16,29), (16, 15), (35, 42).
The third part of small target detection network model light weight specifically comprises:
(3-1) channel pruning
And carrying out channel pruning on the small target detection network aiming at the parameter redundancy of the network. And using gamma of a convolutional module BN layer of the YOLOv4 as a scaling factor, adding an L1 regularization term of the gamma of the BN layer in a loss function, carrying out preset rounds of sparse training on the network for several times, such as 300 rounds, sequencing the gamma based on a gamma value after gradient updating, and removing a channel where the gamma is smaller than a pruning threshold by setting the pruning threshold to obtain the pruned lightweight YOLOv4 network. In the YOLOv4 network, channel pruning is performed on other convolution modules containing BN layers, in addition to the convolution layers and SPP structures preceding the upsampling layer. And selecting the channel cutting proportion through multiple experiments to achieve better balance between speed and precision, finally selecting the cutting proportion to be 0.7, and obtaining a model file and a model structure configuration file after channel pruning.
(3-2) knowledge distillation recovery model accuracy
After pruning the channels, although the removed channels contribute little to the model output, the model accuracy after pruning is reduced by a small extent, so that the model accuracy is restored.
And (3) using the YOLOv4 network which is not pruned as a teacher network, and using the network after channel pruning as a student network to carry out knowledge distillation. Knowledge distillation of YOLOv4 will perform classification tasks and learning of regression tasks, and for distillation of regression results, the prediction results of the teacher network may be opposite to the true values because the regression output is unbounded, and therefore not directly learned to the teacher network when computing regression loss. Firstly, calculating the L2 distances between the teacher network and the tag value and between the student network and the tag value respectively, setting a deviation range w=0.3 through multiple experimental comparison, and only accounting for the L2 loss of the student network in the loss when the deviation between the L2 distances between the student network and the tag value and the teacher network and the tag value exceeds the range w. I.e. when the performance of the student network exceeds the teacher network by a certain value, no loss of the student network is calculated. The overall loss function is as follows:
L reg =(1-v)L sL1 (R s ,y reg )+vL b (R s ,R t ,y reg )
wherein w is a preset deviation range, y reg Is the true tag value, R t And R is s Regression outputs of teacher and student, L b For model distillation partial losses, L sL1 For the loss of the student network and the real label, v is L b And L sL1 The balance factor between the two is set between 0.1 and 0.5 in 80 percent of the time before the network training and between 0.6 and 0.9 in the last 20 percent of the training time; l (L) reg Is the total loss during the network distillation learning.
The fourth part of detecting the small target of the picture specifically comprises:
(4-1) inputting an aerial image of the unmanned aerial vehicle
And (4-2) after the aerial image of the unmanned aerial vehicle is read, the aerial image is sent into a small target detection network which is trained and optimized for target positioning and classification. The network firstly inputs the image into a feature extraction network with a attention mechanism to extract the features, and 3 feature graphs with different resolutions are respectively output through an SPP module. Three kinds of targets with different scales are detected by using regression and classification ideas, the confidence threshold is 0.2-0.6, the confidence threshold is generally set to be 0.3, and the classification and positioning results of the targets are obtained after threshold filtering.
(4-3) repeating the steps (4-1) to (4-2) until the detection of the pictures in the test set is completed, wherein the detection effect of various small targets is shown in fig. 6.

Claims (7)

1. A light-weight small target detection method combined with an attention mechanism is characterized by comprising the following steps of: the method comprises the following steps:
(1) Constructing a small target detection network based on Yolov4 improvement, which comprises the following steps:
(1-1) constructing an MSE multi-scale attention mechanism module, taking the output of a Concat layer of a CSP module as an input characteristic X, integrating a plurality of scale characteristic graphs through convolution kernels with different sizes, and obtaining a multi-scale fusion characteristic output X c For X c Extruding the channels by using global average pooling and global maximum pooling to obtain the characteristics X obtained after global average pooling avg And feature X obtained after global maximization pooling max For X respectively avg And X max Exciting, adding, normalizing to generate attention weight information X s The method comprises the steps of carrying out a first treatment on the surface of the X to be generated s And X is generated c Weighting operation is carried out to obtain the output X of the MSE multi-scale attention module weight ,X weight =Scale(X c ,X s ) X is taken as weight Inserting into a feature extraction network;
(1-2) adding a shallow layer feature map as a prediction layer, deleting 19×19 feature maps output by an FPN and PAN structure, retaining original 38×38 and 76×76 output feature maps of the FPN and PAN structure, carrying out feature fusion on MSE-CSPUnit output and an up-sampling result of a lower deep layer feature map by using the FPN and PAN structure to obtain 152×152 shallow layer feature maps, and finally obtaining 38×38, 76×76 and 152×152 feature maps with three different sizes to predict targets with different scales;
(1-3) SPP module improvement, namely respectively placing SPP modules between the FPN and PAN structures and the corresponding three prediction layers, and performing tensor splicing on the generated feature images with different scales after performing maximum pooling operation on the input feature images by the SPP modules;
(2) Training and optimizing a small target detection network;
(3) Performing model weight reduction on the small target detection network;
(4) And detecting the input image by using the trained small target detection network model.
2. The method for detecting a lightweight small object in combination with an attention mechanism according to claim 1, wherein: the step (1-1) comprises the following steps: and constructing an MSE multi-scale attention mechanism module, and inserting the MSE multi-scale attention mechanism module between a Concat layer and a CBM module in each CSP module of the Yolov4 feature extraction network CSPDarknet53 to form a new MSE-CSPUnit module, thereby obtaining the feature extraction network of the MSE-CSPDarknet53 with attention information.
3. A method for detecting a lightweight small object in combination with an attention mechanism according to claim 1 or 2, characterized in that: the step (1-1) constructs an MSE multi-scale attention mechanism module based on the SE attention mechanism module, and comprises the following steps:
(1-1-1) taking the output of a Concat layer of the CSP module as an input characteristic X, integrating a plurality of scale characteristic graphs through convolution kernels with different sizes to obtain a multi-scale fusion characteristic output X c The method comprises the steps of carrying out a first treatment on the surface of the Convolution kernel sizes are 3×3, 5×5, 7×7, x, respectively c =V 3×3 X+V 5×5 X+V 7×7 X, where V represents a convolution operation using convolution kernels of different sizes;
(1-1-2) pair X c Extruding the channels to obtain channel-level feature information by using global average pooling and global maximum pooling, wherein the global average pooling focuses on global features and the global maximum pooling focuses on local features,
X max =max(X c (i,j));
wherein X is avg X is the feature acquired after global average pooling max For the features acquired after global maximum pooling, i=1, 2, …, H, j=1, 2, …, W, H, W are input height and input width, respectively;
(1-1-3) for X respectively avg And X max Exciting, adding, normalizing to generate attention weight information X s ,FC 1 、FC 2 Is two different full connection layers, whereinC is an input channel, r is a dimension reduction ratio, and FC 1 Plays a role in reducing dimension to reduce parameters of a full connection layer, FC 2 Play a role in restoring the original dimension;
X a =FC 2 (Mish(FC 1 (X avg ))
X m =FC 2 (Mish(FC 1 (X max ))
X s =Softmax(X a +X m )
wherein, mish is a nonlinear activation function, and Softmax is a normalization function;
(1-1-4) X produced in (1-1-3) s X generated with (1-1-1) c Weighting operation is carried out to obtain the output X of the MSE multi-scale attention module weight ,X weight =Scale(X c ,X s ) X is taken as weight As input to the CBM module in the MSE-CSPUnit module.
4. The method for detecting a lightweight small object in combination with an attention mechanism according to claim 1, wherein: in the step (1-3), after the SPP module performs the maximum pooling operation of 1×1, 5×5, 9×9 and 13×13 on the input feature map, tensor stitching is performed on the generated feature maps with different scales.
5. The method for detecting a lightweight small object in combination with an attention mechanism according to claim 1, wherein: the step (2) comprises the following steps:
(2-1) constructing a small target dataset;
(2-2) data enhancement, and performing multi-mode random adjustment on the picture data;
(2-3) setting an anchor frame for fitting to a target in the dataset.
6. The method for detecting a lightweight small object in combination with an attention mechanism according to claim 1, wherein: the step (3) comprises the following steps:
(3-1) channel pruning
Selecting gamma of a BN layer as a scaling factor, adding an L1 regularization term of the gamma of the BN layer in a loss function, performing sparse training on a network for a plurality of times, and performing channel pruning on layers except a convolution layer and an SPP module before an upsampling layer based on a gamma value after gradient updating to obtain a model file and a model structure configuration file after channel pruning;
(3-2) knowledge distillation recovery network accuracy
Taking a YOLOv4 network which is not pruned as a teacher network, and taking a network after channel pruning as a student network; calculating L2 losses of the teacher network and the label value and the student network and the label value respectively, setting a deviation range, and when the deviation between the L2 losses of the student network and the label value and the L2 losses of the teacher network and the label value exceeds a range w, calculating the L2 losses of the student network in total losses, wherein the total loss function is that
L reg =(1-v)L sL1 (R s ,y reg )+vL b (R s ,R t ,y reg )
Wherein L is reg For total loss during network distillation learning, L b For the model distillation part loss,
L sL1 for the loss between the student's network regression output and the tag value, v is L b And L sL1 Balance factor between the training time and the training time is set between 0.1 and 0.5 and between 0.6 and 0.9 respectively before and after the network training and y reg Is a tag value, R t And R is s Regression outputs of the teacher network and the student network are respectively, and w is a preset deviation range.
7. The method for detecting a lightweight small object in combination with an attention mechanism according to claim 1, wherein: the step (4) comprises the following steps: (4-1) inputting a frame of image;
(4-2) after reading one image, sending the image into a small target detection network which is trained and optimized to locate and classify targets; inputting the image into a feature extraction network with an attention mechanism to extract features, respectively outputting 3 feature images with different resolutions through an SPP module, detecting three targets with different scales on the 3 feature images, setting a confidence threshold to be 0.2-0.6, and obtaining classification and positioning results of the targets after threshold filtering;
(4-3) repeating the steps (4-1) to (4-2) until the detection of the pictures in the test set is completed.
CN202110432768.6A 2021-04-21 2021-04-21 Lightweight small target detection method combined with attention mechanism Active CN113065558B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110432768.6A CN113065558B (en) 2021-04-21 2021-04-21 Lightweight small target detection method combined with attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110432768.6A CN113065558B (en) 2021-04-21 2021-04-21 Lightweight small target detection method combined with attention mechanism

Publications (2)

Publication Number Publication Date
CN113065558A CN113065558A (en) 2021-07-02
CN113065558B true CN113065558B (en) 2024-03-22

Family

ID=76567333

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110432768.6A Active CN113065558B (en) 2021-04-21 2021-04-21 Lightweight small target detection method combined with attention mechanism

Country Status (1)

Country Link
CN (1) CN113065558B (en)

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109002848B (en) * 2018-07-05 2021-11-05 西华大学 Weak and small target detection method based on feature mapping neural network
CN113642402A (en) * 2021-07-13 2021-11-12 重庆科技学院 Image target detection method based on deep learning
CN113408549B (en) * 2021-07-14 2023-01-24 西安电子科技大学 Few-sample weak and small target detection method based on template matching and attention mechanism
CN113486990B (en) * 2021-09-06 2021-12-21 北京字节跳动网络技术有限公司 Training method of endoscope image classification model, image classification method and device
CN113743514A (en) * 2021-09-08 2021-12-03 庆阳瑞华能源有限公司 Knowledge distillation-based target detection method and target detection terminal
CN113780406A (en) * 2021-09-08 2021-12-10 福州大学 YOLO-based bundled log end face detection method
CN113807311A (en) * 2021-09-29 2021-12-17 中国人民解放军国防科技大学 Multi-scale target identification method
CN113962882B (en) * 2021-09-29 2023-08-25 西安交通大学 JPEG image compression artifact eliminating method based on controllable pyramid wavelet network
CN113837144B (en) * 2021-10-25 2022-09-13 广州微林软件有限公司 Intelligent image data acquisition and processing method for refrigerator
CN114022705B (en) * 2021-10-29 2023-08-04 电子科技大学 Self-adaptive target detection method based on scene complexity pre-classification
CN114037888B (en) * 2021-11-05 2024-03-08 中国人民解放军国防科技大学 Target detection method and system based on joint attention and adaptive NMS
CN114067437B (en) * 2021-11-17 2024-04-16 山东大学 Method and system for detecting pipe removal based on positioning and video monitoring data
CN114120154B (en) * 2021-11-23 2022-10-28 宁波大学 Automatic detection method for breakage of glass curtain wall of high-rise building
CN114283402B (en) * 2021-11-24 2024-03-05 西北工业大学 License plate detection method based on knowledge distillation training and space-time combined attention
CN113902744B (en) * 2021-12-10 2022-03-08 湖南师范大学 Image detection method, system, equipment and storage medium based on lightweight network
CN114220032A (en) * 2021-12-21 2022-03-22 一拓通信集团股份有限公司 Unmanned aerial vehicle video small target detection method based on channel cutting
CN114092820B (en) * 2022-01-20 2022-04-22 城云科技(中国)有限公司 Target detection method and moving target tracking method applying same
CN114463686B (en) * 2022-04-11 2022-06-17 西南交通大学 Moving target detection method and system based on complex background
CN115618271B (en) * 2022-05-05 2023-11-17 腾讯科技(深圳)有限公司 Object category identification method, device, equipment and storage medium
CN114663654B (en) * 2022-05-26 2022-09-09 西安石油大学 Improved YOLOv4 network model and small target detection method
US11915474B2 (en) 2022-05-31 2024-02-27 International Business Machines Corporation Regional-to-local attention for vision transformers
CN115019169A (en) * 2022-05-31 2022-09-06 海南大学 Single-stage water surface small target detection method and device
CN114862844B (en) * 2022-06-13 2023-08-08 合肥工业大学 Infrared small target detection method based on feature fusion
CN115082869B (en) * 2022-07-07 2023-09-15 燕山大学 Vehicle-road cooperative multi-target detection method and system for serving special vehicle
CN115331384B (en) * 2022-08-22 2023-06-30 重庆科技学院 Fire accident early warning system of operation platform based on edge calculation
CN115424154A (en) * 2022-11-01 2022-12-02 速度时空信息科技股份有限公司 Data enhancement and training method for unmanned aerial vehicle image target detection
CN116205967A (en) * 2023-04-27 2023-06-02 中国科学院长春光学精密机械与物理研究所 Medical image semantic segmentation method, device, equipment and medium
CN116363138B (en) * 2023-06-01 2023-08-22 湖南大学 Lightweight integrated identification method for garbage sorting images
CN116883980A (en) * 2023-09-04 2023-10-13 国网湖北省电力有限公司超高压公司 Ultraviolet light insulator target detection method and system
CN116894983B (en) * 2023-09-05 2023-11-21 云南瀚哲科技有限公司 Knowledge distillation-based fine-grained agricultural pest image identification method and system
CN116912890B (en) * 2023-09-14 2023-11-24 国网江苏省电力有限公司常州供电分公司 Method and device for detecting birds in transformer substation
CN117496509B (en) * 2023-12-25 2024-03-19 江西农业大学 Yolov7 grapefruit counting method integrating multi-teacher knowledge distillation

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257794A (en) * 2020-10-27 2021-01-22 东南大学 YOLO-based lightweight target detection method
CN112329721A (en) * 2020-11-26 2021-02-05 上海电力大学 Remote sensing small target detection method with lightweight model design

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110598731B (en) * 2019-07-31 2021-08-20 浙江大学 Efficient image classification method based on structured pruning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112257794A (en) * 2020-10-27 2021-01-22 东南大学 YOLO-based lightweight target detection method
CN112329721A (en) * 2020-11-26 2021-02-05 上海电力大学 Remote sensing small target detection method with lightweight model design

Also Published As

Publication number Publication date
CN113065558A (en) 2021-07-02

Similar Documents

Publication Publication Date Title
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN112150493B (en) Semantic guidance-based screen area detection method in natural scene
CN110879982B (en) Crowd counting system and method
CN110163041A (en) Video pedestrian recognition methods, device and storage medium again
CN113591968A (en) Infrared weak and small target detection method based on asymmetric attention feature fusion
Chen et al. Remote sensing image quality evaluation based on deep support value learning networks
CN110781736A (en) Pedestrian re-identification method combining posture and attention based on double-current network
CN114972208B (en) YOLOv 4-based lightweight wheat scab detection method
CN111696136B (en) Target tracking method based on coding and decoding structure
CN113326735B (en) YOLOv 5-based multi-mode small target detection method
CN110222718A (en) The method and device of image procossing
CN114882222A (en) Improved YOLOv5 target detection model construction method and tea tender shoot identification and picking point positioning method
CN113505634A (en) Double-flow decoding cross-task interaction network optical remote sensing image salient target detection method
CN116681636A (en) Light infrared and visible light image fusion method based on convolutional neural network
CN113610905A (en) Deep learning remote sensing image registration method based on subimage matching and application
CN116071676A (en) Infrared small target detection method based on attention-directed pyramid fusion
CN112508863B (en) Target detection method based on RGB image and MSR image double channels
CN107358625B (en) SAR image change detection method based on SPP Net and region-of-interest detection
CN117392496A (en) Target detection method and system based on infrared and visible light image fusion
CN111127355A (en) Method for finely complementing defective light flow graph and application thereof
CN116189160A (en) Infrared dim target detection method based on local contrast mechanism
CN116681742A (en) Visible light and infrared thermal imaging image registration method based on graph neural network
CN115861810A (en) Remote sensing image change detection method and system based on multi-head attention and self-supervision learning
Zhao et al. Deep learning-based laser and infrared composite imaging for armor target identification and segmentation in complex battlefield environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant