CN111401201B - Aerial image multi-scale target detection method based on spatial pyramid attention drive - Google Patents

Aerial image multi-scale target detection method based on spatial pyramid attention drive Download PDF

Info

Publication number
CN111401201B
CN111401201B CN202010164167.7A CN202010164167A CN111401201B CN 111401201 B CN111401201 B CN 111401201B CN 202010164167 A CN202010164167 A CN 202010164167A CN 111401201 B CN111401201 B CN 111401201B
Authority
CN
China
Prior art keywords
attention
feature
spatial
unit
channel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010164167.7A
Other languages
Chinese (zh)
Other versions
CN111401201A (en
Inventor
孙玉宝
辛宇
徐宏伟
陈勋豪
周旺平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN202010164167.7A priority Critical patent/CN111401201B/en
Publication of CN111401201A publication Critical patent/CN111401201A/en
Application granted granted Critical
Publication of CN111401201B publication Critical patent/CN111401201B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/13Satellite images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Remote Sensing (AREA)
  • Astronomy & Astrophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a multi-scale target detection method for aerial images based on spatial pyramid attention drive, which comprises the following steps: firstly, enhancing training data set management by using a blocking processing method aiming at a large-size data set; designing a residual network represented by convolution attention enhancement features as a backbone network, and further efficiently extracting image features; further constructing a spatial pyramid attention module, enabling a network to more accurately focus targets with different scales, and extracting an interested region where the targets are located; establishing a target category analysis and target frame regression module, and classifying and predicting the regions of interest under different scales; and in the test stage, a trained detection network is used, a multi-scale test strategy is adopted, and detection results of different scales are fused through a global integrated non-maximum suppression algorithm, so that the detection accuracy is further improved.

Description

Aerial image multi-scale target detection method based on spatial pyramid attention drive
Technical Field
The invention belongs to the technical field of image recognition and target detection, and particularly relates to an aerial image multi-scale target detection method based on spatial pyramid attention driving.
Background
The object detection, also called object extraction, is an image segmentation based on the geometric and statistical characteristics of the object, which combines the segmentation and recognition of the object into one, and the accuracy and the real-time performance are an important capability of the whole system. Especially in complex scenes, when multiple targets need to be processed in real time, automatic extraction and recognition of the targets are particularly important. With the development of computer technology and the wide application of computer vision principle, the real-time tracking research of targets by using computer image processing technology is getting more and more popular, and the dynamic real-time tracking positioning of targets has wide application value in the aspects of intelligent traffic systems, intelligent monitoring systems, military target detection, surgical instrument positioning in medical navigation surgery and the like.
In one aspect, many methods of target detection have emerged in recent years, such as the YOLO, SSD, retinaNet, RCNN series, where YOLO, SSD, retinaNet is a single-stage method and the original RCNN and its extended Fast-RCNN and Fast-RCNN are two-stage methods. The RCNN series method is to generate a candidate frame first, then conduct coordinate regression prediction according to the candidate frame, while YOLO, SSD, retinaNet is to conduct regression directly to generate coordinate regression, and the step of not passing through the candidate frame is not performed.
On the other hand, the visual attention mechanism is a brain signal processing mechanism specific to human vision. The human vision obtains the target area needing to be focused by rapidly scanning the global image, namely the focus of attention, so as to obtain more information of the key characteristics of the target needing to be focused. Therefore, the model of the attention-drawing mechanism has great help to improve the accuracy of target detection.
The two-stage target detection algorithm tends to have higher accuracy than the single-stage algorithm without considering the detection speed, so that the two-stage algorithm tends to have higher accuracy in many cases, such as in detecting aerial images of an unmanned aerial vehicle. Therefore, the patent provides a multi-scale target detection network based on feature pyramid dual attention driving based on a deep learning theory and a method of applying the latest attention mechanism.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an aerial image multi-scale target detection method based on the spatial pyramid attention drive.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
an aerial image multi-scale target detection method based on spatial pyramid attention drive, wherein: the method comprises the following steps:
s101: collecting an unmanned aerial vehicle image set and performing block processing to obtain a large number of block cutting small images with consistent sizes;
s102: inputting the block diagram into a residual network, and extracting features through a convolution attention module in the residual network, wherein the convolution attention module comprises a first channel attention unit and a first space attention unit, the channel attention unit is used for calculating a channel attention diagram according to the first channel attention unit, the space attention diagram is used for calculating a space attention diagram according to the first space attention unit, and the first feature diagram is generated by combining the channel attention diagram and the space attention diagram;
s103: extracting features from the first feature map by a detector based on a feature pyramid, adding a dual-attention module comprising a second spatial attention unit and a second channel attention unit to each layer of the top-down part of the feature pyramid, fusing the feature maps generated by the two attention units to obtain a second feature map, and carrying out region-of-interest alignment operation on the second feature map generated by a region suggestion network in the last layer to fix the size of the features;
s104: aiming at the obtained second feature map with the aligned regions of interest, a target category analysis and target frame regression module is established, and classification and target frame prediction are carried out on the regions of interest under different scales;
s105: and adopting an original image and a 1.5 times original image to carry out multi-scale image test, respectively inputting the images with two scales into a depth network to carry out test, and fusing results with different scales through a global integrated non-maximum suppression algorithm to improve the detection accuracy.
In order to optimize the technical scheme, the specific measures adopted further comprise:
the step S101 specifically includes: and carrying out sliding window type blocking on the image according to the pixel size of 1000 x 1000, adopting the overlapping rate of 0.25, reserving the coordinate information of the manual marking frame of the vehicle with the IOU more than 0.7, and converting the manually marked boundary frame into the coordinates of the block small drawing for all the vehicles in the blocked image.
The step S102 specifically includes: inputting the picture into a residual network embedded with a convolution attention module, wherein a first channel attention unit compresses the picture in a space dimension by using maximum pooling and average pooling to obtain two different space backgrounds
Figure GDA0004205015830000021
And->
Figure GDA0004205015830000022
Spatial context via residual network>
Figure GDA0004205015830000023
And->
Figure GDA0004205015830000024
The channel attention diagram is obtained by calculation, and the calculation formula of the channel attention unit is as follows:
Figure GDA0004205015830000025
wherein: w (W) 1 And W is 0 Representing weights of a multi-layer perceptron, and in which two weights share inputs, and in which W 0 Followed by a relu activation function; sigma represents a Sigmoid function and F represents a convolution operation corresponding to this stage in the attention mechanism;
wherein the first isThe spatial attention unit obtains two different feature descriptions in the dimension of the channel according to the maximum pooling and the average pooling
Figure GDA0004205015830000031
And->
Figure GDA0004205015830000032
Generating a spatial attention map according to convolution calculation, wherein the calculation formula of the first spatial attention unit is as follows:
Figure GDA0004205015830000033
wherein: sigma represents a Sigmoid function, f 7*7 Indicating a convolution kernel size of 7*7;
a first feature map is then generated from the channel attention map and the spatial attention map.
The step 103 specifically includes: extracting features from the first feature map by a feature pyramid-based detector, adding a dual-attention module containing a second position attention unit and a second spatial attention unit to each layer of the top-down portion of the feature pyramid;
calculating an association intensity matrix between any two point features through a second position attention unit, namely an original feature A j Obtaining a characteristic B through convolution dimension reduction i Feature C j And feature D i Then change feature dimension B i And C j Obtaining an association intensity matrix between any two point features according to the matrix product; obtaining the characteristic S of each position to other positions through softmax function calculation ji Then feature S ji And feature D i Multiplying and fusing, and finally combining the result with the original characteristic A j And adding to obtain a position feature diagram finally output by the position attention unit, wherein the calculation formula of the second position attention unit is as follows:
Figure GDA0004205015830000034
wherein A is j Representing a feature corresponding to a given location; b (B) i ,C j ,D i Will be denoted A j Three new features generated by convolution dimension reduction, S ji Will be denoted B i ,C j Matrix multiplication after re-deformation and position attention map obtained by softmax layer, E j1 A position feature map representing a final output of the second position attention unit;
the method comprises the steps of carrying out dimension transformation and matrix multiplication on characteristics of any two channels through a second spatial attention unit to obtain association strength of any two channels, then calculating to obtain a characteristic diagram among the channels, and finally carrying out fusion through weighting of the characteristic diagram among the channels, so that global association can be generated among the channels to obtain characteristics of stronger semantic response, wherein the calculation formula of the second spatial attention unit is as follows:
Figure GDA0004205015830000041
wherein A is j Representing the features corresponding to a given position, x ji Representation A j And A is a j Channel feature map obtained by multiplying transposed 4 of (2) and passing through softmax layer, E j2 A spatial feature map representing a final output of the second spatial attention unit;
and finally, carrying out feature fusion on the first space feature map and the second space feature map to obtain a final second feature map, carrying out region-of-interest alignment operation on the obtained second feature map in a final layer of region suggestion network, and fixing the size of the features.
The step S104 specifically includes: after the second feature map is subjected to region of interest alignment operation and the size of the fixed feature is obtained, connecting two 1024 full-connection layers, then dividing the two full-connection layers into two branches, respectively establishing a target category analysis and target frame regression module, and classifying the regions of interest under different scales of the feature pyramid and predicting the target frame.
The step S105 specifically includes: in the test, a multi-scale image test is adopted, an original image and a 1.5 times image of the original image are concentrated in the test, then the images of two scales are subjected to blocking treatment, then the images of two scales are respectively input into a depth network for testing, detection results on respective scales are obtained, and the detection results of two scales are combined with the detection results of two scales by using a global non-maximum value inhibition fusion algorithm, so that the detection accuracy is improved.
The global integrated non-maximum suppression algorithm process described above is as follows:
step1, globally aligning coordinates of the prediction frames of the sub-blocks of each scale;
step2, weighting calculation and sequencing of confidence weights of the detection frames;
step3, selecting a specific bounding box with highest confidence level, adding the specific bounding box into a final output list, and deleting the specific bounding box from the bounding box list;
step4, calculating the areas of all the bounding boxes;
step5, calculating IOU of the boundary frame with highest confidence and other candidate frames;
step6, deleting the bounding box with the IOU larger than the threshold value;
step7. Repeat the above process until the bounding box list is empty.
The invention has the beneficial effects that:
according to the invention, a multi-scale target detection network method based on double-attention driving of a feature pyramid is established by utilizing a computer target detection and attention mechanism theory, under the conditions that the size of an aerial image is large, the target to be detected is small and the background complexity is high, a model firstly carries out blocking processing on a data set, then removes the strong feature extraction capability driven by the double-attention of the feature pyramid, and meanwhile, adopts a multi-scale fusion detection method, combines the detection results of two scales by utilizing a global non-maximum suppression fusion algorithm with the detection results of the two scales, and finally obtains the most accurate detection result. The detection network provided by the invention has a good effect on the target detection of aerial pictures, and plays a significant role in the fields of geographic environment detection, traffic flow control, military behavior monitoring and the like.
Drawings
FIG. 1 is a schematic flow chart of an algorithm of the present invention;
FIG. 2 is a flow diagram of a global non-maximum suppression fusion algorithm;
FIG. 3 is a schematic diagram of a portion of a dual attention mechanism driven feature pyramid constructed in accordance with the present invention;
FIG. 4 is a schematic diagram of a detection network of the present invention;
fig. 5 is a comparison chart of quantitative analysis of unmanned aerial vehicle data sets according to the present invention.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.
As shown in fig. 1, the invention relates to a method for detecting a multiscale target of an aerial image based on spatial pyramid attention driving, wherein: the method comprises the following steps:
s101, before training, performing block processing on an unmanned aerial vehicle data set for verifying the validity of a designed network;
the method comprises the following steps: before the data set is sent to the network training, we process the data set first, because the data set used in our experiment includes 4355 Zhang Hang images and corresponding coordinates of manually marked vehicles, for each image, because the unmanned aerial vehicle has too large image size, so we slide window type block the images according to the pixel size of 1000 x 1000 to obtain a large number of small block images, in order to avoid the incomplete condition of the vehicles caused by the split images as much as possible, the overlapping rate of 0.25 is adopted, and the coordinate information of the manually marked frames of the vehicles with IOU more than 0.7 is reserved, meanwhile, for all the vehicle instances in the split images, the split is stored to convert the manually marked boundary frames into the coordinates of the small block images, and the total of 48416 1000 small block images.
S102, inputting the block diagram into a residual network, and extracting features through a convolution attention module inside the residual network, wherein the convolution attention module comprises a first channel attention unit and a first space attention unit, the channel attention unit is used for calculating a channel attention diagram according to the first channel attention unit, the space attention diagram is used for calculating a space attention diagram according to the first space attention unit, and the channel attention diagram and the space attention diagram are combined to generate a first feature diagram.
The method comprises the following steps: the picture passes through a backbone network firstly, the backbone network is selected to be a residual network, meanwhile, a convolution attention mechanism module is embedded in the residual, the convolution attention module is an attention module combining space and channels, and then, feature mapping is multiplied with an input feature map to perform self-adaptive learning of features; after the picture passes through the backbone network, a feature map is generated and sent to the next link;
wherein the convolution attention module comprises a first channel attention unit and a first space attention unit, wherein the first channel attention unit focuses more on what is significant in the input picture, and the first channel attention unit compresses in the space dimension by using maximum pool and average pool pair to obtain two different space backgrounds for efficiently calculating the channel attention
Figure GDA0004205015830000061
And->
Figure GDA0004205015830000062
The channel attention map is calculated for these two different spatial background descriptions using a shared network consisting of MLPs, so the calculation formula for the first channel attention unit is as follows:
Figure GDA0004205015830000063
wherein W is 1 And W is 0 Representing weights of the multi-layer perceptron and sharing inputs with two weights in the multi-layer perceptron, whereas in the multi-layer perceptron, W 0 Followed by a relu activation function; σ represents the Sigmoid function and F represents the convolution operation corresponding to this stage in the attention module.
Wherein the first spatial attention unit is different from the first channel attention unit, and the first spatial attention unit mainly focuses on the position information, firstly in the dimension of the channelDegree two different feature descriptions are obtained using maximum pooling and average pooling
Figure GDA0004205015830000064
And->
Figure GDA0004205015830000065
The two feature descriptions are then combined using a cascade and a spatial attention map is generated using a convolution operation, the calculation formula for the first spatial attention unit being as follows:
Figure GDA0004205015830000066
wherein: sigma represents a Sigmoid function, f 7*7 The convolution kernel size in the convolution operation is represented as 7*7, and then a first signature is generated from the channel attention map and the spatial attention map.
S103, extracting features from the first feature map through a detector based on the feature pyramid, adding a dual-attention mechanism module containing a second spatial attention unit and a second channel attention unit to each layer of the top-down part of the feature pyramid, calculating the association degree between different features and the association between modeling channels, and carrying out region-of-interest alignment operation on the generated second feature map in a final layer region suggestion network to fix the sizes of the features.
The method comprises the following steps: in the detector link, firstly, a feature pyramid network is fused into a Faster-RCNN to increase the cognition of the detector on the whole image information, meanwhile, a dual-attention module is added to improve the spatial feature pyramid structure, and finally, the pooling operation of the region of interest with the fixed feature size in the original Faster-RCNN is replaced by the alignment operation of the region of interest with higher pixel level and higher precision.
The loss function of the detection network comprises classification loss and regression loss, and the loss function has the following formula:
Figure GDA0004205015830000071
wherein: i is the i-th target frame and,
Figure GDA0004205015830000072
is the probability of targeting an anchor frame, when targeting an anchor frame, the +.>
Figure GDA0004205015830000073
1, otherwise 0, ti is the position coordinates of the prediction box, +.>
Figure GDA0004205015830000074
Is the coordinates of the real tag;
the bottom-up part of the feature pyramid is the feature obtained by the backbone network, the operation adopted is that the dimension reduction operation of 1*1 is carried out on the layer 2 of the bottom-up part, and then the results obtained after the up-sampling of the layer 3 of the bottom-up part are added to obtain the layer 2 of the top-down part; the same is true for the next layer from top to bottom, and then the region suggestion network operation is performed for the obtained top to bottom portion to obtain the detection region suggestion.
The specific steps of the feature pyramid part integrated with the dual-attention module in the residual error network are that feature extraction of the target to be detected is realized on feature graphs with different scales, and a feature graph with higher precision and richer information can be obtained by adding a dual-attention mechanism to each layer of the feature pyramid part from top to bottom.
Wherein the second location attention unit uses the association between any two point features to mutually enhance the expression of the respective features. Specifically, firstly, calculating an association intensity matrix between any two point features, namely an original feature A j Obtaining a characteristic B through convolution dimension reduction i Feature C j And feature D i Then change feature dimension B i And C j And obtaining an association intensity matrix between any two point features according to the matrix product. Then normalizing by softmax operation to obtain the characteristic S of each position to other positions ji Wherein the more similar between two point features, S thereof ji The larger the response value. Then the response value S in the feature map ji The feature D is weighted and fused as a weight, so that for each point of the position, the similar feature is fused in the global space through the feature map, and the calculation formula of the attention unit of the second position is as follows:
Figure GDA0004205015830000075
wherein A is j Representing the features corresponding to a given position, B i ,C j ,D i Will be denoted A j Feeding two new feature maps generated by the convolution layer S ji Will be denoted B i ,C j Matrix multiplication is carried out after the reformation, and then a space characteristic diagram is obtained through a softmax layer, E j1 A position characteristic map representing the final output of the second position attention unit.
The second spatial attention unit enhances the specific semantic response capabilities under the channels by modeling the associations between the channels. The specific process is similar to the position attention module, except that when the feature attention force diagram X is obtained, the dimension transformation and the matrix product are carried out on the features of any two channels, the association strength of any two channels is obtained, and then the feature diagram among the channels is obtained through the softmax operation. And finally, fusing through attention force diagram weighting among the channels, so that global association can be generated among the channels, and the characteristics of stronger semantic response are obtained. The channel attention module has the following calculation formula:
Figure GDA0004205015830000081
wherein A is j Representing the features corresponding to a given position, x ji Representation A j And A is a j Transpose A of (2) i Channel feature map obtained by multiplying and passing through softmax layer E j2 And the spatial feature diagram representing the final output of the second spatial attention unit.
In the target detection algorithm, a region suggestion candidate box of a result to be detected is usually obtained in a region suggestion network, and then a region of interest pooling operation is used to map candidate regions with different sizes onto a feature map with a fixed size. However, there are two significant disadvantages to using region of interest pooling, one of which is that errors occur when the candidate frame boundaries are quantized to integer coordinates, and there are cases where floating point rounding occurs when pooling is done. The result of error accumulation causes the coordinate position of the candidate frame to deviate, and the detection effect is affected. Because our dataset is to unmanned aerial vehicle aerial image's car detects, and the target that needs to detect is the target that the proportion is very little in the picture, consequently in our replacement pixel level and the region of interest alignment operation that the precision is higher, and then cancel quantization operation, obtain the image numerical value on the pixel point that the coordinate is the floating point number through the method that uses bilinear interpolation to turn into a continuous operation with whole characteristic gathering process.
S104, after the second feature map is subjected to region-of-interest alignment operation and the size of the fixed feature is obtained, connecting two 1024 full-connection layers, then dividing the two full-connection layers into two branches, respectively establishing a target category analysis and target frame regression module, and classifying and predicting the region of interest under different scales of the feature pyramid.
S105, adopting multi-scale image test in the test, dividing the original image and the 1.5 times image of the original image in the test set, then carrying out blocking treatment on the images of the two scales, then respectively inputting the images of the two scales into a depth network for test to obtain detection results on respective scales, and combining the detection results of the two scales by using a global non-maximum value inhibition fusion algorithm with the detection results of the two scales to improve the detection accuracy.
The global integrated non-maximum suppression algorithm is as follows;
step1, globally aligning coordinates of the prediction frames of the sub-blocks of each scale;
step2, weighting calculation and sequencing of confidence weights of the detection frames;
step3, selecting a specific bounding box with highest confidence level, adding the specific bounding box into a final output list, and deleting the specific bounding box from the bounding box list;
step4, calculating the areas of all the bounding boxes;
step5, calculating IOU of the boundary frame with highest confidence and other candidate frames;
step6, deleting the bounding box with the IOU larger than the threshold value;
step7. Repeat the above process until the bounding box list is empty.
The invention is subjected to a comparison experiment, the data set used in the experiment is an unmanned aerial vehicle data set for 'bell-type computing cup' information fusion challenge, and super parameters are set as follows: the maximum iteration number is 12, the batch size is 1, the learning rate is set by adopting a warming up strategy, the initial learning rate is 0.3333, the learning rate is gradually increased to 0.00025 in the initial 500 iterations after that, and the learning rate is reduced in the 8 th and 11 th periods.
Evaluation of the experiment two analytical methods of quantification and visualization were used:
regarding quantitative analysis comparison, precision, recall and F1 score are used to judge the detection accuracy, and F1 score is calculated to measure the detection accuracy of the algorithm. The accuracy, recall and F1 score are calculated as follows:
Figure GDA0004205015830000091
Figure GDA0004205015830000092
Figure GDA0004205015830000093
wherein true probes actually indicate that the target to be detected is correctly detected, false probes actually indicate that the target not to be detected is not detected, and false probes actually indicate that the target not to be detected is not detected.
The visual analysis comparison refers to that the same picture to be detected is detected by using models which are run out through different detection algorithms, meanwhile, the effect after the picture detection is visualized by using visual codes written by the user, and then the detection effect of the models which are run out by different detection algorithms on the same picture is compared manually.
In conclusion, compared with a conventional target detection algorithm, the unmanned aerial vehicle aerial image detection method has the advantages of low detection precision, poor effect and the like. The invention utilizes deep learning and attention mechanisms to establish a multi-scale unmanned aerial vehicle aerial photographing target detection network based on double attention driving of a feature pyramid, and in the process of feature extraction, the attention mechanisms are fused into a space pyramid, so that richer and effective information can be extracted, and then the information is sent to a regional suggestion network for classification and regression.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims (4)

1. A multi-scale target detection method for aerial image based on spatial pyramid attention drive is characterized in that: the method comprises the following steps:
s101: collecting an unmanned aerial vehicle image set and performing block processing to obtain a large number of block cutting small images with consistent sizes;
s102: inputting the block diagram into a residual network, and extracting features through a convolution attention module in the residual network, wherein the convolution attention module comprises a first channel attention unit and a first space attention unit, the channel attention unit is used for calculating a channel attention diagram according to the first channel attention unit, the space attention diagram is used for calculating a space attention diagram according to the first space attention unit, and the first feature diagram is generated by combining the channel attention diagram and the space attention diagram;
s103: extracting features from the first feature map by a detector based on a feature pyramid, adding a dual-attention module comprising a second spatial attention unit and a second channel attention unit to each layer of the top-down part of the feature pyramid, fusing the feature maps generated by the two attention units to obtain a second feature map, and carrying out region-of-interest alignment operation on the second feature map generated by a region suggestion network in the last layer to fix the size of the features;
s104: aiming at the obtained second feature map with the aligned regions of interest, a target category analysis and target frame regression module is established, and classification and target frame prediction are carried out on the regions of interest under different scales;
s105: carrying out multi-scale image test by adopting an original image and a 1.5-fold original image, respectively inputting the images with two scales into a depth network for test, and fusing results with different scales through a global integrated non-maximum suppression algorithm to improve the detection accuracy;
the step S102 specifically includes:
inputting the picture into a residual network embedded with a convolution attention module, wherein a first channel attention unit compresses the picture in a space dimension by using maximum pooling and average pooling to obtain two different space backgrounds
Figure FDA0004205015820000011
And->
Figure FDA0004205015820000012
Spatial context via residual network>
Figure FDA0004205015820000013
And->
Figure FDA0004205015820000014
And calculating to obtain a channel attention map, wherein the calculation formula of the first channel attention unit is as follows:
Figure FDA0004205015820000015
wherein: w (W) 1 And W is 0 Representing weights of a multi-layer perceptron, and in which two weights share inputs, and in which W 0 Followed by a relu activation function; sigma represents a Sigmoid function;
wherein the first spatial attention unit obtains two different feature descriptions in the dimension of the channel according to a maximum pooling and an average pooling
Figure FDA0004205015820000016
And->
Figure FDA0004205015820000017
Generating a spatial attention map according to convolution calculation, wherein the calculation formula of the first spatial attention unit is as follows:
Figure FDA0004205015820000021
wherein: sigma represents a Sigmoid function, f 7*7 Indicating a convolution kernel size of 7*7;
then generating a first feature map from the channel attention map and the spatial attention map;
the step S103 specifically includes:
extracting features from the first feature map by a feature pyramid-based detector, adding a dual-attention mechanism containing a second location attention unit and a second spatial attention unit to each layer of the top-down portion of the feature pyramid;
by injection through the second positionThe force meaning unit calculates the association strength matrix between any two point characteristics, namely the original characteristic A j Obtaining a characteristic B through convolution dimension reduction i Feature C j And feature D i Then change feature dimension B i And C j Obtaining an association intensity matrix between any two point features according to the matrix product; calculation of the features S of each position against other positions using a softmax function ji Then feature S ji And feature D i Multiplying and fusing, and finally combining the result with the original characteristic A j And adding to obtain a position feature diagram finally output by the position attention unit, wherein the calculation formula of the second position attention unit is as follows:
Figure FDA0004205015820000022
wherein A is j Representing a feature corresponding to a given location; b (B) i ,C j ,D i Will be denoted A j Three new features generated by convolution dimension reduction, S ji Will be denoted B i ,C j Matrix multiplication after re-deformation and position attention map obtained by softmax layer, E j1 A position feature map representing a final output of the second position attention unit;
the method comprises the steps of carrying out dimension transformation and matrix multiplication on any two channel characteristics through a second spatial attention unit to obtain association strength of any two channels, then calculating to obtain attention force diagram among the channels, and finally carrying out fusion through attention force diagram weighting among the channels to enable global association among the channels to be generated, so as to obtain characteristics of stronger semantic response, wherein the calculation formula of the second spatial attention unit is as follows:
Figure FDA0004205015820000023
wherein A is j Representing the features corresponding to a given position, x ji Representation A j And A is a j Is rotated by (a)Set A i Channel feature map obtained by multiplying and passing through softmax layer E j2 A spatial feature map representing a final output of the second spatial attention unit;
finally, carrying out feature fusion on the position feature map and the space feature map to obtain a final second feature map, carrying out region-of-interest alignment operation on the obtained second feature map in a final layer of region suggestion network, and fixing the size of the features;
the global integrated non-maximum suppression algorithm process is as follows:
step1, globally aligning coordinates of the prediction frames of the sub-blocks of each scale;
step2, weighting calculation and sequencing of confidence weights of the detection frames;
step3, selecting a specific bounding box with highest confidence level, adding the specific bounding box into a final output list, and deleting the specific bounding box from the bounding box list;
step4, calculating the areas of all the bounding boxes;
step5, calculating IOU of the boundary frame with highest confidence and other candidate frames;
step6, deleting the bounding box with the IOU larger than the threshold value;
step7. Repeat the above process until the bounding box list is empty.
2. The aerial image multi-scale target detection method based on spatial pyramid attention driving according to claim 1, wherein the method comprises the following steps of: the step S101 specifically includes:
and carrying out sliding window type blocking on the image according to the pixel size of 1000 x 1000, adopting the overlapping rate of 0.25, reserving the coordinate information of the manual marking frame of the vehicle with the IOU more than 0.7, and converting the manually marked boundary frame into the coordinates of the block small drawing for all the vehicles in the blocked image.
3. The aerial image multi-scale target detection method based on spatial pyramid attention driving according to claim 1, wherein the method comprises the following steps of: the step S104 specifically includes:
after the second feature map is subjected to region of interest alignment operation and the size of the fixed feature is obtained, connecting two 1024 full-connection layers, then dividing the two full-connection layers into two branches, respectively establishing a target category analysis and target frame regression module, and classifying the regions of interest under different scales of the feature pyramid and predicting the target frame.
4. A method for detecting a multiscale object of an aerial image based on spatial pyramid attention driving according to claim 3, wherein: the step S105 specifically includes:
in the test, a multi-scale image test is adopted, an original image and a 1.5 times image of the original image are concentrated in the test, then the images of two scales are subjected to blocking treatment, then the images of two scales are respectively input into a depth network for testing, detection results on respective scales are obtained, and the detection results of two scales are combined with the detection results of two scales by using a global non-maximum value inhibition fusion algorithm, so that the detection accuracy is improved.
CN202010164167.7A 2020-03-10 2020-03-10 Aerial image multi-scale target detection method based on spatial pyramid attention drive Active CN111401201B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010164167.7A CN111401201B (en) 2020-03-10 2020-03-10 Aerial image multi-scale target detection method based on spatial pyramid attention drive

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010164167.7A CN111401201B (en) 2020-03-10 2020-03-10 Aerial image multi-scale target detection method based on spatial pyramid attention drive

Publications (2)

Publication Number Publication Date
CN111401201A CN111401201A (en) 2020-07-10
CN111401201B true CN111401201B (en) 2023-06-20

Family

ID=71432330

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010164167.7A Active CN111401201B (en) 2020-03-10 2020-03-10 Aerial image multi-scale target detection method based on spatial pyramid attention drive

Country Status (1)

Country Link
CN (1) CN111401201B (en)

Families Citing this family (63)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111814704B (en) * 2020-07-14 2021-11-26 陕西师范大学 Full convolution examination room target detection method based on cascade attention and point supervision mechanism
CN111814726B (en) * 2020-07-20 2023-09-22 南京工程学院 Detection method for visual target of detection robot
CN111914917A (en) * 2020-07-22 2020-11-10 西安建筑科技大学 Target detection improved algorithm based on feature pyramid network and attention mechanism
CN112016569A (en) * 2020-07-24 2020-12-01 驭势科技(南京)有限公司 Target detection method, network, device and storage medium based on attention mechanism
CN111860683B (en) * 2020-07-30 2021-04-27 中国人民解放军国防科技大学 Target detection method based on feature fusion
CN111882002B (en) * 2020-08-06 2022-05-24 桂林电子科技大学 MSF-AM-based low-illumination target detection method
CN114140683A (en) * 2020-08-12 2022-03-04 天津大学 Aerial image target detection method, equipment and medium
CN112101113B (en) * 2020-08-14 2022-05-27 北京航空航天大学 Lightweight unmanned aerial vehicle image small target detection method
CN111914795B (en) * 2020-08-17 2022-05-27 四川大学 Method for detecting rotating target in aerial image
CN111985552B (en) * 2020-08-17 2022-07-29 中国民航大学 Method for detecting diseases of thin strip-shaped structure of airport pavement under complex background
CN112163447B (en) * 2020-08-18 2022-04-08 桂林电子科技大学 Multi-task real-time gesture detection and recognition method based on Attention and Squeezenet
CN112037237B (en) * 2020-09-01 2023-04-07 腾讯科技(深圳)有限公司 Image processing method, image processing device, computer equipment and medium
CN112101189B (en) * 2020-09-11 2022-09-30 北京航空航天大学 SAR image target detection method and test platform based on attention mechanism
CN112101366A (en) * 2020-09-11 2020-12-18 湖南大学 Real-time segmentation system and method based on hybrid expansion network
CN112183269B (en) * 2020-09-18 2023-08-29 哈尔滨工业大学(深圳) Target detection method and system suitable for intelligent video monitoring
CN112132216B (en) * 2020-09-22 2024-04-09 平安国际智慧城市科技股份有限公司 Vehicle type recognition method and device, electronic equipment and storage medium
CN112233071A (en) * 2020-09-28 2021-01-15 国网浙江省电力有限公司杭州供电公司 Multi-granularity hidden danger detection method and system based on power transmission network picture in complex environment
CN112163580B (en) * 2020-10-12 2022-05-03 中国石油大学(华东) Small target detection algorithm based on attention mechanism
CN112307984B (en) * 2020-11-02 2023-02-17 安徽工业大学 Safety helmet detection method and device based on neural network
CN112365480B (en) * 2020-11-13 2021-07-16 哈尔滨市科佳通用机电股份有限公司 Brake pad loss fault identification method for brake clamp device
CN112465880B (en) * 2020-11-26 2023-03-10 西安电子科技大学 Target detection method based on multi-source heterogeneous data cognitive fusion
CN112528786B (en) * 2020-11-30 2023-10-31 北京百度网讯科技有限公司 Vehicle tracking method and device and electronic equipment
CN112396035A (en) * 2020-12-07 2021-02-23 国网电子商务有限公司 Object detection method and device based on attention detection model
CN112464851A (en) * 2020-12-08 2021-03-09 国网陕西省电力公司电力科学研究院 Smart power grid foreign matter intrusion detection method and system based on visual perception
CN112561876B (en) * 2020-12-14 2024-02-23 中南大学 Image-based water quality detection method and system for ponds and reservoirs
CN112464910A (en) * 2020-12-18 2021-03-09 杭州电子科技大学 Traffic sign identification method based on YOLO v4-tiny
CN112633158A (en) * 2020-12-22 2021-04-09 广东电网有限责任公司电力科学研究院 Power transmission line corridor vehicle identification method, device, equipment and storage medium
CN112651326B (en) * 2020-12-22 2022-09-27 济南大学 Driver hand detection method and system based on deep learning
CN112651371A (en) * 2020-12-31 2021-04-13 广东电网有限责任公司电力科学研究院 Dressing security detection method and device, storage medium and computer equipment
CN112733691A (en) * 2021-01-04 2021-04-30 北京工业大学 Multi-direction unmanned aerial vehicle aerial photography vehicle detection method based on attention mechanism
CN112926480B (en) * 2021-03-05 2023-01-31 山东大学 Multi-scale and multi-orientation-oriented aerial photography object detection method and system
CN112883907B (en) * 2021-03-16 2022-07-05 云南师范大学 Landslide detection method and device for small-volume model
CN112907972B (en) * 2021-04-06 2022-11-29 昭通亮风台信息科技有限公司 Road vehicle flow detection method and system based on unmanned aerial vehicle and computer readable storage medium
CN113343755A (en) * 2021-04-22 2021-09-03 山东师范大学 System and method for classifying red blood cells in red blood cell image
CN113538331A (en) * 2021-05-13 2021-10-22 中国地质大学(武汉) Metal surface damage target detection and identification method, device, equipment and storage medium
CN113255759B (en) * 2021-05-20 2023-08-22 广州广电运通金融电子股份有限公司 In-target feature detection system, method and storage medium based on attention mechanism
CN113192058B (en) * 2021-05-21 2021-11-23 中国矿业大学(北京) Intelligent brick pile loading system based on computer vision and loading method thereof
CN113469942B (en) * 2021-06-01 2022-02-22 天津大学 CT image lesion detection method
CN113486930B (en) * 2021-06-18 2024-04-16 陕西大智慧医疗科技股份有限公司 Method and device for establishing and segmenting small intestine lymphoma segmentation model based on improved RetinaNet
CN113591859A (en) * 2021-06-23 2021-11-02 北京旷视科技有限公司 Image segmentation method, apparatus, device and medium
CN113345082B (en) * 2021-06-24 2022-11-11 云南大学 Characteristic pyramid multi-view three-dimensional reconstruction method and system
CN113537119B (en) * 2021-07-28 2022-08-30 国网河南省电力公司电力科学研究院 Transmission line connecting part detection method based on improved Yolov4-tiny
CN113567984B (en) * 2021-07-30 2023-08-22 长沙理工大学 Method and system for detecting artificial small target in SAR image
CN113628179B (en) * 2021-07-30 2023-11-24 厦门大学 PCB surface defect real-time detection method, device and readable medium
CN113591748A (en) * 2021-08-06 2021-11-02 广东电网有限责任公司 Aerial photography insulator sub-target detection method and device
CN113762251B (en) * 2021-08-17 2024-05-10 慧影医疗科技(北京)股份有限公司 Attention mechanism-based target classification method and system
CN113420729B (en) * 2021-08-23 2021-12-03 城云科技(中国)有限公司 Multi-scale target detection method, model, electronic equipment and application thereof
CN113743521B (en) * 2021-09-10 2023-06-27 中国科学院软件研究所 Target detection method based on multi-scale context awareness
CN113822871A (en) * 2021-09-29 2021-12-21 平安医疗健康管理股份有限公司 Target detection method and device based on dynamic detection head, storage medium and equipment
CN114241003B (en) * 2021-12-14 2022-08-19 成都阿普奇科技股份有限公司 All-weather lightweight high-real-time sea surface ship detection and tracking method
CN114038067B (en) * 2022-01-07 2022-04-22 深圳市海清视讯科技有限公司 Coal mine personnel behavior detection method, equipment and storage medium
CN114549413B (en) * 2022-01-19 2023-02-03 华东师范大学 Multi-scale fusion full convolution network lymph node metastasis detection method based on CT image
CN114155475B (en) * 2022-01-24 2022-05-17 杭州晨鹰军泰科技有限公司 Method, device and medium for identifying end-to-end personnel actions under view angle of unmanned aerial vehicle
CN114529825B (en) * 2022-04-24 2022-07-22 城云科技(中国)有限公司 Target detection model, method and application for fire fighting access occupied target detection
CN114648736B (en) * 2022-05-18 2022-08-16 武汉大学 Robust engineering vehicle identification method and system based on target detection
CN114972860A (en) * 2022-05-23 2022-08-30 郑州轻工业大学 Target detection method based on attention-enhanced bidirectional feature pyramid network
CN114821374B (en) * 2022-06-27 2022-09-13 中国电子科技集团公司第二十八研究所 Knowledge and data collaborative driving unmanned aerial vehicle aerial photography target detection method
CN115147375B (en) * 2022-07-04 2023-07-25 河海大学 Concrete surface defect feature detection method based on multi-scale attention
CN115100545A (en) * 2022-08-29 2022-09-23 东南大学 Target detection method for small parts of failed satellite under low illumination
CN115424230B (en) * 2022-09-23 2023-06-06 哈尔滨市科佳通用机电股份有限公司 Method for detecting failure of vehicle door pulley derailment track, storage medium and device
CN116468730B (en) * 2023-06-20 2023-09-05 齐鲁工业大学(山东省科学院) Aerial Insulator Image Defect Detection Method Based on YOLOv5 Algorithm
CN117474861A (en) * 2023-10-31 2024-01-30 东北石油大学 Surface mounting special-shaped element parameter extraction method and system based on improved RetinaNet and Canny-Franklin moment sub-pixels
CN117671473B (en) * 2024-02-01 2024-05-07 中国海洋大学 Underwater target detection model and method based on attention and multi-scale feature fusion

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110084210A (en) * 2019-04-30 2019-08-02 电子科技大学 The multiple dimensioned Ship Detection of SAR image based on attention pyramid network
CN110110751A (en) * 2019-03-31 2019-08-09 华南理工大学 A kind of Chinese herbal medicine recognition methods of the pyramid network based on attention mechanism
CN110378242A (en) * 2019-06-26 2019-10-25 南京信息工程大学 A kind of remote sensing target detection method of dual attention mechanism
CN110533084A (en) * 2019-08-12 2019-12-03 长安大学 A kind of multiscale target detection method based on from attention mechanism
CN110533045A (en) * 2019-07-31 2019-12-03 中国民航大学 A kind of luggage X-ray contraband image, semantic dividing method of combination attention mechanism
CN110532955A (en) * 2019-08-30 2019-12-03 中国科学院宁波材料技术与工程研究所 Example dividing method and device based on feature attention and son up-sampling
CN110705457A (en) * 2019-09-29 2020-01-17 核工业北京地质研究院 Remote sensing image building change detection method

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110751A (en) * 2019-03-31 2019-08-09 华南理工大学 A kind of Chinese herbal medicine recognition methods of the pyramid network based on attention mechanism
CN110084210A (en) * 2019-04-30 2019-08-02 电子科技大学 The multiple dimensioned Ship Detection of SAR image based on attention pyramid network
CN110378242A (en) * 2019-06-26 2019-10-25 南京信息工程大学 A kind of remote sensing target detection method of dual attention mechanism
CN110533045A (en) * 2019-07-31 2019-12-03 中国民航大学 A kind of luggage X-ray contraband image, semantic dividing method of combination attention mechanism
CN110533084A (en) * 2019-08-12 2019-12-03 长安大学 A kind of multiscale target detection method based on from attention mechanism
CN110532955A (en) * 2019-08-30 2019-12-03 中国科学院宁波材料技术与工程研究所 Example dividing method and device based on feature attention and son up-sampling
CN110705457A (en) * 2019-09-29 2020-01-17 核工业北京地质研究院 Remote sensing image building change detection method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于多级特征和混合注意力机制的室内人群检测网络;沈文祥等;《计算机应用》;20191015(第12期);全文 *
基于深度学习的图像语义分割研究进展;李新叶等;《科学技术与工程》;20191128(第33期);全文 *

Also Published As

Publication number Publication date
CN111401201A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN111401201B (en) Aerial image multi-scale target detection method based on spatial pyramid attention drive
CN112200161B (en) Face recognition detection method based on mixed attention mechanism
JP6547069B2 (en) Convolutional Neural Network with Subcategory Recognition Function for Object Detection
CN107563372B (en) License plate positioning method based on deep learning SSD frame
CN111898432B (en) Pedestrian detection system and method based on improved YOLOv3 algorithm
CN111027481B (en) Behavior analysis method and device based on human body key point detection
Du et al. Modeling automatic pavement crack object detection and pixel-level segmentation
CN112215217B (en) Digital image recognition method and device for simulating doctor to read film
CN112149665A (en) High-performance multi-scale target detection method based on deep learning
CN112634369A (en) Space and or graph model generation method and device, electronic equipment and storage medium
CN115546705B (en) Target identification method, terminal device and storage medium
CN114429459A (en) Training method of target detection model and corresponding detection method
CN111339967A (en) Pedestrian detection method based on multi-view graph convolution network
CN113780145A (en) Sperm morphology detection method, sperm morphology detection device, computer equipment and storage medium
CN116563285B (en) Focus characteristic identifying and dividing method and system based on full neural network
CN111582057B (en) Face verification method based on local receptive field
CN115731517B (en) Crowded Crowd detection method based on crown-RetinaNet network
CN112991280B (en) Visual detection method, visual detection system and electronic equipment
CN117011566A (en) Target detection method, detection model training method, device and electronic equipment
CN115331254A (en) Anchor frame-free example portrait semantic analysis method
Zhu et al. PODB: A learning-based polarimetric object detection benchmark for road scenes in adverse weather conditions
CN117523428B (en) Ground target detection method and device based on aircraft platform
CN113362372B (en) Single target tracking method and computer readable medium
Dong et al. Intelligent pixel-level pavement marking detection using 2D laser pavement images
CN115690545B (en) Method and device for training target tracking model and target tracking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant