CN111401201A

CN111401201A - Aerial image multi-scale target detection method based on spatial pyramid attention drive

Info

Publication number: CN111401201A
Application number: CN202010164167.7A
Authority: CN
Inventors: 孙玉宝; 辛宇; 徐宏伟; 陈勋豪; 周旺平
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2020-03-10
Filing date: 2020-03-10
Publication date: 2020-07-10
Anticipated expiration: 2040-03-10
Also published as: CN111401201B

Abstract

The invention discloses an aerial image multi-scale target detection method based on spatial pyramid attention drive, which comprises the following steps of: firstly, aiming at a large-size data set, a block processing method is applied to enhance the training data set; designing a residual error network represented by the convolution attention enhancement features as a backbone network, and further efficiently extracting image features; further constructing a spatial pyramid attention module to promote the network to more accurately focus targets with different scales and extract an interested area where the targets are located; establishing a target category analysis and target frame regression module, and classifying the regions of interest under different scales and predicting the target frames; in the testing stage, a multi-scale testing strategy is adopted by using a trained detection network, and then detection results of different scales are fused by a global integration non-maximum suppression algorithm, so that the detection accuracy is further improved.

Description

Aerial image multi-scale target detection method based on spatial pyramid attention drive

Technical Field

The invention belongs to the technical field of image recognition and target detection, and particularly relates to an aerial image multi-scale target detection method based on spatial pyramid attention driving.

Background

The target detection, also called target extraction, is an image segmentation based on target geometry and statistical characteristics, which combines the segmentation and identification of targets into one, and the accuracy and real-time performance of the method are important capabilities of the whole system. Especially, in a complex scene, when a plurality of targets need to be processed in real time, automatic target extraction and identification are particularly important. With the development of computer technology and the wide application of computer vision principle, the real-time tracking research on the target by using the computer image processing technology is more and more popular, and the dynamic real-time tracking and positioning of the target has wide application value in the aspects of intelligent traffic systems, intelligent monitoring systems, military target detection, surgical instrument positioning in medical navigation operations and the like.

On the one hand, in recent years, many methods for detecting targets have appeared, such as methods of YO L O, SSD, RetinaNet, and RCNN series, wherein YO L O, SSD, RetinaNet are single-stage methods, and original RCNN and its extended Fast-RCNN and Fast-RCNN are two-stage methods.

On the other hand, the visual attention mechanism is a brain signal processing mechanism unique to human vision. Human vision obtains a target area needing important attention, namely a focus of attention in general, by rapidly scanning a global image so as to acquire more information which is critical to the characteristics of the target needing attention. Therefore, the model introducing the attention mechanism is of great help to improve the accuracy of target detection.

Under the condition of not considering the detection speed, the accuracy of the two-stage target detection algorithm is higher than that of the single-stage target detection algorithm, so that the two-stage target detection algorithm can achieve higher accuracy in many conditions such as detection of aerial pictures of the unmanned aerial vehicle. Therefore, the patent provides a feature pyramid dual-attention-driven multi-scale target detection network based on a deep learning theory and a latest attention mechanism method.

Disclosure of Invention

The invention aims to solve the technical problem that the prior art is not enough, and provides an aerial image multi-scale target detection method based on spatial pyramid attention drive.

In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:

a multi-scale target detection method for aerial images based on spatial pyramid attention driving is disclosed, wherein: the method comprises the following steps:

s101: collecting an unmanned aerial vehicle aerial image set and carrying out blocking processing to obtain a large number of small cut-block images with consistent sizes;

s102: inputting the cut small images into a residual error network, extracting features through a convolution attention module in the residual error network, wherein the convolution attention module comprises a first channel attention unit and a first space attention unit, obtaining a channel attention diagram through calculation according to the first channel attention unit, obtaining a space attention diagram through calculation according to the first space attention unit, and generating a first feature diagram by combining the channel attention diagram and the space attention diagram;

s103: extracting features from the first feature map by a detector based on a feature pyramid, adding a dual attention module containing a second spatial attention unit and a second channel attention unit to each layer of the feature pyramid from top to bottom, fusing feature maps generated by the two attention units to obtain a second feature map, performing region-of-interest alignment operation on the second feature map generated by the region suggestion network in the last layer, and fixing the size of the features;

s104: aiming at the obtained second feature map aligned with the region of interest, a target category analysis and target frame regression module is established, and classification and target frame prediction are carried out on the region of interest under different scales;

s105: the original image and the 1.5-time original image are adopted to carry out multi-scale image testing, images of two scales are respectively input into a depth network to be tested, and results of different scales are fused through a global integration non-maximum suppression algorithm, so that the detection accuracy is improved.

In order to optimize the technical scheme, the specific measures adopted further comprise:

the step S101 specifically includes: and carrying out sliding window type blocking on the image according to the pixel size of 1000 × 1000, adopting the overlapping rate of 0.25, keeping the coordinate information of the manual labeling frame of the vehicle with the IOU larger than 0.7, and converting the manually labeled boundary frame into the coordinate of the small diced picture for all vehicles in the diced image.

The step S102 specifically includes: inputting the picture into a residual error network embedded with a convolution attention module, wherein a first channel attention unit compresses the picture in a spatial dimension by using maximum pooling and average pooling to obtain two different spatial backgrounds

And

spatial background through residual network

And

and calculating to obtain a channel attention diagram, wherein the calculation formula of the channel attention unit is as follows:

wherein: w₁And W₀Representing weights of a multi-layered perceptron, and in which two weights share an input, and in which W is₀Followed by a relu activation function; σ represents Sigmoid function, and F represents convolution operation corresponding to the stage in the attention mechanism;

wherein the first spatial attention unit derives two different profiles in the dimension of the channel based on the maximum pooling and the average pooling

And

generating a spatial attention diagram according to convolution calculation, wherein the calculation formula of the first spatial attention unit is as follows:

wherein: σ denotes Sigmoid function, f^7*7Represents a convolution kernel size of 7 × 7;

a first feature map is then generated from the channel attention map and the spatial attention map.

The step3 is specifically: extracting features from the first feature map by a feature pyramid-based detector, and adding a dual attention module containing a second location attention unit and a second spatial attention unit to each layer of the feature pyramid from top to bottom;

calculating a correlation strength matrix between any two point features through a second position attention unit, namely an original feature A_jObtaining characteristic B through convolution dimensionality reduction_iFeature C_jAnd feature D_iThen changing the characteristic dimension B_iAnd C_jObtaining a correlation strength matrix between any two point characteristics according to the matrix product; by passingCalculating and obtaining characteristics S of each position to other positions by using softmax function_jiThen the feature S_jiAnd feature D_iPerforming multiplication and fusion, and finally, combining the result with the original characteristic A_jAnd adding to obtain a position feature map finally output by the position attention unit, wherein the calculation formula of the second position attention unit is as follows:

wherein A is_jRepresenting the feature corresponding to the given position; b is_i，C_j，D_iIs shown as_jThree new features, S, generated by convolution dimensionality reduction_jiRepresents that B is_i，C_jThe position attention map obtained by matrix multiplication after the re-deformation and then the softmax layer is obtained, E_j1A position feature map representing the final output of the second position attention unit;

carrying out dimension transformation and matrix multiplication on the features of any two channels through a second spatial attention unit to obtain the correlation strength of any two channels, then calculating to obtain a feature map between the channels, and finally carrying out weighting and fusion on the feature maps between the channels to enable global correlation to be generated between the channels and obtain features with stronger semantic response, wherein the calculation formula of the second spatial attention unit is as follows:

wherein A is_jRepresenting the feature, x, corresponding to a given location_jiIs represented by A_jAnd A_jTranspose A of_iChannel profile obtained by multiplication through softmax layer, E_j2A spatial signature graph representing the final output of the second spatial attention unit;

and finally, performing feature fusion on the first spatial feature map and the second spatial feature map to obtain a final second feature map, and recommending a network to perform region-of-interest alignment operation on the obtained second feature map in the last layer of region, and fixing the size of the features.

The step S104 is specifically: and after aligning the interested regions of the second feature map and obtaining the size of the fixed features, connecting two 1024 layers of full-connection layers, dividing the full-connection layers into two branches, respectively establishing a target category analysis and target frame regression module, and classifying the interested regions under different scales of the feature pyramid and predicting the target frames.

The step S105 is specifically: in the test, a multi-scale image test is adopted, the original image and the 1.5-time image of the original image are collected in the test, the images of two scales are processed in a blocking mode, then the images of the two scales are respectively input into a depth network to be tested, detection results on the respective scales are obtained, the detection results of the two scales are combined with the detection results of the two scales through a global non-maximum inhibition fusion algorithm, and therefore the detection accuracy is improved.

The global integrated non-maximum suppression algorithm process is as follows:

step1, globally aligning the coordinates of the prediction frames of the subblocks in each scale;

step2, weighted calculation and sequencing of confidence coefficient weights of the detection frames;

step3, selecting a ratio boundary box with the highest confidence coefficient to be added into a final output list, and deleting the ratio boundary box from the boundary box list;

step4, calculating the areas of all the boundary frames;

step5, calculating IOUs of the bounding box with the highest confidence coefficient and other candidate boxes;

step6, deleting the boundary box with the IOU larger than the threshold value;

step7. repeat the above process until the bounding box list is empty.

The invention has the beneficial effects that:

the invention utilizes the theory of computer target detection and attention mechanism to establish a multi-scale target detection network method based on feature pyramid dual attention drive, under the condition that a model has larger aerial image size, smaller target to be detected and high background complexity, firstly, the blocking processing of a data set is carried out, then, the powerful feature extraction capability driven by the feature pyramid dual attention is utilized, meanwhile, a multi-scale fusion detection method is adopted, and the detection results of two scales are combined with the detection results of the two scales by utilizing a global non-maximum inhibition fusion algorithm, so that the most accurate detection result is finally obtained. The detection network provided by the invention achieves a good effect on target detection of aerial pictures, and plays a significant role in the fields of geographic environment detection, traffic flow control, military behavior monitoring and the like.

Drawings

FIG. 1 is a schematic flow chart of the algorithm of the present invention;

FIG. 2 is a schematic flow diagram of a global non-maximum suppression fusion algorithm;

FIG. 3 is a schematic diagram of a feature pyramid portion of a dual attention mechanism drive constructed in accordance with the present invention;

FIG. 4 is a schematic diagram of a detection network of the present invention;

fig. 5 is a comparison graph of quantitative analysis of the unmanned aerial vehicle dataset of the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

As shown in fig. 1, the present invention is a spatial pyramid attention-driven aerial image multi-scale target detection method, wherein: the method comprises the following steps:

s101, before training, carrying out block processing on an unmanned aerial vehicle aerial photography automobile data set used for verifying the effectiveness of a designed network;

the method specifically comprises the following steps: before the data set is sent to network training, the data set is processed firstly, the data set used in our experiments comprises 4355 aerial images and corresponding coordinates of manually marked vehicles, and as for each image, the image size is too large due to aerial shooting by an unmanned aerial vehicle, the image is subjected to sliding window type partitioning according to the pixel size of 1000 × 1000 to obtain a large number of small cut-block images, in order to avoid the situation that the vehicle is incomplete due to image segmentation as far as possible, the overlapping rate of 0.25 is adopted, the coordinate information of the manually marked frame of the vehicle with the IOU being greater than 0.7 is reserved, and for all vehicle examples in the image after the image is cut, the cut-blocks are stored, the manually marked boundary frame of the cut-block frames of the vehicle is converted into the coordinates of the small cut-block images, and 48416 small cut-block images with the size of 1000 × 1000 are obtained in total.

S102, inputting the small cut-block image into a residual error network, extracting features through a convolution attention module in the residual error network, wherein the convolution attention module comprises a first channel attention unit and a first space attention unit, obtaining a channel attention diagram through calculation according to the first channel attention unit, obtaining a space attention diagram through calculation according to the first space attention unit, and generating a first feature diagram by combining the channel attention diagram and the space attention diagram.

The method specifically comprises the following steps: firstly, a picture passes through a backbone network, a residual network is selected by the backbone network, and a convolution attention mechanism module is embedded in the residual, wherein the convolution attention module is an attention module combined with space and channels, and then feature mapping is multiplied by an input feature map to carry out feature self-adaptive learning; after the picture passes through the backbone network, a characteristic diagram is generated and sent to the next link;

the convolution attention module comprises a first channel attention unit and a first spatial attention unit, the first channel attention unit is more concerned with what is meaningful in an input picture, in order to calculate the channel attention efficiently, the first channel attention unit compresses in a spatial dimension by using maximum pooling and average pooling to obtain two different spatial backgrounds

And

the channel attention map is calculated by using a shared network consisting of M L P to obtain two different spatial background descriptions, so the calculation formula of the first channel attention cell is as follows:

wherein, W₁And W₀Representing weights of a multi-tier perceptron, and in which two weights share an input, and in which W is₀Followed by a relu activation function; σ represents the Sigmoid function and F represents the convolution operation corresponding to this stage in the attention module.

Wherein the first spatial attention unit is different from the first channel attention unit, the first spatial attention unit mainly focuses on the position information, and two different feature descriptions are obtained by using maximum pooling and average pooling on the channel dimension

And

the two feature descriptions are then merged using concatenation and a spatial attention graph is generated using convolution operations, the calculation formula for the first spatial attention cell being as follows:

wherein: σ stands for Sigmoid function, f^7*7Representing a convolution kernel size of 7 x 7 in the convolution operation, and then generating the first feature map from the channel attention map and the spatial attention map.

S103, extracting features from the first feature map by a detector based on the feature pyramid, calculating the association degree between different features and the association between modeling channels by adding a dual attention mechanism module containing a second spatial attention unit and a second channel attention unit to each layer of the feature pyramid from top to bottom, and performing region-of-interest alignment operation on the generated second feature map by using a network suggested in the last layer of region to fix the size of the features.

The method specifically comprises the following steps: in the detector link, firstly, a characteristic pyramid network is fused into the Faster-RCNN to increase the cognition of the detector on the whole image information, meanwhile, a spatial characteristic pyramid structure is improved, a double attention module is added, and finally, the original region of interest with the fixed characteristic in the Faster-RCNN is subjected to pooling operation and replaced by region of interest alignment operation with pixel level and higher precision.

The loss function of the detection network comprises classification loss and regression loss, and the loss function formula is as follows:

wherein: i is the ith target box and i is the ith target box,

is the probability of targeting the anchor frame, when the anchor frame is targeted,

1, otherwise 0, ti is the location coordinate of the prediction box,

is the coordinates of the real tag;

the part from bottom to top of the feature pyramid is the features obtained by the backbone network, and the adopted operation is that 1 x 1 dimensionality reduction operation is carried out on the 2 nd layer from bottom to top, and then the results after the 3 rd layer from bottom to top is sampled are added to obtain the 2 nd layer from top to bottom; the same applies to the top-to-bottom next layer, and then the network operation is subjected to area recommendation for the resulting top-to-bottom portion to obtain a recommendation for the area to be detected.

The specific steps of the feature pyramid part which is integrated into the double attention module in the residual error network are that feature extraction of an object to be detected is achieved on feature graphs of different scales, a feature graph with higher precision and richer information can be obtained by adding the double attention mechanism to each layer of the feature pyramid from the top to the bottom, and the double attention module respectively introduces the self-attention mechanism into the space dimension and the channel dimension of the feature, namely a second position attention unit and a second channel attention unit, so that the global dependency relationship of the feature is effectively grasped.

Wherein the second location attention unit mutually enhances the expression of the respective features by utilizing the association between any two features. Specifically, firstly, a correlation strength matrix between any two point features, namely an original feature A, is calculated_jObtaining characteristic B through convolution dimensionality reduction_iFeature C_jAnd feature D_iThen changing the characteristic dimension B_iAnd C_jAnd obtaining a correlation strength matrix between any two point characteristics according to the matrix product. Then obtaining the characteristics S of each position to other positions through the normalization of the softmax operation_jiWherein the more similar between two point features, the S thereof_jiThe larger the response value. Then the response value S in the feature map is compared_jiThe feature D is weighted and fused as a weight, so that for each point of the position, the calculation formula of the second position attention unit is as follows through the fusion of the feature map in the global space and similar features:

wherein A is_jRepresenting the feature corresponding to a given position, B_i，C_j，D_iIs shown as_jTwo new characteristic maps, S, generated by feeding the convolutional layers_jiRepresents that B is_i，C_jCarrying out matrix multiplication after re-deformation and obtaining a spatial characteristic diagram through a softmax layer, E_j1A position feature map representing the final output of the second position attention unit.

The second spatial attention unit enhances specific semantic response capability under the channels by modeling the association between the channels. The specific process is similar to the position attention module, except that when the feature attention diagram X is obtained, dimension transformation and matrix multiplication are carried out on any two channel features to obtain the correlation strength of any two channels, and then the feature diagram between the channels is obtained through the softmax operation. And finally, fusion is carried out through attention diagram weighting among the channels, so that global association can be generated among all the channels, and the characteristics of stronger semantic response are obtained. The calculation formula of the channel attention module is as follows:

wherein A is_jRepresenting the feature, x, corresponding to a given location_jiIs represented by A_jAnd A_jTranspose A of_iChannel profile obtained by multiplication through softmax layer, E_j2A spatial signature graph representing the final output of the second spatial attention unit.

In the target detection algorithm, a region suggestion candidate box of a result to be detected is obtained in a region suggestion network, and then candidate regions with different sizes are mapped onto a feature map with a fixed size by using region-of-interest pooling. However, there are two obvious disadvantages to using region-of-interest pooling, one of which is that errors may occur when quantizing the candidate frame boundaries to integer coordinates, and errors may also occur when floating point numbers are rounded when pooling. The coordinate position of the candidate frame can be deviated due to the error accumulation result, and the detection effect is influenced. Because our data set is to detect the car of the unmanned aerial vehicle aerial image, the target that needs to detect is the target with the extremely small proportion in the picture, therefore we have replaced the alignment operation of the interested region with pixel level and higher precision in our, and then cancel the quantization operation, obtain the image number value on the pixel point of the coordinate as the floating point number through the method of using bilinear interpolation, thus turn the whole characteristic gathering process into a continuous operation.

And S104, after the region-of-interest alignment operation is carried out on the second feature map and the size of the fixed feature is obtained, connecting two 1024 layers of full-connection layers, dividing into two branches, respectively establishing a target category analysis and target frame regression module, and classifying the region-of-interest under different scales of the feature pyramid and predicting the target frame.

And S105, adopting multi-scale image testing in the testing process, except for the original image concentrated in the testing process and the 1.5-time image of the original image, carrying out blocking processing on the images of the two scales, respectively inputting the images of the two scales into a depth network for testing to obtain detection results on the respective scales, and combining the detection results of the two scales with a global non-maximum inhibition fusion algorithm to improve the detection accuracy.

The overall integrated non-maximum suppression algorithm process is as follows;

step4, calculating the areas of all the boundary frames;

step6, deleting the boundary box with the IOU larger than the threshold value;

step7. repeat the above process until the bounding box list is empty.

Compared experiments are carried out on the invention, the data set used in the experiments is the unmanned aerial vehicle aerial photography automobile data set of 'Behcet' information fusion challenge match, and the hyper-parameters are set as follows: the maximum number of iterations is 12, the batch size is 1, the learning rate is set by adopting a warming up strategy, the initial learning rate is 0.3333, the learning rate is gradually increased and reduced to 0.00025 in the initial 500 iterations, and the learning rate is reduced in the 8 th and 11 th periods.

Evaluation of the experiment two analytical methods of quantification and visualization were used:

for quantitative analysis comparison, precision (accuracy), recall (recall) and F1 scores are used for judging detection precision, and precision and recall are used for calculating F1 scores to measure the detection precision of the algorithm. Wherein the accuracy, the recall rate and the F1 score are calculated as follows:

wherein, true posotives actually means that the target to be detected is correctly detected, false posotives actually means that the target not to be detected is detected, and false negatives actually means that the target to be detected is not detected.

The visual analysis comparison means that the same picture to be detected is detected for models run out through different detection algorithms, the effect of the detected picture is visualized through the written visual codes, and then the detection effects of the models run out through different detection algorithms on the same picture are artificially compared.

Compared with the conventional target detection algorithm, the unmanned aerial vehicle aerial image detection method has the advantages of being low in detection precision, poor in effect and the like. The invention utilizes a deep learning and attention mechanism to establish a multi-scale unmanned aerial vehicle aerial photography target detection network based on feature pyramid dual attention drive, and in the process of feature extraction, the attention mechanism is integrated into a space pyramid, so that richer and more effective information can be extracted and further sent to a regional suggestion network for classification and regression.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention.

Claims

1. A multi-scale target detection method of aerial images based on spatial pyramid attention drive is characterized by comprising the following steps: the method comprises the following steps:

2. The aerial image multi-scale target detection method based on spatial pyramid attention driving according to claim 1, characterized in that: the step S101 specifically includes:

and carrying out sliding window type blocking on the image according to the pixel size of 1000 × 1000, adopting the overlapping rate of 0.25, keeping the coordinate information of the manual labeling frame of the vehicle with the IOU larger than 0.7, and converting the manually labeled boundary frame into the coordinate of the small diced picture for all vehicles in the diced image.

3. The aerial image multi-scale target detection method based on spatial pyramid attention driving according to claim 2, characterized in that: the step S102 specifically includes:

inputting the picture into a residual error network embedded with a convolution attention module, wherein a first channel attention unit compresses the picture in a spatial dimension by using maximum pooling and average pooling to obtain two different spatial backgrounds

And

spatial background through residual network

And

and calculating to obtain a channel attention diagram, wherein the calculation formula of the first channel attention unit is as follows:

And

4. The aerial image multi-scale target detection method based on spatial pyramid attention driving according to claim 3, characterized in that: the step3 specifically comprises the following steps:

extracting features from the first feature map by a feature pyramid-based detector, and adding a dual attention mechanism including a second location attention unit and a second spatial attention unit to each layer of the feature pyramid from top to bottom;

calculating a correlation strength matrix between any two point features through a second position attention unit, namely an original feature A_jObtaining characteristic B through convolution dimensionality reduction_iFeature C_jAnd feature D_iThen changing the characteristic dimension B_iAnd C_jObtaining a correlation strength matrix between any two point characteristics according to the matrix product; calculating and obtaining characteristics S of each position to other positions by utilizing softmax function_jiThen the feature S_jiAnd feature D_iPerforming multiplication and fusion, and finally, combining the result with the original characteristic A_jAnd adding to obtain a position feature map finally output by the position attention unit, wherein the calculation formula of the second position attention unit is as follows:

wherein A is_jRepresents a given position pairThe corresponding characteristics; b is_i，C_j，D_iIs shown as_jThree new features, S, generated by convolution dimensionality reduction_jiRepresents that B is_i，C_jThe position attention map obtained by matrix multiplication after the re-deformation and then the softmax layer is obtained, E_j1A position feature map representing the final output of the second position attention unit;

performing dimension transformation and matrix multiplication on the features of any two channels through a second spatial attention unit to obtain the correlation strength of any two channels, then calculating to obtain an attention diagram between the channels, and finally performing fusion through weighting of the attention diagrams between the channels to enable global correlation to be generated between the channels and obtain features with stronger semantic response, wherein the calculation formula of the second spatial attention unit is as follows:

wherein A is_jRepresenting the feature, x, corresponding to a given location_jiIs represented by A_jAnd A_jTranspose A of_iChannel profile obtained by multiplication through softmax layer, E_i2A spatial signature graph representing the final output of the second spatial attention unit.

And finally, performing feature fusion on the position feature map and the space feature map to obtain a final second feature map, and recommending a network to perform region-of-interest alignment operation on the obtained second feature map in the last layer of region, and fixing the size of the features.

5. The aerial image multi-scale target detection method based on spatial pyramid attention driving according to claim 4, characterized in that: the step S104 specifically includes:

and after aligning the interested regions of the second feature map and obtaining the size of the fixed features, connecting two 1024 layers of full-connection layers, dividing the full-connection layers into two branches, respectively establishing a target category analysis and target frame regression module, and classifying the interested regions under different scales of the feature pyramid and predicting the target frames.

6. The aerial image multi-scale target detection method based on spatial pyramid attention driving according to claim 5, characterized in that: the step S105 specifically includes:

in the test, a multi-scale image test is adopted, the original image and the 1.5-time image of the original image are collected in the test, the images of two scales are processed in a blocking mode, then the images of the two scales are respectively input into a depth network to be tested, detection results on the respective scales are obtained, the detection results of the two scales are combined with the detection results of the two scales through a global non-maximum inhibition fusion algorithm, and therefore the detection accuracy is improved.

7. The aerial image multi-scale target detection method based on spatial pyramid attention driving of claim 6, characterized in that: the global integrated non-maximum suppression algorithm process is as follows:

step4, calculating the areas of all the boundary frames;

step6, deleting the boundary box with the IOU larger than the threshold value;

step7. repeat the above process until the bounding box list is empty.