CN111401201B

CN111401201B - Aerial image multi-scale target detection method based on spatial pyramid attention drive

Info

Publication number: CN111401201B
Application number: CN202010164167.7A
Authority: CN
Inventors: 孙玉宝; 辛宇; 徐宏伟; 陈勋豪; 周旺平
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2020-03-10
Filing date: 2020-03-10
Publication date: 2023-06-20
Anticipated expiration: 2040-03-10
Also published as: CN111401201A

Abstract

The invention discloses a multi-scale target detection method for aerial images based on spatial pyramid attention drive, which comprises the following steps: firstly, enhancing training data set management by using a blocking processing method aiming at a large-size data set; designing a residual network represented by convolution attention enhancement features as a backbone network, and further efficiently extracting image features; further constructing a spatial pyramid attention module, enabling a network to more accurately focus targets with different scales, and extracting an interested region where the targets are located; establishing a target category analysis and target frame regression module, and classifying and predicting the regions of interest under different scales; and in the test stage, a trained detection network is used, a multi-scale test strategy is adopted, and detection results of different scales are fused through a global integrated non-maximum suppression algorithm, so that the detection accuracy is further improved.

Description

Aerial image multi-scale target detection method based on spatial pyramid attention drive

Technical Field

The invention belongs to the technical field of image recognition and target detection, and particularly relates to an aerial image multi-scale target detection method based on spatial pyramid attention driving.

Background

The object detection, also called object extraction, is an image segmentation based on the geometric and statistical characteristics of the object, which combines the segmentation and recognition of the object into one, and the accuracy and the real-time performance are an important capability of the whole system. Especially in complex scenes, when multiple targets need to be processed in real time, automatic extraction and recognition of the targets are particularly important. With the development of computer technology and the wide application of computer vision principle, the real-time tracking research of targets by using computer image processing technology is getting more and more popular, and the dynamic real-time tracking positioning of targets has wide application value in the aspects of intelligent traffic systems, intelligent monitoring systems, military target detection, surgical instrument positioning in medical navigation surgery and the like.

In one aspect, many methods of target detection have emerged in recent years, such as the YOLO, SSD, retinaNet, RCNN series, where YOLO, SSD, retinaNet is a single-stage method and the original RCNN and its extended Fast-RCNN and Fast-RCNN are two-stage methods. The RCNN series method is to generate a candidate frame first, then conduct coordinate regression prediction according to the candidate frame, while YOLO, SSD, retinaNet is to conduct regression directly to generate coordinate regression, and the step of not passing through the candidate frame is not performed.

On the other hand, the visual attention mechanism is a brain signal processing mechanism specific to human vision. The human vision obtains the target area needing to be focused by rapidly scanning the global image, namely the focus of attention, so as to obtain more information of the key characteristics of the target needing to be focused. Therefore, the model of the attention-drawing mechanism has great help to improve the accuracy of target detection.

The two-stage target detection algorithm tends to have higher accuracy than the single-stage algorithm without considering the detection speed, so that the two-stage algorithm tends to have higher accuracy in many cases, such as in detecting aerial images of an unmanned aerial vehicle. Therefore, the patent provides a multi-scale target detection network based on feature pyramid dual attention driving based on a deep learning theory and a method of applying the latest attention mechanism.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an aerial image multi-scale target detection method based on the spatial pyramid attention drive.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

an aerial image multi-scale target detection method based on spatial pyramid attention drive, wherein: the method comprises the following steps:

s101: collecting an unmanned aerial vehicle image set and performing block processing to obtain a large number of block cutting small images with consistent sizes;

s102: inputting the block diagram into a residual network, and extracting features through a convolution attention module in the residual network, wherein the convolution attention module comprises a first channel attention unit and a first space attention unit, the channel attention unit is used for calculating a channel attention diagram according to the first channel attention unit, the space attention diagram is used for calculating a space attention diagram according to the first space attention unit, and the first feature diagram is generated by combining the channel attention diagram and the space attention diagram;

s103: extracting features from the first feature map by a detector based on a feature pyramid, adding a dual-attention module comprising a second spatial attention unit and a second channel attention unit to each layer of the top-down part of the feature pyramid, fusing the feature maps generated by the two attention units to obtain a second feature map, and carrying out region-of-interest alignment operation on the second feature map generated by a region suggestion network in the last layer to fix the size of the features;

s104: aiming at the obtained second feature map with the aligned regions of interest, a target category analysis and target frame regression module is established, and classification and target frame prediction are carried out on the regions of interest under different scales;

s105: and adopting an original image and a 1.5 times original image to carry out multi-scale image test, respectively inputting the images with two scales into a depth network to carry out test, and fusing results with different scales through a global integrated non-maximum suppression algorithm to improve the detection accuracy.

In order to optimize the technical scheme, the specific measures adopted further comprise:

the step S101 specifically includes: and carrying out sliding window type blocking on the image according to the pixel size of 1000 x 1000, adopting the overlapping rate of 0.25, reserving the coordinate information of the manual marking frame of the vehicle with the IOU more than 0.7, and converting the manually marked boundary frame into the coordinates of the block small drawing for all the vehicles in the blocked image.

The step S102 specifically includes: inputting the picture into a residual network embedded with a convolution attention module, wherein a first channel attention unit compresses the picture in a space dimension by using maximum pooling and average pooling to obtain two different space backgrounds

And->

Spatial context via residual network>

And->

The channel attention diagram is obtained by calculation, and the calculation formula of the channel attention unit is as follows:

wherein: w (W) ₁ And W is ₀ Representing weights of a multi-layer perceptron, and in which two weights share inputs, and in which W ₀ Followed by a relu activation function; sigma represents a Sigmoid function and F represents a convolution operation corresponding to this stage in the attention mechanism;

wherein the first isThe spatial attention unit obtains two different feature descriptions in the dimension of the channel according to the maximum pooling and the average pooling

And->

Generating a spatial attention map according to convolution calculation, wherein the calculation formula of the first spatial attention unit is as follows:

wherein: sigma represents a Sigmoid function, f ^7*7 Indicating a convolution kernel size of 7*7;

a first feature map is then generated from the channel attention map and the spatial attention map.

The step 103 specifically includes: extracting features from the first feature map by a feature pyramid-based detector, adding a dual-attention module containing a second position attention unit and a second spatial attention unit to each layer of the top-down portion of the feature pyramid;

calculating an association intensity matrix between any two point features through a second position attention unit, namely an original feature A _j Obtaining a characteristic B through convolution dimension reduction _i Feature C _j And feature D _i Then change feature dimension B _i And C _j Obtaining an association intensity matrix between any two point features according to the matrix product; obtaining the characteristic S of each position to other positions through softmax function calculation _ji Then feature S _ji And feature D _i Multiplying and fusing, and finally combining the result with the original characteristic A _j And adding to obtain a position feature diagram finally output by the position attention unit, wherein the calculation formula of the second position attention unit is as follows:

wherein A is _j Representing a feature corresponding to a given location; b (B) _i ，C _j ，D _i Will be denoted A _j Three new features generated by convolution dimension reduction, S _ji Will be denoted B _i ，C _j Matrix multiplication after re-deformation and position attention map obtained by softmax layer, E _j1 A position feature map representing a final output of the second position attention unit;

the method comprises the steps of carrying out dimension transformation and matrix multiplication on characteristics of any two channels through a second spatial attention unit to obtain association strength of any two channels, then calculating to obtain a characteristic diagram among the channels, and finally carrying out fusion through weighting of the characteristic diagram among the channels, so that global association can be generated among the channels to obtain characteristics of stronger semantic response, wherein the calculation formula of the second spatial attention unit is as follows:

wherein A is _j Representing the features corresponding to a given position, x _ji Representation A _j And A is a _j Channel feature map obtained by multiplying transposed 4 of (2) and passing through softmax layer, E _j2 A spatial feature map representing a final output of the second spatial attention unit;

and finally, carrying out feature fusion on the first space feature map and the second space feature map to obtain a final second feature map, carrying out region-of-interest alignment operation on the obtained second feature map in a final layer of region suggestion network, and fixing the size of the features.

The step S104 specifically includes: after the second feature map is subjected to region of interest alignment operation and the size of the fixed feature is obtained, connecting two 1024 full-connection layers, then dividing the two full-connection layers into two branches, respectively establishing a target category analysis and target frame regression module, and classifying the regions of interest under different scales of the feature pyramid and predicting the target frame.

The step S105 specifically includes: in the test, a multi-scale image test is adopted, an original image and a 1.5 times image of the original image are concentrated in the test, then the images of two scales are subjected to blocking treatment, then the images of two scales are respectively input into a depth network for testing, detection results on respective scales are obtained, and the detection results of two scales are combined with the detection results of two scales by using a global non-maximum value inhibition fusion algorithm, so that the detection accuracy is improved.

The global integrated non-maximum suppression algorithm process described above is as follows:

step1, globally aligning coordinates of the prediction frames of the sub-blocks of each scale;

step2, weighting calculation and sequencing of confidence weights of the detection frames;

step3, selecting a specific bounding box with highest confidence level, adding the specific bounding box into a final output list, and deleting the specific bounding box from the bounding box list;

step4, calculating the areas of all the bounding boxes;

step5, calculating IOU of the boundary frame with highest confidence and other candidate frames;

step6, deleting the bounding box with the IOU larger than the threshold value;

step7. Repeat the above process until the bounding box list is empty.

The invention has the beneficial effects that:

according to the invention, a multi-scale target detection network method based on double-attention driving of a feature pyramid is established by utilizing a computer target detection and attention mechanism theory, under the conditions that the size of an aerial image is large, the target to be detected is small and the background complexity is high, a model firstly carries out blocking processing on a data set, then removes the strong feature extraction capability driven by the double-attention of the feature pyramid, and meanwhile, adopts a multi-scale fusion detection method, combines the detection results of two scales by utilizing a global non-maximum suppression fusion algorithm with the detection results of the two scales, and finally obtains the most accurate detection result. The detection network provided by the invention has a good effect on the target detection of aerial pictures, and plays a significant role in the fields of geographic environment detection, traffic flow control, military behavior monitoring and the like.

Drawings

FIG. 1 is a schematic flow chart of an algorithm of the present invention;

FIG. 2 is a flow diagram of a global non-maximum suppression fusion algorithm;

FIG. 3 is a schematic diagram of a portion of a dual attention mechanism driven feature pyramid constructed in accordance with the present invention;

FIG. 4 is a schematic diagram of a detection network of the present invention;

fig. 5 is a comparison chart of quantitative analysis of unmanned aerial vehicle data sets according to the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

As shown in fig. 1, the invention relates to a method for detecting a multiscale target of an aerial image based on spatial pyramid attention driving, wherein: the method comprises the following steps:

s101, before training, performing block processing on an unmanned aerial vehicle data set for verifying the validity of a designed network;

the method comprises the following steps: before the data set is sent to the network training, we process the data set first, because the data set used in our experiment includes 4355 Zhang Hang images and corresponding coordinates of manually marked vehicles, for each image, because the unmanned aerial vehicle has too large image size, so we slide window type block the images according to the pixel size of 1000 x 1000 to obtain a large number of small block images, in order to avoid the incomplete condition of the vehicles caused by the split images as much as possible, the overlapping rate of 0.25 is adopted, and the coordinate information of the manually marked frames of the vehicles with IOU more than 0.7 is reserved, meanwhile, for all the vehicle instances in the split images, the split is stored to convert the manually marked boundary frames into the coordinates of the small block images, and the total of 48416 1000 small block images.

S102, inputting the block diagram into a residual network, and extracting features through a convolution attention module inside the residual network, wherein the convolution attention module comprises a first channel attention unit and a first space attention unit, the channel attention unit is used for calculating a channel attention diagram according to the first channel attention unit, the space attention diagram is used for calculating a space attention diagram according to the first space attention unit, and the channel attention diagram and the space attention diagram are combined to generate a first feature diagram.

The method comprises the following steps: the picture passes through a backbone network firstly, the backbone network is selected to be a residual network, meanwhile, a convolution attention mechanism module is embedded in the residual, the convolution attention module is an attention module combining space and channels, and then, feature mapping is multiplied with an input feature map to perform self-adaptive learning of features; after the picture passes through the backbone network, a feature map is generated and sent to the next link;

wherein the convolution attention module comprises a first channel attention unit and a first space attention unit, wherein the first channel attention unit focuses more on what is significant in the input picture, and the first channel attention unit compresses in the space dimension by using maximum pool and average pool pair to obtain two different space backgrounds for efficiently calculating the channel attention

And->

The channel attention map is calculated for these two different spatial background descriptions using a shared network consisting of MLPs, so the calculation formula for the first channel attention unit is as follows:

wherein W is ₁ And W is ₀ Representing weights of the multi-layer perceptron and sharing inputs with two weights in the multi-layer perceptron, whereas in the multi-layer perceptron, W ₀ Followed by a relu activation function; σ represents the Sigmoid function and F represents the convolution operation corresponding to this stage in the attention module.

Wherein the first spatial attention unit is different from the first channel attention unit, and the first spatial attention unit mainly focuses on the position information, firstly in the dimension of the channelDegree two different feature descriptions are obtained using maximum pooling and average pooling

And->

The two feature descriptions are then combined using a cascade and a spatial attention map is generated using a convolution operation, the calculation formula for the first spatial attention unit being as follows:

wherein: sigma represents a Sigmoid function, f ^7*7 The convolution kernel size in the convolution operation is represented as 7*7, and then a first signature is generated from the channel attention map and the spatial attention map.

S103, extracting features from the first feature map through a detector based on the feature pyramid, adding a dual-attention mechanism module containing a second spatial attention unit and a second channel attention unit to each layer of the top-down part of the feature pyramid, calculating the association degree between different features and the association between modeling channels, and carrying out region-of-interest alignment operation on the generated second feature map in a final layer region suggestion network to fix the sizes of the features.

The method comprises the following steps: in the detector link, firstly, a feature pyramid network is fused into a Faster-RCNN to increase the cognition of the detector on the whole image information, meanwhile, a dual-attention module is added to improve the spatial feature pyramid structure, and finally, the pooling operation of the region of interest with the fixed feature size in the original Faster-RCNN is replaced by the alignment operation of the region of interest with higher pixel level and higher precision.

The loss function of the detection network comprises classification loss and regression loss, and the loss function has the following formula:

wherein: i is the i-th target frame and,

is the probability of targeting an anchor frame, when targeting an anchor frame, the +.>

1, otherwise 0, ti is the position coordinates of the prediction box, +.>

Is the coordinates of the real tag;

the bottom-up part of the feature pyramid is the feature obtained by the backbone network, the operation adopted is that the dimension reduction operation of 1*1 is carried out on the layer 2 of the bottom-up part, and then the results obtained after the up-sampling of the layer 3 of the bottom-up part are added to obtain the layer 2 of the top-down part; the same is true for the next layer from top to bottom, and then the region suggestion network operation is performed for the obtained top to bottom portion to obtain the detection region suggestion.

The specific steps of the feature pyramid part integrated with the dual-attention module in the residual error network are that feature extraction of the target to be detected is realized on feature graphs with different scales, and a feature graph with higher precision and richer information can be obtained by adding a dual-attention mechanism to each layer of the feature pyramid part from top to bottom.

Wherein the second location attention unit uses the association between any two point features to mutually enhance the expression of the respective features. Specifically, firstly, calculating an association intensity matrix between any two point features, namely an original feature A _j Obtaining a characteristic B through convolution dimension reduction _i Feature C _j And feature D _i Then change feature dimension B _i And C _j And obtaining an association intensity matrix between any two point features according to the matrix product. Then normalizing by softmax operation to obtain the characteristic S of each position to other positions _ji Wherein the more similar between two point features, S thereof _ji The larger the response value. Then the response value S in the feature map _ji The feature D is weighted and fused as a weight, so that for each point of the position, the similar feature is fused in the global space through the feature map, and the calculation formula of the attention unit of the second position is as follows:

wherein A is _j Representing the features corresponding to a given position, B _i ，C _j ，D _i Will be denoted A _j Feeding two new feature maps generated by the convolution layer S _ji Will be denoted B _i ，C _j Matrix multiplication is carried out after the reformation, and then a space characteristic diagram is obtained through a softmax layer, E _j1 A position characteristic map representing the final output of the second position attention unit.

The second spatial attention unit enhances the specific semantic response capabilities under the channels by modeling the associations between the channels. The specific process is similar to the position attention module, except that when the feature attention force diagram X is obtained, the dimension transformation and the matrix product are carried out on the features of any two channels, the association strength of any two channels is obtained, and then the feature diagram among the channels is obtained through the softmax operation. And finally, fusing through attention force diagram weighting among the channels, so that global association can be generated among the channels, and the characteristics of stronger semantic response are obtained. The channel attention module has the following calculation formula:

wherein A is _j Representing the features corresponding to a given position, x _ji Representation A _j And A is a _j Transpose A of (2) _i Channel feature map obtained by multiplying and passing through softmax layer E _j2 And the spatial feature diagram representing the final output of the second spatial attention unit.

In the target detection algorithm, a region suggestion candidate box of a result to be detected is usually obtained in a region suggestion network, and then a region of interest pooling operation is used to map candidate regions with different sizes onto a feature map with a fixed size. However, there are two significant disadvantages to using region of interest pooling, one of which is that errors occur when the candidate frame boundaries are quantized to integer coordinates, and there are cases where floating point rounding occurs when pooling is done. The result of error accumulation causes the coordinate position of the candidate frame to deviate, and the detection effect is affected. Because our dataset is to unmanned aerial vehicle aerial image's car detects, and the target that needs to detect is the target that the proportion is very little in the picture, consequently in our replacement pixel level and the region of interest alignment operation that the precision is higher, and then cancel quantization operation, obtain the image numerical value on the pixel point that the coordinate is the floating point number through the method that uses bilinear interpolation to turn into a continuous operation with whole characteristic gathering process.

S104, after the second feature map is subjected to region-of-interest alignment operation and the size of the fixed feature is obtained, connecting two 1024 full-connection layers, then dividing the two full-connection layers into two branches, respectively establishing a target category analysis and target frame regression module, and classifying and predicting the region of interest under different scales of the feature pyramid.

S105, adopting multi-scale image test in the test, dividing the original image and the 1.5 times image of the original image in the test set, then carrying out blocking treatment on the images of the two scales, then respectively inputting the images of the two scales into a depth network for test to obtain detection results on respective scales, and combining the detection results of the two scales by using a global non-maximum value inhibition fusion algorithm with the detection results of the two scales to improve the detection accuracy.

The global integrated non-maximum suppression algorithm is as follows;

step4, calculating the areas of all the bounding boxes;

step6, deleting the bounding box with the IOU larger than the threshold value;

step7. Repeat the above process until the bounding box list is empty.

The invention is subjected to a comparison experiment, the data set used in the experiment is an unmanned aerial vehicle data set for 'bell-type computing cup' information fusion challenge, and super parameters are set as follows: the maximum iteration number is 12, the batch size is 1, the learning rate is set by adopting a warming up strategy, the initial learning rate is 0.3333, the learning rate is gradually increased to 0.00025 in the initial 500 iterations after that, and the learning rate is reduced in the 8 th and 11 th periods.

Evaluation of the experiment two analytical methods of quantification and visualization were used:

regarding quantitative analysis comparison, precision, recall and F1 score are used to judge the detection accuracy, and F1 score is calculated to measure the detection accuracy of the algorithm. The accuracy, recall and F1 score are calculated as follows:

wherein true probes actually indicate that the target to be detected is correctly detected, false probes actually indicate that the target not to be detected is not detected, and false probes actually indicate that the target not to be detected is not detected.

The visual analysis comparison refers to that the same picture to be detected is detected by using models which are run out through different detection algorithms, meanwhile, the effect after the picture detection is visualized by using visual codes written by the user, and then the detection effect of the models which are run out by different detection algorithms on the same picture is compared manually.

In conclusion, compared with a conventional target detection algorithm, the unmanned aerial vehicle aerial image detection method has the advantages of low detection precision, poor effect and the like. The invention utilizes deep learning and attention mechanisms to establish a multi-scale unmanned aerial vehicle aerial photographing target detection network based on double attention driving of a feature pyramid, and in the process of feature extraction, the attention mechanisms are fused into a space pyramid, so that richer and effective information can be extracted, and then the information is sent to a regional suggestion network for classification and regression.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims

1. A multi-scale target detection method for aerial image based on spatial pyramid attention drive is characterized in that: the method comprises the following steps:

s105: carrying out multi-scale image test by adopting an original image and a 1.5-fold original image, respectively inputting the images with two scales into a depth network for test, and fusing results with different scales through a global integrated non-maximum suppression algorithm to improve the detection accuracy;

the step S102 specifically includes:

inputting the picture into a residual network embedded with a convolution attention module, wherein a first channel attention unit compresses the picture in a space dimension by using maximum pooling and average pooling to obtain two different space backgrounds

And->

Spatial context via residual network>

And->

And calculating to obtain a channel attention map, wherein the calculation formula of the first channel attention unit is as follows:

wherein: w (W) ₁ And W is ₀ Representing weights of a multi-layer perceptron, and in which two weights share inputs, and in which W ₀ Followed by a relu activation function; sigma represents a Sigmoid function;

wherein the first spatial attention unit obtains two different feature descriptions in the dimension of the channel according to a maximum pooling and an average pooling

And->

then generating a first feature map from the channel attention map and the spatial attention map;

the step S103 specifically includes:

extracting features from the first feature map by a feature pyramid-based detector, adding a dual-attention mechanism containing a second location attention unit and a second spatial attention unit to each layer of the top-down portion of the feature pyramid;

by injection through the second positionThe force meaning unit calculates the association strength matrix between any two point characteristics, namely the original characteristic A _j Obtaining a characteristic B through convolution dimension reduction _i Feature C _j And feature D _i Then change feature dimension B _i And C _j Obtaining an association intensity matrix between any two point features according to the matrix product; calculation of the features S of each position against other positions using a softmax function _ji Then feature S _ji And feature D _i Multiplying and fusing, and finally combining the result with the original characteristic A _j And adding to obtain a position feature diagram finally output by the position attention unit, wherein the calculation formula of the second position attention unit is as follows:

the method comprises the steps of carrying out dimension transformation and matrix multiplication on any two channel characteristics through a second spatial attention unit to obtain association strength of any two channels, then calculating to obtain attention force diagram among the channels, and finally carrying out fusion through attention force diagram weighting among the channels to enable global association among the channels to be generated, so as to obtain characteristics of stronger semantic response, wherein the calculation formula of the second spatial attention unit is as follows:

wherein A is _j Representing the features corresponding to a given position, x _ji Representation A _j And A is a _j Is rotated by (a)Set A _i Channel feature map obtained by multiplying and passing through softmax layer E _j2 A spatial feature map representing a final output of the second spatial attention unit;

finally, carrying out feature fusion on the position feature map and the space feature map to obtain a final second feature map, carrying out region-of-interest alignment operation on the obtained second feature map in a final layer of region suggestion network, and fixing the size of the features;

the global integrated non-maximum suppression algorithm process is as follows:

step4, calculating the areas of all the bounding boxes;

step6, deleting the bounding box with the IOU larger than the threshold value;

step7. Repeat the above process until the bounding box list is empty.

2. The aerial image multi-scale target detection method based on spatial pyramid attention driving according to claim 1, wherein the method comprises the following steps of: the step S101 specifically includes:

and carrying out sliding window type blocking on the image according to the pixel size of 1000 x 1000, adopting the overlapping rate of 0.25, reserving the coordinate information of the manual marking frame of the vehicle with the IOU more than 0.7, and converting the manually marked boundary frame into the coordinates of the block small drawing for all the vehicles in the blocked image.

3. The aerial image multi-scale target detection method based on spatial pyramid attention driving according to claim 1, wherein the method comprises the following steps of: the step S104 specifically includes:

after the second feature map is subjected to region of interest alignment operation and the size of the fixed feature is obtained, connecting two 1024 full-connection layers, then dividing the two full-connection layers into two branches, respectively establishing a target category analysis and target frame regression module, and classifying the regions of interest under different scales of the feature pyramid and predicting the target frame.

4. A method for detecting a multiscale object of an aerial image based on spatial pyramid attention driving according to claim 3, wherein: the step S105 specifically includes:

in the test, a multi-scale image test is adopted, an original image and a 1.5 times image of the original image are concentrated in the test, then the images of two scales are subjected to blocking treatment, then the images of two scales are respectively input into a depth network for testing, detection results on respective scales are obtained, and the detection results of two scales are combined with the detection results of two scales by using a global non-maximum value inhibition fusion algorithm, so that the detection accuracy is improved.