CN115035429A

CN115035429A - Aerial photography target detection method based on composite backbone network and multiple measuring heads

Info

Publication number: CN115035429A
Application number: CN202210748203.3A
Authority: CN
Inventors: 李馨蔚; 何小其; 杨根科; 褚健
Original assignee: Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Current assignee: Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Priority date: 2022-06-29
Filing date: 2022-06-29
Publication date: 2022-09-09

Abstract

The invention discloses an aerial photography target detection method based on a composite backbone network and multiple measuring heads, which relates to the technical field of computer vision target detection and comprises the following steps: step 1, preparing a data set of aerial images; step 2, constructing a target detection model; step 3, applying a target detection model to carry out target prediction; wherein, step 2 includes: step 2.1, establishing a backbone network of a target detection model, and extracting multi-scale features; 2.2, fusing the multi-scale features through a Recursive-FPN network to obtain fused and enhanced multi-scale features; 2.3, constructing a target detection model by using a TPH network as a detection head based on the fused and enhanced multi-scale features; and 2.4, dividing the data set of the aerial image into a training set and a test set according to a proportion, inputting the training set into the model for training, testing the model by using the test set, and determining a target detection model. The method realizes the balance between aerial image target detection speed and precision.

Description

Aerial photography target detection method based on composite backbone network and multiple measuring heads

Technical Field

The invention relates to the technical field of computer vision target detection, in particular to an aerial photography target detection method based on a composite backbone network and multiple measuring heads.

Background

In recent years, with the rapid development of imaging technology, the number of images has rapidly increased, and the resolution of images has been increasing. Various types of targets exist in massive high-resolution visible light image data, such as airplanes, automobiles, ships and the like, and the targets are accurately detected to play an important role in aerial photography interpretation.

Aerial images are usually acquired by an unmanned aerial vehicle-mounted camera, and compared with images in natural scenes, aerial images have the problems of high background complexity, small target size and fuzzy appearance, and the detection of small targets is a challenging research direction. Most target detection algorithms have higher precision and generalization performance in general target detection, but the precision is still lower in aerial image small target detection.

The remote sensing image target detection algorithm based on traditional machine learning regards the detection of the remote sensing image target as a classification regression task. Firstly, generating a region where a target may exist by using a region search method or a sliding window method, then, extracting features of the region by using different methods, and finally, training a target classifier by using the obtained features, wherein the target classifier comprises K neighbors, a support vector machine, a conditional random field and the like. Although the traditional machine learning method is improved in target detection accuracy to a certain extent compared with other image processing methods, the extracted features still need manual design, the feature hierarchy is more primary, and high-level semantic information is not available. In addition, the target detection result lacks secondary adjustment on the candidate region, and also generates severe dependence on a generation algorithm of the candidate region, the detection omission is caused by too few candidate regions, and the false alarm is caused by too much candidate regions, and simultaneously, higher requirements are put forward on the storage space. The method based on the traditional machine learning has many limitations in practical application, and is difficult to adapt to massive aerial image data.

Deep learning is a branch of machine learning, and in recent years, the emergence of massive labeled data and the development of a GPU parallel computing technology promote the breakthrough of the deep learning technology. The strong expression ability is deeply learned, so that not only the geometric characteristics but also the semantic characteristics can be learned, the image processing steps are simplified, and finally, the network detects the target by using the automatically acquired characteristics. The natural scene target detection algorithm based on deep learning can be roughly divided into two types: a one-phase detection algorithm and a two-phase detection algorithm. The one-stage detection algorithm uses a single network structure and a regression prediction mode to perform target detection, and has the advantages of simple structure, high running speed and the like, but the detection precision is usually slightly inferior to that of the two-stage detection algorithm. The two-stage detection algorithm is based on a region algorithm, firstly provides a candidate region, and then carries out target discrimination and secondary adjustment of position scale on the candidate region. The two-phase algorithm is complex in structure and high in inference delay, but generally has higher detection accuracy.

Chen Tianming et al in the Chinese patent application "a method for rapidly detecting and identifying small sample small target in complex remote sensing land environment" (publication number: CN113963265A) discloses a method for rapidly detecting and identifying, which is based on an improved Faster R CNN convolutional neural network architecture to construct a detection and identification network for a vehicle target in complex remote sensing land environment; by carrying out certain transformation and disturbance expansion on training data and carrying out repeated training on negative samples and difficultly-divided samples, the training data volume is increased, and simultaneously, the network can fully learn the change of a target, so that the problems of weak model generalization capability and poor precision caused by less sample data volume are solved; by adding small target features and mining difficult sample information, the problems of poor detection effect, high false alarm rate and low identification precision of the Faster R CNN on the small target are solved; RPN and Fast R CNN share the same 5-layer convolutional neural network, and the parameters of the network model are optimized, so that the whole detection process can be completed by only completing a series of convolutional operations, and the operation time is reduced. But the method is realized by generating a series of anchors at each position of the image in advance, and the adopted convolutional neural network algorithm is an anchor-based model.

Zhang Cheng et al in the Chinese patent application "a method for detecting dense vehicles based on depth full convolution network remote sensing images based on central point" (publication number: CN110659601A) propose a method for detecting dense vehicles based on remote sensing images, mainly solving the problem of low detection precision caused by small target and dense arrangement in the existing remote sensing images. The scheme is as follows: acquiring a training set and a test set in a remote sensing image target detection data set; constructing a central point depth full convolution network dense vehicle target detection model, and setting an integral loss function of a target central point classification task and a target size regression task; inputting the training set into the constructed network model for training to obtain a trained vehicle detection model; and inputting the test set into a trained vehicle target detection model, and predicting the position of the central point of the output target and the size of the target. The method reduces the influence of the target size on the positioning of the dense scene target, improves the recall rate of the dense vehicle target in the remote sensing image, and improves the detection precision of the vehicle target. The method can be used for urban planning, traffic flow control, traffic supervision and military investigation. However, the method is only suitable for detecting dense vehicles, and information loss still exists to a certain extent, so that the network cannot fully learn all characteristics of each target.

The good effect of the deep learning algorithm in the target detection of the visible light remote sensing image proves that the potential is huge, but the following problems still exist in the practical application:

1. most of aerial image target detection frameworks based on neural networks extract target features first, and then the extracted features are identified. If the feature extraction process is disturbed, the subsequent recognition accuracy will also be affected. However, the target in the aerial image only occupies a small part, most of the area is occupied by the background, the complicated background information can submerge the target information, and the detector is difficult to extract the target characteristics;

2. in aerial view of aerial images, there is a dense arrangement of some classes of targets, in which case it is difficult for the detector to accurately distinguish each target. Although some solutions are provided in the existing detection algorithm, the existing solutions often have information loss to a certain degree, and the network cannot fully learn all the characteristics of each target;

3. at present, most of convolutional neural network algorithms adopt an anchor-based model as a detection head, although the anchor-based model can provide a strong prior for a neural network, the training of the neural network is accelerated, and the network is easy to converge, however, the design of the anchor depends heavily on artificial experience, and if the design is not good, the final detection effect is influenced. The anchor needs to be continuously adjusted along with the data change, so that the generalization of the detection algorithm is greatly reduced;

4. in order to locate the position of the target in the image, most of the existing methods generate a series of anchors at each position of the image in advance. And judging whether the anchor belongs to a positive sample or a negative sample by setting a fixed threshold in the training process. The above method will cause the problem of imbalance between the positive and negative samples of different size targets in the model training process.

Therefore, those skilled in the art are dedicated to developing a new aerial image target detection method, which solves the above-mentioned problems of the existing deep learning algorithm in the practical application of aerial image target detection.

Disclosure of Invention

In view of the above defects in the prior art, the technical problem to be solved by the present invention is how to improve the multi-size features in the aerial image, and can effectively extract the multi-size features therein, and then perform multi-size fusion, thereby improving the accuracy of target detection.

In order to achieve the above purpose, the present invention provides an aerial image target detection method based on a combination of a compound Swin Transformer (Shifted-windows Transformer) trunk network and tph (Transformer Prediction head). Firstly, forming a hierarchical Feature map representation by using a compound Swin Transformer as a main network, extracting multi-scale features of an aerial image, then performing multi-scale fusion on input features extracted by the main network by using a reactive-FPN (reactive Feature Pyramid network), and finally outputting a prediction target detection result on multiple scales of the aerial image by using a TPH network as a detection head. During model training, an OTA (Optimal Transport Assignment) strategy is used for distributing training samples, so that the model convergence efficiency is improved, and the target detection precision is improved. Wherein, the composite backbone network is CBNet.

The invention provides an aerial photography target detection method based on a composite backbone network and multiple measuring heads, which comprises the following steps:

step 1, preparing a data set of aerial images;

step 2, constructing a target detection model;

step 3, applying the target detection model to carry out target prediction;

wherein the step 2 comprises the following substeps:

step 2.1, establishing a backbone network of the target detection model through a compound connection Swin transducer, and extracting multi-scale features;

2.2, fusing the multi-scale features through a Recursive-FPN network to obtain the fused and enhanced multi-scale features;

2.3, based on the multi-scale features after fusion enhancement, using a TPH network as a detection head to construct the target detection model;

step 2.4, dividing the data set of the aerial image into a training set and a test set according to a proportion, inputting the training set into the target detection model for training, testing the target detection model by using the test set, and determining the target detection model; wherein, the division of the positive and negative samples during training follows the OTA strategy.

Further, the method can be used for preparing a novel materialIn step 2.1, the backbone network comprises several back-bone connected back and forth, there are L stages in each back-bone, each stage comprises several convolutional layers, and the signature graph size of each stage is the same; wherein the stage of the backhaul performs a nonlinear conversion F ^l (·)；

Combining a plurality of identical backbones by compositely connecting the stages of the adjacent backbones at the same horizontal position;

the backbone is divided into two types, namely an assistant backbone and a lead backbone; wherein the assistant backbone is represented as B ₁ ,B ₂ ,…,B _k-1 The lead backbone is represented as B _k (ii) a And the output of the assist backbone flows to the next backbone through the composite connection to be used as the input of the stage at the same horizontal position, and the output of the last backbone, namely the lead backbone, is used as the extracted multi-scale feature.

Further, in the step 2.1, the backbone network performs adjacent high-level composition of the backbone, that is, B _k The output of the l-1 th stage

And B _k-1 The output of the stage

Merge together as B _k The input of the stage of (1):

wherein g (-) represents a composite connection, pair

After down-sampling and up-sampling operations are performed, the operation is used as the input of the current l stage of the backbone.

Further, in the adjacent high-level composition of the backbone in the step 2.1, two backbones used for composition connection are Swin transformers;

the Swin transform realizes image block division by using convolution of 7 × 7 with the step size of 4, and the feature maps between different stages realize down-sampling by using convolution of 3 × 3 with the step size of 2; in each Swin Transformer block, calculating self-attention in non-overlapping local windows;

assuming that each local window includes M × M image blocks, and the entire image includes h × W image blocks, the computation complexity of the global MSA and the window-based W-MSA is:

Ω(MSA)＝4hwC ² +2(hw) ² C；

Ω(W-MSA)＝4hwC ² +2M ² hwC；

wherein h represents the image height, w represents the image width, and C represents the image channel;

the computational complexity of the Swin transform is linear with image size.

Further, the Swin Transformer allows cross-window connections;

window partitions are shifted between consecutive Swin Transformer blocks, respectively using the W-MSA and SW-MSA mechanisms, as calculated as follows:

wherein LN represents the layer normalization,

indicating the attention of the l-th layer to adopt the W-MSA mechanism,

attention to the fact that the L-th layer employs the SW-MSA mechanism, z ^l And z ^l+1 Indicating the attention of the l-th layer adopting the MLP mechanism.

Further, in the step 2.3, the detectionhead part applies a Transformer encoderlock to form the TPH network; each Transformer encoder block comprises two sublayers, the first sublayer is a multi-head attention layer, the second sublayer MLP is a full connection layer, and each sublayer is connected by using a residual error.

Further, the step 1 comprises the following substeps:

step 1.1, acquiring the aerial image;

step 1.2, carrying out target annotation on the acquired aerial image by using an image annotation tool to obtain an annotation file; the annotation content is the type of the target and the position of the target in the aerial image;

step 1.3, performing data enhancement on the obtained data set, wherein the data enhancement mode comprises random cutting, random horizontal turning, random vertical turning, scale dithering, color dithering, Mosaic or Mixup; and forming the data set required by the target detection model by using the obtained annotation file and the original aerial image.

Further, in the step 2.4, in the process of inputting the target detection model by using the training set for training, multi-scale training is adopted, and the size of the input aerial image is adjusted to be that the short edge of the image is between 480 and 800, and the long edge of the image is not more than 1333; an SGD optimizer with momentum of 0.9 and weight attenuation of 0.005 is adopted; training 100epochs in total by the model, wherein the initial learning rate is 0.0001; the learning rate was reduced to 1/10 at 67 th and 89 epoch; using 8 GPU training, each GPU was assigned two images, with a total batch size of 16.

Further, in step 2.4, the data set of the aerial image is divided into a training set and a test set in a ratio of 4: 1.

Further, in step 2.4, the training set is input into the target detection model for training, and the total training loss is a weighted sum of the classification loss and the regression loss:

Loss＝L _cls +λL _reg

wherein L is _cls Selecting focal local as the predicted loss between the defect class and the ground trouth class for classifying the loss; l is _reg Selecting GIoU loss as the loss between the predicted defect boundary box coordinate and the ground truth boundary box coordinate for the regression loss; λ is a weighting factor, and is 0.5 by default.

The aerial photography target detection method based on the composite backbone network and the multiple measuring heads provided by the invention at least has the following technical effects:

1. the CBNet method is used for combining Swin transform as a backbone network for extracting image features, and the CBNet method is used for connecting high-level and low-level features of two backbones, so that performance of target detection can be enhanced. Compared with a common deep learning network, the CBNet does not need to be pre-trained when in use, and only needs to initialize each backbone forming the CBNet and a pre-training model of the backbone;

2. it is proposed to make a cross-scale connection through a Recursive-FPN network, by integrating additional feedback connections in the FPN into the bottom-up backbone, to form a Recursive-FPN. The feedback connection brings the features (features rich in semantic information) directly accepting the gradient in the detection head back to the feature layer of the lower level of the backbone, so that training can be accelerated and performance can be enhanced;

3. it is proposed to use the TPH network as the detection head. Compared with the traditional detection head, the detection head increases a Transformer encoder block, increases the capability of capturing different local information, and can also utilize a self-attention mechanism to mine the characteristic characterization potential. Because the feature map resolution at the network end is low, applying TPH to low resolution feature maps can reduce computation and storage costs;

4. the label distribution step in the detection method is expressed as a special linear programming, namely, an optimal transmission problem, so that the optimal label distribution scheme is searched and converted into the optimal transmission plan. Then, the optimal transmission allocation OTA is solved by Sinkhorn-Knopp iteration. Compared with the traditional label distribution strategy, the optimal transmission distribution takes global information into consideration in the distribution process, and one-to-many distribution is dynamically carried out. The distribution strategy enables the model to obtain more high-quality supervision signals, so that the supervision signals can be converged to the optimal result quickly;

5. through the improvement of the parts, the aerial image target detection method provided by the invention has stronger robustness and generalization capability, and improves the aerial image target detection precision. Compared with the detection method in the prior art, the technical scheme of the invention can accurately detect the targets which are shielded, have variable scales and are complex in type in the aerial image, and realize the balance between the aerial image target detection speed and the precision.

The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.

Drawings

FIG. 1 is a schematic flow diagram of a preferred embodiment of the present invention;

FIG. 2 is a schematic diagram of CBNet in a preferred embodiment of the present invention;

FIG. 3 is a block diagram of a backbone network according to a preferred embodiment of the present invention;

FIG. 4 is a schematic diagram of a architecture of a Recursive-FPN network according to a preferred embodiment of the present invention;

fig. 5 is a schematic structural diagram of a TPH network according to a preferred embodiment of the present invention.

Detailed Description

The technical contents of the preferred embodiments of the present invention will be made clear and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.

The invention provides an aerial photography target detection method based on a composite backbone network and multiple measuring heads, which comprises the following steps (as shown in figure 1):

and S1, image acquisition. Acquiring an aerial image shot by a camera carried by an unmanned aerial vehicle;

and S2, labeling the image. And carrying out target annotation on the acquired aerial image by using an image annotation tool. The marked content is the type of the target such as a vehicle, a ship and the position of the target in the image, and the obtained marked file and the original image are used for forming an aerial image data set required by the model;

and S3, enhancing data. And performing data enhancement on the obtained data set. The method mainly comprises the steps of random cutting, random horizontal turning, random vertical turning, scale dithering, color dithering, Mosaic or Mixup;

and S4, constructing an aerial image target detection model. Firstly, establishing a backbone network for extracting multi-scale features through a compound connection Swin Transformer; then, performing multi-scale fusion on the features extracted from the backbone network through a secure-FPN network, and enhancing the features with different resolutions; finally, based on the multi-scale features after fusion enhancement, using a TPH network as a detective head;

and S5, training an aerial image target detection model. Dividing a data set into a training set and a testing set according to a certain proportion, inputting the training set subjected to data enhancement into an aerial image target detection model for training, wherein the division of positive and negative samples during training is based on an OTA strategy;

and S6, predicting the aerial image target detection model. And reasoning by using the trained aerial image target detection model, inputting the aerial image with the concentrated test, and outputting and displaying the detected target type and position.

The invention provides an aerial image target detection method based on a compound connection Swin transform backbone network and TPH, which specifically comprises the following steps:

and S1, image acquisition. Acquiring an aerial image shot by a camera carried by an unmanned aerial vehicle; the annotation content is the type of the identification target and the position of the identification target in the image, and the obtained annotation file and the original image are used for forming an aerial image target data set required by the model;

specifically, first, the unmanned aerial vehicle is aerial-photographed in a weather with good lighting conditions, and then the flight speed of the unmanned aerial vehicle is adjusted to shoot a high-quality aerial image. The aerial image is an RGB three-channel color image.

And S2, labeling the image. And carrying out target annotation on the acquired aerial image by using an image annotation tool.

Specifically, an open source labeling tool labelImg is used for manually labeling the targets in the acquired aerial images to form corresponding xml labeling files. Each aerial image corresponds to an xml annotation file, namely an annotation file comprising target category information and target position information. The markup format of the xml markup file is the same as that of the PASCAL VOC data set, i.e., (x, y) coordinates of the upper left corner and the lower right corner of the minimum rectangular bounding box enclosing the target and the corresponding target class are stored. The aerial image and the corresponding annotation file form a data set for aerial image target detection required by the model.

And S3, enhancing data. The data set is used as the basis of aerial image target detection, and the data enhancement is carried out on the obtained data set in consideration of the fact that fewer aerial image targets exist in the actual shooting of the unmanned aerial vehicle. Mainly comprises random clipping, random horizontal turning, random vertical turning, scale dithering, color dithering, Mosaic or Mixup. And when the data is enhanced, if the aerial image target is shifted or transformed, the size and the coordinates of the labeling boundary box are correspondingly changed through formula calculation. And ensuring that the target in the image is still matched with the labeling boundary box after the image is transformed.

Specifically, random cropping: randomly cutting the image under the condition of ensuring that the aerial photographing target is not cut off, and calculating the position of the original labeling boundary box in the cut image; random horizontal turning: horizontally turning the image and the labeling bounding box with the probability of 0.5; random vertical turning: vertically overturning the image and the labeling boundary frame with the probability of 0.5; and (3) scale dithering: before cutting, randomly adjusting the image size resize to be 0.5-1.5 times of the original image, and correspondingly adjusting the labeling boundary frame; color dithering: transferring the image from an RGB space to an HSV space, randomly changing the lightness (value), saturation (saturation) and hue (hue) of the image to form pictures under different illumination and colors, and transferring the converted image to the RGB space; mosaic: splicing the four images, wherein each image has a corresponding labeling boundary frame, and a new image and a labeling boundary frame corresponding to the new image are obtained after the four images are spliced; mix up: and performing weighted fusion on the two images according to a certain proportion, namely adding each corresponding pixel value according to a certain proportion.

And S4, constructing an aerial image target detection model. Firstly, establishing a backbone network for extracting multi-scale features through a compound connection backbone network Swin Transformer; then, performing multi-scale fusion on the features extracted from the backbone network through a secure-FPN network, and enhancing the features with different resolutions; finally, based on the fused enhanced multi-scale features, TPH is used as detectionhead.

Specifically, the proposed aerial image target detection model can be divided into three parts, and the schematic architecture diagrams of each part are shown in fig. 2 to 4.

First, a composite backbone network is used to extract the multi-scale features of the image. The composite connection is schematically shown in fig. 2. A composite backbone network (CBNet) combines multiple identical backbones by compositely concatenating stages with adjacent backbones at the same horizontal position. As can be seen from fig. 3, the entire information flow is from left to right, the output of the assist backbone (also called advanced feature) flows through the composite connection to the next backbone as the input of the stage at the same horizontal position, and the output of the last backbone (lead backbone) is used for target detection. As shown in fig. 2, CBNet mainly contains two types of backbones: lead backbone B _k And assistant backbone B ₁ ,B ₂ ,...B _k-1 L stages in each backhaul, each stage containing several convolution layers, and the feature map sizeAre the same. The first stage of the Backbone will perform a non-linear transformation F ^l (·)。

As shown in FIG. 2, in CBNet, B is added _k Output of the l-1 stage

And B _k-1 Output of the first stage

Are fused together as B _k Input to the ith stage:

wherein g (-) denotes a composite connection, which is a pair

After the down-sampling and up-sampling operations, the current backhaul l-stage is used as the input. This type of Composition is called Adjacent high-Level Composition (AHLC) because it feeds the output of the high-Level stage of the Adjacent backbone into the next. The two backbones used for the ligation of the complex are SwinTransformer, the architecture of which is shown in FIG. 2. Swin Transformer is inspired by Transformer in the field of natural language processing, which is known to use a self-attentive mechanism to focus on long-range dependencies in data. The scale of elements in the visual domain varies widely and the resolution of pixels in an image is much higher than words in a text passage. For this reason Swin Transformer constructs a hierarchical representation of the signatures, different from the original model. In the present invention, the image block division is first implemented with a 7 × 7 convolution with step size 4, and then the downsampling is implemented with a 3 × 3 convolution with step size 2 for the feature maps between different stages. In each Swin Transformer block, self-attention was calculated in non-overlapping local windows. Assuming that each local Window contains M x M image blocks and the entire image contains h x W image blocks, the global MSA (Multi-head Self-orientation) and Window-based W-MSA (Window Multi-head Self-orientation) are countedThe computational complexity is respectively:

Ω(MSA)＝4hwC ² +2(hw) ² C；

Ω(W-MSA)＝4hwC ² +2M ² hwC；

from the above, the computational complexity of Swin Transformer is linear with the image size.

To increase the receptive field of the network to achieve global self-attention, cross-window connections are allowed to improve efficiency. Window partitioning is Shifted between consecutive Swin transform blocks using the W-MSA (Window Multi-head Self-attention based) and SW-MSA (Shifted-Window Multi-head Self-attention based) mechanisms, respectively. The calculation is as follows:

wherein LN represents the layer normalization,

indicating the attention of the l-th layer to adopt the W-MSA mechanism,

attention to the fact that the L-th layer employs the SW-MSA mechanism, z ^l And z ^l+1 Indicating that the first layer adopts an MLP machineAttention is drawn.

Then, a recurve-FPN network is used to perform multi-scale fusion on the features extracted from the composite connection backbone network, and enhance the input features with different resolutions, and the schematic diagram of the architecture is shown in fig. 4. The recurved-FPN network is built on the basis of FPN (feature Pyramid network), and the recurved-FPN is constructed by integrating additional feedback connections in the FPN into the background from bottom to top, and the black solid arrows in the figure are the mentioned feedback connections. Specifically, this feedback connection brings features in the detective head that directly accept the gradient (semantic information rich features) back into the feature layer at the lower level of the background, thereby enabling accelerated training and enhanced performance.

Finally, a TPH network is adopted as the detective head, and the architecture thereof is shown in fig. 5. TPH is formed by applying a transducer encoderlock to the detectionhead part. The transform encoder block can capture global information and rich context information. Each Transformer encoder buffer comprises two sub-layers, the first sub-Layer is a Multi-head attention Layer, the second sub-Layer MLP (Multi Layer per predictor) is a full link Layer, and residual connection is used between each sub-Layer. The Transformer encoder block increases the capability of capturing different local information and can also utilize a self-attention mechanism to mine the characteristic characterization potential. Because the feature map resolution at the network end is low, applying TPH to low resolution feature maps can reduce computation and storage costs.

And S5, training an aerial image target detection model. And dividing the data set into a training set and a test set according to a certain proportion, inputting the training set subjected to data enhancement into an aerial image target detection model for training, and dividing positive and negative samples during training according to an OTA strategy.

Specifically, the surface defect data set acquired in S2 is divided into a training set and a test set at a ratio of 4: 1. The divided training set is enhanced by using the data enhancement strategy in S3, and is input into the constructed surface defect detection model in S4 for training. The total loss of training for the surface defect detection model is a weighted sum of the classification loss and the regression loss: loss ═ L _cls +λL _reg 。L _cls To classify the loss, focal loss is chosen as the loss between the predicted defect class and the ground trouth class. L is _reg For the regression loss, GIoU loss is selected as the loss between the predicted defect bounding box coordinates and the ground truth bounding box coordinates. λ is a weighting factor, and is 0.5 by default.

And during training, dividing positive and negative samples according to an optimal transmission distribution OTA strategy. Specifically, unit transmission cost between anchor points and ground truths or background is defined as the sum of classification loss and regression loss, and the optimal label distribution scheme is found and converted into the optimal transmission plan. After the unit transmission cost is defined, the optimal transmission plan can be solved quickly and effectively through Sinkhorn-Knopp iteration. The optimal transmission allocation strategy dynamically allocates the defect labels with various sizes, shapes and categories in a one-to-many manner by considering the context information from the global perspective.

The hyper-parameters for model training are set as follows: adopting multi-scale training to adjust the size of the input image to ensure that the short edge of the image is between 480 and 800 and the long edge of the image is not more than 1333; an SGD optimizer with momentum of 0.9 and weight attenuation of 0.005 is adopted; the model was trained for a total of 100epochs, with an initial learning rate of 0.0001. The learning rate was reduced to 1/10 at 67 th and 89 epoch; using 8 GPU training, each GPU was assigned two images, with a total batch size of 16.

And S6, predicting the aerial image target detection model. And reasoning by using the trained network model, inputting the aerial images in the test set, and outputting and displaying the detected type and position of the aerial target.

And specifically, inputting the aerial images in the divided test sets into the aerial image target detection model trained in the fifth step. Firstly, extracting multi-scale features by a composite Swin Transformer backbone network, then carrying out multi-scale fusion on the features extracted by the backbone network through a secure-FPN network, and then using a TPH network as a detective head to respectively output a predicted aerial photography target probability and a predicted aerial photography target boundary box to a multi-scale feature map of the secure-FPN network. And for the aerial photography target prediction result output by the TPH network, filtering out a low credibility result by using a confidence threshold value of 0.05. And then performing post-processing on the prediction results of all layers by using Soft-NMS with a threshold value of 0.6 to generate a final aerial photographic target prediction result and display the type and the position of the target in an aerial photographic image.

The technical scheme provided by the invention remarkably solves the problems that the detection method in the existing aerial image target detection has low precision, especially the detection, generalization and robustness of small targets are not high enough, and realizes the better balance of the aerial image target detection speed and precision.

The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concept. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims

1. An aerial photography target detection method based on a composite backbone network and multiple measuring heads is characterized by comprising the following steps:

step 1, preparing a data set of aerial images;

step 2, constructing a target detection model;

step 3, applying the target detection model to carry out target prediction;

wherein the step 2 comprises the following substeps:

2.3, based on the multi-scale features after fusion enhancement, constructing the target detection model by using a TPH network as a detection head;

2. The method for detecting an aerial target based on a composite backbone network and multiple measuring heads according to claim 1, wherein in the step 2.1, the backbone network comprises a plurality of backbones connected in front and back, each backbone comprises L stages, each stage comprises a plurality of convolution layers, and the feature maps of each stage are the same in size; wherein the stage of the backhaul performs a nonlinear conversion F ^l (·)；

the backbone is divided into two types, namely an assistant backbone and a lead backbone; wherein the assistant backbone is represented as B ₁ ，B ₂ ，...，B _k-1 The lead backbone is represented as B _k (ii) a And the output of one assist back bone flows to the next back bone through the composite connection to be used as the input of the stage at the same horizontal position, and the output of the last back bone, namely the lead back bone, is used as the extracted multi-scale feature.

3. The method for detecting aerial photographic target based on composite backbone network and multi-prediction head as claimed in claim 2, wherein in step 2.1, the backbone network performs adjacent high-level composition of the backbone, namely B _k The output of the l-1 th stage

And B _k-1 The output of the stage

Merge together as B _k Input of the ith said stage:

wherein q (-) denotes a composite connection, pair

4. The method for detecting an aerial photographic target based on the composite backbone network and the multiple measuring heads according to claim 3, wherein in the adjacent high-level composition of the backbone in the step 2.1, two backbones used for composition connection are Swin transformers;

the Swin transform realizes image block division by using convolution of 7 × 7 with the step size of 4, and the feature maps between different stages realize down-sampling by using convolution of 3 × 3 with the step size of 2; in each Swin Transformer block, self-attention is calculated in non-overlapping local windows;

Ω(MSA)＝4hwC ² +2(hw) ² C；

Ω(W-MSA)＝4hwC ² +2M ² hwC；

the computational complexity of the Swin transform is linear with image size.

5. The aerial photography target detection method based on the composite backbone network and the multiple measuring probes as claimed in claim 4, wherein the Swin Transformer allows cross-window connection;

window partitions are shifted among the continuous Swin Transformer blocks, and W-MSA and SW-MSA mechanisms are adopted respectively, and the calculation is as follows:

wherein, LN represents a layer normalization,

indicating the attention of the l-th layer to adopt the W-MSA mechanism,

6. The method for detecting an aerial photography target based on the composite backbone network and the multiple pre-measuring heads according to claim 1, wherein in the step 2.3, the detective head part applies a fransformer encoderlock to form the TPH network; each Transformer encoder block comprises two sublayers, the first sublayer is a multi-head attention layer, the second sublayer MLP is a full connection layer, and each sublayer is connected by using a residual error.

7. The method for detecting aerial photography targets based on the composite backbone network and the multiple measuring heads according to claim 1, wherein the step 1 comprises the following substeps:

step 1.1, acquiring the aerial image;

step 1.2, carrying out target annotation on the acquired aerial image by using an image annotation tool to obtain an annotation file; the method comprises the following steps that annotation content is the type of a target and the position of the target in an aerial image;

8. The method as claimed in claim 1, wherein in the step 2.4, the training process of inputting the target detection model with the training set adopts multi-scale training, and the size of the input aerial image is adjusted to have an image short edge between 480-800 and an image long edge not exceeding 1333; an SGD optimizer with momentum of 0.9 and weight attenuation of 0.005 is adopted; training 100epochs in total by the model, wherein the initial learning rate is 0.0001; the learning rate was reduced to 1/10 at 67 th and 89 epoch; using 8 GPU training, each GPU is assigned two images, with a total batch size of 16.

9. The method for detecting aerial targets based on the composite backbone network and the multiple prediction heads according to claim 1, wherein in the step 2.4, the data set of the aerial image is divided into a training set and a testing set according to a ratio of 4: 1.

10. The method for detecting aerial photographic target based on composite backbone network and multi-prediction head as claimed in claim 1, characterized in that in the step 2.4, the training set is input into the target detection model for training, and the total training loss is the weighted sum of classification loss and regression loss:

Loss＝L _cls +λL _reg