CN115035429A - Aerial photography target detection method based on composite backbone network and multiple measuring heads - Google Patents

Aerial photography target detection method based on composite backbone network and multiple measuring heads Download PDF

Info

Publication number
CN115035429A
CN115035429A CN202210748203.3A CN202210748203A CN115035429A CN 115035429 A CN115035429 A CN 115035429A CN 202210748203 A CN202210748203 A CN 202210748203A CN 115035429 A CN115035429 A CN 115035429A
Authority
CN
China
Prior art keywords
image
target detection
aerial
target
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210748203.3A
Other languages
Chinese (zh)
Inventor
李馨蔚
何小其
杨根科
褚健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Original Assignee
Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University filed Critical Ningbo Institute Of Artificial Intelligence Shanghai Jiaotong University
Priority to CN202210748203.3A priority Critical patent/CN115035429A/en
Publication of CN115035429A publication Critical patent/CN115035429A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/10Terrestrial scenes
    • G06V20/17Terrestrial scenes taken from planes or by drones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses an aerial photography target detection method based on a composite backbone network and multiple measuring heads, which relates to the technical field of computer vision target detection and comprises the following steps: step 1, preparing a data set of aerial images; step 2, constructing a target detection model; step 3, applying a target detection model to carry out target prediction; wherein, step 2 includes: step 2.1, establishing a backbone network of a target detection model, and extracting multi-scale features; 2.2, fusing the multi-scale features through a Recursive-FPN network to obtain fused and enhanced multi-scale features; 2.3, constructing a target detection model by using a TPH network as a detection head based on the fused and enhanced multi-scale features; and 2.4, dividing the data set of the aerial image into a training set and a test set according to a proportion, inputting the training set into the model for training, testing the model by using the test set, and determining a target detection model. The method realizes the balance between aerial image target detection speed and precision.

Description

Aerial photography target detection method based on composite backbone network and multiple measuring heads
Technical Field
The invention relates to the technical field of computer vision target detection, in particular to an aerial photography target detection method based on a composite backbone network and multiple measuring heads.
Background
In recent years, with the rapid development of imaging technology, the number of images has rapidly increased, and the resolution of images has been increasing. Various types of targets exist in massive high-resolution visible light image data, such as airplanes, automobiles, ships and the like, and the targets are accurately detected to play an important role in aerial photography interpretation.
Aerial images are usually acquired by an unmanned aerial vehicle-mounted camera, and compared with images in natural scenes, aerial images have the problems of high background complexity, small target size and fuzzy appearance, and the detection of small targets is a challenging research direction. Most target detection algorithms have higher precision and generalization performance in general target detection, but the precision is still lower in aerial image small target detection.
The remote sensing image target detection algorithm based on traditional machine learning regards the detection of the remote sensing image target as a classification regression task. Firstly, generating a region where a target may exist by using a region search method or a sliding window method, then, extracting features of the region by using different methods, and finally, training a target classifier by using the obtained features, wherein the target classifier comprises K neighbors, a support vector machine, a conditional random field and the like. Although the traditional machine learning method is improved in target detection accuracy to a certain extent compared with other image processing methods, the extracted features still need manual design, the feature hierarchy is more primary, and high-level semantic information is not available. In addition, the target detection result lacks secondary adjustment on the candidate region, and also generates severe dependence on a generation algorithm of the candidate region, the detection omission is caused by too few candidate regions, and the false alarm is caused by too much candidate regions, and simultaneously, higher requirements are put forward on the storage space. The method based on the traditional machine learning has many limitations in practical application, and is difficult to adapt to massive aerial image data.
Deep learning is a branch of machine learning, and in recent years, the emergence of massive labeled data and the development of a GPU parallel computing technology promote the breakthrough of the deep learning technology. The strong expression ability is deeply learned, so that not only the geometric characteristics but also the semantic characteristics can be learned, the image processing steps are simplified, and finally, the network detects the target by using the automatically acquired characteristics. The natural scene target detection algorithm based on deep learning can be roughly divided into two types: a one-phase detection algorithm and a two-phase detection algorithm. The one-stage detection algorithm uses a single network structure and a regression prediction mode to perform target detection, and has the advantages of simple structure, high running speed and the like, but the detection precision is usually slightly inferior to that of the two-stage detection algorithm. The two-stage detection algorithm is based on a region algorithm, firstly provides a candidate region, and then carries out target discrimination and secondary adjustment of position scale on the candidate region. The two-phase algorithm is complex in structure and high in inference delay, but generally has higher detection accuracy.
Chen Tianming et al in the Chinese patent application "a method for rapidly detecting and identifying small sample small target in complex remote sensing land environment" (publication number: CN113963265A) discloses a method for rapidly detecting and identifying, which is based on an improved Faster R CNN convolutional neural network architecture to construct a detection and identification network for a vehicle target in complex remote sensing land environment; by carrying out certain transformation and disturbance expansion on training data and carrying out repeated training on negative samples and difficultly-divided samples, the training data volume is increased, and simultaneously, the network can fully learn the change of a target, so that the problems of weak model generalization capability and poor precision caused by less sample data volume are solved; by adding small target features and mining difficult sample information, the problems of poor detection effect, high false alarm rate and low identification precision of the Faster R CNN on the small target are solved; RPN and Fast R CNN share the same 5-layer convolutional neural network, and the parameters of the network model are optimized, so that the whole detection process can be completed by only completing a series of convolutional operations, and the operation time is reduced. But the method is realized by generating a series of anchors at each position of the image in advance, and the adopted convolutional neural network algorithm is an anchor-based model.
Zhang Cheng et al in the Chinese patent application "a method for detecting dense vehicles based on depth full convolution network remote sensing images based on central point" (publication number: CN110659601A) propose a method for detecting dense vehicles based on remote sensing images, mainly solving the problem of low detection precision caused by small target and dense arrangement in the existing remote sensing images. The scheme is as follows: acquiring a training set and a test set in a remote sensing image target detection data set; constructing a central point depth full convolution network dense vehicle target detection model, and setting an integral loss function of a target central point classification task and a target size regression task; inputting the training set into the constructed network model for training to obtain a trained vehicle detection model; and inputting the test set into a trained vehicle target detection model, and predicting the position of the central point of the output target and the size of the target. The method reduces the influence of the target size on the positioning of the dense scene target, improves the recall rate of the dense vehicle target in the remote sensing image, and improves the detection precision of the vehicle target. The method can be used for urban planning, traffic flow control, traffic supervision and military investigation. However, the method is only suitable for detecting dense vehicles, and information loss still exists to a certain extent, so that the network cannot fully learn all characteristics of each target.
The good effect of the deep learning algorithm in the target detection of the visible light remote sensing image proves that the potential is huge, but the following problems still exist in the practical application:
1. most of aerial image target detection frameworks based on neural networks extract target features first, and then the extracted features are identified. If the feature extraction process is disturbed, the subsequent recognition accuracy will also be affected. However, the target in the aerial image only occupies a small part, most of the area is occupied by the background, the complicated background information can submerge the target information, and the detector is difficult to extract the target characteristics;
2. in aerial view of aerial images, there is a dense arrangement of some classes of targets, in which case it is difficult for the detector to accurately distinguish each target. Although some solutions are provided in the existing detection algorithm, the existing solutions often have information loss to a certain degree, and the network cannot fully learn all the characteristics of each target;
3. at present, most of convolutional neural network algorithms adopt an anchor-based model as a detection head, although the anchor-based model can provide a strong prior for a neural network, the training of the neural network is accelerated, and the network is easy to converge, however, the design of the anchor depends heavily on artificial experience, and if the design is not good, the final detection effect is influenced. The anchor needs to be continuously adjusted along with the data change, so that the generalization of the detection algorithm is greatly reduced;
4. in order to locate the position of the target in the image, most of the existing methods generate a series of anchors at each position of the image in advance. And judging whether the anchor belongs to a positive sample or a negative sample by setting a fixed threshold in the training process. The above method will cause the problem of imbalance between the positive and negative samples of different size targets in the model training process.
Therefore, those skilled in the art are dedicated to developing a new aerial image target detection method, which solves the above-mentioned problems of the existing deep learning algorithm in the practical application of aerial image target detection.
Disclosure of Invention
In view of the above defects in the prior art, the technical problem to be solved by the present invention is how to improve the multi-size features in the aerial image, and can effectively extract the multi-size features therein, and then perform multi-size fusion, thereby improving the accuracy of target detection.
In order to achieve the above purpose, the present invention provides an aerial image target detection method based on a combination of a compound Swin Transformer (Shifted-windows Transformer) trunk network and tph (Transformer Prediction head). Firstly, forming a hierarchical Feature map representation by using a compound Swin Transformer as a main network, extracting multi-scale features of an aerial image, then performing multi-scale fusion on input features extracted by the main network by using a reactive-FPN (reactive Feature Pyramid network), and finally outputting a prediction target detection result on multiple scales of the aerial image by using a TPH network as a detection head. During model training, an OTA (Optimal Transport Assignment) strategy is used for distributing training samples, so that the model convergence efficiency is improved, and the target detection precision is improved. Wherein, the composite backbone network is CBNet.
The invention provides an aerial photography target detection method based on a composite backbone network and multiple measuring heads, which comprises the following steps:
step 1, preparing a data set of aerial images;
step 2, constructing a target detection model;
step 3, applying the target detection model to carry out target prediction;
wherein the step 2 comprises the following substeps:
step 2.1, establishing a backbone network of the target detection model through a compound connection Swin transducer, and extracting multi-scale features;
2.2, fusing the multi-scale features through a Recursive-FPN network to obtain the fused and enhanced multi-scale features;
2.3, based on the multi-scale features after fusion enhancement, using a TPH network as a detection head to construct the target detection model;
step 2.4, dividing the data set of the aerial image into a training set and a test set according to a proportion, inputting the training set into the target detection model for training, testing the target detection model by using the test set, and determining the target detection model; wherein, the division of the positive and negative samples during training follows the OTA strategy.
Further, the method can be used for preparing a novel materialIn step 2.1, the backbone network comprises several back-bone connected back and forth, there are L stages in each back-bone, each stage comprises several convolutional layers, and the signature graph size of each stage is the same; wherein the stage of the backhaul performs a nonlinear conversion F l (·);
Combining a plurality of identical backbones by compositely connecting the stages of the adjacent backbones at the same horizontal position;
the backbone is divided into two types, namely an assistant backbone and a lead backbone; wherein the assistant backbone is represented as B 1 ,B 2 ,…,B k-1 The lead backbone is represented as B k (ii) a And the output of the assist backbone flows to the next backbone through the composite connection to be used as the input of the stage at the same horizontal position, and the output of the last backbone, namely the lead backbone, is used as the extracted multi-scale feature.
Further, in the step 2.1, the backbone network performs adjacent high-level composition of the backbone, that is, B k The output of the l-1 th stage
Figure BDA0003720209710000041
And B k-1 The output of the stage
Figure BDA0003720209710000042
Merge together as B k The input of the stage of (1):
Figure BDA0003720209710000043
wherein g (-) represents a composite connection, pair
Figure BDA0003720209710000044
After down-sampling and up-sampling operations are performed, the operation is used as the input of the current l stage of the backbone.
Further, in the adjacent high-level composition of the backbone in the step 2.1, two backbones used for composition connection are Swin transformers;
the Swin transform realizes image block division by using convolution of 7 × 7 with the step size of 4, and the feature maps between different stages realize down-sampling by using convolution of 3 × 3 with the step size of 2; in each Swin Transformer block, calculating self-attention in non-overlapping local windows;
assuming that each local window includes M × M image blocks, and the entire image includes h × W image blocks, the computation complexity of the global MSA and the window-based W-MSA is:
Ω(MSA)=4hwC 2 +2(hw) 2 C;
Ω(W-MSA)=4hwC 2 +2M 2 hwC;
wherein h represents the image height, w represents the image width, and C represents the image channel;
the computational complexity of the Swin transform is linear with image size.
Further, the Swin Transformer allows cross-window connections;
window partitions are shifted between consecutive Swin Transformer blocks, respectively using the W-MSA and SW-MSA mechanisms, as calculated as follows:
Figure BDA0003720209710000045
Figure BDA0003720209710000046
Figure BDA0003720209710000047
Figure BDA0003720209710000048
wherein LN represents the layer normalization,
Figure BDA0003720209710000049
indicating the attention of the l-th layer to adopt the W-MSA mechanism,
Figure BDA00037202097100000410
attention to the fact that the L-th layer employs the SW-MSA mechanism, z l And z l+1 Indicating the attention of the l-th layer adopting the MLP mechanism.
Further, in the step 2.3, the detectionhead part applies a Transformer encoderlock to form the TPH network; each Transformer encoder block comprises two sublayers, the first sublayer is a multi-head attention layer, the second sublayer MLP is a full connection layer, and each sublayer is connected by using a residual error.
Further, the step 1 comprises the following substeps:
step 1.1, acquiring the aerial image;
step 1.2, carrying out target annotation on the acquired aerial image by using an image annotation tool to obtain an annotation file; the annotation content is the type of the target and the position of the target in the aerial image;
step 1.3, performing data enhancement on the obtained data set, wherein the data enhancement mode comprises random cutting, random horizontal turning, random vertical turning, scale dithering, color dithering, Mosaic or Mixup; and forming the data set required by the target detection model by using the obtained annotation file and the original aerial image.
Further, in the step 2.4, in the process of inputting the target detection model by using the training set for training, multi-scale training is adopted, and the size of the input aerial image is adjusted to be that the short edge of the image is between 480 and 800, and the long edge of the image is not more than 1333; an SGD optimizer with momentum of 0.9 and weight attenuation of 0.005 is adopted; training 100epochs in total by the model, wherein the initial learning rate is 0.0001; the learning rate was reduced to 1/10 at 67 th and 89 epoch; using 8 GPU training, each GPU was assigned two images, with a total batch size of 16.
Further, in step 2.4, the data set of the aerial image is divided into a training set and a test set in a ratio of 4: 1.
Further, in step 2.4, the training set is input into the target detection model for training, and the total training loss is a weighted sum of the classification loss and the regression loss:
Loss=L cls +λL reg
wherein L is cls Selecting focal local as the predicted loss between the defect class and the ground trouth class for classifying the loss; l is reg Selecting GIoU loss as the loss between the predicted defect boundary box coordinate and the ground truth boundary box coordinate for the regression loss; λ is a weighting factor, and is 0.5 by default.
The aerial photography target detection method based on the composite backbone network and the multiple measuring heads provided by the invention at least has the following technical effects:
1. the CBNet method is used for combining Swin transform as a backbone network for extracting image features, and the CBNet method is used for connecting high-level and low-level features of two backbones, so that performance of target detection can be enhanced. Compared with a common deep learning network, the CBNet does not need to be pre-trained when in use, and only needs to initialize each backbone forming the CBNet and a pre-training model of the backbone;
2. it is proposed to make a cross-scale connection through a Recursive-FPN network, by integrating additional feedback connections in the FPN into the bottom-up backbone, to form a Recursive-FPN. The feedback connection brings the features (features rich in semantic information) directly accepting the gradient in the detection head back to the feature layer of the lower level of the backbone, so that training can be accelerated and performance can be enhanced;
3. it is proposed to use the TPH network as the detection head. Compared with the traditional detection head, the detection head increases a Transformer encoder block, increases the capability of capturing different local information, and can also utilize a self-attention mechanism to mine the characteristic characterization potential. Because the feature map resolution at the network end is low, applying TPH to low resolution feature maps can reduce computation and storage costs;
4. the label distribution step in the detection method is expressed as a special linear programming, namely, an optimal transmission problem, so that the optimal label distribution scheme is searched and converted into the optimal transmission plan. Then, the optimal transmission allocation OTA is solved by Sinkhorn-Knopp iteration. Compared with the traditional label distribution strategy, the optimal transmission distribution takes global information into consideration in the distribution process, and one-to-many distribution is dynamically carried out. The distribution strategy enables the model to obtain more high-quality supervision signals, so that the supervision signals can be converged to the optimal result quickly;
5. through the improvement of the parts, the aerial image target detection method provided by the invention has stronger robustness and generalization capability, and improves the aerial image target detection precision. Compared with the detection method in the prior art, the technical scheme of the invention can accurately detect the targets which are shielded, have variable scales and are complex in type in the aerial image, and realize the balance between the aerial image target detection speed and the precision.
The conception, the specific structure and the technical effects of the present invention will be further described with reference to the accompanying drawings to fully understand the objects, the features and the effects of the present invention.
Drawings
FIG. 1 is a schematic flow diagram of a preferred embodiment of the present invention;
FIG. 2 is a schematic diagram of CBNet in a preferred embodiment of the present invention;
FIG. 3 is a block diagram of a backbone network according to a preferred embodiment of the present invention;
FIG. 4 is a schematic diagram of a architecture of a Recursive-FPN network according to a preferred embodiment of the present invention;
fig. 5 is a schematic structural diagram of a TPH network according to a preferred embodiment of the present invention.
Detailed Description
The technical contents of the preferred embodiments of the present invention will be made clear and easily understood by referring to the drawings attached to the specification. The present invention may be embodied in many different forms of embodiments and the scope of the invention is not limited to the embodiments set forth herein.
The invention provides an aerial photography target detection method based on a composite backbone network and multiple measuring heads, which comprises the following steps (as shown in figure 1):
and S1, image acquisition. Acquiring an aerial image shot by a camera carried by an unmanned aerial vehicle;
and S2, labeling the image. And carrying out target annotation on the acquired aerial image by using an image annotation tool. The marked content is the type of the target such as a vehicle, a ship and the position of the target in the image, and the obtained marked file and the original image are used for forming an aerial image data set required by the model;
and S3, enhancing data. And performing data enhancement on the obtained data set. The method mainly comprises the steps of random cutting, random horizontal turning, random vertical turning, scale dithering, color dithering, Mosaic or Mixup;
and S4, constructing an aerial image target detection model. Firstly, establishing a backbone network for extracting multi-scale features through a compound connection Swin Transformer; then, performing multi-scale fusion on the features extracted from the backbone network through a secure-FPN network, and enhancing the features with different resolutions; finally, based on the multi-scale features after fusion enhancement, using a TPH network as a detective head;
and S5, training an aerial image target detection model. Dividing a data set into a training set and a testing set according to a certain proportion, inputting the training set subjected to data enhancement into an aerial image target detection model for training, wherein the division of positive and negative samples during training is based on an OTA strategy;
and S6, predicting the aerial image target detection model. And reasoning by using the trained aerial image target detection model, inputting the aerial image with the concentrated test, and outputting and displaying the detected target type and position.
The invention provides an aerial image target detection method based on a compound connection Swin transform backbone network and TPH, which specifically comprises the following steps:
and S1, image acquisition. Acquiring an aerial image shot by a camera carried by an unmanned aerial vehicle; the annotation content is the type of the identification target and the position of the identification target in the image, and the obtained annotation file and the original image are used for forming an aerial image target data set required by the model;
specifically, first, the unmanned aerial vehicle is aerial-photographed in a weather with good lighting conditions, and then the flight speed of the unmanned aerial vehicle is adjusted to shoot a high-quality aerial image. The aerial image is an RGB three-channel color image.
And S2, labeling the image. And carrying out target annotation on the acquired aerial image by using an image annotation tool.
Specifically, an open source labeling tool labelImg is used for manually labeling the targets in the acquired aerial images to form corresponding xml labeling files. Each aerial image corresponds to an xml annotation file, namely an annotation file comprising target category information and target position information. The markup format of the xml markup file is the same as that of the PASCAL VOC data set, i.e., (x, y) coordinates of the upper left corner and the lower right corner of the minimum rectangular bounding box enclosing the target and the corresponding target class are stored. The aerial image and the corresponding annotation file form a data set for aerial image target detection required by the model.
And S3, enhancing data. The data set is used as the basis of aerial image target detection, and the data enhancement is carried out on the obtained data set in consideration of the fact that fewer aerial image targets exist in the actual shooting of the unmanned aerial vehicle. Mainly comprises random clipping, random horizontal turning, random vertical turning, scale dithering, color dithering, Mosaic or Mixup. And when the data is enhanced, if the aerial image target is shifted or transformed, the size and the coordinates of the labeling boundary box are correspondingly changed through formula calculation. And ensuring that the target in the image is still matched with the labeling boundary box after the image is transformed.
Specifically, random cropping: randomly cutting the image under the condition of ensuring that the aerial photographing target is not cut off, and calculating the position of the original labeling boundary box in the cut image; random horizontal turning: horizontally turning the image and the labeling bounding box with the probability of 0.5; random vertical turning: vertically overturning the image and the labeling boundary frame with the probability of 0.5; and (3) scale dithering: before cutting, randomly adjusting the image size resize to be 0.5-1.5 times of the original image, and correspondingly adjusting the labeling boundary frame; color dithering: transferring the image from an RGB space to an HSV space, randomly changing the lightness (value), saturation (saturation) and hue (hue) of the image to form pictures under different illumination and colors, and transferring the converted image to the RGB space; mosaic: splicing the four images, wherein each image has a corresponding labeling boundary frame, and a new image and a labeling boundary frame corresponding to the new image are obtained after the four images are spliced; mix up: and performing weighted fusion on the two images according to a certain proportion, namely adding each corresponding pixel value according to a certain proportion.
And S4, constructing an aerial image target detection model. Firstly, establishing a backbone network for extracting multi-scale features through a compound connection backbone network Swin Transformer; then, performing multi-scale fusion on the features extracted from the backbone network through a secure-FPN network, and enhancing the features with different resolutions; finally, based on the fused enhanced multi-scale features, TPH is used as detectionhead.
Specifically, the proposed aerial image target detection model can be divided into three parts, and the schematic architecture diagrams of each part are shown in fig. 2 to 4.
First, a composite backbone network is used to extract the multi-scale features of the image. The composite connection is schematically shown in fig. 2. A composite backbone network (CBNet) combines multiple identical backbones by compositely concatenating stages with adjacent backbones at the same horizontal position. As can be seen from fig. 3, the entire information flow is from left to right, the output of the assist backbone (also called advanced feature) flows through the composite connection to the next backbone as the input of the stage at the same horizontal position, and the output of the last backbone (lead backbone) is used for target detection. As shown in fig. 2, CBNet mainly contains two types of backbones: lead backbone B k And assistant backbone B 1 ,B 2 ,...B k-1 L stages in each backhaul, each stage containing several convolution layers, and the feature map sizeAre the same. The first stage of the Backbone will perform a non-linear transformation F l (·)。
As shown in FIG. 2, in CBNet, B is added k Output of the l-1 stage
Figure BDA0003720209710000081
And B k-1 Output of the first stage
Figure BDA0003720209710000082
Are fused together as B k Input to the ith stage:
Figure BDA0003720209710000083
wherein g (-) denotes a composite connection, which is a pair
Figure BDA0003720209710000084
After the down-sampling and up-sampling operations, the current backhaul l-stage is used as the input. This type of Composition is called Adjacent high-Level Composition (AHLC) because it feeds the output of the high-Level stage of the Adjacent backbone into the next. The two backbones used for the ligation of the complex are SwinTransformer, the architecture of which is shown in FIG. 2. Swin Transformer is inspired by Transformer in the field of natural language processing, which is known to use a self-attentive mechanism to focus on long-range dependencies in data. The scale of elements in the visual domain varies widely and the resolution of pixels in an image is much higher than words in a text passage. For this reason Swin Transformer constructs a hierarchical representation of the signatures, different from the original model. In the present invention, the image block division is first implemented with a 7 × 7 convolution with step size 4, and then the downsampling is implemented with a 3 × 3 convolution with step size 2 for the feature maps between different stages. In each Swin Transformer block, self-attention was calculated in non-overlapping local windows. Assuming that each local Window contains M x M image blocks and the entire image contains h x W image blocks, the global MSA (Multi-head Self-orientation) and Window-based W-MSA (Window Multi-head Self-orientation) are countedThe computational complexity is respectively:
Ω(MSA)=4hwC 2 +2(hw) 2 C;
Ω(W-MSA)=4hwC 2 +2M 2 hwC;
wherein h represents the image height, w represents the image width, and C represents the image channel;
from the above, the computational complexity of Swin Transformer is linear with the image size.
To increase the receptive field of the network to achieve global self-attention, cross-window connections are allowed to improve efficiency. Window partitioning is Shifted between consecutive Swin transform blocks using the W-MSA (Window Multi-head Self-attention based) and SW-MSA (Shifted-Window Multi-head Self-attention based) mechanisms, respectively. The calculation is as follows:
Figure BDA0003720209710000091
Figure BDA0003720209710000092
Figure BDA0003720209710000093
Figure BDA0003720209710000094
wherein LN represents the layer normalization,
Figure BDA0003720209710000095
indicating the attention of the l-th layer to adopt the W-MSA mechanism,
Figure BDA0003720209710000096
attention to the fact that the L-th layer employs the SW-MSA mechanism, z l And z l+1 Indicating that the first layer adopts an MLP machineAttention is drawn.
Then, a recurve-FPN network is used to perform multi-scale fusion on the features extracted from the composite connection backbone network, and enhance the input features with different resolutions, and the schematic diagram of the architecture is shown in fig. 4. The recurved-FPN network is built on the basis of FPN (feature Pyramid network), and the recurved-FPN is constructed by integrating additional feedback connections in the FPN into the background from bottom to top, and the black solid arrows in the figure are the mentioned feedback connections. Specifically, this feedback connection brings features in the detective head that directly accept the gradient (semantic information rich features) back into the feature layer at the lower level of the background, thereby enabling accelerated training and enhanced performance.
Finally, a TPH network is adopted as the detective head, and the architecture thereof is shown in fig. 5. TPH is formed by applying a transducer encoderlock to the detectionhead part. The transform encoder block can capture global information and rich context information. Each Transformer encoder buffer comprises two sub-layers, the first sub-Layer is a Multi-head attention Layer, the second sub-Layer MLP (Multi Layer per predictor) is a full link Layer, and residual connection is used between each sub-Layer. The Transformer encoder block increases the capability of capturing different local information and can also utilize a self-attention mechanism to mine the characteristic characterization potential. Because the feature map resolution at the network end is low, applying TPH to low resolution feature maps can reduce computation and storage costs.
And S5, training an aerial image target detection model. And dividing the data set into a training set and a test set according to a certain proportion, inputting the training set subjected to data enhancement into an aerial image target detection model for training, and dividing positive and negative samples during training according to an OTA strategy.
Specifically, the surface defect data set acquired in S2 is divided into a training set and a test set at a ratio of 4: 1. The divided training set is enhanced by using the data enhancement strategy in S3, and is input into the constructed surface defect detection model in S4 for training. The total loss of training for the surface defect detection model is a weighted sum of the classification loss and the regression loss: loss ═ L cls +λL reg 。L cls To classify the loss, focal loss is chosen as the loss between the predicted defect class and the ground trouth class. L is reg For the regression loss, GIoU loss is selected as the loss between the predicted defect bounding box coordinates and the ground truth bounding box coordinates. λ is a weighting factor, and is 0.5 by default.
And during training, dividing positive and negative samples according to an optimal transmission distribution OTA strategy. Specifically, unit transmission cost between anchor points and ground truths or background is defined as the sum of classification loss and regression loss, and the optimal label distribution scheme is found and converted into the optimal transmission plan. After the unit transmission cost is defined, the optimal transmission plan can be solved quickly and effectively through Sinkhorn-Knopp iteration. The optimal transmission allocation strategy dynamically allocates the defect labels with various sizes, shapes and categories in a one-to-many manner by considering the context information from the global perspective.
The hyper-parameters for model training are set as follows: adopting multi-scale training to adjust the size of the input image to ensure that the short edge of the image is between 480 and 800 and the long edge of the image is not more than 1333; an SGD optimizer with momentum of 0.9 and weight attenuation of 0.005 is adopted; the model was trained for a total of 100epochs, with an initial learning rate of 0.0001. The learning rate was reduced to 1/10 at 67 th and 89 epoch; using 8 GPU training, each GPU was assigned two images, with a total batch size of 16.
And S6, predicting the aerial image target detection model. And reasoning by using the trained network model, inputting the aerial images in the test set, and outputting and displaying the detected type and position of the aerial target.
And specifically, inputting the aerial images in the divided test sets into the aerial image target detection model trained in the fifth step. Firstly, extracting multi-scale features by a composite Swin Transformer backbone network, then carrying out multi-scale fusion on the features extracted by the backbone network through a secure-FPN network, and then using a TPH network as a detective head to respectively output a predicted aerial photography target probability and a predicted aerial photography target boundary box to a multi-scale feature map of the secure-FPN network. And for the aerial photography target prediction result output by the TPH network, filtering out a low credibility result by using a confidence threshold value of 0.05. And then performing post-processing on the prediction results of all layers by using Soft-NMS with a threshold value of 0.6 to generate a final aerial photographic target prediction result and display the type and the position of the target in an aerial photographic image.
The technical scheme provided by the invention remarkably solves the problems that the detection method in the existing aerial image target detection has low precision, especially the detection, generalization and robustness of small targets are not high enough, and realizes the better balance of the aerial image target detection speed and precision.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concept. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (10)

1. An aerial photography target detection method based on a composite backbone network and multiple measuring heads is characterized by comprising the following steps:
step 1, preparing a data set of aerial images;
step 2, constructing a target detection model;
step 3, applying the target detection model to carry out target prediction;
wherein the step 2 comprises the following substeps:
step 2.1, establishing a backbone network of the target detection model through a compound connection Swin transducer, and extracting multi-scale features;
2.2, fusing the multi-scale features through a Recursive-FPN network to obtain the fused and enhanced multi-scale features;
2.3, based on the multi-scale features after fusion enhancement, constructing the target detection model by using a TPH network as a detection head;
step 2.4, dividing the data set of the aerial image into a training set and a test set according to a proportion, inputting the training set into the target detection model for training, testing the target detection model by using the test set, and determining the target detection model; wherein, the division of the positive and negative samples during training follows the OTA strategy.
2. The method for detecting an aerial target based on a composite backbone network and multiple measuring heads according to claim 1, wherein in the step 2.1, the backbone network comprises a plurality of backbones connected in front and back, each backbone comprises L stages, each stage comprises a plurality of convolution layers, and the feature maps of each stage are the same in size; wherein the stage of the backhaul performs a nonlinear conversion F l (·);
Combining a plurality of identical backbones by compositely connecting the stages of the adjacent backbones at the same horizontal position;
the backbone is divided into two types, namely an assistant backbone and a lead backbone; wherein the assistant backbone is represented as B 1 ,B 2 ,...,B k-1 The lead backbone is represented as B k (ii) a And the output of one assist back bone flows to the next back bone through the composite connection to be used as the input of the stage at the same horizontal position, and the output of the last back bone, namely the lead back bone, is used as the extracted multi-scale feature.
3. The method for detecting aerial photographic target based on composite backbone network and multi-prediction head as claimed in claim 2, wherein in step 2.1, the backbone network performs adjacent high-level composition of the backbone, namely B k The output of the l-1 th stage
Figure FDA0003720209700000011
And B k-1 The output of the stage
Figure FDA0003720209700000012
Merge together as B k Input of the ith said stage:
Figure FDA0003720209700000021
wherein q (-) denotes a composite connection, pair
Figure FDA0003720209700000022
After down-sampling and up-sampling operations are performed, the operation is used as the input of the current l stage of the backbone.
4. The method for detecting an aerial photographic target based on the composite backbone network and the multiple measuring heads according to claim 3, wherein in the adjacent high-level composition of the backbone in the step 2.1, two backbones used for composition connection are Swin transformers;
the Swin transform realizes image block division by using convolution of 7 × 7 with the step size of 4, and the feature maps between different stages realize down-sampling by using convolution of 3 × 3 with the step size of 2; in each Swin Transformer block, self-attention is calculated in non-overlapping local windows;
assuming that each local window includes M × M image blocks, and the entire image includes h × W image blocks, the computation complexity of the global MSA and the window-based W-MSA is:
Ω(MSA)=4hwC 2 +2(hw) 2 C;
Ω(W-MSA)=4hwC 2 +2M 2 hwC;
wherein h represents the image height, w represents the image width, and C represents the image channel;
the computational complexity of the Swin transform is linear with image size.
5. The aerial photography target detection method based on the composite backbone network and the multiple measuring probes as claimed in claim 4, wherein the Swin Transformer allows cross-window connection;
window partitions are shifted among the continuous Swin Transformer blocks, and W-MSA and SW-MSA mechanisms are adopted respectively, and the calculation is as follows:
Figure FDA0003720209700000023
Figure FDA0003720209700000024
Figure FDA0003720209700000025
Figure FDA0003720209700000026
wherein, LN represents a layer normalization,
Figure FDA0003720209700000027
indicating the attention of the l-th layer to adopt the W-MSA mechanism,
Figure FDA0003720209700000028
attention to the fact that the L-th layer employs the SW-MSA mechanism, z l And z l+1 Indicating the attention of the l-th layer adopting the MLP mechanism.
6. The method for detecting an aerial photography target based on the composite backbone network and the multiple pre-measuring heads according to claim 1, wherein in the step 2.3, the detective head part applies a fransformer encoderlock to form the TPH network; each Transformer encoder block comprises two sublayers, the first sublayer is a multi-head attention layer, the second sublayer MLP is a full connection layer, and each sublayer is connected by using a residual error.
7. The method for detecting aerial photography targets based on the composite backbone network and the multiple measuring heads according to claim 1, wherein the step 1 comprises the following substeps:
step 1.1, acquiring the aerial image;
step 1.2, carrying out target annotation on the acquired aerial image by using an image annotation tool to obtain an annotation file; the method comprises the following steps that annotation content is the type of a target and the position of the target in an aerial image;
step 1.3, performing data enhancement on the obtained data set, wherein the data enhancement mode comprises random cutting, random horizontal turning, random vertical turning, scale dithering, color dithering, Mosaic or Mixup; and forming the data set required by the target detection model by using the obtained annotation file and the original aerial image.
8. The method as claimed in claim 1, wherein in the step 2.4, the training process of inputting the target detection model with the training set adopts multi-scale training, and the size of the input aerial image is adjusted to have an image short edge between 480-800 and an image long edge not exceeding 1333; an SGD optimizer with momentum of 0.9 and weight attenuation of 0.005 is adopted; training 100epochs in total by the model, wherein the initial learning rate is 0.0001; the learning rate was reduced to 1/10 at 67 th and 89 epoch; using 8 GPU training, each GPU is assigned two images, with a total batch size of 16.
9. The method for detecting aerial targets based on the composite backbone network and the multiple prediction heads according to claim 1, wherein in the step 2.4, the data set of the aerial image is divided into a training set and a testing set according to a ratio of 4: 1.
10. The method for detecting aerial photographic target based on composite backbone network and multi-prediction head as claimed in claim 1, characterized in that in the step 2.4, the training set is input into the target detection model for training, and the total training loss is the weighted sum of classification loss and regression loss:
Loss=L cls +λL reg
wherein L is cls Selecting focal local as the predicted loss between the defect class and the ground trouth class for classifying the loss; l is reg Selecting GIoU loss as the loss between the predicted defect boundary box coordinate and the ground truth boundary box coordinate for the regression loss; λ is a weighting factor, and is 0.5 by default.
CN202210748203.3A 2022-06-29 2022-06-29 Aerial photography target detection method based on composite backbone network and multiple measuring heads Pending CN115035429A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210748203.3A CN115035429A (en) 2022-06-29 2022-06-29 Aerial photography target detection method based on composite backbone network and multiple measuring heads

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210748203.3A CN115035429A (en) 2022-06-29 2022-06-29 Aerial photography target detection method based on composite backbone network and multiple measuring heads

Publications (1)

Publication Number Publication Date
CN115035429A true CN115035429A (en) 2022-09-09

Family

ID=83127226

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210748203.3A Pending CN115035429A (en) 2022-06-29 2022-06-29 Aerial photography target detection method based on composite backbone network and multiple measuring heads

Country Status (1)

Country Link
CN (1) CN115035429A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116895029A (en) * 2023-09-11 2023-10-17 山东开泰抛丸机械股份有限公司 Aerial image target detection method and aerial image target detection system based on improved YOLO V7

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116895029A (en) * 2023-09-11 2023-10-17 山东开泰抛丸机械股份有限公司 Aerial image target detection method and aerial image target detection system based on improved YOLO V7
CN116895029B (en) * 2023-09-11 2023-12-19 山东开泰抛丸机械股份有限公司 Aerial image target detection method and aerial image target detection system based on improved YOLO V7

Similar Documents

Publication Publication Date Title
CN112308019B (en) SAR ship target detection method based on network pruning and knowledge distillation
CN111222396B (en) All-weather multispectral pedestrian detection method
CN114202672A (en) Small target detection method based on attention mechanism
CN112818903A (en) Small sample remote sensing image target detection method based on meta-learning and cooperative attention
CN110580699A (en) Pathological image cell nucleus detection method based on improved fast RCNN algorithm
US11308714B1 (en) Artificial intelligence system for identifying and assessing attributes of a property shown in aerial imagery
CN111079739B (en) Multi-scale attention feature detection method
CN111967480A (en) Multi-scale self-attention target detection method based on weight sharing
CN111985376A (en) Remote sensing image ship contour extraction method based on deep learning
CN112150493A (en) Semantic guidance-based screen area detection method in natural scene
CN112347895A (en) Ship remote sensing target detection method based on boundary optimization neural network
CN114565860B (en) Multi-dimensional reinforcement learning synthetic aperture radar image target detection method
CN112149547A (en) Remote sensing image water body identification based on image pyramid guidance and pixel pair matching
CN113313082B (en) Target detection method and system based on multitask loss function
CN114359245A (en) Method for detecting surface defects of products in industrial scene
CN110659601A (en) Depth full convolution network remote sensing image dense vehicle detection method based on central point
CN113052108A (en) Multi-scale cascade aerial photography target detection method and system based on deep neural network
CN111881984A (en) Target detection method and device based on deep learning
CN115861756A (en) Earth background small target identification method based on cascade combination network
Fan et al. A novel sonar target detection and classification algorithm
Sun et al. IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes
CN115035429A (en) Aerial photography target detection method based on composite backbone network and multiple measuring heads
CN110633727A (en) Deep neural network ship target fine-grained identification method based on selective search
CN116342894B (en) GIS infrared feature recognition system and method based on improved YOLOv5
US20230298335A1 (en) Computer-implemented method, data processing apparatus and computer program for object detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination