CN111914795B

CN111914795B - Method for detecting rotating target in aerial image

Info

Publication number: CN111914795B
Application number: CN202010823765.0A
Authority: CN
Inventors: 刘怡光; 唐天航; 朱先震
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2022-05-27
Anticipated expiration: 2040-08-17
Also published as: CN111914795A

Abstract

The invention adopts a deep learning method to design a target detection model for detecting targets such as vehicles, ships, airplanes and the like in high-altitude aerial images and simultaneously carries out positioning prediction on a target rotating frame. Firstly, designing an image feature extraction network for acquiring high-dimensional features of an input aerial image, and simultaneously constructing a feature pyramid by adopting an FPN (field programmable gate array) architecture to realize target feature extraction under different resolutions; then, generating the size of a basic anchor point of a candidate region extraction network by adopting a clustering method, and realizing the size adjustment of the corresponding anchor point according to the size distribution information of the target in the training image, thereby improving the training efficiency; designing a characteristic denoising detector combined with an attention mechanism for denoising the target characteristics of the candidate region; and finally, designing corresponding weight factors aiming at the target frames with different length-width ratios by adopting a rotation angle error optimization method, optimizing the positioning result of the target with the large length-width ratio, and realizing the rotation frame prediction of various targets in the aerial image.

Description

Method for detecting rotating target in aerial image

Technical Field

The invention relates to an aerial image target detection algorithm, in particular to a target rotating frame positioning prediction method aiming at rotating target detection.

Background

Target detection is a challenging computer vision task, and has application prospects in various fields including face recognition, search and rescue, intelligent transportation and the like. The traditional target detection method mainly realizes target detection by artificially designing the characteristics of a target to be detected, is very complicated, and has low efficiency and lack of robustness due to the characteristics of difficult extraction, instability and the like of the target characteristics. With the recent proposal and application of deep learning methods, the related field of target detection tasks also obtains a plurality of milestones, and the detection precision and the detection speed of targets are greatly improved. The target detection method based on deep learning mainly comprises single-step detection and two-step detection, the single-step detection algorithm is high in detection speed, but sacrifices part of precision, and the high-precision detection requirement is difficult to achieve. The single-step detection classical model comprises a YOLO series model and an SSD model, the two-step detection is represented by Fast RCNN, the single-step detection and the two-step detection are obviously different from a model architecture and comprise the steps of detection characteristics of a detector and model training optimization, but the detection characteristics and the model training optimization are used as main algorithms of target detection, the detection characteristics are still consistent on the whole process, aiming at an input image, firstly, a basic characteristic extraction network is used for processing low-dimensional pixel information to construct high-dimensional characteristic information, and then the detector is used for predicting the size of a target central point and a bounding box based on high-order characteristics. The small target detection and the rotating target detection are another important computer vision task after the classical target task, the small target has fewer pixels and less image occupation ratio, and meanwhile, the small target is very easy to be ignored in the feature extraction process of the convolutional neural network, so the detection difficulty is high. In recent years, a plurality of algorithms are designed for a small target, low-dimensional features are combined with high-dimensional features to predict the small target, and the situation that the small target features are ignored along with the increase of the convolution depth to influence the final prediction result is avoided. In the aerial photography image, a plurality of target gathering areas such as parking lots, harbors, airports and the like exist, in the areas with high gathering degree, a traditional horizontal frame is adopted, the situation that a large number of target frames are restrained can occur through non-maximum value restraint, so that a large number of targets in a detection result are lost, the problem can be effectively avoided by adopting a rotating frame to carry out target detection, and meanwhile, more accurate positioning prediction is realized.

Disclosure of Invention

The invention provides a target feature denoising and angle error optimization method based on multi-scale clustering and combined with an attention mechanism for realizing small target detection and rotating target positioning prediction in an aerial image, and realizes accurate positioning prediction of a rotating target.

The method adopts a residual error network structure ResNet as a basic feature extraction framework to extract high-dimensional feature information of an input image, designs a feature pyramid structure to realize the fusion of high-dimensional features and low-dimensional features; then, multi-scale clustering is adopted for setting anchor point parameters of the candidate region proposed network RPN, and corresponding anchor points are distributed to each feature layer according to characteristics such as receptive fields of different resolution characteristics of the feature pyramid; then according to the candidate region result generated by the RPN, intercepting a corresponding feature map on a corresponding feature layer, and denoising each candidate region feature by combining with a proposed attention denoiser; inputting the denoised target characteristics into a full-connection layer, and performing final positioning and classification prediction.

The aerial image rotating target detection method comprises the following steps:

the method comprises the following steps: and (6) data acquisition and labeling. The method comprises the following steps of acquiring aerial images by adopting equipment or network resources, acquiring high-resolution images by utilizing an unmanned aerial vehicle to shoot at high altitude or utilizing Google maps and the like, and performing target marking work after the images are acquired, wherein the marking mode is different from the traditional marking mode of a horizontal external rectangle, but a rotating frame mode is adopted for marking, and the specific implementation steps are as follows:

step A: collecting images by using an unmanned aerial vehicle or network resources, and constructing a large amount of training image data;

and B: marking the target frame by using a marking tool, and adopting a 4-point method, namely a marking mode of four vertexes of a quadrilateral;

and C: and (3) completing the labeling of the rotating rectangular frame by utilizing a quadrilateral minimum rectangle external connection method to form a labeled file:

step two: and (4) preprocessing data. Aerial photography image has very high image resolution, no matter in training or actual testing process, the original image of direct input is unreal, and this can bring very big burden for equipment, and training speed receives very big influence, consequently should original image cutting be the small image, inputs the model again and trains and predict, and concrete realization step does:

step A: cutting the image, setting the cutting size to be 800 x 800 pixels according to the equipment capability and the depth learning model, considering that direct cutting may cause the target at the cutting edge to be cut off, and setting a cutting overlapping area to be 200 pixels specifically;

and B: reconstructing target label data, configuring corresponding label data for each image target generated by cutting, and judging whether a label belongs to the image according to whether a label center is in the cut image;

and C: and constructing training data, uniformly constructing training data tensors according to the cut images and the label data, facilitating model input, and converting the labels expressed by the four-point method into central points, square frame sizes and rotation angle expressions in the label processing process.

Step three: the model design, the deep learning detection model of the invention mainly includes four core structures, namely characteristic extraction network, candidate area generation network, characteristic denoising structure, rotating target predictor, in the course of detecting and training, the input data passes through the 4 structures in turn, finally produce the prediction result, the concrete realization steps of the model are as follows:

step A: adopting a residual error network ResNet as a feature extractor to obtain high-dimensional information of an input image, then constructing a feature pyramid in a top-down mode, and sequentially performing feature fusion on high-dimensional features downwards to generate a plurality of feature maps;

and B: generating anchor point size of a candidate area generation network by adopting a clustering mode, firstly counting target size of training data, setting the number of clustering centers, clustering by adopting a K-means method, generating the clustering centers with corresponding number, and taking coordinates of the center points as width and height parameters of the anchor points for configuring parameters of the candidate area generation network. The candidate areas generate a plurality of groups of classification of the candidate areas and anchor point positioning deviation values according to the characteristic diagram;

and C: intercepting a feature map of a feature layer corresponding to the feature pyramid according to the anchor point positioning deviation value to generate interesting region features, constructing a denoising map through a plurality of convolution layers according to the result, multiplying the denoising map with the interesting region feature layers one by one to obtain denoised target features, and generating a corresponding attention loss function during training;

step D: inputting the denoised target features into a full-link layer, respectively predicting classification information and positioning information of the target, wherein the classification information is a serial number of a target type, the positioning information is a target center, a size and a rotating angle, and the rotating angle error weight is set according to a target length-width ratio during training to realize angle error optimization;

step four: and (3) designing a loss function, wherein the design of the loss function mainly comprises three parts, namely foreground and background classification errors and anchor point offset positioning errors in a candidate region generation network, attention loss in attention denoising, classification errors of a final prediction result and rotating frame positioning errors.

Drawings

FIG. 1 is a diagram of a rotating object detection network architecture for the method of the present invention.

Fig. 2 is a partial aerial image of step one of the present invention.

FIG. 3 is a schematic diagram of image segmentation in step two of the present invention.

Fig. 4 is a structural diagram of a feature pyramid in step three of the present invention.

FIG. 5 is a diagram of an attention denoising detector in step three according to the present invention.

FIG. 6 is a model regression target design of step three of the present invention.

FIG. 7 is an attention mechanism mask design of step three of the present invention.

Detailed Description

The details of the model design of the present invention are described with reference to fig. 1, and the steps of the embodiment are as follows:

the method comprises the following steps: and (5) extracting image features. Processing the low-dimensional image pixels, and extracting high-dimensional feature information (a feature pyramid structure is shown in fig. 4), wherein the specific implementation steps are as follows:

step A: the residual error network ResNet50d is used as a backbone network, 4 residual error blocks are used for the input image, and 4 feature maps { C) with different resolutions are correspondingly generated₂,C₃,C₄,C₅}；

And B: fusing the generated feature maps from top to bottom, firstly, fusing the feature maps by C₅Generation of pyramid high-level features P by convolution₅After P₅C is to be₄And add to obtain P₄Sequentially fusing downwards to finally generate a characteristic pyramid { P }₂,P₃,P₄,P₅}；

Step two: generating anchor point size based on a clustering method, simultaneously distributing anchor points to each characteristic layer for prediction, and predicting an offset value of a target relative to the anchor points by using the generated characteristic pyramid by using a candidate region extraction network, wherein the specific implementation steps are as follows:

step A: counting target size information in the training data, clustering by using a K-means clustering method, setting the number of clustering centers to be 35, finally generating 35 clustering centers, corresponding to the sizes of 35 anchor points, and distributing each anchor point to a corresponding characteristic layer:

and B: performing convolution operation on the generated characteristic pyramid to generate foreground and background score prediction of the target and an offset value of a relative anchor point, outputting 2 values in the foreground and background scores by a model, respectively representing scores of a foreground and a background, and outputting 4 numerical values by the offset value of the relative anchor point, respectively representing center offset x, y and size offset w, h;

step three: feature denoising in combination with attention mechanism. Firstly, intercepting the features according to the candidate region, then denoising the target features, predicting the final target classification and rotating frame positioning, and specifically realizing the following steps:

step A: inversely decoding according to the anchor point offset of the candidate area and the corresponding anchor point size to obtain a real size;

and B: calculating the percentage of the input image size according to the real size;

and C: utilizing the percentage to cut the characteristic diagram of the corresponding characteristic layer to obtain the characteristic diagram of the interested area;

step D: inputting the feature map into an attention denoising generator to generate a feature denoising map with the same size as the feature map;

step E: multiplying the feature map by corresponding elements of the feature de-noising map to obtain a target feature map;

step F: externally connecting full connection twitching, predicting the type of a target and positioning information of a rotating frame;

step four: designing a loss function, namely designing the loss function of a target detection model for model training and convergence, and specifically comprising the following steps:

step A: the candidate area generation network comprises 2 parts of losses, namely foreground and background classification losses and a break-point offset value loss;

and B: attention loss, constructing a convergence mask target of the target in a manner shown in fig. 7, and simultaneously, in the attention denoising, generating an attention feature map when generating a feature denoising map, and constructing an attention loss function by using the attention feature map and the mask target;

and C: the method comprises the steps of target classification and positioning loss, wherein the target classification and positioning loss comprises the loss of target type classification and predicted rotating frame positioning loss, and when the positioning loss is constructed, corresponding weight is set according to the length-width ratio of a target, so that the condition that the final prediction result is excessively influenced by the supervision error of the target with a large length-width ratio is avoided;

the implementation details of the first step are as follows: firstly, a feature map { C ] is generated according to a residual error network₂,C₃,C₄,C₅P is calculated by convolution₅：

（1）

Then carrying out fusion from top to bottom to obtain P₂,P₃,P₄：

（2）

The anchor point allocation strategy of the second step is shown by the following formula:

（3）

the mask generation details of the third step are as follows: firstly, generating an independent mask for each target according to the label:

（4）

wherein FillPoly represents pixel filling, where a zero matrix with the same size as the input image is constructed first, and then the pixels in the target area are set as the type number of the target. Splicing all target masks after construction is completed, and constructing a high-dimensional matrix:

（5）

and constructing a single-hot vector of target frame regression according to the result of the candidate region generation network:

（6）

here, rois _ assignments represents the target box number corresponding to each candidate region. At the same time, the mask is cropped using the generated candidate area result:

（7）

ROI _ Align here represents a crop scaling operation where a mask can be cropped according to a candidate region while scaling to the same size. Finally, the corresponding mask is activated using the unique heat vector:

（8）

step four, the construction of the loss function is divided into three parts, and the overall loss function is as follows:

（9）

wherein the loss function of the candidate area generating network is defined as follows:

（10）

the target predicted loss is defined as follows:

（11）

attention loss is defined as follows:

（12）

wherein λ is_iWeights, rp, for controlling losses of parts_nRepresenting the foreground-background prediction probability, rv, of candidate region generation_nAnchor point offset prediction, gp, representing candidate region generation_nRepresenting the probability of a real foreground and background, foreground 1, background 0, gv_nRepresenting the true anchor offset value, Fp_nIndicating the result of classification of predicted objects, Fv_nIndicating the predicted spin frame positioning result, Gp_nRepresenting the true kind of target, Gv_nRepresenting the true orientation of the target rotating frame, R^h _nAnd R^w _nDenotes the scaled attention feature size, u_ijValues, gu, representing the corresponding positions of the attention feature map_ijRepresenting the true mask value, L_clsDenotes the softmax cross entropy, L_regAnd L_{reg_theta}Indicating a smooth L1 loss, L_ADRepresenting the softmax cross entropy at the pixel level.

Claims

1. A method for detecting a rotating target of an aerial image is characterized by comprising the following steps: the method comprises the steps of data acquisition and marking, data preprocessing, model design and loss function design, wherein an aerial image is acquired by adopting an unmanned aerial vehicle and a satellite map, a target is marked by adopting a rotating frame mode, then the image is preprocessed and divided into a plurality of sub-images with fixed sizes, corresponding label data is configured for the image generated by cutting, the label data is used for constructing a training data tensor, then model design is carried out, the model comprises a feature extraction network, a candidate region generation network, a feature denoising structure and a rotating target predictor, image features are extracted by utilizing a residual error network, anchor point sizes are generated based on a clustering method, feature denoising is carried out by combining an attention mechanism, the candidate target features are input into classification and positioning information of a prediction target of a full connection layer, and the loss function of a rotating target detection model is designed, and comprises the loss of the candidate region generation network, the loss function of the loss function, and the loss function of the loss function, Attention loss, target classification and positioning loss, model training is completed by using relevant aerial photography data, and finally positioning and classification prediction of aerial photography image rotating targets are realized, wherein the specific realization steps are as follows:

the implementation details of the first step are as follows: firstly, a feature map { C ] is generated according to a residual error network₂，C₃，C₄，C₅P is calculated by convolution₅：

P₅＝Conv(C₅) (1)

Then carrying out fusion from top to bottom to obtain P₂，P₃，P₄：

PC_j＝0.5*P_j+1+0.5*C_j，j∈{2，3，4}

P_j＝Conv(PC_j)，j∈{2，3，4} (2)

t_mask^k＝FillPoly(Zeros([H，W])，gt_boxes_k，labels_k) (4)

wherein, fillPoly represents pixel filling, a zero matrix with the same size as an input image is firstly constructed, then pixels in a target area are set as the type serial number of a target, and after construction, all target masks are spliced to construct a high-dimensional matrix:

mask＝Concate(t_mask_k)，k∈[0，len(gt_boxes)] (5)

onehot＝One_hot(rois_assignments，len(gt_boxes)) (6)

here, rois _ assignments represents the target box number corresponding to each candidate region, and at the same time, the mask is cropped using the generated candidate region result:

rois_cropped_mask＝ROI_Align(rois，mask) (7)

here, ROI _ Align represents a cropping scaling operation where masks can be cropped according to the candidate region while scaling to the same size, and finally, the corresponding mask is activated using the unique heat vector:

rois_mask＝Sum(rois_onehot*rois_cropped_msk) (8)

Loss_total＝Loss_RPN+Loss_FAST+Loss_MASK (9)

the target predicted loss is defined as follows:

attention loss is defined as follows:

wherein λ is₁、λ₂、λ₃、λ₄、λ₅Weights, rp, for controlling losses of parts_nIndicating the probability of foreground prediction, rv, generated for the nth candidate region_niAnchor point offset prediction, gp, representing the parameter i generated by the nth candidate region_nRepresenting the probability of a real foreground and background, foreground 1, background 0, gv_niThe true anchor offset value, Fp, representing the parameter i corresponding to the nth candidate region_n(iv) shows the classification result of the nth predicted target, Fv_niThe positioning result of the rotating frame, Gp, representing the parameter i corresponding to the nth predicted target_nRepresenting the nth object true category, Gv_niActual orientation of the rotating frame, R, representing the parameter i corresponding to the nth object^h _nAnd R^w _nDenotes the scaled attention feature size, u_reValues, gu, representing the corresponding positions of the attention feature map_reRepresenting the true mask value, L_clsDenotes the softmax cross entropy, L_regAnd L_{reg_θ}Indicating a smooth L1 loss, L_ADRepresenting the softmax cross entropy at the pixel level.