CN113850761A

CN113850761A - Remote sensing image target detection method based on multi-angle detection frame

Info

Publication number: CN113850761A
Application number: CN202111007113.0A
Authority: CN
Inventors: 王素玉; 许凯焱
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2021-12-28
Anticipated expiration: 2041-08-30
Also published as: CN113850761B

Abstract

The invention discloses a remote sensing image target detection method based on a multi-angle detection frame.A tilt angle module is designed on the basis of a positive frame predicted by a fast-rcnn, and the method is mainly divided into two stages; in the first stage, the full-connection layer and the decoder are used for carrying out preliminary angle deviation rotation, in the second stage, a rotated roi alignment is used for extracting rotation invariant features, and angle deviation correction is carried out again to obtain a detection frame with an accurate angle. In addition, aiming at the problems of large size and slow training of the remote sensing image, the regression loss function of the inclination detection module is redesigned, so that the convergence of the loss function is faster, and the accuracy is higher. The experimental result shows that the accuracy of the invention is improved by 4.4% compared with the improved fast-rcnn, and the invention is proved to have good detection effect.

Description

Remote sensing image target detection method based on multi-angle detection frame

Technical Field

The invention belongs to the field of target detection in computer vision, and relates to a method for detecting a target type and marking a target position of a picture by using a convolutional neural network, which has higher accuracy compared with the current remote sensing target detection method, such as ROI-Transfomer, SCRdet, R3Det and the like, and can detect the inclination direction according to the characteristics of a remote sensing image.

Background

In recent years, the aerospace industry in China is rapidly developed, the remote sensing satellite technology is continuously improved, a large number of images are acquired by the satellite every day for various purposes, the satellite with the visible light camera is the most common, the visible light remote sensing image is the most visual, and the target in the image can be conveniently distinguished. However, the traditional detection algorithm needs manual feature extraction, the identification performance cannot meet daily needs, and the external influence factors are very large. However, with the continuous development of deep learning, the convolutional neural network is utilized to automatically extract characteristics, so that the cost of manual operation is greatly reduced, and the accuracy is obviously improved. However, the accuracy of the most advanced detector still cannot completely meet the current actual needs, and the problem of insufficient accuracy still exists in the field, and needs to be solved urgently.

At present, remote sensing target detection methods based on convolutional neural networks have made great progress, such as R3Det method, PIOU method, DRN and the like based on single-stage networks, R2CNN method, RRPN method, ROI-Transfomer method, SCRDet method and the like based on dual-stage networks. Although the above method has incomparable advantages compared with the conventional method, and achieves quite high accuracy on the main stream data sets DOTA and HRSC2016, the method still has the problem of insufficient accuracy, and still has a relatively large promotion space.

Disclosure of Invention

Aiming at the problem of insufficient accuracy of the algorithm, the invention designs a target detection algorithm based on a multi-angle remote sensing image, and compared with the algorithm, the target detection algorithm is improved to different degrees.

The invention adopts the following technical scheme: a target detection algorithm based on a multi-angle remote sensing image. The specific process of target detection is as follows: firstly, preprocessing and data enhancement are carried out on pictures, then the pictures are sent into the convolutional neural network provided by the invention, the features of the pictures are extracted through a backbone network, then the features are sent into an RPN network to generate a specific number of prousals, then the pro usals need to be subjected to roi alignment to output a feature graph with a fixed size of 7 multiplied by 7, then the feature graph enters a full connection layer to output a positive frame, then the output positive frame is subjected to angular offset regression through a full connection layer and a decoder, and finally the angle correction is carried out through a rotated roi alignment to obtain a final detection result.

(1) Data preprocessing: the present invention discloses data sets using DOTA. In order to facilitate training and prediction, the width and the height of an input network image are limited, the size of the input network image is 1024 × 1024 during training and prediction, and if the original size is larger than 1024 × 1024 during training and prediction, a picture is divided into a plurality of pictures with the size of 1024 × 1024 according to a sliding window with the step length of 512; if the original size is smaller than 1024 × 1024, it will be filled up with a black background, so that the size of the data can be preprocessed without losing boundary information.

(2) Data enhancement: aiming at the problems of a plurality of small target objects of the remote sensing image, a data enhancement strategy is designed, the ratio of the regression loss of a frame with the calculation area smaller than 32 multiplied by 32 to the total regression loss of the whole picture is calculated in each iteration during training, if the ratio is smaller than 0.4, the contribution rate of the loss of the small target of the iteration to the total regression loss is considered to be insufficient, four pictures of a training set are selected randomly in the next iteration, the length and the width of each picture are shortened to 1/2, the four pictures are spliced into a new picture, the coordinates of the corresponding ground truth are modified, and the picture is sent to network training.

(3) Setting and training a model: the network model mainly comprises a trunk convolution neural network, a characteristic pyramid, an RPN full convolution neural network, an ROI classifier and an inclination angle regression network. The main network is designed to be a residual error network with the depth of 152 layers and is divided into 5 parts, wherein convolutions of each part are connected in parallel, the features extracted through 4 times of down-sampling operation of the main network enter a feature pyramid to be subjected to 3 times of up-sampling operation and 1 time of maximum pooling and are fused, the number of output channels of the feature layer is 256, then the output channels are sent to an RPN full convolution neural network to generate proposal, then roi align is performed to map the features to be 7x7, then classification and regression are performed through the full connection layer to obtain a positive frame, finally the positive frame is sent to an inclination angle regression network, initial angle deviation regression is performed through the full connection layer and a decoder, and then a rotated roi align is performed to extract rotation invariant features for obtaining a final detection result.

In the training process, a pre-training model ResNet152 is used, a cross entropy loss function is adopted for the classification loss of the RPN classifier and the ROI classifier, a SmoothL1 loss function is adopted for the regression loss, the cross entropy loss function is still adopted for the classification loss in the tilt angle network, and the regression loss function is redesigned. The optimizer employs an SGD optimizer using momentum with an initial learning rate of 0.00125 for a total of 15 iterations.

(4) Model prediction: after model training is completed, the training model is stored, the trained model parameters are loaded, test data of any size is input, and the type and the position of an object in the data can be acquired end to end. This stage is only to load the trained model parameters, and data enhancement is not used for this step.

The evaluation index is the average accuracy (map). On the DOTA data set to be tested, the performance of the whole algorithm is evaluated, the method obtains a competitive result, compared with the current commonly used single-stage and double-stage algorithms, the method has higher accuracy and identification performance, and can identify data with any size end to end.

Drawings

Fig. 1 is a schematic overall flow chart of the method of the present invention.

Fig. 2 is a schematic diagram of a convolutional neural network structure according to the present invention.

Fig. 3 is a schematic diagram of a data enhancement result according to the present invention.

Detailed Description

The following detailed description of embodiments of the invention is provided in conjunction with the accompanying drawings:

a remote sensing image target detection method based on a multi-angle detection frame is disclosed. As shown in fig. 1, the detection process includes preprocessing an image and enhancing data (only applied to a training stage), dividing the image into 1024 × 1024 sizes, sending the image to a main network for down-sampling to extract features, sending the image to a feature pyramid for up-sampling, fusing features of different levels in the process, sending the image to an RPN full convolution neural network to generate propofol, then performing roi alignment on the feature map to 7 × 7 sizes, classifying and regressing, and finally sending the image to an oblique angle regression network for final classification and regression to obtain a detection result.

The specific algorithm is referred to as follows:

(1) data preprocessing: preprocessing is carried out on the DOTA data set which is most famous in the current remote sensing target detection. Considering the characteristics of remote sensing images, the resolution of a plurality of pictures is very high, the length and the width of the pictures need to be limited for the convenience of training, predicting and saving computer performance, the input pictures are uniformly limited to be 1024 multiplied by 1024, 4 times of downsampling, 3 times of upsampling and 1 time of maximum pooling fusion characteristics are carried out in an input network through 152 layers of residual error neural networks and a characteristic pyramid structure, and the fact that the pictures can be divided by 32 in whole at each time is guaranteed.

(2) Data enhancement: since there are many small targets in the data set, poor detection performance due to many small targets is a very important issue. Research shows that one of the reasons for poor small target detection performance is that the regression loss of the small target object does not contribute sufficiently to the total regression loss, so increasing the loss contribution of the small target is one of the methods for solving the problem. A data enhancement method is designed, and the regression loss of the inclination detection module of the object with the size smaller than 32 x 32 in the current iteration is calculated firstly, and L is used_sWhere s represents an object of size less than 32 x 32, and then the regression loss of the tilt detection module in this iteration is calculated, denoted as L_regAnd finally, the loss contribution rate a of the small target object is calculated according to the following formula:

a＝L_s/L_regformula (1)

If the calculated a is less than 0.4, the contribution rate of the small target loss in the iteration is considered to be insufficient for the total regression loss, and the loss contribution rate of the small target needs to be strengthenedThen, the next iteration randomly selects four pictures from the training set, reduces w and h of each picture to the original w/2 and h/2, changes the area to the original 1/4, and arranges the pictures according to a two-dimensional array

The method is spliced, the size is still 1024 multiplied by 1024, the splicing schematic diagram is shown in fig. 2, and the next iteration is carried out, so that the proportion of the small target is increased, and the detection performance of the small target is improved. Thereby improving the overall detection performance.

(3) Model setup and training

As shown in fig. 3, the network model mainly includes a backbone convolutional neural network, a feature pyramid structure, an RPN full convolutional neural network, an ROI classifier, and an oblique angle regression network. The method comprises the steps that a picture is input into a backbone convolutional neural network after data preprocessing and extracted from a feature pyramid structure, then the receptive field of the network is increased through convolution of 3 x 3 of a feature graph output by the feature pyramid, then dimension reduction is conducted through convolution of 1x1, a proposal candidate frame is finally generated, then bilinear interpolation is conducted through an ROI classifier to obtain a mapped feature graph, finally a full-connection layer is connected, regression and classification are output to obtain a positive candidate frame, however, due to the characteristics of a data set, the frame needs to be sent to an inclination angle regression network to conduct final regression deviation and classification prediction, and compared with a common two-stage network, the number of anchor points is not increased in the whole process, and an accurate regression frame can be obtained.

To obtain the best detection performance, 152 layers of residual networks are used throughout the model and the convolution kernels are grouped in parallel for parameter reduction. The trunk network is divided into five layers of C1, C2, C3, C4 and C5, the picture size is downsampled for 4 times through the five layers, the C1 layer of the trunk network uses a convolution layer of 7x7 and a relu activation function, and finally max pool is carried out for one time; in order to reduce the parameter number and improve the detection performance at the same time, the backbone network layers C2-C5 all use packet convolution, the backbone network layer C2 uses convolution groups of 1 × 1, 3 × 3 and 1 × 1, wherein the convolution dimensionality is divided into 32 groups, the backbone network layer C3 uses 8 convolution groups, the backbone network layer C4 uses 36 convolution groups, and the backbone network layer C5 uses 3 convolution groups, so that the feature maps 256, 512, 1024 and 2048 of the dimensionality in 4 are finally output. And then, carrying out convolution on feature maps output by C2-C5 layers by 1x1, compressing feature dimensions into 256, carrying out up-sampling on 2048-dimensional feature maps output by the C5 layer for 3 times, outputting 256-dimensional feature maps, adding the 256-dimensional feature maps and 256-dimensional features output by the C2-C5 layers to obtain P2'-P5' layers, carrying out convolution on 3 x 3 of the 256 dimensions to obtain final P2-P5 layers, and carrying out max pool with 1x1 step size of 2 on the P5 layers to obtain P6 layers which are used for subsequent calculation.

Then, the output P2-P6 layers extract features through a 256-dimensional convolution layer of 3 x 3, so that the features are extracted from different levels, the features of the different levels are fused, the different features are better extracted, the detection performance of the whole network is further improved, then k anchors are obtained, then the k anchors are divided into two paths to be subjected to 1x1 convolution for classification and regression loss and calculation of 128 positive and 128 negative propofol for a subsequent roi classifier, and classification loss L is used according to a formula (2)_clsEquation (3) uses the regression loss L_reg. The overall RPN stage loss is calculated as shown in equation (4).

Wherein p is_iRepresenting the probability that the ith anchor predicts the true tag,

indicating a 1 when the current sample is positive and a 0 when the current sample is negative

Wherein

Represents the offset, t, of this anchor with respect to the ground truth prediction_iRepresenting the predicted offset

Wherein gamma is 1 in the present invention.

The positive and negative samples enter a roi classifier, roi alignment is carried out to map the propofol extracted by RPN to generate a feature map with the size of 7 multiplied by 7, then a feature map with the size of 1024 is obtained through a full connection layer, and classification loss L is carried out again here_clsrAnd regression loss L_regrAnd (3) calculating the total loss function again after the calculation mode is consistent with the calculation loss of the RPN, wherein the calculation mode is as shown in formula (5), the obtained regression frame is a positive frame at the moment, but an angular inclined frame is required to be obtained according to the characteristics of the remote sensing image, and therefore the calculated positive frame is connected to an inclined detection module.

Where p is the softmax probability distribution predicted by the classifier, u is the true tag value of the corresponding target, t^uPredicting regression parameters for class u

v corresponds to the regression parameter of the bounding box of the real target (v_x，v_y，v_w，v_h)

The tilt detection module is the focus here, taking the positive frame coming out of the modified fast-rcnn module as input, first extracting the angular offset features through the roi align and a full link layer of size 5, then sending to the decoder, outputting the preliminary features RROI, then extracting the deep features of RROI through roi align again, sending to the full link layer of size 2048 for classification andregression loss calculation, wherein the classification loss L_clsxIn accordance with the formula (2), the regression loss L_regxA new calculation method is adopted, as shown in equation (6). And finally obtaining classification and regression results.

In order to ensure continuous conduction, aln (b + beta) ═ μ when x is equal to 1, setting parameters a equal to 0.5, beta equal to 1, μ equal to 1.5,

the inclination detection module is mainly divided into two parts, the first part is an angle rotation module, the module mainly rotates a horizontal anchor frame into an inclined anchor frame, the coordinates of the obtained positive frame are assumed to be (x, y, w, h), wherein x, y represent the coordinates of the central point of the positive anchor frame, w, h represent the width and height of the positive anchor frame, and under the most ideal condition, the positive anchor frame is an external rectangle of the inclined frame, the position and angle deviation is carried out by a middle full connection layer and a decoder in a network, and the calculation of the deviation amount compared with the ground route is shown in a formula (7).

Wherein (x)_r，y_r，w_r，h_r，θ_r) Represents the coordinates of the offset frame calculated by the angle rotation module (x)^*，y^*，w^*，h^*，θ^*) Coordinates of the box representing the ground truth.

The second part is that the angle correction module extracts the deep-level features of the features after the first part is shifted, so that the features are not changed although the angle is rotated, the extracted deep-level features can correct the angle again, and the returned rotating frame can be more robust and fit with the angle of the target object. The specific flow is that the first part calculates the inclined frame parameter (x)_r，y_r，w_r，h_r，θ_r) And inputting a feature map D with the size of (H multiplied by W multiplied by C), dividing features and parameters into K multiplied by K grids (bins) through a rotated roi alignment, and calculating a feature map y with the size of K multiplied by C, for an index of (i ≧ 0, j ≧ C)<K) The calculation mode of the feature graph y with the grid output dimension of C (C is more than or equal to 0 and less than or equal to C) is shown in formula (8).

y_c(i,j)＝∑_{(x,y)∈bin(i,j)}D_i,j,c(T_θ(x,y))/n_ijFormula (8)

Wherein D_i,j,cRepresenting a feature map of size K C, n_ijRepresents the number of samples in the grid, and bin (i, j) represents a grid coordinate set, wherein T is calculated as shown in equation (9)_θIt represents the transformation of each grid real coordinate (x, y) into the coordinate (x ', y') on the feature map, and the transformation is shown in formula (10).

The loss function of the whole network is specifically shown in formula (11), and the total loss obtained by integrating the loss of the RPN stage, the loss of the full link layer stage and the loss of the tilt module is jointly trained.

L_all＝L(p,u,t^u,v)+L_clsx+L_regxFormula (11)

In the training process, a 1080ti video card is used for calculation, a pre-trained resnet152 model is used, an SGD optimizer using momentum is adopted, the initial learning rate is 0.00125, and 15 iterations are performed in total.

(4) Model prediction and evaluation

Storing the trained model, loading and training parameters of the model, and directly predicting the type and position of an object in a remote sensing image with any size end to end, wherein in the prediction process, the image with the size larger than 1024 × 1024 is still segmented according to a sliding window with the step length of 512, and is sent into the model for prediction, if the same object appears in a plurality of images, only the image with the highest confidence coefficient is selected for drawing when the anchor frame is drawn, and the other images are discarded; if the size is smaller than 1024 × 1024, the picture is filled with black background to be a picture with the size of 1024 × 1024. And the model evaluation is to load model parameters, predict local test set pictures, generate the types of the test set pictures and the position coordinate information of the anchor frame, and then perform online evaluation through a DOTA website. The index of the model evaluation is the average accuracy (map). The prediction performance of the algorithm is evaluated on a data set DOTA, and compared with the improved fast-rcnn algorithm, the improved algorithm is greatly improved, wherein the experimental result is shown in the table 1.

TABLE 1 comparison of predicted Performance of the method proposed by the present invention

As shown in table 1, on the DOTA data set, compared with the improved fast-rcnn algorithm, the improved algorithm is improved by 4.42%, a better result is obtained, a more excellent prediction result is obtained on the currently popular algorithms SCRet and R3Det, and an experimental result proves that the algorithm is effective and can more accurately identify objects in the remote sensing image.

Claims

1. A remote sensing image target detection method based on multi-angle detection frame is characterized in that; the method comprises three parts of data preprocessing, data enhancement, positive frame generation, angle rotation and angle correction:

firstly, preprocessing input data to ensure that the size of an image input into a network accords with a preset size, and then classifying and outputting a positive frame without angles through a backbone network, a characteristic pyramid structure, an RPN structure and an ROI;

then entering an inclination angle module, and performing first angle rotation by using a 1 multiplied by 1 convolution, a full connection layer and a decoder;

in order to obtain a more accurate angle, the angle needs to be corrected, a rotated roiign, a 1 × 1 convolution and a full connection layer are used for correcting the angle of the first rotation, and in the training stage, a loss function of position regression is redesigned for facilitating training; and determining whether the next iteration uses a data enhancement strategy to process the input data or not according to the loss contribution rate of the small target.

2. The method for detecting the target of the remote sensing image based on the multi-angle detection frame as claimed in claim 1, wherein:

the data preprocessing is to judge whether the size of a picture is smaller than 1024 × 1024, if so, the picture is filled with black background to be 1024 × 1024, and if the size is larger than 1024 × 1024, the picture is divided into n pictures with the size of 1024 × 1024 by using a sliding window with the step size of 512 pixels, and the target of the boundary of the divided images is completely detected.

3. The method for detecting the target of the remote sensing image based on the multi-angle detection frame as claimed in claim 1, wherein:

the data enhancement strategy is based on the regression loss L of the size of the ground route box less than 32 multiplied by 32 in one iteration_sAccounts for the total regression loss L_regIf the ratio is less than 0.4, the next iteration of inputting the network image will randomly select four pictures from the training set, the length-width ratio of each picture is reduced to 1/2, and then the ratio is determined according to the length-width ratio of each picture

The modes are combined, if the ratio is more than or equal to 0.4, the original training set picture is normally input, and a loss ratio calculation formula is shown as follows;

a＝L_s/L_reg。

4. the method for detecting the target of the remote sensing image based on the multi-angle detection frame as claimed in claim 1, wherein:

the first angular rotation firstly uses a 10-channel 1 × 1 convolution to reduce dimension, and then uses a full-link layer and a decoder to perform the first angular rotation, and the offset calculation method compared with the ground route is as follows:

wherein (x)_r，y_r，w_r，h_r，θ_r) The coordinates of the frame after the offset calculated in the first stage are shown, (x)^*，y^*，w^*，h^*，θ^*) Coordinates of the box representing the ground truth.

5. The method for detecting the target of the remote sensing image based on the multi-angle detection frame as claimed in claim 1, wherein:

the second angle correction uses a rotated roi align to extract the deep features of the features after the first part of the offset features, so that the features are not changed although the angle is rotated, the extracted deep features can correct the angle again, and a regressed rotating frame can be more robust and fit with a target object; the specific procedure of angle correction is to calculate the first part of the calculated tilt frame parameters (x)_r，y_r，w_r，h_r，θ_r) And inputting a feature diagram D with the size of (H multiplied by W multiplied by C), dividing features and parameters into a feature diagram y with the size of K multiplied by C through a rotated roi align, then using a 1 multiplied by 1 convolution with 10 channels to reduce dimensions, and finally using a full connection layer to carry out final classification and regression; for an index of (i ≧ 0, j)<K) The calculation mode of the feature graph y with the grid output dimension C (C is more than or equal to 0 and less than or equal to C) is as follows:

y_c(i,j)＝∑_{(x,y)∈bin(i,j)}D_i,j,c(T_θ(x,y))/n_ij

wherein D_i,j,cRepresenting a feature map of size K C, n_ijRepresents the number of samples in the grid, and bin (i, j) represents the true coordinate value of the grid with coordinate index i, j, wherein the calculation is as follows, T_θThe transformation of each grid real coordinate (x, y) into a coordinate (x ', y') on the feature map is represented as follows:

6. the method for detecting the target of the remote sensing image based on the multi-angle detection frame as claimed in claim 1, wherein:

in order to make the loss function converge faster in the training stage of angle correction, the inclination angle module regression loss function is redesigned, the gradient value of the range x <1 in the gradient function is improved, the training time is shortened, and the detection performance of the model is improved, wherein the loss function calculation mode is as follows: