CN113850761B

CN113850761B - Remote sensing image target detection method based on multi-angle detection frame

Info

Publication number: CN113850761B
Application number: CN202111007113.0A
Authority: CN
Inventors: 王素玉; 许凯焱
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2024-06-14
Anticipated expiration: 2041-08-30
Also published as: CN113850761A

Abstract

The invention discloses a remote sensing image target detection method based on a multi-angle detection frame, which designs an inclination angle module on the basis of a positive frame predicted by a master-rcnn, and mainly comprises two stages; the first stage carries out preliminary angle offset rotation through the full-connection layer and the decoder, the second stage uses rotated roi align to extract rotation invariant features, and angle offset correction is carried out again to obtain a detection frame of an accurate angle. In addition, the regression loss function of the inclination detection module is redesigned according to the problems of large remote sensing image size and slow training, so that the loss function converges more quickly and the accuracy is higher. The experimental result shows that compared with the improved faster-rcnn, the accuracy is improved by 4.4%, and the invention has good detection effect.

Description

Remote sensing image target detection method based on multi-angle detection frame

Technical Field

The invention belongs to the field of target detection in computer vision, and discloses a method for detecting a target type and marking a target position of a picture by utilizing a convolutional neural network, which has higher accuracy compared with the existing remote sensing target detection method, such as ROI-Transfomer, SCRDet, R3Det and the like, and can detect the inclined direction according to the characteristics of a remote sensing image.

Background

In recent years, the development of the aerospace industry in China is rapid, the remote sensing satellite technology is continuously advanced, a large number of images are acquired by satellites every day for various purposes, the satellites of the carried visible light cameras are most common, the visible light remote sensing images are most visual, and targets in the images can be conveniently distinguished. However, the conventional detection algorithm needs to manually extract features, the recognition performance cannot meet the daily requirements, and the detection algorithm is greatly influenced by external factors. However, with the continuous development of deep learning, the characteristic of automatic extraction by using a convolutional neural network greatly reduces the cost of manual operation and obviously improves the accuracy. However, the accuracy of the current most advanced detector still cannot fully meet the current actual needs, and the problem of insufficient accuracy still exists in the field, so that the problem needs to be solved.

Currently, a remote sensing target detection method based on a convolutional neural network has been greatly developed, for example, an R3Det method, a PIOU method, a DRN method and the like based on a single-stage network, an R2CNN method, a RRPN method, an ROI-Transfomer method, a SCRDet method and the like based on a double-stage network. Although the above method has incomparable advantages over the conventional method, a relatively high accuracy is achieved on the mainstream data sets DOTA and HRSC2016, but there is still a problem of insufficient accuracy, and there is still a relatively large lifting space.

Disclosure of Invention

Aiming at the problem of insufficient accuracy of the algorithm, the invention designs a target detection algorithm based on multi-angle remote sensing images, which is improved to different degrees compared with the algorithm.

The invention adopts the following technical scheme: a target detection algorithm based on multi-angle remote sensing images. The specific flow of target detection is as follows: firstly, preprocessing a picture and enhancing data, then sending the picture into a convolutional neural network provided by the invention, extracting the characteristics of the picture through a backbone network, then sending the characteristics into an RPN network to generate a specific number proposal, then, needing to perform the roi align output on the proposal to fix the characteristic pictures with the size of 7 multiplied by 7, then, sending the characteristic pictures into a full-connection layer output positive frame, then, performing angle offset regression on the output positive frame through a full-connection layer and a decoder, and finally, performing angle correction through a rotated roi align to obtain a final detection result.

(1) Data preprocessing: the present invention discloses a dataset using DOTA. In order to facilitate training and prediction, the width and height of an input network image are limited, the image size of the input network is 1024×1024 during training and prediction, and if the original size is larger than 1024×1024 during training and prediction, the image is divided into a plurality of images with 1024×1024 sizes according to a sliding window with a step length 512; if the original size is less than 1024×1024, black background is used for padding, so that the size of the data can be preprocessed without losing boundary information.

(2) Data enhancement: aiming at the problem of numerous small target objects of a remote sensing image, a data enhancement strategy is designed, the regression loss of a frame with the calculated area smaller than 32 multiplied by 32 in each iteration during training accounts for the ratio of the total regression loss of the whole image, if the regression loss of the small target object in the iteration is smaller than 0.4, the contribution rate of the loss of the small target object in the iteration to the total regression loss is considered to be insufficient, four images of a training set are randomly selected in the next iteration, the length and the width of each image are shortened to 1/2 of the original length and width of each image, the four images are spliced into a new image, the corresponding coordinates of ground truth are modified, and the new image is sent into network training.

(3) Model setting and training: the network model mainly comprises a main convolution neural network, a feature pyramid, an RPN full convolution neural network, an ROI classifier and an inclination angle regression network. The main network is designed into a residual network with a depth of 152 layers and is divided into 5 parts, convolution of each part is connected in parallel, the features extracted through 4 times of downsampling operation of the main network enter a feature pyramid to be subjected to 3 times of upsampling operation and 1 time of maximum pooling and fusion, the number of channels of the output feature layer is 256, the output feature layer is then sent to an RPN full convolution neural network to be subjected to proposal generation, then the roi align is carried out to map the features to 7X 7, classification and regression are carried out through a full connection layer to obtain a positive frame, finally the positive frame is sent to an inclination angle regression network, preliminary angle offset regression is carried out through the full connection layer and a decoder, and then rotation invariant features are extracted through rotated roi align for obtaining a final detection result.

In the training process, a pre-training model ResNet is used, the classification loss of the RPN and the ROI classifier adopts a cross entropy loss function, the regression loss adopts a SmoothL loss function, the classification loss still adopts the cross entropy loss function in the inclination angle network, and the regression loss function is redesigned. The optimizer employs an SGD optimizer using momentum with an initial learning rate of 0.00125 for a total of 15 iterations.

(4) Model prediction: after model training is completed, the training model is stored, trained model parameters are loaded, test data with any size are input, and the types and positions of objects in the data can be obtained end to end. This stage is simply loading trained model parameters and data enhancement is not used for this step.

The evaluation index is the average accuracy (map). On the DOTA data set for testing, the performance of the whole algorithm is evaluated, and the method provided by the invention obtains a competitive result, and compared with the currently commonly used single-stage and double-stage commonly used algorithms, the method has higher accuracy and identification performance, and can identify data of any size end to end.

Drawings

Fig. 1 is a schematic overall flow chart of the method according to the present invention.

Fig. 2 is a schematic diagram of a convolutional neural network according to the present invention.

Fig. 3 is a schematic diagram of a data enhancement result according to the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings, which illustrate in detail:

A remote sensing image target detection method based on a multi-angle detection frame. As shown in fig. 1, the detection process includes preprocessing an image and enhancing data (only applied to a training stage), dividing the image into 1024×1024 sizes, sending the image into a main network to perform downsampling and feature extraction, sending the image into a feature pyramid to perform upsampling, merging features of different levels in the process, sending the image into an RPN full convolution neural network to generate proposal, performing roi alignment on the feature map to be 7×7 sizes, performing classification and regression, and finally sending the image into an inclination angle regression network to perform final classification and regression, thereby obtaining a detection result.

Specific algorithms are referenced below:

(1) Data preprocessing: the most well known DOTA data set for the current remote sensing target detection is preprocessed. Considering the characteristics of remote sensing images, the resolution ratio of a plurality of pictures is very high, the length and the width of the pictures are required to be limited for the convenience of training, prediction and saving of computer performance, the input pictures are uniformly limited to 1024×1024, 4 times of downsampling, 3 times of upsampling and 1 time of maximum pooling fusion characteristics are carried out in an input network through a 152-layer residual neural network and a characteristic pyramid structure, and the situation that each time can be divided by 32 times is guaranteed.

(2) Data enhancement: the poor detection performance caused by the large number of small objects in the data set is a very important problem. It is found through researches that one of the reasons for poor detection performance of the small target is that the regression loss of the small target object is not large enough to contribute to the total regression loss, so that improving the loss contribution of the small target is one of methods for solving the problem. The method for enhancing data is designed, firstly, the regression loss of the inclination detection module of an object with the size smaller than 32×32 in the current iteration is calculated, denoted by L _s, s denotes the object with the size smaller than 32×32, then, the regression loss of the inclination detection module in the current iteration is calculated, denoted by L _reg, and finally, the loss contribution rate a of a small target object is calculated according to the following formula:

a=l _s/L_reg formula (1)

If the calculated a is smaller than 0.4, considering that the contribution rate of the small target loss of the iteration is insufficient for the total regression loss and the loss contribution rate of the small target needs to be enhanced, randomly selecting four pictures from the training set in the next iteration, reducing w and h of each picture to be the original w/2,h/2, changing the area to be the original 1/4, and arranging the pictures according to a two-dimensional arrayThe method is spliced, the size is still 1024×1024, the splicing diagram is shown in fig. 2, and the next iteration is carried out, so that the specific gravity of the small target is considered to be increased, and the detection performance of the small target is improved. And further, the overall detection performance is improved.

(3) Model setup and training

As shown in fig. 3, the network model mainly comprises a main convolution neural network, a feature pyramid structure, an RPN full convolution neural network, an ROI classifier and an inclination angle regression network. The method comprises the steps of inputting a picture into a main convolution neural network and a feature pyramid structure after data preprocessing to extract features, adding a receptive field of the network to the feature image output by the feature pyramid through convolution of 3 multiplied by 3, then reducing the dimension through convolution of 1 multiplied by 1, finally generating proposal candidate frames, obtaining mapped feature images through bilinear interpolation by an ROI classifier, finally connecting a full-connection layer, outputting regression and classification to obtain positive candidate frames, but because of the characteristics of a data set, the frames need to be sent to an inclination angle regression network to carry out final regression offset and classification prediction, and compared with a common two-stage network, the number of anchor points is not increased in the whole process, and the accurate regression frames can be obtained.

In order to obtain the best detection performance, a 152-layer residual network is used throughout the model, and the convolution kernel groups are connected in parallel for reducing the number of parameters. The backbone network is divided into five layers C1, C2, C3, C4 and C5, the picture size is downsampled for 4 times through the five layers, the backbone network C1 layer uses a 7x7 convolution layer and a relu activation function, and finally max pool is carried out once; in order to reduce the number of parameters and improve the detection performance, all the layers C2 to C5 of the backbone network use grouping convolution, the layer C2 of the backbone network uses 1×1,3×3 and 1×1 convolution groups, wherein the convolution dimension is divided into 32 groups, the layer C3 uses 8 convolution groups, the layer C4 uses 36 convolution groups, the layer C5 uses 3 convolution groups, and finally the feature images 256, 512, 1024 and 2048 of the dimension in 4 are output. And then, the feature graphs output by the C2-C5 layers are all subjected to 1x1 convolution, the feature dimension is compressed into 256, wherein the feature graphs of 2048 dimensions output by the C5 layers are subjected to 3 times up-sampling, the 256-dimension feature graphs are output and respectively added with the 256-dimension features output by the C2-C5 layers to obtain P2'-P5' layers, and then the final P2-P5 layers are obtained through 256-dimension 3 x 3 convolution, wherein the P6 layers are obtained by carrying out max pool with the 1x1 step length of2 on the P5 layers, and the feature layers are used for subsequent calculation.

The output P2-P6 layer extracts features through a convolution layer of 3X 3 with 256 dimensions, thus extracting features from different levels, fusing the features of different levels, better extracting different features, further improving the detection performance of the whole network, obtaining k anchors, dividing two paths into 1X 1 convolutions for classifying and regressing losses, and calculating 128 positive proposal and 128 negative proposal for the later roi classifier, wherein the classifying loss L _cls is used according to the formula (2), and the regressing loss L _reg is used according to the formula (3). The total RPN phase loss calculation is shown in equation (4).

Where p _i represents the probability that the ith anchor predicts the real label,Indicating a1 when currently positive and 0 when negative

Wherein the method comprises the steps ofRepresenting the predicted offset of this anchor relative to ground truth, t _i represents the predicted offset

Wherein in the present invention, gamma is 1.

The positive and negative samples enter a roi classifier, the roi align is carried out to generate a feature map with the size of 7 multiplied by 7 by proposal extracted by RPN, then a feature map with the size of 1024 is obtained through a full-connection layer, classification loss L _clsr and regression loss L _regr are carried out again, wherein the calculation mode is consistent with the calculation loss of the RPN, then a total loss function is calculated again, the calculation mode is shown as a formula (5), the regression frame obtained at the moment is a positive frame, but according to the characteristics of a remote sensing image, an angular inclined frame is needed to be obtained, and the calculated positive frame is connected to an inclined detection module.

Where p is the classifier-predicted softmax probability distribution, u is the true tag value of the corresponding target, and t ^u predicts the regression parameters of class uV corresponds to the regression parameters of the bounding box of the real object (v _x,v_y,v_w,v_h)

The tilt detection module is the key point here, the positive frame coming out from the improved master-rcnn module is taken as input, firstly, angle offset characteristics are extracted through a roi align and a full-connection layer with the size of 5, then the positive frame is sent to a decoder to output preliminary characteristics RROI, then, deep characteristics of the RROI are extracted through the roi align again, the preliminary characteristics are sent to the full-connection layer with the size of 2048 to carry out classification and regression loss calculation, wherein classification loss L _clsx is consistent with formula (2), and regression loss L _regx adopts a new calculation mode, and is specifically shown as formula (6). And finally obtaining classification and regression results.

Wherein in order to ensure continuous guidance, aln (b+beta) =μ when x=1, the parameter a=0.5, beta=1, μ=1.5,

The tilt detection module is mainly divided into two parts, the first part is an angle rotation module, the module mainly rotates a horizontal anchor frame into an inclined anchor frame, the coordinates of a positive frame obtained from the module are assumed to be (x, y, w and h), wherein x and y represent the coordinates of the center point of the positive anchor frame, w and h represent the width and the height of the positive anchor frame, and in the most ideal case, the positive anchor frame is an external rectangle of the inclined frame, and the position and the angle offset are carried out by a middle full-connection layer and a decoder in a network, and compared with the offset calculated by ground truth as shown in a formula (7).

Where (x _r,y_r,w_r,h_r,θ_r) represents the coordinates of the frame after the offset calculated by the angular rotation module and (x ^*,y^*,w^*,h^*,θ^*) represents the coordinates of the frame of ground truth.

The second part is the angle correction module, which can be considered to rotate the angle though the angle is not changed by extracting the deep-level features of the features after the first part is deviated, and the angle of the extracted deep-level features can be corrected again, so that the regressed rotating frame is more robust and is attached to the angle of the target object. The specific flow is that the characteristic diagram D with the inclination frame parameter (x _r,y_r,w_r,h_r,θ_r) calculated in the first part and the input size (H multiplied by W multiplied by C) is divided into K multiplied by K grids (bin) by rotated roi align, one characteristic diagram y with the size K multiplied by C is calculated, and the calculation mode of the characteristic diagram y with the grid output dimension C (0 is less than or equal to C is less than or equal to C) with the index (i is more than or equal to 0 and j is less than or equal to K) is shown as a formula (8).

Y _c(i,j)＝∑_{(x,y)∈bin(i,j)}D_i,j,c(T_θ(x,y))/n_ij formula (8)

Where D _i,j,c represents a feature map of size KXKXC, n _ij represents the number of samples of the grid, bin (i, j) represents a set of grid coordinates, where the calculation is shown in equation (9), and T _θ represents the transformation of the true coordinates (x, y) of each grid into coordinates (x ', y') on the feature map, as shown in equation (10).

The loss function of the whole network is specifically shown as a formula (11), and the loss of the RPN stage, the loss of the full connection layer stage and the loss of the inclination module are integrated, so that the obtained total loss is subjected to joint training.

L _all＝L(p,u,t^u,v)+L_clsx+L_regx formula (11)

In the training process, a 1080ti display card is used for calculation, a resnet model of pre-training is used, an SGD optimizer using momentum is adopted as an optimizer, the initial learning rate is 0.00125, and the total number of iterations is 15.

(4) Model prediction and assessment

The trained model is stored, parameters of the model are loaded and trained, the type and the position of an object in a remote sensing image with any size can be directly predicted end to end, in the prediction process, the image with the size larger than 1024×1024 is still divided into a plurality of images with the size 1024×1024 according to a sliding window with the step length of 512, the images are sent into the model for prediction, if the same object appears in a plurality of pictures, only the image with the highest confidence level is selected for drawing when an anchor frame is drawn, and the other images are discarded; if the size is smaller than 1024×1024, black background is used for filling up, and the picture with 1024×1024 size is obtained. The evaluation of the model is to load model parameters, predict local test set pictures, generate the types of the test set pictures and the position coordinate information of the anchor frame, and then evaluate the test set pictures on line through the DOTA website. The index of the model evaluation is the average accuracy (map). The predictive performance of the algorithm was evaluated on the dataset DOTA, which improved the algorithm with a greater boost than the improved faster-rcnn algorithm, with the experimental results shown in table 1.

TABLE 1 comparison of the predicted Performance of the methods of the invention

As shown in Table 1, on the DOTA data set, compared with the improved faster-rcnn, the improved algorithm is improved by 4.42%, a better result is obtained, and more excellent prediction results are obtained on the currently popular algorithms SCRet and R3Det, and the experimental result proves that the algorithm is effective and can more accurately identify objects in remote sensing images.

Claims

1. A remote sensing image target detection method based on a multi-angle detection frame is characterized in that the method comprises the following steps of;

Firstly, preprocessing input data to ensure that the image size of an input network accords with a preset size, and then outputting a positive frame without angles through a main network, a characteristic pyramid structure, an RPN structure and an ROI classification;

Then entering an inclination angle module, and performing first angle rotation by using a 1 multiplied by 1 convolution, a full connection layer and a decoder;

in order to obtain a more accurate angle, the angle is required to be corrected, rotated roialign, 1 multiplied by 1 convolution and a full connection layer are used for correcting the angle of the first rotation, and in the training stage, the loss function of position regression is redesigned for facilitating training; determining whether the next iteration uses a data enhancement strategy to process the input data according to the loss contribution rate of the small target;

The first angular rotation is performed first using a 10-channel 1 x 1 convolution dimensionality reduction, and then using a fully-connected layer and decoder, as compared to ground truth, the offset calculation is as follows:

Where (x _r,y_r,w_r,h_r,θ_r) represents the coordinates of the frame after the offset calculated in the first stage and (x ^*,y^*,w^*,h^*,θ^*) represents the coordinates of the frame of ground truth;

Extracting deep features of the features after the first partial offset by using rotated roi align for the second angle correction, wherein the specific process of the angle correction is to divide the features and parameters into K x C feature patterns y by rotated roi align according to the calculated tilt frame parameters (x _r,y_r,w_r,h_r,θ_r) of the first part and the feature patterns D with the input size (H x W x C), then use 1 x 1 convolution dimension reduction of 10 channels, and finally use a full connection layer to carry out final classification and regression; the feature map y with the grid output dimension c with index (i.gtoreq.0, j < K) is calculated as follows:

y_c(i,j)＝∑_{(x,y)∈bin(i,j)}D_i,j,c(T_θ(x,y))/n_ij

Where D _i,j,c represents a feature map of size KXKXC, n _ij represents the number of samples of the grid, bin (i, j) represents the true coordinate values of the grid with a coordinate index of i, j, where T _θ represents the transformation of the true coordinates (x, y) of each grid to coordinates (x ', y') on the feature map of 0C C, as follows:

In order to make the loss function converge faster in the training stage of angle correction, the inclination angle module is redesigned to return to the loss function, the gradient value of the range x <1 in the gradient function is improved, and the detection performance of the model is improved while the training time is shortened, wherein the loss function is calculated as follows:

2. The method for detecting the target of the remote sensing image based on the multi-angle detection frame according to claim 1, wherein the method comprises the following steps:

The data preprocessing is to firstly judge whether the size of a picture is smaller than 1024×1024, if the size is smaller than 1024×1024, the black background is used for filling the size of 1024×1024, if the size is larger than 1024×1024, a sliding window with a step length of 512 pixels is used for dividing the picture into n pictures with the size of 1024×1024, and the object of dividing the image boundary is completely detected.

3. The method for detecting the target of the remote sensing image based on the multi-angle detection frame according to claim 1, wherein the method comprises the following steps:

The data enhancement strategy is to randomly select four pictures from the training set according to the fact that the ground truth boxes in one iteration are smaller than the size of the ratio of the regression loss L _s with the size of 32 multiplied by 32 to the total regression loss L _reg, if the ratio is smaller than 0.4, the aspect ratio of each picture is reduced to 1/2, and then according to the following steps If the ratio is more than or equal to 0.4, the original training set picture is normally input, and the loss ratio calculation formula is shown as follows;

a＝L_s/L_reg。