CN111950488B

CN111950488B - Improved Faster-RCNN remote sensing image target detection method

Info

Publication number: CN111950488B
Application number: CN202010833754.0A
Authority: CN
Inventors: 郭艳艳
Original assignee: Shanxi University
Current assignee: Shanxi University
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2022-07-19
Anticipated expiration: 2040-08-18
Also published as: CN111950488A

Abstract

The invention relates to the field of remote sensing image target detection, in particular to an improved Faster-RCNN remote sensing image target detection method. The method comprises the following steps: (1) dividing a remote sensing image data set into a training set and a testing set; (2) carrying out size transformation, standardization and normalization processing and data enhancement on the remote sensing images in the training set in sequence: (3) building an improved fast-RCNN remote sensing image target detection network; (4) training an improved fast-RCNN remote sensing image target detection network; (5) and testing the improved fast-RCNN remote sensing image target detection network. The method improves the average accuracy of target detection of the remote sensing image, particularly the average accuracy of small target detection, and reduces the probability of false detection and missed detection of the small target.

Description

Improved Faster-RCNN remote sensing image target detection method

Technical Field

The invention relates to the field of remote sensing image target detection, in particular to an improved Faster-RCNN remote sensing image target detection method.

Background

Object detection is one of the basic problems in computer vision recognition tasks, and has wide application in a plurality of fields. The target detection in the remote sensing image has wide application prospect in the aspects of military application, urban planning, environmental management and the like. Unlike target detection on natural images, targets on remote sensing images are much smaller than those on natural images, the size and orientation of the targets are diverse (e.g., playground, car, bridge, etc.), and the visual appearance of the target instances varies due to occlusion, shadows, lighting, resolution, and viewpoint variations. Therefore, detection of objects on remote sensing images is much more difficult than detection of objects on natural images.

In recent years, some researches introduce deep convolutional neural networks into target detection, which can automatically learn from data to feature representation with good robustness and strong expression capability, and the target detection method has greatly improved in speed and precision. The target detection algorithm based on candidate region extraction and the target detection algorithm based on regression are the most classic algorithms in the current deep convolutional neural network target detection algorithm, the candidate region is extracted from a given image based on the algorithm based on candidate region extraction, then classification and regression positioning are carried out on each extracted candidate region, and certain advantages are achieved in the accuracy of target detection; the regression-based target detection algorithm provides a single and integral convolutional neural network, and the target detection problem is reconstructed into a regression problem to directly predict the category and the position of the target, so that the regression-based target detection algorithm has certain advantages in the aspect of target detection speed.

Although the current target detection algorithm has a good effect in natural image target detection, the target detection of the remote sensing image still needs to be improved, and particularly, the detection effect of the small target in the remote sensing image is still not ideal, and the situations of target false detection and target missing detection are easy to occur.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide an improved fast-RCNN remote sensing image target detection method, which can improve the target detection accuracy of a remote sensing image, reduce the probability of false detection and missed detection during target detection and has better generalization capability.

In order to achieve the purpose, the invention adopts the following technical scheme:

an improved method for detecting a target of a Faster-RCNN remote sensing image comprises the following steps:

(1.1) dividing a remote sensing image data set into a training set and a testing set;

(1.2) carrying out size transformation, normalization processing and data enhancement on the remote sensing images in the training set in sequence:

a. the size transformation is to set the size of the remote sensing image in the training set to 800 pixels multiplied by 960 pixels;

b. the normalization processing is to map each pixel value of the images in the training set to a range of 0-1;

c. the data enhancement is to rotate the remote sensing image normalized in the training set by 90 degrees, 180 degrees and 270 degrees and perform mirror image operation;

(1.3) constructing an improved fast-RCNN remote sensing image target detection network: the network consists of a Faster-RCNN sub-network and a positioning fine modification sub-network;

(1.4) training an improved Faster-RCNN remote sensing image target detection network: randomly configuring node parameters of a built improved fast-RCNN remote sensing image target detection network in advance, inputting the remote sensing images in the training set into the built improved fast-RCNN remote sensing image target detection network, and updating the node parameters in an improved fast-RCNN remote sensing image target detection network model in a random gradient descent mode until an optimal solution is found;

(1.5) testing an improved Faster-RCNN remote sensing image target detection network: and detecting the remote sensing images in the test set by using the trained improved Faster-RCNN remote sensing image target detection network, and analyzing the detection effect.

Further, the step (1.3) of building an improved fast-RCNN remote sensing image target detection network specifically comprises the following steps:

(2.1) inputting the remote sensing image into a VGG16 network in a Faster-RCNN sub-network, extracting texture, color and scale characteristics of a target in the remote sensing image, and obtaining a characteristic graph y after passing through a VGG16 network₁The size is 50 × 60 × 256;

(2.2) obtaining a characteristic diagram y₁The device is divided into three parallel branches which are respectively an RPN (resilient packet network) and a RoI pooling layer in a Faster-RCNN sub-network and a RoI pooling layer in a positioning and refining sub-network;

(2.3) in the RPN network in the Faster-RCNN sub-network, a sliding window of 3 x 3 size is used for y₁Standard convolution with step size of 1An operation of generating 12 different scale anchor blocks with sizes of 16 × 16, 16 × 32, 32 × 016, 32 × 132, 32 × 64, 64 × 32, 64 × 128, 128 × 64, 128 × 256, and 256 × 128 centering on a center point of the sliding window every time the sliding window is slid; after generating an anchor point frame, outputting the anchor point frame through a Relu activation function in an RPN network, dividing the anchor point frame into two branches, wherein one branch is a classification loss branch, the branch firstly performs point-by-point convolution operation with the output channel number of 18, then classifies 12 anchor point frames with different scales through a Softmax classifier in the RPN network, each anchor point frame outputs two probability values to distinguish the target or background of an image, and 24 probability values are output after each sliding for 1 time; the other branch is a boundary regression loss branch, and after the branch is convolved point by an output channel with the number of 36, the boundary regression offset of an anchor frame is calculated through a boundary frame regression loss layer in the RPN network: one anchor point frame outputs 4 relative position coordinates, which are respectively the center coordinates (x) of the anchor point frame _a，y_a) And width and height (w) of anchor boxes_a，h_a) Outputting 48 relative position coordinates of 12 anchor point frames after each sliding for 1 time, and finally, integrating the outputs of the classification loss branch and the boundary regression loss branch in the RPN network through a proposed layer in the RPN network to obtain a feature map y of the anchor point frame with the relative position coordinates₂；

(2.4) passing through the RPN network in the Faster-RCNN sub-network, the feature map y₂And a feature map y obtained from the VGG16 network₁Inputting the feature maps into a RoI pooling layer in a Faster-RCNN sub-network, and outputting the feature maps with non-uniform sizes into a feature map y with a size of 25 × 30 × 256₃After passing through a full-link layer with Relu activation function in the Faster-RCNN sub-network, a regression result y is obtained through a regression loss layer in the Faster-RCNN sub-network₄；

(2.5) regression results y obtained from regression lost layer in the Faster-RCNN subnetwork₄And a feature graph y obtained in the VGG16 network₁Inputting the feature map into the RoI pooling layer in the positioning and refining sub-network, outputting a feature map with the size of 6 multiplied by 7 multiplied by 256, and outputting the feature map after passing through a full connection layer with Relu activation function in the positioning and refining sub-networkThe obtained result is divided into two paths, one path outputs the position information y of the target in the remote sensing image through a regression loss layer in the positioning and refining sub-network ₅(ii) a The other path outputs the classification result y of the target in the remote sensing image through a Softmax classifier in the positioning and refining sub-network₆。

All regression loss layers (regressors) in the invention utilize a robust loss function to calculate the boundary regression offset of the anchor point frame.

Compared with the prior art, the invention has the following advantages: according to the method, on the basis of the existing Faster-RCNN network, anchor point frames of an RPN (resilient packet network) in the Faster-RCNN are increased to 12, and a positioning and refining sub-network with a RoI pooling layer is added to further detect the remote sensing image output by the Faster-RCNN network, so that the average accuracy of target detection of the remote sensing image is improved, and particularly the accuracy of detection of small targets such as automobiles and airplanes in the remote sensing image is improved.

Drawings

FIG. 1 is a schematic diagram of the improved fast-RCNN remote sensing image target detection network structure of the present invention;

FIG. 2 is a RPN network of the present invention;

FIG. 3 is a visual comparison of the improved Faster-RCNN remote sensing image object detection network of the present invention and the prior art method.

Detailed Description

The remote sensing image target detection data set used by the invention is from a NWPU VHR 10 data set made by Gong Cheng doctor, etc. of northwest industrial university, and the data sets respectively have 10 types, namely, an airplane (airflane), a ship (ship), a storage tank (storage), a baseball (baseball) diamond, a tennis court (tenis court), a basketball court (basketball court), a track and field (harbor), a bridge (bridge) and a vehicle (vehicle). The data set is a data set containing 800 high resolution remote sensing images, wherein the negative sample data set comprises 150 images which do not belong to any category. The sizes of the targets to be detected vary widely, with the largest target size being about 418 × 418 and the smallest target size being 33 × 33.

Referring to fig. 1, 2 and 3, the improved method for detecting the target of the fast-RCNN remote sensing image disclosed by the invention comprises the following steps:

(1.1) dividing a remote sensing image data set into a training set and a testing set, wherein 80% of the training set used for network training and 20% of the testing set used for network testing keep the data distribution consistency of different types of samples in the training set and the testing set as much as possible;

b. the normalization processing is to map each pixel value of the remote sensing image in the training set to a range of 0-1;

c. the data enhancement is to rotate the remote sensing image normalized in the training set by 90 degrees, 180 degrees, 270 degrees and mirror image operation, so as to ensure the robustness of the improved Faster-RCNN remote sensing image target detection network;

(1.3) constructing an improved fast-RCNN remote sensing image target detection network: the network consists of a Faster-RCNN sub-network and a positioning fine modification sub-network, wherein the Faster-RCNN sub-network is used for carrying out primary target detection on a remote sensing image, and the positioning fine modification sub-network is used for further detecting the output of the Faster-RCNN network, so that the problems of inaccurate positioning, missing detection and error detection of a target are solved;

The fast-RCNN sub-network consists of a VGG16 network, an RPN network, a RoI pooling layer, a full connection layer (FC) with a Relu activation function and a regression loss layer (Regressor); the positioning refinement subnetwork consists of a RoI Pooling layer (RoI Pooling), a full connection layer with Relu activation function, a Softmax classifier and a regression loss layer (Regressor).

The RPN network in the Faster-RCNN subnetwork comprises a standard convolution layer (Conv2d) with the size of 3 x 3, a Relu activation function, two point-by-point convolution layers (Pwise), a Softmax classifier, a bounding box regression loss layer (Bbox Regressor) and a proposed layer (Propusal).

The method for building the improved fast-RCNN remote sensing image target detection network comprises the following specific steps:

(2.1) inputting the remote sensing image into a VGG16 network in a Faster-RCNN sub-network, wherein the VGG16 network comprises 13 convolutional layers (Conv2d) with a Relu activation function behind the convolutional layers and 4 Pooling layers (Pooling), an input feature map is activated through a Relu activation function after passing through each convolutional layer, wherein the maximum Pooling operation is carried out through one Pooling layer after the 2 nd, 4 th, 7 th and 10 th convolutional layers, each convolutional layer adopts standard convolution with the size of 3 x 3, the filling number (Padding) is 1, the step size is 1, each Pooling layer adopts maximum Pooling, the size of a Pooling kernel is 2 x 2, the step size is 2, and a feature map y obtained after texture, color and scale features of an object in the remote sensing image are extracted through the VGG16 network ₁The size is 50 × 60 × 256, and the network configuration of VGG16 is shown in table 1:

table 1 VGG16 network configuration table

(2.2) feature map y to be obtained₁The device is divided into three parallel branches which are respectively an RPN (resilient packet network) and a RoI pooling layer in a Faster-RCNN sub-network and a RoI pooling layer in a positioning and refining sub-network;

(2.3) in the RPN network in the Faster-RCNN subnetwork, using a sliding window with the size of 3 x 3 for y₁Performing standard convolution operation with step size of 1, and generating 12 different scale anchor point frames with the size of 16 × 16, 16 × 32, 32 × 016, 32 × 132, 32 × 64, 64 × 32, 64 × 128, 128 × 64, 128 × 256 and 256 × 128 by taking the central point of the sliding window as the center once sliding; generating anchor point frame, outputting via Relu activating function in RPN network, dividing into two branches, one branch being classified loss branch, and passing through an output channelAfter 18 point-by-point convolution operations, classifying 12 anchor frames with different scales through a Softmax classifier in an RPN, outputting two probability values for each anchor frame to distinguish the target or background of an image, and outputting 24 probability values every 1 sliding time; the other branch is a boundary regression loss branch, and after the branch is convolved point by point with an output channel number of 36, the boundary regression offset of an anchor point frame is calculated through a boundary frame regression loss layer (Bbox _ Regressor) in the RPN network: one anchor point frame outputs 4 relative position coordinates, which are respectively the center coordinates (x) of the anchor point frame _a，y_a) And width and height (w) of anchor boxes_a，h_a) Every time the sliding is performed for 1 time, the 12 anchor point frames output 48 relative position coordinates; finally, the outputs of the classification loss branch and the boundary regression loss branch in the RPN network are integrated through a proposed layer in the RPN network to obtain a feature map y of the anchor point frame with relative position coordinate values₂The proposed layer (Proposal) adopts a non-maximum suppression algorithm (NMS) to realize the preliminary screening of the anchor point frame and remove the anchor point frame beyond the image boundary;

(2.4) passing through the RPN network in the Faster-RCNN sub-network, the feature map y₂And a feature map y obtained from the VGG16 network₁Inputting the feature map into a RoI pooling layer in a Faster-RCNN subnetwork, and outputting the feature map with non-uniform size as a feature map y with the size of 25 x 30 x 256₃After passing through a full connection layer (FC) with a Relu activation function in the Faster-RCNN sub-network, a regression result y is obtained through a regression loss layer (Regressor) in the Faster-RCNN sub-network₄；

(2.5) regression results y obtained from regression lost layer in the Faster-RCNN subnetwork₄And the feature map y obtained in the VGG16 network₁Inputting the data into a RoI pooling layer in a positioning and refining sub-network, outputting a characteristic diagram with the size of 6 multiplied by 7 multiplied by 256, dividing an output result into two paths after passing through a full connection layer (FC) with a Relu activation function in the positioning and refining sub-network, and outputting position information y of a target in the remote sensing image through a regression loss layer (Regressor) in the positioning and refining sub-network ₅(ii) a The other path is output by a Softmax classifier in the positioning and refining sub-networkThe classification result y of the target in the remote sensing image₆。

(1.4) training an improved Faster-RCNN remote sensing image target detection network: firstly, randomly configuring node parameters of a built improved Faster-RCNN remote sensing image target detection network, inputting a remote sensing image in a training set into the built improved Faster-RCNN remote sensing image target detection network, updating the node parameters in an improved Faster-RCNN remote sensing image target detection network model in a random gradient descending mode according to the descending direction in each iteration process until an optimal solution is found, and stopping iteration.

The hardware conditions and parameter configurations for training the network are shown as S401 and S402:

s401, the method adopts a computer with a CPU as an Intel Core i7-9700 processor, a display card configured as an Nvidia GeForce GTX 10606GB and a total memory capacity of 16G, and builds an algorithm frame by using a Pythroch.

S402, the parameters in the network are updated by adopting a random gradient descent algorithm, a pre-training model is a resnet50 network, the network is quickly converged to the optimum by adopting a dynamic learning rate, the initial learning rate is set to be 0.001, the learning rate is multiplied by 0.1 every 4000 iterations, 10 ten thousand iterations are performed, and the threshold value of non-maximum suppression (NMS) is set to be 0.7.

(1.5) testing an improved Faster-RCNN remote sensing image target detection network: and detecting the remote sensing images in the test set by using the trained improved Faster-RCNN remote sensing image target detection network, analyzing the detection effect, and selecting the average accuracy (mAP) as an evaluation index for measuring the detection effect of the remote sensing image target.

The present embodiment is described in further detail below:

standard convolution (Conv2 d): the calculation formula of the standard convolution is shown in formula (1):

Conv2d(W,b,x)＝W·x+b (1)

wherein, W is the weight of the convolution kernel, x is the input characteristic diagram, b is the offset term parameter, M is the input channel number, V and U are the width and height of the convolution kernel respectively, and N is the output channel number.

Point-by-point convolution (Pwise): convolution kernel W of point-by-point convolution_pThe size is 1 × 1 × N, N is the number of output channels, and if the input image size is h × d × M, the size of the output feature map is h × d × N. The point-by-point convolution calculation is shown in equation (2):

Pwise(W_p,x)＝W_p·x (2)

relu activation function: the mathematical formula is relu (x) max (0, x), where max () represents the larger of 0 and x;

pooling layer (Pooling): each pooling layer adopts maximum pooling, the size of a pooling core is 2 multiplied by 2, and the step length is 2;

RoI pooling layer (roupooling): the specific operation of the RoI pooling layer is divided into three steps, wherein in the first step, an interested area is mapped to a position corresponding to a characteristic diagram according to the input characteristic diagram; the second step divides the mapped area into parts with the same size (the number is the same as the output dimension); and thirdly, performing maximum pooling operation on each divided part. Through the three steps, the characteristic graphs with different sizes can be output as the characteristic graphs with fixed sizes, and the size of the output characteristic graph is irrelevant to the size of the RoI pooling layer and the size of the input characteristic graph.

Full connection layer (FC): each neuron of the full connection layer is completely connected with the neuron of the previous layer;

average accuracy (mAP): the calculation formula is shown in formula (3):

assume a total of k +1 classes (including a null or background class), p_ijThe number that is originally in class i but predicted to be in class j is called false positive; p is a radical of_iiTo truly classify the correct number;

softmax classifier: the Softmax classifier is generally used for multi-classification problems, minimizing the Softmax loss function by training the network. Suppose that for a size J dataset { (x)⁽¹⁾,y⁽¹⁾),…(x^(m),y^(m)),…(x^(J),y^(J)) For each sample in the dataset, there is a label for the correct classification, i.e. label value: { y⁽¹⁾,…,y^(k)K is the number of classes. For the mth sample, which corresponds to a category j, there is a probability, also called score value, and the score value of the mth sample is shown in formula (4):

wherein θ ═ θ₀,θ₁…θ_k-1) Is a parameter to be optimized, y^(m)Denoted is the m-th sample label, x^(m)Denotes the m-th sample, h (x)^(m)) Represents the score value of the m-th sample by

This term normalizes the probability distribution so that the sum of the probabilities is 1.

NMS: a Non-Maximum Suppression algorithm (Non-Maximum Suppression), which is an algorithm for removing Non-maxima, and the algorithm steps are as follows:

Assuming that the object to be recognized is surrounded by F candidate frames, the nth candidate frame is calculated by the classifier as s_nN is more than or equal to 1 and less than or equal to F: (1) a set H is newly established, F candidate frames are placed into the set, and an empty set T is newly established; (2) sorting all candidate frames in the set H according to the score values of the classifier, and putting the frame T with the highest score into the set T; (3) traversing the candidate frames in the set H, respectively performing cross-comparison operation with the frame t, if the number of the candidate frames is higher than a certain threshold value, considering that the frame is overlapped with the frame t, and deleting the frame from the set H; (4) and (4) returning to the step (2) to continue the iteration until the set H is an empty set. The boxes in the set T are what we need.

Regression loss layer (Regressor): regression loss function L_regAs shown in equation (5):

L_reg(tⁿ,vⁿ)＝∑_{c∈{x,y,w,h}}smoothL₁(t_c-v_c) (5)

wherein, SmoothL₁The loss function for robustness is shown in equation (6):

wherein v isⁿ＝(v_x,v_y,v_w,v_h) Is the coordinate vector of the real frame, tⁿ＝(t_x,t_y,t_w,t_h) Is the coordinate vector where the prediction box is located. The four coordinate calculation formulas are as follows:

x and y are the center coordinates of the prediction box, w and h are the width and height of the prediction box, respectively, x_aAnd y_aCenter coordinates, w, of anchor boxes generated for RPN networks_aAnd h_aWidth and height, x, of anchor boxes generated in the RPN network, respectively ^*And y^*Is the center coordinate of the real frame, w^*And h^*Respectively the width and height of the real box.

Bounding box regression loss layer (Bbox Regressor): the loss function of the bounding box regression loss layer in the RPN network is defined as:

the loss function is divided into two parts which,

is a function of the classification loss, with p for the output_nIs represented by L_regFor the regression loss function, output tⁿIs represented by L_regIs a regression loss function expressed by the formula (5); n is an anchor frame index; p is a radical of_nFor the probability of the target being contained in the nth anchor box,

indicating that if the nth anchor block contains a target, it is 1, otherwise it is 0; n is a radical of_regIndicating the number of anchor boxes containing objects in the RPN network, N_clsThe total number of anchor boxes; λ is the weight.

The advantages and disadvantages of the present invention were further analyzed by comparing the method of the present invention with the results of the fast-RCNN assay (see Table 2).

TABLE 2 average accuracy of the present invention and prior methods

As can be seen from table 2, the average accuracy (mAP) of the method of the present invention is improved, and for a vehicle with a small target, since the background is complex and is easily occluded by a shadow, the accuracy in the existing method is low, but the network of the present invention improves the average accuracy of the vehicle by about 7%, which indicates that the detection effect of the small target is effectively improved by using the positioning fine trimming sub-network; meanwhile, for a larger target such as a bridge, the average accuracy rate is not high in the method and the prior art, because the data set is connected with the bridge and the road in a long strip shape, and the color characteristics and the texture characteristics are similar, the bridge is easily identified as the road in the detection process, so that the average accuracy rate of the bridge is low; for the oil tank which is a target with dense distribution in the image, the accuracy of the invention and the existing network detection is low, so that the network needs to be continuously improved to detect the target with high distribution density.

Meanwhile, under the data set, the average accuracy of the improved network is tested by setting different anchor point frames, as shown in table 3.

TABLE 3 average accuracy for different anchor frame numbers

Number of anchor boxes	mAP(％)
		3	78.2
6	80.6
		9	81.5
12	83.1
		15	82.6

As can be seen from table 3, the average accuracy steadily increases when the NWPU VHR 10 data set is trained with different anchor boxes, ranging from 3 to 12, while the average accuracy slightly decreases when over 12. This shows that the number of anchor frames can effectively improve the average accuracy within a certain range, and when the number exceeds a certain range, the accuracy of target detection cannot be improved by simply increasing the number of anchor frames, which not only increases the extra calculation amount of the network, but also increases the overfitting risk of the network, thereby increasing the complexity of the network.

As shown in FIG. 3, the improved method for detecting the target of the fast-RCNN remote sensing image is compared with the visualization chart of the prior art, and the visualization chart of the invention is arranged in two columns and three rows in total, wherein the first column from left to right is the visualization chart of the invention, and the second column is the visualization chart of the prior art. As can be seen from the first line of images, the existing method has large displacement on the prediction frame of the bridge and small deviation on the positioning of the ship; the second row of images contains 5 vehicle targets, and the prior method has the problems that for the vehicle with a smaller target, the detection of one vehicle is missed, the detection of one vehicle is mistaken, and the positioning of other three vehicles is slightly deviated; the third row of images contains 5 airplanes, and it can be seen from the figure that the existing method has a missed detection situation for airplanes.

The above embodiments are described in detail for the purpose of further illustrating the present invention and should not be construed as limiting the scope of the present invention, and the skilled engineer can make insubstantial modifications and variations of the present invention based on the above disclosure.

Claims

1. An improved method for detecting a target of a Faster-RCNN remote sensing image is characterized by comprising the following steps:

(1.3) constructing an improved fast-RCNN remote sensing image target detection network: the network consists of a Faster-RCNN sub-network and a positioning and refining sub-network;

(1.4) training an improved Faster-RCNN remote sensing image target detection network: randomly configuring node parameters of the established improved fast-RCNN remote sensing image target detection network in advance, inputting the remote sensing images in the training set into the established improved fast-RCNN remote sensing image target detection network, and updating the node parameters in the improved fast-RCNN remote sensing image target detection network model in a random gradient descent mode until an optimal solution is found;

(1.5) testing an improved Faster-RCNN remote sensing image target detection network: detecting the remote sensing images in the test set by using a trained improved Faster-RCNN remote sensing image target detection network, and analyzing the detection effect;

the step (1.3) of establishing the improved Faster-RCNN remote sensing image target detection network comprises the following specific steps:

(2.1) inputting the remote sensing image into a VGG16 network in a Faster-RCNN sub-network, extracting texture, color and scale features of a target in the remote sensing image, and obtaining a feature map y after passing through a VGG16 network₁The size is 50 × 60 × 256;

(2.3) in the RPN network in the Faster-RCNN subnetwork, using a sliding window with the size of 3 x 3 for y₁Performing standard convolution operation with step size of 1, and generating 12 anchor point frames with different scales with the center point of the sliding window as the center, wherein the anchor point frames are 16 × 16, 16 × 32, 32 × 016, 32 × 132, 32 × 64, 64 × 32, 64 × 128, 128 × 64, 128 × 256 and 256 × 128; generating an anchor point frame, outputting the anchor point frame through a Relu activation function in an RPN network, dividing the anchor point frame into two branches, wherein one branch is a classification loss branch, firstly performing point-by-point convolution operation on the branch with the output channel number of 18, and then performing 12 different pairs of different classes through a Softmax classifier in the RPN network Classifying anchor frames with scales, wherein each anchor frame outputs two probability values for distinguishing the target or the background of the image, and 24 probability values are output every time the anchor frame slides for 1 time; the other branch is a boundary regression loss branch, and after the branch is convolved point by an output channel with the number of 36, the boundary regression offset of an anchor frame is calculated through a boundary frame regression loss layer in the RPN network: one anchor point frame outputs 4 relative position coordinates, which are respectively the center coordinates (x) of the anchor point frame_a，y_a) And width and height (w) of anchor boxes_a，h_a) Outputting 48 relative position coordinates of 12 anchor point frames after each sliding for 1 time, and finally, integrating the outputs of the classification loss branch and the boundary regression loss branch in the RPN network through a proposed layer in the RPN network to obtain a feature map y of the anchor point frame with the relative position coordinates₂；

(2.4) passing through the RPN network in the Faster-RCNN sub-network, the feature map y₂And a feature map y obtained from the VGG16 network₁Inputting the feature maps into a RoI pooling layer in a Faster-RCNN sub-network, and outputting the feature maps with non-uniform sizes into a feature map y with a size of 25 × 30 × 256₃After passing through a full-link layer with Relu activation function in the Faster-RCNN sub-network, a regression result y is obtained through a regression loss layer in the Faster-RCNN sub-network ₄；

(2.5) regression result y obtained from regression loss layer in Faster-RCNN subnetwork₄And a feature graph y obtained in the VGG16 network₁Inputting the data into a RoI pooling layer in a positioning and refining sub-network, outputting a characteristic diagram with the size of 6 multiplied by 7 multiplied by 256, dividing an output result into two paths after passing through a full connection layer with a Relu activation function in the positioning and refining sub-network, and outputting position information y of a target in the remote sensing image through a regression loss layer in the positioning and refining sub-network₅(ii) a The other path outputs the classification result y of the target in the remote sensing image through a Softmax classifier in the positioning and refining sub-network₆。