CN110826485A

CN110826485A - Target detection method and system for remote sensing image

Info

Publication number: CN110826485A
Application number: CN201911071646.8A
Authority: CN
Inventors: 王俊强; 李建胜; 周学文; 吴峰; 郑凯
Original assignee: Information Engineering University of PLA Strategic Support Force
Current assignee: Information Engineering University of PLA Strategic Support Force
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2020-02-21
Anticipated expiration: 2039-11-05
Also published as: CN110826485B

Abstract

The invention relates to a target detection method and a system of remote sensing images, wherein the target detection system comprises a memory, a processor and a computer program which is stored in the memory and can be run on the processor, and the processor comprises the following steps when running the computer program: (1) training to obtain a deep learning network; (2) acquiring the size of a target area to be detected and the actual size of a target; (3) dividing the target area to be detected according to the proportional relation between the size of the target area to be detected and the actual size of the target to obtain a plurality of small grids; (4) and (4) sequentially inputting all the small grid images obtained in the step (3) into the deep learning network obtained in the step (1) for target detection, and finally obtaining a detection result. The target can not be cut excessively or even cut by the adaptively divided grids, and the target in the remote sensing image to be detected can be detected more accurately after the target is detected and identified by the fast and high-precision deep learning network model.

Description

Target detection method and system for remote sensing image

Technical Field

The invention relates to a target detection method and a target detection system for remote sensing images.

Background

The remote sensing image target detection is used for determining whether an interested target exists in a remote sensing image, detecting and accurately positioning the interested target, and is one of important research directions for image interpretation. In the traditional target detection process, a sliding window is adopted to select an area, then shallow level features are extracted, and finally the type is judged. The target recognition effect of the method depends on the set characteristics seriously, and the deep characteristics in the image are difficult to be fully mined. The robustness of feature extraction is poor, the conditions of illumination change, inconsistent resolution and the like of the multisource remote sensing image cannot be adapted, and the requirement of large-scale automatic application is difficult to meet.

In recent years, with the continuous enhancement of computer computing capability, deep learning is widely applied, a remote sensing image target detection technology based on the deep learning is rapidly developed, a deep convolution neural network does not need to design features manually on the aspect of target detection, the remote sensing image data is subjected to feature extraction automatically, and the performance exceeds that of a traditional algorithm.

However, when the target detection is performed on the remote sensing image shot at high altitude, the pixels of the whole image of the image are too large, the processor cannot perform detection at one time, so that the image to be detected can be divided and then detected, and the target to be detected is often cut by the division of the target area to be detected, so that the target in the remote sensing image cannot be accurately identified.

Disclosure of Invention

The invention aims to provide a target detection method and a target detection system of a remote sensing image, which aim to solve the problem that the target detection in the existing remote sensing image is not accurate.

In order to achieve the above object, the present invention provides a target detection method for remote sensing images, comprising the steps of:

(1) training to obtain a deep learning network;

(2) acquiring the size of a target area to be detected and the actual size of a target;

(3) dividing the target area to be detected according to the proportional relation between the size of the target area to be detected and the actual size of the target to obtain a plurality of small grids;

(4) and (4) sequentially inputting all the small grid images obtained in the step (3) into the deep learning network obtained in the step (1) for target detection, and finally obtaining a detection result.

Has the advantages that: according to the target detection method of the remote sensing image, firstly, a target area to be detected is subjected to grid division, the grid division is adaptive division according to the size of the target area to be detected and the actual size of the target, the target cannot be cut excessively or even cannot be cut by the adaptively divided grid, and after the target detection and identification are carried out through a rapid and high-precision deep learning network model, the target in the remote sensing image to be detected can be detected more accurately.

Further, the target area to be detected in the step (3) is divided into m rows and n columns of small grids,

in the formula (I), the compound is shown in the specification,

denotes upward integer, X_maxIs the maximum value of the coordinate in the X direction of the target area to be detected, X_minIs the minimum value of the coordinate in the X direction, Y, of the target area to be detected_maxIs the maximum value of the coordinate in the Y direction of the target area to be detected, Y_minThe minimum value of the coordinate in the Y direction of the target area to be detected is obtained; x_widthMapping the actual length of container loading area, Y, of a target area to be detected at an image level_heightThe actual width of a container loading area of a map under a certain image level is a target area to be detected; overlap_xIs in the X directionThe overlapping rate is determined by the proportion of the size of the target to the length of the target area to be detected in the X direction; overlap_yThe overlapping rate in the Y direction is determined by the ratio of the size of the target to the length of the target area to be detected in the Y direction.

Further, the overlapping rate in the X direction and the overlapping rate in the Y direction are:

in the formula I_objectIs a target actual length, w_objectIs the target actual width.

Further, the divided small grid coordinates are expressed as

Wherein the content of the first and second substances,is the maximum value of the coordinate in the X direction of the ith row and jth column grids,

is the coordinate minimum value in the X direction of the ith row and jth column grids,

is the maximum value of the coordinate in the Y direction of the ith row and the jth column grid,the minimum value of the coordinate in the Y direction of the ith row and the jth column grid; i is 1,2, … … m, j is 1,2, … … n.

Further, the image level is 18 layers. Under the influence level of 18 layers, the detection effect is better.

Further, the deep learning network comprises a Faster R-CNN network and an RPN network.

Further, the Faster R-CNN network comprises a deep residual error network, and the deep residual error network is 50 layers.

Further, the RPN network outputs rectangular candidate boxes by convolution.

Further, the mechanism for generating the rectangular candidate frame in the RPN network is as follows: and determining the number of the rectangular candidate frames according to the number of the area scaling factors and the number of the aspect ratios of the training sample images.

In order to achieve the above object, the present invention provides an object detection system of a remote sensing image, including a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the object detection method when executing the computer program.

Drawings

FIG. 1 is a flowchart of a target detection method of a remote sensing image according to the present invention;

FIG. 2 is a schematic structural diagram of a Faster R-CNN model constructed in an embodiment of the present invention;

FIG. 3-a is an original image before being processed by the data enhancement method of the present invention;

FIG. 3-b is an image of the present invention under random color data enhancement;

FIG. 3-c is an image of the noise-disturbed data enhancement according to the present invention;

FIG. 3-d is an image of the present invention under random scaling data enhancement;

FIG. 3-e is an image of the present invention under random rotation data enhancement;

FIG. 3-f is an image of the random inversion data enhancement according to the present invention;

FIG. 4 is a schematic diagram illustrating the meshing of remote sensing images according to the present invention;

FIG. 5 is a flow chart of multi-level object detection according to the present invention;

FIG. 6 is a comparison graph of the total loss training results under four generation mechanisms according to the present invention;

FIG. 7-a is a comparison graph of the detection results of different target scales under the Anchor mechanism 1 of the present invention;

FIG. 7-b is a comparison graph of the detection results of different target scales under the Anchor mechanism 2 of the present invention;

FIG. 7-c is a comparison graph of the detection results of different target scales under the Anchor mechanism 3 of the present invention;

FIG. 7-d is a comparison graph of the detection results of different target scales under the Anchor mechanism 4 of the present invention;

FIG. 8-a is a graph of mesh division and aircraft target size at the image 17 level according to the present invention;

FIG. 8-b is a graph of mesh division and aircraft target size at the image 18 level according to the present invention;

FIG. 8-c is a graph of mesh division and aircraft target size at the image 19 level in accordance with the present invention;

FIG. 9-a is a block diagram of detection boundaries at the level 18 of the present invention;

FIG. 9-b is a block diagram of detecting boundaries at the level 19 according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail with reference to the accompanying drawings and examples, but the embodiments of the present invention are not limited thereto.

The technical idea of the invention is as follows: and carrying out grid division on the image to be detected according to the overlapping rate in the X direction and the Y direction, and then sequentially inputting the divided small grids into a built and learned trained fast R-CNN detection model for target detection, so that the target in the image to be detected can be detected more accurately and reliably.

The embodiment of the target detection method comprises the following steps:

in the present embodiment, an airplane and an athletic field are described as examples. It should be noted that the invention is also applicable to targets to be detected in other structural forms such as ships, oil drums and the like. The implementation flow of the target detection method is shown in fig. 1, and the specific implementation steps are as follows:

step 1, obtaining a training sample image and a test sample image.

And manually acquiring target sample images with different scales and different categories from the digital earth satellite images in a global range by using a target detection service system. And forming a training sample image containing the target after marking the target sample image. The target types selected in the implementation are airplanes and track and field fields, wherein 1000 airplane samples are obtained, and 900 track and field samples are obtained. 70% of the picture samples are used as a training set, the rest 30% of the picture samples are used as a test set, and a large number of small airplane targets are introduced into the test set to serve as verification models for detecting the accuracy of the small targets. Through the data enhancement operation, the total number of the final training set is 7980, and the test set is 3420.

Step 2, constructing a deep learning network model based on Faster R-CNN

The structure of the deep learning network model in this embodiment is as shown in fig. 2, and the deep learning network model is obtained by a combination structure of the Faster R-CNN network and the RPN network.

Extracting basic features by adopting a 50-layer deep residual error network (ResNet-50) in a Faster R-CNN network to obtain a feature map; the problem that the gradient disappears in the back propagation process can be solved by the deep residual error network on the structural level of the neural network, and the gradient does not disappear even if the network is deep, so that the acquisition precision is ensured.

And the RPN network is used for generating a rectangular candidate region according to the feature map, outputting the position parameter of the rectangular candidate frame (namely the anchor) and the probability that the rectangular candidate frame is the target through convolution, and then outputting the suggestion frame to the Faster R-CNN for training and learning.

Since both the fast R-CNN network and the RPN network belong to the prior art as described above, they will not be further described herein.

Step 3, learning and training of deep learning network

And inputting the training sample image into a deep learning network model for learning and training. There are many discussions in the prior art regarding the learning and training process of the deep learning network, so that only the distinctive related design links are introduced here:

3.1 the generation mechanism of rectangular candidate frame (i.e. anchor) in the RNP network of the present invention is: and for a training sample image, determining the number of rectangular candidate frames according to the number of the area scaling factors and the number of the aspect ratios of the training sample image. For example, a 51 × 39 training sample image, which is a 256-channel image, has an area of 256 × 256, which actually represents a rectangular candidate box with a scaling factor of 1 and an aspect ratio of 1.

Based on the generation mechanism of the invention, the invention sets four anchors generation mechanisms, as shown in the following table 1:

TABLE 1

3.2 loss function in RPN networks

The RPN network calculates the total loss using the multitask loss, which is expressed as:

where i is represented as the ith rectangular candidate box, p_iPredicting a probability of being targeted for the ith rectangular candidate box;

the true probability represents that if the target of the rectangular candidate box is 1, otherwise, the target is 0; t is t_iA four-dimensional vector, representing the coordinate transformation parameters (translation and scaling) of the predictor with respect to the rectangular candidate box i,

transformation parameters representing rectangular candidate boxes relative to authentic marker boxes, { p_iThe set of prediction probability lists for all rectangular candidate boxes, { t }_iIs the set of all predicted coordinate transformation parameter lists, λ is the balance coefficient, N_clsAnd N_regIn order to be a function of the normalization,

to classify the losses, cross-entropy loss calculations can be utilized,for positional regression loss, calculations were performed using the SmoothL1 loss function。

3.3 loss function of Classification regression Process

In performing the objective classification regression, a multitask loss synchronous calculation is adopted, and for each pooling (RoI) layer in the Faster R-CNN network, the loss function is expressed as:

L(p,u,t^u,v)＝L_cls(p,u)+λ[u≥1]L_loc(t^u,v)

wherein p ═ p (p)₀,.....p_k) Denotes the probability of the K +1 class (including background), u represents the true class of RoI, v ═ v_x,v_y,v_w,v_h) Which represents the real position of the object,

represents the predicted position, [ u ≧ 1]When u is more than or equal to 1, 1 is taken, otherwise 0 is taken, lambda is an equilibrium coefficient, L_cls(p, u) is the classification loss, which can be calculated using the cross-entropy loss, L_loc(t^uV) bounding box regression loss, calculated using the SmoothL1 loss function.

3.4 the diversity of training samples can be improved by utilizing a data enhancement mode, overfitting caused by insufficient samples in the training process is prevented, and meanwhile, the robustness of the model is enhanced.

Five ways are used herein, namely random rotational transformation of 0 ° to 360 °, noise perturbation, color dithering, random non-equal scaling of 0.8 to 1.2 times, flip transformation of 180 °, 90 ° or 270 °, as shown in fig. 3-a, fig. 3-b, fig. 3-c, fig. 3-d, fig. 3-e, fig. 3-f, respectively. In training, the initial learning rate is set to 0.0003 and the learning momentum is set to 0.9. In the training process, the pre-training model based on the public data set is subjected to transfer learning, and the network weight is initialized.

3.5, learning and training in the deep learning network model through the training set, namely finishing training to obtain the deep learning network with the detection function.

And 4, carrying out dynamic grid division on the remote sensing image to be detected to obtain a plurality of small grid images.

4.1 dynamic meshing is performed according to a map container, as shown in fig. 4, the coordinate system in the map container is the real coordinate system (generally, the projection coordinate system) in the map container.

The maximum value and the minimum value of the target area to be detected in the X direction and the Y direction of the remote sensing image to be detected can be obtained according to the coordinate system in the graph: x_maxIs the maximum value of the coordinate in the X direction of the target area to be detected, X_minIs the minimum value of the coordinate in the X direction, Y, of the target area to be detected_maxIs the maximum value of the coordinate in the Y direction of the target area to be detected, Y_minThe minimum value of the coordinate in the Y direction of the target area to be detected.

And 4.2, determining the actual size of the loading area of the map container of the remote sensing image under a certain level according to the target area to be detected.

The invention specifically determines the actual size, X, of a map container loading area of a remote sensing image under 18 levels_widthPlotting the actual length of the container loading area at level 18 for the target area to be inspected, Y_heightThe actual width of the container loading area is mapped at 18 levels for the target area to be detected.

The target detection and positioning in a large area range can be realized based on grid division, but if the detection image tile level is too high, although the precision is improved, the grid division quantity is too much, the efficiency is greatly reduced, and the production requirement cannot be met. Therefore, it is important to select a proper image level in the detection process, and the determination is mainly based on the size proportion of the target in the image.

4.3 dividing the target area to be detected of the remote sensing image according to the overlapping rate of the X direction and the Y direction, and forming a plurality of small grids of m rows and n columns after division:

in the formula (I), the compound is shown in the specification,

indicating upward integer, overlap_xThe overlapping rate in the X direction is determined by the ratio of the size of the target to the length of the target area to be detected in the X direction; overlap_yIs the overlapping rate in the Y directionAnd determining the proportion of the size of the mark to the length of the target area to be detected in the Y direction.

Overlap ratio overlap in X and Y directions_xAnd overlap_yThe expression is as follows:

wherein l_objectIs a target actual length, w_objectIs the target actual width.

4.4 the maximum value and the minimum value of the coordinates of the small grid target area to be detected in the X direction and the Y direction after division are as follows:

wherein the content of the first and second substances,

is the maximum value of the coordinate in the X direction of the ith row and jth column grids,

is the maximum value of the coordinate in the Y direction of the ith row and the jth column grid,

the minimum value of the coordinate in the Y direction of the ith row and the jth column grid; i is 1,2, … … m, j is 1,2, … … n.

4.5, after each frame is calculated according to the formula in the step 4.4, drawing the small grids according to the frame coordinates.

And 5, sequentially inputting the small grid images into the deep learning network model for target detection.

Drawing a grid of a target area to be detected according to the step 4, automatically operating the small grid to load tile images of 18 layers and the remote sensing image to be detected under the action of a map control, obtaining a frame of a target range to be detected, namely position information, and mutually combining the tile images and the remote sensing image to be detected to determine the detection and positioning of the target; namely, the airplane in the remote sensing image is checked and identified, and the specific position is located. And finally, filtering the overlapped detection target of the overlapped grid part by a space analysis method.

The target detection and positioning in a large area range can be realized based on grid division, but if the detection image tile level is too high, although the precision is improved, the grid division quantity is too much, the efficiency is greatly reduced, and the production requirement cannot be met. In order to solve the above problems, a multi-level target detection process is designed, in which a target is detected in a low-level image, and then high-level judgment and accurate positioning are performed with the target to be detected as a center, and the whole process is shown in fig. 5.

Experimental analysis:

a target detection service system under a B/S framework is developed and constructed based on an ArcGIS API for JavaScript component, a TensorFlow deep learning framework and a Django framework, and large-range target detection is realized through client operation. The server-side operating system is Ubuntu16.04, 16G memory and is configured with GTX1080Ti video card. Test comparative analysis was performed using 3420 test sets in step 1.

Firstly, comparing the training precision of the SSD-based deep learning network model with that of the Faster R-CNN-based deep learning network model

The SSD adopts an inceptionv3 network as a feature extraction network, the network parameters of the SSD are slightly more complicated than ResNet-50, the invention trains four anchors generation mechanisms respectively, and the precision verification is carried out on a test set.

And when the precision is verified, respectively calculating the average Accuracy (AP) of each type for the test set, and taking the average accuracy (mAP) as the precision of the training result of the measurement model. Briefly, each category can be plotted against recall (call) and accuracy (precision), so that AP is the area under the curve and the maps is the average of the APs for the multiple categories.

The test results of the two models and their corresponding parameter configurations are compared as shown in table 2:

TABLE 2

Through the comparative analysis of table 2, the average value of the detection accuracy of the airplane is lower than that of the track and field, because the airplane target is generally smaller than that of the track and field target, before the image detection starts, the model needs to perform resize operation on the image, so that the target characteristics of a plurality of small airplanes disappear, and the target is likely to be missed or mistakenly detected. The target detection precision of the deep learning network model based on the Raster R-CNN is superior to that of an SSD frame, particularly the target detection method has more obvious advantages on detecting the plane target, and the detection performance of the fast R-CNN frame small target is superior to that of the SSD frame.

Secondly, comparing the performance conditions of four different anchors generating mechanisms of fast R-CNN

As shown in fig. 6, the loss function training conditions under the four anchors are shown, the total loss of the four anchors can achieve the convergence effect, the total loss of all the anchors is finally below 0.1 after training for 40 ten thousand steps, especially the PRN loss is controlled to be smaller, and thus the model training effect is good, the fluctuation of various loss functions of the anchor 3 is minimal, and the training effect is optimal.

From Table 2, it can be seen that the four different mechanisms of generating Faster R-CNN can reach nearly 90% mAP value. Compared with the anchor mechanism 4 and the anchor mechanism 1, the anchor mechanism 4 is added with a smaller area scaling factor, the mAP value of the area scaling factor is also superior to that of the anchor mechanism 1, which indicates that the detection precision of a small target can be improved by a smaller target candidate frame, compared with the anchor mechanism 3 and the anchor mechanism 4, the anchor mechanism 3 is added with an aspect ratio coefficient of 0.25 relative to the anchor mechanism 4, the mAP value is equivalent, which indicates that the aspect ratio coefficient cannot be increased to improve the precision aiming at the data set, compared with the anchor mechanism 3 and the anchor mechanism 2, the area scaling factor of the anchor mechanism 3 relative to the anchor mechanism 2 is consistent, but the coefficients with the same total number of the aspect ratio coefficients are inconsistent, the final mAP values of the two coefficients are also different, which indicates that the precision can be ensured by proper coefficients.

From the perspective of the frame prediction accuracy of the target candidate frame, the anchor mechanism 2 and the anchor mechanism 3 are equivalent and superior to the anchor mechanism 1 and the anchor mechanism 4, mainly because the length-width ratio coefficient is increased, the target effect (such as an athletic field) is better when the length-width ratio is predicted to be large, as shown in fig. 7-a, 7-b, 7-c and 7-d, and the target effect is a detection result of several different target scales. Therefore, combining the mAP value and the frame prediction effect, the anchor mechanism 3 is selected as the anchor mechanism of the Faster R-CNN target detection framework.

Thirdly, comparing the detection precision and efficiency under three image levels (17,18,19)

All airplane targets of the Xinzheng airport are detected based on a dynamic grid division mode, the detection range is 21.5 square kilometers, the detection accuracy and efficiency under three image levels (17,18 and 19) are compared, grid division is shown in a figure 8-a, a figure 8-b and a figure 8-c, and the overlapping degree between grids is dynamically generated according to the size of the targets and the detection levels. And counting the detection results, judging the identification accuracy by adopting the accuracy and the recall rate, wherein the recall rate is the number of correctly identified airplanes divided by the total number of airplanes in the test image, the accuracy is the number of correctly identified airplanes divided by the number of model identifications as the number of airplanes, and the identification is judged to be correct when IoU (the ratio of the intersection area of the detection frame and the real frame to the union area of the detection frame and the real frame) exceeds 0.5, and the statistical results are shown in table 3, wherein the detection time consumption is the total of image tile loading, network transmission, execution detection and result feedback.

TABLE 3

As can be seen from table 3, although the training sample set does not introduce high-resolution image data, it still can achieve better effect. The detection efficiency is highest at the level of 17, but because the proportion of the airplane target in the breadth of the image is too small, resize operation is needed before the image is input into the feature extraction network, the small airplane target feature disappears, and the main reason for low recognition rate is that the detection efficiency is lower at the level of 18 compared with the level of 17, but the precision is greatly improved, and the precision at the level of 19 is highest, the recall rate reaches 97.6%, the accuracy reaches 95%, but because the number of divided grids is too large, the efficiency is greatly reduced. Therefore, in the detection process, it is very important to select an appropriate detection image level according to the size of the detection target. In the experiment, 18 levels are selected as the levels of grid detection images, and the frame of the detection target is accurately positioned on 19 levels by using the accurate positioning method designed in the text. As shown in fig. 9-a and 9-b, the detection bounding boxes at two levels are shown, and it is obvious that the position accuracy of the 19-level image border is better than that of the 18-level image border, and the accurate positioning effect can be achieved.

Generally speaking, the Faster R-CNN target detection framework has higher detection precision, a training sample set is enriched by adopting a reasonable data enhancement mode, a small target detection can be realized by the anchor generation mechanism designed in the text, but if the detection image resolution is low, the recall rate still keeps a low level, so that the selection of a proper image resolution for typical detection is also important in the aspects of balance efficiency and precision. Experiments verify that the target detection and frame accurate positioning method based on dynamic grid division and image multi-level designed by the invention can realize target detection and frame accurate positioning in a large area range, and has certain reference value for the rapid remote sensing target retrieval based on a deep learning method and application to the industrial production field.

Target detection system embodiment:

the target detection system of the remote sensing image comprises a memory, a processor and a computer program which is stored in the memory and can be run on the processor, wherein the processor realizes the target detection method when running the computer program, and the process of the target detection method is described in detail in the embodiment and is not described herein again.

Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention and not for limiting the protection scope thereof, and although the present application is described in detail with reference to the above embodiments, those skilled in the art should understand that after reading the present application, various changes, modifications or equivalents of the embodiments of the present application can be made, and these changes, modifications or equivalents are within the protection scope of the claims of the present invention.

Claims

1. A target detection method of remote sensing images is characterized by comprising the following steps:

(1) training to obtain a deep learning network;

2. The method for detecting the target of the remote sensing image according to claim 1, wherein the target area to be detected in the step (3) is divided into m rows and n columns of small grids:

in the formula (I), the compound is shown in the specification,

denotes upward integer, X_maxIs the maximum value of the coordinate in the X direction of the target area to be detected, X_minIs the minimum value of the coordinate in the X direction, Y, of the target area to be detected_maxIs the maximum value of the coordinate in the Y direction of the target area to be detected, Y_minThe minimum value of the coordinate in the Y direction of the target area to be detected is obtained; x_widthMapping the actual length of container loading area, Y, of a target area to be detected at an image level_heightThe actual width of a container loading area of a map under a certain image level is a target area to be detected; overlap_xIs the overlapping rate in the X direction, is determined by the ratio of the target size to the length of the target area to be detected in the X direction, overlap_yThe overlapping rate in the Y direction is determined by the ratio of the size of the target to the length of the target area to be detected in the Y direction.

3. The method for detecting an object in a remote sensing image according to claim 2, wherein the overlapping rate in the X direction and the overlapping rate in the Y direction are:

4. The method for detecting an object in a remote sensing image according to claim 3, wherein the divided small grid coordinates are expressed as

Wherein the content of the first and second substances,

5. The method for detecting the target of the remote sensing image according to claim 2, wherein the image hierarchy is 18 layers.

6. The method for target detection of remotely sensed images as claimed in any of claims 1-5, wherein said deep learning network comprises a Faster R-CNN network and an RPN network.

7. The method for target detection of remote sensing images of claim 6, wherein the Faster R-CNN network comprises a depth residual network, and the depth residual network comprises 50 layers.

8. The method for detecting the target of the remote sensing image as claimed in claim 7, wherein the RPN network outputs a rectangular candidate frame by convolution.

9. The method for detecting the target of the remote sensing image according to claim 8, wherein the generation mechanism of the rectangular candidate frame in the RPN network is as follows: and determining the number of the rectangular candidate frames according to the number of the area scaling factors and the number of the aspect ratios of the training sample images.

10. An object detection system for remote sensing images, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the object detection method for remote sensing images according to any one of claims 1 to 9 when executing the computer program.