CN113052188A

CN113052188A - Method, system, equipment and storage medium for detecting remote sensing image target

Info

Publication number: CN113052188A
Application number: CN202110327967.0A
Authority: CN
Inventors: 颜蕾; 卢湖川
Original assignee: Dalian Institute Of Artificial Intelligence Dalian University Of Technology; Dalian University of Technology
Current assignee: Dalian Institute Of Artificial Intelligence Dalian University Of Technology; Dalian University of Technology
Priority date: 2021-03-26
Filing date: 2021-03-26
Publication date: 2021-06-29

Abstract

The invention belongs to the field of target detection in image processing, in particular to a remote sensing image target detection method, which is used for preprocessing remote sensing image data and performing conventional data augmentation; extracting a multi-scale feature map by adopting a ResNet residual network, fusing the multi-scale features by adopting a cross-channel information fusion mode according to target characteristics, and enhancing semantic information and feature richness of the features to obtain the fused multi-scale feature map; introducing an attention mechanism on the fused feature map to generate a probability significant feature map, weakening redundant background information in the remote sensing image and enhancing the significance of the target; and introducing the position information of each key point of the detection frame after the first regression, reconstructing a feature map with the position information, and performing final multi-class classification and positioning prediction. The method has the advantages that the method can process the conditions of small target size, complex background information and inaccurate positioning in the remote sensing image from the target characteristics of the remote sensing image.

Description

Method, system, equipment and storage medium for detecting remote sensing image target

Technical Field

The embodiment of the invention relates to the technical field of image processing, in particular to a method, a system, equipment and a storage medium for detecting a remote sensing image target.

Background

In the image processing technology, with the continuous development of deep learning, a convolution neural network method is applied to image processing to obtain a good effect. The target detection task is an important part in the field of deep learning, is applied to remote sensing images, and has important significance for sea area monitoring, city planning, resource investigation, marine traffic monitoring, water area supervision, soil taking safety protection and the like.

The main task of target detection is to classify and position target objects appearing in images, and detection methods are mainly divided into two categories: anchor-based and Anchor-free.

Among them, the Anchor-based method is classified into two methods, one is a two-step assay, and the other is a single-step assay. The two-step detection method comprises the steps of firstly inputting an image into a convolutional neural network to extract corresponding features to obtain a feature map, then utilizing a small RPN network to preset a prior frame Anchor, carrying out target and background binary classification and rough position adjustment on the prior frame Anchor, sending a fine-tuned frame and the feature map into a feature vector inside a subsequent network extraction frame, carrying out multi-class classification and position adjustment, and reducing loss through continuous training of the network to obtain a final detection result. The two-step detection method has two position adjustment processes, is accurate in positioning of the target, and is very time-consuming and not fast enough. The single-step detection method is a position adjustment process with the same preset prior frame but only once, and although the detection speed is high, the precision is low, and the positioning is not accurate enough.

In addition, the Anchor-free method is that a priori frame Anchor is not needed to be preset, key points of a target in an image are directly detected, the position of the target can be positioned through matching of the key points, the category of the target is judged, for example, a ' CornerNet ' article Detecting Objects as registered Keypoids ' published by Hei Law, Jia Deng and the like in 2018 is used for Detecting the probability of belonging to the upper left corner point and the lower right corner point of the target, and the corner points are matched through a certain matching mechanism to obtain the position of a detection frame; in 2019, Kaiwen Duan, Song Bai, Lingxi Xie and others published on CVPR 'center Net: Keypoint Triplets for Object Detection', besides detecting the upper left corner point and the lower right corner point of the target, the center point Detection is also added, and the Detection precision is further improved. The method for detecting the key points has no presetting of a prior frame, so that the configuration of some hyper-parameters is reduced, the problem of imbalance of positive and negative samples caused by introducing the prior frame Anchor is also relieved, and the detection effect needs to be further improved.

Although the target detection technology has achieved good results in the natural scene image, the direct application of the target detection technology to the remote sensing image has certain problems, and the problems are caused by the fact that the characteristics of the target in the natural scene image are inconsistent with those in the remote sensing image. Firstly, the shooting visual angle of the remote sensing image is a top view angle from top to bottom, the resolution of the image is generally large, and the proportion of some types of targets such as vehicles in the image is small, so that certain difficulty is caused to detection; secondly, part of targets such as ships may be densely arranged together, which greatly affects the detection precision; in addition, background information of the remote sensing image is complex, a plurality of unimportant redundant information exists in the image, and the network can generate unnecessary response to the redundant information in the detection process, so that important and interested area information is weakened; finally, the target in the remote sensing image is generally changeable in direction, cannot be detected only by depending on a common horizontal frame, and needs to be positioned by adopting a rotating frame detection method according to the direction changeability of the target.

Disclosure of Invention

An object of the embodiments of the present invention is to provide a method, a system, a device and a storage medium for detecting a remote sensing image target, so as to solve the problems in the background art.

In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:

a method for detecting a remote sensing image target comprises the following steps:

inputting an image to be processed into a residual error network, and extracting feature maps with 4 different scales; sampling the low-scale features to the size of the feature map on the upper layer, equally dividing the feature map and the feature map on the upper layer into two parts according to channels, respectively performing pixel-by-pixel addition fusion on each part, splicing the fused feature maps according to the channels, and sending the feature maps to a subsequent network; 4 fusion characteristics with different scales are obtained, and classification and positioning are respectively carried out on each size;

inputting the fused feature map into an attention mechanism module, obtaining a probability significant feature map through convolution operation, and multiplying the probability significant feature map by the feature of each channel of the original fused feature map pixel by pixel to obtain a target significant feature;

presetting prior frames with different sizes on each target salient feature map with different sizes; performing secondary classification and regression on the prior frame, and positioning the regression to obtain a fine-tuned detection frame; before the second classification and regression, adjusting the information in the fine-tuned detection frame, fusing the surrounding position feature information in a bilinear interpolation mode, and introducing the information into a new feature map;

and combining the new feature graph and the coordinates of the fine-tuned detection frame, and performing multi-class classification and secondary regression to obtain a final prediction frame.

As a further limitation of the technical solution of the embodiment of the present invention, before the step of inputting the image into the ResNet network to extract 4 feature maps with different sizes, the method further includes:

and preprocessing the image to be processed, wherein the preprocessing mode comprises at least one of turning, translation, color transformation and Mixup.

As a further limitation of the technical solution of the embodiment of the present invention, the above and the previous layer feature map are equally divided into two parts according to the channel, and the step of performing pixel-by-pixel addition and fusion on each part specifically includes:

dividing feature maps with different sizes into two parts according to channels;

fusing the first half channel information of the shallow features and the first half channel information of the deep features in a pixel-by-pixel addition mode;

fusing the latter half channel information of the shallow layer characteristic and the latter half channel information of the deep layer characteristic in a pixel-by-pixel addition mode;

and obtaining a fused feature map finally used for identification and positioning by splicing the two fused feature maps according to a channel.

As a further limitation of the technical solution of the embodiment of the present invention, the specific step of obtaining the probability saliency map through convolution operation includes:

adopting convolution kernels with two different sizes to respectively carry out further feature extraction on the feature maps to obtain feature maps with different receptive field sizes, and combining the feature maps together according to channels;

obtaining a probability significant feature map of two channels by convolution of 1x1, and adopting the following formula:

F_prob＝Conv_1×1(concat[Conv_3×3(F)，Conv_5×5(F)])

wherein F is a feature map after mid-fusion, F_probIs a probabilistic saliency map, Conv (. cndot.) represents a convolution operation whose subscript denotes the convolution kernel size, concat [. cndot.)]Showing the operation of splicing according to the channel.

As a further limitation of the technical solution of the embodiment of the present invention, the step of multiplying the feature of each channel of the original fusion feature map by pixels to obtain the target significant feature specifically includes:

the channel corresponding to the foreground probability is taken out, the probability value is multiplied by the original characteristic graph according to pixels, the pixel value of a non-target point is reduced after the probability product, and the pixel value of a target point is improved after the probability product, so that the boundary of the target is clearer than that of the target, and the effect of weakening complex background information is achieved; the following formula is adopted:

F_prob＝Conv_1×1(concat[Conv_3×3(F)，Conv_5×5(F)])

f' is the result of multiplying the probability significant feature map and the corresponding position pixel of the fused feature map; h is the height of the feature map, and W is the width of the feature map.

As a further limitation of the technical solution of the embodiment of the present invention, the step of fusing the surrounding position feature information by a bilinear interpolation mode includes:

the five main characteristic points of the detection frame are recalculated by adopting a bilinear difference value method, wherein the five characteristic points are four corner points and a central point of the detection frame, and the formula is as follows:

f(x，y)＝f(x1，y1)×(x2-x)×(y2-y)+f(x2，y1)×(x-x1)×(y2-y)+f(x1，y2)×(x2-x)×(y-y1)+f(x2，y2)×(x-x1)×(y-y1)

wherein, (x, y) represents the coordinates of the interpolated center point, (x1, y1), (x2, y1), (x1, y2), (x2, y2) represent the position coordinates of the four nearest points, namely, the upper left corner point, the upper right corner point, the lower left corner point and the lower right corner point, around the interpolated center point, respectively, and f (, x) represents the feature pixel value of the position of the previous fused feature map (, x).

As a further limitation of the technical solution of the embodiment of the present invention, the step of introducing to the new feature map includes:

carrying out weight distribution on the new feature vectors of the five feature points, calculating the weighted sum of the new feature vectors, replacing the original feature vectors, introducing the position information into the new feature vectors for subsequent regression, and adopting the following formula:

fv＝ρ×f_A+ρ×f_B+ρ×f_C+ρ×f_D+(1-4ρ)×f_E

wherein f is_A、f_B、f_C、f_D、f_ENew feature vectors of five feature points respectively, rho represents the weight values of four corner points, and fv represents the feature vector after replacement.

A remote sensing image target detection system, the system comprising:

the cross-channel feature fusion module is used for inputting the image to be processed into a residual error network and extracting feature maps with 4 different scales; sampling the low-scale features to the size of the feature map on the upper layer, equally dividing the feature map and the feature map on the upper layer into two parts according to channels, respectively performing pixel-by-pixel addition fusion on each part, splicing the fused feature maps according to the channels, and sending the feature maps to a subsequent network; 4 fusion characteristics with different scales are obtained, and classification and positioning are respectively carried out on each size;

the redundant information removing module is used for inputting the fused feature map into the attention mechanism entering module, obtaining a probability significant feature map through convolution operation, and multiplying the probability significant feature map by the feature of each channel of the original fused feature map pixel by pixel to obtain a target significant feature;

the position information correction module is used for presetting prior frames with different sizes on each target salient feature map with different sizes; performing secondary classification and regression on the prior frame, and positioning the regression to obtain a fine-tuned detection frame; before the second classification and regression, adjusting the information in the fine-tuned detection frame, fusing the surrounding position feature information in a bilinear interpolation mode, and introducing the information into a new feature map;

and the positive and negative sample screening module is used for combining the new feature map and the finely adjusted detection frame coordinates to perform multi-class classification and secondary regression to obtain a final prediction frame.

An apparatus comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method for object detection in remotely sensed images when executing the computer program.

A storage medium storing a computer program which, when executed by a processor, implements the steps of the method for object detection in remotely sensed images.

Compared with the prior art, the embodiment of the invention has the beneficial effects that:

the method has the advantages that redundant irrelevant background information is weakened according to the target characteristics of the remote sensing image, the detection performance of the small target is improved, the position information of the target is processed in the forward propagation process and is introduced into the corresponding characteristic diagram, the positioning performance of the target detection technology is improved, and the detection result can be well applied to the subsequent other image processing fields or engineering application.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention.

Fig. 1 is a network overall structure diagram of a remote sensing image target detection method.

FIG. 2 is a cross-channel information fusion structure diagram in the remote sensing image target detection method.

FIG. 3 is a structural diagram of an attention mechanism in a remote sensing image target detection method.

FIG. 4 is a flow chart of steps of a method for detecting a target in a remote sensing image.

Detailed Description

In order to make the technical problems, technical solutions and advantageous effects to be solved by the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The invention belongs to the technical field of target detection in image processing, relates to application of target detection on remote sensing images, and relates to relevant knowledge of image processing and deep learning. Firstly, preprocessing remote sensing image data and performing conventional data augmentation; then, extracting a multi-scale feature map by adopting a ResNet residual network, fusing the multi-scale features by adopting a cross-channel information fusion mode according to target characteristics, and enhancing semantic information and feature richness of the features to obtain the fused multi-scale feature map; secondly, an attention mechanism is introduced into the fused feature map to generate a probability significant feature map, redundant background information in the remote sensing image is weakened, and the significance of the target is enhanced; and finally, introducing the position information of each key point of the detection frame after the first regression, reconstructing a feature map with the position information, and performing final multi-class classification and positioning prediction. The method has the advantages that the method can process the conditions of small target size, complex background information and inaccurate positioning in the remote sensing image from the target characteristics of the remote sensing image, and can be applied to other technical researches and engineering applications in the field of remote sensing images.

Fig. 1 shows a network overall structure diagram of a remote sensing image target detection method. The network overall structure comprises: the system comprises a ResNet residual error network, a feature fusion module, an attention mechanism module, a class + box sub-module and a position information adjusting module.

The technical problem to be solved by the invention is as follows: for the situation that targets are small and arranged densely in the remote sensing image, the detection precision is improved; moreover, the invention can process complex background information and enhance the detection effect by weakening irrelevant redundant information; in addition, the position information is introduced into the original characteristic diagram, the positioning capability of the target is enhanced, and the division of positive and negative samples in the network training process is optimized.

As shown in fig. 4, in the step flow chart of the remote sensing image target detection method provided in the embodiment of the present invention, the implementation flow includes: inputting an image; preprocessing a picture; extracting a multi-scale characteristic diagram by using a ResNet residual error network; performing cross-channel feature fusion to obtain a fused feature map; reducing background signal interference through an attention mechanism; classifying and performing primary regression; correcting the position information to obtain a characteristic diagram containing position reconstruction; multi-class classification and second regression.

Specifically, the method comprises the following steps:

Cross-channel feature fusion. Feature fusion is a common method for improving the detection effect of small targets, and the method is widely applied to many fields of image processing, such as target detection, target segmentation, target classification and the like. In the process of feature extraction, down-sampling is usually performed through a multilayer convolutional neural network, and the feature size of a shallow layer is large and generally comprises edge and texture information of an image; deep features are relatively small in size and generally contain richer semantic information. Therefore, the two kinds of information are fused together, and compared with the single method which only adopts a certain layer of characteristics, the detection effect is better.

In addition, for the feature maps with the same size, the feature information contained in different channels is different, namely, a cross-channel feature fusion mode is adopted, the feature maps with different sizes are firstly divided into two parts according to the channels, the channel information of the first half part of the shallow feature and the channel information of the first half part of the deep feature are fused in a pixel-by-pixel addition mode, and the channel information of the second half part of the shallow feature and the channel information of the second half part of the deep feature are fused in the same mode. And then, obtaining a fused feature map finally used for identification and positioning by splicing the two fused feature maps according to a channel.

Compared with a common feature fusion method, the cross-channel fusion method can meet the combination of texture edge information and semantic information, and can integrate various information of different channels, thereby enriching the diversity of target feature information.

The attention mechanism removes redundant information. In the remote sensing image, due to the fact that the remote sensing image is shot from the top down, objects on the ground are very complex, interference is caused to a target needing to be detected, the boundary of the target can be clearer by adopting an attention mechanism, and redundant background information is weakened.

As shown in fig. 3, firstly, two convolution kernels with different sizes are used to perform further feature extraction on the feature maps respectively, so as to obtain feature maps with different receptive field sizes, and the feature maps are combined together according to channels. Then, the probability significant feature map of the two channels is obtained through convolution of 1x1, and represents the probability that each pixel point on the original feature map is the target.

F_prob＝Conv_1×1(concat[Conv_3×3(F)，Conv_5×5(F)])

Wherein F is the feature map fused in the step (1), and F_probIs a probabilistic saliency map. Conv (. cndot.) represents a convolution operation, with the subscript denoting the convolution kernel size of the convolution operation, concat [. cndot. ]]Showing the operation of splicing according to the channel.

As shown in fig. 2, in this embodiment, a channel corresponding to the foreground probability is taken out, the probability value is multiplied by the original feature map by pixels, the pixel value of the non-target point is reduced after the probability product, and the pixel value of the target point is increased after the probability product, so that the boundary of the target is clearer than before, thereby weakening the complex background information.

F_prob＝Conv_1×1(concat[Conv_3×3(F)，Conv_5×5(F)])

Wherein, F' is the result of multiplying the corresponding position pixel of the probability significant feature map and the fused feature map, H is the height of the feature map, and W is the width of the feature map.

And (5) correcting the position information. In the two-step detection method, there is usually a quadratic regression process for the detection frame to refine the predicted position of the target. In the second regression, the adopted feature vector is still the feature vector of the region of interest before the first regression, so that the problem of feature mismatching is caused, and the predicted position is not accurate enough.

The invention adopts a bilinear difference value method to recalculate five main characteristic points of the detection frame, namely four corner points and a central point of the detection frame:

wherein, (x, y) represents the coordinates of the interpolated center point, (x1, y1), (x2, y1), (x1, y2), (x2, y2) represent the position coordinates of the four nearest points, namely, the upper left corner point, the upper right corner point, the lower left corner point and the lower right corner point, around the interpolated center point, respectively, and f (, x) represents the feature pixel value of the position of the previous fused feature map (, x). The closer the distance from the interpolation center point, the higher the weight of the interpolation center point, and the higher the contribution degree of the pixel value; the farther away from the interpolated center point, the smaller its weight, and the lower the contribution of the pixel value. The characteristic information is adjusted in a bilinear interpolation mode, and information around the characteristic point is well integrated. The five characteristic points of the detection frame are subjected to interpolation calculation by adopting the method, so that a new characteristic vector is obtained.

And carrying out weight distribution on the new feature vectors of the five feature points, calculating the weighted sum of the new feature vectors, replacing the original feature vectors, introducing the position information into the new feature vectors, and using the position information in subsequent regression:

fv＝ρ×f_A+ρ×f_B+ρ×f_C+ρ×f_D+(1-4ρ)×f_E

wherein f is_A、f_B、fC、f_D、f_ENew feature vectors of five feature points respectively, rho represents the weight values of four corner points, and fv represents the feature vector after replacement. The position information contained by the four corner points is richer, so that a higher weight value is distributed, the central point plays an auxiliary role, and the distributed weight value is lower.

The method not only provides more accurate position information for the second regression, but also introduces contribution degrees of different position feature point information by a weight distribution mode, so that the feature vector information adopted by the second regression can be ensured to be richer and more accurate, and the positioning capability is improved.

And screening positive and negative samples. Network training is started after the network is constructed, and in the network training process, a part of prediction boxes need to be screened out for training of a back propagation loss function, and the principle method for screening the prediction boxes is generally to set a IoU threshold according to the intersection ratio (1oU) of the prediction boxes and a true value box, namely the coincidence degree of the prediction boxes and the true value box, wherein positive samples are above the threshold, and negative samples are below the threshold.

The method for screening positive and negative samples is characterized in that the intersection and combination ratio of the detection frame and the truth value frame before the second regression and after the second regression are combined, the calculation is carried out in a weighting mode, and the screening threshold is determined. The reason for adopting this method is to avoid losing a part of high-quality samples, for example, the intersection ratio before regression of one detection box is higher, which indicates that it is closer to the true value box, but the intersection ratio after regression is lower, at this time, the network will determine it as a negative sample, which affects the convergence performance of the network.

θ＝μ×R_b+(1-μ)×R_a

Where μ is the weight, R_aThe intersection ratio of the prediction box and the truth box before the second regression, R_bAnd theta is the intersection ratio of the prediction box and the truth box after the second regression, and is the threshold value for screening positive and negative samples.

Specifically, in the embodiment of the present invention, the steps of the method are specifically implemented as follows:

step 1, preprocessing image data by adopting data augmentation technologies such as overturning, translation, color transformation, Mixup and the like.

And 2, inputting the preprocessed image into a ResNet network to extract 4 characteristic graphs with different sizes, wherein the ResNet network is a residual error network.

And 3, starting from the feature map with the minimum size, performing up-sampling on the feature map, wherein the size after up-sampling is the same as the size of the feature map on the upper layer.

And 4, extracting the channel characteristics of the front half part of the up-sampled characteristic diagram and the channel characteristics of the front half part of the upper-layer characteristic diagram, and fusing the channel characteristics and the front half part of the upper-layer characteristic diagram in a pixel-by-pixel addition mode to obtain a characteristic diagram F1. The latter half channel profile is processed in the same manner to obtain profile F2.

And 5, splicing the F1 and the F2 according to the channels to obtain a fused feature map F.

And 6, inputting the new feature graph F into an attention mechanism module, reducing background information interference, and obtaining a feature graph F' for primary classification and regression.

And 7, presetting a priori frame on the feature map F', and performing secondary classification and first detection frame regression correction.

And 8, combining the feature map F' with the coordinate information of the detection frame subjected to the primary regression, calculating feature vectors of five feature points of the detection frame through bilinear interpolation, combining the five feature vectors in a weighting mode to obtain a new feature vector, and generating a feature map Fm for final prediction.

And 9, performing multi-target classification and position prediction on the feature map Fm.

And step 10, operating 4 characteristic graphs with different scales according to the above to realize multi-scale prediction.

The embodiment of the invention also discloses a remote sensing image target detection system, which comprises:

And 11, screening positive and negative samples for training according to a new positive and negative sample division principle, and performing back propagation calculation.

In addition, in another preferred embodiment provided by the present invention, there is also provided an apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the steps of the method for object detection in a remote sensing image provided in the above embodiments.

Further, in a further preferred embodiment provided by the present invention, a storage medium is further provided, where the storage medium stores a computer program, and the computer program, when executed by a processor, implements the steps of the remote sensing image target detection method provided in the above embodiments.

Those skilled in the art will appreciate that the above description of a computer apparatus is by way of example only and is not intended to be limiting of computer apparatus, and that the apparatus may include more or less components than those described, or some of the components may be combined, or different components may be included, such as input output devices, network access devices, buses, etc.

The processor may be a central processing unit, but may also be other general purpose processors, digital signal processors, application specific integrated circuits, off-the-shelf programmable gate arrays or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like which is the control center for the computer device and which connects the various parts of the overall computer device using various interfaces and lines.

The memory may be used to store the computer programs and/or modules, and the processor may implement various functions of the computer device by running or executing the computer programs and/or modules stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the mobile phone, and the like. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart memory card, a secure digital card, a flash memory card, at least one magnetic disk storage device, a flash memory device, or other volatile solid state storage device.

The modules/units integrated by the computer device may be stored in a computer-readable storage medium if they are implemented in the form of software functional units and sold or used as separate products. Based on such understanding, all or part of the processes in the method according to the above embodiments may be implemented by a computer program, which may be stored in a computer readable storage medium and executed by a processor, to implement the steps of the method for recommending commodities in a shopping mall described above. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, electrical signals, software distribution medium, and the like.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A remote sensing image target detection method is characterized by comprising the following steps:

2. A remote sensing image target detection method as recited in claim 1, wherein before the step of inputting the image to be processed into a ResNet network to extract 4 feature maps with different sizes, the method further comprises:

3. The remote sensing image target detection method according to claim 2, wherein the above-mentioned feature map and the previous layer feature map are equally divided into two parts according to a channel, and the step of performing pixel-by-pixel addition fusion on each part specifically comprises:

4. The remote sensing image target detection method according to any one of claims 1-3, characterized in that feature maps are further feature extracted by using convolution kernels of two different sizes to obtain feature maps with different receptive field sizes, and the feature maps are combined together according to channels;

F_prob＝Conv_1×1(concat[Conv_3×3(F)，Conv_5×5(F)])

5. The remote sensing image target detection method according to claim 4, wherein the step of multiplying the feature by the feature of each channel of the original fusion feature map pixel by pixel to obtain the significant feature of the target specifically comprises:

F_prob＝Conv_1×1(conCat[Conv_3×3(F)，Conv_5×5(F)])

6. The method for detecting the remote sensing image target as claimed in claim 5, wherein the step of fusing the surrounding position characteristic information in a bilinear interpolation mode comprises the following steps:

7. The method for remote sensing image target detection according to claim 6, wherein said step of introducing a new feature map comprises:

fv＝ρ×f_A+ρ×f_B+ρ×f_C+ρ×f_D+(1-4ρ)×f_E

8. A remote sensing image target detection system, the system comprising:

9. An apparatus comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of object detection according to any one of claims 2-7 when executing the computer program.

10. A storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, carries out the steps of the method for object detection of remote sensing images according to any one of claims 2 to 7.