CN114708511B

CN114708511B - Remote sensing image target detection method based on multi-scale feature fusion and feature enhancement

Info

Publication number: CN114708511B
Application number: CN202210614648.2A
Authority: CN
Inventors: 符颖; 王坤; 文武; 吴锡; 周激流
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2022-06-01
Filing date: 2022-06-01
Publication date: 2022-08-16
Anticipated expiration: 2042-06-01
Also published as: CN114708511A

Abstract

The invention relates to a remote sensing image target detection method based on multi-scale feature fusion and feature enhancement, which adopts a self-adaptive multi-scale feature fusion module to perform feature fusion, simultaneously adopts more transverse connections in the fusion process, increases the communication between adjacent features, fully utilizes the extracted multi-scale features, enriches feature information, simultaneously increases jump connection, enables original features to participate in the fusion process, and improves the multi-scale feature expression capability of a network. The multi-branch cavity convolutions with different expansion rates in the attention characteristic enhancing module are used for obtaining the receptive fields with different sizes, when objects with different sizes exist in the remote sensing image, the characteristics of the targets with different scales can be extracted simultaneously, the generalization capability of the network to the target scale is improved, and the characteristic information of the targets is enhanced while background and noise information is weakened by adopting the mixed attention mechanism module.

Description

Remote sensing image target detection method based on multi-scale feature fusion and feature enhancement

Technical Field

The invention relates to the field of remote sensing image processing, in particular to a remote sensing image target detection method based on multi-scale feature fusion and feature enhancement.

Background

With the increasing maturity of remote sensing technology, many satellites and aviation sensors can obtain remote sensing images with higher resolution, can provide visual and clear earth surface information, and have great significance for earth surface observation. The remote sensing image target detection is used as an important task in the field of remote sensing image interpretation analysis, the position and the category of an interested target can be positioned and marked from a wide visual field range, and the method is an important basis for application such as city planning, land utilization, traffic dispersion, military monitoring and the like. In recent years, with the rapid development of deep learning, a neural network obtains features with stronger semantic representation capability by performing multilayer convolution operation on an image, so that the target detection performance is further improved.

The target detection method based on deep learning has two classification standards, and is divided into two-stage detection and single-stage detection according to whether region-of-interest extraction is required. According to whether the Anchor frame needs to be preset or not, the detection based on the Anchor frame and the detection based on the key point are divided, and the detection based on the key point is also called Anchor free detection. The double-stage target detection is divided into two stages to complete the whole detection process. The regions of interest are first extracted, followed by further detection and identification for each region. The two-stage target detection method achieves higher precision, however, because the region of interest needs to be extracted first and each region needs to be classified and regressed respectively, extra calculation amount is increased, the speed is not fast enough, and the method is difficult to apply to a system with higher real-time requirement. The single-stage target detection finishes the whole detection process in one stage, has high target detection speed, basically meets the requirements of a real-time system, but has slightly lower detection precision than a multi-stage target detection mode. Most detection methods need to extract an anchor point frame, and further fine adjustment is performed by taking the anchor point frame as an initial detection frame. And adjusting the position, shape and size of the anchor point frame by regressing the position offset of the central points of the real frame and the anchor point frame and corresponding to the wide and high scaling ratios so as to enable the anchor point frame to gradually coincide with the real frame. The advantage of detecting based on the anchor frame is that the network output values are relative values based on the anchor frame, the range of the value range is small, the training is easy, and the convergence speed is high. However, the detection process needs to be matched based on the anchor point frame, the design of the anchor point frame needs to perform a great amount of manual intervention aiming at different tasks, certain priori knowledge is needed aiming at specific tasks, and the parameter adjusting process is complicated; some objects with rare aspect ratios are difficult to match, thereby resulting in missed detection; meanwhile, a large number of anchor frames also cause problems of large memory occupation, high time complexity and the like. For the problems brought by the detection of the anchor point frame, a detection mode based on the key points is popular gradually in recent times. The method directly classifies and regresses the target based on the pixel level, avoids the introduction of an anchor point frame, and alleviates a series of problems caused by the anchor point frame. The detection method based on the key points avoids a manual and tedious anchor point design process, avoids matching processes such as overlap ratio (IoU) calculation and the like in the detection process, reduces the operation amount, obtains higher precision, and becomes a hotspot of current research at present.

1. After multiple pooling operations, information loss of small-scale targets is serious, the detection effect of the small targets needs to be improved, the number of the small-size targets in the remote sensing image is large, such as airplanes, automobiles, ships and the like, and the problem of difficulty in detecting the small targets exists.

2. When objects with different sizes exist in the image, the detection effect is poor. The target scale in the remote sensing image is changed greatly, the size difference of different types of targets or the same type of targets collected under different resolutions is very different, the size of a receptive field is limited by using conventional convolution, and the effect is poor when the detected target scale is changed greatly.

3. When the target is densely distributed in the image, the detection precision is reduced, the target of interest in the remote sensing image is densely distributed, such as vehicles and ships, and the like, and the situations of close arrangement and disordered distribution may exist, so that a plurality of target examples may exist in the same region of interest, and background noise is easily introduced, so that the detection precision is reduced.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a remote sensing image target detection method based on multi-scale feature fusion and feature enhancement, which comprises the following steps of realizing top-down feature fusion of features with different resolutions by a self-adaptive multi-scale feature fusion module, simultaneously adopting transverse connection to increase the communication between adjacent features, then improving the generalization capability of a network to a target scale and enhancing effective feature information by combining a multi-branch cavity convolution and attention mechanism through an attention feature enhancement module, and improving the target detection capability, wherein the method specifically comprises the following steps:

step 1, extracting input remote sensing image characteristics, inputting the remote sensing image into a main network by adopting a ResNet network, and outputting multi-scale characteristic graph groups with different resolutions on the last four layers of the ResNet network through multi-group convolution and pooling operations

；

Step 2: adjusting the number of characteristic channels, and combining the multi-scale characteristic graphs

Are respectively carried out once

The number of channels of the multi-scale feature map is adjusted to be equal to the shallowest feature map

Obtaining a feature map group by coincidence

；

And step 3: adopting a self-adaptive multi-scale feature fusion module to perform pair on the feature map group obtained in the step 2

Performing feature fusion, including a first top-down fusion stage, a bottom-up fusion stage, and a second top-down fusion stage, specifically:

step 31: the first top-down fusion stage, which introduces transverse connection in the fusion process, and gradually fuses from the deepest characteristic diagram

Is started by mixing

After upsampling and

are fused to obtain

，

After upsampling and

are fused to obtain

，

After upsampling and

are fused to obtain

Completing the first forward propagation; then will be

After upsampling and

are fused to obtain

Will be

After upsampling and

are fused to obtain

Completing the second forward propagation; finally will be

After upsampling and

are fused to obtain

Completing the third forward propagation to finally obtain a feature map group

；

Step 32: a bottom-up fusion stage by matching the set of feature maps obtained in step 31 from the shallowest features

At the beginning, will

After double down-sampling and

and

are fused to obtain

Then will be

After double down-sampling and

and

are fused to obtain

And finally will

After double down-sampling and

are fused to obtain

Finally, a feature map set is obtained

；

Step 33: in the second top-down fusion stage, the feature map set obtained in step 32 is fused from the deepest features

Sequentially up-sampling and adding layer by layer to obtain a high-resolution first characteristic diagram with the size of P/4

；

And 4, step 4: the first characteristic diagram obtained in the step 33

Inputting the feature into an attention feature enhancement module for feature enhancement, wherein the attention feature enhancement module comprises a multi-branch hole convolution module andthe mixed attention mechanism module and each branch of the multi-branch cavity convolution module are provided with different expansion rates, and the first characteristic diagram is obtained

Fusing the features after convolution with different expansion rates to obtain a second feature map

；

And 5: the second characteristic diagram

Suppressing background and noise by inputting to a mixed attention module including a channel domain attention module and a spatial domain attention module, a second feature map

Obtaining a third feature map after processing by the channel domain attention module and the spatial domain attention module

；

Step 6: obtaining the final detection result through classification and regression, and outputting the third feature map output in the step 5

And obtaining a central point prediction result, a central point offset prediction result and a target width and height prediction result after three convolution branches of 3x3, and obtaining a final prediction result by fusing the three prediction results.

According to a preferred embodiment, the hybrid attention mechanism module obtains a third characteristic map

The specific process comprises the following steps:

step 51: the second characteristic diagram obtained in the step 4

Inputting a channel domain attention module, firstly adding all characteristic values of each channel through global average pooling GAP, then averaging, converting a two-dimensional characteristic diagram into a real number to obtain a vector of C '1', C represents the number of channels, simultaneously using the global average pooling GAP and the global maximum pooling GMP along the channel dimension, respectively sending the vector into 2 full-connection layers for training and learning to obtain 2 one-dimensional channel weight sequences, adding 2 groups of channel weight sequences, mapping the added channel weight sequences to [0,1 ] after passing through a Sigmoid activation function, and mapping the channel weight sequences to [0,1 ] after passing through a Sigmoid activation function]Finally, 1 group of weight sequences are obtained and are compared with the second characteristic diagram

Carrying out feature weighting to obtain an intermediate feature map

Completing the channel domain attention operation;

step 52: drawing the intermediate features

Respectively obtaining feature graphs of 2 single channels through global average pooling GAP and global maximum pooling GMP, connecting the feature graphs of the 2 single channels according to channel dimensionality, performing convolution operation to obtain a spatial domain attention feature graph, and mapping to [0,1 ] after passing through a Sigmoid activation function]Obtaining the attention weight of the space domain, and combining the attention weight with the intermediate feature map

Multiplying to obtain the final third characteristic diagram

。

According to a preferred embodiment, the feature fusion module in step 3 further considers that the contribution degrees of the features with different resolutions to the fused features are different, increases a learnable weight coefficient, and realizes the effect of adaptive fusion, thereby improving the ratio invariance of the features, and specifically realizing the following steps:

firstly, the resolution ratios of the multi-scale features to be fused are adjusted to be consistent, and the adjustment means is as follows:

(1) in the first top-down stage, the deep features are subjected to two-time up-sampling by a nearest neighbor interpolation method;

(2) in the stage from bottom to top, the shallow layer features are sampled twice by maximum pooling, the adjusted features are multiplied by their corresponding weight coefficients, added element by element, and finally fused by Swish activating function, convolution and batch normalization.

Compared with the prior art, the invention has the beneficial effects that:

1. the self-adaptive multi-scale feature fusion module AMFF adopts more transverse connections during feature fusion, increases communication between adjacent features, fully utilizes extracted multi-scale features, enriches feature information, increases jump connection, allows original features to participate in a fusion process, and avoids information loss caused by repeated up-sampling and down-sampling, thereby improving the multi-scale feature expression capability of a network.

2. In the invention, the contribution degree of different resolution characteristics to the characteristics after fusion is considered to be different during fusion, and the learnable weight coefficient is introduced to realize the effect of self-adaptive fusion, thereby improving the proportion invariance of the characteristics.

3. The attention characteristic enhancement module AFE adopts multi-branch cavity convolution with different expansion rates to obtain the receptive fields with different sizes, and when objects with different sizes exist in the remote sensing image, the characteristics of the targets with different scales can be extracted simultaneously, so that the generalization capability of the network to the target scales is improved.

4. Aiming at noise introduced by feature fusion and multi-branch hole convolution, feature information of a target is enhanced while background and noise information is weakened by a channel domain attention and space domain attention module.

Drawings

FIG. 1 is a schematic diagram of the structure of a remote sensing image target detection network according to the present invention;

FIG. 2 is a schematic structural diagram of an adaptive multi-scale feature fusion module AMFF according to the present invention;

FIG. 3 is a flow diagram of the adaptive feature fusion process of the present invention;

FIG. 4 is a schematic diagram of the convolution structure of multi-branched holes with different expansion ratios according to the present invention;

FIG. 5 is a schematic diagram of an AFE of the attentive feature enhancement module of the present invention;

FIG. 6 is a schematic structural diagram of a CBAM of the hybrid attention mechanism of the present invention, FIG. 6a is a schematic structural diagram of a channel domain attention module, and FIG. 6b is a schematic structural diagram of a spatial domain attention module;

FIG. 7 is a graph comparing the results of experiments performed on DIOR data sets by the method of the present invention, FIG. 7a is the results of a CenterNet test example, and FIG. 7b is the results of a test example by the method of the present invention;

FIG. 8 is a graph comparing the results of experiments performed by the method of the present invention on a NWPUWHR-10 dataset, FIG. 8a is the results of a CenterNet test example, and FIG. 8b is the results of a test example of the method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the accompanying drawings in combination with the embodiments. It is to be understood that these descriptions are only illustrative and are not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

The following detailed description is made with reference to the accompanying drawings.

AMFF in the invention represents an adaptive multi-scale feature fusion module.

The AFE in the present invention represents an attention feature enhancement module.

Some symbols in the drawings attached to the specification are described below: conv denotes normal convolution, 3x3 denotes the convolution kernel size is 3x 3; depthchonv denotes depth separable convolution; convolution with 3x3 and r =12 indicates a hole convolution with a convolution kernel size of 3x3 and an expansion rate of 12; BN is batch normalization; wish, ReLU, Sigmoid denote activation functions.

Aiming at some problems in the prior art, the invention provides a remote sensing image target detection method based on multi-scale feature fusion and feature enhancement, and the structural schematic diagram of a detection network is shown in figure 1. Through self-adaptation multiscale feature fusion module, when realizing top-down feature fusion to the feature of different resolutions, still adopt more transverse connection, increase the interchange between the adjacent feature, then through attention feature reinforcing module, adopt the multi-branch cavity convolution of different expansion rates in order to obtain different receptive fields, improve the ability that the target detected, specifically as follows:

step 1: extracting the characteristics of the input remote sensing image, inputting the remote sensing image into the main network by adopting a ResNet network through the main network, and outputting the remote sensing image in the last four layers of the ResNet network to obtain multi-scale characteristic graph groups with different resolutions through operations such as multiple groups of convolution, pooling and the like

(ii) a Wherein, the multi-scale feature map group

Are one quarter, one eighth, one sixteenth and one thirty half of the input remote sensing image, respectively. P in fig. 1 refers to the size of the input remote sensing image.

Step 2: because the number of the characteristic channels extracted by the backbone network is too many and contains a lot of redundant information, the number of the characteristic channels is adjusted, and the multi-scale characteristic graph group is formed

Respectively carrying out convolution operation of 1x1 once to adjust the channel number of the multi-scale feature map to the shallowest feature map

Obtaining a feature map group by coincidence

；

And performing feature fusion, including a first top-down fusion stage, a bottom-up fusion stage and a second top-down fusion stage, and fig. 2 is a schematic structural diagram of the adaptive multi-scale feature fusion module AMFF of the present invention. The method specifically comprises the following steps:

Is started by

After upsampling and

are fused to obtain

，

After upsampling and

are fused to obtain

，

After upsampling and

are fused to obtain

Completing the first forward propagation; then will be

After upsampling and

are fused to obtain

Will be

After upsampling and

are fused to obtain

Completing the second forward propagation; finally will be

After upsampling and

are fused to obtain

Completing the third forward propagation to finally obtain a feature map group

。

In the first top-down fusion stage, after double upsampling of deeper features, the deeper features are fused with the features with the same resolution, so that the communication between adjacent features is increased, the information of the features is retained, the semantic information of deep features is effectively introduced into shallow features, and the semantic information of shallow high-resolution features is increased.

At the beginning, will

After two times down-sampling with

And

are fused to obtain

Then will be

After double down-sampling and

and

are fused to obtain

And finally will

After double down-sampling and

are fused to obtain

Finally, a characteristic diagram is obtained

，

In order to avoid the target information loss caused by frequent up-sampling times in the top-down path in the step 31, the network introduces the bottom-up path and adds jump connection to enable the original input feature map group

Participate in the process of layer-by-layer fusion to obtain

。

Step 33: the second top-down fusion stage, which is to fuse the results obtained in step 32 from the deepest features

Starting up to sample sequentially and add layer by layer to obtain a high-resolution first feature map with the size of P/4

。

In the prediction of the network, the higher resolution feature map is usually used to improve the detection performance of the small-sized target, so that a top-down path is added to the result obtained in step 32 from the deepest feature

The semantic information is richer while the details and the position information of the small target are kept, so that the detection precision of the small target is improved. P/4 represents a quarter of the input image size.

The multi-scale feature fusion module also considers that different resolution features have different contribution degrees to the fused features, increases learnable weight coefficients, and realizes the effect of self-adaptive fusion, thereby improving the proportion invariance of the features, and the specific implementation process is as follows:

the fusion strategy of the invention is to adjust the resolution of the multi-scale features to be fused to be consistent, and the adjustment means is as follows: (1) in the first top-down fusion stage, two times of up-sampling is carried out on deep features by a nearest neighbor interpolation method; (2) in the bottom-up fusion stage, shallow features are downsampled twice through maximum pooling. Then, the adjusted features are multiplied by the corresponding weight coefficients respectively, and then are added element by element, and finally the features are fused through a Swish activation function, convolution and batch normalization.

The mathematical expression (1) is a brief representation of the node to be fused in the fusion process, wherein

Representing the feature nodes that are currently undergoing fusion,

the representation points to the current node

A certain feature node of the plurality of feature nodes,

for the node corresponding to

The weight coefficient of the node is determined,

indicating a point

The total number of nodes.

(1)

Is calculated as in equation (2) and is defined as

In all directions

Is/are as follows

The ratio of each node is 0-1

=0.0001 is an extremely small value set to prevent the learned weight coefficient from being unstable. And, in calculating

Before Relu activating function processing is adopted to ensure

Not negative.

(2)

To be provided with

The node is taken as an example, the specific mode is shown as formula (3), and the fusion process is shown as fig. 3.

（3）

Wherein UpSAMle represents that the nearest linear interpolation method is adopted to carry out two-time up-sampling, so that the resolution of the features to be fused is kept consistent,

and

is the weight coefficient of the node. Depth separable Convolution (DepthwiseSeparateable Convolition) is employed in the AMFF module. Unlike conventional convolution, the depth separable convolution first performs the convolution operation on the independent channel layers, and then uses 1 × 1 to perform convolutionThe convolution operation is completed by the line depth expansion, and the method can effectively reduce the time consumption and parameter quantity of convolution calculation and improve the network detection efficiency.

And 4, step 4: the first characteristic diagram obtained in the step 3 is processed

Inputting the feature to an attention feature enhancing module for feature enhancement, wherein the attention feature enhancing module comprises a multi-branch cavity convolution module and a mixed attention mechanism module, each branch of the multi-branch cavity convolution module is provided with different expansion rates, and a first feature map is obtained

。

And (3) convolving the multi-branch holes with different expansion rates to acquire the receptive fields with different sizes and capture multi-scale context information. Specifically, the expansion ratios r used were 12, 24, and 36, respectively.

In the convolutional neural network, the fixed-size receptive field is not favorable for detecting objects with different sizes, and is particularly not friendly to a target with a large amount of drastic scale change in a remote sensing image. In the semantic segmentation task, a hole convolution is usually used to better classify each pixel point in an image from expanding a receptive field, so as to realize accurate segmentation, for example, a hole space pyramid pooling (ASPP) method in deep lab, but since a target scale in a remote sensing image changes more severely than a natural image, the ASPP method is not directly applicable to target detection of the remote sensing image. The invention is inspired by the ASPP method, and the cavity convolution branches with larger span and larger expansion rate are adopted to obtain the receptive fields with different sizes, so that the detection performance of the network on targets with different scales is improved, and fig. 4 is a structural schematic diagram of the multi-branch cavity convolution with different expansion rates, and the receptive fields of the cavity convolution with different expansion rates are shown in the figure.

Meanwhile, the remote sensing image has a complex scene, and the phenomena of target dense arrangement and missing detection and false detection can occur; and a great deal of noise is introduced after the feature fusion process of the step 3 and the multi-branch hole convolution of the step 4. In order to solve the problem, the invention uses a mixed attention mechanism module CBAM of a space domain and a channel domain to suppress background and noise after the multi-branch hole convolution, wherein the attention of the channel domain can make a network focus on a feature map of an effective channel, the attention of the space domain can focus more on a position which is helpful to a task, and the serial combination of the space domain and the channel domain can highlight effective information enhancement features.

And 5: the second characteristic diagram

The mixed attention mechanism module is input to suppress background and noise, the mixed attention mechanism module comprises a channel domain attention module and a space domain attention module, the structure of the mixed attention mechanism module is shown in fig. 6, fig. 6a is a structural schematic diagram of the channel domain attention module, and fig. 6b is a structural schematic diagram of the space domain attention module. Where C denotes the number of channels, W, H denotes the width and height of the characteristic diagram, and FC1 and FC2 are full-link layers.

Step 51: the second characteristic diagram obtained in the step 4

And inputting a channel domain attention module, firstly adding all characteristic values of each channel through the global average pooling GAP, then averaging, and converting the two-dimensional characteristic graph into a real number to obtain a vector of Cx1x 1.

Since global max pooling GMP (GMP) is also beneficial for the screening of valid feature information, GAP and GMP are used simultaneously along the channel dimension; respectively sending the training data to 2 full-connection layers for training and learning to obtain 2 one-dimensional channel weight sequences; adding 2 groups of channel weight sequences, mapping to [0,1 ] after Sigmoid activation function]Finally, 1 group of weight sequences is obtained; it is mixed with

Carrying out feature weighting to obtain an intermediate feature map

And completing the channel domain attention operation.

Specifically, the second characteristic diagram obtained in the step 4 is used

Input channel Domain attention Module, second Profile

Obtaining two vectors with the size of 64 multiplied by 1 through two parallel maximum value pooling and average pooling, then respectively inputting the two vectors into a shared full-connection layer, compressing the number of channels to 4 by the first full-connection layer, expanding the number of channels back to 64 by the second full-connection layer to obtain 2 one-dimensional channel weight sequences, adding 2 groups of weight sequences, and mapping the values to [0,1 ] through a Sigmoid activation function]Finally, 1 group of weight sequences are obtained and are compared with the second characteristic diagram

Carrying out feature weighting to obtain an intermediate feature map

Completing the attention operation of the channel domain;

step 52: different from channel domain attention channel information, the spatial domain attention is mainly attention position information, and an intermediate feature map is obtained

Respectively obtaining feature graphs of 2 single channels through GAP and GMP, connecting the feature graphs of the 2 single channels according to channel dimensionality, performing convolution operation to obtain a spatial domain attention feature graph, and mapping to [0,1 ] after passing through a Sigmoid activation function]Obtaining the attention weight of the space domain, and combining the attention weight with the intermediate feature map

Multiplying to obtain the final third characteristic diagram

。

Specifically, the method comprises the following steps: drawing the intermediate features

Dividing the channel into two vectors with the size of H multiplied by W multiplied by 1 through maximum pooling and average pooling, stacking the two tensors together through channel dimension splicing operation, wherein the number of channels is 2, changing the channels into 1 through convolution operation while ensuring that H and W are unchanged, and then connecting a Sigmoid activation function to map the numerical values to [0,1]Obtaining attention weight of space domain, and finally obtaining the intermediate feature map

Multiplying the weight value to obtain a third feature map with the size of 64 multiplied by H multiplied by W

And completing the spatial domain attention operation.

The structural diagram of the attention feature enhancement module AFE is shown in fig. 5. The input features are subjected to three cavity convolution branches with the expansion rates of r =12, r =24 and r =36 convolution kernels of 3x3, the output features of the three branches are spliced in channel dimensions, then subjected to convolution fusion features of 1x1 and added with the input original features element by element, and finally, a CBAM mixed attention module is used for suppressing background and noise and enhancing feature information.

Step 6: and (3) obtaining a final detection result through classification and regression, obtaining a central point prediction result, a central point offset prediction result and a target width and height prediction result through classification and regression on the fourth feature map output in the step (5) through three convolution branches of 3x3, and obtaining the final prediction result by combining the target central point, the central point offset and the target width and height, wherein the feature sizes of the central point prediction result, the central point offset prediction result and the target width and height prediction result are respectively (C, P/4, P/4), (2, P/4, P/4) and (2, P/4, P/4), wherein C is the number of the feature types in the detected image, and P is the size of the input image.

To further illustrate the effectiveness of the proposed method, the evaluation criterion employs the average accuracy (mAP) widely used in target detection, which refers to the average of multiple classes of average Accuracy (AP). The AP represents the area size of a curve drawn by a certain category in the range of 0 to 1 according to its corresponding accuracy (Precision) and Recall (Recall). The calculation modes of the accuracy rate and the recall rate are respectively shown as the following formula (4) and formula (5):

（4）

（5）

wherein, TP represents a true positive case, i.e. the model is judged to be true and actually is also a true target, FN represents a false negative case, and FP represents a false positive case.

The average precision AP value of a single category is calculated as formula (6), and the average precision mAP value is calculated as formula (7), wherein C represents the number of categories participating in calculation, P is the accuracy, and R is the recall ratio.

（6）

（7）

Platform and system setup for experiments: CPU is AMD Ryzen 53600X 6-Core; the GPU is NVIDIA GeForce RTX 3090; the operating system is Ubuntu20.04, the pytorch1.8 deep learning framework is used, and the Python version is 3.8. In the experiment, an SGD is selected as a network optimizer, momentum is set to be 0.9, an initial learning rate is set to be 0.01, an attenuation coefficient is 0.0001, a step strategy is adopted to adjust the learning rate, the batch size is 18, 24 epochs are iterated altogether, and the learning rates are reduced at 18 th epoch and 22 th epoch respectively. The same data enhancement method is adopted in the experiment, and comprises color gamut enhancement, affine transformation, random clipping, random rotation, random scale transformation, random inversion and the like.

The DIOR and NWPUVHR-10 public remote sensing image data set is compared with some existing classical target detection algorithms. The same configuration was used on the NWPUVHR-10 dataset in comparison with the YOLOv3, Faster-RCNNWithFPN, Cascade-RCNNWith FPN, RetinaNet and CenterNet networks, with the results shown in Table 1.

TABLE 1 comparison of the detection results of different algorithms on the NWPUVHR-10 dataset

Note: bold represents the optimal value. The current column goodness is underlined.

FIG. 7 is a diagram comparing an example of the detection performed on a DIOR dataset with a base line network CenterNet network. FIG. 7a shows the results of an example of the CenterNet assay, and FIG. 7b shows the results of an example of the assay according to the method of the present invention.

FIG. 8 is a graph comparing an example of the test performed on the NWPUWHR-10 dataset with the CenterNet method. FIG. 8a shows the results of an example of the CenterNet assay, and FIG. 8b shows the results of an example of the assay according to the method of the present invention. As can be seen from fig. 7 and 8, the detection precision and accuracy of the method of the present invention are significantly better than those of the comparative method, especially in the detection of small targets.

It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. The remote sensing image target detection method based on multi-scale feature fusion and feature enhancement is characterized in that the method adopts transverse connection to increase communication between adjacent features while realizing top-down feature fusion on features with different resolutions through a self-adaptive multi-scale feature fusion module, and then enhances effective feature information while improving generalization capability of a network to target scales and improving target detection capability through an attention feature enhancement module by combining a multi-branch cavity convolution and attention mechanism, and specifically comprises the following steps:

step 1: extracting the characteristics of the input remote sensing image, inputting the remote sensing image into the main network by adopting a ResNet network through the main network, and outputting multi-scale characteristic map groups with different resolutions at the last four layers of the ResNet network through multi-group convolution and poolingC ₁ ,C ₂ ,C ₃ ,C ₄ }；

Step 2: adjusting the number of characteristic channels, and making the multi-scale characteristic map groupC ₁ ,C ₂ ,C ₃ ,C ₄ Respectively carrying out convolution operation of 1x1 once, and adjusting the channel number of the multi-scale feature map to be the shallowest feature mapC ₁ Matching to obtain a feature map groupP ₁ ,P ₂ , P ₃ ,P ₄ }；

And step 3: adopting the self-adaptive multi-scale feature fusion module to perform feature map set { P) obtained in step 2 ₁ ,P ₂ ,P ₃ ,P ₄ Performing feature fusion, including a first top-down fusion stage, a bottom-up fusion stage, and a second top-down fusion stage, specifically:

step 31: the first top-down fusion stage, which introduces transverse connection in the fusion process, and gradually fuses from the deepest characteristic diagramP ₄ Is started by mixingP ₄ After upsampling andP ₃ are fused to obtain

，P ₃ After upsampling and

are fused to obtain

，P ₂ After upsampling and

are fused to obtain

Completing the first forward propagation; then will be

After upsampling and

are fused to obtain

Will be

After upsampling and

are fused to obtain

Completing the second forward propagation; finally will be

After upsampling and

are fused to obtain

Completing the third forward propagation to finally obtain a feature map group

；

At the beginning, will

After double down-sampling and

and

are fused to obtain

Then will be

After double down-sampling and

and

are fused to obtain

And finally will

After two times down-sampling with P ₄ Are fused to obtain

Finally, a feature map set is obtained

；

Sequentially up-sampling and adding layer by layer to obtain a high-resolution first characteristic diagram with the size of P/4P _out ；

And 4, step 4: the first characteristic diagram obtained in the step 33P _out Inputting the feature into the attention feature enhancing module for feature enhancement, wherein the attention feature enhancing module comprises a multi-branch hole convolution module and a mixed attention mechanism module, each branch of the multi-branch hole convolution module is provided with different expansion rates, and the first feature map is obtained by integrating the first feature map and the second feature mapP _out Fusing the features after convolution with different expansion rates to obtain a second feature mapF ₁ ；

And 5: the second feature mapF ₁ Suppressing background and noise input to the hybrid attention mechanism module, the hybrid attention mechanism module including a channel domain attention module and a spatial domain attention module, the second feature mapF ₁ Obtaining a third feature map after the processing of the channel domain attention module and the spatial domain attention moduleF _out ；

Step 6: obtaining the final detection result through classification and regression, and outputting the third feature map output in the step 5F _out Obtaining a central point prediction result, a central point offset prediction result and a central point offset prediction result after three convolution branches of 3x3And (4) obtaining a final prediction result by fusing the three prediction results according to the target width and height prediction result.

2. The method for detecting the target of the remote sensing image based on the multi-scale feature fusion and the feature enhancement as claimed in claim 1, wherein the mixed attention mechanism module obtains a third feature mapF _out The specific process comprises the following steps:

step 51: the second characteristic diagram obtained in the step 4F ₁ Inputting the channel domain attention module, firstly adding all characteristic values of each channel through a global average pooling GAP, then averaging, converting a two-dimensional characteristic diagram into a real number to obtain a Cx1x1 vector, C represents the number of channels, simultaneously using the global average pooling GAP and the global maximum pooling GMP along the channel dimension, respectively sending the real number and the vector into 2 full connection layers for training and learning to obtain 2 one-dimensional channel weight sequences, adding 2 groups of channel weight sequences, mapping the sum to [0,1 after a Sigmoid activation function]Finally, 1 group of weight sequences are obtained and are combined with the second characteristic diagramF ₁ Carrying out feature weighting to obtain an intermediate feature map

Completing the channel domain attention operation;

step 52: the intermediate feature map is processed

Multiplying to obtain the final third characteristic diagramF _out 。

3. The method for detecting the target of the remote sensing image based on the multi-scale feature fusion and the feature enhancement as claimed in claim 2, wherein the feature fusion module in the step 3 further considers that the contribution degrees of the features with different resolutions to the fused features are different, increases learnable weight coefficients, and realizes the effect of adaptive fusion, thereby improving the ratio invariance of the features, and the specific implementation process is as follows: