CN117809198A

CN117809198A - Remote sensing image significance detection method based on multi-scale feature aggregation network

Info

Publication number: CN117809198A
Application number: CN202410025432.1A
Authority: CN
Inventors: 朱晨薇; 颜成钢; 张继勇; 周晓飞; 江劭玮; 赵强; 王鸿奎
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2024-01-08
Filing date: 2024-01-08
Publication date: 2024-04-02

Abstract

The invention discloses a remote sensing image saliency detection method based on a multi-scale feature aggregation network. Firstly, collecting and amplifying a data set; then extracting features through a backbone network; adopting a multi-scale feature guiding module, using expansion volume and feature attention guiding operation with different expansion rates, and simultaneously fusing feature information with different scales of the backbone network branches; performing feature alignment polymerization on the obtained features of the two branches through a feature alignment module by using a deformable convolution DCN; and finally, decoding the final characteristics to generate a prediction graph and carrying out loss supervision. The invention provides a dual-branch architecture network for aggregating the characteristics of two backbone networks, wherein the multi-scale characteristics from different branches are fused through the multi-scale characteristic guiding module, so that the context relation between the characteristics is better modeled, the perceptibility of a model to targets with different scales is enhanced, and the recognition and positioning accuracy of the targets are improved.

Description

Remote sensing image significance detection method based on multi-scale feature aggregation network

Technical Field

The invention belongs to the field of image processing, and particularly relates to a remote sensing image saliency detection method based on multi-scale feature aggregation convertors and CNNs.

Background

Visual attention mechanisms aim to capture the most attractive areas of the scene, playing an important role in the human visual system. Human attention is easily attracted to unique objects/regions in the image. By mimicking such visual attention systems, salient Object Detection (SOD) is intended to accurately locate the most attractive objects/regions.

Along with the development of aerospace industry, the data volume of the remote sensing image is greatly increased, and how to quickly and effectively extract effective information from the remote sensing image becomes the primary problem of current utilization of the remote sensing image data. Because of the specificity of the imaging mode of the remote sensing image, the interior of the image mostly contains a plurality of targets in a large space range, however, only a few targets can attract human visual attention (namely, obvious targets), so how to filter redundant information and highlight the obvious targets becomes a primary task of remote sensing image information processing.

Most of existing remote sensing image significant target detection models are constructed based on convolutional neural networks, and compared with the traditional method, the detection accuracy and generalization performance are improved significantly. Convolutional Neural Networks (CNNs) are adept at extracting contextual features within certain receptive fields, while transformers can model global long-range dependent features. However, we note that the transducer performs well on many computer vision tasks, but for some specific tasks, such as object detection or image segmentation, stronger spatial information modeling capability may be required, and the simple transducer cannot fully characterize the spatial local detail of each patch, so the invention proposes a new remote sensing image saliency detection model based on a dual-branch architecture to guide aggregation transducers and CNNs in multi-scale features.

Disclosure of Invention

The invention aims to solve the technical problems that: the diversity of the number of targets in the optical RSI, the diversity of the target scale and the background complexity are difficult to accurately detect the obvious targets with complete positions and structures, the existing remote sensing image obvious target detection model is mostly constructed based on a convolutional neural network, and the detection precision and the generalization performance are obviously improved compared with those of the traditional method. However, as the detection effect cannot meet the actual application requirement due to the fact that the detection effect is not specific to the characteristics of the remote sensing image, the defect of the target and incomplete detection are common problems.

The invention solves the technical problems by adopting the technical scheme that: the invention provides a remote sensing image saliency detection method based on a multi-scale feature aggregation network. The method is designed aiming at the characteristics of the remote sensing image dataset, parallel branch encoders are introduced to respectively attempt to extract global context information and local detail features by using a transducer and a CNNs, so that the efficiency of global context modeling is improved without losing low-level detail positioning capability, but as the transducer and the CNN model are usually applied in different fields, the extracted features usually have different dimensions and representation spaces, and the direct splicing or adding of the features blindly can lead to confusion of the feature spaces, so that the model is difficult to learn effective representation. The invention solves the existing problems by using the effective features with the same resolution extracted by a multi-scale feature guiding module and then utilizing feature alignment to better aggregate the advantages of the two features.

Structural overview of remote sensing image significance detection model: firstly, the VGG16 network and the Swin-transducer network are used for extracting trunk characteristics of an input remote sensing image respectively. Then, the features of the corresponding layers of the double branches are used as the input of a multi-scale feature guiding module, the two features are respectively convolved by using expansion convolution layers with four different expansion rates (1, 3,5 and 7) to obtain eight new feature tensors, the extracted features with different expansion rates are respectively represented, the feature tensors of the transformers are subjected to channel attention mechanism processing to obtain importance weights of the two features in the channel dimension and then are subjected to element multiplication with the features of VGG16 with the corresponding expansion rates, the feature tensors of the VGG16 are subjected to space attention mechanism processing to obtain importance weights of the two features in the space dimension and then are subjected to element multiplication with the features of the transformers with the corresponding expansion rates, then the four features with different expansion rates of the same branch are subjected to element addition with the corresponding features of the two features with the same expansion rates, and feature addition of the four features with different expansion rates of the same layer can be simultaneously fused with feature information with different scales from the transformers and the VGG16 in the channel dimension, so that the perception capability and the global expression capability of a model can be improved. And the feature alignment module performs feature alignment polymerization on two output features of the multi-scale feature guide module by using a deformable convolution DCN, and finally performs decoding operation on the polymerized features and monitors each level of prediction graph. The specific structure of the model is shown in fig. 1.

The remote sensing image saliency detection method based on the multi-scale feature aggregation network comprises the following steps:

step 1: collection and expansion of data sets;

the data set must include remote sensing images in different categories and different environments to avoid single image and weak network generalization capability; and secondly, amplifying the data set, and compensating the problem of small number of images of the data set so as to reduce the risk of overfitting.

Step 2: and extracting the characteristics through a backbone network.

After preprocessing the input image, feature extraction is performed by using two backbone networks, namely VGG16 and Swin-transducer.

Step 3: the multiscale feature steering module is employed to simultaneously fuse feature information of different scales from Swin-transducer and VGG16 branches using expansion convolution and feature attention steering operations of different expansion rates.

Step 4: performing feature alignment polymerization on the features of the two branches obtained in the step 3 by using a deformable convolution DCN through a feature alignment module;

step 5: and decoding the final characteristics to generate a prediction graph and carrying out loss supervision.

And (3) decoding the characteristics obtained by the output of the step (4) and carrying out loss supervision on the output.

Further, the specific method in the step 2 is as follows:

after preprocessing an input image, respectively extracting features by using two backbone networks of VGG16 and Swin-transducer, wherein the extracted features in the VGG16 network are divided into 5 levels from low level to high level, the Swin-transducer is divided into 4 levels, and the features of the first level of the VGG16 network are not selected for information aggregation, namely the features of the first level of the VGG16 network are regarded as 0 th level, and the remaining 4 levels and the 4-level features of the corresponding Swin-transducer network are reserved for subsequent operation.

Further, the specific method in the step 3 is as follows:

the multi-scale feature guiding module is only applied to 1,2 and 3 level features of two branches, the input of the multi-scale feature guiding module of each level is the initial feature acquired by the Swin-transform main network in the step 2, the input is the initial feature acquired by the VGG16 main network, and the initial features of the second level and the third level are added with the output of the multi-scale feature guiding module of the previous level in the channel dimension to serve as the input of the multi-scale feature guiding module of the corresponding level except the first level of the VGG 16.

Then, four expansion convolution layers with expansion rates of {1,3,5 and 7} are used for inputting two branches to obtain eight new feature tensors, the feature tensors of the transformers are respectively used for representing extraction features under different expansion rates, channel attention mechanism processing is carried out on the feature tensors of the transformers to obtain importance weights of the two branches in channel dimensions, then element multiplication is carried out on the feature tensors of the VGG16, space attention mechanism processing is carried out on the feature tensors of the VGG16 to obtain importance weights of the two branches in space dimensions, element multiplication is carried out on the importance weights of the two branches in space dimensions and the feature tensors of the corresponding expansion rates, then element addition is carried out on the feature tensors of the two branches, and feature addition is carried out on the feature tensors of the same level of the same branch in the channel dimension, so that feature information of different scales from the transformers and the VGG16 can be fused simultaneously, and modeling capability of the sequence information and extraction capability of the VG 16 on local features are fully utilized.

Finally, the feature tensors of the same branch corresponding to the four different expansion rates after being guided by the attention mechanism are spliced together along the channel dimension to obtain the feature, and the context information of different scales is obtained, so that the perceptibility and the global expression capability of the model are improved. This feature will be used as input to the next module to proceed with subsequent operations.

Further, the specific method in the step 4 is as follows:

the deformable convolution DCN is used for aligning the output characteristics of the two branches obtained in the oligomerization step 3 through the multi-scale characteristic guiding module, and the shape and the position of the convolution kernel can be adaptively adjusted in the characteristic alignment process by introducing the deformable convolution, so that the representation capacity and the generalization capacity of the model are improved.

The invention has the following beneficial effects:

by adopting the technical scheme, the high-level global context characteristics and low-level space details from the transformers and the CNNs can be effectively fused. In the saliency target detection task, global semantic information is used to locate salient objects, while local spatial cues indicate boundary details of salient objects, so global semantic cues and local spatial details are equally important. CNN-based features have translational invariance and spatial induction bias, however, finite receptive fields of CNN-based features often suffer from long-range dependency problems, and a transducer can establish long-term dependencies through a self-attention mechanism. The Transformer and CNN models are typically applied in different fields, so that the features they extract typically have different dimensions and representation spaces. The blind direct stitching or addition of these features may lead to confusion in feature space, making it difficult for the model to learn an efficient representation.

Based on the above, we propose a dual-branch architecture network to aggregate the features of both the Transformer and CNN, and the multi-scale features from different branches are fused by the multi-scale feature guiding module using the expansion convolution and the attention mechanism of different expansion rates, and the module can further guide the importance and the correlation of the features by using the attention mechanism between the Transformer and the VGG16 branches, so as to better model the context relationship between the features, enhance the perceptibility of the model to different scale targets, and improve the recognition and positioning accuracy of the targets. The deformable convolution DCN is used for fusing the characteristics of the transducer and the CNN to align the characteristics on semantic space, so that correlation among different modal data is better captured, and model performance is improved.

Drawings

FIG. 1 is a diagram of a remote sensing image significance detection model structure according to an embodiment of the present invention;

FIG. 2 is a block diagram of a multi-scale feature guidance module according to an embodiment of the invention;

FIG. 3 is a block diagram of a feature alignment module according to an embodiment of the present invention;

FIG. 4 is a PR chart illustrating an embodiment of the present invention;

FIG. 5 is a graph of F-measure of an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further described below by combining the drawings and the embodiments.

Fig. 1 is a schematic diagram of a remote sensing image saliency detection model according to an embodiment of the present invention, and a remote sensing image saliency detection method based on a multi-scale feature aggregation network includes the following steps:

step 1: collection and expansion of data sets

The data set must include remote sensing images in different categories and different environments to avoid single image and weak network generalization capability; and secondly, amplifying the data set, and compensating the problem of small number of images of the data set so as to reduce the risk of overfitting. The training data set adopted in the embodiment of the invention is EORSSD, and contains 1400 images and corresponding truth diagrams. The eossd dataset is a challenging public dataset because of its diverse types of salient objects and its varying number and size, even though there are no salient objects in some scenarios, e.g., some scenarios have only desert and forest backgrounds and no prospects; and the background information of partial images in the EORSSD data set has complex semantic information such as building shadows, obvious target shadows and the like.

The eossd training set is enhanced by rotation at 90 °,180 ° and 270 ° angles and these images are specularly reflected. In this way, the training set of eossd contained 11200 examples, achieving an eight-fold amplification effect.

Step 2: and extracting the characteristics through a backbone network.

Firstly preprocessing an input image, cutting the size of the input image to 384×384, and finally converting the data type of the input image into a model-recognizable Tensor type with the numerical value ranging from [0,255 ]]Mapping to [ -1,1]The preprocessing operation can improve the execution efficiency of the model. For the preprocessed input image, the VGG16 backbone network and the Swin-converter backbone network are used for feature extraction respectively, and since VGG16 is composed of five sequentially stacked convolution blocks, the extracted features are divided into 5 layers from low level to high level and are respectively expressed as F _vi (i=0, 1,2,3, 4). The features extracted by Swin-transducer are divided into 5 levels from low level to high level, respectively representing F _si (i=1, 2,3, 4) each convolution block contains convolution operations of different layers and different output sizes, and thus the extracted feature sizes are different, where F _v0 ∈R ^64×192×192 ，F _v1 ∈R ^128×96×96 ，F _v2 ∈R ^256×48×48 ，F _v3 ∈R ^512×24×24 ，F _v4 ∈R ^512×12×12 ，F _s1 ∈R ^128×96×96 ，F _s2 ∈R ^256×48×48 ，F _s3 ∈R ^512×24×24 ，F _s4 ∈R ^1024×12×12 Different spatial information is contained in different levels of features.

The process is shown in FIG. 2, after the preliminary features are obtained, each stage of the Swin-transducer branch is subjected to preliminary features F _si (i=1, 2, 3) as input to the corresponding level multi-scale feature guide module, while VGG16 branches except for the first level primary features F _v1 Second and third level characteristics F thereof _vi (i=2, 3) output features of the multi-scale feature guide module of the previous levelFeatures added in channel dimension +.>As input to the corresponding hierarchical multi-scale feature guidance module, the formula is expressed as follows:

the input of two branches is respectively subjected to feature extraction by using expansion convolution with expansion rates of 1,3,5 and 7, and four different features of corresponding levels on each branch are obtained through one layer of batch normalizationAndthe formula is as follows:

where BN (-) represents a batch normalization layer operation, conv _m Indicating an expansion ratio ofm, convolution layer operation.

In the Swin-transducer branch, the characteristics after expansion convolution are comparedAnd a channel attention mechanism is applied to enhance information interaction between different channels, so that the feature expression capability is improved. Meanwhile, in VGG16 branch, the feature after expansion convolution is +.>The spatial attention mechanism is applied to enhance the importance of different spatial locations in the feature map, helping to capture more accurate spatial information. Performing element-by-element multiplication operation on the characteristics subjected to channel attention or spatial attention processing and the characteristics in the other branch with the same corresponding expansion rate, performing characteristic addition on the characteristics and the characteristics of the original branch without attention mechanism, and performing batch processing normalization layer and Relu layer operation to obtain characteristics->And->This can effectively extract and fuse the feature information. The whole process is shown in the following formula:

SA(F)＝σ(conv _3×3 (Concat(AvgPool(F)，MaxPool(F)))) (5)

CA(F)＝σ(conv _1×1 (relu(conv _1×1 (MaxPool(F))))) (6)

wherein BR (-) represents batch normalization layer and Relu layer operations, SA (-) represents spatial attention mechanism operations, CA (-) represents channel attention mechanism operations, maxPool (-) represents adaptive max pooling operations, avgPool (-) represents average pooling operations, σ (-) represents sigmoid operations, conv _1×1 Representing 1 x 1 convolution channel pressureShrink operation, conv _3×3 A 3 x 3 convolutional channel compression operation is shown,representing matrix multiplication, concat (·) represents the series of features along the channel dimension.

And then four features of the corresponding levels of the two branches after being guided by the attention mechanismAnd->Each of which is spliced together in the channel dimension to obtain the feature +.>And->To fully exploit semantic relationships between different expansion rate features, the process is described as follows:

step 4: for the two output features of the multi-scale feature guidance module obtained aboveAnd->Performing characteristic alignment polymerization;

to reduce the number of computation parameters, the fourth level feature F of the Swin-transducer branch is used _s4 Is compressed from 1024 to 512 to obtain the characteristicAnd +.A.of the method obtained in the step 3>All are subjected to characteristic enhancement through a channel attention mechanism to obtain the characteristic +.>The fourth-level feature of VGG16 branch is added with the outputs of the multi-scale feature guiding modules of the first two levels in the channel dimension to obtain feature +.>The process formula is as follows:

because the extracted features of the VGG16 and the Swin-converter are different in semantics and space, the correlation before the features can be destroyed by blind splicing, so that the model is difficult to learn effective feature representation, and meanwhile, the problem of pixel offset can be generated in the feature aggregation process between different feature graphs. The deformable convolution DCN may be used to learn the feature offset in the aggregation process between two branch features and then use the offset to guide the process of feature alignment. As shown in fig. 3, the deformable convolution DCN is used to operate on the characteristics after the channel dimensions are connected to achieve the effect of characteristic alignment, so as to obtain the characteristicsThe specific process is shown in the following formula:

step 5: feature decoding generates a prediction graph.

Features processed by the feature alignment moduleContains rich significance characteristics, and then carries out decoding operation, wherein the decoding operation mainly comprises a convolution layer, a batch normalization layer and a ReLU activation function, and the method CBR (·) is used for representing the decoding operation. The decoding input of the fourth level is the output feature of the corresponding level feature alignment module>The decoded inputs of the first, second and third levels are the decoded outputs from the previous level and the output features of the corresponding level feature alignment module>Features added along the channel dimension. To achieve side output significance mapping, the output features of each hierarchy after decoding operation are sent to a 1×1 convolution layer, the channel dimension is reduced to 1, the feature map size is restored to the original size by using a bilinear upsampling operation up (), and finally a sigmoid function activation operation is performed to map the values of the features to [0,1 ]]Obtaining predictive diagram S in range _i ∈R ^1×384×384 BCE loss supervision is performed on the predicted output of each level.

Wherein the method comprises the steps ofIndicating that the upsampling multiple is2 ⁽ⁱ⁺¹⁾ 。

Model training details:

the model is built based on pytorch. Training was performed using the training set of eossod dataset and convergence was monitored using the test set. The entire model was trained using Adam optimizer. The batch size was set to 4 and epoch was set to 100.

Model experiment results:

compared with the existing methods for detecting the remarkable remote sensing targets, the method can obtain that the performance of the method is excellent on an EORSSD data set, and proves that the feasibility of the method is proved, the lower the MAE is, the better the performance is, and the higher the value is, the better the other indexes are:

we plotted PR curves (fig. 4) and F-measure curves (fig. 5) for our method and some of the most advanced methods on the eossd dataset. It can be seen that the PR curve of our method is closer to (1, 1) than the other models, and that the area under the F-measure curve of our method is also the largest on both datasets. This strongly demonstrates the effectiveness and superiority of our method.

The foregoing is a further detailed description of the invention in connection with specific/preferred embodiments, and it is not intended that the invention be limited to such description. It will be apparent to those skilled in the art that several alternatives or modifications can be made to the described embodiments without departing from the spirit of the invention, and these alternatives or modifications should be considered to be within the scope of the invention.

The invention, in part not described in detail, is within the skill of those skilled in the art.

Claims

1. The remote sensing image saliency detection method based on the multi-scale feature aggregation network is characterized by comprising the following steps of:

step 1: collection and expansion of data sets;

the data set must include remote sensing images in different categories and different environments to avoid single image and weak network generalization capability; secondly, amplifying the data set, and making up the problem of small image quantity of the data set so as to reduce the risk of overfitting;

step 2: extracting features through a backbone network;

after preprocessing an input image, respectively extracting features by using two backbone networks of VGG16 and Swin-converter;

step 3: adopting a multi-scale feature guiding module, using expansion convolution and feature attention guiding operation with different expansion rates, and simultaneously fusing feature information with different scales from Swin-transducer and VGG16 branches;

step 5: decoding the final characteristics to generate a prediction graph and carrying out loss supervision;

2. The method for detecting the saliency of the remote sensing image based on the multi-scale feature aggregation network according to claim 1, wherein the specific method in the step 2 is as follows:

after preprocessing an input image, respectively extracting features by using two backbone networks of VGG16 and Swin-Transformer, wherein the extracted features in the VGG16 network are divided into 5 levels from low level to high level, the Swin-Transformer is divided into 4 levels, the features of the first level of the VGG16 network are regarded as 0 level, and the remaining 4 levels and the 4-level features of the corresponding Swin-Transformer network are reserved for subsequent operation.

3. The method for detecting the saliency of the remote sensing image based on the multi-scale feature aggregation network according to claim 2, wherein the specific method in the step 3 is as follows:

the multi-scale feature guiding module is only applied to 1,2 and 3 level features of two branches, wherein the input of the multi-scale feature guiding module of each level is the initial feature acquired by the Swin-transform main network in the step 2, the input of the multi-scale feature guiding module of each level is the initial feature acquired by the VGG16 main network, and the initial features of the second level and the third level are added with the output of the multi-scale feature guiding module of the previous level in the channel dimension to serve as the input of the multi-scale feature guiding module of the corresponding level except the first level of the VGG 16;

then, using four expansion convolution layers with expansion rates of {1,3,5 and 7} for the input of the two branches to obtain eight new feature tensors, respectively representing the extraction features under different expansion rates, performing channel attention mechanism processing on the feature tensors of the transformers to obtain importance weights of the feature tensors in channel dimensions, performing element multiplication on the feature tensors of the VGG16, performing space attention mechanism processing on the feature tensors of the VGG16 to obtain importance weights of the feature tensors in space dimensions, performing element multiplication on the feature tensors of the feature tensors corresponding to the expansion rates, and performing element addition on the feature tensors corresponding to the branches;

finally, the characteristics obtained by splicing the characteristic tensors of the same branch corresponding to the layers and with different expansion rates after being guided by the attention mechanism are spliced together along the channel dimension, and the context information of different scales is obtained at the same time; this feature will be used as input to the next module to proceed with subsequent operations.

4. The method for detecting the saliency of a remote sensing image based on a multi-scale feature aggregation network according to claim 3, wherein the specific method in the step 4 is as follows: the deformable convolution DCN is used for aligning the output characteristics of the two branches obtained in the oligomerization step 3 through the multi-scale characteristic guiding module, and the shape and the position of the convolution kernel can be adaptively adjusted in the characteristic alignment process by introducing the deformable convolution, so that the representation capacity and the generalization capacity of the model are improved.