CN115861749A

CN115861749A - Remote sensing image fusion method based on window cross attention

Info

Publication number: CN115861749A
Application number: CN202211491547.7A
Authority: CN
Inventors: 柯成杰; 田昕; 李松
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2023-03-28

Abstract

The invention provides a remote sensing image fusion method based on window cross attention, which utilizes a novel remote sensing image fusion network based on window cross attention to fuse panchromatic and multispectral images into multispectral images with high resolution. High-pass filtering and deep feature extraction are combined to mine more texture information, the problem that high-frequency information is not fully extracted by shallow extraction is solved, and the relation between multispectral and full-color images obtained according to feature similarity is more accurate. Then, we establish a cross-modal relationship between the panchromatic image and the multispectral image through a pixel-level window cross-attention mechanism between local windows of the multispectral and panchromatic images. Pixel level attention is more conducive to preserving fine-grained features than patch level attention. Therefore, more spatial detail from the panchromatic image is transferred into the multispectral image, and the fused multispectral image is clearer.

Description

Remote sensing image fusion method based on window cross attention

Technical Field

The invention belongs to the field of remote sensing image fusion, relates to remote sensing image fusion based on window cross attention, and is suitable for various multispectral and panchromatic image fusion application scenes.

Background

With the rapid development of satellite sensor technology, multispectral images are widely applied in the fields of military systems, environmental analysis and the like. However, limited by the limitations of satellite sensor technology, only high spatial resolution, low spectral resolution Panchromatic (PAN) images, or spectral information rich, low spatial resolution Multispectral (MS) images can be captured. Remote sensing image fusion techniques that fuse multispectral and panchromatic images have been extensively studied in order to generate multispectral images with high spatial resolution.

The existing remote sensing image fusion technology is mainly divided into four types, namely a component substitution method (CS), a multi-resolution analysis Method (MRA), a model based method and a Deep Learning (DL) based method. The component substitution method is to decompose the multispectral image into multiple components and then replace the spatial components with full-color images. However, some spectral information in the multispectral image may be lost due to incomplete separation of the components. The multiresolution analysis method is to inject the high frequency information of the full color image into the multispectral image in the transform domain. Multi-resolution analysis preserves the spectral information better, but sometimes produces spatial distortions. Model-based methods build optimization models by constructing a priori constraints, but the large computational cost and difficulty in selecting optimal manual parameters limit their application in practical applications. The current mainstream deep learning networks are still based on convolutional neural networks, which directly connect multispectral and panchromatic images before they are input into the network. This strategy does not take full advantage of the cross-modal correlation between multispectral and panchromatic images. In addition, the operation of the convolution kernel on all the pixel points is the same, and the effective characteristics cannot be focused to suppress redundant information. Therefore, blurring is easily caused in the highly textured remote sensing image. Therefore, how to design an end-to-end deep learning network to explore cross-modal correlation between the panchromatic image and the multispectral image, the spatial texture details of the panchromatic image are better transferred to the multispectral image, and the multispectral image which is rich in texture information and small in spectral distortion as far as possible is an important problem in the field of remote sensing image fusion.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a remote sensing image fusion method based on window crossing attention.

The invention provides a remote sensing image fusion method based on window crossing attention, which comprises the following steps:

step 1, constructing a depth texture feature extraction module based on characteristics of a multispectral image and a panchromatic image, and converting an input image into a feature domain;

step 2, constructing a window cross attention module to acquire a cross-mode fine-grained relation between the multispectral image and the panchromatic image, and outputting a characteristic image;

step 3, constructing an image decoding module and transmitting the generated characteristic image back to an image domain to obtain a final fusion image;

step 4, constructing a training of an objective function driven image fusion model, wherein the image fusion model comprises a depth texture feature extraction module, a window crossing attention module and an image decoding module;

and 5, training the image fusion model by using the simulation data, and testing the simulated test set and the real test set by using the trained model.

Further, the specific implementation manner of step 1 is as follows;

step 1.1, constructing a high-pass filter to extract high-frequency information of an input image, wherein the input image comprises a multispectral image M, a blurred panchromatic image P and a panchromatic image P, and the multispectral image M, the blurred panchromatic image P and the panchromatic image P are processed by the high-pass filter to respectively obtain G (M), G (P) and G (P);

step 1.2, constructing a single-channel texture extraction module to extract high-frequency characteristics of G (P) and G (P) to obtain K and V, wherein the number of convolution layers in the single-channel texture extraction module is increased layer by layer, and the receptive field of the convolution layers is decreased layer by layer to extract multi-scale detail information;

and 1.3, constructing a multi-channel texture extraction module to extract the high-frequency characteristics of G (M) to obtain Q, wherein the multi-channel texture extraction module also comprises three convolution layers, the number of convolution kernels is increased layer by layer, and the convolution kernels are all 1 multiplied by 1.

Further, the blurred panchromatic image in step 1.1 is obtained by down-sampling and up-sampling the original panchromatic image, the high-pass filter is realized by subtracting low-frequency content obtained by average filtering of the original image from the original image, and the average filtering is realized by a global pooling layer.

Further, in step 1.2, the number of convolution kernels in the three convolution layers is changed from 32, 64 to 128, and the receptive fields of the convolution kernels are changed from 7 × 7, 5 × 5 to 3 × 3.

Further, the specific implementation manner of step 2 is as follows;

step 2.1, inputting high-frequency characteristic Q/K/V epsilon R ^H，W，C Division into n windows:

Q＝[q ¹ ，q ² ，…，q ⁿ ]

K＝[k ¹ ，k ² ，…，k ⁿ ]

V＝[v ¹ ，v ² ，…，v ⁿ ]

wherein q is ⁱ /k ⁱ /v ⁱ ∈R ^h，w，C ，

C is the number of characteristic channels, H and w are the image sizes of the images, and H and w are the window sizes;

step 2.2, in order to extract fine-grained features, each window q is transformed by dimension ⁱ 、v ⁱ 、v ⁱ Unfolding the sequence of pixels, for the mth pixel in the sequence

And the nth pixel->

Calculating feature similarity between them:

wherein the content of the first and second substances,

represents the pixel-level cross-modal correlation within window i;

step 2.3, normalizing the correlation among the pixels obtained in the step 2.2 by a softmax function:

wherein the content of the first and second substances,

represents the th from a full-color image>

Pixel to multispectral image ^ th ^ based on>

The injection gain of each pixel point;

step 2.4, gain according to injection

Texture information of the full-color image is extracted, so the m-th pixel of the i-th window of the output feature image is calculated as follows:

step 2.5, folding the unfolded pixel sequence into an original pixel window through dimension conversion to obtain the ith window of the output image:

and 2.6, respectively obtaining the output characteristic image of each window through the cross attention of the windows, and finally splicing the characteristic images of all the windows to obtain the final output characteristic image:

O＝[O ¹ ，O ² ，…，O ⁿ ]。

further, the specific implementation manner of step 3 is as follows;

step 3.1, in order to keep the high-frequency characteristic information in the multispectral image, adding an output characteristic image obtained by the cross attention of the window and the multispectral characteristic image Q through a jump connection to obtain a high-frequency characteristic image;

3.2, the high-frequency characteristic image obtained by fusion passes through a convolution layer to obtain a multi-channel characteristic image with higher dimensionality;

and 3.3, remapping the multi-channel characteristic image into a four-channel image by adopting 4 convolution layers with the size of 1 multiplied by 1 to obtain a reconstructed high-frequency image, and then adding the low-frequency multispectral image to the reconstructed high-frequency image to obtain a final fusion image.

Further, the convolution layer in step 3.2 has 256 channels and a convolution kernel size of 3 × 3.

Further, in step 3.3, the number of convolution kernels for 4 convolutional layers is 128, 64, 32, 4, respectively.

Further, the constructed loss function in step 4 is as follows;

wherein, F _n And G _n Representing the fused image and the reference image, respectively, and b is the batch size.

Further, step 5 includes comparing the test result with the existing algorithm through objective evaluation indexes, wherein the objective evaluation indexes include peak signal-to-noise ratio and no-reference indexes.

Compared with the prior art, the invention has the advantages and beneficial effects that:

firstly, sending an up-sampled multispectral image, a blurred panchromatic image and an original panchromatic image into a high-pass filter, then respectively sending the multispectral image, the blurred panchromatic image and the original panchromatic image into a multi-channel and single-channel depth characteristic extraction module, converting the images into characteristic domains, and extracting depth high-frequency characteristics; then, expressing the extracted high-frequency features as a query vector Q, a key vector K and a value vector V, and acquiring cross-mode correlation between the multispectral image and the panchromatic image through the cross attention of the window; and finally, reconstructing the image, and converting the high-frequency characteristic image obtained by fusion back to the image domain. Due to the combination of high-pass filtering and deep feature extraction, more texture information can be mined, and the relation between the multispectral image and the full-color image obtained according to the feature similarity is more accurate. Furthermore, a cross-modal relationship is established between local windows of the multispectral and panchromatic images through a pixel-level window cross-attention mechanism. Pixel level attention helps preserve fine-grained features, transferring more spatial detail from the panchromatic image into the multispectral image than patch level attention. Therefore, the multispectral image obtained by fusion is clearer, and the spectral information is well stored.

Drawings

Fig. 1 is an overall frame diagram of a remote sensing image fusion network based on window crossing attention according to an embodiment.

FIG. 2 is a network architecture diagram of a window crossing attention module of an embodiment. Where SM is the Softmax normalization function and RS is the dimension transformation module.

Fig. 3 is a graph of the test results of simulation data of an embodiment, wherein (a) is a low resolution multispectral image, (b) is the result of IHS, (c) is the result of PNN, (d) is the result of FuionNet, (e) is the result of the proposed method of the present invention, and (f) is a reference image.

Fig. 4 is a graph of the results of testing the real data for an example where (a) is a low resolution multispectral image, (b) is the result of IHS, (c) is the result of PNN, (d) is the result of FuionNet, (e) is the result of the proposed method of the present invention, and (f) is a full color image.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention will be described in further detail with reference to the accompanying drawings and examples, it is to be understood that the examples described herein are only for the purpose of illustrating the present invention and are not to be construed as limiting the present invention.

The invention is mainly directed to the application requirements for obtaining a high-resolution multispectral image. High-pass filtering and deep feature extraction are combined to mine more texture information, and then a cross-modal relationship between a full-color image and a multi-spectral image is established between local windows of the multi-spectral image and the full-color image through a pixel-level window cross attention mechanism. Therefore, more texture details of the full-color image are transferred to the multispectral image, and the multispectral fusion image with rich space details and small spectral distortion is obtained.

Fig. 1 is an overall framework diagram of a remote sensing image fusion network based on window crossing attention of the embodiment, and fig. 2 is a network framework diagram of a window crossing attention module of the embodiment. The embodiment provides a remote sensing image fusion method based on window crossing attention to realize fusion of a non-multispectral image and a full-color image, which specifically comprises the following steps:

step 1: and constructing a depth texture feature extraction module based on the characteristics of the multispectral image and the panchromatic image, and converting the input image into a feature domain. The specific implementation comprises the following substeps:

step 1.1: and constructing a high-pass filter to extract high-frequency information of the input image. The input image comprises a multispectral image M and a blurred panchromatic image

A panchromatic image P, wherein the blurred panchromatic image is obtained by down-sampling and up-sampling the original panchromatic image, the high-pass filter is designed by subtracting the low-frequency content obtained by averaging and filtering the original image from the original image, the averaging and filtering are realized by a global pooling layer, and the multispectral image M and the blurred panchromatic image->

And the full-color image P is processed by a high-pass filter to obtain G (M),. Or->

And G (P).

Step 1.2: constructing single-channel texture extraction module extraction

And the high frequency characteristics of G (P), yielding K and V. The number of convolution kernels of the three convolution layers in the single-channel texture extraction module is increased layer by layer, the outline features and the high-dimensionality features of the image are extracted step by step, the resolution of the image is kept unchanged in the process, and the number of the convolution kernels is changed from 32 to 64 to 128. The receptive field of the convolution kernel is gradually reduced from 7 multiplied by 7, 5 multiplied by 5 to 3 multiplied by 3 to extract multi-scale detail information, the convolution kernel firstly covers a larger area of the image to extract more area information, and then is gradually reduced to learn deeper detail information for a smaller area.

Step 1.3: and constructing a multi-channel texture extraction module to extract the high-frequency characteristics of G (M) to obtain Q. Similar to step 1.2, the number of convolution kernels of the multi-channel texture extraction module becomes larger layer by layer, and the number of the convolution kernels is changed from 32 and 64 to 128. The convolution kernel is 1 x 1 in size to maintain spatial fidelity and maximize the use of spatial information of the multispectral image.

And 2, step: and a window cross attention module (WCA) is constructed to acquire a cross-modal fine-grained relation between the multispectral image and the panchromatic image. The specific implementation comprises the following substeps:

step 2.1: inputting high-frequency characteristic Q/K/V epsilon R ^H，W，C Division into n windows:

Q＝[q ¹ ，q ² ，…，q ⁿ ]

K＝[k ¹ ，k ² ，…，k ⁿ ]

V＝[v ¹ ，v ² ，…，v ⁿ ]

wherein q is ⁱ /k ⁱ /v ⁱ ∈R ^h，w，C ，

C isThe number of feature channels. In the present embodiment, H =256, w =256, H =2, w =2, c =128, n =16384. H. W is the image size of the image and h, W are the window sizes, representing the division of the image into 16384 image blocks of 2x2 size.

Step 2.2: to extract fine-grained features, each window q is transformed by a dimension transform (RS) ⁱ 、v ⁱ 、v ⁱ The sequence of pixels is unfolded. For the mth pixel in the sequence

And the nth pixel->

Feature similarity between them is calculated by inner product operation in similarity relation Calculation (CRM):

wherein the content of the first and second substances,

representing the pixel level cross modal correlation within window i.

Step 2.3: the correlation between the pixels obtained in step 2.2 is normalized by the softmax function (SM):

wherein the content of the first and second substances,

represents the th from a full-color image>

Pixel to multispectral image ^ th ^ based on>

The injection gain of each pixel.

Step 2.4: according to the injection gain

Texture information of the full-color image can be extracted. The m-th pixel of the i-th window of the output feature image is therefore calculated as follows:

step 2.5: folding the unfolded pixel sequence into a pixel window through dimension transformation (RS), and obtaining an ith window of an output image:

step 2.6: respectively obtaining the output characteristic image of each window through the cross attention of the windows, and finally splicing the characteristic images of all the windows to obtain the final output characteristic image:

O＝[O ¹ ，O ² ，…，O ⁿ ]

and step 3: the constructed image decoding module transmits the generated feature image back to the image domain. The specific implementation comprises the following substeps:

step 3.1: in order to keep the high-frequency characteristic information in the multispectral image, the output characteristic image obtained by the cross attention of the window is added with the multispectral characteristic image Q through a jump connection to obtain the high-frequency characteristic image.

Step 3.2: the high-frequency characteristic image obtained by fusion firstly passes through a convolution kernel with 256 channels and 3 multiplied by 3 to obtain a multi-channel characteristic image with higher dimensionality.

Step 3.3: and 4 convolution kernels with the size of 1 multiplied by 1 are adopted, the number of the convolution kernels of 4 layers is 128, 64, 32 and 4 respectively, the multi-channel characteristic image is remapped into a four-channel image to obtain a reconstructed high-frequency image, and then the reconstructed high-frequency image and the low-frequency multispectral image are added to obtain a final fusion image.

And 4, step 4: and constructing an image fusion model objective function driving model training. The specific implementation comprises the following substeps:

step 4.1: a loss function is constructed. Constructing a loss function based on L2:

wherein, F _n And G _n Representing the fused image and the reference image, respectively, and b is the batch size. In the present embodiment, b =8.

Step 4.2: and b data are randomly selected from the training set to be input into the network, one iteration is completed, and network parameters are adjusted.

And 5: and training the network by using simulation data, testing the simulation test set and the real test set by using the trained model, and comparing the simulation test set and the real test set with other algorithms. The specific implementation comprises the following substeps:

step 5.1: and training the network by using simulation data, and comparing the obtained test result with each comparison method on visual and objective evaluation indexes. In this example, we used mainly images of high-resolution second satellite in the experiment, 4000 image pairs were divided into 90% for training and the remaining 10% for verification. The reference image takes an original MS image with a resolution of 256 × 256, takes a downsampled multispectral and panchromatic image with a factor of 4 as input, and the input image size is 256 × 256. The simulated test image size is 512 x 512 images. To verify the effectiveness of the proposed method, we compared the proposed method with the traditional method and the deep learning based method. The traditional method is IHS, and deep learning methods comprise PNN and fusion Net. In the training phase, the initial learning rate was 0.001 and the batch size was set to 8, and the initial learning rate was attenuated by multiplying by 0.5 when the peak signal-to-noise ratio (PSNR) degradation of the validation set reached 20 rounds. We used 450 rounds to train the proposed network and optimized by Adam optimizer. The visual comparison results are shown in fig. 3. The objective evaluation index is peak signal to noise ratio (PSNR), and the average result on the simulation test set is shown in table 1.

And step 5.2: and testing the network performance by using the real data, and comparing the obtained test result with each comparison method on visual and objective evaluation indexes. 210 real images with a size of 512 × 512 are selected for testing to verify the performance of the proposed method in the real world. The comparison methods are IHS, PNN and FusionNet. The visual comparison results are shown in figure 4. The objective evaluation index was no reference index (QNR), and the average results on the real test set are shown in Table 2.

TABLE 1 average PSNR (dB) comparison of simulation data for different methods (ideal: +∞)

TABLE 2 average PSNR (dB) comparison of the different methods of the real data (ideal: + ∞)

It can be seen that the method firstly carries out high-frequency feature extraction, then obtains cross-mode correlation between pixel-level panchromatic images and multispectral images through window cross attention, and finally transfers texture details of the panchromatic images to the multispectral images, thereby realizing the best effect of remote sensing image fusion.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above-mentioned embodiments are described in some detail, and not intended to limit the scope of the invention, and those skilled in the art will be able to make alterations and modifications without departing from the scope of the invention as defined by the appended claims.

Claims

1. A remote sensing image fusion method based on window cross attention is characterized by comprising the following steps:

step 1, constructing a depth texture feature extraction module based on multispectral image and panchromatic image characteristics, and converting an input image into a feature domain;

2. The remote sensing image fusion method based on the window crossing attention as claimed in claim 1, characterized in that: the specific implementation manner of the step 1 is as follows;

step 1.1, constructing a high-pass filter to extract high-frequency information of an input image, wherein the input image comprises a multispectral image M and a blurred panchromatic image

Panchromatic image P, multispectral image M, blurred panchromatic image>

And G (P);

step 1.2, constructing a single-channel texture extraction module for extraction

And G (P) high-frequency characteristics to obtain K and V, wherein the number of convolution kernels in the single-channel texture extraction module is increased layer by layer, and the receptive field of the convolution kernels is decreased layer by layer to extract multi-scale detail information;

3. The remote sensing image fusion method based on the window crossing attention as claimed in claim 2, characterized in that: the blurred panchromatic image in the step 1.1 is obtained by down-sampling and up-sampling the original panchromatic image, the high-pass filter is realized by subtracting low-frequency content obtained by average filtering of the original image from the original image, and the average filtering is realized by a global pooling layer.

4. The remote sensing image fusion method based on the window crossing attention as claimed in claim 2, characterized in that: in step 1.2, the number of convolution kernels in the three convolution layers is changed from 32 and 64 to 128, and the receptive fields of the convolution kernels are changed from 7 × 7 and 5 × 5 to 3 × 3.

5. The remote sensing image fusion method based on the window crossing attention as claimed in claim 2, characterized in that: the specific implementation manner of the step 2 is as follows;

Q＝[q ¹ ，q ² ，…，q ⁿ ]

K＝[k ¹ ，k ² ，…，k ⁿ ]

V＝[v ¹ ，v ² ，…，v ⁿ ]

wherein q is ⁱ /k ⁱ /v ⁱ ∈R ^h，w，C ，

And the nth pixel->

Calculating feature similarity between them: />

Wherein, the first and the second end of the pipe are connected with each other,

represents the pixel-level cross-modal correlation within window i;

wherein the content of the first and second substances,

represents the th from a full-color image>

Fifth of multispectral image from pixel point>

The injection gain of each pixel point;

step 2.4, gain according to injection

O＝[O ¹ ，O ² ，…，O ⁿ ]。

6. the remote sensing image fusion method based on window crossing attention as claimed in claim 3, characterized in that: the specific implementation manner of the step 3 is as follows;

step 3.2, the high-frequency characteristic image obtained by fusion passes through a convolution layer to obtain a multi-channel characteristic image with higher dimensionality;

7. The remote sensing image fusion method based on the window crossing attention as claimed in claim 6, characterized in that: the convolution layer in step 3.2 has 256 channels and a convolution kernel size of 3 x 3.

8. The remote sensing image fusion method based on the window crossing attention as claimed in claim 6, characterized in that: in step 3.3, the number of convolution kernels for the 4 convolutional layers is 128, 64, 32, 4, respectively.

9. The remote sensing image fusion method based on the window crossing attention as claimed in claim 1, characterized in that: the constructed loss function in step 4 is as follows;

10. The remote sensing image fusion method based on the window crossing attention as claimed in claim 1, characterized in that: step 5 also comprises comparing the test result with the existing algorithm through objective evaluation indexes, wherein the objective evaluation indexes comprise peak signal-to-noise ratio and no-reference indexes.