CN113887619A

CN113887619A - Knowledge-guided remote sensing image fusion method

Info

Publication number: CN113887619A
Application number: CN202111157255.5A
Authority: CN
Inventors: 张景涵; 李文轩; 张承明
Original assignee: Shandong Agricultural University
Current assignee: Shandong Agricultural University
Priority date: 2021-09-30
Filing date: 2021-09-30
Publication date: 2022-01-04
Anticipated expiration: 2041-09-30
Also published as: CN113887619B

Abstract

The application discloses a remote sensing image fusion method based on knowledge guidance, which comprises the following steps: extracting texture details based on high-pass filtering and constructing a high-pass filtering module to extract high-frequency detail information of the full-color image; extracting NDVI and NDWI information of the multispectral image; constructing a self-adaptive SE module, and extruding and exciting input characteristics; and combining the self-adaptive SE module with the convolution unit to perform fusion processing on the input features. And introducing NDVI and NDWI as prior knowledge to constrain the fusion process according to the characteristics of the full-color image and the multispectral image. The fidelity of the spectral information in the fusion process is improved, and the problem of spectral distortion easily generated in the image fusion process is solved.

Description

Knowledge-guided remote sensing image fusion method

Technical Field

The application relates to the technical field of remote sensing image fusion, in particular to a remote sensing image fusion method based on knowledge guidance.

Background

At present, due to the limitation of sensor technology, a single sensor cannot directly acquire an image with high spectral resolution and high spatial resolution at the same time, two sensors are generally simultaneously installed on a satellite remote sensing platform, one sensor is used for acquiring a full-color image with high spatial resolution, and the other sensor is used for acquiring a multispectral image. In application, it is usually necessary to first fuse the panchromatic image and the multispectral image by using an image fusion technique to obtain an image with both high spectral resolution and high spatial resolution.

Conventional image fusion methods mainly include Component Substitution (CS), Multiresolution Analysis (MRA), and Sparse Representation (SR). The CS fusion method first converts a Multispectral image (MS) into another space, separating the spatial structure and spectral information into different components; then, the components having a spatial structure in the converted MS image are replaced with a full color image (PAN). CS-based methods can gain rich details, but spectral distortions tend to be severe. The core of the MRA fusion method is multi-scale detail extraction and injection. The MRA method is typically used to extract spatial detail from the PAN image, which is then injected into the up-sampled multi-spectral image. Compared with the CS-based method, the MRA-based method can better maintain spectral characteristics by injecting the extracted detailed information of the panchromatic image into the multispectral image. The core idea of sparse representation is to represent the image as a linear combination of the least atoms in a overcomplete dictionary, but this approach is complex and time consuming.

In recent years, remote sensing image fusion methods based on deep learning are widely concerned, for example, a PNN network uses a convolutional neural network in an image fusion task for the first time, the interpolated multispectral image and panchromatic image are spliced and input into the network for end-to-end training, and the network directly learns the relationship between the input and high-resolution images; the PanNet network splices the high-frequency detail information of the panchromatic image and the multispectral image and inputs the spliced high-frequency detail information into a residual error network for feature extraction and fusion, and then the extracted high-frequency detail information is injected into the low-resolution multispectral image after up-sampling. Compared with the traditional CS-based and MRA-based algorithms, the CNN-based method significantly improves the performance of image fusion, but the methods have some problems, mainly: both Target-PNN and RSIFNN lack specific texture detail processing, so that the texture details of the fused image are not clear enough. The PNN simply inputs the panchromatic image and the up-sampled multispectral image into a convolutional neural network for training, and does not perform targeted feature extraction and fusion according to the respective characteristics of the panchromatic image and the multispectral image, which may cause distortion of some spectra and spatial structures. PanNet, although enhancing texture details, does not consider the relationship between MS image spectral channels, which may result in some spectral distortion.

Disclosure of Invention

In order to solve the technical problems, the following technical scheme is provided:

in a first aspect, an embodiment of the present application provides a knowledge-guided remote sensing image fusion method, where the method includes: extracting texture details based on high-pass filtering and constructing a high-pass filtering module to extract high-frequency detail information of the full-color image; extracting NDVI and NDWI information of the multispectral image as prior knowledge; constructing a self-adaptive SE module, and extruding and exciting input characteristics; and combining the self-adaptive SE module with the convolution unit to perform fusion processing on the input features.

By adopting the implementation mode, the normalized vegetation index and the normalized water body index are introduced as priori knowledge to constrain the fusion process according to the characteristics of the full-color image and the multispectral image. The fidelity of the spectral information in the fusion process is improved, and the problem of spectral distortion easily generated in the image fusion process is solved.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the extracting texture details based on high-pass filtering and constructing a high-pass filtering module to extract full-color image high-frequency detail information includes: acquiring low-frequency information of the image through an average filter; and then subtracting the low-frequency information from the original image to obtain high-frequency information, wherein the high-frequency information is used for reducing the influence of noise information in the full-color image on spectral information in the multispectral image.

With reference to the first aspect, in a second possible implementation manner of the first aspect, the extracting NDVI and NDWI information of the multispectral image includes: the resolution of the multispectral image is improved by two times of up-sampling, and each time of sampling is improved by two times on the original basis; and respectively calculating the NDVI and NDWI values of each pixel point of the multispectral image after up-sampling through the NDVI and NDWI calculation formulas.

With reference to the second possible implementation manner of the first aspect, in a third possible implementation manner of the first aspect, the calculation formulas of the NDVI and the NDWI are:

in the formula, NIR, R, and G represent the reflectance of the near-infrared band, the red band, and the green band of the multispectral image, respectively.

With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the adaptive SE module includes: the feature graph size input by the self-adaptive SE module is c multiplied by h multiplied by w, wherein c represents the number of channels, h represents the height, and w represents the width.

With reference to the fourth possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, the extruding and exciting the input feature includes: the global average pooling layer compresses the input feature map into a size of c multiplied by 1; the first full-connection layer reduces the dimension of the result obtained in the global average pooling layer to the size of c/r multiplied by 1, and r adopts the channel number c of the convolution of the previous layer; the ReLu layer carries out activation operation on the result of the first full connection layer; the second fully connected layer raises the features back to the original dimensions; selecting proper weight for obtaining each characteristic channel by using a Sigmod activation function; and finally multiplying the obtained weight to the characteristic of each channel.

With reference to the fourth or fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the adaptive SE module is combined with the convolution unit to obtain the application model, where the application model includes a feature extraction module and an image fusion module; the characteristic extraction module comprises a characteristic extraction unit of a multispectral image and a characteristic extraction unit of a full-color image; the image fusion module comprises two residual error units and two attention mechanism units.

With reference to the sixth possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the feature extraction unit for the full-color image includes: the first and second feature extraction subunits have the same structure and respectively comprise two 3 multiplied by 3 convolutional layers, two ReLu activation function layers and a cavity convolutional layer; the third, fourth, fifth and sixth feature extraction subunits have the same structure and respectively comprise three 3 multiplied by 3 convolutional layers, three ReLu activation function layers and a cavity convolutional layer;

the feature extraction unit of the multispectral image comprises: the first and second feature extraction subunits have the same structure and respectively comprise two 3 multiplied by 3 convolutional layers, two ReLu activation function layers and a cavity convolutional layer; the third, fourth, fifth and sixth feature extraction subunits have the same structure and respectively comprise three 3 multiplied by 3 convolutional layers, three ReLu activation function layers and a cavity convolutional layer;

the residual error unit comprises two convolution subunits and a jump link subunit, wherein each convolution subunit consists of a convolution layer, a BN layer and a ReLu activation function layer;

the attention mechanism unit includes: the system comprises two SE subunits, wherein each SE subunit consists of a global average pooling layer, two full-connection layers, a ReLu activation function layer and a Sigmod activation function layer; the global average pooling layer compresses input features to obtain a global receptive field, and the first full-connection layer reduces the dimension of a result obtained by average pooling; the ReLu layer activates the result after dimensionality reduction, and the nonlinearity of the model is increased; and selecting proper weight for obtaining each characteristic channel by using a Sigmod activation function, and then endowing the weight to the original characteristic to realize extrusion and excitation of the characteristic.

With reference to the sixth or seventh possible implementation manner of the first aspect, in an eighth possible implementation manner of the first aspect, the training of the application model includes: 1) determining hyper-parameters in the training process, and initializing parameters of the model; 2) inputting the manufactured training images (including panchromatic images, multispectral images and reference images) serving as training data into the model; 3) performing forward calculation on the current training data by using a model; 4) calculating loss by using a loss function; 5) updating the parameters of the model by using a random gradient descent algorithm to complete a training process; 6) repeating steps 3) -5) until the loss function is less than the specified desired value or the loss value is no longer decreasing.

With reference to the eighth possible implementation manner of the first aspect, in a ninth possible implementation manner of the first aspect, the calculating the loss by using the loss function includes: calculating the texture loss according to the generated image and the original full-color image, wherein the calculation formula is as follows:

wherein h represents a high-resolution multispectral image obtained by the network, p represents an original panchromatic image, n represents the number of pixels in the image block, and mu_hRepresents the mean value of h,. mu._pRepresents the mean value of p, and represents,

is a variance of h and is,

is the variance of p, σ_hpIs the covariance of h and p, c₁＝(k₁L)²，c₂＝(k₂L)²Is two constants, avoiding dividing by 0, L is the range of pixel values, k₁＝0.01，k₂The default value is 0.03, and the texture loss of the image is calculated by 1-SSIM (h, p);

calculating the spectral loss according to the generated image and the original multispectral image, wherein the calculation formula is as follows:

wherein, lT is a multispectral image which is obtained by interpolation and up-sampling of an original multispectral image and has the same size as h, m represents the height of an image block, and w represents the width of the image;

the overall loss function is:

drawings

Fig. 1 is a schematic flow chart of a knowledge-based remote sensing image fusion method according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of an adaptive SE module according to an embodiment of the present disclosure;

FIG. 3 provides a schematic diagram of a model structure according to an embodiment of the present application;

fig. 4 is a schematic diagram of an experimental comparison result provided in the embodiment of the present application.

Detailed Description

The present invention will be described with reference to the accompanying drawings and embodiments.

Attention is drawn to the study of human vision, where humans selectively focus on a portion of all information while ignoring other visible information. The idea was first applied in the processing of natural language. Later, researchers have proposed a SENet (Squeeze-and-Excitation Networks) attention mechanism model that assigns different weights to each feature channel by squeezing, exciting optimized feature extraction on the input information. The network structure is widely applied to tasks such as target detection and image classification, and achieves good effect. In an image fusion task, fusion operation needs to be performed on all features of panchromatic and multispectral images through a neural network, but a large amount of redundant information is generated, and how to fuse feature information more accurately and efficiently becomes a problem to be solved urgently. Attention mechanisms may then solve this problem by giving weight to the features.

The priori knowledge is the description of the object characteristics obtained by people through methods such as experience or mathematical statistics and the like, and is widely applied to deblurring of natural images and denoising of the natural images. The common prior knowledge of the natural image has the characteristics of local smoothness, non-local self-similarity, sparsity and the like of the natural image. Inspired by the prior knowledge of natural images, considering that the ndvi (normalized Difference risk index) index can reflect the distribution information of land Vegetation, the ndwi (normalized Difference Water index) index reflects the distribution information of a Water body, and the Vegetation and the Water body are two main information which can influence the spectrum. Therefore, the two kinds of information are used as priori knowledge to constrain the fusion process of the remote sensing images, so that the fusion effect can be further improved, and the problem of spectral distortion can be solved.

Based on the analysis, the remote sensing image fusion method based on knowledge guidance is provided

Fig. 1 is a schematic flow chart of a knowledge-based remote sensing image fusion method provided in an embodiment of the present application, and referring to fig. 1, the knowledge-based remote sensing image fusion method in the embodiment includes:

and S101, extracting texture details based on high-pass filtering and constructing a high-pass filtering module to extract high-frequency detail information of the full-color image.

Before image fusion, the full-color image is subjected to a high-pass filter, high-frequency detail information is extracted, and redundant noise information is filtered. When the high-pass filter is constructed, firstly, the low-frequency information of the image is obtained by using the mean filter, and then the high-frequency information is obtained by subtracting the low-frequency information from the original image. The high-frequency information of the full-color image is acquired, so that the influence of noise information in the full-color image on spectral information in the multispectral image can be reduced, the advantages of high-frequency texture details in the full-color image are fully exerted, and the spectral information and the texture information of the fused image can achieve the optimal effect.

S102, extracting NDVI and NDWI information of the multispectral image as prior knowledge.

The multispectral image is input to the model and is first up-sampled to achieve the same resolution as a full-color image. Since the bicubic interpolation method can sufficiently maintain the spectral information of the multispectral image, the bicubic interpolation is adopted for up-sampling. Considering that the resolution of the full-color image is four times of that of the multispectral image, the resolution of the multispectral image is improved by two times through up-sampling twice, and each time is improved by two times on the original basis. The upsampling module consists of a bicubic interpolation layer and a 1 x 1 convolved adjustment layer. And introducing prior knowledge after upsampling, namely respectively calculating the NDVI and NDWI values of each pixel point of the multispectral image after upsampling through the NDVI and NDWI calculation formulas, wherein the values respectively form two matrixes, and the two matrixes and the multispectral image after upsampling are used as the input of the next layer of convolution. The formula for calculating NDVI and NDWI is as follows:

S103, constructing a self-adaptive SE module, and extruding and exciting the input features.

Referring to fig. 2, the adaptive SE module includes: the feature graph size input by the self-adaptive SE module is c multiplied by h multiplied by w, wherein c represents the number of channels, h represents the height, and w represents the width.

The global average pooling layer compresses the input feature map to a size of c × 1 × 1, that is, each channel gets a value with a global receptive field that characterizes the global distribution of responses over the feature channels. The first fully-connected layer reduces the dimension of the result obtained in the global average pooling layer to the size of c/r × 1 × 1, where r is the number of channels convolved by the previous layer, i.e. c, which is the reason why the module is called an "adaptive" SE module, i.e. the parameters used for dimension reduction can change with the change of the number of input characteristic channels, and the parameters and the calculated amount can be greatly reduced through the "adaptive" mechanism. The ReLu layer performs activation operation on the result of the first full connection layer. The second fully connected layer raises the features back to the original dimensions. Through the dimension reduction and the dimension lifting of the full connection layer and the activation operation of ReLu, the nonlinearity of the model can be increased, and the complex correlation between channels can be better fitted. The appropriate weights are selected for each feature channel using the Sigmod activation function. Finally, the obtained weights are multiplied to the characteristics of each channel, namely, the x operation in the figure.

And S104, combining the self-adaptive SE module with the convolution unit to perform fusion processing on the input features.

Referring to fig. 3, the adaptive SE module is combined with the convolution unit to obtain the application model, which includes a feature extraction module and an image fusion module. The characteristic extraction module comprises a characteristic extraction unit of a multispectral image and a characteristic extraction unit of a full-color image; the image fusion module comprises two residual error units and two attention mechanism units.

The feature extraction unit of the full-color image includes: the first and second feature extraction subunits have the same structure and respectively comprise two 3 multiplied by 3 convolutional layers, two ReLu activation function layers and a cavity convolutional layer; the third, fourth, fifth and sixth feature extraction subunits have the same structure and respectively comprise three 3 multiplied by 3 convolutional layers, three ReLu activation function layers and a cavity convolutional layer.

The input of the feature extraction unit of the full-color image is high-frequency information extracted by the full-color image through high-pass filtering. In consideration of the above, in the feature extraction process of the image, although the dimensionality reduction can be performed on the features by using the pooling layer to reduce the number of parameters, some important feature information may be lost in the process, and the final image fusion result is affected. Therefore, the maximum pooling layer in the VGG network is not used, and the hole convolution with convolution kernel of 3 × 3 and expansion rate of 3 × 3 is used instead, so that the receptive field of the network can be increased, and the size of the output image is not changed.

The feature extraction unit of the multispectral image comprises: the first and second feature extraction subunits have the same structure and respectively comprise two 3 multiplied by 3 convolutional layers, two ReLu activation function layers and a cavity convolutional layer; the third, fourth, fifth and sixth feature extraction subunits have the same structure and respectively comprise three 3 multiplied by 3 convolutional layers, three ReLu activation function layers and a cavity convolutional layer.

The feature extraction unit of the multispectral image is similar to the feature extraction unit of the full-color image in structure, but the parameters are different. The input of the unit is a multispectral image obtained by upsampling a low-resolution multispectral image by an interpolation method and upsampling by a deconvolution method, and the purpose of doing so is to avoid the problem of partial information loss caused by singly using deconvolution or singly using interpolation upsampling.

The residual error unit comprises two convolution subunits and a jump link subunit, wherein each convolution subunit consists of a convolution layer, a BN layer and a ReLu activation function layer. The purpose of using jump links is to reduce the loss of information during the fusion process.

The attention mechanism unit includes: the SE sub-unit comprises a global average pooling layer, two full-connection layers, a ReLu activation function layer and a Sigmod activation function layer. The global average pooling layer compresses the input features to obtain a global receptive field; the first full-connection layer reduces the dimension of the result obtained by the average pooling; the ReLu layer activates the result after dimensionality reduction, and the nonlinearity of the model is increased; the Sigmod function obtains the weight of each channel and then gives the weight to the original characteristics to realize the optimization of the characteristics. Therefore, through the characteristic optimization of the SE module, the model parameters can be reduced, and the running speed of the model can be increased.

In this embodiment, the model is trained in an end-to-end manner, loss is used as a loss function, an SGD (stochastic gradient descent) algorithm is used as an optimization algorithm, and the specific training steps are as follows:

1) determining hyper-parameters in the training process, and initializing parameters of the model;

2) inputting the manufactured training images (including panchromatic images, multispectral images and reference images) serving as training data into the model;

3) performing forward calculation on the current training data by using a model;

4) calculating loss by using a loss function;

5) updating the parameters of the model by using a random gradient descent algorithm to complete a training process;

6) repeating steps 3) -5) until the loss function is less than the specified desired value or the loss value is no longer decreasing.

The design of the network structure plays a crucial role in the accuracy of the image fusion result, and the design of the loss function is also a very critical part in the process of improving the accuracy of the image fusion. In conventional image fusion and panchromatic sharpening networks, the error between corresponding pixels is usually calculated by using a result image and a reference image after the network fusion is completed, and the error is further used for training a forward-propagation fusion network. In addition, the network extracts information in the full-color image and the multispectral image only during training, which undoubtedly reduces the utilization rate of the original image and influences the accuracy of the model to a certain extent.

The application provides a new loss calculation method without a reference image. That is, the loss of the reference image calculation model is not used in the training process of the image fusion network, and instead, the original full-color and multi-spectral images are used to calculate the loss. The method is mainly divided into two parts, wherein the first part calculates texture loss according to a generated image and an original full-color image, and the calculation formula is as follows:

is a variance of h and is,

is the variance of p, σ_hpIs the covariance of h and p, c₁＝(k₁L)²，c₂＝(k₂L)²Is two constants, avoiding dividing by 0, L is the range of pixel values, k₁＝0.01，k₂0.03 is the default value. The value range of the SSIM is-1 to 1, the closer the value of the SSIM of the two images is to 1, the more similar the images are, the smaller the value of the loss function in the network is, the better the value is, the optimization of the network is performed towards the direction of smaller loss value, and the texture loss of the images is calculated by adopting 1-SSIM (h, p).

The second part is to calculate the spectral loss according to the generated image and the original multispectral image, and the calculation formula is as follows:

l ≈ is a multispectral image with the same size as h obtained by interpolation and upsampling of the original multispectral image, m represents the height of an image block, and w represents the width of the image.

The overall loss function can be written as:

alpha is 0.3 and beta is 0.7.

Therefore, abundant texture information in the full-color image and abundant spectral information in the multispectral image are fully utilized in the image fusion process, and the result image obtained through fusion can have more abundant texture and spectral information.

The model is coded and realized by using a pytoren language and taking a pytoren frame as an experimental platform. A server is used for carrying out a comparison experiment, an operating system used in the experiment is Ubuntu16.04, in order to accelerate data operation, a GPU TITAN X is loaded, and CUDA and CuDNN acceleration are installed.

The application selects two high-scene first images of Taian and a corresponding high-score sixth image to make a data set. Each high scene-one image comprises a panchromatic wave band and a multispectral wave band, the resolution of the panchromatic wave band is 0.5m, and the resolution of the multispectral wave band is 2 m. Each GF-6 image contains multispectral (blue, green, red, near-infrared) and panchromatic bands with spatial resolutions of 8m and 2m, respectively, and table 1 gives the main parameters of the high-scoring six-numbered satellite used herein.

TABLE 1 Main parameters of high-resolution No. 6 satellite

The image preprocessing mainly comprises the steps of geometric correction, radiation correction and the like. The method for preprocessing the image with the high score of six by using the PIE (Pixel Information expert) software produced by the aerospace macro map Information technology GmbH in a batch processing mode comprises the following steps: atmospheric correction, geometric correction, and orthorectification. After preprocessing, the obtained unfused multispectral image comprises four wave bands including blue, green, red and near-infrared wave bands, and the spatial resolution is 8 m. A single-band full-color image with a spatial resolution of 2m is obtained. And performing radiometric calibration, atmospheric correction and orthorectification on the high-scene first multispectral image in ENVI software to obtain a multispectral image with the resolution of 2 m.

And cutting the high-resolution six-color image into a TIF format picture with the size of 512 multiplied by 512, wherein the picture is named by adding _ pan in front of a suffix. And cutting the corresponding high-resolution six-number multispectral image into a TIF format picture with the size of 128 x 128, and adding _ lrms to the front of a suffix when the picture is named. The multispectral image of the Gaojing I is cut into TIF format pictures with the size of 512 multiplied by 512, and when the pictures are named, a _ target is added in front of a suffix. And putting the data in the same folder to obtain a data set for the experiment.

In the experiment, 1850 groups of images are used as training samples, 422 groups of images are used as test samples, and the training samples and the test samples are not crossed with each other. Each set of images includes a panchromatic image, a multispectral image, and a reference image. Since the ratio of the spatial resolution of the panchromatic and multispectral images is 1:4, and the reference image is the same resolution as the panchromatic image, the panchromatic image and the reference image are 512 x 512 pixels in size and the multispectral image is 128 x 128 pixels in size in these samples. The data are obtained by a PMS sensor of a high-resolution six-number satellite and are obtained through preprocessing. In addition, the resulting image obtained by the experiment herein has the same spatial resolution as the full-color image with high resolution of six, both 2m and an image size of 512 × 512 pixels.

The model of the application is compared with 5 common image fusion methods, wherein the comparison methods are GS (Gram-Schmidt) transformation, PCA (principal Component analysis), NNDiffuse (near Neighbor diffusion), PNN and PanNet. The first 3 are traditional image fusion algorithms, and PNN and PanNet are deep learning image fusion algorithms.

Table 2 model used in comparative experiments

The objective evaluation index can quantitatively evaluate the quality condition of the fused image, particularly, the image is difficult to find by human eyes in some tiny details of the image, and the image quality can be quantitatively evaluated by the value by adopting the objective evaluation index. Therefore, 3 evaluation indexes, namely Peak Signal to Noise Ratio (PSNR), Structural Similarity Index (SSIM) and Spectral Angle Mapper (SAM), are selected to quantitatively evaluate the fusion result and evaluate the image quality from different aspects.

The PSNR is mainly used for evaluating the sensitivity error of image quality, is an important index for measuring the difference of two images, and has the calculation formula as follows:

in the formula: h and W are the height and width of the image, respectively; f and R are respectively a fusion result image and a reference image. The larger the PSNR value is, the more information quantity obtained by the fused image from the original image is, the higher the similarity between the image and the original image is, and the better the fusion effect is.

The SSIM compares the distortion degree of an image from three levels of brightness (mean), contrast (variance) and structure, the value range is [0, 1], the larger the value is, the better the value is, and the calculation formula is as follows:

in the formula: mu.s_FIs the mean value of F; mu.s_RIs the mean value of R;

is the variance of F;

is the variance of R; sigma_RFCovariance as F and R; c. C₁＝(k₁L)²，c₂＝(k₂L)²Two constants, avoid dividing by 0; l is the range of pixel values; k is a radical of₁＝0.01，k₂0.03 is the default value.

Spectral Angle Mapper (SAM) is an index that can calculate the spectral similarity of two images at each pixel. A lower value indicates lower spectral distortion and better image fusion quality. The calculation formula is as follows:

in the formula: f_{j}And R_{j}J-th pixel vector for F and R, respectively:<F_{j}，R_{j}>being the inner product of two vectors, | | | |, is the L2 norm of the vector.

FIG. 4 is a graph showing the results of fusion of test samples using the model of the present application, PanNet, PNN, GS transform, PCA, NNDiffuse. Where a is a graph of the results of fusion with NNDiffuse, b is a graph of the results of fusion with GS transform, c is a graph of the results of fusion with PCA, d is a graph of the results of fusion with the PanNet model, e is a graph of the fusion results with the PNN model, and f is a graph of the fusion results with the model of the present application. The images include ground features such as buildings, farmlands, greenhouses, roads, rivers and the like, and have certain representativeness.

It can be seen from fig. 4 that the colors of the fused image obtained by NNDiffuse are generally darker, and especially in the place with vegetation, different colors exhibited by different regions of vegetation cannot be distinguished in the fused image. GS and PCA still have partial spectrum distortion, and PanNet, PNN and the model of the application are improved in the aspect of distinguishing vegetation colors of different regions, so that different colors of different regions can be distinguished. The PanNet and PNN are slightly inferior to the model of the application in terms of the texture characteristics of the images. Careful observation of these images shows that the results obtained by the model are clearer at the edge parts of the images, such as the edges of houses, greenhouses and different crops, the texture of the results obtained by the model is more obvious, and PanNet and PNN have some blurring. This shows that the method of image fusion by using the convolutional neural network is superior to the traditional image fusion method, and different convolutional neural network algorithms have different effects on the fusion result. The channel attention layer plays a role in the fusion process of the images, and can give weights to the characteristic information of each channel according to the importance degree of the characteristic information, so that the noise information with low weight is correspondingly inhibited in the fusion process of the images, and the fusion result is more accurate.

The quantitative evaluation results of the respective models are shown in table 3, in which the upward arrow indicates that the higher the value of the index is, the better the fusion effect is, and the downward arrow indicates that the lower the value of the index is, the better the fusion effect is. It can be seen that the values of the three evaluation indexes of the model of the application are superior to those of the comparative model.

TABLE 3 comparative model Performance index Table

Through the analysis of fig. 3 and table 3, it can be found that the model of the present application constructed herein shows better effect than other methods, whether subjective visual perception or objective evaluation index.

The purpose of using the self-adaptive SE module is to optimize the features so as to achieve the purpose of feature fusion according to the importance degree of the features. In the comparative experiment developed by the application, the model, PanNet and PNN of the application achieve the effect of feature fusion by performing convolution operation on input features. The difference is that the self-adaptive SE module is adopted to optimize the characteristics in the fusion process of the model, and the optimization operation of the characteristics is not carried out in the fusion process of the PanNet and the PNN.

Comparing the fusion results of the model, the PanNet model and the PNN model, the chessboard phenomenon of the images generated by the PanNet model can be found; the resulting image of the PNN model is still spectrally distorted. The result image spectrum information of the model fusion is better, and the details are richer. Especially, the fusion result of the model has clearer texture than the fusion result of PanNet and PNN at the edge of buildings, roads, the edge of different crops in farmlands and the edge of greenhouses in images.

The reason for this is mainly due to the use of the adaptive SE module in the model of the present application. The module enhances useful characteristics through characteristic optimization and inhibits useless characteristics, so that image fusion can be performed in a targeted manner, and the influence of redundant information on a fusion result is reduced.

At present, in image classification and image segmentation tasks, a true value image needs to be made on a training image according to prior knowledge, so that a deep learning model is told which types of ground features are farmlands and which types of ground features are buildings, wheat, corns, water bodies, roads and the like. Each type of ground object cannot be marked in detail in image fusion, but the NDVI and the NDWI attributes which are convenient to calculate by remote sensing images can be used as prior information to perform image fusion. Therefore, the purpose of using the two kinds of prior knowledge, namely the NDVI and the NDWI, is to add some limiting conditions in the model to achieve the effect of image fusion constraint. In the comparative experiments developed in this study, only the model of the present application uses the prior knowledge NDVI and NDWI.

By comparing the model with GS, PCA, PanNet, PNN and NNDiffuse, the result obtained by NNDiffuse is relatively serious in spectral distortion and dark in image color, and different ground objects are difficult to distinguish. The resulting images from GS and PCA present little spectral distortion. The resulting plots for the PanNet and PNN models also show slight spectral distortion. The result image of the model of the application has almost no spectrum distortion phenomenon, the spectrum information is richer, and the image quality is higher.

Aiming at the problems of spectral distortion and spatial structure distortion which are easy to occur in the remote sensing image fusion process, the fusion model suitable for the high-resolution six-number remote sensing image is established by utilizing the advantages of an attention mechanism in the aspect of feature extraction and the characteristics of priori knowledge. The model gives importance to the characteristic channel of the image by using an attention mechanism, and performs subsequent processing on the characteristic according to the importance degree; meanwhile, the prior knowledge is utilized to carry out pixel-by-pixel constraint on the characteristics in the fusion process, so that the spectral characteristics and the texture characteristics of the fusion result have higher fidelity.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Claims

1. A remote sensing image fusion method based on knowledge guidance is characterized by comprising the following steps:

extracting texture details based on high-pass filtering and constructing a high-pass filtering module to extract high-frequency detail information of the full-color image;

extracting NDVI and NDWI information of the multispectral image as prior knowledge;

constructing a self-adaptive SE module, and extruding and exciting input characteristics;

and combining the self-adaptive SE module with the convolution unit to perform fusion processing on the input features.

2. The method of claim 1, wherein extracting texture details based on high-pass filtering and constructing a high-pass filtering module to extract panchromatic image high-frequency detail information comprises:

acquiring low-frequency information of the image through an average filter;

and then subtracting the low-frequency information from the original image to obtain high-frequency information, wherein the high-frequency information is used for reducing the influence of noise information in the full-color image on spectral information in the multispectral image.

3. The method of claim 1, wherein said extracting NDVI and NDWI information from the multispectral image comprises:

the resolution of the multispectral image is improved by two times of up-sampling, and each time of sampling is improved by two times on the original basis;

and respectively calculating the NDVI and NDWI values of each pixel point of the multispectral image after up-sampling through the NDVI and NDWI calculation formulas.

4. The method of claim 3, wherein the NDVI and NDWI are calculated as:

5. The method of claim 1, wherein the adaptive SE module comprises: the feature graph size input by the self-adaptive SE module is c multiplied by h multiplied by w, wherein c represents the number of channels, h represents the height, and w represents the width.

6. The method of claim 5, wherein squeezing, stimulating the input features comprises:

the global average pooling layer compresses the input feature map into a size of c multiplied by 1;

the first full-connection layer reduces the dimension of the result obtained in the global average pooling layer to the size of c/r multiplied by 1, and r adopts the channel number c of the convolution of the previous layer;

the ReLu layer carries out activation operation on the result of the first full connection layer;

the second fully connected layer raises the features back to the original dimensions;

selecting a proper weight for each characteristic channel by using a Sigmod activation function;

and finally multiplying the obtained weight to the characteristic of each channel.

7. The method according to claim 5 or 6, wherein the application model is obtained by combining an adaptive SE module with a convolution unit, and comprises a feature extraction module and an image fusion module;

the characteristic extraction module comprises a characteristic extraction unit of a multispectral image and a characteristic extraction unit of a full-color image; the image fusion module comprises two residual error units and two attention mechanism units.

8. The method according to claim 7, wherein the feature extraction unit of the full-color image comprises: the first and second feature extraction subunits have the same structure and respectively comprise two 3 multiplied by 3 convolutional layers, two ReLu activation function layers and a cavity convolutional layer; the third, fourth, fifth and sixth feature extraction subunits have the same structure and respectively comprise three 3 multiplied by 3 convolutional layers, three ReLu activation function layers and a cavity convolutional layer;

the attention mechanism unit includes: the system comprises two SE subunits, wherein each SE subunit consists of a global average pooling layer, two full-connection layers, a ReLu activation function layer and a Sigmod activation function layer; the global average pooling layer compresses input features to obtain a global receptive field, and the first full-connection layer reduces the dimension of a result obtained by average pooling; the ReLu layer activates the result after dimensionality reduction, and the nonlinearity of the model is increased; the weight of each channel is obtained by the Sigmod function, and then the weight is given to the original feature, so that the extrusion and excitation of the feature are realized.

9. The method of claim 7 or 8, wherein training the application model comprises:

4) calculating loss by using a loss function;

10. The method of claim 9, wherein said calculating the loss using a loss function comprises:

calculating the texture loss according to the generated image and the original full-color image, wherein the calculation formula is as follows:

is a variance of h and is,

is the variance of p, σ_hpIs the covariance of h and p, c₁＝(k₁L)²,c₂＝(k₂L)²Is two constants, avoiding dividing by 0, L is the range of pixel values, k₁＝0.01,k₂The default value is 0.03, and the texture loss of the image is calculated by 1-SSIM (h, p);

wherein l is a multispectral image with the same size as h obtained by interpolation and up-sampling of an original multispectral image, m represents the height of an image block, and w represents the width of the image;

the overall loss function is: