CN111861880A

CN111861880A - Image super-fusion method based on regional information enhancement and block self-attention

Info

Publication number: CN111861880A
Application number: CN202010506835.XA
Authority: CN
Inventors: 李华锋; 岑悦亮; 余正涛; 张亚飞; 原铭
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-06-05
Filing date: 2020-06-05
Publication date: 2020-10-30
Anticipated expiration: 2040-06-05
Also published as: CN111861880B

Abstract

The invention relates to an image super-fusion method based on regional information enhancement and block self-attention, and belongs to the technical field of digital image processing. The method comprises a source image super-resolution branch and a fusion super-resolution branch. In the source image super-resolution branch, a feature extraction block is used iteratively to extract a source image feature map, and dense connection is used to fully utilize feature map information before and after the source image feature map is extracted. The output of each feature extraction block also passes through the region information enhancement block to explore the region of each object in the source image, and the information assists in fusing the super-resolution branch accurate prediction fusion decision diagram. In the fusion super-resolution branch, two source images are spliced together and input, and a fusion block based on a block self-attention mechanism is iteratively used by combining the source image information after the region enhancement input in the source image super-resolution branch so as to better distinguish focusing and non-focusing regions. And finally, performing sub-pixel convolution on each branch to generate a super-resolution source image and a fusion image.

Description

Image super-fusion method based on regional information enhancement and block self-attention

Technical Field

The invention relates to an image super-fusion method based on regional information enhancement and block self-attention, and belongs to the technical field of image information processing.

Background

The purpose of image fusion is to fuse the information of two or more source images captured by different cameras in the same scene into one image and ensure that the information of each source image can be preserved. The image fusion has very wide application in the fields of safety monitoring images, medical images, satellite remote sensing images and the like. In recent years, many researches have achieved good fusion effects, but the existing method is usually based on the de-fusion of high-resolution multi-focus source image data sets, however, the images obtained by the real-world imaging system are not necessarily high-resolution images. When fusing low-resolution source images, the fused image will also be low-resolution, even blurred and lacking in detail information, which reduces the utility of the image fusion technique. In order to input a low-resolution source image into a traditional fusion method for fusion, bicubic interpolation and nearest neighbor interpolation are generally adopted as upsampling operations to unify the resolution of the source image. However, these interpolation methods are too simple, have no pertinence to different data, and introduce wrong information to reduce the accuracy of image texture details, resulting in poor fusion effect; in addition, for the fusion task of multi-focus images, the accuracy of the fusion decision diagram is also reduced. Therefore, in order to solve these disadvantages and make the task of low resolution image fusion more efficient, a method capable of accurately super-resolving and fusing images is urgently needed.

In recent years, many image fusion methods based on deep learning have been proposed, which have a greater ability to extract texture and detail than fusion methods based on transform domain and spatial domain. One of the methods is to use an encoder-decoder network, extract the features of the source image by using an encoder, fuse the features by using a decoding network, and gradually amplify to obtain a fused image. One class of methods uses a pre-trained classification convolutional network to input image blocks into the network to predict whether the image blocks are focused, thereby generating a fusion decision diagram. One class of methods decomposes a source image into a base layer containing large-scale contours or intensity variations and a detail layer containing important textures, which are fused separately. Still other methods are based on generating a countermeasure network, implementing a fused image with the generator, while the discriminator is only used to distinguish the difference between the fused image and the visible image, extracting more texture from the visible image. These methods, while innovative and successful, still suffer from two major drawbacks: 1) the resolution of the source image is low, the resolution of the fused image is low, and texture details are lacked; 2) the regional scope of the salient features in the image cannot be accurately estimated, so that the salient features of the source image contained in the fusion result image are not complete enough.

To overcome the first two disadvantages, some work combines super-resolution with image fusion tasks. Dictionary learning-based methods learn a set of multi-scale dictionaries from a high-resolution image and then use sparse coefficients based on local information content to fuse low-resolution image blocks, but these methods require storage of the dictionaries from low-resolution image to high-resolution image, thereby consuming memory. Some methods fuse images by compressed sensing, however, these methods on the one hand require two steps, i.e. decomposing this task into super-resolution and fusion of the images, which is very time consuming. Some methods also use structure tensor, fractional differentiation and variation technology to fuse the image and the super-resolution into one step, but the methods can only perform the integral multiple of the super-resolution, are not flexible and practical enough, and the fusion result is not good enough.

Disclosure of Invention

The technical problem to be solved by the invention is to provide an image super-fusion and fusion method based on regional information enhancement and block self-attention, so as to solve the image fusion problem when the resolution of a source image is low, and improve the quality of a fusion result.

The technical scheme adopted by the invention is as follows: an image super-fusion and fusion method based on region information enhancement and block self-attention is disclosed, taking a low-resolution multi-focus image fusion method as an example, and a flow chart is shown in fig. 1, wherein the method specifically comprises the following steps:

step1, during the task of over-resolution and fusion of multi-focus images, as shown in FIG. 1, the source image with low resolution

Respectively input into the source image super-resolution branches, and at the same time,

spliced together according to the channels and input into the fusion and super-resolution branches. And a 3 multiplied by 3 convolutional layer is arranged at the beginning of the source image super-resolution branch and the fusion and super-resolution branch and is used for preliminarily extracting features. And then, the source image super-resolution branch contains 16 feature extraction blocks and 16 region information enhancement blocks, and the fusion and super-resolution branch contains 16 fusion blocks based on a block self-attention mechanism. 16 feature extraction blocks, 16 region information enhancement blocks and 16 fusion blocks based on the block self-attention mechanism are in one-to-one correspondence, and i (i is more than or equal to 0 and less than or equal to 16) is defined as the ith feature extraction block/region information enhancement block/fusion block based on the block self-attention mechanism.

Step2, in the source image super-resolution branch, the initial feature map passes through 16 feature extraction blocks, and the 16 feature extraction blocks are connected in a dense connection mode. Output of the i-1 th feature extraction Block

Besides being continuously input into the ith feature extraction block to construct a super-resolution source image, the super-resolution source image is also input into the ith region information enhancement block to assist in fusion and super-resolution branch acquisition decision weight map. The region information enhancement block will enhance the information of the salient feature region, especially the feature information of the focus region. Regional information augmentationThe information output by the strong block is input into the ith fusion block based on the block self-attention mechanism in the fusion and super-resolution branch;

step3, in the fusion and super-resolution branch, the initial feature map passes through 16 fusion blocks based on a block self-attention mechanism, the features are fully extracted, and the information is fused in a self-adaptive manner;

step4, after the 16 feature extraction blocks in the source image super-resolution branch, and after the 16 block self-attention mechanism based fusion blocks in the source image super-resolution branch, is a layer of 1 × 1 convolutions and a layer of sub-pixel convolutions. 1 x 1 convolution reduction

(16 th block-based output from the fusion block of the attention mechanism) to the square of the magnification factor r, where

Are respectively

The output of the 16 th characteristic extraction block of the source image super-resolution branch,

Sub-pixel convolution is carried out on the output of a 16 th feature extraction block of a source image super-resolution branch, the output of a fusion block of a fusion and super-resolution branch based on a block self-attention mechanism, the output of the 1 x 1 convolution layer is up-sampled to reach a target size H x W, H and W respectively represent the height and width of the target size, and after sub-pixel convolution, the source image super-resolution branch is obtained

Super-resolution results of

In the fusion and super-resolution branch, normalization is carried out through a Sigmoid function and obtained through threshold divisionMulti-focus image fused decision weight graph W^SRFinally, combining the source image to obtain a super-resolution fusion result image

Step5, obtaining the network parameter through Step4 in the network parameter training process

Super-resolution results of

And a decision weight graph W^SRSuper-resolution fusion result image

And then, calculating the loss between the label and the label, and minimizing the loss by using an optimizer based on a gradient descent method, thereby optimizing the parameters of the network, finishing the network training when the loss gradually decreases to be flat, and obtaining a high-quality super-resolution and fusion result by testing.

Specifically, the dense connection mode proposed in Step2 refers to an initial feature map f output by a first layer convolution layer in a source image super-resolution branch₀And the output of the previous i-1 feature extraction blocks will be the input of the ith feature extraction block. Finally, f₀And the outputs of all the blocks are spliced together, and dimension reduction and information integration are carried out through convolution of 1 multiplied by 1. The structure of the feature extraction block is shown in fig. 2(a), which is composed of three convolution layers of 3 × 3, and uses a residual learning mode to alleviate the degradation problem caused by the deep network;

Specifically, the region information enhancement block proposed in Step2 is shown in fig. 2 (c). Firstly, a layer of convolution layer acts on an input characteristic diagram, and the dimension of an output characteristic diagram is 2 times that of the input characteristic diagram; outputting the characteristic diagrams and slicing according to channels to obtain two characteristic diagrams with the same dimensionThe two characteristic diagrams are the offset of the input characteristic diagram in the horizontal and vertical directions; namely, the convolutional layer learns the offset of each position of the input feature map in the horizontal and vertical directions, and the horizontal and vertical offsets and the input feature map are input into the deformable convolution, so that the feature map closer to the shape and the size of the object is obtained. Definition of

Are respectively as

The amount of offset in the horizontal and vertical directions of the,

are respectively as

Is offset in the horizontal and vertical directions, wherein

Are respectively

The output of the ith feature extraction block of the super-resolution branch of the source image,

And outputting the ith characteristic extraction block of the source image super-resolution branch. Therefore, the feature map of the salient object region information input to the super-resolution and fusion branch i-th time

The calculation method is as follows:

where split (-) is the channel slicing operation, DConv (-) represents the deformable convolution, Conv (-) represents the convolution layer with a convolution kernel size k of 3, and LeakyRelu (-) is a commonly used nonlinear activation function with a slope s set to 0.2.

Specifically, the block self-attention mechanism proposed in Step3 means that attention should be paid to pixels having a large influence on a pixel when local characteristics of the pixel are considered. In the present invention, the feature relationship of each location to its 7 × 7 neighborhood will be explored. In that

In for position p, define

A neighborhood range of 7 × 7 with p as the center point;

is composed of

The characteristic values corresponding to the regions, () fuse the information in the neighborhood range; sigmoid (·) is an intra-block normalization function, and is used for calculating the weight of other position features in the neighborhood to the feature at the central point p; after the block self-attention mechanism, the characteristic value y of the p position_pCan be calculated as:

wherein

I.e. calculating the eigenvector x of the p position by using the transposition multiplication mode_pAnd the feature vector x of the q position_qThe correlation of (c). BatchNormalize (·) is a batch normalization operation.

The fusion block based on the block self-attention mechanism proposed in Step3 means that a fusion feature map output in the front is spliced with a feature map of a salient focusing area input by a source image super-resolution branch, information integration is carried out through convolution of 1 × 1 and convolution of several layers of 3 × 3, and then the self-attention mechanism based on the block range is used for more accurately highlighting the range of a salient object.

Specifically, the normalization of the Sigmoid function in Step4 means that:

wherein

Representing the result of convolution of the super-resolution with the sub-pixels in the fused branch, the feature map being single-channel and of the size of the target size; and (m, n) represents a coordinate position, and then a decision weight graph of multi-focus image fusion is obtained by using threshold t division. The invention sets t to 0.5, decides the weight graph W^SRThis can be obtained by the following formula:

then, the results are fused

Can be represented by a decision weight graph W^SRObtaining:

specifically, the loss calculation proposed in Step5 adopts the method with better convex optimizationThe L1 norm of the property is used to calculate the loss and an Adam optimizer is used to minimize the loss value. Definition of

Are tag values, respectively

Corresponding high resolution image,

Corresponding high resolution image, high resolution fused image, W^SR、W^HRIf the fusion decision graph and the high-resolution label fusion decision graph are super-resolution, the loss is calculated as follows:

specifically, Relu is used as the nonlinear activation function after all convolutional layers, except where specifically noted; the convolutional layers are all SAME type convolution, namely the input and output of the convolutional layers are consistent in size, and all source images share one source image super-resolution branch.

The invention has the beneficial effects that: the method comprises a source image super-resolution branch and a fusion super-resolution branch, wherein the image super-resolution branch assists in fusing the super-resolution branch to obtain an accurate fusion decision diagram. In the source image super-resolution branch, a feature extraction block is used iteratively to extract a source image feature map, and dense connection is used to fully utilize feature map information before and after the source image feature map is extracted. The output of each feature extraction block also passes through an area information enhancement block to search the range and the area of each object in the source image, and the information is transmitted to a fusion super-resolution branch to accurately predict a fused decision weight map. In the fusion super-resolution branch, two source images are spliced together and input, and a fusion block based on a block self-attention mechanism is iteratively used by combining the information of the source images after the region enhancement input in the source image super-resolution branch, so that the focused region and the unfocused region are better distinguished. And finally, using sub-pixel convolution as an upsampling layer to generate a super-resolution source image and a fusion image.

Drawings

FIG. 1 is an overall architecture diagram of the present invention incorporating an embodiment;

fig. 2 is a diagram of the structure of each sub-module: (a) a structure diagram of a feature extraction block in a source image super-resolution branch is shown; (b) the structure diagram of the fusion block based on the self-attention mechanism in the super-resolution and fusion branch is shown; (c) the tile is patterned for region information enhancement.

Detailed Description

The following detailed description of the embodiments, specific examples and flow diagrams are shown in FIG. 1. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following examples do not represent all embodiments consistent with the present application. But merely as exemplifications of systems and methods consistent with certain aspects of the application, as recited in the claims.

Example 1: referring to fig. 1, a schematic diagram of steps of an image super-fusion and fusion method based on region information enhancement and block self-attention according to the present application is shown, in which an input source image and an output result image of a specific example are also shown. As shown in fig. 1, the present application is composed of a source image super-resolution branch and a super-resolution and fusion branch, and provides an image super-resolution and fusion method based on region information enhancement and block self-attention, including:

spliced together by channel input into fusionAnd in the super-resolution branch. And a 3 multiplied by 3 convolutional layer is arranged at the beginning of the source image super-resolution branch and the fusion and super-resolution branch and is used for preliminarily extracting features. And then, the source image super-resolution branch contains 16 feature extraction blocks and 16 region information enhancement blocks, and the fusion and super-resolution branch contains 16 fusion blocks based on a block self-attention mechanism. 16 feature extraction blocks, 16 region information enhancement blocks and 16 fusion blocks based on the block self-attention mechanism are in one-to-one correspondence, and i (i is more than or equal to 0 and less than or equal to 16) is defined as the ith feature extraction block/region information enhancement block/fusion block based on the block self-attention mechanism.

Besides being continuously input into the ith feature extraction block to construct a super-resolution source image, the super-resolution source image is also input into the ith region information enhancement block to assist in fusion and super-resolution branch acquisition decision weight map. The region information enhancement block will enhance the information of the salient feature region, especially the feature information of the focus region. The information output by the region information enhancement block is input into the ith fusion block based on the block self-attention mechanism in the fusion and super-resolution branch;

step4, after passing 16 feature extraction blocks in the source image super-resolution branch, and 16 block-self-attention mechanism-based fusion blocks in the source image super-resolution branch, is a layer of 1 × 1 convolutional layers and a layer of sub-pixel convolutions. 1 x 1 convolution reduction

To the square of the magnification r, wherein

Are respectively

Super-resolution results of

In the fusion and super-resolution branch, normalization is carried out through a Sigmoid function, and a decision weight graph W for multi-focus image fusion is obtained through threshold division^SRFinally, combining the source image to obtain a super-resolution fusion result image

Super-resolution results of

And a decision weight graph W^SRSuper-resolution fusion result image

Furthermore, in Step2, the dense connection mode refers to the initial feature map f output by the first layer convolution layer in the super-resolution branch of the source image₀And the output of the previous i-1 feature extraction blocks will be the input of the ith feature extraction block. Finally, f₀And the outputs of all the blocks are spliced together, and dimension reduction and information integration are carried out through convolution of 1 multiplied by 1. The structure of the feature extraction block is shown in fig. 2(a), which is composed of three convolution layers of 3 × 3, and uses a residual learning mode to alleviate the degradation problem caused by the deep network;

further, in Step2, the proposed region information enhancement block is shown in fig. 2 (c). Firstly, a layer of convolution layer acts on an input characteristic diagram, and the dimension of an output characteristic diagram is 2 times that of the input characteristic diagram; the output characteristic diagram is sliced according to the channel to obtain two characteristic diagrams with the same dimension, and the two characteristic diagrams are the offset of the input characteristic diagram in the horizontal direction and the vertical direction; namely, the convolutional layer learns the offset of each position of the input feature map in the horizontal and vertical directions, and the horizontal and vertical offsets and the input feature map are input into the deformable convolution, so that the feature map closer to the shape and the size of the object is obtained. Definition of

Are respectively as

The amount of offset in the horizontal and vertical directions of the,

are respectively as

Is offset in the horizontal and vertical directions, wherein

Are respectively

The calculation method is as follows:

Further, in Step3, the block self-attention mechanism means that when the local feature of a pixel is considered, attention should be paid to the pixel which has a large influence on the local feature. In the present invention, each location will be explored with its 7 × 7 neighborhood rangeCharacteristic relationships within the enclosure. In that

In for position p, define

A neighborhood range of 7 × 7 with p as the center point;

is composed of

The characteristic values corresponding to the regions, () fuse the information in the neighborhood range; sigmoid (·) is an intra-block normalization function, and is used for calculating the weight of other position features in the neighborhood to the feature at the central point p; after the block self-attention mechanism, the characteristic value y of the p position _pCan be calculated as:

wherein

Further, in Step3, the fusion block based on the block self-attention mechanism means that the previously output fusion feature map is spliced with the feature map of the salient focusing region input by the source image super-resolution branch, and after information integration is performed through convolution of 1 × 1 and convolution of several layers of 3 × 3, the self-attention mechanism based on the block range is used to more accurately highlight the range of the salient object.

In Step4, the normalization of Sigmoid function means:

wherein

then, the results are fused

Can be represented by a decision weight graph W^SRObtaining:

further, in Step5, regarding the loss calculation, the present invention calculates the loss using the L1 norm with better convex optimization properties and uses an Adam optimizer to minimize the loss value. Definition of

Are tag values, respectively

Corresponding high resolution image,

in Step5, the input test image, i.e. the two low-resolution source images on the left side in fig. 1, is the input low-resolution source image of the specific example, and the intermediate image on the right side in fig. 1 is the fusion result image of the specific example, it can be seen that the super-resolution fusion result contains abundant texture detail information of the two low-resolution source images, which indicates that the method can capture information in the low-resolution source images deeply and further generate natural high-quality details. The focused boundary and the non-focused boundary are accurately estimated, which shows that the region information enhancement block of the invention has the effect of accurately estimating the object contour, the fusion block based on the block attention mechanism has the effect of accurately estimating the focused region, and the combination of the two blocks ensures the information fusion of the focused regions of the two source images.

Further, unless otherwise specified, Relu is used as the nonlinear activation function after all convolutional layers; the convolutional layers are all SAME type convolution, namely the input and output of the convolutional layers are consistent in size, and all source images share one source image super-resolution branch.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. An image super-fusion method based on region information enhancement and block self-attention is characterized in that: the method comprises the following specific steps:

step1, in the task of over-dividing and fusing multi-focus images, the source image with low resolution

splicing the source image super-resolution branch and the fusion and super-resolution branch together according to channels, inputting the source image super-resolution branch and the fusion and super-resolution branch into a 3 x 3 convolution layer at the beginning for preliminarily extracting features, then, the source image super-resolution branch contains 16 feature extraction blocks and 16 region information enhancement blocks, the fusion and super-resolution branch contains 16 fusion blocks based on a block self-attention mechanism, the 16 feature extraction blocks, the 16 region information enhancement blocks and the 16 fusion blocks based on the block self-attention mechanism are in one-to-one correspondence, and i is defined, i is more than or equal to 0 and less than or equal to 16 and is the ith feature extraction block or the region information enhancement block or the fusion block based on the block self-attention mechanism;

Step2, in the source image super-resolution branch, the initial feature map passes through 16 feature extraction blocks, and the 16 feature extraction blocks are connected in a dense connection mode, and the output of the i-1 th feature extraction block

In addition to being continuously input into the ith feature extraction block to construct a super-resolution source image, the feature extraction block also inputs the feature extraction block into the ith region information enhancement block to assist the fusion and super-resolution branch to acquire a decision weight map, the region information enhancement block enhances the information of a significant feature region, particularly the feature information of a focusing region, and the information output by the region information enhancement block is input into the ith fusion block based on a block self-attention mechanism in the fusion and super-resolution branch;

step4, following the 16 feature extraction blocks in the source image super-resolution branch, and the 16 block-autofocusing mechanism-based fusion blocks in the source image super-resolution branch, is a layer of 1 × 1 convolutions and a layer of sub-pixel convolutions, a 1 × 1 convolution reduction

To the square of the magnification r, wherein

Are respectively

Super-resolution results of

Super-resolution results of

And a decision weight graph W^SRSuper-resolution fusion result image

2. The method for image hyper-and fusion based on regional information enhancement and block self-attention of claim 1, wherein:

the dense connection mode proposed in Step2 refers to the following steps: initial characteristic diagram f output by first layer convolution layer in source image super-resolution branch₀And the output of the previous i-1 feature extraction blocks will be the input of the ith feature extraction block, and finally, f₀The outputs of all the blocks are spliced together, and dimension reduction and information integration are carried out through convolution of 1 multiplied by 1; the structure of the feature extraction block is composed of three convolution layers of 3 x 3, and residual learning is used to alleviate the degradation problem caused by the deep network.

3. The method for image hyper-and fusion based on regional information enhancement and block self-attention of claim 1, wherein:

the region information enhancement block proposed in Step2 is: firstly, a layer of convolution layer acts on an input characteristic diagram, and the dimension of an output characteristic diagram is 2 times that of the input characteristic diagram; the output characteristic diagram is sliced according to the channel to obtain two characteristic diagrams with the same dimension, and the two characteristic diagrams are the offset of the input characteristic diagram in the horizontal direction and the vertical direction; that is, the convolutional layer learns that each position of the input feature map is in horizontal and vertical The directional offset, horizontal and vertical offsets and the input feature map are input into a deformable convolution to obtain a feature map that is closer to the shape and size of the object, defining

Are respectively as

The amount of offset in the horizontal and vertical directions of the,

are respectively as

Is offset in the horizontal and vertical directions, wherein

Are respectively

The output of the ith feature extraction block in the super-resolution branch of the source image, and therefore the feature map of the salient object region information input to the super-resolution and fusion branch for the ith time

The calculation method is as follows:

4. The method for image hyper-and fusion based on region information enhancement and block self-attention of claim 3, characterized in that:

the block self-attention mechanism proposed in Step3 means that when the local feature of a pixel is considered, attention should be paid to the pixel which has a large influence on the local feature, and the feature relationship between each position and the 7 x 7 neighborhood of each position is explored, wherein

In for position p, define

A neighborhood range of 7 × 7 with p as the center point;

is composed of

wherein

I.e. calculating the eigenvector x of the p position by using the transposition multiplication mode_pAnd the feature vector x of the q position_qThe relevance of (1), BatchNormalize (·) is a batch normalization operation;

5. The method for image hyper-and fusion based on regional information enhancement and block self-attention of claim 1, wherein: the normalization of the Sigmoid function in Step4 means that:

wherein

Representing the result of convolution of the super-resolution with the sub-pixels in the fused branch, the feature map being single-channel and of the size of the target size; (m, n) represents coordinate positions, and then a decision weight map of multi-focus image fusion is obtained by using threshold t division, t is set to be 0.5, and a decision weight map W is obtained^SRThis can be obtained by the following formula:

then, the results are fused

Can be represented by a decision weight graph W^SRObtaining:

6. the method for image hyper-and fusion based on regional information enhancement and block self-attention of claim 1, wherein: the loss calculation proposed in Step5, using the L1 norm with better convex optimization properties to calculate the loss, and using an Adam optimizer to minimize the loss value, defines the loss

Are tag values, respectively

Corresponding high resolution image,

7. the image super-fusion method based on region information enhancement and block self-attention according to any one of claims 1-6, characterized in that: relu was used as the nonlinear activation function after all convolutional layers, except where specifically noted; the convolutional layers are all SAME type convolution, namely the input and output of the convolutional layers are consistent in size, and all source images share one source image super-resolution branch.