CN115619681A

CN115619681A - Image reconstruction method based on multi-granularity Vit automatic encoder

Info

Publication number: CN115619681A
Application number: CN202211374681.9A
Authority: CN
Inventors: 柯逍; 许培荣
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-01-17

Abstract

The invention relates to an image reconstruction method based on a multi-granularity Vit automatic encoder, which comprises the following steps: s1, constructing an image reconstruction training set, and training a Vit-based image reconstruction refiner, wherein the Vit-based image reconstruction refiner comprises an encoder, a decoder and a jump connection module; and S3, inputting the obtained intermediate features into a decoder to restore image information, and fusing the decoding information with global information in the decoding process of each layer. The invention realizes effective noise reduction processing and image super-resolution under the task of image reconstruction through the multi-granularity Vit image refiner network.

Description

Image reconstruction method based on multi-granularity Vit automatic encoder

Technical Field

The invention relates to the field of pattern recognition and computer vision, in particular to an image reconstruction method based on a multi-granularity Vit automatic encoder.

Background

In recent years, a deep convolutional neural network has achieved great success in the field of computer vision, the image generation method based on deep learning also utilizes the advantages of the deep convolutional neural network in the aspect of feature extraction, various GAN models are continuously generated in images, the image denoising field is more accurate, and the generation result is better in effect. In addition, there are also an increasing number of researchers in various computer vision meetings that issue papers related to image reconstruction. In recent years, with the help of the major breakthrough of the deep learning and attention mechanism in the computer vision field, the task of image reconstruction can obtain more breakthrough by combining the vision transform method.

An auto-encoder is a classical generative model, e.g. a convolutional network of the Unet structure, commonly used as a refiner in the field of image generation. Although convolutional neural networks have worked well in the GAN domain, there are still some problems, such as the need for the model to trade off between image resolution and model size, and the property of convolutional scaling down the feature map size that results in some information loss to the encoding process.

Disclosure of Invention

In view of the above, an object of the present invention is to provide an image reconstruction method based on a multi-granularity Vit automatic encoder, which implements effective noise reduction processing and image super-resolution under a task under image reconstruction through a multi-granularity Vit image refiner network.

In order to realize the purpose, the invention adopts the following technical scheme:

an image reconstruction method based on a multi-granularity Vit automatic encoder comprises the following steps:

s1, constructing an image reconstruction training set and training a Vit-based image reconstruction refiner, wherein the Vit-based image reconstruction refiner comprises an encoder, a decoder and a jump connection module;

s2, inputting an original image into an encoder to obtain intermediate characteristics, and sampling local information of the encoder in each layer of encoding layer;

and S3, inputting the obtained intermediate features into a decoder to restore image information, and fusing the decoding information with global information in the decoding process of each layer.

Further, the image reconstruction training set is constructed by: and acquiring a public image set from a network as an image reconstruction training set, applying similarity transformation to images in the image reconstruction training set, subtracting a mean value from pixel values of all input images, and normalizing.

Further, the encoder structure adopts 4 layers of Vit encoding layers different from the traditional one, and adopts the fragmentation size of (64, 32, 16,8) as the input encoding layer with different granularity characteristics, and the corresponding encoding layers respectively adopt the number of encoding heads of (2,4,8, 12);

in the coding process of each layer, convolution kernels with the sizes of the next layer of granularity are convoluted on the feature graph to replace the down-sampling process in the traditional convolution network, the feature granularity is changed, the local information of the encoder is sampled, and the features of the middle coding layer are used as the global information of the encoding to be sent to the jump connection module.

Further, the decoder structure adopts 4 layers of Vit decoding layers, the slice sizes of (8, 16, 32, 64) are respectively adopted as input decoding layers with different granularity characteristics, and the number of decoding heads adopted by the corresponding decoding layers is (12,8,4,2);

and the characteristics of the middle decoding layer are used as decoding local information and sent to a jump connection module, global information and codebook ground information are fused through the jump connection module in the decoding process of each layer, and the output fusion characteristics are sent to a lower layer decoding module.

Further, the jump connection module takes the decoded local information as key and value vectors through a cross attention mechanism, and the global information is taken as a query vector to be fused, so that the fused features are obtained and input into a lower decoding layer for further analysis, and the calculation formula is as follows:

F _d，k+1 ＝F _d，k +coAttn(Norm(F _e，k )，Norm(F _d，k ))

wherein F _d，k Features representing the decoding of the k-th layer, F _e，k Representing the k-th layer coding feature, F _d，k+1 Shows the decoded profile sent to the lower layer after the jump junction fusion, norm (-) denotes the normalization layer, and coAttn (-) denotes cross attention.

A refiner for image reconstruction based on a multi-granularity Vit automatic encoder comprises an encoder, a decoder and a jump connection module; the encoder adopts a multi-granularity Vit structure and is used for a network model for extracting image depth characteristics and outputting intermediate characteristics; the decoder adopts a multi-granularity Vit structure and is used for decoding and restoring the intermediate features obtained by the encoder and outputting an image with the same scale as the original image; the jump connection module takes the decoded local information as key and value vectors and the global information as query vectors for fusion through a cross attention mechanism, and the obtained fusion characteristics are input into a lower decoding layer for further analysis.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention can effectively denoise the image to complete the image reconstruction task, reduces the loss of the image coding process and improves the high resolution effect of the image after image reconstruction.

2. The invention uses the Transfomer method to form the structures of the encoder and the decoder, and the scale invariance in the encoding process is kept by the attention mechanism method, so that the characteristic graph becomes small in the convolution process is replaced, and the information loss in the encoding process is reduced.

3. The invention adopts multi-granularity segmentation to extract the characteristics, so that each coding layer extracts different granularity characteristics, and under the condition that the size of the characteristic graph is not changed, the characteristics of the image are summarized by a multi-granularity method, so that the extracted characteristics are more perfect and full.

4. The invention connects each layer of coding and decoding part through the skip connection module, so that each granularity characteristic realizes the characteristic fusion before and after coding through the skip connection module, combines the advantages of the characteristic pyramid and adds a cross attention mechanism to lead the characteristics after coding by the characteristics before coding, optimizes the fusion process and improves the characteristic matching degree.

Drawings

FIG. 1 is a schematic diagram of the process of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

Referring to fig. 1, the present invention provides an image reconstruction method based on a multi-granularity Vit auto-encoder, comprising the following steps:

s1, constructing an image reconstruction training set and training an image reconstruction refiner based on Vit, wherein the image reconstruction refiner based on Vit comprises a Vit encoder, a Vit decoder and a jump connection module;

s2, inputting the original image into a Vit encoder to obtain an intermediate characteristic, and sampling local information of the encoder in each layer of encoding layer;

In this embodiment, the image reconstruction training set is constructed specifically as follows: acquiring a public image set from a network as an image reconstruction training set and using the image set for network model self-supervision training; and (3) applying similarity transformation to the images in the image reconstruction training set, subtracting the mean value from the pixel values of all the input images, and normalizing. And applying similarity transformation to the images in the image reconstruction training set, subtracting the mean value from the pixel values of all input images, and normalizing.

In the embodiment, the network structure of the refiner formed by the coder, the decoder and the jump module based on the Vit replaces the traditional Unet network structure formed by CNN convolution, and the Vit image coding process is lower in information loss compared with the traditional method, so that the finally obtained image reconstruction quality is better.

Let D = { D ₁ ,d ₂ ,...,d _N Is the image in the training set, d _i MSELoss calculation was used for the ith image, and the formula is as follows:

wherein

Predicting an output image for the network;

and then updating network parameters by using a gradient descent and back propagation algorithm, adopting a trained model for image denoising, preprocessing an image of an image centralized sample by the same way of S12, inputting the image into the trained network model, and finally outputting new image data to obtain a denoised image so as to realize image reconstruction.

Preferably, in this embodiment, a multi-granularity Vit structure is adopted as the encoder, and is used for a network model for image depth feature extraction to output an intermediate feature; the Vit encoder structure adopts 4 Vit encoding layers different from the traditional one, and adopts the fragmentation size of (64, 32, 16,8) as the input encoding layer with different granularity characteristics, and the corresponding encoding layers respectively adopt the number of encoding heads of (2,4,8, 12);

Preferably, in this embodiment, a multi-granularity Vit structure is adopted as a decoder, and is used for decoding and restoring the intermediate features and outputting an image with the same scale as the original image; the decoder structure adopts 4 Vit decoding layers, the sizes of (8, 16, 32 and 64) slices are respectively adopted as input decoding layers with different granularity characteristics, and the number of encoding heads adopted by the corresponding decoding layers is (12,8,4,2);

In this embodiment, encoding layer local information is input to the jump connection module as global information; and the decoding output characteristic information is sent to the jump connection module as decoding local information in the middle of the decoding layer.

Different from a jump connection path of a traditional characteristic pyramid, decoding local information is used as key and value vectors in a jump connection module through a cross attention mechanism, global information is used as a query vector to be fused, and the obtained fusion characteristics are input into a lower decoding layer for further analysis, wherein a calculation formula is as follows:

F _d，k+1 ＝F _d，k +coAttn(Norm(F _e，k )，Norm(F _d，k ))

wherein F _d，k Features representing the decoding of the k-th layer, F _e，k Representing the k-th layer coding feature, F _d，k+1 Represents the decoded feature map that is fed into the lower layer after the jump-join fusion, norm (. Cndot.) represents the normalization layer, and coAttn (. Cndot.) represents the cross-attention.

The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims

1. An image reconstruction method based on a multi-granularity Vit automatic encoder is characterized by comprising the following steps:

2. The image reconstruction method based on the multi-granularity Vit automatic encoder according to claim 1, wherein the image reconstruction training set is constructed by: and acquiring a public image set from a network as an image reconstruction training set, applying similarity transformation to images in the image reconstruction training set, subtracting a mean value from pixel values of all input images, and normalizing.

3. The image reconstruction method based on the multi-granularity Vit automatic encoder according to claim 1, wherein the encoder structure adopts 4 Vit encoding layers different from the traditional Vit encoding layers, the slice sizes of (64, 32, 16,8) are respectively adopted as different granularity characteristic input encoding layers, and the number of encoding heads adopted by the same corresponding encoding layers is (2,4,8, 12);

in the coding process of each layer, the feature graph is convoluted through a convolution kernel with the granularity of the next layer to replace the down-sampling process in the traditional convolution network, the feature granularity is changed, the local information of the coder is sampled at the same time, and the features of the middle coding layer are used as the global information of the coding and sent to the jump connection module.

4. The image reconstruction method based on multi-granularity Vit automatic encoder as claimed in claim 1, wherein the decoder structure employs 4 Vit decoding layers, the slice size of (8, 16, 32, 64) is respectively employed as the input decoding layers with different granularity characteristics, the number of decoding heads employed by the corresponding decoding layers is (12,8,4,2);

5. The image reconstruction method based on the multi-granularity Vit automatic encoder as claimed in claim 4, wherein the jump connection module fuses the decoded local information as key and value vectors and the global information as query vectors through a cross attention mechanism, and the obtained fused features are input into a lower decoding layer for further analysis, and the calculation formula is as follows:

F _d，k+1 ＝F _d，k +coAttn(Norm(F _e，k )，Norm(F _d，k ))

6. A refiner for image reconstruction based on a multi-granularity Vit automatic encoder is characterized by comprising an encoder, a decoder and a jump connection module _； The encoder adopts a multi-granularity Vit structure and is used for outputting intermediate features by a network model for extracting image depth features; the decoder adopts a multi-granularity Vit structure and is used for decoding and restoring the intermediate features obtained by the encoder and outputting an image with the same scale as the original image; the jump connection module takes the decoded local information as key and value vectors and the global information as query vectors for fusion through a cross attention mechanism, and the obtained fusion characteristics are input into a lower decoding layer for further analysis.