CN115619681A - Image reconstruction method based on multi-granularity Vit automatic encoder - Google Patents

Image reconstruction method based on multi-granularity Vit automatic encoder Download PDF

Info

Publication number
CN115619681A
CN115619681A CN202211374681.9A CN202211374681A CN115619681A CN 115619681 A CN115619681 A CN 115619681A CN 202211374681 A CN202211374681 A CN 202211374681A CN 115619681 A CN115619681 A CN 115619681A
Authority
CN
China
Prior art keywords
vit
layer
granularity
decoding
image reconstruction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211374681.9A
Other languages
Chinese (zh)
Inventor
柯逍
许培荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202211374681.9A priority Critical patent/CN115619681A/en
Publication of CN115619681A publication Critical patent/CN115619681A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/70Denoising; Smoothing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)

Abstract

The invention relates to an image reconstruction method based on a multi-granularity Vit automatic encoder, which comprises the following steps: s1, constructing an image reconstruction training set, and training a Vit-based image reconstruction refiner, wherein the Vit-based image reconstruction refiner comprises an encoder, a decoder and a jump connection module; and S3, inputting the obtained intermediate features into a decoder to restore image information, and fusing the decoding information with global information in the decoding process of each layer. The invention realizes effective noise reduction processing and image super-resolution under the task of image reconstruction through the multi-granularity Vit image refiner network.

Description

Image reconstruction method based on multi-granularity Vit automatic encoder
Technical Field
The invention relates to the field of pattern recognition and computer vision, in particular to an image reconstruction method based on a multi-granularity Vit automatic encoder.
Background
In recent years, a deep convolutional neural network has achieved great success in the field of computer vision, the image generation method based on deep learning also utilizes the advantages of the deep convolutional neural network in the aspect of feature extraction, various GAN models are continuously generated in images, the image denoising field is more accurate, and the generation result is better in effect. In addition, there are also an increasing number of researchers in various computer vision meetings that issue papers related to image reconstruction. In recent years, with the help of the major breakthrough of the deep learning and attention mechanism in the computer vision field, the task of image reconstruction can obtain more breakthrough by combining the vision transform method.
An auto-encoder is a classical generative model, e.g. a convolutional network of the Unet structure, commonly used as a refiner in the field of image generation. Although convolutional neural networks have worked well in the GAN domain, there are still some problems, such as the need for the model to trade off between image resolution and model size, and the property of convolutional scaling down the feature map size that results in some information loss to the encoding process.
Disclosure of Invention
In view of the above, an object of the present invention is to provide an image reconstruction method based on a multi-granularity Vit automatic encoder, which implements effective noise reduction processing and image super-resolution under a task under image reconstruction through a multi-granularity Vit image refiner network.
In order to realize the purpose, the invention adopts the following technical scheme:
an image reconstruction method based on a multi-granularity Vit automatic encoder comprises the following steps:
s1, constructing an image reconstruction training set and training a Vit-based image reconstruction refiner, wherein the Vit-based image reconstruction refiner comprises an encoder, a decoder and a jump connection module;
s2, inputting an original image into an encoder to obtain intermediate characteristics, and sampling local information of the encoder in each layer of encoding layer;
and S3, inputting the obtained intermediate features into a decoder to restore image information, and fusing the decoding information with global information in the decoding process of each layer.
Further, the image reconstruction training set is constructed by: and acquiring a public image set from a network as an image reconstruction training set, applying similarity transformation to images in the image reconstruction training set, subtracting a mean value from pixel values of all input images, and normalizing.
Further, the encoder structure adopts 4 layers of Vit encoding layers different from the traditional one, and adopts the fragmentation size of (64, 32, 16,8) as the input encoding layer with different granularity characteristics, and the corresponding encoding layers respectively adopt the number of encoding heads of (2,4,8, 12);
in the coding process of each layer, convolution kernels with the sizes of the next layer of granularity are convoluted on the feature graph to replace the down-sampling process in the traditional convolution network, the feature granularity is changed, the local information of the encoder is sampled, and the features of the middle coding layer are used as the global information of the encoding to be sent to the jump connection module.
Further, the decoder structure adopts 4 layers of Vit decoding layers, the slice sizes of (8, 16, 32, 64) are respectively adopted as input decoding layers with different granularity characteristics, and the number of decoding heads adopted by the corresponding decoding layers is (12,8,4,2);
and the characteristics of the middle decoding layer are used as decoding local information and sent to a jump connection module, global information and codebook ground information are fused through the jump connection module in the decoding process of each layer, and the output fusion characteristics are sent to a lower layer decoding module.
Further, the jump connection module takes the decoded local information as key and value vectors through a cross attention mechanism, and the global information is taken as a query vector to be fused, so that the fused features are obtained and input into a lower decoding layer for further analysis, and the calculation formula is as follows:
F d,k+1 =F d,k +coAttn(Norm(F e,k ),Norm(F d,k ))
wherein F d,k Features representing the decoding of the k-th layer, F e,k Representing the k-th layer coding feature, F d,k+1 Shows the decoded profile sent to the lower layer after the jump junction fusion, norm (-) denotes the normalization layer, and coAttn (-) denotes cross attention.
A refiner for image reconstruction based on a multi-granularity Vit automatic encoder comprises an encoder, a decoder and a jump connection module; the encoder adopts a multi-granularity Vit structure and is used for a network model for extracting image depth characteristics and outputting intermediate characteristics; the decoder adopts a multi-granularity Vit structure and is used for decoding and restoring the intermediate features obtained by the encoder and outputting an image with the same scale as the original image; the jump connection module takes the decoded local information as key and value vectors and the global information as query vectors for fusion through a cross attention mechanism, and the obtained fusion characteristics are input into a lower decoding layer for further analysis.
Compared with the prior art, the invention has the following beneficial effects:
1. the invention can effectively denoise the image to complete the image reconstruction task, reduces the loss of the image coding process and improves the high resolution effect of the image after image reconstruction.
2. The invention uses the Transfomer method to form the structures of the encoder and the decoder, and the scale invariance in the encoding process is kept by the attention mechanism method, so that the characteristic graph becomes small in the convolution process is replaced, and the information loss in the encoding process is reduced.
3. The invention adopts multi-granularity segmentation to extract the characteristics, so that each coding layer extracts different granularity characteristics, and under the condition that the size of the characteristic graph is not changed, the characteristics of the image are summarized by a multi-granularity method, so that the extracted characteristics are more perfect and full.
4. The invention connects each layer of coding and decoding part through the skip connection module, so that each granularity characteristic realizes the characteristic fusion before and after coding through the skip connection module, combines the advantages of the characteristic pyramid and adds a cross attention mechanism to lead the characteristics after coding by the characteristics before coding, optimizes the fusion process and improves the characteristic matching degree.
Drawings
FIG. 1 is a schematic diagram of the process of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
Referring to fig. 1, the present invention provides an image reconstruction method based on a multi-granularity Vit auto-encoder, comprising the following steps:
s1, constructing an image reconstruction training set and training an image reconstruction refiner based on Vit, wherein the image reconstruction refiner based on Vit comprises a Vit encoder, a Vit decoder and a jump connection module;
s2, inputting the original image into a Vit encoder to obtain an intermediate characteristic, and sampling local information of the encoder in each layer of encoding layer;
and S3, inputting the obtained intermediate features into a decoder to restore image information, and fusing the decoding information with global information in the decoding process of each layer.
In this embodiment, the image reconstruction training set is constructed specifically as follows: acquiring a public image set from a network as an image reconstruction training set and using the image set for network model self-supervision training; and (3) applying similarity transformation to the images in the image reconstruction training set, subtracting the mean value from the pixel values of all the input images, and normalizing. And applying similarity transformation to the images in the image reconstruction training set, subtracting the mean value from the pixel values of all input images, and normalizing.
In the embodiment, the network structure of the refiner formed by the coder, the decoder and the jump module based on the Vit replaces the traditional Unet network structure formed by CNN convolution, and the Vit image coding process is lower in information loss compared with the traditional method, so that the finally obtained image reconstruction quality is better.
Let D = { D 1 ,d 2 ,...,d N Is the image in the training set, d i MSELoss calculation was used for the ith image, and the formula is as follows:
Figure BDA0003926150400000051
wherein
Figure BDA0003926150400000052
Predicting an output image for the network;
and then updating network parameters by using a gradient descent and back propagation algorithm, adopting a trained model for image denoising, preprocessing an image of an image centralized sample by the same way of S12, inputting the image into the trained network model, and finally outputting new image data to obtain a denoised image so as to realize image reconstruction.
Preferably, in this embodiment, a multi-granularity Vit structure is adopted as the encoder, and is used for a network model for image depth feature extraction to output an intermediate feature; the Vit encoder structure adopts 4 Vit encoding layers different from the traditional one, and adopts the fragmentation size of (64, 32, 16,8) as the input encoding layer with different granularity characteristics, and the corresponding encoding layers respectively adopt the number of encoding heads of (2,4,8, 12);
in the coding process of each layer, convolution kernels with the sizes of the next layer of granularity are convoluted on the feature graph to replace the down-sampling process in the traditional convolution network, the feature granularity is changed, the local information of the encoder is sampled, and the features of the middle coding layer are used as the global information of the encoding to be sent to the jump connection module.
Preferably, in this embodiment, a multi-granularity Vit structure is adopted as a decoder, and is used for decoding and restoring the intermediate features and outputting an image with the same scale as the original image; the decoder structure adopts 4 Vit decoding layers, the sizes of (8, 16, 32 and 64) slices are respectively adopted as input decoding layers with different granularity characteristics, and the number of encoding heads adopted by the corresponding decoding layers is (12,8,4,2);
and the characteristics of the middle decoding layer are used as decoding local information and sent to a jump connection module, global information and codebook ground information are fused through the jump connection module in the decoding process of each layer, and the output fusion characteristics are sent to a lower layer decoding module.
In this embodiment, encoding layer local information is input to the jump connection module as global information; and the decoding output characteristic information is sent to the jump connection module as decoding local information in the middle of the decoding layer.
Different from a jump connection path of a traditional characteristic pyramid, decoding local information is used as key and value vectors in a jump connection module through a cross attention mechanism, global information is used as a query vector to be fused, and the obtained fusion characteristics are input into a lower decoding layer for further analysis, wherein a calculation formula is as follows:
F d,k+1 =F d,k +coAttn(Norm(F e,k ),Norm(F d,k ))
wherein F d,k Features representing the decoding of the k-th layer, F e,k Representing the k-th layer coding feature, F d,k+1 Represents the decoded feature map that is fed into the lower layer after the jump-join fusion, norm (. Cndot.) represents the normalization layer, and coAttn (. Cndot.) represents the cross-attention.
The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims (6)

1. An image reconstruction method based on a multi-granularity Vit automatic encoder is characterized by comprising the following steps:
s1, constructing an image reconstruction training set and training a Vit-based image reconstruction refiner, wherein the Vit-based image reconstruction refiner comprises an encoder, a decoder and a jump connection module;
s2, inputting an original image into an encoder to obtain intermediate characteristics, and sampling local information of the encoder in each layer of encoding layer;
and S3, inputting the obtained intermediate features into a decoder to restore image information, and fusing the decoding information with global information in the decoding process of each layer.
2. The image reconstruction method based on the multi-granularity Vit automatic encoder according to claim 1, wherein the image reconstruction training set is constructed by: and acquiring a public image set from a network as an image reconstruction training set, applying similarity transformation to images in the image reconstruction training set, subtracting a mean value from pixel values of all input images, and normalizing.
3. The image reconstruction method based on the multi-granularity Vit automatic encoder according to claim 1, wherein the encoder structure adopts 4 Vit encoding layers different from the traditional Vit encoding layers, the slice sizes of (64, 32, 16,8) are respectively adopted as different granularity characteristic input encoding layers, and the number of encoding heads adopted by the same corresponding encoding layers is (2,4,8, 12);
in the coding process of each layer, the feature graph is convoluted through a convolution kernel with the granularity of the next layer to replace the down-sampling process in the traditional convolution network, the feature granularity is changed, the local information of the coder is sampled at the same time, and the features of the middle coding layer are used as the global information of the coding and sent to the jump connection module.
4. The image reconstruction method based on multi-granularity Vit automatic encoder as claimed in claim 1, wherein the decoder structure employs 4 Vit decoding layers, the slice size of (8, 16, 32, 64) is respectively employed as the input decoding layers with different granularity characteristics, the number of decoding heads employed by the corresponding decoding layers is (12,8,4,2);
and the characteristics of the middle decoding layer are used as decoding local information and sent to a jump connection module, global information and codebook ground information are fused through the jump connection module in the decoding process of each layer, and the output fusion characteristics are sent to a lower layer decoding module.
5. The image reconstruction method based on the multi-granularity Vit automatic encoder as claimed in claim 4, wherein the jump connection module fuses the decoded local information as key and value vectors and the global information as query vectors through a cross attention mechanism, and the obtained fused features are input into a lower decoding layer for further analysis, and the calculation formula is as follows:
F d,k+1 =F d,k +coAttn(Norm(F e,k ),Norm(F d,k ))
wherein F d,k Features representing the decoding of the k-th layer, F e,k Representing the k-th layer coding feature, F d,k+1 Shows the decoded profile sent to the lower layer after the jump junction fusion, norm (-) denotes the normalization layer, and coAttn (-) denotes cross attention.
6. A refiner for image reconstruction based on a multi-granularity Vit automatic encoder is characterized by comprising an encoder, a decoder and a jump connection module The encoder adopts a multi-granularity Vit structure and is used for outputting intermediate features by a network model for extracting image depth features; the decoder adopts a multi-granularity Vit structure and is used for decoding and restoring the intermediate features obtained by the encoder and outputting an image with the same scale as the original image; the jump connection module takes the decoded local information as key and value vectors and the global information as query vectors for fusion through a cross attention mechanism, and the obtained fusion characteristics are input into a lower decoding layer for further analysis.
CN202211374681.9A 2022-11-04 2022-11-04 Image reconstruction method based on multi-granularity Vit automatic encoder Pending CN115619681A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211374681.9A CN115619681A (en) 2022-11-04 2022-11-04 Image reconstruction method based on multi-granularity Vit automatic encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211374681.9A CN115619681A (en) 2022-11-04 2022-11-04 Image reconstruction method based on multi-granularity Vit automatic encoder

Publications (1)

Publication Number Publication Date
CN115619681A true CN115619681A (en) 2023-01-17

Family

ID=84875839

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211374681.9A Pending CN115619681A (en) 2022-11-04 2022-11-04 Image reconstruction method based on multi-granularity Vit automatic encoder

Country Status (1)

Country Link
CN (1) CN115619681A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912926A (en) * 2023-09-14 2023-10-20 成都武侯社区科技有限公司 Face recognition method based on self-masking face privacy

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912926A (en) * 2023-09-14 2023-10-20 成都武侯社区科技有限公司 Face recognition method based on self-masking face privacy
CN116912926B (en) * 2023-09-14 2023-12-19 成都武侯社区科技有限公司 Face recognition method based on self-masking face privacy

Similar Documents

Publication Publication Date Title
CN114758383A (en) Expression recognition method based on attention modulation context spatial information
CN112150354B (en) Single image super-resolution method combining contour enhancement and denoising statistical prior
Li et al. Example-based image super-resolution with class-specific predictors
CN112241939B (en) Multi-scale and non-local-based light rain removal method
CN111444367A (en) Image title generation method based on global and local attention mechanism
CN115393396B (en) Unmanned aerial vehicle target tracking method based on mask pre-training
CN116128898A (en) Skin lesion image segmentation method based on transducer double-branch model
CN114638768B (en) Image rain removing method, system and equipment based on dynamic association learning network
CN112419174A (en) Image character removing method, system and device based on gate cycle unit
CN115619681A (en) Image reconstruction method based on multi-granularity Vit automatic encoder
CN117274059A (en) Low-resolution image reconstruction method and system based on image coding-decoding
Gao A method for face image inpainting based on generative adversarial networks
Li et al. Srinpaintor: When super-resolution meets transformer for image inpainting
Xue et al. A novel attention enhanced residual-in-residual dense network for text image super-resolution
CN116912268A (en) Skin lesion image segmentation method, device, equipment and storage medium
CN116630964A (en) Food image segmentation method based on discrete wavelet attention network
CN117196963A (en) Point cloud denoising method based on noise reduction self-encoder
Wang et al. Self-prior guided pixel adversarial networks for blind image inpainting
Sun et al. Jpeg decoding with nonlinear inverse transform network and progressive recurrent residual network
Li et al. A review of advances in image inpainting research
CN114331894A (en) Face image restoration method based on potential feature reconstruction and mask perception
CN112907456B (en) Deep neural network image denoising method based on global smooth constraint prior model
Liu et al. Photorealistic style transfer fusing frequency separation channel attention mechanism and mirror flow pyramid integration
CN117372306B (en) Pulmonary medical image enhancement method based on double encoders
Zou et al. Lightweight Deep Exemplar Colorization via Semantic Attention-Guided Laplacian Pyramid

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination