CN117522694A

CN117522694A - Diffusion model-based image super-resolution reconstruction method and system

Info

Publication number: CN117522694A
Application number: CN202311584004.4A
Authority: CN
Inventors: 周宁宁; 王瑞; 张政
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-11-24
Filing date: 2023-11-24
Publication date: 2024-02-06

Abstract

The invention discloses an image super-resolution reconstruction method based on a diffusion model, and relates to the technical field of computer vision. Comprising acquiring a pair-wise dataset comprising both a high resolution image and a corresponding low resolution image; iteratively training a latent model on the paired data sets, adding a kernel-based attention module to the noise prediction network to fix the latent model, and simultaneously training a diffusion model in a latent space; and inputting the low-image-quality image into the trained potential diffusion model to obtain a corresponding super-resolution generated image. The invention makes the data set based on the degradation characteristic of the high-quality image to the low-quality image, improves the noise prediction network of the diffusion model and carries out iterative training in the potential space, so that the training speed is faster and the effect is better.

Description

Diffusion model-based image super-resolution reconstruction method and system

Technical Field

The invention relates to the technical field of computer vision, in particular to an image super-resolution reconstruction method and system based on a diffusion model.

Background

Image super-resolution technology is an active technology in the field of deep learning, which aims to reconstruct higher quality super-resolution (SR) images by reasoning about the high frequency information of Low Resolution (LR) images. With the development of deep learning, the super-resolution technology based on convolutional neural network has made a great progress.

However, the existing super-resolution network still faces the problem that the generation details are not abundant enough, and cannot effectively capture high-frequency information. In recent years, super-resolution technology based on generation of an countermeasure network is gradually rising, a discriminator is introduced to discriminate the generated high-resolution image, and the generator is forced to generate more realistic details, so that the effect of super-resolution is improved.

Meanwhile, diffusion models, as a class of generative models, also exhibit strong capabilities in image synthesis. It can implicitly learn the data distribution and can be used for conditional image generation. The diffusion model is introduced into the super-resolution task, so that the dilemma of insufficient details of the existing method can be overcome, and a super-resolution image with higher quality can be generated.

Disclosure of Invention

The invention is provided in view of the problems of insufficient details and high-frequency information missing existing in the existing super-resolution technology when processing the super-resolution task of the image.

Therefore, the present invention aims to solve the problem of insufficient super-resolution effect.

In order to solve the technical problems, the invention provides the following technical scheme:

in a first aspect, an embodiment of the present invention provides a diffusion model-based image super-resolution reconstruction method, which includes acquiring a pair-wise dataset including both a high-resolution image and a corresponding low-resolution image; iteratively training a latent model on the paired data sets, adding a kernel-based attention module to the noise prediction network to fix the latent model, and simultaneously training a diffusion model in a latent space; and inputting the low-image-quality image into the trained potential diffusion model to obtain a corresponding super-resolution generated image.

As a preferable scheme of the diffusion model-based image super-resolution reconstruction method, the invention comprises the following steps: iteratively training a latent model on a paired dataset comprises the steps of: constructing a pair dataset using the high resolution HR image and the corresponding low resolution HR image; training the potential model by utilizing the paired data sets, and matching the input image with the corresponding potential representation thereof through the encoder in the training process; converting the potential representation into a high resolution reconstructed image by a decoder; optimizing parameters of the potential model according to gradients of the loss function, and determining to stop training according to a preset convergence condition or the maximum iteration number; generating a potential representation of the HR image and the LR image with an encoder of the potential model that has been trained; and introducing a kernel-based attention module into the noise prediction network of the diffusion model, and training according to the difference between the output of the noise prediction network and the real image.

As a preferable scheme of the diffusion model-based image super-resolution reconstruction method, the invention comprises the following steps: adding a kernel-based attention module to fix potential models in a noise prediction network includes the steps of: encoding spatial information by adaptively fusing a learnable kernel basis function, and capturing a spatial mode in an image; predicting a fusion coefficient F of each position by using a lightweight convolution branch network; calculating a fused kernel weight for each spatial location using the predicted fusion coefficient F and the learned kernel basis function; performing convolution transformation on the input feature map X to obtain a feature map which is subjected to self-adaptive convolution by fusing convolution kernel weights, and calculating an output feature map X of the position (i, j) through grouping convolution ₀ [i,j]The method comprises the steps of carrying out a first treatment on the surface of the The total loss function of the potential model is calculated by using the reconstruction loss, the representation loss, and the consistency loss.

As a preferable scheme of the diffusion model-based image super-resolution reconstruction method, the invention comprises the following steps: the specific formula of the kernel weight is as follows:

M[i,j]＝∑F[i,j,t]*W[t]

where F [ i, j, t ] is the fusion coefficient of the t-th convolution kernel at position (i, j), and W [ t ] is the t-th learnable basis function.

Output characteristic diagram X ₀ [i,j]The specific formula of (2) is as follows:

X ₀ [i,j]＝GroupConv(X _e [i,j],M[i,j])

where GroupConv () represents a packet convolution operation, X ₀ Representing the output characteristic diagram, X _e Feature map representing adaptive convolution, M [ i, j]Representing the fused convolution kernel weights.

As a preferable scheme of the diffusion model-based image super-resolution reconstruction method, the invention comprises the following steps: the specific formula of the total loss function is as follows:

L＝L _rec (LQ,LQ_real)+L _rep (GT,GT_fake)+0.001*L _reg (LQ,LQ_l)

wherein L represents the total loss function, L _rec Representing reconstruction loss, L _rep To represent loss, L _reg Representing a loss of consistency, LQ represents a low resolution image, lq_real represents a decoded low resolution image, GT represents a high resolution image, gt_fake represents an image generated using hidden features of low resolution and a potential representation of high resolution, lq_l represents a potential representation of a low resolution image.

Reconstruction loss L _rec The specific formula of (2) is as follows:

L _rec (LQ,LQ_real)＝∑LQ-LQ_real|

LQ_real＝LQ_l+LQ_h

wherein L is _rec (LQ, lq_real) represents a loss between the low resolution image and the decoded low resolution image, LQ represents the low resolution image, lq_real represents an image generated using the hidden feature of the low resolution and the potential representation of the low resolution, lq_l represents the potential representation of the low resolution image, and lq_h represents the hidden feature of the low resolution image.

Representing loss L _rep The specific formula of (2) is as follows:

L _rep (GT,GT_fake)＝∑|GT-GT_fake|

GT_fake＝GT_l+LQ_h

wherein L is _rep (GT, GT_fake) represents low resolution image and decodingThe loss between the subsequent low resolution images, GT represents the high resolution image, gt_fake represents the image generated using the hidden feature of the low resolution and the potential representation of the high resolution, gt_l represents the potential representation of the high resolution image, lq_h represents the hidden feature of the low resolution image.

Consistency loss L _reg The specific formula of (LQ, LQ_l) is as follows:

L _reg (LQ,LQ_l)＝∑|LQ_μ-LQ_l_μ|+|LQ_σ-LQ_l_σ|

where lq_μ, lq_l_μ represent the mean of the low resolution picture and its potential representation, lq_σ, lq_l_σ represent the variance of the low resolution picture and its potential representation.

As a preferable scheme of the diffusion model-based image super-resolution reconstruction method, the invention comprises the following steps: the diffusion model comprises a forward process and a reverse process, and the specific formula of the forward process is as follows:

dx＝θ _t (μ-x)dt+σ(t)dω

wherein dω represents Gaussian noise, θ _t Representing a super parameter, σ (t) representing the parameter of the gaussian noise fluctuation over time, dt representing a short period of time, μ representing the low resolution image, x representing the distribution of the generated image, dx representing the amount of change in the generated image within dt.

The specific formula of the reverse process is as follows:

wherein,represents score function, dω represents Gaussian noise, θ _t Representing a super parameter, σ (t) representing the parameter of the gaussian noise fluctuation over time, dt representing a short period of time, μ representing the low resolution image, x representing the distribution of the generated image, dx representing the amount of change in the generated image within dt.

As a preferable scheme of the diffusion model-based image super-resolution reconstruction method, the invention comprises the following steps: the specific formulas for obtaining the corresponding super-resolution generated image and carrying out self-adaptive color normalization on the generated result are as follows:

where x represents the generated super-resolution image, y represents the input low-resolution image, adaIN represents the adaptive color normalization, μ (y) represents the mean of the low-resolution image, σ (y) represents the variance of the low-resolution image, μ (x) represents the mean of the low-resolution image, and σ (x) represents the variance of the low-resolution image.

In a second aspect, an embodiment of the present invention provides an image super-resolution reconstruction system based on a diffusion model, which includes a data reading module, configured to read, and perform preprocessing operations on picture data in a data set before starting network training, including adjusting a picture size, random cropping, random horizontal inversion, and normalization; the training module is used for training a potential model of compressed sensing according to HR and LR pictures in the data set, generating a potential representation input diffusion model by using a potential model compressed image for training, and calculating a direct loss of predicted noise and actual noise in the training process and an iterative optimization model; and the image generation module is used for inputting the super-resolution low-quality image into the potential diffusion model after the network training is finished, so as to obtain a super-resolution image corresponding to the original image content.

In a third aspect, embodiments of the present invention provide a computer apparatus comprising a memory and a processor, the memory storing a computer program, wherein: the computer program instructions, when executed by a processor, implement the steps of the diffusion model-based image super-resolution reconstruction method according to the first aspect of the present invention.

In a fourth aspect, embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, wherein: the computer program instructions, when executed by a processor, implement the steps of the diffusion model-based image super-resolution reconstruction method according to the first aspect of the present invention.

The invention has the beneficial effects that: the invention makes the data set based on the degradation characteristic of the high-quality image to the low-quality image, and introduces a kernel-based attention module and an EAC module into the noise prediction network, thereby improving the accurate prediction capability of the network on the image noise; according to the invention, the image is compressed by the potential model based on the Unet, and the iteration training is performed in the potential space, so that the consumption of the diffusion model training is greatly reduced; compared with the traditional network, the generated network constructed by the invention has better effect in the aspect of generating the high-frequency information of the image, is more suitable for reconstructing the super-resolution image, and has certain practical significance; the method can quickly, effectively and reliably synthesize the super-resolution image with better perceived quality, and opens up new possibility for expanding the super-resolution application scene and improving the visual quality.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

fig. 1 is a flowchart of an image super-resolution reconstruction method based on a diffusion model.

Fig. 2 is a potential model, a diffusion model and an overall architecture diagram of an image super-resolution reconstruction method based on the diffusion model.

Fig. 3 is a KBblock used in a diffusion model of an image super resolution reconstruction method based on the diffusion model.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Example 1

Referring to fig. 1, a first embodiment of the present invention provides a diffusion model-based image super-resolution reconstruction method, which includes,

s1: pairs of data sets are acquired that contain both high resolution images and corresponding low resolution images.

Specifically, the data set is selected according to task requirements, for example, the natural landscape super-resolution may select the data set DIV2K, flickr K, etc., and the face super-resolution may select the data set FFHQ or celeba. And (3) carrying out downsampling processing on the data set by using Bicubic to respectively obtain a corresponding high-resolution image and a corresponding low-resolution image.

S2: the latent model is iteratively trained on the paired data sets and a kernel-based attention module is added to the noise prediction network to fix the latent model while the diffusion model is trained within the latent space.

Specifically, both the latent model and the diffusion model adopt the Unet architecture. The encoder and decoder of the latent model are four layers, each layer is composed of a plurality of Resblock and a lower/upper sampling layer; the noise prediction network of the diffusion model comprises an encoder, a middle layer and a decoder, wherein the encoder and the decoder are also four layers, each layer consists of a plurality of KBlock and a lower/upper sampling layer, and the middle layer consists of a plurality of KBlock.

Further, training is performed using a diffusion model of the potential space, the training process including the following: training the latent model with a dataset of High Resolution (HR) images and Low Resolution (LR) images so that it can effectively combine the latent representation of high resolution with the implicit features of low resolution, thereby generating a high quality high resolution reconstructed image; generating potential representations of the HR image and the LR image (latency_hr and latency_lr) using the encoders of the potential models that have been trained; introducing a kernel-based attention module into a noise prediction network of the diffusion model, and training the noise prediction network according to the latency_HR and the latency_LR; the trained noise prediction network can enable the model to generate an image similar to the HR image in the image diffusion process.

Preferably, the transform. Resolution function is used to resize the picture, with the following specific formulas:

img＝Resize(H,W)

wherein Resize denotes a function of resizing the picture, resize, img denotes an input image after resizing, and H and W denote the width and height of an output image, respectively.

Preferably, the input image is randomly clipped by using a slicing mechanism during training, and the specific formula is as follows:

img＝img_in[rnd_h:rnd_h+size,rnd_w:rnd_w+size,:]

wherein rnd_h and rnd_w represent randomly obtained starting pixel coordinates, rnd_h: rnd_h+lr_size represents extraction in the height direction, rnd_w: rnd_w+lr_size represents extraction in the width direction: representing extraction of all channels, size represents the size of the output after desired random cropping, img represents the final output picture.

Preferably, the input picture is randomly and horizontally flipped using a transform.

img＝Flip(P)

Where img denotes the flipped image, P denotes the probability that the image performs horizontal flipping, flip denotes the random horizontal flipping function transform.

Further, the skip-connection process of the potential model between the encoder and the decoder includes the following: in the encoder, the characteristics of the input residual block are extracted every time the input residual block passes through the encoder, the input residual block is added into the hidden, when each stage of decoding starts, the corresponding encoder characteristics are accessed through the concat operation, when each stage of decoding ends, the corresponding encoder characteristics are accessed again, and the decoder combines the characteristics of different semantic levels of the encoder through multiple skip-connections, so that the detail information can be effectively recovered.

Preferably, the latent model performs feature normalization processing on the input data in each basic block, and the specific formula is as follows:

out＝x*(scale+1)+shift

wherein scale represents a scaling parameter, shift represents a displacement parameter, both parameters are obtained during the training process, x represents an input, and out represents an output.

Preferably, the specific formula for the potential model using the Swish activation function is as follows:

where f (x) represents the Swish activation function, and x represents the input vector from the upper layer neural network, which is a parameter that controls the smoothness of the activation function.

Further, in the noise prediction network, simpleGate is used instead of the activation function, and the specific formula is as follows:

SimpleGate(x)＝x ₁ *x ₂ x ₁ ,x ₂ ＝chunk(x)

wherein, chunk represents a function of dividing the input feature map into two according to the channel direction, namely two feature maps obtained by chunk.

Further, a kernel-based attention module KBLOCK is used in the noise prediction network, and the specific process comprises the following steps:

kernel Basis Attention: i.e., KBA module, encodes spatial information by adaptively fusing the learnable kernel basis, which is shared across all spatial locations and images to capture a common spatial pattern, a set of learnable kernel basis W is learned by KBA given input profile X. The learnable kernel basis W contains N grouped convolution kernels, the number of channels and kernel size of the grouped convolution kernels are C and K, respectively, the number of packets is set to C/4 to balance performance and efficiency, and KBA then adaptively fuses these basis at each location to encode spatial information.

Fusion Coefficients Prediction: fusion coefficient F prediction, a lightweight convolutional branching network is used to predict F for each location. The convolution branches contain two layers: a 3x3 packet convolutional layer, the number of channels reduced to N, the group size N; a SimpleGate activation function followed by a 3x3 convolutional layer.

Kernel Basis Fusion: the inputs are the predicted fusion coefficient F and the learned kernel weight W, and for each spatial position (i, j), the fused kernel weights M [ i, j ] are obtained by linearly combining kernel basis, with the following specific formulas:

M[i,j]＝∑F[i,j,t]*W[t]

wherein F [ i, j, t ] is the fusion coefficient of the t-th kernel at the position (i, j), and W [ t ] is the t-th learnable kernel.

Further, the input feature mapping X is subjected to 1X 1 convolution transformation to obtain a fusion kernel weight M [ i, j ]]Feature map X for adaptive convolution _e And calculates an output feature map X of the position (i, j) by grouping convolution ₀ [i,j]The specific formula is as follows:

X ₀ [i,j]＝GroupConv(X _e [i,j],M[i,j])

where GroupConv () represents a packet convolution operation, X ₀ Representing the output characteristic diagram, X _e Feature map representing adaptive convolution, M [ i, j]Representing the fused kernel weights.

Further, by calculating the total loss function of the potential model using the reconstruction loss, the representation loss, and the consistency loss, the specific formulas are as follows:

L＝L _rec (LQ,LQ_real)+L _rep (GT,GT_fake)+0.001*L _reg (LQ,LQ_l)

wherein L represents the total loss function, L _rec Representing reconstruction loss, L _rep To represent loss, L _reg Representing a consistency loss, LQ represents a low resolution image, LQ_real represents a decoded low resolutionThe image, GT, represents a high resolution image, gt_fake represents an image generated using hidden features of low resolution and a potential representation of high resolution, lq_l represents a potential representation of a low resolution image.

Specifically, reconstruction loss L _rec And represents loss L _rep The specific formula of (2) is as follows:

L _rec (LQ,LQ_real)＝∑|LQ-LQ_real|

LQ_real＝LQ_l+LQ_h

L _rep (GT,GT_fake)＝∑|GT-GT_fake|

GT_fake＝GT_l+LQ_h

Wherein L is _rep (GT, gt_fake) represents a loss between the low resolution image and the decoded low resolution image, GT represents the high resolution image, gt_fake represents an image generated using hidden features of low resolution and potential representations of high resolution, gt_l represents potential representations of high resolution images, lq_h represents hidden features of low resolution images.

Specifically, consistency loss L _reg The specific formula of (LQ, LQ_l) is as follows:

L _reg (LQ,LQ_l)＝∑|LQ_μ-LQ_l_μ|+|LQ_σ-LQ_l_σ|

Further, the specific formula of the loss of the noise prediction network is as follows:

L＝∑|δ-δ _t |

wherein, delta represents the noise actually added, delta _t Representing the predicted noise.

Preferably, the mean regression SDE used by the diffusion model includes a forward process and a reverse process, and the specific formula of the forward process is as follows:

dx＝θ _t (μ-x)dt+σ(t)dω

Specifically, the specific formula of the reverse process is as follows:

wherein,represents a score function, which can be approximated by a formula and predicted using a noise prediction network, dω represents Gaussian noise, θ _t Representing a super parameter, σ (t) representing the parameter of the gaussian noise fluctuation over time, dt representing a short period of time, μ representing the low resolution image, x representing the distribution of the generated image, dx representing the amount of change in the generated image within dt.

It should be noted that the forward process is a process of continuously adding noise to the image, and the reverse process is a process of continuously reducing noise.

S3: and inputting the low-image-quality image into the trained potential diffusion model to obtain a corresponding super-resolution generated image.

Further, the specific formula of the color normalization for adaptively performing the generated result is as follows:

Preferably, a progressive patch aggregate sampling algorithm is employed for larger images, the process comprising: the image is divided into a plurality of patches containing overlapping portions, each of which is sampled, and a weighting map is generated for each patch using a central gaussian kernel, and the overlapping pixels are weighted according to the weighting maps.

Further, the embodiment also provides an image super-resolution reconstruction system based on a diffusion model, which comprises a data reading module, a data processing module and a data processing module, wherein the data reading module is used for reading the picture data in the data set and performing preprocessing operation before starting network training, and the preprocessing operation comprises picture size adjustment, random clipping, random horizontal inversion and normalization; the training module is used for training a potential model of compressed sensing according to HR and LR pictures in the data set, generating a potential representation input diffusion model by using a potential model compressed image for training, and calculating a direct loss of predicted noise and actual noise in the training process and an iterative optimization model; and the image generation module is used for inputting the super-resolution low-quality image into the potential diffusion model after the network training is finished, so as to obtain a super-resolution image corresponding to the original image content.

The embodiment also provides a computer device, which is suitable for the situation of the image super-resolution reconstruction method based on the diffusion model, and comprises a memory and a processor; the memory is used for storing computer executable instructions, and the processor is used for executing the computer executable instructions to realize the image super-resolution reconstruction method based on the diffusion model as proposed in the embodiment.

The computer device may be a terminal comprising a processor, a memory, a communication interface, a display screen and input means connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

The present embodiment also provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of: acquiring paired data sets simultaneously containing high-resolution images and corresponding low-resolution images; iteratively training a latent model on the paired data sets, adding a kernel-based attention module to the noise prediction network to fix the latent model, and simultaneously training a diffusion model in a latent space; and inputting the low-image-quality image into the trained potential diffusion model to obtain a corresponding super-resolution generated image.

In summary, the invention makes a data set based on the degradation characteristics of high-quality to low-quality images, improves the noise prediction network of the diffusion model and carries out iterative training in a potential space, so that the training speed is faster and the effect is better; compared with the traditional network, the generated network constructed by the invention is more suitable for super-resolution image reconstruction and has a certain practical significance; the method can quickly, effectively and reliably synthesize more natural super-resolution images, improve the sense of reality and visual quality of the generated images, and expand the application range and application scenes.

Example 2

Referring to fig. 1 to 3, in order to verify the advantageous effects of the second embodiment of the present invention based on the first embodiment, scientific demonstration is performed through economic benefit calculation and simulation experiments.

Specifically, the training dataset consisted of DIV2K (800 pictures), flickr2K (2650 pictures), OST (10324 pictures), and a total of 13774 pictures. And processing the training data set, and downsampling the picture by using Bicubic according to the super-resolution multiple to obtain a low-resolution picture. Thereby yielding a complete training data set containing pairs of high/low resolution images. The evaluation dataset used DIV2K, which contained 100 pairs of high/low resolution images. The test dataset used was Set5, set14, BSD100, urban100, manga109.

Further, the potential model is constructed based on U-Net, the encoder and the decoder are all 4 layers, each layer comprises a plurality of residual blocks and an up/down sampling layer, and the potential model ensures the consistency of image reconstruction by utilizing a jump connection mechanism of the U-Net. Noise prediction network of diffusion model is based on U-Net of refsion as backbone network, improves network structure: image features are extracted using KBblock as shown in fig. 3 instead of NAFblock. Adding a nuclear-based attention module to increase the sensitivity of the network to the spatial information; the EAC module is used for replacing the SCA module, so that the channel attention is learned more effectively.

Further, the potential model is trained on the constructed training data set, the learning rate is set to 3e-5, the optimizer is set to Lion, the patch_size is set to 512, and the training is iterated 300000 times. And the loss function is L1 loss, and network parameters are iteratively updated according to the loss function, so that a trained potential model is obtained. The potential representation of the high/low resolution image is generated using the encoder of the potential model that has been trained, the potential representation is input into the diffusion model for training, the learning rate is set to 3e-5, the optimizer is set to Lion, the patch_size is set to 512, and the training is iterated 800000 times. And adopting an L1 loss function, and iteratively updating network parameters according to the loss function to obtain a trained diffusion model.

Further, for a low-quality image to be super-resolved, inputting the low-quality image into a latent model to obtain a compressed latent representation Z _t The potential representation is then input into a diffusion model, and the diffusion process iterates to obtain the super-divided potential representation Z ₀ ，Z ₀ And decoding by using the potential model to obtain a final super-resolution image.

Preferably, the method has the advantages that the network structure is improved, the generation effect of the diffusion model is fully utilized, the high-quality super-resolution image can be quickly generated, the method is simple to operate, the training speed is high, the generation effect is good, and a new possibility is opened up for expanding the super-resolution application scene and improving the visual quality.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. The image super-resolution reconstruction method based on the diffusion model is characterized by comprising the following steps of: comprising the steps of (a) a step of,

acquiring paired data sets simultaneously containing high-resolution images and corresponding low-resolution images;

iteratively training a latent model on the paired data sets, adding a kernel-based attention module to the noise prediction network to fix the latent model, and simultaneously training a diffusion model in a latent space;

and inputting the low-image-quality image into the trained potential diffusion model to obtain a corresponding super-resolution generated image.

2. The diffusion model-based image super-resolution reconstruction method as claimed in claim 1, wherein: the iterative training of the latent model on the paired data sets comprises the steps of:

constructing a pair dataset using the high resolution HR image and the corresponding low resolution HR image;

training the potential model by utilizing the paired data sets, and matching the input image with the corresponding potential representation thereof through the encoder in the training process;

converting the potential representation into a high resolution reconstructed image by a decoder;

optimizing parameters of the potential model according to gradients of the loss function, and determining to stop training according to a preset convergence condition or the maximum iteration number;

generating a potential representation of the HR image and the LR image with an encoder of the potential model that has been trained;

and introducing a kernel-based attention module into the noise prediction network of the diffusion model, and training according to the difference between the output of the noise prediction network and the real image.

3. The diffusion model-based image super-resolution reconstruction method as claimed in claim 1, wherein: the adding a kernel-based attention module to the noise prediction network to fix the potential model comprises the following steps:

encoding spatial information by adaptively fusing a learnable kernel basis function, and capturing a spatial mode in an image;

predicting a fusion coefficient F of each position by using a lightweight convolution branch network;

calculating a fused kernel weight for each spatial location using the predicted fusion coefficient F and the learned kernel basis function;

performing convolution transformation on the input feature map X to obtain a feature map which is subjected to self-adaptive convolution by fusing convolution kernel weights, and calculating an output feature map X of the position (i, j) through grouping convolution ₀ [i,j]；

The total loss function of the potential model is calculated by using the reconstruction loss, the representation loss, and the consistency loss.

4. The diffusion model-based image super-resolution reconstruction method as claimed in claim 3, wherein: the specific formula of the kernel weight is as follows:

M[i,j]＝∑F[i,j,t]*W[t]

wherein Fi, j, t is the fusion coefficient of the t-th convolution kernel at position (i, j), and Wt is the t-th learnable basis function;

the output characteristic diagram X ₀ [i,j]The specific formula of (2) is as follows:

X ₀ [i,j]＝GroupConv(X _e [i,j],M[i,j])

5. The diffusion model-based image super-resolution reconstruction method as claimed in claim 3, wherein: the specific formula of the total loss function is as follows:

L＝L _rec (LQ,LQ_real)+L _rep (GT,GT_fake)+0.001*L _reg (LQ,LQ_l)

wherein L represents the total loss function, L _rec Representing reconstruction loss, L _rep To represent loss, L _reg Representing a loss of consistency, LQ represents a low resolution image, lq_real represents a decoded low resolution image, GT represents a high resolution image, gt_fake represents an image generated using hidden features of low resolution and a potential representation of high resolution, lq_l represents a potential representation of a low resolution image;

the reconstruction loss L _rec The specific formula of (2) is as follows:

L _rec (LQ,LQ_real)＝∑|LQ-LQ_real|

LQ_real＝LQ_l+LQ_h

wherein L is _rec (LQ, lq_real) represents a loss between the low resolution image and the decoded low resolution image, LQ represents the low resolution image, lq_real represents an image generated using the hidden feature of the low resolution and the potential representation of the low resolution, lq_l represents the potential representation of the low resolution image, lq_h represents the hidden feature of the low resolution image;

the representation loss L _rep The specific formula of (2) is as follows:

L _rep (GT,GT_fake)＝∑|GT-GT_fake|

GT_fake＝GT_l+LQ_h

wherein L is _rep (GT, gt_fake) represents a loss between the low resolution image and the decoded low resolution image, GT represents the high resolution image, gt_fake represents an image generated using hidden features of low resolution and potential representations of high resolution, gt_l represents potential representations of high resolution image, lq_h represents hidden features of low resolution image;

the consistency loss L _reg The specific formula of (LQ, LQ_l) is as follows:

L _reg (LQ,LQ_l)＝∑|LQ_μ-LQ_l_μ|+|LQ_σ-LQ_l_σ|

6. The diffusion model-based image super-resolution reconstruction method as claimed in claim 1, wherein: the diffusion model includes a forward process and a reverse process,

the specific formula of the forward process is as follows:

dx＝θ _t (μ-x)dt+σ(t)dω

wherein dω represents Gaussian noise, θ _t Representing a super parameter, sigma (t) representing a parameter of the gaussian noise fluctuation with time, dt representing a short period of time, μ representing a low resolution image, x representing a distribution of the generated image, dx representing a variation of the generated image within dt;

the specific formula of the reversing process is as follows:

7. The diffusion model-based image super-resolution reconstruction method as claimed in claim 1, wherein: the obtaining of the corresponding super resolution generated image includes,

the specific formula of the color normalization for the self-adapting generated result is as follows:

8. An image super-resolution reconstruction system based on a diffusion model, which is based on the image super-resolution reconstruction method based on the diffusion model according to any one of claims 1 to 7, and is characterized in that: also included is a method of manufacturing a semiconductor device,

the data reading module is used for reading the picture data in the data set and performing preprocessing operations before starting the network training, wherein the preprocessing operations comprise picture size adjustment, random cutting, random horizontal inversion and normalization;

the training module is used for training a potential model of compressed sensing according to HR and LR pictures in the data set, generating a potential representation input diffusion model by using a potential model compressed image for training, and calculating a direct loss of predicted noise and actual noise in the training process and an iterative optimization model;

and the image generation module is used for inputting the super-resolution low-quality image into the potential diffusion model after the network training is finished, so as to obtain a super-resolution image corresponding to the original image content.

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that: the processor, when executing the computer program, implements the steps of the diffusion model-based image super-resolution reconstruction method according to any one of claims 1 to 7.

10. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program when executed by a processor implements the steps of the diffusion model-based image super-resolution reconstruction method according to any one of claims 1 to 7.