CN116664450A

CN116664450A - Diffusion model-based image enhancement method, device, equipment and storage medium

Info

Publication number: CN116664450A
Application number: CN202310922672.7A
Authority: CN
Inventors: 王红凯; 徐昱; 毛冬; 戴波; 陈祖歌; 黄建平; 李钟煦; 郑怡; 饶涵宇; 李高磊
Original assignee: State Grid Information and Telecommunication Co Ltd; Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd; PanAn Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Current assignee: State Grid Information and Telecommunication Co Ltd; Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd; PanAn Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-08-29

Abstract

The invention discloses an image enhancement method, device, equipment and storage medium based on a diffusion model, wherein the method comprises the following steps: obtaining a target image to be enhanced and an image enhancement instruction, and encoding to obtain an encoding feature map and a text encoding; inputting the coding feature map and the text codes into a pre-trained target image enhancement network; according to a preset noise adding rule and a preset step number, gradually adding Gaussian noise into the coding feature map to obtain a target noise image obeying Gaussian distribution, and determining the prediction noise in a result image after adding Gaussian noise in each step; based on a cross attention mechanism, carrying out image enhancement on a region corresponding to text coding in a target noise image to obtain a noise enhanced image; according to a preset noise removal rule and a preset step number, gradually removing the prediction noise of each step from the noise-added enhanced image to obtain a denoised image; and decoding the denoised image to obtain an enhanced image. The invention effectively improves the enhancement effect on the image with more missing features.

Description

Diffusion model-based image enhancement method, device, equipment and storage medium

Technical Field

The present invention relates to the field of image enhancement technologies, and in particular, to an image enhancement method, apparatus, device, and storage medium based on a diffusion model.

Background

The image is one of the most common information carriers in electronic systems, and is widely applied in the fields of medical imaging, unmanned aerial vehicle photography, security monitoring, industrial detection and the like. However, many of the original pictures acquired have limitations in terms of quality, contrast, sharpness, and detail presentation due to environmental conditions, equipment limitations, noise during acquisition, and other factors. The image enhancement technique refers to a technique of processing features in an image to improve the visual effect of the image and to improve the quality of the image.

Conventional image enhancement methods generally employ techniques such as image filtering, histogram equalization, and image sharpening to improve the quality of the image. However, these methods have limited enhancement effects on images in the face of complex scenes and specific applications. For example: in medical images, the traditional image enhancement method cannot effectively extract pathological details or accurately restore the tissue structure of the image; in unmanned aerial vehicle photography, due to the change of illumination conditions and shooting distance, the problems of blurring, noise, low contrast and the like of a shot image may exist, and the enhancement effect of the shot image is limited by adopting a traditional image enhancement method; in security monitoring, a target object cannot be accurately identified and tracked by adopting a traditional image enhancement method.

With the rapid development of deep learning and computer vision, researchers have proposed image enhancement methods based on electronic systems to overcome the above problems. In order to improve the image enhancement effect, the existing image enhancement algorithm is realized based on a neural network model, and specific implementation modes include but are not limited to the following two modes: first kind: convolutional neural networks (Convolutional Neural Networks, CNN), which use low quality images (i.e., images that require image enhancement) as input and high quality images (i.e., images that do not require image enhancement) as training targets during training, use loss functions for iterative training of the network. And when the image is enhanced, inputting the target image to be enhanced into the trained CNN, and outputting the image to be enhanced. Second kind: a generated challenge network (Generative Adversarial Networks, GAN) uses low quality images as input, high quality images as training targets, and performs iterative training in the challenge of the generator and discriminator. When the image enhancement is carried out, inputting the target image to be subjected to the image enhancement into a trained generator, and outputting the image to be subjected to the image enhancement.

However, the existing neural network model for image enhancement has poor enhancement effect on images with more missing features.

Disclosure of Invention

The invention provides an image enhancement method, device, equipment and storage medium based on a diffusion model, which solve the problem of poor image enhancement effect caused by more characteristic missing in the prior art.

In order to achieve the above purpose, the invention adopts the following technical scheme:

in a first aspect, the present invention provides a diffusion model-based image enhancement method, the method comprising:

acquiring a target image to be enhanced, and encoding the target image through an encoder to obtain an encoding feature map;

acquiring an image enhancement instruction, and encoding the image enhancement instruction through a text editor to obtain a text code; the image enhancement instruction comprises the characteristics and the positions of the image to be enhanced;

inputting the coding feature map and the text codes into a pre-trained target image enhancement network;

according to a preset noise adding rule and a preset step number, gradually adding Gaussian noise into the coding feature map to obtain a target noise image obeying Gaussian distribution, and determining the prediction noise in a result image after adding Gaussian noise in each step;

Based on a cross attention mechanism, performing image enhancement on a region corresponding to the text code in the target noise image to obtain a noise-added enhanced image;

according to a preset noise removal rule and the preset step number, the prediction noise of each step is gradually removed from the noise-added enhanced image, and a denoised image is obtained;

and decoding the denoised image through a decoder to obtain an enhanced image.

In one possible implementation, the preset noise addition rule is determined based on a diffusion process of a denoising diffusion probability model; gradually adding Gaussian noise into the coding feature map according to a preset noise adding rule and a preset step number to obtain a target noise image obeying Gaussian distribution, wherein the method specifically comprises the following steps of:

according to the diffusion process of the denoising diffusion probability model, gaussian noise is added to the coding feature map in each step of the diffusion process; the parameter value of the added Gaussian noise is determined based on a preset noise time table;

and calculating a result image after adding the Gaussian noise in each step of the diffusion process according to the coding feature map and the noise time table, and outputting the result image corresponding to the preset step number as a target noise image.

In one possible implementation manner, the calculating the result image after adding the gaussian noise at each step of the diffusion process according to the coding feature map and the noise schedule specifically includes:

calculating a result image of the diffusion process after adding the Gaussian noise at each step according to the following formula:

，

wherein ,for the coding feature map before adding Gaussian noise, < +.>The result of adding noise corresponding to the time from the noise adding time to the time t;

，/>；

for presetting the noise schedule, < >>Comprises->A parameter value representing the addition of Gaussian noise at each step of the diffusion process, and +.>。

In one possible implementation, the target noise image comprises a plurality of image channels, and the cross-attention mechanism comprises a channel attention mechanism and a spatial attention mechanism; the image enhancement is performed on the region corresponding to the text code in the target noise image based on the cross attention mechanism to obtain a noise enhanced image, and the method specifically comprises the following steps:

through the channel attention mechanism, carrying out pertinence enhancement on different image channels on the feature map corresponding to each image channel of the region corresponding to the text code in the target noise image to obtain a channel attention feature map;

And carrying out targeted enhancement of different spatial positions on the channel attention feature map through the spatial attention mechanism to obtain a noise-added enhanced image.

In a possible implementation manner, the enhancing, by the channel attention mechanism, pertinence of different image channels on the feature map corresponding to each image channel of the region corresponding to the text code in the target noise image to obtain a channel attention feature map specifically includes:

for the feature map of each image channel of the region corresponding to the text code in the target noise image, performing dimension reduction processing on the feature map according to a maximum pooling and average pooling method to obtain global features of the feature map corresponding to the image channel;

processing the global features through a multi-layer sensor to obtain the weight coefficient of the image channel;

weighting the feature images corresponding to the image channels through the weight coefficients to obtain weighted feature images;

and multiplying the weighted feature map and the image channel of the target noise image to obtain a channel attention feature map.

In one possible implementation manner, through the spatial attention mechanism, the channel attention feature map is subjected to targeted enhancement of different spatial positions, so as to obtain a noise enhanced image, which specifically includes:

Processing the channel attention feature map according to the methods of maximum pooling and average pooling to obtain a processing result;

performing connection operation on the processing result based on the corresponding image channel to obtain a connected feature map;

the connected feature images are subjected to dimension reduction into a single channel by a convolution dimension reduction processing method, so that a space feature image is obtained;

and multiplying the space feature image and the target noise image to obtain a noise-added enhanced image.

In one possible implementation, the preset noise removal rule is determined based on a reverse process of a denoising diffusion probability model; the step of gradually removing the prediction noise of each step from the noise-added enhanced image according to a preset noise removal rule and the preset step number specifically comprises the following steps:

and removing the prediction noise determined in the diffusion process corresponding to the inverse process from the noise-added enhanced image at each step of the inverse process based on the inverse process of the denoising diffusion probability model.

In one possible implementation, before the inputting the coding feature map and the text code into a pre-trained target image enhancement network, the method further includes:

Training an original image enhancement network to obtain an image enhancement network with the error value of the predicted noise and the real noise smaller than a preset loss value as a target image enhancement network.

In one possible implementation manner, the training the original image enhancement network to obtain an image enhancement network with an error value of the predicted noise and the real noise smaller than a preset loss value as the target image enhancement network specifically includes:

acquiring a high-quality image meeting a preset quality requirement, and processing the high-quality image in a downsampling mode to obtain a corresponding low-quality image;

encoding the high-quality image and the low-quality image by an encoder to obtain a high-quality encoding diagram and a low-quality encoding diagram;

gradually adding Gaussian noise into the low-quality coding diagram, and determining the prediction noise in the result image after adding Gaussian noise in each step;

and determining error values of the prediction noise and the noise true value, and changing parameters of the original image enhancement network when the error values are larger than preset loss values until the error values are smaller than the preset loss values, so as to obtain the trained target image enhancement network.

In a second aspect, the present invention provides a diffusion model-based image enhancement apparatus comprising:

the coding module is used for acquiring a target image to be enhanced, and coding the target image through the coder to obtain a coding feature map;

the text coding module is used for acquiring the image enhancement instruction, and coding the image enhancement instruction through the text editor to obtain a text code; the image enhancement instruction comprises the characteristics and the positions of the image to be enhanced;

the input module is used for inputting the coding feature map and the text codes into a pre-trained target image enhancement network;

the noise prediction module is used for gradually adding Gaussian noise into the coding feature map according to a preset noise adding rule and a preset step number to obtain a target noise image obeying Gaussian distribution, and determining the prediction noise in a result image after Gaussian noise is added in each step;

the image enhancement module is used for carrying out image enhancement on the region corresponding to the text code in the target noise image based on a cross attention mechanism to obtain a noise enhanced image;

the denoising module is used for gradually removing the prediction noise of each step from the noise-added enhanced image according to a preset noise removal rule and the preset step number to obtain a denoised image;

And the decoding module is used for decoding the denoised image through a decoder to obtain an enhanced image.

Further, the preset noise adding rule is determined based on a diffusion process of a denoising diffusion probability model; when gaussian noise is gradually added to the coding feature map according to a preset noise adding rule and a preset step number to obtain a target noise image obeying gaussian distribution, the noise prediction module is configured to execute:

Further, in calculating a resultant image of the diffusion process after adding the gaussian noise at each step according to the coding feature map and the noise schedule, the noise prediction module is specifically configured to perform:

，

，/>；

Further, the target noise image comprises a plurality of image channels, and the cross attention mechanism comprises a channel attention mechanism and a spatial attention mechanism; the image enhancement module comprises a first enhancement unit and a second enhancement unit;

the first enhancing unit is configured to enhance pertinence of different image channels on a feature map corresponding to each image channel of a region corresponding to the text code in the target noise image through the channel attention mechanism, so as to obtain a channel attention feature map;

the second enhancing unit is configured to perform targeted enhancement on different spatial positions on the channel attention feature map through the spatial attention mechanism, so as to obtain a noise enhanced image.

Further, the first enhancement unit is specifically configured to perform:

Further, the second enhancement unit is specifically configured to perform:

Further, the preset noise removal rule is determined based on the inverse process of the denoising diffusion probability model; the denoising module is specifically configured to perform:

Further, the device further comprises a model training module, which is used for training the original image enhancement network before the coding feature map and the text codes are input into the pre-trained target image enhancement network, so as to obtain the image enhancement network with the error value of the prediction noise and the real noise smaller than the preset loss value as the target image enhancement network.

Further, the model training module is specifically configured to perform:

In a third aspect, the present invention provides an electronic device comprising a processor and a memory, the memory storing at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the diffusion model-based image enhancement method of any of the above.

In a fourth aspect, the present invention provides a computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by a processor to implement a diffusion model-based image enhancement method as set forth in any one of the preceding claims.

According to the image enhancement method based on the diffusion model, firstly, an obtained target image to be enhanced and an obtained image enhancement instruction are respectively encoded through an encoder and a text editor, and an encoding feature diagram and a text encoding are respectively obtained; secondly, inputting the coded coding feature map and the text codes into a pre-trained target image enhancement network; then, gradually adding Gaussian noise into the coding feature map to obtain target noise images obeying the Gaussian noise respectively, and determining the prediction noise in the result image after adding the Gaussian noise in each step; then, based on a cross attention mechanism, performing image enhancement on a region corresponding to the image enhancement instruction in the target noise image to obtain a noise-enhanced image; then, corresponding to the noise adding process, gradually removing the prediction noise of each step from the noise adding enhanced image to obtain a denoised image; and finally, decoding the denoised image through a decoder to obtain an enhanced image. The invention aims at target images with more missing features, such as: the method comprises the steps that in the field of power business vision analysis, a terminal with characteristics lost and discontinuous in the power generation, transmission and distribution processes acquires images, gaussian noise is gradually introduced into a target image to attenuate useful information in the images, and the noisy images tend to be Gaussian noise; and then the original image is restored by removing noise in the image by step denoising, the detail signal and the characteristic of the whole target image are enhanced while the noise and the interference in the target image are eliminated, meanwhile, the image area corresponding to the image enhancement instruction in the target image is enhanced pertinently by combining a cross attention mechanism, the restoration of the characteristics of image texture, saturation, color and the like is supported, the enhancement effect on the target image is effectively improved, and higher-quality image data is provided for subsequent image analysis.

Drawings

FIG. 1 is a flowchart of steps of a diffusion model-based image enhancement method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of the implementation of a denoising diffusion probability model of an image enhancement method based on a diffusion model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a prediction noise model of an image enhancement method based on a diffusion model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an implementation of image enhancement based on a cross-attention mechanism of an image enhancement method based on a diffusion model according to an embodiment of the present invention;

FIG. 5 is a technical flowchart of an image enhancement method based on a diffusion model according to an embodiment of the present invention;

fig. 6 is a block diagram of an image enhancement device based on a diffusion model according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "first" and "second" are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the embodiments of the present disclosure, unless otherwise indicated, the meaning of "a plurality" is two or more. In addition, the use of "based on" or "according to" is intended to be open and inclusive in that a process, step, calculation, or other action "based on" or "according to" one or more of the stated conditions or values may in practice be based on additional conditions or beyond the stated values.

In order to solve the problem of poor image enhancement effect caused by more characteristic missing in the prior art, the embodiment of the invention provides an image enhancement method and device based on a diffusion model.

As shown in fig. 1, in a first aspect, an embodiment of the present invention provides an image enhancement method based on a diffusion model, where the method includes:

and 101, acquiring a target image to be enhanced, and encoding the target image by an encoder to obtain an encoding feature map.

The target image to be enhanced can be an image acquired by a terminal with characteristics lost and discontinuous in the processes of power generation, power transmission and power distribution in the field of visual analysis of power business.

An Encoder (Encoder) is a device that compiles, converts, or communicates or stores signals or data into a signal form.

In this embodiment, the encoder can compress the input target image into a potential spatial representation, resulting in an encoded signature.

Step 102, obtaining an image enhancement instruction, and encoding the image enhancement instruction through a text editor to obtain a text code.

Wherein the image enhancement instructions include features and locations of the image that need to be enhanced.

Specifically, the features of the image to be enhanced may be face features, specific background features, and the like, and the position of the image to be enhanced may be the upper left corner, the upper right corner, and the like of the image.

In this embodiment, the text editor is a CLIP (Contrastive Language-Image Pre-tracking) text editor. And mapping the image and the text to the same vector space through a CLIP text editor to obtain the text code.

And step 103, inputting the coding feature map and the text codes into a pre-trained target image enhancement network.

Specifically, the coding feature map obtained in the step 101 and the text code obtained in the step 102 are both input into a pre-trained target image enhancement network, and target images are enhanced in a targeted manner through the target image enhancement network.

Step 104, gradually adding Gaussian noise into the coding feature map according to a preset noise adding rule and a preset step number to obtain a target noise image obeying Gaussian distribution, and determining the prediction noise in the result image after adding Gaussian noise in each step.

Specifically, adding a certain amount of Gaussian noise into the coding feature map to obtain a result image after the Gaussian noise is added for the first time; and then adding a certain amount of Gaussian noise into the result image after the Gaussian noise is added for the first time to obtain a result image after the Gaussian noise is added for the second time, and repeating the step of adding the Gaussian noise for a preset step for a plurality of times to obtain a target noise image approaching the Gaussian noise. The Gaussian noise is gradually added to the coding feature map, so that the original coding feature map is changed into a noise map conforming to standard Gaussian distribution.

In the process of gradually adding the gaussian noise, the prediction noise contained in the resultant image of each step is determined.

And 105, carrying out image enhancement on a region corresponding to the text code in the target noise image based on a cross attention mechanism to obtain a noise enhanced image.

The attention mechanism is a carrier of a deep learning network to which the attention computation rules can be applied. The cross attention mechanism can unify the association characteristics between the inside of the modes and the modes to perform graph-text matching calculation.

In this embodiment, a attention mechanism is applied to determine a matching relationship between an image enhancement instruction and a coding feature map, and a region matching with the image enhancement instruction is subjected to targeted enhancement, so as to obtain a noise-enhanced image.

And 106, gradually removing the prediction noise of each step from the noise-added enhanced image according to a preset noise removal rule and a preset step number to obtain a denoised image.

Specifically, in contrast to the direction of gradually adding the gaussian noise, the prediction noise in the previous step is gradually removed from the noise-added enhanced image, and the noise-removing step is repeated for a preset number of times, so that a noise-removed image with the prediction noise gradually removed is obtained.

And 107, decoding the denoised image through a decoder to obtain an enhanced image.

Wherein the Decoder (Decoder) is capable of restoring data compressed into a potential spatial representation to an image, which is an enhanced image.

Specifically, the steps 101 and 102 mainly obtain the encoding feature map of the target image and the text encoding of the input image enhancement command, and the execution sequence of the steps 101 and 102 is not limited specifically, and may be executed abnormally or synchronously, as long as the execution is completed before the step 103.

According to the image enhancement method based on the diffusion model, firstly, an obtained target image to be enhanced and an obtained image enhancement instruction are respectively encoded through an encoder and a text editor, and an encoding feature diagram and a text encoding are respectively obtained; secondly, inputting the coded coding feature map and the text codes into a pre-trained target image enhancement network; then, gradually adding Gaussian noise into the coding feature map to obtain target noise images obeying the Gaussian noise respectively, and determining the prediction noise in the result image after adding the Gaussian noise in each step; then, based on a cross attention mechanism, performing image enhancement on a region corresponding to the image enhancement instruction in the target noise image to obtain a noise-enhanced image; then, corresponding to the noise adding process, gradually removing the prediction noise of each step from the noise adding enhanced image to obtain a denoised image; and finally, decoding the denoised image through a decoder to obtain an enhanced image.

The invention aims at target images with more missing features, such as: the method comprises the steps that in the field of power business vision analysis, a terminal with characteristics lost and discontinuous in the power generation, transmission and distribution processes acquires images, gaussian noise is gradually introduced into a target image to attenuate useful information in the images, and the noisy images tend to be Gaussian noise; and then the original image is restored by removing noise in the image by step denoising, the detail signal and the characteristic of the whole target image are enhanced while the noise and the interference in the target image are eliminated, meanwhile, the image area corresponding to the image enhancement instruction in the target image is enhanced pertinently by combining a cross attention mechanism, the restoration of the characteristics of image texture, saturation, color and the like is supported, the enhancement effect on the target image is effectively improved, and higher-quality image data is provided for subsequent image analysis.

Further, the preset noise addition rule is determined based on a diffusion process of the denoising diffusion probability model.

The denoising diffusion probability model (Denoising Diffusion Probabilistic Models, DDPM) is a parameterized Markov chain and is trained by a variational reasoning method. The denoising diffusion probability model is one of depth generation models, and generally comprises two processes, a diffusion process and a reverse process. The diffusion process is also called a forward diffusion process, a forward diffusion process or a noise adding process, and the reverse process is also called a reverse diffusion process or a reverse denoising process.

As shown in FIG. 2, byTo->The process of (1) is the diffusion process of the denoising diffusion probability model, which is composed of +.>To->Is the inverse of the denoising diffusion probability model.

The diffusion process is a step-by-step noise adding process, diagonal Gaussian noise is added to the sample image in each step of the diffusion process, and the sample data distribution of the original sample image is converted into a simple image conforming to the standard Gaussian distribution by continuously adding the Gaussian noise.

The reverse process is a de-drying process, sampling is carried out from the image conforming to the standard Gaussian distribution, and a small part of Gaussian noise is removed in each step, so that the de-noised image is gradually close to the real data distribution, further a sample image in the real data distribution is obtained, and the recovery of the sample image is realized.

According to a preset noise adding rule and a preset step number, gradually adding Gaussian noise into the coding feature map to obtain a target noise image obeying Gaussian distribution, wherein the method specifically comprises the following steps of:

and adding Gaussian noise to the coding feature map in each step of the diffusion process according to the diffusion process of the denoising diffusion probability model.

Wherein the parameter value of the added gaussian noise is determined based on a preset noise schedule.

Specifically, the diffusion process of the denoising diffusion probability model is a denoising process based on a Markov assumption. When the step number of step-by-step noise adding is determined as a preset step number T, after the parameters of Gaussian noise to be added in each step are determined based on a preset noise time table, the coding feature map is taken as a coding feature mapGaussian noise is gradually added to the picture.

And calculating a result image after adding Gaussian noise in each step of the diffusion process according to the coding feature map and the noise time table, and outputting the result image corresponding to the preset step number as a target noise image.

Specifically, the coding feature map is taken asThe number of steps required to add noise is T, the parameters of Gaussian noise added each time are determined according to a noise time table, a result image after each step of noise addition can be unambiguously obtained in the diffusion process, and the result image corresponding to the T-th step is used as a target noise image.

Further, according to the coding feature map and the noise schedule, calculating a result image after adding Gaussian noise in each step of the diffusion process, specifically:

assuming that the preset step number is T, the initial distribution of the sample data of the coding feature map isThe Gaussian noise with the mean value and standard deviation of a specific difference is added to the coding feature diagram at each time t of the diffusion process, and the Gaussian noise is expressed by the following formula:

（1），

（2），

wherein ,for adding the noise to the resulting image at time t, < >>For presetting the noise schedule, < >>IncludedA parameter value representing the addition of Gaussian noise at each step of the diffusion process, and。

it is thus possible to obtain,（3），

wherein ,definition of the variable->，/>Based on the Markov assumption, after continuous iteration, the resulting image after adding Gaussian noise for each step of the diffusion process can be calculated according to the following formula:

（4），

that is to say that the first and second,（5），

from the above, it can be seen that for a determined code profile throughout the diffusion processAnd noise schedule->Can obtain the result image after noise adding in any step +.>When the preset step number T is large enough, the final denoised result image can be regarded as isotropic Gaussian distribution noise, namely +.>。

Further, the prediction noise in the result image after the addition of the gaussian noise at each step is determined specifically as follows:

as shown in fig. 3, the prediction noise in the resultant image after the addition of the gaussian noise at each step is determined by a prediction noise model.

The prediction noise model is formed based on a U-Net network with the same input and output dimensions, and the U-Net network comprises a contracted path and an expanded path; the contraction path adopts a multi-layer downsampling structure, and the multi-layer downsampling structure is realized through a first convolution module; the expansion path adopts a multi-layer up-sampling structure, and the multi-layer up-sampling structure is realized through a second convolution module; the number of layers of the multi-layer downsampling structure is the same as the number of layers of the multi-layer upsampling structure.

In the present embodiment, the input of the noise model is predictedAnd (5) encoding tensors and time t of 128×128 in a single channel by adopting an encoding technology, and merging residual structures. The output of the prediction noise model is +.>The number and the size of the channels are the same as those of the input of the prediction noise model.

The multi-layer downsampling structure is a 4-layer downsampling structure, and downsampling of the prediction noise model adopts convolution operation with convolution kernel of 3×3, step size of 2 and filling of 1.

And taking the image after noise addition as a characteristic image input for the first time, and reducing the input characteristic image by half at each layer of the multi-layer downsampling structure by utilizing a first convolution module.

And (3) doubling the input characteristic images on each layer of the multi-layer up-sampling structure by utilizing a second convolution module through a nearest interpolation method, splicing the characteristic images with the characteristic images corresponding to the contracted paths, and finally outputting the prediction noise of the image after noise addition.

The first convolution module includes 5 convolution units, and the convolution channel numbers of the five convolution units are respectively set to 32, 64, 128, 256 and 512 from top to bottom. In order to prevent gradient extinction and gradient explosion, a residual structure is used for completing network transmission and expansion and reduction of the number of channels. The prediction noise model converts the number of channels to 1 at the output.

Further, the target noise image includes a plurality of image channels.

Specifically, the image channel is an important concept of an image, and in the RGB color mode, a complete image is composed of three image channels of red, green and blue, and the three image channels cooperate to generate the complete image.

The cross-attention mechanism includes a channel attention mechanism and a spatial attention mechanism.

The essence of attention record is to locate information of interest to a user in an image and restrain useless information in the image.

Based on a cross attention mechanism, carrying out image enhancement on an area corresponding to text coding in a target noise image to obtain a noise enhanced image, and specifically comprising the following steps:

and carrying out pertinence enhancement of different image channels on the feature map corresponding to each image channel of the region corresponding to the text coding in the target noise image through a channel attention mechanism to obtain a channel attention feature map.

The channel attention mechanism comprises a compression part and an excitation part, wherein the compression part mainly compresses global space information, then performs feature learning in the dimension of an image channel to form the importance of each channel, and the excitation part is used for distributing different weights to each channel.

And (3) carrying out targeted enhancement on different spatial positions on the channel attention feature map through a spatial attention mechanism to obtain a noise-added enhanced image.

The spatial attention mechanism is to find the position of the picture focused by the user and process the position.

Further, through a channel attention mechanism, the pertinence enhancement of different image channels is performed on the feature map corresponding to each image channel of the region corresponding to the text code in the target noise image, so as to obtain a channel attention feature map, which specifically includes:

and carrying out dimension reduction processing on the feature map of each image channel of the region corresponding to the text code in the target noise image according to the methods of maximum pooling and average pooling to obtain the global feature of the feature map corresponding to the image channel.

And processing the global features through the multi-layer perceptron to obtain the weight coefficient of the image channel.

And weighting the feature images corresponding to the image channels through the weight coefficients to obtain weighted feature images.

And multiplying the weighted feature map and an image channel of the target noise image to obtain a channel attention feature map.

In the present embodiment, as shown in fig. 4, the target noise image xThe region image corresponding to the text coding is processed by the encoder in the self-encoder to obtain the coding featuref _c Coding featuresf _c Processing by decoder in self-encoder to obtain characteristic diagramFFor characteristic diagramFPerforming maximum pooling and average pooling to generate a channel attention mapM _c Will beM _c And (3) withFMultiplication operation is carried out to obtain a channel attention characteristic diagramF’。

Channel attention diagramM _c The specific calculation formula of (2) is as follows:

，

，（6），

wherein AvgPool is global average pooling; maxPool is global max pooling;is the corresponding weight coefficient; MLP is a multi-layer perceptron; />To reduce the rate; />Is a sigmoid function; />The number of channels; /> and />Respectively representing 2 weight coefficients; />Is characterized by->The feature vector after the averaging pooling is marked by +.>Representing a channel attention module; />Is characterized by->The vector after the maximum pooling operation is generated by adding 2 features and activating a sigmoid functionM _c 。

Channel attention is soughtM _c And (3) withFMultiplication operation is carried out to obtain a channel attention characteristic diagramF’。

Further, through a spatial attention mechanism, pertinence enhancement of different spatial positions is performed on the channel attention feature map, and a noise enhanced image is obtained, which specifically comprises:

and processing the channel attention characteristic diagram according to the methods of maximum pooling and average pooling to obtain a processing result.

And performing connection operation on the processing result based on the corresponding image channel to obtain a connected feature map.

And reducing the dimension of the connected feature map into a single channel by a convolution dimension reduction processing method to obtain a space feature map.

In this embodiment, as shown in FIG. 4, the channel attention profile isF’Input feature map used as spatial attention mechanism, and channel attention feature map based on graphic channelF’Performing maximum pooling and average pooling processing, performing convolutional neural (Conv) connection operation on processing results of all image channels to obtain a connected feature map, then performing dimension reduction on the connected feature map into a single channel by a set of convolution kernel dimension reduction processing method, and activating to generate a space feature mapM _s The method comprises the steps of carrying out a first treatment on the surface of the Finally, the space feature diagramM _s And channel attention profileF’And (5) performing multiplication processing to obtain a space feature map.

Space feature mapM _s The specific calculation formula of (2) is as follows:

（7），

（8），

wherein ,representation->The superscript s denotes the spatial attention module.

Further, the preset noise removal rule is determined based on the inverse process of the denoising diffusion probability model.

The inverse process of the denoising diffusion probability model is a process of reconstructing a target image from noise.

According to a preset noise removal rule and a preset step number, the prediction noise of each step is gradually removed from the noise-added enhanced image, and the method specifically comprises the following steps:

In particular, the inverse of the denoising diffusion probability model can also be assumed to be a Markov chain. Each step in the reverse processThe conditional probability distribution can be determined precisely>The +.A. can then be obtained by iterative continuous sampling in the opposite direction>And completing the generating task. But due to->Depending on the data distribution of all samples, the direct determination is therefore madeIt is not realistic. Thus, take the construction->Parameterized neural networks approximate their distribution, assuming +.>Is a probability distribution of the inverse process and obeys a Gaussian distribution, the mean of which +.>Sum of variances->All are-> and />As an input parameter, it is specifically expressed by the following formula:

（9），

in practical application, in order to facilitate subsequent calculation and reduce training difficulty of the neural network, variance is calculatedIs set to be constant which does not need to participate in the training of the neural network and is related to the time constant >。

Training the mean value during the training phase using only neural networksAnd (3) obtaining the product. Although it is impossible to directly calculate +.>But can be according to->Time course value->And initial value->Calculating posterior conditional probability->。

Specifically, a bayesian formula is applied:

（10），

according to (10) and (4) it is possible to:

（11），

wherein ,（12），/>（13）。

further, according to and />The relation between the two, and the combination of (9) and (11) can determine the loss function of the target image enhancement network.

Training means using neural networksWhen the training method is used, three selection modes can be adopted for the quantity to be predicted to obtain the training result.

First kind: direct prediction of the mean of gaussian noise per step in the inverse process；

Second kind: predicting initial valuesWill->Substituting into (12), obtaining the mean +.>；

Third kind: predicting noiseEliminating +.>The following formula is obtained:

（14），

calculation by (14), wherein />Is the predicted value of noise z.

In this embodiment, a third way is used for prediction, and the loss function is:

（15），

the final objective of network optimization is to maximize the end result of the reverse processThereby obtaining the result of the generation of the most suitable sample, so that the variation lower bound +.>To optimize its negative log likelihood function: / >

（16），

Equation (15) can be regarded as a variation lower bound lossAnd (15) will result in a comparison of the direct optimization +.>Better sample quality.

In the present embodiment, use is made ofThe loss function replaces the MSE loss function and is brought into equation (4), resulting in the final loss function as follows:

（17）。

as shown in fig. 4, further, before inputting the coding feature map and the text codes into the pre-trained target image enhancement network, the method further includes:

Specifically, before the image enhancement network is applied in the invention, the original image enhancement network is required to be trained through a training sample, and the image enhancement network with the error value of the predicted noise and the real noise smaller than the preset loss value is used as a trained target image enhancement network.

Further, training the original image enhancement network to obtain an image enhancement network with an error value of the predicted noise and the real noise smaller than a preset loss value as a target image enhancement network, which specifically comprises:

And obtaining a high-quality image meeting the preset quality requirement, and processing the high-quality image in a downsampling mode to obtain a corresponding low-quality image.

The downsampling, also called downsampling, is a multi-rate digital signal processing technology or a process of reducing the sampling rate of a signal, and is generally used for reducing the data transmission rate or the data size.

And reducing the data size in the high-quality image to obtain a corresponding low-quality image.

The high quality image and the low quality image are encoded by an encoder to obtain a high quality encoded picture and a low quality encoded picture.

Specifically, the corresponding high quality code map and low quality code map form a training image pair, and the high quality code map and the low quality code map are mapped from the pixel space to the hidden layer space by an encoder in the self-encoder.

In this embodiment, if the size of the image is [ B, C, H, W ], where B represents image batch processing, C represents the number of channels, H represents the height of the image, and W represents the width of the image. After the image is encoded by the self-encoder, the size of the resulting encoded picture is [ B, C, H/8,W/8].

The self-encoder needs to be trained well before application and parameters are fixed in the subsequent training process. That is, the training of the self-encoder can be performed independently, the training method is not limited herein, and the self-encoder can also directly use the trained model.

Gaussian noise is gradually added to the low quality code map, and prediction noise in the resultant image after each step of adding gaussian noise is determined.

And determining error values of the prediction noise and the noise true value, and changing parameters of the original image enhancement network when the error values are larger than the preset loss values until the error values are smaller than the preset loss values, so as to obtain the trained target image enhancement network.

Specifically, in the training stage, the prediction noise can be obtained by calculation according to the input training sample image and the loss function of the model, and whether the image enhancement network is trained well can be determined according to the error value of the prediction noise and the noise true value and the preset loss value.

As shown in fig. 5, the workflow of the diffusion model-based image enhancement method of the present invention is divided into two parts, a training phase and a generating phase.

In the training stage, the input original image is an acquired high-quality image meeting the preset quality requirement and a low-quality image obtained by processing the high-quality image in a downsampling mode.

And encoding the original image by an Encoder (Encoder), mapping the original image from a pixel space to a hidden layer space to obtain an encoding feature map, and then gradually adding Gaussian noise into the encoded image based on a denoising diffusion probability model to obtain a noise image.

And the custom image enhancement options are custom image enhancement instructions, and the CLIP text editor is trained to edit the image enhancement instructions to obtain text codes. In the figure, a custom image enhancement instruction is encoded through the CLIP to generate an Embedding with the size of [ B, K, E ]. Where K represents the maximum coding length of the text and E represents the size of the assembled.

And gradually adding Gaussian noise into the coding feature map in a diffusion process based on a denoising diffusion probability model, and then determining the predicted noise in the result image after adding the Gaussian noise in each step through a noise prediction model based on U-Net. Meanwhile, a noise prediction model based on U-Net receives feature coding graphs of a high-quality coding graph and a low-quality coding graph and text coding of an image enhancement instruction, and trains a matching relation between the image enhancement instruction and an image based on a cross attention mechanism.

Based on the error values of the predicted noise and the real noise and the magnitude of the preset loss value, when the error values are larger than the preset loss value, updating parameters of a noise prediction model of the U-Net through a back propagation algorithm, wherein the parameters of the encoder and the CLIP text editor cannot be updated in the process of updating the parameters.

In the generation stage, an input low-quality image is encoded by an encoder to obtain a hidden layer image.

And gradually adding Gaussian noise into the coding feature map based on the diffusion process of the denoising diffusion probability model to obtain a target noise image obeying Gaussian distribution.

And iterating the T-shaped wheel through a denoising model based on the U-Net network and a reverse process based on a denoising diffusion probability model, and gradually removing noise in the denoised image to obtain the denoised image.

The denoised image is restored from the hidden layer space to an enhanced high quality image by a Decoder (Decoder).

As shown in fig. 6, in a second aspect, the present invention provides an image enhancement apparatus based on a diffusion model, the apparatus comprising:

the encoding module 201 is configured to obtain a target image to be enhanced, and encode the target image by using an encoder to obtain an encoding feature map;

the text encoding module 202 is configured to obtain an image enhancement instruction, encode the image enhancement instruction through a text editor, and obtain a text code; the image enhancement instruction comprises the characteristics and the positions of the image to be enhanced;

the input module 203 is configured to input the coding feature map and the text code into a pre-trained target image enhancement network;

The noise prediction module 204 is configured to gradually add gaussian noise to the coding feature map according to a preset noise addition rule and a preset number of steps, obtain a target noise image subject to gaussian distribution, and determine a prediction noise in a result image after adding the gaussian noise at each step;

the image enhancement module 205 is configured to perform image enhancement on an area corresponding to text encoding in the target noise image based on a cross attention mechanism, so as to obtain a noise enhanced image;

the denoising module 206 is configured to gradually remove the prediction noise of each step from the noise-added enhanced image according to a preset noise removal rule and a preset step number, so as to obtain a denoised image;

the decoding module 207 is configured to decode the denoised image by using a decoder, so as to obtain an enhanced image.

Further, the preset noise adding rule is determined based on a diffusion process of the denoising diffusion probability model; upon gradually adding gaussian noise to the encoding feature map in accordance with a preset noise addition rule and a preset number of steps, resulting in a target noise image that follows a gaussian distribution, the noise prediction module 204 is configured to perform:

Further, in calculating the resultant image after adding gaussian noise for each step of the diffusion process from the coding feature map and the noise schedule, the noise prediction module 204 is specifically configured to perform:

the resulting image after adding gaussian noise for each step of the diffusion process is calculated according to the following formula:

，

，/>；

Further, the target noise image includes a plurality of image channels, and the cross attention mechanism includes a channel attention mechanism and a spatial attention mechanism; the image enhancement module 205 includes a first enhancement unit and a second enhancement unit;

the first enhancement unit is used for carrying out targeted enhancement on different image channels on the feature map corresponding to each image channel of the region corresponding to the text code in the target noise image through a channel attention mechanism to obtain a channel attention feature map;

The second enhancement unit is used for carrying out targeted enhancement on different spatial positions on the channel attention feature map through a spatial attention mechanism to obtain a noise enhanced image.

Further, the first enhancement unit is specifically configured to perform:

for the feature map of each image channel of the region corresponding to the text code in the target noise image, performing dimension reduction processing on the feature map according to the methods of maximum pooling and average pooling to obtain the global feature of the feature map corresponding to the image channel;

Further, the second enhancement unit is specifically configured to perform:

the connected feature images are reduced in dimension into a single channel by a convolution dimension reduction processing method, and a space feature image is obtained;

Further, the preset noise removal rule is determined based on the inverse process of the denoising diffusion probability model; the denoising module 206 is specifically configured to perform:

Further, the model training module is specifically configured to perform:

acquiring a high-quality image meeting the preset quality requirement, and processing the high-quality image in a downsampling mode to obtain a corresponding low-quality image;

encoding the high-quality image and the low-quality image by an encoder to obtain a high-quality encoding picture and a low-quality encoding picture;

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In a third aspect, the present invention provides an electronic device comprising a processor and a memory having stored therein at least one instruction, at least one program, code set or instruction set, the at least one instruction, at least one program, code set or instruction set being loaded and executed by the processor to implement the diffusion model based image enhancement method of any of the above.

In a fourth aspect, the present invention provides a computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, at least one program, code set, or instruction set being loaded and executed by a processor to implement a diffusion model based image enhancement method as in any of the above.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present invention are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.) means from one website, computer, server, or data center. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the present invention is not limited thereto, but any changes or substitutions within the technical scope of the present invention should be covered by the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A diffusion model-based image enhancement method, the method comprising:

and decoding the denoised image through a decoder to obtain an enhanced image.

2. The image enhancement method according to claim 1, wherein the preset noise addition rule is determined based on a diffusion process of a denoising diffusion probability model; gradually adding Gaussian noise into the coding feature map according to a preset noise adding rule and a preset step number to obtain a target noise image obeying Gaussian distribution, wherein the method specifically comprises the following steps of:

3. The image enhancement method according to claim 2, wherein the calculating the result image after adding the gaussian noise for each step of the diffusion process according to the coding feature map and the noise schedule is specifically:

，

，/>；

4. The image enhancement method according to claim 1, wherein the target noise image comprises a plurality of image channels, and the cross-attention mechanism comprises a channel attention mechanism and a spatial attention mechanism; the image enhancement is performed on the region corresponding to the text code in the target noise image based on the cross attention mechanism to obtain a noise enhanced image, and the method specifically comprises the following steps:

5. The method for enhancing an image according to claim 4, wherein the step of performing, by the channel attention mechanism, the targeted enhancement of different image channels on the feature map corresponding to each image channel of the region corresponding to the text code in the target noise image to obtain a channel attention feature map specifically includes:

6. The image enhancement method according to claim 5, wherein the channel attention feature map is subjected to targeted enhancement of different spatial positions through the spatial attention mechanism to obtain a noisy enhanced image, and the method specifically comprises:

7. The image enhancement method according to claim 2, wherein the preset noise removal rule is determined based on a reverse process of a denoising diffusion probability model; the step of gradually removing the prediction noise of each step from the noise-added enhanced image according to a preset noise removal rule and the preset step number specifically comprises the following steps:

8. The image enhancement method according to claim 1, wherein prior to said inputting said encoding feature map and said text encoding into a pre-trained target image enhancement network, said method further comprises:

9. The image enhancement method according to claim 8, wherein training the original image enhancement network to obtain an image enhancement network with an error value of prediction noise and real noise smaller than a preset loss value as the target image enhancement network specifically comprises:

10. An image enhancement device based on a diffusion model, the device comprising:

11. An electronic device comprising a processor and a memory having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, the at least one instruction, the at least one program, the set of codes, or the set of instructions being loaded and executed by the processor to implement the diffusion model-based image enhancement method of any one of claims 1-9.

12. A computer readable storage medium having stored therein at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code set, or instruction set being loaded and executed by a processor to implement the diffusion model-based image enhancement method of any of claims 1-9.