CN116777805A

CN116777805A - Rain scene image synthesis method and device learned from rendering and storage medium

Info

Publication number: CN116777805A
Application number: CN202310781233.9A
Authority: CN
Inventors: 赵生捷; 周楷彬; 邓浩
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2023-09-19

Abstract

The application relates to a method, a device and a storage medium for synthesizing a rain scene image learned from rendering, wherein the method comprises the following steps: the rendering stage creates a high-resolution paired rain scene-background image data set which contains a plurality of scenes and image pairs at different moments under illumination conditions; the learning stage introduces a guiding diffusion model into an implicit diffusion model, learns an hidden space which is perceptually equivalent to an image space by using a self-encoder model, perceptually compresses the image by using an encoder, and obtains hidden variables which are equivalent to pixels of the image space in the hidden space; completing a forward process and a reverse process of the diffusion model in the hidden space, wherein the reverse process is constrained by using two conditional mechanisms of cross-attention and series connection; the output hidden variable is transformed into an image space by a decoder, and a high-resolution rain scene image is generated. Compared with the prior art, the method has the advantages of reality based on the rendering method and high efficiency based on the learning method.

Description

Rain scene image synthesis method and device learned from rendering and storage medium

Technical Field

The application relates to the field of image processing, in particular to a high-resolution rain scene image synthesis method learned from rendering.

Background

Single image rain-out (SIRR, single Image Rain Removal) is a widely focused task, with deep learning based SIRR methods being the current major trend due to the strong fitting ability of neural networks, while the effectiveness of such data-driven methods is largely affected by the quality of the data set. The existing SIRR data set acquisition mode can be mainly divided into three types, namely a real data set, an artificially generated data set and a synthetic data set. The real data set is obtained by taking images of the real world scene in rainy days, such methods are often limited by weather conditions and it is difficult to obtain paired data. The artificially generated data set simulates a raining scene in the real world through a sprinkler, and images are taken through a camera, so that a clean rain Jing Tuxiang pair is obtained, but a great deal of time and labor are consumed by the method. The image synthesis method can synthesize the rain scene image from the clear background image only by a small amount or without manual intervention, can save time and labor, and provides possibility for acquiring a large-scale paired image data set.

Existing rain scene image synthesis methods are mainly divided into two types, a rendering-based method and a learning-based method. Modeling is carried out on a raindrop dynamic model and an appearance model of rain veins based on a rendering method, rain is rendered through an input scene depth map, light source attributes and a plurality of custom rain-related attribute parameters, and a rainlayer and a background image are mixed in a manner conforming to a physical principle to obtain a synthetic rain scene image, so that the color appearance of the rain in a specific illumination environment can be truly reflected. The method based on learning uses the real rain scene image data set to train the generated model, so that the model can capture the rain streak complex distribution in the real image, and the diversified and non-repeated rain streaks can be automatically and efficiently generated under the condition that artificial subjective intervention and experience parameter setting are not needed.

For example, chinese patent application CN114332460a discloses schemes, but although these methods can be used to synthesize data sets, there are some limitations. Input data based on a rendering method is complex, parameters which are manually set according to experience are involved, the type of generated rain is limited, and physical simulation and rendering also increase a great amount of time cost. The rainy layer is generally used as a single-channel gray-scale image layer based on a learning method, and the rainy layer and the background image are often mixed in a linear superposition mode, so that the color appearance of rain and optical phenomena such as refraction and transmission to the environment are ignored. In addition, the existing synthetic data sets lack diversity in terms of illumination environments, mainly include images in daytime illumination environments, have fewer images in complicated illumination environments including nighttime and the like, and further, have lower resolution of these synthetic images.

As shown in fig. 1, there is shown a comparison of a Rain scene image randomly selected from 5 synthetic data sets BDD350, COCO, rain100H, rain L and Rain scene images and a real Rain scene image, and there is a large difference between the Rain layer color appearance of the Rain scene image in the existing data set and the real Rain scene image. The SIRR model trained by the data sets is difficult to generalize to complex illumination environments such as night, and the performance is greatly affected. As shown in fig. 2, there are shown the raining results of 4 SIRR models based on deep learning on a real night rain scene image, and from the results, it can be seen that several SIRR models have difficulty in completely removing colored rain streaks and reducing a clear background. Therefore, an effective high-resolution rain scene image synthesis method under a complex illumination environment is lacking at present, and a high-quality pairing rain scene data set is created to train a SIRR model based on deep learning, so that the SIRR model can be generalized to the complex illumination environment such as night.

Disclosure of Invention

The application aims to provide a high-resolution rain scene image synthesis method and device learned from rendering.

The aim of the application can be achieved by the following technical scheme:

as a first aspect of the present application, there is provided a rain scene image synthesizing method learned from rendering, the method steps comprising:

a rendering stage, wherein a high-resolution paired rain scene-background image data set is created, and the data set comprises a plurality of images pairs comprising a rain layer shade image, a background image and a rain scene image at different moments under the condition of illumination;

a learning stage, in which a guided diffusion model is introduced into an implicit diffusion model, a self-encoder model is used for learning an hidden space which is perceptually equivalent to an image space, an encoder is used for performing perceptual compression on the image, and hidden variables which are equivalent to pixels of the image space are obtained in the hidden space; completing a forward process and a reverse process of the diffusion model in the hidden space, wherein the reverse process is constrained by using two conditional mechanisms of cross-attention and series connection; the output hidden variable is transformed into an image space by a decoder, and a high-resolution rain scene image is generated.

Further, the rendering stage comprises the following specific steps:

creating a scene model using a modeling tool, and creating a raindrop model using a particle simulator; combining the scene model and the raindrop model to obtain a raindrop model; after setting the environment related parameters, a rendering engine is used for rendering a rain layer shade image, a background image and a rain scene image to form an image pair.

Further, the hidden diffusion model adds noise to the data gradually on a Markov chain of a time step T, and converts between the original data distribution and the Gaussian distribution, and the specific steps comprise:

implicit diffusion model first trains a perceptually compressed self-encoder model, comprising encoder epsilon and decoderCompressing the image from a high-dimensional pixel space to a low-dimensional hidden space with an encoder;

in the forward process, the real data x ₀ ～q(x ₀ ) Gradually adding noise:

wherein ,is super-parameter, x ₁ ，...，x _T Is identical to the original data x ₀ Hidden variables with the same dimension;

sampling arbitrary time steps by re-parameterization techniques wherein

By training an implicit diffusion model, a reverse process is learned for variablesApplying a leachable Gaussian transformation->Progressive denoising with prediction of p using neural network _θ Statistical characteristics mu of (2) _θ and ∑_θ, wherein μ_θ Mean vector, Σ, representing gaussian distribution _θ A covariance matrix representing a gaussian distribution; mu is added to _θ Re-parameterizing to a de-noised network epsilon _θ (x _t ，t)；

Rainy day image x based on given high resolution ^high The corresponding hidden variable encoded by the encoder is z ^high The objective function of the implicit diffusion model LDM is expressed as:

where e represents the variable sampled from a standard gaussian distribution, T is uniformly sampled from {1,...

Further, the guiding diffusion model is trained on hidden variables of the low-resolution image, and hidden variables of the low-resolution rain scene image are roughly predicted; in the high-resolution image generation process, taking hidden variables predicted by a guiding diffusion model as conditions of a diffusion model reverse process, and using the hidden variables as conditions of the diffusion model reverse process for guiding image generation; the method comprises the following specific steps:

downsampling an input background image to obtain a low-resolution imageWill be of low resolutionCombining the image with the mask image to obtain a low resolution mask image +.>

Encoding a low resolution image and a low resolution mask image into a hidden space to obtain a hidden variable and />And input into the guiding diffusion model to predict hidden variable +.>The objective function of the guided diffusion model GDM is expressed as:

wherein ,z^low And the backbone of the guiding diffusion model is realized by adopting UNet as a hidden variable of a real low-resolution rain scene image.

Further, the specific steps of using a conditional mechanism to perform constraint in the reversing process include:

predicting hidden variables of the low-resolution rain scene image by using a guide model;

based on a cross attention mechanism, enhancing the bottom backbone of the diffusion model by using hidden variables of the low-resolution rain scene image;

and synthesizing the high-resolution background image and the high-resolution rain layer mask into a high-resolution mask image, compressing the high-resolution mask image by an encoder to obtain hidden variables, and taking the hidden variables as conditions of a reverse process by a tandem mechanism.

Further, the specific steps of constraining based on the cross-attention mechanism include:

modeling the inverse process of the implicit diffusion model as a conditional distribution p (z|y), employing a conditional denoising network e _θ (z _t T, y) constrains the reverse process;

low resolution rain Jing Yin variables predicted by a cross-attention mechanism to guide diffusion modelAs a condition, will->The middle layer, mapped to the diffusion model bottom layer UNet by the cross-attention layer, is denoted as:

wherein , e representing UNet implementation _θ Intermediate representation of->Is a learnable projection matrix.

Further, the specific steps of constraining based on the tandem mechanism include:

combining a high resolution background image and a rain layer mask into a mask imageWill +.>Compression as hidden variable +.>The hidden variable is connected in series with the input hidden variable as a condition of the reverse process, specifically, the variable input by the reverse process +.>Is->

Based on the tandem and cross-attention condition mechanisms, the conditional implicit diffusion model LDM is learned by the following objective function:

in the formula ,variable representing reverse process input, +.>The hidden variables representing the low resolution rain scene image predicted by the guiding model.

Further, after the rain scene image is generated through the model, post-processing is carried out on the generated rain scene image:

and mixing to generate a rain scene image and a background image by adopting a mixed mode of light, and keeping the color of pixels outside a rain range unchanged by using a rain layer shade:

wherein M represents a rain mask layer, and by-element multiplication,representing the generation of a rain scene image,>representing a high resolution background image.

As a second aspect of the present application, there is provided a rain scene image synthesizing apparatus learned from rendering, comprising a memory, a processor, and a program stored in the memory, the processor implementing the rain scene image synthesizing method learned from rendering as described above when executing the program.

As a third aspect of the present application, there is provided a storage medium having stored thereon a program which, when executed, implements the rain scene image synthesizing method learned from rendering as described above.

Compared with the prior art, the application has the following beneficial effects:

the application provides a practical high-resolution rain scene image synthesis method which is learned from rendering, comprising the following steps of:

in the rendering stage, the application uses a rendering-based method to render a realistic high-resolution paired rain scene-background image and creates a paired rain scene-background image dataset containing realistic paired rain scene images of multiple scenes and at different times under lighting conditions.

Training a high-resolution rain scene image by using a rendered data set to generate a network HRIGNet in a learning stage, introducing a guiding diffusion model into an implicit diffusion model, and guiding the synthesis of the high-resolution image by using a low-resolution image; controlling the composition of the rain scene images by using a cross attention and registration condition mechanism; illumination information can be learned from the background image, generating a high resolution rain scene image under the same illumination conditions as the background image.

The method provided by the application has the advantages of reality based on the rendering method and high efficiency based on the learning method, and avoids the defects of low efficiency based on the rendering method and poor reality based on the learning method.

Drawings

FIG. 1 is a diagram of a comparison of rain scene images of a prior art synthetic dataset, (a) BDD350; (b) COCO; (c) Rain100H; (d) Rain100L; (e) RainCityscapes; (f) - (h) real rain scene images;

FIG. 2 is a comparison graph of rain removal results of a real night rain scene image by a SIRR model based on deep learning in the prior art, (a) an input image; (b) a restorer; (c) Maxim; (d) DGNL-Net; (e) PReNet;

FIG. 3 is a flow diagram of a method of composition of a rain scene image learned from rendering in accordance with the present application;

FIG. 4 is a diagram of a rain scene image of an HRI dataset of the present application, the first line being the rain Jing Tuxiang of a road scene, the second line being the rain scene image of a city street scene;

FIG. 5 is a schematic diagram of an HRIGNet architecture of the present application;

FIG. 6 is a comparison of a rain scene image and an output image generated in accordance with the present application, (a) a background image; (b) a reference image; (c) the generated rain scene image; (d) post-processing the output image;

FIG. 7 is a diagram of the scheme of the present application in comparison with an output image of a base model, the first line being a clear background image and a generated image, the second line being a real rain scene image and an output image, (a) a clear background image; (b) ASSET generates an image; (c) LDM generating an image; (d) Dit generating an image; (e) generating an image by HRIGNet; (f) a real rain scene image; (g) an ASSET output image; (h) LDM output images; (i) Dit output image; (j) HRIGNet output image.

Detailed Description

The application will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present application, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present application is not limited to the following examples.

Example 1

In the present application, in order to effectively synthesize high-resolution rain Jing Tuxiang in a large amount of complex lighting environments, a practical learning-from-rendering (learning) pipeline is proposed. The pipeline is divided into two phases, a high resolution paired rain-scene-background image dataset is created in the rendering phase, and a rain-scene image generating network is trained with the dataset in the learning phase. The pipeline has the advantages of reality based on a rendering method and high efficiency based on a learning method, and avoids the defects of low efficiency based on the rendering method and poor reality based on the learning method.

To train a high quality rain scene image generation network, a realistic paired rain scene-background image dataset is required. Considering that the collection of the real rain scene image shot by the camera is time-consuming and labor-consuming and the corresponding background image is difficult to obtain, the application provides a method for constructing a High-resolution rain scene image (HRI, high-Resolution Rainy Image) data set by adopting a rendering-based method. The data set comprises a plurality of scenes and vivid pairing rain scene-background images in illumination environments at different moments.

In order to learn illumination information from a clear background image and generate a High-resolution rain scene image corresponding to an illumination environment, a High-resolution rain scene image generation network (HRIGNet, high-Resolution Rainy Image Generation Network) based on a Diffusion Model (DMs) is proposed in a learning stage.

The low-resolution image is adopted to guide the synthesis of the high-resolution image, so that more guiding information can be provided for the synthesis of the high-resolution image, and the quality of the synthesized image is expected to be improved. The application introduces a guided diffusion (guiding diffusion) model in the implicit diffusion model (LDM, latent Diffusion Model); in order to pair the generated rain scene image and the input background image, effective constraint needs to be applied to the image generation process, the rain scene image synthesis is controlled by using a cross-attention (cross-attention) and tandem (registration) condition mechanism, and hidden variables (layer codes) of a low-resolution rain scene image and a high-resolution mask image (masked image) predicted by the guiding Diffusion model are respectively taken as conditions (conditioning).

In order to effectively synthesize high resolution rain scene images in complex lighting environments, create high quality synthetic rain scene datasets, a practical pipeline is proposed that learns from rendering. An overview of the pipeline is shown in fig. 3. Specifically, the pipeline combines a rendering-based method and a learning-based method, and is divided into two phases, a rendering phase and a learning phase. In the rendering phase, a realistic high resolution paired rain-background image is rendered using a rendering-based method and a paired rain-background image dataset is created. In the learning stage, a rain scene image generating network is trained by using the rendered data set so as to efficiently generate a high-resolution rain scene image. The pipeline has the advantages of reality based on a rendering method and high efficiency based on a learning method, and avoids the defects of low efficiency based on the rendering method and poor reality based on the learning method.

1. Rendering stage

To train a high quality rain scene image generation network, a realistic rain scene paired image dataset is required, containing background images, rain layer mask images and rain scene image pairs in different lighting environments. In view of the time and effort involved in gathering the actual rain scene images captured by the camera and the difficulty in obtaining the corresponding background images and rain layer mask images, a rendering-based approach is employed to construct the dataset.

The offline rendering technology based on ray tracing can simulate most of natural phenomena of real physical world about object surface interaction, render photo-like lifelike images, and is widely applied to the fields of movies, cartoons, designs and the like at present. Blender is open-source 3D content creation software capable of conveniently and freely creating a true three-dimensional scene model, comprising a plurality of common light sources, wherein a physical engine and a particle system can be used for simulating a raining effect, and a GPU ray tracing renderer cycle based on a physical algorithm is provided. A realistic rain scene image can be rendered.

The rendering phase is implemented based on Blender, and is shown in fig. 3. Specifically, a modeling tool is used for creating a scene model, a particle simulator is used for creating a raindrop model, the scene model and the raindrop model are combined to obtain a raindrop model, after other environment-related parameters such as illumination are set, a rendering engine is used for rendering a raindrop mask image, a background image and a raindrop image, and therefore an image pair is formed.

1.1 high resolution rain scene image dataset (HRI)

In the rendering phase, the present embodiment constructs a realistic high resolution rain scene paired image dataset HRI. The HRI dataset contained 1300 image pairs, as shown in table 1, comprising two scene roads (lanes) and city streets (citystreet), with image resolutions of 720 x 960 and 512x512, respectively, for 1000 image pairs for a lane scene, comprising images at 4 camera views, each view comprising images at 50 moments, each moment comprising 5 intensity and direction rain images; the citystreet scene has a total of 300 image pairs, including images of 6 camera perspectives, each perspective including images from day to night at 25 moments in time, each moment including a rain scene image of 2 intensities and directions. Some of which are shown in fig. 4.

Dividing the data set into a training set and a test set according to the camera view angles, wherein for a lane scene, the training set comprises images of 3 camera view angles, and the test set comprises images of 1 camera view angle; for the citysteret scene, the training set includes 5 camera view images and the test set includes 1 camera view image. Thus, the training set includes a total of 1000 image pairs, and the test set includes a total of 300 image pairs.

Table 1HRI dataset

2. High resolution rain scene image generating network (HRIGNet)

The application provides a high-resolution rain scene image generation network (HRIGNet), which can synthesize a high-resolution rain scene image from a clear background image and a corresponding rain layer mask image. Specifically, given an RGB clear scene image and a mask image representing the position of the moire in the scene image, the method is capable of generating the moire at the position corresponding to the mask, the resultant moire having the illumination condition and color appearance corresponding to the background image. In addition, the method can generate high-resolution images with resolution up to 512x 512.

The architecture of the HRIGNet is shown in FIG. 5. According to LDM, in order to reduce the overhead of training a diffusion model on a high resolution image, a hidden space perceptually equivalent to the image space is learned using a self-encoder model, the image is perceptually compressed by an encoder, and hidden variables equivalent to pixels of the image space are obtained in the hidden space. Thus, the forward and backward processes of the diffusion model can be completed in the hidden space, and finally the output hidden variables are transformed into the image space by the decoder.

To control the image synthesis process of the diffusion model, two conditional mechanisms of tandem and cross-attention are used to constrain the inverse process. Firstly, a guiding diffusion guiding diffusion model is used for predicting hidden variable layer codes of low-resolution rain scene images, and based on a cross attention mechanism, hidden variables are used for enhancing a bottom UNet backbone (backbone) of the diffusion model. The cost of training and sampling the diffusion model under the low-resolution image is low, so that the rain scene image is predicted under the low-resolution image, hidden variables of the predicted low-resolution rain scene image are used for guiding high-resolution synthesis, more guiding information is provided for synthesizing the high-resolution image, and the quality of the synthesized image is improved. In order to impose stronger constraints on the image synthesis process, the background image and the rain layer mask are synthesized into a masked image, which is used as a condition of the inverse process through a tandem mechanism.

2.1 implicit diffusion model

The diffusion model is a probabilistic model that converts between raw data distribution and gaussian distribution by adding noise to the data step by step over a markov chain of time steps T. The forward process of the diffusion model is the real data x ₀ ～q(x ₀ ) Gradually adding noise wherein />Is super-parameter, x ₁ ，...，x _T Is identical to the original data x ₀ Hidden variables with the same dimensions. By means of reparameterization trick, arbitrary time steps can be sampled +.> wherein />By training the diffusion model,reverse processes, i.e. reverse transitions of forward processes, can be learned for variables +.>Applying a learnable gaussian transformationProgressive denoising in which a neural network is used to predict p _θ Statistical characteristics mu of (2) _θ and ∑_θ . Mu is added to _θ Re-parameterizing to a de-noised network epsilon _θ (x _t T), the corresponding objective function can be reduced to:

For image synthesis, to reduce the computational cost of training diffusion models on high resolution images, LDM first trains a perceptually compressed self-encoder model VQGAN, comprising encoder ε and decoderThe image is compressed with the encoder from a high-dimensional pixel space to a low-dimensional hidden space, where high-frequency, difficult-to-interpret details are abstracted, making the training of diffusion models more efficient in the low-dimensional hidden space. Given a high resolution RGB rainy day image x ^high The corresponding hidden variable encoded by the encoder is z ^high . The objective function of LDM can be expressed as:

2.2 guiding diffusion model

The cost of training and sampling diffusion model under low resolution image is low, and the low resolution image is used for guiding the synthesis of high resolution image, so that more guiding information can be provided for the synthesis of high resolution image, and the quality of synthesized image is hopefully improved.

In the embodiment, a guiding diffusion guiding diffusion model is used for training on hidden variables of a low-resolution image, and because the hidden variables are small in dimension, the guiding diffusion model can rapidly conduct training and sampling processes, hidden variables of the low-resolution rain scene image are roughly predicted, and then in a subsequent high-resolution image generation process, the hidden variables predicted by the model are used as conditions of a diffusion model reverse process for guiding image generation. Specifically, the input RGB background image is downsampled to obtain a low resolution imageCombining with the mask image to obtain a low resolution mask image +.>Encoding two images into the hidden space to obtain the hidden variable +.> and />Hidden variable input to guide diffusion model to predict low resolution rain scene image +.>The objective function of the guided diffusion model can be expressed as: />

wherein ,z^low The backbone (backbone) of the guided diffusion model GDM is implemented with UNet as an hidden variable of the real low resolution rain scene image.

2.3 conditional mechanisms

Modeling the inverse of the diffusion model as a conditional distribution p (z|y), the inverse can be constrained to control the image synthesis process using a conditional denoising network e _θ (z _t T, y) implementation. In the context of image synthesis, LDM allows different modalities of input as conditions for DMs through a cross-attention mechanism. In this approach, guiding transformer predicted low resolution rain Jing Yin variable is predicted by a cross-attention mechanismAs a condition, will->Intermediate layer mapped to UNet by cross-attention layer, denoted as wherein

Here the number of the elements is the number,e representing UNet implementation _θ Intermediate representation of->Is a learnable projection matrix.

In addition to using crossoversIn addition to the attention condition mechanism, a tandem condition mechanism is used in order to impose stronger constraints on the image generation process. Combining a high resolution background image and a rain layer mask into a mask imageCompression thereof into the hidden variable +.>It is concatenated with the input hidden variable as a condition for the reverse process. Specifically, the variable of the reverse process input is +.>

thus, the HRIGNet overall objective function is

L _HRIG ＝L _GDM +L _LDM (5)

2.4 post-treatment

By using the HRIGNet, a clear background image and a rain layer shade are given, so that a corresponding rain scene image can be obtained. For simplicity, it is assumed that the areas in the image where there is no rain are the same color as the clear background image, regardless of the effect of the presence of weather such as fog, and that in normal cases, brighter rain marks highlight the color at the areas and darker rain marks are not visible, so that the color at the areas does not become darker than the background color, but remains the same color as before.

Therefore, after obtaining the rain scene image generated by the model, further post-processing the rain scene image, mixing the rain scene image and the background image by adopting a mixed mode of the lightning, and keeping the color of the pixels outside the rain range unchanged by using a rain layer shade, namely

Where M represents the rain mask layer, and by-element multiplication.

A comparison of the generated rain image (generated image) and the post-processed output image (output image) is shown in fig. 6, and it can be seen that the post-processed image is more visually better.

3. Effect verification

In the experimental part, the HRIGNet model is trained based on the HRI data set, and quantified and visually compared with a plurality of image generation baseline models to verify the capability of the HRIGNet in high resolution rain scene image synthesis. Meanwhile, ablation experiments were also performed on the guiding model and diffusion backbone of the HRIGNet to verify the effect of the HRIGNet.

3.1 training details

First, a low resolution image with a size of 256×256 is used for the L based on equation 3 _GDM A Guiding Diffusion model was pre-trained and then Guiding Diffusion weighted, using a high resolution image of size 512x512, based on L of equation 4 _LDM The HRIGNet is trained.

In the training process of Guiding Diffusion and HRIG Net, an AdamW optimizer was used. In training Guiding Diffusion, first Stage Model and Cond Stage Model use the same VQGAN Model, E in FIG. 5 ₂ Model weights were pre-trained using LDM vq-f8-256. Initial learning rate of diffion was 2×10 ^-6 The batch size was 4, and the image size of unet backup was 32× 32,model channels and 224.

In training the HRIGNet, first Stage Model and Cond Stage Model use the same VQGAN Model, E in FIG. 5 ₁ Model weights were pre-trained using LDM vq-f4. Initial learning rate of diffion was 2×10 ^-6 The batch size is 1, and the image size of unet backlight is 128× 128,model channels and 224.

3.2 comparison with Baselines model

To verify the capability of HRIG Net on high resolution rain scene image synthesis, the method of the present application was compared to several image generation models baselines: ASSET, LDM, diT, the evaluation index is FID, LPIPS, SSIM, PSNR, and a specific model is set in the additional material. As shown in the experimental results of Table 2, the model proposed by the present application achieves the best results on all of the several indexes. The result of the composition of the rain scene images of several methods is shown in fig. 7, and it can be seen from the figure that the model of the application can well capture the illumination and color in the background environment, and map the illumination and color onto the generated rain layer, so that the model has a vivid color appearance corresponding to the background image.

TABLE 2 index contrast for the Baselines and HRIGNet models at 512×512 resolution

Method	Resolution ratio	FID↓	LPIPS↓	SSIM↑	PSNR↑
						ASSET	512×512	330.918	0.255	0.792	23.562
LDM	512×512	166.874	0.243	0.784	22.932
						DiT	512×512	279.663	0.342	0.719	19.810
HRIGNet	512×512	130.186	0.203	0.819	24.030

3.3 ablation experiments

The effect of using the dispersion as a guiding model was evaluated by an ablation experiment. The performance of the guiding model using the transform and Diffusion models as the HRIGNet was compared. As shown in Table 3, the model using Guiding Diffusion is better than the model using Guiding Transformer in the FID, LPIPS and SSIM indexes. The diffusion model is fitted with a simple gaussian distribution at each time step, which converges easily and therefore achieves better results. This may reasonably explain why image synthesis results using the guided diffusion model would be better at low resolution.

TABLE 3 index contrast at 512X512 resolution for HRIGNet using different guiding models

Guiding Model	Image resolution	FID↓	LPIPS↓	SSIM↑	PSNR↑
						Transformer	512×512	133.738	0.204	0.818	24.056
Diffusion	512×512	130.186	0.203	0.819	24.030

To explore the effect of using different backbones on HRIG Net in the diffion model, ablation experiments were also performed on backbones, using UNet and Transformer, respectively, and as shown in table 4, HRIG Net results using UNet were better than Transformer. The scalable nature of the transducer according to DiT is also used to use the Transformer Backbone model, but the model using Transformer Backone performs poorly with insufficient parameter amounts, being a positive, therefore, nature. Furthermore, the model of the present application employs UNet backup with a faster convergence rate than the Transformer Backbone-based model.

TABLE 4 index contrast at 512X512 resolution for HRIGNet using different backbolts

Backbone	Image resolution	FID↓	LPIPS↓	SSIM↑	PSNR↑
						Transformer	512×512	217.182	0.263	0.780	22.469
UNet	512×512	130.186	0.203	0.819	24.030

Example 2

As a second aspect of the present application, the present application also provides an electronic apparatus including: one or more processors; a memory for storing one or more programs; the one or more programs, when executed by the one or more processors, cause the one or more processors to implement a rain scene image composition method learned from rendering as described above. In addition to the above processor, memory and interface, any device with data processing capability in the embodiments may generally include other hardware according to the actual function of the any device with data processing capability, which will not be described herein.

Example 3

As a third aspect of the present application, there is also provided a computer-readable storage medium having stored thereon computer instructions which, when executed by a processor, implement a rain scene image synthesizing method as learned from rendering as described above. The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may also be an external storage device, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any device having data processing capabilities. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

The foregoing describes in detail preferred embodiments of the present application. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the application by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by the person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. A method of composition of a rain scene image learned from rendering, the method steps comprising:

2. A method of composition of a rain scene image learned from rendering according to claim 1, wherein the rendering stage comprises the following specific steps:

3. A method of composition of a rain scene image learned from rendering according to claim 1, wherein the implicit diffusion model transitions between raw data distribution and gaussian distribution by progressively adding noise to the data over a markov chain of time steps T, the steps comprising:

sampling arbitrary time steps by re-parameterization techniques wherein

By training an implicit diffusion model, a reverse process is learned for variablesApplying a learnable gaussian transformationProgressive denoising with prediction of p using neural network _θ Statistical characteristics mu of (2) _θ and ∑_θ, wherein μ_θ Mean vector, Σ, representing gaussian distribution _θ A covariance matrix representing a gaussian distribution; mu is added to _θ Re-parameterizing to a de-noised network epsilon _θ (x _t ，t)；

Rainy day image x based on given high resolution ^high Pairs encoded by an encoderThe concealed variable is z ^high The objective function of the implicit diffusion model LDM is expressed as:

4. A method of composition of a rain scene image learned from rendering as set forth by claim 3 wherein the guided diffusion model is trained on hidden variables of the low resolution image, roughly predicting hidden variables of the low resolution rain scene image; in the high-resolution image generation process, taking hidden variables predicted by a guiding diffusion model as conditions of a diffusion model reverse process, and using the hidden variables as conditions of the diffusion model reverse process for guiding image generation; the method comprises the following specific steps:

downsampling an input background image to obtain a low-resolution imageCombining the low resolution image with the mask image to obtain a low resolution mask image +.>

5. The method for synthesizing a rain scene image learned from rendering according to claim 4, wherein the specific step of constraining using a conditional mechanism in the reversing process comprises:

6. A method of composition of a rain scene image learned from rendering according to claim 5, wherein the specific step of constraining based on the cross-attention mechanism comprises:

low resolution rain Jing Yin variables predicted by a cross-attention mechanism to guide diffusion modelAs a condition, willThe middle layer, mapped to the diffusion model bottom layer UNet by the cross-attention layer, is denoted as:

wherein , intermediate representation of ∈θ representing UNet implementation, +.>Is a learnable projection matrix.

7. A method of composition of a rain scene image learned from rendering according to claim 5, wherein the specific step of constraining based on the concatenation scheme comprises:

8. The method for synthesizing a rain scene image learned from rendering according to claim 1, wherein after generating the rain scene image by the model, post-processing is performed on the generated rain scene image:

9. A rain scene image synthesizing apparatus learned from rendering, comprising a memory, a processor, and a program stored in the memory, wherein the processor implements the rain scene image synthesizing method learned from rendering according to any one of claims 1 to 8 when executing the program.

10. A storage medium having a program stored thereon, wherein the program, when executed, implements the method of composition of a rain scene image learned from rendering as set forth by any one of claims 1 to 8.