CN112184575A

CN112184575A - Image rendering method and device

Info

Publication number: CN112184575A
Application number: CN202010971444.5A
Authority: CN
Inventors: 李超; 陈濛
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2021-01-05
Also published as: WO2022057598A1

Abstract

The application provides an image rendering method and device, which can achieve the purpose that a high-resolution and high-frame-rate rendered image is obtained under the condition of a low sampling value. The method comprises the following steps: acquiring continuous three-frame images such as a first image, a second image and a third image; updating the illumination map of the second image according to the first image to obtain the updated illumination map of the second image; inputting the illumination map of the updated second image into a hyper-resolution denoising network to obtain a hyper-resolution denoising image of the second image; updating the illumination map of the third image according to the second image to obtain an updated illumination map of the third image; inputting the illumination map of the updated third image into a hyper-resolution denoising network to obtain a hyper-resolution denoising image of the third image; acquiring an initial frame interpolation image at a target moment according to the super-resolution de-noised image of the second image and the super-resolution de-noised image of the third image, wherein the target moment is a moment between the second image and the third image; and inputting the initial frame interpolation image into a bidirectional frame interpolation network to obtain the frame interpolation image at the target moment.

Description

Image rendering method and device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a method and an apparatus for image rendering.

Background

Ray tracing is a technology for generating or enhancing special visual effects in the fields of modern movies, games and the like, global illumination such as ambient light shielding, indirect reflection, diffusion and the like is realized by tracing each ray emitted from a camera, and seamless connection of images and reality can be ensured on a rendering frame.

Currently, the mainstream ray tracing technology is mainly divided into three modes, including an offline mode, an interactive mode, and a real-time mode. The offline mode has the best rendering effect but takes a long time, the interactive mode balances the rendering effect and the time, and the real-time mode sacrifices part of the rendering effect to meet the requirement of real-time performance. The movie presentation mode is non-interactive, so that the movie can be rendered offline through a large number of servers during production, and the game requires real-time human-computer interaction, so that game manufacturers can only calculate each frame of picture through a real-time rendering mode, and the real-time calculation brings huge calculation amount. In the ray tracing field, the ray sampling value of each pixel point directly influences the rendering effect, a high sampling value means huge calculation amount, and a low sampling value introduces a lot of noise points on the premise of time-lapse rendering, so that the quality of rendered pictures is reduced.

In the prior art, the rendering time of a path tracking algorithm based on Optix is 70ms when the sampling number per pixel (spp) is 1 in a SponzaGlossy scene, the rendering time of 1spp is 260ms in a SanMiguel scene, and the Optix-based path tracking algorithm cannot meet the requirement for rendering time of each frame of a game industry for 16ms at most. Therefore, to realize real-time rendering under limited hardware conditions, low sampling values are used in combination with a noise reduction algorithm. Table 1 shows the optimization effect of the existing noise reduction algorithm under the condition of 1 to 2spp low sample value:

TABLE 1

Noise reduction algorithm	Sampling value (spp)	Resolution ratio	Time consuming (ms)	Hardware device
					SBF	1	720	7402	Titan XP
AAF	1	720	211	Titan XP
					LBF	1	720	1550	Titan XP
NFOR	1	720	107～121	Intel i7-7700HQ
					AE	1	720	54.9	Titan XP
KPCN	32	1080	12000	Nvidia Quadro M6000
					SVGF	1	720	4～5	Titan XP

In the above table, Titan XP, Intel i7-7700HQ, and Nvidia quadra M6000 are all high-performance video cards, and noise reduction algorithms such as sbf (sub base filter), aaf (axi aligned filter for bottom soft displays), lbf (learning based filter), nfor (nonlinear weighted first order regression), ae (interactive direction of remote image sequences using a recording consuming rendering autoencoder) and the like obtain rendering images with 720P resolution under the condition of low sampling value, which is too long to meet the requirement; however, the kpcn (kernel prediction volumetric networks) denoising algorithm needs a higher sampling value to obtain a rendered image with a higher resolution of 1080P, and the time consumption is too long, which cannot meet the requirement; time-consuming and time-consuming acquisition of 720P resolution rendered images by a temporal variance guided filtering (SVGF) noise reduction algorithm under the condition of low sampling value meets the time requirement, but the 720P resolution cannot guarantee the fluency of the game. Therefore, as can be seen from the above table, the existing real-time ray tracing technology still has the defects of large calculation amount and high hardware requirement, and a better rendering effect is achieved under the condition of a low sampling value, rendering is time-consuming, and the time-consuming requirement of game rendering cannot be met. Therefore, it is important to obtain a real-time rendering effect with high frame rate and high resolution without increasing hardware cost.

Disclosure of Invention

The application provides an image rendering method and device, which can achieve the purpose that a high-resolution and high-frame-rate rendered image is obtained under the condition of a low sampling value.

In a first aspect, a method for image rendering is provided, the method comprising: acquiring a first image, a second image and a third image, wherein the first image, the second image and the third image are continuous three-frame images; updating the illumination map of the second image according to the first image to obtain the updated illumination map of the second image; inputting the illumination map of the updated second image into a hyper-resolution denoising network to obtain a hyper-resolution denoised image of the second image; updating the illumination map of the third image according to the second image to obtain the updated illumination map of the third image; inputting the illumination map updated by the third image into a hyper-resolution denoising network to obtain a hyper-resolution denoised image of the third image; acquiring an initial frame interpolation image at a target moment according to the super-resolution de-noised image of the second image and the super-resolution de-noised image of the third image, wherein the target moment is a moment between the second image and the third image; and inputting the initial frame interpolation image into a bidirectional frame interpolation network to obtain the frame interpolation image at the target moment.

The image rendering method of the embodiment of the application can process the image with a low sampling value (such as 1spp), so that the requirement on hardware equipment is greatly reduced; according to the image rendering method, the color values of the pixel points are accumulated, the illumination map of the low sampling value is updated, and the noise problem caused by insufficient sampling information amount can be solved; the image rendering method of the embodiment of the application uses the super-resolution denoising network to process the image, so that the resolution of the image can be improved; the image rendering method provided by the embodiment of the application performs frame interpolation on two continuous frames of images, and uses a bidirectional frame interpolation network to process the images of the frame interpolation, so that the frame rate of the images is improved, and the fluency of image rendering is ensured.

With reference to the first aspect, in some implementations of the first aspect, updating the illumination map of the second image according to the first image to obtain an updated illumination map of the second image includes: acquiring a light map of a second image, wherein the light map of the second image comprises color values of a plurality of pixel points, and the light map of the second image is a direct light map or an indirect light map; acquiring a second pixel point corresponding to a first pixel point in a first image, wherein the first pixel point is any one of a plurality of pixel points; and updating the color value of the first pixel point according to the color value of the first pixel point and the color value of the second pixel point so as to obtain an updated illumination map.

Aiming at the problem of serious noise caused by a low sampling value, the image rendering method provided by the embodiment of the application updates the color value of each pixel point by accumulating time domain information and combining historical color information of each pixel point, so that the illumination map of the image is updated, the shortage of the sampling value is made up, and the rendering noise is reduced.

With reference to the first aspect, in some implementations of the first aspect, when the position of the second pixel point is not on a grid node of the first image, the method further includes: obtaining color values of four pixel points closest to the second pixel point, wherein the four pixel points are on grid nodes of the first image; and obtaining the color value of the second pixel point according to the color values of the four pixel points.

The image rendering method in the embodiment of the application considers that a pixel point possibly existing in a next frame of image is not on a grid node of a previous frame of image, and at this time, a color value of the pixel point of the previous frame of image cannot be directly obtained, so that the color value of the pixel point of the previous frame of image is obtained according to a bilinear interpolation method.

With reference to the first aspect, in some implementations of the first aspect, before updating the color value of the first pixel according to the color value of the first pixel and the color value of the second pixel, the method further includes: judging the consistency of the first pixel point and the second pixel point, wherein the judgment of the consistency of the first pixel point and the second pixel point comprises the following steps: acquiring a depth value of a first pixel point, a normal magnitude of the first pixel point, a patch ID of the first pixel point, a depth value of a second pixel point, a normal magnitude of the second pixel point and a patch ID of the second pixel point; the square of the difference between the depth value of the first pixel point and the depth value of the second pixel point is smaller than a first threshold value; the square of the difference between the normal magnitude of the first pixel point and the normal magnitude of the second pixel point is smaller than a second threshold value; the patch ID of the first pixel point is equal to the patch ID of the second pixel point.

In order to ensure that the second pixel point is indeed the pixel point corresponding to the first pixel point in the first image, before updating the color value of the first pixel point by using the color value of the second pixel point, the method of the embodiment of the application further includes performing consistency judgment on the first pixel point and the second pixel point. Similarly, if the position of the corresponding second pixel point of the first pixel point in the first image is not on the grid node of the first image, at this time, the depth value, the normal magnitude value, and the patch ID of the second pixel point cannot be directly obtained, and similar to the above method for obtaining the color value, the depth value, the normal magnitude value, and the patch ID of the second pixel point need to be obtained by adopting a bilinear interpolation algorithm.

In combination with the first aspect, in some implementations of the first aspect, updating the color value of the first pixel point according to the color value of the first pixel point and the color value of the second pixel point includes: the color value of the first pixel after updating is the sum of the color value of the first pixel multiplied by the first coefficient and the color value of the second pixel multiplied by the second coefficient.

The image rendering method of the embodiment of the application provides a color value of a first pixel point and a color value of a second pixel point, and the method for updating the color value of the first pixel point, wherein a first coefficient and a second coefficient are preset values.

With reference to the first aspect, in some implementations of the first aspect, inputting the illumination map after the second image update into a hyper-resolution denoising network, further includes: acquiring a depth map of the second image and a normal vector map of the second image; fusing the depth map of the second image, the normal vector map of the second image and the updated illumination map of the second image to obtain a first fusion result; and inputting the first fusion result into a hyper-resolution denoising network.

In order to compensate for the problem that the motion vector estimation of the non-rigid motion and the shadow motion is inaccurate, the image rendering method according to the embodiment of the application further includes performing feature fusion on the depth map, the normal vector map and the updated illumination map.

With reference to the first aspect, in some implementations of the first aspect, obtaining the initial interpolated frame image at the target time according to the super-resolution denoised image of the second image and the super-resolution denoised image of the third image includes: acquiring a motion vector from a third image to a second image; determining a first motion vector from the initial frame interpolation image to the second image at the target moment and a second motion vector from the initial frame interpolation image to the third image at the target moment according to the motion vectors from the third image to the second image; and obtaining an initial frame interpolation image at the target moment according to the super-resolution de-noised image of the second image, the super-resolution de-noised image of the third image, the first motion vector and the second motion vector.

The image rendering method provided by the embodiment of the application performs frame interpolation on two continuous frames of images, so that the frame rate of image rendering is improved.

With reference to the first aspect, in some implementations of the first aspect, inputting the initial frame interpolation image into a bidirectional frame interpolation network further includes: acquiring a depth map of the second image, a normal vector map of the second image, a depth map of the third image and a normal vector map of the third image; fusing the depth map of the second image, the normal vector map of the second image, the depth map of the third image, the normal vector map of the third image and the initial frame interpolation image to obtain a second fusion result; and inputting the second fusion result into the bidirectional frame interpolation network.

In order to compensate for the problem that the motion vector estimation of the non-rigid motion and the shadow motion is inaccurate, the image rendering method according to the embodiment of the application further includes performing feature fusion on the depth map, the normal vector map and the initial frame interpolation image.

With reference to the first aspect, in some implementations of the first aspect, the hyper-resolution denoising network is a pre-trained neural network model, and the training of the hyper-resolution denoising network includes: acquiring a plurality of groups of super-component denoising original training data, wherein each group of super-component denoising original training data in the plurality of groups of super-component denoising original training data comprises two continuous frames of images and a standard image corresponding to a next frame of image in the two continuous frames of images; judging whether pixel points of two continuous frames of images accord with consistency; acquiring a depth map of a next frame image, a normal vector map of the next frame image and an illumination map of the next frame image in two continuous frames of images, wherein the illumination map of the next frame image is a direct illumination map or an indirect illumination map; updating the color value of a pixel point of the next frame of image according to the two continuous frames of images to obtain an updated illumination map of the next frame of image; fusing the depth map of the next frame image, the normal vector map of the next frame image and the updated illumination map of the next frame image to obtain an updated image; and training the hyper-resolution denoising network according to the updated image and the standard image.

The embodiment of the application also provides a training method of the super-resolution denoising network, wherein a low sampling value (such as 1spp) image is obtained, a rendering result of a next frame of image under a high sampling value (such as 4096spp) condition is used as a standard image, and the neural network is trained.

With reference to the first aspect, in some implementations of the first aspect, the bidirectional frame interpolation network is a pre-trained neural network model, and the training of the bidirectional frame interpolation network includes: acquiring a plurality of groups of original training data of the bidirectional interpolation frame, wherein each group of original training data of the bidirectional interpolation frame in the plurality of groups of original training data of the bidirectional interpolation frame comprises a fourth image, a fifth image and a sixth image, and the fourth image, the fifth image and the sixth image are continuous three-frame images; acquiring an interpolation image at the intermediate moment of the fourth image and the sixth image according to the fourth image and the sixth image; and training the bidirectional frame interpolation network according to the frame interpolation image and the fifth image at the intermediate moment.

The embodiment of the application also provides a training method of the bidirectional frame interpolation network, which comprises the steps of obtaining a fourth image, a fifth image and a sixth image which are continuous, performing frame interpolation on the fourth image and the sixth image to obtain an initial frame interpolation result, and then training the neural network by taking the fifth image as a standard.

In a second aspect, an apparatus for image rendering is provided, the apparatus comprising: the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a first image, a second image and a third image, and the first image, the second image and the third image are continuous three-frame images; the processing module is used for updating the illumination map of the second image according to the first image so as to obtain the updated illumination map of the second image; the processing module is further used for inputting the illumination pattern after the second image is updated into the hyper-resolution denoising network to obtain a hyper-resolution denoising image of the second image; the processing module is further configured to update the illumination map of the third image according to the second image to obtain an updated illumination map of the third image; the processing module is further used for inputting the illumination pattern after the third image is updated into the hyper-resolution denoising network to obtain a hyper-resolution denoising image of the third image; the processing module is further used for acquiring an initial frame interpolation image at a target moment according to the super-resolution de-noised image of the second image and the super-resolution de-noised image of the third image, wherein the target moment is a moment between the second image and the third image; the processing module is also used for inputting the initial frame interpolation image into the bidirectional frame interpolation network so as to obtain the frame interpolation image at the target moment.

With reference to the second aspect, in some implementations of the second aspect, the updating, by the processing module, the illumination map of the second image according to the first image to obtain an updated illumination map of the second image includes: acquiring a light map of a second image, wherein the light map of the second image comprises color values of a plurality of pixel points, and the light map of the second image is a direct light map or an indirect light map; acquiring a second pixel point corresponding to a first pixel point in a first image, wherein the first pixel point is any one of a plurality of pixel points; and updating the color value of the first pixel point according to the color value of the first pixel point and the color value of the second pixel point so as to obtain an updated illumination map.

With reference to the second aspect, in some implementations of the second aspect, when the position of the second pixel point is not on a grid node of the first image, the processing module is further configured to: obtaining color values of four pixel points closest to the second pixel point; and obtaining the color value of the second pixel point according to the color values of the four pixel points.

In combination with the second aspect, in some implementations of the second aspect, before the processing module updates the color value of the first pixel point according to the color value of the first pixel point and the color value of the second pixel point, the processing module is further configured to: judging the consistency of the first pixel point and the second pixel point, wherein the judgment of the consistency of the first pixel point and the second pixel point comprises the following steps: acquiring a depth value of a first pixel point, a normal magnitude of the first pixel point, a patch ID of the first pixel point, a depth value of a second pixel point, a normal magnitude of the second pixel point and a patch ID of the second pixel point; the square of the difference between the depth value of the first pixel point and the depth value of the second pixel point is smaller than a first threshold value; the square of the difference between the normal magnitude of the first pixel point and the normal magnitude of the second pixel point is smaller than a second threshold value; the patch ID of the first pixel point is equal to the patch ID of the second pixel point.

In combination with the second aspect, in some implementations of the second aspect, updating the color value of the first pixel point according to the color value of the first pixel point and the color value of the second pixel point includes: the color value of the first pixel after updating is the sum of the color value of the first pixel multiplied by the first coefficient and the color value of the second pixel multiplied by the second coefficient.

With reference to the second aspect, in some implementations of the second aspect, the inputting, by the processing module, the illumination map after the second image update into the hyper-resolution denoising network further includes: acquiring a depth map of the second image and a normal vector map of the second image;

fusing the depth map of the second image, the normal vector map of the second image and the updated illumination map of the second image to obtain a first fusion result; and inputting the first fusion result into a hyper-resolution denoising network.

With reference to the second aspect, in some implementation manners of the second aspect, the acquiring, by a processing module, an initial frame interpolation image at a target time according to the super-resolution denoised image of the second image and the super-resolution denoised image of the third image includes: acquiring a motion vector from a third image to a second image; determining a first motion vector from the initial frame interpolation image to the second image at the target moment and a second motion vector from the initial frame interpolation image to the third image at the target moment according to the motion vectors from the third image to the second image; and determining an initial frame interpolation image at the target moment according to the super-resolution de-noised image of the second image, the super-resolution de-noised image of the third image, the first motion vector and the second motion vector.

With reference to the second aspect, in some implementations of the second aspect, the processing module inputs the initial frame interpolation image into a bidirectional frame interpolation network, further including: acquiring a depth map of the second image, a normal vector map of the second image, a depth map of the third image and a normal vector map of the third image; fusing the depth map of the second image, the normal vector map of the second image, the depth map of the third image, the normal vector map of the third image and the initial frame interpolation image to obtain a second fusion result; and inputting the second fusion result into the bidirectional frame interpolation network.

With reference to the second aspect, in some implementations of the second aspect, the hyper-resolution denoising network is a pre-trained neural network model, and the training of the hyper-resolution denoising network includes: acquiring a plurality of groups of super-component denoising original training data, wherein each group of super-component denoising original training data in the plurality of groups of super-component denoising original training data comprises two continuous frames of images and a standard image corresponding to a next frame of image in the two continuous frames of images; judging whether pixel points of two continuous frames of images accord with consistency; acquiring a depth map of a next frame image, a normal vector map of the next frame image and an illumination map of the next frame image in two continuous frames of images, wherein the illumination map of the next frame image is a direct illumination map or an indirect illumination map; updating the color value of a pixel point of the next frame of image according to the two continuous frames of images to obtain an updated illumination map of the next frame of image; fusing the depth map of the next frame image, the normal vector map of the next frame image and the updated illumination map of the next frame image to obtain an updated image; and training the hyper-resolution denoising network according to the updated image and the standard image.

With reference to the second aspect, in some implementations of the second aspect, the bidirectional frame interpolation network is a pre-trained neural network model, and the training of the bidirectional frame interpolation network includes: acquiring a plurality of groups of original training data of the bidirectional interpolation frame, wherein each group of original training data of the bidirectional interpolation frame in the plurality of groups of original training data of the bidirectional interpolation frame comprises a fourth image, a fifth image and a sixth image, and the fourth image, the fifth image and the sixth image are continuous three-frame images; acquiring an interpolation image at the intermediate moment of the fourth image and the sixth image according to the fourth image and the sixth image; and training the bidirectional frame interpolation network according to the frame interpolation image and the fifth image at the intermediate moment.

In a third aspect, an apparatus for image rendering is provided, the apparatus comprising: a memory for storing a program; a processor for executing the program stored in the memory, wherein when the program stored in the memory is executed by the processor, the processor performs part or all of the operations in any of the manners of the first aspect.

In a fourth aspect, an electronic device is provided, which includes the apparatus for image rendering in any one of the manners of the second aspect.

In a fifth aspect, a computer-readable storage medium is provided, which stores a computer program executable by a processor, and when the computer program is executed by the processor, the processor performs part or all of the operations in any one of the manners of the first aspect.

In a sixth aspect, a chip is provided, which includes a processor configured to perform some or all of the operations of the method described in the first aspect.

In a seventh aspect, there is provided a computer program or computer program product comprising computer readable instructions which, when executed by a processor, cause the processor to perform some or all of the operations of any one of the above-described first aspects.

Drawings

FIG. 1 is a schematic block diagram of ray tracing and rasterization in an embodiment of the present application;

FIG. 2 is a schematic block diagram of a U-Net neural network according to an embodiment of the present application;

FIG. 3 is a schematic block diagram of an electronic device of an embodiment of the present application;

FIG. 4 is a schematic block diagram of a system architecture of a conventional image rendering method based on a real-time ray tracing technology according to an embodiment of the present application;

FIG. 5 is a schematic block diagram of a system architecture of an image rendering method based on a real-time ray tracing technology according to an embodiment of the present application;

FIG. 6 is a schematic flow chart of an image rendering method of an embodiment of the present application;

FIG. 7 is a schematic flow chart diagram of training of a hyper-resolution denoising network according to an embodiment of the present application;

FIG. 8 is a schematic flow chart diagram of the training of a bidirectional frame insertion network of an embodiment of the present application;

FIG. 9 is a schematic block diagram of an image rendering method of an embodiment of the present application;

FIG. 10 is a schematic flow chart diagram of acquiring a data set according to an embodiment of the present application;

FIG. 11 is a schematic block diagram of a rasterization process of an embodiment of the present application;

FIG. 12 is a schematic block diagram illustrating an embodiment of the present application for obtaining parameters of a pixel point of a previous frame by bilinear interpolation;

FIG. 13 is a schematic block diagram illustrating an embodiment of the present application for performing hyper-resolution denoising on an image using a hyper-resolution denoising network;

FIG. 14 is a schematic block diagram of an embodiment of the present application for processing an image using a bidirectional framing network;

FIG. 15 is a schematic block diagram of an apparatus for image rendering according to an embodiment of the present application;

FIG. 16 is a schematic block diagram of an apparatus for training a hyper-resolution denoising network according to an embodiment of the present application;

FIG. 17 is a schematic block diagram of an apparatus for training of a bidirectional framing network of an embodiment of the present application;

fig. 18 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of this application and the appended claims, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, such as "one or more", unless the context clearly indicates otherwise. It should also be understood that in the following embodiments of the present application, "at least one", "one or more" means one, two or more. The term "and/or" is used to describe an association relationship that associates objects, meaning that three relationships may exist; for example, a and/or B, may represent: a alone, both A and B, and B alone, where A, B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Reference throughout this specification to "one embodiment" or "some embodiments," or the like, means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," or the like, in various places throughout this specification are not necessarily all referring to the same embodiment, but rather "one or more but not all embodiments" unless specifically stated otherwise. The terms "comprising," "including," "having," and variations thereof mean "including, but not limited to," unless expressly specified otherwise.

In order to facilitate understanding of the technical solutions of the present application, first, concepts related to the present application are briefly introduced.

Ray tracing: a special rendering algorithm in three-dimensional computer graphics emits a ray from a viewpoint to pass through each pixel point on a viewing plane and continuously performs intersection judgment of the ray and an object, and simultaneously renders a three-dimensional scene by considering optical phenomena such as reflection and refraction.

Global Illumination (GI): the method is a rendering technology which considers direct illumination from a light source in a scene and indirect illumination reflected by other objects in the scene, and shows the comprehensive effect of the direct illumination and the indirect illumination.

Rasterization: the method is a process for converting vertex data into fragments, has the function of converting a graph into an image formed by grids, and is a process for converting a geometric primitive into a two-dimensional image.

Rigid body: a solid body of limited size with negligible deformation. The distance between the mass points in the rigid body is not changed whether the external force is applied or not.

And (3) image super-segmentation: i.e., super-resolution of images, and reconstructing an input low-resolution image into a high-resolution image.

Deep learning: the branch of machine learning is an algorithm which takes an artificial neural network as a framework and performs characterization learning on data.

The technical solution in the present application will be described below with reference to the accompanying drawings.

FIG. 1 shows a schematic block diagram of ray tracing and rasterization in an embodiment of the present application. Ray tracing and rasterization are rendering techniques that aim to project objects in three-dimensional space, via computational shading, onto a two-dimensional screen space for display. Ray tracing and rasterization are different in that ray tracing rendering is to calculate where the rays hit the graphics (such as the triangle shown in fig. 1) and then calculate the texel colors at these positions, assuming that each point on the screen is a forward ray; and the rasterization rendering is to perform coordinate transformation on the vertex of a graph (such as a triangle shown in fig. 1) and fill a texture in the interior of the triangle on a two-dimensional screen. Compared with rasterization, the calculation amount of ray tracing is larger, but the rendering effect is more real, and the image rendering method is based on ray tracing.

Then, U-Net, which is a Convolutional Neural Network (CNN) involved in the image rendering method according to the embodiment of the present application, is introduced. U-Net is initially applied to medical image segmentation tasks, and is widely applied to various segmentation tasks later due to good effect. The U-Net supports a small amount of data training models, higher segmentation accuracy is obtained by classifying each pixel point, and the U-Net segments images by using the trained models at a high speed. Fig. 2 shows a schematic block diagram of U-Net, which is briefly described below in conjunction with fig. 2. FIG. 2 shows a network structure of U-Net, with an encoder section on the left, with downsampling of the input, which is achieved by maximum pooling; the right side is a decoder part, the output of the encoder is up-sampled, the resolution is restored, and the up-sampling is realized by upsample; the middle is skip-connect (skip-connect) for feature fusion. Since the entire network structure is shaped like a "U", it is called U-Net.

The upsampling and the downsampling can increase robustness to small disturbances of the input image, such as image translation, rotation and the like, reduce the risk of overfitting, reduce the amount of computation and increase the size of a receptive field. The effect of up-sampling is to restore and decode the abstract features to the size of the original image, and finally obtain a clear and noiseless image.

The method for rendering images in the embodiment of the application can be executed by an electronic device. The electronic device may be a mobile terminal (e.g., a smartphone), a computer, a personal digital assistant, a wearable device, an in-vehicle device, an internet-of-things device, or other device capable of image rendering. The electronic device may be a device running an android system, an IOS system, a windows system, and other systems.

The graphics rendering method according to the embodiment of the application may be executed by an electronic device, and a specific structure of the electronic device may be as shown in fig. 3, and the specific structure of the electronic device is described in detail below with reference to fig. 3.

In one embodiment, as shown in FIG. 3, an electronic device 300 may comprise: a Central Processing Unit (CPU)301, a Graphics Processing Unit (GPU)302, a display device 303, and a memory 304. Optionally, the electronic device 300 may further include at least one communication bus 310 (not shown in fig. 3) for implementing connection communication among the components.

It should be understood that the various components of electronic device 300 may also be coupled by other connectors, which may include various types of interfaces, transmission lines, or buses, etc. The various components in the electronic device 300 may also be in a processor 301 centric radial connection. In various embodiments of the present application, coupled means connected or communicated electrically with each other, including directly or indirectly through other devices.

There are various connection modes between the cpu 301 and the graphic processor 302, and the connection modes are not limited to the connection mode shown in fig. 3. The cpu 301 and the gpu 302 in the electronic device 300 may be located on the same chip or may be separate chips.

The roles of the central processor 301, graphics processor 302, display device 303 and memory 304 will be briefly described below.

The central processor 301: for running an operating system 305 and application programs 307. The application 307 may be a graphics class application such as a game, video player, or the like. Operating system 305 provides a system graphics library interface through which application 307 generates instruction streams for rendering graphics or image frames, and the associated rendering data required, via drivers provided by operating system 305, such as graphics library user state drivers and/or graphics library kernel state drivers. The system graphic library includes but is not limited to: system graphics libraries such as an embedded graphics library for embedded system (OpenGL ES), the koruos platform graphics interface (the khronos platform graphics interface), or Vulkan (a cross-platform drawing application program interface). The instruction stream contains a series of instructions, typically call instructions to a system graphics library interface.

Alternatively, the central processor 301 may include at least one of the following types of processors: an application processor, one or more microprocessors, a Digital Signal Processor (DSP), a microcontroller unit (MCU), or an artificial intelligence processor, among others.

The cpu 301 may further include necessary hardware accelerators, such as an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or an integrated circuit for implementing logic operations. The processor 301 may be coupled to one or more data buses for transferring data and instructions between the various components of the electronic device 300.

The graphics processor 302: the graphics pipeline is used for receiving a graphics instruction stream sent by the processor 301, generating a rendering target through a rendering pipeline (pipeline), and displaying the rendering target to the display device 303 through a layer composition display module of the operating system.

Alternatively, graphics processor 302 may comprise a general-purpose graphics processor, such as a GPU or other type of special-purpose graphics processing unit that executes software.

The display device 303: for displaying various images generated by the electronic device 300, which may be a Graphical User Interface (GUI) of an operating system or image data (including still images and video data) processed by the graphics processor 302.

Alternatively, display device 303 may include any suitable type of display screen. Such as a Liquid Crystal Display (LCD) or a plasma display, or an organic light-emitting diode (OLED) display.

The memory 304, which is a transmission channel between the cpu 301 and the gpu 302, may be a double data rate synchronous dynamic random access memory (DDR SDRAM) or other types of cache.

The rendering pipeline is a series of operations that graphics processor 302 sequentially performs in rendering a graphics or image frame, typical operations including: vertex Processing (Vertex Processing), Primitive Processing (primative Processing), Rasterization (Rasterization), Fragment Processing (Fragment Processing), and the like.

In the method for rendering graphics according to the embodiment of the present application, a three-dimensional coordinate is converted into a two-dimensional coordinate, and a brief description of related basic concepts is provided below.

The process of converting three-dimensional coordinates to two-dimensional coordinates may involve 5 different coordinate systems.

Local Space (alternatively referred to as Object Space);

world Space (World Space);

view Space (otherwise known as Eye Space);

clip Space (Clip Space);

screen Space (Screen Space).

In order to transform coordinates from one coordinate system to another coordinate system, several transformation matrices are generally required, and the most important transformation matrices are three matrices, Model (Model), View (View), and Projection (Projection). The coordinates of the vertex data generally start from a Local Space (Local Space), which is referred to herein as Local coordinates (Local Coordinate), and the Local coordinates become World coordinates (World Coordinate), View coordinates (View Coordinate), Clip coordinates (Clip Coordinate), and finally end in the form of Screen coordinates (Screen Coordinate) after being transformed.

In the above coordinate transformation process, the local coordinates are coordinates of the object with respect to the local origin, and are also coordinates of the start of the object. Next, the local coordinates are transformed into world space coordinates, which are in a larger spatial range. These coordinates are relative to the world's global origin, and they are placed with other objects relative to the world's origin. The world coordinates are then transformed into viewing space coordinates such that each coordinate is viewed from the perspective of the camera or viewer. After the coordinates arrive in the viewing space, we need to project them to the clipping coordinates. Crop coordinates are processed to a range of-1.0 to 1.0 and determine which vertices will appear on the screen. Finally, the clipping coordinates are transformed to screen coordinates, and a process called Viewport Transform (Viewport Transform) is used. The viewport transformation transforms coordinates in the range-1.0 to a coordinate range defined by the glViewport function. The finally transformed coordinates are sent to a rasterizer, which converts them into segments (after conversion into segments, video images can be displayed from the segments).

In the above process, the vertices are transformed into different spaces because some operations are meaningful and more convenient in a particular coordinate system. For example, when a modification of an object is required, it is more likely to operate in a local space; it is more useful to do this in a world coordinate system, if an operation is to be made on one object with respect to the position of the other object, etc. If we wish we can also define a transformation matrix that transforms directly from local space to clipping space, but that loses much of its flexibility.

The respective coordinate systems will be described in detail below.

Local space:

the local space refers to the coordinate space where the object is located, i.e. where the object is located at the very beginning. Imagine you create a cube in a modeling software (say Blender). The origin of the cube you create is likely to be at (0,0,0), even though it may end up in a completely different location in the program. It is even possible that all models created have an initial position of (0,0,0) (however they will eventually appear in different locations of the world). Therefore, all vertices of the created model are in local space: they are local to your object.

World space:

if we import all our objects into the program, they may crowd all over the world's origin (0,0,0), which is not a result we want. We want to define a location for each object so that they can be placed in the larger world. Coordinates in world space are just as their name: refers to the coordinates of the vertices relative to the (game) world. If you want to spread objects around the world (especially very real), this is the space you want the objects to transform into. The coordinates of the object will be transformed from local to world space; the transformation is realized by a Model Matrix (Model Matrix).

The model matrix is a transformation matrix that can be used to place an object in its intended position or orientation by shifting, scaling, rotating it. You can imagine it as transforming a house, you need to first shrink it (it is too large in local space) and shift it to a town in the suburban area, and then rotate it a little to the left on the y-axis to match a nearby house. You can also roughly regard the matrix used for placing the box in the scene from the previous section as a model matrix; we transform the local coordinates of the box to different locations in the scene/world.

Observation space:

the viewing Space is often referred to as the Camera (sometimes also referred to as Camera Space or Eye Space) of an open graphics library (OPENGL). The viewing space is the result of translating world space coordinates to coordinates in front of the user's field of view. The observation space is thus the space that is observed from the point of view of the camera. This is usually done by a series of combinations of displacements and rotations, which translate/rotate the scene so that a particular object is transformed in front of the camera. These combined transformations are typically stored in a View Matrix (View Matrix), which is used to transform world coordinates to viewing space. In the next section we will discuss in depth how to create one such observation matrix to simulate a camera.

Cutting space:

at the end of a vertex shader run, OpenGL expects all coordinates to fall within a certain range, and any points outside this range should be cropped (Clipped). The coordinates that are cropped out are ignored and the remaining coordinates become segments visible on the screen. This is also the origin of the name of the Clip Space (Clip Space).

Since it is not very intuitive to specify all visible coordinates in the range-1.0 to 1.0, we will specify its own Set of coordinates (Coordinate Set) and transform it back to the standardized device Coordinate system, as expected by OPENGL.

To transform the coordinates from view to crop space, we need to define a Projection Matrix (Projection Matrix) that specifies a range of coordinates, such as-1000 to 1000 in each dimension. The projection matrix will then transform the coordinates within this specified range to a range of normalized device coordinates (-1.0, 1.0). All out-of-range coordinates are not mapped between the-1.0 to 1.0 range and are therefore clipped. Within the range specified by the projection matrix above, the coordinates (1250,500,750) will not be visible because its x-coordinate is out of range, it is translated into a normalized device coordinate greater than 1.0, and therefore cropped out.

For example, if only a portion of a Primitive (primative), such as a triangle, exceeds the Clipping Volume (Clipping Volume), OpenGL reconstructs the triangle into one or more triangles that can fit the Clipping range.

Ray tracing obtains precise shadow, reflection, and diffuse reflection global illumination effects by tracing each ray emitted from a camera, and thus, a virtual scene with a great sense of reality needs to be simulated and rendered with great computation and power consumption. At present, real-time ray tracing based on 1920 × 1080 resolution and 30fps frame rate can only provide 1 to 2spp sample value for each pixel value due to the limitation of GPU hardware conditions, and a low sample value may introduce a lot of noise, so that the quality of rendered pictures is reduced, and if the resolution is higher than 4k or 8k, the sample value must be lower. Therefore, it is necessary to optimize the rendering effect under the condition of a low sampling value, remove noise caused by insufficient sampling, and output a stable global illuminated image without increasing hardware cost and maintaining real-time ray tracing.

Under the condition of 1spp, the existing SVGF algorithm combines information filtering and denoising of a space domain and a time domain, and calculates variance on the space domain and the time domain to distinguish high-frequency texture information and a noise point region and guide a filter to carry out multi-layer filtering. However, the method cannot accurately estimate the motion vectors of the non-rigid motion and the shadow part, so that the denoising effect of the shadow part is poor; meanwhile, the method adopts the traditional bilateral filtering, the filtering weight cannot be dynamically adjusted, and the multi-layer filtering consumes longer time, so the method has poor timeliness.

Another existing KPCN algorithm divides an image into a specular reflection part and a diffuse reflection part, then adaptively adjusts the filter kernel weights of the specular reflection part and the diffuse reflection part by utilizing a neural network, and finally combines the specular reflection part and the diffuse reflection part to obtain a combined denoising result. The method has the disadvantages of large sampling value (usually 32spp), huge model structure but single effect (only with denoising function), large calculation amount, long time consumption and insufficient supplement by utilizing time domain information, so the algorithm can hardly meet the real-time requirement.

In order to solve the problem of poor denoising effect caused by inaccurate estimation of shadow parts and non-rigid motion vectors, the existing third algorithm utilizes pixel surface information and shadow information to calculate a gradient change value, if the gradient value is larger, the representative pixel point moves more, and therefore the corresponding historical information of the pixel point is discarded. The method provides the method for judging the severity of motion based on the gradient, is used for relieving the ghost phenomenon caused by the fact that the large displacement cannot accurately estimate the motion vector, and cannot be used as an independent denoising module. The method can be combined with the SVGF algorithm and can also be used in combination with the image rendering method of the embodiment of the application.

The existing GPU hardware has limited power consumption and computing capacity, and the calculated amount is huge under the condition of high sampling value, so that the real-time requirement of 30fps cannot be met. If only 1 to 2 rays are traced for each pixel point, although the amount of computation is greatly reduced, a lot of noise is introduced. And the noise point characteristics of the surfaces of different materials are different, and if the same denoising process is used, the denoising effect is poor, so that the difficulty of the noise reduction algorithm of the low sampling value is further increased. The motion vector of the rigid motion can be accurately obtained according to the related information in the rendering pipeline of the geometric buffer (G-buffer), wherein the G-buffer is a buffer containing color, normal direction and world space coordinates. However, the motion vector estimation of the non-rigid body and the shadow part is not accurate, which causes the rendering effect to be reduced. In addition, the real-time performance of the noise reduction algorithm is further influenced by the size of the image resolution, and the real-time performance of the ray tracing noise reduction algorithm at a high resolution and a high frame rate faces a greater challenge.

Therefore, the existing image rendering technology faces the following problems:

(1) the high sampling value has huge calculation amount, cannot meet the real-time requirement, and has serious noise point of the low sampling value;

(2) the noise point characteristics of the surfaces of different materials are different, and if the same denoising process is used, the denoising effect is poor;

(3) for game picture rendering, real-time ray tracing under the conditions of high resolution and high frame rate needs to be realized, and the real-time difficulty is high;

(4) the motion vector estimation of the non-rigid body and the shaded portion is inaccurate.

Therefore, the embodiment of the application provides an image rendering method, which adopts a low sampling value under a limited hardware condition, and adopts different optimization strategies for different noise points generated by different materials by combining time domain information, so as to realize real-time ray tracing image rendering with high frame rate and high resolution.

Fig. 4 is a schematic block diagram of a system architecture of an existing image rendering method based on a real-time ray tracing technology, and as shown in fig. 4, the system architecture includes six parts, namely a model material loading module 401, a ray generating module 402, a ray intersection module 403, a denoising module 404, a post-processing module 405, and a display module 406.

The first step of image rendering based on the real-time ray tracing technology is loading of model materials, and mainly involves two parts, wherein the first part is adding a model to be rendered into a scene, and the second part is adding respective material information and texture information to the model in the scene. This step is implemented by the shaped material loading module 401.

Generating light refers to the process of emitting light from a light source to a dimensional plane of imaging. Aiming at the fact that the number of light rays emitted by each pixel point greatly influences the final rendering effect, the sampling value is low, the image is fuzzy, and more noise points are generated; the sampling value is high, the image is clear, and the effect is good. However, due to the limited hardware conditions, in order to ensure the real-time performance of ray tracing, only 1 to 2 rays are generally emitted for one pixel. This step is implemented by generate rays module 402.

Ray tracing splits the rendering task of a scene into consideration the effect of several rays emanating from the camera on the scene, which rays are not aware of each other, but of the information of the entire scene model. The ray intersection is to track the rays emitted by the camera, find the intersection point with the scene model, obtain the information such as the material and texture of the scene model surface according to the intersection point position, and calculate the reflected rays by combining the light source information. The reflected ray calculation is based on monte carlo significance sampling, with 1spp of sample, i.e. only 1 ray is traced for each intersection. The sampling value of the reflected ray also affects the final rendering effect, corresponding to the ray generation section. Ray intersection is implemented by the generate ray intersection module 403.

The denoising module 404 is used for reducing noise generated by a low sampling value, and ensuring rendering effect while ensuring real-time performance of ray tracing.

The post-processing module 405 is configured to adopt tone mapping (tone mapping) and anti-aliasing (TAA) techniques to improve the rendering effect. The tone mapping technology is used for mapping and transforming the colors of the image and adjusting the gray level of the image, so that the processed image can better express the information and the characteristics of the original image; the TAA technique employs a toning technique to mitigate "jaggies" at the edges of the image, making the edges of the image smoother.

The display module 406 is used to display the final rendered image.

Fig. 5 is a schematic block diagram of a system architecture of an image rendering method based on a real-time ray tracing technology according to an embodiment of the present application, and as shown in fig. 5, the system architecture includes seven parts, namely a model material loading module 501, a downsampling ray generating module 502, a ray intersection module 503, a denoising module 504, a frame interpolation module 505, a post-processing module 506, and a display module 507.

Unlike the conventional image rendering method based on the real-time ray tracing technology in fig. 4, in the image rendering method based on the real-time ray tracing technology according to the embodiment of the present application, assuming that the size of the final imaged two-dimensional plane is w × h, and the down-sampled image is 1/2, light is emitted only to (1/2) w × (1/2) h pixel points, so that the calculation amount of light intersection can be greatly reduced.

Since downsampling is performed when rays are generated, in the image rendering method based on the real-time ray tracing technology according to the embodiment of the present application, the denoising module 504 is integrated with the hyper-resolution technology, and the image is restored.

In addition, an interpolation module 505 is added after the denoising module 504 to solve the real-time problem in a high frame rate scene.

Fig. 6 shows a schematic flowchart of an image rendering method according to an embodiment of the present application, and as shown in fig. 6, the method includes steps 601 to 607, which are described in detail below.

S601, acquiring an n-1 th frame image, an n-th frame image and an n +1 th frame image, wherein the n-1 th frame image, the n-th frame image and the n +1 th frame image are continuous three frames of images, and the continuity refers to that the n-1 th frame image is before the n-th frame image and the n +1 th frame image is before the n +1 th frame image. The (n-1) th frame image, the (n) th frame image, and the (n + 1) th frame image are images of low sample values (for example, 1spp) generated by the model material loading module 501, the downsampling ray generation module 502, and the ray intersection module 503 in fig. 5.

S602, the illumination map of the nth frame image is updated according to the (n-1) th frame image, so as to obtain the updated illumination map of the nth frame image. This step may be performed by denoising module 504 in fig. 5.

The illumination pattern comprises a direct illumination pattern and an indirect illumination pattern, wherein the direct illumination pattern is an illumination pattern which is obtained by directly irradiating a light source on an observed object and reflecting light rays to the eyes of a user through the observed object; the indirect illumination pattern is an illumination pattern obtained by irradiating a light source on other objects, reflecting the light for one time or multiple times to the observed object and reflecting the light to the eyes of a user through the observed object.

It should be understood that an image is composed of a plurality of pixels, wherein each pixel has a respective color value, and the sum of the color values of all the pixels on one image is a light map of the image.

Specifically, the illumination pattern includes a direct illumination pattern and an indirect illumination pattern, the direct illumination is that light from the light source directly irradiates on the object, and the indirect illumination is that light from the light source irradiates on the object after being reflected once or multiple times, and the direct illumination pattern is taken as an example for description below.

And for any pixel point in the nth frame image, marking as a first pixel point, and acquiring a pixel point corresponding to the first pixel point in the (n-1) th frame image and marking as a second pixel point. That is to say, the second pixel point is the corresponding pixel point of the first pixel point in the (n-1) th frame image. Then according to the color value of the first pixel point and the color value of the second pixel point, the color value of the first pixel point is updated, specifically, the color value of the first pixel point can be multiplied by a first coefficient to obtain a first result, the color value of the second pixel point is multiplied by a second coefficient to obtain a second result, and then the first result and the second result are added to obtain the updated color value of the first pixel point. The first coefficient and the second coefficient may be artificially preset values, and the embodiment of the present application is not specifically limited herein.

Optionally, if the position of the first pixel point in the n-1 th frame image corresponding to the second pixel point is not on the grid node of the n-1 th frame image, the color value of the second pixel point cannot be directly obtained at this time, and the color value of the second pixel point needs to be obtained by adopting a bilinear interpolation algorithm. Specifically, color values of four pixel points closest to the second pixel point are found firstly, and the four pixel points are required to be on grid nodes of the (n-1) th frame image; then obtaining color values of the four pixel points; and the color values of the four pixel points are combined with a bilinear interpolation algorithm, so that the color value of the second pixel point can be calculated.

Optionally, to ensure that the second pixel is indeed the corresponding pixel of the first pixel in the (n-1) th frame of image, before updating the color value of the first pixel by using the color value of the second pixel, the method of the embodiment of the present application further includes performing consistency judgment on the first pixel and the second pixel. Specifically, a depth value of a first pixel point, a normal magnitude value of the first pixel point, a patch ID of the first pixel point, a depth value of a second pixel point, a normal magnitude value of the second pixel point, and a patch ID of the second pixel point are obtained; if the first pixel point and the second pixel point meet: the square of the difference between the depth value of the first pixel point and the depth value of the second pixel point is smaller than a first threshold, the square of the difference between the normal magnitude of the first pixel point and the normal magnitude of the second pixel point is smaller than a second threshold, the patch ID of the first pixel point is equal to the patch ID of the second pixel point, the first pixel point and the second pixel point are considered to meet the consistency, and the color value of the second pixel point can be used for updating the color value of the first pixel point. And if the first pixel point and the second pixel point do not meet the consistency, updating the color value of the first pixel point without the color value of the second pixel point, and enabling the updated color value of the first pixel point to be the current color value of the first pixel point.

Optionally, if the position of the first pixel point in the n-1 th frame image corresponding to the second pixel point is not on the mesh node of the n-1 th frame image, the depth value, the normal value, and the patch ID of the second pixel point cannot be directly obtained at this time, and similar to the above method for obtaining the color value, the depth value, the normal value, and the patch ID of the second pixel point need to be obtained by using a bilinear interpolation algorithm. Specifically, the depth values, normal direction values and surface patch IDs of four pixel points closest to the second pixel point are found, and the four pixel points are required to be on grid nodes of the (n-1) th frame of image; then obtaining the depth values, normal direction values and surface patch IDs of the four pixel points; the depth value, the normal value and the patch ID of the second pixel point can be calculated by combining the depth value, the normal value and the patch ID of the four pixel points with a bilinear interpolation algorithm.

After color values of all pixel points in the nth frame of image are updated according to the above formula, the updated direct illumination map of the nth frame of image can be obtained.

In the embodiment of the present application, the processing manner of the direct illumination map and the indirect illumination map is the same, and the processing manner of the indirect illumination map may refer to the processing manner of the direct illumination map.

S603, inputting the illumination pattern updated by the nth frame image into a hyper-resolution denoising network to obtain a hyper-resolution denoising image of the nth frame image. This step may be performed by denoising module 504 in fig. 5.

It should be understood that the updated illumination map includes an updated direct illumination map and an updated indirect illumination map. Specifically, a depth map and a normal vector map of an nth frame image are obtained first, and it should be understood that the depth map of the nth frame image is the sum of depth values of all pixel points in the nth frame image, and the normal vector map of the nth frame image is the sum of normal magnitude values of all pixel points in the nth frame image; then, fusing the updated direct illumination map, the updated indirect illumination map, the depth map and the normal vector map of the nth frame image to obtain a first fusion result, where the fusion mode may be an existing feature fusion mode, such as a concatee or add mode, and the embodiment of the present application is not specifically limited herein; and finally, inputting the first fusion result into a hyper-resolution denoising network, thereby obtaining a hyper-resolution denoising image of the nth frame image. The hyper-resolution denoising network is a neural network model trained in advance, the hyper-resolution denoising network can have a U-Net network structure as shown in FIG. 2, and the training process of the hyper-resolution denoising network is specifically described below.

S604, the illumination map of the n +1 frame image is updated according to the n frame image, so as to obtain the illumination map after the n +1 frame image is updated. This step may be performed by denoising module 504 in fig. 5.

The process of updating the illumination map of the n +1 th frame image according to the nth frame image is similar to the process of updating the illumination map of the nth frame image according to the n-1 th frame image, and the description of the process of updating the illumination map of the nth frame image according to the n-1 th frame image may be specifically referred to above, and is not repeated herein for brevity.

S605, inputting the illumination pattern updated by the n +1 frame image into a hyper-resolution denoising network to obtain a hyper-resolution denoising image of the n +1 frame image. This step may be performed by denoising module 504 in fig. 5.

The process of inputting the illumination pattern after the image of the (n + 1) th frame is updated into the hyper-resolution denoising network is similar to the process of inputting the illumination pattern after the image of the n th frame is updated into the hyper-resolution denoising network, and specifically, the description of the process of inputting the illumination pattern after the image of the n th frame into the hyper-resolution denoising network to obtain the hyper-resolution denoising image of the n th frame can be referred to above.

S606, obtaining an initial frame interpolation image of a target moment according to the super-resolution denoised image of the nth frame image and the super-resolution denoised image of the nth frame image, wherein the target moment is a moment between the nth frame image and the (n + 1) th frame image. This step may be performed by the frame insertion module 505 in fig. 5.

After the super-resolution denoised image of the nth frame image and the super-resolution denoised image of the (n + 1) th frame image are obtained according to the method, the initial frame interpolation image of the target time is obtained according to the super-resolution denoised image of the nth frame image and the super-resolution denoised image of the (n + 1) th frame image, wherein the target time is the time between the nth frame image and the (n + 1) th frame image, and preferably, the target time is the middle time between the nth frame image and the (n + 1) th frame image. Specifically, motion vectors from the n +1 th frame image to the n th frame image are obtained, it should be understood that each pixel point in the n +1 th frame image to a corresponding pixel point in the n th frame image have a motion vector, and the sum of the motion vectors of all the pixel points is the motion vector from the n +1 th frame image to the n th frame image; then, a first motion vector of the initial frame interpolation image to the n-th frame image at the target time and a second motion vector of the initial frame interpolation image to the n + 1-th frame image at the target time are determined based on the motion vectors of the n + 1-th frame image to the n-th frame image, for example, assuming that the motion vectors of the n + 1-th frame image to the n-th frame image are M_3→2If the target time is time t and t is a value between (0,1), the first motion vector from the initial frame interpolation image to the nth frame image at the target time is:

M_t→2＝t×M_3→2

the second motion vector from the initial frame interpolation image to the (n + 1) th frame image at the target time is as follows:

M_t→3＝-(1-t)×M_3→2

and finally, obtaining an initial frame interpolation image at the target moment according to the super-resolution de-noised image of the nth frame image, the super-resolution de-noised image of the (n + 1) th frame image, the first motion vector and the second motion vector. Supposing that the super-resolution denoised image of the nth frame image is I₂The super-resolution denoised image of the (n + 1) th frame image is I₃Then, the calculation method of the initial frame interpolation image at the target time is as follows:

I_t＝(1-t)×g(I₂,M_t→2)+t×g(I₃,M_t→3)

where the function g () represents a mapping operation.

S607, inputting the initial frame interpolation image into the bidirectional frame interpolation network to obtain the frame interpolation image at the target moment. This step may be performed by the frame insertion module 505 in fig. 5.

The method of the embodiment of the present application may directly input the initial frame interpolation image into the bidirectional frame interpolation network, but in order to compensate for the problem that the motion vector estimation of the non-rigid motion and the shadow motion is inaccurate, the method of the embodiment of the present application further includes: firstly, acquiring a depth map of an nth frame image, a normal vector map of the nth frame image, a depth map of an n +1 th frame image and a normal vector map of an n +1 th frame image; then, the depth map of the nth frame image, the normal vector map of the nth frame image, the depth map of the third image, the normal vector map of the third image and the initial interpolation frame image are fused to obtain a second fusion result, wherein the fusion mode may be an existing feature fusion mode, such as a locate mode or an add mode, and the like, and the embodiment of the present application is not specifically limited herein; and finally, inputting the second fusion result into the bidirectional frame interpolation network. The bidirectional frame interpolation network is a pre-trained neural network model, and may have a U-Net network structure as shown in fig. 2, and the training process of the bidirectional frame interpolation network is described in detail below.

Fig. 7 shows a schematic flowchart of training a hyper-resolution denoising network according to an embodiment of the present application, and as shown in fig. 7, the training includes steps 701 to 706, which are described below.

S701, acquiring multiple groups of super-component denoising original training data, wherein each group of super-component denoising original training data in the multiple groups of super-component denoising original training data comprises two continuous frames of images and a standard image corresponding to the next frame of image in the two continuous frames of images.

Specifically, two consecutive frames of images are images with a low sampling value (e.g., 1spp), and the standard image corresponding to the next frame of image is a rendering result of the next frame of image under a high sampling value (e.g., 4096 spp). The standard image is used as a training standard for the low-sampling-value image.

S702, judging whether pixel points of two continuous frames of images accord with consistency.

Specifically, for each pixel point in the next frame of image, whether the corresponding pixel point in the previous frame of image is consistent with the corresponding pixel point is judged. For brevity, the method for determining the consistency of the pixel points may refer to the description in S602, and is not described herein again.

And S703, acquiring a depth map of the next frame image, a normal vector map of the next frame image and an illumination map of the next frame image in the two continuous frames of images, wherein the illumination map of the next frame image is a direct illumination map or an indirect illumination map.

S704, according to the color values of the pixel points of the next frame of image updated by the two continuous frames of images, the illumination map of the next frame of image updated is obtained.

Specifically, for the pixel points that conform to the consistency, the color value of the pixel point of the next frame of image is updated according to the two consecutive frames of images, and the method for updating the color value of the pixel point may refer to the description in S602 above. And regarding the pixel points which do not accord with the consistency, taking the current color value of the pixel point of the next frame of image as the updated color value of the pixel point. After color values of each pixel point in the next frame of image are updated, an illumination map of the next frame of image after updating can be obtained.

S705, the depth map of the next frame image, the normal vector map of the next frame image and the updated illumination map of the next frame image are fused to obtain an updated image.

For a specific fusion method, reference may be made to the above description of S603, and for brevity, no further description is given herein in this embodiment of the present application.

S706, training the super-resolution denoising network according to the updated image and the standard image.

It should be understood that, here, training of the hyper-resolution denoising network is the same as a training method of a general neural network, a high sampling value image is taken as a standard, so that a training result of an image with a low sampling value approaches to the high sampling value image, and when a difference between the training result and the standard image is smaller than a preset value, the training of the hyper-resolution denoising network is considered to be completed.

Fig. 8 shows a schematic flowchart of training of a bidirectional frame-insertion network according to an embodiment of the present application, and as shown in fig. 8, the method includes steps 801 to 803, which are described below.

S801, acquiring multiple groups of original training data of the bidirectional frame interpolation, wherein each group of original training data of the bidirectional frame interpolation in the multiple groups of original training data of the bidirectional frame interpolation comprises a fourth image, a fifth image and a sixth image, and the fourth image, the fifth image and the sixth image are continuous three-frame images.

The fourth image, the fifth image, and the sixth image may be the superseparation denoised image obtained through the above steps, or may be other images, and the embodiment of the present application is not limited herein.

And S802, acquiring an interpolation frame image at the middle moment of the fourth image and the sixth image according to the fourth image and the sixth image.

For obtaining the interpolated frame image at the intermediate time, reference may be made to the above description of S606, and for brevity, no further description is given here in this embodiment of the application.

And S803, training the bidirectional frame interpolation network according to the frame interpolation image at the middle moment and the fifth image.

And taking the fifth image as a standard image, enabling the training result of the frame interpolation image at the middle moment to approach the fifth image, and when the difference between the training result and the fifth image is smaller than a preset value, considering that the bidirectional frame interpolation network training at the moment is finished.

The image rendering method of the embodiment of the application mainly aims at removing noise under the condition of a low sampling value (for example, 1spp) and achieving real-time ray tracing with high resolution and high frame rate. Fig. 9 is a schematic block diagram illustrating an image rendering method according to an embodiment of the present application, and the image rendering method according to the embodiment of the present application is described in detail below with reference to fig. 9.

The image rendering method of the embodiment of the application relates to a super-resolution denoising network and a bidirectional frame interpolation network, wherein the super-resolution denoising network and the bidirectional frame interpolation network have the U-Net network model structure shown in the figure 2, and the super-resolution denoising network and the bidirectional frame interpolation network need to be obtained through pre-training.

The training of the hyper-resolution denoising network comprises the following steps: and dividing the acquired illumination information into direct illumination and indirect illumination through ray tracing, and updating the color information of the direct illumination and the indirect illumination by combining the motion vector and consistency judgment in the G-buffer. And fusing the updated color information of the direct illumination and the indirect illumination with corresponding depth information and normal information in the G-buffer, wherein the fusion mode can be a concatee mode or an add mode. And inputting the fused direct illumination and indirect illumination into a hyper-resolution denoising network to respectively obtain a hyper-resolution denoising result of the direct illumination and a hyper-resolution denoising result of the indirect illumination. And finally, fusing the direct illumination hyper-resolution denoising result and the indirect illumination hyper-resolution denoising result to obtain a final hyper-resolution denoising result. The true value (Ground Truth) of the hyper-resolution denoising network is the ray tracing rendering result of the sampling value 4096 app.

The training of the bidirectional frame interpolation network comprises the following steps: recording continuous three-frame image output by super-resolution denoising network as I_{AI+Denoise_0}、I_{AI+Denoise_1}、I_{AI+Denoise_2}Obtaining I in combination with G-buffer_{AI+Denoise_0}And I_{AI+Denoise_1}、I_{AI+Denoise_1}And I_{AI+Denoise_2}The initial intermediate frame interpolation result I is obtained by estimating the bidirectional motion vector between the two frames by using the frame interpolation calculation formula_{AI+Denoise_1_calculate}. Will I_{AI+Denoise_1_calculate}As input to a bidirectional framing network to obtain the finalAnd (5) frame interpolation results. Group Truth of bidirectional frame-insertion network is I_{AI+Denoise_1}。

The rasterized rendering shown in fig. 9 is a mapping of a series of coordinate values of a three-dimensional object into a two-dimensional plane. The process from three-dimensional coordinates to two-dimensional coordinates is usually performed step by step, requiring multiple coordinate systems for the transition, including: local space, world space, observation space, clipping space, planar space, etc. The transformation of coordinates from one coordinate system to another is achieved by a transformation matrix, wherein the transformation matrix from local space coordinates to world space coordinates is a model matrix M_modelThe transformation matrix from the world space coordinate system to the observation space coordinate system is an observation matrix M_viewThe transformation matrix from the observation space coordinate system to the clipping space coordinate system is a projection matrix M_projection. For moving images, scenes of two adjacent frames of images have correlation, so that relative offset of the same pixel point on the two adjacent frames of images is a motion vector, and a process of solving the motion vector is motion estimation. And according to the coordinate operation, the motion vector of the pixel point in the two continuous frames of images can be obtained. Depth information, normal information, patch ID (mesh ID), motion vector and the like of the image can be obtained in the rasterization process, and the information is stored in the G-buffer.

As shown in fig. 9, the acquired illumination information is divided into direct illumination and indirect illumination by ray tracing. Through visiting historical color buffer and combining the colour value of the pixel point that corresponds in the current frame, can realize the continuous accumulation of colour value on the time domain, this is because under the less condition of sampling value, more noise can be produced to the image, and it is equivalent to increasing the sampling value with historical colour value and current colour value accumulation. However, since the estimation of the motion vector of the non-rigid body such as the shadow has inaccuracy, the consistency determination needs to be performed according to the normal information, the depth information, and the median in the G-buffer, and if and only if the normal information, the depth information, and the median of the pixel simultaneously satisfy the consistency, the history information is accumulated. Specifically, the position of a pixel point of a current frame is projected into a previous frame according to a motion vector, normal information, depth information and meshid of the pixel point in the previous frame are obtained by using bilinear interpolation, then consistency judgment is carried out, and color cache of direct illumination and indirect illumination is updated according to a judgment result. And finally, fusing the color information of the direct illumination and the indirect illumination and the corresponding depth information and normal information respectively, inputting the fused color information and the corresponding depth information and normal information into a hyper-resolution denoising network, and obtaining hyper-resolution denoising images which are respectively marked as a 0 th frame and a 1 st frame, wherein the 0 th frame and the 1 st frame are two continuous frames of images.

According to the motion vector information in the G-buffer, a bidirectional motion vector corresponding to the intermediate time t of the 0 th frame and the 1 st frame can be obtained through linear operation, and an initial frame interpolation result can be obtained by performing mapping operation on the hyper-resolution denoised images of the 0 th frame and the 1 st frame and the bidirectional motion vector corresponding to the intermediate time t. And fusing the initial frame interpolation result and the corresponding depth information and normal information in the G-buffer, and inputting the fused result into a bidirectional frame interpolation network to obtain a final frame interpolation result, namely a final t frame image.

The image rendering method of the embodiment of the application combines the super-resolution technology and the frame interpolation technology, achieves the purpose of obtaining images with high frame rate and given resolution ratio under the condition of a low sampling value, and meanwhile reduces rendering time. The image rendering method according to the embodiment of the present application is described above with reference to fig. 9, and the image rendering method according to the embodiment of the present application is further described in detail with reference to specific examples below.

Acquisition of a data set

Fig. 10 shows a schematic flowchart of acquiring a data set in the embodiment of the present application, and as shown in fig. 10, a model scene data set used for training a neural network in the embodiment of the present application may be a set of existing public rendering models or a set of model scenes that are developed and constructed by itself. However, in order to ensure the balance of the neural network and the reliability of the training effect, different kinds of rendered scenes, such as different scenes of buildings, automobiles, home furnishings, games, animals, characters, statues, etc., should be screened out when the rendered scene data source is obtained. In addition to the richness of the content of the data source, the richness of the characteristics of the data itself must be ensured, including richness of different detail textures, different textures, images with different illumination, and the like. The data set should contain as many smooth regions and complex regions as possible, where the complex regions are more difficult to remove noise than smooth regions due to more texture. After acquiring data with different contents and different characteristics, the method of the embodiment of the application further includes a series of operations such as turning, rotating, stretching and shrinking on the acquired image, so as to expand the data set as much as possible.

Motion vector estimation and backprojection

Fig. 11 is a schematic block diagram illustrating a rasterization process according to an embodiment of the present application, and as shown in fig. 11, the rasterization process is a process of converting three-dimensional coordinates into two-dimensional coordinates through a local space, a world space, a viewing space, a clipping space, and a screen space. In order to transform coordinates from one coordinate system to another, several transformation matrices are generally required, the most important of which are Model matrices M_modelView matrix M_viewProjection matrix M_projectionThree matrices. The coordinates of the vertex data generally start from a Local Space (Local Space), which is referred to herein as Local coordinates (Local Coordinate), and the Local coordinates become World coordinates (World Coordinate), View coordinates (View Coordinate), Clip coordinates (Clip Coordinate), and finally end in the form of Screen coordinates (Screen Coordinate) after being transformed.

The following describes the calculation process of the motion vector of the same pixel point in two consecutive images. Suppose two consecutive frames of image I, J, pixels

Is a pixel point in the image I,

is the pixel point corresponding to the pixel u in the image J, the formalization of the motion vector is expressed as:

M＝(x₂-x₁,y₂-y₁)

the calculation of the motion vector from the rasterized G-buffer is as follows:

calculating a conversion matrix of a previous frame: m_{mvp_prev}＝M_{projection_prev}×M_{view_prev}×M_{model_prev}；

Calculating a current frame conversion matrix: m_{mvp_cur}＝M_{projection_cur}×M_{view_cur}×M_{model_cur}；

Calculating a motion vector: m is M_{mvp_cur}×aPos-M_{mvp_prev}×aPos。

Where, aPos represents three-dimensional coordinates in a local coordinate system.

The limitation of the above method is that only rigid motion is targeted, and therefore, the method of the embodiment of the present application further needs to use the calculated motion vector and the bilinear interpolation method to find the position of the corresponding pixel point v of the pixel point u in the current frame image I in the previous frame J, and corresponding parameters such as depth information, normal vector, and median. As shown in fig. 12, if the position of the pixel u in the previous frame image I corresponding to the pixel v in the previous frame J is not at the vertex position, and at this time, parameters such as depth information, normal vector, median and the like of the position cannot be directly obtained, parameters such as normal vector, depth information, median and the like corresponding to the P' position may be obtained according to fixed points P1, P2, P3 and P4 around the pixel v by using a bilinear interpolation method, and used as parameters such as depth information, normal vector, median and the like of the pixel v for subsequent consistency judgment.

(III) consistency judgment

In the embodiment of the application, the pixel point u in the current frame is projected to the corresponding position in the previous frame to obtain the pixel point v at the corresponding position, and then consistency judgment is performed on the pixel points u and v. The formula for judging consistency is as follows:

(W_{z_cur}-W_{z_prev})²<threshold_z

(W_{n_cur}-W_{n_prev})²<threshold_n

W_{id_cur}＝W_{id_prev}

wherein, W_{z_cur}Representing the depth value, W, of pixel u_{z_prev}Depth of the represented pixel point vValue of the square of the difference of the two needs to be less than the depth threshold_z；W_{n_cur}Normal value, W, representing pixel u_{n_prev}The normal value of the pixel point v is represented, and the square of the difference between the two values needs to be smaller than a normal threshold value_n；W_{id_cur}Median, W, representing pixel u_{id_prev}And (4) representing the median of the pixel points v, wherein the two values need to be equal. Depth threshold_zAnd normal threshold value threshold_nThe empirical value may be determined by appropriate adjustment according to the rendering result.

If and only when above-mentioned three condition is satisfied, think that the pixel passes through the uniformity and judges, the colour value of pixel this moment can add up, and the accumulation formula is:

C_update＝α×C_original+(1-α)×C_history

wherein, C_updateRepresenting an updated illumination map, C_originalRepresenting the original illumination pattern, C_historyRepresenting the illumination pattern in the history buffer, and alpha represents the proportionality coefficient of the original illumination pattern and the illumination pattern in the history buffer, which may be an artificially preset value.

Optionally, if the pixel u and the pixel v do not satisfy the consistency, it is considered that the pixel corresponding to the pixel u is not found in the previous frame, and the current color value of the pixel u is used as the updated color value.

(IV) super-resolution denoising network

The first part of the data set of the super-resolution denoising network is a direct illumination map obtained from a rendering pipeline, after consistency judgment is carried out, direct illumination color values in a history cache are accumulated to the current direct illumination map, and the formula is represented as follows:

C_{direct_update}＝α₁×C_{direct_original}+(1-α₁)×C_{direct_history}

wherein, C_{direct_update}Representing an updated direct illumination map, C_{direct_original}Representing the original direct illumination map, C_{direct_history}Representing direct light patterns in history buffer, alpha₁Representing original direct illuminationAnd the scaling factor of the direct illumination map in the history buffer.

The second part of the data set of the super-resolution denoising network is an indirect light map obtained from a rendering pipeline, after consistency judgment is carried out, the indirect light color values in the historical cache are accumulated to the current indirect light map, and the formula is represented as follows:

C_{indirect_update}＝α₂×C_{indirect_original}+(1-α₂)×C_{indirect_history}

wherein, C_{indirect_update}Representing the updated indirect illumination map, C_{indirect_original}Representing the original indirect illumination map, C_{indirect_history}Representing indirect light patterns in history cache, alpha₂Scale factors representing the original indirect illumination map and the indirect illumination maps in the history buffer.

The third part of the super-resolution denoising network data set is a depth map I of the current frame obtained from G-buffer_depthHem vector diagram I_{normal_vector}。

In summary, the training data set Dataset of the super-resolution denoising network includes four parts in total:

Dataset＝C_{direct_update}+C_{indirect_update}+I_depth+I_{normal_vector}

fig. 13 shows a schematic block diagram of performing hyper-resolution denoising on an image by using a hyper-resolution denoising network in the embodiment of the present application, and as shown in fig. 13, the processing of a direct illumination map of a certain pixel is taken as an example for explanation, and the processing of an indirect illumination map is similar to the processing of a direct illumination map, which may specifically refer to the processing of a direct illumination map, and is not described in detail in the embodiment of the present application. As shown in fig. 13, a current frame buffer and a previous frame buffer are first obtained, where the current frame buffer includes parameters such as a motion vector of the pixel from the current frame to the previous frame, depth information, normal information, and median of the pixel in the current frame; the previous frame cache comprises parameters such as historical color values of the pixel points, depth information of the pixel points in the current frame, normal information, and median. And then, the motion vector is utilized to project the previous frame buffer to the space of the current frame, and consistency judgment is carried out according to parameters such as depth information, normal information, median and the like. If the judgment result is consistent, accumulating the historical color value and the current color value, and updating the color value of the current frame; and if the judgment result is inconsistent, retaining the current color value. And updating the historical color value according to the updated color value. And finally, fusing the updated color value with the depth information and the normal information of the current frame, and inputting a fused result into a hyper-resolution denoising network so as to obtain a hyper-resolution denoised image.

(V) bidirectional frame insertion network

The acquisition of the data set of the bidirectional framing network comprises: recording continuous three-frame image output by super-resolution denoising network as I_{AI+Denoise_0}、I_{AI+Denoise_1}、I_{AI+Denoise_2}Obtaining I from G-buffer_{AI+Denoise_1}To I_{AI+Denoise_0}Has a motion vector of M_1-0Obtaining I from G-buffer_{AI+Denoise_2}To I_{AI+Denoise_1}Has a motion vector of M_1-2Therefore, the frame interpolation result can be obtained according to the frame interpolation formula:

I_{AI+Denoise_1_calculate}＝(1-t)×g(I_{AI+Denoise_0},M_t-0)+t×g(I_{AI+Denoise_2},M_t-2)

wherein, I_{AI+Denoise_1_calculate}For the frame interpolation result, t represents the time position of the frame interpolation result in the 0 th frame and the 2 nd frame, for example, t may take 0.5, M_t-0Represents the motion vector from time t to time 0, where M_t-0Is equal to M_1-0，M_t-2Representing motion vectors from time t to time 2, where M_t-2Is equal to M_1-2。

Inserting frame result I_{AI+Denoise_1_calculate}As input to a bidirectional framing network, I_{AI+Denoise_1}As a group Truth of the bidirectional framing network, the bidirectional framing network can be trained accordingly.

Fig. 14 shows a schematic block diagram of processing an image using a bidirectional frame interpolation network according to an embodiment of the present application. M_1-0Representing the motion vector from the next frame image to the previous frame image obtained from the G-buffer, t being the regionCoefficient between (0,1) representing the previous frame image I₀And the next frame image I₁The bidirectional motion vector M at the time t can be obtained by linear frame interpolation calculation_t-0And M_t-1The calculation formula is as follows:

M_t-0＝t×M_1-0

M_t-1＝(1-t)×M_1-0

after obtaining the two-way motion vector estimation result at the time t, respectively carrying out the estimation on the previous frame image I₀And the next frame image I₁And mapping the motion vector to obtain an initial frame interpolation result, wherein the calculation formula is as follows:

I_{t_initial}＝(1-t)×g(I₀,M_t-0)+t×g(I₁,M_t-1)

wherein, I_{t_initial}Representing the initial frame interpolation result and the function g () representing the mapping operation.

And acquiring depth information and normal information of the previous frame image and the next frame image from the G-buffer, and fusing the depth information and the normal information with the initial frame interpolation result to solve the problem of inaccurate estimation of the motion vector of the non-rigid motion or the shadow motion. And inputting the fused result into a bidirectional frame interpolation network so as to obtain a final frame interpolation result.

The image rendering method according to the embodiment of the present application is described in detail above with reference to fig. 7 to 14, and the image rendering device according to the embodiment of the present application is described in detail below with reference to fig. 15 to 18. It should be understood that the image rendering apparatus of fig. 15 to 18 is capable of performing the steps of the image rendering method of the embodiment of the present application, and the repetitive description will be appropriately omitted when describing the image rendering apparatus shown in fig. 15 to 18.

Fig. 15 is a schematic block diagram of an apparatus for image rendering according to an embodiment of the present application, as shown in fig. 15, including an acquisition module 1501 and a processing module 1502, which are briefly described below.

The acquiring module 1501 is configured to acquire a first image, a second image, and a third image, where the first image, the second image, and the third image are consecutive three frames of images.

The processing module 1502 is configured to update the illumination map of the second image according to the first image to obtain an updated illumination map of the second image.

The processing module 1502 is further configured to input the updated illumination map of the second image into the hyper-resolution denoising network to obtain a hyper-resolution denoised image of the second image.

The processing module 1502 is further configured to update the illumination map of the third image according to the second image to obtain an updated illumination map of the third image.

The processing module 1502 is further configured to input the illumination map of the updated third image into the hyper-resolution denoising network to obtain a hyper-resolution denoised image of the third image.

The processing module 1502 is further configured to obtain an initial frame interpolation image at a target time according to the super-resolution denoised image of the second image and the super-resolution denoised image of the third image, where the target time is a time between the second image and the third image.

The processing module 1502 is further configured to input the initial frame interpolation image into the bidirectional frame interpolation network to obtain a frame interpolation image at the target time.

Optionally, the processing module 1502 is further configured to execute each step of the method in S602 to S607 in fig. 6, which may specifically refer to the description of fig. 6, and for brevity, the embodiment of the present application is not described herein again.

Fig. 16 is a schematic block diagram of an apparatus for training a hyper-resolution denoising network according to an embodiment of the present application, and as shown in fig. 16, the apparatus includes an obtaining module 1601 and a processing module 1602, which are briefly introduced below.

The obtaining module 1601 is configured to obtain multiple sets of super-component denoising original training data, where each set of super-component denoising original training data in the multiple sets of super-component denoising original training data includes two consecutive frames of images and a standard image corresponding to a next frame of image in the two consecutive frames of images;

the obtaining module 1601 is further configured to obtain a depth map of a next frame image, a normal vector map of the next frame image, and an illumination map of the next frame image in two consecutive frames of images, where the illumination map of the next frame image is a direct illumination map or an indirect illumination map;

a processing module 1602, configured to determine consistency of pixel points of two consecutive frames of images;

the processing module 1602 is further configured to update color values of pixel points of a next frame of image according to two consecutive frames of images to obtain an updated illumination map of the next frame of image;

the processing module 1602 is further configured to fuse the depth map of the next frame of image, the normal vector map of the next frame of image, and the updated illumination map of the next frame of image to obtain an updated image;

the processing module 1602 is further configured to train the hyper-resolution denoising network according to the updated image and the standard image.

Fig. 17 shows a schematic block diagram of a device for training a bidirectional frame interpolation network according to an embodiment of the present application, and as shown in fig. 17, includes an obtaining module 1701 and a processing module 1702, which are briefly described below.

An obtaining module 1701, configured to obtain multiple sets of original training data of bidirectional interpolation frames, where each set of original training data of the multiple sets of original training data of bidirectional interpolation frames includes a fourth image, a fifth image, and a sixth image, and the fourth image, the fifth image, and the sixth image are consecutive three images;

a processing module 1702, configured to obtain, according to the fourth image and the sixth image, an interpolated image at an intermediate time between the fourth image and the sixth image;

the processing module 1702 is further configured to train the bidirectional frame interpolation network according to the interpolated frame image at the intermediate time and the fifth image.

It is to be understood that the specific structure of the apparatus 1500 shown in fig. 15 above may be as shown in fig. 18.

The electronic device in fig. 18 includes a communication module 3010, a sensor 3020, a user input module 3030, an output module 3040, a processor 3050, a memory 3070, and a power supply 3080. The processor 3050 may include one or more CPUs, among others.

The electronic device shown in fig. 18 may perform the steps of the graphics rendering method according to the embodiment of the present application, and in particular, one or more CPUs in the processor 3050 may perform the steps of the graphics rendering method according to the embodiment of the present application.

The various modules of the electronic device of fig. 18 are described in detail below.

The communication module 3010 may include at least one module that enables communication between the electronic device and other electronic devices. For example, the communication module 3010 may include one or more of a wired network interface, a broadcast receiving module, a mobile communication module, a wireless internet module, a local area communication module, and a location (or position) information module.

For example, the communication module 3010 can acquire a game screen in real time from the game server side.

The sensor 3020 may sense some operations of the user, and the sensor 3020 may include a distance sensor, a touch sensor, and the like. The sensor 3020 may sense an operation of a user touching or approaching the screen. For example, the sensor 3020 may be capable of sensing some actions by the user at the game interface.

The user input module 3030 is used for receiving input digital information, character information or contact touch operation/non-contact gesture, receiving signal input related to user setting and function control of the system, and the like. The user input module 3030 includes a touch panel and/or other input devices. For example, the user may control the game through the user input module 3030.

The output module 3040 includes a display panel for displaying information input by a user, information provided to the user, various menu interfaces of the system, and the like.

Alternatively, the display panel may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. In other embodiments, the touch panel can be overlaid on the display panel to form a touch display screen.

In addition, the output module 3040 may further include a video output module, an alarm, a haptic module, and the like. The video output module can display the game picture after the graphic rendering.

The power supply 3080 may receive external power and internal power under the control of the processor 3050 and provide power required for the operation of the respective modules of the entire electronic device.

The processor 3050 can include one or more CPUs, and the processor 3050 can also include one or more GPUs.

When the processor 3050 includes a plurality of CPUs, the plurality of CPUs may be integrated on the same chip or may be integrated on different chips.

When the processor 3050 includes a plurality of GPUs, the GPUs may be integrated on the same chip or may be integrated on different chips, respectively.

When the processor 3050 includes both a CPU and a GPU, the CPU and the GPU may be integrated on the same chip.

For example, when the electronic device shown in fig. 18 is a smartphone, a CPU and a GPU are typically associated with image processing inside the processor of the smartphone. Both the CPU and GPU here may contain multiple cores.

Memory 3070 may store computer programs, including operating system programs 3072, application programs 3071, and the like. Typical operating systems include Windows from Microsoft corporation, MacOS from apple Inc. for desktop or notebook computer systems, and Google Inc. for example

Android of

System, etc. for a mobile terminal.

The memory 3070 may be one or more of the following types: flash (flash) memory, hard disk type memory, micro multimedia card type memory, card type memory (e.g., SD or XD memory), Random Access Memory (RAM), Static Random Access Memory (SRAM), Read Only Memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, or optical disk. In other embodiments, the memory 3070 can also be a network storage device on the internet, and the system can perform operations such as updating or reading the memory 3070 on the internet.

For example, the memory 3070 may store a computer program (which is a program corresponding to the graphics rendering method according to the embodiment of the present application), and when the processor 3050 executes the computer program, the processor 3050 may execute the graphics rendering method according to the embodiment of the present application.

The memory 3070 also stores other data 3073 in addition to computer programs, for example, the memory 3070 may store data in the graphic rendering method process of the present application.

The connection relationship of the modules in fig. 18 is only an example, and the electronic device provided in any embodiment of the present application may also be applied to electronic devices with other connection modes, for example, all the modules are connected through a bus.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program executable by a processor, where when the computer program is executed by the processor, the processor executes the method according to any one of fig. 6 to 8.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of image rendering, comprising:

acquiring a first image, a second image and a third image, wherein the first image, the second image and the third image are continuous three-frame images;

updating the illumination map of the second image according to the first image to obtain the updated illumination map of the second image;

inputting the illumination map of the updated second image into a hyper-resolution denoising network to obtain a hyper-resolution denoised image of the second image;

updating the illumination map of the third image according to the second image to obtain the updated illumination map of the third image;

inputting the illumination map updated by the third image into a hyper-resolution denoising network to obtain a hyper-resolution denoised image of the third image;

acquiring an initial frame interpolation image of a target moment according to the super-resolution denoised image of the second image and the super-resolution denoised image of the third image, wherein the target moment is the moment between the second image and the third image;

and inputting the initial frame interpolation image into a bidirectional frame interpolation network to obtain the frame interpolation image at the target moment.

2. The method of claim 1, wherein the updating the illumination map of the second image according to the first image to obtain the updated illumination map of the second image comprises:

acquiring a light map of the second image, wherein the light map of the second image comprises color values of a plurality of pixel points, and the light map of the second image is a direct light map or an indirect light map;

acquiring a second pixel point corresponding to a first pixel point in the first image, wherein the first pixel point is any one of the plurality of pixel points;

and updating the color value of the first pixel point according to the color value of the first pixel point and the color value of the second pixel point so as to obtain an updated illumination map.

3. The method of claim 2, wherein when the location of the second pixel point is not on a grid node of the first image, the method further comprises:

obtaining color values of four pixel points closest to the second pixel point, wherein the four pixel points are on grid nodes of the first image;

and obtaining the color value of the second pixel point according to the color values of the four pixel points.

4. The method according to claim 2 or 3, wherein before the updating the color value of the first pixel according to the color value of the first pixel and the color value of the second pixel, the method further comprises:

determining consistency of the first pixel point and the second pixel point, wherein the determining consistency of the first pixel point and the second pixel point comprises:

acquiring the depth value of the first pixel point, the normal value of the first pixel point, the patch ID of the first pixel point, the depth value of the second pixel point, the normal value of the second pixel point and the patch ID of the second pixel point,

the square of the difference between the depth value of the first pixel and the depth value of the second pixel is less than a first threshold,

the square of the difference between the normal magnitude of the first pixel and the normal magnitude of the second pixel is less than a second threshold,

and the patch ID of the first pixel point is equal to the patch ID of the second pixel point.

5. The method of any one of claims 2 to 4, wherein the updating the color value of the first pixel according to the color value of the first pixel and the color value of the second pixel comprises:

and the color value of the first pixel point after updating is the sum of the color value of the first pixel point multiplied by a first coefficient and the color value of the second pixel point multiplied by a second coefficient.

6. The method of any one of claims 1 to 5, wherein the inputting the updated illumination map of the second image into a hyper-resolution denoising network further comprises:

acquiring a depth map of the second image and a normal vector map of the second image;

fusing the depth map of the second image, the normal vector map of the second image and the updated illumination map of the second image to obtain a first fusion result;

and inputting the first fusion result into a hyper-resolution denoising network.

7. The method according to any one of claims 1 to 6, wherein the obtaining of the initial interpolated frame image at the target time from the super-divided denoised image of the second image and the super-divided denoised image of the third image comprises:

acquiring a motion vector from the third image to the second image;

determining a first motion vector from the initial frame interpolation image at the target moment to the second image and a second motion vector from the initial frame interpolation image at the target moment to the third image according to the motion vector from the third image to the second image;

and acquiring an initial frame interpolation image of the target moment according to the super-resolution denoised image of the second image, the super-resolution denoised image of the third image, the first motion vector and the second motion vector.

8. The method according to any of claims 1 to 7, wherein said inputting the initial interpolated picture into a bidirectional interpolation network further comprises:

acquiring a depth map of the second image, a normal vector map of the second image, a depth map of the third image and a normal vector map of the third image;

fusing the depth map of the second image, the normal vector map of the second image, the depth map of the third image, the normal vector map of the third image and the initial frame interpolation image to obtain a second fusion result;

and inputting the second fusion result into a bidirectional frame interpolation network.

9. The method of any one of claims 1 to 8, wherein the hyper-resolution denoising network is a pre-trained neural network model, and the training of the hyper-resolution denoising network comprises:

acquiring a plurality of groups of super-component denoising original training data, wherein each group of super-component denoising original training data in the plurality of groups of super-component denoising original training data comprises two continuous frames of images and a standard image corresponding to a next frame of image in the two continuous frames of images;

judging whether the pixel points of the two continuous frames of images accord with consistency;

acquiring a depth map of a next frame image, a normal vector map of the next frame image and an illumination map of the next frame image in the two continuous frames of images, wherein the illumination map of the next frame image is a direct illumination map or an indirect illumination map;

updating the color values of the pixel points of the next frame of image according to the two continuous frames of images to obtain an updated illumination map of the next frame of image;

fusing the depth map of the next frame image, the normal vector map of the next frame image and the updated illumination map of the next frame image to obtain an updated image;

and training the hyper-resolution denoising network according to the updated image and the standard image.

10. The method according to any one of claims 1 to 9, wherein the bidirectional framing network is a pre-trained neural network model, the training of the bidirectional framing network comprising:

acquiring multiple groups of original training data of bidirectional interpolation frames, wherein each group of original training data of the bidirectional interpolation frames in the multiple groups of original training data of the bidirectional interpolation frames comprises a fourth image, a fifth image and a sixth image, and the fourth image, the fifth image and the sixth image are continuous three-frame images;

acquiring an interpolation frame image at the intermediate moment of a fourth image and a sixth image according to the fourth image and the sixth image;

and training the bidirectional frame interpolation network according to the frame interpolation image at the middle moment and the fifth image.

11. An apparatus for image rendering, comprising:

the device comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring a first image, a second image and a third image, and the first image, the second image and the third image are continuous three-frame images;

the processing module is used for updating the illumination map of the second image according to the first image so as to obtain the updated illumination map of the second image;

the processing module is further configured to input the illumination map of the updated second image into a hyper-resolution denoising network to obtain a hyper-resolution denoised image of the second image;

the processing module is further configured to update the illumination map of the third image according to the second image to obtain an updated illumination map of the third image;

the processing module is further configured to input the illumination map after the third image is updated into a hyper-resolution denoising network, so as to obtain a hyper-resolution denoised image of the third image;

the processing module is further configured to obtain an initial frame interpolation image at a target time according to the super-resolution denoised image of the second image and the super-resolution denoised image of the third image, where the target time is a time between the second image and the third image;

the processing module is further configured to input the initial frame interpolation image into a bidirectional frame interpolation network to obtain the frame interpolation image at the target moment.

12. The apparatus of claim 11, wherein the processing module updates the illumination map of the second image according to the first image to obtain the updated illumination map of the second image, and comprises:

13. The apparatus of claim 12, wherein when the location of the second pixel point is not on a grid node of the first image, the processing module is further configured to:

obtaining color values of four pixel points closest to the second pixel point;

14. The apparatus of claim 12 or 13, wherein before the processing module updates the color value of the first pixel according to the color value of the first pixel and the color value of the second pixel, the processing module is further configured to:

judging the consistency of the first pixel point and the second pixel point, wherein the judging the consistency of the first pixel point and the second pixel point comprises:

15. The apparatus according to any one of claims 12 to 14, wherein the updating the color value of the first pixel according to the color value of the first pixel and the color value of the second pixel comprises:

16. The apparatus according to any one of claims 11 to 15, wherein the processing module inputs the updated illumination map of the second image into a hyper-resolution denoising network, further comprising:

17. The apparatus according to any one of claims 11 to 16, wherein the processing module obtains an initial frame interpolation image at a target time according to the super-resolution denoised image of the second image and the super-resolution denoised image of the third image, and includes:

acquiring a motion vector from the third image to the second image;

and determining an initial frame interpolation image of the target moment according to the super-resolution denoised image of the second image, the super-resolution denoised image of the third image, the first motion vector and the second motion vector.

18. The apparatus of any of claims 11 to 17, wherein the processing module inputs the initial framing image into a bidirectional framing network, further comprising:

19. The apparatus of any one of claims 11 to 18, wherein the hyper-resolution denoising network is a pre-trained neural network model, and the training of the hyper-resolution denoising network comprises:

20. The apparatus according to any one of claims 11 to 19, wherein the bidirectional framing network is a pre-trained neural network model, the training of the bidirectional framing network comprising:

21. A computer device, comprising:

a memory for storing a program;

a processor for executing the memory-stored program, the computer device performing the method of any of claims 1 to 10 when the memory-stored program is executed by the processor.

22. An electronic device, characterized in that it comprises means for image rendering according to any one of claims 11 to 21.

23. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program executable by a processor, which processor executes the method according to any one of claims 1 to 10 when the computer program is executed by the processor.