CN112184575B

CN112184575B - Image rendering method and device

Info

Publication number: CN112184575B
Application number: CN202010971444.5A
Authority: CN
Inventors: 李超; 陈濛
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-09-16
Filing date: 2020-09-16
Publication date: 2024-09-13
Anticipated expiration: 2040-09-16
Also published as: WO2022057598A1; CN112184575A

Abstract

The application provides an image rendering method and device, which can obtain a high-resolution and high-frame-rate rendered image under the condition of low sampling value. The method comprises the following steps: acquiring continuous three-frame images such as a first image, a second image, a third image and the like; updating the illumination map of the second image according to the first image to obtain an updated illumination map of the second image; inputting the illumination image updated by the second image into an ultra-division denoising network to obtain an ultra-division denoising image of the second image; updating the illumination map of the third image according to the second image to obtain an updated illumination map of the third image; inputting the illumination image updated by the third image into an ultra-division denoising network to obtain an ultra-division denoising image of the third image; acquiring an initial frame inserting image of a target moment according to the super-division denoising image of the second image and the super-division denoising image of the third image, wherein the target moment is the moment between the second image and the third image; and inputting the initial frame inserting image into a bidirectional frame inserting network to obtain the frame inserting image at the target moment.

Description

Image rendering method and device

Technical Field

The present application relates to the field of image processing technology, and more particularly, to a method and apparatus for image rendering.

Background

Ray tracing is a technology used in the fields of modern movies, games and the like for generating or enhancing special visual effects, and by tracking each ray emitted from a camera, global illumination such as ambient light shielding, indirect reflection, diffusion and the like is realized, so that seamless connection of images and reality can be ensured on a rendering frame.

Currently, the mainstream ray tracing technology is mainly divided into three modes, including an offline mode, an interactive mode and a real-time mode. The offline mode has optimal rendering effect and long time consumption, the interactive mode balances the rendering effect and time, and the real-time mode sacrifices part of the rendering effect to meet the requirement of real-time. The movie presentation mode is non-interactive, so that a large number of servers can be used for offline rendering during production, and the game needs real-time man-machine interaction, so that a game manufacturer can only calculate each frame of picture in a real-time rendering mode, and huge calculation amount is brought to real-time calculation. In the field of ray tracing, the effect of rendering is directly affected on a ray sampling value of each pixel point, a high sampling value means huge calculation amount, and a low sampling value introduces a lot of noise points on the premise of guaranteeing real-time rendering, so that the quality of a rendered picture is reduced.

In the prior art, the Optix-based path tracking algorithm has a rendering time of 70ms when the sampling number (SAMPLE PER pixel, spp) of each pixel is 1 in SponzaGlossy scenes, a rendering time of 1spp is 260ms in SanMiguel scenes, and for each frame of time of at most 16ms in the game industry, the Optix-based path tracking algorithm cannot meet the requirement. Thus, to achieve real-time rendering under limited hardware conditions, it is necessary to use low sample values in combination with noise reduction algorithms. Table 1 shows the optimization effect of the existing noise reduction algorithm under the condition of low sampling value of 1 to 2 spp:

TABLE 1

Noise reduction algorithm	Sampling value (spp)	Resolution ratio	Time consuming (ms)	Hardware device
					SBF	1	720	7402	Titan XP
AAF	1	720	211	Titan XP
					LBF	1	720	1550	Titan XP
NFOR	1	720	107～121	Intel i7-7700HQ
					AE	1	720	54.9	Titan XP
KPCN	32	1080	12000	Nvidia Quadro M6000
					SVGF	1	720	4～5	Titan XP

In the above table, the noise reduction algorithms such as Titan XP, intel i7-7700HQ and Nvidia Quadro M6000 are all high-performance display cards ,SBF(sure based filter)、AAF(axis aligned filter for both soft shadows)、LBF(learning based filter)、NFOR(nonlinearly weighted first order regression)、AE(interactive reconstruction of Monte carlo image sequences using a recurrent denoising autoencoder), and the time for obtaining the 720P resolution rendered image is too long under the condition of low sampling value, so that the requirements cannot be met; however, the KPCN (KERNEL PREDICTING convolutional networks) denoising algorithm needs a higher sampling value to acquire a higher resolution 1080P rendered image, and is too long in time consumption and cannot meet the requirements; time-domain variance guided filtering (SVGF) noise reduction algorithm obtains a 720P resolution rendered image under the condition of low sampling value, which is time-consuming to meet the time requirement, but the 720P resolution cannot guarantee the smoothness of the game. Therefore, as shown in the table above, the existing real-time ray tracing technology still has the defects of large calculation amount and high hardware requirement, and the time-consuming requirement of game rendering cannot be met because better rendering effect is required to be achieved under the condition of low sampling value and rendering time is long. Therefore, it is particularly important how to obtain a real-time rendering effect with high frame rate and high resolution without increasing hardware cost.

Disclosure of Invention

The application provides an image rendering method and device, which can obtain a high-resolution and high-frame-rate rendered image under the condition of low sampling value.

In a first aspect, there is provided a method of image rendering, the method comprising: acquiring a first image, a second image and a third image, wherein the first image, the second image and the third image are continuous three-frame images; updating the illumination map of the second image according to the first image to obtain an updated illumination map of the second image; inputting the illumination graph updated by the second image into an ultra-division denoising network to obtain an ultra-division denoising image of the second image; updating the illumination map of the third image according to the second image to obtain an updated illumination map of the third image; inputting the illumination image updated by the third image into an ultra-division denoising network to obtain an ultra-division denoising image of the third image; acquiring an initial frame inserting image of a target moment according to the super-division denoising image of the second image and the super-division denoising image of the third image, wherein the target moment is the moment between the second image and the third image; and inputting the initial frame inserting image into a bidirectional frame inserting network to obtain the frame inserting image at the target moment.

The image rendering method of the embodiment of the application can process the image with a low sampling value (for example, 1 spp), thereby greatly reducing the requirement on hardware equipment; according to the image rendering method, color value accumulation is carried out on the pixel points, and the illumination graph of the low sampling value is updated, so that the problem of noise caused by insufficient sampling information can be solved; according to the image rendering method, the super-division denoising network is used for processing the image, so that the resolution ratio of the image can be improved; according to the image rendering method, the continuous two-frame images are subjected to frame interpolation, and the two-way frame interpolation network is used for processing the frame interpolation images, so that the frame rate of the images is improved, and the smoothness of image rendering is ensured.

With reference to the first aspect, in some implementations of the first aspect, updating the illumination map of the second image according to the first image to obtain the illumination map after the second image update includes: acquiring an illumination map of a second image, wherein the illumination map of the second image comprises color values of a plurality of pixel points, and the illumination map of the second image is a direct illumination map or an indirect illumination map; acquiring a second pixel point corresponding to a first pixel point in a first image, wherein the first pixel point is any one of a plurality of pixel points; and updating the color value of the first pixel point according to the color value of the first pixel point and the color value of the second pixel point to obtain an updated illumination map.

Aiming at the problem of serious noise caused by low sampling values, the image rendering method of the embodiment of the application updates the color value of each pixel point by accumulating time domain information and combining the historical color information of the pixel point, thereby updating the illumination map of the image, making up for the deficiency of the sampling values and further reducing the noise of rendering.

With reference to the first aspect, in certain implementations of the first aspect, when the location of the second pixel point is not on a grid node of the first image, the method further includes: acquiring color values of four pixels closest to the second pixel, wherein the four pixels are arranged on grid nodes of the first image; and acquiring the color value of the second pixel point according to the color values of the four pixel points.

The image rendering method in the embodiment of the application considers that the pixel point of the next frame image possibly exists in the previous frame image and the corresponding pixel point is not on the grid node of the previous frame image, and the color value of the pixel point of the previous frame image can not be directly obtained at the moment, so that the image rendering method in the embodiment of the application obtains the color value of the pixel point of the previous frame image according to the bilinear interpolation method.

With reference to the first aspect, in some implementations of the first aspect, before updating the color value of the first pixel according to the color value of the first pixel and the color value of the second pixel, the method further includes: judging the coincidence of the first pixel point and the second pixel point, wherein the judging the coincidence of the first pixel point and the second pixel point comprises the following steps: acquiring a depth value of a first pixel point, a normal vector value of the first pixel point, a patch ID of the first pixel point, a depth value of a second pixel point, a normal vector value of the second pixel point and a patch ID of the second pixel point; the square of the difference between the depth value of the first pixel point and the depth value of the second pixel point is smaller than a first threshold value; the square of the difference between the normal vector of the first pixel point and the normal vector of the second pixel point is smaller than a second threshold value; the patch ID of the first pixel point is equal to the patch ID of the second pixel point.

In order to ensure that the second pixel point is actually the pixel point corresponding to the first pixel point in the first image, before updating the color value of the first pixel point by using the color value of the second pixel point, the method of the embodiment of the application further comprises the step of judging the consistency of the first pixel point and the second pixel point. Similarly, if the position of the first pixel point in the first image corresponding to the second pixel point is not on the grid node of the first image, the depth value, the normal vector value and the patch ID of the second pixel point cannot be directly obtained, and similar to the method for obtaining the color value, the depth value, the normal vector value and the patch ID of the second pixel point need to be obtained by adopting a bilinear interpolation algorithm.

With reference to the first aspect, in some implementations of the first aspect, updating the color value of the first pixel according to the color value of the first pixel and the color value of the second pixel includes: the updated color value of the first pixel is the sum of the color value of the first pixel multiplied by the first coefficient and the color value of the second pixel multiplied by the second coefficient.

The image rendering method of the embodiment of the application provides a method for updating the color value of a first pixel point and the color value of a second pixel point, wherein a first coefficient and a second coefficient are preset values.

With reference to the first aspect, in certain implementation manners of the first aspect, inputting the illumination pattern after the second image update into the super-division denoising network further includes: acquiring a depth map of a second image and a normal vector map of the second image; fusing the depth map of the second image, the normal vector map of the second image and the updated illumination map of the second image to obtain a first fusion result; and inputting the first fusion result into the super-division denoising network.

In order to solve the problem of inaccurate motion vector estimation of non-rigid motion and shadow motion, the method for image rendering according to the embodiment of the application further comprises feature fusion of the depth map, the normal vector map and the updated illumination map.

With reference to the first aspect, in certain implementation manners of the first aspect, obtaining an initial interpolated image at a target time according to the super-division denoised image of the second image and the super-division denoised image of the third image includes: acquiring a motion vector from the third image to the second image; determining a first motion vector from the initial frame-inserting image to the second image at the target moment and a second motion vector from the initial frame-inserting image to the third image at the target moment according to the motion vectors from the third image to the second image; and acquiring an initial frame inserting image at the target moment according to the super-division denoising image of the second image, the super-division denoising image of the third image, the first motion vector and the second motion vector.

The image rendering method of the embodiment of the application inserts frames of continuous two-frame images, thereby improving the frame rate of image rendering.

With reference to the first aspect, in certain implementation manners of the first aspect, inputting the initial frame insertion image into the bidirectional frame insertion network further includes: acquiring a depth map of a second image, a normal vector map of the second image, a depth map of a third image and a normal vector map of the third image; fusing the depth map of the second image, the normal vector map of the second image, the depth map of the third image and the normal vector map of the third image with the initial frame inserting image to obtain a second fusion result; and inputting the second fusion result into a bidirectional frame inserting network.

In order to compensate for the problem of inaccurate motion vector estimation of non-rigid motion and shadow motion, the image rendering method of the embodiment of the application further comprises the step of carrying out feature fusion on the depth map, the normal vector map and the initial frame inserting image.

With reference to the first aspect, in certain implementations of the first aspect, the super-division denoising network is a pre-trained neural network model, and the training of the super-division denoising network includes: acquiring a plurality of groups of super-component denoising original training data, wherein each group of super-component denoising original training data in the plurality of groups of super-component denoising original training data comprises two continuous frame images and a standard image corresponding to a later frame image in the two continuous frame images; judging the consistency of pixel points of two continuous frames of images; acquiring a depth map of a next frame image, a normal vector map of the next frame image and an illumination map of the next frame image in two continuous frame images, wherein the illumination map of the next frame image is a direct illumination map or an indirect illumination map; updating the color value of the pixel point of the next frame of image according to the two continuous frames of images to obtain an illumination graph updated by the next frame of image; fusing the depth map of the image of the next frame, the normal vector map of the image of the next frame and the updated illumination map of the image of the next frame to obtain an updated image; and training the super-division denoising network according to the updated image and the standard image.

The embodiment of the application also provides a training method of the super-division denoising network, which is used for acquiring a low sampling value (for example, 1 spp) image, taking a rendering result of a later frame of image under the condition of a high sampling value (for example, 4096 spp) as a standard image, and training the neural network.

With reference to the first aspect, in certain implementations of the first aspect, the bidirectional frame inserting network is a pre-trained neural network model, and training of the bidirectional frame inserting network includes: obtaining multiple groups of bidirectional frame inserting original training data, wherein each group of bidirectional frame inserting original training data in the multiple groups of bidirectional frame inserting original training data comprises a fourth image, a fifth image and a sixth image, and the fourth image, the fifth image and the sixth image are continuous three-frame images; obtaining a frame inserting image of the intermediate moment of the fourth image and the sixth image according to the fourth image and the sixth image; and training the bidirectional frame inserting network according to the frame inserting image and the fifth image at the middle moment.

The embodiment of the application also provides a training method of the bidirectional frame inserting network, which comprises the steps of obtaining a fourth image, a fifth image and a sixth image which are continuous, carrying out frame inserting on the fourth image and the sixth image to obtain an initial frame inserting result, and then training the neural network by taking the fifth image as a standard.

In a second aspect, there is provided an apparatus for image rendering, the apparatus comprising: the acquisition module is used for acquiring a first image, a second image and a third image, wherein the first image, the second image and the third image are continuous three-frame images; the processing module is used for updating the illumination graph of the second image according to the first image so as to obtain the illumination graph after the second image is updated; the processing module is also used for inputting the illumination image updated by the second image into an ultra-division denoising network so as to obtain an ultra-division denoising image of the second image; the processing module is also used for updating the illumination map of the third image according to the second image so as to obtain an updated illumination map of the third image; the processing module is also used for inputting the illumination image updated by the third image into an ultra-division denoising network so as to obtain an ultra-division denoising image of the third image; the processing module is also used for acquiring an initial frame inserting image at a target moment according to the super-division denoising image of the second image and the super-division denoising image of the third image, wherein the target moment is the moment between the second image and the third image; the processing module is also used for inputting the initial frame inserting image into the bidirectional frame inserting network so as to obtain the frame inserting image at the target moment.

With reference to the second aspect, in some implementations of the second aspect, the processing module updates the illumination map of the second image according to the first image to obtain an updated illumination map of the second image, including: acquiring an illumination map of a second image, wherein the illumination map of the second image comprises color values of a plurality of pixel points, and the illumination map of the second image is a direct illumination map or an indirect illumination map; acquiring a second pixel point corresponding to a first pixel point in a first image, wherein the first pixel point is any one of a plurality of pixel points; and updating the color value of the first pixel point according to the color value of the first pixel point and the color value of the second pixel point to obtain an updated illumination map.

With reference to the second aspect, in some implementations of the second aspect, when the location of the second pixel point is not on a grid node of the first image, the processing module is further configured to: acquiring color values of four pixel points closest to the second pixel point; and acquiring the color value of the second pixel point according to the color values of the four pixel points.

With reference to the second aspect, in some implementations of the second aspect, before the processing module updates the color value of the first pixel according to the color value of the first pixel and the color value of the second pixel, the processing module is further configured to: judging the coincidence of the first pixel point and the second pixel point, wherein the judging the coincidence of the first pixel point and the second pixel point comprises the following steps: acquiring a depth value of a first pixel point, a normal vector value of the first pixel point, a patch ID of the first pixel point, a depth value of a second pixel point, a normal vector value of the second pixel point and a patch ID of the second pixel point; the square of the difference between the depth value of the first pixel point and the depth value of the second pixel point is smaller than a first threshold value; the square of the difference between the normal vector of the first pixel point and the normal vector of the second pixel point is smaller than a second threshold value; the patch ID of the first pixel point is equal to the patch ID of the second pixel point.

With reference to the second aspect, in some implementations of the second aspect, updating the color value of the first pixel according to the color value of the first pixel and the color value of the second pixel includes: the updated color value of the first pixel is the sum of the color value of the first pixel multiplied by the first coefficient and the color value of the second pixel multiplied by the second coefficient.

With reference to the second aspect, in some implementations of the second aspect, the processing module inputs the illumination pattern after the second image update into an overdriving denoising network, further includes: acquiring a depth map of a second image and a normal vector map of the second image;

fusing the depth map of the second image, the normal vector map of the second image and the updated illumination map of the second image to obtain a first fusion result; and inputting the first fusion result into the super-division denoising network.

With reference to the second aspect, in some implementations of the second aspect, the processing module obtains an initial interpolated image at a target time according to the super-divided denoised image of the second image and the super-divided denoised image of the third image, including: acquiring a motion vector from the third image to the second image; determining a first motion vector from the initial frame-inserting image to the second image at the target moment and a second motion vector from the initial frame-inserting image to the third image at the target moment according to the motion vectors from the third image to the second image; and determining an initial frame inserting image at the target moment according to the super-division denoising image of the second image, the super-division denoising image of the third image, the first motion vector and the second motion vector.

With reference to the second aspect, in certain implementations of the second aspect, the processing module inputs the initial frame insertion image into a bidirectional frame insertion network, further includes: acquiring a depth map of a second image, a normal vector map of the second image, a depth map of a third image and a normal vector map of the third image; fusing the depth map of the second image, the normal vector map of the second image, the depth map of the third image and the normal vector map of the third image with the initial frame inserting image to obtain a second fusion result; and inputting the second fusion result into a bidirectional frame inserting network.

With reference to the second aspect, in certain implementations of the second aspect, the super-division denoising network is a pre-trained neural network model, and the training of the super-division denoising network includes: acquiring a plurality of groups of super-component denoising original training data, wherein each group of super-component denoising original training data in the plurality of groups of super-component denoising original training data comprises two continuous frame images and a standard image corresponding to a later frame image in the two continuous frame images; judging the consistency of pixel points of two continuous frames of images; acquiring a depth map of a next frame image, a normal vector map of the next frame image and an illumination map of the next frame image in two continuous frame images, wherein the illumination map of the next frame image is a direct illumination map or an indirect illumination map; updating the color value of the pixel point of the next frame of image according to the two continuous frames of images to obtain an illumination graph updated by the next frame of image; fusing the depth map of the image of the next frame, the normal vector map of the image of the next frame and the updated illumination map of the image of the next frame to obtain an updated image; and training the super-division denoising network according to the updated image and the standard image.

With reference to the second aspect, in certain implementations of the second aspect, the bidirectional frame inserting network is a pre-trained neural network model, and training of the bidirectional frame inserting network includes: obtaining multiple groups of bidirectional frame inserting original training data, wherein each group of bidirectional frame inserting original training data in the multiple groups of bidirectional frame inserting original training data comprises a fourth image, a fifth image and a sixth image, and the fourth image, the fifth image and the sixth image are continuous three-frame images; obtaining a frame inserting image of the intermediate moment of the fourth image and the sixth image according to the fourth image and the sixth image; and training the bidirectional frame inserting network according to the frame inserting image and the fifth image at the middle moment.

In a third aspect, there is provided an apparatus for image rendering, the apparatus comprising: a memory for storing a program; a processor for executing a program stored in the memory, the processor performing part or all of the operations in any one of the modes described above when the program stored in the memory is executed by the processor.

In a fourth aspect, there is provided an electronic device comprising the apparatus for image rendering of any one of the modes of the second aspect.

In a fifth aspect, there is provided a computer readable storage medium storing a computer program executable by a processor, the processor performing part or all of the operations in any one of the modes of the first aspect described above when the computer program is executed by the processor.

In a sixth aspect, there is provided a chip comprising a processor for performing part or all of the operations of the method described in the first aspect above.

In a seventh aspect, there is provided a computer program or computer program product comprising computer readable instructions which, when executed by a processor, cause the processor to perform part or all of the operations in any of the ways of the first aspect described above.

Drawings

FIG. 1 is a schematic block diagram of ray tracing and rasterization of an embodiment of the present application;

FIG. 2 is a schematic block diagram of a U-Net neural network of an embodiment of the present application;

FIG. 3 is a schematic block diagram of an electronic device of an embodiment of the present application;

FIG. 4 is a schematic block diagram of a system architecture of an existing real-time ray tracing technology-based image rendering method according to an embodiment of the present application;

FIG. 5 is a schematic block diagram of a system architecture of an image rendering method based on a real-time ray tracing technique according to an embodiment of the present application;

FIG. 6 is a schematic flow chart diagram of an image rendering method of an embodiment of the present application;

FIG. 7 is a schematic flow chart of training of a super-division denoising network according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of training of a bi-directional frame insertion network of an embodiment of the present application;

FIG. 9 is a schematic block diagram of an image rendering method of an embodiment of the present application;

FIG. 10 is a schematic flow chart of acquiring a data set according to an embodiment of the application;

FIG. 11 is a schematic block diagram of a rasterization process of an embodiment of the present application;

FIG. 12 is a schematic block diagram of acquiring parameters of a previous frame pixel using bilinear interpolation in accordance with an embodiment of the application;

FIG. 13 is a schematic block diagram of super-denoising an image using a super-denoising network in an embodiment of the present application;

FIG. 14 is a schematic block diagram of processing an image using a bi-directional frame insertion network in accordance with an embodiment of the present application;

FIG. 15 is a schematic block diagram of an apparatus for image rendering of an embodiment of the present application;

FIG. 16 is a schematic block diagram of an apparatus for training a super-resolution denoising network according to an embodiment of the present application;

FIG. 17 is a schematic block diagram of an apparatus for training a bi-directional frame insertion network in accordance with an embodiment of the present application;

fig. 18 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The terminology used in the following examples is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the application and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include, for example, "one or more" such forms of expression, unless the context clearly indicates to the contrary. It should also be understood that in the following embodiments of the present application, "at least one", "one or more" means one, two or more than two. The term "and/or" is used to describe an association relationship of associated objects, meaning that there may be three relationships; for example, a and/or B may represent: a alone, a and B together, and B alone, wherein A, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

In order to facilitate understanding of the technical solution of the present application, a brief description will be first made of the concept related to the present application.

Ray tracing: a special rendering algorithm in three-dimensional computer graphics starts to emit a ray from a viewpoint through each pixel point on a viewing plane, continuously judges the intersection of the ray and an object, and simultaneously takes optical phenomena such as reflection and refraction into consideration to render a three-dimensional scene.

Global illumination (global illumination, GI): the method refers to a rendering technology which considers both direct illumination from a light source in a scene and indirect illumination reflected by other objects in the scene, and shows the comprehensive effect of the direct illumination and the indirect illumination.

Rasterizing: the process of converting vertex data into primitives has the function of converting a graph into an image composed of grids, and the process of converting geometric primitives into a two-dimensional image.

Rigid body: a solid of limited size with negligible deformation. The distance between the particles inside the rigid body will not change, whether or not subjected to external forces.

Image superdivision: i.e., super-resolution of the image, and reconstruct the input low-resolution image into a high-resolution image.

Deep learning: the branch of machine learning is an algorithm for carrying out characterization learning on data by taking an artificial neural network as a framework.

The technical scheme of the application will be described below with reference to the accompanying drawings.

FIG. 1 shows a schematic block diagram of ray tracing and rasterization in an embodiment of the present application. Ray tracing and rasterization are both rendering techniques aimed at projecting objects in three-dimensional space to two-dimensional screen space for display via computational rendering. Ray tracing differs from rasterization in that ray tracing rendering is to calculate which positions on the graph (triangles as shown in fig. 1) these rays strike, assuming that each point on the screen is a forward ray, and then to calculate the texel color at these positions; and rasterization rendering is to perform coordinate transformation on the vertex of a graph (such as the triangle shown in fig. 1) and then fill textures in the triangle on a two-dimensional screen. Compared with rasterization, the method has the advantages that the calculated amount of ray tracing is larger, but the rendering effect is more real, and the image rendering method of the embodiment of the application is a rendering method based on ray tracing.

Then, a convolutional neural network (convolutional neuron network, CNN), U-Net, involved in the image rendering method of the embodiment of the present application is described. U-Net is initially used in medical image segmentation tasks, and is widely used in various segmentation tasks due to its excellent effect. The U-Net supports a small amount of data training models, and the higher segmentation accuracy is obtained by classifying each pixel point, so that the U-Net segments images by using the trained models, and the speed is high. FIG. 2 shows a schematic block diagram of a U-Net, which is briefly described below in connection with FIG. 2. FIG. 2 shows a network architecture of U-Net, with encoder section on the left, downsampling the input, the downsampling being accomplished by maximum pooling; the right part is a decoder part, the output of the encoder is up-sampled, the resolution is recovered, and the up-sampling is realized through upsample; the middle is jump connection (skip-connect), and feature fusion is carried out. Since the overall network structure is shaped like a "U," it is called a U-Net.

Wherein up-sampling and down-sampling may increase robustness to some small disturbances of the input image, such as image translation, rotation, etc., reduce the risk of over-fitting, reduce the amount of computation and increase the size of the receptive field. The up-sampling is used for restoring and decoding abstract features to the original size, and finally, a clear and noiseless image is obtained.

The image rendering method in the embodiment of the application can be executed by the electronic equipment. The electronic device may be a mobile terminal (e.g., a smart phone), a computer, a personal digital assistant, a wearable device, a vehicle-mounted device, an internet of things device, or other device capable of image rendering processing. The electronic device may be a device that runs an android system, an IOS system, a windows system, and other systems.

The graphics rendering method of the embodiment of the present application may be performed by an electronic device, and a specific structure of the electronic device may be shown in fig. 3, and a detailed description of the specific structure of the electronic device is described below in connection with fig. 3.

In one embodiment, as shown in fig. 3, an electronic device 300 may include: a Central Processing Unit (CPU) 301, a Graphics Processor (GPU) 302, a display device 303, and a memory 304. Optionally, the electronic device 300 may also include at least one communication bus 310 (not shown in FIG. 3) for enabling connected communication between the various components.

It should be appreciated that the various components in electronic device 300 may also be coupled by other connectors, which may include various interfaces, transmission lines, buses, or the like. The various components in the electronic device 300 may also be radio-connected with the processor 301 as a center. In various embodiments of the application, coupled means electrically connected or in communication with each other, including directly or indirectly through other devices.

There are various connection modes of the cpu 301 and the gpu 302, and the connection modes are not limited to the one shown in fig. 3. The cpu 301 and the graphics processor 302 in the electronic device 300 may be located on the same chip, or may be separate chips.

The functions of the central processor 301, the graphic processor 302, the display device 303, and the memory 304 are briefly described below.

Central processing unit 301: for running an operating system 305 and application programs 307. The application 307 may be a graphics-like application such as a game, video player, or the like. The operating system 305 provides a system graphics library interface through which the application 307 generates an instruction stream for rendering graphics or image frames, and associated rendering data as needed, as well as drivers provided by the operating system 305, such as a graphics library user mode driver and/or a graphics library kernel mode driver. Among them, the system graphics library includes, but is not limited to: system graphics libraries such as embedded open graphics library (open graphics library for embedded system, openGL ES), ke Luonuo s platform graphical interface (the khronos platform GRAPHICS INTERFACE) or Vulkan (a cross-platform drawing application program interface). The instruction stream contains columns of instructions, which are typically call instructions to the system graphics library interface.

Optionally, the central processor 301 may include at least one of the following types of processors: an application processor, one or more microprocessors, a digital signal processor (DIGITAL SIGNAL processor, DSP), a microcontroller (microcontroller unit, MCU), or an artificial intelligence processor, etc.

The central processor 301 may further include necessary hardware accelerators, such as Application Specific Integrated Circuits (ASICs), field programmable gate arrays (field programmable GATE ARRAY, FPGAs), or integrated circuits for implementing logic operations. The processor 301 may be coupled to one or more data buses for transmitting data and instructions among the various components of the electronic device 300.

Graphics processor 302: for receiving a stream of graphics instructions sent by the processor 301, generating a render target through a render pipeline (pipeline), and displaying the render target to the display device 303 through a graphics layer composition display module of the operating system.

Alternatively, graphics processor 302 may comprise a general-purpose graphics processor executing software, such as a GPU or other type of special-purpose graphics processing unit, or the like.

Display device 303: for displaying various images generated by the electronic device 300, which may be a graphical user interface (GRAPHICAL USER INTERFACE, GUI) of an operating system or image data (including still images and video data) processed by the graphics processor 302.

Alternatively, the display device 303 may include any suitable type of display screen. Such as a Liquid CRYSTAL DISPLAY (LCD) or a plasma display or an organic light-emitting diode (OLED) display.

Memory 304, which is the transmission channel between CPU 301 and graphics processor 302, may be double rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM) or other types of caches.

A rendering pipeline is a series of column operations that the graphics processor 302 sequentially performs in rendering graphics or image frames, typical operations include: vertex processing (Vertex Processing), primitive processing (PRIMITIVE PROCESSING), rasterization (Rasterization), fragment processing (Fragment Processing), and so forth.

In the graphics rendering method according to the embodiment of the present application, three-dimensional coordinates are converted into two-dimensional coordinates, and related basic concepts will be briefly described below.

The process of converting three-dimensional coordinates to two-dimensional coordinates may involve 5 different coordinate systems.

Local Space (or Object Space);

World Space (World Space);

A viewing Space (VIEW SPACE, alternatively referred to as the visual Space (Eye Space));

A clipping space (CLIP SPACE);

screen space (SCREEN SPACE).

In order to transform coordinates from one coordinate system to another, several transformation matrices are generally required, and the most important transformation matrices are three matrices of Model (Model), view (View), projection (project), respectively. The coordinates of the vertex data generally start in a Local Space (Local Space), which is referred to herein as Local coordinates (Local coordinates), which after transformation become World coordinates (World coordinates), view coordinates (View coordinates), clip coordinates (Clip coordinates), and finally end in the form of screen coordinates (Screen Coordinate).

In the above coordinate transformation process, the local coordinates are coordinates of the object with respect to the local origin, and are also coordinates of the start of the object. Next, the local coordinates are transformed into world space coordinates, which are in a larger spatial range. These coordinates are relative to the global origin of the world, and they are placed with other objects relative to the origin of the world. The world coordinates are then transformed into viewing space coordinates such that each coordinate is viewed from the camera or observer's perspective. After the coordinates reach the viewing space, we need to project them to the clipping coordinates. The clipping coordinates are processed to a range of-1.0 to 1.0 and a determination is made as to which vertices will appear on the screen. Finally, the clipping coordinates are transformed into screen coordinates, and a process called viewport transformation (Viewport Transform) is used next. The viewport transformation transforms coordinates lying in the range-1.0 to 1.0 into a coordinate range defined by the glViewport function. The transformed coordinates are finally sent to a rasterizer, which converts them into fragments (after conversion into fragments, the video image can be displayed according to the fragments).

In the above process, the vertices are transformed into different spaces because some operations are meaningful and more convenient in a particular coordinate system. For example, when modifications to the object are required, it may be more said that it is done in local space; this is more so if an operation is to be performed on one object with respect to the position of the other object, in the world coordinate system, etc. If we want, we can define a transformation matrix that transforms directly from local space to clipping space, but that loses much flexibility.

The respective coordinate systems will be described in detail.

Local space:

The local space refers to the coordinate space where the object is located, i.e., where the object is initially located. Imagine you created a cube in a modeling software (say Blender). The origin of the cube you create is likely to be at (0, 0), even though it may eventually be at a completely different location in the program. It is even possible that all models created have (0, 0) as the initial position (however they will eventually appear in different locations in the world). Therefore, all vertices of the created model are in local space: they are all local to your object.

World space:

if we introduce all our objects into the program, they may be squeezed all over the world's origin (0, 0), which is not our intended result. We want to define a location for each object so that they can be placed in a larger world. Coordinates in world space are just as their name: refers to the coordinates of the vertex with respect to the (game) world. If you want to spread the object around the world (especially very truly), this is the space you want the object to transform into. The coordinates of the object will be transformed from local to world space; the transformation is implemented by a Model Matrix (Model Matrix).

The model matrix is a transformation matrix that can be placed in the position or orientation that it should have been by displacing, scaling, rotating the object. You can think of it as transforming a house, you need to first shrink it (it is too large in local space) and shift it to a town in suburban area, then rotate a bit left on the y-axis to match the nearby house. You can also consider the matrix that the previous section uses to place boxes around in the scene approximately as a model matrix; we transform the local coordinates of the box to different locations in the scene/world.

Observation space:

The viewing Space is often referred to by people as a camera (sometimes also referred to as camera Space (CAMERA SPACE) or visual Space (Eye Space)) of a cross-platform graphical programming interface (open graphics library, OPENGL). The viewing space is the result of converting world space coordinates into coordinates in front of the user's field of view. The viewing space is thus the space that is observed from the camera's perspective. This is typically accomplished by a series of combinations of displacements and rotations, translating/rotating the scene so that a particular object is transformed in front of the camera. These combined transforms are typically stored in a View Matrix (View Matrix), which is used to transform world coordinates into View space. In the next section we will discuss how to create one such observation matrix to simulate one camera.

Cutting space:

At the end of the vertex shader run, openGL expects that all coordinates fall within a particular range, and any points outside this range should be clipped (Clipped). The coordinates that are clipped out are ignored and the remaining coordinates become a segment visible on the screen. This is the origin of the clipping space (CLIP SPACE) name.

Since it is not straightforward to specify all visible coordinates in the range of-1.0 to 1.0, we would specify and transform their own Set of coordinates (Coordinate Set) back to the standardized device Coordinate system as expected by OPENGL.

To transform the coordinates from view to crop space, we need to define a projection matrix (Projection Matrix) that specifies a range of coordinates, such as-1000 to 1000 in each dimension. The projection matrix would then transform the coordinates within this specified range to a range of standardized device coordinates (-1.0, 1.0). All coordinates outside the range will not be mapped to a range between-1.0 and will therefore be clipped out. Within the range specified by this projection matrix, the coordinates (1250,500,750) will not be visible because its x-coordinate is out of range, which is translated into a standardized device coordinate of greater than 1.0, and therefore clipped out.

For example, if only a portion of a Primitive (PRIMITIVE), such as a triangle, exceeds the Clipping Volume (Clipping Volume), openGL will reconstruct the triangle into one or more triangles that fit within the Clipping range.

Ray tracing obtains accurate shadow, reflection and diffuse reflection global illumination effects by tracking each ray emitted from a camera, so that great calculation amount and power consumption are required for simulating and rendering a virtual scene with extremely realistic sense. At present, based on real-time ray tracing with resolution of 1920×1080 and frame rate of 30fps, due to the limitation of GPU hardware conditions, only 1 to 2spp sampling values can be provided for each pixel value, and a large amount of noise is introduced into low sampling values, so that the quality of rendered pictures is reduced, and if the resolution is higher than 4k or 8k, the sampling values must be lower. Therefore, on the premise of not increasing hardware cost and maintaining real-time ray tracing, the rendering effect under the condition of low sampling value is optimized, noise caused by insufficient sampling is removed, and a stable global illumination image is output.

The existing SVGF algorithm combines the information filtering denoising of the space domain and the time domain under the condition of 1spp, and calculates the variance on the space domain and the time domain to distinguish the high-frequency texture information and the noise point area at the same time, so as to guide the filter to carry out multi-layer filtering. However, the method cannot accurately estimate the motion vector of the non-rigid body motion and the shadow part, so that the denoising effect of the shadow part is poor; meanwhile, the traditional bilateral filtering is adopted, the filtering weight cannot be dynamically adjusted, and the time consumption of the multi-layer filtering is long, so that the timeliness of the method is poor.

Another existing KPCN algorithm divides an image into a specular reflection part and a diffuse reflection part, then uses a neural network to adaptively adjust the filter kernel weights of the specular reflection part and the diffuse reflection part, and finally combines the two parts to obtain a combined denoising result. The method has the defects of large sampling value (usually 32 spp), huge model structure, single effect (only having a denoising function), large calculation amount, long time consumption and insufficient utilization of time domain information for supplementation, so that the algorithm almost cannot meet the real-time requirement.

In order to solve the problem of poor denoising effect caused by inaccurate estimation of shadow parts and non-rigid motion vectors, the conventional third algorithm calculates a gradient change value by using pixel surface information and shadow information, and if the gradient value is larger, the representative pixel point moves more, so that the corresponding historical information of the pixel point is discarded. The method provides the dramatic intensity of motion based on gradient judgment, and is used for relieving the ghost phenomenon caused by the fact that large-scale displacement cannot accurately estimate the motion vector, but cannot be used as an independent denoising module. The method can be combined with SVGF algorithm and can also be combined with the image rendering method according to the embodiment of the application.

The existing GPU hardware has limited power consumption and computing capacity, has huge computing capacity under the condition of high sampling value, and cannot meet the real-time requirement of 30 fps. If only 1 to 2 rays are traced for each pixel, a large amount of noise is introduced, although the amount of computation is greatly reduced. And the noise characteristics of the surfaces of different materials are different, if the same denoising process is used, the denoising effect is poor, and the difficulty of a denoising algorithm under a low sampling value is further increased. According to the related information in the geometry buffer (G-buffer) rendering pipeline, the motion vector of the rigid motion can be accurately obtained, and the G-buffer refers to a buffer containing color, normal and world space coordinates. But inaccurate motion vector estimation for non-rigid bodies and shadow parts may cause a degradation of rendering effects. In addition, the instantaneity of the noise reduction algorithm is further influenced by the size of the image resolution, and the instantaneity of the ray tracing noise reduction algorithm under the conditions of high resolution and high frame rate faces greater challenges.

The existing image rendering technology faces the following problems:

(1) The high sampling value has huge calculation amount, cannot meet the real-time requirement, and has serious noise point;

(2) The characteristics of noise points on the surfaces of different materials are different, and if the same denoising process is used, the denoising effect is poor;

(3) For game picture rendering, real-time ray tracing under the conditions of high resolution and high frame rate is needed to be realized, and the real-time difficulty is high;

(4) The motion vector estimation of the non-rigid body and the shadow part is inaccurate.

Therefore, the embodiment of the application provides an image rendering method, which adopts a low sampling value under the condition of limited hardware and combines time domain information to adopt different optimization strategies for different noise points generated by different materials so as to realize real-time ray tracing image rendering with high frame rate and high resolution.

Fig. 4 is a schematic block diagram of a system architecture of an existing image rendering method based on a real-time ray tracing technology, and as shown in fig. 4, the system architecture includes six parts including a model material loading module 401, a generating ray module 402, a ray intersection module 403, a denoising module 404, a post-processing module 405, and a display module 406.

The first step of image rendering based on the real-time ray tracing technology is the loading of model materials, and mainly involves two parts, wherein the first part is to add a model to be rendered into a scene, and the second part is to add respective material information and texture information to the model in the scene. This step is implemented by the profile loading module 401.

Generating light refers to the process of emitting light from a light source into an imaging dimension plane. Aiming at the number of rays emitted by each pixel point, the final rendering effect is greatly affected, the sampling value is low, and the image is blurred and the noise is more; the sampling value is high, the image is clear, and the effect is good. However, due to limited hardware conditions, only 1 to 2 rays are generally emitted for one pixel in order to ensure the real-time performance of ray tracing. This step is implemented by the generate ray module 402.

Ray tracing splits the rendering task of a scene into consideration the effect of several rays emanating from the camera on the scene, which are not aware of each other, but of the information of the entire scene model. The ray intersection refers to tracking rays emitted by a camera, calculating intersection points with a scene model, acquiring information such as materials, textures and the like of the surface of the scene model according to the positions of the intersection points, and calculating reflected rays by combining light source information. The calculation of the reflected light is based on Monte Carlo importance sampling, and only 1 light is tracked for each intersection point under the condition of 1spp of sampling value. The sampling value of the reflected light also affects the final rendering effect, corresponding to the light generating part. Ray intersection is implemented by a generate ray intersection module 403.

The denoising module 404 is configured to reduce noise generated due to low sampling values, and ensure rendering effects while ensuring the instantaneity of ray tracing.

The post-processing module 405 is configured to perfect the rendering effect using tone mapping (tone mapping), antialiasing (TAA) techniques. The tone mapping technology is used for mapping and transforming the colors of the image and adjusting the gray level of the image so that the processed image can better express the information and the characteristics of the original image; TAA techniques employ toning techniques to mitigate "jaggies" of image edges, making the image edges smoother.

The display module 406 is used to display the final rendered image.

Fig. 5 is a schematic block diagram of a system architecture of an image rendering method based on a real-time ray tracing technology according to an embodiment of the present application, and as shown in fig. 5, the system architecture includes seven parts, namely a model material loading module 501, a downsampling generating ray module 502, a ray intersection module 503, a denoising module 504, a frame inserting module 505, a post-processing module 506, and a display module 507.

Unlike the conventional image rendering method based on the real-time ray tracing technology in fig. 4, in the image rendering method based on the real-time ray tracing technology according to the embodiment of the present application, assuming that the size of the finally imaged two-dimensional plane is w×h, the downsampled image is 1/2 of the size of the finally imaged plane, only (1/2) w× (1/2) h pixels are used for emitting light, so that the calculation amount of light intersection can be greatly reduced.

Because the downsampling is performed when generating the light, in the image rendering method based on the real-time ray tracing technology according to the embodiment of the present application, the denoising module 504 merges the super-division technology, and restores the image.

In addition, a frame inserting module 505 is added after the denoising module 504 to solve the real-time problem in the high frame rate scene.

Fig. 6 shows a schematic flow chart of an image rendering method according to an embodiment of the present application, as shown in fig. 6, including steps 601 to 607, which are respectively described in detail below.

S601, acquiring an n-1 frame image, an n-1 frame image and an n+1 frame image, wherein the n-1 frame image, the n-1 frame image and the n+1 frame image are continuous three frame images, and the continuous refers to that the n-1 frame image precedes the n-1 frame image and the n-1 frame image precedes the n+1 frame image. The n-1 frame image, and the n+1 frame image are images of the low sampling value (e.g., 1 spp) generated by the model material loading module 501, the downsampling generating light module 502, and the light intersection module 503 in fig. 5.

S602, updating the illumination map of the nth frame image according to the nth-1 frame image to obtain the illumination map after the nth frame image is updated. This step may be performed by the denoising module 504 in fig. 5.

The illumination map comprises a direct illumination map and an indirect illumination map, wherein the direct illumination map is obtained by directly irradiating a light source onto an observed object, and reflecting light rays into eyes of a user through the observed object; the indirect illumination is obtained by irradiating the light source onto other objects, reflecting the light once or more times, finally reaching the observed object, and reflecting the light into eyes of the user.

It will be appreciated that an image is made up of a plurality of pixels, each having a respective color value, the sum of the color values of all pixels on an image being the illumination map of the image.

Specifically, the illumination pattern includes a direct illumination pattern and an indirect illumination pattern, wherein the direct illumination is that light of the light source is directly irradiated onto the object, and the indirect illumination is that the light of the light source is irradiated onto the object after one or more reflections, and the direct illumination pattern is described below as an example.

And (3) marking any pixel point in the nth frame image as a first pixel point, and acquiring a pixel point corresponding to the first pixel point in the nth-1 frame image as a second pixel point. That is, the second pixel point is the corresponding pixel point of the first pixel point in the n-1 frame image. And updating the color value of the first pixel point according to the color value of the first pixel point and the color value of the second pixel point, specifically, multiplying the color value of the first pixel point by a first coefficient to obtain a first result, multiplying the color value of the second pixel point by a second coefficient to obtain a second result, and adding the first result and the second result to obtain the updated color value of the first pixel point. The first coefficient and the second coefficient may be manually preset values, and embodiments of the present application are not specifically limited herein.

Optionally, if the position of the first pixel point in the n-1 frame image corresponding to the second pixel point is not on the grid node of the n-1 frame image, the color value of the second pixel point cannot be directly obtained at this time, and a bilinear interpolation algorithm is required to be adopted to obtain the color value of the second pixel point. Specifically, the color values of four pixel points closest to the second pixel point are found first, and the four pixel points are required to be on grid nodes of an n-1 frame image; then obtaining color values of the four pixel points; and calculating the color value of the second pixel point by utilizing the color values of the four pixel points and combining a bilinear interpolation algorithm.

Optionally, in order to ensure that the second pixel point is actually a pixel point corresponding to the first pixel point in the n-1 th frame image, before updating the color value of the first pixel point by using the color value of the second pixel point, the method of the embodiment of the application further includes performing consistency judgment on the first pixel point and the second pixel point. Specifically, a depth value of a first pixel point, a normal vector value of the first pixel point, a patch ID of the first pixel point, a depth value of a second pixel point, a normal vector value of the second pixel point and a patch ID of the second pixel point are obtained; if the first pixel point and the second pixel point satisfy: the square of the difference between the depth value of the first pixel point and the depth value of the second pixel point is smaller than a first threshold value, the square of the difference between the normal vector value of the first pixel point and the normal vector value of the second pixel point is smaller than a second threshold value, the patch ID of the first pixel point is equal to the patch ID of the second pixel point, the first pixel point and the second pixel point are considered to meet consistency, and the color value of the second pixel point can be used for updating the color value of the first pixel point. If the first pixel point and the second pixel point do not meet the consistency, the color value of the first pixel point is not updated by the color value of the second pixel point, and the color value updated by the first pixel point is the current color value of the first pixel point.

Optionally, if the position of the first pixel point in the n-1 frame image corresponding to the second pixel point is not on the grid node of the n-1 frame image, the depth value, the normal vector value and the patch ID of the second pixel point cannot be directly obtained, and similar to the method for obtaining the color value, the depth value, the normal vector value and the patch ID of the second pixel point need to be obtained by adopting a bilinear interpolation algorithm. Specifically, the depth value, the normal vector value and the patch ID of four pixel points closest to the second pixel point are found first, and the four pixel points are required to be on grid nodes of an n-1 frame image; then obtaining depth values, normal vector values and patch IDs of the four pixel points; and calculating the depth value, the normal vector value and the patch ID of the second pixel point by combining the depth value, the normal vector value and the patch ID of the four pixel points with a bilinear interpolation algorithm.

And after the color value of each pixel point in the nth frame image is updated in the above manner, obtaining the updated direct illumination map of the nth frame image.

In the embodiment of the present application, the processing manners of the direct illumination map and the indirect illumination map are the same, and the processing manner of the indirect illumination map can refer to the processing manner of the direct illumination map, so that for brevity, the embodiment of the present application is not described herein again.

S603, inputting the illumination image updated by the nth frame image into an ultra-division denoising network to obtain an ultra-division denoising image of the nth frame image. This step may be performed by the denoising module 504 in fig. 5.

It should be appreciated that the updated illumination patterns include updated direct illumination patterns and updated indirect illumination patterns. Specifically, firstly, a depth map and a normal vector map of an nth frame image are obtained, and it is understood that the depth map of the nth frame image is the sum of depth values of all pixel points in the nth frame image, and the normal vector map of the nth frame image is the sum of normal vector values of all pixel points in the nth frame image; then fusing the updated direct illumination map, the updated indirect illumination map, the depth map and the normal vector map of the nth frame of image to obtain a first fusion result, wherein the fusion mode can be the existing characteristic fusion mode, such as concate or add, and the embodiment of the application is not particularly limited; and finally, inputting the first fusion result into a super-division denoising network, so as to obtain a super-division denoising image of the nth frame of image. The super-division denoising network is a pre-trained neural network model, and may have a U-Net network structure as shown in fig. 2, and the training process of the super-division denoising network is described in detail below.

S604, updating the illumination map of the n+1st frame image according to the n-th frame image to obtain the updated illumination map of the n+1st frame image. This step may be performed by the denoising module 504 in fig. 5.

The process of updating the illumination map of the n+1th frame image according to the n-1 th frame image is similar to the process of updating the illumination map of the n-1 th frame image according to the n-1 th frame image, and the description of the process of updating the illumination map of the n-1 th frame image according to the n-1 th frame image to obtain the illumination map after the n-th frame image update may be referred to, which is omitted herein for brevity.

S605, inputting the illumination image updated by the n+1st frame image into an ultra-division denoising network to obtain an ultra-division denoising image of the n+1st frame image. This step may be performed by the denoising module 504 in fig. 5.

The process of inputting the illumination pattern updated by the n+1st frame image into the super-division denoising network is similar to the process of inputting the illumination pattern updated by the n frame image into the super-division denoising network, and specifically, the description of the process of inputting the illumination pattern updated by the n frame image into the super-division denoising network to obtain the super-division denoising image of the n frame image can be referred to, which is omitted herein for brevity.

S606, obtaining an initial interpolation frame image of a target moment according to the super-division denoising image of the nth frame image and the super-division denoising image of the nth frame image, wherein the target moment is the moment between the nth frame image and the n+1st frame image. This step may be performed by the framing module 505 in fig. 5.

After the super-division denoising image of the nth frame image and the super-division denoising image of the n+1th frame image are obtained according to the above method, an initial interpolation frame image of a target time is obtained according to the super-division denoising image of the nth frame image and the super-division denoising image of the n+1th frame image, wherein the target time is a time between the nth frame image and the n+1th frame image, and preferably the target time is an intermediate time between the nth frame image and the n+1th frame image. Specifically, firstly, the motion vectors from the (n+1) th frame image to the (n) th frame image are obtained, and it is understood that each pixel point in the (n+1) th frame image to the corresponding pixel point in the (n) th frame image has a motion vector, and the sum of the motion vectors of all the pixel points is the motion vector from the (n+1) th frame image to the (n) th frame image; then, a first motion vector from the initial interpolation image to the n-th frame image at the target time and a second motion vector from the initial interpolation image to the n-th frame image at the target time are determined according to the motion vectors from the n-th frame image to the n-th frame image, for example, assuming that the motion vector from the n-th frame image to the n-th frame image is M _3→2, the target time is t time, and t is a value between (0, 1), the first motion vector from the initial interpolation image to the n-th frame image at the target time is:

M_t→2＝t×M_3→2

the second motion vector from the initial frame inserting image to the n+1st frame image at the target moment is:

M_t→3＝-(1-t)×M_3→2

and finally, obtaining an initial frame inserting image at the target moment according to the super-division denoising image of the nth frame image, the super-division denoising image of the n+1th frame image, the first motion vector and the second motion vector. Assuming that the super-division denoising image of the nth frame image is I ₂ and the super-division denoising image of the n+1th frame image is I ₃, the calculation mode of the initial interpolation frame image at the target moment is as follows:

I_t＝(1-t)×g(I₂,M_t→2)+t×g(I₃,M_t→3)

where function g () represents the mapping operation.

S607, inputting the initial frame inserting image into the two-way frame inserting network to obtain the frame inserting image at the target moment. This step may be performed by the framing module 505 in fig. 5.

The method of the embodiment of the application can directly input the initial frame inserting image into the bidirectional frame inserting network, but in order to make up the problem of inaccurate motion vector estimation of non-rigid motion and shadow motion, the method of the embodiment of the application further comprises the following steps: firstly, acquiring a depth map of an nth frame image, a normal vector map of the nth frame image, a depth map of an n+1th frame image and a normal vector map of the n+1th frame image; then fusing the depth map of the nth frame image, the normal vector map of the nth frame image, the depth map of the third image and the normal vector map of the third image with the initial frame inserting image to obtain a second fusion result, wherein the fusion mode can be the existing characteristic fusion mode, such as concate or add, and the embodiment of the application is not particularly limited; and finally, inputting the second fusion result into a bidirectional frame inserting network. The bidirectional frame inserting network is a pre-trained neural network model, and may have a U-Net network structure as shown in fig. 2, and the training process of the bidirectional frame inserting network is described in detail below.

Fig. 7 shows a schematic flow chart of training of the super-division denoising network according to an embodiment of the present application, including steps 701 to 706, which are described below, respectively, as shown in fig. 7.

S701, acquiring a plurality of groups of super-component denoising original training data, wherein each group of super-component denoising original training data in the plurality of groups of super-component denoising original training data comprises two continuous frame images and a standard image corresponding to a later frame image in the two continuous frame images.

Specifically, two consecutive frames of images are images with low sampling value (for example, 1 spp), and the standard image corresponding to the next frame of image is a rendering result of the next frame of image under the condition of high sampling value (for example, 4096 spp). The standard image is used as a training standard for the low sample value image.

S702, judging the consistency of the pixel points of two continuous frames of images.

Specifically, for each pixel point in the image of the next frame, whether the corresponding pixel point in the image of the previous frame accords with the pixel point is judged. The method for determining the consistency of the pixel points may refer to the description in S602, and for brevity, the embodiments of the present application are not described herein again.

S703, obtaining a depth map of a next frame image, a normal vector map of the next frame image and an illumination map of the next frame image in the two continuous frame images, wherein the illumination map of the next frame image is a direct illumination map or an indirect illumination map.

And S704, updating the color value of the pixel point of the image of the next frame according to the two continuous frames of images to obtain the updated illumination map of the image of the next frame.

Specifically, for the pixel points conforming to the consistency, the method for updating the color value of the pixel point of the next frame of image according to the color value of the pixel point of the next frame of image after updating the two continuous frames of images may refer to the description in S602, and for brevity, the embodiment of the present application is not described herein again. And for the pixel points which do not accord with the consistency, taking the current color value of the pixel point of the image of the next frame as the updated color value of the pixel point. After the color value of each pixel point in the image of the next frame is updated, the illumination graph after the image of the next frame is updated can be obtained.

S705, fusing the depth map of the next frame image, the normal vector map of the next frame image and the updated illumination map of the next frame image to obtain an updated image.

For a specific fusion method, reference may be made to the description of S603, and for brevity, the embodiment of the present application is not described herein.

S706, training the super-resolution denoising network according to the updated image and the standard image.

It should be understood that, the training of the super-division denoising network is the same as that of a common neural network, and the high sampling value image is used as a standard, so that the training result of the low sampling value image approximates to the high sampling value image, and when the difference between the training result and the standard image is smaller than a preset value, the super-division denoising network training at the moment is considered to be completed.

Fig. 8 shows a schematic flow chart of training of a bidirectional frame insertion network according to an embodiment of the present application, as shown in fig. 8, including steps 801 to 803, which are described below.

S801, multiple groups of bidirectional frame inserting original training data are obtained, wherein each group of bidirectional frame inserting original training data in the multiple groups of bidirectional frame inserting original training data comprises a fourth image, a fifth image and a sixth image, and the fourth image, the fifth image and the sixth image are continuous three-frame images.

The fourth image, the fifth image, and the sixth image may be the super-resolution denoising image obtained through the above steps, or may be other images, and embodiments of the present application are not limited herein.

S802, obtaining a frame inserting image of the intermediate time of the fourth image and the sixth image according to the fourth image and the sixth image.

The step of obtaining the intermediate-time frame image may refer to the description of S606, and for brevity, the embodiment of the present application is not described herein again.

S803, training the bidirectional frame inserting network according to the frame inserting image at the middle time and the fifth image.

And taking the fifth image as a standard image, enabling the training result of the frame inserting image at the middle moment to approach the fifth image, and considering that the bidirectional frame inserting network training at the moment is completed when the difference between the training result and the fifth image is smaller than a preset value.

The image rendering method of the embodiment of the application is mainly aimed at removing noise points under the condition of low sampling value (for example, 1 spp) and realizing high-resolution and high-frame-rate real-time ray tracing. Fig. 9 is a schematic block diagram of an image rendering method according to an embodiment of the present application, and the image rendering method according to the embodiment of the present application is described in detail below with reference to fig. 9.

The image rendering method of the embodiment of the application relates to a super-division denoising network and a bidirectional frame inserting network, wherein the super-division denoising network and the bidirectional frame inserting network have the U-Net network model structures in the figure 2, and the super-division denoising network and the bidirectional frame inserting network need to be trained in advance.

The training of the super-division denoising network comprises the following steps: the obtained illumination information is divided into direct illumination and indirect illumination through ray tracing, and the color information of the direct illumination and the indirect illumination is updated by combining motion vectors in the G-buffer and consistency judgment. And fusing the color information of the updated direct illumination and the color information of the updated indirect illumination with the corresponding depth information and normal information in the G-buffer, wherein the fusion mode can be concate mode or add mode. And inputting the fused direct illumination and indirect illumination into a super-division denoising network to respectively obtain a super-division denoising result of the direct illumination and a super-division denoising result of the indirect illumination. And finally, merging the superdivision denoising result of the direct illumination and the superdivision denoising result of the indirect illumination to obtain a final superdivision denoising result. The real value (Ground Truth) of the super-division denoising network is the ray trace rendering result of the sampling value 4096 app.

Training of the bidirectional frame insertion network comprises: and (3) marking the continuous three-frame image output by the super-division denoising network as I _{AI+Denoise_0}、I_{AI+Denoise_1}、I_{AI+Denoise_2}, acquiring a bidirectional motion vector between I _{AI+Denoise_0} and I _{AI+Denoise_1}、I_{AI+Denoise_1} and I _{AI+Denoise_2} by combining with G-buffer, and estimating by using a frame interpolation calculation formula to obtain an initial intermediate frame interpolation result I _{AI+Denoise_1_calculate}. I _{AI+Denoise_1_calculate} is taken as the input of the bidirectional frame inserting network to obtain the final frame inserting result. Ground Truth of the bi-directional frame-inserted network is I _{AI+Denoise_1}.

The rasterized rendering shown in fig. 9 is to map a series of coordinate values of a three-dimensional object into a two-dimensional plane. The process from three-dimensional coordinates to two-dimensional coordinates is typically performed in steps, requiring multiple coordinate systems for transition, including: local space, world space, viewing space, cropping space, planar space, and the like. The transformation of coordinates from one coordinate system to another is accomplished by a transformation matrix, wherein the transformation matrix from local space coordinates to world space coordinates is model matrix M _model, the transformation matrix from world space coordinates to viewing space coordinates is viewing matrix M _view, and the transformation matrix from viewing space coordinates to clipping space coordinates is projection matrix M _projection. For a moving image, scenes of two adjacent frames of images have correlation, so that the relative offset of the same pixel point on the two adjacent frames of images is a motion vector, and the process of solving the motion vector is motion estimation. According to the coordinate operation, the motion vector of the pixel point in the two continuous frames of images can be obtained. Depth information, normal information, a patch ID (mesh ID), a motion vector, etc. of the image can be obtained in the rasterization process, and these information are stored in the G-buffer.

As shown in fig. 9, the acquired illumination information is divided into direct illumination and indirect illumination by ray tracing. By accessing the historical color buffer area and combining the color values of the corresponding pixel points in the current frame, continuous accumulation of the color values in the time domain can be realized, because the image can generate more noise points under the condition of fewer sampling values, and accumulating the historical color values and the current color values is equivalent to increasing the sampling values. However, since there is inaccuracy in the estimation of the motion vector of the non-rigid body such as the shadow, it is necessary to perform the consistency judgment based on the normal information, the depth information and meshid in the G-buffer, and if and only if the normal information, the depth information and meshid of the pixel satisfy the consistency at the same time, the history information is accumulated. Specifically, according to the motion vector, the position of the pixel point of the current frame is projected into the previous frame, the normal information, the depth information and meshid of the pixel point in the previous frame are obtained by using bilinear interpolation, then consistency judgment is carried out, and the color caches of direct illumination and indirect illumination are updated according to the judgment result. And finally, respectively fusing the color information of direct illumination and indirect illumination, the corresponding depth information and normal information, and inputting the color information and the corresponding depth information into a super-division denoising network to obtain super-division denoising images, wherein the super-division denoising images are respectively marked as a 0 th frame and a 1 st frame, and the 0 th frame and the 1 st frame are continuous two-frame images.

According to the motion vector information in the G-buffer, a bidirectional motion vector corresponding to the intermediate time t of the 0 th frame and the 1 st frame can be obtained through linear operation, and the mapping operation is carried out by combining the super-division denoising images of the 0 th frame and the 1 st frame and the bidirectional motion vector corresponding to the intermediate time t, so that an initial frame inserting result can be obtained. And fusing the initial frame inserting result with the corresponding depth information and normal information in the G-buffer, and then inputting the fused information into a bidirectional frame inserting network to obtain a final frame inserting result, namely a final t frame image.

The image rendering method combines the super-division technology and the frame inserting technology, achieves the purpose of obtaining the image with high frame rate and given resolution under the condition of low sampling value, and reduces the rendering time. The image rendering method according to the embodiment of the present application is described above with reference to fig. 9, and the image rendering method according to the embodiment of the present application is described in further detail below with reference to specific examples.

Acquisition of data sets

FIG. 10 is a schematic flow chart of acquiring a data set according to an embodiment of the present application, and as shown in FIG. 10, the model scene data set used for training the neural network in the embodiment of the present application may be a set of existing public rendering models or a set of model scenes built by self-development. However, in order to ensure the balance of the neural network and the reliability of the training effect, different kinds of rendering scenes, such as different scenes of buildings, automobiles, home, games, animals, figures, statues and the like, should be screened out when the rendering scene data source is acquired. In addition to the richness of the data source content, the richness of the characteristics of the data itself, including the richness of different detail textures, different materials, images with different illumination and the like, must be ensured. The dataset should contain as much smooth regions as possible and complex regions where noise removal is more difficult than smooth regions due to more texture. After acquiring the data with different contents and different characteristics, the method of the embodiment of the application further comprises a series of operations such as turning, rotating, stretching, shrinking and the like on the acquired image, so that the data set is expanded as much as possible.

(II) motion vector estimation and backprojection

FIG. 11 is a schematic block diagram of a rasterization process according to an embodiment of the present application, where the rasterization process is a process of converting three-dimensional coordinates into two-dimensional coordinates through a local space, world space, viewing space, cropping space, and screen space, as shown in FIG. 11. In order to transform coordinates from one coordinate system to another, several transformation matrices are generally required, and the most important transformation matrices are three matrices, namely a Model matrix M _model, an observation (View) matrix M _view, and a Projection (Projection) matrix M _projection. The coordinates of the vertex data generally start in a Local Space (Local Space), which is referred to herein as Local coordinates (Local coordinates), which after transformation become World coordinates (World coordinates), view coordinates (View coordinates), clip coordinates (Clip coordinates), and finally end in the form of screen coordinates (Screen Coordinate).

The following describes a process of calculating a motion vector of the same pixel point in two consecutive frames of images. Assume two consecutive frames of image I, J, pixelsIs a pixel point in image I,Is the corresponding pixel point of the pixel u in the image J, the formalization of the motion vector is expressed as:

M＝(x₂-x₁,y₂-y₁)

The calculation process for obtaining the motion vector from the rasterized G-buffer is as follows:

calculating a previous frame conversion matrix: m _{mvp_prev}＝M_{projection_prev}×M_{view_prev}×M_{model_prev};

Calculating a current frame conversion matrix: m _{mvp_cur}＝M_{projection_cur}×M_{view_cur}×M_{model_cur};

calculating a motion vector: m=m _{mvp_cur}×aPos-M_{mvp_prev} × aPos.

Wherein aPos denotes three-dimensional coordinates in a local coordinate system.

The limitation of the above method is that only rigid motion is aimed at, so the method of the embodiment of the application also needs to find the corresponding pixel point v position of the pixel point u in the previous frame J in the current frame image I, and the corresponding parameters such as depth information, normal vector, meshid and the like by using the calculated motion vector and bilinear interpolation method. As shown in fig. 12, if the position of the pixel point u in the previous frame image I corresponding to the pixel point v in the previous frame J is not at the vertex position, and parameters such as depth information, normal vector, meshid and the like of the position cannot be directly obtained, parameters such as normal vector, depth information, meshid and the like corresponding to the P' position can be obtained according to the fixed points P1, P2, P3 and P4 around the pixel point v by using the bilinear interpolation method, and are used as parameters such as the depth information, normal vector, meshid and the like of the pixel point v for subsequent consistency judgment.

(III) consistency determination

In the embodiment of the application, the pixel point u in the current frame is projected to the corresponding position in the previous frame to obtain the pixel point v at the corresponding position, and then consistency judgment is carried out on the pixel points u and v. The formula for judging consistency is as follows:

(W_{z_cur}-W_{z_prev})²<threshold_z

(W_{n_cur}-W_{n_prev})²<threshold_n

W_{id_cur}＝W_{id_prev}

Wherein, W _{z_cur} represents the depth value of the pixel point u, W _{z_prev} represents the depth value of the pixel point v, the square of the difference needs to be smaller than the depth threshold _z;W_{n_cur} represents the normal value of the pixel point u, W _{n_prev} represents the normal value of the pixel point v, the square of the difference needs to be smaller than the normal threshold _n;W_{id_cur} represents meshid of the pixel point u, and W _{id_prev} represents meshid of the pixel point v, which are equal. The depth threshold _z and the normal threshold _n are empirical values, and can be appropriately adjusted and determined according to the rendering result.

If and only if the three conditions are met, the pixel points are considered to be judged through consistency, the color values of the pixel points can be accumulated, and the accumulation formula is as follows:

C_update＝α×C_original+(1-α)×C_history

Wherein C _update represents the updated illumination map, C _original represents the original illumination map, C _history represents the illumination map in the history buffer, and α represents the scaling factor of the original illumination map and the illumination map in the history buffer, which may be a manually preset value.

Optionally, if the pixel point u and the pixel point v do not meet the consistency, the pixel point corresponding to the pixel point u is not found in the previous frame, and the current color value of the pixel point u is regarded as the updated color value.

(IV) super-division denoising network

The first part of the data set of the super-division denoising network is a direct illumination graph obtained from a rendering pipeline, and after consistency judgment is carried out, the direct illumination color values in the history cache are accumulated into the current direct illumination graph, and the formula is expressed as follows:

C_{direct_update}＝α₁×C_{direct_original}+(1-α₁)×C_{direct_history}

Wherein C _{direct_update} represents the updated direct illumination map, C _{direct_original} represents the original direct illumination map, C _{direct_history} represents the direct illumination map in the history buffer, and α ₁ represents the scaling coefficients of the original direct illumination map and the direct illumination map in the history buffer.

The second part of the data set of the super-division denoising network is an indirect illumination graph obtained from a rendering pipeline, and after consistency judgment is carried out, the indirect illumination color values in the history cache are accumulated into the current indirect illumination graph, and the formula is expressed as follows:

C_{indirect_update}＝α₂×C_{indirect_original}+(1-α₂)×C_{indirect_history}

Wherein C _{indirect_update} represents the updated indirect lighting map, C _{indirect_original} represents the original indirect lighting map, C _{indirect_history} represents the indirect lighting map in the history buffer, and α ₂ represents the scaling factor of the original indirect lighting map and the indirect lighting map in the history buffer.

The third part of the super-noise-removed network dataset is the depth map I _depth and normal vector map I _{normal_vector} of the current frame acquired from the G-buffer.

In summary, the training dataset Dataset of the super-division denoising network includes four parts in total:

Dataset＝C_{direct_update}+C_{indirect_update}+I_depth+I_{normal_vector}

Fig. 13 is a schematic block diagram of performing super-division denoising on an image by using a super-division denoising network in the embodiment of the present application, and as shown in fig. 13, the processing of a direct illumination map of a certain pixel point is illustrated as an example, and the processing of an indirect illumination map is similar to the processing of a direct illumination map, and in particular, reference may be made to the processing of a direct illumination map, which is not repeated in the embodiment of the present application. As shown in fig. 13, a current frame buffer and a previous frame buffer are first obtained, wherein the current frame buffer includes parameters such as a motion vector of the pixel point from a current frame to a previous frame, depth information, normal information and meshid of the pixel point in the current frame; the previous frame buffer memory comprises parameters such as historical color values of the pixel point, depth information, normal information and meshid of the pixel point in the current frame. And then, the previous frame is cached and projected to the space of the current frame by utilizing the motion vector, and consistency judgment is carried out according to parameters such as depth information, normal information, meshid and the like. If the judging result is consistent, accumulating the historical color value and the current color value, and updating the color value of the current frame; if the judging result is inconsistent, the current color value is reserved. And updating the historical color value according to the updated color value. And finally, fusing the updated color value with the depth information and normal information of the current frame, and inputting the fused result into a super-division denoising network, thereby obtaining a super-division denoised image.

(V) bidirectional frame inserting network

The acquiring of the data set of the bidirectional frame inserting network comprises the following steps: the continuous three frames of images output by the super-division denoising network are recorded as I _{AI+Denoise_0}、I_{AI+Denoise_1}、I_{AI+Denoise_2}, the motion vectors from I _{AI+Denoise_1} to I _{AI+Denoise_0} are obtained from G-buffer and are M _1-0, and the motion vectors from I _{AI+Denoise_2} to I _{AI+Denoise_1} are obtained from G-buffer and are M _1-2, so that a frame interpolation result can be obtained according to a frame interpolation formula:

I_{AI+Denoise_1_calculate}＝(1-t)×g(I_{AI+Denoise_0},M_t-0)+t×g(I_{AI+Denoise_2},M_t-2)

Where I _{AI+Denoise_1_calculate} is the interpolation result, t represents the time position of the interpolation result in the 0 th and 2 nd frames, for example, t may be 0.5, M _t-0 represents the motion vector from the time t to the time 0 th frame, where M _t-0 is equal to M _1-0,M_t-2 represents the motion vector from the time t to the time 2 nd frame, and where M _t-2 is equal to M _1-2.

The bi-directional frame inserting network can be trained by taking the frame inserting result I _{AI+Denoise_1_calculate} as the input of the bi-directional frame inserting network and taking the I _{AI+Denoise_1} as the Ground Truth of the bi-directional frame inserting network.

Fig. 14 shows a schematic block diagram of processing an image using a bi-directional frame-insertion network in accordance with an embodiment of the present application. M _1-0 represents a motion vector from a subsequent frame image to a previous frame image acquired from the G-buffer, t is a coefficient between intervals (0, 1), and represents a time between the previous frame image I ₀ and the subsequent frame image I ₁, and bidirectional motion vectors M _t-0 and M _t-1 at the time t can be obtained by linear interpolation calculation, and the calculation formula is as follows:

M_t-0＝t×M_1-0

M_t-1＝(1-t)×M_1-0

after obtaining the bidirectional motion vector estimation result at the time t, mapping the previous frame image I ₀, the next frame image I ₁ and the motion vector respectively to obtain an initial frame interpolation result, wherein the calculation formula is as follows:

I_{t_initial}＝(1-t)×g(I₀,M_t-0)+t×g(I₁,M_t-1)

where I _{t_initial} represents the initial interpolation result and the function g () represents the mapping operation.

Depth information and normal information of a previous frame image and a next frame image are obtained from the G-buffer and are fused with an initial frame interpolation result, so that the problem of inaccurate motion vector estimation of non-rigid motion or shadow motion is solved. And inputting the fused result into a bidirectional frame inserting network, thereby obtaining a final frame inserting result.

The image rendering method according to the embodiment of the present application is described in detail above with reference to fig. 7 to 14, and the graphics rendering apparatus according to the embodiment of the present application is described in detail below with reference to fig. 15 to 18. It should be understood that the image rendering apparatuses of fig. 15 to 18 are capable of performing the respective steps of the image rendering method of the embodiment of the present application, and duplicate descriptions are appropriately omitted when the image rendering apparatuses shown in fig. 15 to 18 are described below.

Fig. 15 is a schematic block diagram of an apparatus for image rendering according to an embodiment of the present application, as shown in fig. 15, including an acquisition module 1501 and a processing module 1502, which will be briefly described below.

An acquiring module 1501 is configured to acquire a first image, a second image, and a third image, where the first image, the second image, and the third image are consecutive three frames of images.

The processing module 1502 is configured to update the illumination map of the second image according to the first image, so as to obtain the illumination map after the second image is updated.

The processing module 1502 is further configured to input the updated illumination pattern of the second image into an ultra-division denoising network to obtain an ultra-division denoising image of the second image.

The processing module 1502 is further configured to update the illumination map of the third image according to the second image to obtain an updated illumination map of the third image.

The processing module 1502 is further configured to input the illumination map after the third image update into an ultra-division denoising network to obtain an ultra-division denoising image of the third image.

The processing module 1502 is further configured to obtain an initial interpolated image at a target time according to the super-divided denoised image of the second image and the super-divided denoised image of the third image, where the target time is a time between the second image and the third image.

The processing module 1502 is further configured to input the initial frame-inserted image into the bidirectional frame-inserted network to obtain a frame-inserted image at the target time.

Optionally, the processing module 1502 is further configured to perform each step of the methods of S602 to S607 in fig. 6, and specific reference may be made to the description of fig. 6, which is omitted herein for brevity.

Fig. 16 shows a schematic block diagram of an apparatus for training a super-division denoising network according to an embodiment of the present application, as shown in fig. 16, including an acquisition module 1601 and a processing module 1602, which are briefly described below.

The acquisition module 1601 is configured to acquire a plurality of sets of super-component denoising original training data, where each set of super-component denoising original training data in the plurality of sets of super-component denoising original training data includes two continuous frame images and a standard image corresponding to a later frame image in the two continuous frame images;

the obtaining module 1601 is further configured to obtain a depth map of a next frame image, a normal vector map of the next frame image, and an illumination map of the next frame image in the two continuous frame images, where the illumination map of the next frame image is a direct illumination map or an indirect illumination map;

A processing module 1602, configured to determine consistency of pixel points of two consecutive frames of images;

The processing module 1602 is further configured to update color values of pixels of the next frame of image according to two continuous frames of images, so as to obtain an updated illumination map of the next frame of image;

the processing module 1602 is further configured to fuse the depth map of the next frame image, the normal vector map of the next frame image, and the updated illumination map of the next frame image to obtain an updated image;

the processing module 1602 is further configured to train the super-resolution denoising network according to the updated image and the standard image.

Fig. 17 shows a schematic block diagram of an apparatus for training a bidirectional frame insertion network according to an embodiment of the present application, as shown in fig. 17, including an acquisition module 1701 and a processing module 1702, which are briefly described below.

The obtaining module 1701 is configured to obtain multiple sets of bidirectional frame-inserted original training data, where each set of bidirectional frame-inserted original training data in the multiple sets of bidirectional frame-inserted original training data includes a fourth image, a fifth image, and a sixth image, and the fourth image, the fifth image, and the sixth image are continuous three-frame images;

A processing module 1702, configured to obtain an interpolated image at a middle moment of the fourth image and the sixth image according to the fourth image and the sixth image;

the processing module 1702 is further configured to train the bidirectional frame inserting network according to the intermediate-time frame inserting image and the fifth image.

It should be appreciated that the specific structure of the apparatus 1500 shown in fig. 15 above may be as shown in fig. 18.

The electronic device in fig. 18 includes a communication module 3010, a sensor 3020, a user input module 3030, an output module 3040, a processor 3050, a memory 3070, and a power source 3080. Wherein the processor 3050 may include one or more CPUs.

The electronic device shown in fig. 18 may perform the steps of the graphics rendering method of the embodiment of the present application, and in particular, one or more CPUs in the processor 3050 may perform the steps of the graphics rendering method of the embodiment of the present application.

The respective modules of the electronic device in fig. 18 are described in detail below.

The communication module 3010 may include at least one module that enables communication between the electronic device and other electronic devices. For example, the communication module 3010 may include one or more of a wired network interface, a broadcast receiving module, a mobile communication module, a wireless internet module, a local area communication module, and a location (or position) information module, etc.

For example, the communication module 3010 can acquire a game screen in real time from the game server side.

The sensor 3020 may sense some operation of the user, and the sensor 3020 may include a distance sensor, a touch sensor, and the like. The sensor 3020 may sense a user touching the screen or approaching the screen. For example, the sensor 3020 may be capable of sensing some manipulation of the game interface by the user.

The user input module 3030 is used for receiving input digital information, character information or touch operation/non-touch gestures, receiving signal input related to user setting and function control of the system, and the like. The user input module 3030 includes a touch panel and/or other input device. For example, the user may control the game through the user input module 3030.

The output module 3040 includes a display panel for displaying information input by a user, information provided to the user, various menu interfaces of a system, or the like.

Alternatively, the display panel may be configured in the form of a Liquid Crystal Display (LCD) CRYSTAL DISPLAY, an organic light-emitting diode (OLED), or the like. In other embodiments, the touch panel may be overlaid on the display panel to form a touch display.

In addition, the output module 3040 may further include a video output module, an alarm, a haptic module, and the like. The video output module may display the graphically rendered game visuals.

The power supply 3080 may receive external power and internal power under the control of the processor 3050 and provide power needed when the various modules of the overall electronic device are operating.

The processor 3050 may include one or more CPUs, and the processor 3050 may also include one or more GPUs.

When the processor 3050 includes a plurality of CPUs, the plurality of CPUs may be integrated on the same chip or may be integrated on different chips, respectively.

When the processor 3050 includes multiple GPUs, the multiple GPUs may be integrated on the same chip or may be integrated on different chips.

When the processor 3050 includes both a CPU and a GPU, the CPU and the GPU may be integrated on the same chip.

For example, when the electronic device shown in fig. 18 is a smart phone, a CPU and a GPU are generally associated with image processing inside the processor of the smart phone. Both the CPU and GPU herein may contain multiple cores.

Memory 3070 may store computer programs including operating system programs 3072, application programs 3071, and the like. Wherein typical operating systems such as Windows from Microsoft corporation, macOS from apple corporation, etc. are used in desktop or notebook systems, as well as Google corporation based systems developedAndroid of (C)A system or the like for a mobile terminal.

The memory 3070 may be one or more of the following types: flash memory, hard disk type memory, micro multimedia card memory, card memory (e.g., SD or XD memory), random access memory (random access memory, RAM), static random access memory (STATIC RAM, SRAM), read Only Memory (ROM), electrically erasable programmable read only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ-only memory, EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, or optical disk. In other embodiments, the memory 3070 may also be a network storage device on the Internet, and the system may perform updates or reads to the memory 3070 on the Internet.

For example, the memory 3070 may store a computer program (the computer program is a program corresponding to the graphics rendering method of the embodiment of the present application), and when the processor 3050 executes the computer program, the processor 3050 is capable of executing the graphics rendering method of the embodiment of the present application.

The memory 3070 also stores other data 3073 than computer programs, for example, the memory 3070 may store data during processing of the graphics rendering method of the present application.

The connection relationship between each module in fig. 18 is only an example, and the electronic device provided in any embodiment of the present application may also be applied to electronic devices with other connection manners, for example, all modules are connected through a bus.

Embodiments of the present application also provide a computer readable storage medium storing a computer program executable by a processor, the processor performing the method according to any one of fig. 6 to 8 when the computer program is executed by the processor.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of image rendering, comprising:

Acquiring a first image, a second image and a third image, wherein the first image, the second image and the third image are continuous three-frame images;

Updating the illumination map of the second image according to the first image to obtain an updated illumination map of the second image;

inputting the illumination image updated by the second image into an ultra-division denoising network to obtain an ultra-division denoising image of the second image;

Updating the illumination map of the third image according to the second image to obtain an updated illumination map of the third image;

Inputting the illumination image updated by the third image into an ultra-division denoising network to obtain an ultra-division denoising image of the third image;

acquiring an initial frame inserting image at a target moment according to the super-division denoising image of the second image and the super-division denoising image of the third image, wherein the target moment is the moment between the second image and the third image;

and inputting the initial frame inserting image into a bidirectional frame inserting network to obtain the frame inserting image at the target moment.

2. The method of claim 1, wherein updating the illumination map of the second image based on the first image to obtain the updated illumination map of the second image comprises:

acquiring an illumination map of the second image, wherein the illumination map of the second image comprises color values of a plurality of pixel points, and the illumination map of the second image is a direct illumination map or an indirect illumination map;

acquiring a second pixel point corresponding to a first pixel point in the first image, wherein the first pixel point is any one of the plurality of pixel points;

and updating the color value of the first pixel point according to the color value of the first pixel point and the color value of the second pixel point to obtain an updated illumination map.

3. The method of claim 2, wherein when the location of the second pixel point is not on a grid node of the first image, the method further comprises:

acquiring color values of four pixel points closest to the second pixel point, wherein the four pixel points are on grid nodes of the first image;

and acquiring the color value of the second pixel point according to the color values of the four pixel points.

4. The method of claim 2, wherein the updating the color value of the first pixel point is preceded by updating the color value of the first pixel point based on the color value of the first pixel point and the color value of the second pixel point, the method further comprising:

determining that the first pixel point and the second pixel point conform to consistency, the determining that the first pixel point and the second pixel point accord with consistency comprises:

Obtaining a depth value of the first pixel point, a normal vector value of the first pixel point, a patch ID of the first pixel point, a depth value of the second pixel point, a normal vector value of the second pixel point and a patch ID of the second pixel point,

The square of the difference between the depth value of the first pixel point and the depth value of the second pixel point is smaller than a first threshold value,

The square of the difference between the normal magnitude of the first pixel point and the normal magnitude of the second pixel point is less than a second threshold,

And the patch ID of the first pixel point is equal to the patch ID of the second pixel point.

5. The method of claim 2, wherein updating the color value of the first pixel based on the color value of the first pixel and the color value of the second pixel comprises:

the updated color value of the first pixel point is the sum of the color value of the first pixel point multiplied by a first coefficient and the color value of the second pixel point multiplied by a second coefficient.

6. The method of claim 1, wherein the inputting the second image updated illumination pattern into an overdriving denoising network, further comprises:

acquiring a depth map of the second image and a normal vector map of the second image;

fusing the depth map of the second image, the normal vector map of the second image and the updated illumination map of the second image to obtain a first fusion result;

And inputting the first fusion result into an ultra-division denoising network.

7. The method of claim 1, wherein the obtaining the initial interpolated image at the target time from the super-divided denoised image of the second image and the super-divided denoised image of the third image comprises:

Acquiring a motion vector from the third image to the second image;

Determining a first motion vector from the initial frame-inserted image to the second image at the target moment and a second motion vector from the initial frame-inserted image to the third image at the target moment according to the motion vectors from the third image to the second image;

and acquiring an initial frame inserting image of the target moment according to the super-division denoising image of the second image, the super-division denoising image of the third image, the first motion vector and the second motion vector.

8. The method of claim 1, wherein said inputting the initial interpolated image into a bi-directional interpolation network further comprises:

acquiring a depth map of the second image, a normal vector map of the second image, a depth map of the third image and a normal vector map of the third image;

Fusing a depth map of a second image, a normal vector map of the second image, a depth map of a third image and a normal vector map of the third image with the initial frame inserting image to obtain a second fusion result;

And inputting the second fusion result into a bidirectional frame inserting network.

9. The method of claim 1, wherein the super-division denoising network is a pre-trained neural network model, and wherein the training of the super-division denoising network comprises:

acquiring a plurality of groups of super-component denoising original training data, wherein each group of super-component denoising original training data in the plurality of groups of super-component denoising original training data comprises two continuous frame images and a standard image corresponding to a later frame image in the two continuous frame images;

Judging the consistency of pixel points of the two continuous frames of images;

acquiring a depth map of a next frame image in the two continuous frame images, a normal vector map of the next frame image and an illumination map of the next frame image, wherein the illumination map of the next frame image is a direct illumination map or an indirect illumination map;

updating the color values of the pixel points of the next frame of image according to the two continuous frames of images to obtain an updated illumination map of the next frame of image;

fusing the depth map of the next frame image, the normal vector map of the next frame image and the updated illumination map of the next frame image to obtain an updated image;

And training the super-division denoising network according to the updated image and the standard image.

10. The method according to any one of claims 1 to 9, wherein the bi-directional interpolation network is a pre-trained neural network model, the training of the bi-directional interpolation network comprising:

Obtaining multiple groups of bidirectional frame-inserting original training data, wherein each group of bidirectional frame-inserting original training data in the multiple groups of bidirectional frame-inserting original training data comprises a fourth image, a fifth image and a sixth image, and the fourth image, the fifth image and the sixth image are continuous three-frame images;

obtaining a frame inserting image of the intermediate moment of the fourth image and the sixth image according to the fourth image and the sixth image;

and training the bidirectional frame inserting network according to the frame inserting image at the middle moment and the fifth image.

11. An apparatus for image rendering, comprising:

The acquisition module is used for acquiring a first image, a second image and a third image, wherein the first image, the second image and the third image are continuous three-frame images;

the processing module is used for updating the illumination map of the second image according to the first image so as to obtain the illumination map after the second image is updated;

The processing module is further used for inputting the illumination graph updated by the second image into an ultra-division denoising network so as to obtain an ultra-division denoising image of the second image;

the processing module is further used for updating the illumination map of the third image according to the second image so as to obtain the illumination map updated by the third image;

The processing module is further used for inputting the illumination graph updated by the third image into an ultra-division denoising network so as to obtain an ultra-division denoising image of the third image;

The processing module is further used for obtaining an initial frame inserting image of a target moment according to the super-division denoising image of the second image and the super-division denoising image of the third image, wherein the target moment is the moment between the second image and the third image;

the processing module is also used for inputting the initial frame inserting image into a bidirectional frame inserting network so as to obtain the frame inserting image at the target moment.

12. The apparatus of claim 11, wherein the processing module updates the illumination map of the second image based on the first image to obtain the updated illumination map of the second image, comprising:

13. The apparatus of claim 12, wherein when the location of the second pixel point is not on a grid node of the first image, the processing module is further to:

Acquiring color values of four pixel points closest to the second pixel point;

14. The apparatus of claim 12, wherein the processing module is further configured to, prior to updating the color value of the first pixel based on the color value of the first pixel and the color value of the second pixel:

judging that the first pixel point and the second pixel point accord with consistency, wherein the judging that the first pixel point and the second pixel point accord with consistency comprises the following steps:

15. The apparatus of claim 12, wherein updating the color value of the first pixel based on the color value of the first pixel and the color value of the second pixel comprises:

16. The apparatus of claim 11, wherein the processing module inputs the second image updated illumination pattern into an overdriving denoising network, further comprising:

And inputting the first fusion result into an ultra-division denoising network.

17. The apparatus of claim 11, wherein the processing module obtains an initial interpolated image of a target time from the super-denoised image of the second image and the super-denoised image of the third image, comprising:

Acquiring a motion vector from the third image to the second image;

And determining an initial frame inserting image of the target moment according to the super-division denoising image of the second image, the super-division denoising image of the third image, the first motion vector and the second motion vector.

18. The apparatus of claim 11, wherein the processing module inputs the initial interpolated image into a bi-directional interpolated network, further comprising:

19. The apparatus of claim 11, wherein the super-division denoising network is a pre-trained neural network model, and wherein the training of the super-division denoising network comprises:

Judging the consistency of pixel points of the two continuous frames of images;

Acquiring a depth map of a next frame image in the continuous two frames of images, a normal vector map of the next frame image and an illumination map of the next frame image, wherein the illumination map of the next frame image is a direct illumination map or an indirect illumination map;

20. The apparatus according to any one of claims 11 to 19, wherein the bi-directional interpolation network is a pre-trained neural network model, the training of the bi-directional interpolation network comprising:

21. A computer device, comprising:

A memory for storing a program;

a processor for executing the memory-stored program, which, when executed by the processor, performs the method of any one of claims 1 to 10.

22. An electronic device comprising the apparatus for image rendering according to any one of claims 11 to 20.

23. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program executable by a processor, which when executed by the processor performs the method according to any of claims 1 to 10.