CN116257338A

CN116257338A - Parallel optimization method and system for Blender rendering in Shenwei super calculation

Info

Publication number: CN116257338A
Application number: CN202211728123.8A
Authority: CN
Inventors: 陈彦言; 徐希豪; 张琳
Original assignee: Jinan Supercomputing Center Co ltd
Current assignee: Jinan Supercomputing Center Co ltd
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2023-06-13

Abstract

The disclosure provides a parallel optimization method and system for Blender rendering in Shenwei super calculation, comprising the following steps: obtaining rendering scene data of a Blender task to be rendered, and dividing the rendering scene data, wherein the number of divisions is determined based on the number of kernel groups of nodes in Shenwei super calculation; for the segmented rendering scene data, respectively adopting independent processes to render the scene by utilizing computing resources of different kernel groups; wherein each core group corresponds to an independent process; when all independent processes are calculated, merging rendering results of each core group to obtain a final rendering result; the method comprises the steps of starting a noise reduction function in a Blender rendering calculation process, and solving the problem of data dependence by adopting a data overlapping area mode.

Description

Parallel optimization method and system for Blender rendering in Shenwei super calculation

Technical Field

The disclosure belongs to the technical field of application software migration optimization, and particularly relates to a parallel optimization method and system for Blender rendering in Shenwei super calculation.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

For a common application software, the method is often not suitable for efficient operation of new generation Shenwei super computing (i.e. a home-made Shenwei new generation super computer system), a great deal of transplanting optimization work is needed, and the implementation process or method of an algorithm is always changed during the period. The software such as Blender uses the original method, the thread technology is used as the main in each scene rendering process, and all data are put into an accelerator, and the data of the rendered scene cannot be completely received because the memory limitation is that large-scale data cannot be stored in a slave core with high-performance calculation due to the characteristics of Shenwei; in addition, the Shenwei is designed to have 64 slave cores and one master core, and one node (similar to a cpu) has six core groups, so that the method has no good expansibility on the prior thread technology. Therefore, the inventor discovers that the efficient transplanting of the software in the Shenwei is realized under the condition that the secondary nuclear energy storage needs to be ensured, the resources of the multi-core group are fully utilized, the high expansibility is realized, and the following problems exist:

(1) In the calculation of rendering, because the distributed data is unbalanced, the slave core resources cannot be reasonably utilized, or because the data cannot be stored in the slave core at the same time, the communication overhead is increased, and finally the operation efficiency is reduced, or because the blender applies the thread technology, the processing technology of the Shenwei is more local when in use, the data quantity entering the slave core is reduced, the communication overhead is increased, the operation efficiency is reduced, and the operation cost is increased.

(2) The poor thread technology expansibility can only be used in a single node in the architecture of a CPU or a GPU, can only be used in a single core group or a special large sharing mode in the Shenwei architecture, and cannot fully use computing resources.

(3) In the process of the Blender rendering noise reduction, each noise pixel point in the image is data depending on surrounding pixel points, so that in the transplanting optimization process, the situation of data dependence in the noise reduction process is considered, and the communication cost is necessarily increased again by adopting a data exchange mode.

Disclosure of Invention

In order to solve the problems, the present disclosure provides a parallel optimization method and system for Blender rendering in Shenwei super computing, where the scheme divides rendering scene data of Blender on the Shenwei, solves the problem that the rendering of scene is unfavorable for the parallel of Shenwei in the process of rendering and noise reduction, and ensures the correctness of programs, although there is a certain redundancy of data, the slave core and the master core of Shenwei chip are effectively utilized for Shenwei parallel, and the expandability is increased, so that the computing resources are not limited to one core group or even not limited to one node.

According to a first aspect of the embodiments of the present disclosure, there is provided a parallel optimization method of Blender rendering in Shenwei super calculation, including:

obtaining rendering scene data of a Blender task to be rendered, and dividing the rendering scene data, wherein the number of divisions is determined based on the number of kernel groups of nodes in Shenwei super calculation;

for the segmented rendering scene data, respectively adopting independent processes to render the scene by utilizing computing resources of different kernel groups; wherein each core group corresponds to an independent process;

when all independent processes are calculated, merging rendering results of each core group to obtain a final rendering result; the method comprises the steps of starting a noise reduction function in a Blender rendering calculation process, and solving the problem of data dependence by adopting a data overlapping area mode.

Further, the merging of the rendering results of each core group to obtain a final rendering result specifically includes: each independent process performs rendering of each part of rendering scene data based on the computing resources of the corresponding core group, and data communication is not performed among the processes; when all independent processes finish rendering, RGB data of rendering scenes of all independent processes are combined through MPI communication, and a final rendering result is obtained.

Further, the dividing the rendering scene data specifically includes: the rendering scene data is partitioned into portions matching the number of kernel groups in the same dimension.

Further, in the case of starting the noise reduction function in the rendering calculation process of the blend, the data dependence problem is solved by adding the data overlapping region, specifically: when the rendering scene data is segmented, the head line and the tail line of the segmented rendering scene data are multiple, and one line of adjacent data is taken.

Further, for each independent process, redundant data is acquired only when data is acquired, and data exchange is not performed when rendering calculation is performed.

Further, the Shenwei super algorithm comprises six core groups for each node, and each core group comprises a master core and 64 slave cores.

According to a second aspect of embodiments of the present disclosure, there is provided a parallel optimization system for Blender rendering in Shenwei supercomputer, comprising:

the data segmentation unit is used for acquiring rendering scene data of a Blender task to be rendered and segmenting the rendering scene data, wherein the number of segmentation is determined based on the number of kernel groups of the nodes in the Shenwei super calculation;

the rendering unit is used for rendering the scene by using the computing resources of different kernel groups by adopting independent processes for the segmented rendering scene data; wherein each core group corresponds to an independent process;

the merging unit is used for merging the rendering results of each core group to obtain a final rendering result when all independent processes are calculated; the method comprises the steps of starting a noise reduction function in a Blender rendering calculation process, and solving the problem of data dependence by adopting a data overlapping area mode.

According to a third aspect of the disclosed embodiments, there is provided an electronic device, including a memory, a processor and a computer program running on the memory, where the processor implements the parallel optimization method of a Blender rendering in Shenwei super computing when executing the program.

According to a fourth aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the described parallel optimization method of Blender rendering in Shenwei super computing.

Compared with the prior art, the beneficial effects of the present disclosure are:

(1) The scheme divides rendering scene data of Blender on homemade Shenwei, solves the problem that home made Shenwei parallelization is not facilitated in the scene rendering and noise reduction process, ensures the correctness of a program, effectively utilizes slave cores and master cores of Shenwei chips for Shenwei parallelization although certain data redundancy exists, increases expandability, and ensures that computing resources are not limited to a core group or even a node. Through testing, before and after modification, acceleration effects exist relative to the source code. The master-slave computing time is accelerated by 17 to 28.5 times under the same data scale and parallel condition.

(2) Because the scheme is realized based on MPI (Message Passing Interface) standard, the method is applicable to CPU and GPU architecture, the Blender software does not realize the module of MPI technology at present, and as rendering scene data is increased, and under the condition that the scene is more complex, a single node or a single GPU can not meet the calculation requirement, the method increases the expandability of the software, and provides the possibility of cross-node calculation.

Additional aspects of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the exemplary embodiments of the disclosure and together with the description serve to explain the disclosure, and do not constitute an undue limitation on the disclosure.

FIG. 1 is a schematic diagram of a Blender thread hierarchy in an embodiment of the disclosure;

FIG. 2 is a schematic diagram of data partitioning according to an embodiment of the present disclosure;

FIG. 3 is a rendering effect after data segmentation according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of noise reduction effect when there is data dependency according to an embodiment of the disclosure;

FIG. 5 is a schematic diagram of a rendering result of a parallel optimization method of Blender rendering in Shenwei supercomputer according to an embodiment of the disclosure;

FIG. 6 is a flow chart of a parallel optimization method for Blender rendering in Shenwei supercomputer, according to an embodiment of the disclosure.

Detailed Description

The disclosure is further described below with reference to the drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments in accordance with the present disclosure. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

Embodiments of the present disclosure and features of embodiments may be combined with each other without conflict.

Term interpretation:

ray tracing algorithm: ray Tracing algorithm (Ray Tracing) is a calculation method for simulating image scene rays in calculation based on the basic principles of refraction and reflection of rays.

Slave core: the homemade Shenwei has the core of high-performance calculation.

Blender: blender is a free open-source three-dimensional graphics image software that provides a range of animation shortcuts from modeling, animation, texture, rendering, to audio processing, video clipping, etc.

Image noise reduction: image Denoising (Image Denoising) is a term of art in Image processing. Refers to a process of reducing noise in a digital image, sometimes referred to as image denoising.

Large sharing mode: under the Shenwei architecture, 6 core groups in one node are LDM space with independent computing resources, and one master core leads 64 slave cores and does not interfere with each other. The number of the 64 slave cores used at a time is too small to meet the requirement, so that more slave cores in the node can be simply and quickly utilized, the Shenwei has a large sharing mode, 384 slave cores can be led by one master core, computing resources among the core groups can be shared, the sharing is equivalent to the merging of the slave cores, and 2/3/6 core combination can be selected. The combination of 6 cores also causes the waste of computing resources of the rest 5 main cores.

Embodiment one:

the embodiment aims to provide a parallel optimization method for Blender rendering in Shenwei super calculation.

A parallel optimization method for Blender rendering in Shenwei supercomputer comprises the following steps:

Further, the dividing the rendering scene data specifically includes: and the rendering scene data is divided into a plurality of parts of which the number is matched with that of the kernel group according to the same dimension, so that the data volume of each part is ensured to be similar as much as possible.

In particular, for easy understanding, the following detailed description of the embodiments will be given with reference to the accompanying drawings:

the technical problems to be solved by the solution in this embodiment are summarized as follows:

(1) For the implementation of Blender existing ray tracing, how to reasonably use the computing resources of the multi-core group in the home-made Shenwei, so that the resources are fully utilized, the computing resources can only be used in a single node in a large sharing mode, and 5 main cores are not utilized, so that other solutions need to be found.

(2) To avoid conflicts generated by the original thread technology, the main influencing place is that the thread lock can cause the reduction of the computing efficiency, thereby causing the waste of computing resources. The threads of the original thread hierarchy of Blender are interwoven with each other like the threads of the three layers in fig. 1.

(3) How the Blender processes the data dependency relationship in the rendering process, and communication overhead is reduced.

In order to solve the above problems, the present embodiment provides a parallel optimization method of Blender rendering in Shenwei super computing, and the main technical concept of the scheme is as follows: the rendering scene data of the Blender is segmented, so that the problem that the parallelization of the homemade Shenwei is not facilitated in the scene rendering and noise reduction processes is solved, the correctness of a program is ensured, and although certain data redundancy exists, the slave core and the master core of the Shenwei chip are effectively utilized for the Shenwei parallelization, and the expandability is increased, so that the computing resource is not limited to a core group or even a node.

Specifically, as shown in fig. 2, the plane perpendicular to the eye line in the three-dimensional rendering scene is denoted as the plane of the x-axis and the y-axis in the scheme of this embodiment. Segmentation of the rendered scene along the xy-plane using task-level parallelism (MPI (Message Passing Interface) technology implementation), we take a single-node 6-core group as an example, so we use 6-processes to segment the rendered scene by 6 parts (as in fig. 2). The six parts are not communicated with each other and are data-dependent when the ray tracing algorithm is calculated, and each part is rendered, as shown in fig. 3, a frame of effect which is stored in the form of an image is rendered for the render, and it can be seen that only the first process renders the rest of the image of the first process to be blank. Only after the calculation is completed, we only merge the RGB data of the image to the main process through MPI communication, and finally compose a photo. Therefore, no communication is ensured during calculation, the operation efficiency is improved, the MPI technology is used for realizing task level parallelism, the conflict with the original thread level technology is avoided, the resources of the multi-core group are fully used, the resources of a single node are not limited, the high expansibility is realized, and the efficiency is improved by 4 times.

Meanwhile, the above manner cannot solve the problem of starting the noise reduction function in the computing rendering process, because the noise reduction of Blender is data dependent, and when the MPI technology is used, the memory is not shared, so that the noise reduction effect is very poor when the data is dependent, and the noise point is very obvious on the dividing boundary (as shown in fig. 4), generally, the situation needs to be communicated, and the cost is increased, but the method is not ideal. According to the scheme, the problem of data dependence is solved by designing a data overlap area according to the characteristics of ray tracing and noise reduction, each process is multiple-initialized to two lines of data, namely, for the segmented rendering scene data, the first line and the last line of the rendering scene data are multiple-fetched to be adjacent to each other (as shown in figure 5), each process only multiple-fetched initialized data, and data are not exchanged in calculation; the noise reduction errors are corrected step by step in the calculation process so that the data required for the main process in the process is error-free (see fig. 6).

Embodiment two:

the embodiment aims to provide a parallel optimization system for Blender rendering in Shenwei super calculation.

A parallel optimization system for Blender rendering in a Shenwei super algorithm, comprising:

Further, the system in this embodiment corresponds to the method in the first embodiment, and the technical details thereof are described in the first embodiment, so that they will not be described herein.

In further embodiments, there is also provided:

an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the method of embodiment one. For brevity, the description is omitted here.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include read only memory and random access memory and provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store information of the device type.

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the method of embodiment one.

The method in the first embodiment may be directly implemented as a hardware processor executing or implemented by a combination of hardware and software modules in the processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory, and the processor reads the information in the memory and, in combination with its hardware, performs the steps of the above method. To avoid repetition, a detailed description is not provided herein.

Those of ordinary skill in the art will appreciate that the elements of the various examples described in connection with the present embodiments, i.e., the algorithm steps, can be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

The parallel optimization method and the system for Blender rendering in Shenwei super calculation can be realized, and have wide application prospects.

The foregoing description of the preferred embodiments of the present disclosure is provided only and not intended to limit the disclosure so that various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A parallel optimization method of Blender rendering in Shenwei supercomputer, comprising:

2. The parallel optimization method of Blender rendering in Shenwei super computing according to claim 1, wherein the combining of rendering results of each core group to obtain a final rendering result is specifically: each independent process performs rendering of each part of rendering scene data based on the computing resources of the corresponding core group, and data communication is not performed among the processes; when all independent processes finish rendering, RGB data of rendering scenes of all independent processes are combined through MPI communication, and a final rendering result is obtained.

3. The parallel optimization method of Blender rendering in Shenwei super computing according to claim 1, wherein the segmentation of the rendering scene data is specifically: the rendering scene data is partitioned into portions matching the number of kernel groups in the same dimension.

4. The parallel optimization method of Blender rendering in Shenwei super computing according to claim 1, wherein the method for adding a data overlap region is adopted to solve the data dependence problem in the case of starting a noise reduction function in the Blender rendering calculation process, specifically: when the rendering scene data is segmented, the head line and the tail line of the segmented rendering scene data are multiple, and one line of adjacent data is taken.

5. A parallel optimization method of Blender rendering in Shenwei super computing according to claim 1, wherein for each independent process, it only gets more redundant data when data is obtained, and no data exchange is performed when rendering calculation is performed.

6. A parallel optimization method of Blender rendering in a Shenwei super computing as claimed in claim 1, wherein each node of the Shenwei super computing comprises six core groups, each core group comprising a master core and 64 slave cores.

7. A parallel optimization system for Blender rendering in a wizard supercomputer, comprising:

8. The parallel optimization system of Blender rendering in Shenwei super computing according to claim 1, wherein the method for adding a data overlap region is adopted to solve the data dependence problem for the situation that the noise reduction function is started in the Blender rendering calculation process, specifically: when the rendering scene data is segmented, the head line and the tail line of the segmented rendering scene data are multiple, and one line of adjacent data is taken.

9. An electronic device comprising a memory, a processor and a computer program stored to run on the memory, wherein the processor, when executing the program, implements a parallel optimization method of a Blender rendering in Shenwei super computing as claimed in any one of claims 1-6.

10. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements a parallel optimization method of Blender rendering in Shenwei super-computing according to any of claims 1-6.