CN116385624A

CN116385624A - Stack buffer memory component for ray tracing hardware accelerator and application method thereof

Info

Publication number: CN116385624A
Application number: CN202310398440.6A
Authority: CN
Inventors: 黄立波; 闫润; 苏垠; 郭辉; 郑重; 邓全; 郭维; 雷国庆; 王俊辉; 隋兵才; 孙彩霞; 王永文; 倪晓强
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-04-13
Filing date: 2023-04-13
Publication date: 2023-07-04

Abstract

The invention discloses a stack buffer memory for ray tracing hardware accelerator and its application method, the stack buffer memory component of the invention includes: a stack for storing node information in the light traversal process; the multiplexer is used for matching the data in the lookup table to determine whether stack ejection or stack pressing operation is required to be carried out on the stack cache; the lookup table is used for storing the stack numbers and the serial numbers corresponding to the stack caches; stack cache, which consists of a plurality of stacks and serial numbers thereof; the stack pile, the multiplexer, the lookup table and the stack cache are sequentially connected to form a pipeline structure. The invention provides stack cache for the ray tracing hardware accelerator and ensures ray parallelism, and has the advantages of low memory resource cost, simple design logic and little influence on the performance of the ray tracing hardware.

Description

Stack buffer memory component for ray tracing hardware accelerator and application method thereof

Technical Field

The invention relates to the field of hardware design of computer image rendering, in particular to a stack cache component for a ray tracing hardware accelerator and an application method thereof.

Background

With the increasing demands of applications such as movies, animations, games, etc. for rendering images with a sense of reality, people have increasingly demanded a computer to render images with a sense of reality that is expected to be able to continuously approximate photographs taken in a real-world environment. Rendering refers to the process of converting a description of a 3D scene into a two-dimensional image in some form. The current rendering mode mainly comprises two modes of rasterization and ray tracing. The rasterization adopts a local illumination principle, and maps geometric primitives in a scene to pixel points of an image according to illumination effects of directly visible light emitted by a light source onto an object. The ray tracing adopts a global illumination model, models the interaction behavior between the rays and the object through a physical principle, not only considers the direct illumination effect, but also considers the mutual illumination effect between the objects, and the effect is more stereoscopic than the traditional rasterization rendering effect and the color is softer.

To accelerate the traversal process in ray tracing, an acceleration data structure (acceleration struceture, AS) is generally used, mainly to divide the primitives in the scene into different hierarchical spatial structures, under which those uncorrelated spaces can be rapidly eliminated, so AS to identify the primitive closest to the ray. The most widely used at present is a tree structure, mainly comprising a kd-Tree subdivided into a number of smaller spatial regions or bounding box hierarchies (bounding volume hierarchy, BVH) decomposed into smaller object geometries.

The current mainstream ray tracing hardware accelerator pipeline is shown in fig. 1, and mainly comprises a ray generation renderer, a hit renderer and a miss renderer, wherein the ray generation renderer is used for traversing an acceleration data structure to realize ray traversal and performing triangle intersection test, if the triangle intersection test result is hit, the hit renderer is selected to execute rendering, otherwise, the miss renderer is selected to execute rendering, and finally a rendered image is obtained. Taking the bounding box hierarchical structure BVH as an example, as shown in fig. 2, four triangles are respectively included in the figure, and the distribution relationship of the areas a to D is shown as a right tree structure. The ray traverses the data structure iteratively, and when the ray (shown by the arrow on the left side of fig. 2) traverses with a child node of a node, a corresponding stack operation is required according to whether the ray intersects the child node and the intersection distance of the ray and the child node. If the ray does not intersect with both child nodes, popping the node data in the stack; if the ray intersects with both child nodes, sending the ray with a shorter intersection distance to a triangle intersection test stage, and storing the node with a longer distance into a stack; if a ray intersects only one child node, then that node is passed to the triangle intersection test stage without a stack operation.

In modern CPUs or GPUs, due to limited register space in these hardware, high programming complexity of the stack, etc., ray tracing is typically performed by a single thread for a single ray, which often results in a system with low parallelism. In addition, SIMD architectures commonly used in CPUs or GPUs also produce long-tail effects due to the complexity of processing light by multiple threads, resulting in overall inefficiency. Thus, in many ray tracing hardware accelerators, employing a stack heap is an effective way to address ray tracing parallelism. However, the stack depth is too deep, and in practical application, the stack depth is large, which causes excessive hardware resources occupied by the stack, whether embedded devices or general GPUs, and thus huge memory overhead is unacceptable.

Disclosure of Invention

The invention aims to solve the technical problems: aiming at the problems in the prior art, the invention provides a stack buffer component for a ray tracing hardware accelerator and an application method thereof.

In order to solve the technical problems, the invention adopts the following technical scheme:

a stack cache component for a ray traced hardware accelerator, comprising:

a stack for storing node information in the light traversal process;

the multiplexer is used for matching the data in the lookup table to determine whether stack ejection or stack pressing operation is required to be carried out on the stack cache;

the lookup table is used for storing the stack numbers and the serial numbers corresponding to the stack caches;

stack cache, which consists of a plurality of stacks and serial numbers thereof;

the stack pile, the multiplexer, the lookup table and the stack cache are sequentially connected to form a pipeline structure.

Optionally, the stack includes n stacks 1 to n for storing node information in the optical traversal process.

Optionally, each stack in the stack has a flag bit, where the flag bit has a value of 0 or 1, to indicate whether the corresponding stack in the stack is already stored in the stack cache.

In addition, the invention also provides a ray tracing hardware accelerator, which comprises a ray generation renderer, a ray traversing unit, a triangle intersection testing unit, a hit renderer, a miss renderer and the stack cache component for the ray tracing hardware accelerator, wherein the ray generation renderer, the ray traversing unit and the triangle intersection testing unit are sequentially connected, the triangle intersection testing unit is respectively connected with the hit renderer and the miss renderer, and the stack cache component is respectively connected with the ray traversing unit, the triangle intersection testing unit, the hit renderer and the miss renderer.

In addition, the invention also provides an application method of the stack cache component for the ray tracing hardware accelerator, which comprises the following steps of performing stack popping operation on one stack in a stack in response to a stack popping request, and simultaneously performing the following steps on stack cache in response to the stack popping request:

s101, judging whether a stack needing to be popped is stored in a stack cache or not, and ending and exiting if the stack needing to be popped is not stored in the stack cache; otherwise, jumping to step S102;

s102, according to the stack number of the stack flicking operation, searching a sequence number corresponding to the stack cache in a lookup table;

s103, according to the sequence number of the found stack buffer, the buffer stack which needs to be corresponding is found in the stack buffer, and the found data in the buffer stack is flicked.

Optionally, in step S101, determining whether the stack requiring pop operation is already stored in the stack buffer refers to determining that the flag bit of the stack requiring pop operation is valued, if the flag bit is valued 0, determining that the stack requiring pop operation is not already stored in the stack buffer, otherwise determining that the stack requiring pop operation is already stored in the stack buffer.

Optionally, the pop request is sent to the stack by the ray traversal unit from the ray tracing hardware accelerator and the triangle intersection test unit, and when the ray traversal unit determines that the two child nodes are not intersected with the ray, the triangle intersection test unit sends the pop request to the stack after the triangle intersection test is finished.

Optionally, the following steps are executed when responding to the push request:

s201, judging whether a stack of the push request is full in a stack, if the stack of the push request is not full in the stack, directly storing data corresponding to the push request into a corresponding stack, ending and exiting; otherwise, jump to step S202;

s202, judging whether a stack of the push request is stored in a stack cache, and if not, jumping to the step S203; otherwise, step S204 is skipped;

s203, sending the stack pressing request into a lookup table through a multiplexer to find whether an idle stack buffer exists, if so, storing data corresponding to the stack pressing request into a stack with a smaller sequence number in the stack buffer, and recording the sequence number stored in the stack buffer and a corresponding stack number in the lookup table; if the stack buffer memory is not empty, the processing of the current light is temporarily blocked, the data corresponding to the push request is stored until the empty stack buffer memory appears in the lookup table, the pipeline is restored, and the marking position of the corresponding stack in the stack is 1;

s204, sending the stack pushing request into a lookup table through a multiplexer to search the corresponding stack number and sequence number, and then pushing the corresponding data of the cached stack into a stack cache according to the sequence number.

Optionally, determining whether the stack of the push request is already stored in the stack buffer in step S202 refers to determining the value of the flag bit of the stack of the push request, if the value of the flag bit is 0, determining that the stack is not already stored in the stack buffer, otherwise determining that the stack is already stored in the stack buffer.

Optionally, the push request is from a ray traversing unit of the ray tracing hardware accelerator, and the ray traversing unit sends the push request to push the information of the sub-node far away from the two sub-nodes into the stack when judging that the ray intersects both the two sub-nodes after the ray traversing is finished.

Compared with the prior art, the invention has the following advantages:

1. and compared with a design mode of using a full stack, the method reduces the cost of storage resources, and can effectively reduce the storage resources of the stack by nearly half.

2. The design logic is simple. The stack in the stack cache and the stack in the stack have the same design mode and principle, and the modified logic design is simpler.

3. The effect on ray tracing hardware performance is small. The invention is realized through the design of flow line, and the related operation of the stack is finished through the detection of the full or non-full stack and the matching of the data in the lookup table. The current pipeline is blocked only when the stacks in the stack cache are all occupied and other stacks still need to be cached, which is very rare in ray tracing, so the performance is little affected.

Drawings

FIG. 1 is a schematic diagram of a prior art ray tracing hardware accelerator.

FIG. 2 is a schematic diagram of a prior art ray traversal and triangle intersection test.

Fig. 3 is a schematic diagram of a stack buffer unit according to an embodiment of the present invention.

Detailed Description

As shown in fig. 3, the stack buffer unit for the ray tracing hardware accelerator of this embodiment includes:

a stack for storing node information in the light traversal process;

Referring to fig. 3, the stack in this embodiment includes n stacks 1 to n for storing node information in the optical traversal process. The basic ray tracing hardware is generally only provided with a stack heap for storing node information in the ray traversal process, and the stack buffer component for the ray tracing hardware accelerator of the embodiment is added with a group of stack buffers, so that data can be effectively temporarily stored, and the original performance can be maintained by adopting the design of a pipeline.

Referring to fig. 3, each stack in the stack of the present embodiment has a flag bit, where the flag bit has a value of 0 or 1, to indicate whether the corresponding stack in the stack is already stored in the stack buffer. In this embodiment, stack 1 in the stack indicates that the current stack is full, and because its flag bit is 1, it indicates that data is stored in the stack cache. Stack 2 indicates that the stack is not full and the flag bit is 0, indicating that no data is stored in the stack cache. The design mode of the stack cache is the same as that of a single stack, and stacks which are half of the depth of the whole stack are adopted.

In addition, the embodiment further provides a ray tracing hardware accelerator, which comprises a ray generation renderer, a ray traversing unit, a triangle intersection testing unit, a hit renderer, a miss renderer and the stack cache component for the ray tracing hardware accelerator, wherein the ray generation renderer, the ray traversing unit and the triangle intersection testing unit are sequentially connected, the triangle intersection testing unit is respectively connected with the hit renderer and the miss renderer, and the stack cache component is respectively connected with the ray traversing unit, the triangle intersection testing unit, the hit renderer and the miss renderer.

The operation of stack caching mainly comprises two parts, namely a push operation and a pop operation, and the two operations are described below.

In addition, the embodiment also provides an application method of the stack cache component for the ray tracing hardware accelerator, which comprises the following steps of performing a pop operation on one stack in the stack in response to a pop request, and simultaneously performing the following steps on the stack cache in response to the pop request:

In this embodiment, in step S101, determining whether the stack to be popped is already stored in the stack buffer refers to determining that the flag bit of the stack to be popped is valued, if the flag bit is valued to 0, determining that the stack to be popped is not already stored in the stack buffer, otherwise determining that the stack to be popped is already stored in the stack buffer. If the stack flicking operation is needed, whether the corresponding stack detected from the stack is stored in a stack buffer memory or not is judged, namely, if the corresponding stack is 1, a marking bit of 0 or 1 in the stack at the left side of fig. 3 indicates that the corresponding stack needs to flick the stack in the stack buffer memory, the corresponding operation is carried out through the stack number and the sequence number stored in the lookup table, finally, the sequence number is searched in the stack buffer memory to carry out the corresponding operation, and the stack flick of the data is carried out; and 0, otherwise, performing stack ejection operation on the corresponding stack in the stack. The tag bit in the stack heap is changed to 0 only if there is and only if the pop stack operation and the corresponding stack is empty in the stack cache.

In this embodiment, the pop request is sent to the stack by the ray traversal unit and the triangle intersection test unit from the ray tracing hardware accelerator, and when the ray traversal unit determines that the two child nodes are not intersected with the ray, the triangle intersection test unit sends the pop request to the stack after the triangle intersection test is finished. Only one pop request can typically be processed in one cycle. When the pop request arrives, the judgment needs to be performed according to different situations: 1) The stack for the ray is not full or the stack for the ray is full, the flag bit is 0. And searching a corresponding stack from the stacks to perform stack flicking processing. 2) The stack corresponding to the light is full and the marking bit is 1, which indicates that the current stack has cached data in the stack cache, a pop request and a corresponding stack number are required to be sent to a lookup table through a multiplexer, the stack number and the sequence number are matched through the lookup table, and then the corresponding stack cache is searched in the stack cache according to the matched sequence number to perform the pop operation.

The embodiment further includes executing the following steps when responding to the push request:

In this embodiment, the step S202 of determining whether the stack of the push request is already stored in the stack buffer refers to determining the value of the flag bit of the stack of the push request, if the value of the flag bit is 0, it is determined that the stack is not already stored in the stack buffer, otherwise it is determined that the stack is already stored in the stack buffer.

In this embodiment, the push request is from a ray traversal unit of the ray tracing hardware accelerator, where the ray traversal unit sends the push request to push the child node information with a longer distance from the two child nodes into the stack when determining that the ray intersects both the two child nodes after the ray traversal is completed. The following situations are involved in stacking, and the following needs to be judged respectively: 1) The stack which needs to be pushed is not full in the stack, and the corresponding data is stored in the corresponding stack; 2) The stack to be pushed is full in the stack, but the flag bit is 0, which indicates that the stack is pushed the last time, and the request of pushing the stack is sent to the lookup table through the multiplexer to find whether the free stack cache exists. If the stack cache is empty, storing the data into a stack of the stack cache with a smaller sequence number in the stack cache, and recording the sequence number and the corresponding stack number stored in the stack cache at the time in a lookup table; if there is no empty stack cache, the processing of the current ray is temporarily blocked, the current data is not stored until an empty stack cache appears in the lookup table, the pipeline is restored, and the tag position in the stack is 1. 3) The stack to be pushed is full in the stack and the marking bit is 1, which means that the data is stored in the stack buffer before, and only the corresponding stack number and sequence number are needed to be inquired in the lookup table, and then the stack is pushed into the stack buffer according to the sequence number.

In summary, the ray tracing-oriented hardware accelerator has the problem that the stack occupies too much hardware resources, and the embodiment designs a set of stack caches for the ray tracing hardware accelerator using the stack, reduces the depth of each stack by adding a set of stack caches on the basis of a basic stack design, further reduces the hardware resources occupied by the stack, and marks whether data is stored in the stack caches or not by setting a single mark bit, so that the caches can realize efficient processing. The stack buffer component of the embodiment mainly comprises three parts in structure, wherein one part is a multiplexer and is mainly used for matching data in a lookup table so as to determine whether stack buffer needs to be popped or pushed; the second is a lookup table, which mainly stores the stack number and the information of the corresponding stack buffer sequence number; and thirdly, the stack cache consists of a plurality of stacks and sequence numbers, so that the stack cache can be provided for the ray tracing hardware accelerator, the ray parallelism is ensured, and the method has the advantages of low memory resource cost, simple design logic and small influence on the performance of the ray tracing hardware. The stack cache component of the embodiment has simple hardware structure, can effectively release the memory space occupied by the stack, obviously reduces the hardware cost and has little influence on the performance.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims

1. A stack cache component for a ray traced hardware accelerator, comprising:

a stack for storing node information in the light traversal process;

2. The stack cache unit for a ray traced hardware accelerator according to claim 1, wherein the stack heap contains n stacks 1-n for storing node information during a ray traversal.

3. The stack cache component for a ray traced hardware accelerator of claim 2, wherein each stack in the stack heap is provided with a flag bit, the flag bit having a value of 0 or 1, for indicating whether the corresponding stack in the stack heap is already stored in the stack cache.

4. The ray tracing hardware accelerator is characterized by comprising a ray generation renderer, a ray traversing unit, a triangle intersection testing unit, a hit renderer, a miss renderer and a stack cache component, wherein the ray generation renderer, the ray traversing unit and the triangle intersection testing unit are sequentially connected, the triangle intersection testing unit is respectively connected with the hit renderer and the miss renderer, the stack cache component is the stack cache component for the ray tracing hardware accelerator according to any one of claims 1-3, and the stack cache component is respectively connected with the ray traversing unit, the triangle intersection testing unit, the hit renderer and the miss renderer.

5. A method of using a stack cache component for a ray tracing hardware accelerator according to any one of claims 1-3, comprising performing a pop operation on a stack in a stack in response to a pop request, and simultaneously performing the following steps on the stack cache in response to the pop request:

6. The method according to claim 5, wherein the step S101 of determining whether the stack requiring pop operation is already stored in the stack cache is determining that the flag bit of the stack requiring pop operation is a value of 0, and if the flag bit is a value of 0, it is determined that the stack requiring pop operation is not already stored in the stack cache, otherwise it is determined that the stack requiring pop operation is already stored in the stack cache.

7. The method according to claim 5, wherein the pop request is from a ray traversal unit and a triangle intersection test unit of the ray tracing hardware accelerator, and the ray traversal unit sends the pop request to the stack if it is determined that both child nodes do not intersect the ray when the ray traversal unit determines that the ray intersection test unit sends the pop request to the stack after the triangle intersection test is completed.

8. The method of claim 5, comprising performing the following steps in response to a push request:

9. The method according to claim 8, wherein determining whether the stack of the push request is already stored in the stack cache in step S202 refers to determining a value of a flag bit of the stack of the push request, and if the flag bit is 0, it is determined that the stack is not already stored in the stack cache, and otherwise it is determined that the stack is already stored in the stack cache.

10. The method according to claim 9, wherein the push request is from a ray traversal unit of the ray tracing accelerator, and the ray traversal unit sends the push request to push information of a child node that is farther from the two child nodes into the stack when it is determined that the ray intersects both child nodes after the ray traversal is completed.