CN115775295A - Apparatus and method for tile-based deferred rendering - Google Patents

Apparatus and method for tile-based deferred rendering Download PDF

Info

Publication number
CN115775295A
CN115775295A CN202310032126.6A CN202310032126A CN115775295A CN 115775295 A CN115775295 A CN 115775295A CN 202310032126 A CN202310032126 A CN 202310032126A CN 115775295 A CN115775295 A CN 115775295A
Authority
CN
China
Prior art keywords
tile
rendering
visibility information
primitive
scheduling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310032126.6A
Other languages
Chinese (zh)
Inventor
请求不公布姓名
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Moore Threads Technology Co Ltd
Original Assignee
Moore Threads Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Moore Threads Technology Co Ltd filed Critical Moore Threads Technology Co Ltd
Priority to CN202310032126.6A priority Critical patent/CN115775295A/en
Publication of CN115775295A publication Critical patent/CN115775295A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Generation (AREA)

Abstract

The present disclosure provides an apparatus for tile-based deferred rendering, comprising: a visibility engine configured to generate primitive visibility information for each tile in a stage related to location-only; and a scheduler configured to perform scheduling for the plurality of processor cores based on the primitive visibility information prior to a rendering phase. By the method and the device, the processing period, the memory space and the power of the GPU system are saved, and better GPU utilization rate is realized by load balancing scheduling.

Description

用于基于图块的延迟渲染的装置和方法Apparatus and method for tile-based deferred rendering

技术领域technical field

本公开涉及用于基于图块的延迟渲染(TBDR)的装置、方法、设备和计算机可读介质,并且更具体地涉及具有针对TBDR架构的死图元(primitive)移除和负载均衡的多阶段渲染架构。The present disclosure relates to apparatus, methods, devices, and computer-readable media for tile-based deferred rendering (TBDR), and more particularly to multi-stage with dead primitive removal and load balancing for TBDR architectures Rendering architecture.

背景技术Background technique

TBDR由于其在功率和效率方面的优势而成为一种流行的现代图形处理单元(GPU)架构。TBDR模式将屏幕划分为多个图块(tile)。当渲染几何对象时,每个图元都会落入其中一个屏幕图块中,形成每图块的图元列表。稍后在像素着色阶段中,光栅化单元提取 图元列表以生成像素。在可选的隐藏表面移除(HSR)单元之后,只有可见像素被递送到着色单元以进行纹理和着色。由此,纹理和着色处理被延迟直到图元可见性已知,相比于未延迟的基于图块的渲染而确保了尽可能低的带宽使用和每帧的最低处理周期。TBDR is a popular modern graphics processing unit (GPU) architecture due to its advantages in power and efficiency. TBDR mode divides the screen into tiles. When rendering geometry, each primitive falls into one of the screen tiles, forming a primitive-per-tile list. Later in the pixel shader stage, the rasterizer fetches a list of primitives to generate pixels. After the optional Hidden Surface Removal (HSR) unit, only visible pixels are delivered to the shading unit for texturing and shading. Thus, texturing and shading processing is delayed until primitive visibility is known, ensuring the lowest possible bandwidth usage and the lowest processing cycles per frame compared to undelayed tile-based rendering.

然而,在TBDR模式中,几何流水线处理所有原始图元,并根据图块位置将它们写入到相应的图元列表。整个几何过程必须处理最终将被HSR单元拒绝的死图元。这些死图元在几何操作上浪费了大量计算资源/周期,例如顶点变换、属性插值、裁剪和图元装配。此外,死图元将占用设备内存空间以形成图元列表,这可能触发内存不足问题从而停止或重新启动几何过程。However, in TBDR mode, the geometry pipeline processes all raw primitives and writes them to the corresponding primitive list based on the tile position. The entire geometry process has to deal with dead primitives that will eventually be rejected by the HSR unit. These dead primitives waste a lot of computing resources/cycles on geometry operations such as vertex transformation, attribute interpolation, clipping, and primitive assembly. Additionally, dead primitives will take up device memory space to form primitive lists, which can trigger out-of-memory issues that stop or restart the geometry process.

另一方面,当启用多处理器时,负载均衡是用于实现最优渲染时延的非常重要的研究课题。通常,采用中央调度单元,以便预处理顶点缓冲区并将图元分派给每个处理器以用于负载均衡。在大多数情况下,该中央调度单元最终会成为整个流水线的瓶颈。此外,如果启用像曲面细分之类的几何放大,那么调度单元不可能预测将从原始图元的几何数据生成多少个子图元,这使得负载均衡成为难以解决的问题。On the other hand, load balancing is a very important research topic for optimal rendering latency when multiple processors are enabled. Typically, a central dispatch unit is employed to preprocess vertex buffers and dispatch primitives to each processor for load balancing. In most cases, this central scheduling unit ends up being the bottleneck of the entire pipeline. Furthermore, if geometry magnification like tessellation is enabled, it is impossible for the scheduling unit to predict how many sub-primitives will be generated from the geometry data of the original primitive, making load balancing an intractable problem.

发明内容Contents of the invention

本公开提供了一种利用TBDR模式进行渲染的新架构,它能够避免死图元上的几何处理资源浪费,同时它能够为死图元节省不必要的内存占用。此外,可以针对多处理器GPU系统而实现负载均衡。The present disclosure provides a new architecture for rendering using the TBDR mode, which can avoid wasting geometry processing resources on dead primitives, and can save unnecessary memory occupation for dead primitives. Additionally, load balancing can be implemented for multi-processor GPU systems.

根据本公开的第一方面,提供了一种用于基于图块的延迟渲染的装置,所述装置包括:可见性引擎,被配置成在与仅位置相关的阶段中生成针对每个图块的图元可见性信息;以及调度器,被配置成在渲染阶段之前基于所述图元可见性信息来执行针对多个处理器核的调度。According to a first aspect of the present disclosure, there is provided an apparatus for tile-based deferred rendering, the apparatus comprising: a visibility engine configured to generate, in a position-only phase, a Primitive visibility information; and a scheduler configured to perform scheduling for the plurality of processor cores based on the primitive visibility information prior to a rendering stage.

根据本公开的第二方面,提供了一种用于基于图块的延迟渲染的方法,所述方法包括:在与仅位置相关的阶段中生成针对每个图块的图元可见性信息;以及在渲染阶段之前基于所述图元可见性信息来执行针对多个处理器核的调度。According to a second aspect of the present disclosure, there is provided a method for tile-based deferred rendering, the method comprising: generating primitive visibility information for each tile in a position-only phase; and Scheduling for multiple processor cores is performed based on the primitive visibility information prior to a rendering stage.

根据本公开的第三方面,提供了一种用于基于图块的延迟渲染的设备,所述设备包括:处理器;以及存储器,可通信地连接到所述处理器且被适配成存储指令,所述指令在由所述处理器执行时使所述设备执行根据上述第二方面所述的方法的操作。According to a third aspect of the present disclosure, there is provided an apparatus for tile-based deferred rendering, the apparatus comprising: a processor; and a memory communicatively connected to the processor and adapted to store instructions , the instructions, when executed by the processor, cause the device to perform operations according to the method described in the second aspect above.

根据本公开的第四方面,提供了一种其上存储有指令的计算机可读介质,所述指令在被执行时使用于基于图块的延迟渲染的设备的处理器执行根据上述第二方面所述的方法。According to a fourth aspect of the present disclosure, there is provided a computer-readable medium having stored thereon instructions which, when executed, cause a processor of an apparatus for tile-based deferred rendering to perform the method described in accordance with the above-mentioned second aspect. described method.

通过本公开,消除了在几何处理阶段期间引起的针对死图元的计算资源浪费,从而节省了GPU系统的处理周期、内存空间和功率;此外,利用负载均衡调度来实现更好的GPU利用率,几何工作量和像素着色工作量均可以被均匀分布到多处理器系统。Through the present disclosure, the waste of computing resources for dead primitives caused during the geometry processing stage is eliminated, thereby saving processing cycles, memory space and power of the GPU system; moreover, utilizing load balancing scheduling to achieve better GPU utilization , both geometry workload and pixel shader workload can be evenly distributed across multiprocessor systems.

附图说明Description of drawings

现在将参考附图来描述本公开的示例性实施例。然而,本公开可以以许多不同的形式实施,且不应被解读为限于本文所阐述的实施例。相反,这些实施例被提供以便使公开内容全面而完整,且将向本领域技术人员完全传达本公开的范围。在对附图所示的示例性实施例的详细描述中所使用的术语不意在对本公开进行限制。在附图中,类似的数字指代类似的部件。Exemplary embodiments of the present disclosure will now be described with reference to the accompanying drawings. However, this disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. The terminology used in the detailed description of the exemplary embodiments illustrated in the accompanying drawings is not intended to limit the present disclosure. In the drawings, like numerals refer to like parts.

图1示出了根据本公开实施例的TBDR模式中的死图元移除的示意框图。FIG. 1 shows a schematic block diagram of dead primitive removal in TBDR mode according to an embodiment of the present disclosure.

图2示出了根据本公开实施例的TBDR模式中的利用活图元信息进行负载均衡的示意框图。Fig. 2 shows a schematic block diagram of load balancing using live graph element information in a TBDR mode according to an embodiment of the present disclosure.

图3示出了根据本公开实施例的用于TBDR的装置的框图。Fig. 3 shows a block diagram of an apparatus for TBDR according to an embodiment of the present disclosure.

图4示出了根据本公开实施例的用于TBDR的方法的流程图。Fig. 4 shows a flowchart of a method for TBDR according to an embodiment of the present disclosure.

图5示出了根据本公开实施例的用于TBDR的设备的框图。Fig. 5 shows a block diagram of a device for TBDR according to an embodiment of the present disclosure.

具体实施方式Detailed ways

以下对用于TBDR的装置、方法和设备进行描述。在以下详细描述中,参考了附图,附图作为该详细描述的一部分,并且在附图中,以图示的方式示出了其中可以实现本公开的具体实施例。以足够的细节描述这些实施例,使得本领域技术人员能够实现本公开,并且应当理解,在不脱离本公开各个实施例的范围的情况下,可以利用其他实施例并且可以做出结构、逻辑和电气上的变化。因此,下面的详细描述不应被视作限制性的,而应当是说明性的。本公开的范围由所附权利要求书及其等同物限定。Devices, methods and apparatus for TBDR are described below. In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which are shown by way of illustrations specific embodiments in which the disclosure may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the disclosure, and it is to be understood that other embodiments may be utilized and that structural, logical and electrical changes. Therefore, the following detailed description should not be regarded as limiting, but illustrative. The scope of the present disclosure is defined by the appended claims and their equivalents.

本文所使用的术语仅用于描述特定实施例的目的,而并非意图限制本公开。如本文所使用的那样,单数形式的“一”、“一个”、“该”也意图包括复数形式,除非上下文清楚地另有所指。还应当理解,术语“包括”指示存在所声明的特征、整体、步骤、操作、元件和/或组件,但并不排除存在一个或多个其他特征、整体、步骤、操作、元件、组件。The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the present disclosure. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It should also be understood that the term "comprising" indicates the presence of stated features, integers, steps, operations, elements and/or components, but does not exclude the presence of one or more other features, integers, steps, operations, elements, components.

除非另外定义,本文所使用的术语具有与本公开所属领域技术人员普遍理解的含义相同的含义。本文所使用的术语应当被解释为具有与其在本说明书的上下文以及相关领域中的含义一致的含义,除非本文特别定义。Unless otherwise defined, terms used herein have the same meaning as commonly understood by those skilled in the art to which this disclosure belongs. Terms used herein should be interpreted as having meanings consistent with their meanings in the context of this specification and in the relevant art, unless otherwise defined herein.

以下参考示出根据本公开实施例的方法、设备和/或计算机程序产品的框图和/或流程图描述本公开。应当理解,可以通过计算机程序指令来实现框图和/或流程图的一个框以及框组合。可以将这些计算机程序指令提供给通用计算设备、专用计算设备的处理器和/或其他可编程数据处理装置,使得经由计算设备处理器和/或其他可编程数据处理装置执行的指令创建用于实现框图和/或流程图中所指定的功能/动作的方法。The present disclosure is described below with reference to block diagrams and/or flowchart illustrations of methods, apparatuses and/or computer program products according to embodiments of the disclosure. It will be understood that one block and combinations of blocks of the block diagrams and/or flowchart illustrations can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computing device, a special-purpose computing device, and/or other programmable data processing means, such that the instructions executed via the computing device processor and/or other programmable data processing means create a Methods of function/acts specified in block diagrams and/or flowcharts.

相应地,还可以用硬件和/或软件(包括固件、驻留软件、微码等)实施本公开。更进一步,本公开可以采取计算机可使用或计算机可读存储介质上的计算机程序产品的形式,其具有在介质中实现的计算机可使用或计算机可读程序代码,以供指令执行系统使用或结合指令执行系统而使用。在本公开的上下文中,计算机可使用或计算机可读介质可以是任何下述这样的介质:其可以包含、存储、通信、传输或传送程序以供指令执行系统、装置或设备使用或者结合指令执行系统、装置或设备使用。Accordingly, the present disclosure may also be implemented in hardware and/or software (including firmware, resident software, microcode, etc.). Still further, the present disclosure may take the form of a computer program product on a computer-usable or computer-readable storage medium having computer-usable or computer-readable program code embodied in the medium, for use by an instruction execution system or in conjunction with the instruction used to execute the system. In the context of this disclosure, a computer-usable or computer-readable medium may be any medium that can contain, store, communicate, transmit, or deliver a program for use by or in connection with instruction execution systems, apparatus, or devices System, device or equipment use.

电子设备使用机器可读介质(也称为计算机可读介质)来存储和(内部地和/或通过网络与其他电子设备)传输代码(包括软件指令且可以被称为计算机程序代码或计算机程序)和/或数据,该机器可读介质诸如是机器可读存储介质(例如,磁盘、光盘、只读存储器(ROM)、闪速存储器、相变存储器等)和机器可读传输介质(也称为载体)(例如,电学、光学、射频、声学或其他形式的传播信号——诸如载波、红外信号等)。因此,电子设备(例如,计算机)包括硬件和软件,诸如一个或多个处理器,其耦合到一个或多个机器可读存储介质以存储代码,以供一个或多个处理器执行和/或存储数据。例如,电子设备可以包括非易失性存储器,当电子设备关断时,该非易失性存储器可以维持代码/数据,并且当电子设备开启时,要由处理器执行的代码的部分通常从较慢的非易失性存储器拷贝到该电子设备的易失性存储器(例如,动态随机存取存储器(DRAM)、静态随机存取存储器(SRAM)等)。通常,电子设备还包括一组物理网络接口以与其他电子设备进行网络连接(以使用传播信号来发射/接收代码和/或数据)。本公开的一个或多个部分可以使用软件、固件和/或硬件的不同组合而实现。Electronic devices use machine-readable media (also known as computer-readable media) to store and transmit (internally and/or over a network with other electronic devices) code (including software instructions and may be referred to as computer program code or computer programs) and/or data, such as machine-readable storage media (e.g., magnetic disks, optical disks, read-only memory (ROM), flash memory, phase-change memory, etc.) and machine-readable transmission media (also known as Carrier) (for example, electrical, optical, radio frequency, acoustic or other form of propagating signal - such as carrier wave, infrared signal, etc.). Accordingly, an electronic device (e.g., a computer) includes hardware and software, such as one or more processors, coupled to one or more machine-readable storage media to store code for execution by the one or more processors and/or Storing data. For example, an electronic device may include non-volatile memory that can maintain code/data when the electronic device is off, and when the electronic device is on, the portion of the code to be executed by the processor typically starts from The slow non-volatile memory is copied to the electronic device's volatile memory (eg, dynamic random access memory (DRAM), static random access memory (SRAM), etc.). Typically, electronic devices also include a set of physical network interfaces to network with other electronic devices (to transmit/receive code and/or data using propagated signals). One or more portions of the present disclosure may be implemented using various combinations of software, firmware, and/or hardware.

图1示意性地示出了根据本公开实施例的TBDR模式中的不可见图元移除的示意框图。Fig. 1 schematically shows a schematic block diagram of invisible primitive removal in TBDR mode according to an embodiment of the present disclosure.

如图1所示,使用预处理的与仅位置相关的阶段来收集针对每个TBDR图块的图元可见性信息,同时还累计针对每个图块的活图元数目。在下文中,活图元被称作可见图元,而对应地,死图元被称作不可见图元。As shown in FIG. 1 , the position-only phase of preprocessing is used to collect primitive visibility information for each TBDR tile, while also accumulating the number of live primitives for each tile. Hereinafter, live primitives are called visible primitives, and correspondingly, dead primitives are called invisible primitives.

与仅位置相关的阶段具有几何着色、裁剪、投影、剔除和光栅化阶段,该几何着色包括顶点着色器处理、曲面细分、几何着色器处理等。在图块化(即,将图元串联到一个屏幕图块)之后,光栅化和深度测试单元被激活,这是因为经光栅化的像素的位置可以用于确定图元是否落到图块中,以及图元是否被另一个图元完全阻挡,因而可以在稍后的渲染阶段中丢弃该图元。最后,由可见性引擎生成并输出针对每个图块的图元可见性信息。该图元可见性信息可以包括每个图块的可见图元的数目。The position-only related stages have geometry shading, clipping, projection, culling, and rasterization stages, which include vertex shader processing, tessellation, geometry shader processing, and more. After tiling (that is, concatenating primitives into a screen tile), the rasterization and depth testing unit is activated, since the position of the rasterized pixel can be used to determine whether the primitive falls into a tile , and whether the primitive is completely occluded by another primitive so that it can be discarded in a later rendering pass. Finally, primitive visibility information for each tile is generated and output by the visibility engine. The primitive visibility information may include the number of visible primitives for each tile.

在整个阶段期间,仅提取和使用顶点位置以节省带宽和计算资源,针对每个图块的图元可见性信息是该阶段的输出,像素着色被完全跳过,并且在该阶段结束时不会生成其他信息。During the entire stage, only vertex positions are fetched and used to save bandwidth and computational resources, primitive visibility information for each tile is the output of this stage, pixel shading is skipped entirely, and at the end of this stage no Generate additional information.

在随后的渲染阶段中,在与仅位置相关的阶段中生成的图元可见性信息将被提取以与几何数据对准。从图1中可以看出,只有可见图元被传递到流水线以进行裁剪和投影处理。在该阶段中跳过了剔除操作,因为如上所述不可见图元已被标记和丢弃。In subsequent rendering stages, primitive visibility information generated in position-only stages is extracted to align with geometry data. As can be seen in Figure 1, only visible primitives are passed to the pipeline for clipping and projection processing. The culling operation is skipped in this phase because invisible primitives are marked and discarded as described above.

上述与仅位置相关的阶段和渲染阶段生成每个图块的图元可见性信息,跳过不可见图元,且成本并不高。在该预处理阶段中,累计诸如可见图元数目之类的每图块的图元信息。利用该过程,将节省不可见图元的不必要资源消耗,例如,不可见图元装配、投影、裁剪和内存空间,从而可以避免对不可见图元进行处理的几何开销和内存占用。The position-only phase and rendering phase described above generate per-tile primitive visibility information, skipping invisible primitives, and are inexpensive. In this preprocessing stage, primitive information per tile, such as the number of visible primitives, is accumulated. With this process, unnecessary resource consumption of invisible primitives, such as invisible primitive assembly, projection, clipping, and memory space will be saved, so that geometry overhead and memory usage of invisible primitives can be avoided.

图2示意性地示出了根据本公开实施例的TBDR模式中的利用图元可见性信息进行负载均衡的示意框图。Fig. 2 schematically shows a schematic block diagram of utilizing primitive visibility information for load balancing in a TBDR mode according to an embodiment of the present disclosure.

在GPU系统具有多个处理器的情况下,中央调度单元可以读取由与仅位置相关的阶段生成的可见图元信息。该信息可以包含每个屏幕图块的可见图元数目,GPU调度器可以通过动态调度算法来对其进行处理,以输出用于调度的最佳图块组选项。In the case of a GPU system with multiple processors, the central dispatch unit can read the visible primitive information generated by the position-only related stages. This information can include the number of visible primitives per screen tile, which the GPU scheduler can process through a dynamic scheduling algorithm to output the best tile group options for scheduling.

如图2所示,GPU实例0、GPU实例1、……、GPU实例N各自的操作对应于图1所示的渲染阶段的操作,GPU调度器可以接收每个图块中的可见图元数目,然后根据该可见图元数目来进行调度。例如,在一个示例中,系统具有2个处理器核,屏幕被划分为4个图块——图块0至图块3,各图块的可见图元数目分别为100、200、300和400,那么GPU调度器可以将图块0和图块3分派给处理器核0并且将图块1和图块2分派给处理器核1。因此,每个处理器都核将处理500个图元。As shown in Figure 2, the respective operations of GPU instance 0, GPU instance 1, ..., GPU instance N correspond to the operations of the rendering stage shown in Figure 1, and the GPU scheduler can receive the number of visible primitives in each tile , and then schedule according to the number of visible primitives. For example, in one example, the system has 2 processor cores, and the screen is divided into 4 tiles - tile 0 to tile 3, and the number of visible primitives of each tile is 100, 200, 300 and 400 respectively , then the GPU scheduler may dispatch tiles 0 and 3 to processor core 0 and tiles 1 and 2 to processor core 1 . Therefore, each processor core will process 500 primitives.

调度是以基于图块的粒度执行的,即,GPU调度器以图块为单位(例如,一个或多个图块)分派给各个GPU实例,这比复杂的顶点缓冲区预处理要高效得多。Scheduling is performed at a tile-based granularity, i.e., the GPU scheduler dispatches to individual GPU instances in units of tiles (e.g., one or more tiles), which is much more efficient than complex vertex buffer preprocessing .

GPU调度器可以有效地使用每图块的可见图元信息来将几何工作量调度到多个处理器以进行负载均衡。由于与仅位置相关的阶段,即使启用了几何放大(曲面细分或几何着色),GPU调度器也可以具有最终的子图元可见性,以实现更好的负载均衡调度,避免了顶点缓冲区扫描。The GPU scheduler can efficiently use per-tile visible primitive information to schedule geometry workloads to multiple processors for load balancing. Thanks to the position-only phase, the GPU scheduler can have final sub-primitive visibility even with geometry magnification (tessellation or geometry shading) enabled for better load-balanced scheduling, avoiding vertex buffers scanning.

因此,在多处理器GPU系统中,图2所示的过程更好地利用了GPU计算资源并实现了更高的几何处理性能。Therefore, in a multi-processor GPU system, the process shown in Figure 2 makes better use of GPU computing resources and achieves higher geometry processing performance.

图3示意性地示出了根据本公开实施例的用于TBDR的装置300的框图。Fig. 3 schematically shows a block diagram of an apparatus 300 for TBDR according to an embodiment of the present disclosure.

参考图3,用于TBDR的装置300可以至少包括可见性引擎301和调度器302。在一个示例中,可见性引擎301可以是如图1所示的可见性引擎,其被配置成在与仅位置相关的阶段中生成针对每个图块的图元可见性信息。在一个示例中,调度器302可以是如图2所示的GPU调度器,其被配置成在渲染阶段之前基于来自可见性引擎301的图元可见性信息来执行针对多个处理器核的调度。Referring to FIG. 3 , an apparatus 300 for TBDR may at least include a visibility engine 301 and a scheduler 302 . In one example, the visibility engine 301 may be a visibility engine as shown in FIG. 1 , which is configured to generate primitive visibility information for each tile in a position-only phase. In one example, the scheduler 302 may be a GPU scheduler as shown in FIG. 2, which is configured to perform scheduling for multiple processor cores based on primitive visibility information from the visibility engine 301 before the rendering stage. .

作为示例,可见性引擎301可以进一步被配置成累计每个图块的可见图元的数目以及将该数目包括在图元可见性信息中。由调度器302进行的调度可以是基于被包括在图元可见性信息中的该数目来执行的。As an example, the visibility engine 301 may be further configured to accumulate the number of visible primitives for each tile and include the number in the primitive visibility information. Scheduling by scheduler 302 may be performed based on the number included in the primitive visibility information.

作为进一步示例,调度器302可以进一步被配置成将可见图元的总数均匀地分布在该多个处理器核上。As a further example, the scheduler 302 may be further configured to evenly distribute the total number of visible primitives on the plurality of processor cores.

作为示例,由调度器302进行的调度可以是以基于图块的粒度执行的。As an example, scheduling by scheduler 302 may be performed at a tile-based granularity.

作为示例,调度器302可以提取各个图块的图元可见性信息,以将其与几何数据对准,使得仅可见图元进入渲染阶段。As an example, scheduler 302 may extract primitive visibility information for individual tiles to align it with geometry data such that only visible primitives enter the rendering stage.

作为示例,渲染阶段可以包括裁剪和投影操作,而不包括剔除操作,因为不可见图元先前已被丢弃。As an example, the rendering stage may include clipping and projection operations, but not culling operations, since invisible primitives were previously discarded.

一些部件在图3中被图示为分离的单元。然而,这仅仅指示功能是分离的。这些单元可以作为分离的元件而提供。然而,其他布置也是可能的,例如,它们中的一些可以被组合为一个单元。可以在任何合适位置以软件、硬件和/或固件的任何组合来实现单元的任何组合。例如,可以有更多的控制器被分离地配置,或者只有一个控制器用于所有组件。Some components are illustrated as separate units in FIG. 3 . However, this only indicates that the functionality is separated. These units may be provided as separate elements. However, other arrangements are also possible, eg some of them can be combined into one unit. Any combination of elements may be implemented in any combination of software, hardware and/or firmware, where appropriate. For example, there could be more controllers configured separately, or only one controller for all components.

图3中所示的部件可以构成在例如机器可读介质内体现的机器可执行指令,其在由机器执行时将使机器执行所描述的操作。此外,这些单元中的任一个可以被实现为硬件,例如专用集成电路(ASIC)、数字信号处理器(DSP)、现场可编程门阵列(FPGA)等等。The components shown in FIG. 3 may constitute machine-executable instructions embodied, for example, on a machine-readable medium, which, when executed by a machine, will cause the machine to perform the operations described. Furthermore, any of these units may be implemented as hardware, such as an Application Specific Integrated Circuit (ASIC), Digital Signal Processor (DSP), Field Programmable Gate Array (FPGA), or the like.

图4示意性地示出了根据本公开实施例的用于TBDR的方法400的流程图。Fig. 4 schematically shows a flowchart of a method 400 for TBDR according to an embodiment of the present disclosure.

在一个示例中,在框401处,在与仅位置相关的阶段中生成针对每个图块的图元可见性信息。在框402处,在渲染阶段之前基于图元可见性信息来执行针对多个处理器核的调度。In one example, at block 401 , primitive visibility information for each tile is generated in a position-only phase. At block 402, scheduling for multiple processor cores is performed based on primitive visibility information prior to a rendering stage.

作为示例,图元可见性信息的生成可以进一步包括累计每个图块的可见图元的数目以及将该数目包括在图元可见性信息中。调度可以是基于该数目来执行的。As an example, generating the primitive visibility information may further include accumulating the number of visible primitives for each tile and including the number in the primitive visibility information. Scheduling can be performed based on this number.

作为进一步示例,调度可以是通过将可见图元的总数均匀地分布在所述多个处理器核上来执行的。As a further example, scheduling may be performed by evenly distributing the total number of visible primitives across the plurality of processor cores.

作为示例,调度可以是以基于图块的粒度执行的。As an example, scheduling may be performed at a tile-based granularity.

作为示例,图元可见性信息可以被提取以与几何数据对准,使得仅可见图元进入渲染阶段。As an example, primitive visibility information may be extracted to align with geometry data such that only visible primitives enter the rendering stage.

作为示例,渲染阶段可以包括裁剪和投影操作,而不包括剔除操作。As an example, the rendering stage may include clipping and projection operations, but not culling operations.

图5示意性地示出了根据本公开实施例的用于TBDR的设备500的框图。Fig. 5 schematically shows a block diagram of a device 500 for TBDR according to an embodiment of the present disclosure.

参考图5,用于TBDR的设备500可以至少包括处理器501、存储器502、接口503和通信介质504。处理器501、存储器502和接口503可以经由通信介质504而彼此通信耦合。Referring to FIG. 5 , an apparatus 500 for TBDR may include at least a processor 501 , a memory 502 , an interface 503 and a communication medium 504 . Processor 501 , memory 502 and interface 503 may be communicatively coupled to each other via communication medium 504 .

处理器501可以包括一个或多个处理单元。处理单元可以是物理设备或制品,其包括从计算机可读介质(诸如,存储器502)读取数据和指令并选择性地执行指令的一个或多个集成电路。在各种实施例中,处理器501可以以各种方式实现。作为示例,处理器501可以被实现为一个或多个处理核。作为另一示例,处理器501可以包括一个或多个分离的微处理器。在又一示例中,处理器501可以包括提供特定功能的专用集成电路(ASIC)。在再一个示例中,处理器501可以通过使用ASIC和/或通过执行计算机可执行指令来提供特定功能。Processor 501 may include one or more processing units. A processing unit may be a physical device or article of manufacture comprising one or more integrated circuits that read data and instructions from a computer-readable medium, such as memory 502 , and selectively execute the instructions. In various embodiments, the processor 501 may be implemented in various ways. As an example, processor 501 may be implemented as one or more processing cores. As another example, processor 501 may include one or more discrete microprocessors. In yet another example, the processor 501 may include an Application Specific Integrated Circuit (ASIC) that provides specific functions. In yet another example, the processor 501 may provide specific functions by using an ASIC and/or by executing computer-executable instructions.

存储器502可以包括能够存储数据和/或计算机可执行指令的一个或多个计算机可使用或计算机可读存储介质。应当理解,优选地,存储介质可以是非瞬变存储介质。Memory 502 may include one or more computer-usable or computer-readable storage media capable of storing data and/or computer-executable instructions. It should be understood that preferably, the storage medium may be a non-transitory storage medium.

接口503可以是使用于TBDR的设备500能够向外部设备发送数据或从外部设备接收数据的设备或制品。The interface 503 may be a device or an article of manufacture that enables the device for TBDR 500 to transmit data to or receive data from an external device.

通信介质504可以促进处理器501、存储器502和接口503之间的通信。通信介质504可以以各种方式实现。例如,通信介质504可以包括外围组件互连(PCI)总线、PCI Express总线,加速图形端口(AGP)总线、串行高级技术附件(ATA)互连,并行ATA互连、光纤通道互连、USB总线、小型计算系统接口(SCSI)接口或其他类型的通信介质。Communication medium 504 may facilitate communication between processor 501 , memory 502 and interface 503 . Communications medium 504 can be implemented in various ways. For example, communication media 504 may include a Peripheral Component Interconnect (PCI) bus, a PCI Express bus, an Accelerated Graphics Port (AGP) bus, a Serial Advanced Technology Attachment (ATA) interconnect, a Parallel ATA interconnect, a Fiber Channel interconnect, a USB bus, Small Computing System Interface (SCSI) interface, or other type of communication medium.

在图5的示例中,存储在存储器502中的指令可以包括在由处理器501执行时使用于TBDR的设备500实现关于图4描述的方法的指令。In the example of FIG. 5 , the instructions stored in the memory 502 may include instructions that when executed by the processor 501 cause the apparatus for TBDR 500 to implement the method described with respect to FIG. 4 .

本公开的实施例可以是制品,其中非瞬变机器可读介质(诸如微电子存储器)上存储有对一个或多个信号处理组件(这里一般称为“处理器”)进行编程以执行上述操作的指令(例如,计算机代码)。在其他实施例中,这些操作中的一些可能由包含硬连线逻辑的特定硬件组件(例如,专用数字滤波器块和状态机)来执行。可替换地,这些操作可能由经编程的信号处理组件和固定硬连线电路组件的任何组合来执行。Embodiments of the present disclosure may be articles of manufacture in which one or more signal processing components (generally referred to herein as "processors") programmed to perform the operations described above are stored on a non-transitory machine-readable medium, such as a microelectronic memory. instructions (for example, computer code). In other embodiments, some of these operations may be performed by specific hardware components containing hardwired logic (eg, dedicated digital filter blocks and state machines). Alternatively, these operations may be performed by any combination of programmed signal processing components and fixed hardwired circuit components.

应当认识到,为了清楚起见,在分离的实施例的上下文中描述的本申请的某些特征还可以在单个实施例中以组合的方式提供。相反,为了简便起见,在单个实施例的上下文中描述的本申请的各种特征还可以分离地或以任何适当的子组合或在本申请的任何其他实施例中适当地提供。不应将在各种实施例的上下文中描述的某些特征视为那些实施例的必要特征,除非该实施例在没有那些元素的情况下无效。It should be appreciated that certain features of the application which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the application which are, brevity, described in the context of a single embodiment, may also be provided separately or in any suitable subcombination or as appropriate in any other embodiment of the application. Certain features described in the context of various embodiments should not be considered essential features of those embodiments, unless the embodiment is not effective without those elements.

在以上详细描述中,已经参考本公开的特定示例性实施例描述了本公开的实施例。很明显,在不脱离所附权利要求中阐述的本公开的精神和范围的情况下,可以对本公开的实施例进行各种修改。因此,说明书和附图应被视为说明性的而非限制性的。In the foregoing detailed description, embodiments of the present disclosure have been described with reference to specific exemplary embodiments thereof. It will be apparent that various modifications may be made to the embodiments of the present disclosure without departing from the spirit and scope of the present disclosure as set forth in the appended claims. Accordingly, the specification and drawings are to be regarded as illustrative rather than restrictive.

在整个说明书中,已经通过流程图呈现了本公开的一些实施例。应当理解,这些流程图中描述的操作的顺序仅用于图示目的,而不意图作为对本公开的限制。本领域技术人员将认识到,在不脱离所附权利要求中阐述的本公开的精神和范围的情况下,可以做出流程图的变型。Throughout the specification, some embodiments of the disclosure have been presented through flow diagrams. It should be understood that the order of operations depicted in these flowcharts is for illustration purposes only and is not intended as a limitation of the present disclosure. Those skilled in the art will appreciate that changes to the flow diagrams can be made without departing from the spirit and scope of the present disclosure as set forth in the appended claims.

Claims (14)

1.一种用于基于图块的延迟渲染的装置,其特征在于,所述装置包括:1. A device for delayed rendering based on tiles, characterized in that the device comprises: 可见性引擎,被配置成在与仅位置相关的阶段中生成针对每个图块的图元可见性信息;以及a visibility engine configured to generate primitive visibility information for each tile in the position-only phase; and 调度器,被配置成在渲染阶段之前基于所述图元可见性信息来执行针对多个处理器核的调度。A scheduler configured to perform scheduling for multiple processor cores based on the primitive visibility information before a rendering stage. 2.根据权利要求1所述的装置,其中所述可见性引擎进一步被配置成累计每个图块的可见图元的数目以及将所述数目包括在所述图元可见性信息中,并且其中所述调度是基于所述数目来执行的。2. The apparatus of claim 1, wherein the visibility engine is further configured to accumulate a number of visible primitives per tile and include the number in the primitive visibility information, and wherein The scheduling is performed based on the number. 3.根据权利要求2所述的装置,其中所述调度器进一步被配置成将可见图元的总数均匀地分布在所述多个处理器核上。3. The apparatus of claim 2, wherein the scheduler is further configured to evenly distribute the total number of visible primitives across the plurality of processor cores. 4.根据权利要求1至3中任一项所述的装置,其中针对所述多个处理器核的调度是以基于图块的粒度执行的。4. The apparatus of any one of claims 1 to 3, wherein scheduling for the plurality of processor cores is performed at a tile-based granularity. 5.根据权利要求1至3中任一项所述的装置,其中所述图元可见性信息被提取以与几何数据对准,使得仅可见图元进入所述渲染阶段。5. The apparatus of any one of claims 1 to 3, wherein the primitive visibility information is extracted to align with geometric data such that only visible primitives enter the rendering stage. 6.根据权利要求1至3中任一项所述的装置,其中所述渲染阶段包括裁剪和投影,而不包括剔除。6. The apparatus of any one of claims 1 to 3, wherein the rendering stage includes clipping and projection, but not culling. 7.一种用于基于图块的延迟渲染的方法,其特征在于,所述方法包括:7. A method for deferred rendering based on tiles, characterized in that the method comprises: 在与仅位置相关的阶段中生成针对每个图块的图元可见性信息;以及Generating primitive visibility information for each tile in a position-only phase; and 在渲染阶段之前基于所述图元可见性信息来执行针对多个处理器核的调度。Scheduling for multiple processor cores is performed based on the primitive visibility information prior to a rendering stage. 8.根据权利要求7所述的方法,其中生成针对每个图块的图元可见性信息进一步包括累计每个图块的可见图元的数目以及将所述数目包括在所述图元可见性信息中,并且其中执行针对所述多个处理器核的调度进一步包括基于所述数目来执行调度。8. The method of claim 7, wherein generating primitive visibility information for each tile further comprises accumulating a number of visible primitives per tile and including the number in the primitive visibility information, and wherein performing scheduling for the plurality of processor cores further includes performing scheduling based on the number. 9.根据权利要求8所述的方法,其中执行针对所述多个处理器核的调度进一步包括将可见图元的总数均匀地分布在所述多个处理器核上。9. The method of claim 8, wherein performing scheduling for the plurality of processor cores further comprises evenly distributing a total number of visible primitives across the plurality of processor cores. 10.根据权利要求7至9中任一项所述的方法,其中执行针对所述多个处理器核的调度进一步包括以基于图块的粒度执行调度。10. The method of any one of claims 7 to 9, wherein performing scheduling for the plurality of processor cores further comprises performing scheduling at a tile-based granularity. 11.根据权利要求7至9中任一项所述的方法,其中所述图元可见性信息被提取以与几何数据对准,使得仅可见图元进入所述渲染阶段。11. The method of any one of claims 7 to 9, wherein the primitive visibility information is extracted to align with geometric data such that only visible primitives enter the rendering stage. 12.根据权利要求7至9中任一项所述的方法,其中所述渲染阶段包括裁剪和投影,而不包括剔除。12. A method according to any one of claims 7 to 9, wherein the rendering stage comprises clipping and projection but not culling. 13.一种用于基于图块的延迟渲染的设备,其特征在于,所述设备包括:13. A device for tile-based deferred rendering, characterized in that the device comprises: 处理器;以及processor; and 存储器,可通信地连接到所述处理器且被适配成存储指令,所述指令在由所述处理器执行时使所述设备执行根据权利要求7至12中任一项所述的方法的操作。a memory, communicatively connected to the processor and adapted to store instructions which, when executed by the processor, cause the device to perform the steps of the method according to any one of claims 7 to 12 operate. 14.一种其上存储有指令的计算机可读介质,所述指令在被执行时使用于基于图块的延迟渲染的设备的处理器执行根据权利要求7至12中任一项所述的方法。14. A computer-readable medium having stored thereon instructions which, when executed, cause a processor of an apparatus for tile-based deferred rendering to perform the method of any one of claims 7 to 12 .
CN202310032126.6A 2023-01-10 2023-01-10 Apparatus and method for tile-based deferred rendering Pending CN115775295A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310032126.6A CN115775295A (en) 2023-01-10 2023-01-10 Apparatus and method for tile-based deferred rendering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310032126.6A CN115775295A (en) 2023-01-10 2023-01-10 Apparatus and method for tile-based deferred rendering

Publications (1)

Publication Number Publication Date
CN115775295A true CN115775295A (en) 2023-03-10

Family

ID=85393366

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310032126.6A Pending CN115775295A (en) 2023-01-10 2023-01-10 Apparatus and method for tile-based deferred rendering

Country Status (1)

Country Link
CN (1) CN115775295A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116820580A (en) * 2023-08-31 2023-09-29 摩尔线程智能科技(北京)有限责任公司 Instruction execution method, system and device, graphics processor and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140198119A1 (en) * 2013-01-17 2014-07-17 Qualcomm Incorporated Rendering graphics data using visibility information
CN108305318A (en) * 2017-01-12 2018-07-20 想象技术有限公司 Graphics processing unit and the method for controlling rendering complexity using the instruction of the cost for the segment set for rendering space
CN108711133A (en) * 2017-04-01 2018-10-26 英特尔公司 The Immediate Mode based on segment of Z with early stage layering renders
US20190066354A1 (en) * 2017-08-31 2019-02-28 Hema C. Nalluri Apparatus and method for processing commands in tile-based renderers
CN110728616A (en) * 2018-06-29 2020-01-24 畅想科技有限公司 Tile allocation for processing cores within a graphics processing unit

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140198119A1 (en) * 2013-01-17 2014-07-17 Qualcomm Incorporated Rendering graphics data using visibility information
CN108305318A (en) * 2017-01-12 2018-07-20 想象技术有限公司 Graphics processing unit and the method for controlling rendering complexity using the instruction of the cost for the segment set for rendering space
CN108711133A (en) * 2017-04-01 2018-10-26 英特尔公司 The Immediate Mode based on segment of Z with early stage layering renders
US20190066354A1 (en) * 2017-08-31 2019-02-28 Hema C. Nalluri Apparatus and method for processing commands in tile-based renderers
CN110728616A (en) * 2018-06-29 2020-01-24 畅想科技有限公司 Tile allocation for processing cores within a graphics processing unit

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116820580A (en) * 2023-08-31 2023-09-29 摩尔线程智能科技(北京)有限责任公司 Instruction execution method, system and device, graphics processor and electronic equipment
CN116820580B (en) * 2023-08-31 2023-11-10 摩尔线程智能科技(北京)有限责任公司 Instruction execution method, system and device, graphics processor and electronic equipment

Similar Documents

Publication Publication Date Title
US12315067B2 (en) Geometry to tiling arbiter for tile-based rendering system
KR101134241B1 (en) Fragment shader bypass in a graphics processing unit, and apparatus and method thereof
CN110717989B (en) Scalable parallel tessellation
US10796483B2 (en) Identifying primitives in input index stream
TW201214326A (en) Tile rendering for image processing
US11481256B2 (en) Task graph scheduling for workload processing
TWI528178B (en) Method and computing system of analying performance of graphics processing pipeline
WO2022011841A1 (en) Implementation method, apparatus, terminal for cluster in gpgpu, and medium
CN115775295A (en) Apparatus and method for tile-based deferred rendering
US11275586B2 (en) Task graph generation for workload processing
US10417815B2 (en) Out of order pixel shader exports
CN116263982B (en) Graphics processors, systems, methods, electronic devices and equipment
US11061429B2 (en) Fine-grained speed binning in an accelerated processing device
US20230205608A1 (en) Hardware supported split barrier
TW202240528A (en) Scalable primitive rate architecture for geometry processing
US10832465B2 (en) Use of workgroups in pixel shader
CN108958921B (en) A hardware-accelerated implementation method of coloring segment scheduling management in GPU
US20230377086A1 (en) Pipeline delay elimination with parallel two level primitive batch binning
US20250208922A1 (en) Dynamic precision management in graphics processing
US20240070961A1 (en) Vertex index routing for two level primitive batch binning
WO2024250891A1 (en) Primitive distribution method and apparatus, device, storage medium, and computer program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Country or region after: China

Address after: B655, 4th Floor, Building 14, Cuiwei Zhongli, Haidian District, Beijing, 100036

Applicant after: Mole Thread Intelligent Technology (Beijing) Co.,Ltd.

Address before: 209, 2nd Floor, No. 31 Haidian Street, Haidian District, Beijing

Applicant before: Moore Threads Technology Co., Ltd.

Country or region before: China