WO2023279246A1 - 线程组创建方法、图形处理单元和电子设备 - Google Patents

线程组创建方法、图形处理单元和电子设备 Download PDF

Info

Publication number
WO2023279246A1
WO2023279246A1 PCT/CN2021/104584 CN2021104584W WO2023279246A1 WO 2023279246 A1 WO2023279246 A1 WO 2023279246A1 CN 2021104584 W CN2021104584 W CN 2021104584W WO 2023279246 A1 WO2023279246 A1 WO 2023279246A1
Authority
WO
WIPO (PCT)
Prior art keywords
thread
thread group
pixel
mask
image processing
Prior art date
Application number
PCT/CN2021/104584
Other languages
English (en)
French (fr)
Inventor
朱韵鹏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/104584 priority Critical patent/WO2023279246A1/zh
Priority to CN202180006799.3A priority patent/CN115803769A/zh
Publication of WO2023279246A1 publication Critical patent/WO2023279246A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering

Definitions

  • the present application relates to the field of graphics processing, and in particular to a method for creating a thread group, a graphics processing unit (graphics processing unit, GPU) and an electronic device.
  • a graphics processing unit graphics processing unit, GPU
  • the GPU can create at least one thread group, each thread group includes at least one thread, and each thread can perform image processing on one or more pixels, such as performing ray intersection testing and rendering.
  • each thread group includes at least one thread
  • each thread can perform image processing on one or more pixels, such as performing ray intersection testing and rendering.
  • ray intersection testing and rendering At present, in application scenarios where ray tracing achieves reflection effects, only some reflective surfaces need to be tested for ray intersection and rendering, so only threads in some thread groups perform image processing. However, thread groups corresponding to other non-reflective surfaces still need to Reserving resources, participating in instruction scheduling, executing thread group destruction, and releasing resources, etc., occupy the processing resources of the GPU and reduce the execution efficiency of the GPU.
  • Embodiments of the present application provide a method for creating a thread group, a graphics processing unit, and an electronic device, so as to improve execution efficiency of the graphics processing unit.
  • a method for creating a thread group including: obtaining masks of multiple pixels in an image, wherein the mask of each pixel is used to indicate whether each pixel is to be image-processed; according to A target thread group is created by masking multiple pixel points, and the target thread group includes at least one image processing thread.
  • the above-mentioned thread creation method, graphics processing unit, and electronic device provided in the embodiments of the present application indicate whether the pixel needs image processing (or whether it needs to create a corresponding thread) through the mask of each pixel, and then determine whether to create the corresponding thread group.
  • a thread group that does not perform image processing on any thread does not need to be created, thereby saving processing resources of the GPU and improving execution efficiency of the GPU.
  • acquiring masks of multiple pixel points in the image includes: receiving masks of multiple pixel points from a central processing unit.
  • the central processing unit sends an instruction to draw a geometric figure to the graphics processing unit, it may send masks of multiple pixel points at the same time.
  • acquiring masks of multiple pixel points in the image includes: generating masks of multiple pixel points according to attributes of the multiple pixel points (such as geometric parameters, material information, etc.). Geometric parameters may include the position, depth, etc. of the pixel.
  • the attributes of the multiple pixels include material information, and the material information includes at least one of reflectance, roughness, and material identifier, where the material identifier is used to indicate the material of the pixel, Such as water, metal, ceramics, glass, etc.
  • creating the target thread group according to the mask includes: creating a first thread group according to the mask of the first group of pixels in the plurality of pixels; wherein, the first thread group includes at least one A thread that does not perform image processing; create a second thread group according to the mask of the second group of pixels in the plurality of pixels; wherein, at least one thread that does not perform image processing is included in the second thread group; merge the first thread group The thread performing image processing in the second thread group and the thread performing image processing in the second thread group obtain the target thread group. In this way, the processing resources of the graphics processing unit can be further saved, and the execution efficiency of the graphics processing unit can be improved.
  • one thread of the target thread group corresponds to masks of multiple pixel points.
  • One thread can correspond to one pixel, that is, one thread can perform image processing on one pixel, or one thread can correspond to multiple pixels, that is, one thread can perform image processing on multiple pixels.
  • a storage space for the management mask is also included.
  • the storage space of the mask can be explicitly managed by the application program, for example, the storage space of the mask is allocated or destroyed by the application program, or the storage space of the mask is managed by the driver of the GPU, and the application program is not aware of the storage space of the mask , the above-mentioned application program may include an application program running on a graphics processing unit and an application program running on a central processing unit.
  • the storage space of the mask is allocated or destroyed by the driver program of the GPU, and the application program running in the GPU directly obtains the mask through the driver program.
  • the mask is stored in a block cache of the graphics processing unit, or in a system cache, or in a memory.
  • the storage location of the mask can be flexible.
  • a second aspect provides a graphics processing unit, which is characterized by comprising a stream multiprocessor, and the stream multiprocessor is configured to execute the method for creating a thread group as described in the first aspect and any implementation manner thereof.
  • an electronic device including the graphics processing unit as described in the second aspect.
  • a computer-readable storage medium where instructions are stored in the computer-readable storage medium, and the instructions are run on a graphics processing unit, so that the graphics processing unit executes the method described in the first aspect and any implementation thereof. Thread group creation method.
  • a computer program product including instructions is provided, and the instructions run on a graphics processing unit, so that the graphics processing unit executes the method for creating a thread group described in the second aspect and any implementation manner thereof.
  • FIG. 1 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
  • FIG. 2 is a schematic diagram of performing thread group compression in a delayed rendering process provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a method for creating a thread group provided in an embodiment of the present application
  • FIG. 4 is a schematic diagram of another delayed rendering process provided by an embodiment of the present application.
  • Ray tracing is used to simulate the propagation of light in the real world. By emitting light, it traces several bounces of light in the scene, calculates the intersection of light and objects in the scene, and calculates direct lighting and indirect lighting for intersection points.
  • the ray-plane intersection test refers to testing whether a ray intersects a geometric figure, and determining where the ray intersects the geometric figure.
  • the ray can be a direct ray from a light source, or it can be from another reflective surface Reflected light, or, can be refracted light, etc.
  • Rasterization refers to the process of converting geometric figures into two-dimensional images.
  • Rendering refers to calculating the lighting information of pixels.
  • Delayed rendering means that the GPU rasterizes the geometry to obtain a geometry cache (G-buffer).
  • the geometry cache includes the attributes of each pixel in the two-dimensional image in the original geometry (such as geometric parameters, material information, etc.), and the GPU can convert the geometry
  • the cache is output to memory (such as double data rate (DDR) memory), and then the geometry cache is read from memory and the image is rendered. Rather than rendering the geometry directly after rasterizing it and outputting it to memory.
  • the geometric parameters may include the position and depth of a certain pixel, and the material information may include reflectance, roughness, material identifier, normal direction, etc., wherein the material identifier is used to indicate the material of the pixel.
  • a shader program refers to an editable program used to replace a fixed rendering pipeline to achieve graphics rendering.
  • Thread group (warp), GPU creates threads (thread) in units of thread groups
  • each thread group can include at least one thread, for example, including 4 threads or 16 threads, each thread can be for one or more pixel points Do image processing, such as doing ray intersection testing and rendering.
  • Each thread can be represented by a one-dimensional identifier or a multi-dimensional identifier. For example, as shown in Table 1, the first row indicates that there are two thread groups, and the identifiers of these two thread groups are 0 and 1 respectively.
  • Each thread group includes There are a total of 8 threads in 4 threads, which can be represented by one-dimensional numbers 0-7 respectively.
  • a thread group may include 4 threads, which may be represented by two-dimensional numbers (0,0), (1,0), (0,1), (1,1) respectively.
  • the embodiment of the present application provides an electronic device 10 including a GPU, including a GPU 101 connected by a bus, a central processing unit (central processing unit, CPU) 102, a system cache (system cache) 103, and a memory 104.
  • the electronic device may also include a display screen (not shown in the figure).
  • the GPU 101 includes a streaming multiprocessor (streaming multiprocessor) 1011 and a tile cache (tile cache) 1012.
  • the electronic device may be a mobile phone, a computer, a tablet, and other devices that display images.
  • the stream multiprocessor 1011 in the GPU 101 After the stream multiprocessor 1011 in the GPU 101 obtains the instruction for drawing geometric figures from the CPU 102, it can process the image in the memory 104 and display the image on the display screen.
  • the system cache 103 and the block cache 1012 can be used to cache intermediate processing results of images.
  • the memory 104, the system cache 103 or the block cache 1012 may be used to store the mask (described later).
  • the thread group created by the GPU may waste GPU resources due to no image processing.
  • the GPU can first rasterize the geometry to obtain the image during the deferred rendering process. Perform thread group compression. For thread groups that do not perform image processing, resources will not be reserved or created, and then the remaining thread groups will perform image processing (such as ray intersection testing and rendering), thereby saving GPU processing resources. Improve GPU execution efficiency.
  • the mask of each pixel is used to indicate whether the pixel needs image processing (or whether it needs to create a corresponding thread. ), and then determine whether to create the corresponding thread group.
  • a thread group that does not perform image processing on any thread does not need to be created, thereby saving processing resources of the GPU and improving execution efficiency of the GPU. It can be applied to scenes where only local areas need to be ray traced in ray tracing, and can also be applied to scenes where only local areas need to be rendered by global illumination in global illumination.
  • the GPU provided in the embodiment of the present application can execute the method for creating a thread group as shown in Figure 3:
  • the mask of each pixel is used to indicate whether each pixel needs to perform image processing (such as performing ray intersection testing and rendering), or in other words, the mask of each pixel is used to indicate whether each pixel needs to Create the corresponding thread.
  • the mask value of each pixel point is 0, indicating that the pixel point does not need to perform image processing (or does not need to create a corresponding thread), and the mask value of each pixel point is If it is 1, it means that the pixel needs image processing (or needs to create a corresponding thread).
  • One thread can correspond to one pixel, that is, one thread can perform image processing on one pixel, or one thread can correspond to multiple pixels, that is, one thread can perform image processing on multiple pixels.
  • the GPU can manage the storage space of the mask, for example, the storage space of the mask can be explicitly managed by the application, for example, the storage space of the mask is allocated or destroyed by the application, or the storage space of the mask is managed by the driver of the GPU,
  • the application program is not aware of the storage space of the mask, and the above-mentioned application program may include an application program running on a graphics processing unit and an application program running on a central processing unit.
  • the storage space of the mask is allocated or destroyed by the driver program of the GPU, and the application program running in the GPU directly obtains the mask through the driver program.
  • the mask can be stored in the block cache of the GPU, or can be stored in the system cache, or can be stored in the memory, and the storage location of the mask can be very flexible.
  • the process of acquiring the masks of multiple pixels in the image may occur after rasterizing the geometry to obtain a geometry buffer (G-buffer).
  • the geometry buffer size is 1280x720 pixels
  • the GPU can pre-allocate a cache of 1280x720x8 bits for the storage mask; if 32-bit data represents 8x4 A mask corresponding to pixels, that is, one bit represents a mask, as shown in Table 3, the GPU can pre-allocate a buffer of 160x180x32 bits for storing the mask.
  • the GPU may obtain masks of multiple pixels from the CPU. For example, when the CPU sends an instruction to draw a geometric figure to the GPU, the masks of multiple pixels may be sent together.
  • the GPU may update the mask of each pixel point, for example, generate a mask of multiple pixel points according to attributes of multiple pixel points (such as geometric parameters, material information, etc.).
  • the geometric parameters may include the position, depth, etc. of the pixel
  • the material information may include at least one of the reflectance, roughness, material identifier, normal direction, etc. of the pixel, wherein the material identifier is used to indicate that the pixel
  • the material of the point such as water, metal, ceramic, glass, etc.
  • the mask corresponding to the pixel indicates that the pixel needs to be processed (or the corresponding thread needs to be created); otherwise, if a certain pixel The position of the pixel is on the backlight side, that is, light reflection does not occur, and the mask corresponding to the pixel indicates that the pixel does not need to perform image processing (or does not need to create a corresponding thread).
  • the mask corresponding to the pixel indicates that the pixel needs to be processed (or the corresponding thread needs to be created), otherwise the pixel corresponding to the The mask indicates that the pixel does not require image processing (or in other words, no corresponding thread needs to be created).
  • the mask corresponding to the pixel indicates that the pixel needs to be processed (or a corresponding thread needs to be created), otherwise the pixel needs to be processed.
  • the mask corresponding to the pixel indicates that the pixel does not require image processing (or that the corresponding thread does not need to be created).
  • the second threshold is 0.2
  • the reflection coefficients of pixel 4 and pixel 5 among the 8 pixel points are both greater than 0.2, so the mask of these two pixel points can take a value of 1 , to indicate that image processing is required for these two pixels (or corresponding threads need to be created).
  • the mask corresponding to the pixel indicates that the pixel needs to be processed (or a corresponding thread needs to be created), otherwise the pixel The mask corresponding to the point indicates that the pixel does not require image processing (or that the corresponding thread does not need to be created).
  • the mask corresponding to the pixel indicates that the pixel needs image processing (or Said that the corresponding thread needs to be created), otherwise if the material of a certain pixel is a non-specular reflector (such as cotton, wool, soil, paper, tree, grass, etc.), that is, it is less likely to cause light reflection, then the pixel corresponding to The mask indicates that the pixel does not require image processing (or does not need to create a corresponding thread).
  • a specular reflector such as water, glass, ceramics, metal, etc.
  • the target thread group includes at least one image processing thread.
  • This thread can run a shader program, and the shader program performs image processing on each pixel according to the geometry cache.
  • each thread group includes four threads, and each thread corresponds to a pixel point and a mask.
  • Pixels 0-3 represent that image processing is not required because the mask is 0 (or There is no need to create the corresponding thread), because the mask of pixel 4-5 is 1, it means that image processing is required (or the corresponding thread needs to be created), so a thread group is created for pixel 4-7 (the thread group ID is 0)
  • the thread group includes threads 0-3, wherein thread 0 is used for image processing of pixel 4, and thread 1 is used for image processing of pixel 5.
  • this application only needs to create one thread group, so the processing resources of the GPU can be saved and the execution efficiency of the GPU can be improved.
  • the GPU may combine multiple threads to obtain a target thread group.
  • the GPU may create a first thread group according to a mask of a first group of pixels among the plurality of pixels; wherein, the first thread group includes at least one thread not performing image processing.
  • the reflection coefficients of pixels 2-5 in the 8 pixels are all greater than 0.2, so the masks of these four pixels can take a value of 1, to It indicates that image processing needs to be performed on these four pixels (or corresponding threads need to be created).
  • the GPU can create a first thread group (thread group identifier 0) according to the mask of the first group of pixels (pixels 0-3), wherein thread 2 is used for image processing of pixel 2, and thread 3 is used for processing pixels Point 3 for image processing.
  • the GPU can create a second thread group (thread group identifier 1) according to the mask of the second group of pixels (pixels 4-7), wherein, thread 4 is used for image processing of pixel 4, and thread 5 is used for pixel 4. Point 5 for image processing. Then the GPU can merge the first thread group (thread group ID 0) and the second thread group (thread group ID 1) to obtain the target thread group (thread group ID 0).
  • thread 0 uses For image processing of pixel 2
  • thread 1 is used for image processing of pixel 3
  • thread 3 is used for image processing of pixel 5. In this way, the processing resources of the GPU can be further saved, and the execution efficiency of the GPU can be improved.
  • the above-mentioned thread creation method, graphics processing unit, and electronic device provided in the embodiments of the present application indicate whether the pixel needs image processing (or whether it needs to create a corresponding thread) through the mask of each pixel, and then determine whether to create the corresponding thread group.
  • a thread group that does not perform image processing on any thread does not need to be created, thereby saving processing resources of the GPU and improving execution efficiency of the GPU.
  • the embodiment of the present application also provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and the instructions are executed on a GPU, so that the GPU executes the method for creating a thread group shown in FIG. 3 .
  • the embodiment of the present application also provides a computer program product including instructions, the instructions run on the GPU, so that the GPU executes the thread group creation method shown in FIG. 3 .
  • the processor involved in this embodiment of the present application may be a chip.
  • it can be a field programmable gate array (field programmable gate array, FPGA), an application specific integrated circuit (ASIC), a system on chip (SoC), or a central processing unit.
  • It can also be a central processor unit (CPU), a network processor (network processor, NP), a digital signal processing circuit (digital signal processor, DSP), or a microcontroller (micro controller unit, MCU) , and can also be a programmable logic device (programmable logic device, PLD) or other integrated chips.
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • modules and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the modules is only a logical function division. In actual implementation, there may be other division methods.
  • multiple modules or components can be combined or May be integrated into another device, or some features may be omitted, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or modules may be in electrical, mechanical or other forms.
  • the modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical modules, that is, they may be located in one device, or may be distributed to multiple devices. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional module in each embodiment of the present application may be integrated into one device, or each module may physically exist separately, or two or more modules may be integrated into one device.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • a software program it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server, or data center Transmission to another website site, computer, server or data center via wired (such as coaxial cable, optical fiber, Digital Subscriber Line (DSL)) or wireless (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or may be a data storage device including one or more servers, data centers, etc. that can be integrated with the medium.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.
  • a magnetic medium such as a floppy disk, a hard disk, or a magnetic tape
  • an optical medium such as a DVD
  • a semiconductor medium such as a solid state disk (Solid State Disk, SSD)

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Graphics (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)
  • Image Generation (AREA)

Abstract

一种线程组创建方法、图形处理单元和电子设备,用于提高图形处理单元的执行效率。线程组创建方法包括:获取图像中的多个像素点的掩码(S301),其中,每个像素点的掩码用于指示每个像素点是否要进行图像处理;根据多个像素点的掩码创建目标线程组(S302),目标线程组包括至少一个进行图像处理的线程。

Description

线程组创建方法、图形处理单元和电子设备 技术领域
本申请涉及图形处理领域,尤其涉及一种线程组创建方法、图形处理单元(graphics processing unit,GPU)和电子设备。
背景技术
在图形处理领域,GPU可以创建至少一个线程组,每个线程组中包括至少一个线程,每个线程可以对一个或多个像素点进行图像处理,例如进行光线相交测试和渲染。目前在光线追踪实现反射效果的应用场景中,往往只有部分反射面需要做光线相交测试和渲染,因此只有部分线程组中的线程进行图像处理,然而对于其他非反射面对应的线程组仍需要预留资源,参与指令调度,执行线程组销毁,进行资源释放等,占用了GPU的处理资源,降低了GPU的执行效率。
发明内容
本申请实施例提供一种线程组创建方法、图形处理单元和电子设备,用于提高图形处理单元的执行效率。
为达到上述目的,本申请的实施例采用如下技术方案:
第一方面,提供了一种线程组创建方法,包括:获取图像中的多个像素点的掩码,其中,每个像素点的掩码用于指示每个像素点是否要进行图像处理;根据多个像素点的掩码创建目标线程组,目标线程组包括至少一个进行图像处理的线程。本申请实施例提供的上述线程创建方法、图形处理单元和电子设备,通过各个像素点的掩码来指示该像素点是否需要进行图像处理(或者说是否需要创建对应的线程),进而确定是否创建对应的线程组。对于任一线程均不进行图像处理的线程组则不必创建,从而节省GPU的处理资源,提高GPU的执行效率。
在一种可能的实施方式中,获取图像中的多个像素点的掩码,包括:从中央处理单元接收多个像素点的掩码。中央处理单元在向图形处理单元发送绘制几何图形的指令时,可以一并发送多个像素点的掩码。
在一种可能的实施方式中,获取图像中的多个像素点的掩码,包括:根据多个像素点的属性(例如几何参数、材质信息等)生成多个像素点的掩码。几何参数可以包括该像素点的位置、深度等。
在一种可能的实施方式中,多个像素点的属性包括材质信息,材质信息包括反射系数、粗糙度、材质标识符中的至少一项,其中,材质标识符用于指示像素点的材质,例如水、金属、陶瓷、玻璃等。
在一种可能的实施方式中,根据掩码创建目标线程组,包括:根据多个像素点中的第一组像素点的掩码创建第一线程组;其中,第一线程组中包括至少一个不进行图像处理的线程;根据多个像素点中的第二组像素点的掩码创建第二线程组;其中,第二线程组中包括至少一个不进行图像处理的线程;合并第一线程组中进行图像处理的线程以及第二线程组中进行图像处理的线程得到目标线程组。这样可以进一步节省图 形处理单元的处理资源,提高图形处理单元的执行效率。
在一种可能的实施方式中,目标线程组的一个线程对应多个像素点的掩码。一个线程可以对应一个像素点,即一个线程可以对一个像素点进行图像处理,或者,一个线程可以对应多个像素点,即一个线程可以对多个像素点进行图像处理。
在一种可能的实施方式中,还包括管理掩码的存储空间。可以由应用程序显式管理掩码的存储空间,例如由应用程序分配或销毁掩码的存储空间,或者,由GPU的驱动程序管理掩码的存储空间,而应用程序不感知掩码的存储空间,上述应用程序可以包括运行在图形处理单元上的应用程序以及运行在中央处理单元上的应用程序。例如由GPU的驱动程序分配或销毁掩码的存储空间,GPU中运行的应用程序直接通过驱动程序获取掩码。
在一种可能的实施方式中,掩码存储在图形处理单元的分块缓存中,或者,存储在系统缓存中,或者,存储在内存中。掩码的存储位置可以很灵活。
第二方面,提供了一种图形处理单元,其特征在于,包括流多处理器,流多处理器用于执行如第一方面及其任一实施方式所述的线程组创建方法。
第三方面,提供了一种电子设备,包括如第二方面所述的图形处理单元。
第四方面,提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,指令在图形处理单元上运行,使得图形处理单元执行第一方面及其任一实施方式所述的线程组创建方法。
第五方面,提供了一种包含指令的计算机程序产品,该指令在图形处理单元上运行,使得图形处理单元执行第二方面及其任一实施方式所述的线程组创建方法。
关于第二方面至第五方面的技术效果参照第一方面及其任一实施方式的技术效果。
附图说明
图1为本申请实施例提供的一种电子设备的结构示意图;
图2为本申请实施例提供的一种延迟渲染过程中执行线程组压缩的示意图;
图3为本申请实施例提供的一种线程组创建方法的流程示意图;
图4为本申请实施例提供的另一种延迟渲染过程的示意图。
具体实施方式
首先对本申请涉及的一些概念进行描述:
光线追踪,用于模拟光线在真实世界的传播,通过发射光线,追踪光线在场景中的若干次弹射,计算光线和场景中物体的相交,对交点计算直接光照和间接光照。
光线相交测试(ray-plane intersection test),指测试光线与几何图形是否相交,以及,确定光线与几何图形相交的位置,该光线可以是来自光源的直射光线,或者,可以是来自其他反射面的反射光线,或者,可以是折射光线等
光栅化,指将几何图形转变为二维图像的过程。
渲染,指计算像素点的光照信息。
延迟渲染,指GPU将几何图形光栅化得到几何缓存(G-buffer),几何缓存包括二维图像中每一像素点在原几何图形中的属性(例如几何参数、材质信息等),GPU可以将几何缓存输出到内存(例如双倍速率(double data rate,DDR)存储器),再从 内存读取几何缓存,并对图像进行渲染。而不是在将几何图形光栅化后直接渲染,然后输出到内存。几何参数可以包括某像素点的位置、深度等,材质信息可以包括反射系数、粗糙度、材质标识符、法向等,其中,材质标识符用于指示像素点的材质。
着色器(shader)程序,指用于替代固定渲染管线来实现图形渲染的可编辑程序。
线程组(warp),GPU以线程组为单位创建线程(thread),每个线程组可以包括至少一个线程,例如,包括4个线程或16个线程,每个线程可以对一个或多个像素点进行图像处理,例如进行光线相交测试和渲染。每个线程可以通过一维标识或者多维标识来表示,例如,如表1所示,第一行表示有两个线程组,这两个线程组的标识分别是0和1,每个线程组包括4个线程共8个线程,可以分别通过一维编号0-7来表示。或者,一个线程组可以包括4个线程,可以分别通过二维编号(0,0),(1,0),(0,1),(1,1)来表示。
表1
线程组标识 0 0 0 0 1 1 1 1
线程标识 0 1 2 3 4 5 6 7
如图1所示,本申请实施例提供了一种包括GPU的电子设备10,包括通过总线连接的GPU 101、中央处理单元(central processing unit,CPU)102、系统缓存(system cache)103和内存104。该电子设备还可以包括显示屏(图中未示出)。GPU 101包括流多处理器(streaming multiprocessor)1011和分块缓存(tile cache)1012。该电子设备可以是手机、电脑、平板等显示图像的设备。
GPU 101中的流多处理器1011从CPU 102获取绘制几何图形的指令后,可以在内存104中对图像进行处理,并在显示屏上显示图像。系统缓存103以及分块缓存1012可以用于缓存图像的中间处理结果。另外,在本申请实施例中,内存104、系统缓存103或分块缓存1012可以用于存储掩码(后文会展开描述)。
如前文所述的,GPU创建的线程组中可能由于未进行图像处理而浪费GPU的资源,如图2所示,GPU可以在执行延迟渲染过程中,在将几何图形光栅化得到图像后,先执行线程组压缩,对于不进行图像处理的线程组则不会预留资源也不会进行创建,再由剩余线程组执行图像处理(例如进行光线相交测试和渲染),从而节省GPU的处理资源,提高GPU的执行效率。
如果将线程组压缩单独作为一个过程仍会额外耗费GPU的处理资源,本申请实施例中,通过各个像素点的掩码来指示该像素点是否需要进行图像处理(或者说是否需要创建对应的线程),进而确定是否创建对应的线程组。对于任一线程均不进行图像处理的线程组则不必创建,从而节省GPU的处理资源,提高GPU的执行效率。可以应用于光线追踪中当只有局部区域需要进行光线追踪的场景,也可以应用于全局光照中只有局部区域需要做全局光照渲染的场景。
本申请实施例提供的GPU(例如其中的流多处理器)可以执行如图3所示的线程组创建方法:
S301、获取图像中的多个像素点的掩码(mask)。
其中,每个像素点的掩码用于指示每个像素点是否要进行图像处理(例如进行光线相交测试和渲染),或者说,每个像素点的掩码用于指示每个像素点是否要创建对 应的线程。示例性的,如表2所示,每个像素点的掩码取值为0表示该像素点不需要进行图像处理(或者说不需要创建对应的线程),每个像素点的掩码取值为1表示该像素点需要进行图像处理(或者说需要创建对应的线程)。一个线程可以对应一个像素点,即一个线程可以对一个像素点进行图像处理,或者,一个线程可以对应多个像素点,即一个线程可以对多个像素点进行图像处理。
GPU可以管理掩码的存储空间,例如可以由应用程序显式管理掩码的存储空间,例如由应用程序分配或销毁掩码的存储空间,或者,由GPU的驱动程序管理掩码的存储空间,而应用程序不感知掩码的存储空间,上述应用程序可以包括运行在图形处理单元上的应用程序以及运行在中央处理单元上的应用程序。例如由GPU的驱动程序分配或销毁掩码的存储空间,GPU中运行的应用程序直接通过驱动程序获取掩码。掩码可以存储在GPU的分块缓存中,或者,可以存储在系统缓存中,或者,可以存储在内存中,掩码的存储位置可以很灵活。
获取图像中的多个像素点的掩码的过程可以发生在将几何图形光栅化得到几何缓存(G-buffer)之后。假设几何缓冲大小为1280x720像素点,如果以8比特数据表示一个像素点对应的掩码,则如表2所示,GPU可以为存储掩码预分配1280x720x8比特的缓存;如果以32比特数据表示8x4个像素点对应的掩码,即一个比特表示一个掩码,则如表3所示,GPU可以为存储掩码预分配160x180x32比特的缓存。
表2
  0 1 ...... 718 719
0 00000000 00000001 ...... 00000000 00000000
1 00000001 00000001 ...... 00000001 00000000
...... ...... ...... ...... ...... ......
1278 00000000 00000001 ...... 00000001 00000000
1279 00000000 00000000 ...... 00000000 00000000
表3
Figure PCTCN2021104584-appb-000001
Figure PCTCN2021104584-appb-000002
在一种可能的实施方式中,GPU可以从CPU获取多个像素点的掩码,例如,CPU在向GPU发送绘制几何图形的指令时,可以一并发送多个像素点的掩码。
在另一种可能的实施方式中,GPU可以对各个像素点的掩码进行更新,例如根据多个像素点的属性(例如几何参数、材质信息等)生成多个像素点的掩码。几何参数可以包括该像素点的位置、深度等,材质信息可以包括该像素点的反射系数、粗糙度、材质标识符、法向等中的至少一项,其中,材质标识符用于指示该像素点的材质,例如水、金属、陶瓷、玻璃等。
以光线追踪中由GPU实现各像素点的反射效果的应用场景为例:
例如,如果某像素点的位置位于向光侧,即会发生光线反射,则该像素点对应的掩码指示该像素点需要进行图像处理(或者说需要创建对应的线程),否则如果某像素点的位置位于背光侧,即不会发生光线反射,则该像素点对应的掩码指示该像素点不需要进行图像处理(或者说不需要创建对应的线程)。
再例如,根据场景设计需要,如果某像素点的深度小于第一阈值,该像素点对应的掩码指示该像素点需要进行图像处理(或者说需要创建对应的线程),否则该像素点对应的掩码指示该像素点不需要进行图像处理(或者说不需要创建对应的线程)。
再例如,如果某像素点的反射系数大于第二阈值,即更容易发生光线反射,则该像素点对应的掩码指示该像素点需要进行图像处理(或者说需要创建对应的线程),否则该像素点对应的掩码指示该像素点不需要进行图像处理(或者说不需要创建对应的线程)。示例性的,如表4所示,假设第二阈值为0.2,8个像素点中像素点4和像素点5的反射系数均大于0.2,所以这两个像素点的掩码可以取值为1,以指示这两个像素点需要进行图像处理(或者说需要创建对应的线程)。
表4
像素点 0 1 2 3 4 5 6 7
反射系数 0 0 0 0 0.3 0.4 0 0
掩码 0 0 0 0 1 1 0 0
线程组标识         0 0 0 0
线程标识         0 1 2 3
再例如,如果某像素点的粗糙度大于第三阈值,需要发射额外的光线,则该像素点对应的掩码指示该像素点需要进行图像处理(或者说需要创建对应的线程),否则该像素点对应的掩码指示该像素点不需要进行图像处理(或者说不需要创建对应的线程)。
再例如,如果某像素点的材质为镜面反射物(例如水、玻璃、陶瓷、金属等),即更容易发生光线反射,则该像素点对应的掩码指示该像素点需要进行图像处理(或者说需要创建对应的线程),否则如果某像素点的材质为非镜面反射物(例如棉、毛、土、纸、树、草等),即更不容易发生光线反射,则该像素点对应的掩码指示该像素 点不需要进行图像处理(或者说不需要创建对应的线程)。
S302、根据多个像素点的掩码创建目标线程组。
目标线程组包括至少一个进行图像处理的线程。该线程可以运行着色器程序,由着色器程序根据几何缓存对各个像素点进行图像处理。
示例性的,如表4所示,假设每个线程组包括四个线程,每个线程对应一个像素点及掩码,像素点0-3由于掩码为0表示不需要进行图像处理(或者说不需要创建对应的线程),像素点4-5由于掩码为1表示需要进行图像处理(或者说需要创建对应的线程),因此针对像素点4-7创建线程组(线程组标识为0)该线程组包括线程0-3,其中,线程0用于对像素点4进行图像处理,线程1用于对像素点5进行图像处理。相对于现有技术要创建两个线程组,本申请只需要创建一个线程组,所以可以节省GPU的处理资源,提高GPU的执行效率。
进一步地,GPU可以对多个线程组合并从而得到目标线程组。GPU可以根据多个像素点中的第一组像素点的掩码创建第一线程组;其中,第一线程组中包括至少一个不进行图像处理的线程。根据多个像素点中的第二组像素点的掩码创建第二线程组;其中,第二线程组中包括至少一个不进行图像处理的线程;然后合并第一线程组中进行图像处理的线程以及第二线程组中进行图像处理的线程得到目标线程组。
示例性的,如表5所示,假设第二阈值为0.2,8个像素点中像素点2-5的反射系数均大于0.2,所以这四个像素点的掩码可以取值为1,以指示这四个像素点需要进行图像处理(或者说需要创建对应的线程)。GPU可以根据第一组像素点(像素点0-3)的掩码创建第一线程组(线程组标识0),其中,线程2用于对像素点2进行图像处理,线程3用于对像素点3进行图像处理。GPU可以根据第二组像素点(像素点4-7)的掩码创建第二线程组(线程组标识1),其中,线程4用于对像素点4进行图像处理,线程5用于对像素点5进行图像处理。则GPU可以合并第一线程组(线程组标识0)和第二线程组(线程组标识1)得到目标线程组(线程组标识0),目标线程组(线程组标识0)中,线程0用于对像素点2进行图像处理,线程1用于对像素点3进行图像处理,线程2用于对像素点4进行图像处理,线程3用于对像素点5进行图像处理。这样可以进一步节省GPU的处理资源,提高GPU的执行效率。
表5
Figure PCTCN2021104584-appb-000003
对于图2来说,增加的线程组压缩步骤是很复杂的。如图4所示,经过上述改进后,相对于图2来说,不必增加单独的线程组压缩步骤,在生成掩码时以及根据掩码来创建目标线程组时均是简单的逻辑判断,因此实现简单,节省工作量。
本申请实施例提供的上述线程创建方法、图形处理单元和电子设备,通过各个像素点的掩码来指示该像素点是否需要进行图像处理(或者说是否需要创建对应的线程),进而确定是否创建对应的线程组。对于任一线程均不进行图像处理的线程组则不必创建,从而节省GPU的处理资源,提高GPU的执行效率。
本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,指令在GPU上运行,使得GPU执行图3所示的线程组创建方法。
本申请实施例还提供了一种包含指令的计算机程序产品,指令在GPU上运行,使得GPU执行图3所示的线程组创建方法。
本申请实施例涉及的处理器可以是一个芯片。例如,可以是现场可编程门阵列(field programmable gate array,FPGA),可以是专用集成芯片(application specific integrated circuit,ASIC),还可以是系统芯片(system on chip,SoC),还可以是中央处理器(central processor unit,CPU),还可以是网络处理器(network processor,NP),还可以是数字信号处理电路(digital signal processor,DSP),还可以是微控制器(micro controller unit,MCU),还可以是可编程控制器(programmable logic device,PLD)或其他集成芯片。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的模块及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、设备和方法,可以通过其它的方式实现。例如,以上所描述的设备实施例仅仅是示意性的,例如,所述模块的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个模块或组件可以结合或者可以集成到另一个设备,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,设备或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的模块可以是或者也可以不是物理上分开的,作为模块显示的部件可以是或者也可以不是物理模块,即可以位于一个设备,或者也可以分布到多个设备上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能模块可以集成在一个设备中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个设备中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式来实现。该 计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可以用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带),光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (10)

  1. 一种线程组创建方法,其特征在于,包括:
    获取图像中的多个像素点的掩码,其中,每个像素点的掩码用于指示所述每个像素点是否要进行图像处理;
    根据所述多个像素点的掩码创建目标线程组,所述目标线程组包括至少一个进行图像处理的线程。
  2. 根据权利要求1所述的方法,其特征在于,所述获取图像中的多个像素点的掩码,包括:
    从中央处理单元接收所述多个像素点的掩码。
  3. 根据权利要求1所述的方法,其特征在于,所述获取图像中的多个像素点的掩码,包括:
    根据所述多个像素点的属性生成所述多个像素点的掩码。
  4. 根据权利要求3所述的方法,其特征在于,所述多个像素点的属性包括材质信息,所述材质信息包括反射系数、粗糙度、材质标识符中的至少一项,其中,所述材质标识符用于指示像素点的材质。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述根据所述掩码创建目标线程组,包括:
    根据所述多个像素点中的第一组像素点的掩码创建第一线程组;其中,所述第一线程组中包括至少一个不进行图像处理的线程;
    根据所述多个像素点中的第二组像素点的掩码创建第二线程组;其中,所述第二线程组中包括至少一个不进行图像处理的线程;
    合并所述第一线程组中进行图像处理的线程以及所述第二线程组中进行图像处理的线程得到所述目标线程组。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述目标线程组的一个线程对应多个像素点的掩码。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,还包括管理所述掩码的存储空间。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述掩码存储在图形处理单元的分块缓存中,或者,存储在系统缓存中,或者,存储在内存中。
  9. 一种图形处理单元,其特征在于,包括流多处理器,所述流多处理器用于执行如权利要求1-8任一项所述的线程组创建方法。
  10. 一种电子设备,其特征在于,包括如权利要求9所述的图形处理单元和显示屏,所述图形处理单元用于对图像进行处理并在显示屏上显示所述图像。
PCT/CN2021/104584 2021-07-05 2021-07-05 线程组创建方法、图形处理单元和电子设备 WO2023279246A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/104584 WO2023279246A1 (zh) 2021-07-05 2021-07-05 线程组创建方法、图形处理单元和电子设备
CN202180006799.3A CN115803769A (zh) 2021-07-05 2021-07-05 线程组创建方法、图形处理单元和电子设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/104584 WO2023279246A1 (zh) 2021-07-05 2021-07-05 线程组创建方法、图形处理单元和电子设备

Publications (1)

Publication Number Publication Date
WO2023279246A1 true WO2023279246A1 (zh) 2023-01-12

Family

ID=84801162

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/104584 WO2023279246A1 (zh) 2021-07-05 2021-07-05 线程组创建方法、图形处理单元和电子设备

Country Status (2)

Country Link
CN (1) CN115803769A (zh)
WO (1) WO2023279246A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050705A (zh) * 2013-03-13 2014-09-17 辉达公司 处置光栅操作中的post-z覆盖数据
US20170116698A1 (en) * 2015-10-27 2017-04-27 Nvidia Corporation Techniques for maintaining atomicity and ordering for pixel shader operations
CN107408210A (zh) * 2015-03-25 2017-11-28 英特尔公司 基于边缘的覆盖掩码压缩
CN107918946A (zh) * 2016-10-05 2018-04-17 三星电子株式会社 执行指令的图形处理设备和方法
CN109241466A (zh) * 2018-07-26 2019-01-18 威创软件南京有限公司 一种适用于小面积及少热点的热力图的全屏渲染方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104050705A (zh) * 2013-03-13 2014-09-17 辉达公司 处置光栅操作中的post-z覆盖数据
CN107408210A (zh) * 2015-03-25 2017-11-28 英特尔公司 基于边缘的覆盖掩码压缩
US20170116698A1 (en) * 2015-10-27 2017-04-27 Nvidia Corporation Techniques for maintaining atomicity and ordering for pixel shader operations
CN107918946A (zh) * 2016-10-05 2018-04-17 三星电子株式会社 执行指令的图形处理设备和方法
CN109241466A (zh) * 2018-07-26 2019-01-18 威创软件南京有限公司 一种适用于小面积及少热点的热力图的全屏渲染方法

Also Published As

Publication number Publication date
CN115803769A (zh) 2023-03-14

Similar Documents

Publication Publication Date Title
US10706608B2 (en) Tree traversal with backtracking in constant time
US10210651B2 (en) Allocation of tiles to processing engines in a graphics processing system
JP6271768B2 (ja) 共有されるデータチャネルを用いるシェーダパイプライン
US10235338B2 (en) Short stack traversal of tree data structures
US10217183B2 (en) System, method, and computer program product for simultaneous execution of compute and graphics workloads
US11756256B2 (en) Dedicated ray memory for ray tracing in graphics systems
US9779533B2 (en) Hierarchical tiled caching
CN116050495A (zh) 用稀疏数据训练神经网络的系统和方法
JP2016534486A (ja) ページ常駐に関する条件付きページフォールト制御
US9836878B2 (en) System, method, and computer program product for processing primitive specific attributes generated by a fast geometry shader
JP5969145B1 (ja) コマンド命令管理
US20180174349A1 (en) Adaptive partition mechanism with arbitrary tile shape for tile based rendering gpu architecture
WO2022089592A1 (zh) 一种图形渲染方法及其相关设备
US20140267276A1 (en) System, method, and computer program product for generating primitive specific attributes
JP2016522474A (ja) タイルベースのレンダリングのためのイントラフレームタイムスタンプ
GB2597822A (en) Graphics processing
CN105144244A (zh) 用于基于瓦片的渲染器的查询处理
WO2023279246A1 (zh) 线程组创建方法、图形处理单元和电子设备
US20140204106A1 (en) Shader program attribute storage
US11908064B2 (en) Accelerated processing via a physically based rendering engine
US11481967B2 (en) Shader core instruction to invoke depth culling
US8976185B2 (en) Method for handling state transitions in a network of virtual processing nodes
US20140304662A1 (en) Methods and Systems for Processing 3D Graphic Objects
US11830123B2 (en) Accelerated processing via a physically based rendering engine
US11704860B2 (en) Accelerated processing via a physically based rendering engine

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21948745

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21948745

Country of ref document: EP

Kind code of ref document: A1