WO2022058012A1 - Rendu et filtrage post-traitement en un seul passage - Google Patents

Rendu et filtrage post-traitement en un seul passage Download PDF

Info

Publication number
WO2022058012A1
WO2022058012A1 PCT/EP2020/075929 EP2020075929W WO2022058012A1 WO 2022058012 A1 WO2022058012 A1 WO 2022058012A1 EP 2020075929 W EP2020075929 W EP 2020075929W WO 2022058012 A1 WO2022058012 A1 WO 2022058012A1
Authority
WO
WIPO (PCT)
Prior art keywords
tiles
tile
memory
rendering
filtering
Prior art date
Application number
PCT/EP2020/075929
Other languages
English (en)
Inventor
Baoquan Liu
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2020/075929 priority Critical patent/WO2022058012A1/fr
Priority to CN202080102321.6A priority patent/CN115943421A/zh
Publication of WO2022058012A1 publication Critical patent/WO2022058012A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management

Definitions

  • This invention relates to rendering and post-processing filtering of images, for example for game rending in a mobile graphics processing unit (GPU).
  • GPU graphics processing unit
  • Post-processing filtering for game rendering involves applying a spatial filtering operation to the rendering result, with a filter footprint covering other surrounding pixels. For example, filtering may be applied to an image in framebuffer.
  • FXAA ignores polygons and line edges and simply analyses the pixels on the screen by using a pixel shader program that runs for each pixel of the rendered frame. However, it also needs to access a local neighborhood of 3x3 surrounding pixels. If it finds those pixels that create an artificial edge, it smooths them out. FXAA can smooth edges in all pixels on the screen, including those inside alpha-blended textures and those resulting from a previous pass’ pixel shader effects, which were immune to Multi-Sample Anti-Aliasing (MSAA) but now can be solved by FXAA.
  • MSAA Multi-Sample Anti-Aliasing
  • FXAA may run at very fast speed on a desktop GPU, usually costing only a millisecond or two. Because desktop GPUs have high bandwidth dedicated video memory, a 3D rendering pass to generate the image followed by a separate 2D post-process pass to filter the rendered image is not a problem. However, this is generally not the case for mobile GPUs which do not have such dedicated video memory.
  • Figure 1 shows a 3x3 filtering operation applied to an image, where nine pixels in a local neighborhood are read from DDR to filter the current pixel, shown at 101.
  • Modern mobile GPUs usually use tile-based rendering, where one tile of pixels at a time are rendered into the on-chip tile buffer memory, which is much faster than the off-chip system memory.
  • the surrounding pixels cannot be accessed within one single renderpass or its multiple sub-passes (in the terminology of the Vulkan API).
  • a second separate pass is launched for the post-processing purpose, as shown in Figure 2.
  • EXT_shader_pixel_local_storage cannot help either, because it only allows access to values stored at the current pixel location, but does not allow access to the surrounding pixels in a local neighborhood.
  • Another difficulty for a tile-base mobile GPU is that to filter the current pixel with a filter footprint larger than 1x1 , the required surrounding pixels could even be outside of the current tile being rendered, and may be located in other tiles which may have not been rendered yet, as shown in Figure 3.
  • the problem of performing a 3D rendering pass (to generate the image) followed by a separate 2D post-process pass (to filter the rendered image) on tile-based mobile GPU is that the rendered intermediate frame buffer in system memory has to been accessed many times (write once and read nine times for a 3x3 filter) to perform the filtering.
  • This may result in heavy cost, in terms of both latency and power consumption, of reading and writing bandwidth when accessing the rendered intermediate framebuffer in system memory of a System-on-Chip.
  • the reading bandwidth is very heavy for any practical filter size, for example 3x3, 5x5, 7x7, etc. This may also result in low rendering speed in terms of frames per second.
  • Mobile devices require real-time rendering performance for games, with high frame rate and low latency of user interaction, and at the same time require low power consumption to extend battery life, and also low heat dissipation for comfortable user hand-holding.
  • a graphics processing system configured to receive an input image comprising a plurality of pixels, the system being configured to divide the input image into a plurality of tiles, each tile comprising a subset of the pixels of the input image, the system comprising: a memory configured to store a set of tiles, the set of tiles being a subset of the plurality of tiles of the input image; and a processor, wherein the processor is configured to, for at least one tile in the set, perform a processing pass comprising: rendering the tile; filtering the tile in dependence on at least one other tile of the set; and storing, in the memory, a rendered and filtered tile.
  • the rendering pass and the filtering pass are therefore combined into a single processing pass by keeping a small fixed amount of data in the memory, i.e., a small subset of the whole frame image is stored in the memory.
  • the intermediate framebuffer does not need to go to external system memory. This may result in less memory bandwidth consumption compared with the traditional two-pass methods, with less external memory bandwidth impact.
  • the processing pass may be a processing pipeline comprising the rendering and filtering stages above.
  • the rendered and filtered tile that is output from the pipeline may be stored in the memory.
  • the memory may be an on-chip memory.
  • the memory may be a high-speed cache.
  • the high-speed cache may be a system level cache or a static on-chip memory.
  • the memory may be separate to the external system memory. Storing a subset of tiles in a high-speed cache may reduce the bandwidth consumption when rendering and filtering tiles.
  • the set of tiles may comprise a predetermined fixed number of tiles.
  • the memory may store a predetermined fixed number for each processing pass.
  • the system may be configured to implement a sliding window algorithm to keep the fixed number of tiles in the set for each processing pass performed by the processor.
  • the sliding window scheduling algorithm may exploit a small fixed number of pixel-tiles as a working set stored in the on-chip memory, so that only the very necessary surrounding tiles (of some current tiles that are being or to be filtered) are kept in the on-chip memory at a certain time when applying the filtering operation to these current active tiles in the set that have been rendered recently.
  • the set of tiles may comprise at least one tile that is un-filtered.
  • the set of tiles may comprise at least one tile that has been rendered and filtered.
  • the set may therefore comprise neighbouring tiles to the tile currently being filtered.
  • the system may further comprise a scheduler configured to schedule the processing of the tiles of the input image in a vertical or horizontal scanline order. This may be a convenient way of scheduling the tiles. A zigzagged order may also be used to schedule the small tiles inside a larger super-tile, and then use the scanline order to schedule the big-super tiles. Any other possible combinations of these different scheduling orders may also be used.
  • the set of tiles may comprise a fixed number of columns of tiles of the input image.
  • the number of columns may be three. This may allow the image to be scanned three columns of tiles at a time using a sliding window algorithm.
  • the set of tiles may comprise K tiles, where K is at least 2*(NTC +1) and where NTC is the number of tiles along each column of the final image. This may allow a sufficient number of tiles to be stored in the on-chip memory so that filtering of a current tile can be performed without the need to access DDR. K may be greater than 2*(NTC +1).
  • the rendering and filtering of two different tiles in the set may be performed concurrently. This may allow the image to be rendered more quickly.
  • the system may be configured to store a different set of tiles for each processing pass. Each different set of tiles may comprise a predetermined fixed number of tiles.
  • the processor may be configured to subsequently write a rendered and filtered image to a frame buffer in a system memory. This may allow the final image to be displayed.
  • the system may be implemented by a mobile graphics processing unit.
  • Implementing the system in a mobile device may help to achieve real-time rendering performance for games, with high frame rate and low latency of user interaction, and at the same time low power consumption to extend battery life, and also low heat dissipation for comfortable user handholding.
  • a method for processing an input image comprising a plurality of pixels at an image processing system, the system being configured to receive the input image and divide the input image into a plurality of tiles, each tile comprising a subset of the pixels of the input image, the system comprising a memory configured to store a set of tiles, the set of tiles being a subset of the plurality of tiles of the input image, the method comprising, in a processing pass, for at least one tile in the set: rendering the tile; filtering the tile in dependence on at least one other tile of the set; and storing, in the memory, a rendered and filtered tile.
  • rendering pass and the filtering pass are combined into a single pass by keeping a small fixed amount of data in the on-chip memory.
  • a small subset of the whole frame image is stored in the on-chip memory.
  • the intermediate framebuffer does not need to go to external system memory. This may result in less memory bandwidth consumption compared with the traditional two-pass methods, with less external memory bandwidth impact.
  • the processing pass may be a processing pipeline comprising the rendering and filtering stages above.
  • the rendered and filtered tile that is output from the pipeline may be stored in the memory.
  • the memory may be an on-chip memory.
  • the memory may be a high-speed cache.
  • the high-speed cache may be a system level cache or a static on-chip memory.
  • the memory may be separate to the external system memory. Storing a subset of tiles in a high-speed cache may reduce the bandwidth consumption when rendering and filtering tiles.
  • a computer program which, when executed by a computer, causes the computer to perform the method described above.
  • the computer program may be provided on a non-transitory computer readable storage medium.
  • Figure 1 schematically illustrates a 3x3 filtering operation.
  • Figure 2 schematically illustrates two passes involved in processing an image on a mobile GPU: a first pass for 3D rendering into the DDR, and a second pass for post-processing filtering which reads from the DDR.
  • Figure 3 schematically illustrates that, for a 2x2 filtering, surrounding pixels from neighbouring tiles are required for correct filtering.
  • Figure 4 shows a rendering pipeline where both the rendering pass and the filtering pass can be finished within only one single pass of the processor.
  • Figure 5 illustrates how a high speed cache usually does not have enough memory space to hold the whole framebuffer for a specific rendering application.
  • Figure 6 illustrates filtering of a tile of pixels of an image.
  • Figure 7 illustrates allocation of a fixed memory size of three columns of tiles as an active working set in the high speed cache, and how a sliding window algorithm is performed for these three columns of tiles (by sliding one column at a time from left to right).
  • Figure 8 illustrates the stages to render a 3D geometry scene to a framebuffer.
  • the tile-based mobile GPU generally has two stages: a binning stage for geometry processing, and a rendering stage for rasterization and pixel shading,
  • Figures 9(a) to 9(d) illustrate the first four iterations of the rendering and filtering process using a first sliding window algorithm.
  • Figure 10 illustrates exemplary pseudocode for the first sliding window rendering and filtering algorithm.
  • Figure 11 (a) to 11(d) illustrate the first four iterations of the rendering and filtering process using a second sliding window algorithm.
  • Figure 12 illustrates exemplary pseudocode for the second sliding window rendering and filtering algorithm.
  • Figure 13 shows a flowchart for an example of a method for processing an input image.
  • Figure 14 shows an example of a graphics processing system.
  • rendering refers to any form of generating a visible image, for example for displaying the image on a computer screen, printing, or projecting.
  • a memory for example a high speed cache (HSC) in buffer mode.
  • HSC high speed cache
  • both the rendering pass and post-process filtering pass may be performed within only one rendering pass along the GPU pipeline.
  • the stages shown in the pipeline of Figure 4 are geometry data submission 401 , vertex shader (VS) 402, pixel shader (PS) 403, render target 404 and post-process 405.
  • the rendering 404 and post-processing filtering 405 are performed in a single processing pass along the GPU pipeline.
  • the intermediate framebuffer does not need to go to system memory. This may allow a large amount of data bandwidth to be saved, with much less external memory bandwidth impact. This can be very useful for various rendering techniques used by game engines where it is desirable to apply a post-processing filter to the rendered framebuffer.
  • a tile’s rendering result of previous rendering operations can efficiently stay on-chip if subsequent rendering operations are within the same tile of pixels, and if only the pixels’ data in the current tile being rendered is accessed.
  • access to other pixel locations would require data outside of the current tile, which breaks the tile-based rendering mechanics.
  • a render pass can comprise multiple subpasses, and one subpass can access the previous subpasses’ rendering results for a tile which are still staying in the on- chip memory of a mobile GPU before being output to the external system memory.
  • These multiple sub-passes share the same tile arrangement, so that one subpass can access the result of a previous subpass, one tile at a time.
  • access to pixels of other surrounding tiles is not allowed, since other tiles may have been evicted out of the on-chip memory or may have not been rendered yet for the time being.
  • the highspeed caches (HSC) on a System-on-Chip (SoC) may be exploited to assist the filtering process to avoid the memory bandwidth involved in accessing the intermediate framebuffer in system memory (i.e. , DDR).
  • the HSCs which may be a system level cache, or a static on-chip memory, on a SoC are expensive and are usually very small in memory size. They are also shared by multiple applications. As a result, as shown in Figure 5, HSCs usually do not have enough memory space to hold the whole framebuffer for a specific rendering application.
  • the rendered image is shown at 501 and the filtered image at 502.
  • the DDR is shown at 504.
  • the HSC 503 is too small to hold one full frame.
  • the rendered intermediate framebuffer data is read by a filtering pixel-shader or a compute shader (executing one GPU thread for every single pixel location) which accesses only a small neighborhood of surrounding pixels when processing each individual pixel and when filtering a tile of pixels, for example tile 601 of image 600 in Figure 6. Therefore, only access to eight additional tiles which are surrounding the current tile is required. These tiles are shown within box 602 along with the current tile 601 .
  • the tiles at the image boundary may have some of their surrounding tiles lying outside of the image boundary. However, this is not a problem for filtering, because these outlying pixels can be clamped to the last pixel at the image edge, or simply clamped to zero.
  • two sliding window solutions for tile-based mobile GPU are implemented in order to combine the 3D rendering pass and the spatial filtering pass into a single pass by keeping a small fixed amount of tile-data in the HSC using buffer mode.
  • a fixed small number of tiles are maintained as a working set (WS) in the HSC at any one time.
  • the working set of tiles comprises a predetermined fixed number of tiles.
  • a sliding window algorithm is implemented to keep a fixed number of active tiles in the WS, in which there is at least one tile (there may be more than one such tile) that is ready to be filtered (i.e. that has already been rendered) with the surrounding tiles also available in the WS.
  • the WS of tiles therefore comprises at least one tile that is un-filtered and at least one tile that has been rendered and filtered.
  • the system is configured to store a different set of tiles for each processing pass.
  • the GPU hardware schedules a tile-based rendering order (one tile after the other) in a vertical or horizontal scanline order.
  • the system may use a scheduling algorithm to allow the two jobs (rendering and filtering) to be scheduled to run on a GPU in a cooperating and synchronized way so that a minimum amount of tiles is stored in the WS.
  • the GPU hardware renders tiles in vertical scanline order.
  • tiles may also be rendered in a horizontal scanline order.
  • a fixed memory size of three columns of tiles is allocated as an active WS in the HSC.
  • a sliding window algorithm is applied to these three columns of tiles by sliding one column at a time from left to right.
  • the tiles in a column are rendered and filtered in a vertical scanline order.
  • the light grey tiles indicated at 701 are the tiles that have already been rendered and filtered, and have been evicted out of the WS.
  • the darker tiles shown at 702 are the three columns of tiles that are currently stored in the WS, in which the middle column can be filtered because all of its surrounding tiles are already available in the WS.
  • the white tiles indicated at 703 are the tiles that have not been rendered yet and are not in the WS at present.
  • a ring buffer is used to manage the memory of the WS in the HSC.
  • the ring buffer includes three slots (i.e. , slotO, slotl, slot2). Each slot stores one columns of tiles, which equals to NTC tiles.
  • a tile-based mobile GPU To render the 3D geometry scene to a framebuffer, a tile-based mobile GPU usually has two stages, as shown in Figure 8: a binning stage for geometry processing, shown generally at 801 , and a rendering stage for rasterization and pixel shading, shown generally at 802. Following the rendering stage of a tile, a filtering job is inserted (for example, by launching a compute shader or a fragment shader for each tile) after the tile is rendered.
  • a scheduling algorithm can be implemented for the rendering stage to allow the two jobs (rendering and then filtering for each tile) to be scheduled on a GPU in a synchronized way. This is to allow rendering and filtering of each tile in a certain order by using a sliding window algorithm, so that a minimum amount of tiles is stored in the WS and at the same time the surrounding tiles of the tile being filtered are available in the WS.
  • the scheduling algorithm has two stages as follows. In the initialization stage, the first three columns of tiles of the frame are rendered and each of these rendered tiles is stored into the three slots of the WS. Then, filtering is performed only for the first two columns of the tiles.
  • the following algorithmic steps are performed by sliding forward one column of tiles at a time along the scanline direction, which will move the three columns in the WS forwards towards the right hand side of the image.
  • a second step filtering is only performed for the middle column (slotl) of tiles in the WS and the result is stored to DDR, because all of the surrounding tiles (of the middle column) are now available in the WS.
  • Figures 9(a)-9(d) The first four iterations of the process are shown in Figures 9(a)-9(d).
  • Figure 9(a) shows the WS state after the initialization stage and
  • Figures 9(b)-(d) show the next three consecutive iterations.
  • the three columns of tiles of the WS are shown in dark grey at 901 , 903, 906 and 909 for each iteration. These are the tiles being rendered or filtered currently in each iteration.
  • the light grey tiles 904, 907 and 910 are the finished tiles (not currently in the WS), and the white tiles 902, 905, 908 and 911 have not yet been processed (also not currently in the WS).
  • x_offsetWS (wavefront-2)* TileSize, where wavefront is the tile-index of the right most column in the WS.
  • the filtering job can be launched at the granularity of one column of tiles for the middle column stored in slotl of the WS, for example, 1080x32 pixels which are launched together using one shader.
  • the filtering job could be defined by a fragment shader or compute shader, which is scheduled with a deferral after the finish of the rendering job of the wavefront column, i.e, slot2.
  • the rendering job (defined by a fragment shader) is still launched at tile granularity, i.e., one tile after the other along the vertical direction within the wavefront column, and the result is stored into slot2 of WS
  • a filtering job can be inserted for the middle column so that the rendering job and the filtering job are scheduled in an interleaved pattern, one column after the other.
  • the GPU hardware may require modification to achieve such interleaved scheduling of the rendering job and filtering job at each iteration of the sliding window algorithm.
  • a filtering job is launched for the whole column of tiles stored in slotl of WS.
  • This scheduling can be implemented by GPU HW (more efficiently than the driver) by inserting a filtering job after, whenever a new wavefront column of tiles have been rendered.
  • a circular ring buffer can be used to manage memory of the WS in the HSC.
  • the HSC is preferably used in buffer mode, which can guarantee that the preset ring buffer (with 3* NTC tiles of memory size) will never be evicted out of the HSC, i.e., every bit of the WS will stay in the HSC at all times during the rendering process.
  • slotO (slotO + 1) % 3
  • slotl (slotl + 1) % 3
  • slot2 (slot2 + 1) % 3;
  • Access to neighbouring pixels in neighbouring tiles may be performed as follows.
  • vec4 textureOffset gsampler2D samplerWS_in_HSC, vec2 pos, ivec2 offset
  • pos is the current pixel location to be filtered
  • offset is the integer offset to the current pixel (e.g., offset along X and Y could be [-2, -1 , 0, 1 , 2] for a filter with a diameter of 5 pixels, i.e., a 5x5 filter)
  • samplerWS_in_HSC is our Working-Set in HSC (at buffer mode) with three columns of tiles.
  • the compiler calculates the address (in the WS) to load the neighbouring pixels from the HSC. It is possible to calculate the buffer’s offset address in HSC by using the following three levels of indexing:
  • slot-index i.e., [0, 1 , 2] of the ring buffer: with each slot pointing to a certain column (of tiles) in the HSC;
  • tile-row-index each pointing to a tile (along vertical direction) in a certain slot (each slot has a column of tiles: equal to NTC).
  • intra-tile offset i.e. the XY offset within a tile.
  • GPU driver modifications may be required.
  • An API extension may be provided for the applications to use this GPU feature (i.e., rendering+filtering combined in a single pass).
  • the developers only needs to provide a customized filtering shader.
  • the GPU driver may manage everything directly, include providing the filtering shader, for some specific Postprocessing, such as FXAA.
  • the developers then only need to enable/disable this feature to apply one of these very common Postprocessing filters to the rendering result.
  • the memory addressing of a tile in the WS is easy to calculate, because the alignment of the three columns of tiles stored in HSC.
  • the HW scheduler of the two jobs (for rendering and filtering) is very simple to implement via an interleaved mode.
  • the two jobs (rendering and filtering) are scheduled by the HW in a serial and interleaved mode. As a result, one job waits for the other before sliding forward to the next column. In some implementations, this kind of waiting may introduce pipeline bubbles for the rendering job.
  • An alternative second sliding window solution is proposed to solve the pipeline bubble problem that may be encountered in some implementations due to the interleaved scheduling of two jobs which have to wait for each other.
  • a more complex scheduling algorithm is used which can allow the two jobs (rendering and filtering) to be scheduled in a concurrent way on a GPU, instead of in a serial and interleaved way, by using binary semaphore mechanics to synchronize the two jobs so that the rendering job does not need to wait for the filtering job.
  • some HW modification may be used to achieve the described binary semaphore signal-sending mechanics, as will be described in more detail below.
  • a memory size of a fixed number of tiles is still allocated for the WS in the HSC.
  • K tiles are maintained in the HSC (in buffer mode) at all times, where K is a pre-set value that is preferably at least 2*(NTC +1), where NTC is the number of tiles along each column of the final image.
  • a ring buffer is used to manage memory of these K tiles in order to store the WS.
  • the ring buffer manages only K slots, where each slot stores only one tile of pixels.
  • the WS can grow larger as more tiles are being added to it during the sliding window iterations, but it should not be larger than K tiles, otherwise the ring buffer will overflow.
  • a sliding window algorithm sliding one tile at a time along the vertical scanline order, can be used to schedule both the rendering job and the filtering job.
  • only one tile is slid at each iteration step, instead of sliding one column of tiles at each iteration, as found in the first example.
  • Figures 11 (a)-11 (d) show four consecutive iteration steps, where the WS has been allocated memory in the HSC with K tiles of memory size.
  • the tiles 1101 , 1102, 1103 and 1104 (hereinafter referred to as the “current tile” for each iteration) and the surrounding grey tiles within boxes 1105, 1106, 1107 and 1108 are currently in the WS and are being processed (rendered or filtered) at the current iteration step, the light grey tiles to the left of the WS are the finished tiles (not currently in the WS), and the white tiles to the right of the WS have not been touched yet and will be slid into in the future iteration steps (also not currently in the WS).
  • slotO pointing to the wavefront tile, i.e., the first grey tile in WS. It is also the most recently rendered tile, which moves forward in vertically scanline order.
  • slotl pointing to the “current” (shown in dark grey), which is the only tile that is ready for being filtered; slot2: pointing to the tail tile, i.e. , the last grey tile in WS.
  • a check can be performed to determine whether the current tile (i.e., slotl) can start to perform its filtering job. For example, it may progress to filtering when its lagging-behind distance to the wavefront tile is larger than (NTC+1). Otherwise, the current tile’s filtering job waits before sliding forward to the next tile. This is to ensure that the current tile has all its surrounding tiles already available in the WS before performing its filtering job.
  • the second sliding window algorithm is therefore a more complex HW scheduling algorithm which can allow the two jobs (rendering and filtering) to be scheduled in a concurrent style, instead of a serial style, by using a binary semaphore signal sending mechanics.
  • the rendering job can send a signal to the filtering job to indicate that it is safe (because all of the tiles required for filtering are present in the WS) for the filtering job to perform filtering for the current tile (i.e., slotl in WS).
  • the rendering job can keep moving forward in its own rendering rhythm (by sliding forward to the next tile, to render one tile after the other in the vertical scanline order) without needing to stop and wait for the filtering job.
  • the filtering job waits for a new semaphore signal from the rendering job which indicates that it is safe to filter the current tile. After that, it may perform the filtering operation and then slide forward to next tile in slotl .
  • the rendering job (at the wavefront tile) does not need to wait, while the filtering job does need to wait for a certain lagging-behind distance (from it to the wavefront tile) before it can slide forward.
  • a check can be performed to determine whether the tail-tile’s distance to the current tile is larger than (NTC+1). If so, the tail-tile can be evicted out of the WS and its slot-index can slide forward to pointer to the next tile.
  • the number of valid tiles in the WS (which may generally be equal to slot0-slot2+1 , i.e., the tileindex distance between the wavefront tile and the tail tile) is no larger than K.
  • K is the pre-set allocated memory size for the ring buffer. In this example, if the above condition is not true (i.e. if the number of tile in the WS is greater than K), the ringbuffer may be full and could overflow (i.e., when (slotO+1) is equal to slot2 ).
  • K should be larger than 2*(NTC+1).
  • slotO - slotl is usually equal to NTC+1
  • slotl - slot2 usually also equal to NTC+1 , as shown in 1107 and 1108 of Figure 11 . Therefore, K should preferably be at least 2*(NTC+1)+1 to avoid the ring buffer becoming full and overflowing.
  • the rendering job sends a semaphore signal to the filtering job whenever the slot-index distance (in the ring buffer) between the wavefront tile (slotO) and the current tile (slotl) is larger than (NTC+1).
  • the filtering job waits for a new semaphore signaled from the rendering job which indicates that it is safe to filter the current tile (because all the required neighboring tiles are present in the WS), and after receiving the signal the current tile can be filtered and then the system slides forward to the next tile for filtering.
  • a fixed, preset number of tiles K where K should preferably be at least 2*(NTC+1), is maintained as the ring buffer in the HSC to store all the tiles in WS.
  • Memory alignment in the HSC is still easy, as a pool of K tiles that are maintained in the HSC, so memory is aligned at the granularity of a tile’s memory size.
  • the ring buffer’s memory size is small and fixed: at most K tiles are stored in the HSC at any time.
  • the WS’s memory is managed by using a ring buffer with K tiles, where each tile has a slot-index in the WS. Therefore, a required neighboring tile can firstly be found by using the slot-index distance (in the WS) between this neighboring tile and the current tile, and then the required pixels within this tile can be found using the intratile pixel-offsets.
  • the slot-indexes of the eight neighboring tiles of a current tile (i.e., slotl) in the WS can be calculated as below: o its top-left neighboring tile’s slot-index can be calculated as: slot1-(NTC+1); o its left neighboring tile’s slot-index can be calculated as: slot1-NTC; o its bottom-left neighboring tile’s slot-index should be: slot1-(NTC-1); o its top neighboring tile’s slot-index should be: slotl -1; o its bottom neighboring tile’s slot-index should be: slotl +1; o its top-right neighboring tile’s slot-index can be calculated as: slotl +(NTC-1); o its right neighboring tile’s slot-index can be calculated as: slotl +NTC; o its bottom-right neighboring tile’s slot-index should be: slotl +(NTC+1);
  • the HW scheduler of the two jobs allows the two jobs to be scheduled in a concurrent way.
  • the rendering job can keep moving forward in its own rhythm, by sliding forward a tile at a time, without the need to wait for the filtering job.
  • this will not introduce rendering pipeline bubbles, such that the rendering job can keep running continuously for one tile after the other without stopping.
  • the filtering job may, for example, be a fragment shader or a compute shader. Both of these operations are possible because the filtering job performs at the granularity of a tile.
  • the intermediate framebuffer (after rendering) does not need to go to system memory. This may save read bandwidth by up to 49x for a 7x7 filter and save 1x write bandwidth of the whole frame.
  • the filter kernel can be as large as (2*TileSize+1)x (2*TileSize+1).
  • post-processing filters can be applied to the rendered framebuffer by game engines.
  • the solutions described herein can support at least the following post-processing filters without the need of storing the intermediate framebuffer to external system memory: Nvidia’s FXAA: 3x3, Bloom effect filter: 7x7, Gaussian blur filter: 3x3, 5x5, 7x7, SuperResolution filter: 7x7, AMD’s CAS filter: 3x3, and 4x4, Chromatic Aberration, Depth Of Field: applying a blur effect based on distance to focal point, Motion Blur: blurs objects based on its motion using a variable blur size, bicubic filtering: 4x4, and many more spatial filters, or any other post-processes that are applied to a rendered framebuffer that use neighboring pixels’ values to calculate a new value for the current pixel.
  • tile-based scheduling of the two GPU jobs may also be used as the unit for job scheduling, and also such granularity can be used for ring-buffer memory slots management.
  • tile scheduling orders may be utilized. As described above, vertical and horizontal scanline orders may be used. A zigzagged order may also be used to schedule the small tiles (for example, four tiles) inside a big super-tile, and then use the scanline order to schedule the big-super tiles. Any other possible combinations of these different scheduling orders may also be used.
  • Figure 13 shows a flowchart detailing an example of a method for processing an input image comprising a plurality of pixels at an image processing system, the system being configured to receive the input image and divide the input image into a plurality of tiles, each tile comprising a subset of the pixels of the input image, the system comprising a memory configured to store a set of tiles, the set of tiles being a subset of the plurality of tiles of the input image.
  • the method comprises, in a processing pass, for at least one tile in the set performing steps 1301-1303 as follows.
  • the method comprises rendering the tile.
  • the method comprises filtering the tile in dependence on at least one other tile of the set.
  • the method comprises storing, in the memory, a rendered and filtered tile.
  • Figure 14 is a schematic representation of a system 1400 configured to perform the methods described herein.
  • the system 1400 may be implemented on a device, such as a laptop, tablet, smart phone, TV or any other device in which graphics data is to be processed.
  • the system is preferably implemented by mobile GPU.
  • the system 1400 comprises a graphics processor 1401 configured to process data.
  • the processor 1401 may be a GPU.
  • the processor 1401 may be implemented as a computer program running on a programmable device such as a GPU or a Central Processing Unit (CPU).
  • the system 1400 comprises an on-chip memory 1402 which is arranged to communicate with the graphics processor 1401.
  • the system may comprise more than one processor and more than one memory.
  • the memory may store data that is executable by the processor.
  • the processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium.
  • the computer program may store instructions for causing the processor to perform its methods in the manner described herein.
  • the processor may also write data to an external system memory (not shown in Figure 14).
  • a rendering pipeline for a mobile GPU is described above which implements rendering and post-processing filtering within a single pass via the HSC for a tile-based mobile GPU.
  • the 3D rendering pass and the spatial filtering pass are combined into a single pass by keeping a small fixed amount of data on a high speed cache (for example, a system level cache or a static on-chip memory) in buffer mode.
  • a high speed cache for example, a system level cache or a static on-chip memory
  • the method utilizes the advantage of the HSC memory on a mobile GPU to store a small subset of the whole frame image. As a result, the intermediate framebuffer does not need to go to external system memory. This may result in less memory bandwidth consumption compared with the traditional two-pass methods, with less external memory bandwidth impact.
  • the described sliding window scheduling algorithms exploit a small fixed number of pixel tiles as a working set stored in the HSC, so that only the necessary surrounding tiles (some tiles that are being or to be filtered) are kept in the HSC at a certain time when applying the filtering operation to these current tiles in the WS that have been recently rendered.
  • three columns of tiles are maintained in the HSC as a ring buffer.
  • the rendering job and the filtering job are maintained in serial mode by sliding forward one new column of tiles at a time.
  • the scheduling granularity is one column of tiles at each iteration step.
  • K tiles are maintained (where K is at least 2*(NTC+1)) in HSC as a ring buffer.
  • Tiles are rendered one by one, for example in a vertical scanline order, with no need to wait.
  • Three slot indices are tracked to chase the wavefront tile, current tile, and the tail-tile, respectively.
  • Signal semaphore mechanics are used to synchronize the rendering job and the filtering job, so that the rendering job doesn’t need to wait for the filtering job.
  • the scheduling granularity is one tile at each iteration step.
  • the proposed solutions require only one single rendering pass of the processor to finish both rendering and post-process filtering to a framebuffer.
  • the described system may therefore achieve the objective of reducing memory bandwidth (read and write) by accessing the intermediate framebuffer in system memory by combining the 3D rendering pass and post-processing filtering pass into a single render-pass along the graphics pipeline on the mobile GPU by using sliding window algorithms which only need to store a small fixed amount of data (a subset of the whole frame) in a HSC in buffer mode.
  • the approach can reduce the amount of memory bandwidth required by avoiding the intermediate framebuffer going to system memory. This may result in a faster frame-rate and less power consumption. There is no need to perform the cumbersome and redundant copying, addressing, and storing of each tile’s edge-pixels and corner-pixels to its eight neighbor tiles.

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)
  • Image Generation (AREA)

Abstract

L'invention concerne un système de traitement graphique (1400) configuré pour recevoir une image d'entrée comprenant une pluralité de pixels, le système étant configuré pour diviser l'image d'entrée en une pluralité de mosaïques, chaque mosaïque comprenant un sous-ensemble des pixels de l'image d'entrée, le système comprenant : une mémoire (503, 1402) configurée pour stocker un ensemble de mosaïques (702, 901, 1105), l'ensemble de mosaïques étant un sous-ensemble de la pluralité de mosaïques de l'image d'entrée ; et un processeur (1401), le processeur étant configuré, pour au moins une mosaïque de l'ensemble (702, 901, 1105), pour effectuer un passage de traitement comprenant : le rendu de la mosaïque ; le filtrage de la mosaïque en fonction d'au moins une autre mosaïque de l'ensemble (702, 901, 1105) ; et le stockage, dans la mémoire (503, 1402), d'une mosaïque rendue et filtrée. Ceci peut réduire la consommation de bande passante de mémoire par comparaison avec des procédés à deux passages classiques, avec moins d'impact de bande passante de mémoire externe.
PCT/EP2020/075929 2020-09-17 2020-09-17 Rendu et filtrage post-traitement en un seul passage WO2022058012A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/EP2020/075929 WO2022058012A1 (fr) 2020-09-17 2020-09-17 Rendu et filtrage post-traitement en un seul passage
CN202080102321.6A CN115943421A (zh) 2020-09-17 2020-09-17 单通道的渲染和后处理滤波

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/075929 WO2022058012A1 (fr) 2020-09-17 2020-09-17 Rendu et filtrage post-traitement en un seul passage

Publications (1)

Publication Number Publication Date
WO2022058012A1 true WO2022058012A1 (fr) 2022-03-24

Family

ID=72560594

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2020/075929 WO2022058012A1 (fr) 2020-09-17 2020-09-17 Rendu et filtrage post-traitement en un seul passage

Country Status (2)

Country Link
CN (1) CN115943421A (fr)
WO (1) WO2022058012A1 (fr)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100022A (zh) * 2022-08-23 2022-09-23 芯动微电子科技(珠海)有限公司 图形处理方法及系统
CN115147579A (zh) * 2022-09-01 2022-10-04 芯动微电子科技(珠海)有限公司 一种扩展图块边界的分块渲染模式图形处理方法及系统
CN115330986A (zh) * 2022-10-13 2022-11-11 芯动微电子科技(珠海)有限公司 一种分块渲染模式图形处理方法及系统
CN115660935A (zh) * 2022-10-08 2023-01-31 芯动微电子科技(珠海)有限公司 一种分块渲染模式图形处理方法及系统

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030227462A1 (en) * 2002-06-07 2003-12-11 Tomas Akenine-Moller Graphics texture processing methods, apparatus and computer program products using texture compression, block overlapping and/or texture filtering
US20080094406A1 (en) * 2004-08-11 2008-04-24 Koninklijke Philips Electronics, N.V. Stripe-Based Image Data Storage

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030227462A1 (en) * 2002-06-07 2003-12-11 Tomas Akenine-Moller Graphics texture processing methods, apparatus and computer program products using texture compression, block overlapping and/or texture filtering
US20080094406A1 (en) * 2004-08-11 2008-04-24 Koninklijke Philips Electronics, N.V. Stripe-Based Image Data Storage

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHIHOUB A ET AL: "A Band Processing Imaging Library for a TriCore-Based Digital Still Camera", REAL-TIME IMAGING, ACADEMIC PRESS LIMITED, GB, vol. 7, no. 4, 1 August 2001 (2001-08-01), pages 327 - 337, XP004419458 *
DONGJU LI ET AL: "Design Optimization of VLSI Array Processor Architecture for Window Image Processing", IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS, COMMUNICATIONS AND COMPUTER SCIENCE, vol. 82, no. 8, 1 August 1999 (1999-08-01), pages 1475 - 1484, XP055658095 *
HAIQIAN YU ET AL: "Optimizing data intensive window-based image processing on reconfigurable hardware boards", IEEE WORKSHOP ON SIGNAL PROCESSING SYSTEM DESIGN AND IMPLEMENTATION, 2 November 2005 (2005-11-02), pages 491 - 496, XP010882621 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100022A (zh) * 2022-08-23 2022-09-23 芯动微电子科技(珠海)有限公司 图形处理方法及系统
CN115147579A (zh) * 2022-09-01 2022-10-04 芯动微电子科技(珠海)有限公司 一种扩展图块边界的分块渲染模式图形处理方法及系统
CN115147579B (zh) * 2022-09-01 2022-12-13 芯动微电子科技(珠海)有限公司 一种扩展图块边界的分块渲染模式图形处理方法及系统
CN115660935A (zh) * 2022-10-08 2023-01-31 芯动微电子科技(珠海)有限公司 一种分块渲染模式图形处理方法及系统
CN115660935B (zh) * 2022-10-08 2024-03-01 芯动微电子科技(珠海)有限公司 一种分块渲染模式图形处理方法及系统
CN115330986A (zh) * 2022-10-13 2022-11-11 芯动微电子科技(珠海)有限公司 一种分块渲染模式图形处理方法及系统

Also Published As

Publication number Publication date
CN115943421A (zh) 2023-04-07

Similar Documents

Publication Publication Date Title
US11748843B2 (en) Apparatus and method for non-uniform frame buffer rasterization
US10991152B2 (en) Adaptive shading in a graphics processing pipeline
WO2022058012A1 (fr) Rendu et filtrage post-traitement en un seul passage
CN105405103B (zh) 通过在空间上和/或在时间上改变采样模式增强抗锯齿
US10120187B2 (en) Sub-frame scanout for latency reduction in virtual reality applications
US10733794B2 (en) Adaptive shading in a graphics processing pipeline
CN106251392B (zh) 用于执行交织的方法和设备
US9437040B2 (en) System, method, and computer program product for implementing anti-aliasing operations using a programmable sample pattern table
US9697641B2 (en) Alpha-to-coverage using virtual samples
US10446118B2 (en) Apparatus and method using subdivided swapchains for improved virtual reality implementations
US7629978B1 (en) Multichip rendering with state control
US20110261063A1 (en) System and Method for Managing the Computation of Graphics Shading Operations
US7605825B1 (en) Fast zoom-adaptable anti-aliasing of lines using a graphics processing unit
US9811334B2 (en) Block operation based acceleration
CN103003839A (zh) 反锯齿样本的拆分存储
WO2016209611A1 (fr) Filtrage de surfaces à échantillons multiples
US8773447B1 (en) Tag logic scoreboarding in a graphics pipeline
CN114998087A (zh) 渲染方法及装置
WO2021178222A1 (fr) Procédés et appareil permettant un tramage à points de vue multiples à rendement élevé
US11908079B2 (en) Variable rate tessellation
CN115443487A (zh) 单遍次渲染和上缩放

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20774956

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20774956

Country of ref document: EP

Kind code of ref document: A1