CN107153617B

CN107153617B - Cache architecture for efficient access to texture data using buffers

Info

Publication number: CN107153617B
Application number: CN201710128572.1A
Authority: CN
Inventors: S.亚伯拉罕; K.拉马尼; 徐雄; 权劝宅; 朴贞爱
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2016-03-04
Filing date: 2017-03-06
Publication date: 2023-04-07
Anticipated expiration: 2037-03-06
Also published as: CN107153617A

Abstract

A texture cache architecture facilitates access to compressed texel data in a non-power-of-2 format, such as an Adaptive Scalable Texture Compression (ASTC) codec. In one implementation, a texture cache architecture includes a controller, a first buffer, a second buffer, and a texture decompressor. The first buffer stores one or more blocks of compressed texel data retrieved from the first texture cache in response to the first request, wherein the one or more blocks of compressed texel data includes at least the requested texel data. The second buffer stores the decompressed compressed texel data block or blocks and provides the decompressed requested texel data as output to a second texture cache. The one or more blocks of compressed texel data stored by the first buffer include second texel data in addition to the requested texel data.

Description

Cache architecture for efficient access to texture data using buffers

Cross Reference to Related Applications

This application claims the benefit of U.S. provisional application No. 62/303,889, filed 2016, 3, 4, the contents of which are hereby incorporated by reference.

Technical Field

Embodiments of the present invention generally relate to techniques for using a texture cache (cache) in a graphics processing unit.

Background

In graphics systems, textures are typically stored in a texture cache in a compressed format. For example, a block compression format may compress the color and alpha of a 4x4 pixel block into 64 bits (64b 8 bytes (8B)). After decompression, there are 2B red, green and blue (RGB) components, each of 5 bits, 6 bits, 5 bits, respectively. Thus, this compression format achieves a compression factor of 4 (e.g., 2B/pixel 16 pixels)/8b =4 for a 4x4 pixel block).

The compression format achieves savings in memory requirements and bandwidth required to move textures between multiple levels of the memory hierarchy. However, there are a number of disadvantages and limitations associated with conventional texture caching schemes.

Disclosure of Invention

Drawings

FIG. 1A is a block diagram of a graphics processing system including a texture cache architecture according to an embodiment of the present invention.

FIG. 1B illustrates the texture cache architecture of FIG. 1A in more detail, according to an embodiment of the invention.

FIG. 1C illustrates an embodiment of the texture cache architecture of FIG. 1A supporting ASTC codecs.

FIG. 2 illustrates a method of operating a graphics processing unit, according to one embodiment.

FIG. 3 illustrates a method of operating a graphics processing unit, according to one embodiment.

FIG. 4 illustrates an example of cache data and tag mapping, according to an embodiment.

FIG. 5 illustrates an example of cache data and tag mapping with conflict free access, according to an embodiment.

FIG. 6 illustrates an example of cache access to a four-way 3x3 footprint, according to an embodiment.

FIG. 7 illustrates an example of cache access to a four-sided 2x2 footprint, according to an embodiment.

FIG. 8 illustrates an example of sub-blocks of a texture cache architecture, according to one embodiment.

FIG. 9 illustrates an example of an ASTC texel footprint pattern, according to one embodiment.

FIG. 10 illustrates an example of address generation control to combine texel requests, according to one embodiment.

Fig. 11 illustrates an example of ASTC block sizes and texture cache line boundaries, according to an embodiment.

Fig. 12 illustrates an example of ASTC block sizes and texture cache boundaries in a series of accesses.

Detailed Description

FIG. 1A is a block diagram illustrating a graphics system 100, according to one embodiment. In one embodiment, the texture cache unit 110 is part of a Graphics Processing Unit (GPU) 106. In one embodiment, texture cache unit 110 includes a texture cache architecture, which is described in more detail below with respect to FIG. 1B.

In one embodiment, the GPU 106 may include graphics hardware and implement a graphics pipeline (graphics pipeline) that includes, for example, one or more shader cores. An external graphics memory 112 may be provided to store additional texture data. In one embodiment, a Central Processing Unit (CPU) 101 and associated system memory 102 may include computer program instructions for driver software 104. The bus may be used to communicatively couple the CPU 101 to the GPU 106, the system memory 102 to the CPU 100, and the GPU 106 to the external graphics memory 112.

FIG. 1B illustrates an embodiment of a texture cache architecture 108 in more detail. A level 0 texture cache (TC 0) is provided for uncompressed texture data (e.g., texel (texel) data). In one embodiment, the TC0 cache holds decompressed texels organized into 64B cache lines, where each 4B segment is stored in a separate data store and the entire cache line is stored on 16 data stores. However, it will be understood that other cache line sizes and segment sizes may be used. A level 1 texture cache (L1C) is provided for compressed texture data (e.g., texel data).

A Texture Decompressor (TD) is disposed between TC0 and L1C. First and second buffers (buffers) are provided to buffer data. Although the buffers may be implemented in different ways, in one embodiment these buffers are implemented as stream-in First-Out (FIFO), including one implementation: wherein the first buffer is a first FIFO (stream FIFO 1) and the second buffer is a second FIFO (stream FIFO 2). Stream FIFO1 buffers compressed data from L1C into TD. The stream FIFO2 buffers the decompressed data provided from the TD into TC0. In one embodiment, although the stream FIFO uses a FIFO replacement scheme that always replaces the oldest entry with the new entry, the stream FIFO allows read access to any entry, not just the oldest entry as in conventional FIFOs.

In one embodiment, a texture Address unit (shown in dashed lines in FIG. 1B) generates and delivers a set of accesses for four corners (groups of four pixels) to the front end 180 of the texture cache architecture 108 starting with an Address Generation Controller (AGC). In one embodiment, the AGC combines these accesses into a set of a minimum number of non-conflicting tag accesses and data store accesses. In one embodiment, the AGC then looks up the tag in the TAGS unit, which delivers the miss to the address calculation unit. The address calculation unit, in turn, generates an address to access the compressed texture data in the L1C cache.

In one embodiment, in the event of a cache miss in the TC0 cache, the AGC supports generating an address and using a tag (from the TAGS unit) to access compressed texture data from the L1C cache with an address calculation block. The compressed texture data is then buffered in the stream FIFO1, decompressed in TD, buffered in the stream FIFO2, and then provided to the TC0 cache.

In one embodiment, the TC0 cache supports reuse of decompressed texture data. In response to a cache hit, the output of the TC0 cache can be used, for example, by a texture filter unit (shown in FIG. 1B with dashed lines) to compute a texture for the pixel. Furthermore, as described in more detail below, in one embodiment, the read pointers to FIFO1 and FIFO2 may be controlled to improve the reuse of texel data. In one embodiment, a control block 190 or other control feature may be provided, for example, to coordinate the operation of the TD with the read pointers of the first buffer (e.g., FIFO 1) and/or the second buffer (e.g., FIFO 2).

In one embodiment, the texture cache unit 110 accepts a request for texel data for one quad (a 2x2 set of pixels) and generates a filtered texel for each active pixel in the quad, which may involve accessing 4 texels for each pixel, for a total of 16 texels per cycle.

In one embodiment, graphics system 100 has the flexibility to reorganize data within a texture. In one embodiment, driver 104 reorganizes the texture data to best accommodate the desired request pattern. Shader cores can be latency tolerant because they are highly multithreaded to take advantage of the natural parallelism that exists in graphics applications. In addition, multiple requests arriving at the texture cache per cycle may be coherent, as they correspond to texel requests made on behalf of a single four-party.

In one embodiment, the organization of the data in the TC0 cache is based on common data access patterns to allow a set of texel accesses to be processed with a minimum number of data stores and tag lookups. In one embodiment, data is organized into the TC0 cache based on a locality pattern present in a set of accesses to improve the cache performance of the TC0 cache. For example, data may be stored in a churn mode among the data stores that make up the TC0 texture cache. In addition, data that is likely to be accessed together may be grouped together into cache lines to reduce the number of different cache lines and thus reduce the number of different tag lookups required. The example cache architecture disclosed herein supports operations that require only a maximum of 4 tag lookups per cycle and utilizes only 16 data stores. However, it will be appreciated that other numbers of tag lookup and data stores may be utilized in alternative embodiments.

Referring to FIG. 1C, additionally or alternatively, an embodiment facilitates a Texture Compression scheme that utilizes variable size blocks, such as an Adaptive Scalable Texture Compression (ASTC) codec. In one embodiment, this may include a merger (CLS) module that merges decompressed data from different size blocks and a control block 192 that has control features to control the CLS, decompressor and read pointers to the buffer and support the use of variable size blocks (such as those of ASTC), which are described in more detail below.

In older texture compression schemes, each compressed block contains fixed-power-of-two (2) texels and is stored in a fixed block size. For example, the texture compression scheme described earlier compresses 4x4 blocks of 2B texels into 8B, giving a constant compression factor of 4. With a power of 2 compression size in texels in each dimension and a power of 2 block size, the computation of the start address of a compressed block containing a texel (u, v) in a 2D texture involves only u, v and a specific shift operation on the base address of the texture. Additionally, in one embodiment, decompressing cache lines in TC0 comprises a single or smaller entirety of a power-of-2 compressed block. In one embodiment, the compressed block is not split between multiple cache lines of the decompression cache.

The ASTC texture compression scheme may compress variable-size blocks that vary from 4x4 to 12x12 texels into 16B to capture the benefit of supporting a range of compression factors depending on the required quality. With such variable size blocks, address calculation can become more complex. For example, a 7x5 block results in a division by 7 and the memory address of the compressed block containing the desired texel is calculated. This division can consume a large amount of area and power.

In one embodiment, TC0 caches operate in the uncompressed domain, where texel addresses are identified using uncompressed (u, v) coordinates of texels. In response to a miss in the TC0 cache, a compressed block address of the missed uncompressed cache line is calculated in an address calculation unit.

In one embodiment, FIFO1 is sized to improve performance. When an ASTC or other compressed block is requested from the L1C, the L1C returns a cache line containing a set of two or more compressed blocks (e.g., multiple blocks). For example, if the ASTC compressed block is 16B and the cache line is 64B, then L1C returns four compressed blocks. One or more of these blocks are stored in the FIFO1. Given the locality of access in texture requests, the TD may require some of these blocks within a small time window during its presence in FIFO1. In this case, the TD may retrieve it directly from the FIFO1 without making another request to the L1C (which would otherwise have to be made), thereby saving the power required to access the L1C and possibly improving performance.

In one embodiment, the FIFO2 is sized to improve performance. When a block is TD decompressed, it generates decompressed texels. But many texels in a block may not be immediately needed to fill a texel in the current cache line. However, there may be other cache line miss requests from TC0 that require these texels. In one embodiment, the decompressed texels are stored in the stream FIFO2. If some texels are really needed to satisfy the subsequent TC0 cache line fill, they are fetched from the stream FIFO2, avoiding another decompression of the entire compressed block by TD.

In one embodiment, the stream FIFOs of FIFO1 and FIFO2 use a first-in-first-out replacement policy, eliminating the need for additional replacement policy management states. In one embodiment, the stream FIFO also has tags that represent the future state of the FIFO after all previous references have been processed. In one embodiment, one aspect of stream FIFOs is that they capture short-term spatial locality in the stream of texture addresses. In one embodiment, the control hardware detects the presence of a desired compressed block in FIFO1 or a group of texels in FIFO2 and computes a read pointer to access them from FIFO1 or FIFO2, respectively. That is, the read pointers are controlled to select individual entries within the first buffer using the first read pointer and to select individual entries within the second buffer using the second read pointer. The ability to control the read pointer allows potential savings with respect to accessing L1C or decompressing blocks in TD.

Fig. 2 is a flow diagram illustrating a method according to an embodiment. The compressed texel data is stored in a first texture cache (e.g., an L1C cache). The decompressed texel data is stored 20 in a second texture cache (e.g., a TC0 cache). A request is received 215 for texel data for a set of pixels. Accesses to the first or second texture cache are scheduled 220 for the requested texel data.

FIG. 3 is a flow diagram illustrating a method to emphasize aspects of buffering according to an embodiment. A first request is received 305 for texel data for a first set of pixels. The requested compressed texel data is fetched 310 from a first texture cache (e.g., an L1C cache). The retrieved compressed texel data is buffered 315 in a first buffer. For example, the first buffer may comprise a FIFO1. The output of the first buffer is provided 320 to the texture decompressor to decompress one or more compressed texel data blocks. The resulting decompressed texel data is buffered 325 in a second buffer. For example, the second buffer may comprise a FIFO2. The output of the second buffer is provided 330 to a second texture cache (e.g., TC 0). In some embodiments, the one or more compressed texel data blocks stored by the first buffer include second texel data in addition to the requested texel data. In some embodiments, the one or more uncompressed texel groups stored into the second buffer include third uncompressed texel data in addition to the requested texel data. This third texel data is used to form portions of uncompressed cache lines of texel data that are requested by the TC0 cache in subsequent transactions.

FIG. 4 illustrates cache data and tag mapping in a TC0 cache in accordance with an embodiment. In one embodiment, 16 requests are processed per cycle, corresponding to 16 requests for a 2D texture representing the four pixels (P0, P1, P2, P3) of the illustration, which are mapped to a texel space belonging to four corners. Each pixel is associated with four requests corresponding to the corners of the unit square in texel space, thus for example at coordinates (u, v), (u +1, v), (u, v + l) and (u + l, v + 1). By arranging the data within the cache line of the TC0 level to include a square (or near square) area in texel space, the four requests for a particular pixel are mostly located within a cache line and multiple 1B/2B requests can in many cases even be co-located within a 4B doubleword.

Fig. 4 illustrates an example of how texel data is laid out among 16 bins, numbered hexadecimal from 0.. 9A.. F. A group of 16 texels (0.. 9A.. F) is contained in a cache line in a tag store having a single tag. A 4x4 square set of texels (illustrated as squares containing the hexadecimal number 0.. 9A.. F) is mapped to each cache line numbered CLO to CL 15. To illustrate, the numbers within the texels represent the repository holding the data. In this example, the tag of the cache line is contained in the tag store indicated by TB < num >. To illustrate, the texel squares with the bolded outline illustrate texels for computing the filtered texture value for a pixel (e.g.,

texels

0,1,2, and 3 in CL0TB0 for pixel P0, texels A and B in CL4TB0 and texels 0 and 1 in CL5TB1 for pixel P1,

texels

6,7, C, and D in CL2TB2 for pixel P2, and texels C, D, E, and F in CL3TB3 for pixel P3).

FIG. 4 illustrates an example of how a cache line is mapped to four tag stores (TB 0, TB1, TB2, and TB 3) when the texel data size is 4B. In this case, the texture is a two-dimensional (2D) array of texels of 4B size each. In one embodiment, the driver and texture decompressor cooperate to lay out the data as shown. Each of the 16 squares of each cache line represents a texel and the number within the texel represents the memory bank that holds the data. Note that in one embodiment, data is laid out in Z-order or Morton order to take advantage of locality in two dimensions, as opposed to conventional layouts that take advantage of locality in one dimension or the other. Z-order (also called morton order) is a function that maps multidimensional data while preserving the locality of data points.

The squares labeled (P0, P1, P2, P3) indicate where the four pixels from the four corners of the texture request map into the texel space. Note that while they tend to be mapped to squares in texel space, they can also be mapped to any region in texel space. The texel squares in the dashed box adjacent to each pixel indicate the texels used to perform the bilinear filtering or weighted averaging to calculate the filtered texture value.

In this example of fig. 4, each pixel uses four texels that are non-overlapping, so a total of 16 texels need to be fetched from the cache. This may represent a rare extreme case, but is chosen to generalize the operation of the illustrative TC0 cache.

In one embodiment, depending on the cache implementation, the operation of the TC0 cache may take into account one or more constraints on texel access. In one embodiment, the TC0 cache is configured to access at most one unique texel from each store during a data access. However, in the example of fig. 4, each pixel requires access to texel data from a particular memory bank. For example, pixel P2 and P3 accesses are mapped to texels in banks C and D, such that access to pixels P2 and P3 occurs in at least two cycles due to the constraint of at most one unique texel from each bank being accessed during a data access. Another example of a possible constraint is a constraint on the TC0 cache, i.e., no more than one tag access may be made to a tag store.

Any constraints on texel accesses can be taken into account by the AGC to organize the sequence of texel accesses. In one embodiment, the AGC shown in fig. 1B functions as follows: a four-way access is split into multiple sets so that each set can be executed without data store conflicts or tag store conflicts (if the constraint applies). In one embodiment, the AGC may schedule accesses to P0 and P2 of FIG. 4 in one cycle, as they relate to non-conflicting tag stores TBO and TB2 for cache lines CLO and CL2 and non-conflicting data stores (0, 1,2, 3) for P0 and non-conflicting data stores (6, 7, C, D) for P2. Similarly, accesses to P1 and P3 only involve non-conflicting tag stores TBO, TB1 and TB3 and non-conflicting data stores (A, B,0, l) and (C, D, E, F) for pixels P1 and P3, respectively.

Although four pixels of a square can be mapped to any position in texture space, they may be tilted towards proximity in texture space. In particular, with a suitable mip mapping surface, the distance between four corners of pixels in texture space tends to be less than 1.5 for bilinear sampling and less than 1.0/2.0 for higher/lower mip levels, respectively, for trilinear sampling.

FIG. 5 illustrates collision-free access to 16 texels according to one embodiment. FIG. 5 represents a scenario where 16 texels representing a single four-square are accessed, where all 16 texels can be accessed in a single cycle. When the four corners of pixels are horizontally and vertically separated in texel space between the locations to which they map in texel space, the footprint (or layout) is a 4x4 set of texels, as shown by the bold outline. The squares P0, P1, P2 and P3 represent the position of the pixel in texel space and it can be seen that the union of the four adjacent texels for all four pixels is a 4x4 set of texels (with bolded outline), subsequently referred to as the texel footprint.

In the case of FIG. 5, the texels in the texel footprint are distributed over four cache lines and these cache lines in turn map to four different tag stores. Thus, tags can be accessed in parallel without a store conflict. Specifically, the numbering of the bold texel squares is disparate; no two bold squares have the same number indicating that they map to the same data repository. Thus, all 16 texels may be accessed in a collision-free manner. In general, all 16 texels will be mapped to different bins regardless of the location of the 4x4 texel footprint, and these texels map to at most 4 cache lines, which 4 cache lines map to different tag bins. Thus, there is no conflict-free access regardless of the 4x4 texel footprint. For this case, the AGC looks at the texel footprint and in this case schedules all tag accesses and data store accesses to a single cycle.

Fig. 6 illustrates an embodiment of texel footprints for a texture for proper mip mapping when performing bilinear filtering. When the pitch in the texel space is about 1.5 texels, the pixels tend to share the same texel and therefore the texel footprint is often 3x3 as shown in fig. 6. In this case, the nine texels with the bolded outline have different numbers, indicating that no two texels in the footprint map to the same bin. In addition, all nine texels belong to one of two cache lines that map to different tag stores. As before, these observations apply regardless of the 3x3 footprint location.

In some cases, the pixels may be warped in the texel space. For example, the texel footprints may be skewed or otherwise misaligned in the horizontal/vertical direction. Even in this case, all texels can be accessed in a collision-free manner as long as the inter-pixel spacing is less than 1.5 texels.

FIG. 7 illustrates a quad minimum 2x2 footprint, according to an embodiment. Fig. 7 shows the texel footprint (i.e. four texels with bold outline) when the pixels map to such a small area in texel space that the texel footprint is reduced to a minimum 2x2 footprint 9. This may occur, for example, at a higher (less detailed) mip level when trilinear filtering is performed. This footprint may be handled in a conflict-free manner, where four texels map to different memory banks, and all four texels may belong to a single cache line.

Thus, in an example embodiment, the TC0 cache supports four tag lookups per cycle. Each TC0 cache line of 64B is mapped to 16 banks, each 32B wide. Each bank has a single read port. If a four-way request requires more than one access to each store or more than four tag lookups, the request is split over multiple cycles to meet these constraints in each cycle.

Furthermore, to provide good performance for many situations, in one embodiment, the driver organizes the texture data in memory and the hardware decompressor further arranges the data in the TC0 cache line to minimize data store conflicts and tag lookups. In one embodiment, the texture data is organized into mtile, where the cache lines are organized by the driver in Morton (Z) order such that a contiguous square block of texels requires a minimum number (i.e., less than a predefined number) of different cache lines and thus a minimum number of tag lookups. Thus, in one embodiment, as long as a texture request representing four texels in a four square maps to a 2x2 block of cache lines within an mtile, no more than four tag lookups are required.

In one embodiment, each cache line in a common 2B/texel texture holds an 8x4 texel block with a TC0 cache line size of 64B. Thus, a 2x2 cache line block holds a 16x8 texel block. A square texel footprint may be a 3x3 texel block. With proper mip mapping, the maximum expected texel footprint is four-sided towards 45 degrees and with an inter-pixel distance of 2 texels in texel space. This texel footprint is (2 √ 2+1 =) 3.8x 3.8 blocks, well below the 16x8 texels contained in the 2x2 cache line block. Thus, repository conflicts are avoided for many situations.

In one embodiment, the original texture request is split into multiple requests without the request being properly mipmapped. In a common case, a texture cache handles 16 requests in a very efficient manner, taking advantage of the expected properties of these requests to achieve both high bandwidth and power efficiency.

FIG. 8 illustrates sub-blocks of a portion of a texture cache architecture, according to an embodiment. The L1C cache is omitted in FIG. 6. In one embodiment, an L0 data store is provided for a TC0 cache. In one embodiment, the L0 data store corresponds to 16 banks of 32 words each and 32b words each. The L0 data read control block and the L0 data write control block control reading and writing of data from the L0 data storage. In one embodiment, an L0 crossbar is used to output texel data. The L0 read latency FIFO receives the bank address from the AGC. The L0 line and write control latency FIFO receives the line address from the L0 tag storage.

In one embodiment, the first input 801 (from a Texture Address (TA) subunit (not shown in fig. 8)) corresponds to up to (up to) 16 addresses. Each request is for up to 16 texels, corresponding to the texture base address and 16 (u, v) coordinates in the 2D texture. The texture base address of each request is shared among all 16 texels. Each quad is made up of four pixels and each pixel accesses 4 texels arranged in a unit square having a pair of coordinates coord u and coord _ v. In each pair, coord _ u [ i ] [ l ] may be coord _ u [ i ] +1, except in the case of winding. For 3D textures, each pixel accesses 8 texels arranged in the unit cube, requiring another coordinate coord _ w to specify an additional dimension.

In one embodiment, the remaining fields on the first incoming packet from the TA element are derived from the state and the input. The width, height and depth of the mip map are for this request the dimensions of the texture image at the mip level and are required to calculate the offset from the provided base address. In one embodiment, the texture format describes the format of a texture image in a particular texel size. In one embodiment, some aspects of the format are used by downstream TD subunits. In one embodiment, two fields nr _ samples and sample _ idx are used for multisampled texture access.

In one embodiment, texel data output 802 is comprised of two sets of 16 texels, where each texel 32b is wide. For texel sizes larger than 32b, the power-of-2 set of outputs are clustered together to send a single texel and a set of 16 texels is delivered in multiple cycles.

Arrows

803 and 804 illustrate interaction with the L1C cache. In the event of a TC0 cache miss, a request is made to the L1C cache to provide the virtual address of the cache line. Since the virtual address is 48b and the cache line size log2 is 6b, this address is 42b. In response, L1C delivers 64B data.

In one embodiment, the AGC receives two coordinate positions in the u, v, w dimensions for four square pixels, for a total of 16 coordinates, to specify 16 texels. The AGC output consists of up to four tag requests and the data store and crossbar control bits required to access texels from the data array.

In one embodiment, the AGC accepts 16 requests from the texture address unit and generates a TAG lookup in the TC0TAG store. In addition, the AGC generates control bits for selecting one of four line addresses for each of the 16 banks and routing 32b of data from each data bank to the output port. The tag is updated immediately on a miss and the miss is sent to the L1C cache. Data access is delayed until the data arrives from L1C. The delayed access requests are stored in the latency FIFO and processed in an orderly fashion. The 16 banks may represent 16 texels being read simultaneously. The data is routed to the correct texel output at the output crossbar.

In one embodiment, the AGC organizes these 16 requests into a minimum number of sets (e.g., one) so that texel requests within one set do not access more than four cache lines and are fetched from no more than one 4B per each of the 16 data stores. In one embodiment, the AGC provides up to four tags per cycle to the L0 tag storage. The L0 tag stores the write to the L0 line and write control latency FIFO. In one embodiment, a merger (CLS) and CLS controller are provided to support merging of decompressed blocks into a standard form size.

In one embodiment, the data write control block accepts incoming data from the merger and fills the TC0 data array. The LO data read control block pops up the RD L0FIFO written by the AGC and coordinates reading out up to four cache lines and selecting data of up to 16 texels from the four cache lines. TC0 delivers up to 16 texels to the texture filter.

In one embodiment, the TC0 cache parameters are 2KB size, 64B line size, 32 lines, 4 sets, 8-way set associative. In one embodiment, the TC0 cache is addressed using a concatenation of 40b base address and u, v coordinates, each of which is 14b for the 2D texture, for a total of 40+28=68b. But the 3D texture has three coordinates, 11b each, requiring that an address width of 40+33=73b be supported in texel space. However, given that the smallest texel block in ctile is 2x2xl and the number of texels in ctile on each axis is a power of 2, the u, v coordinates will always be even. The LSB l bits of the u, v coordinates need not be stored as a tag. This leaves 71b of the tag bits. There are a total of four incoming tags per cycle, all of which may be directed to a particular tag store. Each tag repository has sufficient comparators and other resources to support tag matching on up to four incoming tags. Each incoming 71b tag address is compared in parallel to all 8 71b tags. On a match, the 5b line address is sent down to the read tag latency FIFO.

On a miss, the missed address is sent to the L1C cache. In one embodiment, each of the four cache line requests may miss the cache, resulting in a maximum of four misses generated in one cycle. On a miss, the corresponding data _ ram _ line _ miss bit for that bank is set. One of the eight lines of the set is selected for replacement and its label is overwritten by the new label. In some cases there may be pending requests on the replaced tag, but since a lookup has already been performed on the line address for these requests, this means that the cache line may only be overwritten just before the first use and thus after any pending requests. In the case of a serial cache organization, the tags may even be overwritten before the corresponding data has been read out of the data RAM.

In one embodiment, a locality-based replacement policy is employed to maximize the spatial locality in texture accesses. When comparing the incoming 71b tag with the tags in the cache set, it is also determined whether the difference is only in the lower bits of the reconciliation component. The victim is first selected from the high-order miss tags. When there is no high-order miss tag, a victim is selected from the low-order miss tags. In one embodiment, random selection is used in the same priority group. The low bit miss is detected by the following criteria. If there is a difference in the base address, it is a high order miss. Otherwise, for 2D texture, 3D texture of sliced tissue: if the difference is only in the LSB 6 bit of each u, v coordinate component, it is a low order miss. For 3D texture of 3D block organization: if the difference is only in the LSB 4 bits of each u, v, w coordinate component, it is a low order miss. Otherwise, it is a high bit miss.

As shown in FIG. 8, in one embodiment, stream FIFO1 (LSF indicates it receives data from L1 cache) holds potentially compressed cache lines delivered from L1C. The TD decompresses the block of compressed texels into a decompressed cache line. Stream FIFO2 (DSF indicates that it receives data from decompressor) holds these decompressed cache lines. TC0 holds decompressed texels organized into 64B cache lines, where each 4B segment is stored in a separate data store and the entire cache line is stored on 16 data stores.

In one embodiment, each decoded RGBA ASTC texel occupies 8 bytes of space (16 floating points for each component), allowing the TC0 cache line (64B) to hold 8 uncompressed texels organized as a 4x2 block with 4 columns and 2 rows. Each compressed 8B ASTC block contains 5x5 compressed texels. On a miss, the TC will request a grid of 4Cx2R uncompressed texels (4 columns by 2 rows). Depending on how the uncompressed mesh maps onto the compressed ASTC mesh, the 4Cx2R mesh may map to multiple (1-4) compressed ASTC blocks.

In one embodiment, the CLS and associated control features are used to generate an aligned block of uncompressed texel data that may be loaded into the L0 data store. This is useful for non-power-of-2 block size dimensions present in ASTC. For other compression schemes, the decompression factor is a small power of 2, and each compressed block easily expands into a 64B cache line. That is, decompressing the small power-of-2 set of compressed blocks produces aligned 64B uncompressed texel data that can be loaded directly into the L0 data store. In one embodiment, a Decompressor and LSF Controller (DLC) controller decompresses multiple (variable size) ASTC blocks to produce decompressed 4x4 texel blocks in a 64B line. Additional coordination is provided via control of the read pointers into FIFO1 and FIFO2.

As an example, consider how power and bandwidth may be wasted if ASTC blocks are decompressed and utilized without proper coordination and reuse. A nominal texel size of 4B, which for a 64B line in the L0 data store means a 4x4 block. Since ASTC non-power of 2 blocks are not aligned on 4x4 uncompressed blocks in the L0 data storage cache line, each such block may require decompression up to 4 blocks (e.g., 6x 6) for a total of 6x 4=144 texels. Only 16 of these texels are needed for a 4x4 block. Therefore, up to 144-16=128 texels may be discarded, wasting decompressor power and bandwidth. In addition, these 4 blocks can be on 4 separate 64B lines in the worst case, wasting L1C access power and bandwidth.

However, it is assumed that there is a large amount of spatial locality in the texture access mode. Thus, it is likely that the unused decompressed texels when filling one 4x4 block in the L0 data store will soon be used to fill the nearby 4x4 blocks of other requests. Similarly, 4 ASTC blocks that include an L1 cache line are likely to be reused for nearby 4x4 blocks. Thus, two small buffers (FIFO 1 and FIFO 2) that cache compressed L1 cache lines and decompressed ASTC blocks are effective in reducing the number of cache line blocks taken from the L1C and the number of unused decompressed texels.

In a stream FIFO, the oldest write line is always selected for replacement. Thus, the write pointer is incremented in a wrap around manner at each write. However, reading may occur from any line within the window of writing. A line can be read multiple times, resulting in the exploitation of reuse. The returned L1C cache lines are deposited into the stream FIFO1. The decompressor reads 16B chunks (possibly larger for non-ASTC) from the stream FIFO1, decompresses them and sends them out to the CLS. The CLS collects the TD output data to construct 64B cache lines and writes them into the L0 data store. The stream FIFO is a simple cache structure aimed at eliminating excessive request traffic to the L1C.

The TC uses a small buffer at the input of the decompressor because the same compressed block may be needed to generate multiple decompressed 64B blocks that are temporally adjacent.

An additional aspect of FIG. 8 includes a tag miss FIFO (serialized by the tag miss serializer) that receives a tag for a tag miss. Selecting a Missing four-way Request (SMQR) block selects one of the Missing requests, pairs it with a base address and associated information from the texture image descriptor, andand delivers the entire packet to a Compressed Block Address Generation (CBAG). Specifically, for each dimension, CBAG computes the minimum and maximum values of the texel dimension. For 2D textures, the outputs are thus the base addresses, (u min, u max), and (v min, v max). In one embodiment, the CBAG calculates up to 4 ASTC block addresses in the compressed (memory) address space. Generally, this address calculation involves dividing each dimension range by the ASTC block size in that dimension. For example, for a 5x6 block, (u min, u max) is divided by 5 and (v min, v max) is divided by 6 to get the required ASTC block. Next, the address of each of these blocks is calculated. The output is a set of up to 4 ASTC block addresses, the lower 4b of which are zeros (since the ASTC block size is 2 ⁴ ＝16B)。

In one embodiment, a texture Decompressor (DC) can process up to 4 output texels per cycle laid out in one of the predefined organizations. In one embodiment, DSF tag lookup and LSF tag lookup partition memory access traffic into a plurality of predefined texel footprint patterns and send them out one by one. FIG. 9 illustrates an example of an ASTC texel footprint pattern that DC may process, according to an embodiment. Six different example cases are illustrated, in which different options of

processing

1,2,3 or 4 texels in one or two cycles are illustrated.

In one embodiment, the CLS is controlled by a DSF entry signal which in turn receives a control bit via a DSF tag lookup. These control bits specify a set of up to 9 texel squares from 9 bins (for the 4B texel size case), although 4x4 sized blocks may be generated from 4 pixel squares. The additional control bits specify which portions of these four parties are routed to which portions of the 64B cache line in the L0 data store. The CLS reads the four parties specified, routes the data and writes the 64B cache line to the L0 data store upon receiving a ready signal (e.g., from the CC).

In one embodiment, the incoming address from a DSF tag lookup is hit tested in a fully associative LSF tag lookup. The miss is the entry at the write pointer allocated to be advanced. The miss is sent to the L1C. In one embodiment, the LSF controls the FIFO function as both a stream FIFO and a buffer between the L1C and decompressor control (DLC).

Fig. 10 illustrates an AGC process flow according to an embodiment. The process of organizing texel requests is spread over multiple steps (S0, S1, S2, S3, S4), where each step attempts to combine a set of requests that satisfy a larger or different set of constraints. In one embodiment, the set of constraints includes no more than four different cache lines and no more than one doubleword from each store. However, it will be understood that other constraints may be utilized. In a first step S1, requests originating from each pixel are examined for cache line addresses for these requests. Each cache line address is then linked to the required bank offset for each of the four requests. This process is referred to as bucketing (bucketing) in fig. 8. The first step S1 thus generates four groups, each group having four buckets, each bucket containing up to four texel requests. In subsequent steps, each bucket may contain many additional texel requests as long as they do not have a store conflict with other requests in the group. In one embodiment, driver software organizes the texture data such that requests associated with a pixel are highly unlikely to have a bank conflict. However, in the rare case where there is a bank conflict, the requests of the corresponding pixels are processed separately.

In a second step S2, two combinations of bucket pairs are considered. For example, the bucketing of p0& p1 checks whether all requests associated with pixels p0 and p1 that were in two different bucket sets can be placed into a single bucket set while still satisfying the constraint that no more than four different cache lines and no more than one doubleword from each store bank. At the end of the second step, we have two cases of buckets, where the pixels are paired differently.

The third step S3 checks if both sets of pairings fail, in which case we bucket the third pairing of p0 and p3 and send a request for p0& p3 if the bucket conforms to the constraints. This is followed by a check p1& p2 (not shown). However, the most common case is that both

cases

1 and 2 comply with all constraints, in which case the process considers bucketing all four pixels, as shown by "bucketing p0& p1& p2& p 3". Again, an example case is that this barreling is successful and all requests from four pixel requests can be processed in the same cycle.

Fig. 10 also illustrates other cases, for example, when the request for the pixel p0 has to be sent separately, as shown in step S4. The process is hierarchical, starting with a request for a single pixel, then building a quad of pixels that are compatible, eventually in terms of their tag and data store access requirements. The process terminates efficiently in the common case where all four pixels are bucketed together, but is also useful for quickly determining compatible subsets in other cases.

FIG. 11 illustrates an example texture cache boundary and ASTC block map for three possible block maps, according to an embodiment. Fig. 12 illustrates an example of a 15x15ASTC texture. The thick black line 1105 shows the cache line boundaries in TC0 and the request to DC on a miss. The thin black line 1110 shows the ASTC 5x5 block boundary of the texture. On a miss, TC0 requests a 4x2 texel grid from the ASTC decoder in the TD. Three types of requests are possible depending on cache line misses. For class 0 blocks, misses map into an ASTC 5x5 block. The TD will deliver the decoded texels (a measure of throughput rather than latency) in 2 cycles. For class 1 blocks, a miss maps to two ASTC 5x5 blocks. The TD will decode the block in 2 (block B) or 3 (block a) cycles. Block a requires 2 cycles on the second ASTC block (since it requires 6 texels to be decoded) and 1 cycle (1 Cx 2R) on the first ASTC block. Block B requires 1 cycle on each of the two ASTC blocks. A class 2 block miss maps onto 4 ASTC 5x5 blocks. Both blocks a and B require four cycles for decoding. In one embodiment, the TD is required to decode 2Cx2R, 4Cx1R (or subsets) and 1Cx2R blocks to support throughput.

In addition to supporting ASTC, the streaming FIFO2, in an example embodiment, may efficiently support the ETC2 compression format. As a result, in one embodiment, the stream FIFO2 includes 4 banks of 128 bits wide, sufficient to store 8 ASTC decoded texels or 16 ETC2 decoded texels. Each bank has support for channel masking and the ability to write either high 64b or low 64b in an example embodiment. For the ASTC to decode texel numbers within a 4x2 block of texels, bank 0 holds texels 0 and 1, bank 1 holds texels 2 and 3, and so on. In an example embodiment, no memory bank conflict occurs for all three types of blocks.

In an example embodiment, the decoder decodes a 4Cx1R or 1Cx2R block if there is a flexible choice. TD will only decode 2Cx2R blocks for the class 1 block B case.

An example of texture cache to texture decompressor ordering for ASTC textures will now be described. For class 0 blocks, to fill a cache line, a request for 8 texels may be made to the TD. There are two options for requesting decoding from the TD unit. The request may be a request for up to 2 4CxlR blocks or a request for up to 2Cx2R blocks.

In one embodiment, for a class 1 block, a request is made for uncompressed data from two ASTC blocks. Which requests 2-4 texels from each block. In one embodiment, the following sequence is followed:

request 1Cx2R or 2Cx2R or 3Cx2R from the upper left ASTC block.

For an lxx 2R request, in one embodiment, the stream FIFO2 decompressor output has support for channel masking of individual texels to different banks (e.g., texel 0 is written to bank 0, texel 4 is written to bank 2, texel 3 is written to bank 1, and texel 6 is written to bank 3).

For the 2Cx2R case, the request is written to memory bank 0 and memory bank 2, respectively, or vice versa.

Request 1Cx2R or 2Cx2R or 3Cx2R from the top right ASTC block.

Always follow the Z order of the request.

For class 2 blocks, the texture cache requests data from four ASTC blocks. In these cases, the Z-order is maintained.

Request 1CxlR or 2Cx 1R or 3Cx1R from the top left ASTC block.

Request 1Cx1R or 2Cx l R or 3Cx l R from the top right ASTC block.

Requests for lxxr or 2CxlR or 3CxlR from the bottom left ASTC block.

Request l CxlR or 2Cx lR or 3Cx l R from the bottom right ASTC block.

In these cases, the ordering is identical, and support for lane masking in the stream FIFO2 allows data to be written efficiently. Supporting 3Cx2R requires additional buffering on the TD and this can be further split into two 3Cx1R requests.

The relationship between the uncompressed domain address and the address of the corresponding compressed block in memory may be complex for non-power-of-2 block sizes used in ASTC. The texel data required for the aligned 64B block may come from multiple compressed blocks.

Fig. 12 illustrates an example showing ASTC 5x5 texels. The boundaries are illustrated by thin black lines 1110. The blocks are numbered from 00..02 on the first row in a thin black line, with the last row numbered 20 \823022. The cache line contains 4 ASTC blocks (00, 01,10, 11).

The texture cache block is a 4x2 64b texel. The block boundaries are illustrated by thick black lines 1105. The blocks are numbered from 00..03 on the first row and 00 to 50 on the first column in thick black lines.

The first access has a texel footprint shown as a shaded block as 0 and the second access has a slashed block footprint as 1.

Starting from an empty cache/buffer, the first access brings a cache line with (00, 01,10, 11) into the LSF, decompresses the ASTC fine black block 00 and stores in the DSF, and fills TC0 with uncompressed coarse

black blocks

10, 20.

The second access hits in the DSF for ASTC block 00 and hits in the LSF for ASTC block (01, 10, 11). This saves repeated decompression of the ASTC block 00 and re-accessing the L1C for the cache line containing (01, 10, 11).

Decompression is performed on the ASTC blocks (01, 10, 11). The merger combines all three plus the decompressed 00 to generate the uncompressed coarse black block 21. This fills TC0 with uncompressed coarse black blocks 21.

One exemplary but non-limiting application of embodiments of the present invention is in a mobile environment. In a mobile environment, there are constraints on the memory bandwidth and power required to transfer data from main memory to the texture cache of the GPU via the L2 cache. The energy cost of moving a doubleword (4B) from a low power double data rate random access memory (LPDDR) to the L1 cache is estimated to be about 50 times that of performing a floating point operation. Thus, example embodiments disclosed herein may facilitate a compression format that achieves a high compression factor in texture units of a mobile GPU.

While compression formats may be energy efficient in terms of data movement costs, the energy costs associated with decompression may be substantial. For example, in the example block compression format, the decompressor linearly interpolates between two colors to generate a total of, say, four colors. The decompressor then selects an index based on the texel address and uses the 2b index to select one of the four colors. The energy cost of interpolation can be substantial. The indexing mechanism introduces two levels of lookups. With the trend of supporting a variety of more sophisticated compression schemes, decompression and data routing energy costs can account for a significant portion of the overall texture unit power.

To amortize some of these costs over multiple texel accesses, example embodiments of the texture cache architecture insert a level 0 (TC 0) cache between the decompressor and the addressing logic. The TC0 cache holds decompressed texels, unlike a level 1 cache (L1C) that holds texel data in a compressed format. The energy cost of decompression is amortized over multiple texel accesses over multiple cycles. For example, if four texels are accessed from a 4x4 compressed block in four consecutive cycles, the TC0 cache holds the uncompressed texels in four cycles and the decompression cost is incurred only once, as opposed to four times without decompressing the TC0 cache.

Another factor that contributes to the power and area traditionally used to support non-power-of-2 block sizes is that while a cache line contains uncompressed texel blocks with power-of-2 sizes, such as 8x4, the size of the compressed blocks in memory may be non-power-of-2, such as 7x5. In this case, the boundaries of the compressed block may not align with the boundaries of the power-of-2 block in the cache line. In this particular example, padding 8x4 may require two 7x5 blocks or 4 7x5 blocks. As a result, the texture decompressor must decompress many compressed blocks to fill all texels in the cache line. Example embodiments may be utilized to support improved performance for non-power-of-2 block sizes. Many of the same compressed blocks (or other blocks in the same L1C cache line) may be necessary to fill texels in the next few miss cache lines and must be fetched from the L1C repeatedly, resulting in wasted bandwidth and power. The stream FIFO1 holding the most recently accessed compressed blocks may serve to reduce accesses to the L1C. If the next several cache line fill requests require the same compressed block, the stream FIFO1 delivers them to the TD without requiring L1C access.

One aspect of an embodiment of the texture cache architecture is that the texture cache client is relatively insensitive to latency. In a CPU level 1 cache, tag accesses and data accesses are performed in parallel (or using some manner of predictive hardware) to reduce cache hit latency to approximately 1-4 cycles. Due to complex addressing logic involving, for example, level of Detail (LOD) computations and texture filtering operations, the latency of a texture unit may exceed 50 cycles even without any Level 1 misses. In the event of a cache miss, followed by a cache hit to a different address, the CPU cache delivers the data that hit the cache immediately rather than waiting for irrelevant missed data to arrive from the next memory hierarchy level. This out-of-order or hit-under-miss (hit-miss) return of data may reduce latency for a single thread in the CPU, but does not provide significant benefits in the GPU due to the vectoriality of shader core accesses and the overall ordering of the graphics pipeline. Given the relative insensitivity of shader performance to texture latency, large fixed components due to texture addressing and filtering, and the ordering of the overall graphics pipeline, alternatives to the CPU level 1 cache organization are attractive.

In one embodiment, all addresses sent to the texture cache architecture 108 are processed in order. In the case where the cache miss is followed by a cache hit, the delivery of the data of the cache hit is delayed until after the data of the cache miss. In addition, a hit in the tag array does not necessarily mean that the corresponding data is present in the cache, but only that it will be present in the cache once all previous references have been processed. The streaming behavior of this texture cache, where all references are streamed through the cache in order in their entirety, brings important benefits and design simplicity. In a graphics pipeline, states and work are ordered, i.e., any state received is applicable only to subsequent work requests. Out-of-order handling of hits prior to misses complicates the application of state to data. For example, the texture filtering logic must recognize that the newer state is to be applied to the hit while it remains the older state to be applied to the miss. In other caches, if the tag comparison fails on the main tag array, the control logic further initiates a check whether there is an earlier outstanding miss on the same cache line. In an example embodiment, this check is not necessary in the stream cache.

In one embodiment, an example of a graphics processing unit includes: a controller configured to receive a first request for texel data for a first group of pixels; a first buffer to store one or more blocks of compressed texel data retrieved from a first texture cache in response to a first request, the one or more blocks of compressed texel data including at least the requested texel data; a texture decompressor which decompresses one or more blocks of compressed texel data stored in the first buffer; and a second buffer storing the decompressed compressed texel data and providing the decompressed requested texel data as output to a second texture cache; wherein the one or more blocks of compressed texel data stored by the first buffer include second texel data in addition to the requested texel data. In one embodiment, the first buffer may be a first FIFO buffer and the second buffer may be a second FIFO buffer. In one embodiment, the one or more blocks of compressed texel data stored by the first buffer may include second texel data in addition to the requested texel data. In one embodiment, the controller may be configured to receive a second request for texel data for a second group of pixels, at least a portion of the one or more blocks of the first request corresponding to at least a portion of the second group of pixels; and the first buffer is configured to provide the portion of the one or more blocks to the texture decompressor without a second fetch from the first cache in response to the second request. In one embodiment, the controller may be configured to receive a second request for texel data for a second group of pixels, at least one texel of the second request corresponding to decompressed texel data stored in a second buffer as a result of processing the first request; and the first buffer is configured to provide the at least one texel of the second request to the second texture cache in response to the second request without a second decompression from the first buffer. In one embodiment, the first texture cache may be configured to store a power of 2 block size. In one embodiment, the second texture cache may be configured to store a power of 2 block size. In one embodiment, a merger unit may be included to merge the decompressed texture data prior to storage in the second texture cache. In one embodiment, the first texture cache stores block sizes according to an Adaptive Scalable Texture Compression (ASTC) codec. In one embodiment, the controller may control a first read pointer of the first buffer to select an individual entry within the first buffer and control a second read pointer of the second buffer to select an individual entry within the second buffer.

In one embodiment, an example of a method of operating a graphics processing unit includes: receiving a first request for texel data for a first group of pixels; retrieving the requested compressed texel data from a first texture cache; buffering the obtained compressed texel data in a first buffer; providing the output of the first buffer to a texture decompressor and decompressing one or more blocks of compressed texel data; buffering the decompressed texel data in a second buffer; and providing an output of the second buffer to a second texture cache; wherein the one or more blocks of compressed texel data stored by the first buffer include second texel data in addition to the requested texel data. In one embodiment of a method, the first buffer is a first FIFO buffer and the second buffer is a second FIFO buffer. In one embodiment, the one or more blocks of compressed texel data stored by the first buffer include second texel data in addition to the requested texel data. In a particular embodiment, the read pointer to the first buffer is selected to reuse texel data in the first buffer to service more than one request for texel data. In one embodiment, the read pointer to the second buffer is selected to reuse the texel data in the second buffer to service more than one request for texel data. One embodiment includes reusing texel data in the first buffer fetched for the first request to at least partially service a second request for texel data for a second group of pixels without a second fetch from the first texture cache. In one embodiment, the first texture cache is configured to store a non-power of 2 block size. In one embodiment, the second cache is configured to store a power of 2 block size. One embodiment further comprises merging decompressed texture data received from the second buffer prior to storage in the second cache. In one particular embodiment, texel data from multiple non-power-of-2 blocks are merged.

In one embodiment, an example of a graphics processing unit includes: a first texture cache configured to store compressed texel data; a second texture cache configured to store texel data that has been decompressed from the first texture cache; and a controller configured to receive a request for texel data for a group of pixels and to schedule access to the first texture cache or the second texture cache for the texel data. In one embodiment, the controller is further configured to: determining whether there is a cache hit or a cache miss for the requested texel data in the second texture cache; accessing a first texture cache for the requested texel data in response to determining the cache miss; and responsive to determining a cache hit, accessing a second texture cache for the requested texel data. In one embodiment, the data is organized into the second texture cache based on locality patterns present in the access set. In one embodiment, the second texture cache has texel data grouped into cache lines corresponding to consecutive two-dimensional texel blocks organized in morton order. In one embodiment, the controller is further configured to divide the set of requested texel addresses into at least one sequence of non-conflicting memory accesses. In one embodiment, the at least one non-conflicting memory access does not have a tag conflict or a data store conflict. In one embodiment, the controller is further configured to combine texel requests that satisfy the set of constraints based on at least one of a number of distinct cache lines or a number of doublewords per bank. In one embodiment, the controller is further configured to: looking up a cache line address required by a texel request originating from each pixel of the group of pixels; and combining texel requests that satisfy the constraint of no more than four different cache lines and no more than one doubleword from each store. In one embodiment, the second texture cache has a 4-way binning tag lookup and a 16-way binning data store. In one embodiment, the layout of the texels in the second texture cache is selected to ensure that four texels in the texel footprint are on different bins.

In one embodiment, an example of a method of operating a graphics processing unit includes: storing the compressed texel data in a first texture cache; storing the texel data decompressed from the first texture cache in a second texture cache; a request for texel data for a group of pixels is received and an access to either a first texture cache or a second texture cache is scheduled for the texel data. In one embodiment, the arrangement comprises: determining whether there is a cache hit or a cache miss for the requested texel data in the second texture cache; accessing a first texture cache for the requested texel data in response to determining the cache miss; and in response to determining a cache hit, accessing a second texture cache for the requested texel data. One embodiment further includes organizing the texel data in the second texture cache into tiles (tiles) within which the cache lines are organized in morton order such that a contiguous two-dimensional block of texels requires less than a predefined number of different cache line and tag lookups. One embodiment also includes dividing the set of requested texel addresses into a set of non-conflicting access sets. In one embodiment, the non-conflicting access sets do not have tag conflicts or data store conflicts. One embodiment also includes combining texel requests that satisfy a set of constraints on number based on at least one of a number of distinct cache lines or a number of doublewords per bank. In one embodiment, combining texel requests includes combining texel requests that satisfy the constraint of no more than four different cache lines and no more than one doubleword from each store. In one embodiment, the second texture cache has a 4-way binning tag lookup and a 16-way binning data store. In one embodiment, the layout of the texels in the second texture cache is selected to ensure that four texels in the texel footprint are on different bins. In one embodiment, the data is organized into the second texture cache based on locality patterns present in the access set.

While the invention has been described in connection with specific embodiments, it will be understood that it is not intended to limit the invention to the described embodiments. On the contrary, it is intended to cover alternatives, modifications, and equivalents as may be included within the spirit and scope of the invention as defined by the appended claims. Embodiments may be practiced without some or all of these specific details. Additionally, well known features may not have been described in detail to avoid unnecessarily obscuring the invention. In accordance with the present invention, the components, process steps, and/or data structures may be implemented using various types of operating systems, programming languages, computing platforms, computer programs, and/or computing devices. In addition, those of ordinary skill in the art will recognize that devices such as hardwired devices, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), or the like, may also be used without departing from the scope and spirit of the inventive concepts disclosed herein. The invention may also be tangibly embodied as a set of computer instructions stored on a computer-readable medium, such as a memory device.

Claims

1. A graphics processing unit, comprising:

a controller configured to receive a first request for texel data for a first group of pixels;

a first buffer to store one or more blocks of compressed texel data fetched from a first texture cache in response to the first request, the one or more blocks of compressed texel data including at least the requested texel data;

a texture decompressor for decompressing one or more blocks of compressed texel data stored in said first buffer; and

a second buffer storing one or more blocks of decompressed compressed texel data and providing the decompressed requested texel data as output to a second texture cache;

wherein the size of the block of compressed texel data in the first texture cache is set such that a cache line in the first texture cache returns one or more blocks of the compressed texel data,

wherein the one or more blocks of compressed texel data stored by the first buffer include second texel data in addition to the requested texel data,

wherein the controller is configured to receive a second request for texel data for a second group of pixels, at least a portion of the one or more blocks of the first request corresponding to at least a portion of the second group of pixels; and the first buffer is configured to provide the portion of the one or more blocks to the texture decompressor without a second fetch from the first texture cache in response to the second request.

2. The graphics processing unit of claim 1, wherein the first buffer is a first FIFO buffer and the second buffer is a second FIFO buffer.

3. A graphics processing unit as defined in claim 1, wherein:

the one or more blocks of decompressed texel data stored by the second buffer include third texel data in addition to the requested texel data.

4. The graphics processing unit of claim 1, wherein the controller is configured to receive a second request for texel data for a second group of pixels, at least one texel of the second request corresponding to decompressed texel data stored in the second buffer as a result of processing the first request; and the second buffer is configured to provide the at least one texel of the second request to the second texture cache in response to the second request without a second decompression from the texture decompressor.

5. The graphics processing unit of claim 1, wherein the first texture cache is configured to store a power of not 2 block size.

6. The graphics processing unit of claim 5, wherein the second texture cache is configured to store a power of 2 block size.

7. The graphics processing unit of claim 6, further comprising a merger unit to merge decompressed texture data prior to storage in the second texture cache.

8. The graphics processing unit of claim 5, wherein the first texture cache stores block sizes according to an Adaptive Scalable Texture Compression (ASTC) codec.

9. The graphics processing unit of claim 1, further comprising at least one controller to control a first read pointer of the first buffer to select an individual entry within the first buffer and to control a second read pointer of the second buffer to select an individual entry within the second buffer.

10. A method of operating a graphics processing unit, comprising:

receiving a first request for texel data for a first group of pixels;

retrieving the requested compressed texel data from a first texture cache;

buffering the obtained compressed texel data in a first buffer;

providing an output of the first buffer to a texture decompressor and decompressing one or more blocks of the compressed texel data;

buffering the decompressed texel data in a second buffer;

providing an output of the second buffer to a second texture cache; and

reusing texel data in the first buffer fetched for the first request to at least partially service a second request for texel data for a second group of pixels without a second fetch from the first texture cache;

wherein the block of compressed texel data in the first texture cache is sized such that the cache line in the first texture cache returns one or more blocks of compressed texel data, wherein the one or more blocks of compressed texel data stored by the first buffer includes second texel data in addition to the requested texel data.

11. The method of claim 10, wherein the first buffer is a first FIFO buffer and the second buffer is a second FIFO buffer.

12. The method of claim 10, wherein:

13. The method of claim 10, wherein the read pointer to the first buffer is selected to reuse texel data in the first buffer to service more than one request for texel data.

14. The method of claim 10, wherein the read pointer to the second buffer is selected to reuse texel data in the second buffer to service more than one request for texel data.

15. The method of claim 10, wherein the first texture cache is configured to store a power-of-2 block size.

16. The method of claim 15, wherein the second texture cache is configured to store power of 2 block sizes.

17. The method of claim 16, further comprising merging decompressed texture data received from the second buffer prior to storing in the second texture cache.

18. The method of claim 17, wherein texel data from a plurality of non-power-of-2 blocks is merged.