GB2516682A - Hierarchical memory for mip map - Google Patents

Hierarchical memory for mip map Download PDF

Info

Publication number
GB2516682A
GB2516682A GB1313570.2A GB201313570A GB2516682A GB 2516682 A GB2516682 A GB 2516682A GB 201313570 A GB201313570 A GB 201313570A GB 2516682 A GB2516682 A GB 2516682A
Authority
GB
United Kingdom
Prior art keywords
data
address
memory block
memory
memory device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
GB1313570.2A
Other versions
GB201313570D0 (en
Inventor
Matthew Charles Porth
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority to GB1313570.2A priority Critical patent/GB2516682A/en
Publication of GB201313570D0 publication Critical patent/GB201313570D0/en
Publication of GB2516682A publication Critical patent/GB2516682A/en
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • G06T15/04Texture mapping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0893Caches characterised by their organisation or structure
    • G06F12/0897Caches characterised by their organisation or structure with two or more cache hierarchy levels

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Graphics (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Input (AREA)

Abstract

A hierarchical memory device 100 comprising an Embedded Logic Unit and a plurality of memory blocks L0,L1,L2, wherein, when writing data into the memory device: the plurality of memory blocks L0, L1, L2 are arranged to be accessible in parallel; a first memory block is arranged to be written with first data at a first address of the first memory block; a second memory block is arranged to be written with second data at a second address of the second memory block (see figure 5A); the Embedded Logic Unit may comprise an arithmetic logic unit or address generator arranged to generate the second address of the second memory block using the first address of the first memory block, involving for example a binary division (bit wise shift left) and rounding off to nearest integer; and to generate the second data by down-sampling or compressing the first data using dedicated data filters 150; a second memory block access operation comprises at least one of the generation of the second address, generation of the second data or writing of the second data to the second memory block; when an instruction for the writing of the first data at the first address is communicated to the hierarchical memory device, this triggers the second memory block access operation; and the second memory block access operation to the first memory block. The hierarchical memory device may be used in a 3D texel rendering graphics processing unit for storage of sampled pixel or texel data using a mipmapping process. A method of mipmap generation is also disclosed.

Description

HIERARCHICAL MEMORY FOR MIP MAP
The present invention relates to hierarchical memory devices, to methods for mipmap generation, to methods for manufacturing hierarchical memory devices, and to computer readable media.
Embodiments of the invention find particular, but not exclusive, use when an application implements a Mipmapping technique, wherein an exemplary implementation of the Mipmapping technique might be used to render a 3-dimensional (3D) graphics image and/or texture map a 2-dimensional (2D) surface to a 3D object.
Mipmapping is a useful optimisation technique which be used to improve texture sampling quality in 3D graphics.
When an object is rotated in 3D space, the ratio of a screen portion to texture pixels to be displayed thereon changes and if this ratio reaches values lower than 0.5, then a bilinear texture filtering can not use data for a set of pixels under the same pixel resolution. This means, as the bilinear texture filtering takes place there will be gaps" in the source data relating to the present set of pixels. So, different data for the set of pixels under a different pixel resolution, for example half the pixel resolution, needs to be computed by reading in and averaging the data from the set of pixels at the first pixel resolution. This requires an unpredictable amount of processing time and storage capacity since they depend on the size of the set of pixels requiring these computations. This can lead to a number of undesirable visual artefacts such as "glittering" or "flickering" of pixels as an object moves around or an image "graininess".
Mipmapping is used to perform these computations and store the result thereof so that when a bilinear texture filtering takes place, both sets of data with different pixel resolutions are available for the filtering. A typical example of such Mipmapping would be computing a down-sampled image with a quarter of the pixel resolution (a half number of pixels along a first axis X and a half number of pixels along a second axis Y) and storing the computed down-sampled image and iterating this process until the width (X) or height (Y) reaches a value lower than 1. This means the down-sampled images are available for bilinear texture filtering and therefore the processing time required for the bilinear texture filtering can be estimated with an upper bound, which can be useful. Further, storage of such down-sampled images only takes up around a further third of the memory storage capacity required for storing the initial image before the down-sampling process.
However, a typical computer system, or architecture, of a Central Processing Unit (CPU) or ALU connected to an array of orthogonal memory blocks is not optimised for implementing mipmapping which comprises computation of these down-sampled images and/or storage of the resultant data thereof. The number of times, i.e. clock cycles, required for accessing (reading or writing) the memory blocks to perform these computations add a large latency to overall rendering time of a 3D graphics image since each access operation to the memory block can span over a number of clock cycles. This is because a single access operation to any one the memory blocks requires one clock cycle and most of the computations for mipmapping also have to be performed in series in such typical computer system/architecture.
It is an aim of embodiments of the present invention to provide for a method, an apparatus or a system for implementing mipmapping wherein a latency added by the mipmapping process to the overall rendering time of an image is reduced.
According to the present invention there is provided a method and apparatus as set forth in the appended claims. Other features of the invention will be apparent from the dependent claims, and the description which follows.
For a better understanding of the invention, and to show how embodiments of the same may be carried into effect, reference will now be made, by way of example, to the accompanying diagrammatic drawings in which: Figure 1 shows a hierarchical memory device layout according to an embodiment of the present invention; Figure 2 shows a diagram representing a write operation being performed on a hierarchical memory device according to an embodiment of the present invention; Figure 3 shows a filter function for performing a re-sampling operation between different levels of hierarchy in the hierarchical memory device shown in Figures 1 and 2; Figure 4 shows a diagram representing a read/fetch operation being performed on a hierarchical memory device according to an embodiment of the present invention; Figure 5A shows an addressing mapping scheme for implementing the hierarchical memory device shown in Figures 1-4; Figure SB shows an address space requirement for an address mapping scheme for implementing the hierarchical memory device shown in Figures 1-4; Figure 6 shows a video card or a computer system comprising the hierarchical memory device shown in Figures 1-4 according to an embodiment of the present invention; Figure 7 shows a diagram representing a method of pipelining a mipmap generation process according to an embodiment of the present invention; Figure 8 shows a diagram representing a method of manufacturing the hierarchical memory device shown in Figures 1-4 according to an embodiment of the present invention; and Figure 9 shows an illustrative environment according to an embodiment of the present invention.
Referring to Figure 1, a hierarchical memory device 100 layout comprising a command bus 190, a data bus 180, each memory block/level (L0, Li, L2, ... Lx), and filters 150 between each pair of memory block/level is shown.
The command bus 190 is used to communicate a command and/or an instruction for an access operation to be performed at each memory block. The command and/or instruction comprises data on a type of operation to be performed on the relevant memory block and/or an address for identifying a portion of the memory block to be accessed to perform the operation thereof.
According to an embodiment of the present invention, the type of operation comprises read/fetch, write or interrupt operation. It is understood that various other types of operation may also be used depending on the actual implementation of the present invention.
According to an embodiment of the present invention, the address for identifying a portion of the memory block to be accessed comprises a block address for identifying the memory block, for example L = LU, Li, L2 or L3, and a portion address for identifying a portion of the identified memory block wherein the portion is uniquely identifiable using two domains, for example X and Y. It is understood that depending on the addressing scheme implemented to the hierarchical memory device, other types of addresses, for example another type of identifier with an index table for mapping the identifier to a portion of the memory block, may also be used. An exemplary addressing mapping scheme is shown in Figure 5A wherein the address for the portions of the memory block are mapped so that they are addressable using a contiguous block of physical memory addresses.
The data bus 180 is used to communicate data relevant to the operation communicated via the command bus 190 at the same clock cycle so that the communicated data is used in the operation specified by the command instruction at the portion of the memory block identified by the command and/or instruction.
The filter 150 performs a filter function 155 comprising a re-sampling operation on data from a portion of a memory block. An exemplary filter function 155 is shown in Figure 3.
According to an embodiment of the present invention, the re-sample operation down- samples first data from a lower level, say L0, so that second data comprising the down-sampled first data can be stored at a higher level, say U, wherein the second data takes up less storage capacity than the first data, i.e. the first data is down-sampled to the second data.
Since the embodiment concerned relates to a hierarchical memory device 100 for mipmapping, according to this embodiment the first data is for an image with a higher pixel/texture resolution than the second data.
An exemplary embodiment might be an hierarchical memory device 100 wherein: the first data comprises image data stored in LU memory block defined by X = A number of pixels and Y = B number of pixels; and the second data comprises image data stored in Li memory block defined by X = N2 number of pixels and Y = B12 number of pixels.
According to this exemplary embodiment, the hierarchical memory device 100 comprises a third memory block L2 defined by X = N4 number of pixels and V = B/4 number of pixels wherein third data comprising image data with a pixel resolution that is a quarter of the pixel resolution of the second data. Depending on the mipmapping technique used, the hierarchical memory device 100 may comprise a plurality of memory blocks up to a final level (level x) wherein the memory block at the final level (Lx) has at least one of X or V domains equal to less than one pixel, for example X = N(2Ax) and Y = B/(2Ix) number of pixels for Lx.
It is understood that a number of variations of the mipmapping technique could also be implemented as long as the hierarchical memory device 100 is capable of providing for a plurality of hierarchically addressable memory blocks, wherein all the memory blocks are accessible in a parallel manner, i.e. an operation can be performed on each memory blocks independently within the same single clock cycle, and a plurality of filters 150 are provided between each pair of memory blocks, whereby any data written to a memory block can be down-sampled by the plurality of filters 150 and stored in a memory block at an adjacent level in the hierarchy. The hierarchical memory device 100 may also be arrangeable so that the down-sampling operations by the plurality of filters 150 can be performed in a parallel manner.
According to an embodiment of the present invention, the hierarchical memory device comprises an embedded logic unit, such as an Arithmetic Logic Unit (ALU) or an Address Generation Unit (AGU), and a plurality of memory blocks LU, Li, L2, ... Lx, arranged to be accessible in parallel so that a read (fetch), a write or any other access operation on each memory block can be performed in parallel and/or within a single clock cycle. The ALU or AGU is arranged to map a first address for identifying a portion of a first memory block to a second address for identifying a portion of a second memory block wherein the first memory block and the second memory block are at different levels in the memory device hierarchy.
The incorporation of the embedded logic unit enables an external CPU to communicate, via the command bus i 90, to both the first memory block and the second memory block using a command or an instruction comprising only the first address. That is, the communicated command or instruction also triggers access operation to the second address of the second memory block as well as to the first address in accordance with the issued command or instruction. This provides for the hierarchical memory device 100 to pipeline the implementation of the mipmapping technique at a computer architecture level whilst remaining compatible with conventional computer architecture which comprises the external CPU communicating commands in accordance with clock cycles.
For example, when first data for an image with X = A and Y = B pixels is written to memory block LU at the first address, the same command communicated for performing this write operation to LO will also trigger the embedded logic unit to map the first address in the command to the second address and prompt a filter 150 to start a down-sampling operation on the first data. The down-sampling operation then outputs second data for an image with X = A/2 and Y = B/2 pixels and writes the second data to memory block Li at the mapped second address. According to the present embodiment, the mapping from the first address to the second address comprises shifting of a bit in address data, which is equivalent to a division by 2 and rounding down to zero.
According to an embodiment of the present invention, a write command or a write instruction comprising the first address (henceforth referred to as a first write command) from the external CPU is communicated on the command bus 190 and the first data is communicated on the data bus 180. The first memory block receives the first write command from the command bus 190 and the first data from the data bus 180 and writes the first data at the first address. When or as the first data is written to the first memory block, the embedded logic maps the first address in the first write command to the second address and the filter 150 down-samples the first data to generate the second data. Then the generated second data from the filter 150 is written to the second memory block at the second address.
When or as the second data is written to the second memory block, the embedded logic maps the second address to a third address and the filter 150 down-samples the second data to generate third data, which is then written to the third memory block at the third address.
This process of mapping previous address to an address at a subsequent memory block and generating data to be written thereto is repeated until final data is written to final level of the memory block hierarchy (Lx).
According to an embodiment, the first write command comprises a first commanding address based on a contiguous block of physical memory addresses from the exemplary addressing mapping scheme shown in Figure 5A. The embedded logic then maps the first commanding address to the first address of the first memory block, for use by the hierarchical memory device 100, before the write operation on the first memory block can be performed using the first address.
According to an embodiment, when the first write command from the external CPU is communicated on the command bus 190 and the first data is communicated on the data bus 180, the embedded logic maps and/or down-samples the first address or the first commanding address in the first write command to generate all the relevant addresses at all the subsequent memory blocks (e.g. the second, third etc. addresses). The filter 150 down-samples the first data to generate the second data and uses the generated second address to store the second data at the second address of the second memory block and this process is repeated until the final data is written to the final level of the memory block hierarchy (Lx).
It is understood that the hierarchical memory device 100 of the present invention could be implemented using other architectures. For example, the memory addressing/mapping scheme implemented may differ from the exemplary embodiment depending on the communicate-able external CPU, cache or memory architectures.
It is also understood that the access operation to the second address and/or the down-sampling operation to generate the second data may only be triggered after a predetermined quantity of first data has been written in the first memory block. Where the quantity of data is in units of data for displaying a single pixel on a 2D image at a particular coordinate (X, Y), the predetermined quantity of first data may be data for a block of pixels comprising four pixels adjacent to each other, for example a block of pixels at (X, Y), (X+1, Y), (X, Y+1), and (X+1, Y+1). Other predetermined quantity and units of data can be implemented for trigging the access operation to the second address depending on the type of data, 2D image/surface to be used for the mipmapping and other factors relating to the displaying of the final 2D and/or 3D graphic image/surface which uses the generated mipmaps.
For example, when the predetermined quantity comprises four bytes of the first data, the down-sampling operation comprises an averaging process wherein a block or a group of four 8 bit values, i.e. four 1 byte values, written in the first memory block is averaged. The averaged value is then written to the second memory block at the second address which is mapped to the four first addresses wherein the block of fourS bit values are stored.
The hierarchical architecture of the memory device 100 with the embedded logic enables pipelined mipmap generation and writing thereof to all the memory blocks in the hierarchy during the same clock cycles. So when the down-sampling operation and storage of the second data take place at the filter 150 and Li, LU is available for other access operations such as further write and/or read/fetch operations. This enables parallel processing of the mipmapping to take place at the memory device architecture level, whereby the number of overall clock cycles and/or commands from the external CPU required to complete a set of computations for mipmapping and/or rendering of a 3D graphics image is significantly reduced.
Since the first and second addresses are still referenced by a physical address of the relevant memory block, the first and the second data are also cache-able and cache-coherence can still be implemented as they may be implemented in the conventional computer systems/architectures. For example, the cache coherence mechanism can continue to operate using the contiguous block of physical memory addresses shown in Figure 5A. When the hierarchical memory device 100 is performing write operations to subsequent levels, e.g. LI, L2 etc., the hierarchical memory device 100 can communicate a "cache update" value to any side-bus or system bus is used for cache coherency so that cached address and/or data stored in other components of a system comprising the hierarchical memory device 100 can update the address and/or data accordingly. The update of the address and/or data comprises an invalidation of, or re-request for, data. Therefore, the cache coherency mechanism implemented with the computer system/architecture comprising the hierarchical memory device 100 of the present invention will continue to work as it would generally do with conventional cache/memory devices.
Referring to Figure 2, a diagram representing a write operation being performed on a hierarchical memory device 100 comprising a data bus 180, an address bus 195, a LO flag 210, a Li flag 211, a L2 flag 212, a plurality of memory blocks L0-L2, and filters 150 is shown.
The address bus 195 represents a portion of the command bus 190 shown in Figure 1 whereon address data is communicated. The data bus 180 is used to communicate any relevant data for the communicated address data. The L0-L2 flags 210-212 may be stored in registers accessible by the CPU.
During the write operation, only first data for LU is written directly by an external CPU.
The LU flag 21U enables the CPU to access LU. The write operation also triggers the filters (Re-sampler) 150 to start a first down-sampling operation on the first data to compute second data for an image with a quarter the resolution. First address (L =U, X = A and Y = B) for a portion of LU whereto the first data is to be written is also down-sampled to second address (L = 1, X = A/2 and Y = B12) so that the computed second data can be written to Li at the second address. In the meantime, LU is available for a further access operation by the external CPU while the first down-sampling operation and the write operation of the second data to LI are being performed. 1U
The second data write operation to Li triggers the filters (Re-sampler) 15U to start a second down-sampling operation on the second data to compute third data for an image with a quarter the resolution of the second data. Second address (L = I, X = A12 and Y = B/2) for a portion of LI whereto the second data is tor be written is also down-sampled to third address (L = 2, X = A14 and Y = B/4) so that the computed third data can be written to L2 at the third address. In the meantime, LU and Li are available for further access operations while the second down-sampling operation and the write operation of the third data to L2 are being performed.
2U Therefore, the hierarchical memory device lOU is pipelined to compute operations required for implementing the mipmapping technique wherein down-sampled bitmaps are computed in parallel with any access operations being performed on LU by the external CPU.
According to an embodiment of the present invention, the addresses of a block of data comprising data for four pixels, wherein the block is required for generating data to be stored at an address of the subsequent memory block, differ only by their bottom bit. Assume data for each pixel in level z+1 of the mipmap hierarchy is computed by averaging of a block of data for four pixels in level z. That is: L(z+i, A, B)= 3U Average( L(z, Ax2+U, Bx2+0), L(z, Ax2+U, Bx2+l), L(z, Ax2+l, Bx2+0), L(z, Ax2+l, Bx2+l) where L(Z,X,Y) represents data stored at location (X,Y) in the memory block of level Z in the mipmap hierarchy.
As A and B (and indeed z) are integer values, then Ax2, Bx2 will always be an even number so that the lowest bit of its binary representation is always 0. Therefore, rather than computing "+1" to "Ax2" or "Bx2" to generate the address L(z, Ax2+0, Bx2+1), L(z, Ax2+1, Bx2+0) and L(z, Ax2+1, Bx2+1), the lowest bit of "Ax2" or "Bx2" can simply be set to I since there will be no binary addition carry chain to propagate. This also means that when generating the second address L(z+1, A, B) from the first addresses L(z, Ax2+0, Bx2+0), L(z, Ax2+0, Bx2+ 1), L(z, A2+1, Bx2+0), and L(z, Ax2+ 1, Bx2+ 1), the A and B can be computed by simply shifting the binary representations of Ax2 and Bx2 by one bit to arrive at the binary representations of A and B. The last bit of the binary representations of Ax2 and Bx2, which has been removed due to the shifting operation, can be ignored sine all possible permutations of the remaining bits present in the binary representations of Ax2 and Bx2 down-samples to the same second address comprising A and B. Figure 3 shows a filter function 155 for performing a re-sampling operation between different levels of hierarchy in the hierarchical memory device 100 shown in Figures 1 and 2.
In a conventional colour data encoding scheme, a pixel is encoded as a Red Green blue Alpha (RGBA) colour space information, wherein intensities of each colour channel is defined by 8 bits and are arranged in a memory block so that a single 32-bit unsigned integer comprises an Alpha sample in the highest 8 bits, followed by a Red sample, a Green sample and a Blue sample in the lowest 8 bits. This 32 bit unsigned integer format is henceforth referred to as Alpha Red Green Blue (ARGB) format.
According to an embodiment of the present invention, data corresponding to four pixels from a lower level memory block, say LU, at the first address is read as a 2 x 2 block and the read values are averaged and written to a higher level memory block, say Li, at the second address. This process is repeated for 4 x 3 bit channels when colour data in the conventional 32 bit ARGB format is used. It is understood that data formats with other sizes, such as 16 or 32 bit, or other colour data formats can also be used but this will require an extra address bit or two for indicating the data format used.
In Figure 3, 0 represents an output which is the averaged value to be written to the subsequent memory block. For each of A, R, G, and B, 8 bit data from four first addresses, X00, X01, X10, and Xii, are added with the numerical values shown on the arrow representing the bus width of the relevant buses. Then when all fourS bit data are added, this is divided by 4, represented by >>2 in Figure 3. According to an embodiment, this division is performed by simply shifting the 10 bit binary representation of the sum of four S bit data by two, ignoring the last two bits in the 10 bit binary representation. This removal or ignoring of the last two bits means the shifting according to this embodiment is equivalent to dividing by 4 and rounding down to zero.
It is understood that other ways of computing an average and/or a convolution thereof can also be used depending on the actual mipmapping technique used.
Referring to Figure 4, a diagram representing a read/fetch operation being performed on a hierarchical memory device 100 according to an embodiment of the present invention is shown, wherein the hierarchical memory device 100 comprises a data bus 180, an address bus 195, a LO flag 210, a Li flag 211, a L2 flag 212, a L3 flag 213, a plurality of memory blocks L0-L3, filters 150, an Address Generation Unit (AGU) 490, and an embedded logic 450.
When a read command is communicated from an external CPU and/or GPU or device, address data is communicated via the address bus 195. According to an embodiment of the present invention, the address data comprises a flag data and address information wherein the flag data sets the flags 210-213 if an access operation to the corresponding memory block is required and the address information identifies a portion of the corresponding memory block to be accessed.
Alternatively, the address data comprises address information which can be computed by the AGU 490 or any embedded logic 450 to set appropriate flags 210-213 if an access operation to the corresponding memory block is required whilst also identifying a portion of the memory block to be accessed. An exemplary embodiment of such computation of the address information may be represented by a Boolean logic shown in relation to an address mapping scheme described in Figure GA.
According to an embodiment, the embedded logic 450 is an "AND" or "&" gate which ANDs data output from each of the memory block L0-L3 with the corresponding flag 210-213 so that only when the corresponding flag is set to non-zero (or any value equivalent to "True"), the data output is placed on the data bus 180, whereby the external CPU and/or GPU or device obtains the outputted data corresponding to the initially communicated read command.
The filters 150 perform down-sampling operations on data written to the memory blocks in parallel with any commands communicated by the external CPU until all the data written to LO, which is available for down-sampling operation, is down-sampled and written to subsequent memory blocks in the mipmap and/or memory hierarchy.
Figure 5A shows an address mapping scheme for implementing the hierarchical memory device 100 shown in Figures 1-4, wherein a standard linear memory address space is mapped to the memory block hierarchy of the hierarchical memory device 100.
The address mapping scheme establishes a one-to-one relationship between coordinate address (z, X, Y) of the hierarchical memory device 100 and the standard linear memory address space defined by a contiguous block of physical memory addresses. The address mapping scheme may be used by the embedded logic 450, AGU 490 or even an external CPU and/or CPU to communicate a relevant command to, or access, the relevant portion of the hierarchical memory device 100.
The address mapping scheme of Figure 5A requires a small range of standard linear memory addresses to establish an one-to-one relationship with the coordinate addresses (z, X, Y) of the hierarchical memory device 100. This saves required address range for identifying all the portions of all of the memory blocks of the hierarchical memory device 100.
The address mapping scheme of Figure SA is an exemplary embodiment of the present invention for writing 8x 8 pixels (also known as texels when the data relates to texture) of data to the highest level memory block LU. So, there are B X B portions of LO that require individual identification thereof when an access operation is performed on LO. As can be seen from Figure GA, this 8 x 3 portions of LU can be individually identified using log2(8) + log2(8) = 3 -f 3 = 6 bits of address space (columns 5,4, 3,2, 1, and 0 shown in Figure 5A).
The subsequent mipmap level Li will store 8/2 x 8/2 pixels of data so will require log2(8/2) + log2(8/2) = 2 + 2 = 4 bits of address space (columns 3,2, 1, and 0 shown in Figure 5A) to individually identify each portion of Li. Similarly calculated bits of address spaces required for individually identifying each portion of the subsequent mipmap levels L2 and L3 are 2 bits (columns i and 0) and i bit (column 0) respectively.
Since each mipmap level needs to be individually identified, a further log2(log2(8)+1) = 2 bits for identifying each mipmap level from log2(8) +1 = 4 levels of mipmaps is required.
Therefore, according to an embodiment, for a completely individual identification of each portion of each memory block of the hierarchical memory device 100, which stores 8 x 8 pixels of data at its highest level, log2(8) + log2(8) + log2(log2(8)+i) bits = 8 bits of address space is required, which will use all the bits available from columns 7, 6, 5, 4, 3, 2, 1, and U shown in Figure SA.
The address mapping scheme of Figure 5A, however, requires less address space than 8 bits since each portion of each memory block level can be individually identified using only 7 bits of address space (columns 6, 5, 4, 3, 2, 1, and 0 in Figure 5A). This is achieved by using a Compressed Memory Addressing technique, which can be used in conjunction with Boolean logic expressions to identify each memory block level. The Compressed Memory Addressing embodiment shown in Figure 5A only requires the 7 bits of address space, which when used with appropriate Boolean logic expressions (described below), can be used to set a flag for identifying which memory block will be operated on or accessed.
According to an embodiment of the present invention, when only a single data bus 180 is provided between the external CPU and/or GPU or devices and the hierarchical memory device 100, and when the address mapping scheme of Figure SA is used, the plurality of flags, a LU flag (L[0].enable) 210, a Li flag (L[i].enable) 211, a L2 flag (L[2].enable) 212, a L3 flag (L[3].enable) 213, can be set according to the following Boolean logic: L[0].enable = NOT(A[6]); L[1].enable = AND(A[6]. NOT(A[5]), NOT(A[4]); is L[2].enable = AND(A[6]. NOT(A[5]), A[4], NOT(A[3]), NOT(A[2])); and L[3].enable = AND(A[6]. NOT(A[5]), A[4], NOT(A[3]), A[2], NOT(A[1]), NOT(A[0])) This enables at least one of the plurality of flags to be set using to the standard linear memory address value so that data from only one memory block can be read via the data bus 180 at any given point in time (e.g. within a single clock cycle) when the external CPU and/or GPU or device communicates a read command on the command bus 190. Further, this also enables an individual identification of each memory block of the hierarchical memory device using only a 7 bit address space as illustrated by columns 6, 5, 4, 3, 2, 1, and 0 in Figure 5A.
By utilising such Boolean logic expressions, it is possible to distinguishing between the highest level memory block (LU) and the rest of the subsequent memory blocks (L1-L3) using a first single bit (column 6 in Figure 5A), and distinguish between Li and the rest of the subsequent memory blocks (L2 and L3) using the first single bit plus another second single bit (column 4 in Figure 5A), and distinguish between L2 and L3 using the first single bit and the second single bit plus yet another third single bit (column 2 in Figure 5A).
This is made possible by recognising and exploiting the fact that at each subsequent level the range of address space required for individual identification of each portion of a memory block is reduced by 2 bits, and these 2 bits can be used to individually distinguish a memory block of the previous level from the memory block(s) of present and/or subsequent levels. Therefore, the address mapping scheme of Figure 5A exploits the fact that: Li-L3 do not require first two highest bits of address space (columns 5 and 4 in Figure 5A) used for storing pixels of data in LU; L2-L3 do not require second two highest bits of address space (columns 3 and 2 in Figure 5A) used for storing pixels of data in Li; and L3 does not require a bit of address space (columns 1 in Figure 5A) used for storing pixels of data in L2, so that the first/second two highest bits and the bit of address spaces can be used to distinguish between each consecutive pair of memory block levels.
The highest bit (column 6 in Figure 5A) is added to the 6 bits (columns 5, 4, 3, 2, 1 and 0 in Figure 5A) of address space for storing pixels of data at the highest level (LU) so that the highest memory block (LU) can be individually identified from the rest of subsequent memory blocks, whilst each of the rest of the subsequent memory blocks can then be individually identified using unused address space at each subsequent memory block level.
According to this embodiment, some of the address space remains unused since this enables a simpler set of Boolean logic expressions to be used for individually identifying each memory block level, for example the set of Boolean logic expressions described earlier for is setting a plurality of flags for enabling an access operation on each memory block is relatively simple. However, it is understood that according to an alternative embodiment even this remaining unused address space may be exploited to further reduce the range of address space required for individual identification of each portion of each memory block level of the hierarchical memory device iOU.
Such reduction in the range of address space requirement by an address mapping scheme is advantageous since as the amount of pixels of data to be written at the highest level (LU) increases, the range of address space required increases rapidly which means computing each address for an access operation can become computationally expensive. Figure SB shows an address space requirement for an address mapping scheme for implementing the hierarchical memory device shown in Figures i-4, wherein different sizes of pixel data are written to the highest mipmap level (LU dimension).
As shown in Figure SB, the Compressed Memory Addressing consistently achieves less number of bits requirement for individually identifying each portion of each memory block level of the hierarchical memory device than an Uncompressed MemoryAddressing.
For example, when 256 x 256 pixels (also known as texels when the data relates to texture) of data are written to the highest level memory block LU (LU dimension = 256), there are 256 x 256 portions of LB that require individual identification thereof when an access operation is performed on LU. This requires log2(256) + log2(256) = 8 + 8 bits of address space to individually identify each portion of LU.
The subsequent mipmap level Li will store 256/2 x 25612 pixels of data so will require log2(256/2) + log2(256/2) = 7 + 7 bits of address space to individually identify each portion of Li. Similarly calculated bits of address spaces will be required for individually identifying each portion of the subsequent mipmap levels L2-L7.
If the Uncompressed Memory Addressing is used, individually identifying each mipmap level requires a further log2(log2(256)) = 3 bits. However, if the Compressed Memory Addressing according to an embodiment of the present invention is used, only a further I bit is required to distinguish the highest mipmap level (LU) from the rest of the mipmap levels (L2-L7), and the subsequent mipmap levels can be individually identified using unused address spaces. Therefore, for a completely individual identification of each portion of each memory block of the hierarchical memory device 100, which stores 256 x 256 pixels of data at its highest level, only log2(256) + log2(256) + I bits = 17 bits of address space is required if the Compressed Memory Addressing is used, which is 2 bits less than Uncompressed Memory Addressing.
According to an embodiment, as shown in Figure SB, if LU dimension is 65536, there are 65536 x 65536 pixels of data to be written to LU and log2(65536) = 16 mipmap levels.
If the Uncompressed Memory Addressing is used, this will require a further log2(log2(65536)) = log2(16) = 4 bits of address space for individually identifying each mipmap level so that the total range of address space required is log2(65536) + log2(65536) + log2(log2(65536)) = 36 bits.
If the Compressed Memory Addressing is used, only a single further bit is required to individually identify LU, so the total range of address spaced required is log2(65536) + log2(65536) + 1 = 33 bits. Therefore, there is a saving of 3 bits when the Compressed memory Addressing is used, and this saving will increase as the number of pixels of data to be written to LU increases.
According to an embodiment, a first commanding address is communicated via the address bus 195 of the command bus 190, wherein the first commanding address is in the standard linear memory address format so that the first commanding address sets at least one of the plurality of flags. The embedded logic or an AGU also maps the first commanding address to a first address (z, X, Y) according to the address mapping scheme so that the read operation can be performed at the first address.
It is understood that depending on the actual architecture of a system comprising the hierarchical memory device, a plurality of data, command or address buses may be provided between an external CPU and/or GPU or devices and the hierarchical memory device. Where such architecture is implemented, different Boolean logic is used to set at least one of the flags so that a parallel reading of data from more than one memory block at any given point in time (e.g. within a single clock cycle) can be performed. This enables the external CPU and/or GPU or devices to obtain data from more than one level in the mipmap and/or memory hierarchy simultaneously (within a single clock cycle).
According to an embodiment, the address mapping of a standard linear memory address space with a coordinate address (z, X, Y) of the hierarchical memory device 100 is performed by the external CPU and/or GPU or devices so that any command communicated therefrom comprises data relating to the coordinate address (z, X, Y) of the hierarchical memory device 100 rather than the linear memory address so that the hierarchical memory device 100 does not need to perform the mapping.
Referring to Figure 6, a video card or a computer system 600 comprising the hierarchical memory device 100 is shown. It is understood that either the video card or the computer system 600 comprising the hierarchical memory device 100 will have a similar configuration as that shown in Figure 6. A difference may be that in the video card 600 a Graphics Processing Unit (GPU) 610 is in communication with the hierarchical memory device 100 whilst in the computer system 600 a Central Processing Unit (CPU) 610 of the system is in communication with the hierarchical memory device. It is understood that a number of variations on this configuration, and combinations thereof, are possible according to embodiments of the present invention.
According to an embodiment of the present invention, the video card 600 comprises a Graphics Processing Unit (GPU) 610, an address bus 195, a data bus 180, and the hierarchical memory device 100. Although not shown, it is understood that the video card 600 will also comprise an input/output channel for communicating video and/or audio data.
According to an embodiment of the present invention, a single Integrated Circuit (IC) and/or chip construction comprising the hierarchical memory device 100 and the AGU 490 is provided. It is understood that any ALU may perform the functions of the AGU 490 and/or an ALU of an external CPU may perform certain functions of the AGU 490. It is understood that a number of variations on this configuration of the hierarchical memory device 100 are also possible.
When in use, the video card 600 receives video data, processes the video data and outputs the processed video data to an external device such as an external CPU, a cache or a memory so that the processed video data can be used by the external device. The external device may use the video data to display an image or to communicate the video data for further processing.
During an exemplary processing of the video data by the video card 600, the received video data is communicated to the GFU 610 with a first request to process the received video data using the mipmapping process. The GPU 610 then communicates to the hierarchical memory device 100 a first instruction to write the video data to LU. The first instruction comprises first address, which is communicated via the address bus 195, and the video data, which is communicated via the data bus 180. The first instruction may also comprise an instruction to perform a mipmapping process. The first instruction also sets a LU flag 210 so that the video data can be written to LU.
The writing of the video data to LU then triggers AGU 490 to generate a second address for LI from the first address and a filter 150 to down-sample the video data written to LU. The down-sampled video data is then written to Li and the process is repeated for L2.
Whilst this mipmapping process is going on, the video card 600 may also receive a second request to write a second video data to LU and a third request to obtain the down-sampled video data from Li. The GPU 6i 0 issues a second instruction to set the LO flag 210 and to write the second video data to LU, triggering the same mipmapping process to be performed on the second video data. In parallel, the GF'U 610 also issues a third instruction to set the LI flag 211 and to fetch/read the down-sampled video data from LI. When an embedded logic 450 detects the Li flag 211 to be set and the down-sampled video data from Li to be available, by ANDing the Li flag 211 data and the down-sampled video data, the down-sampled video data from Li is communicated via the data bus 180 to the GPU 610. The GFU 610 then communicates the down-sampled video data to whichever device made the third request.
It is understood that requests such as the third request may automatically made so that once any down-sampled video data becomes available they are saved to a less temporary storage, wherefrom all layers of mipmap will become available once the mipmapping process is completed.
It is also understood that further hierarchy levels can also be implemented using further number of memory blocks depending on the number of layers of mipmap required or the size of pixel resolution.
It is possible to implement the present invention using a number of memory blocks of a uniform capacity. However, it would be beneficial to have each memory block's capacity to reflect the size and/or pixel resolution of the mipmap to be written thereto since this minimises any redundant/spare capacity on the memory blocks whilst the mipmapping process is in progress.
Referring to Figure 7, a diagram representing a method of pipelining a mipmap generation process according to an embodiment of the present invention is shown.
The method of pipelining a mipmap generation process comprises the steps of: writing first data to a first address of a first memory block (710); generating a second address of a second memory block by performing an arithmetic operation on the first address (720); generating second data by down-sampling the first data (730); and writing the second data to the second address of the second memory block (740), wherein: a second memory block operation comprises at least one of the generation of the second address, the generation of the second data or the writing of the second data to the second memory block; the second memory block operation is triggered by the writing operation of the first data to the first memory block; and the first memory block is available for a further access operation whilst the second memory block operation is in progress.
The writing of the first data (710) automatically triggers the mipmap generation process to start at least one of the generation of the second address (720), the generation of the second data (730) or the writing of the second data (740). As the writing of the second data (740) requires both the generated second address (720) and the generated second data (730), according to an embodiment of the present invention, the generation of the second address 9720) or the generation of the second data (730) triggers the writing of the second data (740).
However, it is understood that according to an alternative embodiment, these processes 720- 740 can occur in an incremental manner with appropriate portions of each process occurs in parallel wherever possible.
According to an embodiment of the present invention, the first address comprises a positional coordinate ( A, B); and the second address is generated by dividing each components of the coordinate of the first address by two and rounding off to the nearest integer so that the second address becomes ( [A/2], [B12] ), when [C] denotes the nearest integer to C. This enables an embedded logic or an AGU to easily generate the second address from the first address. According to an embodiment, an address register might be used wherein the generation of the second address merely involves shifting of the first address by one bit without the use of an embedded logic or an AGU.
Each memory block stores each layer of the mipmap generated by this process. The down-sampling of the first data comprises averaging and rounding off to the nearest integer all data from a first set of first addresses which maps to the same one second address so that the first memory block can store an earlier layer of the mipmap and the second memory block can store the next layer of the mipmap, wherein the earlier layer comprises the first data and the next layer comprises the second data. According to an embodiment, the first set of first addresses which maps to the same one second address ( E, F) comprises four first addresses ( 2E, 2F), ( 2E+1, 2F), ( 2E, 2F-'-l), and ( 2E+1, 2F+1). It is understood that a number of variations of the first set of first addresses can be used depending on a specific implementation of the address mapping scheme.
According to an embodiment of the present invention, the further access operation to the first memory block is a write operation of third data. The writing of the third data then also triggers further generation process and writing operations based on the third data whereby a mipmap is generated for the third data.
It is understood that the further access operation and the second memory block operation may be performed during a single clock cycle wherein a local or external clock provides a clock signal for controlling the mipmapping process.
According to an embodiment of the present invention, the mipmapping process steps are repeated for further memory blocks with the last writing of the data to the last memory block triggering the further writing and generating steps thereof in relation to next memory block. This cascading mipmapping process continues until the last layer of the mipmap is generated and/or stored so that all the layers of the mipmap are generated and/or available for access by the memory blocks.
Referring to Figure 8, a diagram representing a method of manufacturing the hierarchical memory device shown in Figures 1-4 according to an embodiment of the present invention is shown.
The method of manufacturing the hierarchical memory device on a single integrated circuit comprises the steps of: installing a first contact, a second contact and a plurality of memory blocks (810); installing a down-sampler arranged to down-sample data from a first memory block so that the down-sampled data can be written to a second memory block (820); installing an Address Generation Unit, an embedded logic or an Arithmetic Logic Unit (830); providing an address bus for communication among the first contact, the Address Generation Unit and the plurality of memory blocks (840); and providing a data bus for communication among the plurality of memory blocks and the second contact (850).
According to an embodiment of the present invention, the method of manufacturing the hierarchical memory device further comprises a step of installing a plurality of registers for setting an access enable flag for each memory block, wherein the plurality of registers are arranged to be communicable via the address bus or the data bus.
It is understood that the manufacturing process of the integrated circuit may involve at least one of etching, deposition and/or imaging processes. Semiconductors, such as silicon, may go through the manufacturing process in a wafer form with metallic and/or carbon based electric contacts deposited thereon. It is understood that variations of these or other fabrication techniques can also be used to manufacture the integrated circuit comprising the hierarchical memory device according to an embodiment of the present invention.
According to an embodiment of the present invention, a computer readable medium storing a computer program to operate the mipmap process pipelining method of Figure 7 and/or the manufacturing method of Figure 5 is provided.
Embodiments of the present invention performs automatic mipmapping which results in high quality anti-aliasing and texturing, wherein a hierarchical memory system with integrated logic for pipelining the mipmapping process is implemented to perform various stages of the mipmapping process in parallel. This makes the mipmaps available for access by an external GPU or CPU quicker and more efficiently. Further, cube mapping with fully anti-aliased texture cubs, better shadow mapping due to anti-aliasing, better and faster texture upload performances, and mipmapped software textures (EGLImage) can be achieved with less CPU and/or GPU overhead when compared to conventional computer systems comprising conventional memories.
Embodiments of the present invention generate mip-maps at real-time speeds, in parallel with writing operations being performed at the lowest level of the memory hierarchy, e.g. LO, where as the conventional mip-mapping processes use a batch mode technique which requires each of the access operation (e.g. write and read operations) the down-sampling operation and address generation to be performed in a sequential manner which causes unnecessary latency in generating mipmaps. Any reduction of this latency is useful in a number of applications including a CPU/CPU or software generated imagery, immediate anti-aliasing and other applications implementing a down-sampling technique. Extended filtering techniques, such as a multi-tap FIR/IIR filtering and trilinear among other, require a considerable amount of computation and memory bandwidth and/or an intermediate pre-filtered results, which are data down-sampled only to an intermediate level in the rnipmapping memory hierarchy. Supplying these intermediate pre-filtered results is what the generated mipmaps represent and the embodiments of the present invention generates these mipmaps in real-time and/or in parallel, as opposed to batching the overall process up and performing each step in sequence.
Any device writing to the hierarchical memory device will automatically gain access to a mipmapping process and due to the pipelined parallel nature the mipmap generation process can be performed in the order of 0(N) regardless of a bitmap size of the image to be rendered, where N is the number of required mipmap levels. This is a significant improvement over the conventional mipmapping software based approaches which requires operations in the order of 0((5/3)NA2).
For example, for a 256 x 256 bitmap, a conventional computer system will require 109225 memory read/write instructions as well as the averaging calculation instructions from the CPU and/or GPU whereas the embodiments of the present invention, under the best scenario, will only require 8 clock cycles/instructions to generate the whole mipmap provided that the hierarchical memory device is arranged to run at a faster local clock speed so that no latency is introduced from the automated mipmapping process steps performed between each clock cycle of the CPU and/or GPU.
Even under the worse case scenario, the maximum latency introduced by the mipmapping process will be 8 clock cycles/instructions. As the 256 x 256 bitmap is being written to the highest level memory block LU, computation required for generating the subsequent mipmap levels are performed immediately thereafter so that any latency only come from the propagation and/or chain of down-sampling or filtering required to generate the subsequent mipmaps.
Since the down-sampling can be performed at all levels of mipmap hierarchy for all write operations, any write to LO will trigger a down-sampling and a subsequent write operation to Li, which will trigger a down-sampling and a subsequent write operation to L2, and so on.
Therefore, the maximum latency or delay from writing to LO to reading from L7 will be 8 clock cycles/instructions.
According to an embodiment, at every clock cycle, a predetermined number of bytes, which can be one byte or four byte or any number of bytes depending on the actual implementation, are written to LU which in turn triggers a down-sampling and a subsequent write operation to Li, which then triggers the same down-sampling and write operation to L2, and so on. This propagation of down-sampling and write operations at each hierarchy level are serial in nature and, therefore, in the worst case scenario the maximum latency of 8 clock cycles/instructions may be experienced as the mipmapping process propagates through the levels of the hierarchical memory device. Therefore, even the worst case scenario offers a significant improvement in latency over the conventional mipmapping software based approaches.
Figure 9 shows an illustrative environment 1010 according to an embodiment of the invention. The skilled person will realise and understand that embodiments of the present invention may be implemented using any suitable computer system, and the example system shown in Figure 9 is exemplary only and provided for the purposes of completeness only. To this extent, environment 1010 includes a computer system 1020 that can perform a process described herein in order to perform an embodiment of the invention. In particular, computer system 1020 is shown including a program 1030, which makes computer system 1020 operable to implement an embodiment of the invention by performing a process described herein.
Computer system 1020 is shown including a processing component 1022 (e.g., one or more processors), a storage component 1024 (e.g., a storage hierarchy), an input/output (I/O) component 1026 (e.g., one or more I/O interfaces and/or devices), and a communications pathway 1028. In general, processing component 1022 executes program code, such as program 1030, which is at least partially fixed in storage component 1024. While executing program code, processing component 1022 can process data, which can result in reading and/or writing transformed data from/to storage component 1024 and/or I/O component 1026 for further processing. Pathway 1028 provides a communications link between each of the components in computer system 1020. I/O component 1026 can comprise one or more human I/O devices, which enable a human user 1012 to interact with computer system 1020 and/or one or more communications devices to enable a system user 1012 to communicate with computer system 1020 using any type of communications link. To this extent, program 1030 can manage a set of interfaces (e.g., graphical user interface(s), application program interface, and/or the like) that enable human and/or system users 1012 to interact with program 1030.
Further, program 1030 can manage (e.g., store, retrieve, create, manipulate. organize, present, etc.) the data, such as a plurality of data files 1040, using any solution.
In any event, computer system 1020 can comprise one or more general purpose computing articles of manufacture (e.g., computing devices) capable of executing program code, such as program 1030, installed thereon. As used herein, it is understood that "program code" means any collection of instructions, in any language, code or notation, that cause a computing device having an information processing capability to perform a particular action either directly or after any combination of the following: (a) conversion to another language, code or notation; (b) reproduction in a different material form; and/or (c) decompression. To this extent, program 1030 can be embodied as any combination of system software and/or application software.
Further, program 1030 can be implemented using a set of modules. In this case, a module can enable computer system 1020 to perform a set of tasks used by program 1030, and can be separately developed and/or implemented apart from other portions of program 1030. As used herein, the term "component" means any configuration of hardware, with or without software, which implements the functionality described in conjunction therewith using any solution, while the term module" means program code that enables a computer system 1020 to implement the actions described in conjunction therewith using any solution. When fixed in a storage component 1024 of a computer system 1020 that includes a processing component 1022, a module is a substantial portion of a component that implements the actions. Regardless, it is understood that two or more components, modules, and/or systems may share some/all of their respective hardware and/or software. Further, it is understood that some of the functionality discussed herein may not be implemented or additional functionality may be included as part of computer system 1020.
When computer system 1020 comprises multiple computing devices, each computing device can have only a portion of program 1030 fixed thereon (e.g., one or more modules).
However, it is understood that computer system 1020 and program 1030 are only representative of various possible equivalent computer systems that may perform a process described herein. To this extent, in other embodiments, the functionality provided by computer system 1020 and program 1030 can be at least partially implemented by one or more computing devices that include any combination of general and/or specific purpose hardware with or without program code. In each embodiment, the hardware and program code, if included, can be created using standard engineering and programming techniques, respectively.
Regardless, when computer system 1020 includes multiple computing devices, the computing devices can communicate over any type of communications link. Further, while performing a process described herein, computer system 1020 can communicate with one or more other computer systems using any type of communications link. In either case, the communications link can comprise any combination of various types of optical fibre, wired, and/or wireless links; comprise any combination of one or more types of networks; and/or utilize any combination of various types of transmission techniques and protocols.
In any event, computer system 1020 can obtain data from files 1040 using any solution.
For example, computer system 1020 can generate and/or be used to generate data files 1040, retrieve data from files 1040, which may be stored in one or more data stores, receive data from files 1040 from another system, and/or the like.
Attention is directed to all papers and documents which are filed concurrently with or previous to this specification in connection with this application and which are open to public inspection with this specification, and the contents of all such papers and documents are incorporated herein by reference.
All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and/or all of the steps of any method or process so disclosed, may be combined in any combination, except combinations where at least some of such features and/or steps are mutually exclusive.
Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise. Thus, unless expressly stated otherwise, each feature disclosed is one example only of a generic series of equivalent or similar features.
The invention is not restricted to the details of the foregoing embodiment(s). The invention extends to any novel one, or any novel combination, of the features disclosed in this specification (including any accompanying claims, abstract and drawings), or to any novel one, or any novel combination, of the steps of any method or process so disclosed.

Claims (21)

  1. CLAIMS1. A hierarchical memory device comprising an Embedded Logic Unit and a plurality of memory blocks, wherein, when writing data into the memory device: the plurality of memory blocks are arranged to be accessible in parallel; a first memory block is arranged to be written with first data at a first address of the first memory block; a second memory block is arranged to be written with second data at a second address of the second memory block; the Embedded Logic Unit is arranged to generate the second address of the second memory block using the first address of the first memory block and to generate the second data by down-sampling the first data; a second memory block access operation comprises at least one of the generation of the second address, generation of the second data or writing of the second data to the second memory block; when an instruction for the writing of the first data at the first address is communicated to the hierarchical memory device, this triggers the second memory block access operation; and the second memory block access operation is performed in parallel with another access operation to the first memory block.
  2. 2. The hierarchical memory device of claim 1, wherein the other access operation to the first memory block is a write operation of third data.
  3. 3. The hierarchical memory device of claim 1 or 2, wherein: the first address comprises a positional coordinate (A, B); and the Embedded Logic Unit is arranged to generate the second address by dividing each component of the coordinate of the first address by two and rounding off to the nearest integer so that the second address becomes ( [N2], [B12]), when [C] denotes the nearest integer to C.
  4. 4. The hierarchical memory device of any preceding claim wherein the Embedded Logic Unit comprises an Arithmetic Logic Unit.
  5. 5. The hierarchical memory device of any preceding claim wherein the Embedded Logic Unit comprises a separate Address Generation Unit arranged to generate the second address.
  6. 6. The hierarchical memory device of any preceding claim, wherein: each memory block is arrangeable to store each layer of a mipmap from a mipmapping process; the down-sampling of the first data comprises averaging and rounding off to the nearest integer all data from a first set of first addresses which maps to the same second address; the first memory block stores an earlier layer and the second memory block stores the next layer of the mipmap; and the earlier layer comprises the first data and the next layer comprises the second data.
  7. 7. The hierarchical memory device of any preceding claim, wherein: the hierarchical memory device is arranged to be communicable with an external Central Processing Unit; and the external Central Processing Unit communicates the instruction for the writing of the first data at the first address to the memory device.
  8. 8. The hierarchical memory device of any preceding claims, wherein the hierarchical memory device comprises a plurality of memory blocks with each pair of memory blocks arranged in a hierarchy as defined by the first memory block and the second memory block.
  9. 9. The hierarchical memory device of any preceding claims further comprising a plurality of enable flags, wherein: each flag corresponds to one memory block; and a memory block is accessed only when the corresponding flag is set.
  10. 10. A video card comprising a Graphics Processing Unit and the hierarchical memory device of any preceding claim, wherein the hierarchical memory device is arranged to be accessed by the Graphics Processing Unit to generate image data for displaying an image on a display apparatus.
  11. 11. A computer system comprising a Central Processing Unit and the hierarchical memory device of any one of claims 1 to 9, wherein the hierarchical memory device is arranged to be accessed by the Central Processing Unit to generate image data for displaying an image on a display apparatus.
  12. 12. A method for a mipmap generation process comprising the steps of: writing first data to a first address of a first memory block; generating a second address of a second memory block by performing an arithmetic operation on the first address; generating second data by down-sampling the first data; and writing the second data to the second address of the second memory block, wherein: a second memory block operation comprises at least one of the generation of the second address, the generation of the second data or the writing of the second data to the second memory block; the second memory block operation is triggered by the writing operation of the first data to the first memory block; and the first memory block is available for a further access operation whilst the second memory block operation is in progress.
  13. 13. The method of claim 12, wherein the further access operation to the first memory block isawrite operation of third data.
  14. 14. The method of claim 12 or 13, wherein: the first address comprises a positional coordinate (A, B); and the second address is generated by dividing each components of the coordinate of the first address by two and rounding off to the nearest integer so that the second address becomes ( [P12], [ B/2] ), when [0] denotes the nearest integer to 0.
  15. 15. The method of any one of claims 12 to 14, wherein: each memory block stores each layer of the mipmap; the down-sampling of the first data comprises averaging and rounding off to the nearest integer all data from a first set of first addresses which maps to the same one second address; the first memory block stores an earlier layer and the second memory block stores the next layer of the mipmap; and the earlier layer comprises the first data and the next layer comprises the second data.
  16. 16. The method of any one of claims 12 to 15, wherein the steps are repeated for further memory blocks with the last writing of the data to the last memory block triggering the further writing and generating steps thereof in relation to next memory block.
  17. 17. The method of any one of claims 12 to 16, wherein the further access operation and the second memory block operation are performed during the same clock cycle.
  18. 18. A method of manufacturing the hierarchical memory device of any one of claims 1 to 9 on a single integrated circuit, comprising the steps of: installing a first contact, a second contact and a plurality of memory blocks; installing a down-sampler arranged to down-sample data from a first memory block so that the down-sampled data can be written to a second memory block; installing an Address Generation Unit, an embedded logic or an Arithmetic Logic Unit; providing an address bus for communication among the first contact, the Address Generation Unit and the plurality of memory blocks; and providing a data bus for communication among the plurality of memory blocks and the second contact.
  19. 19. A method of manufacturing the hierarchical memory device of any one of claims 1 to 9, the video card of claim 10 or the computer system of claim 11.
  20. 20. A computer readable medium storing a computer program to operate a method according to any one of claims 12 to 19.
  21. 21. The hierarchical memory device, the video card, the computer system, the method or the computer readable medium as substantially described herein with reference to the accompanying drawings.
GB1313570.2A 2013-07-30 2013-07-30 Hierarchical memory for mip map Withdrawn GB2516682A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1313570.2A GB2516682A (en) 2013-07-30 2013-07-30 Hierarchical memory for mip map

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1313570.2A GB2516682A (en) 2013-07-30 2013-07-30 Hierarchical memory for mip map

Publications (2)

Publication Number Publication Date
GB201313570D0 GB201313570D0 (en) 2013-09-11
GB2516682A true GB2516682A (en) 2015-02-04

Family

ID=49167156

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1313570.2A Withdrawn GB2516682A (en) 2013-07-30 2013-07-30 Hierarchical memory for mip map

Country Status (1)

Country Link
GB (1) GB2516682A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3156974A1 (en) * 2015-10-12 2017-04-19 Samsung Electronics Co., Ltd. Texture processing method and unit

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7126604B1 (en) * 2001-08-08 2006-10-24 Stephen Clark Purcell Efficiently determining storage locations for the levels of detail of a MIP map of an image
US20090102851A1 (en) * 2007-10-19 2009-04-23 Kabushiki Kaisha Toshiba Computer graphics rendering apparatus and method
US20120147028A1 (en) * 2010-12-13 2012-06-14 Advanced Micro Devices, Inc. Partially Resident Textures

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7126604B1 (en) * 2001-08-08 2006-10-24 Stephen Clark Purcell Efficiently determining storage locations for the levels of detail of a MIP map of an image
US20090102851A1 (en) * 2007-10-19 2009-04-23 Kabushiki Kaisha Toshiba Computer graphics rendering apparatus and method
US20120147028A1 (en) * 2010-12-13 2012-06-14 Advanced Micro Devices, Inc. Partially Resident Textures

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Ramchan Woo, 'A 210-mW graphics LSI implementing full 3-D pipeline with 264 mtexels/s texturing for mobile multimedia applications', IEEE Journal of Solid-State Circuits, Feb. 2004, VOL 39, pp 358-367, ISSN 0018-9200. *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3156974A1 (en) * 2015-10-12 2017-04-19 Samsung Electronics Co., Ltd. Texture processing method and unit
CN106570924A (en) * 2015-10-12 2017-04-19 三星电子株式会社 Texture processing method and unit
JP2017076388A (en) * 2015-10-12 2017-04-20 三星電子株式会社Samsung Electronics Co.,Ltd. Texture processing method and texture processing unit
KR20170042969A (en) * 2015-10-12 2017-04-20 삼성전자주식회사 Method and apparatus for processing texture
US10134173B2 (en) 2015-10-12 2018-11-20 Samsung Electronics Co., Ltd. Texture processing method and unit
CN106570924B (en) * 2015-10-12 2021-06-29 三星电子株式会社 Texture processing method and unit
KR102512521B1 (en) * 2015-10-12 2023-03-21 삼성전자주식회사 Method and apparatus for processing texture

Also Published As

Publication number Publication date
GB201313570D0 (en) 2013-09-11

Similar Documents

Publication Publication Date Title
CN107250996B (en) Method and apparatus for compaction of memory hierarchies
US6954204B2 (en) Programmable graphics system and method using flexible, high-precision data formats
US20110243469A1 (en) Selecting and representing multiple compression methods
US6339428B1 (en) Method and apparatus for compressed texture caching in a video graphics system
US8670613B2 (en) Lossless frame buffer color compression
US8669999B2 (en) Alpha-to-coverage value determination using virtual samples
US9478002B2 (en) Vertex parameter data compression
US7880745B2 (en) Systems and methods for border color handling in a graphics processing unit
US20050024378A1 (en) Texturing systems for use in three-dimensional imaging systems
US10824357B2 (en) Updating data stored in a memory
KR20060116916A (en) Texture cache and 3-dimensional graphics system including the same, and control method thereof
US20230409221A1 (en) Methods and systems for storing variable length data blocks in memory
US6812928B2 (en) Performance texture mapping by combining requests for image data
CN113256478A (en) Method and primitive block generator for storing primitives in a graphics processing system
CN113256477A (en) Method and tiling engine for storing tiling information in a graphics processing system
CN107209926B (en) Graphics processing unit with bayer mapping
CN114830082B (en) SIMD operand arrangement selected from a plurality of registers
US11232622B2 (en) Data flow in a distributed graphics processing unit architecture
US10706607B1 (en) Graphics texture mapping
GB2516682A (en) Hierarchical memory for mip map
CN116263982B (en) Graphics processor, system, method, electronic device and apparatus
US20220027281A1 (en) Data processing systems
US20240212257A1 (en) Workload packing in graphics texture pipeline
US20240221279A1 (en) Sliced graphics processing unit (gpu) architecture in processor-based devices
US8488890B1 (en) Partial coverage layers for color compression

Legal Events

Date Code Title Description
WAP Application withdrawn, taken to be withdrawn or refused ** after publication under section 16(1)