CN112348733A - Method and device for initializing, filling, reading and writing storage objects in GPU off-chip memory - Google Patents

Method and device for initializing, filling, reading and writing storage objects in GPU off-chip memory Download PDF

Info

Publication number
CN112348733A
CN112348733A CN201910731295.2A CN201910731295A CN112348733A CN 112348733 A CN112348733 A CN 112348733A CN 201910731295 A CN201910731295 A CN 201910731295A CN 112348733 A CN112348733 A CN 112348733A
Authority
CN
China
Prior art keywords
element block
target element
descriptor corresponding
block
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910731295.2A
Other languages
Chinese (zh)
Inventor
殷诚信
王磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaxiaxin Beijing General Processor Technology Co ltd
Original Assignee
Huaxiaxin Beijing General Processor Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaxiaxin Beijing General Processor Technology Co ltd filed Critical Huaxiaxin Beijing General Processor Technology Co ltd
Priority to CN201910731295.2A priority Critical patent/CN112348733A/en
Publication of CN112348733A publication Critical patent/CN112348733A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/001Texturing; Colouring; Generation of texture or colour

Abstract

The invention provides an initialization filling and reading and writing method and device for a storage object in a GPU off-chip memory. The method comprises the following steps: when an initialization padding operation request is received, allocating storage space for each element block of a storage object and a corresponding head descriptor thereof, wherein the head descriptor corresponding to the element block comprises a write mask and a single element value of the element block, the write mask is multi-bit data, each bit is used for indicating whether one element in the element block is written or not, and the single element value is used for indicating a background color or a padding mode of the element block; after the storage space is allocated, initially setting all bits of the write mask in the head descriptor corresponding to the element block, and writing a background color or a filling mode specified by a user into a single element value in the head descriptor corresponding to the element block. The invention can reduce the data writing amount of initialization filling, reduce time delay and reduce power consumption.

Description

Method and device for initializing, filling, reading and writing storage objects in GPU off-chip memory
Technical Field
The invention relates to the technical field of image data storage, in particular to an initialization filling and reading and writing method and device for a storage object in a GPU off-chip memory.
Background
A GPU (graphics Processing Unit) is a processor of a graphics card, similar to a CPU, except that it is designed specifically to perform complex mathematical and geometric calculations that are necessary for image rendering applications. For rendering applications, the background color is first cleared and then the desired scene is rendered according to the rendering effect. The clearing of the depth buffer is also a necessary process as initialization in order to properly implement occlusion testing. For some shading algorithms, the GPU also needs to initialize the template cache.
In order to match the GPU computing power as much as possible, most memory objects are allocated in the GPU off-chip memory according to the policy of proximity computation, thereby reducing the bottleneck problem due to "memory walls". In practical applications, the initial filling operation of the memory object has a very high operation frequency.
At present, when a storage object in a GPU off-chip memory is initialized and filled, a typical DMA technology is used to implement the initialization, a space allocated to the storage object to be initialized is divided into sizes of DMA burst transmission, a write operation is performed on a designated area according to a burst transmission mode, and the operation is repeated until all spaces of the storage object complete writing initial data.
In the process of implementing the invention, the inventor finds that at least the following technical problems exist in the prior art:
the existing initialization filling method is to perform write operation on all element data of a storage object, and has long initialization delay time and large power consumption.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method and an apparatus for initializing, filling, reading, and writing a storage object in a GPU off-chip memory, which can reduce the data writing amount of the initializing filling, reduce the time delay, and reduce the power consumption.
In a first aspect, the present invention provides a method for initializing, filling, reading, and writing a storage object in a GPU off-chip memory, comprising:
when an initialization padding operation request is received, allocating storage space for each element block of a storage object and a head descriptor corresponding to each element block, wherein the head descriptor corresponding to the element block comprises a write mask and a single element value of the element block, the write mask is multi-bit data, each bit is used for indicating whether one element in the element block is written or not, and the single element value is used for indicating a background color or a padding mode of the element block;
after the storage space is allocated, initially setting all bits of a write mask in a head descriptor corresponding to the element block to indicate that all elements of the element block are not written with data, and writing a background color or a filling mode specified by a user into a single element value in the head descriptor corresponding to the element block.
Optionally, the allocating a storage space for each element block of the storage object and the header descriptor corresponding to each element block includes:
and allocating the element blocks in a GPU off-chip memory, and allocating the head descriptors corresponding to the element blocks in a GPU on-chip cache.
Optionally, the method further comprises:
when a read operation request is received, searching a target element block needing to read data according to the read operation request;
testing whether a head descriptor corresponding to the target element block is hit or not according to the target element block;
if the head descriptor corresponding to the target element block is hit, judging whether the target element block is an element block covered by a single element, and if the head descriptor is not hit, reading element data of the target element block;
and when the target element block is an element block covered by a single element, reading a single element value in a head descriptor corresponding to the target element block, otherwise, reading element data of the target element block.
Optionally, the method further comprises:
after reading the element data of the target element block, judging whether the target element block is reloaded into a GPU chip for caching;
when the target element block is reloaded into a cache in a GPU chip, calculating the minimum value and the maximum value in the target element block;
comparing whether the minimum value and the maximum value are equal;
if the minimum value and the maximum value are equal, redistributing the head descriptor corresponding to the target element block into a GPU on-chip cache, initially setting a write mask in the head descriptor corresponding to the target element block, and simultaneously writing the minimum value or the maximum value into a single element value in the head descriptor corresponding to the target element block.
Optionally, the method further comprises:
when a write operation request is received, searching a target element block needing to be written with data according to the write operation request;
testing whether a head descriptor corresponding to the target element block is hit or not according to the target element block;
if the head descriptor corresponding to the target element block is hit, writing data into the target element block, and negating a write mask in the head descriptor corresponding to the target element block and a bit corresponding to an element of the written data;
after all elements of the target element block are written with data, calculating the minimum value and the maximum value in the target element block;
comparing whether the minimum value and the maximum value are equal;
if the minimum value and the maximum value are equal, resetting a write mask in a head descriptor corresponding to the target element block, and writing the minimum value or the maximum value into a single element value in the head descriptor corresponding to the target element block, otherwise, removing the head descriptor corresponding to the target element block from a GPU on-chip cache.
In a second aspect, the present invention provides an apparatus for initializing, filling, reading, and writing memory objects in a GPU off-chip memory, comprising:
the storage allocation module is used for allocating storage space for each element block of a storage object and a head descriptor corresponding to each element block when an initial filling operation request is received, wherein the head descriptor corresponding to the element block comprises a write mask and a single element value of the element block, the write mask is multi-bit data, each bit is used for indicating whether one element in the element block is written or not, and the single element value is used for indicating a background color or a filling mode of the element block;
and the initialization filling module is used for initially setting all bits of a write mask in the head descriptor corresponding to the element block after the storage space is allocated, so as to indicate that all elements of the element block are not written with data, and simultaneously writing a background color or a filling mode specified by a user into a single element value in the head descriptor corresponding to the element block.
Optionally, the storage allocation module is configured to allocate the element block in a GPU off-chip memory, and allocate a header descriptor corresponding to the element block in a GPU on-chip cache.
Optionally, the apparatus further comprises a second read operation module, the second read operation module comprising:
the second reading and searching unit is used for searching a target element block needing to read data according to a reading operation request when the reading operation request is received;
a read hit test unit, configured to test whether a header descriptor corresponding to the target element block is hit according to the target element block;
a second judging unit, configured to judge whether the target element block is an element block covered by a single element when a header descriptor corresponding to the target element block is hit;
a second reading unit, configured to read a single-element value in a header descriptor corresponding to the target element block when the target element block is an element block covered by a single element; and the processor is further configured to read the element data of the target element block if the head descriptor corresponding to the target element block is not hit or the target element block is an element block covered by a non-single element.
Optionally, the second read operation module further comprises:
a third judging unit, configured to judge whether the target element block is reloaded into the GPU chip for caching after reading the element data of the target element block;
a second calculation unit, configured to calculate a minimum value and a maximum value in the target element block when the target element block is reloaded into a cache in a GPU chip;
a second comparing unit for comparing whether the minimum value and the maximum value are equal;
and a second resetting unit, configured to, if the minimum value and the maximum value are equal, reallocate the head descriptor corresponding to the target element block to a GPU on-chip cache, initially set a write mask in the head descriptor corresponding to the target element block, and write the minimum value or the maximum value into a single element value in the head descriptor corresponding to the target element block.
Optionally, the apparatus further includes a second write operation module, where the second write operation module includes:
the second writing searching unit is used for searching a target element block needing to be written with data according to the writing operation request when the writing operation request is received;
a write hit test unit, configured to test whether a header descriptor corresponding to the target element block is hit according to the target element block;
a second writing unit, configured to write data to the target element block when a head descriptor corresponding to the target element block is hit, and invert a write mask in the head descriptor corresponding to the target element block and a bit corresponding to an element in which the data is written;
a third calculation unit configured to calculate a minimum value and a maximum value in the target element block after data is written to all elements of the target element block;
a third comparing unit for comparing whether the minimum value and the maximum value are equal;
and a third resetting unit, configured to reset a write mask in the header descriptor corresponding to the target element block when the minimum value and the maximum value are equal, write the minimum value or the maximum value into a single element value in the header descriptor corresponding to the target element block, and otherwise remove the header descriptor corresponding to the target element block from the GPU on-chip cache.
According to the method and the device for initializing, filling and reading and writing the storage object in the GPU off-chip memory, when the storage object is initialized and filled, only the head descriptors corresponding to all element blocks divided by the storage object need to be filled with data, and data are not written into each element of the storage object, so that the bus bandwidth required by initializing and filling the storage object can be reduced. For some applications with good write locality, such as the UI, the fact that a block of elements is fully covered by a single value may further reduce write transactions on the bus of the memory object. The method can shorten time delay and reduce power consumption particularly when processing large-space storage objects, and has obvious advantages.
Drawings
FIG. 1 is a schematic diagram of a structure of a storage object partition element block;
FIG. 2 is a flow chart illustrating an implementation of initialization padding according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a memory allocation of element blocks and header descriptors;
FIG. 4 is a schematic diagram of another memory allocation of element blocks and header descriptors;
FIG. 5 is a flow chart illustrating a read operation according to an embodiment of the present invention;
FIG. 6 is a flow chart illustrating a write operation according to an embodiment of the present invention;
FIG. 7 is a flowchart illustrating a read operation according to another embodiment of the present invention;
FIG. 8 is a flow chart illustrating a write operation according to another embodiment of the present invention;
FIG. 9 is a schematic structural diagram of an initialization fill and read/write apparatus for a storage object in a GPU off-chip memory according to an embodiment of the present invention;
FIG. 10 is a block diagram of an apparatus for initializing, filling, reading, and writing memory objects in a GPU off-chip memory according to an embodiment of the present invention;
FIG. 11 is a block diagram of an initialization fill and read/write apparatus for memory objects in a GPU off-chip memory according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An embodiment of the present invention provides an initialization filling and reading/writing method for a storage object in a GPU off-chip memory, in this embodiment, the storage object includes a general cache object and an image object having a spatial dimension, which may be one-dimensional, two-dimensional or three-dimensional, the storage object is divided into a plurality of element blocks according to dimension information, each element block includes a plurality of elements and has the same size, as shown in fig. 1, by taking a two-dimensional storage object as an example, assuming that 64 elements are shared by the storage object, which is denoted as M1-M64, the storage object is divided into 4 rows and 4 columns of 16 element blocks, which are denoted as blk1-blk16 in sequence, each element block includes 4 elements, and the way of dividing the storage object by blocks is helpful to improve locality of the image type storage object and the 2-dimensional cache object. These types of memory objects are very widely used in graphics applications and GPGPU applications.
When the initial filling operation is performed on the storage object, as shown in fig. 2, the method includes:
s201, when an initial filling operation request is received, allocating a storage space for each element block of a storage object and a head descriptor corresponding to each element block, wherein the head descriptor corresponding to the element block comprises a write mask and a single element value of the element block, the write mask is multi-bit data, each bit is used for indicating whether one element in the element block is written or not, and the single element value is used for indicating a background color or a filling mode of the element block;
s202, after the storage space is distributed, initially setting all bits of a write mask in a head descriptor corresponding to the element block to indicate that all elements of the element block are not written with data, and writing a background color or a filling mode specified by a user into a single element value in the head descriptor corresponding to the element block.
Specifically, the required storage space is allocated according to the size of the storage object and the configuration information of the element blocks, each element block has a corresponding header Descriptor, which is denoted as a Head Descriptor, and each header Descriptor is used to indicate some typical characteristics of the corresponding element block, in this embodiment, the header Descriptor includes a write mask and a single element value of the element block, wherein the write mask is a multi-bit data, each bit indicating whether one of the elements in the corresponding element block is written, the number of bits of the write mask therefore depends on the SIZE of the block of elements, which is assumed to be BLK _ SIZE ═ BLK _ H ═ BLK _ W, the writemask in the header descriptor may be denoted as hd _ wMask [ BLK _ SIZE-1:0], the single element value is used to indicate the background color or fill pattern of the corresponding element block, and the single element value in the header descriptor may be denoted as kVal 0.
After the memory space is allocated, setting 1 to all bits of the write mask in all the header descriptors to indicate that all elements of the corresponding element block have not been written with data, and when a certain bit is 0, indicating that the corresponding element has been written with data, or vice versa. The user-specified background color or fill pattern is written to the single-element value kVal0 in all header descriptors at the same time. And finishing the filling of the head descriptors of all the element blocks, and finishing the initial filling of the storage object.
Assuming that the SIZE of the memory object is MEM _ SIZE ═ MEM _ H ═ MEM _ W, the number of element blocks BLK _ NUM ═ MEM _ SIZE/BLK _ SIZE ═ (MEM _ H/BLK _ H) × (MEM _ W/BLK _ W), and therefore the total write transfer load is Wtotal ═ SIZE _ of (kVal0) × BLK _ NUM. The additional store size introduced is htotai size _ of (head descriptor) BLK _ NUM.
If the SIZE of each element is also the SIZE _ of (kVal0) according to the conventional initialization filling method, the total write transfer load is the SIZE _ of (kVal 0). times.MEM _ SIZE, and thus for the initialization filling operation of the storage objects in the GPU, the method of the invention can reduce the write transfer load, thereby reducing the time delay and the power consumption and saving the storage bandwidth. For normal read and write operations of the storage object, the method of the invention can still be optimized by the head descriptor.
In particular, when the storage space is allocated for each element block of the storage object and the head descriptor corresponding to each element block, there are two alternative allocation manners.
The first distribution mode:
as shown in fig. 3, the element block and its corresponding header descriptor are allocated in the GPU off-chip memory at the same time, i.e. all header descriptors are allocated in the GPU off-chip memory as special memory objects along with the allocation of the element block.
The second distribution mode:
as shown in fig. 4, the element block is allocated in the GPU off-chip memory, and the head descriptor corresponding to the element block is allocated in the GPU on-chip cache, for example, the head descriptor is stored in the L2 cache, and all the head descriptors are stored as a macro tag.
In the first allocation mode, the read and write operations to the memory objects in the GPU off-chip memory go through the Write Queue (WQ) and the Read Queue (RQ).
When a read operation is performed on the storage object, a flowchart of the read operation is shown in fig. 5, where the method includes:
s501, when a read operation request is received, searching a target element block needing to read data and a head descriptor corresponding to the target element block according to the read operation request;
s502, judging whether the target element block is an element block covered by a single element or not, and if so, executing a step S503; otherwise, executing step S504;
s503, reading a single element value in a head descriptor corresponding to the target element block;
and S504, reading the element data of the target element block.
Specifically, for a read operation request in a read queue, a target element block is found according to a request address, a head descriptor corresponding to the target element block is searched according to an address mapping relation between the element block and the head descriptor, if the element block where the request address is located is an element block covered by a single element, a single element value in the head descriptor corresponding to the element block is directly returned, otherwise, element data of the request address corresponding to the element block is read, that is, the element block where the request address is located is an element block not covered by the single element, and the element data of the request address corresponding to the element block is read according to the request address. Here, a single element covered element block means that the element values of all elements of the element block are the same; an element block that is not a single element overlay means that there are at least two element values for the element block.
When a write operation is performed on the storage object, a flow chart of the write operation is shown in fig. 6, and the method includes:
s601, when a write operation request is received, searching a target element block needing to be written with data and a head descriptor corresponding to the target element block according to the write operation request;
s602, writing data into the target element block, and negating a write mask in a head descriptor corresponding to the target element block and a bit corresponding to an element of the written data;
s603, after all elements of the target element block are written with data, calculating the minimum value and the maximum value in the target element block;
s604, comparing whether the minimum value and the maximum value are equal, and if so, executing a step S605;
s605, resetting the write mask in the head descriptor corresponding to the target element block, and writing the minimum value or the maximum value into the single element value in the head descriptor corresponding to the target element block.
Specifically, for a write operation request in a write queue, a target element block is found according to a request address, a head descriptor corresponding to the target element block is searched according to an address mapping relation between the element block and the head descriptor, a corresponding write mask in the head descriptor is set to zero according to the write mask of the request, and meanwhile, data requested to be written is written into the target element block. And after all the elements of the target element block are written with data, performing reduction operation on the target element block, namely calculating the minimum value and the maximum value of the element block. And if the minimum value and the maximum value of the element block are equal, indicating that the element block is the element block covered by the single element, resetting the write mask in the head descriptor corresponding to the element block, and writing the reduced minimum value or maximum value into the single element value in the head descriptor. And if the minimum value and the maximum value of the element block are not equal, keeping the current write mask in the head descriptor corresponding to the element block, and writing the minimum value into the single element value in the head descriptor. It can be seen that if the minimum value and the maximum value of an element block are not equal, the writemask in the corresponding head descriptor of the element block is not all 1, and if the minimum value and the maximum value of the element block are equal, the writemask in the corresponding head descriptor of the element block keeps all 1, so that whether the element block is an element block covered by a single element can be determined by whether the writemask in the head descriptor is all 1.
Under the second allocation mode, the read-write operation of the storage object in the GPU off-chip memory firstly performs hit test on the request address, the hit test is to read the head descriptor of the element block, when the head descriptor corresponding to the element block can hit, the element block is the element block covered by the single element, and the read operation and the write operation are processed according to the head descriptor when the element block is hit. For read operations, hit testing can quickly process requests, reducing the bandwidth requirements of requesting external memory.
When a read operation is performed on a memory object in the GPU off-chip memory, a flowchart of the read operation is shown in fig. 7, and the method includes:
s701, when a read operation request is received, searching a target element block needing to read data according to the read operation request;
s702, testing whether a head descriptor corresponding to the target element block is hit or not according to the target element block, and if the head descriptor corresponding to the target element block is hit, continuing to execute S703; otherwise, executing S705;
s703, judging whether the target element block is an element block covered by a single element, and if so, continuing to execute S704; otherwise, executing S705;
s704, reading a single element value in a head descriptor corresponding to the target element block;
s705, reading element data of the target element block;
here, data reading is realized in L2 through a caching mechanism;
s706, after reading the element data of the target element block, judging whether the target element block is reloaded into the GPU chip cache or not, and if the target element block is reloaded, continuing to execute S707-S709;
s707, calculating the minimum value and the maximum value of the target element block;
s708, comparing whether the minimum value and the maximum value are equal;
and S709, when the minimum value and the maximum value are equal, reallocating the head descriptor corresponding to the target element block to a GPU on-chip cache, initially setting a write mask in the head descriptor corresponding to the target element block, and simultaneously writing the minimum value or the maximum value into a single-element value in the head descriptor corresponding to the target element block.
Specifically, for a read operation request of an element block, the head descriptor of the element block is first indexed according to the request address in the on-chip cache L2. If the index's head descriptor indicates a hit in L2, then the singleton value in the index's head descriptor is returned. Otherwise, the request address is processed through L2, which includes: if there are blocks of elements in L2, the requested value is returned in the processing of L2. Otherwise, L2 requests external storage, at which point the requested element block is considered to be reloaded if a replacement occurs for the element block within L2. And for the reloaded element blocks, calculating the minimum value and the maximum value of the element blocks, and when the minimum value and the maximum value are equal, indicating that the element blocks are element blocks covered by the single elements, reallocating corresponding head descriptors to a GPU on-chip cache, wherein the reallocating comprises allocating head descriptor storage space, resetting a write mask and reallocating single element values. Therefore, the subsequent read-write operation of the element block can firstly carry out hit test, then the head descriptor of the element block is updated, the reading process of the element block is accelerated according to the head descriptor, the head descriptor of the element block covered by the single element value is always kept on the chip, and the off-chip access bandwidth is reduced.
When a write operation is performed on a memory object in the GPU off-chip memory, a flow chart of the write operation is shown in fig. 8, and the method includes:
s801, when a write operation request is received, searching a target element block needing to be written with data according to the write operation request;
s802, testing whether a head descriptor corresponding to the target element block is hit or not according to the target element block, and if the head descriptor corresponding to the target element block is hit, continuing to execute S803; otherwise, the step proceeds to execute S808;
s803, writing data into the target element block, and negating a write mask in a head descriptor corresponding to the target element block and a bit corresponding to an element of the written data;
s804, after all elements of the target element block are written with data, calculating the minimum value and the maximum value in the target element block;
s805, comparing whether the minimum value and the maximum value are equal, and if the minimum value and the maximum value are equal, executing S806; otherwise, executing S807;
s806, resetting a write mask in a head descriptor corresponding to the target element block, and writing the minimum value or the maximum value into a single element value in the head descriptor corresponding to the target element block;
s807, removing the head descriptor corresponding to the target element block from a GPU on-chip cache;
and S808, writing data into the target element block, executing write operation in L2 according to the request address, if the target element block is in L2, updating the data of the part in L2, and otherwise, requesting an off-chip storage request.
Specifically, for a write operation request of an element block, the head descriptor of the element block is indexed according to the request address in L2. If the descriptor of the index is in L2, it indicates that the head descriptor is hit, the write mask of the element block is consumed according to the write mask of the request after hit, and when the write mask of the element block is completely consumed, that is, after all elements of the element block are written with data, a reduction operation is performed on the element block, that is, the minimum value and the maximum value of the element block are calculated. And if the minimum value and the maximum value of the element block are equal, indicating that the element block is the element block covered by the single element, resetting the write mask in the head descriptor corresponding to the element block, and writing the reduced minimum value or maximum value into the single element value in the head descriptor. Otherwise, that is, the minimum value and the maximum value of the element block are not equal, it is indicated that the element block is an element block covered by a non-single element, and the head descriptor corresponding to the element block is removed from the GPU on-chip cache. In this way, the subsequent read/write operation of the element block skips the hit test in the header descriptor, and the element block is treated as a general element block and processed according to a normal cache mechanism.
As can be seen from the above description, the macro tag on the chip stores the head descriptors of the element blocks covered by the single element, and the reading of the elements can be accelerated by storing the stored head descriptors on the chip. In the worst case, the header descriptors are also shifted out of on-chip storage without wasting any on-chip storage. At this time, the element block is completely processed according to a normal cache mechanism without reducing the read-write performance of the element block. At the same time, the header descriptors can also be stored and rebuilt on-chip, thereby speeding up the reading of elements.
In addition, when the memory object does not enable the compression option, the write back to the external memory does not incur any space penalty. When compression is enabled, for a single-element covered element block that is a write-back head descriptor, non-single-element covered element blocks will compute the minimum and maximum values of the element block and write back both the head descriptor and the element block. The minimum value and the maximum value of the block do not need to be recalculated in the subsequent compression, which is beneficial to the subsequent data compression of the element block and shortens the compression process of the element block.
An embodiment of the present invention further provides an initialization filling and read-write apparatus for a storage object in a GPU off-chip memory, as shown in fig. 9, the apparatus includes:
a storage allocation module 901, configured to allocate a storage space for each element block of a storage object and a header descriptor corresponding to each element block when an initial padding operation request is received, where the header descriptor corresponding to the element block includes a write mask of the element block and a single element value, the write mask is multi-bit data, each bit is used to indicate whether one of the elements in the element block is written or not, and the single element value is used to indicate a background color or a padding mode of the element block;
an initialization padding module 902, configured to perform initial setting on all bits of a write mask in a header descriptor corresponding to the element block after allocating a storage space is completed, to indicate that all elements of the element block are not written with data, and write a background color or a padding pattern specified by a user into a single element value in the header descriptor corresponding to the element block.
Optionally, the storage allocation module 901 is configured to allocate the element block and its corresponding header descriptor in the GPU off-chip memory at the same time.
In this storage allocation manner, as shown in fig. 10, the apparatus further includes a first read operation module 903, where the first read operation module 903 includes:
a first read lookup unit 9031, configured to, when a read operation request is received, lookup a target element block that needs to read data and a header descriptor corresponding to the target element block according to the read operation request;
a first judging unit 9032, configured to judge whether the target element block is an element block covered by a single element;
a first reading unit 9033, configured to read a single-element value in a header descriptor corresponding to the target element block when the target element block is an element block covered by a single element, and otherwise, read element data of the target element block.
Further, as shown in fig. 10, the apparatus further includes a first write operation module 904, where the first write operation module 904 includes:
a first write lookup unit 9041, configured to, when a write operation request is received, lookup a target element block to which data needs to be written and a header descriptor corresponding to the target element block according to the write operation request;
a first writing unit 9042, configured to write data into the target element block, and negate a write mask in a header descriptor corresponding to the target element block and bits corresponding to elements in which the data is written;
a first calculation unit 9043, configured to calculate a minimum value and a maximum value in the target element block after data is written in all elements of the target element block;
a first comparing unit 9044, configured to compare whether the minimum value and the maximum value are equal;
a first resetting unit 9045, configured to reset a write mask in a header descriptor corresponding to the target element block when the minimum value and the maximum value are equal, and write the minimum value or the maximum value into a single element value in the header descriptor corresponding to the target element block.
Optionally, the storage allocation module 901 is configured to allocate the element block in a GPU off-chip memory, and allocate a header descriptor corresponding to the element block in a GPU on-chip cache.
In this storage allocation manner, as shown in fig. 11, the apparatus further includes a second read operation module 905, where the second read operation module 905 includes:
a second read lookup unit 9051, configured to, when a read operation request is received, lookup a target element block whose data needs to be read according to the read operation request;
a read hit test unit 9052, configured to test whether a header descriptor corresponding to the target element block is hit according to the target element block;
a second judging unit 9053, configured to, when a header descriptor corresponding to the target element block is hit, judge whether the target element block is an element block covered by a single element;
a second reading unit 9054, configured to, when the target element block is an element block covered by a single element, read a single element value in a header descriptor corresponding to the target element block; and the processor is further configured to read the element data of the target element block if the head descriptor corresponding to the target element block is not hit or the target element block is an element block covered by a non-single element.
Further, the second read operation module 905 further includes:
a third determining unit 9055, configured to determine whether the target element block is reloaded into the GPU chip for caching after the element data of the target element block is read;
a second calculating unit 9056, configured to calculate a minimum value and a maximum value in the target element block when the target element block is reloaded into a cache in the GPU chip;
a second comparing unit 9057, configured to compare whether the minimum value and the maximum value are equal to each other;
a second resetting unit 9058, configured to, if the minimum value is equal to the maximum value, reallocate the head descriptor corresponding to the target element block to a GPU on-chip cache, initially set a write mask in the head descriptor corresponding to the target element block, and write the minimum value or the maximum value into a single element value in the head descriptor corresponding to the target element block.
Further, as shown in fig. 11, the apparatus further includes a second write operation module 906, where the second write operation module 906 includes:
the second write searching unit 9061 is configured to search, when a write operation request is received, a target element block to which data needs to be written according to the write operation request;
a write hit test unit 9062, configured to test whether a header descriptor corresponding to the target element block is hit according to the target element block;
a second writing unit 9063, configured to write data to the target element block when a head descriptor corresponding to the target element block is hit, and negate a write mask in the head descriptor corresponding to the target element block and a bit corresponding to an element in which the data is written;
a third calculation unit 9064, configured to calculate a minimum value and a maximum value in the target element block after data is written in all elements of the target element block;
a third comparing unit 9065, configured to compare whether the minimum value and the maximum value are equal;
a third resetting unit 9066, configured to reset a write mask in the header descriptor corresponding to the target element block when the minimum value and the maximum value are equal, write the minimum value or the maximum value into a single element value in the header descriptor corresponding to the target element block, and otherwise remove the header descriptor corresponding to the target element block from the GPU on-chip cache.
The initialization filling and reading-writing device for the storage object in the GPU off-chip memory provided by the embodiment of the invention can reduce the writing transmission load when performing initialization filling operation, thereby reducing time delay and power consumption and saving storage bandwidth. For normal read and write operations of the storage object, the method of the invention can still be optimized by the head descriptor.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A method for initializing, filling, reading and writing a storage object in a GPU off-chip memory is characterized by comprising the following steps:
when an initialization padding operation request is received, allocating storage space for each element block of a storage object and a head descriptor corresponding to each element block, wherein the head descriptor corresponding to the element block comprises a write mask and a single element value of the element block, the write mask is multi-bit data, each bit is used for indicating whether one element in the element block is written or not, and the single element value is used for indicating a background color or a padding mode of the element block;
after the storage space is allocated, initially setting all bits of a write mask in a head descriptor corresponding to the element block to indicate that all elements of the element block are not written with data, and writing a background color or a filling mode specified by a user into a single element value in the head descriptor corresponding to the element block.
2. The method of claim 1, wherein allocating storage space for each element block of a storage object and a corresponding header descriptor of each element block comprises:
and allocating the element blocks in a GPU off-chip memory, and allocating the head descriptors corresponding to the element blocks in a GPU on-chip cache.
3. The method of claim 2, further comprising:
when a read operation request is received, searching a target element block needing to read data according to the read operation request;
testing whether a head descriptor corresponding to the target element block is hit or not according to the target element block;
if the head descriptor corresponding to the target element block is hit, judging whether the target element block is an element block covered by a single element, and if the head descriptor is not hit, reading element data of the target element block;
and when the target element block is an element block covered by a single element, reading a single element value in a head descriptor corresponding to the target element block, otherwise, reading element data of the target element block.
4. The method of claim 3, further comprising:
after reading the element data of the target element block, judging whether the target element block is reloaded into a GPU chip for caching;
when the target element block is reloaded into a cache in a GPU chip, calculating the minimum value and the maximum value in the target element block;
comparing whether the minimum value and the maximum value are equal;
if the minimum value and the maximum value are equal, redistributing the head descriptor corresponding to the target element block into a GPU on-chip cache, initially setting a write mask in the head descriptor corresponding to the target element block, and simultaneously writing the minimum value or the maximum value into a single element value in the head descriptor corresponding to the target element block.
5. The method of claim 2, further comprising:
when a write operation request is received, searching a target element block needing to be written with data according to the write operation request;
testing whether a head descriptor corresponding to the target element block is hit or not according to the target element block;
if the head descriptor corresponding to the target element block is hit, writing data into the target element block, and negating a write mask in the head descriptor corresponding to the target element block and a bit corresponding to an element of the written data;
after all elements of the target element block are written with data, calculating the minimum value and the maximum value in the target element block;
comparing whether the minimum value and the maximum value are equal;
if the minimum value and the maximum value are equal, resetting a write mask in a head descriptor corresponding to the target element block, and writing the minimum value or the maximum value into a single element value in the head descriptor corresponding to the target element block, otherwise, removing the head descriptor corresponding to the target element block from a GPU on-chip cache.
6. An apparatus for initializing, filling, reading, and writing memory objects in a GPU off-chip memory, the apparatus comprising:
the storage allocation module is used for allocating storage space for each element block of a storage object and a head descriptor corresponding to each element block when an initial filling operation request is received, wherein the head descriptor corresponding to the element block comprises a write mask and a single element value of the element block, the write mask is multi-bit data, each bit is used for indicating whether one element in the element block is written or not, and the single element value is used for indicating a background color or a filling mode of the element block;
and the initialization filling module is used for initially setting all bits of a write mask in the head descriptor corresponding to the element block after the storage space is allocated, so as to indicate that all elements of the element block are not written with data, and simultaneously writing a background color or a filling mode specified by a user into a single element value in the head descriptor corresponding to the element block.
7. The apparatus of claim 6, wherein the memory allocation module is configured to allocate the element block in an off-chip memory of the GPU and allocate a header descriptor corresponding to the element block in an on-chip cache of the GPU.
8. The apparatus of claim 7, further comprising a second read operation module, the second read operation module comprising:
the second reading and searching unit is used for searching a target element block needing to read data according to a reading operation request when the reading operation request is received;
a read hit test unit, configured to test whether a header descriptor corresponding to the target element block is hit according to the target element block;
a second judging unit, configured to judge whether the target element block is an element block covered by a single element when a header descriptor corresponding to the target element block is hit;
a second reading unit, configured to read a single-element value in a header descriptor corresponding to the target element block when the target element block is an element block covered by a single element; and the processor is further configured to read the element data of the target element block if the head descriptor corresponding to the target element block is not hit or the target element block is an element block covered by a non-single element.
9. The apparatus of claim 8, wherein the second read operation module further comprises:
a third judging unit, configured to judge whether the target element block is reloaded into the GPU chip for caching after reading the element data of the target element block;
a second calculation unit, configured to calculate a minimum value and a maximum value in the target element block when the target element block is reloaded into a cache in a GPU chip;
a second comparing unit for comparing whether the minimum value and the maximum value are equal;
and a second resetting unit, configured to, if the minimum value and the maximum value are equal, reallocate the head descriptor corresponding to the target element block to a GPU on-chip cache, initially set a write mask in the head descriptor corresponding to the target element block, and write the minimum value or the maximum value into a single element value in the head descriptor corresponding to the target element block.
10. The apparatus of claim 7, further comprising a second write module, the second write module comprising:
the second writing searching unit is used for searching a target element block needing to be written with data according to the writing operation request when the writing operation request is received;
a write hit test unit, configured to test whether a header descriptor corresponding to the target element block is hit according to the target element block;
a second writing unit, configured to write data to the target element block when a head descriptor corresponding to the target element block is hit, and invert a write mask in the head descriptor corresponding to the target element block and a bit corresponding to an element in which the data is written;
a third calculation unit configured to calculate a minimum value and a maximum value in the target element block after data is written to all elements of the target element block;
a third comparing unit for comparing whether the minimum value and the maximum value are equal;
and a third resetting unit, configured to reset a write mask in the header descriptor corresponding to the target element block when the minimum value and the maximum value are equal, write the minimum value or the maximum value into a single element value in the header descriptor corresponding to the target element block, and otherwise remove the header descriptor corresponding to the target element block from the GPU on-chip cache.
CN201910731295.2A 2019-08-08 2019-08-08 Method and device for initializing, filling, reading and writing storage objects in GPU off-chip memory Pending CN112348733A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910731295.2A CN112348733A (en) 2019-08-08 2019-08-08 Method and device for initializing, filling, reading and writing storage objects in GPU off-chip memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910731295.2A CN112348733A (en) 2019-08-08 2019-08-08 Method and device for initializing, filling, reading and writing storage objects in GPU off-chip memory

Publications (1)

Publication Number Publication Date
CN112348733A true CN112348733A (en) 2021-02-09

Family

ID=74366786

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910731295.2A Pending CN112348733A (en) 2019-08-08 2019-08-08 Method and device for initializing, filling, reading and writing storage objects in GPU off-chip memory

Country Status (1)

Country Link
CN (1) CN112348733A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103500149A (en) * 2013-09-29 2014-01-08 华为技术有限公司 Direct memory access controller and direct memory access control method
US20180107591A1 (en) * 2011-04-06 2018-04-19 P4tents1, LLC System, method and computer program product for fetching data between an execution of a plurality of threads
CN109643443A (en) * 2016-09-26 2019-04-16 英特尔公司 Cache and compression interoperability in graphics processor assembly line

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180107591A1 (en) * 2011-04-06 2018-04-19 P4tents1, LLC System, method and computer program product for fetching data between an execution of a plurality of threads
US20190205244A1 (en) * 2011-04-06 2019-07-04 P4tents1, LLC Memory system, method and computer program products
CN103500149A (en) * 2013-09-29 2014-01-08 华为技术有限公司 Direct memory access controller and direct memory access control method
CN109643443A (en) * 2016-09-26 2019-04-16 英特尔公司 Cache and compression interoperability in graphics processor assembly line

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
韩俊刚;蒋林;杜慧敏;曹小鹏;董梁;孟李林;赵全良;殷诚信;张军;: "一种图形加速器和着色器的体系结构", 计算机辅助设计与图形学学报, no. 03, 15 March 2010 (2010-03-15) *

Similar Documents

Publication Publication Date Title
US20220129752A1 (en) Memory bandwidth reduction techniques for low power convolutional neural network inference applications
US9329988B2 (en) Parallel dynamic memory allocation using a nested hierarchical heap
TWI475386B (en) Virtual memory structure for coprocessors having memory allocation limitations
US8704826B1 (en) Primitive re-ordering between world-space and screen-space pipelines with buffer limited processing
US9110809B2 (en) Reducing memory traffic in DRAM ECC mode
US20180024938A1 (en) Allocating physical pages to sparse data sets in virtual memory without page faulting
US9256536B2 (en) Method and apparatus for providing shared caches
US11657119B2 (en) Hardware accelerated convolution
US10198357B2 (en) Coherent interconnect for managing snoop operation and data processing apparatus including the same
US11030095B2 (en) Virtual space memory bandwidth reduction
US9087411B2 (en) Multigrid pressure solver for fluid simulation
US9934145B2 (en) Organizing memory to optimize memory accesses of compressed data
US9633458B2 (en) Method and system for reducing a polygon bounding box
US9754561B2 (en) Managing memory regions to support sparse mappings
CN110333827B (en) Data loading device and data loading method
US10402323B2 (en) Organizing memory to optimize memory accesses of compressed data
US9529718B2 (en) Batching modified blocks to the same dram page
US9928033B2 (en) Single-pass parallel prefix scan with dynamic look back
CN112348733A (en) Method and device for initializing, filling, reading and writing storage objects in GPU off-chip memory
US11609860B1 (en) Techniques for generating a system cache partitioning policy
KR102659997B1 (en) Memory bandwidth reduction technology for low-power convolutional neural network inference applications
KR20180018269A (en) Computing apparatus and method for processing operations thereof
JP2023127069A (en) Information processing apparatus and memory access control method
JP2023537579A (en) Graphics processing unit with selective two-level binning
CN117056247A (en) Hybrid allocation of data lines in a streaming cache memory

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination