CN110457238B

CN110457238B - Method for slowing down GPU (graphics processing Unit) access request and pause when instructions access cache

Info

Publication number: CN110457238B
Application number: CN201910601175.0A
Authority: CN
Inventors: 李炳超
Original assignee: Civil Aviation University of China
Current assignee: Civil Aviation University of China
Priority date: 2019-07-04
Filing date: 2019-07-04
Publication date: 2023-01-03
Anticipated expiration: 2039-07-04
Also published as: CN110457238A

Abstract

The invention discloses a method for slowing down memory access requests of a GPU (graphics processing Unit) and pauses when instructions access a cache, which comprises the following steps: the access request at the head of the FIFO queue accesses the L1 cache, the tag of the access request is compared with the tag in the L1 cache, if the access request with the reservation pause exists, the access request is popped out from the head of the FIFO queue and is placed into the tail of the FIFO queue; the first control logic controls the trend of the memory access request after being popped from the FIFO queue head; and constructing a second control logic, a first control signal and a third control signal, and performing pipeline processing on the access instructions between the thread bundle scheduler and the reading unit, so that when the address merging unit in the reading unit merges all the access requests, the next access instruction can be processed, and when an idle entry exists in the FIFO queue, the access request can be generated and stored. Compared with the prior art, the method can reduce the pause time of the memory access request, improve the processing speed of the memory access request, and simultaneously can reduce the waiting time of the memory access instruction and improve the processing speed of the memory access instruction.

Description

Method for slowing down memory access request of GPU (graphics processing Unit) and pause when instructions access cache

Technical Field

The invention relates to the field of GPU (graphic processor) cache (cache memory) architecture, in particular to a processing method for slowing down the pause of a GPU (graphic processor) memory access request and a memory access instruction when accessing an L1 cache (primary cache memory).

Background

In recent years, GPUs have evolved into a multi-threaded high-performance parallel general-purpose computing platform, and the computing power of GPUs is still rapidly increasing, attracting more and more applications to accelerate on GPUs.

In the GPU software level, when an application program runs on a GPU, tasks of the application program need to be subdivided into a plurality of threads that can run independently, and then the plurality of threads are organized into thread blocks. In the GPU hardware level, one GPU is composed of a plurality of streaming multiprocessors, an on-chip interconnection network and a memory. The stream multiprocessor is internally provided with a register file, a scalar processor, a read-write unit, a shared memory, a cache and the like which support multithreading parallel operation. The threads are respectively sent to each stream multiprocessor by taking a thread block as a unit, and the thread block is subdivided into thread bundles by hardware in the stream multiprocessor, wherein the thread bundles are the most basic execution units of the GPU ^[1] . In NVIDIA GPU, a thread bundle consists of 32 threads, and these 32 threads are executed in parallel.

When the thread bundle executes the access instruction, each thread generates an access request, and in order to reduce the number of the access requests, the access requests generated by the same thread bundle are merged in a stream multiprocessor of the GPU through an address merging unit. If the addresses accessed by the memory access requests generated by a thread bundle are positioned in the same data block (such as 128 bytes), the memory access requests can be combined into a memory access request ^[2] . However, because some programs have irregular access characteristics, even after address merging, an access instruction of a thread bundle still has a plurality of access requests, and the access requests are put into a FIFO (first-in first-out) queue to cause burst-type access to the cache. On the other hand, due to cache capacity inside the streaming multiprocessorThe quantity is small (16 KB-96 KB), the number of threads can reach thousands, the average cache capacity of each thread is only dozens of bytes, and the miss rate of the cache is very high. When a cache miss occurs to a memory access request, according to a corresponding replacement policy, a cache-line (cache line) is selected from the cache to replace data therein, and then the memory access request will continue to access a next-level memory (L2 cache (secondary cache memory) or DRAM (dynamic random access memory)). The state in which the cache-line is in the period from when old data is replaced to when new data is retrieved from the next-level memory and stored into the cache-line is called a reservation state. The cache-line in the reserved state can not be replaced by other memory access requests with missing. If too many access requests cause cache-lines in the cache to be in a reserved state, no object which can be replaced exists after cache miss occurs in subsequent access requests, and the access requests are stopped ^[3] And ending the reservation state until the data of a certain cache-line in the cache returns, wherein the phenomenon is called reservation pause. The GPU processes the access requests of one thread bundle according to a first-in first-out sequence, and as the access requests generally need hundreds of cycles to access the next-level memory, other access requests in the reading unit need to wait hundreds of cycles until the previous access requests are stopped, so that the access requests can be processed, and the processing efficiency of the access requests is reduced.

On the other hand, a fetch unit can currently only accommodate one thread bundle access instruction. That is, before all the access requests of the thread bundle access instruction currently located in the fetch unit are not processed, even if there is an idle entry in the FIFO, the thread bundle scheduler cannot send other access instructions to the fetch unit for processing. If the memory access request of the current memory access instruction is subjected to reservation pause, the next memory access instruction also needs to wait for hundreds of cycles, and the processing efficiency of the memory access instruction of the thread bundle is reduced.

Reference to the literature

[1]E.Lindholm,J.Nickolls,S.Oberman,J.Montrym.“NVIDIA Tesla:A Unified Graphics and Computing Architecture”,IEEE Micro,vol.28,no.2,pp.39-55,2008.

[2]NVIDIA Corporation.NVIDIA CUDA C Programming Guide,2019.

[3]W.Jia,K.A.Shaw,M.Martonosi.“MRPB:Memory Request Prioritization for Massively Parallel Processors”,International Symposium on High Performance Computer Architecture,pp.272-283,2014.

Disclosure of Invention

The invention provides a method for slowing down the memory access requests of a GPU and the pause when instructions access a cache, and the method reorders the memory access requests for reserving the pause, reduces the pause time of the memory access requests in a reading unit and improves the processing efficiency of the memory access requests; in addition, by performing pipeline processing on the memory access instruction, the waiting time of the memory access instruction outside the reading unit is reduced, and the processing efficiency of the memory access instruction is improved, which is described in detail in the following description:

a method for slowing down memory access requests of a GPU and stopping when instructions access cache comprises the following steps:

accessing the L1 cache by the access request positioned at the head of the FIFO queue, comparing the tag of the access request with the tag in the L1 cache, if the access request with reservation pause exists, popping the access request from the head of the FIFO queue, and placing the access request into the tail of the FIFO queue;

the first control logic controls the trend of the memory access request after being popped from the FIFO queue head;

and constructing a second control logic, a first control signal and a third control signal for performing pipeline processing on the access instructions between the thread bundle scheduler and the reading unit, so that the next access instruction can be processed when the address merging unit in the reading unit merges all the access requests, and the access requests can be generated and stored when free items exist in the FIFO queue.

The trend of the first control logic controlling the memory access request after popping from the head of the FIFO queue is specifically as follows:

when the access result of the access request in the L1 cache is reserved, the second control signal is false, and the second control signal is true after passing through the reverser;

the first tri-state gate is in a conducting state, and the second tri-state gate is in a high-impedance state, and the access request is transmitted to the tail of the FIFO queue after being popped from the head of the FIFO queue.

Further, the constructing the second control logic and the first and third control signals for performing pipeline processing on the access instruction between the thread bundle scheduler and the reading unit specifically includes:

1) If the access requests are not completely synthesized by the address merging unit, the third control signal is false, and the thread bundle scheduler is informed that other access instructions cannot be sent to the reading unit; otherwise, the third control signal is true, and the thread bundle scheduler is informed to send other memory access instructions to the reading unit;

2) And sending the state that the FIFO queue is full to the address merging unit through a first control signal so as to control the generation of the access request.

The sending the state of whether the FIFO queue is full to the address merging unit through the first control signal to control the generation of the access request specifically includes:

if the FIFO queue is full, the first control signal is false, and the address merging unit is informed to suspend merging the access requests until the FIFO queue has an idle entry;

otherwise, the address merging unit is informed to continue merging the access requests and put the merged access requests into the tail of the FIFO queue.

Preferably, the method further comprises: and performing conflict processing on the memory access request generated by the address merging unit and the memory access request popped from the head of the FIFO queue when the memory access requests are placed at the tail of the FIFO queue in the same period.

Wherein the conflict handling specifically comprises:

the memory access request popped from the head of the FIFO queue is given high priority, the memory access request is put into the tail of the FIFO queue through a second control logic,

the address merging unit suspends the generation of new access requests until no access request popped from the head of the FIFO queue needs to be placed at the tail of the FIFO queue.

Further, the step of placing the access and storage request into the FIFO queue tail through the second control logic is:

when the access result of the access request in the L1 cache is hit or missing, the second control signal is true, the input path2 of the multi-path selector is gated, and the access request generated by the address merging unit is placed at the tail of the FIFO queue;

when the access result of the access request in the L1 cache is the reservation pause, the second control signal is false, the input path1 of the multiplexer is gated, and the access request popped out from the head of the FIFO queue is placed in the tail of the FIFO queue.

The technical scheme provided by the invention has the beneficial effects that:

1. the invention can reorder the access requests with reserved pause, thereby enabling the subsequent access requests to continuously access the L1 cache and reducing the pause time of the access requests;

2. the other memory access instructions in the invention can be processed by the reading unit without waiting for the completion of the processing of all the memory access requests of the current memory access instruction in the reading unit, but only by the completion of the merging of the memory access requests of all the threads by the address merging unit in the reading unit, the thread bundle scheduler can send other memory access instructions to the reading unit for processing, thereby reducing the waiting time of the memory access instructions and improving the processing efficiency of the memory access instructions.

Drawings

FIG. 1 is a schematic structural diagram of a process for slowing down a GPU memory access request and a memory access instruction from stalling in an L1 cache according to the present invention;

FIG. 2 is a schematic diagram of a reservation stall occurring in a memory access request;

fig. 3 is a graph comparing the results of the operation after the present invention was applied.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

Referring to fig. 1, an embodiment of the present invention provides a method for slowing down a GPU memory access request and a pause when an instruction accesses a cache, where the method includes the following steps:

101: comparing tag (label) of the access request with tag in L1 cache, and if the access request with reservation pause exists, reordering FIFO queue;

wherein, the access request at the head of FIFO (first in first out) accesses L1 cache, firstly comparing the tag of the access request with the tag in L1 cache, including the following three conditions:

if cache hit occurs, popping the access request from the FIFO queue head, and further accessing the hit cache-line; or the like, or a combination thereof,

if cache miss occurs, popping the access request from the FIFO queue head, and sending the access request to a next-level memory; or the like, or, alternatively,

if the reservation pause occurs, the memory access request is popped out from the head of the FIFO queue and is placed at the tail of the FIFO queue, so that other memory access requests in the FIFO queue can continuously access the L1 cache in the next period, the pause is avoided, and the processing speed of the memory access request is accelerated.

For this reason, the embodiment of the present invention designs a corresponding data path1 to connect the head of the FIFO queue and the tail of the FIFO queue, and a corresponding first control logic 1 to control the direction of the memory access request after being popped from the head of the FIFO queue.

The data path1 is a data line for transmitting access request information, where the access request information generally includes: address information, thread bundle index, read-write information.

The first control logic 1 is used for controlling the trend of the memory access request popped from the head of the FIFO queue: when the access result r of the access request in the L1 cache is hit or missing, the control signal c2 is true, and becomes false after passing through the inverter. Therefore, the tristate gate 1 is in a high-impedance state, and the tristate gate 2 is in a conducting state, which indicates that the access and storage request is deleted after being popped from the FIFO queue head; when the access result r of the access request in the L1 cache is retention stop, the control signal c2 is false and becomes true after passing through the inverter. Therefore, the three-state gate 1 is in a conducting state, and the three-state gate 2 is in a high-impedance state, which indicates that the access request is sent to the tail of the FIFO queue after being popped from the head of the FIFO queue.

102: carrying out pipeline processing on the access instruction;

the method specifically comprises the following steps:

1) If the access requests generated by the current access instruction are not completely synthesized by the address merging unit, the control signal c3 is false, and the thread bundle scheduler is informed that other access instructions cannot be sent to the reading unit;

2) If the access requests generated by the current access instruction are completely synthesized by the address merging unit, the control signal c3 is true and informs the thread bundle scheduler that other access instructions can be sent to the reading unit;

3) Continuously detecting the state of the FIFO queue;

if the FIFO queue is full, the control signal c1 is false, and the address merging unit is informed to suspend merging the access requests until the FIFO queue has an idle entry;

if the FIFO is not full, the control signal c1 is true, and the address merging unit is informed that the access requests can be merged continuously and placed at the tail of the FIFO queue.

For this purpose, the FIFO controller sends the status of whether the FIFO queue is full to the address merge unit via control signal c1 to control the generation of the access request.

103: and conflict processing is carried out when the access request generated by the address merging unit and the access request popped out from the head of the FIFO queue are placed at the tail of the FIFO queue in the same period.

If the access request generated by the address merging unit and the access request popped from the head of the FIFO queue need to be put into the tail of the FIFO queue in the same period, the access request popped from the head of the FIFO queue is given high priority, and is put into the tail of the FIFO queue preferentially. At this time, the address merging unit suspends generating new access requests until no access request popped from the head of the FIFO queue needs to be placed at the tail of the FIFO queue.

For this purpose, it is necessary to design a corresponding control logic 2 at the tail of the FIFO queue, and to use the tag comparison result r of the access request as a control signal to control the input selection of the FIFO queue tail. When the access result r of the access request in the L1 cache is hit or missing, the control signal c2 is true, and the input path2 of the multiplexer in the gating control logic 2 indicates that the access request generated by the address merging unit is placed at the tail of the FIFO queue; when the access result r of the access request in the L1 cache is retention stop, the control signal c2 is false, and the input path1 of the multiplexer in the gating control logic 2 indicates that the access request popped from the head of the FIFO queue is placed at the tail of the FIFO queue.

Example 2

The following further introduces and verifies the embodiment 1 of the present invention with respect to the processing manner of memory access request reservation and pause in the prior art, which is described in detail below:

the FIFO queue of the memory access request has 32 entries, and the GPU flows the number of the multiple processors: 15; number of DRAM channels: 6; maximum number of threads of streaming multiprocessor: 1536; streaming multiprocessor register file capacity: 128KB; shared memory capacity: 48KB; l1 cache: 4-way set association, wherein the cache-line size is 128 bytes, 32 sets and the total capacity is 16KB; l2 cache (second level cache): 8-way set association, cache-line size 128 bytes, total capacity 128KB. L1 cache access latency: 1 period; l2 cache access latency: 120 periods; DRAM access latency: 220 period.

As shown in FIG. 2, assuming that all cache-lines in the L1 cache can be accessed in the initial state (cache cold start), the access requests stored in the FIFO queue are req-a0, req-a1, req-a 2. Req. A20 from the access instruction inst-a. Based on the address mapping, req-a 0. Req-a4 will access set-0 in the L1 cache, req-5. Req. A9 will access set-1 in the L1 cache.

According to the access sequence of first-in first-out, firstly accessing the L1 cache by using req-a0, causing cache miss, allocating one cache-line in set-0 to req-a0, and keeping the cache-line in a reserved state (R), then sending the req-a0 to a next-stage memory, and simultaneously popping the req-a0 from the head of a FIFO queue by a FIFO controller. The remaining three cache-lines in set-0 are allocated to req-a1, req-a2, req-a3, respectively, for the next three cycles. At this time, all cache-lines in set-0 are in the reserved state (R). When the req-a4 continues to access the set-0 and cache loss occurs, because no cache-line which can be allocated exists in the set-0 at the moment, the reservation pause occurs in the req-a4, the FIFO controller cannot pop the req-a4 out of the FIFO queue head, but waits for the return of the req-a0, the req-a1, the req-a2 or the req-a3 from the next-stage memory, and cancels the corresponding cache-line reservation state. Although the access requests such as req-a5 do not need to access set-0, namely, do not need to wait for the return of req-a0, req-a1, req-a2 and req-a3, the rest access requests such as req-a5 also have to wait because req-a4 is blocked in the front, and the processing efficiency of the access requests is greatly reduced.

On the other hand, at this time, all the access requests stored in the FIFO queue are the access requests of inst-a, although the FIFO queue still has free entries at this time, the FIFO controller notifies the thread bundle scheduler through the control signal c1 that the current reading unit still cannot process other access instructions inst-b and the like, so that other access instructions such as inst-b and the like are also affected by the retention and pause of the inst-a access instruction, and the processing efficiency of the access instruction is greatly reduced.

As shown in FIG. 1, after the embodiment of the present invention is adopted, when a reservation stall occurs to req-a4, the FIFO controller pops out req-a4 from the head of the FIFO queue, and at the same time, the control signal c2 controls the data path1 to be opened, and puts req-a4 at the tail of the FIFO queue, and req-a5 becomes the head of the FIFO queue, thereby avoiding the reservation stall.

In the next cycle, req-a5 accesses set-1, thereby increasing the processing speed of the access request. In addition, at this time, the address merge unit has merged all the memory access requests of inst-a, and the address merge unit notifies the thread bundle scheduler that the current read unit can receive inst-b. Assuming that inst-b can generate 24 access requests (req-b 0. Cndot. Req-b 23) in total, req-b0 and req-a4 need to be placed at the tail of the FIFO queue at the same time, and an access conflict occurs.

In the embodiment of the invention, the req-a4 with the reservation pause is given higher priority, so the req-a4 is put into an FIFO queue first, and in the period, the control signal c2 controls an address merging unit in a reading unit to suspend merging the memory access requests of inst-b. When the access conflict at the tail of the FIFO queue is finished, the control signal c2 controls the address merging unit to continue merging the access requests of inst-b. Therefore, the embodiment of the invention also improves the processing speed of the access instruction.

As shown in fig. 3, the average performance (GM) of the GPU improved by 23% after the embodiment of the present invention was employed.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for slowing down memory access requests of a GPU and pauses when instructions access a cache is characterized by comprising the following steps:

the access request at the head of the FIFO queue accesses the L1 cache, the tag of the access request is compared with the tag in the L1 cache, if the access request with the reservation pause exists, the access request is popped out from the head of the FIFO queue and is placed into the tail of the FIFO queue;

the first control logic controls the trend of the memory access request after the memory access request is popped from the FIFO queue head;

2. The method for slowing down the stall of the GPU during the access to the memory request and the instruction for accessing the cache according to claim 1, wherein the trend of the access request controlled by the first control logic after being popped from the FIFO queue head is specifically as follows:

when the access result of the access request in the L1 cache is retention stop, the second control signal is false, and the second control signal is true after passing through the reverser;

the first tri-state gate is in a conducting state, and the second tri-state gate is in a high-impedance state and represents that the access request is transmitted to the tail of the FIFO queue after being popped from the head of the FIFO queue.

3. The method for slowing down the memory access request of the GPU and the pause when the instructions access the cache according to claim 1, wherein the step of constructing a second control logic, a first control signal and a third control signal for performing pipeline processing on the memory access instructions between the thread bundle scheduler and the reading unit specifically comprises the following steps:

4. The method for slowing down the stall of the GPU when accessing the memory request and the instruction accesses the cache according to claim 3, wherein the sending the state of whether the FIFO queue is full to the address merging unit through the first control signal to control the generation of the memory request specifically comprises:

if the FIFO queue is full, the first control signal is false, and the address merging unit is informed to suspend merging the access and memory requests until an idle entry exists in the FIFO queue;

5. The method of any one of claims 1-4, wherein the method further comprises:

and performing conflict processing on the memory access request generated by the address merging unit and the memory access request popped from the head of the FIFO queue when the memory access request is placed at the tail of the FIFO queue in the same period.

6. The method for slowing down the stall of the GPU when accessing the memory request and the instruction to access the cache according to claim 5, wherein the conflict processing specifically comprises:

giving high priority to the memory access request popped from the head of the FIFO queue, and putting the memory access request into the tail of the FIFO queue through a second control logic;

the address merging unit suspends generating new access requests until no access request popped from the head of the FIFO queue needs to be placed at the tail of the FIFO queue.

7. The method of claim 6, wherein the step of placing the access request into the FIFO queue tail by the second control logic is:

when the access result of the access request in the L1 cache is hit or missing, the second control signal is true, the input path2 of the multiplexer is gated, and the access request generated by the address merging unit is placed at the tail of the FIFO queue;

when the access result of the access request in the L1 cache is that the reservation is stopped, the second control signal is false, the input path1 of the multiplexer is gated, and the access request popped from the head of the FIFO queue is placed at the tail of the FIFO queue.