CN113377688B

CN113377688B - L1 cache sharing method for GPU

Info

Publication number: CN113377688B
Application number: CN202110519990.XA
Authority: CN
Inventors: 赵夏; 何益百; 张拥军; 张光达; 陈任之; 隋京高; 王承智; 王璐; 王君展
Original assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Current assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2022-10-11
Anticipated expiration: 2041-05-13
Also published as: CN113377688A

Abstract

The invention discloses a method for sharing an L1 cache of a GPU, which comprises the following steps: s11, judging whether the local memory access request is empty, if so, executing S21, and if not, executing S12; s12, taking out the request to access the L1 cache; s13, judging whether the data are hit, if yes, returning the data, and if not, executing S14; s14, judging whether the program is a storage-intensive program or not, if so, sending the request to other SMs, executing S15, and if not, sending the request to an L2 cache; s15, judging whether a cache data block needs to be replaced or not, if so, sending a data block replacement request to other SMs; s21, judging whether the remote memory access request is empty or not, and if not, executing S22; s22, taking out the request to access the L1 cache; s23, judging whether the data is hit or not, if yes, returning the data, executing S24, and if not, sending the request to the L2 cache, and executing S24; s24, judging whether the remote data request is empty or not, and if not, storing the data block needing to be replaced into the L1 cache. The present invention enables a run-memory-intensive program to use the L1 cache on the SM running the compute-intensive program.

Description

L1 cache sharing method for GPU

Technical Field

The invention relates to the technical field of a GPU (graphics processing Unit), in particular to a method for sharing an L1 cache of the GPU.

Background

A Graphics Processing Unit (GPU) is a microprocessor used for performing operations related to images and Graphics, and the GPU is widely used in cloud computing platforms and data centers due to its powerful computing capability, and provides users with required computation. Compared with a single-task GPU which only runs one task on the GPU, the multi-task GPU can simultaneously run a plurality of tasks on the GPU, and the utilization rate of resources can be effectively improved. Specifically, the multitasking GPU may run a compute-intensive program and a storage-intensive program on one GPU at the same time, and the compute resources and the storage resources on the GPU may be fully utilized at the same time.

At present, a space multitasking mode is mainly adopted to implement that a GPU simultaneously runs multiple tasks, specifically, in the space multitasking mode, all SMs (Streaming multiprocessors) on the GPU are averagely divided into two groups, and each group of SMs is used to run an application program. The space multitask GPU can simultaneously run a calculation intensive program and a storage intensive program through space sharing, and the utilization rate of calculation resources and storage resources is improved.

However, although the space multitask GPU which simultaneously runs the compute intensive program and the storage intensive program can effectively improve the overall resource utilization rate of the system, running different programs on different SMs causes resource utilization imbalance on SM, especially L1 Cache (first level Cache, L1 Cache) resources, and further improves the performance of the multitask GPU. Specifically, for an SM running a storage-intensive program, such a program may generate a large number of access requests, which results in that L1 Cache resources are excessively used, the L1 Cache failure rate is high, and the failed requests are sent to an L2 Cache (a second level Cache, an L2 Cache) and a storage system through an on-chip internet network, which may bring a large access overhead; for SMs running compute intensive programs, such programs have few memory access requests, resulting in inefficient utilization of L1 cache resources.

Disclosure of Invention

To solve some or all of the above technical problems in the prior art, the present invention provides an L1 cache sharing method for a GPU.

The invention discloses an L1 cache sharing method for a GPU, which comprises the following steps:

s11, judging whether a local access request of the current stream multiprocessor is empty, if so, executing a step S21, and if not, executing a step S12;

s12, taking out the local access request to access the L1 cache;

s13, judging whether the local access memory request hits an L1 cache, if so, returning corresponding data to an access memory component of the current stream multiprocessor, and if not, executing the step S14;

s14, judging whether the current task of the current stream multiprocessor is a storage-intensive program, if so, sending the local memory access request to other stream multiprocessors where the computation-intensive program corresponding to the current stream multiprocessor is located, and executing the step S15, and if not, sending the local memory access request to an L2 cache and/or a storage system;

s15, judging whether a cache data block in an L1 cache of the current stream multiprocessor needs to be replaced, if so, sending the data block needing to be replaced and a data block replacement request to other stream multiprocessors where compute-intensive programs corresponding to the current stream multiprocessor are located;

s21, judging whether a remote access request is empty or not, and if not, executing the step S22, wherein the remote access request represents an access request sent by other stream multiprocessors to the current stream multiprocessor;

s22, the remote memory access request is taken out to access the L1 cache;

s23, judging whether the remote access request hits an L1 cache, if so, returning corresponding data to a streaming multiprocessor for sending the remote access request, and executing the step S24, otherwise, sending the remote access request to an L2 cache and/or a storage system, and executing the step S24;

and S24, judging whether the remote data request is empty, if not, storing the data block which needs to be replaced and corresponds to the remote data request into an L1 cache of the current stream multiprocessor, wherein the remote data request represents a data block replacement request which is sent by other stream multiprocessors to the current stream multiprocessor.

In some optional embodiments, a local memory access request queue unit is created in the streaming multiprocessor and used for storing the memory access requests generated by the current streaming multiprocessor in a queue structure.

In some optional embodiments, the fetching the local access request accesses an L1 cache, including:

and taking out the memory access request positioned at the head of the local memory access request queue in the local memory access request queue unit to access the L1 cache of the current stream multiprocessor.

In some optional embodiments, a remote access request queue unit is created in the streaming multiprocessor and used for storing the access requests sent by other streaming multiprocessors in a queue structure.

In some optional embodiments, the fetching the remote access request accesses an L1 cache, including:

and fetching the access request positioned at the head of the remote access request queue in the remote access request queue unit to access the L1 cache of the current stream multiprocessor.

In some optional embodiments, a remote data request queue unit is created in the streaming multiprocessor, and is used for storing data block replacement requests sent by other streaming multiprocessors in a queue structure.

In some optional embodiments, the storing the data block corresponding to the remote data request, which needs to be replaced, in the L1 cache of the current streaming multiprocessor includes:

and storing the data block which needs to be replaced and corresponds to the data block replacement request positioned at the head of the data block replacement request queue in the remote data request queue unit into the L1 cache of the current streaming multiprocessor.

In some optional embodiments, a selection logic unit is created in the streaming multiprocessor, and the selection logic unit is used for judging and selecting the local memory access request, the remote memory access request and the remote data request.

In some optional embodiments, whether the current task of the current streaming multiprocessor is a storage-intensive program is determined by the frequency of accesses to the L1 cache by the current streaming multiprocessor during the current program run.

In some optional embodiments, if the number of accesses per thousand instructions to the L1 cache of the current streaming multiprocessor is greater than a preset threshold, it is determined that the current task of the current streaming multiprocessor is a storage-intensive program.

The technical scheme of the invention has the following main advantages:

the L1 cache sharing method for the GPU can realize that the stream multiprocessor SM running the storage intensive program uses the L1 cache on the stream multiprocessor SM running the calculation intensive program, fully utilizes the L1 cache resource in the GPU, improves the resource utilization rate of the system, and solves the problem of unbalanced utilization rate of the L1 cache on the stream multiprocessor SM when different tasks run in the space multi-task GPU.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for sharing an L1 cache of a GPU according to one embodiment of the present invention;

FIG. 2 is a flow multiprocessor micro-architecture according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the specific embodiments of the present invention and the accompanying drawings. It is to be understood that the described embodiments are merely a few embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The technical solution provided by an embodiment of the present invention is described in detail below with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present invention provides an L1 cache sharing method for a GPU, where the method is used for a space multitask GPU which runs a compute-intensive program and a storage-intensive program at the same time, the space multitask GPU divides stream multiprocessors SM on the GPU into two groups in an equal division manner, one group of stream multiprocessors SM runs the compute-intensive program, the other group of stream multiprocessors SM runs the storage-intensive program, the stream multiprocessors SM running the storage-intensive program all have one stream multiprocessors SM running the compute-intensive program corresponding thereto, and different stream multiprocessors SM are connected to each other through an on-chip interconnection network, so as to implement data communication between the stream multiprocessors; the L1 cache sharing method for the GPU comprises the following steps:

s11, judging whether a local memory access request of the current streaming multiprocessor is empty, if so, executing a step S21, otherwise, executing a step S12;

s12, taking out the local access request to access the L1 cache;

s13, judging whether the local memory access request hits the L1 cache, if so, returning corresponding data to a memory access component of the current streaming multiprocessor, and if not, executing a step S14;

s14, judging whether the current task of the current stream multiprocessor is a storage-intensive program, if so, sending a local memory access request to other stream multiprocessors where the computation-intensive program corresponding to the current stream multiprocessor is located, and executing the step S15, otherwise, sending the local memory access request to an L2 cache and/or a storage system;

s21, judging whether a remote access request is empty, if not, executing the step S22, wherein the remote access request represents an access request sent by other stream multiprocessors to the current stream multiprocessor;

s22, taking out a remote access request to access the L1 cache;

s23, judging whether the remote access request hits the L1 cache, if so, returning corresponding data to the streaming multiprocessor for sending the remote access request, and executing the step S24, otherwise, sending the remote access request to the L2 cache and/or the storage system, and executing the step S24;

and S24, judging whether the remote data request is empty, if not, storing the data block which needs to be replaced and corresponds to the remote data request into an L1 cache of the current stream multiprocessor, wherein the remote data request represents the data block replacement request sent by other stream multiprocessors to the current stream multiprocessor.

If the local memory access request and the remote memory access request are both empty, it indicates that no memory access request needs to access the L1 cache in the current clock cycle.

In an embodiment of the present invention, the above-mentioned L1 cache sharing method is adopted for each stream multiprocessor SM of the GPU, so that the stream multiprocessor SM running the storage-intensive program uses the L1 cache on the stream multiprocessor SM running the computation-intensive program, L1 cache resources in the GPU are fully utilized, resource utilization rate of the system is improved, and the problem of unbalanced utilization rate of the L1 cache on the stream multiprocessor SM when different tasks are run in the spatial multitask GPU is solved.

Referring to fig. 2, in an embodiment of the present invention, a local memory access request queue unit is created in a streaming multiprocessor, and is used for storing memory access requests generated by a current streaming multiprocessor in a queue structure.

Further, when the local memory access request is stored in a queue structure, in the L1 cache sharing method, taking out the local memory access request to access the L1 cache specifically includes:

and taking out the access request positioned at the head of the local access request queue in the local access request queue unit to access the L1 cache of the current stream multiprocessor.

Referring to fig. 2, in an embodiment of the present invention, a remote access request queue unit is created in a streaming multiprocessor, and is used to store access requests sent by other streaming multiprocessors in a queue structure.

Further, when the remote access request is stored in a queue structure, in the L1 cache sharing method, the accessing of the L1 cache by the remote access request is taken out, which specifically includes:

and fetching the access request positioned at the head of the remote access request queue in the remote access request queue unit to access the L1 cache of the current streaming multiprocessor.

Referring to fig. 2, in an embodiment of the present invention, a remote data request queue unit is created in a streaming multiprocessor and used for storing data block replacement requests sent by other streaming multiprocessors in a queue structure.

Further, when the data block replacement request is stored in a queue structure, in the L1 cache sharing method, the storing the data block to be replaced corresponding to the remote data request into the L1 cache of the current streaming multiprocessor specifically includes:

and storing the data block which needs to be replaced and corresponds to the data block replacement request positioned at the head of the data block replacement request queue in the far-end data request queue unit into the L1 cache of the current streaming multiprocessor.

Referring to fig. 2, in an embodiment of the present invention, a selection logic unit is created in a streaming multiprocessor, and the selection logic unit is used for determining and selecting a local access request, a remote access request, and a remote data request.

Specifically, on the basis that a local memory access request queue unit, a remote memory access request queue unit and a remote data request queue unit are established in a stream multiprocessor, a selection logic unit is respectively connected with the local memory access request queue unit, the remote data request queue unit and an L1 cache, and the selection logic unit judges and selects requests in the local memory access request queue unit, the remote memory access request queue unit and the remote data request queue unit and sends the selected requests to the L1 cache.

The performance parameters of the application program can reflect the characteristics of the application program and the operation type of the application program, and compared with a calculation-intensive program, the storage-intensive program has higher access frequency to the L1 in the program operation process; for this reason, in an embodiment of the present invention, whether the current task of the current streaming multiprocessor is a storage-intensive program may be determined by the access frequency of the current streaming multiprocessor to the L1 cache during the current program running.

Specifically, if the access number APKI of every thousand instructions of the L1 cache of the current streaming multiprocessor is greater than a preset threshold, it is determined that the current task of the current streaming multiprocessor is a storage-intensive program.

The Access per Kilo-Instruction (APKI) is a parameter that reflects the frequency with which applications Access memory, and applications with high APKI values have more memory accesses.

The preset threshold may be, for example, 10.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. In addition, "front", "rear", "left", "right", "upper" and "lower" in this document are referred to the placement states shown in the drawings.

Finally, it should be noted that: the above examples are only for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for L1 cache sharing for a GPU, comprising the steps of:

s12, taking out the local memory access request to access the L1 cache;

s14, judging whether the current task of the current stream multiprocessor is a storage intensive program, if so, sending the local access request to other stream multiprocessors where the computation intensive program corresponding to the current stream multiprocessor is located, and executing the step S15, otherwise, sending the local access request to an L2 cache and/or a storage system;

s22, the remote access request is taken out to access the L1 cache;

2. The L1 cache sharing method for GPU as claimed in claim 1, characterized in that a local memory access request queue unit is created in the streaming multiprocessor for storing the memory access requests generated by the current streaming multiprocessor in a queue structure.

3. The L1 cache sharing method for a GPU of claim 2, wherein said fetching the local access request accesses an L1 cache comprising:

and fetching the access request positioned at the head of the local access request queue in the local access request queue unit to access the L1 cache of the current stream multiprocessor.

4. An L1 cache sharing method for GPU according to claim 1 or 2, characterized in that a remote access request queue unit is created in the stream multiprocessor for storing the access requests sent by other stream multiprocessor in a queue structure.

5. The L1 cache sharing method for a GPU of claim 4, wherein the fetching the remote access request to access an L1 cache comprises:

6. The L1 cache sharing method for the GPU of claim 1, 2 or 4, characterized in that a remote data request queue unit is created in the streaming multiprocessor for storing data block replacement requests sent by other streaming multiprocessors in a queue structure.

7. The L1 cache sharing method for the GPU of claim 6, wherein the step of storing the data block corresponding to the remote data request which needs to be replaced into the L1 cache of the current streaming multiprocessor comprises the following steps:

8. An L1 cache sharing method for GPU according to any of claims 1 to 7 characterized in that a selection logic unit is created in the streaming multiprocessor, said selection logic unit is used for judging and selecting local access request, remote access request and remote data request.

9. The L1 cache sharing method for the GPU of claim 1, wherein whether the current task of the current streaming multiprocessor is a storage intensive program is judged by the access frequency of the current streaming multiprocessor to the L1 cache in the current program running.

10. The method of claim 9, wherein if the number of accesses per thousand instructions to the L1 cache of the current streaming multiprocessor is greater than a predetermined threshold, determining that the current task of the current streaming multiprocessor is a storage-intensive program.