CN113377538A - GPU data reuse-oriented storage and calculation cooperative scheduling method and system - Google Patents

GPU data reuse-oriented storage and calculation cooperative scheduling method and system Download PDF

Info

Publication number
CN113377538A
CN113377538A CN202110649358.7A CN202110649358A CN113377538A CN 113377538 A CN113377538 A CN 113377538A CN 202110649358 A CN202110649358 A CN 202110649358A CN 113377538 A CN113377538 A CN 113377538A
Authority
CN
China
Prior art keywords
gpu
data
data page
kernel program
thread block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110649358.7A
Other languages
Chinese (zh)
Other versions
CN113377538B (en
Inventor
李晨
李宣佚
郭阳
鲁建壮
陈小文
刘胜
张洋
刘畅
曹壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110649358.7A priority Critical patent/CN113377538B/en
Publication of CN113377538A publication Critical patent/CN113377538A/en
Application granted granted Critical
Publication of CN113377538B publication Critical patent/CN113377538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a GPU data reuse-oriented storage computation cooperative scheduling method and a system, wherein the method comprises the steps of turning over a reversal mark of a kernel program when the kernel program is started; in a thread block scheduler of the GPU, aiming at the thread block scheduling of the kernel program, selecting a thread block dispatching strategy in turn from a forward thread block dispatching strategy and a reverse thread block dispatching strategy according to a reverse mark of the kernel program to select the thread block from a queue to be transmitted of the thread block for transmission; in the GPU driver, for the data page replacement of the kernel program, selecting one of a forward data page replacement strategy and a reverse data page replacement strategy in turn according to the reverse sign of the kernel program to select a GPU-side data page from a GPU-side data page queue for replacement. The invention realizes the thread block and data page cooperative scheduling, reduces the influence of the memory excess configuration on the system performance by reusing the shared data, and can effectively improve the system performance.

Description

GPU data reuse-oriented storage and calculation cooperative scheduling method and system
Technical Field
The invention relates to a computing scheduling technology of a computer, in particular to a storage computing cooperative scheduling method and a storage computing cooperative scheduling system for GPU data reuse.
Background
Because of its high computational throughput and good programmability, GPUs have been widely used in high performance fields including machine learning, object detection, and image denoising. However, the limited memory space on the GPU has not been able to accommodate the ever-expanding working set (GPU data access per unit time) of the application. The introduction of unified virtual memory and on-demand page fetching techniques provides good support for memory over-provisioning, but the loss of system performance is caused by the extra data page transmission between the CPU memory and the GPU memory. How to reduce these superfluous data migrations is therefore crucial for performance improvement. After studying a large number of tested program sets, we find that there are many applications in which data sharing between Kernel programs (Kernel) exists. Moreover, for most of these programs, each of the kernel programs accesses the same piece of data area in a similar data access order. When the memory of the GPU cannot accommodate the working set of the entire kernel program, the old data page is swapped out to the memory of the CPU and the required data page is fetched to the memory of the GPU. When a kernel program is finished, only the data pages which are accessed newly are reserved in the memory of the GPU, and the subsequent kernel program can access the data pages which are replaced into the memory of the CPU after being started. We have found that although there is a lot of data sharing between kernel programs in these applications, such data sharing characteristics disappear when memory over-allocation occurs, which in turn causes a drastic drop in system performance.
Effectively using the existing data in the GPU memory is a key to avoid long latency overhead caused by page failure, especially in the case of memory over-provisioning. FIG. 1 illustrates the performance degradation of an application with data sharing between kernel programs due to memory over-provisioning. We have found that such applications are insensitive to the extent of memory over-provisioning, and that a slight memory over-provisioning can cause a dramatic drop in performance. Fig. 2 shows the variation of the page access characteristic and the page failure rate of the GPU when the memory can only accommodate 75% of the data access size of the FFT program (multiple kernel programs access the same sequence in the same sequence, and the dotted line indicates the end boundary of each kernel program). We have found that each kernel in the FFT has similar data access characteristics and order, and that there is a high page-failure rate at the beginning of each kernel (circled in fig. 2) because earlier accessed data pages are replaced by newly accessed data pages, resulting in data page failures when subsequent kernels re-access these evicted data pages. In order to fundamentally reduce the number of times of migration of pages when the memory is over-configured, many related techniques have been proposed including prefetching, hiding the transfer time using the calculation time, and batching page failures, etc. However, these techniques have little effect on the performance improvement of such applications. Prefetching causes system thrashing due to eviction of useful data pages, and the long latency caused by large number of page faults cannot be hidden by pre-eviction of data pages and batch page faults. Based on the above analysis, we have found three findings for applications where there is data sharing between kernel programs. First, once memory over-provisioning occurs, the performance of the program can drop dramatically. Second, previous methods of optimizing program performance for memory over-provisioning are not applicable to such programs. Finally, the page failure rate at kernel boundaries is very high. Therefore, our research goal is to reduce page failure rate at kernel boundaries by reusing shared data between kernels.
Disclosure of Invention
The technical problems to be solved by the invention are as follows: aiming at the problems in the prior art, the invention provides a GPU data reuse-oriented storage and calculation cooperative scheduling method and system.
In order to solve the technical problems, the invention adopts the technical scheme that:
a GPU data reuse-oriented storage and calculation cooperative scheduling method comprises the following steps:
1) detecting whether a kernel program is started or not under the conditions that the current program has GPU memory capacity excess and data sharing exists among the kernel programs, and turning over a reverse sign of the kernel program if the kernel program is started;
2) in the GPU drive, aiming at the data page replacement of the kernel program, selecting one data page replacement strategy in turn to select a GPU-end data page from a GPU-end data page queue for replacement according to a reversal mark of the kernel program in two preset data page replacement strategies, namely a forward data page replacement strategy and a reverse data page replacement strategy, wherein the directions of selecting the GPU-end data page by the forward data page replacement strategy and the reverse data page replacement strategy are different; in a thread block scheduler of the GPU, aiming at the thread block scheduling of the kernel program, selecting one of two preset thread block dispatching strategies of a forward thread block dispatching strategy and a reverse thread block dispatching strategy in turn according to a reverse mark of the kernel program to select the thread block from a thread block queue to be transmitted, wherein the directions of the thread blocks selected by the forward thread block dispatching strategy and the reverse thread block dispatching strategy are different.
Optionally, the flipping the inversion flag of the kernel program in step 1) includes: firstly, whether a reversal mark of the kernel program exists or not is detected, if the reversal mark of the kernel program does not exist, the reversal mark is initialized for the kernel program, and if the reversal mark of the kernel program exists, the reversal mark of the kernel program is turned over.
Optionally, when the inversion flag is initialized, the initialization value of the inversion flag is 0 or 1.
Optionally, the flipping the inversion flag of the kernel program refers to: if the original value of the reverse flag of the kernel program is 0, the reverse flag of the kernel program is changed from 0 to 1, and if the original value of the reverse flag of the kernel program is 1, the reverse flag of the kernel program is changed from 1 to 0.
Optionally, in step 2), when one of the two preset data page replacement policies, namely a forward data page replacement policy and a reverse data page replacement policy, is selected in turn according to the reverse flag of the kernel program to select a GPU end data page from the GPU end data page queue for replacement, if the reverse flag is 0, the forward data page replacement policy is selected; if the reverse flag is 1, then the reverse data page replacement policy is selected.
Optionally, when one of the thread block dispatching strategies is selected in turn in the step 2) according to the reversal flag in two preset thread block dispatching strategies, namely a forward thread block dispatching strategy and a reverse thread block dispatching strategy, if the reversal flag is 0, the forward thread block dispatching strategy is selected; if the reverse flag is 1, then a reverse thread block dispatch strategy is selected.
Optionally, the occurrence of GPU memory capacity excess in the current program in step 1) means: when a GPU driver at a host monitors a data page request generated by a GPU terminal, if the data page queue length maintained by the GPU driver reaches the GPU memory capacity at this time, the data page in the GPU memory needs to be evicted to the CPU memory according to a data page replacement policy, and a memory excess configuration flag is set to indicate that memory excess configuration occurs in a current program.
Optionally, the data sharing between the kernel programs in step 1) means that: judging whether the kernel programs share the same pointer according to the information during compiling, taking the pointer as an index for data sharing between the kernel programs, and distributing a data sharing mark for each kernel program to be started so as to indicate whether the kernel program has data sharing with the previous kernel program; when a kernel program is started, whether the memory excess configuration flag bit and the data sharing flag bit corresponding to the kernel program are both 1 or not is respectively judged, and if yes, the cooperative scheduling method is judged to be needed.
In addition, the invention also provides a storage and calculation cooperative scheduling system facing GPU data reuse, which comprises a processing unit and a memory which are connected with each other, wherein the processing unit is programmed or configured to execute the steps of the storage and calculation cooperative scheduling method facing GPU data reuse.
Furthermore, the present invention also provides a computer readable storage medium, in which a computer program programmed or configured to execute the aforementioned GPU data reuse oriented storage computation cooperative scheduling method is stored.
Compared with the prior art, the invention has the following advantages: because of its high computational throughput and good programmability, GPUs have been widely used in high performance fields including machine learning, object detection, and image denoising. However, the limited memory space on the GPU has not been able to accommodate the ever-expanding working set (GPU data access per unit time) of the application. The introduction of unified virtual memory and on-demand page fetching techniques provides good support for memory over-provisioning, but the loss of system performance is caused by the extra data page transmission between the CPU memory and the GPU memory. How to reduce these superfluous data migrations is therefore crucial for performance improvement. After studying a large number of tested program sets, we find that there are many applications in which data sharing between kernel programs exists. Moreover, for most of these programs, each of the kernel programs accesses the same piece of data area in a similar data access order. When the memory of the GPU cannot accommodate the working set of the entire kernel program, the old data page is swapped out to the memory of the CPU and the required data page is fetched to the memory of the GPU. When a kernel program is finished, only the data pages which are accessed newly are reserved in the memory of the GPU, and the subsequent kernel program can access the data pages which are replaced into the memory of the CPU after being started. We have found that although there is a lot of data sharing between kernel programs in these applications, such data sharing characteristics disappear when memory over-allocation occurs, which in turn causes a drastic drop in system performance. Based on the above observation, the invention provides a method for cooperatively scheduling thread blocks and data pages to improve the performance of the system by effectively utilizing the shared data among kernel programs, fully utilize the shared data among the kernel programs by coordinating the switching of the thread block allocation sequence, namely changing the thread block allocation sequence and the switching of the data page replacement strategy, and evaluate the performance of the method by a large number of GPU test sets, and the result shows that the performance of the method is improved by 65% compared with the newly researched performance.
Drawings
FIG. 1 illustrates the impact of GPU memory over-provisioning on system performance.
Fig. 2 shows the variation of the page access characteristic and the page failure rate of the GPU when the memory can only accommodate 75% of the data access size of the FFT program.
Fig. 3 is a schematic diagram of the basic principle of the method according to the embodiment of the present invention.
FIG. 4 is a graph showing a comparison of the performance of the method of the present invention and a prior art method.
Fig. 5 shows the variation of the access characteristic and the failure rate of the data page during the FFT execution process when the GPU memory capacity is set to 75% of the data amount accessed by the FFT in the embodiment of the present invention and the method of the present embodiment (inverse allocation) is used.
Fig. 6 shows the variation of the access characteristic and the failure rate of the data page during the FFT execution process when the GPU memory capacity is set to 75% of the FFT-accessed data amount and the conventional cooperation method is used in the embodiment of the present invention.
Fig. 7 is an implementation procedure under the reference configuration in the embodiment of the present invention.
Fig. 8 is an implementation of the present invention when reverse allocation is used.
Fig. 9 is an implementation procedure when the cooperation method is used in the embodiment of the present invention.
FIG. 10 is a diagram illustrating a least recently used data page replacement policy based on load time according to an embodiment of the present invention.
FIG. 11 is a diagram illustrating a data page replacement policy based on load time in reverse direction for the least recently used data page according to an embodiment of the present invention.
FIG. 12 is a comparison of IPC based on different optimization mechanisms using a benchmark configuration as a performance reference in an embodiment of the present invention.
Detailed Description
As shown in fig. 3, the storage and computation cooperative scheduling method for GPU data reuse in this embodiment includes:
1) detecting whether a kernel program is started or not under the conditions that the current program has GPU memory capacity excess and data sharing exists among the kernel programs, and turning over a reverse sign of the kernel program if the kernel program is started;
2) in the GPU drive, aiming at the data page replacement of the kernel program, selecting one data page replacement strategy in turn to select a GPU-end data page from a GPU-end data page queue for replacement according to a reversal mark of the kernel program in two preset data page replacement strategies, namely a forward data page replacement strategy and a reverse data page replacement strategy, wherein the directions of selecting the GPU-end data page by the forward data page replacement strategy and the reverse data page replacement strategy are different; in a thread block scheduler of the GPU, aiming at the thread block scheduling of the kernel program, selecting one of two preset thread block dispatching strategies of a forward thread block dispatching strategy and a reverse thread block dispatching strategy in turn according to a reverse mark of the kernel program to select the thread block from a thread block queue to be transmitted, wherein the directions of the thread blocks selected by the forward thread block dispatching strategy and the reverse thread block dispatching strategy are different.
As indicated above, matching thread blocks with data in memory is key to reducing page failure rate at kernel boundaries, however how to determine thread blocks associated with such data is the biggest challenge encountered by design. Based on the observation and analysis of such applications, the method of the present embodiment finds that in most of these programs, each kernel program has similar data access characteristics and order (as shown in fig. 2), and in each kernel program, the access behavior to the data page is related to the scheduling mechanism of the thread block, because each thread generally uses its corresponding thread number and thread block number to indicate the location of the data to be operated on. In view of this, the method of the present embodiment proposes a thread block allocation mechanism called inverse allocation, as shown by reference sign b in fig. 3, and dynamically changes the dispatching policy of the thread block by adjusting the thread block scheduling mechanism. The default thread block dispatching strategy is shown in the left part of the label b in fig. 3, and the thread blocks are dispatched according to the sequence number from low to high. Whenever there is a kernel program to start, the reverse flag is turned over, the scheduler selects two thread block dispatching strategies shown in the label b in fig. 3 according to the reverse flag in the command processor, when the reverse flag is 0, the forward thread block dispatching strategy is selected, otherwise, the reverse thread block dispatching strategy is selected, so that the dispatched thread blocks and the data reserved in the memory can be coincided to maximally reuse the data.
In this embodiment, the excess of the GPU memory capacity of the current program in step 1) means: when a GPU driver at a host monitors a data page request generated by a GPU terminal, if the data page queue length maintained by the GPU driver reaches the GPU memory capacity at this time, the data page in the GPU memory needs to be evicted to the CPU memory according to a data page replacement policy, and a memory excess configuration flag is set to indicate that memory excess configuration occurs in a current program.
In this embodiment, the data sharing between the kernel programs in step 1) means that: judging whether the kernel programs share the same pointer according to the information during compiling, taking the pointer as an index for data sharing between the kernel programs, and distributing a data sharing mark for each kernel program to be started so as to indicate whether the kernel program has data sharing with the previous kernel program; when a kernel program is started, whether the memory excess configuration flag bit and the data sharing flag bit corresponding to the kernel program are both 1 or not is respectively judged, and if yes, the cooperative scheduling method is judged to be needed.
The application scenario of the cooperative scheduling method provided in this embodiment is that when the application program has an excess memory configuration and there is data sharing between multiple kernel programs, the detection of the excess memory configuration and the data sharing between the kernel programs is the key of the mechanism trigger. When a GPU driver at the host monitors a data page request generated by the GPU terminal, if the data page queue length maintained by the GPU driver reaches the GPU memory capacity at this time, the data page in the GPU memory needs to be evicted to the CPU memory according to the data page replacement policy, and at the same time, a memory excess configuration flag is set to indicate that memory excess configuration occurs in the current program, and this flag is used as one of the indexes for starting the cooperative scheduling mechanism in this embodiment. And judging whether the kernel programs share the same pointer according to the information during compiling, and using the pointer as an index for data sharing between the kernel programs. And each kernel program to be started is allocated with a data sharing flag so as to indicate whether the kernel program has data sharing with the previous kernel program. When a kernel program is started, the cooperative scheduling mechanism provided in this embodiment respectively determines whether the memory excess configuration flag bit and the data sharing flag bit corresponding to the kernel program are both 1, if so, the cooperative scheduling mechanism is triggered, otherwise, the cooperative scheduling mechanism is executed according to a default configuration.
In this embodiment, the flipping the inversion flag of the kernel program in step 1) includes: firstly, whether a reversal mark of the kernel program exists or not is detected, if the reversal mark of the kernel program does not exist, the reversal mark is initialized for the kernel program, and if the reversal mark of the kernel program exists, the reversal mark of the kernel program is turned over.
In this embodiment, when the inversion flag is initialized, the initialization value of the inversion flag is 0 or 1.
In this embodiment, the flipping the inversion flag of the kernel program refers to: if the original value of the reverse flag of the kernel program is 0, the reverse flag of the kernel program is changed from 0 to 1, and if the original value of the reverse flag of the kernel program is 1, the reverse flag of the kernel program is changed from 1 to 0.
In this embodiment, in step 2), when selecting one of the forward data page replacement policies and the reverse data page replacement policy in turn to select a GPU end data page from the GPU end data page queue for replacement according to the reversal flag of the kernel program in two preset data page replacement policies, if the reversal flag is 0, selecting the forward data page replacement policy; if the reverse flag is 1, then the reverse data page replacement policy is selected.
In this embodiment, when one of the thread block dispatching strategies is selected in turn in step 2) according to the reversal flag in two preset thread block dispatching strategies, namely a forward thread block dispatching strategy and a reverse thread block dispatching strategy, if the reversal flag is 0, the forward thread block dispatching strategy is selected; if the reverse flag is 1, then a reverse thread block dispatch strategy is selected.
In order to demonstrate the effectiveness of the method of this embodiment, the prior art called Oracle and ETC are introduced in this embodiment to compare with the method of this embodiment (inverse distribution), wherein Oracle can fully utilize the shared data between kernel programs reserved in the memory. FIG. 4 shows the average performance of the method of this example (inverted dispense) compared to Oracle, ETC using the default configuration as a reference, where Baseline is the reference performance for comparison. Fig. 5 shows the data page access characteristics and page fault variation of the FFT procedure when applying the method of the present embodiment (inverse distribution) and the cooperative method. Based on the analysis of fig. 4, first, compared to the latest method ETC, the performance of the method of the present embodiment (inverse distribution) is improved by 20.2% on average, which is consistent with the page failure rate reduction at the kernel boundary shown in fig. 5; secondly, the performance of the inverted distribution is 58.5% worse than Oracle, indicating that there is still much room for improvement. Based on the analysis of fig. 5, the execution time of the even kernel is shorter than the execution time of the odd kernel, and referring to fig. 5, when the GPU memory capacity is set to 75% of the data amount accessed by the FFT, when the method of the present embodiment (inverse allocation) is used, the data page access characteristic during the FFT execution and the change of the data page failure rate (a plurality of kernels access the same sequence in the same sequence, and the dotted line indicates the end boundary of each kernel).
In order to further explore the reason for the huge performance gap between the method (inverse distribution) and Oracle and the execution time difference between different kernel programs, the embodiment analyzes the execution process of a simple test program, where the application program includes multiple kernel programs, and each kernel program accesses the same data, and the specific execution process is shown in fig. 7 to 9. To simplify the analysis, we made some assumptions: 1) the GPU can only execute one thread block at the same time; 2) five kernel programs (a, B, C, D, E) access the same data, and each kernel program consists of 5 thread blocks (C1-C5) which access only one page of data (P1-P5, respectively); 3) the memory capacity of the GPU is 3 data pages. First we analyze the program execution process when using the basic configuration as shown in fig. 7, these kernel programs are started sequentially and the relevant thread blocks are dispatched to the corresponding execution units in sequence. Initially, requested data pages are loaded into the memory of less than full GPUs. However, due to the limited storage space and the least recently used data page replacement policy based on the loading time, the data page loaded into the memory first is replaced by the newly accessed data page, so that the data page missing exception is always generated in the whole execution process, resulting in a great performance loss, which is consistent with the result shown in fig. 1.
The execution process to which the method of the present embodiment (inverted allocation) is applied is shown in fig. 8, and unlike fig. 7, the thread block allocation order of each kernel program is constantly switched. We have found two findings, first, that the page failure rate at the kernel boundary is very low compared to the execution process of the basic configuration, which is consistent with the results of fig. 5. Secondly, for the kernel program executed for odd times, the data reserved in the GPU memory is not fully utilized, which explains the performance gap between the method (inverse allocation) of the present embodiment and Oracle. In summary, although the method of the present embodiment (inverse distribution) can improve the performance well, it is impossible to achieve performance improvement close to Oracle. The replacement strategy for data pages is the key to compensate the performance gap with the method of the present embodiment (inverse distribution).
Kernel programs that execute odd numbers of times do not fully utilize the shared data in memory. Fig. 10 illustrates the basic principle of the default data page replacement strategy, and the GPU driver on the host maintains a linear table for recording the migration sequence of data pages from the CPU memory to the GPU memory. Although it does not reflect the order of access of the data pages, it has a lower overhead than the ideal least recently used policy. Because the reverse allocation switches the order of allocation of thread blocks, the basic data page replacement strategy can no longer be used well with reverse allocation. As shown in FIG. 5, according to the load-time based least recently used policy, data page 3(P3) is evicted from the CPU's memory during the execution of thread block C2 in kernel B, instead of data page 5 (P5). However, it is more desirable for kernel B to evict page 5(P5) so that subsequent kernels can take full advantage of the shared data.
To solve this problem, we propose a replacement strategy called reverse replacement data page, as shown in fig. 10 (same as label a in fig. 3). The cooperative inversion strategy introduces a least recently used data page replacement strategy called inverse load time based on thread block inversion replacement, as shown in fig. 11. This new replacement policy switches the allocation and eviction directions of the data pages compared to the default. As shown by reference a in fig. 3, whenever a new kernel program is started, the GPU driver on the host side selects a corresponding data page replacement policy according to the inversion flag bit in the command processor, and when the inversion flag is 0, the forward data page replacement policy shown in fig. 7 is used, and otherwise, the reverse data page replacement policy shown in fig. 11 is used, so that the inversion allocation mechanism can be better matched to fully utilize the shared data.
FIG. 8 illustrates the execution of the simple test program introduced in the previous section when a cooperative inversion mechanism of an inversion allocation mechanism and an inversion replacement mechanism is applied simultaneously. As a result, data page 5(P5) is replaced in CPU memory when block C2 is executed in kernel B. Therefore, when the kernel program C executes, all the data previously reserved in the GPU memory will be fully used, so that no page fault will be generated at the kernel program boundary. FIG. 6 illustrates the data page access characteristics and page failure rate of the FFT program when using the cooperative inversion strategy, we have two findings, first, the performance of the odd-order kernel program is greatly improved, resulting in better performance than the inversion allocation; second, more shared data is reused, resulting in fewer data page failure errors at kernel boundaries. We conclude that data page failure errors at kernel boundaries can be effectively reduced by reusing shared data.
In the embodiment, GPGPU-simv4.0.0 is expanded for evaluation. Table 1 shows the relevant configuration information for the GPU system, including the kernel and the memory. We carefully simulated the on-demand migration process of data between CPU memory and GPU memory, setting the data page invalidation processing delay to the most optimistic 20 μ s. And if the memory of the GPU is full, replacing the corresponding data page by the driving program of the GPU according to the corresponding data page replacement strategy. To experiment with varying degrees of memory over-provisioning, we set the memory capacity of the GPU to a certain proportion (75% -95%) of the space required for each application. We selected 12 application programs from standard tested program suites such as CUDASDK, Rodinia, Ispass, Polibench and the like to perform experiments, the occupied space of the programs is different from 1MB to 96MB, and the average occupied space is 18.5 MB. The limited simulation speed prevents us from simulating applications that occupy more space.
Table 1 simulator basic configuration information:
Figure BDA0003110466690000081
to evaluate the effectiveness of our method, we implemented the latest technology, ETC. FIG. 12 shows a comparison of performance of multiple applications based on a basic configuration in a reverse distribution, co-reverse, ETC, and Oracle configuration. We have found that first, ETC exhibits a poor performance optimisation capability, an average improvement of 15%. Inverted dispensing and synergistic inversion were improved by 20.2% and 65%, respectively, compared to ETC. Second, for almost all applications, the synergistic inversion can achieve performance similar to Oracle, mainly because most shared data between kernel programs is effectively utilized. Finally, for both BFS and FWT programs, cooperative inversion does not improve their performance well because the different kernel programs of the two programs have irregular data access characteristics and cannot be well captured and utilized by cooperative inversion. Thus, for applications where there is a large amount of data sharing between kernel programs, they suffer a significant performance penalty in the event of memory over-provisioning. In the method of this embodiment, a method for scheduling thread blocks and data pages cooperatively is provided, which aims to reduce the influence of memory over-allocation on system performance by reusing shared data.
In summary, while the latest GPUs are equipped with larger and larger memories, they still cannot accommodate the entire working set of large applications. With the support of unified virtual memory and on-demand data page placement techniques, these large programs can still operate normally without active intervention of a programmer, however, this advantage causes a certain performance penalty. We have found that despite much current research on reducing the amount of page migration between the CPU and GPU, better performance improvements are still not achieved for applications where there is data sharing between kernel programs. In this embodiment, the method of this embodiment provides a thread block and data page cooperative scheduling method, so as to reduce the influence caused by the excessive configuration of the memory in a manner transparent to the programmer, and the basic principle of the storage and computation cooperative scheduling method for GPU data reuse in this embodiment is to effectively use shared data among kernel programs by coordinating the policy of the thread block allocation sequence and the data page replacement sequence, and experiments show that the system performance of the storage and computation cooperative scheduling method for GPU data reuse in this embodiment is improved by 65% compared with that of the system using the latest research method.
In addition, this embodiment further provides a storage and computation cooperative scheduling system for GPU data reuse, including:
the reverse sign management program unit is used for detecting whether a kernel program is started or not, and if the kernel program is started, the reverse sign of the kernel program is reversed;
a data page replacement program unit, configured to select, in the GPU driver, one of two preset data page replacement policies, namely a forward data page replacement policy and a reverse data page replacement policy, in turn to select a GPU end data page from a GPU end data page queue for replacement according to a reversal flag of the kernel program for data page replacement of the kernel program, where directions in which the forward data page replacement policy and the reverse data page replacement policy select the GPU end data page are different;
and the thread block selection program unit is used for scheduling the thread blocks of the kernel program in a thread block scheduler of the GPU, selecting one of thread block dispatching strategies in turn to select the thread blocks from a thread block queue to be transmitted according to the reverse sign of the kernel program in two preset thread block dispatching strategies, namely a forward thread block dispatching strategy and a reverse thread block dispatching strategy, and transmitting the selected thread blocks, wherein the directions of the thread blocks are different in the forward thread block dispatching strategy and the reverse thread block dispatching strategy.
In addition, this embodiment further provides a storage and computation cooperative scheduling system for GPU data reuse, including:
the CPU is used for detecting whether a kernel program is started, turning over a reverse mark of the kernel program if the kernel program is started, and selecting one data page replacement strategy in turn to select a GPU-end data page from a GPU-end data page queue for replacement according to the reverse mark of the kernel program in two preset data page replacement strategies, namely a forward data page replacement strategy and a reverse data page replacement strategy, in the GPU drive aiming at data page replacement of the kernel program, wherein the directions of selecting the GPU-end data page by the forward data page replacement strategy and the reverse data page replacement strategy are different;
the GPU is used for scheduling the thread blocks of the kernel program in a thread block scheduler of the GPU, selecting one thread block dispatching strategy in turn from two preset thread block dispatching strategies, namely a forward thread block dispatching strategy and a reverse thread block dispatching strategy according to a reverse mark of the kernel program to select the thread blocks from a thread block queue to be transmitted for transmission, wherein the directions of the thread blocks selected by the forward thread block dispatching strategy and the reverse thread block dispatching strategy are different;
and the CPU and the GPU are connected with each other.
In addition, the embodiment also provides a storage and computation cooperative scheduling system for GPU data reuse, which includes a processing unit and a memory connected to each other, where the processing unit is programmed or configured to execute the steps of the storage and computation cooperative scheduling method for GPU data reuse.
In addition, the present embodiment also provides a computer-readable storage medium, in which a computer program programmed or configured to execute the storage and computation cooperative scheduling method for GPU data reuse is stored.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiments, and all technical solutions belonging to the idea of the present invention belong to the protection scope of the present invention. It should be noted that modifications and embellishments within the scope of the invention may occur to those skilled in the art without departing from the principle of the invention, and are considered to be within the scope of the invention.

Claims (10)

1. A GPU data reuse-oriented storage and calculation cooperative scheduling method is characterized by comprising the following steps:
1) detecting whether a kernel program is started or not under the conditions that the current program has GPU memory capacity excess and data sharing exists among the kernel programs, and turning over a reverse sign of the kernel program if the kernel program is started;
2) in the GPU drive, aiming at the data page replacement of the kernel program, selecting one data page replacement strategy in turn to select a GPU-end data page from a GPU-end data page queue for replacement according to a reversal mark of the kernel program in two preset data page replacement strategies, namely a forward data page replacement strategy and a reverse data page replacement strategy, wherein the directions of selecting the GPU-end data page by the forward data page replacement strategy and the reverse data page replacement strategy are different; in a thread block scheduler of the GPU, aiming at the thread block scheduling of the kernel program, selecting one of two preset thread block dispatching strategies of a forward thread block dispatching strategy and a reverse thread block dispatching strategy in turn according to a reverse mark of the kernel program to select the thread block from a thread block queue to be transmitted, wherein the directions of the thread blocks selected by the forward thread block dispatching strategy and the reverse thread block dispatching strategy are different.
2. The GPU data reuse-oriented storage computation cooperative scheduling method of claim 1, wherein the flipping of the inversion flag of the kernel program in step 1) comprises: firstly, whether a reversal mark of the kernel program exists or not is detected, if the reversal mark of the kernel program does not exist, the reversal mark is initialized for the kernel program, and if the reversal mark of the kernel program exists, the reversal mark of the kernel program is turned over.
3. The GPU-data-reuse-oriented storage-computation-cooperative scheduling method of claim 2, wherein when the inversion flag is initialized, the initialization value of the inversion flag is 0 or 1.
4. The GPU-oriented data reuse storage and computation cooperative scheduling method of claim 2, wherein the flipping of the inversion flag of the kernel program is: if the original value of the reverse flag of the kernel program is 0, the reverse flag of the kernel program is changed from 0 to 1, and if the original value of the reverse flag of the kernel program is 1, the reverse flag of the kernel program is changed from 1 to 0.
5. The GPU-oriented storage and computation cooperative scheduling method for data reuse according to claim 1, wherein in step 2), when one of the two preset data page replacement strategies, namely a forward data page replacement strategy and a reverse data page replacement strategy, is selected in turn according to a reversal flag of the kernel program to select a GPU-side data page from a GPU-side data page queue for replacement, if the reversal flag is 0, the forward data page replacement strategy is selected; if the reverse flag is 1, then the reverse data page replacement policy is selected.
6. The GPU-oriented data reuse storage and computation cooperative scheduling method of claim 1, wherein in step 2), when one of the thread block dispatching strategies is selected in turn according to a reversal flag in two preset thread block dispatching strategies, namely a forward thread block dispatching strategy and a reverse thread block dispatching strategy, if the reversal flag is 0, the forward thread block dispatching strategy is selected; if the reverse flag is 1, then a reverse thread block dispatch strategy is selected.
7. The GPU data reuse-oriented storage and computation cooperative scheduling method of claim 1, wherein the GPU memory capacity excess of the current program in the step 1) is: when a GPU driver at a host monitors a data page request generated by a GPU terminal, if the data page queue length maintained by the GPU driver reaches the GPU memory capacity at this time, the data page in the GPU memory needs to be evicted to the CPU memory according to a data page replacement policy, and a memory excess configuration flag is set to indicate that memory excess configuration occurs in a current program.
8. The GPU data reuse-oriented storage and computation cooperative scheduling method of claim 1, wherein the data sharing among kernel programs in the step 1) is that: judging whether the kernel programs share the same pointer according to the information during compiling, taking the pointer as an index for data sharing between the kernel programs, and distributing a data sharing mark for each kernel program to be started so as to indicate whether the kernel program has data sharing with the previous kernel program; when a kernel program is started, whether the memory excess configuration flag bit and the data sharing flag bit corresponding to the kernel program are both 1 or not is respectively judged, and if yes, the cooperative scheduling method is judged to be needed.
9. A GPU data reuse oriented storage and computation cooperative scheduling system, comprising a processing unit and a memory connected to each other, wherein the processing unit is programmed or configured to perform the steps of the GPU data reuse oriented storage and computation cooperative scheduling method according to any of claims 1 to 8.
10. A computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, the computer program being programmed or configured to perform the GPU data reuse oriented storage-computation collaborative scheduling method according to any of claims 1 to 8.
CN202110649358.7A 2021-06-10 2021-06-10 Storage computing collaborative scheduling method and system for GPU data reuse Active CN113377538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110649358.7A CN113377538B (en) 2021-06-10 2021-06-10 Storage computing collaborative scheduling method and system for GPU data reuse

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110649358.7A CN113377538B (en) 2021-06-10 2021-06-10 Storage computing collaborative scheduling method and system for GPU data reuse

Publications (2)

Publication Number Publication Date
CN113377538A true CN113377538A (en) 2021-09-10
CN113377538B CN113377538B (en) 2023-06-20

Family

ID=77573750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110649358.7A Active CN113377538B (en) 2021-06-10 2021-06-10 Storage computing collaborative scheduling method and system for GPU data reuse

Country Status (1)

Country Link
CN (1) CN113377538B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115361451A (en) * 2022-10-24 2022-11-18 中国人民解放军国防科技大学 Network communication parallel processing method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365796A (en) * 2012-04-05 2013-10-23 西门子公司 Volume rendering on shared memory systems with multiple processors by optimizing cache reuse
US20200327019A1 (en) * 2019-04-11 2020-10-15 International Business Machines Corporation Crash recoverability for graphics processing units (gpu) in a computing environment
CN112181689A (en) * 2020-09-30 2021-01-05 华东师范大学 Runtime system for efficiently scheduling GPU kernel under cloud
CN112801849A (en) * 2019-11-14 2021-05-14 英特尔公司 Method and apparatus for scheduling thread order to improve cache efficiency

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365796A (en) * 2012-04-05 2013-10-23 西门子公司 Volume rendering on shared memory systems with multiple processors by optimizing cache reuse
US20200327019A1 (en) * 2019-04-11 2020-10-15 International Business Machines Corporation Crash recoverability for graphics processing units (gpu) in a computing environment
CN112801849A (en) * 2019-11-14 2021-05-14 英特尔公司 Method and apparatus for scheduling thread order to improve cache efficiency
CN112181689A (en) * 2020-09-30 2021-01-05 华东师范大学 Runtime system for efficiently scheduling GPU kernel under cloud

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115361451A (en) * 2022-10-24 2022-11-18 中国人民解放军国防科技大学 Network communication parallel processing method and system

Also Published As

Publication number Publication date
CN113377538B (en) 2023-06-20

Similar Documents

Publication Publication Date Title
Usui et al. DASH: Deadline-aware high-performance memory scheduler for heterogeneous systems with hardware accelerators
DE19983793B4 (en) A system comprising a processor on which a plurality of concurrent execution entities are executed, and a cache memory having multiple cache sections associated with execution entities
US7234040B2 (en) Program-directed cache prefetching for media processors
US6493800B1 (en) Method and system for dynamically partitioning a shared cache
EP3912027B1 (en) Data structure processing
CN103019962B (en) Data buffer storage disposal route, device and system
DE102012221504B4 (en) Multilevel-Instruction-Cache-Pre-Fetch
CN103383672B (en) High-speed cache control is to reduce transaction rollback
US20030154349A1 (en) Program-directed cache prefetching for media processors
US20030229761A1 (en) Memory compression for computer systems
CN101067781A (en) Technique to perform memory disambiguation
JP2003131946A (en) Method and device for controlling cache memory
US20110213925A1 (en) Methods for reducing cache memory pollution during parity calculations of raid data
JPH07271674A (en) Method for optimization of cache
CN111813710B (en) Method and device for avoiding Linux kernel memory fragmentation and computer storage medium
CN113377538B (en) Storage computing collaborative scheduling method and system for GPU data reuse
CN108733585B (en) Cache system and related method
US7822920B2 (en) Mass prefetching method for disk array
Snir et al. On the theory of spatial and temporal locality
KR20240023642A (en) Dynamic merging of atomic memory operations for memory-local computing.
CN104461928A (en) Method and device for dividing caches
JP4792065B2 (en) Data storage method
US11354127B2 (en) Method of managing multi-tier memory displacement using software controlled thresholds
Qin et al. Adaptive Cache Allocation with Prefetching Policy over End-to-End Data Processing
CN117032594B (en) Read command scheduling method, processing method, device and storage equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant