CN113377538B - Storage computing collaborative scheduling method and system for GPU data reuse - Google Patents

Storage computing collaborative scheduling method and system for GPU data reuse Download PDF

Info

Publication number
CN113377538B
CN113377538B CN202110649358.7A CN202110649358A CN113377538B CN 113377538 B CN113377538 B CN 113377538B CN 202110649358 A CN202110649358 A CN 202110649358A CN 113377538 B CN113377538 B CN 113377538B
Authority
CN
China
Prior art keywords
gpu
data
kernel
data page
reverse
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110649358.7A
Other languages
Chinese (zh)
Other versions
CN113377538A (en
Inventor
李晨
李宣佚
郭阳
鲁建壮
陈小文
刘胜
张洋
刘畅
曹壮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202110649358.7A priority Critical patent/CN113377538B/en
Publication of CN113377538A publication Critical patent/CN113377538A/en
Application granted granted Critical
Publication of CN113377538B publication Critical patent/CN113377538B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

The invention discloses a storage computing collaborative scheduling method and a system for GPU data reuse, wherein the method comprises the steps that a kernel program is started to turn over a reverse mark of the kernel program; in a thread block scheduler of the GPU, for thread block scheduling of the kernel program, selecting one thread block dispatch strategy in a forward thread block dispatch strategy and a reverse thread block dispatch strategy in turn according to a reversing mark of the kernel program to select a thread block to be transmitted from a thread block to be transmitted queue; in the GPU driving, for the data page replacement of the kernel program, one of the forward data page replacement strategy and the reverse data page replacement strategy is selected in turn according to the inversion mark of the kernel program to select the GPU side data page from the GPU side data page queue for replacement. The invention realizes the cooperative scheduling of the thread blocks and the data pages, reduces the influence of excessive configuration of the memory on the system performance by reusing shared data, and can effectively improve the system performance.

Description

Storage computing collaborative scheduling method and system for GPU data reuse
Technical Field
The invention relates to a computing scheduling technology of a computer, in particular to a storage computing collaborative scheduling method and a storage computing collaborative scheduling system for GPU data reuse.
Background
Because of its high computational throughput and good programmability, GPUs have been widely used in high performance areas including machine learning, object detection, and image denoising. However, the ever-expanding working set of applications (GPU data access per unit time) has not been accommodated by the limited memory space on GPUs. The introduction of unified virtual memory and page fetching on demand provides good support for memory over-allocation, but causes a loss of system performance due to the additional data page transfer between CPU memory and GPU memory. How to reduce these unwanted data migration is therefore crucial for performance improvement. After a large number of tested program sets have been studied, we have found that there are many applications in which inter-Kernel (Kernel) data sharing exists. And for most such programs, each kernel accesses the same piece of data area in a similar data access order. When the memory of the GPU cannot accommodate the working set of the entire kernel, the old data page is swapped out to the memory of the CPU and the data page needed is fetched into the memory of the GPU. When a kernel program is finished, only the latest accessed data pages are reserved in the memory of the GPU, and the data pages which are already replaced in the memory of the CPU are accessed after the subsequent kernel program is started. It has been found that, although there is a large amount of shared data between kernel programs in these applications, such data sharing characteristics disappear when memory overdischarge occurs, thereby causing a drastic decrease in system performance.
The efficient use of existing data in GPU memory is critical to avoid long latency overhead caused by page faults, especially in the case of memory oversubscription. FIG. 1 illustrates the performance degradation of an application with data sharing between kernel programs due to memory oversubscription. We have found that such applications are insensitive to the extent of memory oversubscription, which can cause a dramatic drop in performance as long as there is a little. Fig. 2 illustrates the variation in its data page access characteristics and page failure rate when the GPU memory can only accommodate 75% of the data access size of the FFT program (multiple kernel programs access the same sequence in the same order, with the dashed line representing each kernel program ending boundary). We have found that each kernel in the FFT has similar data access characteristics and sequences, and that there is a high page failure rate at the boundary (circled area in fig. 2) where each kernel begins, because earlier accessed pages of data are replaced by newly accessed pages of data, resulting in a subsequent kernel revisiting the evicted pages of data, which can result in a page failure. In order to fundamentally reduce the number of page migration times at the time of memory oversubscription, many related techniques have been proposed including prefetching, using computation time to hide transfer time, and batch processing page failures. However, these techniques have little effect on the performance enhancement of such applications. Prefetching causes system jitter due to evicting useful pages of data, and long latency caused by a large number of page faults cannot be hidden by pre-evicting the pages of data and batch page faults. Based on the above analysis, we have obtained three findings for such applications where there is data sharing between kernel programs. First, once the memory over-allocation occurs, the performance of the program may drop dramatically. Second, previous optimization methods for program performance when memory oversubscription occurs are not applicable to such programs. Finally, page failure rates at kernel boundaries are very high. Thus, our aim of research is to reduce page failure rate at kernel boundaries by reusing data shared between kernels.
Disclosure of Invention
The invention aims to solve the technical problems: aiming at the problems in the prior art, the invention provides a storage computing collaborative scheduling method and a system for GPU data reuse.
In order to solve the technical problems, the invention adopts the following technical scheme:
a storage computing cooperative scheduling method for GPU data reuse comprises the following steps:
1) Detecting whether a kernel program is started under the conditions that the current program has excessive GPU memory capacity and data sharing exists among the kernel programs, and turning over a reversing mark of the kernel program if the kernel program is started;
2) In GPU driving, selecting one of the data page replacement strategies in turn in two preset data page replacement strategies of a forward data page replacement strategy and a reverse data page replacement strategy according to the inversion mark of the kernel program to select a GPU side data page from a GPU side data page queue for replacement, wherein the directions of the data page selection of the forward data page replacement strategy and the reverse data page replacement strategy are different; in a thread block scheduler of the GPU, for thread block scheduling of the kernel program, one of the forward thread block dispatch strategies and the reverse thread block dispatch strategies is selected in turn according to a reversing mark of the kernel program in two preset thread block dispatch strategies, so as to select thread block emission from a thread block to-be-emitted queue, wherein the directions of the thread blocks selected by the forward thread block dispatch strategy and the reverse thread block dispatch strategy are different.
Optionally, flipping the inversion flag of the kernel in step 1) includes: firstly, detecting whether a reverse flag of the kernel program exists or not, initializing the reverse flag for the kernel program if the reverse flag of the kernel program does not exist, and reversing the reverse flag of the kernel program if the reverse flag of the kernel program exists.
Optionally, when the inversion flag is initialized, the initialization value of the inversion flag is 0 or 1.
Optionally, the flipping the inversion flag of the kernel refers to: the reverse flag of the kernel is changed from 0 to 1 if the original value of the reverse flag of the kernel is 0, and from 1 to 0 if the original value of the reverse flag of the kernel is 1.
Optionally, in step 2), one of the two preset data page replacement strategies, namely the forward data page replacement strategy and the reverse data page replacement strategy, is selected in turn according to the inversion flag of the kernel program to select the GPU side data page from the GPU side data page queue for replacement, and if the inversion flag is 0, the forward data page replacement strategy is selected; if the reverse flag is 1, then the reverse page replacement policy is selected.
Optionally, in step 2), when one of the forward thread block dispatch policies and the reverse thread block dispatch policies is selected in turn according to the reverse flag in the two preset thread block dispatch policies, if the reverse flag is 0, the forward thread block dispatch policy is selected; if the reverse flag is 1, then the reverse thread block dispatch policy is selected.
Optionally, the occurrence of the GPU memory capacity excess in the current procedure in step 1) means that: when the GPU driver at the host computer monitors a data page request generated by the GPU, if the length of a data page queue maintained by the GPU driver reaches the memory capacity of the GPU at the moment, the data page in the GPU memory needs to be evicted into the CPU memory according to a data page replacement strategy, and meanwhile, a memory excess configuration mark is set to indicate that the memory excess configuration of the current program occurs.
Optionally, the existence of data sharing between kernel programs in step 1) means: judging whether the same pointer is shared between kernel programs according to the information of compiling time, taking the pointer as an index of data sharing between kernel programs, and distributing a data sharing mark for each kernel program to be started so as to indicate whether the kernel programs share data with the previous kernel program; when a kernel program is started, whether the memory excess configuration zone bit and the data sharing zone bit corresponding to the kernel program are 1 or not is respectively judged, and if yes, a cooperative scheduling method is judged to be needed.
In addition, the invention also provides a storage computing co-scheduling system facing the GPU data reuse, which comprises a processing unit and a memory which are connected with each other, wherein the processing unit is programmed or configured to execute the steps of the storage computing co-scheduling method facing the GPU data reuse.
In addition, the invention further provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program programmed or configured to execute the storage computing co-scheduling method for GPU data reuse.
Compared with the prior art, the invention has the following advantages: because of its high computational throughput and good programmability, GPUs have been widely used in high performance areas including machine learning, object detection, and image denoising. However, the ever-expanding working set of applications (GPU data access per unit time) has not been accommodated by the limited memory space on GPUs. The introduction of unified virtual memory and page fetching on demand provides good support for memory over-allocation, but causes a loss of system performance due to the additional data page transfer between CPU memory and GPU memory. How to reduce these unwanted data migration is therefore crucial for performance improvement. After a large number of tested program sets are studied, we find that there are many applications in which inter-kernel data sharing exists. And for most such programs, each kernel accesses the same piece of data area in a similar data access order. When the memory of the GPU cannot accommodate the working set of the entire kernel, the old data page is swapped out to the memory of the CPU and the data page needed is fetched into the memory of the GPU. When a kernel program is finished, only the latest accessed data pages are reserved in the memory of the GPU, and the data pages which are already replaced in the memory of the CPU are accessed after the subsequent kernel program is started. It has been found that, although there is a large amount of shared data between kernel programs in these applications, such data sharing characteristics disappear when memory overdischarge occurs, thereby causing a drastic decrease in system performance. Based on the observation, the invention provides a method for cooperatively scheduling thread blocks and data pages to improve the performance of a system by effectively utilizing shared data among kernel programs, fully utilizing the shared data among the kernel programs by coordinating the switching of the thread block allocation sequence, namely changing the sequence of thread block allocation and the switching of a data page replacement strategy, and evaluating the performance of the method by a large number of GPU test sets, and the result shows that the method of the invention is improved by 65 percent compared with the performance of the latest research.
Drawings
FIG. 1 is a graph illustrating the impact of GPU memory overdriving on system performance.
Fig. 2 is a graph showing the variation of its data page access characteristics and page failure rate when the GPU memory can only accommodate 75% of the data access size of the FFT program.
Fig. 3 is a schematic diagram of the basic principle of the method according to the embodiment of the invention.
FIG. 4 is a graphical representation of the performance of the process of the present invention versus the prior art process.
FIG. 5 shows the variation of the page access characteristics and page failure rate during FFT execution when the GPU memory capacity is set to 75% of the amount of data accessed by the FFT using the method of the present embodiment (flip allocation) in accordance with an embodiment of the present invention.
Fig. 6 illustrates the variation of the data page access characteristics and the data page failure rate during the FFT execution when the GPU memory capacity is set to 75% of the data volume accessed by the FFT according to the embodiment of the present invention.
Fig. 7 is a diagram illustrating an implementation of the present invention in a reference configuration.
FIG. 8 is a diagram illustrating the implementation of the present invention when using the reverse allocation.
Fig. 9 is a diagram illustrating an implementation procedure when a collaborative method is used in an embodiment of the present invention.
FIG. 10 is a diagram illustrating a least recently used data page replacement policy based on load time in an embodiment of the present invention.
FIG. 11 is a diagram illustrating a reverse load time based least recently used data page replacement policy in accordance with an embodiment of the present invention.
FIG. 12 is a graph showing IPC comparisons based on different optimization mechanisms using baseline configuration as a performance reference in an embodiment of the invention.
Detailed Description
As shown in fig. 3, the storage computing co-scheduling method for GPU data reuse of the present embodiment includes:
1) Detecting whether a kernel program is started under the conditions that the current program has excessive GPU memory capacity and data sharing exists among the kernel programs, and turning over a reversing mark of the kernel program if the kernel program is started;
2) In GPU driving, selecting one of the data page replacement strategies in turn in two preset data page replacement strategies of a forward data page replacement strategy and a reverse data page replacement strategy according to the inversion mark of the kernel program to select a GPU side data page from a GPU side data page queue for replacement, wherein the directions of the data page selection of the forward data page replacement strategy and the reverse data page replacement strategy are different; in a thread block scheduler of the GPU, for thread block scheduling of the kernel program, one of the forward thread block dispatch strategies and the reverse thread block dispatch strategies is selected in turn according to a reversing mark of the kernel program in two preset thread block dispatch strategies, so as to select thread block emission from a thread block to-be-emitted queue, wherein the directions of the thread blocks selected by the forward thread block dispatch strategy and the reverse thread block dispatch strategy are different.
As indicated above, the coordination of the thread blocks with the data in memory is critical to reduce page failure rates at the kernel boundaries, however how to determine the thread blocks associated with these data is the greatest challenge encountered by the design. Based on the observation analysis of such applications, the method of the present embodiment finds that in most of these programs, each kernel program is characterized by similar data access characteristics and sequences (as shown in fig. 2), and in each kernel program, the access behavior to the data page is related to the scheduling mechanism of the thread block, because each thread generally uses its corresponding thread index and thread block index to indicate the data location to be operated. In view of this, the method of the present embodiment proposes a thread block allocation mechanism called a reverse allocation, as indicated by a reference b in fig. 3, to dynamically change the dispatch policy of the thread block by adjusting the thread block scheduling mechanism. The default thread block dispatch strategy is shown on the left side of label b in fig. 3, where thread blocks are dispatched in order of sequence number from low to high. The reverse flag is toggled each time a kernel is started, the scheduler selects one of the two thread block dispatch strategies shown in reference b in fig. 3 according to the reverse flag in the command processor, and when the reverse flag is 0, the forward thread block dispatch strategy is selected, and conversely, the reverse thread block dispatch strategy is selected, so that the dispatched thread blocks and the data reserved in the memory are matched to maximally reuse the data.
In this embodiment, the occurrence of the excess GPU memory capacity in the current procedure in step 1) means that: when the GPU driver at the host computer monitors a data page request generated by the GPU, if the length of a data page queue maintained by the GPU driver reaches the memory capacity of the GPU at the moment, the data page in the GPU memory needs to be evicted into the CPU memory according to a data page replacement strategy, and meanwhile, a memory excess configuration mark is set to indicate that the memory excess configuration of the current program occurs.
In this embodiment, the existence of data sharing between kernel programs in step 1) means that: judging whether the same pointer is shared between kernel programs according to the information of compiling time, taking the pointer as an index of data sharing between kernel programs, and distributing a data sharing mark for each kernel program to be started so as to indicate whether the kernel programs share data with the previous kernel program; when a kernel program is started, whether the memory excess configuration zone bit and the data sharing zone bit corresponding to the kernel program are 1 or not is respectively judged, and if yes, a cooperative scheduling method is judged to be needed.
The application scenario of the collaborative scheduling method provided in this embodiment is that when there is memory excess configuration in an application program and data sharing exists among a plurality of kernel programs, so detection of the memory excess configuration and the data sharing among the kernel programs is a key of triggering the mechanism. When the GPU driver at the host monitors a data page request generated by the GPU, if the length of the data page queue maintained by the GPU driver reaches the GPU memory capacity at this time, the data page in the GPU memory needs to be evicted to the CPU memory according to the data page replacement policy, and at the same time, the memory excess configuration flag is set to indicate that the current program has memory excess configuration, and the flag is used as one of the indexes for starting the collaborative scheduling mechanism in this embodiment. Judging whether the same pointer is shared among the kernel programs according to the information of compiling time, and taking the pointer as an index for sharing data among the kernel programs. And a data sharing flag is allocated to each kernel program to be started so as to indicate whether the kernel program has data sharing with the previous kernel program. When a kernel program is started, the collaborative scheduling mechanism provided by the embodiment respectively judges whether the memory excess configuration flag bit and the data sharing flag bit corresponding to the kernel program are both 1, if so, the collaborative scheduling mechanism is triggered, otherwise, the collaborative scheduling mechanism is executed according to default configuration.
In this embodiment, the flipping the inversion flag of the kernel in step 1) includes: firstly, detecting whether a reverse flag of the kernel program exists or not, initializing the reverse flag for the kernel program if the reverse flag of the kernel program does not exist, and reversing the reverse flag of the kernel program if the reverse flag of the kernel program exists.
In this embodiment, when the inversion flag is initialized, the initialization value of the inversion flag is 0 or 1.
In this embodiment, the flipping the inversion flag of the kernel refers to: the reverse flag of the kernel is changed from 0 to 1 if the original value of the reverse flag of the kernel is 0, and from 1 to 0 if the original value of the reverse flag of the kernel is 1.
In this embodiment, in step 2), when one of the two preset data page replacement policies, namely the forward data page replacement policy and the reverse data page replacement policy, is selected in turn according to the inversion flag of the kernel program to select the GPU side data page from the GPU side data page queue for replacement, if the inversion flag is 0, the forward data page replacement policy is selected; if the reverse flag is 1, then the reverse page replacement policy is selected.
In this embodiment, in step 2), when one of the forward thread block dispatch policies and the reverse thread block dispatch policies is selected in turn according to the reverse flag in the two preset thread block dispatch policies, if the reverse flag is 0, the forward thread block dispatch policy is selected; if the reverse flag is 1, then the reverse thread block dispatch policy is selected.
In order to demonstrate the effectiveness of the method of the present embodiment, the present embodiment introduces a prior art called Oracle, ETC, respectively, in contrast to the method of the present embodiment (reverse allocation), where Oracle can fully utilize shared data between kernel programs retained in memory. Fig. 4 shows the average performance comparison of the method of this example (inverted allocation) and Oracle, ETC, using the default configuration as a benchmark, where Baseline is the benchmark performance for comparison. Fig. 5 shows the data page access characteristics and page fault variation of the FFT program when the present embodiment method (reverse allocation) and cooperative method are applied. Based on the analysis of FIG. 4, first, the present embodiment approach (reverse allocation) improves performance by 20.2% on average compared to the latest approach ETC, consistent with the page failure rate drop at kernel program boundaries illustrated in FIG. 5; second, the performance of the inverted allocation is poor by 58.5% compared to Oracle, illustrating that there is also much room for improvement. Based on the analysis of fig. 5, it is known that the execution time of the even-numbered kernel program is shorter than that of the odd-numbered kernel program, and referring to fig. 5, when the GPU memory capacity is set to 75% of the amount of data accessed by the FFT, the data page access characteristic and the variation of the data page failure rate during the FFT execution (the multiple kernel programs access the same order in the same order, and the broken line indicates the end boundary of each kernel program) are used in the method of the present embodiment (reverse allocation).
In order to go deep into the reason that there is a huge performance gap between the method (inversion allocation) and Oracle and the execution time difference of different kernel programs in this embodiment, the execution process of a simple test program is analyzed in this embodiment, the application program includes a plurality of kernel programs, and each kernel program accesses the same data, and the specific execution process is shown in FIGS. 7-9. To simplify the analysis, we make some assumptions: 1) The GPU can only execute one thread block at the same time; 2) Five kernel programs (A, B, C, D, E) access the same data, and each kernel program consists of 5 thread blocks (C1-C5) that access only one page of data (P1-P5, respectively); 3) The memory capacity of the GPU is 3 pages of data. First we analyze the program execution process when using the basic configuration as shown in fig. 7, where the kernel programs are started up sequentially and the associated thread blocks are dispatched sequentially to the corresponding execution units as well. Initially, the requested data pages are loaded into memory of the GPU that is not full. However, due to the limited storage space and the least recently used data page replacement policy based on the loading time, the data page that is loaded into the memory first is replaced by the newly accessed data page, so that the data page miss exception is always generated in the whole execution process, resulting in a great performance loss, which is consistent with the result shown in fig. 1.
The execution process to which the method of the present embodiment (reverse allocation) is applied is shown in fig. 8, and the thread block allocation order of each kernel program is continuously switched, unlike fig. 7. We have found that first, the page failure rate at the kernel boundary is low compared to the basic configuration implementation, which is consistent with the results of fig. 5. Second, for the odd number of kernel applications, the data retained in the GPU memory is not fully utilized, which also explains the performance gap between the method of this embodiment (reverse allocation) and Oracle. In summary, while the present embodiment approach (reverse allocation) can better improve performance, it is not possible to achieve performance improvement similar to Oracle. Matching the replacement policy of the data page with the method of this embodiment (reverse allocation) is critical to make up this performance gap.
The odd number of executions of the kernel program do not take full advantage of the shared data in memory. FIG. 10 illustrates the basic principle of a default data page replacement strategy, where the driver of the GPU on the host maintains a linear table that records the order in which data pages are migrated from CPU memory to GPU memory. Although it does not reflect the order of access of the data pages, there is a lower overhead compared to the ideal least recently used policy. Because the reverse allocation switches the allocation order of thread blocks, the basic data page replacement policy is no longer well-suited for use with the reverse allocation. As shown in fig. 5, data page 3 (P3) is evicted to the CPU's memory during execution of thread block C2 in kernel B instead of data page 5 (P5) according to the least recently used policy based on load time. However, it is more desirable for kernel B to evict data page 5 (P5) so that the shared data can be fully utilized by subsequent kernel programs.
To solve this problem we propose a so-called reverse replacement data page replacement strategy, as shown in fig. 10 (same as reference a in fig. 3). The collaborative inversion policy introduces a least recently used data page replacement policy called reverse load time based on thread block inversion replacement, as shown in FIG. 11. This new replacement policy switches the allocation and eviction direction of the data page compared to the default. As indicated by a in fig. 3, whenever a new kernel is started, the GPU driver at the host end selects a corresponding data page replacement policy according to the inversion flag in the command processor, and when the inversion flag is 0, the forward data page replacement policy shown in fig. 7 is used, and otherwise, the reverse data page replacement policy shown in fig. 11 is used, so that the shared data can be better utilized in cooperation with the inversion allocation mechanism.
FIG. 8 illustrates the execution of the simple test procedure introduced in the previous section when a synergistic inversion mechanism of both the inversion allocation mechanism and the inversion replacement mechanism is employed. As a desired result, data page 5 (P5) is replaced into CPU memory when thread block C2 is executed in kernel B. Therefore, when kernel C executes, all the data previously retained in GPU memory is fully used, so that no page fault errors occur at the kernel boundaries. FIG. 6 illustrates the data page access characteristics and page failure rate of FFT programs when a collaborative inversion strategy is used, we have found that, first, the performance of the kernel program is greatly improved an odd number of times, resulting in better performance than inversion allocation; second, more shared data is re-used, resulting in fewer page-fail errors at the kernel boundaries. We conclude that data page fault errors at kernel boundaries can be effectively reduced by reusing shared data.
The GPGPU-simv4.0.0 was extended for evaluation in this example. Relevant configuration information for the GPU system, including the kernel and the stored GPU system, is shown in table 1. We carefully simulated the on-demand migration process of data between CPU memory and GPU memory, setting the data page invalidation processing delay to the most optimistic 20 μs. If the memory of the GPU is full, the driver of the GPU replaces the corresponding data page according to the corresponding data page replacement strategy. To experiment with varying degrees of memory oversubscription, we set the memory capacity of the GPU to a certain fraction (75% -95%) of the space that each application needs to occupy. We selected 12 application programs from the standard tested program suite of CUDASDK, rodinia, ispass and Polibench etc. to experiment, the occupied space of these programs varies from 1MB to 96MB, and the average occupied is 18.5MB. The limited simulation speed prevents us from simulating applications that take up more space.
Table 1 simulator basic configuration information:
Figure BDA0003110466690000081
to evaluate the effectiveness of our method, we implemented the latest technology ETC. FIG. 12 shows a comparison of performance of multiple applications in reverse distribution, collaborative reverse, ETC, and Oracle configurations, referenced against a basic configuration. We have obtained 3 findings that, first, ETC exhibits a weaker performance optimizing capability, improving performance by an average of 15%. Inversion dispensing and synergistic inversion are improved by 20.2% and 65%, respectively, compared to ETC. Second, for almost all applications, collaborative inversion can achieve near Oracle performance, mainly because most of the shared data between kernel programs is effectively utilized. Finally, neither the BFS nor FWT programs improve their performance well for the synergistic inversion, since the different kernel programs of the two programs have irregular data access features that cannot be captured and exploited well by synergistic inversion. Thus, for applications where there is a large amount of data sharing between kernel programs, they suffer from significant performance loss in the case of memory oversubscription. In the method of the embodiment, a method for cooperatively scheduling a thread block and a data page is provided, which aims to reduce the influence of excessive memory configuration on the system performance by reusing shared data.
In summary, while the latest GPUs are equipped with more and more memory, they still cannot accommodate the entire working set of large applications. With the support of unified virtual memory and on-demand data page placement techniques, these large programs can still function properly without active intervention by programmers, however this advantage causes some performance penalty. It has been found that despite the current research on reducing the number of page migration between the CPU and the GPU, better performance improvements are not obtained for applications where there is data sharing between kernel programs. In this embodiment, the method of this embodiment proposes a thread block and data page collaborative scheduling method, so as to reduce the influence caused by excessive configuration of the memory in a manner transparent to the programmer, and the basic principle of the storage computing collaborative scheduling method for GPU data reuse of this embodiment is to effectively use shared data among kernel programs by coordinating the policies of the thread block allocation sequence and the data page replacement sequence, and experiments show that the system performance of the storage computing collaborative scheduling method for GPU data reuse of this embodiment is improved by 65% compared with that of the latest research method.
In addition, the embodiment also provides a storage computing collaborative scheduling system for GPU data reuse, which comprises:
the reverse sign management program unit is used for detecting whether a kernel program is started or not, and if the kernel program is started, the reverse sign of the kernel program is turned over;
the data page replacement program unit is used for selecting one of the forward data page replacement strategy and the reverse data page replacement strategy in turn according to the reversing mark of the kernel program in the two preset data page replacement strategies to select the GPU side data page from the GPU side data page queue for replacement aiming at the data page replacement of the kernel program, wherein the directions of the data page selection of the GPU side data page are different in the forward data page replacement strategy and the reverse data page replacement strategy;
and the thread block selecting program unit is used for selecting one of the forward thread block dispatch strategies and the reverse thread block dispatch strategies to select the thread block to be launched from the thread block to be launched queue in turn according to the reversing mark of the kernel program in the thread block dispatcher of the GPU aiming at the thread block dispatch of the kernel program, wherein the directions of the thread block selected by the forward thread block dispatch strategies and the reverse thread block dispatch strategies are different.
In addition, the embodiment also provides a storage computing collaborative scheduling system for GPU data reuse, which comprises:
the CPU is used for detecting whether a kernel program is started, if so, turning over a reversing mark of the kernel program, and in GPU driving, selecting one of the two preset data page replacement strategies in a forward data page replacement strategy and a reverse data page replacement strategy according to the reversing mark of the kernel program to select a GPU side data page from a GPU side data page queue for replacement, wherein the directions of the data page of the GPU side are selected by the forward data page replacement strategy and the reverse data page replacement strategy;
the GPU is used for selecting one of the forward thread block dispatch strategies and the reverse thread block dispatch strategies in turn according to the reversing mark of the kernel program in the thread block dispatch strategies preset by the kernel program to select the thread block to be launched from the thread block to be launched queue, and the forward thread block dispatch strategies and the reverse thread block dispatch strategies are different in direction of selecting the thread block;
and the CPU and the GPU are connected with each other.
In addition, the embodiment also provides a storage computing co-scheduling system for GPU data reuse, which comprises a processing unit and a memory which are connected with each other, wherein the processing unit is programmed or configured to execute the steps of the storage computing co-scheduling method for GPU data reuse.
In addition, the present embodiment also provides a computer readable storage medium, in which a computer program programmed or configured to execute the GPU-oriented data reuse storage computing co-scheduling method is stored.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (10)

1. The storage computing cooperative scheduling method for GPU data reuse is characterized by comprising the following steps of:
1) Detecting whether a kernel program is started under the conditions that the current program has excessive GPU memory capacity and data sharing exists among the kernel programs, and turning over a reversing mark of the kernel program if the kernel program is started;
2) In GPU driving, selecting one of the data page replacement strategies in turn in two preset data page replacement strategies of a forward data page replacement strategy and a reverse data page replacement strategy according to the inversion mark of the kernel program to select a GPU side data page from a GPU side data page queue for replacement, wherein the directions of the data page selection of the forward data page replacement strategy and the reverse data page replacement strategy are different; in a thread block scheduler of a GPU, for thread block scheduling of the kernel program, one of a forward thread block dispatch strategy and a reverse thread block dispatch strategy is selected in turn according to a reversing mark of the kernel program in two preset thread block dispatch strategies to select thread block emission from a thread block to-be-emitted queue, wherein the directions of the thread blocks selected by the forward thread block dispatch strategy and the reverse thread block dispatch strategy are different; the data page replacement strategy introduces a reverse data page replacement strategy based on the thread block reverse replacement of the GPU, and the reverse data page replacement strategy is a data page replacement strategy which is least recently used and is reversely based on loading time, so that the allocation and the eviction direction of the data page are switched with the forward data page replacement strategy.
2. The GPU data reuse oriented storage computing co-scheduling method of claim 1, wherein flipping the kernel's inversion flag in step 1) comprises: firstly, detecting whether a reverse flag of the kernel program exists or not, initializing the reverse flag for the kernel program if the reverse flag of the kernel program does not exist, and reversing the reverse flag of the kernel program if the reverse flag of the kernel program exists.
3. The GPU-data reuse oriented storage computing co-scheduling method of claim 2, wherein when the inversion flag is initialized, an initialization value of the inversion flag is 0 or 1.
4. The GPU data reuse oriented storage computing co-scheduling method of claim 2, wherein said flipping the inversion flag of the kernel refers to: the reverse flag of the kernel is changed from 0 to 1 if the original value of the reverse flag of the kernel is 0, and from 1 to 0 if the original value of the reverse flag of the kernel is 1.
5. The method for collaborative scheduling of storage and computation for GPU-oriented data reuse according to claim 1, wherein in step 2), one of the two preset data page replacement policies, namely the forward data page replacement policy and the reverse data page replacement policy, is selected in turn according to the inversion flag of the kernel program, when the GPU-oriented data page is selected from the GPU-oriented data page queue for replacement, if the inversion flag is 0, the forward data page replacement policy is selected; if the reverse flag is 1, then the reverse page replacement policy is selected.
6. The method for collaborative scheduling of storage and computation for GPU data reuse according to claim 1, wherein in step 2), when one of the forward thread block dispatch policies and the reverse thread block dispatch policies is selected in turn according to the reverse flag in two preset thread block dispatch policies, if the reverse flag is 0, the forward thread block dispatch policy is selected; if the reverse flag is 1, then the reverse thread block dispatch policy is selected.
7. The collaborative scheduling method for GPU-oriented data reuse according to claim 1, wherein the occurrence of the GPU memory capacity excess in the current program in step 1) means: when the GPU driver at the host computer monitors a data page request generated by the GPU, if the length of a data page queue maintained by the GPU driver reaches the memory capacity of the GPU at the moment, the data page in the GPU memory needs to be evicted into the CPU memory according to a data page replacement strategy, and meanwhile, a memory excess configuration mark is set to indicate that the memory excess configuration of the current program occurs.
8. The collaborative scheduling method for storage computing for GPU-oriented data reuse according to claim 1, wherein the data sharing between kernel programs in step 1) means: judging whether the same pointer is shared between kernel programs according to the information of compiling time, taking the pointer as an index of data sharing between kernel programs, and distributing a data sharing mark for each kernel program to be started so as to indicate whether the kernel programs share data with the previous kernel program; when a kernel program is started, whether the memory excess configuration zone bit and the data sharing zone bit corresponding to the kernel program are 1 or not is respectively judged, and if yes, a cooperative scheduling method is judged to be needed.
9. A GPU data reuse oriented storage computing co-scheduling system comprising a processing unit and a memory connected to each other, wherein the processing unit is programmed or configured to perform the steps of the GPU data reuse oriented storage computing co-scheduling method of any of claims 1 to 8.
10. A computer readable storage medium having stored therein a computer program programmed or configured to perform the GPU-oriented data reuse storage computing co-scheduling method of any of claims 1-8.
CN202110649358.7A 2021-06-10 2021-06-10 Storage computing collaborative scheduling method and system for GPU data reuse Active CN113377538B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110649358.7A CN113377538B (en) 2021-06-10 2021-06-10 Storage computing collaborative scheduling method and system for GPU data reuse

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110649358.7A CN113377538B (en) 2021-06-10 2021-06-10 Storage computing collaborative scheduling method and system for GPU data reuse

Publications (2)

Publication Number Publication Date
CN113377538A CN113377538A (en) 2021-09-10
CN113377538B true CN113377538B (en) 2023-06-20

Family

ID=77573750

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110649358.7A Active CN113377538B (en) 2021-06-10 2021-06-10 Storage computing collaborative scheduling method and system for GPU data reuse

Country Status (1)

Country Link
CN (1) CN113377538B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115361451B (en) * 2022-10-24 2023-03-24 中国人民解放军国防科技大学 Network communication parallel processing method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365796A (en) * 2012-04-05 2013-10-23 西门子公司 Volume rendering on shared memory systems with multiple processors by optimizing cache reuse
CN112181689A (en) * 2020-09-30 2021-01-05 华东师范大学 Runtime system for efficiently scheduling GPU kernel under cloud
CN112801849A (en) * 2019-11-14 2021-05-14 英特尔公司 Method and apparatus for scheduling thread order to improve cache efficiency

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11016861B2 (en) * 2019-04-11 2021-05-25 International Business Machines Corporation Crash recoverability for graphics processing units (GPU) in a computing environment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103365796A (en) * 2012-04-05 2013-10-23 西门子公司 Volume rendering on shared memory systems with multiple processors by optimizing cache reuse
CN112801849A (en) * 2019-11-14 2021-05-14 英特尔公司 Method and apparatus for scheduling thread order to improve cache efficiency
CN112181689A (en) * 2020-09-30 2021-01-05 华东师范大学 Runtime system for efficiently scheduling GPU kernel under cloud

Also Published As

Publication number Publication date
CN113377538A (en) 2021-09-10

Similar Documents

Publication Publication Date Title
US9971513B2 (en) System and method for implementing SSD-based I/O caches
US8285930B2 (en) Methods for adapting performance sensitive operations to various levels of machine loads
EP3912027B1 (en) Data structure processing
US20040268051A1 (en) Program-directed cache prefetching for media processors
US20030229761A1 (en) Memory compression for computer systems
Laga et al. Lynx: A learning linux prefetching mechanism for ssd performance model
US8285931B2 (en) Methods for reducing cache memory pollution during parity calculations of RAID data
US10891150B2 (en) Storage control method and storage controller for user individual service environment
JPH07271674A (en) Method for optimization of cache
CN113377538B (en) Storage computing collaborative scheduling method and system for GPU data reuse
Yu et al. Hpe: Hierarchical page eviction policy for unified memory in gpus
Choi et al. Memory harvesting in {Multi-GPU} systems with hierarchical unified virtual memory
Papon et al. ACEing the Bufferpool Management Paradigm for Modern Storage Devices
Snir et al. On the theory of spatial and temporal locality
CN116820773A (en) GPGPU register cache management system
KR20240023642A (en) Dynamic merging of atomic memory operations for memory-local computing.
CN104461928A (en) Method and device for dividing caches
Deng et al. Deconstructing on-board disk cache by using block-level real traces
Kim et al. Exploiting write-only-once characteristics of file data in smartphone buffer cache management
JP4792065B2 (en) Data storage method
Otozi et al. Virtual and Cache Memory: Implications for Enhanced Performance of the Computer System
Bodin et al. Evaluating two loop transformations for reducing multiple-writer false sharing
CN114746848B (en) Cache architecture for storage devices
Jeremic et al. An adaptive ssd cache architecture simultaneously using multiple caches
US11354127B2 (en) Method of managing multi-tier memory displacement using software controlled thresholds

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant