CN107357661B

CN107357661B - Fine-grained GPU resource management method for mixed load

Info

Publication number: CN107357661B
Application number: CN201710563834.7A
Authority: CN
Inventors: 杨海龙; 禹超; 白跃彬; 栾钟治; 顾育豪
Original assignee: Beihang University
Current assignee: Kaixi Beijing Information Technology Co ltd
Priority date: 2017-07-12
Filing date: 2017-07-12
Publication date: 2020-07-10
Anticipated expiration: 2037-07-12
Also published as: CN107357661A

Abstract

The invention discloses a fine-grained GPU resource management method aiming at mixed load, and provides a capacity-based stream multiprocessor abstract model CapSM, wherein the CapSM is used as a basic unit for resource management; when mixed loads (including online tasks and offline tasks) share GPU resources, the use of the GPU resources by different types of tasks is managed in a fine-grained mode, the task resource quota and the online adjustment of the resources are supported, and the service quality of the online tasks is guaranteed while the GPU resources are shared. The method determines the resources finally allocated to the tasks according to the types of the tasks, the resource requests and the current system GPU resource state, can meet the requirement of the offline tasks on GPU resources under the condition of sufficient resources, dynamically adjusts the resource usage of the offline tasks when the GPU resources are insufficient, and preferentially meets the resource requirements of the online tasks, so that the performance of the online tasks can be ensured and the GPU resources can be fully utilized when mixed loads run simultaneously.

Description

Fine-grained GPU resource management method for mixed load

Technical Field

The invention relates to the field of resource management and task scheduling in heterogeneous computing, in particular to a fine-grained GPU resource management method for mixed load.

Background

A Graphics Processing Unit (GPU) is becoming an indispensable component of high-performance computing, cloud computing, and data centers due to its powerful peak computing capability, and the acceleration of key services by using the GPU is being adopted by more and more organizations. In order to increase the utilization of the GPU, infrastructure providers often also make a plurality of different types of tasks (online tasks and offline tasks) share GPU resources, i.e. a mixed-load operation mode is adopted. However, when the mixed load shares the GPU, the performance of the on-line task will be severely disturbed because multiple tasks will compete for GPU resources. The fundamental reason is that when a task is submitted to the GPU for execution, the task can only release the resources occupied by the task after the execution is finished, and if the offline task occupies too much GPU or takes too long time, the online task cannot obtain enough GPU resources in time for processing, so that the quality of service target cannot be met.

In recent years, in order to solve the problem of performance interference when a hybrid load runs on a GPU, researchers have conducted research from multiple aspects, and the existing research results mainly come from the following aspects:

(1) hardware-based method

When the method is adopted, the existing GPU hardware structure needs to be modified, and corresponding control components are added. Due to the protection of the GPU manufacturer, the hardware architecture of the GPU is difficult to have a complete detailed understanding, and the modification of the hardware of the GPU in an actual system is impossible. Therefore, the hardware-based method is realized in a simulator, has study value only in academia and has no practical significance.

(2) Software-based method

When the method is adopted, the existing GPU hardware is not required to be modified, only different applications are required to be controlled on a software level, and operability is achieved; therefore, software-based approaches have practical significance. In particular, software-based methods can be further classified into the following categories:

a) method for scheduling based on priority

When the method is adopted, different types of GPU tasks are endowed with different priorities, the online tasks have higher priorities, the offline tasks have lower priorities, and when the online tasks and the offline tasks need to be scheduled simultaneously, the online tasks with higher priorities are preferentially operated. When the method is used, only one task can be operated on the GPU at each moment, so that the utilization rate of the GPU is low.

b) Method for reordering based on kernel

When the method is adopted, the method is similar to a method based on priority scheduling, except that the priority of each task is dynamic, when the kernel task arrives, the priority of the task is dynamically calculated according to the service quality requirement of the task, and then the submission sequence of the kernel task is adjusted according to the calculated dynamic priority.

c) Method for seizing based on GPU

When the method is adopted, similar to the method based on priority scheduling, each task has a fixed priority, but the method supports preemption based on the priority, when one task is running on the GPU, a task with higher priority comes, and the subsequent task with higher priority can preempt the task running on the GPU, so that the GPU can not be used until the task running on the GPU is executed. Although the method based on the GPU preemption can reduce the task waiting time at a certain level, the time overhead of preemption is related to the execution time of the GPU kernel task.

In conclusion, the hardware-based method needs to solve the performance problem during load mixing by modifying the hardware structure of the GPU, and has low operability and poor practicability on the conventional GPU equipment; although the software-based method can enable the online task to run preferentially as much as possible, the method cannot ensure that corresponding resources can be obtained in time when the online task needs additional resources. Therefore, a fine-grained GPU resource management method is needed to effectively control the usage of different types of tasks on GPU resources under mixed load, and particularly support task resource quota and resource online adjustment, so as to meet the requirement of service quality, and no related technical report is found at present.

Disclosure of Invention

The invention solves the problems: the defects of the prior art are overcome, and the fine-grained GPU resource management system and method for the mixed load are provided. Meanwhile, when the online task needs additional resources, the resources used by the offline task are adjusted online, so that the condition that the service quality target cannot be met due to the fact that the online task waits for a long time for the offline task to release the resources is avoided.

The present invention is based on a Computer Unified Device Architecture (CUDA) Multi-Process Service (MPS) technology. MPS is a Hyper-Q based GPU resource management technique proposed by great britain (NVIDIA), and by means of the MPS, kernel tasks from multiple applications can be executed concurrently when the GPU is not fully utilized, thereby improving the utilization rate of the GPU. Furthermore, the MPS is application transparent, and can automatically convert kernels from different CUDA contexts to the same CUDA Context, so that the kernels can run simultaneously on the GPU. Since the MPS treats all kernel equally, when each kernel starts running, all needed resources are allocated for the threads in that kernel. Therefore, when running online tasks and offline tasks on a GPU in a mixed manner, a mechanism needs to be taken to limit the resources used by the offline tasks and reduce the performance interference caused to the online tasks due to resource contention.

The technical scheme of the invention is as follows: the method for managing the fine-grained GPU resources for the mixed load is characterized in that the mixed load divides tasks into online tasks and offline tasks, when the online tasks and the offline tasks share GPU resources, an SM abstract model based on capacity is used as a resource management basic unit to manage the use of the GPU resources by different types of tasks in a fine-grained mode, the task resource quota and the resource online adjustment are supported, and the service quality of the online tasks is guaranteed while the GPU resources are shared, and the method comprises the following steps:

(1) when a user submits a task (if the task comprises an online task and an offline task) to a GPU through a resource management API, resource request information of the task is set, if the task is the offline task, the resource upper limit of the task is set, namely quota, and if the task is the online task, the lowest resource amount of the task is set, namely the reserved amount;

(2) analyzing the submission information of the tasks through a resource management API, wherein the submission information comprises a kernel function, the number of task blocks, the size of the task block and the resource request of the tasks;

(3) calculating the number of active thread blocks which can be accommodated on a GPU SM according to the kernel function of the task and the size of the task block;

(4) calculating the residual available resource amount on the GPU according to the running condition of the application on the current GPU;

(5) if the resource residual amount of the current GPU is not less than the resource request of the task acquired in the step (2), executing the step (6), otherwise, executing the task (8);

(6) setting the resource configuration of the task as a resource request of the task;

(7) according to the resource configuration of the task and the number of the active thread blocks determined in the step (3), calculating the number of thread blocks which are to be created when the task is submitted to the GPU for operation and the number of task blocks allocated to each thread block, and then executing a step (11);

(8) if the current task is an offline task, executing the step (9), and if the current task is an online task, executing the task (10);

(9) setting the resource residual amount of the current GPU as the resource configuration of the task, and then turning to the step (7) to execute;

(10) calculating a resource difference according to the resource surplus of the current GPU and the resource request of the task, then sending a resource release command to the offline task running on the current GPU to enable the offline task to release the resource amount specified by the resource difference, and then turning to the step (6);

(11) submitting the tasks to a GPU according to the calculated number of thread blocks of the GPU tasks, creating threads and starting to run;

(12) if the task receives a resource release command in the process of running on the GPU, executing the step (13), otherwise, executing the step (14);

(13) if the task running on the GPU receives a command of releasing the resources, the resources in the specified range are released, and if task blocks on the released resources are not executed, the task blocks which are not executed are remapped to the rest resources to be continuously executed;

(14) and (5) after the task is executed, quitting the GPU.

The used resource management basic unit is a capacity-based SM abstract model, hereinafter referred to as CapSM, and the CapSM is implemented as follows:

(1-1) setting the capacity of each SM to be 1 capacity unit given a GPU; giving a kernel task K, and assuming that the number of thread blocks in an active state, which can contain the task K, of an SM on a GPU is M;

(1-2) abstracting each SM into M small fragments according to M, wherein the capacity of each small fragment is 1/M capacity units, and each small fragment can only contain a thread block of one task K;

(1-3) after all SMs of the GPU are divided into a plurality of small fragments according to the method, regarding any N small fragments, if the capacity sum of the small fragments is equivalent to that of one physical SM, the N small fragments are considered to form one CapSM;

(1-4) for task K, any M shards make up one CapSM.

In the resource management unit cap sm,

(1-1) the M small fragments that make up each CapSM may be from the same SM or from multiple different SMs;

(1-2) each small slice corresponds to a thread block, and a cap sm can be regarded as a set of thread blocks, so that when the method is implemented, the management of the cap sm can be converted into the management of the number of thread blocks;

(1-3) the capacity-based SM abstraction model CapSM is not dependent on a particular GPU architecture and GPU parallel programming language, and the concept of CapSM can be readily applied to other GPU architectures and GPU parallel programming languages.

The original kernel function of the task needs to be converted, so that the thread of the task running on the GPU is a persistent thread, and the specific conversion process is as follows:

(1-1) inserting a cycle control structure into an original kernel function, wherein an original kernel function body is used as a cycle body of the cycle control structure;

(1-2) the loop body sequentially executes each task by traversing the task distributed to each persistent thread, and sets a variable taskIdx as a task block number to which the currently executed task belongs;

(1-3) changing a variable blockIdx representing the index of the thread block to which the thread belongs in the original kernel function body into a variable taskIdx representing the current task block to which the thread belongs;

in the step (1), when the user submits the task to the GPU through the resource management API, the resource management API used provides the following two task submission modes:

(1-1) running tasks in a resource quota mode, wherein the resource amount used by the offline tasks is limited mainly aiming at the offline tasks, and when the tasks are submitted in the mode, the resource quota amount quota, a kernel function of the tasks to be run, the task block number of the tasks, TaskBlockNumber, and the task block size, TaskBlockSize, are required to be provided;

and (1-2) running the tasks in a resource reservation mode, mainly aiming at the online tasks, and when the tasks are submitted in the resource reservation mode, the resource amount reservation reserved for the tasks, a kernel function to run the tasks, the task block number of the tasks, TaskBlockNumber, and the task block size, TaskBlockSize, namely the number of threads in each task block need to be provided.

In the step (3), according to the kernel function of the task and the size of the task block TaskBlockSize_iAnd calculating the number of active thread blocks which can be contained in one GPU SM, wherein the process can calculate the maximum number of active thread blocks MaxActivePBlock which can be contained in each SM or CapSM through cudaOccupycancyanmmaxactiveBlockSerMultiprocessoraPI provided by CUDA_i。

In the step (7), the CapsMQuota is configured according to the resources of the task_iAnd the number of active thread blocks MaxActivePBlock determined in the step (3)_iAnd the PBlocknumber of the thread blocks which should be created when the calculation task is submitted to the GPU for operation_iAnd the number of task blocks TaskBlocksPerPBlock allocated to each thread block_iThe specific process comprises the following steps:

(7-1) junctionResource configuration of synthetic task CapsMQuota_iAnd MaxActivePBlock_iPBlocknumber of the thread block of the computing task_i，PBlockNumber_i＝CapSMQuota_i*MaxActivePBlock_i；

(7-2) task Block number TaskBlockNumber according to task number_iAnd number of thread blocks PBlocknumber_iCalculating the number of task blocks allocated to each thread block,

in the step (11), a task is submitted to the GPU, a thread is created and starts to run, and the specific execution step of each thread includes:

(11-1) calculating the number CapsMId of the CapsM to which the current thread belongs, wherein the process comprises the following sub-processes:

(11-1-1) calculating the thread block number PBlockId to which the current thread belongs: PBlockId ═ blockdim.x blockidx.y + blockidx.x, wherein blockdim.x, blockidx.y and blockidx.x are private macros provided by CUDA for each thread, and the threads can be directly used in the running process;

(11-1-2) MaxActivePBlock based on the maximum number of active thread blocks accommodated in each SM or CapSM_iAnd the number of the thread block PBlockId, calculating the number of the CapSM to which the program belongs:

(11-2) calculating a task block range for each persistent thread block process, the process comprising the sub-processes of:

(11-2-1) calculating the assigned task block start value StartTaskId: StartTaskId (PBlockId) TaskBlocksPerPBlock_i；

(11-2-2) calculating an assigned task block end value StopTask Id: StoptaskId StartTaskId + TaskBlocksPerPBlock_i；

(11-3) calculating the number PBIdInCapSM of the thread block to which the current thread belongs in the CapSM: PBIdInCapSM ═ PBlockId% CapSMId;

and (11-4) entering a cycle control structure according to the task block range obtained in the process, and sequentially executing corresponding tasks in all the distributed task blocks.

In the step (10), the step of sending a resource release command to the offline task running on the current GPU to enable the offline task to release the resource amount specified by the resource difference includes the following two stages:

(10-1) changing the resource release identification evictCapsMNum at the CPU end_iValue of (e), evictCapsMNum_iCapable of synchronization between CPU and GPU, evictCapsMNum_iThe value of (d) represents the amount of CapsM that needs to be released;

(10-2) checking the epictCapsMNum before starting execution of the loop body each time the loop control structure for each persistent thread of a task running on the GPU_iAll CapsMId are less than evictCapsMNum_iIs exited, and thus, the evictCapsMNum_iA number of caps resources are released.

In the step (13), if there are task blocks on the released resources that are not executed, remapping the task blocks that are not executed onto the remaining resources for further execution, and the specific processing procedure of the persistent thread block for task block remapping is as follows:

(13-1) calculating the number NumberPerCapsM of released CapsM responsible for mapping per CapsM_i：

Wherein the CapsMQuota_iAllocating resources of the GPU task, namely allocating the number of the CapSMs when the task is submitted;

(13-2) calculating the released CapSM ranges each CapSM is responsible for mapping, including the following two sub-processes:

(13-2-1) calculating a mapped CapSM Start value:

CapSMRemapStart＝(CapSMId-evictCapSMNum_i)*NumberPerCapSM_i；

(13-2-2) calculating a mapped CapSM end value:

CapSMRemapEnd＝CapSMRemapStart+NumberPerCapSM_i；

(13-3) sequentially selecting each CapSM in the mapped CapSM range, setting the currently selected CapSM as CapSM MRemapCur, and jumping to the step (14) if the execution of task blocks in all CapSMs in the mapped range is finished;

(13-4) modifying the variable PBlockId to the corresponding thread block number in the CapsMRemapCur to be executed currently: PBlockId. CapsMRemapcur. MaxActivePBlock_i+PBIdInCapSM；

(13-5) executing all unexecuted task blocks in the PBlockId according to the current thread block number PBlockId, and jumping to the step (13-3) if all unexecuted tasks in the currently mapped CapsMRemapCur are executed completely;

(13-6) go to the process (13-5) to continue execution.

(14) And (5) after the task is executed, quitting the GPU.

Compared with the prior art, the invention has the innovation points that: a capacity-based SM abstract model is provided, and the limit on the SM quantity is converted into the limit on the SM capacity, so that a GPU resource reservation and online adjustment mechanism can be flexibly realized without depending on specific hardware and a programming model, and the performance interference of an offline task on an online task is effectively controlled when loads are mixed. The concrete expression is as follows:

(1) the present invention abstracts the notion of capacity from physical SM. Each SM on the same GPU has the same capacity, and a certain number of SMs have a certain number of capacities, so that the use of SM resources can be converted into the use of SM capacity, and the concept of capacity makes the present invention very easy to apply to GPUs of other vendors, such as AMD.

(2) The invention eliminates the performance interference through resource reservation in the software level, and can flexibly realize the limitation of GPU resource use through the limitation of SM capacity. And reserving enough GPU resources for the online task by limiting GPU resources which can be used by the offline task, thereby eliminating resource competition as much as possible and ensuring the service quality target of the online task.

Drawings

FIG. 1 is a diagram of a fine-grained GPU resource management method of the present invention;

FIG. 2 is a flowchart of a fine-grained GPU resource management method for mixed loads according to the present invention;

FIG. 3 is a schematic diagram of the capacity-based SM resource abstraction model CapSM

FIG. 4 is a diagram of a generic thread block versus a persistent thread block;

FIG. 5 is a diagram illustrating the relationship between the CapSM, persistent thread blocks, and task block maps;

FIG. 6 is a diagram of dynamic resource reclamation and task remapping.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The basic idea of the present invention is to design a hybrid load resource management method based on resource quota and reservation based on the existing MPS mechanism through the design in the software layer. When the online task and the offline task are operated in a mixed mode on the GPU, in order to ensure that the online task can be processed in time, the use of GPU resources by the offline task is limited through a resource quota mechanism, so that enough resources are reserved for the online task; meanwhile, when the resource requirement of the online task is difficult to meet, the method supports online adjustment of the use of the offline task for GPU resources, recovers the resources used by the offline task, preferentially meets the resource requirement of the online task, and ensures the service quality of the online task.

An example of an application of the present invention is shown in fig. 1. After the offline application and the online application submit the GPU task to the resource management module through the resource management API, the resource management module can analyze the provided resource request information from the resource management API, and then the resource finally allocated to the task is determined according to the type of the task, the resource request and the current system GPU resource state. If the current system GPU resources are sufficient, allocating the resources requested by the task to the task; if the resources of the current GPU can not meet the resource allocation request of the task, the next processing is carried out according to the task type, if the task is an offline task, the currently available GPU resources are allocated to the task, if the task is an online task, part of the resources of the offline task running on the current GPU are recycled to meet the resource request of the online task, so that the performance of the online task can be guaranteed, the GPU resources can be fully utilized, the process needs the coordination control of a CPU and a GPU end, and a resource management module is also needed at the GPU end to be matched with the CPU to complete the fine-grained management of the GPU resources.

As shown in fig. 2, the fine-grained GPU resource management method for mixed load of the present invention includes the following steps:

(1) the user submits task to GPU through resource management API_iIn time, resource request information CapsMRequest of task is set_iIf the task is an offline task, setting a resource upper limit of the task, namely a quota, and if the task is an online task, setting the minimum resource amount of the task, namely a reserved amount;

in order to manage GPU resources flexibly and effectively, the invention uses an SM resource abstraction model based on capacity (capacity): the CapSM is the basis for realizing the fine-grained resource management method in the software level and is a basic unit for resource management.

As shown in fig. 3, the definition of the resource management unit CapSM is as follows:

(1-4) for the task K, forming a CapSM by any M small fragments;

the resource management unit caps has the following characteristics:

(1-2) each capacity slice corresponds to one thread block, and one cap sm can be regarded as a set of thread blocks, so that in implementation, the management of the cap sm can be converted into the management of the number of thread blocks;

(1-3) abstraction of GPU resources by the caps is not dependent on a particular GPU architecture and GPU parallel programming language, the concept of caps can be easily applied to other GPU architectures and GPU parallel programming languages;

the invention provides two resource management APIs for a user to submit tasks to a GPU, wherein the two APIs comprise the following two APIs:

(1-1)Launch_kernel_with_quota(quota,kernel,grid_size,block_size,kernel_arg_list)：

running a kernel task in a resource quota mode, mainly aiming at an offline task, wherein quota is the amount of requested resource quota, kernel is a kernel function to be run, grid _ size is the number of task blocks, block _ size is the size of the task blocks, namely the number of threads in each task block, and kernel _ arg _ list is a parameter transferred to the kernel function;

(1-2)Launch_kernel_with_reservation(reservation,kernel,grid_size,block_size,kernel_arg_list)：

running tasks in a resource reservation mode, mainly aiming at online tasks, wherein reservation is the amount of resources expected to be reserved for the tasks, kernel is a kernel function to be run, grid _ size is the number of task blocks, block _ size is the size of the task blocks, namely the number of threads in each task block, and kernel _ arg _ list is a parameter transferred to the kernel function;

in the two APIs, quota and reservation are the resource requests of the tasks, grid _ size is the number of task blocks of the tasks, and the tasks executed in a common thread block are called a task block in the invention.

The fine-grained GPU resource management method for mixed load is a process of cooperation of a CPU and a GPU, except that a resource management API is provided at the CPU end, a kernel function executed at the GPU end needs to be modified, so that a thread of an original GPUkernel task running on the GPU is converted into a persistent thread, each persistent thread block can execute tasks of a plurality of original thread blocks, namely the plurality of task blocks can be executed in sequence, and as shown in figure 4, the specific conversion process is as follows:

the tasks managed by the present invention refer to the tasks after the above conversion, if not specifically described.

(2) The submission information of the tasks is analyzed through the resource management API, and the submission information comprises a kernel function and the number of task blocks TaskBlockNumber_iTask Block size TaskBlockSize_iAnd resource request of task CapsMRequest_iWherein TaskBlocknumber_iThe value of (b) is obtained through a parameter grid _ size in the resource management API, TaskBlockSize_iAs parameter block _ size, CapsMRequest_iIs the parameter quota or reservation.

(3) Kernel function according to task, and task block size taskbocksize_iCalculating the number of active thread blocks MaxActivePBlock capable of being accommodated on one GPU SM through cudaOccupicMaxActiveBlocksPermultiprocessoraPI provided by CUDA_i；

(4) Calculating the residual available resource amount Remain on the GPU according to the running condition of the application on the current GPU_GPU；

(5) If the current GPU resource residual quantity is Remain_GPUNot less than renBusiness CapsMRequest_iIf not, executing the step (6), otherwise, executing the task (8);

(6) setting resource configuration of task CapsMQuota_iRequesting CapsMRequest for a resource of a task_i；

(7) CapsMQuota by task_iAnd the number of active thread blocks MaxActivePBlock determined in the step (3)_iAnd the PBlocknumber of the thread blocks which should be created when the calculation task is submitted to the GPU for operation_iAnd the number of task blocks TaskBlocksPerPBlock allocated to each thread block_iThen, the step (11) is executed to specifically include the following sub-processes:

(7-1) resource configuration of binding tasks CapsMQuota_iAnd MaxActivePBlock_iPBlocknumber of the thread block of the computing task_i：PBlockNumber_i＝CapSMQuota_i*MaxActivePBlock_i；

(7-2) task Block number TaskBlockNumber according to task number_iAnd number of thread blocks PBlocknumber_iCalculating the number of task blocks TaskBlocksPerPBlock allocated to each thread block_i：

(9) remaining amount of current GPU resources Remain_GPUConfiguring CapsMQuota as a resource for a task_iThen, the step (7) is carried out;

(10) according to the current GPU resource residual amount Remain_GPUAnd resource request of task CapsMRequest_iCalculating the resource difference Gap_iThen, a resource release command is sent to the offline task running on the current GPU to enable the offline task to release the resource amount specified by the resource difference, and then the step (6) is carried out, wherein the execution of the resource release command needs the cooperation of the CPU and the GPU, and the method comprises the following two stages:

(10-1) changing the resource release identification evictCa at the CPU sidepSMNum_iValue of (e), evictCapsMNum_iCapable of synchronization between CPU and GPU, evictCapsMNum_iThe value of (d) represents the amount of CapsM that needs to be released;

(10-2) checking the epictCapsMNum before starting execution of the loop body each time the loop control structure for each persistent thread of a task running on the GPU_iAll CapsMId are less than evictCapsMNum_iIs exited, and thus, the evictCapsMNum_iA certain number of CapSM resources will be released;

(11) according to the calculated thread number PBlocknumber of the GPU task_iAfter the task is submitted to the GPU for running, threads are created on the GPU by the task and the task starts running, the specific execution steps of each thread comprise:

FIG. 5 is a diagram showing the relationship among the CapsM, the persistent thread blocks and the task blocks, wherein every 3 task blocks are allocated to a persistent thread block, and every 3 persistent thread blocks are allocated to a CapsM.

(11-4) entering a cycle control structure according to the task block range obtained in the process, and sequentially executing corresponding tasks in all the distributed task blocks;

(13) and if the task running on the GPU receives a command of releasing the resources, releasing the resources within the specified range. As shown in fig. 6, if there are task blocks not executed on the released resources, remapping these task blocks not executed onto the remaining resources for further execution, where the specific execution steps of the above task block remapping in each persistent thread block include:

(13-2-1) calculating a mapped CapSM Start value: CapsMRemapstart ═ (CapsMId-epictCapsMNum)_i)*NumberPerCapSM_i；

(13-2-2) calculating a mapped CapSM end value: the sequence of the CapsMRemapEnd ═ CapsMRemapstart + NumberPerCapsM_i；

(13-4) modifying the variable PBlockId to the corresponding thread block number in the CapsMRemapCur to be executed currently: PBlockId ═ capsremapCur*MaxActivePBlock_i+PBIdInCapSM；

(13-6) going to the process (13-5) to continue execution;

(14) and (5) after the task is executed, quitting the GPU.

In a word, when mixed loads (including online tasks and offline tasks) share GPU resources, the method manages the use of the GPU resources by different types of tasks in a fine-grained mode, supports task resource quota and resource online adjustment, and guarantees the service quality of the online tasks while sharing the GPU resources. A capacity-based Streaming Multiprocessor (SM) abstract model CapSM is provided, wherein CapSM is used as a basic unit for resource management, and one CapSM is equivalent to one SM in capacity, namely the maximum number of thread blocks in an active state on one CapSM is equivalent to the original SM. When the off-line application and the on-line application submit the GPU task through the resource management API, the provided resource request information is firstly analyzed from the resource management API, and then the resource finally allocated to the task is determined according to the type of the task, the resource request and the current system GPU resource state. If the current system GPU resources are sufficient, allocating the resources requested by the task to the task; if the resources of the current GPU can not meet the resource allocation request of the task, the next processing is carried out according to the task type, if the task is an offline task, the currently available GPU resources are allocated to the task, if the task is an online task, part of the resources of the offline task running on the current GPU are released through a resource recovery mechanism to meet the resource request of the online task, and therefore the performance of the online task can be guaranteed, and the GPU resources can be fully utilized.

The invention has not been described in detail and is within the skill of the art.

The above description is only a part of the embodiments of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. A fine-grained GPU resource management method for mixed load is characterized in that the mixed load divides tasks into online tasks and offline tasks, when the online tasks and the offline tasks share GPU resources, an SM abstract model based on capacity is used as a basic unit of resource management to manage the GPU resources in a fine-grained mode, task resource quota and resource online adjustment are supported, and the service quality of the online tasks is guaranteed while the GPU resources are shared, and the method comprises the following steps:

(1) when a user submits a task to a GPU through an Application Programming Interface (API), resource request information of the task is set, wherein the task comprises an online task and an offline task, if the task is the offline task, the upper limit of the resource of the task, namely quota, is set, and if the task is the online task, the lowest resource amount of the task, namely the reserved amount, is set;

(14) after the task is executed, quitting the GPU;

the used resource management basic unit is a capacity-based SM abstract model, hereinafter referred to as CapSM, which is implemented as follows:

(1-4) for task K, any M shards make up one CapSM.

2. A fine-grained GPU resource management method for mixed loads according to claim 1, characterized by: the resource management basic unit, the CapSM,

(1-3) the capacity-based SM abstraction model CapSM is not dependent on a particular GPU architecture and GPU parallel programming language, and the concept of CapSM can be applied to other GPU architectures and GPU parallel programming languages.

3. A fine-grained GPU resource management method for mixed loads according to claim 1, characterized by: the original kernel function of the task needs to be converted, so that the thread of the task running on the GPU is a persistent thread, and the specific conversion process is as follows:

and (1-3) changing a variable blockIdx representing the index of the thread block to which the thread belongs in the original kernel function body into a variable taskIdx representing the current task block to which the thread belongs.

4. A fine-grained GPU resource management method for mixed loads according to claim 1, characterized by: in the step (1), when the user submits the task to the GPU through the resource management API, the resource management API used provides the following two task submission modes:

(1-1) running tasks in a resource quota mode, wherein the resource amount used by the offline tasks is limited mainly aiming at the offline tasks, when the tasks are submitted in the mode, the resource quota amount quota, a kernel function of the tasks to be run, the task block number taskblock number of the tasks and the task block size taskblock size, namely the number of threads in each task block, need to be provided;

5. The fine-grained GPU resource management method for hybrid loads according to claim 1, characterized in that in step (3), the number of active thread blocks that can be accommodated on one GPU SM is calculated according to a kernel function of the task and the size of the task block, and the process can calculate the maximum number of active thread blocks that can be accommodated on each SM or CapSM through an API provided by a Compute Unified Device Architecture (CUDA) for obtaining the maximum number of active thread blocks in each SM_i。

6. A fine-grained GPU resource management method for mixed loads according to claim 1, characterized by: in the step (7), the CapsMQuota is configured according to the resources of the task_iAnd the number of active thread blocks MaxActivePBlock determined in the step (3)_iAnd the PBlocknumber of the thread blocks which should be created when the calculation task is submitted to the GPU for operation_iAnd the number of task blocks TaskBlocksPerPBlock allocated to each thread block_iThe specific process comprises the following steps:

(7-1) resource configuration of binding tasks CapsMQuota_iAnd MaxActivePBlock_iPBlocknumber of the thread block of the computing task_i，PBlockNumber_i＝CapSMQuota_i*MaxActivePBlock_i；

(7-2) task Block number TaskBlo according to taskckNumber_iAnd number of thread blocks PBlocknumber_iCalculating the number of task blocks allocated to each thread block,

7. a fine-grained GPU resource management method for mixed loads according to claim 6, characterized by: in the step (11), a task is submitted to the GPU, a thread is created and starts to run, and the specific execution step of each thread includes:

8. A fine-grained GPU resource management method for mixed loads according to claim 7, characterized by: in the step (10), the step of sending a resource release command to the offline task running on the current GPU to enable the offline task to release the resource amount specified by the resource difference includes the following two stages:

9. A fine-grained GPU resource management method for mixed loads according to claim 8, characterized by: in the step (13), if there are task blocks on the released resources that are not executed, remapping the task blocks that are not executed onto the remaining resources for further execution, and the specific processing procedure of the persistent thread block for task block remapping is as follows:

(13-2-1) calculating a mapped CapSM Start value:

CapSMRemapStart＝(CapSMId-evictCapSMNum_i)*NumberPerCapSM_i；

(13-2-2) calculating a mapped CapSM end value:

CapSMRemapEnd＝CapSMRemapStart+NumberPerCapSM_i；

(13-6) go to the process (13-5) to continue execution.