CN107357661A

CN107357661A - A kind of fine granularity GPU resource management method for mixed load

Info

Publication number: CN107357661A
Application number: CN201710563834.7A
Authority: CN
Inventors: 杨海龙; 禹超; 白跃彬; 栾钟治; 顾育豪
Original assignee: Beihang University
Current assignee: Kaixi Beijing Information Technology Co ltd
Priority date: 2017-07-12
Filing date: 2017-07-12
Publication date: 2017-11-17
Anticipated expiration: 2037-07-12
Also published as: CN107357661B

Abstract

The invention discloses a kind of fine granularity GPU resource management method for mixed load, it is proposed that a stream multiprocessor abstract model CapSM based on capacity, the elementary cell using CapSM as resource management；When mixed load (including online task and offline task) shares GPU resource, use of the different type task to GPU resource is managed by fine granularity, task resource quota and resource on-line tuning are supported, while shared GPU resource, ensures the service quality of online task.The present invention determines the final resource for task distribution according to type, resource request and the current system GPU resource state of task, use of the offline task to GPU resource can be met in the case of resource abundance, when GPU resource deficiency, dynamic adjusts the resource use of offline task, preferentially meet the resource requirement of online task, so simultaneously when mixed load is run, both the performance of online task had been can guarantee that, and can makes full use of GPU resource.

Description

A kind of fine granularity GPU resource management method for mixed load

Technical field

It is negative for mixing more particularly to one kind the present invention relates to the resource management in Heterogeneous Computing and task scheduling field The fine granularity GPU resource management method of load.

Background technology

Graphics processor (Graphics Processing Unit, hereinafter referred to as GPU) is with its powerful peak computational energy Power, high-performance calculation, the indispensable part of cloud computing and data center are increasingly becoming, key business is entered using GPU Row accelerates by increasing mechanism and tissue to be adopted.In order to improve GPU utilization rate, infrastructure provider is usual Also a variety of different types of tasks (online task and offline task) can be allowed to share GPU resource, i.e., using the operation of mixed load Mode.However, when mixed load shares GPU, because multiple tasks can compete GPU resource, it will to the performance of online task Produce severe jamming.Its basic reason is that, when a task, which is submitted to GPU, to be performed, the task can only be after execution terminates The resource shared by it can be just discharged, GPU is excessive or overlong time if offline task takes, it will causes online task not Enough GPU resources can be obtained in time to be handled, so as to cause quality of service goals not to be met.

In recent years, in order to solve performance interference problem of the mixed load when being run on GPU, researcher is from many aspects Expansion research, existing achievement in research is essentially from the following aspects：

(1) hardware based method

When this method is employed, it is necessary to be modified to existing GPU hardware structure, corresponding control unit is added.By In the protection of GPU manufacturers, it is difficult to have a completely detailed understanding to GPU hardware structure, unlikely in systems in practice GPU hardware is modified.Therefore, hardware based method is realized in simulator, only academicly has research Value, the meaning without reality.

(2) method based on software

When this method is employed, it is not necessary to existing GPU hardware is modified, it is only necessary in software view to different Using being controlled, there is operability；Therefore, the method based on software has realistic meaning.Specifically, based on software Method can be divided into following a few classes again：

A) method based on priority scheduling

When this method is employed, different types of GPU task is assigned to different priority, online task has compared with Gao You First level, offline task have lower priority, when online task and offline task need scheduling simultaneously, preferentially make priority compared with High online task run.During using this method, there can only be a task to be run on GPU at each moment, so that GPU Utilization rate is relatively low.

B) method to be reordered based on kernel

When this method is employed, similar with the method based on priority scheduling, only the priority of each task is It is dynamic, it is necessary to when kernel tasks arrive, according to the quality of service requirement of task, the priority of dynamic calculation task, so Afterwards according to the submission of the dynamic priority adjusting kernel tasks calculated order.

C) method seized based on GPU

When this method is employed, it is similar with the method based on priority scheduling, every kind of task have one it is fixed preferential Grade, only this method supports seizing based on priority, when being currently running a task on GPU, a higher priority Task arrive, this higher priority task can subsequently to arrive seizes just being run on GPU for task, from without GPU can be used after the completion of the tasks carrying just run on GPU.The method seized based on GPU, although can be one The time that task waits is reduced on given layer degree, but the time overhead seized and the execution time correlation of GPU kernel tasks.

It in summary it can be seen, hardware based method needs solve load mixing by changing GPU hardware structure to reach When performance issue, operability is low in existing GPU equipment, poor practicability；Although the method based on software can make The preferential operation as far as possible of line task, but do not ensure that when online task needs extra resource, it can obtain in time corresponding Resource.Therefore, it is necessary to which a kind of fine-grained GPU resource management method effectively controls different type task pair under mixed load The use of GPU resource, task resource quota and resource on-line tuning are particularly supported, so as to meet the requirement of service quality, mesh It is preceding it is not yet found that correlation technique report.

The content of the invention

The technology of the present invention solves problem：A kind of overcome the deficiencies in the prior art, there is provided fine granularity GPU for mixed load Resource management system and method, when mixed load is being run on GPU, limit what offline task can use by resource quota Resource, offline task is avoided to take excess resource.Meanwhile when online task needs extra resource, support on-line tuning from The resource that line task uses, quality of service goals is caused so as to avoid online task from waiting as long for offline task release resource Situation about can not be met.

The present invention is based on calculating Unified Device framework (Compute Unified Device Architecture, below letter Claim CUDA) multi-process service (Multi-Process Service, hereinafter referred to as MPS) technology.MPS is tall and handsome to reach (NVIDIA) A kind of GPU resource administrative skill based on Hyper-Q proposed, can be allowed and come from when GPU is not fully used by MPS The kernel tasks of multiple applications concurrently perform, so as to improve GPU utilization rate.In addition, MPS is using transparent, MPS can be with Automatically the kernel in different CUDA Context is transformed into same CUDA Context, so as on GPU Run simultaneously.Can be in the kernel when each kernel brings into operation because MPS treats all kernel on an equal basis A thread distribution institute resource in need.Therefore, when when the online task of mixed running on GPU is with offline task, it is necessary to take A kind of mechanism limits the resource that offline task uses, and reduces due to contention for resources and performance is disturbed to caused by online task.

The technology of the present invention solution：A kind of fine granularity GPU resource management method for mixed load provided, it is described Mixed load is that task is divided into online task and offline task, when online task and offline task sharing GPU resource, is used A kind of SM abstract models based on capacity manage different type task to GPU resource as resource management elementary cell come fine granularity Use, support task resource quota and resource on-line tuning, while shared GPU resource, ensure the service of online task Quality, comprise the following steps：

(1) user by resource management API to GPU submit task (such as not specified otherwise, task including online task and from Line task) when, the resource request information of task is set, if task is offline task, is provided that the resource upper limit of task, That is quota, if task is online task, the minimum resources amount of task, i.e., pre- allowance are provided that；

(2) the submission information gone out on missions by resource management API parsings, including kernel functions, task number of blocks, task The resource request of block size and task；

(3) according to the kernel functions of task, and task block size, the activity that can be accommodated on a GPU SM is calculated Thread number of blocks；

(4) according to the operation conditions applied on current GPU, remaining available resource amount on GPU is calculated；

(5) if current GPU resource surplus performs step no less than the resource request for the task that step (2) obtains (6) task (8), otherwise, is performed；

(6) resource distribution for setting task is the resource request of task；

(7) the active threads number of blocks determined according to the resource distribution of task and step (3), calculating task are submitted to The thread number of blocks and the task number of blocks of each thread block distribution that GPU should be created when running, then perform step (11)；

(8) if current task is offline task, step (9) is performed, if online task, then perform task (10)；

(9) current GPU resource surplus is set to the resource distribution of task, then goes to step (7) execution；

(10) according to current GPU resource surplus, and the resource request of task, calculate resource difference, then to work as The offline task run on preceding GPU sends resource release commands, the stock number for specifying offline task release resource difference, then Go to step (6)；

(11) according to the thread number of blocks of the GPU task calculated, task is submitted to GPU, is created thread and is started to transport OK；

(12) if task receives the order of resource release on GPU in running, step (13) is performed, otherwise Perform step (14)；

(13) if being run on GPU for task, the order of release resource is received, then discharges the resource of specified range, If thering is task block to be not carried out in the resource being released, the task block that these are not carried out be remapped in remaining resource after It is continuous to perform；

(14) tasks carrying is completed, and exits GPU.

The resource management elementary cell that uses is a kind of SM abstract models based on capacity, hereinafter referred to as CapSM, CapSM realizes as follows：

(1-1) gives a GPU, sets each SM capacity as 1 bodge；A kernel task K is given, it is false If the thread number of blocks being active that the upper SM of GPU can accommodate task K is M；

(1-2) foundation M, each SM being abstracted as M small bursts, the capacity of each small burst is 1/M bodge, and And each small burst can only accommodate task K thread block；

After GPU all SM are divided into multiple small bursts by (1-3) according to above method, for any N number of small burst, such as Their capacity of fruit and of equal value with physics SM, then it is assumed that this N number of small burst forms a CapSM；

(1-4) forms a CapSM for task K, any M small bursts.

In described rm-cell CapSM,

(1-1) forms each CapSM M small bursts, can come from same SM, can be from multiple different SM；

(1-2) each small burst is corresponding with a thread block, and a CapSM can be regarded as the collection of one group of thread block Close, therefore, when realizing, it is management to thread number of blocks that the management to CapSM, which can be converted to,；

The SM abstract models CapSM of (1-3) based on capacity independent of specific GPU architecture and GPU parallel programming languages, CapSM concept can be readily applied to other GPU architectures and GPU parallel programming languages.

Need to change the original kernel functions of the task, the thread for making task run on GPU is lasting Thread, specific transfer process are as follows：

(1-1) inserts loop control structure into original kernel functions, and original kernel function bodies are as loop control The loop body of structure；

(1-2) loop body performs each task by traveling through as each task that persistently thread distributes successively, and by variable TaskIdx is arranged to the affiliated task block number of task currently performed；

The variable blockIdx that the affiliated thread block index of thread is represented in original kernel function bodies is changed to represent by (1-3) The variable taskIdx of task block belonging to current；

In the step (1), user by resource management API to GPU submit task when, the resource management API used is carried For the following two kinds task way of submission：

(1-1), come operation task, mainly for offline task, limits what offline task used by way of resource quota Stock number, when submitting task using which, it is desirable to provide resource quota amount quota, the kernel functions of operation task, appoint The task number of blocks TaskBlockNumber of business, and task block size TaskBlockSize are the thread in each task block Quantity；

(1-2), come operation task, mainly for online task, task is submitted using which by way of resource reservation When, it is desirable to provide the stock number reservation reserved for the task, the kernel functions of operation task, the task of task Number of blocks TaskBlockNumber, and task block size TaskBlockSize are the number of threads in each task block.

In the step (3), according to the kernel functions of task, and task block size TaskBlockSize_i, calculate The active threads number of blocks that can be accommodated on one GPU SM, what the process can be provided by CUDA CudaOccupancyMaxActiveBlocksPerMultiprocessorAPI can accommodate to calculate on each SM or CapSM Maximum activity thread number of blocks MaxActivePBlock_i。

In the step (7), according to the resource distribution CapSMQuota of task_iAnd the active threads that step (3) determines Number of blocks MaxActivePBlock_i, the thread number of blocks PBlockNumber that should create when calculating task is submitted to GPU operations_i And the task number of blocks TaskBlocksPerPBlock of each thread block distribution_i, detailed process includes：

(7-1) combines the resource distribution CapSMQuota of task_iAnd MaxActivePBlock_i, the thread block of calculating task Quantity PBlockNumber_i, PBlockNumber_i=CapSMQuota_i*MaxActivePBlock_i；

(7-2) is according to the task number of blocks TaskBlockNumber of task_iWith thread number of blocks PBlockNumber_i, meter The task number of blocks for the distribution of each thread block is calculated,

In the step (11), task is submitted to GPU, is created thread and is brought into operation, the specific execution of each thread Step includes：

(11-1) calculates the numbering CapSMId of the CapSM belonging to current thread, and the process includes following subprocess：

(11-1-1) calculates the thread block number PBlockId belonging to current thread：PBlockId=blockDim.x* BlockIdx.y+blockIdx.x, wherein blockDim.x, blockIdx.y and blockIdx.x are that CUDA carries for each thread What is supplied is privately owned grand, and thread can be used directly in running；

(11-1-2) is according to the active threads block maximum quantity MaxActivePBlock accommodated in each SM or CapSM_iWith Place thread block number PBlockId, calculate affiliated CapSM numberings：

(11-2) calculates the task block scope of each persistently thread block processing, and the process includes following subprocess：

(11-2-1) calculates the task block initial value StartTaskId of distribution：StartTaskId=PBlockId* TaskBlocksPerPBlock_i；

(11-2-2) calculates the task block end value StopTaskId of distribution：StopTaskId=StartTaskId+ TaskBlocksPerPBlock_i；

(11-3) calculates numbering PBIdInCapSM of the current affiliated thread block in CapSM：PBIdInCapSM= PBlockId%CapSMId；

The task block scope that (11-4) obtains according to above procedure, into loop control structure, distribution is performed successively Corresponding task in all task blocks.

In the step (10), to current GPU on the offline task run send resource release commands, release offline task Put the stock number that resource difference is specified, including following two stages：

(10-1) changes resource release mark evictCapSMNum at CPU ends_iValue, evictCapSMNum_iCan be It is synchronous between CPU and GPU, evictCapSMNum_iValue represent to need the CapSM quantity that discharges；

The loop control structure of each persistently thread for the task that (10-2) is run on GPU starts to perform loop body every time Preceding inspection evictCapSMNum_iValue, all CapSMId are less than evictCapSMNum_iThread will exit execution, so, evictCapSMNum_iThe CapSM resources of individual quantity will be released.

In the step (13), if thering is task block to be not carried out in the resource being released, the task that these are not carried out Block is remapped in remaining resource and continued executing with, and it is as follows that lasting thread block carries out the concrete processing procedure that task block remaps：

(13-1) calculates the CapSM quantity NumberPerCapSM being released that each CapSM is responsible for mapping_i：

Wherein CapSMQuota_iFor the resource distribution of GPU task, that is, the CapSM quantity configured when submitting task；

(13-2) calculates the CapSM scopes being released that each CapSM is responsible for mapping, including following two subprocess：

(13-2-1) calculates the CapSM initial values of mapping：

CapSMRemapStart=(CapSMId-evictCapSMNum_i)*NumberPerCapSM_i；

(13-2-2) calculates the CapSM end values of mapping：

CapSMRemapEnd=CapSMRemapStart+NumberPerCapSM_i；

(13-3) each CapSM in the CapSM scopes of Choose for user successively, set the CapSM that currently selects for CapSMRemapCur, if the task block in all CapSM in mapping range is carried out completing, jump to step (14)；

(13-4) modification variable PBlockId is corresponding thread block number in the CapSMRemapCur currently to be performed： PBlockId=CapSMRemapCur*MaxActivePBlock_i+PBIdInCapSM；

(13-5) performs all task blocks being not carried out in PBlockId according to current thread block number PBlockId, If all task executeds being not carried out in the CapSMRemapCur currently mapped are completed, step (13-3) is jumped to；

(13-6) go to procedure (13-5) continue executing with.

(14) tasks carrying is completed, and exits GPU.

Compared with prior art, innovation of the invention is：A kind of SM abstract models based on capacity are proposed, will be right The limitation of SM quantity is converted to the limitation to SM capacity, so as to independent of specific hardware and programming model and neatly real Existing GPU resource is reserved and on-line tuning mechanism, and when mixing load, offline task is disturbed to obtain effectively to the performance of online task Control.It is embodied in：

(1) present invention takes out the concept of capacity from physics SM.Each SM on same GPU holds with identical Amount, a number of SM have certain amount of capacity, so, SM resources are converted to using can be made to SM capacity With the concept of capacity can make the GPU, such as AMD of the invention for being easily applied to other manufacturers.

(2) present invention is disturbed in software view by resource reservation to eliminate performance, can be neatly by SM capacity Limitation realize the limitation used GPU resource.The GPU resource that can be used by limiting offline task, it is online task Enough GPU resources are reserved, so as to eliminate as much as resource contention, ensure the quality of service goals of online task.

Brief description of the drawings

Fig. 1 is the scene graph of the fine granularity GPU resource management method of the present invention；

Fig. 2 is flow chart of the present invention for the fine granularity GPU resource management method of mixed load；

Fig. 3 is the schematic diagram of the SM resource abstract models CapSM based on capacity

Fig. 4 is common thread block and lasting thread block graph of a relation；

Fig. 5 is the relation schematic diagram between CapSM, lasting thread block and task block are reflected；

Fig. 6 is that dynamic resource recovery and task remap schematic diagram.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.As long as in addition, technical characteristic involved in each embodiment of invention described below that Conflict can is not formed between this to be mutually combined.

The basic ideas of the present invention are, pass through the design in software view, on the basis of existing MPS mechanism, design Based on resource quota and reserved mixed load method for managing resource.When online task and offline task mixed running on GPU When, in order to ensure online task can obtain timely processing, by resource quota mechanism, offline task is limited to GPU resource Use, so as to reserve enough resources for online task；Meanwhile when the resource requirement of online task is difficult to meet, support Line adjusts use of the offline task to GPU resource, reclaims the resource that offline task uses, preferentially meets the resource need of online task Ask, ensure the service quality of online task.

The application example of the present invention is as shown in Figure 1.Offline application and application on site are submitting GPU by resource management API Task is to after resource management module, and resource management module can parse the resource request information of offer from resource management API, so The final resource for task distribution is determined according to type, resource request and the current system GPU resource state of task afterwards.If Current system GPU resource is sufficient, then the resource of its request is distributed for task；If current GPU resource can not meet task Resource allocation request, then it is further processed according to task type, will be currently available if task is offline task GPU resource distributes to task, if the online task of task, reclaims the part resource for the offline task run on current GPU, To meet the resource request of online task, the performance of online task was so both can guarantee that, and can makes full use of GPU resource, the mistake Journey needs the coordination at CPU and GPU ends to control, and a resource management module is also required at GPU ends and is used for coordinating CPU completions pair The fine granularity management of GPU resource.

As shown in Fig. 2 the present invention comprises the following steps for the fine granularity GPU resource management method of mixed load：

(1) user submits task task by resource management API to GPU_iWhen, the resource request information of task is set CapSMRequest_iIf task is offline task, be provided that the resource upper limit of task, i.e. quota, if task be Line task, then it is provided that the minimum resources amount of task, i.e., pre- allowance；

In order to be flexibly effectively managed to GPU resource, the present invention uses a SM for being based on capacity (capacity) Resource abstract model：CapSM, CapSM are the bases that the inventive method realizes fine granularity method for managing resource in software view, are The elementary cell of resource management.

As shown in figure 3, rm-cell CapSM is defined as follows：

(1-4) forms a CapSM for task K, any M small bursts；

Rm-cell CapSM has following feature：

(1-2) each capacity burst is corresponding with a thread block, and a CapSM can be regarded as the collection of one group of thread block Close, therefore, when realizing, it is management to thread number of blocks that the management to CapSM, which can be converted to,；

(1-3) CapSM is abstracted independent of specific GPU architecture and GPU parallel programming languages to GPU resource, CapSM's Concept can be readily applied to other GPU architectures and GPU parallel programming languages；

The present invention provides the user two resource management API to submit task, including following two API to GPU：

(1-1)Launch_kernel_with_quota(quota,kernel,grid_size,block_size, kernel_arg_list)：

Kernel tasks are run by way of resource quota, mainly for offline task, wherein quota is request Resource quota amount, kernel are the kernel functions to be run, and grid_size is the quantity of task block, and block_size is to appoint It is engaged in block size, i.e., the number of threads in each task block, kernel_arg_list is the parameter for passing to kernel functions；

(1-2)Launch_kernel_with_reservation(reservation,kernel,grid_size, block_size,kernel_arg_list)：

Come operation task by way of resource reservation, mainly for online task, wherein reservation is that hope is The reserved stock number of the task, kernel be the kernel functions to be run, and grid_size is the quantity of task block, block_ Size is task block size, i.e., the number of threads in each task block, kernel_arg_list is to pass to kernel functions Parameter；

In two above API, quota and reservation are the resource request of task, and grid_size is task Task number of blocks, being performed in a common thread block for task is referred to as a task block in the present invention.

The fine granularity GPU resource management method for mixed load of the present invention is a kind of process of CPU and GPU collaboration, Except providing resource management API at CPU ends, it is also necessary to the kernel functions performed at GPU ends are modified, make original GPU The thread that kernel tasks are run on GPU is converted to lasting thread, and each persistently thread block can perform original multiple threads The task of block, you can to perform multiple tasks block successively, as shown in figure 4, specific transfer process is as follows：

Institute's managerial role of the present invention, if not otherwise specified, all referring to the task after being changed more than.

(2) the submission information gone out on missions by resource management API parsings, including kernel functions, task number of blocks TaskBlockNumber_i, task block size TaskBlockSize_iAnd the resource request CapSMRequest of task_i, wherein TaskBlockNumber_iValue obtained by parameter grid_size in resource management API, TaskBlockSize_iFor parameter Block_size, CapSMRequest_iFor parameter quota or reservation.

(3) according to the kernel functions of task, and task block size TaskBlockSize_i, provided by CUDA CudaOccupancyMaxActiveBlocksPerMultiprocessorAPI calculates the work that can be accommodated on a GPU SM Moving-wire journey number of blocks MaxActivePBlock_i；

(4) according to the operation conditions applied on current GPU, remaining available resource amount Remain on GPU is calculated_GPU；

(5) if current GPU resource surplus Remain_GPUNo less than the CapSMRequest of task_i, then step is performed (6) task (8), otherwise, is performed；

(6) the resource distribution CapSMQuota of task is set_iFor the resource request CapSMRequest of task_i；

(7) according to the CapSMQuota of task_iAnd the active threads number of blocks that step (3) determines MaxActivePBlock_i, the thread number of blocks PBlockNumber that should create when calculating task is submitted to GPU operations_iIt is and every The task number of blocks TaskBlocksPerPBlock of individual thread block distribution_i, then perform step (11) specifically include it is following several Subprocess：

(7-1) combines the resource distribution CapSMQuota of task_iAnd MaxActivePBlock_i, the thread block of calculating task Quantity PBlockNumber_i：PBlockNumber_i=CapSMQuota_i*MaxActivePBlock_i；

(7-2) is according to the task number of blocks TaskBlockNumber of task_iWith thread number of blocks PBlockNumber_i, meter Calculate the task number of blocks TaskBlocksPerPBlock for the distribution of each thread block_i：

(9) by current GPU resource surplus Remain_GPUIt is set to the resource distribution CapSMQuota of task_i, then go to Step (7) performs；

(10) according to current GPU resource surplus Remain_GPU, and the resource request CapSMRequest of task_i, meter Calculate resource difference Gap_i, resource release commands then are sent to the offline task run on current GPU, it is discharged resource poor The stock number that volume is specified, step (6) being then gone to, the execution of above resource release commands needs CPU and GPU cooperation, including Following two stages：

The loop control structure of each persistently thread for the task that (10-2) is run on GPU starts to perform loop body every time Preceding inspection evictCapSMNum_iValue, all CapSMId are less than evictCapSMNum_iThread will exit execution, so, evictCapSMNum_iThe CapSM resources of individual quantity will be released；

(11) according to the number of threads PBlockNumber of the GPU task calculated_i, run in submission task to GPU, appoint After business creates thread and brought into operation on GPU, the specific execution step of each thread includes：

It is the relation schematic diagram between CapSM, lasting thread block and task block as shown in Figure 5, wherein every 3 task blocks are distributed To a lasting thread block, every 3 lasting thread blocks distribute to a CapSM again.

The task block scope that (11-4) obtains according to above procedure, into loop control structure, distribution is performed successively Corresponding task in all task blocks；

(13) if being run on GPU for task, the order of release resource is received, then discharges the resource of specified range. If as shown in fig. 6, having task block to be not carried out in the resource being released, the task block that these are not carried out is remapped to residue Resource on continue executing with, above task block remaps to be included in each specific execution step persistently in thread block：

(13-2-1) calculates the CapSM initial values of mapping：CapSMRemapStart=(CapSMId- evictCapSMNum_i)*NumberPerCapSM_i；

(13-2-2) calculates the CapSM end values of mapping：CapSMRemapEnd=CapSMRemapStart+ NumberPerCapSM_i；

(13-6) go to procedure (13-5) continue executing with；

(14) tasks carrying is completed, and exits GPU.

In a word, when mixed load (including online task and offline task) shares GPU resource, the present invention passes through fine granularity Use of the different type task to GPU resource is managed, task resource quota and resource on-line tuning are supported, in shared GPU resource While, ensure the service quality of online task.Propose a stream multiprocessor (Streaming based on capacity Multiprocessor, hereinafter referred to as SM) abstract model CapSM, the elementary cell using CapSM as resource management, one CapSM and SM is of equal value on capacity, i.e., maximum quantity and the original SM of the thread block being active on a CapSM It is of equal value., first can be from resource management API when offline application and application on site are submitting GPU task by resource management API The resource request information of offer is parsed, then the type, resource request and current system GPU resource state according to task are come true The fixed final resource for task distribution.If current system GPU resource is sufficient, the resource of its request is distributed for task；If Current GPU resource can not meet the resource allocation request of task, then be further processed according to task type, if task It is offline task, then currently available GPU resource is distributed into task, if task is online task, passes through resource reclaim Mechanism discharges the part resource for the offline task run on current GPU, to meet the resource request of online task, can so protect The performance of online task is demonstrate,proved, and can makes full use of GPU resource.

Non-elaborated part of the present invention belongs to techniques well known.

It is described above, part embodiment only of the present invention, but protection scope of the present invention is not limited thereto, and is appointed What those skilled in the art the invention discloses technical scope in, the change or replacement that can readily occur in should all be covered Within protection scope of the present invention.

Claims

1. a kind of fine granularity GPU resource management method for mixed load, it is characterised in that the mixed load is by task It is divided into online task and offline task, when online task and offline task sharing GPU resource, uses a kind of SM based on capacity Abstract model carrys out use of the fine granularity management different type task to GPU resource as the elementary cell of resource management, supports to appoint Business resource quota and resource on-line tuning, while shared GPU resource, the service quality of online task is ensured, including it is following Step：

(1) user by asset management application DLL (Application Programming Interface, with Lower abbreviation API) to GPU submit task (such as not specified otherwise, task includes online task and offline task) when, task is set Resource request information, if task is offline task, be provided that the resource upper limit of task, i.e. quota, if task be Line task, then it is provided that the minimum resources amount of task, i.e., pre- allowance；

(2) the submission information gone out on missions by resource management API parsings, including kernel functions, task number of blocks, task block are big Small and task resource request；

(3) according to the kernel functions of task, and task block size, the active threads that can be accommodated on a GPU SM are calculated Number of blocks；

(5) if current GPU resource surplus is no less than the resource request for the task that step (2) obtains, step (6) is performed, Otherwise, task (8) is performed；

(6) resource distribution for setting task is the resource request of task；

(7) the active threads number of blocks determined according to the resource distribution of task and step (3), calculating task are submitted to GPU fortune The thread number of blocks that should be created during row and the task number of blocks of each thread block distribution, then perform step (11)；

(8) if current task is offline task, step (9) is performed, if online task, then performs task (10)；

(10) according to current GPU resource surplus, and the resource request of task, resource difference is calculated, then to current GPU The offline task of upper operation sends resource release commands, the stock number for specifying offline task release resource difference, then goes to Step (6)；

(11) according to the thread number of blocks of the GPU task calculated, task is submitted to GPU, is created thread and is brought into operation；

(12) if task receives the order of resource release on GPU in running, step (13) is performed, is otherwise performed Step (14)；

(13) if being run on GPU for task, the order of release resource is received, then discharges the resource of specified range, if There is task block to be not carried out in the resource being released, then the task block being not carried out these, which is remapped in remaining resource, to be continued to hold OK；

(14) tasks carrying is completed, and exits GPU.

A kind of 2. fine granularity GPU resource management method for mixed load according to claim 1, it is characterised in that： The resource management elementary cell used is a kind of SM abstract models based on capacity, and hereinafter referred to as CapSM, CapSM realize as follows：

(1-1) gives a GPU, sets each SM capacity as 1 bodge；Give a kernel tasks K, it is assumed that The thread number of blocks being active that the upper SM of GPU can accommodate task K is M；

(1-2) foundation M, each SM being abstracted as M small bursts, the capacity of each small burst is 1/M bodge, and often Individual small burst can only accommodate task K thread block；

After GPU all SM are divided into multiple small bursts by (1-3) according to above method, for any N number of small burst, if it Capacity and of equal value with physics SM, then it is assumed that this N number of small burst forms a CapSM；

(1-4) forms a CapSM for task K, any M small bursts.

A kind of 3. fine granularity GPU resource management method for mixed load according to claim 2, it is characterised in that： The rm-cell CapSM,

(1-2) each small burst is corresponding with a thread block, and a CapSM can be regarded as the set of one group of thread block, because This, when realizing, it is management to thread number of blocks that the management to CapSM, which can be converted to,；

The SM abstract models CapSM of (1-3) based on capacity is independent of specific GPU architecture and GPU parallel programming languages, CapSM Concept can be readily applied to other GPU architectures and GPU parallel programming languages.

A kind of 4. fine granularity GPU resource management method for mixed load according to claim 1, it is characterised in that： Need to change the original kernel functions of the task, the thread for making task run on GPU is lasting thread, specifically Transfer process is as follows：

(1-1) inserts loop control structure into original kernel functions, and original kernel function bodies are as loop control structure Loop body；

The variable blockIdx that the affiliated thread block index of thread is represented in original kernel function bodies is changed to represent currently by (1-3) The variable taskIdx of affiliated task block.

A kind of 5. fine granularity GPU resource management method for mixed load according to claim 1, it is characterised in that： In the step (1), user by resource management API to GPU submit task when, the resource management API that uses provides following two Kind task way of submission：

(1-1), come operation task, mainly for offline task, limits the resource that offline task uses by way of resource quota Amount, when submitting task using which, it is desirable to provide resource quota amount quota, the kernel functions of operation task, task Task number of blocks TaskBlockNumber and task block size TaskBlockSize is the number of threads in each task block；

(1-2) by way of resource reservation come operation task, mainly for online task, when submitting task using which, Need to be provided as the reserved stock number reservation of the task, the kernel functions of operation task, the task block number of task TaskBlockNumber is measured, and task block size TaskBlockSize is the number of threads in each task block.

A kind of 6. fine granularity GPU resource management method for mixed load according to claim 1, it is characterised in that institute State in step (3), according to the kernel functions of task, and task block size, calculate the work that can be accommodated on a GPU SM Moving-wire journey number of blocks, the process can be by calculating Unified Device framework (Compute Unified Device Architecture, hereinafter referred to as CUDA) maximum activity thread number of blocks in each SM of acquisition that provides API it is every to calculate The maximum activity thread number of blocks MaxActivePBlock that can be accommodated on individual SM or CapSM_i。

A kind of 7. fine granularity GPU resource management method for mixed load according to claim 1, it is characterised in that： In the step (7), according to the resource distribution CapSMQuota of task_iAnd the active threads number of blocks that step (3) determines MaxActivePBlock_i, the thread number of blocks PBlockNumber that should create when calculating task is submitted to GPU operations_iIt is and every The task number of blocks TaskBlocksPerPBlock of individual thread block distribution_i, detailed process includes：

(7-1) combines the resource distribution CapSMQuota of task_iAnd MaxActivePBlock_i, the thread number of blocks of calculating task PBlockNumber_i, PBlockNumber_i=CapSMQuota_i*MaxActivePBlock_i；

(7-2) is according to the task number of blocks TaskBlockNumber of task_iWith thread number of blocks PBlockNumber_i, it is calculated as The task number of blocks of each thread block distribution,

A kind of 8. fine granularity GPU resource management method for mixed load according to claim 1, it is characterised in that： In the step (11), task is submitted to GPU, is created thread and is brought into operation, the specific execution step of each thread includes：

(11-1-2) is according to the active threads block maximum quantity MaxActivePBlock accommodated in each SM or CapSM_iAnd place Thread block number PBlockId, calculate affiliated CapSM numberings：

The task block scope that (11-4) obtains according to above procedure, into loop control structure, all of distribution are performed successively Corresponding task in task block.

A kind of 9. fine granularity GPU resource management method for mixed load according to claim 1, it is characterised in that： In the step (10), to current GPU on the offline task run send resource release commands, make offline task release resource poor The stock number that volume is specified, including following two stages：

(10-1) changes resource release mark evictCapSMNum at CPU ends_iValue, evictCapSMNum_iCan in CPU and It is synchronous between GPU, evictCapSMNum_iValue represent to need the CapSM quantity that discharges；

The loop control structure of each persistently thread for the task that (10-2) is run on GPU starts to examine before performing loop body every time Look into evictCapSMNum_iValue, all CapSMId are less than evictCapSMNum_iThread will exit execution, so, evictCapSMNum_iThe CapSM resources of individual quantity will be released.

A kind of 10. fine granularity GPU resource management method for mixed load according to claim 1, it is characterised in that： In the step (13), if having task block to be not carried out in the resource being released, the task block that these are not carried out remaps Continued executing with to remaining resource, it is as follows that lasting thread block carries out the concrete processing procedure that task block remaps：

(13-2-1) calculates the CapSM initial values of mapping：

CapSMRemapStart=(CapSMId-evictCapSMNum_i)*NumberPerCapSM_i；

(13-2-2) calculates the CapSM end values of mapping：

CapSMRemapEnd=CapSMRemapStart+NumberPerCapSM_i；

(13-5) performs all task blocks being not carried out in PBlockId according to current thread block number PBlockId, if All task executeds being not carried out in the CapSMRemapCur currently mapped are completed, then jump to step (13-3)；

(13-6) go to procedure (13-5) continue executing with.