CN109445565A

CN109445565A - A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores

Info

Publication number: CN109445565A
Application number: CN201811325650.8A
Authority: CN
Inventors: 杨海龙; 孙庆骁; 张静怡
Original assignee: Beihang University
Current assignee: Kaixi Beijing Information Technology Co ltd
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2019-03-08
Anticipated expiration: 2038-11-08
Also published as: CN109445565B

Abstract

The invention discloses a GPU service quality assurance method based on exclusive and reserved streaming multiprocessor cores, comprising: 1) initially reasonably dividing Streaming Multiprocessors (SM for short) according to applied QoS indicators; 2) Through effective decision-making at runtime, the number of SMs of each application is periodically and dynamically adjusted to ensure the QoS indicators of the application; 3) During the process of dynamic adjustment, the type of each application can be identified through periodic data collection (memory-intensive or compute-intensive); 4) Improve GPU energy efficiency or throughput based on application type information. The invention fully taps the performance potential of the GPU concurrent kernel, and can maximize the balance of the GPU's energy efficiency and throughput while effectively realizing the QoS index: for computing-intensive applications, the thread-level concurrency of the GPU is fully improved; for memory Intensive applications, fully reduce GPU energy consumption.

Description

A kind of GPU service quality guarantee exclusive and reserved based on stream multiple processor cores Method

Technical field

The present invention relates to the fields such as the execution of concurrent kernel, Resources on Chip management and thread block scheduling, more particularly to one kind Based on the exclusive and reserved GPU QoS guarantee method of stream multiple processor cores.

Background technique

In high-performance computing sector, graphics processor (Graphics Processing Units, hereinafter referred to as GPU) It is increasingly being used for the acceleration of general-purpose computations.GPU by utilizing Thread-Level Parallelism (Thread Level in large quantities Parallelism, hereinafter referred to as TLP) realize high computing capability.Flow multiprocessor (Streaming Multiprocessors, hereinafter referred to as SM) it is responsible for executing the calculating task of GPU, include many calculating cores in SM (Compute Cores) and resource, such as register (Registers), shared drive (Shared Memory) and L1 cache (L1Cache).Device memory (Device Memory) is shared by interference networks (Interconnect Network) between SM. As multiple Application share GPU, GPU supports concurrent kernel, inner core managing device (Kernel Management Unit, below letter Referred to as KMU) kernel can be distributed to kernel distributor (Kernel Distributor Unit, hereinafter referred to as KDU), in KDU Kernel is executed with the sequence of First Come First Served (first-come-first-serve, hereinafter referred to as FCFS).But it is answered when multiple When with concurrently executing, thread block scheduler is dispatched in next again after the completion of waiting the scheduling of the thread block in previous kernel Core, and thread block scheduler is understood thread block homogeneous dispatch according to poll (Round-Robin) strategy into all SM.

As the blowout of number of applications increases, as multiple concurrent Application share GPU, how handling capacity is preferably improved Become particularly important with efficiency.For the multi-task parallel on GPU, academia and industry propose two kinds of major techniques: empty Between multitask (Spatial Multitasking) and simultaneously multitask (Simultaneous Multitasking).Space is more SM can be divided into several mutually disjoint subsets in task, that is, GPU, and each subset allocation is run simultaneously to different applications；Simultaneously Multitask carries out fine granularity for SM resource and shares (Fine-grained Sharing), and multiple answer is performed simultaneously on single SM With.Current general GPU can manage SM resource in chip-scale, to support spatial multiplexing.In addition to this, when application needs When guaranteeing service quality (Quality of Service, hereinafter referred to as QoS), it is necessary to distribute enough resources and meet application Qos requirement, this proposes bigger challenge to GPU architecture.In order to guarantee the qos requirement of application, while maximizing handling up for GPU Amount and efficiency, in terms of existing solution is broadly divided into following two:

(1) the application execution model of GPU architecture is adapted to

The research of this respect is the characteristic by the default execution pattern of change application to cater to GPU accelerator.Such as it is based on The kernel dispatching of priority always first carries out the kernel of high priority that is, when multiple kernels are distributed to GPU.Or it can incite somebody to action All tasks abstracts for being submitted to GPU are task queue, manage and predict the execution duration of each task in CPU end pipe, and Rearrangement task is to meet the qos requirement of application.Or fine-grained SM resource model is utilized, using similar to persistence thread The technology of (persistent threads) realizes the resource reservation on SM, so that the resource for limiting the application of non-delay sensitive type accounts for With；If current reservations resource still cannot effectively meet QoS index, dynamic resource, which adapts to module, can be called to seize and currently hold Resource shared by row task.Such method is generally optimized by granularity of kernel, cannot handle the interior of long-play well Core, in addition to this kernel seize the delay distributed again and energy consumption also can be very big.

(2) QoS of GPU architecture level executes model

The research of this respect is to be broadly divided into spatial multiplexing and multitask simultaneously based on the multi-task parallel on GPU.It is empty Between multitask strategy estimation delay-sensitive application when passing through operation performance, then pass through linear properties model (Linear Performance Model) SM quantity needed for each application of prediction.Multitask simultaneously is managed using fine-grained QoS, is answered all It is assigned in all SM with paving, distributes different quotas (Quota) for delay-sensitive application and the application of non-delay sensitive type, from And fine-grained distribution is carried out to resource on single SM.Multitask simultaneously, which is disadvantageous in that, does not support power gating (Power Gating), for all SM on GPU always occupied, energy consumption is higher；And when multiple kernels occupy the same SM, L1 high The conflict of speed caching (L1Cache) can reduce performance.

In conclusion application execution model is generally from software respective, granularity is kernel or other GPU tasks, makes it Adapt to GPU architecture；And QoS executes model generally from GPU architecture, adapts it to various applications.These two aspects model can be with It is compatible.It is noticeable to have two o'clock with the update of GPU: 1) SM quantity rapid development, newest Pascal and Volta framework has 56 SM (Tesla P100) and 80 SM (Tesla V100) respectively；2) resource such as register on single SM File, shared drive and L1Cache size do not change substantially.It can be seen that, it is contemplated that SM quantity is continuous on the following GPU architecture Increased trend, the present invention carry out kernel as granularity using SM and seize and reserve, can handle long-term running kernel well, keep away Exempt from frequent kernel to seize and distribute；And single SM is monopolized by kernel, avoids the conflict of L1 Cache, and support power supply Gate is to reduce energy consumption.

Summary of the invention

The technology of the present invention solves the problems, such as: overcome the deficiencies in the prior art and defect, provides a kind of based in stream multiprocessor The exclusive and reserved GPU QoS guarantee method of core, satisfaction apply the QoS index executed on GPU more, sufficiently excavate The potentiality of the concurrent kernel of GPU, while can adaptively maximize the efficiency or handling capacity of GPU.

Technical solution of the invention, based on the exclusive and reserved service quality guarantee side GPU of stream multiple processor cores Method includes the following steps:

Step 1: multiple GPU are applied to be started at the end CPU, needs to guarantee service quality QoS (Quality of by user is specified Service application and QoS index)；API is loaded into GPU (GPU application when the parallel section of application, that is, GPU kernel passes through operation It is one-to-many relationship, a corresponding one or more GPU kernel of GPU application with GPU kernel), GPU kernel is according to affiliated application Mark is divided into delay-sensitive kernel and non-delay sensitive type kernel；

Step 2: each GPU kernel is identified by software work queue SWQ (Software Work Queue) ID, and by Push-in waits pond (Pending Kernel Pool) positioned at the kernel of grid managers GMU (Grid Management Unit)；

Step 3: the GPU kernel with identical SWQ ID will be mapped to the same hardware work queue HWQ (Hardware Work Queue) in；GPU kernel positioned at each head HWQ is determined initial stream multiprocessor by the QoS index of affiliated application SM (Streaming Multiprocessor) allocation plan is respectively as follows: quick by postponing based on allocation plan SM there are three types of state Sense type kernel occupies, is occupied, closes and reserved by non-delay sensitive type kernel；

Step 4: after determining SM allocation plan, the thread block Thread Block in the GPU kernel on each head HWQ passes through Thread block scheduler distributes to SM；

The execution of step 5:GPU kernel is with time span T_epochFor unit, in each T_epochAfter, it collects in each GPU The each cycle of core executes instruction several IPC (Instructions Per Cycle), and the IPC includes that GPI kernel is executed from initial Start to current T_epochThe total IPC, i.e. IPC terminated_toratWith current T_epochIPC, i.e. IPC_epoch；

Step 6: after obtaining ipc message, next T being determined by decision making algorithm_epochThe distribution side SM of period each GPU kernel Case, SM allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive Type MI (Memory-intensive)；

Step 7: distributing number for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation plan SM_eptmal, when being assigned as SM_optmalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus maximum Improve to limit handling capacity or efficiency；

Step 8: after obtaining SM allocation plan, then being determined by decision making algorithm and need to change to Swap-in and the Swap-out that swaps out) SM number, that is, the SM to swap out is in next T_epochWhen it is interior no longer by former GPU core occupy, and change to SM in next T_epoch When it is interior by target GPU kernel occupy；

Step 9: when some SM is marked as changing to or swapping out, there are two types of situation: one be the SM is reservation state, at this time It directly changes to, as occupied by target GPU kernel；Two be the SM occupied by other GPU kernels, to wait in SM and own at this time Thread block execution terminates, then executes and swap out and change operation；

Step 10: after the completion of all SM are changed to and swapped out, then starting timing T next time_epoch, to ensure to collect number next time According to accuracy；

Step 11: if each GPU application Exactly-once GPU kernel, repeating step 5 to step 10 until all GPU Kernel execution terminates；If the multiple GPU kernel of some application execution, the initial SM distribution of GPU kernel is numbered and upper one next time It is secondary identical, step 5 is repeated to step 10, until the GPU kernel execution currently distributed terminates；

Step 12: repeating step 11 until all GPU application executions terminate.

In the step 1, realized by the specified application for needing to guarantee service quality of user and QoS index are as follows:

(1) need to guarantee that the application of service quality is expressed as ls_QoS, it is specific as follows:

(1-1) if it is delay-sensitive application, for all GPU kernels of the delay-sensitive application, ls_Qos= true；

(1-2) if it is non-delay sensitive type application, for all GPU kernels of non-delay sensitive type application, ls_Qos=false；

(2) QoS index is expressed as IPC_goal, when whole SM are distributed to some delay-sensitive application, obtain IPC_sofated；Define α_kFor IPC_goalAccount for IPC_sofatedRatio, calculation method is as follows:

IPC_goal=IPC_isolated×α_k, wherein α_k∈ (0,1).

In the step 3, initial stream multiprocessor SM is determined, it is assumed that the delay-sensitive application and Fei Yan concurrently executed The quantity of slow responsive type application is 1, and calculation method is as follows:

(1) if it is delay-sensitive kernel,

(2) if it is non-delay sensitive type kernel, SM_k=SM_fofar-SM_QoS,

Wherein SM_totalFor SM total number in GPU, SMk is the occupied SM number of some GPU kernel, SM_QoSIt is quick to postpone The occupied SM number of sense type kernel, α_kFor IPC_goalAccount for IPC_isolatedRatio.

In the step 6, decision making algorithm determines next T_epochThe period SM allocation plan of each GPU kernel is as follows:

Define l_tetalFor GPU kernel to current T since initial execute_epochTerminate the total instruction number executed, N_epochFor from It initially executes and starts to current T_epochAt the end of total T_epochNumber, Swap-in are one SM of change, and Swap-out is to swap out one A SM, calculation method are as follows:

(1) for delay-sensitive kernel:

(1-1) is if I_total/[(N_epoch+1)×T_epoch] > IPC_goal, then it is labeled as Swap-out；

(1-2) is if IPC_total< IPC_goal, then it is labeled as Swap-in；

(2) for non-delay sensitive type kernel, the decision based on delay-sensitive kernel carries out decision, and there are three types of feelings here Condition:

(2-1) delay-sensitive kernel requests change to a SM, preferential at this time to select reserved SM, and SM is avoided to execute the behaviour that swaps out The expense of work；If reserved without SM, marking non-delay sensitive type kernel is Swap-out, that is, swap out a SM, and the SM It can be delayed by occupied by responsive type kernel；

(2-2) delay-sensitive kernel requests swap out a SM.It is determined in non-delay sensitive type according to decision making algorithm at this time Whether core needs to change to a SM, if desired changes to, then marks non-delay sensitive type kernel for Swap-in, in delay-sensitive The SM that core swaps out can be occupied by non-delay sensitive type kernel；Otherwise it closes the SM and reserves；

(2-3) delay-sensitive kernel does not require Swap to operate.Non-delay sensitive type is determined according to decision making algorithm at this time Whether kernel needs to swap out a SM；If needed to swap out, then marking non-delay sensitive type kernel is Swap-out, non-delay sensitive The SM that type kernel swaps out is closed and reserves.

In the step (7), the best SM of non-delay sensitive type kernel is distributed until being determined currently by allocation plan Number SM_optimal, by hardware or software definition threshold value threshold, to judge whether SM will swap out as reservation state, wherein The scope of threshold only in non-delay sensitive type kernel, defines IPC_lastFor last time T_epochThe IPC of period_epoch.In order to SM is obtained in period short as far as possible_optimalValue, it is also necessary to two flag bit Upper_kAnd Lower_k, wherein Upper_kIt indicates SM_optunalWhether the upper limit, Lower are reached_kIndicate SM_optunalWhether lower limit is reached.Specific requirement is as follows:

(1) if non-delay sensitive type kernel performs Swap-in or Swap-out operation, IPC is saved_last；

(2) if last time T_epochAt the end of non-delay sensitive type kernel changed to a SM, circular is as follows:

(2-1) is if IPC_last≥IPC_epoch× (1-Threshold), then Upper_k=true, SM_opnafReach the upper limit；

(2-2) is if IPC_last< IPC_epoch× (1-Threshold), then Lower_k=true, SM_optmalReach lower limit, And SM_optmal=SM_k；

(3) if last time T_epochAt the end of non-delay sensitive type kernel swapped out a SM, circular is as follows:

(3-1) is if IPC_last> IPC_epoch× (1+Threshold), then Lower_k=true, SM_optmalReach lower limit；

(3-2) is if IPC_last≤IPC_opoch× (1+Threshold), then Upper_k=true, SM_optmalReach the upper limit, And SM_optmal=SM_k。

(4) if Lower_k=true and Upper_k=true, it is determined that SM_optmalValue, i.e., it is next in GPU kernel Executing will not change in the period.

In the step 9, all thread blocks execution in SM to be waited to terminate, then execute change or swap out when operating, thread Block scheduler does not reallocate thread block to the SM, and simply waiting existing thread block execution terminates, and such meaning is to interrupt The calculating of kernel, and the operation that swaps out is quickly completed as far as possible.

In the step 11, the initial SM distribution number of the kernel of GPU next time is identical with the last time, such Meaning is: 1) the different kernels of the same application generally have similitude, can make full use of SM resource, such as register in this way (Registers), shared drive (Shared Memory) and on-chip cache (L1 Cache)；2) swapping out for SM is eliminated Operation, reduce kernel distribution etc. it is to be delayed.

In the step 4, HWQ quantity is that 32, GPU is at best able to concurrently execute 32 kernels.

In the step 8, for SM occupied by some GPU kernel, be up to one needs change to or swap out, other will It maintains the original state；For reserving SM, there is no the decisions that swaps out, only change and constant two kinds of situations.

The advantages of the present invention over the prior art are that: the present invention has sufficiently excavated the performance potential of the concurrent kernel of GPU, The efficiency and handling capacity of GPU can be balanced to the maximum extent: for computation-intensive while effectively realizing QoS index Using the abundant thread-level concurrency for promoting GPU；Application for memory-intensive sufficiently reduces the energy consumption of GPU.

Detailed description of the invention

Fig. 1 is the hardware architecture diagram for realizing proposition method of the present invention；

Fig. 2 is the schematic diagram of kernel distribution proposed by the present invention and thread block scheduling strategy；

Fig. 3 is the flow diagram that SM proposed by the present invention is dynamically distributed；

Fig. 4 is the tactful schematic diagram that non-delay sensitive type kernel proposed by the present invention confirms optimal SM distribution number.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing and example, to this Invention is further elaborated.It should be appreciated that specific example described herein is not used to only to explain the present invention Limit the present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below each other it Between do not constitute conflict and can be combined with each other.

Hardware architecture diagram of the invention is as shown in Figure 1, wherein SMQoS and SMQoS interface is the present invention in the original base of GPU The module increased newly on plinth.

Specific implementation step of the present invention is as follows as shown in Figure 1::

Step 1: multiple GPU are applied to be started at the end CPU, needs to guarantee service quality QoS (Quality of by user is specified Service application and QoS index)；API is loaded into GPU (GPU application when the parallel section of application, that is, GPU kernel passes through operation It is one-to-many relationship, a corresponding one or more GPU kernel of GPU application with GPU kernel), GPU kernel is according to affiliated application Mark is divided into delay-sensitive kernel and non-delay sensitive type kernel；The present invention needs newly-increased API as shown in table 1, The qos parameter that the end CPU is arranged is transmitted to GPU by cudaSetQoS, so that GPU supports the QoS requirement of application；

Table 1 is the required newly-increased API of proposition method of the present invention；

Step 2: each GPU kernel is identified by software work queue SWQ (SoftwareWorkQueue) ID, and is pushed away The kernel entered positioned at grid managers GMU (Grid ManagementUnit) waits pond (Pending Kernel Pool)；

Step 3: the GPU kernel with identical SWQ ID will be mapped to the same hardware work queue HWQ (Hardware Work Queue) in；GPU kernel positioned at each head HWQ is determined initial stream multiprocessor by the QoS index of affiliated application SM (Streaming Multiprocessor) allocation plan.Kernel distribution proposed by the present invention and thread block scheduling strategy are such as Shown in Fig. 2.Wherein single SM is monopolized by kernel, and supports the change (Swap-in) of SM and (Swap-out) operation that swaps out, from And GPU is made to support the QoS demand applied, the application for computation-intensive maximizes the handling capacity of GPU；It is close for memory The application of collection type can close SM to reduce energy consumption.In following all steps, based on allocation plan SM, there are three types of states, divide Not are as follows: occupied by delay-sensitive kernel, occupied, close and reserved by non-delay sensitive type kernel；

The execution of step 5:GPU kernel is with time span T_epochFor unit, in each T_epochAfter, it collects in each GPU The each cycle of core executes instruction several IPC (Instructions PerCycle), and the IPC includes that GPI kernel is opened from initial execution Begin to current T_epochThe total IPC, i.e. IPC terminated_toralWith current T_epochIPC, i.e. IPC_epoch；

Step 6: after obtaining ipc message, next T being determined by decision making algorithm_epochThe distribution side SM of period each GPU kernel Case, SM allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive Type MI (Memory-intensive).The process that SM proposed by the present invention is dynamically distributed is as shown in Figure 3.Specific step is as follows:

(1) a upper T_epocH execution terminates.

(2) data information is fed back to SMQoS.

(3) SMQoS updates kernel information according to its scheduling strategy, and determines next T_epochThe SM allocation plan of period.

(4) SMQoS judges whether delay-sensitive kernel needs to change to a SM.If desired it changes to a SM and then goes to step Suddenly (5) otherwise go to step (6).

(5) it is reserved to judge whether there is SM by SMQoS.Step (14) are gone to if having SM to reserve, otherwise go to step (13).

(6) SMQoS judges whether delay-sensitive kernel needs to swap out a SM.If desired the SM that swaps out then goes to step Suddenly (7) otherwise go to step (12).

(7) delay-sensitive kernel swaps out a SM.

(8) SMQoS judges SM number (SM occupied by non-delay sensitive type kernel_k) whether it is less than optimum allocation number (SM_optmal).If SM_k< SM_optmalStep (10) are then gone to, step (9) are otherwise gone to.

(9) SMQoS judges SM_optmalWhether the upper limit (Upper is reached_k).If Upper_k=false then goes to step (10), Otherwise step (11) are gone to.

(10) non-delay sensitive type kernel changes to a SM.

(11) delay-sensitive kernel swaps out and closes the SM, and SM state becomes reserved.

(12) SMQoS judges SM_optmalWhether lower limit (Lowerk) is reached.If Lower_k=false then goes to step (13)。

(13) non-delay sensitive type kernel swaps out a SM.

(14) delay-sensitive kernel changes to a SM.

(15) the synchronous SM operation of SMQoS waits all SM replacements to complete.

(16) start to execute T next time_epoch。

It is noted herein that QoS kernel can wait when QoS kernel requests change to a SM and reserve without SM Non- QoS kernel swaps out a SM.Step (15) avoids the possible Deadlock of SM replacement operation process；

Step 7: distributing number for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation plan SM_optmal, when being assigned as SM_optmalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus maximum Improve to limit handling capacity or efficiency.Non-delay sensitive type application proposed by the present invention confirms that optimal SM distributes number (SM_optmal) Strategy as shown in Figure 4.Specific step is as follows:

(1) if non-delay sensitive type kernel performs Swap-in or Swap-out operation, IPC is saved_last。

(2-1) is if IPC_last≥IPC_epoch× (1-Threshold), then Upper_k=true, SM_optmalReach the upper limit；

(2-2) is if IPC_last< IPC_epoch× (1-Threshold), then Lower_k=true, SM_optmalReach lower limit, And SM_optmal=SM_k。

(3-2) is if IPC_last≤IPC_epoch× (1+Threshold), then Upper_k=true, SM_optmalReach the upper limit, And SM_optmal=SM_k。

(4) if Lower_k=true and Upper_k=true, it is determined that SM_optmalValue, SM_optmalIt is following in kernel The execution period in will not change.

Step 12: repeating step 11 until all GPU application executions terminate.

Claims

1. a kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores, which is characterized in that including with Lower step:

Step 2: by software work queue SWQ (Software Work Queue), ID is identified each GPU kernel, and is pushed into Kernel positioned at grid managers GMU (Grid Management Unit) waits pond (Pending Kernel Pool)；

Step 3: the GPU kernel with identical SWQ ID will be mapped to same hardware work queue HWQ (the Hardware Work Queue in)；GPU kernel positioned at each head HWQ is determined initial stream multiprocessor SM by the QoS index of affiliated application (Streaming Multiprocessor) allocation plan, based on allocation plan SM, there are three types of states, are respectively as follows: by delay-sensitive Type kernel occupies, is occupied, closes and reserved by non-delay sensitive type kernel；

The execution of step 5:GPU kernel is with time span T_epochFor unit, in each T_epochAfter, collect each GPU kernel Each cycle executes instruction several IPC (Instructions Per Cycle), and the IPC includes GPU kernel since initial execute To current T_epochThe total IPC, i.e. IPC terminated_totalWith current T_epochIPC, i.e. IPC_epoch；

Step 6: after obtaining ipc message, next T being determined by decision making algorithm_epochThe SM allocation plan of period each GPU kernel, SM Allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive MI (Memory-intensive)；

Step 7: distributing number SM for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation plan_optimal, When being assigned as SM_optimalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus to the maximum extent Improve handling capacity or efficiency；

Step 8: after obtaining SM allocation plan, then the SM for needing to change to Swap-in and the Swap-out that swaps out is determined by decision making algorithm Number, that is, the SM to swap out is in next T_epochWhen it is interior no longer by former GPU core occupy, and change to SM in next T_epochWhen it is interior It is occupied by target GPU kernel；

Step 9: when some SM is marked as changing to or swapping out, there are two types of situation: one be the SM is reservation state, at this time directly Change, as occupied by target GPU kernel；Two be the SM occupied by other GPU kernels, to wait all threads in SM at this time Block execution terminates, then executes and swap out and change operation；

Step 10: after the completion of all SM are changed to and swapped out, then starting timing T next time_epoch, to ensure to collect data next time Accuracy；

Step 11: if each GPU application Exactly-once GPU kernel, repeating step 5 to step 10 until all GPU kernels Execution terminates；If the multiple GPU kernel of some application execution, the initial SM distribution number of GPU kernel is complete with the last time next time It is exactly the same, step 5 is repeated to step 10, until the GPU kernel execution currently distributed terminates；

Step 12: repeating step 11 until all GPU application executions terminate.

2. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 1, realized by the specified application for needing to guarantee service quality of user and QoS index are as follows:

(1) need to guarantee that the application of service quality is expressed as Is_QoS, it is specific as follows:

(1-1) if it is delay-sensitive application, for all GPU kernels of the delay-sensitive application, Is_QoS=true；

(1-2) if it is non-delay sensitive type application, for all GPU kernels of non-delay sensitive type application, Is_QoS= false；

(2) QoS index is expressed as IPC_goal, when whole SM are distributed to some delay-sensitive application, obtain IPC_isotated； Define α_kFor IPC_goalAccount for IPC_isotatedRatio, calculation method is as follows:

IPC_goal=IPC_isotated× α k, wherein α_k∈ (0,1).

3. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 3, determine initial stream multiprocessor SM, it is assumed that the delay-sensitive application concurrently executed Quantity with the application of non-delay sensitive type is 1, and calculation method is as follows:

(1) if it is delay-sensitive kernel,

(2) if it is non-delay sensitive type kernel, SM_k=SM_total-SM_QoS,

Wherein SM_totalFor SM total number, SM in GPU_kFor the occupied SM number of some GPU kernel, SM_QoSFor delay-sensitive The occupied SM number of kernel, α_kFor IPC_goalAccount for IPC_isotatedRatio.

4. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 6, decision making algorithm determines next T_epochThe SM allocation plan of period each GPU kernel is such as Under:

Define I_totalFor GPU kernel to current T since initial execute_epochTerminate the total instruction number executed, T_epochFor from initial Execution starts to current T_epochAt the end of total T_epochNumber, Swap-in are one SM of change, and Swap-out is the SM that swaps out, Calculation method is as follows:

(1) for delay-sensitive kernel:

(1-1) is if I_total/[(Ne_epoch+1)×T_epoch] > IPC_goal, then it is labeled as Swap-out；

(1-2) is if IPC_total< IPC_goal, then it is labeled as Swap-in；

(2) for non-delay sensitive type kernel, the decision based on delay-sensitive kernel carries out decision, and there are three types of situations here:

(2-1) delay-sensitive kernel requests change to a SM, preferential at this time to select reserved SM, and SM is avoided to execute the operation that swaps out Expense；If reserved without SM, marking non-delay sensitive type kernel is Swap-out, that is, swap out a SM, and the SM can quilt Occupied by delay-sensitive kernel；

(2-2) delay-sensitive kernel requests swap out a SM.Determine that non-delay sensitive type kernel is according to decision making algorithm at this time It is no to need to change to a SM, it if desired changes to, then marking non-delay sensitive type kernel is Swap-in, and delay-sensitive kernel changes SM out can be occupied by non-delay sensitive type kernel；Otherwise it closes the SM and reserves；

(2-3) delay-sensitive kernel does not require Swap to operate, and determines non-delay sensitive type kernel according to decision making algorithm at this time Whether need to swap out a SM；If needed to swap out, then non-delay sensitive type kernel is marked for Swap-out, in non-delay sensitive type The SM that core swaps out is closed and reserves.

5. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step (7), for the best of non-delay sensitive type kernel until being determined currently by allocation plan SM distributes number SM_optimal, it is reserved to judge whether some SM will swap out by hardware or software definition threshold value threshold State, wherein the scope of threshold defines IPC only in non-delay sensitive type kernel_lastFor last time T_epochPeriod IPC_epoch, in order to obtain SM within the period short as far as possible_optimalValue, it is also necessary to two flag bit Upper_kAnd Lower_k, wherein Upper_kIndicate SM_optimalWhether the upper limit, Lower are reached_kIndicate SM_optimalWhether lower limit is reached, and specific requirement is as follows:

(2-1) is if IPC_last≥IPC_epoch× (1-Threshold), then Upperk=true, SM_optimalReach the upper limit；

(2-2) is if IPC_last< IPC_epoch× (1-Threshold), then Lower_k=ture, SM_optimalReach lower limit, and SM_optimal=SM_k；

(3-1) is if IPC_last> IPC_epoch× (1+Threshold), then Lower_k=true, SM_optimalReach lower limit；

(3-2) is if IPC_last≤IPC_epoch× (1+Threshold), then Upperk=true, SM_optimalReach the upper limit, and SM_optimal=SM_k；

(4) if Lower_k=true and Upper_k=true, it is determined that SM_optimalValue, i.e., hold GPU kernel is next It will not change in the row period.

6. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 9, to wait all thread block execution in SM to terminate, then execute change or the operation that swaps out When, thread block scheduler no longer distributes thread block to some SM, and simply waiting existing thread block execution terminates, and such meaning exists In the calculating that will not interrupt GPU kernel, and the operation that swaps out is quickly completed as far as possible.

7. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 11, the initial SM distribution number of the kernel of GPU next time is identical with the last time.

8. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 4, HWQ quantity is that 32, GPU is at best able to concurrently execute 32 kernels.

9. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 8, for SM occupied by some GPU kernel, be up to one needs change to or swap out, Other will maintain the original state；For reserving SM, there is no the decisions that swaps out, only change and constant two kinds of situations.