A kind of GPU service quality guarantee exclusive and reserved based on stream multiple processor cores
Method
Technical field
The present invention relates to the fields such as the execution of concurrent kernel, Resources on Chip management and thread block scheduling, more particularly to one kind
Based on the exclusive and reserved GPU QoS guarantee method of stream multiple processor cores.
Background technique
In high-performance computing sector, graphics processor (Graphics Processing Units, hereinafter referred to as GPU)
It is increasingly being used for the acceleration of general-purpose computations.GPU by utilizing Thread-Level Parallelism (Thread Level in large quantities
Parallelism, hereinafter referred to as TLP) realize high computing capability.Flow multiprocessor (Streaming
Multiprocessors, hereinafter referred to as SM) it is responsible for executing the calculating task of GPU, include many calculating cores in SM
(Compute Cores) and resource, such as register (Registers), shared drive (Shared Memory) and L1 cache
(L1Cache).Device memory (Device Memory) is shared by interference networks (Interconnect Network) between SM.
As multiple Application share GPU, GPU supports concurrent kernel, inner core managing device (Kernel Management Unit, below letter
Referred to as KMU) kernel can be distributed to kernel distributor (Kernel Distributor Unit, hereinafter referred to as KDU), in KDU
Kernel is executed with the sequence of First Come First Served (first-come-first-serve, hereinafter referred to as FCFS).But it is answered when multiple
When with concurrently executing, thread block scheduler is dispatched in next again after the completion of waiting the scheduling of the thread block in previous kernel
Core, and thread block scheduler is understood thread block homogeneous dispatch according to poll (Round-Robin) strategy into all SM.
As the blowout of number of applications increases, as multiple concurrent Application share GPU, how handling capacity is preferably improved
Become particularly important with efficiency.For the multi-task parallel on GPU, academia and industry propose two kinds of major techniques: empty
Between multitask (Spatial Multitasking) and simultaneously multitask (Simultaneous Multitasking).Space is more
SM can be divided into several mutually disjoint subsets in task, that is, GPU, and each subset allocation is run simultaneously to different applications;Simultaneously
Multitask carries out fine granularity for SM resource and shares (Fine-grained Sharing), and multiple answer is performed simultaneously on single SM
With.Current general GPU can manage SM resource in chip-scale, to support spatial multiplexing.In addition to this, when application needs
When guaranteeing service quality (Quality of Service, hereinafter referred to as QoS), it is necessary to distribute enough resources and meet application
Qos requirement, this proposes bigger challenge to GPU architecture.In order to guarantee the qos requirement of application, while maximizing handling up for GPU
Amount and efficiency, in terms of existing solution is broadly divided into following two:
(1) the application execution model of GPU architecture is adapted to
The research of this respect is the characteristic by the default execution pattern of change application to cater to GPU accelerator.Such as it is based on
The kernel dispatching of priority always first carries out the kernel of high priority that is, when multiple kernels are distributed to GPU.Or it can incite somebody to action
All tasks abstracts for being submitted to GPU are task queue, manage and predict the execution duration of each task in CPU end pipe, and
Rearrangement task is to meet the qos requirement of application.Or fine-grained SM resource model is utilized, using similar to persistence thread
The technology of (persistent threads) realizes the resource reservation on SM, so that the resource for limiting the application of non-delay sensitive type accounts for
With;If current reservations resource still cannot effectively meet QoS index, dynamic resource, which adapts to module, can be called to seize and currently hold
Resource shared by row task.Such method is generally optimized by granularity of kernel, cannot handle the interior of long-play well
Core, in addition to this kernel seize the delay distributed again and energy consumption also can be very big.
(2) QoS of GPU architecture level executes model
The research of this respect is to be broadly divided into spatial multiplexing and multitask simultaneously based on the multi-task parallel on GPU.It is empty
Between multitask strategy estimation delay-sensitive application when passing through operation performance, then pass through linear properties model (Linear
Performance Model) SM quantity needed for each application of prediction.Multitask simultaneously is managed using fine-grained QoS, is answered all
It is assigned in all SM with paving, distributes different quotas (Quota) for delay-sensitive application and the application of non-delay sensitive type, from
And fine-grained distribution is carried out to resource on single SM.Multitask simultaneously, which is disadvantageous in that, does not support power gating (Power
Gating), for all SM on GPU always occupied, energy consumption is higher;And when multiple kernels occupy the same SM, L1 high
The conflict of speed caching (L1Cache) can reduce performance.
In conclusion application execution model is generally from software respective, granularity is kernel or other GPU tasks, makes it
Adapt to GPU architecture;And QoS executes model generally from GPU architecture, adapts it to various applications.These two aspects model can be with
It is compatible.It is noticeable to have two o'clock with the update of GPU: 1) SM quantity rapid development, newest Pascal and
Volta framework has 56 SM (Tesla P100) and 80 SM (Tesla V100) respectively;2) resource such as register on single SM
File, shared drive and L1Cache size do not change substantially.It can be seen that, it is contemplated that SM quantity is continuous on the following GPU architecture
Increased trend, the present invention carry out kernel as granularity using SM and seize and reserve, can handle long-term running kernel well, keep away
Exempt from frequent kernel to seize and distribute;And single SM is monopolized by kernel, avoids the conflict of L1 Cache, and support power supply
Gate is to reduce energy consumption.
Summary of the invention
The technology of the present invention solves the problems, such as: overcome the deficiencies in the prior art and defect, provides a kind of based in stream multiprocessor
The exclusive and reserved GPU QoS guarantee method of core, satisfaction apply the QoS index executed on GPU more, sufficiently excavate
The potentiality of the concurrent kernel of GPU, while can adaptively maximize the efficiency or handling capacity of GPU.
Technical solution of the invention, based on the exclusive and reserved service quality guarantee side GPU of stream multiple processor cores
Method includes the following steps:
Step 1: multiple GPU are applied to be started at the end CPU, needs to guarantee service quality QoS (Quality of by user is specified
Service application and QoS index);API is loaded into GPU (GPU application when the parallel section of application, that is, GPU kernel passes through operation
It is one-to-many relationship, a corresponding one or more GPU kernel of GPU application with GPU kernel), GPU kernel is according to affiliated application
Mark is divided into delay-sensitive kernel and non-delay sensitive type kernel;
Step 2: each GPU kernel is identified by software work queue SWQ (Software Work Queue) ID, and by
Push-in waits pond (Pending Kernel Pool) positioned at the kernel of grid managers GMU (Grid Management Unit);
Step 3: the GPU kernel with identical SWQ ID will be mapped to the same hardware work queue HWQ (Hardware
Work Queue) in;GPU kernel positioned at each head HWQ is determined initial stream multiprocessor by the QoS index of affiliated application
SM (Streaming Multiprocessor) allocation plan is respectively as follows: quick by postponing based on allocation plan SM there are three types of state
Sense type kernel occupies, is occupied, closes and reserved by non-delay sensitive type kernel;
Step 4: after determining SM allocation plan, the thread block Thread Block in the GPU kernel on each head HWQ passes through
Thread block scheduler distributes to SM;
The execution of step 5:GPU kernel is with time span TepochFor unit, in each TepochAfter, it collects in each GPU
The each cycle of core executes instruction several IPC (Instructions Per Cycle), and the IPC includes that GPI kernel is executed from initial
Start to current TepochThe total IPC, i.e. IPC terminatedtoratWith current TepochIPC, i.e. IPCepoch;
Step 6: after obtaining ipc message, next T being determined by decision making algorithmepochThe distribution side SM of period each GPU kernel
Case, SM allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive
Type MI (Memory-intensive);
Step 7: distributing number for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation plan
SMeptmal, when being assigned as SMoptmalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus maximum
Improve to limit handling capacity or efficiency;
Step 8: after obtaining SM allocation plan, then being determined by decision making algorithm and need to change to Swap-in and the Swap-out that swaps out)
SM number, that is, the SM to swap out is in next TepochWhen it is interior no longer by former GPU core occupy, and change to SM in next Tepoch
When it is interior by target GPU kernel occupy;
Step 9: when some SM is marked as changing to or swapping out, there are two types of situation: one be the SM is reservation state, at this time
It directly changes to, as occupied by target GPU kernel;Two be the SM occupied by other GPU kernels, to wait in SM and own at this time
Thread block execution terminates, then executes and swap out and change operation;
Step 10: after the completion of all SM are changed to and swapped out, then starting timing T next timeepoch, to ensure to collect number next time
According to accuracy;
Step 11: if each GPU application Exactly-once GPU kernel, repeating step 5 to step 10 until all GPU
Kernel execution terminates;If the multiple GPU kernel of some application execution, the initial SM distribution of GPU kernel is numbered and upper one next time
It is secondary identical, step 5 is repeated to step 10, until the GPU kernel execution currently distributed terminates;
Step 12: repeating step 11 until all GPU application executions terminate.
In the step 1, realized by the specified application for needing to guarantee service quality of user and QoS index are as follows:
(1) need to guarantee that the application of service quality is expressed as lsQoS, it is specific as follows:
(1-1) if it is delay-sensitive application, for all GPU kernels of the delay-sensitive application, lsQos=
true;
(1-2) if it is non-delay sensitive type application, for all GPU kernels of non-delay sensitive type application,
lsQos=false;
(2) QoS index is expressed as IPCgoal, when whole SM are distributed to some delay-sensitive application, obtain
IPCsofated;Define αkFor IPCgoalAccount for IPCsofatedRatio, calculation method is as follows:
IPCgoal=IPCisolated×αk, wherein αk∈ (0,1).
In the step 3, initial stream multiprocessor SM is determined, it is assumed that the delay-sensitive application and Fei Yan concurrently executed
The quantity of slow responsive type application is 1, and calculation method is as follows:
(1) if it is delay-sensitive kernel,
(2) if it is non-delay sensitive type kernel, SMk=SMfofar-SMQoS,
Wherein SMtotalFor SM total number in GPU, SMk is the occupied SM number of some GPU kernel, SMQoSIt is quick to postpone
The occupied SM number of sense type kernel, αkFor IPCgoalAccount for IPCisolatedRatio.
In the step 6, decision making algorithm determines next TepochThe period SM allocation plan of each GPU kernel is as follows:
Define ltetalFor GPU kernel to current T since initial executeepochTerminate the total instruction number executed, NepochFor from
It initially executes and starts to current TepochAt the end of total TepochNumber, Swap-in are one SM of change, and Swap-out is to swap out one
A SM, calculation method are as follows:
(1) for delay-sensitive kernel:
(1-1) is if Itotal/[(Nepoch+1)×Tepoch] > IPCgoal, then it is labeled as Swap-out;
(1-2) is if IPCtotal< IPCgoal, then it is labeled as Swap-in;
(2) for non-delay sensitive type kernel, the decision based on delay-sensitive kernel carries out decision, and there are three types of feelings here
Condition:
(2-1) delay-sensitive kernel requests change to a SM, preferential at this time to select reserved SM, and SM is avoided to execute the behaviour that swaps out
The expense of work;If reserved without SM, marking non-delay sensitive type kernel is Swap-out, that is, swap out a SM, and the SM
It can be delayed by occupied by responsive type kernel;
(2-2) delay-sensitive kernel requests swap out a SM.It is determined in non-delay sensitive type according to decision making algorithm at this time
Whether core needs to change to a SM, if desired changes to, then marks non-delay sensitive type kernel for Swap-in, in delay-sensitive
The SM that core swaps out can be occupied by non-delay sensitive type kernel;Otherwise it closes the SM and reserves;
(2-3) delay-sensitive kernel does not require Swap to operate.Non-delay sensitive type is determined according to decision making algorithm at this time
Whether kernel needs to swap out a SM;If needed to swap out, then marking non-delay sensitive type kernel is Swap-out, non-delay sensitive
The SM that type kernel swaps out is closed and reserves.
In the step (7), the best SM of non-delay sensitive type kernel is distributed until being determined currently by allocation plan
Number SMoptimal, by hardware or software definition threshold value threshold, to judge whether SM will swap out as reservation state, wherein
The scope of threshold only in non-delay sensitive type kernel, defines IPClastFor last time TepochThe IPC of periodepoch.In order to
SM is obtained in period short as far as possibleoptimalValue, it is also necessary to two flag bit UpperkAnd Lowerk, wherein UpperkIt indicates
SMoptunalWhether the upper limit, Lower are reachedkIndicate SMoptunalWhether lower limit is reached.Specific requirement is as follows:
(1) if non-delay sensitive type kernel performs Swap-in or Swap-out operation, IPC is savedlast;
(2) if last time TepochAt the end of non-delay sensitive type kernel changed to a SM, circular is as follows:
(2-1) is if IPClast≥IPCepoch× (1-Threshold), then Upperk=true, SMopnafReach the upper limit;
(2-2) is if IPClast< IPCepoch× (1-Threshold), then Lowerk=true, SMoptmalReach lower limit,
And SMoptmal=SMk;
(3) if last time TepochAt the end of non-delay sensitive type kernel swapped out a SM, circular is as follows:
(3-1) is if IPClast> IPCepoch× (1+Threshold), then Lowerk=true, SMoptmalReach lower limit;
(3-2) is if IPClast≤IPCopoch× (1+Threshold), then Upperk=true, SMoptmalReach the upper limit,
And SMoptmal=SMk。
(4) if Lowerk=true and Upperk=true, it is determined that SMoptmalValue, i.e., it is next in GPU kernel
Executing will not change in the period.
In the step 9, all thread blocks execution in SM to be waited to terminate, then execute change or swap out when operating, thread
Block scheduler does not reallocate thread block to the SM, and simply waiting existing thread block execution terminates, and such meaning is to interrupt
The calculating of kernel, and the operation that swaps out is quickly completed as far as possible.
In the step 11, the initial SM distribution number of the kernel of GPU next time is identical with the last time, such
Meaning is: 1) the different kernels of the same application generally have similitude, can make full use of SM resource, such as register in this way
(Registers), shared drive (Shared Memory) and on-chip cache (L1 Cache);2) swapping out for SM is eliminated
Operation, reduce kernel distribution etc. it is to be delayed.
In the step 4, HWQ quantity is that 32, GPU is at best able to concurrently execute 32 kernels.
In the step 8, for SM occupied by some GPU kernel, be up to one needs change to or swap out, other will
It maintains the original state;For reserving SM, there is no the decisions that swaps out, only change and constant two kinds of situations.
The advantages of the present invention over the prior art are that: the present invention has sufficiently excavated the performance potential of the concurrent kernel of GPU,
The efficiency and handling capacity of GPU can be balanced to the maximum extent: for computation-intensive while effectively realizing QoS index
Using the abundant thread-level concurrency for promoting GPU;Application for memory-intensive sufficiently reduces the energy consumption of GPU.
Detailed description of the invention
Fig. 1 is the hardware architecture diagram for realizing proposition method of the present invention;
Fig. 2 is the schematic diagram of kernel distribution proposed by the present invention and thread block scheduling strategy;
Fig. 3 is the flow diagram that SM proposed by the present invention is dynamically distributed;
Fig. 4 is the tactful schematic diagram that non-delay sensitive type kernel proposed by the present invention confirms optimal SM distribution number.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing and example, to this
Invention is further elaborated.It should be appreciated that specific example described herein is not used to only to explain the present invention
Limit the present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below each other it
Between do not constitute conflict and can be combined with each other.
Hardware architecture diagram of the invention is as shown in Figure 1, wherein SMQoS and SMQoS interface is the present invention in the original base of GPU
The module increased newly on plinth.
Specific implementation step of the present invention is as follows as shown in Figure 1::
Step 1: multiple GPU are applied to be started at the end CPU, needs to guarantee service quality QoS (Quality of by user is specified
Service application and QoS index);API is loaded into GPU (GPU application when the parallel section of application, that is, GPU kernel passes through operation
It is one-to-many relationship, a corresponding one or more GPU kernel of GPU application with GPU kernel), GPU kernel is according to affiliated application
Mark is divided into delay-sensitive kernel and non-delay sensitive type kernel;The present invention needs newly-increased API as shown in table 1,
The qos parameter that the end CPU is arranged is transmitted to GPU by cudaSetQoS, so that GPU supports the QoS requirement of application;
Table 1 is the required newly-increased API of proposition method of the present invention;
Step 2: each GPU kernel is identified by software work queue SWQ (SoftwareWorkQueue) ID, and is pushed away
The kernel entered positioned at grid managers GMU (Grid ManagementUnit) waits pond (Pending Kernel Pool);
Step 3: the GPU kernel with identical SWQ ID will be mapped to the same hardware work queue HWQ (Hardware
Work Queue) in;GPU kernel positioned at each head HWQ is determined initial stream multiprocessor by the QoS index of affiliated application
SM (Streaming Multiprocessor) allocation plan.Kernel distribution proposed by the present invention and thread block scheduling strategy are such as
Shown in Fig. 2.Wherein single SM is monopolized by kernel, and supports the change (Swap-in) of SM and (Swap-out) operation that swaps out, from
And GPU is made to support the QoS demand applied, the application for computation-intensive maximizes the handling capacity of GPU;It is close for memory
The application of collection type can close SM to reduce energy consumption.In following all steps, based on allocation plan SM, there are three types of states, divide
Not are as follows: occupied by delay-sensitive kernel, occupied, close and reserved by non-delay sensitive type kernel;
Step 4: after determining SM allocation plan, the thread block Thread Block in the GPU kernel on each head HWQ passes through
Thread block scheduler distributes to SM;
The execution of step 5:GPU kernel is with time span TepochFor unit, in each TepochAfter, it collects in each GPU
The each cycle of core executes instruction several IPC (Instructions PerCycle), and the IPC includes that GPI kernel is opened from initial execution
Begin to current TepochThe total IPC, i.e. IPC terminatedtoralWith current TepochIPC, i.e. IPCepoch;
Step 6: after obtaining ipc message, next T being determined by decision making algorithmepochThe distribution side SM of period each GPU kernel
Case, SM allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive
Type MI (Memory-intensive).The process that SM proposed by the present invention is dynamically distributed is as shown in Figure 3.Specific step is as follows:
(1) a upper TepocH execution terminates.
(2) data information is fed back to SMQoS.
(3) SMQoS updates kernel information according to its scheduling strategy, and determines next TepochThe SM allocation plan of period.
(4) SMQoS judges whether delay-sensitive kernel needs to change to a SM.If desired it changes to a SM and then goes to step
Suddenly (5) otherwise go to step (6).
(5) it is reserved to judge whether there is SM by SMQoS.Step (14) are gone to if having SM to reserve, otherwise go to step (13).
(6) SMQoS judges whether delay-sensitive kernel needs to swap out a SM.If desired the SM that swaps out then goes to step
Suddenly (7) otherwise go to step (12).
(7) delay-sensitive kernel swaps out a SM.
(8) SMQoS judges SM number (SM occupied by non-delay sensitive type kernelk) whether it is less than optimum allocation number
(SMoptmal).If SMk< SMoptmalStep (10) are then gone to, step (9) are otherwise gone to.
(9) SMQoS judges SMoptmalWhether the upper limit (Upper is reachedk).If Upperk=false then goes to step (10),
Otherwise step (11) are gone to.
(10) non-delay sensitive type kernel changes to a SM.
(11) delay-sensitive kernel swaps out and closes the SM, and SM state becomes reserved.
(12) SMQoS judges SMoptmalWhether lower limit (Lowerk) is reached.If Lowerk=false then goes to step
(13)。
(13) non-delay sensitive type kernel swaps out a SM.
(14) delay-sensitive kernel changes to a SM.
(15) the synchronous SM operation of SMQoS waits all SM replacements to complete.
(16) start to execute T next timeepoch。
It is noted herein that QoS kernel can wait when QoS kernel requests change to a SM and reserve without SM
Non- QoS kernel swaps out a SM.Step (15) avoids the possible Deadlock of SM replacement operation process;
Step 7: distributing number for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation plan
SMoptmal, when being assigned as SMoptmalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus maximum
Improve to limit handling capacity or efficiency.Non-delay sensitive type application proposed by the present invention confirms that optimal SM distributes number (SMoptmal)
Strategy as shown in Figure 4.Specific step is as follows:
(1) if non-delay sensitive type kernel performs Swap-in or Swap-out operation, IPC is savedlast。
(2) if last time TepochAt the end of non-delay sensitive type kernel changed to a SM, circular is as follows:
(2-1) is if IPClast≥IPCepoch× (1-Threshold), then Upperk=true, SMoptmalReach the upper limit;
(2-2) is if IPClast< IPCepoch× (1-Threshold), then Lowerk=true, SMoptmalReach lower limit,
And SMoptmal=SMk。
(3) if last time TepochAt the end of non-delay sensitive type kernel swapped out a SM, circular is as follows:
(3-1) is if IPClast> IPCepoch× (1+Threshold), then Lowerk=true, SMoptmalReach lower limit;
(3-2) is if IPClast≤IPCepoch× (1+Threshold), then Upperk=true, SMoptmalReach the upper limit,
And SMoptmal=SMk。
(4) if Lowerk=true and Upperk=true, it is determined that SMoptmalValue, SMoptmalIt is following in kernel
The execution period in will not change.
Step 8: after obtaining SM allocation plan, then being determined by decision making algorithm and need to change to Swap-in and the Swap-out that swaps out)
SM number, that is, the SM to swap out is in next TepochWhen it is interior no longer by former GPU core occupy, and change to SM in next Tepoch
When it is interior by target GPU kernel occupy;
Step 9: when some SM is marked as changing to or swapping out, there are two types of situation: one be the SM is reservation state, at this time
It directly changes to, as occupied by target GPU kernel;Two be the SM occupied by other GPU kernels, to wait in SM and own at this time
Thread block execution terminates, then executes and swap out and change operation;
Step 10: after the completion of all SM are changed to and swapped out, then starting timing T next timeepoch, to ensure to collect number next time
According to accuracy;
Step 11: if each GPU application Exactly-once GPU kernel, repeating step 5 to step 10 until all GPU
Kernel execution terminates;If the multiple GPU kernel of some application execution, the initial SM distribution of GPU kernel is numbered and upper one next time
It is secondary identical, step 5 is repeated to step 10, until the GPU kernel execution currently distributed terminates;
Step 12: repeating step 11 until all GPU application executions terminate.