CN109445565A - A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores - Google Patents

A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores Download PDF

Info

Publication number
CN109445565A
CN109445565A CN201811325650.8A CN201811325650A CN109445565A CN 109445565 A CN109445565 A CN 109445565A CN 201811325650 A CN201811325650 A CN 201811325650A CN 109445565 A CN109445565 A CN 109445565A
Authority
CN
China
Prior art keywords
sm
gpu
kernel
ipc
delay
Prior art date
Application number
CN201811325650.8A
Other languages
Chinese (zh)
Inventor
杨海龙
孙庆骁
张静怡
Original Assignee
北京航空航天大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京航空航天大学 filed Critical 北京航空航天大学
Priority to CN201811325650.8A priority Critical patent/CN109445565A/en
Publication of CN109445565A publication Critical patent/CN109445565A/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F1/00Details not covered by groups G06F3/00 – G06F13/00 and G06F21/00
    • G06F1/26Power supply means, e.g. regulation thereof
    • G06F1/32Means for saving power
    • G06F1/3203Power management, i.e. event-based initiation of power-saving mode
    • G06F1/3234Power saving characterised by the action undertaken
    • G06F1/3243Power saving in microcontroller unit
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration

Abstract

The invention discloses a kind of GPU QoS guarantee methods exclusive and reserved based on stream multiple processor cores, classifying rationally is carried out according to the QoS index convection current multiprocessor (Streaming Multiprocessor, referred to as SM) of application when including: 1) initial;2) by effective decision-making when running, periodically dynamic is carried out for the SM quantity of each application and is adjusted, to guarantee the QoS index of application;3) during dynamic adjusts, by interim data collection, the type (memory-intensive or computation-intensive) of each application can be identified;4) efficiency or handling capacity of GPU are improved according to application type information.The present invention has sufficiently excavated the performance potential of the concurrent kernel of GPU, can balance to the maximum extent the efficiency and handling capacity of GPU: the application for computation-intensive while effectively realizing QoS index, sufficiently promote the thread-level concurrency of GPU;Application for memory-intensive sufficiently reduces the energy consumption of GPU.

Description

A kind of GPU service quality guarantee exclusive and reserved based on stream multiple processor cores Method

Technical field

The present invention relates to the fields such as the execution of concurrent kernel, Resources on Chip management and thread block scheduling, more particularly to one kind Based on the exclusive and reserved GPU QoS guarantee method of stream multiple processor cores.

Background technique

In high-performance computing sector, graphics processor (Graphics Processing Units, hereinafter referred to as GPU) It is increasingly being used for the acceleration of general-purpose computations.GPU by utilizing Thread-Level Parallelism (Thread Level in large quantities Parallelism, hereinafter referred to as TLP) realize high computing capability.Flow multiprocessor (Streaming Multiprocessors, hereinafter referred to as SM) it is responsible for executing the calculating task of GPU, include many calculating cores in SM (Compute Cores) and resource, such as register (Registers), shared drive (Shared Memory) and L1 cache (L1Cache).Device memory (Device Memory) is shared by interference networks (Interconnect Network) between SM. As multiple Application share GPU, GPU supports concurrent kernel, inner core managing device (Kernel Management Unit, below letter Referred to as KMU) kernel can be distributed to kernel distributor (Kernel Distributor Unit, hereinafter referred to as KDU), in KDU Kernel is executed with the sequence of First Come First Served (first-come-first-serve, hereinafter referred to as FCFS).But it is answered when multiple When with concurrently executing, thread block scheduler is dispatched in next again after the completion of waiting the scheduling of the thread block in previous kernel Core, and thread block scheduler is understood thread block homogeneous dispatch according to poll (Round-Robin) strategy into all SM.

As the blowout of number of applications increases, as multiple concurrent Application share GPU, how handling capacity is preferably improved Become particularly important with efficiency.For the multi-task parallel on GPU, academia and industry propose two kinds of major techniques: empty Between multitask (Spatial Multitasking) and simultaneously multitask (Simultaneous Multitasking).Space is more SM can be divided into several mutually disjoint subsets in task, that is, GPU, and each subset allocation is run simultaneously to different applications;Simultaneously Multitask carries out fine granularity for SM resource and shares (Fine-grained Sharing), and multiple answer is performed simultaneously on single SM With.Current general GPU can manage SM resource in chip-scale, to support spatial multiplexing.In addition to this, when application needs When guaranteeing service quality (Quality of Service, hereinafter referred to as QoS), it is necessary to distribute enough resources and meet application Qos requirement, this proposes bigger challenge to GPU architecture.In order to guarantee the qos requirement of application, while maximizing handling up for GPU Amount and efficiency, in terms of existing solution is broadly divided into following two:

(1) the application execution model of GPU architecture is adapted to

The research of this respect is the characteristic by the default execution pattern of change application to cater to GPU accelerator.Such as it is based on The kernel dispatching of priority always first carries out the kernel of high priority that is, when multiple kernels are distributed to GPU.Or it can incite somebody to action All tasks abstracts for being submitted to GPU are task queue, manage and predict the execution duration of each task in CPU end pipe, and Rearrangement task is to meet the qos requirement of application.Or fine-grained SM resource model is utilized, using similar to persistence thread The technology of (persistent threads) realizes the resource reservation on SM, so that the resource for limiting the application of non-delay sensitive type accounts for With;If current reservations resource still cannot effectively meet QoS index, dynamic resource, which adapts to module, can be called to seize and currently hold Resource shared by row task.Such method is generally optimized by granularity of kernel, cannot handle the interior of long-play well Core, in addition to this kernel seize the delay distributed again and energy consumption also can be very big.

(2) QoS of GPU architecture level executes model

The research of this respect is to be broadly divided into spatial multiplexing and multitask simultaneously based on the multi-task parallel on GPU.It is empty Between multitask strategy estimation delay-sensitive application when passing through operation performance, then pass through linear properties model (Linear Performance Model) SM quantity needed for each application of prediction.Multitask simultaneously is managed using fine-grained QoS, is answered all It is assigned in all SM with paving, distributes different quotas (Quota) for delay-sensitive application and the application of non-delay sensitive type, from And fine-grained distribution is carried out to resource on single SM.Multitask simultaneously, which is disadvantageous in that, does not support power gating (Power Gating), for all SM on GPU always occupied, energy consumption is higher;And when multiple kernels occupy the same SM, L1 high The conflict of speed caching (L1Cache) can reduce performance.

In conclusion application execution model is generally from software respective, granularity is kernel or other GPU tasks, makes it Adapt to GPU architecture;And QoS executes model generally from GPU architecture, adapts it to various applications.These two aspects model can be with It is compatible.It is noticeable to have two o'clock with the update of GPU: 1) SM quantity rapid development, newest Pascal and Volta framework has 56 SM (Tesla P100) and 80 SM (Tesla V100) respectively;2) resource such as register on single SM File, shared drive and L1Cache size do not change substantially.It can be seen that, it is contemplated that SM quantity is continuous on the following GPU architecture Increased trend, the present invention carry out kernel as granularity using SM and seize and reserve, can handle long-term running kernel well, keep away Exempt from frequent kernel to seize and distribute;And single SM is monopolized by kernel, avoids the conflict of L1 Cache, and support power supply Gate is to reduce energy consumption.

Summary of the invention

The technology of the present invention solves the problems, such as: overcome the deficiencies in the prior art and defect, provides a kind of based in stream multiprocessor The exclusive and reserved GPU QoS guarantee method of core, satisfaction apply the QoS index executed on GPU more, sufficiently excavate The potentiality of the concurrent kernel of GPU, while can adaptively maximize the efficiency or handling capacity of GPU.

Technical solution of the invention, based on the exclusive and reserved service quality guarantee side GPU of stream multiple processor cores Method includes the following steps:

Step 1: multiple GPU are applied to be started at the end CPU, needs to guarantee service quality QoS (Quality of by user is specified Service application and QoS index);API is loaded into GPU (GPU application when the parallel section of application, that is, GPU kernel passes through operation It is one-to-many relationship, a corresponding one or more GPU kernel of GPU application with GPU kernel), GPU kernel is according to affiliated application Mark is divided into delay-sensitive kernel and non-delay sensitive type kernel;

Step 2: each GPU kernel is identified by software work queue SWQ (Software Work Queue) ID, and by Push-in waits pond (Pending Kernel Pool) positioned at the kernel of grid managers GMU (Grid Management Unit);

Step 3: the GPU kernel with identical SWQ ID will be mapped to the same hardware work queue HWQ (Hardware Work Queue) in;GPU kernel positioned at each head HWQ is determined initial stream multiprocessor by the QoS index of affiliated application SM (Streaming Multiprocessor) allocation plan is respectively as follows: quick by postponing based on allocation plan SM there are three types of state Sense type kernel occupies, is occupied, closes and reserved by non-delay sensitive type kernel;

Step 4: after determining SM allocation plan, the thread block Thread Block in the GPU kernel on each head HWQ passes through Thread block scheduler distributes to SM;

The execution of step 5:GPU kernel is with time span TepochFor unit, in each TepochAfter, it collects in each GPU The each cycle of core executes instruction several IPC (Instructions Per Cycle), and the IPC includes that GPI kernel is executed from initial Start to current TepochThe total IPC, i.e. IPC terminatedtoratWith current TepochIPC, i.e. IPCepoch

Step 6: after obtaining ipc message, next T being determined by decision making algorithmepochThe distribution side SM of period each GPU kernel Case, SM allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive Type MI (Memory-intensive);

Step 7: distributing number for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation plan SMeptmal, when being assigned as SMoptmalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus maximum Improve to limit handling capacity or efficiency;

Step 8: after obtaining SM allocation plan, then being determined by decision making algorithm and need to change to Swap-in and the Swap-out that swaps out) SM number, that is, the SM to swap out is in next TepochWhen it is interior no longer by former GPU core occupy, and change to SM in next Tepoch When it is interior by target GPU kernel occupy;

Step 9: when some SM is marked as changing to or swapping out, there are two types of situation: one be the SM is reservation state, at this time It directly changes to, as occupied by target GPU kernel;Two be the SM occupied by other GPU kernels, to wait in SM and own at this time Thread block execution terminates, then executes and swap out and change operation;

Step 10: after the completion of all SM are changed to and swapped out, then starting timing T next timeepoch, to ensure to collect number next time According to accuracy;

Step 11: if each GPU application Exactly-once GPU kernel, repeating step 5 to step 10 until all GPU Kernel execution terminates;If the multiple GPU kernel of some application execution, the initial SM distribution of GPU kernel is numbered and upper one next time It is secondary identical, step 5 is repeated to step 10, until the GPU kernel execution currently distributed terminates;

Step 12: repeating step 11 until all GPU application executions terminate.

In the step 1, realized by the specified application for needing to guarantee service quality of user and QoS index are as follows:

(1) need to guarantee that the application of service quality is expressed as lsQoS, it is specific as follows:

(1-1) if it is delay-sensitive application, for all GPU kernels of the delay-sensitive application, lsQos= true;

(1-2) if it is non-delay sensitive type application, for all GPU kernels of non-delay sensitive type application, lsQos=false;

(2) QoS index is expressed as IPCgoal, when whole SM are distributed to some delay-sensitive application, obtain IPCsofated;Define αkFor IPCgoalAccount for IPCsofatedRatio, calculation method is as follows:

IPCgoal=IPCisolated×αk, wherein αk∈ (0,1).

In the step 3, initial stream multiprocessor SM is determined, it is assumed that the delay-sensitive application and Fei Yan concurrently executed The quantity of slow responsive type application is 1, and calculation method is as follows:

(1) if it is delay-sensitive kernel,

(2) if it is non-delay sensitive type kernel, SMk=SMfofar-SMQoS,

Wherein SMtotalFor SM total number in GPU, SMk is the occupied SM number of some GPU kernel, SMQoSIt is quick to postpone The occupied SM number of sense type kernel, αkFor IPCgoalAccount for IPCisolatedRatio.

In the step 6, decision making algorithm determines next TepochThe period SM allocation plan of each GPU kernel is as follows:

Define ltetalFor GPU kernel to current T since initial executeepochTerminate the total instruction number executed, NepochFor from It initially executes and starts to current TepochAt the end of total TepochNumber, Swap-in are one SM of change, and Swap-out is to swap out one A SM, calculation method are as follows:

(1) for delay-sensitive kernel:

(1-1) is if Itotal/[(Nepoch+1)×Tepoch] > IPCgoal, then it is labeled as Swap-out;

(1-2) is if IPCtotal< IPCgoal, then it is labeled as Swap-in;

(2) for non-delay sensitive type kernel, the decision based on delay-sensitive kernel carries out decision, and there are three types of feelings here Condition:

(2-1) delay-sensitive kernel requests change to a SM, preferential at this time to select reserved SM, and SM is avoided to execute the behaviour that swaps out The expense of work;If reserved without SM, marking non-delay sensitive type kernel is Swap-out, that is, swap out a SM, and the SM It can be delayed by occupied by responsive type kernel;

(2-2) delay-sensitive kernel requests swap out a SM.It is determined in non-delay sensitive type according to decision making algorithm at this time Whether core needs to change to a SM, if desired changes to, then marks non-delay sensitive type kernel for Swap-in, in delay-sensitive The SM that core swaps out can be occupied by non-delay sensitive type kernel;Otherwise it closes the SM and reserves;

(2-3) delay-sensitive kernel does not require Swap to operate.Non-delay sensitive type is determined according to decision making algorithm at this time Whether kernel needs to swap out a SM;If needed to swap out, then marking non-delay sensitive type kernel is Swap-out, non-delay sensitive The SM that type kernel swaps out is closed and reserves.

In the step (7), the best SM of non-delay sensitive type kernel is distributed until being determined currently by allocation plan Number SMoptimal, by hardware or software definition threshold value threshold, to judge whether SM will swap out as reservation state, wherein The scope of threshold only in non-delay sensitive type kernel, defines IPClastFor last time TepochThe IPC of periodepoch.In order to SM is obtained in period short as far as possibleoptimalValue, it is also necessary to two flag bit UpperkAnd Lowerk, wherein UpperkIt indicates SMoptunalWhether the upper limit, Lower are reachedkIndicate SMoptunalWhether lower limit is reached.Specific requirement is as follows:

(1) if non-delay sensitive type kernel performs Swap-in or Swap-out operation, IPC is savedlast

(2) if last time TepochAt the end of non-delay sensitive type kernel changed to a SM, circular is as follows:

(2-1) is if IPClast≥IPCepoch× (1-Threshold), then Upperk=true, SMopnafReach the upper limit;

(2-2) is if IPClast< IPCepoch× (1-Threshold), then Lowerk=true, SMoptmalReach lower limit, And SMoptmal=SMk

(3) if last time TepochAt the end of non-delay sensitive type kernel swapped out a SM, circular is as follows:

(3-1) is if IPClast> IPCepoch× (1+Threshold), then Lowerk=true, SMoptmalReach lower limit;

(3-2) is if IPClast≤IPCopoch× (1+Threshold), then Upperk=true, SMoptmalReach the upper limit, And SMoptmal=SMk

(4) if Lowerk=true and Upperk=true, it is determined that SMoptmalValue, i.e., it is next in GPU kernel Executing will not change in the period.

In the step 9, all thread blocks execution in SM to be waited to terminate, then execute change or swap out when operating, thread Block scheduler does not reallocate thread block to the SM, and simply waiting existing thread block execution terminates, and such meaning is to interrupt The calculating of kernel, and the operation that swaps out is quickly completed as far as possible.

In the step 11, the initial SM distribution number of the kernel of GPU next time is identical with the last time, such Meaning is: 1) the different kernels of the same application generally have similitude, can make full use of SM resource, such as register in this way (Registers), shared drive (Shared Memory) and on-chip cache (L1 Cache);2) swapping out for SM is eliminated Operation, reduce kernel distribution etc. it is to be delayed.

In the step 4, HWQ quantity is that 32, GPU is at best able to concurrently execute 32 kernels.

In the step 8, for SM occupied by some GPU kernel, be up to one needs change to or swap out, other will It maintains the original state;For reserving SM, there is no the decisions that swaps out, only change and constant two kinds of situations.

The advantages of the present invention over the prior art are that: the present invention has sufficiently excavated the performance potential of the concurrent kernel of GPU, The efficiency and handling capacity of GPU can be balanced to the maximum extent: for computation-intensive while effectively realizing QoS index Using the abundant thread-level concurrency for promoting GPU;Application for memory-intensive sufficiently reduces the energy consumption of GPU.

Detailed description of the invention

Fig. 1 is the hardware architecture diagram for realizing proposition method of the present invention;

Fig. 2 is the schematic diagram of kernel distribution proposed by the present invention and thread block scheduling strategy;

Fig. 3 is the flow diagram that SM proposed by the present invention is dynamically distributed;

Fig. 4 is the tactful schematic diagram that non-delay sensitive type kernel proposed by the present invention confirms optimal SM distribution number.

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing and example, to this Invention is further elaborated.It should be appreciated that specific example described herein is not used to only to explain the present invention Limit the present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below each other it Between do not constitute conflict and can be combined with each other.

Hardware architecture diagram of the invention is as shown in Figure 1, wherein SMQoS and SMQoS interface is the present invention in the original base of GPU The module increased newly on plinth.

Specific implementation step of the present invention is as follows as shown in Figure 1::

Step 1: multiple GPU are applied to be started at the end CPU, needs to guarantee service quality QoS (Quality of by user is specified Service application and QoS index);API is loaded into GPU (GPU application when the parallel section of application, that is, GPU kernel passes through operation It is one-to-many relationship, a corresponding one or more GPU kernel of GPU application with GPU kernel), GPU kernel is according to affiliated application Mark is divided into delay-sensitive kernel and non-delay sensitive type kernel;The present invention needs newly-increased API as shown in table 1, The qos parameter that the end CPU is arranged is transmitted to GPU by cudaSetQoS, so that GPU supports the QoS requirement of application;

Table 1 is the required newly-increased API of proposition method of the present invention;

Step 2: each GPU kernel is identified by software work queue SWQ (SoftwareWorkQueue) ID, and is pushed away The kernel entered positioned at grid managers GMU (Grid ManagementUnit) waits pond (Pending Kernel Pool);

Step 3: the GPU kernel with identical SWQ ID will be mapped to the same hardware work queue HWQ (Hardware Work Queue) in;GPU kernel positioned at each head HWQ is determined initial stream multiprocessor by the QoS index of affiliated application SM (Streaming Multiprocessor) allocation plan.Kernel distribution proposed by the present invention and thread block scheduling strategy are such as Shown in Fig. 2.Wherein single SM is monopolized by kernel, and supports the change (Swap-in) of SM and (Swap-out) operation that swaps out, from And GPU is made to support the QoS demand applied, the application for computation-intensive maximizes the handling capacity of GPU;It is close for memory The application of collection type can close SM to reduce energy consumption.In following all steps, based on allocation plan SM, there are three types of states, divide Not are as follows: occupied by delay-sensitive kernel, occupied, close and reserved by non-delay sensitive type kernel;

Step 4: after determining SM allocation plan, the thread block Thread Block in the GPU kernel on each head HWQ passes through Thread block scheduler distributes to SM;

The execution of step 5:GPU kernel is with time span TepochFor unit, in each TepochAfter, it collects in each GPU The each cycle of core executes instruction several IPC (Instructions PerCycle), and the IPC includes that GPI kernel is opened from initial execution Begin to current TepochThe total IPC, i.e. IPC terminatedtoralWith current TepochIPC, i.e. IPCepoch

Step 6: after obtaining ipc message, next T being determined by decision making algorithmepochThe distribution side SM of period each GPU kernel Case, SM allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive Type MI (Memory-intensive).The process that SM proposed by the present invention is dynamically distributed is as shown in Figure 3.Specific step is as follows:

(1) a upper TepocH execution terminates.

(2) data information is fed back to SMQoS.

(3) SMQoS updates kernel information according to its scheduling strategy, and determines next TepochThe SM allocation plan of period.

(4) SMQoS judges whether delay-sensitive kernel needs to change to a SM.If desired it changes to a SM and then goes to step Suddenly (5) otherwise go to step (6).

(5) it is reserved to judge whether there is SM by SMQoS.Step (14) are gone to if having SM to reserve, otherwise go to step (13).

(6) SMQoS judges whether delay-sensitive kernel needs to swap out a SM.If desired the SM that swaps out then goes to step Suddenly (7) otherwise go to step (12).

(7) delay-sensitive kernel swaps out a SM.

(8) SMQoS judges SM number (SM occupied by non-delay sensitive type kernelk) whether it is less than optimum allocation number (SMoptmal).If SMk< SMoptmalStep (10) are then gone to, step (9) are otherwise gone to.

(9) SMQoS judges SMoptmalWhether the upper limit (Upper is reachedk).If Upperk=false then goes to step (10), Otherwise step (11) are gone to.

(10) non-delay sensitive type kernel changes to a SM.

(11) delay-sensitive kernel swaps out and closes the SM, and SM state becomes reserved.

(12) SMQoS judges SMoptmalWhether lower limit (Lowerk) is reached.If Lowerk=false then goes to step (13)。

(13) non-delay sensitive type kernel swaps out a SM.

(14) delay-sensitive kernel changes to a SM.

(15) the synchronous SM operation of SMQoS waits all SM replacements to complete.

(16) start to execute T next timeepoch

It is noted herein that QoS kernel can wait when QoS kernel requests change to a SM and reserve without SM Non- QoS kernel swaps out a SM.Step (15) avoids the possible Deadlock of SM replacement operation process;

Step 7: distributing number for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation plan SMoptmal, when being assigned as SMoptmalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus maximum Improve to limit handling capacity or efficiency.Non-delay sensitive type application proposed by the present invention confirms that optimal SM distributes number (SMoptmal) Strategy as shown in Figure 4.Specific step is as follows:

(1) if non-delay sensitive type kernel performs Swap-in or Swap-out operation, IPC is savedlast

(2) if last time TepochAt the end of non-delay sensitive type kernel changed to a SM, circular is as follows:

(2-1) is if IPClast≥IPCepoch× (1-Threshold), then Upperk=true, SMoptmalReach the upper limit;

(2-2) is if IPClast< IPCepoch× (1-Threshold), then Lowerk=true, SMoptmalReach lower limit, And SMoptmal=SMk

(3) if last time TepochAt the end of non-delay sensitive type kernel swapped out a SM, circular is as follows:

(3-1) is if IPClast> IPCepoch× (1+Threshold), then Lowerk=true, SMoptmalReach lower limit;

(3-2) is if IPClast≤IPCepoch× (1+Threshold), then Upperk=true, SMoptmalReach the upper limit, And SMoptmal=SMk

(4) if Lowerk=true and Upperk=true, it is determined that SMoptmalValue, SMoptmalIt is following in kernel The execution period in will not change.

Step 8: after obtaining SM allocation plan, then being determined by decision making algorithm and need to change to Swap-in and the Swap-out that swaps out) SM number, that is, the SM to swap out is in next TepochWhen it is interior no longer by former GPU core occupy, and change to SM in next Tepoch When it is interior by target GPU kernel occupy;

Step 9: when some SM is marked as changing to or swapping out, there are two types of situation: one be the SM is reservation state, at this time It directly changes to, as occupied by target GPU kernel;Two be the SM occupied by other GPU kernels, to wait in SM and own at this time Thread block execution terminates, then executes and swap out and change operation;

Step 10: after the completion of all SM are changed to and swapped out, then starting timing T next timeepoch, to ensure to collect number next time According to accuracy;

Step 11: if each GPU application Exactly-once GPU kernel, repeating step 5 to step 10 until all GPU Kernel execution terminates;If the multiple GPU kernel of some application execution, the initial SM distribution of GPU kernel is numbered and upper one next time It is secondary identical, step 5 is repeated to step 10, until the GPU kernel execution currently distributed terminates;

Step 12: repeating step 11 until all GPU application executions terminate.

Claims (9)

1. a kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores, which is characterized in that including with Lower step:
Step 1: multiple GPU are applied to be started at the end CPU, needs to guarantee service quality QoS (Quality of by user is specified Service application and QoS index);API is loaded into GPU (GPU application when the parallel section of application, that is, GPU kernel passes through operation It is one-to-many relationship, a corresponding one or more GPU kernel of GPU application with GPU kernel), GPU kernel is according to affiliated application Mark is divided into delay-sensitive kernel and non-delay sensitive type kernel;
Step 2: by software work queue SWQ (Software Work Queue), ID is identified each GPU kernel, and is pushed into Kernel positioned at grid managers GMU (Grid Management Unit) waits pond (Pending Kernel Pool);
Step 3: the GPU kernel with identical SWQ ID will be mapped to same hardware work queue HWQ (the Hardware Work Queue in);GPU kernel positioned at each head HWQ is determined initial stream multiprocessor SM by the QoS index of affiliated application (Streaming Multiprocessor) allocation plan, based on allocation plan SM, there are three types of states, are respectively as follows: by delay-sensitive Type kernel occupies, is occupied, closes and reserved by non-delay sensitive type kernel;
Step 4: after determining SM allocation plan, the thread block Thread Block in the GPU kernel on each head HWQ passes through thread Block scheduler distributes to SM;
The execution of step 5:GPU kernel is with time span TepochFor unit, in each TepochAfter, collect each GPU kernel Each cycle executes instruction several IPC (Instructions Per Cycle), and the IPC includes GPU kernel since initial execute To current TepochThe total IPC, i.e. IPC terminatedtotalWith current TepochIPC, i.e. IPCepoch
Step 6: after obtaining ipc message, next T being determined by decision making algorithmepochThe SM allocation plan of period each GPU kernel, SM Allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive MI (Memory-intensive);
Step 7: distributing number SM for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation planoptimal, When being assigned as SMoptimalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus to the maximum extent Improve handling capacity or efficiency;
Step 8: after obtaining SM allocation plan, then the SM for needing to change to Swap-in and the Swap-out that swaps out is determined by decision making algorithm Number, that is, the SM to swap out is in next TepochWhen it is interior no longer by former GPU core occupy, and change to SM in next TepochWhen it is interior It is occupied by target GPU kernel;
Step 9: when some SM is marked as changing to or swapping out, there are two types of situation: one be the SM is reservation state, at this time directly Change, as occupied by target GPU kernel;Two be the SM occupied by other GPU kernels, to wait all threads in SM at this time Block execution terminates, then executes and swap out and change operation;
Step 10: after the completion of all SM are changed to and swapped out, then starting timing T next timeepoch, to ensure to collect data next time Accuracy;
Step 11: if each GPU application Exactly-once GPU kernel, repeating step 5 to step 10 until all GPU kernels Execution terminates;If the multiple GPU kernel of some application execution, the initial SM distribution number of GPU kernel is complete with the last time next time It is exactly the same, step 5 is repeated to step 10, until the GPU kernel execution currently distributed terminates;
Step 12: repeating step 11 until all GPU application executions terminate.
2. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 1, realized by the specified application for needing to guarantee service quality of user and QoS index are as follows:
(1) need to guarantee that the application of service quality is expressed as IsQoS, it is specific as follows:
(1-1) if it is delay-sensitive application, for all GPU kernels of the delay-sensitive application, IsQoS=true;
(1-2) if it is non-delay sensitive type application, for all GPU kernels of non-delay sensitive type application, IsQoS= false;
(2) QoS index is expressed as IPCgoal, when whole SM are distributed to some delay-sensitive application, obtain IPCisotated; Define αkFor IPCgoalAccount for IPCisotatedRatio, calculation method is as follows:
IPCgoal=IPCisotated× α k, wherein αk∈ (0,1).
3. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 3, determine initial stream multiprocessor SM, it is assumed that the delay-sensitive application concurrently executed Quantity with the application of non-delay sensitive type is 1, and calculation method is as follows:
(1) if it is delay-sensitive kernel,
(2) if it is non-delay sensitive type kernel, SMk=SMtotal-SMQoS,
Wherein SMtotalFor SM total number, SM in GPUkFor the occupied SM number of some GPU kernel, SMQoSFor delay-sensitive The occupied SM number of kernel, αkFor IPCgoalAccount for IPCisotatedRatio.
4. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 6, decision making algorithm determines next TepochThe SM allocation plan of period each GPU kernel is such as Under:
Define ItotalFor GPU kernel to current T since initial executeepochTerminate the total instruction number executed, TepochFor from initial Execution starts to current TepochAt the end of total TepochNumber, Swap-in are one SM of change, and Swap-out is the SM that swaps out, Calculation method is as follows:
(1) for delay-sensitive kernel:
(1-1) is if Itotal/[(Neepoch+1)×Tepoch] > IPCgoal, then it is labeled as Swap-out;
(1-2) is if IPCtotal< IPCgoal, then it is labeled as Swap-in;
(2) for non-delay sensitive type kernel, the decision based on delay-sensitive kernel carries out decision, and there are three types of situations here:
(2-1) delay-sensitive kernel requests change to a SM, preferential at this time to select reserved SM, and SM is avoided to execute the operation that swaps out Expense;If reserved without SM, marking non-delay sensitive type kernel is Swap-out, that is, swap out a SM, and the SM can quilt Occupied by delay-sensitive kernel;
(2-2) delay-sensitive kernel requests swap out a SM.Determine that non-delay sensitive type kernel is according to decision making algorithm at this time It is no to need to change to a SM, it if desired changes to, then marking non-delay sensitive type kernel is Swap-in, and delay-sensitive kernel changes SM out can be occupied by non-delay sensitive type kernel;Otherwise it closes the SM and reserves;
(2-3) delay-sensitive kernel does not require Swap to operate, and determines non-delay sensitive type kernel according to decision making algorithm at this time Whether need to swap out a SM;If needed to swap out, then non-delay sensitive type kernel is marked for Swap-out, in non-delay sensitive type The SM that core swaps out is closed and reserves.
5. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step (7), for the best of non-delay sensitive type kernel until being determined currently by allocation plan SM distributes number SMoptimal, it is reserved to judge whether some SM will swap out by hardware or software definition threshold value threshold State, wherein the scope of threshold defines IPC only in non-delay sensitive type kernellastFor last time TepochPeriod IPCepoch, in order to obtain SM within the period short as far as possibleoptimalValue, it is also necessary to two flag bit UpperkAnd Lowerk, wherein UpperkIndicate SMoptimalWhether the upper limit, Lower are reachedkIndicate SMoptimalWhether lower limit is reached, and specific requirement is as follows:
(1) if non-delay sensitive type kernel performs Swap-in or Swap-out operation, IPC is savedlast
(2) if last time TepochAt the end of non-delay sensitive type kernel changed to a SM, circular is as follows:
(2-1) is if IPClast≥IPCepoch× (1-Threshold), then Upperk=true, SMoptimalReach the upper limit;
(2-2) is if IPClast< IPCepoch× (1-Threshold), then Lowerk=ture, SMoptimalReach lower limit, and SMoptimal=SMk
(3) if last time TepochAt the end of non-delay sensitive type kernel swapped out a SM, circular is as follows:
(3-1) is if IPClast> IPCepoch× (1+Threshold), then Lowerk=true, SMoptimalReach lower limit;
(3-2) is if IPClast≤IPCepoch× (1+Threshold), then Upperk=true, SMoptimalReach the upper limit, and SMoptimal=SMk
(4) if Lowerk=true and Upperk=true, it is determined that SMoptimalValue, i.e., hold GPU kernel is next It will not change in the row period.
6. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 9, to wait all thread block execution in SM to terminate, then execute change or the operation that swaps out When, thread block scheduler no longer distributes thread block to some SM, and simply waiting existing thread block execution terminates, and such meaning exists In the calculating that will not interrupt GPU kernel, and the operation that swaps out is quickly completed as far as possible.
7. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 11, the initial SM distribution number of the kernel of GPU next time is identical with the last time.
8. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 4, HWQ quantity is that 32, GPU is at best able to concurrently execute 32 kernels.
9. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1 Method, it is characterised in that: in the step 8, for SM occupied by some GPU kernel, be up to one needs change to or swap out, Other will maintain the original state;For reserving SM, there is no the decisions that swaps out, only change and constant two kinds of situations.
CN201811325650.8A 2018-11-08 2018-11-08 A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores CN109445565A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811325650.8A CN109445565A (en) 2018-11-08 2018-11-08 A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811325650.8A CN109445565A (en) 2018-11-08 2018-11-08 A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores

Publications (1)

Publication Number Publication Date
CN109445565A true CN109445565A (en) 2019-03-08

Family

ID=65551962

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811325650.8A CN109445565A (en) 2018-11-08 2018-11-08 A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores

Country Status (1)

Country Link
CN (1) CN109445565A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591418A (en) * 2010-12-16 2012-07-18 微软公司 Scalable multimedia computer system architecture with qos guarantees
CN103336718A (en) * 2013-07-04 2013-10-02 北京航空航天大学 GPU thread scheduling optimization method
WO2016202153A1 (en) * 2015-06-19 2016-12-22 华为技术有限公司 Gpu resource allocation method and system
CN107357661A (en) * 2017-07-12 2017-11-17 北京航空航天大学 A kind of fine granularity GPU resource management method for mixed load
CN108694080A (en) * 2017-04-09 2018-10-23 英特尔公司 Efficient thread group scheduling
CN108733490A (en) * 2018-05-14 2018-11-02 上海交通大学 A kind of GPU vitualization QoS control system and method based on resource-sharing adaptive configuration

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102591418A (en) * 2010-12-16 2012-07-18 微软公司 Scalable multimedia computer system architecture with qos guarantees
CN103336718A (en) * 2013-07-04 2013-10-02 北京航空航天大学 GPU thread scheduling optimization method
WO2016202153A1 (en) * 2015-06-19 2016-12-22 华为技术有限公司 Gpu resource allocation method and system
CN108694080A (en) * 2017-04-09 2018-10-23 英特尔公司 Efficient thread group scheduling
CN107357661A (en) * 2017-07-12 2017-11-17 北京航空航天大学 A kind of fine granularity GPU resource management method for mixed load
CN108733490A (en) * 2018-05-14 2018-11-02 上海交通大学 A kind of GPU vitualization QoS control system and method based on resource-sharing adaptive configuration

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PAULA AGUILERA ETC.: ""QoS-Aware Dynamic Resource Allocation for Spatial-Multitasking GPUs"", 《2014 19TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE》 *
张唯唯: ""基于GPU的高性能计算研究与应用"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Similar Documents

Publication Publication Date Title
Zhang et al. Dynamic heterogeneity-aware resource provisioning in the cloud
US10089140B2 (en) Dynamically adaptive, resource aware system and method for scheduling
US20180212842A1 (en) Managing data center resources to achieve a quality of service
US9779042B2 (en) Resource management in a multicore architecture
JP5651214B2 (en) Scheduling in multi-core architecture
Zhong et al. Kernelet: High-throughput GPU kernel executions with dynamic slicing and scheduling
CN103605567B (en) Cloud computing task scheduling method facing real-time demand change
Elliott et al. GPUSync: A framework for real-time GPU management
US9003215B2 (en) Power-aware thread scheduling and dynamic use of processors
Mutlu et al. Stall-time fair memory access scheduling for chip multiprocessors
Kato et al. Semi-partitioned fixed-priority scheduling on multiprocessors
US9824003B2 (en) Dynamically resizable circular buffers
US8203562B1 (en) Apparatus, system, and method for distributing work to integrated heterogeneous processors
US8839259B2 (en) Thread scheduling on multiprocessor systems
JP5175335B2 (en) Priority-based throttling for power / performance quality of service
TWI410866B (en) Scheduling method,scheduling apparatus,multiprocessor system and scheduling program
US7962679B2 (en) Interrupt balancing for multi-core and power
US9542229B2 (en) Multiple core real-time task execution
Zhuravlev et al. Survey of scheduling techniques for addressing shared resources in multicore processors
KR101786768B1 (en) Graphics compute process scheduling
Feng et al. A model of hierarchical real-time virtual resources
US8707314B2 (en) Scheduling compute kernel workgroups to heterogeneous processors based on historical processor execution times and utilizations
US10579388B2 (en) Policies for shader resource allocation in a shader core
Kayıran et al. Neither more nor less: optimizing thread-level parallelism for GPGPUs
Chen et al. Accelerating MapReduce on a coupled CPU-GPU architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination