CN109445565A - A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores - Google Patents
A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores Download PDFInfo
- Publication number
- CN109445565A CN109445565A CN201811325650.8A CN201811325650A CN109445565A CN 109445565 A CN109445565 A CN 109445565A CN 201811325650 A CN201811325650 A CN 201811325650A CN 109445565 A CN109445565 A CN 109445565A
- Authority
- CN
- China
- Prior art keywords
- gpu
- kernel
- ipc
- delay
- epoch
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3243—Power saving in microcontroller unit
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Multi Processors (AREA)
Abstract
The invention discloses a kind of GPU QoS guarantee methods exclusive and reserved based on stream multiple processor cores, classifying rationally is carried out according to the QoS index convection current multiprocessor (Streaming Multiprocessor, referred to as SM) of application when including: 1) initial;2) by effective decision-making when running, periodically dynamic is carried out for the SM quantity of each application and is adjusted, to guarantee the QoS index of application;3) during dynamic adjusts, by interim data collection, the type (memory-intensive or computation-intensive) of each application can be identified;4) efficiency or handling capacity of GPU are improved according to application type information.The present invention has sufficiently excavated the performance potential of the concurrent kernel of GPU, can balance to the maximum extent the efficiency and handling capacity of GPU: the application for computation-intensive while effectively realizing QoS index, sufficiently promote the thread-level concurrency of GPU;Application for memory-intensive sufficiently reduces the energy consumption of GPU.
Description
Technical field
The present invention relates to the fields such as the execution of concurrent kernel, Resources on Chip management and thread block scheduling, more particularly to one kind
Based on the exclusive and reserved GPU QoS guarantee method of stream multiple processor cores.
Background technique
In high-performance computing sector, graphics processor (Graphics Processing Units, hereinafter referred to as GPU)
It is increasingly being used for the acceleration of general-purpose computations.GPU by utilizing Thread-Level Parallelism (Thread Level in large quantities
Parallelism, hereinafter referred to as TLP) realize high computing capability.Flow multiprocessor (Streaming
Multiprocessors, hereinafter referred to as SM) it is responsible for executing the calculating task of GPU, include many calculating cores in SM
(Compute Cores) and resource, such as register (Registers), shared drive (Shared Memory) and L1 cache
(L1Cache).Device memory (Device Memory) is shared by interference networks (Interconnect Network) between SM.
As multiple Application share GPU, GPU supports concurrent kernel, inner core managing device (Kernel Management Unit, below letter
Referred to as KMU) kernel can be distributed to kernel distributor (Kernel Distributor Unit, hereinafter referred to as KDU), in KDU
Kernel is executed with the sequence of First Come First Served (first-come-first-serve, hereinafter referred to as FCFS).But it is answered when multiple
When with concurrently executing, thread block scheduler is dispatched in next again after the completion of waiting the scheduling of the thread block in previous kernel
Core, and thread block scheduler is understood thread block homogeneous dispatch according to poll (Round-Robin) strategy into all SM.
As the blowout of number of applications increases, as multiple concurrent Application share GPU, how handling capacity is preferably improved
Become particularly important with efficiency.For the multi-task parallel on GPU, academia and industry propose two kinds of major techniques: empty
Between multitask (Spatial Multitasking) and simultaneously multitask (Simultaneous Multitasking).Space is more
SM can be divided into several mutually disjoint subsets in task, that is, GPU, and each subset allocation is run simultaneously to different applications;Simultaneously
Multitask carries out fine granularity for SM resource and shares (Fine-grained Sharing), and multiple answer is performed simultaneously on single SM
With.Current general GPU can manage SM resource in chip-scale, to support spatial multiplexing.In addition to this, when application needs
When guaranteeing service quality (Quality of Service, hereinafter referred to as QoS), it is necessary to distribute enough resources and meet application
Qos requirement, this proposes bigger challenge to GPU architecture.In order to guarantee the qos requirement of application, while maximizing handling up for GPU
Amount and efficiency, in terms of existing solution is broadly divided into following two:
(1) the application execution model of GPU architecture is adapted to
The research of this respect is the characteristic by the default execution pattern of change application to cater to GPU accelerator.Such as it is based on
The kernel dispatching of priority always first carries out the kernel of high priority that is, when multiple kernels are distributed to GPU.Or it can incite somebody to action
All tasks abstracts for being submitted to GPU are task queue, manage and predict the execution duration of each task in CPU end pipe, and
Rearrangement task is to meet the qos requirement of application.Or fine-grained SM resource model is utilized, using similar to persistence thread
The technology of (persistent threads) realizes the resource reservation on SM, so that the resource for limiting the application of non-delay sensitive type accounts for
With;If current reservations resource still cannot effectively meet QoS index, dynamic resource, which adapts to module, can be called to seize and currently hold
Resource shared by row task.Such method is generally optimized by granularity of kernel, cannot handle the interior of long-play well
Core, in addition to this kernel seize the delay distributed again and energy consumption also can be very big.
(2) QoS of GPU architecture level executes model
The research of this respect is to be broadly divided into spatial multiplexing and multitask simultaneously based on the multi-task parallel on GPU.It is empty
Between multitask strategy estimation delay-sensitive application when passing through operation performance, then pass through linear properties model (Linear
Performance Model) SM quantity needed for each application of prediction.Multitask simultaneously is managed using fine-grained QoS, is answered all
It is assigned in all SM with paving, distributes different quotas (Quota) for delay-sensitive application and the application of non-delay sensitive type, from
And fine-grained distribution is carried out to resource on single SM.Multitask simultaneously, which is disadvantageous in that, does not support power gating (Power
Gating), for all SM on GPU always occupied, energy consumption is higher;And when multiple kernels occupy the same SM, L1 high
The conflict of speed caching (L1Cache) can reduce performance.
In conclusion application execution model is generally from software respective, granularity is kernel or other GPU tasks, makes it
Adapt to GPU architecture;And QoS executes model generally from GPU architecture, adapts it to various applications.These two aspects model can be with
It is compatible.It is noticeable to have two o'clock with the update of GPU: 1) SM quantity rapid development, newest Pascal and
Volta framework has 56 SM (Tesla P100) and 80 SM (Tesla V100) respectively;2) resource such as register on single SM
File, shared drive and L1Cache size do not change substantially.It can be seen that, it is contemplated that SM quantity is continuous on the following GPU architecture
Increased trend, the present invention carry out kernel as granularity using SM and seize and reserve, can handle long-term running kernel well, keep away
Exempt from frequent kernel to seize and distribute;And single SM is monopolized by kernel, avoids the conflict of L1 Cache, and support power supply
Gate is to reduce energy consumption.
Summary of the invention
The technology of the present invention solves the problems, such as: overcome the deficiencies in the prior art and defect, provides a kind of based in stream multiprocessor
The exclusive and reserved GPU QoS guarantee method of core, satisfaction apply the QoS index executed on GPU more, sufficiently excavate
The potentiality of the concurrent kernel of GPU, while can adaptively maximize the efficiency or handling capacity of GPU.
Technical solution of the invention, based on the exclusive and reserved service quality guarantee side GPU of stream multiple processor cores
Method includes the following steps:
Step 1: multiple GPU are applied to be started at the end CPU, needs to guarantee service quality QoS (Quality of by user is specified
Service application and QoS index);API is loaded into GPU (GPU application when the parallel section of application, that is, GPU kernel passes through operation
It is one-to-many relationship, a corresponding one or more GPU kernel of GPU application with GPU kernel), GPU kernel is according to affiliated application
Mark is divided into delay-sensitive kernel and non-delay sensitive type kernel;
Step 2: each GPU kernel is identified by software work queue SWQ (Software Work Queue) ID, and by
Push-in waits pond (Pending Kernel Pool) positioned at the kernel of grid managers GMU (Grid Management Unit);
Step 3: the GPU kernel with identical SWQ ID will be mapped to the same hardware work queue HWQ (Hardware
Work Queue) in;GPU kernel positioned at each head HWQ is determined initial stream multiprocessor by the QoS index of affiliated application
SM (Streaming Multiprocessor) allocation plan is respectively as follows: quick by postponing based on allocation plan SM there are three types of state
Sense type kernel occupies, is occupied, closes and reserved by non-delay sensitive type kernel;
Step 4: after determining SM allocation plan, the thread block Thread Block in the GPU kernel on each head HWQ passes through
Thread block scheduler distributes to SM;
The execution of step 5:GPU kernel is with time span TepochFor unit, in each TepochAfter, it collects in each GPU
The each cycle of core executes instruction several IPC (Instructions Per Cycle), and the IPC includes that GPI kernel is executed from initial
Start to current TepochThe total IPC, i.e. IPC terminatedtoratWith current TepochIPC, i.e. IPCepoch;
Step 6: after obtaining ipc message, next T being determined by decision making algorithmepochThe distribution side SM of period each GPU kernel
Case, SM allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive
Type MI (Memory-intensive);
Step 7: distributing number for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation plan
SMeptmal, when being assigned as SMoptmalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus maximum
Improve to limit handling capacity or efficiency;
Step 8: after obtaining SM allocation plan, then being determined by decision making algorithm and need to change to Swap-in and the Swap-out that swaps out)
SM number, that is, the SM to swap out is in next TepochWhen it is interior no longer by former GPU core occupy, and change to SM in next Tepoch
When it is interior by target GPU kernel occupy;
Step 9: when some SM is marked as changing to or swapping out, there are two types of situation: one be the SM is reservation state, at this time
It directly changes to, as occupied by target GPU kernel;Two be the SM occupied by other GPU kernels, to wait in SM and own at this time
Thread block execution terminates, then executes and swap out and change operation;
Step 10: after the completion of all SM are changed to and swapped out, then starting timing T next timeepoch, to ensure to collect number next time
According to accuracy;
Step 11: if each GPU application Exactly-once GPU kernel, repeating step 5 to step 10 until all GPU
Kernel execution terminates;If the multiple GPU kernel of some application execution, the initial SM distribution of GPU kernel is numbered and upper one next time
It is secondary identical, step 5 is repeated to step 10, until the GPU kernel execution currently distributed terminates;
Step 12: repeating step 11 until all GPU application executions terminate.
In the step 1, realized by the specified application for needing to guarantee service quality of user and QoS index are as follows:
(1) need to guarantee that the application of service quality is expressed as lsQoS, it is specific as follows:
(1-1) if it is delay-sensitive application, for all GPU kernels of the delay-sensitive application, lsQos=
true;
(1-2) if it is non-delay sensitive type application, for all GPU kernels of non-delay sensitive type application,
lsQos=false;
(2) QoS index is expressed as IPCgoal, when whole SM are distributed to some delay-sensitive application, obtain
IPCsofated;Define αkFor IPCgoalAccount for IPCsofatedRatio, calculation method is as follows:
IPCgoal=IPCisolated×αk, wherein αk∈ (0,1).
In the step 3, initial stream multiprocessor SM is determined, it is assumed that the delay-sensitive application and Fei Yan concurrently executed
The quantity of slow responsive type application is 1, and calculation method is as follows:
(1) if it is delay-sensitive kernel,
(2) if it is non-delay sensitive type kernel, SMk=SMfofar-SMQoS,
Wherein SMtotalFor SM total number in GPU, SMk is the occupied SM number of some GPU kernel, SMQoSIt is quick to postpone
The occupied SM number of sense type kernel, αkFor IPCgoalAccount for IPCisolatedRatio.
In the step 6, decision making algorithm determines next TepochThe period SM allocation plan of each GPU kernel is as follows:
Define ltetalFor GPU kernel to current T since initial executeepochTerminate the total instruction number executed, NepochFor from
It initially executes and starts to current TepochAt the end of total TepochNumber, Swap-in are one SM of change, and Swap-out is to swap out one
A SM, calculation method are as follows:
(1) for delay-sensitive kernel:
(1-1) is if Itotal/[(Nepoch+1)×Tepoch] > IPCgoal, then it is labeled as Swap-out;
(1-2) is if IPCtotal< IPCgoal, then it is labeled as Swap-in;
(2) for non-delay sensitive type kernel, the decision based on delay-sensitive kernel carries out decision, and there are three types of feelings here
Condition:
(2-1) delay-sensitive kernel requests change to a SM, preferential at this time to select reserved SM, and SM is avoided to execute the behaviour that swaps out
The expense of work;If reserved without SM, marking non-delay sensitive type kernel is Swap-out, that is, swap out a SM, and the SM
It can be delayed by occupied by responsive type kernel;
(2-2) delay-sensitive kernel requests swap out a SM.It is determined in non-delay sensitive type according to decision making algorithm at this time
Whether core needs to change to a SM, if desired changes to, then marks non-delay sensitive type kernel for Swap-in, in delay-sensitive
The SM that core swaps out can be occupied by non-delay sensitive type kernel;Otherwise it closes the SM and reserves;
(2-3) delay-sensitive kernel does not require Swap to operate.Non-delay sensitive type is determined according to decision making algorithm at this time
Whether kernel needs to swap out a SM;If needed to swap out, then marking non-delay sensitive type kernel is Swap-out, non-delay sensitive
The SM that type kernel swaps out is closed and reserves.
In the step (7), the best SM of non-delay sensitive type kernel is distributed until being determined currently by allocation plan
Number SMoptimal, by hardware or software definition threshold value threshold, to judge whether SM will swap out as reservation state, wherein
The scope of threshold only in non-delay sensitive type kernel, defines IPClastFor last time TepochThe IPC of periodepoch.In order to
SM is obtained in period short as far as possibleoptimalValue, it is also necessary to two flag bit UpperkAnd Lowerk, wherein UpperkIt indicates
SMoptunalWhether the upper limit, Lower are reachedkIndicate SMoptunalWhether lower limit is reached.Specific requirement is as follows:
(1) if non-delay sensitive type kernel performs Swap-in or Swap-out operation, IPC is savedlast;
(2) if last time TepochAt the end of non-delay sensitive type kernel changed to a SM, circular is as follows:
(2-1) is if IPClast≥IPCepoch× (1-Threshold), then Upperk=true, SMopnafReach the upper limit;
(2-2) is if IPClast< IPCepoch× (1-Threshold), then Lowerk=true, SMoptmalReach lower limit,
And SMoptmal=SMk;
(3) if last time TepochAt the end of non-delay sensitive type kernel swapped out a SM, circular is as follows:
(3-1) is if IPClast> IPCepoch× (1+Threshold), then Lowerk=true, SMoptmalReach lower limit;
(3-2) is if IPClast≤IPCopoch× (1+Threshold), then Upperk=true, SMoptmalReach the upper limit,
And SMoptmal=SMk。
(4) if Lowerk=true and Upperk=true, it is determined that SMoptmalValue, i.e., it is next in GPU kernel
Executing will not change in the period.
In the step 9, all thread blocks execution in SM to be waited to terminate, then execute change or swap out when operating, thread
Block scheduler does not reallocate thread block to the SM, and simply waiting existing thread block execution terminates, and such meaning is to interrupt
The calculating of kernel, and the operation that swaps out is quickly completed as far as possible.
In the step 11, the initial SM distribution number of the kernel of GPU next time is identical with the last time, such
Meaning is: 1) the different kernels of the same application generally have similitude, can make full use of SM resource, such as register in this way
(Registers), shared drive (Shared Memory) and on-chip cache (L1 Cache);2) swapping out for SM is eliminated
Operation, reduce kernel distribution etc. it is to be delayed.
In the step 4, HWQ quantity is that 32, GPU is at best able to concurrently execute 32 kernels.
In the step 8, for SM occupied by some GPU kernel, be up to one needs change to or swap out, other will
It maintains the original state;For reserving SM, there is no the decisions that swaps out, only change and constant two kinds of situations.
The advantages of the present invention over the prior art are that: the present invention has sufficiently excavated the performance potential of the concurrent kernel of GPU,
The efficiency and handling capacity of GPU can be balanced to the maximum extent: for computation-intensive while effectively realizing QoS index
Using the abundant thread-level concurrency for promoting GPU;Application for memory-intensive sufficiently reduces the energy consumption of GPU.
Detailed description of the invention
Fig. 1 is the hardware architecture diagram for realizing proposition method of the present invention;
Fig. 2 is the schematic diagram of kernel distribution proposed by the present invention and thread block scheduling strategy;
Fig. 3 is the flow diagram that SM proposed by the present invention is dynamically distributed;
Fig. 4 is the tactful schematic diagram that non-delay sensitive type kernel proposed by the present invention confirms optimal SM distribution number.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing and example, to this
Invention is further elaborated.It should be appreciated that specific example described herein is not used to only to explain the present invention
Limit the present invention.As long as in addition, technical characteristic involved in the various embodiments of the present invention described below each other it
Between do not constitute conflict and can be combined with each other.
Hardware architecture diagram of the invention is as shown in Figure 1, wherein SMQoS and SMQoS interface is the present invention in the original base of GPU
The module increased newly on plinth.
Specific implementation step of the present invention is as follows as shown in Figure 1::
Step 1: multiple GPU are applied to be started at the end CPU, needs to guarantee service quality QoS (Quality of by user is specified
Service application and QoS index);API is loaded into GPU (GPU application when the parallel section of application, that is, GPU kernel passes through operation
It is one-to-many relationship, a corresponding one or more GPU kernel of GPU application with GPU kernel), GPU kernel is according to affiliated application
Mark is divided into delay-sensitive kernel and non-delay sensitive type kernel;The present invention needs newly-increased API as shown in table 1,
The qos parameter that the end CPU is arranged is transmitted to GPU by cudaSetQoS, so that GPU supports the QoS requirement of application;
Table 1 is the required newly-increased API of proposition method of the present invention;
Step 2: each GPU kernel is identified by software work queue SWQ (SoftwareWorkQueue) ID, and is pushed away
The kernel entered positioned at grid managers GMU (Grid ManagementUnit) waits pond (Pending Kernel Pool);
Step 3: the GPU kernel with identical SWQ ID will be mapped to the same hardware work queue HWQ (Hardware
Work Queue) in;GPU kernel positioned at each head HWQ is determined initial stream multiprocessor by the QoS index of affiliated application
SM (Streaming Multiprocessor) allocation plan.Kernel distribution proposed by the present invention and thread block scheduling strategy are such as
Shown in Fig. 2.Wherein single SM is monopolized by kernel, and supports the change (Swap-in) of SM and (Swap-out) operation that swaps out, from
And GPU is made to support the QoS demand applied, the application for computation-intensive maximizes the handling capacity of GPU;It is close for memory
The application of collection type can close SM to reduce energy consumption.In following all steps, based on allocation plan SM, there are three types of states, divide
Not are as follows: occupied by delay-sensitive kernel, occupied, close and reserved by non-delay sensitive type kernel;
Step 4: after determining SM allocation plan, the thread block Thread Block in the GPU kernel on each head HWQ passes through
Thread block scheduler distributes to SM;
The execution of step 5:GPU kernel is with time span TepochFor unit, in each TepochAfter, it collects in each GPU
The each cycle of core executes instruction several IPC (Instructions PerCycle), and the IPC includes that GPI kernel is opened from initial execution
Begin to current TepochThe total IPC, i.e. IPC terminatedtoralWith current TepochIPC, i.e. IPCepoch;
Step 6: after obtaining ipc message, next T being determined by decision making algorithmepochThe distribution side SM of period each GPU kernel
Case, SM allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive
Type MI (Memory-intensive).The process that SM proposed by the present invention is dynamically distributed is as shown in Figure 3.Specific step is as follows:
(1) a upper TepocH execution terminates.
(2) data information is fed back to SMQoS.
(3) SMQoS updates kernel information according to its scheduling strategy, and determines next TepochThe SM allocation plan of period.
(4) SMQoS judges whether delay-sensitive kernel needs to change to a SM.If desired it changes to a SM and then goes to step
Suddenly (5) otherwise go to step (6).
(5) it is reserved to judge whether there is SM by SMQoS.Step (14) are gone to if having SM to reserve, otherwise go to step (13).
(6) SMQoS judges whether delay-sensitive kernel needs to swap out a SM.If desired the SM that swaps out then goes to step
Suddenly (7) otherwise go to step (12).
(7) delay-sensitive kernel swaps out a SM.
(8) SMQoS judges SM number (SM occupied by non-delay sensitive type kernelk) whether it is less than optimum allocation number
(SMoptmal).If SMk< SMoptmalStep (10) are then gone to, step (9) are otherwise gone to.
(9) SMQoS judges SMoptmalWhether the upper limit (Upper is reachedk).If Upperk=false then goes to step (10),
Otherwise step (11) are gone to.
(10) non-delay sensitive type kernel changes to a SM.
(11) delay-sensitive kernel swaps out and closes the SM, and SM state becomes reserved.
(12) SMQoS judges SMoptmalWhether lower limit (Lowerk) is reached.If Lowerk=false then goes to step
(13)。
(13) non-delay sensitive type kernel swaps out a SM.
(14) delay-sensitive kernel changes to a SM.
(15) the synchronous SM operation of SMQoS waits all SM replacements to complete.
(16) start to execute T next timeepoch。
It is noted herein that QoS kernel can wait when QoS kernel requests change to a SM and reserve without SM
Non- QoS kernel swaps out a SM.Step (15) avoids the possible Deadlock of SM replacement operation process;
Step 7: distributing number for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation plan
SMoptmal, when being assigned as SMoptmalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus maximum
Improve to limit handling capacity or efficiency.Non-delay sensitive type application proposed by the present invention confirms that optimal SM distributes number (SMoptmal)
Strategy as shown in Figure 4.Specific step is as follows:
(1) if non-delay sensitive type kernel performs Swap-in or Swap-out operation, IPC is savedlast。
(2) if last time TepochAt the end of non-delay sensitive type kernel changed to a SM, circular is as follows:
(2-1) is if IPClast≥IPCepoch× (1-Threshold), then Upperk=true, SMoptmalReach the upper limit;
(2-2) is if IPClast< IPCepoch× (1-Threshold), then Lowerk=true, SMoptmalReach lower limit,
And SMoptmal=SMk。
(3) if last time TepochAt the end of non-delay sensitive type kernel swapped out a SM, circular is as follows:
(3-1) is if IPClast> IPCepoch× (1+Threshold), then Lowerk=true, SMoptmalReach lower limit;
(3-2) is if IPClast≤IPCepoch× (1+Threshold), then Upperk=true, SMoptmalReach the upper limit,
And SMoptmal=SMk。
(4) if Lowerk=true and Upperk=true, it is determined that SMoptmalValue, SMoptmalIt is following in kernel
The execution period in will not change.
Step 8: after obtaining SM allocation plan, then being determined by decision making algorithm and need to change to Swap-in and the Swap-out that swaps out)
SM number, that is, the SM to swap out is in next TepochWhen it is interior no longer by former GPU core occupy, and change to SM in next Tepoch
When it is interior by target GPU kernel occupy;
Step 9: when some SM is marked as changing to or swapping out, there are two types of situation: one be the SM is reservation state, at this time
It directly changes to, as occupied by target GPU kernel;Two be the SM occupied by other GPU kernels, to wait in SM and own at this time
Thread block execution terminates, then executes and swap out and change operation;
Step 10: after the completion of all SM are changed to and swapped out, then starting timing T next timeepoch, to ensure to collect number next time
According to accuracy;
Step 11: if each GPU application Exactly-once GPU kernel, repeating step 5 to step 10 until all GPU
Kernel execution terminates;If the multiple GPU kernel of some application execution, the initial SM distribution of GPU kernel is numbered and upper one next time
It is secondary identical, step 5 is repeated to step 10, until the GPU kernel execution currently distributed terminates;
Step 12: repeating step 11 until all GPU application executions terminate.
Claims (9)
1. a kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores, which is characterized in that including with
Lower step:
Step 1: multiple GPU are applied to be started at the end CPU, needs to guarantee service quality QoS (Quality of by user is specified
Service application and QoS index);API is loaded into GPU (GPU application when the parallel section of application, that is, GPU kernel passes through operation
It is one-to-many relationship, a corresponding one or more GPU kernel of GPU application with GPU kernel), GPU kernel is according to affiliated application
Mark is divided into delay-sensitive kernel and non-delay sensitive type kernel;
Step 2: by software work queue SWQ (Software Work Queue), ID is identified each GPU kernel, and is pushed into
Kernel positioned at grid managers GMU (Grid Management Unit) waits pond (Pending Kernel Pool);
Step 3: the GPU kernel with identical SWQ ID will be mapped to same hardware work queue HWQ (the Hardware Work
Queue in);GPU kernel positioned at each head HWQ is determined initial stream multiprocessor SM by the QoS index of affiliated application
(Streaming Multiprocessor) allocation plan, based on allocation plan SM, there are three types of states, are respectively as follows: by delay-sensitive
Type kernel occupies, is occupied, closes and reserved by non-delay sensitive type kernel;
Step 4: after determining SM allocation plan, the thread block Thread Block in the GPU kernel on each head HWQ passes through thread
Block scheduler distributes to SM;
The execution of step 5:GPU kernel is with time span TepochFor unit, in each TepochAfter, collect each GPU kernel
Each cycle executes instruction several IPC (Instructions Per Cycle), and the IPC includes GPU kernel since initial execute
To current TepochThe total IPC, i.e. IPC terminatedtotalWith current TepochIPC, i.e. IPCepoch;
Step 6: after obtaining ipc message, next T being determined by decision making algorithmepochThe SM allocation plan of period each GPU kernel, SM
Allocation plan is related to GPU core type, including computation-intensive CI (Compute-intensive) or memory-intensive MI
(Memory-intensive);
Step 7: distributing number SM for the best SM of non-delay sensitive type kernel until being determined currently by SM allocation planoptimal,
When being assigned as SMoptimalSM when, the performance and energy consumption of non-delay sensitive type kernel reach desirable balance, thus to the maximum extent
Improve handling capacity or efficiency;
Step 8: after obtaining SM allocation plan, then the SM for needing to change to Swap-in and the Swap-out that swaps out is determined by decision making algorithm
Number, that is, the SM to swap out is in next TepochWhen it is interior no longer by former GPU core occupy, and change to SM in next TepochWhen it is interior
It is occupied by target GPU kernel;
Step 9: when some SM is marked as changing to or swapping out, there are two types of situation: one be the SM is reservation state, at this time directly
Change, as occupied by target GPU kernel;Two be the SM occupied by other GPU kernels, to wait all threads in SM at this time
Block execution terminates, then executes and swap out and change operation;
Step 10: after the completion of all SM are changed to and swapped out, then starting timing T next timeepoch, to ensure to collect data next time
Accuracy;
Step 11: if each GPU application Exactly-once GPU kernel, repeating step 5 to step 10 until all GPU kernels
Execution terminates;If the multiple GPU kernel of some application execution, the initial SM distribution number of GPU kernel is complete with the last time next time
It is exactly the same, step 5 is repeated to step 10, until the GPU kernel execution currently distributed terminates;
Step 12: repeating step 11 until all GPU application executions terminate.
2. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1
Method, it is characterised in that: in the step 1, realized by the specified application for needing to guarantee service quality of user and QoS index are as follows:
(1) need to guarantee that the application of service quality is expressed as IsQoS, it is specific as follows:
(1-1) if it is delay-sensitive application, for all GPU kernels of the delay-sensitive application, IsQoS=true;
(1-2) if it is non-delay sensitive type application, for all GPU kernels of non-delay sensitive type application, IsQoS=
false;
(2) QoS index is expressed as IPCgoal, when whole SM are distributed to some delay-sensitive application, obtain IPCisotated;
Define αkFor IPCgoalAccount for IPCisotatedRatio, calculation method is as follows:
IPCgoal=IPCisotated× α k, wherein αk∈ (0,1).
3. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1
Method, it is characterised in that: in the step 3, determine initial stream multiprocessor SM, it is assumed that the delay-sensitive application concurrently executed
Quantity with the application of non-delay sensitive type is 1, and calculation method is as follows:
(1) if it is delay-sensitive kernel,
(2) if it is non-delay sensitive type kernel, SMk=SMtotal-SMQoS,
Wherein SMtotalFor SM total number, SM in GPUkFor the occupied SM number of some GPU kernel, SMQoSFor delay-sensitive
The occupied SM number of kernel, αkFor IPCgoalAccount for IPCisotatedRatio.
4. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1
Method, it is characterised in that: in the step 6, decision making algorithm determines next TepochThe SM allocation plan of period each GPU kernel is such as
Under:
Define ItotalFor GPU kernel to current T since initial executeepochTerminate the total instruction number executed, TepochFor from initial
Execution starts to current TepochAt the end of total TepochNumber, Swap-in are one SM of change, and Swap-out is the SM that swaps out,
Calculation method is as follows:
(1) for delay-sensitive kernel:
(1-1) is if Itotal/[(Neepoch+1)×Tepoch] > IPCgoal, then it is labeled as Swap-out;
(1-2) is if IPCtotal< IPCgoal, then it is labeled as Swap-in;
(2) for non-delay sensitive type kernel, the decision based on delay-sensitive kernel carries out decision, and there are three types of situations here:
(2-1) delay-sensitive kernel requests change to a SM, preferential at this time to select reserved SM, and SM is avoided to execute the operation that swaps out
Expense;If reserved without SM, marking non-delay sensitive type kernel is Swap-out, that is, swap out a SM, and the SM can quilt
Occupied by delay-sensitive kernel;
(2-2) delay-sensitive kernel requests swap out a SM.Determine that non-delay sensitive type kernel is according to decision making algorithm at this time
It is no to need to change to a SM, it if desired changes to, then marking non-delay sensitive type kernel is Swap-in, and delay-sensitive kernel changes
SM out can be occupied by non-delay sensitive type kernel;Otherwise it closes the SM and reserves;
(2-3) delay-sensitive kernel does not require Swap to operate, and determines non-delay sensitive type kernel according to decision making algorithm at this time
Whether need to swap out a SM;If needed to swap out, then non-delay sensitive type kernel is marked for Swap-out, in non-delay sensitive type
The SM that core swaps out is closed and reserves.
5. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1
Method, it is characterised in that: in the step (7), for the best of non-delay sensitive type kernel until being determined currently by allocation plan
SM distributes number SMoptimal, it is reserved to judge whether some SM will swap out by hardware or software definition threshold value threshold
State, wherein the scope of threshold defines IPC only in non-delay sensitive type kernellastFor last time TepochPeriod
IPCepoch, in order to obtain SM within the period short as far as possibleoptimalValue, it is also necessary to two flag bit UpperkAnd Lowerk, wherein
UpperkIndicate SMoptimalWhether the upper limit, Lower are reachedkIndicate SMoptimalWhether lower limit is reached, and specific requirement is as follows:
(1) if non-delay sensitive type kernel performs Swap-in or Swap-out operation, IPC is savedlast;
(2) if last time TepochAt the end of non-delay sensitive type kernel changed to a SM, circular is as follows:
(2-1) is if IPClast≥IPCepoch× (1-Threshold), then Upperk=true, SMoptimalReach the upper limit;
(2-2) is if IPClast< IPCepoch× (1-Threshold), then Lowerk=ture, SMoptimalReach lower limit, and
SMoptimal=SMk;
(3) if last time TepochAt the end of non-delay sensitive type kernel swapped out a SM, circular is as follows:
(3-1) is if IPClast> IPCepoch× (1+Threshold), then Lowerk=true, SMoptimalReach lower limit;
(3-2) is if IPClast≤IPCepoch× (1+Threshold), then Upperk=true, SMoptimalReach the upper limit, and
SMoptimal=SMk;
(4) if Lowerk=true and Upperk=true, it is determined that SMoptimalValue, i.e., hold GPU kernel is next
It will not change in the row period.
6. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1
Method, it is characterised in that: in the step 9, to wait all thread block execution in SM to terminate, then execute change or the operation that swaps out
When, thread block scheduler no longer distributes thread block to some SM, and simply waiting existing thread block execution terminates, and such meaning exists
In the calculating that will not interrupt GPU kernel, and the operation that swaps out is quickly completed as far as possible.
7. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1
Method, it is characterised in that: in the step 11, the initial SM distribution number of the kernel of GPU next time is identical with the last time.
8. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1
Method, it is characterised in that: in the step 4, HWQ quantity is that 32, GPU is at best able to concurrently execute 32 kernels.
9. a kind of GPU service quality guarantee side exclusive and reserved based on stream multiple processor cores according to claim 1
Method, it is characterised in that: in the step 8, for SM occupied by some GPU kernel, be up to one needs change to or swap out,
Other will maintain the original state;For reserving SM, there is no the decisions that swaps out, only change and constant two kinds of situations.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811325650.8A CN109445565B (en) | 2018-11-08 | 2018-11-08 | GPU service quality guarantee method based on monopolization and reservation of kernel of stream multiprocessor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811325650.8A CN109445565B (en) | 2018-11-08 | 2018-11-08 | GPU service quality guarantee method based on monopolization and reservation of kernel of stream multiprocessor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109445565A true CN109445565A (en) | 2019-03-08 |
CN109445565B CN109445565B (en) | 2020-09-15 |
Family
ID=65551962
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811325650.8A Expired - Fee Related CN109445565B (en) | 2018-11-08 | 2018-11-08 | GPU service quality guarantee method based on monopolization and reservation of kernel of stream multiprocessor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109445565B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109992385A (en) * | 2019-03-19 | 2019-07-09 | 四川大学 | A kind of inside GPU energy consumption optimization method of task based access control balance dispatching |
CN113282536A (en) * | 2021-07-26 | 2021-08-20 | 浙江毫微米科技有限公司 | Data processing system and computer equipment based on memory intensive algorithm |
CN115617499A (en) * | 2022-12-20 | 2023-01-17 | 深流微智能科技(深圳)有限公司 | System and method for GPU multi-core hyper-threading technology |
CN116820784A (en) * | 2023-08-30 | 2023-09-29 | 杭州谐云科技有限公司 | GPU real-time scheduling method and system for reasoning task QoS |
CN117215802A (en) * | 2023-11-07 | 2023-12-12 | 四川并济科技有限公司 | GPU management and calling method for virtualized network function |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591418A (en) * | 2010-12-16 | 2012-07-18 | 微软公司 | Scalable multimedia computer system architecture with qos guarantees |
CN103197917A (en) * | 2011-11-10 | 2013-07-10 | 辉达公司 | Compute thread array granularity execution preemption |
CN103336718A (en) * | 2013-07-04 | 2013-10-02 | 北京航空航天大学 | GPU thread scheduling optimization method |
WO2016202153A1 (en) * | 2015-06-19 | 2016-12-22 | 华为技术有限公司 | Gpu resource allocation method and system |
CN106569895A (en) * | 2016-10-24 | 2017-04-19 | 华南理工大学 | Construction method of multi-tenant big data platform based on container |
CN107357661A (en) * | 2017-07-12 | 2017-11-17 | 北京航空航天大学 | A kind of fine granularity GPU resource management method for mixed load |
CN108595258A (en) * | 2018-05-02 | 2018-09-28 | 北京航空航天大学 | A kind of GPGPU register files dynamic expansion method |
CN108694080A (en) * | 2017-04-09 | 2018-10-23 | 英特尔公司 | Efficient thread group scheduling |
CN108733490A (en) * | 2018-05-14 | 2018-11-02 | 上海交通大学 | A kind of GPU vitualization QoS control system and method based on resource-sharing adaptive configuration |
-
2018
- 2018-11-08 CN CN201811325650.8A patent/CN109445565B/en not_active Expired - Fee Related
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102591418A (en) * | 2010-12-16 | 2012-07-18 | 微软公司 | Scalable multimedia computer system architecture with qos guarantees |
CN103197917A (en) * | 2011-11-10 | 2013-07-10 | 辉达公司 | Compute thread array granularity execution preemption |
CN103336718A (en) * | 2013-07-04 | 2013-10-02 | 北京航空航天大学 | GPU thread scheduling optimization method |
WO2016202153A1 (en) * | 2015-06-19 | 2016-12-22 | 华为技术有限公司 | Gpu resource allocation method and system |
CN106569895A (en) * | 2016-10-24 | 2017-04-19 | 华南理工大学 | Construction method of multi-tenant big data platform based on container |
CN108694080A (en) * | 2017-04-09 | 2018-10-23 | 英特尔公司 | Efficient thread group scheduling |
CN107357661A (en) * | 2017-07-12 | 2017-11-17 | 北京航空航天大学 | A kind of fine granularity GPU resource management method for mixed load |
CN108595258A (en) * | 2018-05-02 | 2018-09-28 | 北京航空航天大学 | A kind of GPGPU register files dynamic expansion method |
CN108733490A (en) * | 2018-05-14 | 2018-11-02 | 上海交通大学 | A kind of GPU vitualization QoS control system and method based on resource-sharing adaptive configuration |
Non-Patent Citations (2)
Title |
---|
PAULA AGUILERA ETC.: ""QoS-Aware Dynamic Resource Allocation for Spatial-Multitasking GPUs"", 《2014 19TH ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFERENCE》 * |
张唯唯: ""基于GPU的高性能计算研究与应用"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109992385A (en) * | 2019-03-19 | 2019-07-09 | 四川大学 | A kind of inside GPU energy consumption optimization method of task based access control balance dispatching |
CN109992385B (en) * | 2019-03-19 | 2021-05-14 | 四川大学 | GPU internal energy consumption optimization method based on task balance scheduling |
CN113282536A (en) * | 2021-07-26 | 2021-08-20 | 浙江毫微米科技有限公司 | Data processing system and computer equipment based on memory intensive algorithm |
CN113282536B (en) * | 2021-07-26 | 2021-11-30 | 浙江毫微米科技有限公司 | Data processing system and computer equipment based on memory intensive algorithm |
CN115617499A (en) * | 2022-12-20 | 2023-01-17 | 深流微智能科技(深圳)有限公司 | System and method for GPU multi-core hyper-threading technology |
CN115617499B (en) * | 2022-12-20 | 2023-03-31 | 深流微智能科技(深圳)有限公司 | System and method for GPU multi-core hyper-threading technology |
CN116820784A (en) * | 2023-08-30 | 2023-09-29 | 杭州谐云科技有限公司 | GPU real-time scheduling method and system for reasoning task QoS |
CN116820784B (en) * | 2023-08-30 | 2023-11-07 | 杭州谐云科技有限公司 | GPU real-time scheduling method and system for reasoning task QoS |
CN117215802A (en) * | 2023-11-07 | 2023-12-12 | 四川并济科技有限公司 | GPU management and calling method for virtualized network function |
CN117215802B (en) * | 2023-11-07 | 2024-02-09 | 四川并济科技有限公司 | GPU management and calling method for virtualized network function |
Also Published As
Publication number | Publication date |
---|---|
CN109445565B (en) | 2020-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109445565A (en) | A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores | |
Wang et al. | Simultaneous multikernel GPU: Multi-tasking throughput processors via fine-grained sharing | |
Wang et al. | Quality of service support for fine-grained sharing on GPUs | |
CN107122245A (en) | GPU task dispatching method and system | |
CN108595258A (en) | A kind of GPGPU register files dynamic expansion method | |
CN103927225A (en) | Multi-core framework Internet information processing and optimizing method | |
CN105117285B (en) | A kind of nonvolatile memory method for optimizing scheduling based on mobile virtual system | |
KR20110075297A (en) | Apparatus and method for parallel processing in consideration of degree of parallelism | |
US10768684B2 (en) | Reducing power by vacating subsets of CPUs and memory | |
CN103257900B (en) | Real-time task collection method for obligating resource on the multiprocessor that minimizing CPU takies | |
CN109871268A (en) | A kind of energy-saving scheduling method based on air current composition at data-oriented center | |
CN104090826B (en) | Task optimization deployment method based on correlation | |
CN106155794B (en) | A kind of event dispatcher method and device applied in multi-threaded system | |
KR20100074920A (en) | Apparatus and method for load balancing in multi-core system | |
CN118069379B (en) | Scheduling realization method based on GPU resources | |
CN111045800A (en) | Method and system for optimizing GPU (graphics processing Unit) performance based on short job priority | |
CN116820784A (en) | GPU real-time scheduling method and system for reasoning task QoS | |
Kuo et al. | Task assignment with energy efficiency considerations for non-DVS heterogeneous multiprocessor systems | |
CN102193828A (en) | Decoupling the number of logical threads from the number of simultaneous physical threads in a processor | |
CN112445619A (en) | Management system and method for dynamically sharing ordered resources in a multi-threaded system | |
CN104731662B (en) | A kind of resource allocation methods of variable concurrent job | |
CN112114967B (en) | GPU resource reservation method based on service priority | |
CN107577524A (en) | The GPGPU thread scheduling methods of non-memory access priority of task | |
CN118245013B (en) | Computing unit, method and corresponding processor supporting dynamic allocation of scalar registers | |
Shieh et al. | Enabling fast preemption via dual-kernel support on GPUs |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210420 Address after: 100160, No. 4, building 12, No. 128, South Fourth Ring Road, Fengtai District, Beijing, China (1515-1516) Patentee after: Kaixi (Beijing) Information Technology Co.,Ltd. Address before: 100191 Haidian District, Xueyuan Road, No. 37, Patentee before: BEIHANG University |
|
TR01 | Transfer of patent right | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20200915 Termination date: 20211108 |
|
CF01 | Termination of patent right due to non-payment of annual fee |