CN109857564A - The GPU of method for managing resource and its application based on fine-grained GPU - Google Patents
The GPU of method for managing resource and its application based on fine-grained GPU Download PDFInfo
- Publication number
- CN109857564A CN109857564A CN201910164573.0A CN201910164573A CN109857564A CN 109857564 A CN109857564 A CN 109857564A CN 201910164573 A CN201910164573 A CN 201910164573A CN 109857564 A CN109857564 A CN 109857564A
- Authority
- CN
- China
- Prior art keywords
- kernel
- gpu
- quota
- qos
- fine
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 38
- 238000007726 management method Methods 0.000 claims description 27
- 230000003068 static effect Effects 0.000 abstract description 11
- 230000007246 mechanism Effects 0.000 abstract description 3
- 238000013461 design Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 5
- 238000013468 resource allocation Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 4
- 238000012986 modification Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000001186 cumulative effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000007812 deficiency Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000009472 formulation Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 239000000203 mixture Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 241000208340 Araliaceae Species 0.000 description 1
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 1
- 235000003140 Panax quinquefolius Nutrition 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 235000008434 ginseng Nutrition 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 229920006395 saturated elastomer Polymers 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 230000003442 weekly effect Effects 0.000 description 1
Landscapes
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The present invention provides the GPU of a kind of method for managing resource based on fine-grained GPU and its application, and the method for managing resource based on fine-grained GPU includes: the average each cycle number of instructions mark executed according to QoS Target Acquisition kernel;The average each cycle number of instructions mark executed according to the kernel dynamically distributes the quota quantities for distributing to kernel in adjustment each period;It is that each stream handle distributes thread block according to the operating condition of the kernel.The present invention provides one kind to be based on fine granularity GPU shared mechanism, fine-grained QoS management can be carried out in terms of dynamic resource management, static resource manage two, not only it can guarantee the QoS demand of specific application, but also the entire throughput of GPU system can be maximized on this basis.
Description
Technical field
The present invention relates to graphics processor, that is, GPU technical fields, more particularly to a kind of resource based on fine-grained GPU
Management method and its GPU of application.
Background technique
Graphics processor, i.e. GPU (Graphics Processing Unit), are a kind of for accelerating the spy of image procossing
Different processor.Since GPU has high degree of parallelism and powerful matrix operation and floating number computing capability, it is widely used again
In compute-intensive applications other than to image procossing, such as cloud computing, deep learning etc., referred to as graphics processing unit
(General-purpose computing on graphics processing units, GPGPU).
For GPU, applied compared to primary only operation one, while running multiple applications to effectively improve
The resource utilization of GPU.Among the multiple tasks run simultaneously on GPU, some tasks must obtain fast within a certain period of time
The response of speed, to guarantee good user experience.In some scenes, if the response time of particular task is too long, just very may be used
It can reduce in user experience, such as game and occur getting stuck, fall phenomena such as frame.Even the task of non-image processing, it is also possible to
There is the demand in performance.For example, the user of data center is possible to may require that they are guaranteeing a certain rate at applying for submission
In the case of complete.Therefore, a kind of service quality (Quality to be effectively ensured between multiple applications of operation simultaneously on GPU
Of Service, QoS) method it is very necessary.
According to whether supporting to seize, it is shared shared with preemptive type can be divided into non-preemptive for the sharing mode of GPU at present.It is non-
The shared QoS management mainly realized by modification device drives to more being applied in GPU equipment of preemptive type.In this strategy
Under, it is suitable that the means of guaranteed qos usually dispatch the relevant starting for instructing or controlling multiple kernels of GPU by system calling
Sequence, actually GPU still can only once run the instruction of an application, and degree of parallelism is not improved.With industry
Progress, the shared research direction for being increasingly becoming mainstream of the GPU of preemptive type.
Different according to the granularity of resource allocation, preemptive type is shared and can be divided into seizing for coarseness to be robbed with fine-grained
It accounts for.In the seizing of coarseness, resources regulation is to carry out context switching as unit of entire SM, and it is domestic-investment that this will cause SM again
The problem of source utilization rate deficiency;And it is fine-grained seize, then be to carry out context switching as unit of thread block, can carry out
Finer resource allocation effectively improves resource utilization in SM, to improve the whole resource utilization of GPU.
It is clear that being again based on the GPU seized, the shared method than coarseness of fine-grained GPU is more effectively managed
The Resources on Chip of GPU is managed.QoS management on shared GPU how is carried out based on fine-grained GPU sharing method becomes this field
Technical staff's technical problem urgently to be resolved.
Summary of the invention
In view of the foregoing deficiencies of prior art, the purpose of the present invention is to provide a kind of based on fine-grained GPU's
Method for managing resource and its GPU of application, for carrying out fine-grained QoS management, to accurately regulate and control SM (stream handle)
Interior resource effectively improves the whole resource utilization of GPU.
In order to achieve the above objects and other related objects, the present invention provides a kind of resource management based on fine-grained GPU
Method, the method for managing resource based on fine-grained GPU include: to be averaged weekly according to what QoS Target Acquisition kernel executed
Phase number of instructions mark;The average each cycle number of instructions mark dynamic allocation adjustment each period executed according to the kernel is distributed to
The quota quantities of kernel;It is that each stream handle distributes thread block according to the operating condition of the kernel.
In one embodiment of the invention, the average each cycle number of instructions that the kernel executes is designated as:
Wherein, IPCgoalThe average each cycle number of instructions mark executed for kernel.
In one embodiment of the invention, each period distributes to the quota quantities of kernel are as follows:
Quotak=αk×IPCgoal×Tepoch;
Wherein:IPCgoal=p × IPCalone;
QuotakThe quota quantities of kernel k, IPC are distributed to for each periodaloneIt is kernel k individually when being run on GPU
Average each cycle instruct number, IPCgoalFor the average each cycle number of instructions mark that kernel executes, TepochFor preset period length;
akFor historical factors;IPChistoryFor history be averaged each cycle instruct number.
In one embodiment of the invention, the average each cycle number of instructions mark dynamic point executed according to the kernel
With the initial time that a kind of implementation for adjusting the quota quantities that each period distributes to kernel includes: in each period, often
The quota quantities that a period distributes to kernel k are assigned on each stream handle;It is obtained according to the instruction number actually accomplished each
The remaining quota quantity of stream handle;According to the remaining quota quantity determine have QoS requirement kernel quota whether
It uses up, if so, the kernel for not QoS requirement updates quota.
In one embodiment of the invention, it is full for including: in the kernel update quota for not QoS requirement
The kernel of the not QoS requirement in foot slow stage increases quota;Wherein, meet the condition in slow stage are as follows:
In continuous two periods, there is the IPC of the kernel of QoS requirementepochIt is all satisfied: IPCepoch< xIPCgoal;
Wherein: IPCepochFor period be averaged each cycle instruct number;X is the threshold value of setting, 0.5 < x < 0.8.
In one embodiment of the invention, the kernel for not QoS requirement updates a kind of realization side of quota
Formula includes: to obtain the target difference for the kernel for currently having QoS requirement;Currently had in QoS requirement according to described
The target difference of core obtains the average each cycle number of instructions mark that the kernel of the not kernel of QoS requirement executes;According to institute
It states the average each cycle number of instructions that the kernel of the not kernel of QoS requirement executes and is designated as not no QoS requirement
Kernel updates quota.
In one embodiment of the invention, the operating condition according to the kernel is that each stream handle distributes thread block
A kind of mode are as follows: the thread block for the kernel for having QoS requirement is averagely allocated to each stream handle;For no Service Quality
The kernel of amount demand divides the subregion of stream handle, and each the thread block of the kernel of QoS requirement is not evenly distributed to certainly
On stream handle in own subregion.
In one embodiment of the invention, the operating condition according to the kernel is that each stream handle distributes thread block
Further include: acquire the quantity of the idle warp scheduler of all kernels;Compare the quantity and per thread of idle warp scheduler
The quantity of the warp scheduler of block, if the quantity of idle warp scheduler is greater than or equal to the warp scheduler of some thread block
The thread block is then removed TB scheduler by quantity.
In one embodiment of the invention, if the quantity of the idle TB scheduler of a kernel for having a QoS requirement
No more than 1, and its IPChistoryDo not reach target, then there is the kernel of QoS requirement to distribute more threads for this
Block.
The embodiment of the present invention also provides a kind of GPU, and the GPU extension has TB scheduler and wrap scheduler;The TB
Using the method for managing resource as described above based on fine-grained GPU in scheduler.
As described above, a kind of GPU of method for managing resource and its application based on fine-grained GPU of the invention, has
Below the utility model has the advantages that
1, the present invention provides one kind to be based on fine granularity GPU shared mechanism, can be from dynamic resource management, static resource pipe
It manages two aspects and carries out fine-grained QoS management, not only can guarantee the QoS demand of specific application, but also can maximize on this basis
The entire throughput of GPU system.
2, achievement of the invention can provide the technology ginseng of QoS Managed Solution for the design and landing of industry GPU hardware
It examines.
Detailed description of the invention
Fig. 1 is shown as the framework being extended in one embodiment of the invention based on the shared GPU system that fine granularity is seized
Schematic diagram.
The overall flow that Fig. 2 is shown as the method for managing resource based on fine-grained GPU in one embodiment of the invention is shown
It is intended to.
Fig. 3 is shown as quota rollback plan in the method for managing resource based on fine-grained GPU in one embodiment of the invention
Implementation exemplary diagram slightly.
Fig. 4 was shown as in the method for managing resource based on fine-grained GPU in one embodiment of the invention based on the slow stage
The adjustment of kernel (non-QoS kernel) allocation of quota of not QoS requirement is shown in the quota rollback strategy of judgement
It is intended to.
Specific embodiment
Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification
Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities
The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from
Various modifications or alterations are carried out under spirit of the invention.
The embodiment of the present invention is designed to provide a kind of method for managing resource and server based on fine-grained GPU,
For carrying out fine-grained QoS management, to accurately regulate and control the resource in SM (stream handle), the entirety of GPU is effectively improved
Resource utilization.
The embodiment of the present invention is intended to design, it is a kind of based on GPGPU-Sim, fine-grained shared GPU scheduling strategy to realize,
Under the premise of guaranteeing target application QoS, the entire throughput of remaining application is maximized.By static resource and based on quota
The scheme of dynamic processing capacity allocation policy, the present embodiment can carry out fine-grained QoS management, to accurately regulate and control the money in SM
Source effectively improves the whole resource utilization of GPU.
Fig. 1 is that the embodiment of the present invention shares the architecture of GPU and the entirety of QoS administration extensions module based on fine granularity
System structure.Fine granularity GPU shared framework before reference of the embodiment of the present invention, and pay close attention to how to design on this basis
Fine granularity shares the QoS management method on GPU.
As shown in Figure 1, the technical solution of the embodiment of the present invention is extended based on the shared GPU system that fine granularity is seized,
Expansion mainly includes enhanced edition TB scheduler, static resource management, dynamic resource management and enhanced edition warp scheduler etc.
Four parts.The core of invention is the resource allocation policy designed in QoS management module, is based especially on the dynamic resource of quota
Management strategy.
Enhanced edition TB scheduler: on the basis of fine granularity shares GPU, enhanced edition TB scheduler is introduced to support that GPU is single
More kernel in SM (streamingMultiproeessor, stream handle) are executed parallel.Enhanced edition TB scheduler be responsible for
Each SM is linked up, and is distributed according to the strategy execution static and dynamic resource of formulation.
Static resource management: should be the executable TB of each kernel distribution on each SM of static resource management decision
Quantity.Since scheme herein is realized based on preemptive type GPU, situation can be reached according to QoS at runtime
Neatly carry out the adjustment of resource allocation.
Dynamic resource management: the management of dynamic resource is core place.The embodiment of the present invention uses the algorithm based on quota,
Come the progress in regulate and control each kernel each period at runtime.QoS management module is responsible for the real-time number fed back in acquisition operation
According to, and being calculated according to the qos policy of formulation be each kernel allocated quota quantity.
Enhanced edition warp scheduler: enhanced edition warp scheduler is according to the allocation of quota sent from QoS manager, by core
Calculating cycle distributes to each kernel.The scheme of the embodiment of the present invention is drawn operational process as unit of fixed cycle numbers
Point, referred to as each period (epoch).All kernel in the beginning in each period, each SM are assigned to one and match
Volume.In each period, the maximum that the specific quota quantities that kernel is obtained determine that this period kernel can be completed refers to
Enable number, it can complete most progresses.
A kind of original of the method for managing resource and server based on fine-grained GPU of the present embodiment described in detail below
Reason and embodiment make those skilled in the art not need creative work and are appreciated that one kind of the present embodiment is based on fine granularity
GPU method for managing resource and server.
As shown in Fig. 2, the present embodiment provides a kind of method for managing resource based on fine-grained GPU, it is described to be based on particulate
The method for managing resource of the GPU of degree includes:
Step S110, the average each cycle number of instructions mark executed according to QoS Target Acquisition kernel;
Step S120 is dynamically distributed according to the average each cycle number of instructions mark that the kernel executes and is adjusted each period point
The quota quantities of dispensing kernel;
Step S130 is that each stream handle distributes thread block according to the operating condition of the kernel.
The above-mentioned steps S110 of the present embodiment to step S130 is described in detail below.
Step S110, the average each cycle number of instructions mark executed according to QoS Target Acquisition kernel.
The QoS target specified by user may have diversified forms, such as frame per second, message transmission rate etc..In order to allow
The QoS management system of architecture layers can directly use this target, first have to for QoS target being unified into a certain specific hard
Part index.Rule of thumb, any type of QoS target can be converted to the average each cycle instruction number of kernel (kernel) execution
(Instructions Per Cycle,IPC).The present embodiment is average in other words by guaranteeing the kernel final execution time
IPC manages QoS.
Specifically, in this present embodiment, the average each cycle number of instructions that the kernel executes is designated as:
Wherein, IPCgoalThe average each cycle number of instructions mark executed for kernel.
Assuming that the total number of instructions that execution time and kernel of the known kernel in isolated operation need to complete, then
IPC target can be calculated by above-mentioned formula.
Remember that IPC of the kernel individually when running on GPU is IPCalone, it is assumed that the kernel is shared with other kernel
IPC target when GPU (is denoted as IPCgoal, pass through QoSgoalIt is converted to) it can reach in its isolated operation, that is, have:
IPCgoal≤IPCalone。
Requirement due to different application to QoS is different, and the present embodiment uniformly uses IPCgoalWith IPCaloneRatio carry out table
The IPC target for showing different application, is denoted as:
Wherein 0≤p≤1;P reflects the complexity that QoS target is reached.
Step S120 is dynamically distributed according to the average each cycle number of instructions mark that the kernel executes and is adjusted each period point
The quota quantities of dispensing kernel.
According to whether there is or not QoS demand, the kernel on GPU can be divided into QoS requirement kernel (QoS kernel) and
There is no the kernel (non-QoS kernel) of QoS requirement.The final goal of QoS management keeps QoS kernel final
Average IPC reaches QoS target (IPC target), while maximizing the handling capacity of non-QoS kernel.Therefore, one it is crucial
Problem is the design to QoS kernel allocation of quota algorithm.The present embodiment expresses IPC target with quota (quota).
By variable quota, the instruction number that can tell warp scheduler that should complete in each QoS kernel of each period, and
By warp scheduler the case where tracking of each period quota consumes, so that feedback information is to update quota.
Specifically, in this present embodiment, each period distributes to the quota quantities of kernel are as follows:
Quotak=αk×IPCgoal×Tepoch;
Wherein:IPCgoal=p × IPCalone;
QuotakThe quota quantities of kernel k, IPC are distributed to for each periodaloneIt is kernel k individually when being run on GPU
Average each cycle instruct number, IPCgoalFor the average each cycle number of instructions mark that kernel executes, TepochFor preset period length;
akFor historical factors;IPChistoryFor history be averaged each cycle instruct number.
Assuming that the period length set is Tepoch, the QoS kernel for needing to reach target is denoted as k, and known k is independent
IPC when running on GPU is denoted as IPCalone.When target QoS percentage is p, wish that the target that k reaches is average when parallel
IPC may be calculated: IPCgoal=p × IPCalone。
QuotakIt is the instruction number for the k in total that all SM need to complete, is each kernel k meter by QoS manager
It calculates, and according to the thread number of blocks of the k held on SM come in further allocated quotas to each SM.For example, it is assumed that kernel
The thread block sum of k is T, and SMiIt is upper to be responsible for scheduling T thereiniA thread block, then SMiOn the quota about k that is assigned to
Should beAnd so on, QuotakIt can by relative equilibrium be assigned on each SM.
In this present embodiment, it is every to dynamically distribute adjustment for the average each cycle number of instructions mark executed according to the kernel
A kind of implementation that a period distributes to the quota quantities of kernel includes:
In the initial time in each period, the quota quantities that each period distributes to kernel k are assigned to each stream process
On device;
The remaining quota quantity of each stream handle is obtained according to the instruction number actually accomplished;
Determine whether the quota for having the kernel of QoS requirement is used up according to the remaining quota quantity, if so, being
There is no the kernel of QoS requirement to update quota.
Specifically, in the beginning in each period, QuotakIt is assigned on each SM.Each SM meter of one part
Number device CkTo store quota for kk.When the warp of a k, which is instructed, to be completed, CkThe instruction number that actually accomplishes can be subtracted, and (this refers to
Number is enabled to be generally equal to the width of SIMD but it is also possible to be the positive integer for being no more than the width).If C at the end of periodkFor
Just, i.e. there are some quotas not to complete by k, then retains these remaining quotas, and be added to the quota of next new periodkUpper use
In resetting Ck。
In order to avoid the computing resource of non-QoS kernel excessive occupancy QoS kernel, it is similarly non-QoS
Quota is arranged to limit them in kernel.SM can check C for all kernelkWhether it is non-just, to ensure all QoS
Kernel has depleted their quota, completes interim QoS target.Once the quota of all QoS kernel is all used
It to the greatest extent, then is the C of all non-QoS kernelkAll update with quota quotak, so that it is guaranteed that thering is thread block transporting always on SM
Row.For QoS kernel, as long as guaranteeing that they reach QoS target, it is not required that they have high-throughput.So one
The IPC target of denier period QoS kernel is reached, it is not necessary to distribute new quota to them again.In the beginning in each period,
Either QoS kernel or non-QoS kernel, their CkIt will be reset as quotak.We claim this strategy
For quota rollback strategy.
The schematic diagram of quota rollback strategy is as shown in Figure 2.K1It is a QoS kernel, quota quota1It is 100;K2
It is a non-QoS kernel, quota quota2It is 50.In t0Moment, first period, quota1And quota2Point
K is not assigned to it1And K2, Ck1It is updated to 100, Ck2It is updated to 50.In t1Moment, second period, K1Remaining 5 surplus
Remaining quota is accumulated in basic quota, Ck1It is updated to 105, and K2Remaining quota be ignored, Ck2It is updated to 50.In t2When
It carves, second period not yet terminates, and K1And K2Quota all used up.Due to K1It is QoS kernel, therefore quota is no longer more
Newly;And K2It is a non-QoS kernel, since all QoS kernel are completed phased goal, K2It can brush
New quota makes full use of the period remaining computing resource.
In fact, some kernel have first slow rear fast IPC (becoming " cumulative type " kernel below), such as
The sparse matrix multiplication (spmv) that Parboil benchmark is concentrated.In the stage that early period runs slower because of its characteristic, i.e.,
Make to improve quota, remaining on may have a long way to go from Target IP C;QoS kernel is due to that can not reach always IPC mesh simultaneously
Mark, constantly cumulative remaining quota can suppress the performance of non-QoS kernel instead, eventually lead in the whole of slow stage GPU
Body performance is bad.
In order to solve this problem, the present embodiment it is further proposed that slow phase judgement method.
In this present embodiment, it includes: to meet the slow stage that the kernel for not QoS requirement, which updates in quota,
Not QoS requirement kernel increase quota;Wherein, meet the condition in slow stage are as follows:
In continuous two periods, there is the IPC of the kernel of QoS requirementepochIt is all satisfied: IPCepoch< xIPCgoal;
Wherein: IPCepochFor period be averaged each cycle instruct number;X is the threshold value of setting, 0.5 < x < 0.8.
On the basis of quota rollback strategy, some flexibly adjustments are done to the allocation of quota of non-QoS kernel, such as
Shown in Fig. 4.
Wherein, the condition of slow phase judgement are as follows: for the threshold value x (0 < x < 1) set, if continuous two periods, QoS
The IPC of kernelepochIt is all satisfied IPCepoch< xIPCgoal。
The condition for why limiting " continuous two periods ", is to sentence of short duration performance inconsistency since it is considered that avoid the occurrence of
The case where being set to the slow stage in stage.Setting for x, need to satisfy two conditions:: first, be small enough to can to distinguish slight
Performance inconsistency and the slow stage in stage as caused by estimated performance, be unlikely to influence QoS kernel performance;Second, enough
Greatly to when there is the slow stage in stage idle computing resource can be made full use of in this time, gulped down for non-QoS kernel raising
The amount of spitting.
Consider for this two o'clock, tentatively the range of x is set between 0.5-0.8.In order to determine this value, can distinguish
Susceptibility test is carried out when being equal to 0.5,0.6,0.7,0.8 to x, to choose optimal assignment according to the actual situation.
Since non-QoS kernel does not have QoS target, the distribution of quota can not directly adopt point of QoS kernel
With mode.
In this present embodiment, a kind of implementation of the kernel update quota for not QoS requirement includes:
Obtain the target difference for currently having the kernel of QoS requirement;
The kernel of not QoS requirement is obtained according to the target difference of the kernel for currently having QoS requirement
Kernel execute average each cycle number of instructions mark;
It is designated as not having according to the average each cycle number of instructions that the kernel of the kernel of the not QoS requirement executes
The kernel of QoS requirement updates quota.
It is non-QoS kernel allocated quota quantity depending on being at present in the QoS management strategy of the present embodiment
Only QoS kernel distance reach it QoS target it is also how much poor.If IPC QoS kernel interim at upper oneepoch
Higher than IPCgoal, then non-QoS kernel can have higher quota, vice versa.So, non-QoS is being calculated
Before the quota of kernel, an interim IPC target first can be calculated for it according to historical information:
Wherein, akIt is the history regulatory factor in the simple strategy based on history.In the beginning in each period, once it calculates
The IPC target of non-QoS kernel out, its quota can refer to the quota of QoS kernel, with same method meter
It calculates.In order to enable QoS kernel to reach QoS target, the IPC of non-QoS kernel as early as possibleepochIt can be initialized as one
Very small value (such as being set as 1).This means that the quota of non-QoS kernel is very little at the beginning, no
It will affect the performance of QoS kernel.The problem of in order to avoid underutilization of resources, the IPC of non-QoS kernelgoalIt can be by
Cumulative length, but will not be high to the performance for influencing QoS kernel.In fact, IPCepochInitial value last result is influenced
Very little.So the non-QoS kernel quota of the present embodiment designs, can both be moved when QoS kernel needs more resources
State limits the resource of non-QoS kernel, and can not need so much resource in QoS kernel is release resource to non-
QoS kernel is to promote its handling capacity.The final goal of this and QoS management is consistent.
Step S130 is that each stream handle distributes thread block according to the operating condition of the kernel.
Static resource mainly has thread number of blocks of register, shared drive and each kernel etc..For QoS kernel
There is provided enough thread numbers of blocks can be improved TLP, to increase a possibility that reaching QoS target.
In this present embodiment, the operating condition according to the kernel is a kind of side that each stream handle distributes thread block
Formula are as follows:
The thread block for the kernel for having QoS requirement is averagely allocated to each stream handle;
The subregion of stream handle is divided for the kernel of not QoS requirement, each the not no kernel of QoS requirement
Thread block be evenly distributed on the stream handle in oneself subregion.
The static resource allocation of the present embodiment is from the symmetrical method of salary distribution.On the one hand, for QoS kernel, by it
Thread block fifty-fifty assign on each SM.Hold identical thread number of blocks on each SM, to keep the balance of TLP.Separately
On the one hand, due to before about fine granularity share studies have shown that if a SM on simultaneously run too many kernel, possibility
Effect is simultaneously bad.Therefore, for non-QoS kernel, the subregion of SM, each non-QoS kernel can be divided for them
Thread block be evenly distributed on the SM in oneself subregion.
In this present embodiment, the operating condition according to the kernel is that each stream handle distributes thread block further include:
Acquire the quantity of the idle warp scheduler of all kernels;
Compare the quantity of the quantity of idle warp scheduler and the warp scheduler of per thread block, if idle warp scheduling
The quantity of device is greater than or equal to the quantity of the warp scheduler of some thread block, then the thread block is removed TB scheduler.
At runtime, the distribution of thread block is adjusted according to the operating condition of kernel.It is all in each period
Kernel acquires the quantity of " idle warp ", i.e., those have the instruction that can be run but since assembly line is saturated so not adjusted
The warp of degree.The quantity of idle warp corresponds to the degree of excess TLP.Since idle warp occupies static resource, do not have but
Any contribution, therefore corresponding thread block should be switched out, the thread block of the low kernel of those TLP degree is put into
Come.In the beginning in each period, the quantity of the idle warp of each kernel is acquired from each SM.If the number of idle warp
Amount is equal to the warp number of per thread block, then switching out entire thread block all will not influence TLP.We are these thread blocks
Referred to as " idle TB ".
In addition, in this present embodiment, if the quantity of the idle TB scheduler of a kernel for having a QoS requirement is not
More than 1, and its IPChistoryDo not reach target, then there is the kernel of QoS requirement to distribute more thread blocks for this.
If the idle TB number of a QoS kernel is no more than 1, and its IPChistoryDo not reach target, is then it
More thread blocks are distributed, so that it faster reaches QoS target.
Wherein, the kernel being replaced should meet one of following three conditions:
1, the kernel is a non-QoS kernel.
If 2, the kernel has n idle TB to need to occupy enough resources, it must have n+1 idle TB.
3, the IPC of the kernelhistoryIt should meetWherein N is its total thread
Number of blocks.
It is being cut out or be a non-QoS kernel or be one and have excessive TLP or foot according to conditions above
The QoS kernel that enough IPC can lose.In addition, switching occurs over just any kernel all in order to avoid excessive expense of seizing
When the preemption instructions that do not hang up.
In conclusion the present embodiment also provides more from the thread block quantity control in static resource for QoS kernel
The good condition for reaching QoS target.
The embodiment of the present invention also provides a kind of GPU, and the GPU extension has TB scheduler and wrap scheduler;The TB
Using the method for managing resource as described above based on fine-grained GPU in scheduler.It is above-mentioned to be based on fine granularity to described
The method for managing resource of GPU be described in detail, details are not described herein.
In conclusion the present invention provides one kind to be based on fine granularity GPU shared mechanism, it can be from dynamic resource management, quiet
Two aspects of state resource management carry out fine-grained QoS management, not only can guarantee the QoS demand of specific application, but also can be basic herein
The upper entire throughput for maximizing GPU system;Achievement of the invention can provide for the design and landing of industry GPU hardware
The Technical Reference of QoS Managed Solution.So the present invention effectively overcomes various shortcoming in the prior art and has high industrial benefit
With value.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe
The personage for knowing this technology all without departing from the spirit and scope of the present invention, carries out modifications and changes to above-described embodiment.Cause
This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as
At all equivalent modifications or change, should be covered by the claims of the present invention.
Claims (10)
1. a kind of method for managing resource based on fine-grained GPU, it is characterised in that: the resource based on fine-grained GPU
Management method includes:
The average each cycle number of instructions mark executed according to QoS Target Acquisition kernel;
The average each cycle number of instructions mark dynamic allocation adjustment each period executed according to the kernel distributes to matching for kernel
Specified number amount;
It is that each stream handle distributes thread block according to the operating condition of the kernel.
2. the method for managing resource according to claim 1 based on fine-grained GPU, it is characterised in that: the kernel is held
Capable average each cycle number of instructions is designated as:
Wherein, IPCgoalThe average each cycle number of instructions mark executed for kernel.
3. the method for managing resource according to claim 1 or 2 based on fine-grained GPU, it is characterised in that: described each
Period distributes to the quota quantities of kernel are as follows:
Quotak=αk×IPCgoal×Tepoch;
Wherein:IPCgoal=p × IPCalone;
QuotakThe quota quantities of kernel k, IPC are distributed to for each periodaloneIt is individually flat when being run on GPU for kernel k
Equal each cycle instructs number, IPCgoalFor the average each cycle number of instructions mark that kernel executes, TepochFor preset period length;akFor
Historical factors;IPChistoryFor history be averaged each cycle instruct number.
4. the method for managing resource according to claim 1 based on fine-grained GPU, it is characterised in that: described according to institute
The average each cycle number of instructions mark for stating kernel execution, which dynamically distributes, adjusts the one of the quota quantities that each period distributes to kernel
Planting implementation includes:
In the initial time in each period, the quota quantities that each period distributes to kernel k are assigned on each stream handle;
The remaining quota quantity of each stream handle is obtained according to the instruction number actually accomplished;
Determine whether the quota for having the kernel of QoS requirement is used up according to the remaining quota quantity, if so, not have
The kernel of QoS requirement updates quota.
5. the method for managing resource according to claim 4 based on fine-grained GPU, it is characterised in that: described is not have
The kernel of QoS requirement updates in quota
Increase quota to meet the kernel of the not QoS requirement in slow stage;Wherein, meet the condition in slow stage are as follows:
In continuous two periods, there is the IPC of the kernel of QoS requirementepochIt is all satisfied: IPCepoch< xIPCgoal;
Wherein: IPCepochFor period be averaged each cycle instruct number;X is the threshold value of setting, 0.5 < x < 0.8.
6. the method for managing resource according to claim 5 based on fine-grained GPU, it is characterised in that: described is not have
A kind of implementation that the kernel of QoS requirement updates quota includes:
Obtain the target difference for currently having the kernel of QoS requirement;
It is obtained in the not kernel of QoS requirement according to the target difference of the kernel for currently having QoS requirement
The average each cycle number of instructions mark that core executes;
It is designated as not servicing according to the average each cycle number of instructions that the kernel of the kernel of the not QoS requirement executes
The kernel of quality requirement updates quota.
7. the method for managing resource according to claim 1 based on fine-grained GPU, it is characterised in that: described according to institute
The operating condition for stating kernel is a kind of mode of each stream handle distribution thread block are as follows:
The thread block for the kernel for having QoS requirement is averagely allocated to each stream handle;
The subregion of stream handle is divided for the kernel of not QoS requirement, each the not no line of the kernel of QoS requirement
Journey block is evenly distributed on the stream handle in oneself subregion.
8. the method for managing resource according to claim 7 based on fine-grained GPU, it is characterised in that: described according to institute
The operating condition for stating kernel is that each stream handle distributes thread block further include:
Acquire the quantity of the idle warp scheduler of all kernels;
Compare the quantity of the quantity of idle warp scheduler and the warp scheduler of per thread block, if idle warp scheduler
Quantity is greater than or equal to the quantity of the warp scheduler of some thread block, then the thread block is removed TB scheduler.
9. the method for managing resource according to claim 8 based on fine-grained GPU, it is characterised in that: if one has
The quantity of the idle TB scheduler of the kernel of QoS requirement is no more than 1, and its IPChistoryDo not reach target, then for
This has the kernel of QoS requirement to distribute more thread blocks.
10. a kind of GPU, it is characterised in that: the GPU extension has TB scheduler and wrap scheduler;It is answered in the TB scheduler
With such as claim 1 to the method for managing resource as claimed in claim 9 based on fine-grained GPU.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910164573.0A CN109857564A (en) | 2019-03-05 | 2019-03-05 | The GPU of method for managing resource and its application based on fine-grained GPU |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910164573.0A CN109857564A (en) | 2019-03-05 | 2019-03-05 | The GPU of method for managing resource and its application based on fine-grained GPU |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109857564A true CN109857564A (en) | 2019-06-07 |
Family
ID=66899834
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910164573.0A Pending CN109857564A (en) | 2019-03-05 | 2019-03-05 | The GPU of method for managing resource and its application based on fine-grained GPU |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109857564A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111124691A (en) * | 2020-01-02 | 2020-05-08 | 上海交通大学 | Multi-process shared GPU (graphics processing Unit) scheduling method and system and electronic equipment |
CN114463159A (en) * | 2022-01-06 | 2022-05-10 | 江苏电力信息技术有限公司 | GPU resource sharing method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101685391A (en) * | 2002-01-04 | 2010-03-31 | 微软公司 | Methods and system for managing computational resources of a coprocessor in a computing system |
CN104838360A (en) * | 2012-09-04 | 2015-08-12 | 微软技术许可有限责任公司 | Quota-based resource management |
CN106576114A (en) * | 2014-08-08 | 2017-04-19 | 甲骨文国际公司 | Policy based resource management and allocation system |
CN106874113A (en) * | 2017-01-19 | 2017-06-20 | 国电南瑞科技股份有限公司 | A kind of many GPU heterogeneous schemas static security analysis computational methods of CPU+ |
CN107943592A (en) * | 2017-12-13 | 2018-04-20 | 江苏省邮电规划设计院有限责任公司 | A kind of method for avoiding GPU resource contention towards GPU cluster environment |
-
2019
- 2019-03-05 CN CN201910164573.0A patent/CN109857564A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101685391A (en) * | 2002-01-04 | 2010-03-31 | 微软公司 | Methods and system for managing computational resources of a coprocessor in a computing system |
CN104838360A (en) * | 2012-09-04 | 2015-08-12 | 微软技术许可有限责任公司 | Quota-based resource management |
CN106576114A (en) * | 2014-08-08 | 2017-04-19 | 甲骨文国际公司 | Policy based resource management and allocation system |
CN106874113A (en) * | 2017-01-19 | 2017-06-20 | 国电南瑞科技股份有限公司 | A kind of many GPU heterogeneous schemas static security analysis computational methods of CPU+ |
CN107943592A (en) * | 2017-12-13 | 2018-04-20 | 江苏省邮电规划设计院有限责任公司 | A kind of method for avoiding GPU resource contention towards GPU cluster environment |
Non-Patent Citations (1)
Title |
---|
过敏意等: "《Quality_of_service_support_for_fine-grained_sharing_on_GPUs》", 《2017 ACM/IEEE 第44届计算机体系结构国际研讨会(ISCA)》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111124691A (en) * | 2020-01-02 | 2020-05-08 | 上海交通大学 | Multi-process shared GPU (graphics processing Unit) scheduling method and system and electronic equipment |
CN111124691B (en) * | 2020-01-02 | 2022-11-25 | 上海交通大学 | Multi-process shared GPU (graphics processing Unit) scheduling method and system and electronic equipment |
CN114463159A (en) * | 2022-01-06 | 2022-05-10 | 江苏电力信息技术有限公司 | GPU resource sharing method |
CN114463159B (en) * | 2022-01-06 | 2024-02-23 | 江苏电力信息技术有限公司 | GPU resource sharing method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111444012B (en) | Dynamic resource regulation and control method and system for guaranteeing delay-sensitive application delay SLO | |
Moir et al. | Pfair scheduling of fixed and migrating periodic tasks on multiple resources | |
KR20080041047A (en) | Apparatus and method for load balancing in multi core processor system | |
CN110362407A (en) | Computing resource dispatching method and device | |
CN109408215A (en) | A kind of method for scheduling task and device of calculate node | |
CN113064712B (en) | Micro-service optimization deployment control method, system and cluster based on cloud edge environment | |
CN109857564A (en) | The GPU of method for managing resource and its application based on fine-grained GPU | |
CN103942109B (en) | Self-adaptation task scheduling method based on multi-core DSP | |
Tantalaki et al. | Pipeline-based linear scheduling of big data streams in the cloud | |
CN109445565A (en) | A kind of GPU QoS guarantee method exclusive and reserved based on stream multiple processor cores | |
CN114816715B (en) | Cross-region-oriented flow calculation delay optimization method and device | |
US20090183166A1 (en) | Algorithm to share physical processors to maximize processor cache usage and topologies | |
Saeidi et al. | Determining the optimum time quantum value in round robin process scheduling method | |
CN105847385B (en) | A kind of cloud computing platform dispatching method of virtual machine based on operation duration | |
US20190171489A1 (en) | Method of managing dedicated processing resources, server system and computer program product | |
Shen et al. | Goodbye to fixed bandwidth reservation: Job scheduling with elastic bandwidth reservation in clouds | |
CN104917839A (en) | Load balancing method for use in cloud computing environment | |
CN111597044A (en) | Task scheduling method and device, storage medium and electronic equipment | |
Suryadevera et al. | Load balancing in computational grids using ant colony optimization algorithm | |
CN108595259A (en) | A kind of internal memory pool managing method based on global administration | |
CN107423134B (en) | Dynamic resource scheduling method for large-scale computing cluster | |
Du et al. | Energy-efficient scheduling for best-effort interactive services to achieve high response quality | |
CN105519075A (en) | Resource scheduling method and apparatus | |
Yang et al. | Elastic executor provisioning for iterative workloads on apache spark | |
Joseph et al. | Fuzzy reinforcement learning based microservice allocation in cloud computing environments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190607 |