CN109857564A

CN109857564A - The GPU of method for managing resource and its application based on fine-grained GPU

Info

Publication number: CN109857564A
Application number: CN201910164573.0A
Authority: CN
Inventors: 陈�全; 过敏意; 韦梦泽; 赵文益
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2019-03-05
Filing date: 2019-03-05
Publication date: 2019-06-07

Abstract

The present invention provides the GPU of a kind of method for managing resource based on fine-grained GPU and its application, and the method for managing resource based on fine-grained GPU includes: the average each cycle number of instructions mark executed according to QoS Target Acquisition kernel；The average each cycle number of instructions mark executed according to the kernel dynamically distributes the quota quantities for distributing to kernel in adjustment each period；It is that each stream handle distributes thread block according to the operating condition of the kernel.The present invention provides one kind to be based on fine granularity GPU shared mechanism, fine-grained QoS management can be carried out in terms of dynamic resource management, static resource manage two, not only it can guarantee the QoS demand of specific application, but also the entire throughput of GPU system can be maximized on this basis.

Description

The GPU of method for managing resource and its application based on fine-grained GPU

Technical field

The present invention relates to graphics processor, that is, GPU technical fields, more particularly to a kind of resource based on fine-grained GPU Management method and its GPU of application.

Background technique

Graphics processor, i.e. GPU (Graphics Processing Unit), are a kind of for accelerating the spy of image procossing Different processor.Since GPU has high degree of parallelism and powerful matrix operation and floating number computing capability, it is widely used again In compute-intensive applications other than to image procossing, such as cloud computing, deep learning etc., referred to as graphics processing unit (General-purpose computing on graphics processing units, GPGPU).

For GPU, applied compared to primary only operation one, while running multiple applications to effectively improve The resource utilization of GPU.Among the multiple tasks run simultaneously on GPU, some tasks must obtain fast within a certain period of time The response of speed, to guarantee good user experience.In some scenes, if the response time of particular task is too long, just very may be used It can reduce in user experience, such as game and occur getting stuck, fall phenomena such as frame.Even the task of non-image processing, it is also possible to There is the demand in performance.For example, the user of data center is possible to may require that they are guaranteeing a certain rate at applying for submission In the case of complete.Therefore, a kind of service quality (Quality to be effectively ensured between multiple applications of operation simultaneously on GPU Of Service, QoS) method it is very necessary.

According to whether supporting to seize, it is shared shared with preemptive type can be divided into non-preemptive for the sharing mode of GPU at present.It is non- The shared QoS management mainly realized by modification device drives to more being applied in GPU equipment of preemptive type.In this strategy Under, it is suitable that the means of guaranteed qos usually dispatch the relevant starting for instructing or controlling multiple kernels of GPU by system calling Sequence, actually GPU still can only once run the instruction of an application, and degree of parallelism is not improved.With industry Progress, the shared research direction for being increasingly becoming mainstream of the GPU of preemptive type.

Different according to the granularity of resource allocation, preemptive type is shared and can be divided into seizing for coarseness to be robbed with fine-grained It accounts for.In the seizing of coarseness, resources regulation is to carry out context switching as unit of entire SM, and it is domestic-investment that this will cause SM again The problem of source utilization rate deficiency；And it is fine-grained seize, then be to carry out context switching as unit of thread block, can carry out Finer resource allocation effectively improves resource utilization in SM, to improve the whole resource utilization of GPU.

It is clear that being again based on the GPU seized, the shared method than coarseness of fine-grained GPU is more effectively managed The Resources on Chip of GPU is managed.QoS management on shared GPU how is carried out based on fine-grained GPU sharing method becomes this field Technical staff's technical problem urgently to be resolved.

Summary of the invention

In view of the foregoing deficiencies of prior art, the purpose of the present invention is to provide a kind of based on fine-grained GPU's Method for managing resource and its GPU of application, for carrying out fine-grained QoS management, to accurately regulate and control SM (stream handle) Interior resource effectively improves the whole resource utilization of GPU.

In order to achieve the above objects and other related objects, the present invention provides a kind of resource management based on fine-grained GPU Method, the method for managing resource based on fine-grained GPU include: to be averaged weekly according to what QoS Target Acquisition kernel executed Phase number of instructions mark；The average each cycle number of instructions mark dynamic allocation adjustment each period executed according to the kernel is distributed to The quota quantities of kernel；It is that each stream handle distributes thread block according to the operating condition of the kernel.

In one embodiment of the invention, the average each cycle number of instructions that the kernel executes is designated as:

Wherein, IPC_goalThe average each cycle number of instructions mark executed for kernel.

In one embodiment of the invention, each period distributes to the quota quantities of kernel are as follows:

Quota_k=α_k×IPC_goal×T_epoch；

Wherein:IPC_goal=p × IPC_alone；

Quota_kThe quota quantities of kernel k, IPC are distributed to for each period_aloneIt is kernel k individually when being run on GPU Average each cycle instruct number, IPC_goalFor the average each cycle number of instructions mark that kernel executes, T_epochFor preset period length； a_kFor historical factors；IPC_historyFor history be averaged each cycle instruct number.

In one embodiment of the invention, the average each cycle number of instructions mark dynamic point executed according to the kernel With the initial time that a kind of implementation for adjusting the quota quantities that each period distributes to kernel includes: in each period, often The quota quantities that a period distributes to kernel k are assigned on each stream handle；It is obtained according to the instruction number actually accomplished each The remaining quota quantity of stream handle；According to the remaining quota quantity determine have QoS requirement kernel quota whether It uses up, if so, the kernel for not QoS requirement updates quota.

In one embodiment of the invention, it is full for including: in the kernel update quota for not QoS requirement The kernel of the not QoS requirement in foot slow stage increases quota；Wherein, meet the condition in slow stage are as follows:

In continuous two periods, there is the IPC of the kernel of QoS requirement_epochIt is all satisfied: IPC_epoch< xIPC_goal；

Wherein: IPC_epochFor period be averaged each cycle instruct number；X is the threshold value of setting, 0.5 < x < 0.8.

In one embodiment of the invention, the kernel for not QoS requirement updates a kind of realization side of quota Formula includes: to obtain the target difference for the kernel for currently having QoS requirement；Currently had in QoS requirement according to described The target difference of core obtains the average each cycle number of instructions mark that the kernel of the not kernel of QoS requirement executes；According to institute It states the average each cycle number of instructions that the kernel of the not kernel of QoS requirement executes and is designated as not no QoS requirement Kernel updates quota.

In one embodiment of the invention, the operating condition according to the kernel is that each stream handle distributes thread block A kind of mode are as follows: the thread block for the kernel for having QoS requirement is averagely allocated to each stream handle；For no Service Quality The kernel of amount demand divides the subregion of stream handle, and each the thread block of the kernel of QoS requirement is not evenly distributed to certainly On stream handle in own subregion.

In one embodiment of the invention, the operating condition according to the kernel is that each stream handle distributes thread block Further include: acquire the quantity of the idle warp scheduler of all kernels；Compare the quantity and per thread of idle warp scheduler The quantity of the warp scheduler of block, if the quantity of idle warp scheduler is greater than or equal to the warp scheduler of some thread block The thread block is then removed TB scheduler by quantity.

In one embodiment of the invention, if the quantity of the idle TB scheduler of a kernel for having a QoS requirement No more than 1, and its IPC_historyDo not reach target, then there is the kernel of QoS requirement to distribute more threads for this Block.

The embodiment of the present invention also provides a kind of GPU, and the GPU extension has TB scheduler and wrap scheduler；The TB Using the method for managing resource as described above based on fine-grained GPU in scheduler.

As described above, a kind of GPU of method for managing resource and its application based on fine-grained GPU of the invention, has Below the utility model has the advantages that

1, the present invention provides one kind to be based on fine granularity GPU shared mechanism, can be from dynamic resource management, static resource pipe It manages two aspects and carries out fine-grained QoS management, not only can guarantee the QoS demand of specific application, but also can maximize on this basis The entire throughput of GPU system.

2, achievement of the invention can provide the technology ginseng of QoS Managed Solution for the design and landing of industry GPU hardware It examines.

Detailed description of the invention

Fig. 1 is shown as the framework being extended in one embodiment of the invention based on the shared GPU system that fine granularity is seized Schematic diagram.

The overall flow that Fig. 2 is shown as the method for managing resource based on fine-grained GPU in one embodiment of the invention is shown It is intended to.

Fig. 3 is shown as quota rollback plan in the method for managing resource based on fine-grained GPU in one embodiment of the invention Implementation exemplary diagram slightly.

Fig. 4 was shown as in the method for managing resource based on fine-grained GPU in one embodiment of the invention based on the slow stage The adjustment of kernel (non-QoS kernel) allocation of quota of not QoS requirement is shown in the quota rollback strategy of judgement It is intended to.

Specific embodiment

Illustrate embodiments of the present invention below by way of specific specific example, those skilled in the art can be by this specification Other advantages and efficacy of the present invention can be easily understood for disclosed content.The present invention can also pass through in addition different specific realities The mode of applying is embodied or practiced, the various details in this specification can also based on different viewpoints and application, without departing from Various modifications or alterations are carried out under spirit of the invention.

The embodiment of the present invention is designed to provide a kind of method for managing resource and server based on fine-grained GPU, For carrying out fine-grained QoS management, to accurately regulate and control the resource in SM (stream handle), the entirety of GPU is effectively improved Resource utilization.

The embodiment of the present invention is intended to design, it is a kind of based on GPGPU-Sim, fine-grained shared GPU scheduling strategy to realize, Under the premise of guaranteeing target application QoS, the entire throughput of remaining application is maximized.By static resource and based on quota The scheme of dynamic processing capacity allocation policy, the present embodiment can carry out fine-grained QoS management, to accurately regulate and control the money in SM Source effectively improves the whole resource utilization of GPU.

Fig. 1 is that the embodiment of the present invention shares the architecture of GPU and the entirety of QoS administration extensions module based on fine granularity System structure.Fine granularity GPU shared framework before reference of the embodiment of the present invention, and pay close attention to how to design on this basis Fine granularity shares the QoS management method on GPU.

As shown in Figure 1, the technical solution of the embodiment of the present invention is extended based on the shared GPU system that fine granularity is seized, Expansion mainly includes enhanced edition TB scheduler, static resource management, dynamic resource management and enhanced edition warp scheduler etc. Four parts.The core of invention is the resource allocation policy designed in QoS management module, is based especially on the dynamic resource of quota Management strategy.

Enhanced edition TB scheduler: on the basis of fine granularity shares GPU, enhanced edition TB scheduler is introduced to support that GPU is single More kernel in SM (streamingMultiproeessor, stream handle) are executed parallel.Enhanced edition TB scheduler be responsible for Each SM is linked up, and is distributed according to the strategy execution static and dynamic resource of formulation.

Static resource management: should be the executable TB of each kernel distribution on each SM of static resource management decision Quantity.Since scheme herein is realized based on preemptive type GPU, situation can be reached according to QoS at runtime Neatly carry out the adjustment of resource allocation.

Dynamic resource management: the management of dynamic resource is core place.The embodiment of the present invention uses the algorithm based on quota, Come the progress in regulate and control each kernel each period at runtime.QoS management module is responsible for the real-time number fed back in acquisition operation According to, and being calculated according to the qos policy of formulation be each kernel allocated quota quantity.

Enhanced edition warp scheduler: enhanced edition warp scheduler is according to the allocation of quota sent from QoS manager, by core Calculating cycle distributes to each kernel.The scheme of the embodiment of the present invention is drawn operational process as unit of fixed cycle numbers Point, referred to as each period (epoch).All kernel in the beginning in each period, each SM are assigned to one and match Volume.In each period, the maximum that the specific quota quantities that kernel is obtained determine that this period kernel can be completed refers to Enable number, it can complete most progresses.

A kind of original of the method for managing resource and server based on fine-grained GPU of the present embodiment described in detail below Reason and embodiment make those skilled in the art not need creative work and are appreciated that one kind of the present embodiment is based on fine granularity GPU method for managing resource and server.

As shown in Fig. 2, the present embodiment provides a kind of method for managing resource based on fine-grained GPU, it is described to be based on particulate The method for managing resource of the GPU of degree includes:

Step S110, the average each cycle number of instructions mark executed according to QoS Target Acquisition kernel；

Step S120 is dynamically distributed according to the average each cycle number of instructions mark that the kernel executes and is adjusted each period point The quota quantities of dispensing kernel；

Step S130 is that each stream handle distributes thread block according to the operating condition of the kernel.

The above-mentioned steps S110 of the present embodiment to step S130 is described in detail below.

Step S110, the average each cycle number of instructions mark executed according to QoS Target Acquisition kernel.

The QoS target specified by user may have diversified forms, such as frame per second, message transmission rate etc..In order to allow The QoS management system of architecture layers can directly use this target, first have to for QoS target being unified into a certain specific hard Part index.Rule of thumb, any type of QoS target can be converted to the average each cycle instruction number of kernel (kernel) execution (Instructions Per Cycle,IPC).The present embodiment is average in other words by guaranteeing the kernel final execution time IPC manages QoS.

Specifically, in this present embodiment, the average each cycle number of instructions that the kernel executes is designated as:

Assuming that the total number of instructions that execution time and kernel of the known kernel in isolated operation need to complete, then IPC target can be calculated by above-mentioned formula.

Remember that IPC of the kernel individually when running on GPU is IPC_alone, it is assumed that the kernel is shared with other kernel IPC target when GPU (is denoted as IPC_goal, pass through QoS_goalIt is converted to) it can reach in its isolated operation, that is, have:

IPC_goal≤IPC_alone。

Requirement due to different application to QoS is different, and the present embodiment uniformly uses IPC_goalWith IPC_aloneRatio carry out table The IPC target for showing different application, is denoted as:

Wherein 0≤p≤1；P reflects the complexity that QoS target is reached.

Step S120 is dynamically distributed according to the average each cycle number of instructions mark that the kernel executes and is adjusted each period point The quota quantities of dispensing kernel.

According to whether there is or not QoS demand, the kernel on GPU can be divided into QoS requirement kernel (QoS kernel) and There is no the kernel (non-QoS kernel) of QoS requirement.The final goal of QoS management keeps QoS kernel final Average IPC reaches QoS target (IPC target), while maximizing the handling capacity of non-QoS kernel.Therefore, one it is crucial Problem is the design to QoS kernel allocation of quota algorithm.The present embodiment expresses IPC target with quota (quota). By variable quota, the instruction number that can tell warp scheduler that should complete in each QoS kernel of each period, and By warp scheduler the case where tracking of each period quota consumes, so that feedback information is to update quota.

Specifically, in this present embodiment, each period distributes to the quota quantities of kernel are as follows:

Quota_k=α_k×IPC_goal×T_epoch；

Wherein:IPC_goal=p × IPC_alone；

Assuming that the period length set is T_epoch, the QoS kernel for needing to reach target is denoted as k, and known k is independent IPC when running on GPU is denoted as IPC_alone.When target QoS percentage is p, wish that the target that k reaches is average when parallel IPC may be calculated: IPC_goal=p × IPC_alone。

Quota_kIt is the instruction number for the k in total that all SM need to complete, is each kernel k meter by QoS manager It calculates, and according to the thread number of blocks of the k held on SM come in further allocated quotas to each SM.For example, it is assumed that kernel The thread block sum of k is T, and SM_iIt is upper to be responsible for scheduling T therein_iA thread block, then SM_iOn the quota about k that is assigned to Should beAnd so on, Quota_kIt can by relative equilibrium be assigned on each SM.

In this present embodiment, it is every to dynamically distribute adjustment for the average each cycle number of instructions mark executed according to the kernel A kind of implementation that a period distributes to the quota quantities of kernel includes:

In the initial time in each period, the quota quantities that each period distributes to kernel k are assigned to each stream process On device；

The remaining quota quantity of each stream handle is obtained according to the instruction number actually accomplished；

Determine whether the quota for having the kernel of QoS requirement is used up according to the remaining quota quantity, if so, being There is no the kernel of QoS requirement to update quota.

Specifically, in the beginning in each period, Quota_kIt is assigned on each SM.Each SM meter of one part Number device C_kTo store quota for k_k.When the warp of a k, which is instructed, to be completed, C_kThe instruction number that actually accomplishes can be subtracted, and (this refers to Number is enabled to be generally equal to the width of SIMD but it is also possible to be the positive integer for being no more than the width).If C at the end of period_kFor Just, i.e. there are some quotas not to complete by k, then retains these remaining quotas, and be added to the quota of next new period_kUpper use In resetting C_k。

In order to avoid the computing resource of non-QoS kernel excessive occupancy QoS kernel, it is similarly non-QoS Quota is arranged to limit them in kernel.SM can check C for all kernel_kWhether it is non-just, to ensure all QoS Kernel has depleted their quota, completes interim QoS target.Once the quota of all QoS kernel is all used It to the greatest extent, then is the C of all non-QoS kernel_kAll update with quota quota_k, so that it is guaranteed that thering is thread block transporting always on SM Row.For QoS kernel, as long as guaranteeing that they reach QoS target, it is not required that they have high-throughput.So one The IPC target of denier period QoS kernel is reached, it is not necessary to distribute new quota to them again.In the beginning in each period, Either QoS kernel or non-QoS kernel, their C_kIt will be reset as quota_k.We claim this strategy For quota rollback strategy.

The schematic diagram of quota rollback strategy is as shown in Figure 2.K₁It is a QoS kernel, quota quota₁It is 100；K₂ It is a non-QoS kernel, quota quota₂It is 50.In t₀Moment, first period, quota₁And quota₂Point K is not assigned to it₁And K₂, C_k1It is updated to 100, C_k2It is updated to 50.In t₁Moment, second period, K₁Remaining 5 surplus Remaining quota is accumulated in basic quota, C_k1It is updated to 105, and K₂Remaining quota be ignored, C_k2It is updated to 50.In t₂When It carves, second period not yet terminates, and K₁And K₂Quota all used up.Due to K₁It is QoS kernel, therefore quota is no longer more Newly；And K₂It is a non-QoS kernel, since all QoS kernel are completed phased goal, K₂It can brush New quota makes full use of the period remaining computing resource.

In fact, some kernel have first slow rear fast IPC (becoming " cumulative type " kernel below), such as The sparse matrix multiplication (spmv) that Parboil benchmark is concentrated.In the stage that early period runs slower because of its characteristic, i.e., Make to improve quota, remaining on may have a long way to go from Target IP C；QoS kernel is due to that can not reach always IPC mesh simultaneously Mark, constantly cumulative remaining quota can suppress the performance of non-QoS kernel instead, eventually lead in the whole of slow stage GPU Body performance is bad.

In order to solve this problem, the present embodiment it is further proposed that slow phase judgement method.

In this present embodiment, it includes: to meet the slow stage that the kernel for not QoS requirement, which updates in quota, Not QoS requirement kernel increase quota；Wherein, meet the condition in slow stage are as follows:

On the basis of quota rollback strategy, some flexibly adjustments are done to the allocation of quota of non-QoS kernel, such as Shown in Fig. 4.

Wherein, the condition of slow phase judgement are as follows: for the threshold value x (0 < x < 1) set, if continuous two periods, QoS The IPC of kernel_epochIt is all satisfied IPC_epoch< xIPC_goal。

The condition for why limiting " continuous two periods ", is to sentence of short duration performance inconsistency since it is considered that avoid the occurrence of The case where being set to the slow stage in stage.Setting for x, need to satisfy two conditions:: first, be small enough to can to distinguish slight Performance inconsistency and the slow stage in stage as caused by estimated performance, be unlikely to influence QoS kernel performance；Second, enough Greatly to when there is the slow stage in stage idle computing resource can be made full use of in this time, gulped down for non-QoS kernel raising The amount of spitting.

Consider for this two o'clock, tentatively the range of x is set between 0.5-0.8.In order to determine this value, can distinguish Susceptibility test is carried out when being equal to 0.5,0.6,0.7,0.8 to x, to choose optimal assignment according to the actual situation.

Since non-QoS kernel does not have QoS target, the distribution of quota can not directly adopt point of QoS kernel With mode.

In this present embodiment, a kind of implementation of the kernel update quota for not QoS requirement includes:

Obtain the target difference for currently having the kernel of QoS requirement；

The kernel of not QoS requirement is obtained according to the target difference of the kernel for currently having QoS requirement Kernel execute average each cycle number of instructions mark；

It is designated as not having according to the average each cycle number of instructions that the kernel of the kernel of the not QoS requirement executes The kernel of QoS requirement updates quota.

It is non-QoS kernel allocated quota quantity depending on being at present in the QoS management strategy of the present embodiment Only QoS kernel distance reach it QoS target it is also how much poor.If IPC QoS kernel interim at upper one_epoch Higher than IPC_goal, then non-QoS kernel can have higher quota, vice versa.So, non-QoS is being calculated Before the quota of kernel, an interim IPC target first can be calculated for it according to historical information:

Wherein, a_kIt is the history regulatory factor in the simple strategy based on history.In the beginning in each period, once it calculates The IPC target of non-QoS kernel out, its quota can refer to the quota of QoS kernel, with same method meter It calculates.In order to enable QoS kernel to reach QoS target, the IPC of non-QoS kernel as early as possible_epochIt can be initialized as one Very small value (such as being set as 1).This means that the quota of non-QoS kernel is very little at the beginning, no It will affect the performance of QoS kernel.The problem of in order to avoid underutilization of resources, the IPC of non-QoS kernel_goalIt can be by Cumulative length, but will not be high to the performance for influencing QoS kernel.In fact, IPC_epochInitial value last result is influenced Very little.So the non-QoS kernel quota of the present embodiment designs, can both be moved when QoS kernel needs more resources State limits the resource of non-QoS kernel, and can not need so much resource in QoS kernel is release resource to non- QoS kernel is to promote its handling capacity.The final goal of this and QoS management is consistent.

Static resource mainly has thread number of blocks of register, shared drive and each kernel etc..For QoS kernel There is provided enough thread numbers of blocks can be improved TLP, to increase a possibility that reaching QoS target.

In this present embodiment, the operating condition according to the kernel is a kind of side that each stream handle distributes thread block Formula are as follows:

The thread block for the kernel for having QoS requirement is averagely allocated to each stream handle；

The subregion of stream handle is divided for the kernel of not QoS requirement, each the not no kernel of QoS requirement Thread block be evenly distributed on the stream handle in oneself subregion.

The static resource allocation of the present embodiment is from the symmetrical method of salary distribution.On the one hand, for QoS kernel, by it Thread block fifty-fifty assign on each SM.Hold identical thread number of blocks on each SM, to keep the balance of TLP.Separately On the one hand, due to before about fine granularity share studies have shown that if a SM on simultaneously run too many kernel, possibility Effect is simultaneously bad.Therefore, for non-QoS kernel, the subregion of SM, each non-QoS kernel can be divided for them Thread block be evenly distributed on the SM in oneself subregion.

In this present embodiment, the operating condition according to the kernel is that each stream handle distributes thread block further include:

Acquire the quantity of the idle warp scheduler of all kernels；

Compare the quantity of the quantity of idle warp scheduler and the warp scheduler of per thread block, if idle warp scheduling The quantity of device is greater than or equal to the quantity of the warp scheduler of some thread block, then the thread block is removed TB scheduler.

At runtime, the distribution of thread block is adjusted according to the operating condition of kernel.It is all in each period Kernel acquires the quantity of " idle warp ", i.e., those have the instruction that can be run but since assembly line is saturated so not adjusted The warp of degree.The quantity of idle warp corresponds to the degree of excess TLP.Since idle warp occupies static resource, do not have but Any contribution, therefore corresponding thread block should be switched out, the thread block of the low kernel of those TLP degree is put into Come.In the beginning in each period, the quantity of the idle warp of each kernel is acquired from each SM.If the number of idle warp Amount is equal to the warp number of per thread block, then switching out entire thread block all will not influence TLP.We are these thread blocks Referred to as " idle TB ".

In addition, in this present embodiment, if the quantity of the idle TB scheduler of a kernel for having a QoS requirement is not More than 1, and its IPC_historyDo not reach target, then there is the kernel of QoS requirement to distribute more thread blocks for this.

If the idle TB number of a QoS kernel is no more than 1, and its IPC_historyDo not reach target, is then it More thread blocks are distributed, so that it faster reaches QoS target.

Wherein, the kernel being replaced should meet one of following three conditions:

1, the kernel is a non-QoS kernel.

If 2, the kernel has n idle TB to need to occupy enough resources, it must have n+1 idle TB.

3, the IPC of the kernel_historyIt should meetWherein N is its total thread Number of blocks.

It is being cut out or be a non-QoS kernel or be one and have excessive TLP or foot according to conditions above The QoS kernel that enough IPC can lose.In addition, switching occurs over just any kernel all in order to avoid excessive expense of seizing When the preemption instructions that do not hang up.

In conclusion the present embodiment also provides more from the thread block quantity control in static resource for QoS kernel The good condition for reaching QoS target.

The embodiment of the present invention also provides a kind of GPU, and the GPU extension has TB scheduler and wrap scheduler；The TB Using the method for managing resource as described above based on fine-grained GPU in scheduler.It is above-mentioned to be based on fine granularity to described The method for managing resource of GPU be described in detail, details are not described herein.

In conclusion the present invention provides one kind to be based on fine granularity GPU shared mechanism, it can be from dynamic resource management, quiet Two aspects of state resource management carry out fine-grained QoS management, not only can guarantee the QoS demand of specific application, but also can be basic herein The upper entire throughput for maximizing GPU system；Achievement of the invention can provide for the design and landing of industry GPU hardware The Technical Reference of QoS Managed Solution.So the present invention effectively overcomes various shortcoming in the prior art and has high industrial benefit With value.

The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe The personage for knowing this technology all without departing from the spirit and scope of the present invention, carries out modifications and changes to above-described embodiment.Cause This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as At all equivalent modifications or change, should be covered by the claims of the present invention.

Claims

1. a kind of method for managing resource based on fine-grained GPU, it is characterised in that: the resource based on fine-grained GPU Management method includes:

The average each cycle number of instructions mark executed according to QoS Target Acquisition kernel；

The average each cycle number of instructions mark dynamic allocation adjustment each period executed according to the kernel distributes to matching for kernel Specified number amount；

It is that each stream handle distributes thread block according to the operating condition of the kernel.

2. the method for managing resource according to claim 1 based on fine-grained GPU, it is characterised in that: the kernel is held Capable average each cycle number of instructions is designated as:

3. the method for managing resource according to claim 1 or 2 based on fine-grained GPU, it is characterised in that: described each Period distributes to the quota quantities of kernel are as follows:

Quota_k=α_k×IPC_goal×T_epoch；

Wherein:IPC_goal=p × IPC_alone；

Quota_kThe quota quantities of kernel k, IPC are distributed to for each period_aloneIt is individually flat when being run on GPU for kernel k Equal each cycle instructs number, IPC_goalFor the average each cycle number of instructions mark that kernel executes, T_epochFor preset period length；a_kFor Historical factors；IPC_historyFor history be averaged each cycle instruct number.

4. the method for managing resource according to claim 1 based on fine-grained GPU, it is characterised in that: described according to institute The average each cycle number of instructions mark for stating kernel execution, which dynamically distributes, adjusts the one of the quota quantities that each period distributes to kernel Planting implementation includes:

In the initial time in each period, the quota quantities that each period distributes to kernel k are assigned on each stream handle；

Determine whether the quota for having the kernel of QoS requirement is used up according to the remaining quota quantity, if so, not have The kernel of QoS requirement updates quota.

5. the method for managing resource according to claim 4 based on fine-grained GPU, it is characterised in that: described is not have The kernel of QoS requirement updates in quota

Increase quota to meet the kernel of the not QoS requirement in slow stage；Wherein, meet the condition in slow stage are as follows:

6. the method for managing resource according to claim 5 based on fine-grained GPU, it is characterised in that: described is not have A kind of implementation that the kernel of QoS requirement updates quota includes:

It is obtained in the not kernel of QoS requirement according to the target difference of the kernel for currently having QoS requirement The average each cycle number of instructions mark that core executes；

It is designated as not servicing according to the average each cycle number of instructions that the kernel of the kernel of the not QoS requirement executes The kernel of quality requirement updates quota.

7. the method for managing resource according to claim 1 based on fine-grained GPU, it is characterised in that: described according to institute The operating condition for stating kernel is a kind of mode of each stream handle distribution thread block are as follows:

The subregion of stream handle is divided for the kernel of not QoS requirement, each the not no line of the kernel of QoS requirement Journey block is evenly distributed on the stream handle in oneself subregion.

8. the method for managing resource according to claim 7 based on fine-grained GPU, it is characterised in that: described according to institute The operating condition for stating kernel is that each stream handle distributes thread block further include:

Acquire the quantity of the idle warp scheduler of all kernels；

Compare the quantity of the quantity of idle warp scheduler and the warp scheduler of per thread block, if idle warp scheduler Quantity is greater than or equal to the quantity of the warp scheduler of some thread block, then the thread block is removed TB scheduler.

9. the method for managing resource according to claim 8 based on fine-grained GPU, it is characterised in that: if one has The quantity of the idle TB scheduler of the kernel of QoS requirement is no more than 1, and its IPC_historyDo not reach target, then for This has the kernel of QoS requirement to distribute more thread blocks.

10. a kind of GPU, it is characterised in that: the GPU extension has TB scheduler and wrap scheduler；It is answered in the TB scheduler With such as claim 1 to the method for managing resource as claimed in claim 9 based on fine-grained GPU.