CN108710536A

CN108710536A - A kind of multi-level fine-grained virtualization GPU method for optimizing scheduling

Info

Publication number: CN108710536A
Application number: CN201810285080.8A
Authority: CN
Inventors: 姚建国; 赵晓辉; 高平; 管海兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-04-02
Filing date: 2018-04-02
Publication date: 2018-10-26
Anticipated expiration: 2038-04-02
Also published as: CN108710536B

Abstract

The invention discloses a kind of multi-level fine-grained virtualization GPU method for optimizing scheduling, respectively with 3 kinds of modes come Optimized Operation strategy：Scheduling based on Time And Event, the seamless scheduling based on assembly line, and mix the scheduling based on ring and based on virtual machine.This 3 kinds of scheduling strategies have been utilized respectively expense caused by two virtual machines switch, virtual machine operation is divided into multiple stages while operation and multiple virtual machines work at the same time this 3 points methods as an optimization using different rings.The present invention is by changing scheduler and scheduling strategy, greatly reduce the expense of handoff procedure, and the parallel execution between multiple virtual GPU is supported, therefore the performance of the shared multiple virtual GPU of a physics GPU can be obviously improved, to promote overall performance.Invention makes the utilization rate of physics GPU be promoted, to further make the performance boost of virtual GPU.In addition, this method ensure that virtual GPU still conforms to quality of service requirement simultaneously.

Description

A kind of multi-level fine-grained virtualization GPU method for optimizing scheduling

Technical field

The present invention relates to GPU vitualization and its task scheduling fields, and in particular to arrives a kind of multi-level fine-grained virtual Change GPU method for optimizing scheduling.Specifically, the performance of GPU vitualization is mainly promoted using the scheduling strategy of optimization.Pass through Optimize GPU scheduling strategies, be fine-grained scheduling by the optimizing scheduling of original coarseness, takes full advantage of original unserviceable Time and resource, in the case where physical condition is constant, promote the performance of virtual GPU to promote the overall utilization rate of GPU.

Background technology

Nowadays, GPU technologies are more and more important in high-performance computing sector, such as AI, deep learning, data analysis, And the fields such as cloud game are required for the participation of GPU.GPU cloud services are also come into being, and Tencent, Ali are proposed respective GPU Cloud Servers are supplied to user as a kind of new calculating pattern.

Therefore GPU vitualization technology has higher requirement for high-performance calculation.Current solution is using complete GPU vitualization (full GPU virtualization) technology.Program advantage is preferably isolation and safety, and The special support of hardware is not needed.But in current full GPU vitualization technology, scheduling strategy granularity is big, still has in performance Prodigious room for promotion.In the present invention, our experimental section is based on GVT-g (Intel Graphics Virtualization Technology for shared vGPU technology, Intel GPU vitualizations technology).The technology allows the GPU of Intel Multiple virtual GPU can be virtualized into, to be used to multiple virtual machines.Although testing the GPU based on Intel, we Method be still a kind of general method.

The every 1 millisecond of scheduling of the existing schedulers of GVT-g is primary, passes through polling dispatching (Round-Robin Scheduling) Mode, the virtual machine that is able to carry out of scheduling selection subsequent time period every time, within this period, the task of the virtual machine (workload) allow to execute on physics GPU.Each virtual machine has certain quality of service requirement, there is three classes, limit (cap), proportion (weight) and priority (priority), original scheduling strategy ensure that service quality reaches given and wants It asks.It is arrived whenever the time, scheduler just determines whether to have completed task.If completing task, scheduler is just New task can be dispatched；Otherwise next task is determined according to the requirement of service quality.The execution of each task It only is likely to really be executed by scheduler, and is serializing on the whole, serial, two tasks will not be by It is performed simultaneously.

This scheduling mode is fairly simple blunt, and convenient for the calculating of service quality.But meanwhile this scheduling strategy grain Degree is thicker, the time for having many free times and being wasted.

Invention content

For the defects in the prior art, the object of the present invention is to provide a kind of multi-level fine-grained virtualization GPU tune Optimization method is spent, 3 kinds of compatible scheduling strategies are largely divided into, is utilized respectively and original do not used from 3 angles Part, to improving performance.This 3 kinds of strategies are respectively：Scheduling based on Time And Event, the seamless scheduling based on assembly line, And mixing is based on ring (ring) and the scheduling based on virtual machine.

The present invention is realized according to following technical scheme：

A kind of multi-level fine-grained virtualization GPU method for optimizing scheduling, which is characterized in that include the following steps：

Step S1：The scheduling based on Time And Event is added, reduces the expense of two virtual GPU switchings；

Step S2：The seamless scheduling based on assembly line is added, allows a part of virtual GPU can be parallel, promotes virtual GPU Efficiency when co-operation；

Step S3：Scheduling of the mixing based on ring and based on virtual machine is added, different virtual machine is allowed concurrently to utilize physics completely GPU promotes overall utilization rate.

In above-mentioned technical proposal, step S1 includes the following steps：

Step S101：Decouple scheduling strategy frame and task dispatcher：Scheduling strategy frame is used for realizing amended Scheduling strategy, and task dispatcher is dispatched to realize；

Step S102：Increase context and complete event, event can be triggered after the completion of context, pass to scheduling strategy Frame, to further trigger corresponding task scheduling；

Step S103：Increase context and submit event, submits event when job scheduler receives context, and at this time When the virtual GPU free time, the event can be handled at once, execute task；

Step S104：Change scheduling strategy frame, to support increased event, scheduling strategy frame in the time of receipt (T of R) or After event, processing is responded, and submits to job scheduler execution；

Step S105：The service quality changed in scheduling strategy frame calculates.

In above-mentioned technical proposal, step S2 includes the following steps：

Step S201：Workflow is divided into audit and shadow stage and scheduling and held by disintegration scheduler flow Row order section, wherein the former is the preparation stage, and the latter is the execution stage；

Step S202：Divide work and submit path, allow multiple work while submitting, makes full use of pipelining scheduling excellent Gesture；

Step S203：Original shadow correlative code is removed task to assign, the code of different phase is separated；This When, whole workflow code is divided into relatively independent two parts, audit and shadow and scheduling and execution, not It can run and be independent of each other simultaneously with the stage, improve efficiency.

Step S204：When each virtual GPU only there are one shadow context when, if virtual GPU is not current at this time GPU, it is just that first job task is shadowed.

In above-mentioned technical proposal, step S3 includes the following steps：

Step S301：Ring or engine scheduling are pressed in introducing, and the support of ring is added, and minimum thread is each from being integrally changed to A ring is worked at the same time and is independent of each other so if work operates on different rings；

Step S302：Modification to task dispatcher will be originally with virtual by all correlative codes from individually array is changed to Machine is that unit is changed to as unit of ring, is related to rescheduling, changes current GPU and ring, next GPU and ring；

Step S303：Modification to task dispatcher reconstructs correlative code logic, will only be supported with virtual in original logic Machine is that the logic of unit is changed to the new logic as unit of ring；

Step S304：To dispatching the modification of policy framework, scheduling data structure is changed to array from single, while running more Scheduling data structure is changed to array and supports to run while multiple rings by a ring；

Step S305：To dispatching the modification of policy framework, time or event triggering are changed to respectively to the branch of each ring It holds；

Step S306：It needs global state consistent by ring scheduling, judges current virtual machine and next void with pointer CRC32 Whether the global state of quasi- machine is consistent；

Step S307：When storing every time or restoring MMIO, the value of CRC32 is calculated, if unanimously, using step S301- Ring scheduling is pressed in 305；If inconsistent, scheduling virtual machine is pressed using original；

Step S308：Maintenance for service quality calculates separately each ring, redefines the ginseng of service quality Number ensures also maintain correct service quality by ring scheduling；

Step S309：The switching for changing by virtual machine and pressing ring scheduling, ensure that program is correctly run.

Compared with prior art, the present invention has following advantageous effect：

The time that the present invention takes full advantage of free time and is wasted, to not change hardware and maintain Service Quality Under the premise of amount, whole performance is improved.Such as a virtual GPU network provider, after the present invention, Ke Yi Identical fund, the lower more overall performances of acquisition of same hardware configuration, are sold to more users, obtain more incomes.And The user of specific objective is completed for desired, such as wants that target program is made to reach 60 frames, then can buy less equipment just Target can be completed, to reduce expense.

Description of the drawings

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon：

Fig. 1 is the scheduling general frame based on Time And Event；

Fig. 2 is the comparison schematic diagram of time-based scheduling and the scheduling based on Time And Event；

Fig. 3 is the seamless scheduling schematic diagram based on assembly line；

Fig. 4 is scheduling schematic diagram of the mixing based on ring and based on virtual machine；

Fig. 5-1 is that the scheduling 3dmark06 based on Time And Event tests score schematic diagram；

Fig. 5-2 is that the scheduling heaven based on Time And Event tests score schematic diagram；

Fig. 6-1 is that the seamless scheduling 3dmark06 based on assembly line tests score schematic diagram；

Fig. 6-2 is that the seamless scheduling heaven based on assembly line tests score schematic diagram；

Fig. 7-1 is that score schematic diagram is tested in the scheduling based on ring；

Fig. 7-2 is scheduling experiment score of the mixing based on ring and based on virtual machine.

Specific implementation mode

With reference to specific embodiment, the present invention is described in detail.Following embodiment will be helpful to the technology of this field Personnel further understand the present invention, but the invention is not limited in any way.It should be pointed out that the ordinary skill of this field For personnel, without departing from the inventive concept of the premise, several changes and improvements can also be made.These belong to the present invention Protection domain.

First in existing dispatching method, triggering scheduling is determined by the time completely, but very may be used after the completion per task There can be the vacant time, when especially small task is in the majority, these vacant times are wasted, therefore present invention adds tasks Completion event triggers scheduler to make full use of the event of this part in advance.Fig. 5-1 is the scheduling based on Time And Event 3dmark06 tests score schematic diagram, and Fig. 5-2 is that the scheduling heaven based on Time And Event tests score schematic diagram, from figure It can be seen that having larger promotion using performance after the present invention.Experiment is shown through this side dispatched based on Time And Event Method, GPU overall performances can promote 3.2%-21.5%.

Then in existing dispatching method, each task be it is ordered executed, the former all it is complete At later, the latter can just be performed.Actually task is segmented into two stages, and the first stage is the preparation stage, the Two-stage is only real execution.The two stages have invoked physics GPU different pieces, therefore can be according to a 2 stage flowing water The mode of line is dispatched.Present invention utilizes the method for this similar pipeline schedule, allow two stages of different task can be by It dispatches simultaneously.Fig. 6-1 is that the seamless scheduling 3dmark06 based on assembly line tests score, and Fig. 6-2 is based on the seamless of assembly line It dispatches heaven and tests score schematic diagram, as can be seen from the figure have larger promotion using part benchmark after this method.Experiment Show that the seamless scheduling based on assembly line, GPU overall performances can promote 0%-19.7% by this.

Finally in existing dispatching method, task is all based on what entire GPU was dispatched as unit, even if multiple tasks tune With different ring/engines, such as image rendering and streaming coding decoding, it is required for that the former is waited for complete.Therefore the present invention draws Enter the method dispatched based on ring, if task needs to call different rings, allows these tasks while called execution.By Inadequate in the support of GPU hardware, the operating system of this method of calling requirement virtual machine is consistent and global state is consistent, if It is inconsistent to use original method call.Fig. 7-1 is that score schematic diagram is tested in the scheduling based on ring, and Fig. 7-2 is that mixing is based on Score is tested in ring and scheduling based on virtual machine.Experiment shows the scheduling based on ring and based on virtual machine by this mixing Method, for two tasks when calling different rings, performance can promote 34.0% and 70.6% respectively, and have little influence on it His part.

A kind of multi-level fine-grained virtualization GPU method for optimizing scheduling of the present invention, which is characterized in that including walking as follows Suddenly：

Wherein, Fig. 1 is the scheduling general frame based on Time And Event, essentially consists in and adds Time And Event and change Frame.Fig. 2 is the comparison of time-based scheduling and the scheduling based on Time And Event, and top is time-based scheduling, Produce the free time；Lower section is the scheduling based on Time And Event, takes full advantage of the free time.The scheduling based on Time And Event is added Include the following steps：

Step S101：Decouple scheduling strategy frame (scheduling policy framework) and task dispatcher (workload scheduler)：Scheduling strategy frame is used for realizing amended scheduling strategy, and task dispatcher comes in fact Now dispatch；Mixing both in original framework can not be revised as customized tactful dispatching method, it is therefore desirable to by its point It opens.Scheduling strategy frame after separating only is responsible for realizing scheduling strategy, and task dispatcher is only responsible for specific implementation scheduling.It is this Modification after the segmentation of Focus is allowed to is possibly realized.

Step S102：Increase context (context) and complete event, event can be triggered after the completion of context, transmit Scheduling strategy frame is given, to further trigger corresponding task scheduling；Original mode only exists the event of time triggered, i.e., Per 1ms, triggering is primary.In the present invention, it on the basis of original scheduler, increases context and completes event, which can be It is triggered after the completion of context, passes to scheduling strategy frame, to further trigger corresponding task scheduling.

Step S103：Increase context and submit event, submits event when job scheduler receives context, and at this time When the virtual GPU free time, the event can be handled at once, execute task；On the basis of original scheduler, increases context and submit thing Part, the event can be periodically executed task in the virtual GPU free time of work at present.Once job scheduler receives context Submission event, and at this time the virtual GPU free time when, can handle the event at once, execute task.In original method, because Lack the event, job scheduler meeting idle waiting wastes the working time.

Step S104：Change scheduling strategy frame, to support increased event, scheduling strategy frame in the time of receipt (T of R) or After event, processing is responded, and submits to job scheduler execution；Time triggered is only supported in original design, is increased now Event triggering, needs the support of scheduling strategy frame.Scheduling strategy frame responds place after time of receipt (T of R) or event Reason, and submit to job scheduler execution.

Step S105：The service quality changed in scheduling strategy frame calculates.In original scheme, in order to keep needs Service quality needs to count corresponding scheduling time, and by being then based on time scheduling, comparison for calculation methods is simple.In the present invention In, due to adding the scheduling based on event, the calculating needs of service quality recalculate, not only need to consider the time, also need Consider the influence that each event is brought.In the present invention, content is calculated to the part and has done algorithm again, to ensure totality Service quality can still reach given requirement.

Fig. 3 is the seamless scheduling schematic diagram based on assembly line, and top is former design, and lower section is modified design, is decomposed Task simultaneously allows streamlined to complete scheduling.Seamless scheduling based on assembly line, includes the following steps：

Step S201：Workflow is divided into audit and shadow (audit&amp by disintegration scheduler flow;Shadow) rank Section and scheduling and execution (scheduling&Execution in the) stage, wherein the former is the preparation stage, and the latter is the execution stage； In original design, workflow ordered can execute, from the beginning each task is held every time there is no by careful decomposition Row to the end, just starts next task.

Step S202：Divide work and submit path, allow multiple work while submitting, makes full use of pipelining scheduling excellent Gesture；It can only support the same time that can be submitted there are one task in original design, the present invention allows multiple.Allow multiple It works while submitting the advantage that could be enjoyed and pipeline scheduling in the present invention.

Step S203：Original shadow correlative code is removed task to assign, the code of different phase is separated；This When, whole workflow code is divided into relatively independent two parts, audit and shadow and scheduling and execution, not It can run and be independent of each other simultaneously with the stage, improve efficiency.At this point, original whole workflow code is divided into relatively Independent two parts, audit and shadow and scheduling and execution.The purpose of segmentation is that different phase can be run and mutual simultaneously It does not influence, therefore when encountering multiple events, the execution stage of previous event can be same with the preparation stage of the latter event Shi Yunhang, to improve efficiency.

Step S204：When each virtual GPU only there are one shadow context when, if virtual GPU is not current at this time GPU, it is just that first job task is shadowed.This step ensure that the dispatching method can correctly be held in all cases It goes, is consistent with original design in correctness.

Fig. 4 is scheduling schematic diagram of the mixing based on ring and based on virtual machine, be respectively compared original design, by ring dispatch with And the scheduling mode of mixed scheduling is different, it is also seen that the advantage of mixed scheduling from figure.Mixing is based on ring (ring) and base In the scheduling of virtual machine, include the following steps：

Step S301：Ring (ring) or engine (engine) scheduling are pressed in introducing, the support of ring are added, by minimum thread It is changed to each ring from whole, so if work operates on different rings, works at the same time and is independent of each other；Original design In, for virtual GPU based on being dispatched as unit of entire virtual machine, virtual machine has respective live load, operates in difference On ring.Under normal circumstances, a GPU has 3 or more rings, each handles different types of task (such as image rendering, stream Media coding decoding etc.).

Step S304：To dispatching the modification of policy framework, scheduling data structure is changed to array from single, while running more Scheduling data structure is changed to array and supports to run while multiple rings by a ring；Scheduling data structure, scheduling plan are related generally to Slightly equal relevant portions.In original design, it is only necessary to which a structure can complete scheduling.

Step S305：To dispatching the modification of policy framework, time or event triggering are changed to respectively to the branch of each ring It holds；In original design, only time triggered, and only triggering entirety GPU in of the invention, not only needs to support time and thing Part triggers, and is responded with greater need for each ring correspondence.

Step S306：It needs global state consistent by ring scheduling, judges current virtual machine and next void with pointer CRC32 Whether the global state of quasi- machine is consistent；Pointer CRC32 has been directed toward the content of the global state information of virtual GPU, due in the part Piece of content can only be retained in system design by holding, and need to use the pointer as judgement.

Step S307：When storing every time or restoring MMIO, the value of CRC32 is calculated, if unanimously, using step S301- Ring scheduling is pressed in 305；If inconsistent, scheduling virtual machine is pressed using original；The value of CRC32 can find the corresponding overall situation State, when the multiple operating systems executed at present are consistent, which can be multiplexed；If inconsistent, can not just use Press ring scheduling, it is necessary to be switched to original by scheduling virtual machine.

Step S309：The switching for changing by virtual machine and pressing ring scheduling, ensure that program is correctly run.The present invention is perfect The switching of two kinds of scheduling strategies.

The present invention combines these three dispatching methods, and three kinds of dispatching methods do not conflict, can be by same luck It uses in GPU, therefore the overall performance of GPU can be therefore greatly increased.In specific experiment, present invention uses Intel GPU is as test object, but the method for the present invention is a kind of method of versatility, can be used for other GPU manufacturers systems The GPU made.

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited in above-mentioned Particular implementation, those skilled in the art can make a variety of changes or change within the scope of the claims, this not shadow Ring the substantive content of the present invention.In the absence of conflict, the feature in embodiments herein and embodiment can arbitrary phase Mutually combination.

Claims

1. a kind of multi-level fine-grained virtualization GPU method for optimizing scheduling, which is characterized in that include the following steps：

Step S2：The seamless scheduling based on assembly line is added, allows a part of virtual GPU can be parallel, it is common to promote virtual GPU Efficiency when work；

Step S3：Scheduling of the mixing based on ring and based on virtual machine is added, different virtual machine is allowed concurrently to utilize physics GPU completely, Promote overall utilization rate.

2. a kind of multi-level fine-grained virtualization GPU method for optimizing scheduling according to claim 1, which is characterized in that Step S1 includes the following steps：

Step S102：Increase context and complete event, event can be triggered after the completion of context, pass to scheduling strategy frame Frame, to further trigger corresponding task scheduling；

Step S103：Increase context and submit event, submits event when job scheduler receives context, and virtual at this time When the GPU free time, the event can be handled at once, execute task；

Step S104：Scheduling strategy frame is changed, to support increased event, scheduling strategy frame is in time of receipt (T of R) or event Afterwards, processing is responded, and submits to job scheduler execution；

3. a kind of multi-level fine-grained virtualization GPU method for optimizing scheduling according to claim 1, which is characterized in that Step S2 includes the following steps：

Step S201：Workflow is divided into audit and shadow stage and scheduling and executes rank by disintegration scheduler flow Section, wherein the former is the preparation stage, and the latter is the execution stage；

Step S202：Divide work and submit path, allow multiple work while submitting, makes full use of pipelining scheduling advantage；

Step S203：Original shadow correlative code is removed task to assign, the code of different phase is separated；At this point, will Whole workflow code is divided into relatively independent two parts, audit and shadow and scheduling and execution, in not same order Section can run and be independent of each other simultaneously, improve efficiency.

Step S204：When each virtual GPU only there are one shadow context when, if virtual GPU is not current GPU at this time, It is just that first job task is shadowed.

4. a kind of multi-level fine-grained virtualization GPU method for optimizing scheduling according to claim 1, which is characterized in that Step S3 includes the following steps：

Step S301：Ring or engine scheduling are pressed in introducing, and the support of ring is added, and minimum thread is changed to each from whole Ring is worked at the same time and is independent of each other so if work operates on different rings；

Step S302：Modification to task dispatcher will be originally with virtual machine by all correlative codes from individually array is changed to Unit is changed to as unit of ring, is related to rescheduling, is changed current GPU and ring, next GPU and ring；

Step S303：Modification to task dispatcher reconstructs correlative code logic, will a support be with virtual machine in original logic The logic of unit is changed to the new logic as unit of ring；

Step S304：To dispatching the modification of policy framework, by scheduling data structure from being individually changed to array, while running multiple Scheduling data structure is changed to array and supports to run while multiple rings by ring；

Step S305：To dispatching the modification of policy framework, time or event triggering are changed to the support to each ring respectively；

Step S306：It needs global state consistent by ring scheduling, judges current virtual machine and next virtual machine with pointer CRC32 Global state it is whether consistent；

Step S307：When storing every time or restoring MMIO, the value of CRC32 is calculated, if unanimously, using in step S301-305 Press ring scheduling；If inconsistent, scheduling virtual machine is pressed using original；

Step S308：Maintenance for service quality calculates separately each ring, redefines the parameter of service quality, protects Card can also maintain correct service quality by ring scheduling；