CN108710536B

CN108710536B - Multilevel fine-grained virtualized GPU (graphics processing Unit) scheduling optimization method

Info

Publication number: CN108710536B
Application number: CN201810285080.8A
Authority: CN
Inventors: 姚建国; 赵晓辉; 高平; 管海兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2018-04-02
Filing date: 2018-04-02
Publication date: 2021-08-06
Anticipated expiration: 2038-04-02
Also published as: CN108710536A

Abstract

The invention discloses a multilevel fine-grained virtualized GPU scheduling optimization method, which respectively uses 3 modes to optimize a scheduling strategy: time and event based scheduling, pipeline based seamless scheduling, and hybrid ring and virtual machine based scheduling. The 3 scheduling strategies respectively utilize the expenses caused by switching of the two virtual machines, the virtual machines are operated simultaneously in a plurality of stages, and the virtual machines work simultaneously by using different rings as optimization methods. According to the invention, by modifying the scheduler and the scheduling strategy, the overhead of the switching process is greatly reduced, and the parallel execution among a plurality of virtual GPUs is supported, so that the performance of the plurality of virtual GPUs shared by one physical GPU can be remarkably improved, and the overall performance is improved. The invention improves the utilization rate of the physical GPU, thereby further improving the performance of the virtual GPU. In addition, the method simultaneously ensures that the virtual GPU still meets the service quality requirement.

Description

Multilevel fine-grained virtualized GPU (graphics processing Unit) scheduling optimization method

Technical Field

The invention relates to the field of GPU virtualization and task scheduling, in particular to a multilevel fine-grained virtualized GPU scheduling optimization method. Specifically, optimized scheduling strategies are mainly used to improve the performance of GPU virtualization. By optimizing the GPU scheduling strategy, the original coarse-grained scheduling is optimized into fine-grained scheduling, and originally unavailable time and resources are fully utilized, so that the overall utilization rate of the GPU is improved, and the performance of the virtual GPU is improved under the condition that physical conditions are unchanged.

Background

Today, GPU technology is becoming more and more important in high performance computing fields, such as AI, deep learning, data analysis, and cloud gaming, which all require the participation of a GPU. The GPU cloud service is also generated, and Tencent and Ali push out respective GPU cloud servers to serve as a new computing mode to be provided for users.

Therefore, GPU virtualization technology has higher requirements for high performance computing. The current solution is to use full GPU virtualization (full GPU virtualization) technology. This solution has the advantage of better isolation and security and does not require special support of hardware. However, in the current full GPU virtualization technology, the scheduling policy granularity is large, and there is still a great room for improving the performance. In the present invention, our experiments are based in part on GVT-g (Intel Graphics Virtualization Technology for shared vGPU Technology, Intel GPU Virtualization Technology). The technique enables an Intel GPU to be virtualized into multiple virtual GPUs for use by multiple virtual machines. Although the experiment was based on an Intel GPU, our approach is still a general one.

GVT-g schedules every 1ms, and selects a virtual machine that can be executed in the next time period in each Scheduling by Round-Robin Scheduling, during which the work task (workload) of the virtual machine is allowed to be executed on the physical GPU. Each virtual machine has a certain service quality requirement, and has three types, namely, a limit (cap), a weight (weight) and a priority (priority), and an original scheduling strategy ensures that the service quality meets a given requirement. Each time that time comes, the scheduler will determine whether the work task has been completed. If the work task is completed, the scheduler will schedule a new work task; otherwise, the next task is decided according to the requirement of the service quality. The execution of each work task is only possible by the scheduler to be actually executed and is serialized and serialized as a whole, and two work tasks are not executed simultaneously.

The scheduling mode is simple and straightforward, and facilitates the calculation of the service quality. However, at the same time, the scheduling policy has a relatively coarse granularity, and has a lot of idle time and wasted time.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a multilevel fine-grained virtualized GPU scheduling optimization method, which is mainly divided into 3 mutually compatible scheduling strategies, and the parts which are not utilized originally are utilized from 3 angles respectively, so that the performance is improved. These 3 strategies are: time and event based scheduling, pipeline based seamless scheduling, and hybrid ring (ring) and virtual machine based scheduling.

The invention is realized according to the following technical scheme:

a multilevel fine-grained virtualized GPU scheduling optimization method is characterized by comprising the following steps:

step S1: scheduling based on time and events is added, and the switching overhead of the two virtual GPUs is reduced;

step S2: seamless scheduling based on a production line is added, so that a part of the virtual GPUs can be in parallel, and the efficiency of the virtual GPUs in joint work is improved;

step S3: and mixed ring-based and virtual machine-based scheduling is added, so that different virtual machines can completely and concurrently utilize the physical GPU, and the overall utilization rate is improved.

In the above technical solution, step S1 includes the following steps:

step S101: decoupling the scheduling policy framework and the task scheduler: the scheduling strategy framework is used for realizing the modified scheduling strategy, and the task scheduler is used for realizing the scheduling;

step S102: adding a context completion event, wherein the event can be triggered after the context is completed and is transmitted to a scheduling strategy framework, so that corresponding task scheduling is further triggered;

step S103: adding a context submission event, and when the work scheduler receives the context submission event and the virtual GPU is idle at the moment, immediately processing the event and executing a task;

step S104: modifying the scheduling strategy frame to support the added event, and making response processing after receiving the time or the event by the scheduling strategy frame and submitting the response processing to a work scheduler for execution;

step S105: the quality of service calculation in the scheduling policy framework is modified.

In the above technical solution, step S2 includes the following steps:

step S201: decomposing a work scheduler flow, and dividing the work flow into an auditing stage, a shadow stage and a scheduling stage and an executing stage, wherein the former is a preparation stage, and the latter is an executing stage;

step S202: dividing a work submission path, allowing a plurality of works to be submitted simultaneously, and fully utilizing the advantages of streamlined scheduling;

step S203: shifting the original shadow related code out of the work task assignment, and separating the codes in different stages; at the moment, the whole work flow code is divided into two relatively independent parts, namely audit and shadow, scheduling and execution, and the work flow code can run at the same time in different stages without mutual influence, so that the efficiency is improved.

Step S204: when each virtual GPU has only one shadow context, the first work task is shadowed if the virtual GPU is not the current GPU at the moment.

In the above technical solution, step S3 includes the following steps:

step S301: introducing ring-based or engine scheduling, adding support of the rings, and changing the lowest scheduling unit from the whole to each ring, so that if the work runs on different rings, the work works simultaneously without mutual influence;

step S302: modifying the task scheduler, namely changing all related codes from single codes to arrays, changing the original virtual machine unit into a ring unit, and relating to rescheduling and modifying the current GPU and ring and the next GPU and ring;

step S303: modifying the task scheduler, reconstructing relevant code logic, and changing the logic which only supports the virtual machine as a unit in the original logic into new logic which takes a ring as a unit;

step S304: modifying the scheduling strategy framework, changing a scheduling data structure from a single scheduling data structure into an array, simultaneously operating a plurality of rings, and changing the scheduling data structure into the array to support the simultaneous operation of the plurality of rings;

step S305: modifying a scheduling strategy framework, and changing time or event triggering into support for each ring respectively;

step S306: the global states of the current virtual machine and the next virtual machine are consistent or not by using a pointer CRC 32;

step S307: calculating the value of CRC32 each time MMIO is stored or restored, and if consistent, using the ring-by-ring scheduling in steps S301-305; if not, using the original scheduling according to the virtual machine;

step S308: for the maintenance of the service quality, calculating each ring respectively, redefining parameters of the service quality, and ensuring that the correct service quality can be maintained according to ring scheduling;

step S309: and the switching according to the virtual machine and the ring scheduling is modified, so that the program can run correctly.

Compared with the prior art, the invention has the following beneficial effects:

the invention fully utilizes the idle time and the wasted time, thereby improving the overall performance on the premise of not changing hardware and maintaining the service quality. For example, for a virtual GPU network provider, after the invention is used, more overall performance can be obtained under the same fund and the same hardware configuration, and the performance can be sold to more users, and more benefits can be obtained. For a user who wants to accomplish a specific goal, for example, wants to get the goal program to 60 frames, fewer devices can be purchased to accomplish the goal, thereby reducing overhead.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is an overall framework for time and event based scheduling;

FIG. 2 is a schematic diagram of a comparison of time-based scheduling with time-and-event-based scheduling;

FIG. 3 is a schematic diagram of pipeline-based seamless scheduling;

FIG. 4 is a schematic diagram of hybrid ring-based and virtual machine-based scheduling;

FIG. 5-1 is a graphical illustration of the time and event based scheduling 3dmark06 experimental scores;

FIG. 5-2 is a graph of scheduled heaven experimental scores based on time and events;

FIG. 6-1 is a schematic diagram of pipeline-based seamless scheduling 3dmark06 experimental scores;

FIG. 6-2 is a schematic diagram of seamless scheduling heaven experiment scores based on a pipeline;

FIG. 7-1 is a schematic diagram of a ring-based scheduling experiment score;

FIG. 7-2 is a hybrid ring-based and virtual machine-based scheduling experiment score.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Firstly, in the existing scheduling method, the triggering scheduling is completely determined by time, but after each work is finished, vacant time is likely to exist, especially when small tasks are too many, the vacant time is wasted, therefore, the invention adds a task completion event and triggers a scheduler in advance to fully utilize the events. Fig. 5-1 is a graph of the experimental scores of scheduling 3dmark06 based on time and events, and fig. 5-2 is a graph of the experimental scores of scheduling heaven based on time and events, which shows that the performance is greatly improved by using the method. Experiments show that the overall performance of the GPU can be improved by 3.2% -21.5% by the method based on time and event scheduling.

Then, in the existing scheduling method, each work task is executed sequentially, and after the former is completed, the latter can be executed. In practice, the work task can be divided into two phases, the first phase being the preparation phase and the second phase being the actual execution. These two stages call different parts of the physical GPU and can therefore be scheduled in a 2-stage pipeline. The invention utilizes the method similar to pipeline scheduling, so that two stages of different tasks can be scheduled simultaneously. Fig. 6-1 is a pipeline-based seamless scheduling 3dmark06 experiment score, and fig. 6-2 is a pipeline-based seamless scheduling heaven experiment score schematic diagram, and it can be seen from the diagram that the part of the benchmark is greatly improved after the method is used. Experiments show that the overall performance of the GPU can be improved by 0% -19.7% through the seamless scheduling based on the assembly line.

Finally, in the existing scheduling method, tasks are scheduled based on the whole GPU as a unit, and even if multiple tasks call different rings/engines, such as image rendering and streaming media encoding and decoding, the tasks need to wait for the former to complete. Therefore, the invention introduces a method based on ring scheduling, and if the work tasks need to call different rings, the tasks are allowed to be called and executed at the same time. Because the support of GPU hardware is not enough, the calling mode requires that the operating systems of the virtual machines are consistent and the global state is consistent, and if the operating systems are inconsistent, the original method can be used for calling. Fig. 7-1 is a schematic diagram of a ring-based scheduling experiment score, and fig. 7-2 is a hybrid ring-based and virtual machine-based scheduling experiment score. Experiments show that by the method for mixing the ring-based scheduling and the virtual machine-based scheduling, the performance of two tasks can be respectively improved by 34.0% and 70.6% when different rings are called, and other parts are hardly influenced.

The invention discloses a multilevel fine-grained virtualized GPU scheduling optimization method which is characterized by comprising the following steps:

Among them, fig. 1 is an overall framework of scheduling based on time and events, mainly in that time and events are added and the framework is modified. FIG. 2 is a comparison of time-based scheduling with time and event-based scheduling, with time-based scheduling above, resulting in idleness; the lower is time and event based scheduling, making full use of idleness. Joining a time and event based schedule includes the steps of:

step S101: decoupling scheduling policy framework (scheduling policy frame) and task scheduler (workload scheduler): the scheduling strategy framework is used for realizing the modified scheduling strategy, and the task scheduler is used for realizing the scheduling; the original architecture is a mixture of the two, and the two cannot be modified into a self-defined strategy scheduling method, so that the two methods need to be separated. The separated scheduling strategy framework is only responsible for realizing the scheduling strategy, and the task scheduler is only responsible for specifically realizing the scheduling. This specialized segmentation makes later modifications possible.

Step S102: adding a context (context) completion event, wherein the event is triggered after the context is completed and is transmitted to a scheduling policy framework, so that corresponding task scheduling is further triggered; the original approach has only time-triggered events, i.e. triggering every 1 ms. In the invention, a context completion event is added on the basis of the original scheduler, and the event is triggered after the context is completed and is transmitted to a scheduling strategy framework, thereby further triggering the corresponding task scheduling.

Step S103: adding a context submission event, and when the work scheduler receives the context submission event and the virtual GPU is idle at the moment, immediately processing the event and executing a task; on the basis of an original scheduler, a context submission event is added, and the event can execute tasks periodically when a currently working virtual GPU is idle. Once the work scheduler receives the context commit event, and the virtual GPU is idle at this time, the event is immediately processed and the task is executed. In the original approach, the work scheduler would be idle waiting, wasting work time due to the lack of this event.

Step S104: modifying the scheduling strategy frame to support the added event, and making response processing after receiving the time or the event by the scheduling strategy frame and submitting the response processing to a work scheduler for execution; the original design only supports time triggering, and now event triggering is added, which needs the support of a scheduling policy framework. And the scheduling strategy frame responds after receiving the time or the event and submits the time or the event to the work scheduler for execution.

Step S105: the quality of service calculation in the scheduling policy framework is modified. In the original scheme, in order to maintain the required service quality, the corresponding scheduling time needs to be counted, and the calculation method is simpler because the scheduling is based on time. In the invention, due to the addition of event-based scheduling, the calculation of the service quality needs to be recalculated, and not only the time but also the influence caused by each event need to be considered. In the invention, a new algorithm is carried out on the part of the calculated content so as to ensure that the overall service quality can still meet the given requirement.

FIG. 3 is a schematic diagram of pipeline-based seamless scheduling, with the original design on top and the modified design on the bottom, which breaks down tasks and allows pipelining to complete scheduling. The seamless scheduling based on the pipeline comprises the following steps:

step S201: decomposing a work scheduler flow, and dividing the work flow into an audit and shadow (audio & shadow) stage and a scheduling and execution (scheduling & execution) stage, wherein the former is a preparation stage and the latter is an execution stage; in the original design, the workflow is not finely divided, and each time the workflow is sequentially executed, each task is executed from the beginning to the end, and then the next task is started.

Step S202: dividing a work submission path, allowing a plurality of works to be submitted simultaneously, and fully utilizing the advantages of streamlined scheduling; the original design can only support one work task to be submitted at the same time, and the invention allows a plurality of tasks. Allowing multiple jobs to be submitted simultaneously to enjoy the advantages of the pipelined scheduling of the present invention.

Step S203: shifting the original shadow related code out of the work task assignment, and separating the codes in different stages; at the moment, the whole work flow code is divided into two relatively independent parts, namely audit and shadow, scheduling and execution, and the work flow code can run at the same time in different stages without mutual influence, so that the efficiency is improved. At this time, the original overall workflow code is divided into two relatively independent parts, auditing and shadowing, and scheduling and executing. The purpose of the segmentation is that different phases can run simultaneously without mutual influence, so that when a plurality of events are met, the execution phase of the previous event can run simultaneously with the preparation phase of the next event, and the efficiency is improved.

Step S204: when each virtual GPU has only one shadow context, the first work task is shadowed if the virtual GPU is not the current GPU at the moment. This step ensures that the scheduling method can be executed correctly under all circumstances, consistent with the original design in terms of correctness.

Fig. 4 is a schematic diagram of hybrid ring-based and virtual machine-based scheduling, which compares the different scheduling manners of the original design, ring-based scheduling and hybrid scheduling, and can also see the advantages of hybrid scheduling. Hybrid ring (ring) based and virtual machine based scheduling, comprising the steps of:

step S301: introducing ring (ring) or engine (engine) scheduling, adding support of the rings, and changing the lowest scheduling unit from the whole to each ring, so that if the work runs on different rings, the work works simultaneously without mutual influence; in the original design, the virtual GPU is scheduled based on the whole virtual machine as a unit, and the virtual machines have respective workloads and run on different rings. Typically, a GPU has more than 3 rings, each handling different types of tasks (e.g., image rendering, streaming codec, etc.).

step S304: modifying the scheduling strategy framework, changing a scheduling data structure from a single scheduling data structure into an array, simultaneously operating a plurality of rings, and changing the scheduling data structure into the array to support the simultaneous operation of the plurality of rings; the method mainly relates to relevant parts such as a scheduling data structure, a scheduling strategy and the like. In the original design, only one structural body is needed to complete the scheduling.

Step S305: modifying a scheduling strategy framework, and changing time or event triggering into support for each ring respectively; in the original design, only time trigger is needed, and only the whole GPU is triggered.

Step S306: the global states of the current virtual machine and the next virtual machine are consistent or not by using a pointer CRC 32; pointer CRC32 points to the contents of the virtual GPU's global state information, which is needed as a decision because only one copy of the contents can be retained in the system design.

Step S307: calculating the value of CRC32 each time MMIO is stored or restored, and if consistent, using the ring-by-ring scheduling in steps S301-305; if not, using the original scheduling according to the virtual machine; the value of CRC32 may find a corresponding global state that may be multiplexed when multiple operating systems currently executing are consistent; if not, the ring scheduling cannot be used, and the original scheduling according to the virtual machine must be switched to.

step S309: and the switching according to the virtual machine and the ring scheduling is modified, so that the program can run correctly. The invention perfects the switching of two scheduling strategies.

The invention combines the three scheduling methods, and the three scheduling methods are not conflicted and can be simultaneously applied to the GPU, so the overall performance of the GPU can be greatly improved. In a specific experiment, the invention uses an Intel GPU as a test object, but the method is a universal method and can also be used for GPUs manufactured by other GPU manufacturers.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A multilevel fine-grained virtualized GPU scheduling optimization method is characterized by comprising the following steps:

step S3: mixed ring-based and virtual machine-based scheduling is added, so that different virtual machines can completely and concurrently utilize the physical GPU, and the overall utilization rate is improved;

step S1 includes the following steps:

step S105: modifying the service quality calculation in the scheduling strategy framework;

step S2 includes the following steps:

step S203: shifting the original shadow related code out of the work task assignment, and separating the codes in different stages; at the moment, the whole work flow code is divided into two relatively independent parts, namely audit and shadow, scheduling and execution, and the work flow code can run at the same time in different stages without mutual influence, so that the efficiency is improved;

step S204: when each virtual GPU only has one shadow context, if the virtual GPU is not the current GPU, the first work task is shadowed;

step S3 includes the following steps: