KR101435772B1

KR101435772B1 - Gpu virtualizing system

Info

Publication number: KR101435772B1
Application number: KR1020130071605A
Authority: KR
Inventors: 이재진; 김정현
Original assignee: 서울대학교산학협력단
Priority date: 2013-06-21
Filing date: 2013-06-21
Publication date: 2014-08-29

Abstract

The present invention relates to a variable number transformable GPU virtualization system. A first aspect of the present invention is a master node including a GPU virtual device driver receiving a GPU code and transferring the received GPU code to the slave node; And a plurality of cluster nodes divided into a slave node including a GPU virtual server for receiving the GPU code and being executed through an actual GPU, the GPU virtual device driver comprising: The GPU code for N (N is natural number) actual GPUs from OpenCL (Open Computing Language), a universal parallel computing framework, is converted into a form for transferring M GPUs (M is a natural number equal to or different from N) An emulator; And a dispatcher for delivering the GPU code converted by the emulator to the M virtual GPUs; To a variable number of GPU virtualization systems.
According to a second aspect of the present invention, there is provided a GPU virtualization system for performing GPU virtualization for GPU parallel processing, the system comprising: a master node which is a node through which a user executes GPU code for implementation as M virtual GPUs; And a slave node communicating with the master node to receive the GPU code and having N actual GPUs to perform a GPU operation; Wherein the master node includes an analyzer for computing a distribution of data to use over the GPU code based on a work item size and a work group size.
Virtual GPUs can be combined or redistributed through virtualization, providing virtual GPUs with memory and computing power that do not exist in real life, making it easier to run larger programs more quickly. to provide.

Description

GPU VIRTUALIZING SYSTEM

The present invention relates to a variable number of convertible GPU virtualization system, and more particularly, to a GPU virtualization system that suggests a way to combine or redistribute actual GPUs through virtualization.

GPU is an abbreviation of Graphic Processing Unit. It is a graphic processing unit that loads the burden of complicated graphics processing to relieve computation and processing load (road) of graphic by CPU (Central Processing Unit) Developed.

However, in recent years, technology development has been actively pursued to utilize the parallel processing function of the GPU for use in complicated calculations.

As a result, many supercomputers are using GPUs to speed up processing. In this trend, methods to add virtualization to the GPU in order to use the GPU more efficiently have been studied and are being developed to alleviate the burden of complicated graphics processing from the CPU.

As one example, in recent years, technology development has been actively pursued to utilize the parallel processing function of the GPU for complicated calculation, and many recent supercomputers are rapidly increasing the processing speed by using the GPU. Therefore, it is possible to achieve high performance by using GPU in case of parallelizing well-computed algorithm.

Accordingly, in the related technology field, it is required to further divide the GPU through the GPU virtualization to form a virtual desktop environment, or to develop a technique for displaying the GPU as if there is one per virtual machine.

[Related Technical Literature]

1. METHOD AND APPARATUS FOR COMPILING AND EXECUTING APPLICATION USING VIRTUALIZATION IN HETEROGENEOUS SYSTEM USING CPU AND GPU IN VARIOUS SYSTEMS USING CPU AND GPU (Patent Application No. 10-2010-0093327 number)

2. Virtual GPU (VIRTUAL GPU) (Patent Application No. 10-2012-0078202)

On the other hand, the background art described above is technical information acquired by the inventor for the derivation of the present invention or obtained in the derivation process of the present invention, and can not necessarily be a known technology disclosed to the general public before the application of the present invention .

In order to solve the above problems, the present invention provides a GPU virtualization system capable of virtualizing N actual GPUs into M GPUs so that a user can intuitively and easily perform parallel programming.

In addition, according to another embodiment of the present invention, a GPU virtualization system that can be converted into a variable number can provide a GPU virtualization system for dividing a GPU according to a user's request to increase the GPU utilization rate.

However, the objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

According to an aspect of the present invention, there is provided a variable number-convertible GPU virtualization system including: a master node including a GPU virtual device driver for receiving a GPU code and transferring the GPU code to a slave node; And a plurality of cluster nodes divided into a slave node including a GPU virtual server for receiving the GPU code and being executed through an actual GPU, the GPU virtual device driver comprising: An emulator for converting M (M is a natural number) virtual GPU code from a universal parallel computing framework into a form for transferring GPU code to N (N is a natural number equal to or different from M) actual GPU; And a dispatcher for passing the GPU code converted by the emulator to the N actual GPUs; . &Lt; / RTI >

The dispatcher may also store the location of the N actual GPUs and deliver the converted GPU code based on the location of the stored actual GPUs.

The GPU virtual device driver may further include an analyzer analyzing whether the GPU memory is capable of being divided into the actual GPU memory for data used through the GPU code based on the work item size and the work group size; As shown in FIG.

In addition, when the data to be used through the GPU code can be divided, the analyzer causes the emulator and the dispatcher to carry out transmission of the GPU code divided into the N actual GPUs, If the data to be used can not be divided, a GPU code that can not be divided by a shared virtual memory (Shared Virtual Memory) method or a CPU execution method can be executed.

If the data to be used through the GPU code is divided into N actual GPUs, the analyzer may divide the workgroup size into N pieces and distribute the N pieces to the N actual GPUs to be executed .

In addition, the analyzer may overlap the N actual GPUs using the shared virtual memory of the MMU of the actual GPU if the data to be used through the GPU code is not divided into the N actual GPUs It is possible to perform processing on the buffer area for reading or writing.

Also, the master node may be a node where a program is executed by a user, and the slave node may be a node having the N actual GPUs and performing a GPU operation.

The GPU virtual device driver may support CUDA and OpenCL.

Preferably, communication between the GPU virtual server and the GPU virtual device driver uses an interconnection network connected between the master node and the slave node.

According to another aspect of the present invention, there is provided a GPU virtualization system including: a master node, which is a node for executing GPU codes for implementing M virtual GPUs; And one or more slave nodes communicating with the master node to receive the GPU code and having N actual GPUs to perform GPU operations; And the master node may include an analyzer for computing a distribution of data to be used over the GPU code based on a work item size and a work group size.

An emulator configured to convert the GPU code for N real GPUs by operation by the analyzer; And a dispatcher configured to provide the converted GPU code to each of the N GPUs.

The GPU virtualization system capable of converting a variable number according to an embodiment of the present invention provides a method of combining or redistributing an actual GPU through virtualization to provide a virtual GPU having a memory and a computing capability that do not exist in actual size Thereby providing an effect that a larger program can be driven more quickly and more easily.

Also, according to another embodiment of the present invention, the variable number-convertible GPU virtualization system provides an effect of increasing the GPU usage rate by dividing the GPU according to the demand of the user.

In addition, according to another embodiment of the present invention, the variable number-convertible GPU virtualization system can virtualize N real GPUs into M GPUs, thereby providing an effect that a user can easily perform parallel programming intuitively do.

The effects obtained by the present invention are not limited to the above-mentioned effects, and other effects not mentioned can be clearly understood by those skilled in the art from the following description will be.

FIG. 1 is a diagram for explaining a basic structure of an actual GPU used in a variable number-convertible GPU virtualization system according to an embodiment of the present invention.
2 is a diagram for explaining a virtual device driver layer in a variable number-convertible GPU virtualization system according to an embodiment of the present invention.
3 is a diagram illustrating a GPU virtual device driver in a variable number-convertible GPU virtualization system according to an embodiment of the present invention.
4 is a diagram illustrating an example of a GPU virtualization technique in a variable number-convertible GPU virtualization system according to an embodiment of the present invention.
FIG. 5 is a conceptual diagram for explaining an implementation method when data accessed by a virtualized GPU in a GPU virtualization system that can be converted to a variable number according to an embodiment of the present invention can not be divided.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a detailed description of preferred embodiments of the present invention will be given with reference to the accompanying drawings. In the following description of the present invention, detailed description of known functions and configurations incorporated herein will be omitted when it may make the subject matter of the present invention rather unclear.

In the present specification, when any one element 'transmits' data or signals to another element, the element can transmit the data or signal directly to the other element, and through at least one other element Data or signal can be transmitted to another component.

FIG. 1 is a diagram for explaining a basic structure of an actual GPU 10 used in a variable number-convertible GPU virtualization system according to an embodiment of the present invention. Referring to FIG. 1, an actual GPU 10 includes a plurality of processing elements (PE) 11 and a GPU memory 12. The actual GPU 10 reads data from the GPU memory 12 by the PE 11, processes the data, and stores the data in the memory 12, similarly to a general CPU operation.

An MMU (Memory Management Unit) 13 is installed to implement a virtual memory like an MMU of a CPU. The MMU 13 is implemented in a recent GPU to support virtualization of the actual GPU 10. GPU virtualization that can be converted into a variable number for both the case where the MMU 13 is installed and the case where the MMU 13 is not installed on the GPU virtualization system that can be converted to a variable number according to the present invention will be described in detail below.

2 is a diagram illustrating a virtual device driver layer in a variable number-convertible GPU virtualization system according to an embodiment of the present invention. Referring to FIG. 2, a cluster node of a GPU virtualization system convertible into a variable number is divided into a master node (master node) 100a and a slave node (slave node) 100b.

The master node 100a is a node formed by a user to execute a GPU code and includes hardware 100a-1, an operating system 100a-2, a GPU virtual device driver 100a-3, CUDA 100a-4, OpenCL 100a-5, CUDA App 100a-6 and OpenCL App 100a-7.

The slave node 100b is a node that receives the GPU code from the master node 100a and performs the actual calculation. The slave node 100b includes a hardware 100b-1, an operating system 100b-2, A GPU device driver 100b-3, a CUDA 100b-4, an OpenCL 100b-5, and a GPU virtual server 100b-6.

In a cluster node environment separated into the master node 100a and the slave node 100b, when the user first accesses the master node 100a and executes the GPU code, the GPU virtual device driver 100a-3 ) Implements a virtual GPU for the user. In addition, the GPU virtual device driver 100a-3 receives the GPU code of the user and transmits the received GPU code to the slave node 100b through the interconnection network (Interconnection Network) 100c.

Accordingly, the slave node 100b receives the GPU code of the user from the master node 100a by the GPU virtual server 100b-6. The delivered GPU code is delivered to the actual GPU device driver 100b-3 and can be executed through the N (N is a natural number) GPU 10.

3 is a diagram showing a GPU virtual device driver 100a-3 in a variable number-convertible GPU virtualization system according to an embodiment of the present invention.

Referring to FIG. 3, the GPU virtual device driver 100a-3 includes an emulator 110a-3, a dispatcher 120a-3, and an analyzer 130a-3.

First, an emulator 100a-3 is connected to a parallel computing framework, for example, Compute Unified Device Architecture (CUDA) 100a-4 or OpenCL (Open Computing Language: 100a-5) And may be composed of components that communicate directly.

Also, the GPU code is transferred from the OpenCL 100a-5 to the emulator 110a-3, and the emulator 110a-3 converts the transferred GPU code into a form that can be transmitted to the actual GPU 10.

For example, in a case where N (N is a natural number) virtual GPU 10 is virtualized as one virtual GPU, the emulator 110a-3 transmits the GPU code delivered in one virtual GPU to N actual GPUs 10 ).

Next, the dispatcher 120a-3 transmits the GPU code converted by the emulator 110a-3 to the actual GPU 10. More specifically, since the actual N real GPUs 10 are distributed on the cluster node, the dispatcher 120a-3 stores the location of each GPU 10 and delivers the command to the corresponding GPU.

Finally, the analyzer 130a-3 analyzes whether the data used by the GPU code in the additional module constituting the GPU virtual device driver 100a-3 is divisible into actual GPU memory.

In this specification, a module may mean a functional and structural combination of hardware for carrying out the technical idea of the present invention and software for driving the hardware. For example, the module may mean a logical unit of a predetermined code and a hardware resource for executing the predetermined code, and it does not necessarily mean a physically connected code or a kind of hardware. Can be easily deduced to the average expert in the field of < / RTI >

The analyzer 130a-3 allows the data to be transferred to the N real GPUs 10 through the emulator 110a-3 and the dispatcher 120a-3 if the data to be used through the GPU code is divisible.

Conversely, if the data used by the GPU code by the analyzer 130a-3 can not be divided, the GPU code can be executed through the following two methods. First, the MMU 13 implemented in the GPU 10 provides a shared virtual memory to perform data exchange between the GPUs when a memory access occurs. If the MMU 13 is not present in the GPU 10, the second code is executed through the CPU. Here, the two methods will be described in detail in FIG.

4 is a diagram illustrating an example of a GPU virtualization technique in a variable number-convertible GPU virtualization system according to an embodiment of the present invention. Referring to (1) of FIG. 4, two GPUs 10-0, 10-1, and 10-2 are installed. In an environment having three cluster nodes, a plurality of CPUs 2 And one virtual GPU 1 are installed.

In this specification, CPUs 2-0, 2-1, and 2-2 on a cluster node that can be divided into a master node 100a and a slave node 100b, and main memories 3-0, 3-1, 3- (10: 10-0, 10-1, 10-2) can be used only for the combining method of the GPU 10 (10: 10-0, 10-1, 10-2) 4 as shown in Fig.

2, the master node 100a is a node for executing a program from a user, and the slave node 100b is a system in which an actual GPU 10 is installed to perform an actual GPU operation.

The GPU virtual device driver 100a-3 is installed in the master node 100a and the CUDA 100a-4 or the OpenCL 100a-5 is supported in the GPU virtual device driver 100a-3. 3 that the elements constituting the GPU virtual device driver 100a-3 are composed of an emulator 110a-3, a dispatcher 120a-3, and an analyzer 130a-3.

The GPU virtual server 100b-6 is installed in the slave node 100b so that the GPU virtual server 100b-6 receives and processes the command from the GPU virtual device driver 100a-3 in the master node 100a . Communication between the GPU virtual server 100b-6 and the GPU virtual device driver 100a-3 uses an interconnection network 100c connected between the master node 100a and the slave node 100b.

4 (1) is an example of GPU virtualization, in which a plurality of GPUs 10: 10-0, 10-1, 10-2 are virtualized into one virtual GPU 1 .

When the programmer executes the program implemented by the OpenCL programming model, the GPU virtual device driver 100a-3 receives the GPU code and delivers it to the actual GPU 10 (10-0, 10-1, 10-2).

At this time, the analyzer 130a-3 checks whether the data used by the GPU code can be divided into the GPU 10 (10-0, 10-1, 10-2).

For example, the analyzer 130a-3 analyzes the GPU code using the input values of the workitem size and the workgroup size, which are input when executing on the OpenCL 100a-5.

The analyzer 130a-3 divides the workgroup into 64 pieces if the size of the workgroup size is 1024 and the number of actual GPUs 10 (for example, 10-0 to 10-15) do.

In the case where the worker is divided into 64 workgroups, the analyzer 130a-3 divides the workgroup into 64 GPUs, and if the buffer areas are not overlapping, 0 to 10 to 15).

In this case, the divided buffer areas may be respectively copied to the GPU memory 12 of the GPU 10, and then the workgroup size may be set to 64 so as to be performed in each of the actual GPUs 10. When the execution is completed, the divided buffers can be copied and merged again into the memory 3 of the master node 100a.

FIG. 5 is a conceptual diagram for explaining an implementation method when data accessed by a virtualized GPU in a GPU virtualization system that can be converted to a variable number according to an embodiment of the present invention can not be divided. If the data accessed by the virtualized GPU can not be partitioned, it is performed in two ways.

First, the MMU 13 of the GPU 10 is used. If there is an MMU 13 on the structure of the GPU 10 of FIG. 1, a shared virtual memory is implemented using the function of the MMU 13. An embodiment in which a shared virtual memory is implemented is shown in FIG.

When attempting to access page r from GPU0 {10 (0)}, MMU {13 (0)} of GPU {10 (0)} recognizes that the page is not in its local GPU.

In this case, the corresponding page is fetched from the master node 10.

If the page is to be modified, a twin page is generated as shown in FIG. 5 as GPU1 {10 (1)} to find the modified contents. After the code execution is completed, the modified data (diff) based on the twin page is reflected in the main memory 3.

Since the OpenCL 100a-5 and the CUDA 100a-4 do not provide synchronization for the entire thread during kernel execution, data diff is reflected to the master node 100a after the execution of the kernel is completed, Modifications may be merged at the master node 100a. This protocol is consistent with the implementation of existing shared virtual memory technology.

Second, the CPU 2 executes the corresponding code. If there is no MMU 13 and can not provide a shared virtual memory, memory between different GPUs can not be synchronized while GPU code is being executed. Therefore, the accuracy of the performed operation can not be guaranteed. As another alternative, the emulator 110a-3 executes the corresponding code using the CPU 3. [

As described above, preferred embodiments of the present invention have been disclosed in the present specification and drawings, and although specific terms have been used, they have been used only in a general sense to easily describe the technical contents of the present invention and to facilitate understanding of the invention , And are not intended to limit the scope of the present invention. It is to be understood by those skilled in the art that other modifications based on the technical idea of the present invention are possible in addition to the embodiments disclosed herein.

10: Actual GPU
11: PE (Processing Element)
12: GPU Memory (GPU Memory)
13: Memory Management Unit (MMU)
100a: master node
100a-1: Hardware
100a-2: Operating System (Operating System)
100a-3: GPU virtual device driver
110a-3: Emulator
120a-3: Dispatcher
130a-3:
100a-4: CUDA
100a-5: OpenCL
100a-6: CUDA App.
100a-7: OpenCL App.
100b: Slave node
100b-3: GPU device driver
100b-6: GPU virtual server

Claims

A master node including a GPU virtual device driver for receiving a GPU code and transferring the GPU code to a slave node; And a plurality of cluster nodes divided into a slave node including a GPU virtual server for receiving the GPU code and being executed through an actual GPU, the GPU virtual device driver comprising:
An emulator for converting M (M is a natural number) virtual GPU code from a parallel computing framework into a form for transferring GPU code to N (N is a natural number equal to or different from M) actual GPU; And
A dispatcher for delivering the GPU code converted by the emulator to the N actual GPUs; Wherein the variable-number-convertible GPU virtualization system comprises:

The method according to claim 1,
The dispatcher comprising:
Wherein the location of the N actual GPUs is stored and the converted GPU code is delivered based on the location of the stored actual GPUs.

3. The method of claim 2,
The GPU virtual device driver comprising:
An analyzer for analyzing, based on the work item size and the work group size, whether the GPU memory is divisible into the actual GPU memory for data used through the GPU code; The GPU virtualization system further comprising:

The method of claim 3,
The above-
The GPU code being divided into N actual GPUs through the emulator and the dispatcher when the data to be used through the GPU code can be divided,
A GPU code that is not divisible by a shared virtual memory method or a CPU execution method is executed when data to be used through the GPU code can not be divided, Possible GPU virtualization system.

5. The method of claim 4,
The above-
Dividing the workgroup size into N pieces of the N actual GPUs for data to be used through the GPU code, and distributing the divided N pieces to the N actual GPUs, GPU virtualization system.

6. The method of claim 5,
The above-
If the data to be used through the GPU code is not divided into the N actual GPUs, the N real GPUs using the shared virtual memory of the MMU of the actual GPU may read or write in overlapping buffer areas The GPU virtualization system comprising:

The method according to claim 1,
The master node,
A node where a program is executed by a user,
The slave node,
Wherein the GPU is a node that has the N actual GPUs and performs GPU operations.

The method according to claim 1,
The GPU virtual device driver comprising:
A variable number of convertible GPU virtualization system characterized by supporting a parallel computing framework.

The method according to claim 1,
Wherein the communication between the GPU virtual server and the GPU virtual device driver uses an interconnection network connected between the master node and the slave node.

In a GPU virtualization system that performs GPU virtualization for GPU parallel processing,
A master node that executes a GPU code for implementing M virtual GPUs; And
One or more slave nodes communicating with the master node to receive the GPU code and having N actual GPUs to perform GPU operations; / RTI >
Wherein the master node comprises an analyzer for computing a distribution of data to be used over the GPU code based on a work item size and a work group size.

11. The method of claim 10,
The master node,
An emulator configured to convert the GPU code for N slave nodes by operation by the analyzer; And
A dispatcher configured to provide the converted GPU code to each of the N slave nodes; The GPU virtualization system further comprising: