CN117931425A

CN117931425A - Framework and method for improving utilization rate of GPU (graphics processing unit) of intelligent computation center

Info

Publication number: CN117931425A
Application number: CN202311724685.XA
Authority: CN
Inventors: 刘超
Original assignee: Inesa R&d Center
Current assignee: Inesa R&d Center
Priority date: 2023-12-14
Filing date: 2023-12-14
Publication date: 2024-04-26

Abstract

The invention relates to a framework and a method for improving the utilization rate of a GPU in an intelligent computing center, wherein the framework comprises the following components: basic calculation force area: the method comprises the steps of receiving an API call request of a user and calling the packaging library, intercepting and forwarding the API call request by the packaging library and calling the scheduling engine, and enabling the scheduling engine to sense the idle state of the GPU and select a target GPU according to a scheduling algorithm; intelligent force calculation area: the intelligent computing system comprises a GPU server, an agent program and an intelligent computing API, wherein the agent program reports the idle state of the GPU to a dispatching engine, receives an API call request of a packaging library and calls the intelligent computing API, and the intelligent computing API executes intelligent computing tasks according to a target GPU; a data storage area: for providing file storage, block storage and object storage services for the base and intelligent power calculation zones. Compared with the prior art, the method has the advantages of improving the utilization rate of the GPU and the like.

Description

Framework and method for improving utilization rate of GPU (graphics processing unit) of intelligent computation center

Technical Field

The invention relates to the technical field of computers, in particular to a framework and a method for improving the utilization rate of a Graphic Processing Unit (GPU) of an intelligent computing center.

Background

In order to realize large model training, intelligent computing centers often need to deploy thousands to tens of thousands of GPUs, and the investment is very high due to the matched high-performance storage and network equipment. On the other hand, the utilization of these GPUs is relatively low. The reasons mainly come from two aspects:

1) The intelligent computing center provides services to the outside in a manner of renting the GPU server clusters, and there is a mismatch between the customer's needs and the provisioning of the intelligent computing center. For example, a smart computing center has 1 ten thousand GPU cards. For example, for the first half of a year, customer a rents 5000 cards and customer B rents 3000 cards. C, the customer rents 8000 cards from the third quarter. Then 1- ((3000+5000) 3+8000×3)/10000×12) 100% = 60% of GPU cards are unused throughout the year.

2) The utilization rate of the leased GPU card is not high. 80% of the GPU used for large model training is used for training of a single model, and the other 20% is used for auxiliary functions such as data cleaning. Among them, GPUs used for auxiliary functions such as data cleaning have a problem of low utilization rate.

Therefore, improving the GPU utilization of the intelligent computing center is an urgent issue to be resolved. The industry is continually exploring how to more optimally use GPU resources, such as GPU virtualization techniques. However, this technology defines a scenario where only native GPU resources can be used, and is commonly used for multiple tenants to share one native GPU. The leased GPU computing power of the intelligent computing center is often single-tenant oriented, i.e. the design of hardware and software architecture is designed for one user to be exclusive, and cannot be shared to multiple tenants, so that the problem of low GPU utilization rate of the intelligent computing center cannot be solved by the existing GPU virtualization technology.

Disclosure of Invention

The invention aims to provide a framework and a method for improving the GPU utilization rate of an intelligent computation center, which solve the problem of GPU utilization rate.

The aim of the invention can be achieved by the following technical scheme:

a framework for improving the utilization rate of a GPU of an intelligent computing center comprises a basic computing area, an intelligent computing area and a data storage area,

The basic computing area comprises a general computing node, network equipment deployed with a cloud operating system, an intelligent computing application program, a packaging library and a scheduling engine, wherein the basic computing area is used for the intelligent computing application program to receive an API call request of a user intelligent computing task and call the packaging library, the packaging library intercepts and forwards the API call request of the user and calls the scheduling engine, and the scheduling engine perceives the idle state of a GPU of the intelligent computing area and selects a target GPU according to a scheduling algorithm;

The intelligent computing area comprises a GPU server node, an agent program and an intelligent computing API, wherein the intelligent computing area is used for reporting the idle state of the GPU to a dispatching engine by the agent program, receiving an API call request of a packaging library and calling the intelligent computing API, and the intelligent computing API executes an intelligent computing task according to a target GPU and returns an execution result to an intelligent computing application program through the agent program and the packaging library;

the data storage area is used for providing file storage, block storage and object storage services for the basic power calculation area and the intelligent power calculation area.

Further, the intelligent computing application runs in a virtual machine.

Further, the packaging library is installed in a smart mirror.

Further, the packaging library is a series of packaging function sets formed by packaging API functions in the intelligent computing library.

Further, the GPU server nodes are interconnected through a spine network.

Further, the specific step of selecting the target GPU according to the scheduling algorithm includes:

And checking whether an idle GPU server node exists, if so, randomly selecting the idle GPU server node as a target GPU, and if not, selecting the GPU with the lowest utilization rate in the last day from the power assisting server as the target GPU.

Further, the expression formula of the scheduling algorithm is:

wherein,

Where S is the final selected target GPU, G (i, j, l) represents the ith GPU of the jth server of the ith cluster, U (i, j, l) represents the usage of G (i, j, l), m represents the number of GPU servers in each cluster, R (i, j) represents whether the jth server of the ith cluster is a computation aid server, and k represents the number of GPUs in each server.

Further, the agent and the intelligent API run in a virtual machine of the GPU server node or directly on the GPU server node.

Further, the scheduling engine is implemented by using an open source Kubernetes.

The invention also provides an intelligent computing task processing method based on the framework for improving the utilization rate of the GPU of the intelligent computing center, which comprises the following steps:

the intelligent computing application program obtains an API call request of the intelligent computing task of the user and sends the API call request to the packaging library;

The packaging library intercepts the API call request and calls the scheduling engine to inquire the idle condition of the GPU of the intelligent computing area;

The scheduling engine receives the idle condition of the GPU returned by the agent program, selects a target GPU by adopting a scheduling algorithm, and returns the IP address of the agent program corresponding to the target GPU to the packaging library;

the packaging library forwards the API call request to the corresponding agent program according to the IP address;

The corresponding agent calls the mental calculation API to perform mental calculation tasks and returns the corresponding agent and wrapper library to the mental calculation application.

Compared with the prior art, the invention has the following beneficial effects:

(1) According to the invention, the dispatching engine dispatches the GPU computing power in the basic computing power area to the target GPU in the intelligent computing power area to run through a dispatching strategy, and the target GPU is reasonably selected to execute the intelligent computing task, so that the utilization rate of the GPU in the intelligent computing center is improved.

(2) The framework provided by the invention is designed for multiple users, and can be shared to multiple tenants, so that the GPU utilization rate is further improved.

(3) The invention does not change the existing network topology of the computing center, and the idle part of the GPU cluster can be scheduled and distributed to the basic computing area of multiple tenants by means of a packaging library, network scheduling, task superposition and hardware multiplexing, thereby helping the basic computing area to realize algorithm acceleration and parallel computation so as to improve the overall GPU utilization rate of the intelligent computing center.

Drawings

FIG. 1 is a frame construction diagram of the present invention;

FIG. 2 is a schematic flow chart of the method of the present invention.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.

The present example provides a framework for improving the utilization of a GPU in a computing center, as shown in fig. 1, the framework comprising: a basic calculation area, an intelligent calculation area and a data storage area.

The basic power calculation area is called an "east area" on the network topology, and the intelligent power calculation area is called a "west area" on the network topology. In the "east region" is a CPU cluster, on which cloud operating system software of a multi-tenant function is deployed, and a multi-tenant cloud service is provided externally. Some GPU servers in "western district" are leased for exclusive use by a particular tenant (e.g., leased for 5000 GPU cards and supporting infrastructure for 6 months to an a tenant) for training a "large model" for a period of time.

According to the invention, the package library software is pre-installed in the mirror image in the basic power calculation area, the cloud host service with the GPU is provided for the outside, and the dispatching engine 3 dispatches the GPU power calculation in the basic power calculation area to the idle GPU in the Western area for operation, so that the resource utilization rate of the GPU is improved.

According to the illustration in fig. 1, the entire frame is divided into three parts:

1. Basic calculation force area (east area): the east region includes a general computing node, namely a server with only a CPU and no GPU, and a matched network device, on which a cloud operating system is deployed, and provides multi-tenant service for the outside. This region contains the following three key components:

Intelligent computing application 1: the method runs in a virtual machine, and directly calls an intelligent computation packaging library API to accelerate an intelligent computation algorithm.

Packaging warehouse 2: is a series of wrapper functions for API functions in the intelligent computing library, the wrapper functions are identical to the original function interfaces, and the application program calls the intelligent computing API actually call the wrapper functions at the bottom layer. The wrapper function will call the dispatch engine 3, find the target node agent 4 for "west region" and forward the mental API to the target agent 4.

The specific wrapper function can replace parallel computing APIs such as CUDA library, openCL, openAcc and the like.

Scheduling engine 3: the scheduler engine 3 is a piece of software running in a physical or virtual machine that perceives the "western region" GPU idle situation and informs the wrapper function target agent 4 of the IP address.

The scheduling engine 3 can be realized based on open source Kubernetes, and a controller of the GPU can be constructed through Device Plugins technology for topology awareness and GPU scheduling.

2. A data storage area: the data storage area provides file storage, block storage, and object storage services for the "east region" and "west region".

3. Intelligent power calculation area (western area): the 'Western region' is mainly used for large model training, the framework is characterized in that a GPU server is used as a node, and interconnected by a leaf ridge network, and the region comprises 2 key components:

Agent 4: agent 4 has two roles: one is to accept API requests for "wrapper functions" and call the real mental API 5. And secondly, reporting the idle state of the local GPU to the scheduling engine 3.

Intelligent API 5: a real mental arithmetic API 5 called by the agent 4, through which parallel computing units of GPU hardware, such as CUDA API of inflight, can be called.

The agent and the intelligent API may run in a virtual machine of the GPU node or may run directly on the GPU node. The method can be realized based on an OpenStack open source cloud operating system if running in a virtual machine, and can be realized based on a Kubernetes container scheduling platform if directly running on a GPU node.

The specific steps for improving the utilization rate of the GPU in the intelligent computing center by utilizing the framework are shown in fig. 2:

In the implementation, more than 25G networks are adopted between the wrapper function and the agent program between the east and west regions, and RDMA protocol can be adopted, so that the network is ensured to have smaller delay.

S1, a series of intelligent images (including Windows, ubuntu Linux and the like) are manufactured, wherein the intelligent images are different from the common images in that a packaging library 2 is installed in the intelligent images, so that a cloud host started by the intelligent images can call a remote GPU through a dispatcher.

And uploading the intelligent mirror image to an east cloud operating system by a cloud administrator, sharing the intelligent mirror image to all cloud tenants, and developing programs by the cloud tenants through the intelligent mirror image.

A user in the tenant starts the cloud host in "east zone" with "smart mirror". And writing an application program requiring GPU acceleration based on Pytorch or Tensorflow and other artificial intelligent algorithm frameworks, and calling a parallel computing library at the bottom layer.

S2, the intelligent computing application program 1 receives API calls of the GPU acceleration program of the cloud tenant, forwards the API calls to the packaging library 2, intercepts the API calls by the packaging library 2, and calls the scheduling engine 3 to inquire the idle condition of the GPU of the intelligent computing area.

These API calls are intercepted by the wrapper library 2 without perception by the user. Wrapper library 2 calls dispatch engine 3 asking for the dispatch of the API call to run on the GPU in the "Western region".

S3, the scheduling engine 3 receives the idle condition of the GPU returned by the agent program 4, selects a target GPU, namely 'S', by adopting a scheduling algorithm, and returns the IP address of the agent program 4 corresponding to the target GPU to the packaging library 2.

In the case that the intelligent driver and the intelligent API 5 are already deployed in the western region, the deployment agent 4 is additionally installed, and the deployment agent 4 continuously reports the use state of the local GPU to the scheduling engine 3.

After receiving the given parameters, the scheduling engine 3 selects a GPU node according to a policy schedule, wherein the scheduling policy is as follows:

assume that there are n GPU clusters;

M GPU servers per cluster;

there are k GPUs per server;

C (i, j) represents the jth server of the ith cluster

G (i, j, l) represents the ith GPU of the jth server of the ith cluster;

u (i, j, l) represents the usage of G (i, j, l);

R (i, j) represents whether C (i, j) is a calculation assistance server;

The finally selected GPU is denoted S.

Then, this scheduling algorithm can be expressed by the following formula:

wherein,

The scheduling algorithm firstly checks whether an idle GPU exists, and if so, randomly selects one GPU; if no GPU with the lowest utilization of the latest day is selected from the force assistance server.

S4, the packaging library 2 forwards the API call request to the corresponding agent program 4 according to the IP address.

And S5, after the corresponding GPU node agent program 4 receives the request and the parameters, the real intelligent computing API 5 is operated.

The return value of the intelligent computing API 5 is returned to the intelligent computing application 1 through the agent 4 and the wrapper library 2.

The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein. The scheme in the embodiment of the invention can be realized by adopting various computer languages, such as object-oriented programming language Java, an transliteration script language JavaScript and the like.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A framework for improving the utilization rate of a GPU of an intelligent computing center is characterized by comprising a basic computing area, an intelligent computing area and a data storage area,

The basic computing power area comprises general computing nodes, network equipment deployed with a cloud operating system, an intelligent computing application program (1), a packaging library (2) and a scheduling engine (3), wherein the intelligent computing application program (1) receives an API call request of a user intelligent computing task and calls the packaging library (2), the packaging library (2) intercepts and forwards the API call request of the user and calls the scheduling engine (3), and the scheduling engine (3) perceives the idle state of a GPU of the intelligent computing power area and selects a target GPU according to a scheduling algorithm;

The intelligent computing area comprises a GPU server node, an agent program (4) and an intelligent computing API (5), wherein the agent program (4) reports the idle state of the GPU to the dispatching engine (3), receives an API call request of the packaging library (2) and calls the intelligent computing API (5), the intelligent computing API (5) executes an intelligent computing task according to a target GPU, and returns an execution result to the intelligent computing application program (1) through the agent program (4) and the packaging library (2);

2. A framework for improving the utilization of a smart computing center GPU according to claim 1, characterized in that the smart computing application (1) runs in a virtual machine.

3. A framework for improving the utilization of a GPU in a smart computing center according to claim 1, wherein the packaging library (2) is installed in a smart mirror.

4. A framework for improving the utilization of a GPU in a smart computing center according to claim 1, wherein the wrapper library (2) is a set of wrapper functions formed by packaging API functions in the smart computing library.

5. The framework for improving the utilization of a GPU in a computing center of claim 1, wherein the GPU server nodes are interconnected by a spine network.

6. The framework for improving the utilization of a GPU in a computing center of claim 1, wherein the specific step of selecting a target GPU according to a scheduling algorithm comprises:

7. The framework for improving GPU utilization in a computing center of claim 6, wherein the scheduling algorithm has an expression:

wherein,

8. A framework for improving the utilization of a smart computing center GPU according to claim 1, characterized in that the agent (4) and the smart computing API (5) run in a virtual machine of the GPU server node or directly on the GPU server node.

9. A framework for improving the utilization of a smart computing center GPU according to claim 1, characterized in that the scheduling engine (3) is implemented using an open source Kubernetes.

10. A method of processing a mental arithmetic task based on the framework of any one of claims 1-9 for improving the utilization of GPU in a mental arithmetic center, comprising the steps of:

Acquiring an API call request of a user intelligent computing task, and sending the API call request to a packaging library (2) by an intelligent computing application program (1);

the packaging library (2) intercepts the API call request and calls the scheduling engine (3) to inquire the idle condition of the GPU of the intelligent computing area;

The scheduling engine (3) receives the idle condition of the GPU returned by the agent program (4), adopts a scheduling algorithm to select a target GPU, and returns the IP address of the agent program (4) corresponding to the target GPU to the packaging library (2);

The packaging library (2) forwards the API call request to the corresponding agent program (4) according to the IP address;

the corresponding agent (4) invokes the mental calculation API (5) to perform mental calculation tasks and returns the corresponding agent (4) and wrapper library (2) to the mental calculation application (1).