CN116680035A - GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container - Google Patents
GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container Download PDFInfo
- Publication number
- CN116680035A CN116680035A CN202310443295.9A CN202310443295A CN116680035A CN 116680035 A CN116680035 A CN 116680035A CN 202310443295 A CN202310443295 A CN 202310443295A CN 116680035 A CN116680035 A CN 116680035A
- Authority
- CN
- China
- Prior art keywords
- gpu
- kubernetes
- node
- remote
- cuda
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 26
- 238000012545 processing Methods 0.000 title description 4
- HPTJABJPZMULFH-UHFFFAOYSA-N 12-[(Cyclohexylcarbamoyl)amino]dodecanoic acid Chemical compound OC(=O)CCCCCCCCCCCNC(=O)NC1CCCCC1 HPTJABJPZMULFH-UHFFFAOYSA-N 0.000 claims abstract description 29
- 238000004891 communication Methods 0.000 claims abstract description 8
- 238000005516 engineering process Methods 0.000 claims abstract description 8
- 238000002372 labelling Methods 0.000 claims abstract description 6
- 230000007246 mechanism Effects 0.000 claims description 10
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 230000036541 health Effects 0.000 claims description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000007726 management method Methods 0.000 description 16
- 230000004075 alteration Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/133—Protocols for remote procedure calls [RPC]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5077—Logical partitioning of resources; Management or configuration of virtualized resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45575—Starting, stopping, suspending or resuming virtual machine instances
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/455—Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
- G06F9/45533—Hypervisors; Virtual machine monitors
- G06F9/45558—Hypervisor-specific management and integration aspects
- G06F2009/45595—Network integration; Enabling network access in virtual machine instances
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention relates to the technical field of RPC communication, in particular to a method and a device for realizing remote scheduling of kubernetes containers for using a GPU, which comprise the following steps: deploying a Kubernetes cluster, and labeling GPU nodes and non-GPU nodes by using 'kubectl label node'; deploying an official CUDA run time API on the non-GPU node; an official GPU Driver API, namely a GPU Driver, is deployed on the GPU node; deploying a client service component called by a remote CUDA on a non-GPU node; the beneficial effects are as follows: according to the method and the device for realizing remote dispatching of Kubernetes containers, virtual GPU equipment is created on the nodes without physical GPUs through a virtualization technology, and the virtual GPU equipment is registered on the Kubernetes nodes, so that the Kubernetes can dispatch the containers needing GPU resources on the nodes without the physical GPUs, and bind the containers to the virtual GPU equipment, thereby realizing sharing of GPU resources, improving the utilization rate of the GPUs and reducing the hardware and software cost.
Description
Technical Field
The invention relates to the technical field of RPC (remote procedure control) communication, in particular to a method and a device for realizing remote scheduling of kubernetes containers for using a GPU (graphics processing unit).
Background
Kubernetes, abbreviated as K8s. Is an open source for managing containerized applications on multiple hosts in a cloud platform, and the goal of Kubernetes is to make deploying containerized applications simple and efficient, kubernetes provides a mechanism for application deployment, planning, updating, and maintenance.
In the prior art, in Kubernetes, on one hand, users can only use the GPU equipment resources on the present node, but the number of GPU nodes is limited, and certain cost and technology are also required for deploying and managing GPU nodes, which limits the flexibility and portability of the container; on the other hand, the utilization rate of the GPU resources is not high, and some GPU resources on the nodes may be in an idle state, which wastes many unused GPU resources.
Disclosure of Invention
The invention aims to provide a method and a device for realizing remote scheduling of a kubernetes container for using a GPU (graphics processing unit), so as to solve the problems in the prior art.
In order to achieve the above purpose, the present invention provides the following technical solutions: a method for implementing remote scheduling of kubernetes containers using a GPU, the method comprising the steps of:
deploying a Kubernetes cluster, and labeling GPU nodes and non-GPU nodes by using 'kubectl label node';
deploying an official CUDA run time API on the non-GPU node;
an official GPU Driver API, namely a GPU Driver, is deployed on the GPU node;
deploying a client service component called by a remote CUDA on a non-GPU node;
deploying a server-side service component called by a remote CUDA on the GPU node;
creating a depoyment using GPU resources in Kubernetes, configuring scheduling parameters, and creating Pod until no GPU resource nodes exist;
and observing the running state of the Pod on the Kubernetes platform, entering the normal running of the Pod observation service, and confirming the normal call of the GPU resource.
Preferably, when labeling GPU nodes and non-GPU nodes, the GPU nodes and the non-GPU nodes are labeled as "GPU" and "non-GPU" respectively in the Kubernetes cluster.
Preferably, the client service component is implemented by an RPC technology, intercepts the access of the CUDA application program in Pod to the GPU by hijacking the CUDA API, and forwards the access to a node with GPU resources, where a remote CUDA call server is deployed, through a TCP/IP network or an RDMA network.
Preferably, the server service component receives the CUDA call request sent by the client, forwards the request to the GPU device for execution, and returns the result to the client.
Preferably, when the Kubernetes platform observes the running state of Pod, after the GPU container is started, the invoked CUDA program and API call are hijacked and redirected to run on the remote CUDA client, and executed on the node with GPU through communication between the remote CUDA, returning the result to the GPU container.
The remote dispatching and using GPU device of the kubernetes container comprises a kubernetes management module, a dispatching module and a GPU node management module;
the kubernetes management module is responsible for managing the creation, deletion and scheduling operation of the pod, and when the kubernetes management module detects that no GPU node is available, the kubernetes management module sends a scheduling request to the scheduling module; the GPU node management module is responsible for managing states of the GPU nodes, including GPU resource use conditions and GPU node health states, and provides GPU node state information for the kubernetes management module.
Preferably, the scheduling module uses a remote cuda to invoke a TCP/IP network or RDMA network interconnection technique used by the server and the client, including RDMA Aware SocketRDS or InfiniBand networks.
Preferably, the communication between the remote cuda call server and the client is managed through a service discovery mechanism of Kubernetes, a load balancing mechanism of Kubernetes, a network policy of Kubernetes and a security mechanism of Kubernetes.
Preferably, the container POD of the kubernetes management module uses a shared GPU.
Compared with the prior art, the invention has the beneficial effects that:
according to the method and the device for realizing remote dispatching of Kubernetes containers, virtual GPU equipment is created on the nodes without physical GPUs through a virtualization technology, and the virtual GPU equipment is registered on the Kubernetes nodes, so that the Kubernetes can dispatch the containers needing GPU resources on the nodes without the physical GPUs, and bind the containers to the virtual GPU equipment, thereby realizing sharing of GPU resources, improving the utilization rate of the GPUs and reducing the hardware and software cost.
Drawings
FIG. 1 is a schematic diagram of the method of the present invention;
FIG. 2 is a flow chart of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions, and advantages of the present invention more apparent, the embodiments of the present invention will be further described in detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are some, but not all, embodiments of the present invention, are intended to be illustrative only and not limiting of the embodiments of the present invention, and that all other embodiments obtained by persons of ordinary skill in the art without making any inventive effort are within the scope of the present invention.
Example 1
Referring to fig. 1 to 2, the present invention provides a technical solution: a method for implementing remote scheduling of kubernetes containers using a GPU, the method comprising the steps of:
deploying a Kubernetes cluster, and labeling GPU nodes and non-GPU nodes by using 'kubectl label node'; in Kubernetes clusters, GPU nodes and non-GPU nodes are labeled "GPU" and "non-GPU", respectively;
deploying an official CUDA run time API on the non-GPU node;
an official GPU Driver API, namely a GPU Driver, is deployed on the GPU node;
deploying a client service component called by a remote CUDA on a non-GPU node; the client service component is realized by an RPC technology, intercepts the access of the CUDA application program to the GPU in the Pod by hijacking the CUDA API, and forwards the access to a node with GPU resources of a server side deployed with remote CUDA call through a TCP/IP network or an RDMA network;
deploying a server-side service component called by a remote CUDA on the GPU node; the server side service component receives a CUDA call request sent by the client side, forwards the request to the GPU equipment for execution, and returns a result to the client side;
creating a depoyment using GPU resources in Kubernetes, configuring scheduling parameters, and creating Pod until no GPU resource nodes exist;
observing the running state of the Pod on the Kubernetes platform, entering the normal running of the Pod observation service, confirming the normal calling of GPU resources, after the GPU container is started, hijacking and redirecting the called CUDA program and API call to a remote CUDA client for running, executing on a node with the GPU through communication between the remote CUDA, and returning the result to the GPU container.
Example two
On the basis of the first embodiment, a remote dispatching GPU device for kubernetes containers comprises a kubernetes management module, a dispatching module and a GPU node management module;
the kubernetes management module is responsible for managing the creation, deletion and scheduling operation of the pod, and when the kubernetes management module detects that no GPU node is available, the kubernetes management module sends a scheduling request to the scheduling module; the GPU node management module is responsible for managing the states of the GPU nodes, including the use condition of GPU resources and the health state of the GPU nodes, and provides GPU node state information for the kubernetes management module;
the scheduling module uses a TCP/IP network or RDMA network interconnection technology used by the remote cuda calling server and the client, including RDMA Aware SocketRDS or InfiniBand networks; the communication between the remote cuda call server and the client is managed through a service discovery mechanism of Kubernetes, a load balancing mechanism of Kubernetes, a network policy of Kubernetes and a security mechanism of Kubernetes, and a container POD of the Kubernetes management module uses a shared GPU.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (9)
1. A GPU method for realizing remote scheduling of kubernetes containers is characterized in that: the GPU remote scheduling method comprises the following steps of:
deploying a Kubernetes cluster, and labeling GPU nodes and non-GPU nodes by using 'kubectl label node';
deploying an official CUDA run time API on the non-GPU node;
an official GPU Driver API, namely a GPU Driver, is deployed on the GPU node;
deploying a client service component called by a remote CUDA on a non-GPU node;
deploying a server-side service component called by a remote CUDA on the GPU node;
creating a depoyment using GPU resources in Kubernetes, configuring scheduling parameters, and creating Pod until no GPU resource nodes exist;
and observing the running state of the Pod on the Kubernetes platform, entering the normal running of the Pod observation service, and confirming the normal call of the GPU resource.
2. The method for implementing kubernetes container remote scheduling according to claim 1, wherein the GPU method is characterized by: when labeling GPU nodes and non-GPU nodes, the GPU nodes and the non-GPU nodes are labeled as "GPU" and "non-GPU" respectively in the Kubernetes cluster.
3. The method for implementing kubernetes container remote scheduling according to claim 1, wherein the GPU method is characterized by: the client service component is realized by an RPC technology, intercepts the access of the CUDA application program to the GPU in the Pod by hijacking the CUDA API, and forwards the access to a node with GPU resources of a server side deployed with remote CUDA call through a TCP/IP network or an RDMA network.
4. The method for implementing kubernetes container remote scheduling according to claim 1, wherein the GPU method is characterized by: the server side service component receives the CUDA call request sent by the client side, forwards the request to the GPU equipment for execution, and returns the result to the client side.
5. The method for implementing kubernetes container remote scheduling according to claim 1, wherein the GPU method is characterized by: when the Kubernetes platform observes the running state of Pod, after the GPU container is started, the invoked CUDA program and API call are hijacked and redirected to run on the remote CUDA client, and executed on the node with GPU through communication between the remote CUDA, returning the result to the GPU container.
6. A kubernetes container remote scheduling GPU apparatus for implementing a kubernetes container remote scheduling GPU method according to any of claims 1-5, wherein: the remote dispatching GPU device comprises a kubernetes management module, a dispatching module and a GPU node management module;
the kubernetes management module is responsible for managing the creation, deletion and scheduling operation of the pod, and when the kubernetes management module detects that no GPU node is available, the kubernetes management module sends a scheduling request to the scheduling module; the GPU node management module is responsible for managing states of the GPU nodes, including GPU resource use conditions and GPU node health states, and provides GPU node state information for the kubernetes management module.
7. The GPU apparatus of claim 6, wherein the GPU apparatus is configured to implement kubernetes container remote scheduling: the scheduling module utilizes a TCP/IP network or RDMA network interconnection technology used by the remote cuda call server and the client, including RDMA Aware SocketRDS or InfiniBand networks.
8. The GPU apparatus of claim 7, wherein the GPU apparatus is configured to implement kubernetes container remote scheduling: the communication between the remote cuda call server and the client is managed through a service discovery mechanism of Kubernetes, a load balancing mechanism of Kubernetes, a network policy of Kubernetes and a security mechanism of Kubernetes.
9. The GPU apparatus of claim 6, wherein the GPU apparatus is configured to implement kubernetes container remote scheduling: the container POD of the kubernetes management module uses a shared GPU.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310443295.9A CN116680035A (en) | 2023-04-24 | 2023-04-24 | GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310443295.9A CN116680035A (en) | 2023-04-24 | 2023-04-24 | GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116680035A true CN116680035A (en) | 2023-09-01 |
Family
ID=87782603
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310443295.9A Pending CN116680035A (en) | 2023-04-24 | 2023-04-24 | GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116680035A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118041704A (en) * | 2024-04-12 | 2024-05-14 | 清华大学 | Kubernetes container access method, device, computing equipment and storage medium |
-
2023
- 2023-04-24 CN CN202310443295.9A patent/CN116680035A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118041704A (en) * | 2024-04-12 | 2024-05-14 | 清华大学 | Kubernetes container access method, device, computing equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10579426B2 (en) | Partitioning processes across clusters by process type to optimize use of cluster specific configurations | |
CN107483390B (en) | Cloud rendering network deployment subsystem, system and cloud rendering platform | |
CN103927218B (en) | Event distributing method and system | |
CN105159753B (en) | The method, apparatus and pooling of resources manager of accelerator virtualization | |
CN102937911B (en) | The management method and system of resources of virtual machine | |
CN107438060A (en) | Remote procedure calling (PRC) method and the network equipment in a kind of network equipment | |
CN110677277B (en) | Data processing method, device, server and computer readable storage medium | |
EP2321937B1 (en) | Load balancing for services | |
CN106452841A (en) | Method for using transmission service quality in robot operating system | |
US20200052982A1 (en) | In situ triggered function as a service within a service mesh | |
WO2018055533A1 (en) | Event-driven policy-based distributed container management system | |
CN113596110A (en) | Heterogeneous cloud-oriented cloud native micro-service platform | |
CN103118076A (en) | Upgraded server cluster system and load balancing method thereof | |
US8606908B2 (en) | Wake-up server | |
CN103973578A (en) | Virtual machine traffic redirection method and device | |
CN103677983A (en) | Scheduling method and device of application | |
CN112698838A (en) | Multi-cloud container deployment system and container deployment method thereof | |
CN110855739B (en) | Container technology-based remote and heterogeneous resource unified management method and system | |
CN108170510A (en) | A kind of managing computing resources system based on virtualization technology | |
CN105681311B (en) | A kind of rocket ground network heterogeneous system based on cloud computing technology | |
CN116680035A (en) | GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container | |
CN115225645B (en) | Service updating method, device, system and storage medium | |
CN108123966A (en) | A kind of application virtualization technology based on cloud platform | |
CN107438834A (en) | Implementation method, device, electronic equipment and the computer program product of virtual machine function services | |
CN102075532B (en) | Method for penetrating through firewall |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |