CN116680035A - GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container - Google Patents

GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container Download PDF

Info

Publication number
CN116680035A
CN116680035A CN202310443295.9A CN202310443295A CN116680035A CN 116680035 A CN116680035 A CN 116680035A CN 202310443295 A CN202310443295 A CN 202310443295A CN 116680035 A CN116680035 A CN 116680035A
Authority
CN
China
Prior art keywords
gpu
kubernetes
node
remote
cuda
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310443295.9A
Other languages
Chinese (zh)
Inventor
薛少宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Communication Technology Co Ltd
Original Assignee
Inspur Communication Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Communication Technology Co Ltd filed Critical Inspur Communication Technology Co Ltd
Priority to CN202310443295.9A priority Critical patent/CN116680035A/en
Publication of CN116680035A publication Critical patent/CN116680035A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/133Protocols for remote procedure calls [RPC]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5077Logical partitioning of resources; Management or configuration of virtualized resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45575Starting, stopping, suspending or resuming virtual machine instances
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention relates to the technical field of RPC communication, in particular to a method and a device for realizing remote scheduling of kubernetes containers for using a GPU, which comprise the following steps: deploying a Kubernetes cluster, and labeling GPU nodes and non-GPU nodes by using 'kubectl label node'; deploying an official CUDA run time API on the non-GPU node; an official GPU Driver API, namely a GPU Driver, is deployed on the GPU node; deploying a client service component called by a remote CUDA on a non-GPU node; the beneficial effects are as follows: according to the method and the device for realizing remote dispatching of Kubernetes containers, virtual GPU equipment is created on the nodes without physical GPUs through a virtualization technology, and the virtual GPU equipment is registered on the Kubernetes nodes, so that the Kubernetes can dispatch the containers needing GPU resources on the nodes without the physical GPUs, and bind the containers to the virtual GPU equipment, thereby realizing sharing of GPU resources, improving the utilization rate of the GPUs and reducing the hardware and software cost.

Description

GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container
Technical Field
The invention relates to the technical field of RPC (remote procedure control) communication, in particular to a method and a device for realizing remote scheduling of kubernetes containers for using a GPU (graphics processing unit).
Background
Kubernetes, abbreviated as K8s. Is an open source for managing containerized applications on multiple hosts in a cloud platform, and the goal of Kubernetes is to make deploying containerized applications simple and efficient, kubernetes provides a mechanism for application deployment, planning, updating, and maintenance.
In the prior art, in Kubernetes, on one hand, users can only use the GPU equipment resources on the present node, but the number of GPU nodes is limited, and certain cost and technology are also required for deploying and managing GPU nodes, which limits the flexibility and portability of the container; on the other hand, the utilization rate of the GPU resources is not high, and some GPU resources on the nodes may be in an idle state, which wastes many unused GPU resources.
Disclosure of Invention
The invention aims to provide a method and a device for realizing remote scheduling of a kubernetes container for using a GPU (graphics processing unit), so as to solve the problems in the prior art.
In order to achieve the above purpose, the present invention provides the following technical solutions: a method for implementing remote scheduling of kubernetes containers using a GPU, the method comprising the steps of:
deploying a Kubernetes cluster, and labeling GPU nodes and non-GPU nodes by using 'kubectl label node';
deploying an official CUDA run time API on the non-GPU node;
an official GPU Driver API, namely a GPU Driver, is deployed on the GPU node;
deploying a client service component called by a remote CUDA on a non-GPU node;
deploying a server-side service component called by a remote CUDA on the GPU node;
creating a depoyment using GPU resources in Kubernetes, configuring scheduling parameters, and creating Pod until no GPU resource nodes exist;
and observing the running state of the Pod on the Kubernetes platform, entering the normal running of the Pod observation service, and confirming the normal call of the GPU resource.
Preferably, when labeling GPU nodes and non-GPU nodes, the GPU nodes and the non-GPU nodes are labeled as "GPU" and "non-GPU" respectively in the Kubernetes cluster.
Preferably, the client service component is implemented by an RPC technology, intercepts the access of the CUDA application program in Pod to the GPU by hijacking the CUDA API, and forwards the access to a node with GPU resources, where a remote CUDA call server is deployed, through a TCP/IP network or an RDMA network.
Preferably, the server service component receives the CUDA call request sent by the client, forwards the request to the GPU device for execution, and returns the result to the client.
Preferably, when the Kubernetes platform observes the running state of Pod, after the GPU container is started, the invoked CUDA program and API call are hijacked and redirected to run on the remote CUDA client, and executed on the node with GPU through communication between the remote CUDA, returning the result to the GPU container.
The remote dispatching and using GPU device of the kubernetes container comprises a kubernetes management module, a dispatching module and a GPU node management module;
the kubernetes management module is responsible for managing the creation, deletion and scheduling operation of the pod, and when the kubernetes management module detects that no GPU node is available, the kubernetes management module sends a scheduling request to the scheduling module; the GPU node management module is responsible for managing states of the GPU nodes, including GPU resource use conditions and GPU node health states, and provides GPU node state information for the kubernetes management module.
Preferably, the scheduling module uses a remote cuda to invoke a TCP/IP network or RDMA network interconnection technique used by the server and the client, including RDMA Aware SocketRDS or InfiniBand networks.
Preferably, the communication between the remote cuda call server and the client is managed through a service discovery mechanism of Kubernetes, a load balancing mechanism of Kubernetes, a network policy of Kubernetes and a security mechanism of Kubernetes.
Preferably, the container POD of the kubernetes management module uses a shared GPU.
Compared with the prior art, the invention has the beneficial effects that:
according to the method and the device for realizing remote dispatching of Kubernetes containers, virtual GPU equipment is created on the nodes without physical GPUs through a virtualization technology, and the virtual GPU equipment is registered on the Kubernetes nodes, so that the Kubernetes can dispatch the containers needing GPU resources on the nodes without the physical GPUs, and bind the containers to the virtual GPU equipment, thereby realizing sharing of GPU resources, improving the utilization rate of the GPUs and reducing the hardware and software cost.
Drawings
FIG. 1 is a schematic diagram of the method of the present invention;
FIG. 2 is a flow chart of the method of the present invention.
Detailed Description
In order to make the objects, technical solutions, and advantages of the present invention more apparent, the embodiments of the present invention will be further described in detail with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are some, but not all, embodiments of the present invention, are intended to be illustrative only and not limiting of the embodiments of the present invention, and that all other embodiments obtained by persons of ordinary skill in the art without making any inventive effort are within the scope of the present invention.
Example 1
Referring to fig. 1 to 2, the present invention provides a technical solution: a method for implementing remote scheduling of kubernetes containers using a GPU, the method comprising the steps of:
deploying a Kubernetes cluster, and labeling GPU nodes and non-GPU nodes by using 'kubectl label node'; in Kubernetes clusters, GPU nodes and non-GPU nodes are labeled "GPU" and "non-GPU", respectively;
deploying an official CUDA run time API on the non-GPU node;
an official GPU Driver API, namely a GPU Driver, is deployed on the GPU node;
deploying a client service component called by a remote CUDA on a non-GPU node; the client service component is realized by an RPC technology, intercepts the access of the CUDA application program to the GPU in the Pod by hijacking the CUDA API, and forwards the access to a node with GPU resources of a server side deployed with remote CUDA call through a TCP/IP network or an RDMA network;
deploying a server-side service component called by a remote CUDA on the GPU node; the server side service component receives a CUDA call request sent by the client side, forwards the request to the GPU equipment for execution, and returns a result to the client side;
creating a depoyment using GPU resources in Kubernetes, configuring scheduling parameters, and creating Pod until no GPU resource nodes exist;
observing the running state of the Pod on the Kubernetes platform, entering the normal running of the Pod observation service, confirming the normal calling of GPU resources, after the GPU container is started, hijacking and redirecting the called CUDA program and API call to a remote CUDA client for running, executing on a node with the GPU through communication between the remote CUDA, and returning the result to the GPU container.
Example two
On the basis of the first embodiment, a remote dispatching GPU device for kubernetes containers comprises a kubernetes management module, a dispatching module and a GPU node management module;
the kubernetes management module is responsible for managing the creation, deletion and scheduling operation of the pod, and when the kubernetes management module detects that no GPU node is available, the kubernetes management module sends a scheduling request to the scheduling module; the GPU node management module is responsible for managing the states of the GPU nodes, including the use condition of GPU resources and the health state of the GPU nodes, and provides GPU node state information for the kubernetes management module;
the scheduling module uses a TCP/IP network or RDMA network interconnection technology used by the remote cuda calling server and the client, including RDMA Aware SocketRDS or InfiniBand networks; the communication between the remote cuda call server and the client is managed through a service discovery mechanism of Kubernetes, a load balancing mechanism of Kubernetes, a network policy of Kubernetes and a security mechanism of Kubernetes, and a container POD of the Kubernetes management module uses a shared GPU.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (9)

1. A GPU method for realizing remote scheduling of kubernetes containers is characterized in that: the GPU remote scheduling method comprises the following steps of:
deploying a Kubernetes cluster, and labeling GPU nodes and non-GPU nodes by using 'kubectl label node';
deploying an official CUDA run time API on the non-GPU node;
an official GPU Driver API, namely a GPU Driver, is deployed on the GPU node;
deploying a client service component called by a remote CUDA on a non-GPU node;
deploying a server-side service component called by a remote CUDA on the GPU node;
creating a depoyment using GPU resources in Kubernetes, configuring scheduling parameters, and creating Pod until no GPU resource nodes exist;
and observing the running state of the Pod on the Kubernetes platform, entering the normal running of the Pod observation service, and confirming the normal call of the GPU resource.
2. The method for implementing kubernetes container remote scheduling according to claim 1, wherein the GPU method is characterized by: when labeling GPU nodes and non-GPU nodes, the GPU nodes and the non-GPU nodes are labeled as "GPU" and "non-GPU" respectively in the Kubernetes cluster.
3. The method for implementing kubernetes container remote scheduling according to claim 1, wherein the GPU method is characterized by: the client service component is realized by an RPC technology, intercepts the access of the CUDA application program to the GPU in the Pod by hijacking the CUDA API, and forwards the access to a node with GPU resources of a server side deployed with remote CUDA call through a TCP/IP network or an RDMA network.
4. The method for implementing kubernetes container remote scheduling according to claim 1, wherein the GPU method is characterized by: the server side service component receives the CUDA call request sent by the client side, forwards the request to the GPU equipment for execution, and returns the result to the client side.
5. The method for implementing kubernetes container remote scheduling according to claim 1, wherein the GPU method is characterized by: when the Kubernetes platform observes the running state of Pod, after the GPU container is started, the invoked CUDA program and API call are hijacked and redirected to run on the remote CUDA client, and executed on the node with GPU through communication between the remote CUDA, returning the result to the GPU container.
6. A kubernetes container remote scheduling GPU apparatus for implementing a kubernetes container remote scheduling GPU method according to any of claims 1-5, wherein: the remote dispatching GPU device comprises a kubernetes management module, a dispatching module and a GPU node management module;
the kubernetes management module is responsible for managing the creation, deletion and scheduling operation of the pod, and when the kubernetes management module detects that no GPU node is available, the kubernetes management module sends a scheduling request to the scheduling module; the GPU node management module is responsible for managing states of the GPU nodes, including GPU resource use conditions and GPU node health states, and provides GPU node state information for the kubernetes management module.
7. The GPU apparatus of claim 6, wherein the GPU apparatus is configured to implement kubernetes container remote scheduling: the scheduling module utilizes a TCP/IP network or RDMA network interconnection technology used by the remote cuda call server and the client, including RDMA Aware SocketRDS or InfiniBand networks.
8. The GPU apparatus of claim 7, wherein the GPU apparatus is configured to implement kubernetes container remote scheduling: the communication between the remote cuda call server and the client is managed through a service discovery mechanism of Kubernetes, a load balancing mechanism of Kubernetes, a network policy of Kubernetes and a security mechanism of Kubernetes.
9. The GPU apparatus of claim 6, wherein the GPU apparatus is configured to implement kubernetes container remote scheduling: the container POD of the kubernetes management module uses a shared GPU.
CN202310443295.9A 2023-04-24 2023-04-24 GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container Pending CN116680035A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310443295.9A CN116680035A (en) 2023-04-24 2023-04-24 GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310443295.9A CN116680035A (en) 2023-04-24 2023-04-24 GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container

Publications (1)

Publication Number Publication Date
CN116680035A true CN116680035A (en) 2023-09-01

Family

ID=87782603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310443295.9A Pending CN116680035A (en) 2023-04-24 2023-04-24 GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container

Country Status (1)

Country Link
CN (1) CN116680035A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118041704A (en) * 2024-04-12 2024-05-14 清华大学 Kubernetes container access method, device, computing equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118041704A (en) * 2024-04-12 2024-05-14 清华大学 Kubernetes container access method, device, computing equipment and storage medium

Similar Documents

Publication Publication Date Title
US10579426B2 (en) Partitioning processes across clusters by process type to optimize use of cluster specific configurations
CN107483390B (en) Cloud rendering network deployment subsystem, system and cloud rendering platform
CN103927218B (en) Event distributing method and system
CN105159753B (en) The method, apparatus and pooling of resources manager of accelerator virtualization
CN102937911B (en) The management method and system of resources of virtual machine
CN107438060A (en) Remote procedure calling (PRC) method and the network equipment in a kind of network equipment
CN110677277B (en) Data processing method, device, server and computer readable storage medium
EP2321937B1 (en) Load balancing for services
CN106452841A (en) Method for using transmission service quality in robot operating system
US20200052982A1 (en) In situ triggered function as a service within a service mesh
WO2018055533A1 (en) Event-driven policy-based distributed container management system
CN113596110A (en) Heterogeneous cloud-oriented cloud native micro-service platform
CN103118076A (en) Upgraded server cluster system and load balancing method thereof
US8606908B2 (en) Wake-up server
CN103973578A (en) Virtual machine traffic redirection method and device
CN103677983A (en) Scheduling method and device of application
CN112698838A (en) Multi-cloud container deployment system and container deployment method thereof
CN110855739B (en) Container technology-based remote and heterogeneous resource unified management method and system
CN108170510A (en) A kind of managing computing resources system based on virtualization technology
CN105681311B (en) A kind of rocket ground network heterogeneous system based on cloud computing technology
CN116680035A (en) GPU (graphics processing unit) method and device for realizing remote scheduling of kubernetes container
CN115225645B (en) Service updating method, device, system and storage medium
CN108123966A (en) A kind of application virtualization technology based on cloud platform
CN107438834A (en) Implementation method, device, electronic equipment and the computer program product of virtual machine function services
CN102075532B (en) Method for penetrating through firewall

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination