WO2024119823A1 - Gpu计算资源的管理方法、装置、电子设备及可读存储介质 - Google Patents

Gpu计算资源的管理方法、装置、电子设备及可读存储介质 Download PDF

Info

Publication number
WO2024119823A1
WO2024119823A1 PCT/CN2023/106827 CN2023106827W WO2024119823A1 WO 2024119823 A1 WO2024119823 A1 WO 2024119823A1 CN 2023106827 W CN2023106827 W CN 2023106827W WO 2024119823 A1 WO2024119823 A1 WO 2024119823A1
Authority
WO
WIPO (PCT)
Prior art keywords
gpu
pod
vgpu
information
node
Prior art date
Application number
PCT/CN2023/106827
Other languages
English (en)
French (fr)
Inventor
王超
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024119823A1 publication Critical patent/WO2024119823A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5066Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Embodiments of the present application relate to the field of Internet technology, and in particular, to a method for managing GPU computing resources, a device for managing GPU computing resources, an electronic device, and a non-volatile readable storage medium.
  • GPU Graphics Processing Unit
  • GPU Graphics Processing Unit
  • GPU-based AI systems search institutions
  • new/traditional enterprises undergoing digital transformation they will inevitably face the following problems when using GPU computing resources:
  • GPU resource management is difficult. GPUs are more expensive than CPUs (central processing units). As high-value hardware resources, it is difficult to achieve an integrated operation, maintenance, and management model for GPUs like networks and storage. In actual application environments, multiple processes, multiple personnel, and multiple tasks often reuse the same GPU resources. Long waiting times for resources seriously reduce the efficiency of business process advancement and the speed of product iteration.
  • AI services need to apply for and release GPU resources based on the usage cycle of the task load and the usage of GPU resources by different tasks during peak/trough periods.
  • the capacity must be automatically scaled up or down according to the number of online requests (Query Per Second, abbreviated as QPS) to meet the real-time high concurrency and low latency requirements of online AI services.
  • the embodiments of the present application provide a method, device, electronic device and non-volatile readable storage medium for managing GPU computing resources to solve the problems of difficult GPU resource management, low GPU resource utilization efficiency and difficulty in quickly applying for and recovering GPU resources.
  • the embodiment of the present application discloses a method for managing GPU computing resources, which is applied to a GPU sharing system.
  • the GPU sharing system is deployed with a k8s cluster, and the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core.
  • the method includes:
  • Each vGPU contains part of the GPU memory and part of the GPU computing core.
  • One vGPU corresponds to one Pod service.
  • the GPU in the Node node is divided to obtain multiple vGPUs, including:
  • the GPU memory and GPU computing core of the GPU are allocated to each vGPU according to the preset resource quota, so as to obtain multiple vGPUs including part of the GPU memory and part of the GPU computing core of the GPU.
  • the vGPU information includes at least the vGPU quantity and the vGPU memory size of the vGPU.
  • the k8s cluster also includes a Master node, which includes a hijacking scheduler, collects vGPU information of each vGPU in the Node node, registers each vGPU information, and obtains Pod information of each Pod service corresponding to each vGPU, including:
  • each Pod information is received and saved as multiple files, including:
  • the Pod information includes at least the usage of the GPU video memory and the usage of the GPU computing core in the vGPU.
  • part of the GPU memory and part of the GPU computing core in each vGPU are managed according to each file, including:
  • the process of the Pod service is controlled, including:
  • the Pod service process runs normally.
  • the number of Pod services is scaled up or down based on the usage of GPU memory and GPU computing cores in each vGPU.
  • the GPU is located on the host, the host includes at least a CPU and a memory, the Pod service is bound to the CPU and the memory, and the number of Pod services is expanded or reduced according to the usage of the GPU video memory and the usage of the GPU computing core in each vGPU, including:
  • the number of Pod services is automatically expanded or reduced according to the CPU utilization and the average memory utilization, including:
  • the number of Pod services is automatically reduced to reduce the number of vGPUs corresponding to the Pod service;
  • the number of Pod services is automatically increased to increase the number of vGPUs corresponding to the Pod service.
  • the number of Pod services is scaled up or down according to the usage of GPU memory and GPU computing core in each vGPU, including:
  • the number of Pod services is automatically expanded or reduced according to the real-time service request traffic of the Pod service, including:
  • the number of Pod services is automatically increased to increase the number of vGPUs corresponding to the Pod service;
  • the number of Pod services is automatically reduced to reduce the number of vGPUs corresponding to the Pod services.
  • the Pod service is scheduled to the target GPU.
  • the k8s cluster also includes a Master node, the Master node includes a controller, and the controller is used to create resources corresponding to different types of Pod services.
  • resources include at least Deployment, Service, and Statefulset.
  • Deployment is used to deploy stateless Pod services
  • Service is used to deploy Pod services that can be scaled to zero
  • Statefulset is used to deploy stateful Pod services.
  • the embodiment of the present application further discloses a management device for GPU computing resources, which is used in a GPU sharing system.
  • the GPU sharing system is deployed with a k8s cluster, and the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core.
  • the device includes:
  • the GPU partitioning module is used to partition the GPU in the Node node to obtain multiple vGPUs.
  • Each vGPU contains part of the GPU memory and part of the GPU computing core.
  • One vGPU corresponds to one Pod service.
  • the Pod information acquisition module is used to collect the vGPU information of each vGPU in the Node node, register each vGPU information, and obtain the Pod information of each Pod service corresponding to each vGPU;
  • a Pod information file generation module is used to receive information of each Pod and save the information of each Pod into multiple files;
  • the resource management module is used to allocate part of the GPU memory and part of the GPU computing core in each vGPU according to each file. Management from the heart.
  • the GPU partition module is specifically used to:
  • the GPU memory and GPU computing core of the GPU are allocated to each vGPU according to the preset resource quota, so as to obtain multiple vGPUs including part of the GPU memory and part of the GPU computing core of the GPU.
  • the k8s cluster also includes a Master node, the Master node includes a hijacking scheduler, and the Pod information acquisition module is specifically used for:
  • the Pod information file generation module is specifically used to:
  • the resource management module is specifically used to:
  • the embodiment of the present application also discloses an electronic device, including a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;
  • Memory used to store computer programs
  • the processor is used to implement the method of the embodiment of the present application when executing the program stored in the memory.
  • the embodiment of the present application also discloses a non-volatile readable storage medium having instructions stored thereon, which, when executed by one or more processors, enables the processors to execute the method of the embodiment of the present application.
  • the GPU sharing system is deployed with a k8s cluster, and the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core.
  • the GPU sharing system is deployed with a k8s cluster, and the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core.
  • the GPU computing resources can be strictly isolated; then the vGPU information of each vGPU in the Node node is collected, and each vGPU information is registered to obtain the Pod information of each Pod service corresponding to each vGPU, and each Pod information is received, and each Pod information is saved as multiple files, and then according to each file, part of the GPU video memory and part of the GPU computing core in each vGPU are managed, and through the Pod information of each Pod service, part of the GPU video memory and part of the GPU computing core in each vGPU are managed, which effectively solves the problem of GPU computing resource overlimit.
  • FIG1 is a flowchart of a method for managing GPU computing resources provided in an embodiment of the present application
  • FIG2 is a schematic diagram of the architecture of a GPU sharing system provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of code execution of a configuration file provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of a scaling mode architecture provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of resource allocation of a multi-service shared GPU provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of one of the scheduling modes of multi-service shared resources provided in an embodiment of the present application.
  • FIG7 is a second schematic diagram of a scheduling mode for multi-service shared resources provided in an embodiment of the present application.
  • FIG8 is a structural block diagram of a GPU computing resource management device provided in an embodiment of the present application.
  • FIG9 is a schematic diagram of the structure of a non-volatile readable storage medium provided in an embodiment of the present application.
  • FIG. 10 is a schematic diagram of the hardware structure of an electronic device implementing various embodiments of the present application.
  • Kubernetes (k8s for short) is a portable, extensible open source platform for managing containerized workloads and services that facilitates declarative configuration and automation.
  • Container technology Use Docker as an open source application container engine to provide flexible application deployment methods; Kubernetes is an open source project that automates the deployment, expansion, and management of containerized applications and can be used on edge computing platforms to provide reliable and scalable container orchestration.
  • Pod is the smallest unit of Kubernetes scheduling.
  • GPU Graphics Processing Unit
  • GPU Graphics Processing Unit
  • Model inference service Converts the result model obtained from AI training into a service that can perform model inference operations.
  • Node, Kubernetes node, Kubernetes nodes can be divided into Master and Node, where Master is the management node and Node is the computing node.
  • CRD Customer Resource Definition
  • Kubernetes API Application Programming Interface
  • Elastic scaling automatically controls the number of instances in actual operation according to the set scaling rules.
  • Model inference service converts the result model obtained from AI training into a service that can perform model inference operations.
  • CDUA Computer Unified Device Architecture
  • NVIDIA Non-Volatile Memory Stick
  • ISA CUDA instruction set architecture
  • parallel computing engine inside the GPU.
  • GPU-based AI systems search institutions
  • new/traditional enterprises undergoing digital transformation
  • GPU computing resources they will inevitably face the following problems: difficulty in GPU resource management, low efficiency in GPU resource utilization, and difficulty in quickly applying for and recovering GPU resources.
  • the industry has proposed a variety of GPU sharing solutions. Driven by the cloud-native trend, containerized deployment using cloud-native technology and standard Docker has become a common method for cloud services in the industry to use heterogeneous computing resources.
  • the existing GPU sharing solutions are shown in Table 1:
  • one of the core invention points of the present application is to apply it to a GPU sharing system.
  • the GPU sharing system is deployed with a k8s cluster.
  • the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core.
  • the GPU in the Node node multiple vGPUs can be obtained, wherein each vGPU includes part of the GPU video memory and part of the GPU computing core of the GPU, and one vGPU corresponds to a Pod service.
  • the GPU computing resources can be strictly isolated; then the vGPU information of each vGPU in the Node node is collected, and each vGPU information is registered to obtain the Pod information of each Pod service corresponding to each vGPU, and each Pod information is received, and each Pod information is saved as multiple files, and then according to each file, part of the GPU video memory and part of the GPU computing core in each vGPU are managed, and through the Pod information of each Pod service, part of the GPU video memory and part of the GPU computing core in each vGPU are managed, which effectively solves the problem of GPU computing resource overlimit.
  • a flowchart of a method for managing GPU computing resources provided in an embodiment of the present application is shown, which is applied to a GPU sharing system, wherein a k8s cluster is deployed in the GPU sharing system, wherein the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core, and specifically may include the following steps:
  • Step 101 divide the GPU in the Node node to obtain multiple vGPUs; each vGPU includes part of the GPU memory and part of the GPU computing core of the GPU, and one vGPU corresponds to one Pod service;
  • FIG. 2 there is shown a schematic diagram of the architecture of a GPU sharing system provided in an embodiment of the present application.
  • the method for managing GPU computing resources provided in an embodiment of the present application can be applied to the GPU sharing system shown in Figure 2.
  • a k8s cluster is deployed in the GPU sharing system, and the k8s cluster may include one or more Node nodes and Pod services, wherein each Node node may include one or more GPUs, and the GPU computing resources corresponding to each GPU include at least GPU memory and GPU computing cores.
  • k8s is a portable, scalable open source platform for managing containerized workloads and services, which can promote declarative configuration and automation, and can include multiple physical devices/virtual machines in the k8s cluster.
  • the k8s cluster can include one or more Node nodes and Pod services, wherein each Node node can include one or more GPUs; wherein the Node node is a computing node in k8s, which can be responsible for running related containers in the cluster and managing the data transmitted by the containers.
  • Pod is the smallest unit of Kubernetes scheduling, which can represent a single process instance running in the Kubernetes cluster.
  • a Pod can have multiple containers, and a container can contain an AI service. Therefore, a Pod can make AI services in multiple containers into a large AI service.
  • a Pod has a container, a container mounts a vGPU, and a Pod uses a vGPU, then a vGPU corresponds to a Pod service.
  • the data is set to be relatively simple. In actual applications, the use of Pod may be more complicated, and the use may also vary according to the actual application scenario.
  • vGPU virtual graphics processing unit
  • it is a vGPU obtained by dividing the GPU in the Node node. It can be that a whole card GPU is virtualized into multiple vGPUs. The vGPU is obtained by fine-grained segmentation of the whole card GPU.
  • FIG 2 there are Node nodes in the GPU sharing system, there are multiple GPUs in the Node node, and there are multiple vGPUs divided by GPUs. Multiple vGPUs form a vGPU pool.
  • the GPU it can be located on the Node node.
  • the GPU is a microprocessor that is specially used to perform image and graphics related operations on personal computers, workstations, game consoles and some mobile devices.
  • the GPU includes GPU computing resources, and the GPU computing resources can include GPU video memory and GPU computing core.
  • the GPU video memory can be understood as a space, similar to memory.
  • the GPU video memory is used to store models, data, etc.
  • the larger the GPU video memory the larger the network it can run. In large-scale training, the GPU video memory will become more important.
  • the GPU computing core it can be used to perform all GPU graphics operations, general operations, etc.
  • multiple vGPUs can be obtained by dividing the GPU in the Node node. Specifically, in the process of division, according to the preset resource quota, part of the GPU memory and part of the GPU computing core of the GPU are respectively allocated to multiple vGPUs, thereby obtaining multiple vGPUs including part of the GPU memory and part of the GPU computing core of the GPU, wherein one vGPU can correspond to one Pod service, and the GPU computing resources occupied by multiple Pod services running on the same GPU card are independently divided.
  • the GPU computing resources can be strictly isolated.
  • the preset resource quota For the preset resource quota, it can set the GPU memory size and GPU computing core required by the vGPU for users when creating Pod services or applications, so that part of the GPU memory and part of the GPU computing core can be allocated to the GPU according to the preset resource quota.
  • the cores are assigned to multiple vGPUs.
  • Step 102 collect vGPU information of each vGPU in the Node node, register each vGPU information, and obtain Pod information of each Pod service corresponding to each vGPU;
  • the vGPU information may include the number of vGPUs and the size of vGPU video memory of the vGPU; for the Pod information, it may include the usage of part of the GPU video memory of the GPU included in the vGPU and the usage of part of the GPU computing core; wherein the usage may be the usage of the GPU video memory or computing core by the Pod service.
  • the usage may be that the GPU video memory required for the Pod service exceeds the preset resource quota, or that the GPU video memory required for the Pod service is within the range of the preset resource quota; for the preset resource quota, it may be the resource quota of the GPU video memory and the resource quota of the GPU computing core set according to the preset configuration file.
  • the vGPU quantity and vGPU memory size of each vGPU in the Node node are collected, and the vGPU quantity and vGPU memory size of each vGPU are registered to obtain the Pod information of each Pod service corresponding to each vGPU, that is, the usage of part of the GPU memory of the GPU contained in each vGPU and the usage of part of the GPU computing core are obtained.
  • Step 103 receiving each Pod information, and saving each Pod information into multiple files;
  • the file may be a file containing the usage of part of the GPU video memory and the usage of part of the GPU computing core of the GPU included in each vGPU.
  • the vGPU quantity and vGPU memory size of each vGPU in the Node node are collected, and the vGPU quantity and vGPU memory size of each vGPU are registered, and the usage of part of the GPU memory and part of the GPU computing core of each Pod service corresponding to each vGPU is obtained, the usage of part of the GPU memory and part of the GPU computing core of the GPU by each Pod service is received, and the data is saved as a file.
  • Step 104 manage part of the GPU memory and part of the GPU computing core in each vGPU according to each file.
  • the GPU sharing system is deployed with a k8s cluster, and the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core.
  • the GPU sharing system is deployed with a k8s cluster, and the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core.
  • multiple Pod services are implemented to run on the same physical GPU, and the GPU computing resources can be strictly isolated at the same time; then the vGPU information of each vGPU in the Node node is collected, and each vGPU information is registered to obtain the Pod information of each Pod service corresponding to each vGPU, and each Pod information is received, and each Pod information is saved as multiple files, and then according to each file, part of the GPU video memory and part of the GPU computing core in each vGPU are managed, and through the Pod information of each Pod service, part of the GPU video memory and part of the GPU computing core in each vGPU are managed, which effectively solves the problem of GPU computing resource overlimit.
  • the k8s cluster also includes a Master node, the Master node includes a hijacking scheduler, step 102, collects the vGPU information of each vGPU in the Node node, and registers each vGPU information to obtain each vGPU pair.
  • the Pod information of each Pod service including:
  • the Master node is the management node in the k8s cluster. It can be a node deployed on the central server of the cluster and is responsible for associating other nodes, such as managing the Node nodes.
  • the hijacking scheduler may be a GPUSharing Scheduler, which may be used to count, manage and schedule multiple Pod services that share the GPU computing resources of the same GPU card. It may restrict the use of GPU computing resources at the software layer by hijacking the usage of GPU video memory and GPU computing core in real time. Specifically, the real-time resource usage and status of the Pod service may be collected by the hijacking scheduler, and the service may be monitored strictly according to the pre-allocated resource size. If the resource quota is exceeded, the process of the Pod service that exceeds the maximum preset value of the resource may be controlled, and the process may be in an interrupted state at this time.
  • vGPU information of each vGPU in the Node node is collected, and each vGPU information is sent to the hijacking scheduler in the Master node, and each vGPU information is registered to obtain the Pod information of each Pod service corresponding to each vGPU.
  • the k8s cluster also includes a Master node, which includes a GPUSharing Scheduler.
  • Each Node node is responsible for collecting all vGPU information of each Node node, and sending all vGPU information to the GPUSharing Scheduler for information registration, so that the Pod information of each Pod service corresponding to each vGPU can be obtained.
  • step 103 receiving each Pod information, and saving each Pod information into multiple files, includes:
  • the file may be a file containing the usage of part of the GPU video memory and the usage of part of the GPU computing core of the GPU included in each vGPU.
  • the vGPU information of each vGPU in the Node node is collected, and the information of each vGPU is sent to the hijacking scheduler in the Master node, and the information of each vGPU is registered to obtain the Pod information of each Pod service corresponding to each vGPU, that is, the usage of part of the GPU video memory and part of the GPU computing core of the GPU by each Pod service corresponding to each vGPU is obtained through registration with the hijacking scheduler, the usage of part of the GPU video memory and part of the GPU computing core of the GPU by each Pod service returned by the hijacking scheduler is received, and the data is saved as a file. By saving the data as a file, convenience is provided for further resource management.
  • step 104 managing part of the GPU memory and part of the GPU computing core in each vGPU according to each file, includes:
  • the Pod information may include the usage of some GPU memory and some GPU computing cores of the GPU included in the vGPU; the usage may be the usage of GPU memory or computing cores by the Pod service. For example, the usage may be that the GPU memory consumed by the Pod service exceeds the preset resource quota, or that the Pod service The GPU memory required to be consumed is within the preset resource quota.
  • the GPU video memory and the GPU computing core in the vGPU are controlled to terminate the process of the Pod service, that is, if the usage of the GPU video memory and the GPU computing core corresponding to the vGPU in the file meet the preset resource quota, the process of the Pod service runs normally.
  • the usage of the GPU video memory and the GPU computing core corresponding to the vGPU in the Pod information are saved as a file, and the process of the Pod service is controlled according to the usage of the GPU video memory and the GPU computing core corresponding to the vGPU in the file.
  • the usage of the GPU video memory and the GPU computing core of the vGPU corresponding to the Pod service can be collected by hijacking the scheduler, and the service can be monitored strictly in accordance with the preset resource quota to control the process of the Pod service.
  • the space division scheduling mode can also be used in combination with MPS (push stack instruction) technology.
  • MPS push stack instruction
  • the preset resource quota can be the resource quota of GPU video memory and the resource quota of GPU computing core set according to the preset configuration file.
  • the resource quota of GPU video memory and the resource quota of GPU computing core required for the Pod service can be set through the configuration file.
  • FIG 3 a code execution diagram of a configuration file provided in an embodiment of the present application is shown.
  • the GPU sharing system does not need to modify the design of Extended Resource and the implementation of Scheduler of the Kubernetes (k8s) core.
  • NVIDIA Device Plugin and native Kubernetes can be used, which has no impact on the underlying driver (CUDA Driver, NVIDIA Driver) and runtime (CUDA Runtime). Fine-grained deployment of services can be performed only by using Kubernetes yaml files.
  • it also includes:
  • the number of Pod services is scaled up or down based on the usage of GPU memory and GPU computing cores in each vGPU.
  • Pod services for scaling up and down, it can be to increase the number of Pod services or to reduce the number of Pod services. Since one Pod service corresponds to one vGPU, increasing the number of Pod services is actually to increase the number of vGPUs, and reducing the number of Pod services is actually to reduce the number of vGPUs.
  • the number of Pod services is scaled up or down to scale the number of vGPUs according to the usage of GPU video memory and GPU computing cores in each vGPU.
  • the GPU sharing system can schedule services with the maximum integration rate to the same GPU card, thereby more efficiently improving the utilization rate of GPU resources in the existing cluster.
  • FIG. 4 a schematic diagram of a scaling mode architecture provided in an embodiment of the present application is shown. It can be seen from the figure that there are two scaling methods in the embodiment of the present application, one is a scaling method based on HPA (Horizontal Pod Autoscaler), and the other is a scaling method based on TPA (Traffic Pod Autoscaler).
  • HPA scaling method can enable user applications or services to achieve horizontal scaling of Pod services based on the utilization of resources such as CPU and memory
  • TPA can enable user applications or services to achieve horizontal scaling of Pods based on the busyness of the business, wherein the busyness of the business can be real-time service request traffic.
  • the GPU is located on the host, the host includes at least a CPU and a memory, the Pod service is bound to the CPU and the memory, and the Pod service is bound to the CPU and the memory according to the usage of the GPU memory and the usage of the GPU computing core in each vGPU.
  • the number of expansions includes:
  • the CPU it is the final execution unit for information processing and program running; as for the memory, it is an important component of the computer, also known as internal memory and main memory, which is used to temporarily store the calculation data in the CPU and the data exchanged with external memory such as the hard disk.
  • the CPU utilization corresponding to the CPU and the average memory utilization corresponding to the memory in the host are obtained, and the number of Pod services is automatically expanded or reduced according to the CPU utilization and the average memory utilization. Specifically, if the CPU utilization and/or the average memory utilization corresponding to the Pod service is lower than the preset utilization, the number of Pod services is automatically reduced to reduce the number of vGPUs corresponding to the Pod service; if the CPU utilization and/or the average memory utilization corresponding to the Pod service is higher than the preset utilization, the number of Pod services is automatically expanded to expand the number of vGPUs corresponding to the Pod service; when the number of Pod services after automatic expansion meets the preset resource quota of the Pod service, the Pod service is scheduled to the target GPU.
  • the scaling method in the above example is a scaling method based on HPA, which can automatically scale the number of Pod services according to CPU utilization and average memory utilization.
  • the examples listed above are only used as examples.
  • the data are set to be relatively simple.
  • the resource utilization indicators based on the HPA-based scaling method may be far more than CPU utilization and average memory utilization. It is understandable that technical personnel in this field can perform automatic scaling based on custom measurement indicators provided by other applications according to actual conditions, and the embodiments of the present application are not limited to this.
  • the number of Pod services is scaled up or down according to the usage of GPU memory and GPU computing core in each vGPU, including:
  • the real-time service request traffic of the Pod service it can be the number of server requests (Queries-per-second, abbreviated as QPS) when the Pod service is running.
  • QPS Quality-per-second
  • the TPA-based scaling method increases the number of Pod services by one and the corresponding number of vGPUs by one to cope with burst traffic.
  • a Pod service occupies 10GB of video memory and 10% of the computing cores on GPU card A.
  • the resources on GPU card A can process 100 requests simultaneously.
  • the GPU sharing system can apply for the same size of resources on GPU card A, GPU card B, or GPU card N with N times the resource quota according to the resources allocated on GPU card A to cope with burst traffic.
  • the real-time service request traffic of the Pod service is obtained, and the number of Pod services is automatically scaled according to the real-time service request traffic of the Pod service. Specifically, if the real-time service request traffic of the Pod service is greater than the preset real-time service request traffic, the number of Pod services is automatically expanded to expand the number of vGPUs corresponding to the Pod service. If the real-time service request traffic of the Pod service is less than the preset real-time service request traffic, the number of Pod services is automatically reduced to reduce the number of vGPUs corresponding to the Pod service. When the number of Pod services after automatic expansion meets the preset resource quota of the Pod service, the Pod service is scheduled to the target GPU.
  • the scaling method in the above example is a TPA-based scaling method, which can automatically scale the number of Pod services according to the real-time service request traffic of the Pod services.
  • the k8s cluster also includes a Master node, the Master node includes a controller, and the controller is used to create resources corresponding to different types of Pod services.
  • the Master node it is the management node in the k8s cluster. It can be a node deployed on the central server of the k8s cluster, responsible for associating other nodes, such as managing Node nodes; for resources, it can include three different types of resources: Deployment, Service, and Statefulset; among them, Deployment is used to deploy stateless Pod services, Service is used to deploy Pod services that can be scaled to zero, and Statefulset is used to deploy stateful Pod services.
  • it also includes:
  • the Pod service is scheduled to the target GPU.
  • the target GPU may be a GPU that can meet the resource quota requirements of the Pod service.
  • the hijacking scheduler can schedule the Pod service to the target GPU.
  • the hijacking scheduler can ensure that the computing resources of the vGPU can meet the needs of the Pod during scheduling.
  • FIG. 5 a schematic diagram of resource allocation of a multi-service shared GPU provided in an embodiment of the present application is shown.
  • the "instance” in Figure 5 can be expressed as Case
  • the "container” can be expressed as Container
  • the “solution” can be expressed as Case Scenario
  • the "APP” can be expressed as an application or service.
  • multiple services sharing GPU resources can include GPU memory (Memory) and GPU computing core (Kernel).
  • GPU memory Memory
  • Kernel GPU computing core
  • Pod service a occupies 25% of the GPU memory in GPU card A (Memory-Container A) and 20% of the GPU computing core (Kernel-ContainerA).
  • users can deploy multiple services of different types on the same GPU card.
  • the utilization rate of GPU resources can reach 100%.
  • one Container can correspond to one Pod service, Container 1 is 50%, Container 2 is 50%, and Container 3 is 20%.
  • r2 is 25%
  • Container3 50%
  • Container4 75%.
  • the GPU sharing system can be used to schedule the Pod service with the maximum integration rate to the same GPU card.
  • the combination of Container1 (50%) and Container3 (50%), Container2 (25%) and Container4 (75%) can fully meet the video memory quota of the existing GPU resources. It can be understood that the GPU sharing system can schedule services with the maximum integration rate to the same GPU card, thereby more efficiently improving the utilization rate of GPU resources in the existing cluster.
  • a schematic diagram of a scheduling mode for multi-service shared resources provided in an embodiment of the present application is shown, where GPU sharing
  • the system can calculate the optimal scheduling strategy through the background algorithm to provide the minimum remaining resources and service security solution for the pre-deployed services, that is, to make the Pod service occupy the resources on a GPU card as much as possible, reduce the number of GPU cards used, and reduce the fragmentation of GPU computing resources.
  • idle GPU resources can be provided for other services.
  • APP4 and APP5 corresponding to Container4 (20%) and Container5 (80%) in Figure 6 can also be integrated into one card to meet the GPU video memory usage quota of a GPU (less than or equal to 100%). It can be understood that those skilled in the art can calculate the optimal scheduling strategy based on actual conditions to provide the minimum remaining resources and service security solution for the pre-deployed service, and the embodiments of the present application are not limited to this.
  • the above resource scheduling method is also applicable to the cross-node resource allocation scheme.
  • APP6 85%
  • Node 2 adds APP7 (35%)
  • APP6 85%
  • APP7 APP7
  • the GPU sharing system calculates the optimal scheduling strategy through the background algorithm to provide the minimum remaining resources and service security solution for the pre-deployed service. After being able to reasonably schedule services to different GPU cards, it can provide idle GPU resources for other services, while ensuring resource isolation between services.
  • the GPU sharing system is deployed with a k8s cluster, and the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core.
  • the GPU sharing system is deployed with a k8s cluster, and the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core.
  • multiple Pod services are implemented to run on the same physical GPU, and the GPU computing resources can be strictly isolated at the same time; then the vGPU information of each vGPU in the Node node is collected, and each vGPU information is registered to obtain the Pod information of each Pod service corresponding to each vGPU, and each Pod information is received, and each Pod information is saved as multiple files, and then according to each file, part of the GPU video memory and part of the GPU computing core in each vGPU are managed, and through the Pod information of each Pod service, part of the GPU video memory and part of the GPU computing core in each vGPU are managed, which effectively solves the problem of GPU computing resource overlimit.
  • the best GPU resources can be applied for the Pod service in a fine-grained resource scheduling method.
  • the GPU sharing system can schedule services with the maximum integration rate to the same GPU card, thereby more efficiently improving the utilization rate of GPU resources in the existing cluster.
  • the optimal scheduling strategy is calculated through the background algorithm to provide the minimum remaining resources and service security solution for the pre-deployed service. It can reasonably schedule services to different GPU cards, provide idle GPU resources for other services, and at the same time ensure resource isolation between services.
  • FIG. 8 a block diagram of a GPU computing resource management device provided in an embodiment of the present application is shown, which is applied to a GPU sharing system.
  • the GPU sharing system is deployed with a k8s cluster, and the k8s cluster includes a Node node and a Pod service, wherein the Node node includes a GPU, and the GPU computing resources corresponding to the GPU include at least a GPU video memory and a GPU computing core, which may specifically include the following modules:
  • the GPU partition module 801 is used to partition the GPU in the Node node to obtain multiple vGPUs; each vGPU includes part of the GPU memory and part of the GPU computing core of the GPU, and one vGPU corresponds to one Pod service;
  • the Pod information acquisition module 802 is used to collect the vGPU information of each vGPU in the Node node, register each vGPU information, and obtain the Pod information of each Pod service corresponding to each vGPU;
  • the Pod information file generation module 803 is used to receive each Pod information and save each Pod information into multiple files;
  • the resource management module 804 is used to manage part of the GPU memory and part of the GPU computing core in each vGPU according to various files.
  • the GPU partition module 801 is specifically used for:
  • the GPU memory and GPU computing core of the GPU are allocated to each vGPU according to the preset resource quota, so as to obtain multiple vGPUs including part of the GPU memory and part of the GPU computing core of the GPU.
  • the k8s cluster further includes a Master node, the Master node includes a hijacking scheduler, and the Pod information acquisition module 802 is specifically used for:
  • the Pod information file generation module 803 is specifically used to:
  • the resource management module 804 is specifically used for:
  • the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.
  • an embodiment of the present application also provides an electronic device, including: a processor, a memory, and a computer program stored in the memory and executable on the processor.
  • a computer program stored in the memory and executable on the processor.
  • FIG. 9 is a schematic diagram of the structure of a non-volatile readable storage medium provided in an embodiment of the present application.
  • the embodiment of the present application also provides a non-volatile readable storage medium 901, on which a computer program is stored.
  • a computer program is stored.
  • the medium 901 is, for example, a read-only memory (ROM), a random access memory (RAM), a magnetic disk or an optical disk.
  • FIG. 10 is a schematic diagram of the hardware structure of an electronic device implementing various embodiments of the present application.
  • the electronic device 1000 includes but is not limited to: a radio frequency unit 1001, a network module 1002, an audio output unit 1003, an input unit 1004, a sensor 1005, a display unit 1006, a user input unit 1007, an interface unit 1008, a memory 1009, a processor 1010, and a power supply 1011.
  • a radio frequency unit 1001 includes but is not limited to: a radio frequency unit 1001, a network module 1002, an audio output unit 1003, an input unit 1004, a sensor 1005, a display unit 1006, a user input unit 1007, an interface unit 1008, a memory 1009, a processor 1010, and a power supply 1011.
  • the electronic device structure shown in FIG. 10 does not constitute a limitation on the electronic device, and the electronic device may include more or fewer components than shown, or combine certain components, or arrange components differently.
  • the electronic device includes but is not limited to a mobile phone, a tablet computer, a laptop computer, a PDA, a vehicle-mounted terminal, a wearable device, and a
  • the RF unit 1001 can be used for receiving and sending signals during information transmission or calls. Specifically, after receiving downlink data from the base station, it is sent to the processor 1010 for processing; in addition, uplink data is sent to the base station.
  • the RF unit 1001 includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, etc.
  • the RF unit 1001 can also communicate with the network and other devices through a wireless communication system.
  • the electronic device provides users with wireless broadband Internet access through the network module 1002, such as helping users to send and receive emails, browse web pages, and access streaming media.
  • the audio output unit 1003 can convert the audio data received by the RF unit 1001 or the network module 1002 or stored in the memory 1009 into an audio signal and output it as sound. Moreover, the audio output unit 1003 can also provide audio output related to a specific function performed by the electronic device 1000 (for example, a call signal reception sound, a message reception sound, etc.).
  • the audio output unit 1003 includes a speaker, a buzzer, a receiver, etc.
  • the input unit 1004 is used to receive audio or video signals.
  • the input unit 1004 may include a graphics processor (GPU) 10041 and a microphone 10042.
  • the graphics processor 10041 processes the image data of a static picture or video obtained by an image capture device (such as a camera) in a video capture mode or an image capture mode.
  • the processed image frame can be displayed on the display unit 1006.
  • the image frame processed by the graphics processor 10041 can be stored in the memory 1009 (or other storage medium) or sent via the radio frequency unit 1001 or the network module 1002.
  • the microphone 10042 can receive sound and can process such sound into audio data.
  • the processed audio data can be converted into a format output that can be sent to a mobile communication base station via the radio frequency unit 1001 in the case of a telephone call mode.
  • the electronic device 1000 also includes at least one sensor 1005, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 10061 according to the brightness of the ambient light, and the proximity sensor can turn off the display panel 10061 and/or the backlight when the electronic device 1000 is moved to the ear.
  • the accelerometer sensor can detect the magnitude of acceleration in each direction (generally three axes), and can detect the magnitude and direction of gravity when stationary, which can be used to identify the posture of the electronic device (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, tapping), etc.; the sensor 1005 can also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, an infrared sensor, etc., which will not be repeated here.
  • the display unit 1006 is used to display information input by the user or information provided to the user.
  • the display unit 1006 may include a display panel 10061, which may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • the user input unit 1007 can be used to receive input digital or character information and generate information related to the user settings of the electronic device. And key signal input related to function control.
  • the user input unit 1007 includes a touch panel 10071 and other input devices 10072.
  • the touch panel 10071 also known as a touch screen, can collect the user's touch operation on or near it (such as the user's operation on the touch panel 10071 or near the touch panel 10071 using any suitable object or accessory such as a finger, stylus, etc.).
  • the touch panel 10071 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into the contact point coordinates, and then sends it to the processor 1010, receives the command sent by the processor 1010 and executes it.
  • the touch panel 10071 can be implemented in various types such as resistive, capacitive, infrared and surface acoustic waves.
  • the user input unit 1007 may also include other input devices 10072.
  • other input devices 10072 may include but are not limited to a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, and a joystick, which are not described in detail here.
  • the touch panel 10071 may be covered on the display panel 10061.
  • the touch panel 10071 detects a touch operation on or near it, it is transmitted to the processor 1010 to determine the type of the touch event, and then the processor 1010 provides a corresponding visual output on the display panel 10061 according to the type of the touch event.
  • the touch panel 10071 and the display panel 10061 are used as two independent components to implement the input and output functions of the electronic device, in some embodiments, the touch panel 10071 and the display panel 10061 may be integrated to implement the input and output functions of the electronic device, which is not limited here.
  • the interface unit 1008 is an interface for connecting an external device to the electronic device 1000.
  • the external device may include a wired or wireless headset port, an external power supply (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a device with an identification module, an audio input/output (I/O) port, a video I/O port, a headphone port, etc.
  • the interface unit 1008 may be used to receive input (e.g., data information, power, etc.) from an external device and transmit the received input to one or more elements within the electronic device 1000 or may be used to transmit data between the electronic device 1000 and an external device.
  • the memory 1009 can be used to store software programs and various data.
  • the memory 1009 can mainly include a program storage area and a data storage area, wherein the program storage area can store an operating system, an application required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; the data storage area can store data created according to the use of the mobile phone (such as audio data, a phone book, etc.), etc.
  • the memory 1009 can include a high-speed random access memory, and can also include a non-volatile memory, such as at least one disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the processor 1010 is the control center of the electronic device. It uses various interfaces and lines to connect various parts of the entire electronic device. By running or executing software programs and/or modules stored in the memory 1009 and calling data stored in the memory 1009, it performs various functions of the electronic device and processes data, thereby monitoring the electronic device as a whole.
  • the processor 1010 may include one or more processing units; in some embodiments of the present invention, the processor 1010 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc., and the modem processor mainly processes wireless communications. It is understandable that the above-mentioned modem processor may not be integrated into the processor 1010.
  • the electronic device 1000 may also include a power supply 1011 (such as a battery) for supplying power to each component.
  • a power supply 1011 such as a battery
  • the power supply 1011 may be logically connected to the processor 1010 through a power management system, thereby implementing functions such as managing charging, discharging, and power consumption management through the power management system.
  • the electronic device 1000 includes some functional modules not shown, which will not be described in detail here.
  • the technical solution of the present application can be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, a disk, or an optical disk), and includes a number of instructions for a terminal (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods of each embodiment of the present application.
  • a storage medium such as ROM/RAM, a disk, or an optical disk
  • a terminal which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.
  • the disclosed devices and methods can be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of units is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed.
  • Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium.
  • the technical solution of the present application or the part that contributes to the prior art or the part of the technical solution, can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for a computer device (which can be a personal computer, server, or network device, etc.) to perform all or part of the steps of the various embodiments of the present application.
  • the aforementioned storage medium includes: various media that can store program codes, such as USB flash drives, mobile hard drives, ROM, RAM, magnetic disks, or optical disks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供了一种GPU计算资源的管理方法、装置、电子设备及可读存储介质,包括:将Node节点中的GPU进行划分,得到多个vGPU;其中,各个vGPU包含GPU的部分GPU显存和部分GPU计算核心,一个vGPU对应一个Pod服务;收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息;接收各个Pod信息,将各个Pod信息保存为多个文件;根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理。通过上述方法,能够支持多个Pod服务运行在同一个物理GPU上,同时能够对GPU计算资源进行严格的隔离。

Description

GPU计算资源的管理方法、装置、电子设备及可读存储介质
相关申请的交叉引用
本申请要求于2022年12月06日提交中国专利局,申请号为202211553120.5,申请名称为“GPU计算资源的管理方法、装置、电子设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及互联网技术领域,特别是涉及一种GPU计算资源的管理方法、一种GPU计算资源的管理装置、一种电子设备以及一种非易失性可读存储介质。
背景技术
GPU(Graphics Processing Unit图形处理器),其为一种由大量核心组成的大规模并行计算架构,为同时处理多重任务而设计。作为人工智能革命中领先的计算引擎,GPU在大规模并行运算上有着巨大优势,为大数据、人工智能训练和推理任务以及图像渲染等场景提供了显著的计算性能和加速支持。
对于人工智能开发人员、GPU-based AI system(研究机构)或进行数字化转型的新型/传统企业,在使用GPU计算资源时势必会面临如下几个问题:
(1)GPU资源管理困难。GPU相对CPU(central processing unit中央处理器)价格较贵,GPU作为高价值硬件资源,很难做到像网络、存储一样的运维、管理一体化模式。在实际应用环境下,经常出现多进程、多人员、多任务复用同一GPU资源的情况,长时间的资源等待严重降低了业务流程的推进效率,降低了产品迭代的速度。
(2)GPU资源使用效率低。对于算力需求较小的AI(Artificial Intelligence人工智能)服务(如:on-premise或cloud)通常无法满负载使用一块GPU卡,并且用户在使用GPU资源时也需要显性地区分不同型号的GPU,以适配不同架构、型号GPU的计算核心、驱动和其他版本组件,这些因素难免为用户带来了更高的使用门槛。
(3)GPU资源快速申请、回收困难。在生产环境中,AI服务对GPU资源的申请与释放需要基于任务负载的使用周期,以及不同任务在波峰/波谷时对GPU资源的使用量,按照在线请求数量(Query Per Second简称QPS)进行自动扩缩容,才能够满足线上AI服务的实时高并发、低延迟的需求。
为了解决上述问题,工业界已经提出了多种的GPU共享方案,并且在云原生趋势的带动下,利用云原生技术和标准Docker(应用容器引擎)进行容器化部署,已经成为业内云服务对异构计算资源的通用方法,但现有方法通常存在方案需要不断适配、难以覆盖所有场景、不能进行安全隔离或安全性低,无法进行二次开发或二次开发难度高等问题,因此,如何将多个任务同时运行在同一张GPU卡上,同时能够对共享资源进行严格的隔离,是工业界研究的一个重要方向。
发明内容
本申请实施例是提供一种GPU计算资源的管理方法、装置、电子设备以及非易失性可读存储介质,以解决GPU资源管理困难、GPU资源使用效率低以及GPU资源快速申请、回收困难的问题。
本申请实施例公开了一种GPU计算资源的管理方法,应用于GPU共享系统,GPU共享系统部署有k8s集群,k8s集群包括Node节点和Pod服务,其中,Node节点包括GPU,GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,方法包括:
将Node节点中的GPU进行划分,得到多个vGPU;其中,各个vGPU包含GPU的部分GPU显存和部分GPU计算核心,一个vGPU对应一个Pod服务;
收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息;
接收各个Pod信息,将各个Pod信息保存为多个文件;
根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理。
在本申请一些实施例中,将Node节点中的GPU进行划分,得到多个vGPU,包括:
当Node节点中的GPU进行划分时,根据预设资源配额,将GPU的GPU显存和GPU计算核心分配给各个vGPU,得到多个包含GPU的部分GPU显存和部分GPU计算核心的vGPU。
在本申请一些实施例中,vGPU信息至少包括vGPU的vGPU数量和vGPU显存大小。
在本申请一些实施例中,k8s集群还包括Master节点,Master节点包括劫持调度器,收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息,包括:
收集Node节点中各个vGPU的vGPU信息;
将各个vGPU信息发送至Master节点中的劫持调度器,对各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息。
在本申请一些实施例中,接收各个Pod信息,将各个Pod信息保存为多个文件,包括:
接收劫持调度器返回的各个vGPU对应的各个Pod服务的Pod信息,将各个Pod信息保存为多个文件。
在本申请一些实施例中,Pod信息至少包括vGPU中的GPU显存的使用情况和GPU计算核心的使用情况。
在本申请一些实施例中,根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理,包括:
将Pod信息中vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况保存为文件;
根据文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况,控制Pod服务的进程。
在本申请一些实施例中,根据文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况,控制Pod服务的进程,包括:
若文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况为超过预设资源配额时,则控制vGPU中的GPU显存和GPU计算核心以终止Pod服务的进程;
若文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况为满足预设资源配额时,则Pod服务的进程正常运行。
在本申请一些实施例中,还包括:
根据各个vGPU中的GPU显存的使用情况和GPU计算核心的使用情况,对Pod服务的数量进行扩缩。
在本申请一些实施例中,GPU位于主机上,主机至少包括CPU和内存,Pod服务和CPU以及内存进行绑定,根据各个vGPU中的GPU显存的使用情况和GPU计算核心的使用情况,对Pod服务的数量进行扩缩,包括:
获取主机中CPU对应的CPU利用率和内存对应的平均内存利用率;
根据CPU利用率和平均内存利用率,自动扩缩Pod服务的数量。
在本申请一些实施例中,根据CPU利用率和平均内存利用率,自动扩缩Pod服务的数量,包括:
若Pod服务对应的CPU利用率和/或平均内存利用率低于预设使用率,则自动缩减Pod服务的数量以缩减Pod服务对应的vGPU的数量;
若Pod服务对应的CPU利用率和/或平均内存利用率高于预设使用率,则自动扩增Pod服务的数量以扩增Pod服务对应的vGPU的数量。
在本申请一些实施例中,根据各个vGPU中的GPU显存的使用情况和GPU计算核心的使用情况,对Pod服务的数量进行扩缩,包括:
获取Pod服务的实时服务请求流量;
根据Pod服务的实时服务请求流量,自动扩缩Pod服务的数量。
在本申请一些实施例中,根据Pod服务的实时服务请求流量,自动扩缩Pod服务的数量,包括:
若Pod服务的实时服务请求流量大于预设实时服务请求流量,则自动扩增Pod服务的数量以扩增Pod服务对应的vGPU的数量;
若Pod服务的实时服务请求流量小于预设实时服务请求流量,则自动缩减Pod服务的数量以缩减Pod服务对应的vGPU的数量。
在本申请一些实施例中,还包括:
当自动扩缩后的Pod服务的数量满足Pod服务的预设资源配额时,将Pod服务调度到目标GPU。
在本申请一些实施例中,k8s集群还包括Master节点,Master节点包括控制器,控制器用于创建不同类型的Pod服务所对应的资源。
在本申请一些实施例中,资源至少包括Deployment部署、Service服务、Statefulset有状态集。
在本申请一些实施例中,Deployment部署用于部署无状态的Pod服务,Service服务用于部署可伸缩至零的Pod服务,Statefulset有状态集用于部署有状态的Pod服务。
本申请实施例还公开了一种GPU计算资源的管理装置,用于GPU共享系统,GPU共享系统部署有k8s集群,k8s集群包括Node节点和Pod服务,其中,Node节点包括GPU,GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,装置包括:
GPU划分模块,用于将Node节点中的GPU进行划分,得到多个vGPU;其中,各个vGPU包含GPU的部分GPU显存和部分GPU计算核心,一个vGPU对应一个Pod服务;
Pod信息获取模块,用于收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息;
Pod信息文件生成模块,用于接收各个Pod信息,将各个Pod信息保存为多个文件;
资源管理模块,用于根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核 心进行管理。
在本申请一些实施例中,GPU划分模块具体用于:
当Node节点中的GPU进行划分时,根据预设资源配额,将GPU的GPU显存和GPU计算核心分配给各个vGPU,得到多个包含GPU的部分GPU显存和部分GPU计算核心的vGPU。
在本申请一些实施例中,k8s集群还包括Master节点,Master节点包括劫持调度器,Pod信息获取模块具体用于:
收集Node节点中各个vGPU的vGPU信息;
将各个vGPU信息发送至Master节点中的劫持调度器,对各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息。
在本申请一些实施例中,Pod信息文件生成模块具体用于:
接收劫持调度器返回的各个vGPU对应的各个Pod服务的Pod信息,将各个Pod信息保存为多个文件。
在本申请一些实施例中,资源管理模块具体用于:
将Pod信息中vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况保存为文件;
根据文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况,控制Pod服务的进程。
本申请实施例还公开了一种电子设备,包括处理器、通信接口、存储器和通信总线,其中,处理器、通信接口以及存储器通过通信总线完成相互间的通信;
存储器,用于存放计算机程序;
处理器,用于执行存储器上所存放的程序时,实现如本申请实施例的方法
本申请实施例还公开了一种非易失性可读存储介质,其上存储有指令,当由一个或多个处理器执行时,使得处理器执行如本申请实施例的方法。
本申请实施例包括以下优点:
在本申请实施例中,应用于GPU共享系统,GPU共享系统部署有k8s集群,k8s集群包括Node节点和Pod服务,其中,Node节点包括GPU,GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,通过将Node节点中的GPU进行划分,可以得到多个vGPU,其中,各个vGPU包含GPU的部分GPU显存和部分GPU计算核心,一个vGPU对应一个Pod服务,通过将Node节点中的GPU进行划分得到多个vGPU,能够支持多个Pod服务运行在同一个物理GPU上,同时能够对GPU计算资源进行严格的隔离;接着收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息,接收各个Pod信息,将各个Pod信息保存为多个文件,进而根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理,通过各个Pod服务的Pod信息,进而对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理,有效地解决了GPU计算资源超限的问题。
附图说明
图1是本申请实施例中提供的一种GPU计算资源的管理方法的步骤流程图;
图2是本申请实施例中提供的一种GPU共享系统的架构示意图;
图3是本申请实施例中提供的一种配置文件的代码执行示意图;
图4是本申请实施例中提供的一种扩缩容模式架构示意图;
图5是本申请实施例中提供的多服务共享GPU的资源分配示意图;
图6是本申请实施例中提供的多服务共享资源的调度模式示意图之一;
图7是本申请实施例中提供的多服务共享资源的调度模式示意图之二;
图8是本申请实施例中提供的一种GPU计算资源的管理装置的结构框图;
图9是本申请实施例中提供的一种非易失性可读存储介质的结构示意图;
图10是实现本申请各个实施例的一种电子设备的硬件结构示意图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
为了使本领域技术人员更好地理解本申请实施例的技术方案,下面对本申请实施例中涉及的部分技术特征进行解释、说明:
Kubernetes(简称k8s),是一个可移植的、可扩展的开源平台,用于管理容器化的工作负载和服务,可促进声明式配置和自动化。
容器技术:使用Docker作为一个开源的应用容器引擎,以提供灵活的应用部署方式;Kubernetes是一个对容器化应用进行自动化部署、扩展和管理的开源项目,可用于边缘计算平台以提供可靠和可扩展的容器编排。
Pod,Kubernetes调度的最小单位。
GPU(Graphics Processing Unit图形处理器),是一种专门在个人电脑、工作站、游戏机和一些移动设备上做图像和图形相关运算工作的微处理器。
模型推理服务:将AI训练得到的结果模型转化为服务,能够进行模型推理操作。
Node,Kubernetes节点,Kubernetes节点可以分为Master和Node,其中Master为管理节点,Node为计算节点。
CRD(Custom Resource Definition),是一种无需改变代码就可以扩展Kubernetes API(Application Programming Interface应用程序编程接口)的机制,用来管理自定义对象。
弹性伸缩,根据设定的伸缩规则,在实际运行中,自动控制实例个数。
模型推理服务,将AI训练得到的结果模型转化为服务,能够进行模型推理操作。
CDUA(Compute Unified Device Architecture统一计算设备架构),是一种由NVIDIA推出的通用并行计算架构,该架构使GPU能够解决复杂的计算问题。它包含了CUDA指令集架构(ISA)以及GPU内部的并行计算引擎。
作为一种示例,对于人工智能开发人员、GPU-based AI system(研究机构)或进行数字化转型的新型/传统企业,在使用GPU计算资源时势必会面临GPU资源管理困难、GPU资源使用效率低和GPU资源快速申请、回收困难这几个问题,为了解决该问题,工业界已经提出了多种的GPU共享方案,并且在云原生趋势的带动下,利用云原生技术和标准Docker进行容器化部署,已经成为业内云服务对异构计算资源的通用方法,现有GPU共享方案如表1所示:
表1
由表1可知,现有的GPU共享方案通常还存在方案需要不断适配、难以覆盖所有场景、不能进行安全隔离或安全性低,无法进行二次开发或二次开发难度高等问题,因此,如何将多个任务同时运行在同一张GPU卡上,同时能够对共享资源进行严格的隔离,是工业界研究的一个重要方向。
对此,本申请的核心发明点之一在于应用于GPU共享系统,GPU共享系统部署有k8s集群,k8s集群包括Node节点和Pod服务,其中,Node节点包括GPU,GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,通过将Node节点中的GPU进行划分,可以得到多个vGPU,其中,各个vGPU包含GPU的部分GPU显存和部分GPU计算核心,一个vGPU对应一个Pod服务,通过将Node节点中的GPU进行划分得到多个vGPU,能够支持多个Pod服务运行在同一个物理GPU上,同时能够对GPU计算资源进行严格的隔离;接着收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息,接收各个Pod信息,将各个Pod信息保存为多个文件,进而根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理,通过各个Pod服务的Pod信息,进而对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理,有效地解决了GPU计算资源超限的问题。
参照图1,示出了本申请实施例中提供的一种GPU计算资源的管理方法的步骤流程图,应用于GPU共享系统,GPU共享系统部署有k8s集群,k8s集群包括Node节点和Pod服务,其中,Node节点包括GPU,GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,具体可以包括如下步骤:
步骤101,将Node节点中的GPU进行划分,得到多个vGPU;其中,各个vGPU包含GPU的部分GPU显存和部分GPU计算核心,一个vGPU对应一个Pod服务;
参照图2,示出了本申请实施例中提供的一种GPU共享系统的架构示意图,本申请实施例所提供的GPU计算资源的管理方法,可以应用于如图2所示的GPU共享系统中。具体地,GPU共享系统中部署有k8s集群,k8s集群可以包括一个或多个Node节点和Pod服务,其中,每个Node节点中可以包括一个或多个GPU,各个GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,基于此GPU共享系统,能够支持将多个Pod服务部署到同一个物理GPU上,并且可以由用户指定每个Pod服务所占用的GPU显存大小和计算核心比例,同时可以实现安全的资源隔离,解决资源超限问题。
其中,对于k8s,其为一个可移植的、可扩展的开源平台,用于管理容器化的工作负载和服务,可促进声明式配置和自动化,在k8s集群中可以包括多个物理设备/虚拟机组成。具体地,在k8s集群中可以包括一个或多个Node节点和Pod服务,其中,每个Node节点中可以包括一个或多个GPU;其中,Node节点为k8s中的计算节点,其可以负责对运行集群中的相关容器,并对容器传输的数据进行管理。
对于Pod,其为Kubernetes调度的最小单位,其可以代表Kubernetes集群中正在运行的单个进程实例,一个Pod中可以有多个容器(Container),一个容器可以包含一个AI服务,因此,一个Pod可以将多个容器中的AI服务构成一个大的AI服务,可以理解的是,一个Pod中有一个容器,一个容器挂载一个vGPU,一个Pod使用一个vGPU,则一个vGPU对应一个Pod服务。需要说明的是,对于Pod的使用形式,为方便解释说明,因此特将各项数据设置得较为简单,在实际应用中,对于Pod的使用形式可能更为复杂,使用的形式可能也会根据实际应用场景而有所不同。
对于vGPU(虚拟图形处理单元),其为Node节点中的GPU进行划分得到的vGPU,可以为是将一个整卡GPU虚拟化成多个vGPU,vGPU是由整卡GPU做了精细化的切分得到的,由图2所示,GPU共享系统中存在Node节点,Node节点中存在多个GPU,并且存在多个由GPU进行划分的vGPU,多个vGPU组成一个vGPU池。
对于GPU,其可以位于Node节点上,GPU为一种专门在个人电脑、工作站、游戏机和一些移动设备上做图像和图形相关运算工作的微处理器;其中,在GPU中包括GPU计算资源,GPU计算资源可以包括GPU显存和GPU计算核心;GPU显存可以理解为一种空间,类似于内存,GPU显存用于存放模型,数据等,GPU显存越大,其所能运行的网络也就越大,在大规模训练时,GPU显存会显得较为重要;对于GPU计算核心,其可以用于执行GPU所有的图形运算、通用运算等。
在本申请实施例中,在GPU共享系统中,通过将Node节点中的GPU进行划分,可以得到多个vGPU,具体地,在划分的过程中,根据预设资源配额,将GPU的部分GPU显存和部分GPU计算核心分别分配给多个vGPU,从而得到多个包含GPU的部分GPU显存和部分GPU计算核心的vGPU,其中,一个vGPU可以对应一个Pod服务,并且,运行在同一张GPU卡上的多个Pod服务所占的GPU计算资源为独立划分的。通过将Node节点中的GPU进行划分得到多个vGPU,能够支持多个Pod服务运行在同一个物理GPU上,同时能够对GPU计算资源进行严格的隔离。
其中,对于预设资源配额,其可以为用户在创建Pod服务或应用时设定vGPU所需要的GPU显存大小和GPU计算核心,从而可以根据预设资源配额将GPU的部分GPU显存和部分GPU计算核 心分别分配给多个vGPU。
步骤102,收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息;
其中,vGPU信息可以包括vGPU的vGPU数量和vGPU显存大小;对于Pod信息,其可以包括vGPU中所包含的GPU的部分GPU显存的使用情况和部分GPU计算核心的使用情况;其中,使用情况可以为Pod服务对GPU显存或计算核心的使用情况,示例性地,使用情况可以为Pod服务所需要消耗的GPU显存超过预设资源配额,也可以为Pod服务所需要消耗的GPU显存在预设资源配额的范围内;对于预设资源配额,其可以为根据预设配置文件进行设置的GPU显存的资源配额和GPU计算核心的资源配额。
在本申请实施例中,在将Node节点中的GPU进行划分得到多个vGPU后,收集Node节点中各个vGPU的vGPU数量和vGPU显存大小,并将各个vGPU的vGPU数量和vGPU显存大小进行注册,得到各个vGPU所对应的各个Pod服务的Pod信息,即得到各个vGPU中所包含的GPU的部分GPU显存的使用情况和部分GPU计算核心的使用情况。
步骤103,接收各个Pod信息,将各个Pod信息保存为多个文件;
其中,对于文件,其可以为包含各个vGPU中所包含的GPU的部分GPU显存的使用情况和部分GPU计算核心的使用情况的文件。
在本申请实施例中,在将Node节点中的GPU进行划分得到多个vGPU后,收集Node节点中各个vGPU的vGPU数量和vGPU显存大小,并将各个vGPU的vGPU数量和vGPU显存大小进行注册,得到各个vGPU所对应的各个Pod服务对GPU的部分GPU显存的使用情况和部分GPU计算核心的使用情况,接收各个Pod服务对GPU的部分GPU显存的使用情况和部分GPU计算核心的使用情况,并将该数据保存为文件。
步骤104,根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理。
在具体实现中,根据文件中的所包含的各个Pod服务对GPU的部分GPU显存的使用情况和部分GPU计算核心的使用情况,判断Pod服务对GPU的部分GPU显存的使用和部分GPU计算核心的使用是否超出预设资源配额,从而控制Pod服务的进程,进而对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理。
在本申请实施例中,应用于GPU共享系统,GPU共享系统部署有k8s集群,k8s集群包括Node节点和Pod服务,其中,Node节点包括GPU,GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,通过将Node节点中的GPU进行划分,可以得到多个vGPU,其中,各个vGPU包含GPU的部分GPU显存和部分GPU计算核心,一个vGPU对应一个Pod服务,通过将Node节点中的GPU进行划分得到多个vGPU,实现了多个Pod服务运行在同一个物理GPU上,同时能够对GPU计算资源进行严格的隔离;接着收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息,接收各个Pod信息,将各个Pod信息保存为多个文件,进而根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理,通过各个Pod服务的Pod信息,进而对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理,有效地解决了GPU计算资源超限的问题。
在一种可选实施例中,k8s集群还包括Master节点,Master节点包括劫持调度器,步骤102、收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对 应的各个Pod服务的Pod信息,包括:
收集Node节点中各个vGPU的vGPU信息;
将各个vGPU信息发送至Master节点中的劫持调度器,对各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息。
其中,对于Master节点,其为k8s集群中的管理节点,其可以为部署在集群的中心服务器的节点,负责对其他节点进行关联,如对Node节点进行管理等。
对于劫持调度器,其可以为GPUSharing Scheduler,其可以用于统计、管理与调度共享同一个GPU卡的GPU计算资源的多个Pod服务,其可以通过实时劫持GPU显存、GPU计算核心的使用情况,对GPU计算资源进行软件层的使用限制,具体地,可以通过劫持调度器采集Pod服务的实时资源使用情况和状态,严格按照预先分配的资源大小进行服务监控,如果超过资源配额,则控制超出资源最大预设值的Pod服务的进程,此时的进程可以为中断状态。
在本申请实施例中,收集Node节点中各个vGPU的vGPU信息,将各个vGPU信息发送至Master节点中的劫持调度器,对各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息。
由图2可知,k8s集群还包括Master节点,在Master节点中包括劫持调度器(GPUSharing Scheduler),各个Node节点负责收集各个Node节点的所有vGPU信息,并且将所有vGPU信息发送至劫持调度器(GPUSharing Scheduler)以进行信息注册,从而可以得到各个vGPU对应的各个Pod服务的Pod信息。
在一种可选实施例中,步骤103、接收各个Pod信息,将各个Pod信息保存为多个文件,包括:
接收劫持调度器返回的各个vGPU对应的各个Pod服务的Pod信息,将各个Pod信息保存为多个文件。
其中,对于文件,其可以为包含各个vGPU中所包含的GPU的部分GPU显存的使用情况和部分GPU计算核心的使用情况的文件。
在本申请实施例中,在将Node节点中的GPU进行划分得到多个vGPU后,收集Node节点中各个vGPU的vGPU信息,将各个vGPU信息发送至Master节点中的劫持调度器,对各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息,即通过劫持调度器注册得到各个vGPU所对应的各个Pod服务对GPU的部分GPU显存的使用情况和部分GPU计算核心的使用情况,接收劫持调度器返回的各个Pod服务对GPU的部分GPU显存的使用情况和部分GPU计算核心的使用情况,并将该数据保存为文件,通过将数据保存为文件,为进一步的资源管理提供了便利。
在一种可选实施例中,步骤104、根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理,包括:
将Pod信息中vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况保存为文件;
根据文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况,控制Pod服务的进程。
对于Pod信息,其可以包括vGPU中所包含的GPU的部分GPU显存的使用情况和部分GPU计算核心的使用情况;其中,使用情况可以为Pod服务对GPU显存或计算核心的使用情况,示例性地,使用情况可以为Pod服务所需要消耗的GPU显存超过预设资源配额,也可以为Pod服务所 需要消耗的GPU显存在预设资源配额的范围内。
在一种示例中,若文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况为超过预设资源配额时,则控制vGPU中的GPU显存和GPU计算核心以终止Pod服务的进程,即若文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况为满足预设资源配额时,则Pod服务的进程正常运行。
在本申请实施例中,将Pod信息中vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况保存为文件,根据文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况,控制Pod服务的进程,具体地,可以通过劫持调度器采集Pod服务所对应的vGPU的GPU显存的使用情况和GPU计算核心的使用情况,严格按照预设资源配额进行服务监控,控制Pod服务的进程。
需要说明的是,对于GPU显存限制和GPU时间片控制的方法,除了本申请实施例中GPU节点基于CUDA动态库劫持方法,启动调度器来进行GPU显存限制和GPU时间片控制的方法外,也可以结合MPS(压入栈指令)技术使用空分调度模式,本领域技术人员可以根据实际情况选择,本申请实施例对此不作限制。
值得一提的是,对于预设资源配额,其可以为根据预设配置文件进行设置的GPU显存的资源配额和GPU计算核心的资源配额,可以通过配置文件设定Pod服务所需要的GPU显存的资源配额和GPU计算核心的资源配额,参照图3,示出了本申请实施例中提供的一种配置文件的代码执行示意图,在本申请实施例中,GPU共享系统不必修改Kubernetes(k8s)核心的Extended Resource的设计与Scheduler的实现,可以使用NVIDIA Device Plugin与原生Kubernetes,对底层驱动(CUDA Driver、NVIDIA Driver)与运行时(CUDA Runtime)无影响,仅通过使用Kubernetes yaml文件即可进行服务的细粒度部署。
在一种可选实施例中,还包括:
根据各个vGPU中的GPU显存的使用情况和GPU计算核心的使用情况,对Pod服务的数量进行扩缩。
其中,对于扩缩,其可以为扩增Pod服务的数量,也可以为缩减Pod服务的数量,由于一个Pod服务对应一个vGPU,即扩增Pod服务的数量实际上是为了扩增vGPU的数量,缩减Pod服务的数量实际上是为了缩减vGPU的数量。
在具体实现中,根据各个vGPU中的GPU显存的使用情况和GPU计算核心的使用情况,对Pod服务的数量进行扩缩以对vGPU的数量进行扩缩,通过对Pod服务的数量进行扩缩,GPU共享系统可以将最大整合率的服务调度到同一张GPU卡,进而更加高效地提升现有集群中GPU资源的使用率。
参照图4,示出了本申请实施例中提供的一种扩缩容模式架构示意图,由图中可知,本申请实施例中存在两种扩缩容的方式,一种是基于HPA(Horizontal Pod Autoscaler)的扩缩容方式,一种是基于TPA(Traffic Pod Autoscaler)的扩缩容方式,其中,HPA的扩缩容方式可以使用户应或服务根据CPU、内存等资源的利用率来实现Pod服务的横向扩缩容,而TPA可以使用户应用或服务根据业务的繁忙程度来实现Pod的横向扩缩容,其中,业务的繁忙程度可以为实时服务请求流量。
在一种可选实施例中,GPU位于主机上,主机至少包括CPU和内存,Pod服务和CPU以及内存进行绑定,根据各个vGPU中的GPU显存的使用情况和GPU计算核心的使用情况,对Pod服务 的数量进行扩缩,包括:
获取主机中CPU对应的CPU利用率和内存对应的平均内存利用率;
根据CPU利用率和平均内存利用率,自动扩缩Pod服务的数量。
对于CPU,其可以信息处理、程序运行的最终执行单元;对于内存,其为是计算机的重要部件,也称内存储器和主存储器,用于暂时存放CPU中的运算数据,以及与硬盘等外部存储器交换的数据。
在本申请实施例中,获取主机中CPU对应的CPU利用率和内存对应的平均内存利用率,根据CPU利用率和平均内存利用率,自动扩缩Pod服务的数量,具体地,若Pod服务对应的CPU利用率和/或平均内存利用率低于预设使用率,则自动缩减Pod服务的数量以缩减Pod服务对应的vGPU的数量,若Pod服务对应的CPU利用率和/或平均内存利用率高于预设使用率,则自动扩增Pod服务的数量以扩增Pod服务对应的vGPU的数量,当自动扩缩后的Pod服务的数量满足Pod服务的预设资源配额时,将Pod服务调度到目标GPU。
值得一提的是,上述示例的扩缩容方式一种基于HPA的扩缩容方式,可以根据CPU利用率和平均内存利用率,自动扩缩Pod服务的数量。
需要说明的是,上述列举的示例均仅作为一种示例,为方便解释说明,因此特将各项数据设置得较为简单,在实际应用中,基于HPA的扩缩容方式所根据的资源利用率指标可能远不止CPU利用率和平均内存利用率,可以理解的是,本领域技术人员可以根据实际情况基于其他应程序提供的自定义度量指标以进行自动扩缩容,本申请实施例对此不作限制。
在一种可选实施例中,根据各个vGPU中的GPU显存的使用情况和GPU计算核心的使用情况,对Pod服务的数量进行扩缩,包括:
获取Pod服务的实时服务请求流量;
根据Pod服务的实时服务请求流量,自动扩缩Pod服务的数量。
对于Pod服务的实时服务请求流量,其可以为Pod服务在运行时的服务器请求数量(Queries-per-second简称QPS)。
在一种示例中,假设设置一个Pod服务能够处理的实时服务请求数量为10个每秒,当请求数量超10个每秒时,基于TPA的扩缩容方式,则将Pod服务的数量增加一个,相对应的vGPU数量也增加一个,以应对突发流量。
在另一种示例中,假设某个Pod服务在GPU卡A上占用了10GB显存和10%的计算核心,GPU卡A上的资源可以同时处理100个请求,当请求数量出现较大幅度变化时(如:上升至150或者更多请求),则GPU共享系统能够按照GPU卡A上分配的资源,以N倍的资源配额在GPU卡A或GPU卡B,或GPU卡N上申请相同大小的资源,以应对突发流量。
在本申请实施例中,获取Pod服务的实时服务请求流量,根据Pod服务的实时服务请求流量,自动扩缩Pod服务的数量,具体地,若Pod服务的实时服务请求流量大于预设实时服务请求流量,则自动扩增Pod服务的数量以扩增Pod服务对应的vGPU的数量,若Pod服务的实时服务请求流量小于预设实时服务请求流量,则自动缩减Pod服务的数量以缩减Pod服务对应的vGPU的数量,当自动扩缩后的Pod服务的数量满足Pod服务的预设资源配额时,将Pod服务调度到目标GPU。
需要说明的是,上述示例的扩缩容方式一种基于TPA的扩缩容方式,可以根据Pod服务的实时服务请求流量,自动扩缩Pod服务的数量。
在一种可选实施例中,k8s集群还包括Master节点,Master节点包括控制器,控制器用于创建不同类型的Pod服务所对应的资源。
其中,对于Master节点,其为k8s集群中的管理节点,其可以为部署在k8s集群的中心服务器的节点,负责对其他节点进行关联,如对Node节点进行管理等;对于资源,其可以包括Deployment(部署)、Service(服务)、Statefulset(有状态集)三种不同类型的资源;其中,Deployment用于部署无状态的Pod服务,Service用于部署可伸缩至零的Pod服务,Statefulset用于部署有状态的Pod服务。
在一种可选实施例中,还包括:
当自动扩缩后的Pod服务的数量满足Pod服务的预设资源配额时,将Pod服务调度到目标GPU。
其中,对于目标GPU,其可以为能够满足Pod服务对资源配额需求的GPU。
在具体实现中,当自动扩缩后的Pod服务的数量满足Pod服务的预设资源配额时,劫持调度器可以将Pod服务调度到目标GPU,其中,当多个用户应用的Pod服务运行在同一个物理GPU时,劫持调度器在调度的时候能够确保vGPU的计算资源可以满足Pod的需求。
参照图5,示出了本申请实施例中提供的多服务共享GPU的资源分配示意图,需要说明的是,为了方便描述,图5中的“实例”可以表示为Case,“容器”可以表示为Container、“解决方案”可以表示为Case Scenario,“APP”可以为表示为应用或服务,本领域技术人员可以根据实际情况对上述命名进行调整,本申请实施例对此不作限制。
由图可知,多服务共享GPU资源可以包括GPU显存(Memery)与GPU计算核心(Kernel),例如,Pod服务a在GPU卡A中占用的GPU显存为25%(Memery-Container A)、GPU计算核心为20%(Kernel-ContainerA);其次,用户可以将多个不同类型的服务部署在同一张GPU卡上,当多个服务的所需GPU显存能够满足单张GPU的实际显存数量时,此时GPU资源的使用率可以达到100%,如图5中Case1(实例1)所示,一个Container(容器)可以对应一个Pod服务,Container1为50%、Container2为25%、Container3为50%以及Container4为75%,如此一来,各个GPU的显存利用率均不能满载或接近满载,此时,可以通过GPU共享系统将最大整合率的Pod服务调度到同一张GPU卡,如图5中Case Scenario(解决方案)所示,Container1(50%)和Container3(50%)、Container2(25%)和Container4(75%)的组合分别可以完全满足现有GPU资源的显存额度,可以理解的是,GPU共享系统可以将能够最大整合率的服务调度到同一张GPU卡,进而更加高效地提升现有集群中GPU资源的使用率。另外,如图中Case2(实例2)所示,Container 2需要占用75%的GPU显存,但现有GPU占用已超100%(Container1+Container2=50%+75%=125%),如图中Case2所示的GPU0上方存在一部分为超出GPU显存额度的部分(125%-100%=25%),因此,需要使用GPU共享系统中的劫持调度器将资源调度到合适的GPU卡上,如图中Case Scenario所示,Case2中的对应的Container1和Container4组合(Container1+Container4=50%+50%=100%),Container2和Container3组合(Container2+Container3=75%+25%=100%),Container5和Container6组合(Container5+Container6=50%+50%=100%),通过将资源调度到合适的GPU卡上,能够最大限度地提高GPU计算资源的利用率,可以理解的是,GPU共享系统可以将能够最大整合率的服务调度到同一张GPU卡,进而更加高效地提升现有集群中GPU资源的使用率。
参照图6-7,示出了本申请实施例中提供的多服务共享资源的调度模式示意图,GPU共享 系统可以通过后台算法计算出最优的调度策略为预部署服务提供最小剩余资源与服务安全保障方案,即尽可能让Pod服务占满一张GPU卡上的资源,减少GPU卡使用的数量,减少GPU计算资源碎片,在合理调度服务到不同GPU卡后,可以为其他服务提供空闲的GPU资源。
如图6所示,在GPU0中,Pod服务的占用资源已接近100%(Container1+Container2+Container4=95%),其次,GPU0上还有剩余5%的资源,但是Container3(45%)所需要占用的资源大于GPU0上剩余5%的资源,因此需要重新分配到一个新的GPU上,同理,GPU1上还有剩余55%的资源,但是Container5(80%)所需要占用的资源大于GPU1上剩余55%的资源,因此需要重新分配到一个新的GPU上,由于GPU1中的Container3(45%)和GPU2中的Container5(80%)组合会超过100%的GPU计算资源额度,因此不能整合到一个GPU上,需要分别分配到不同的GPU上。
需要说明的是,图6中Container4(20%)和Container5(80%)所对应的APP4和APP5也可以整合到一张卡上,满足一个GPU的GPU显存的使用配额(小于或等于100%),可以理解的是,本领域技术人员可以根据实际情况计算出最优的调度策略为预部署服务提供最小剩余资源与服务安全保障方案,本申请实施例对此不作限制。
上述资源调度的方式,对于跨节点的资源分配方案同样适用,如图7所示,在图6场景的基础上,当Node1增加APP6(85%)和Node 2增加APP7(35%)时,APP6(85%)可以调度到Node 1中空闲的GPU3上,Node 2上的APP7(35%)可以调度到Node 1中GPU2剩余的资源空间上。可以理解的是,GPU共享系统通过后台算法计算出最优的调度策略为预部署服务提供最小剩余资源与服务安全保障方案,能够合理调度服务到不同GPU卡后,可以为其他服务提供空闲的GPU资源,同时保证了各服务之间的资源隔离。
在本申请实施例中,应用于GPU共享系统,GPU共享系统部署有k8s集群,k8s集群包括Node节点和Pod服务,其中,Node节点包括GPU,GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,通过将Node节点中的GPU进行划分,可以得到多个vGPU,其中,各个vGPU包含GPU的部分GPU显存和部分GPU计算核心,一个vGPU对应一个Pod服务,通过将Node节点中的GPU进行划分得到多个vGPU,实现了多个Pod服务运行在同一个物理GPU上,同时能够对GPU计算资源进行严格的隔离;接着收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息,接收各个Pod信息,将各个Pod信息保存为多个文件,进而根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理,通过各个Pod服务的Pod信息,进而对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理,有效地解决了GPU计算资源超限的问题。
此外,通过HPA扩缩容的方式和TPA的扩缩容方式,在满足Pod服务的预设资源配额的条件下,能够按照细粒度的资源调度方式为Pod服务申请最佳的GPU资源,GPU共享系统可以将能够最大整合率的服务调度到同一张GPU卡,进而更加高效地提升现有集群中GPU资源的使用率,并且,通过后台算法计算出最优的调度策略为预部署服务提供最小剩余资源与服务安全保障方案,能够合理调度服务到不同GPU卡,可以为其他服务提供空闲的GPU资源,同时保证了各服务之间的资源隔离。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本 申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于本申请的一些实施例,所涉及的动作并不一定是本申请实施例所必须的。
参照图8,示出了本申请实施例中提供的一种GPU计算资源的管理装置的结构框图,应用于GPU共享系统,GPU共享系统部署有k8s集群,k8s集群包括Node节点和Pod服务,其中,Node节点包括GPU,GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,具体可以包括如下模块:
GPU划分模块801,用于将Node节点中的GPU进行划分,得到多个vGPU;其中,各个vGPU包含GPU的部分GPU显存和部分GPU计算核心,一个vGPU对应一个Pod服务;
Pod信息获取模块802,用于收集Node节点中各个vGPU的vGPU信息,并将各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息;
Pod信息文件生成模块803,用于接收各个Pod信息,将各个Pod信息保存为多个文件;
资源管理模块804,用于根据各个文件,对各个vGPU中的部分GPU显存和部分GPU计算核心进行管理。
在一种可选实施例中,GPU划分模块801具体用于:
当Node节点中的GPU进行划分时,根据预设资源配额,将GPU的GPU显存和GPU计算核心分配给各个vGPU,得到多个包含GPU的部分GPU显存和部分GPU计算核心的vGPU。
在一种可选实施例中,k8s集群还包括Master节点,Master节点包括劫持调度器,Pod信息获取模块802具体用于:
收集Node节点中各个vGPU的vGPU信息;
将各个vGPU信息发送至Master节点中的劫持调度器,对各个vGPU信息进行注册,得到各个vGPU对应的各个Pod服务的Pod信息。
在一种可选实施例中,Pod信息文件生成模块803具体用于:
接收劫持调度器返回的各个vGPU对应的各个Pod服务的Pod信息,将各个Pod信息保存为多个文件。
在一种可选实施例中,资源管理模块804具体用于:
将Pod信息中vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况保存为文件;
根据文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况,控制Pod服务的进程。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
另外,本申请实施例还提供了一种电子设备,包括:处理器,存储器,存储在存储器上并可在处理器上运行的计算机程序,该计算机程序被处理器执行时实现上述GPU计算资源的管理方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
图9为本申请实施例中提供的一种非易失性可读存储介质的结构示意图。
本申请实施例还提供了一种非易失性可读存储介质901,非易失性可读存储介质901上存储有计算机程序,计算机程序被处理器执行时实现上述GPU计算资源的管理方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。其中,非易失性可读存储 介质901,如只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等。
图10为实现本申请各个实施例的一种电子设备的硬件结构示意图。
该电子设备1000包括但不限于:射频单元1001、网络模块1002、音频输出单元1003、输入单元1004、传感器1005、显示单元1006、用户输入单元1007、接口单元1008、存储器1009、处理器1010、以及电源1011等部件。本领域技术人员可以理解,图10中示出的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。在本申请实施例中,电子设备包括但不限于手机、平板电脑、笔记本电脑、掌上电脑、车载终端、可穿戴设备、以及计步器等。
应理解的是,本申请实施例中,射频单元1001可用于收发信息或通话过程中,信号的接收和发送,具体的,将来自基站的下行数据接收后,给处理器1010处理;另外,将上行的数据发送给基站。通常,射频单元1001包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器、双工器等。此外,射频单元1001还可以通过无线通信系统与网络和其他设备通信。
电子设备通过网络模块1002为用户提供了无线的宽带互联网访问,如帮助用户收发电子邮件、浏览网页和访问流式媒体等。
音频输出单元1003可以将射频单元1001或网络模块1002接收的或者在存储器1009中存储的音频数据转换成音频信号并且输出为声音。而且,音频输出单元1003还可以提供与电子设备1000执行的特定功能相关的音频输出(例如,呼叫信号接收声音、消息接收声音等等)。音频输出单元1003包括扬声器、蜂鸣器以及受话器等。
输入单元1004用于接收音频或视频信号。输入单元1004可以包括图形处理器(Graphics Processing Unit,GPU)10041和麦克风10042,图形处理器10041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静态图片或视频的图像数据进行处理。处理后的图像帧可以显示在显示单元1006上。经图形处理器10041处理后的图像帧可以存储在存储器1009(或其它存储介质)中或者经由射频单元1001或网络模块1002进行发送。麦克风10042可以接收声音,并且能够将这样的声音处理为音频数据。处理后的音频数据可以在电话通话模式的情况下转换为可经由射频单元1001发送到移动通信基站的格式输出。
电子设备1000还包括至少一种传感器1005,比如光传感器、运动传感器以及其他传感器。具体地,光传感器包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板10061的亮度,接近传感器可在电子设备1000移动到耳边时,关闭显示面板10061和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别电子设备姿态(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;传感器1005还可以包括指纹传感器、压力传感器、虹膜传感器、分子传感器、陀螺仪、气压计、湿度计、温度计、红外线传感器等,在此不再赘述。
显示单元1006用于显示由用户输入的信息或提供给用户的信息。显示单元1006可包括显示面板10061,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板10061。
用户输入单元1007可用于接收输入的数字或字符信息,以及产生与电子设备的用户设置 以及功能控制有关的键信号输入。具体地,用户输入单元1007包括触控面板10071以及其他输入设备10072。触控面板10071,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板10071上或在触控面板10071附近的操作)。触控面板10071可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器1010,接收处理器1010发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板10071。除了触控面板10071,用户输入单元1007还可以包括其他输入设备10072。具体地,其他输入设备10072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。
进一步的,触控面板10071可覆盖在显示面板10061上,当触控面板10071检测到在其上或附近的触摸操作后,传送给处理器1010以确定触摸事件的类型,随后处理器1010根据触摸事件的类型在显示面板10061上提供相应的视觉输出。虽然在图10中,触控面板10071与显示面板10061是作为两个独立的部件来实现电子设备的输入和输出功能,但是在某些实施例中,可以将触控面板10071与显示面板10061集成而实现电子设备的输入和输出功能,具体此处不做限定。
接口单元1008为外部装置与电子设备1000连接的接口。例如,外部装置可以包括有线或无线头戴式耳机端口、外部电源(或电池充电器)端口、有线或无线数据端口、存储卡端口、用于连接具有识别模块的装置的端口、音频输入/输出(I/O)端口、视频I/O端口、耳机端口等等。接口单元1008可以用于接收来自外部装置的输入(例如,数据信息、电力等等)并且将接收到的输入传输到电子设备1000内的一个或多个元件或者可以用于在电子设备1000和外部装置之间传输数据。
存储器1009可用于存储软件程序以及各种数据。存储器1009可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器1009可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
处理器1010是电子设备的控制中心,利用各种接口和线路连接整个电子设备的各个部分,通过运行或执行存储在存储器1009内的软件程序和/或模块,以及调用存储在存储器1009内的数据,执行电子设备的各种功能和处理数据,从而对电子设备进行整体监控。处理器1010可包括一个或多个处理单元;在本发明的一些实施例中,处理器1010可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器1010中。
电子设备1000还可以包括给各个部件供电的电源1011(比如电池),在本发明的一些实施例中,电源1011可以通过电源管理系统与处理器1010逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
另外,电子设备1000包括一些未示出的功能模块,在此不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排 他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例的方法。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。
本领域普通技术人员可以意识到,结合本申请实施例中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
以上,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技 术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种GPU计算资源的管理方法,其特征在于,应用于GPU共享系统,所述GPU共享系统部署有k8s集群,所述k8s集群包括Node节点和Pod服务,其中,所述Node节点包括GPU,所述GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,所述方法包括:
    将所述Node节点中的GPU进行划分,得到多个vGPU;其中,各个所述vGPU包含所述GPU的部分GPU显存和部分GPU计算核心,一个所述vGPU对应一个所述Pod服务;
    收集所述Node节点中各个所述vGPU的vGPU信息,并将各个所述vGPU信息进行注册,得到各个所述vGPU对应的各个所述Pod服务的Pod信息;
    接收各个所述Pod信息,将各个所述Pod信息保存为多个文件;
    根据各个所述文件,对各个所述vGPU中的部分GPU显存和部分GPU计算核心进行管理。
  2. 根据权利要求1所述的方法,其特征在于,所述将所述Node节点中的GPU进行划分,得到多个vGPU,包括:
    当所述Node节点中的GPU进行划分时,根据预设资源配额,将所述GPU的GPU显存和GPU计算核心分配给各个所述vGPU,得到多个包含所述GPU的部分GPU显存和部分GPU计算核心的vGPU。
  3. 根据权利要求1所述的方法,其特征在于,所述vGPU信息至少包括所述vGPU的vGPU数量和vGPU显存大小。
  4. 根据权利要求1所述的方法,其特征在于,所述k8s集群还包括Master节点,所述Master节点包括劫持调度器,所述收集所述Node节点中各个所述vGPU的vGPU信息,并将各个所述vGPU信息进行注册,得到各个所述vGPU对应的各个所述Pod服务的Pod信息,包括:
    收集所述Node节点中各个所述vGPU的vGPU信息;
    将各个所述vGPU信息发送至所述Master节点中的劫持调度器,对各个所述vGPU信息进行注册,得到各个所述vGPU对应的各个所述Pod服务的Pod信息。
  5. 根据权利要求4所述的方法,其特征在于,所述接收各个所述Pod信息,将各个所述Pod信息保存为多个文件,包括:
    接收所述劫持调度器返回的各个所述vGPU对应的各个所述Pod服务的Pod信息,将各个所述Pod信息保存为多个文件。
  6. 根据权利要求4所述的方法,其特征在于,所述Pod信息至少包括所述vGPU中的GPU显存的使用情况和GPU计算核心的使用情况。
  7. 根据权利要求6所述的方法,其特征在于,所述根据各个所述文件,对各个所述vGPU中的部分GPU显存和部分GPU计算核心进行管理,包括:
    将所述Pod信息中vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况保存为文件;
    根据所述文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况,控制所述Pod服务的进程。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况,控制所述Pod服务的进程,包括:
    若所述文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况为超过预 设资源配额时,则控制所述vGPU中的GPU显存和GPU计算核心以终止所述Pod服务的进程;
    若所述文件中的vGPU对应的GPU显存的使用情况和GPU计算核心的使用情况为满足预设资源配额时,则所述Pod服务的进程正常运行。
  9. 根据权利要求1所述的方法,其特征在于,还包括:
    根据各个所述vGPU中的GPU显存的使用情况和GPU计算核心的使用情况,对所述Pod服务的数量进行扩缩。
  10. 根据权利要求9所述的方法,其特征在于,所述GPU位于主机上,所述主机至少包括CPU和内存,所述Pod服务和所述CPU以及所述内存进行绑定,所述根据各个所述vGPU中的GPU显存的使用情况和GPU计算核心的使用情况,对所述Pod服务的数量进行扩缩,包括:
    获取所述主机中CPU对应的CPU利用率和内存对应的平均内存利用率;
    根据所述CPU利用率和所述平均内存利用率,自动扩缩所述Pod服务的数量。
  11. 根据权利要求10所述的方法,其特征在于,所述根据所述CPU利用率和所述平均内存利用率,自动扩缩所述Pod服务的数量,包括:
    若所述Pod服务对应的所述CPU利用率和/或所述平均内存利用率低于预设使用率,则自动缩减所述Pod服务的数量以缩减所述Pod服务对应的vGPU的数量;
    若所述Pod服务对应的所述CPU利用率和/或所述平均内存利用率高于预设使用率,则自动扩增所述Pod服务的数量以扩增所述Pod服务对应的vGPU的数量。
  12. 根据权利要求9所述的方法,其特征在于,所述根据各个所述vGPU中的GPU显存的使用情况和GPU计算核心的使用情况,对所述Pod服务的数量进行扩缩,包括:
    获取所述Pod服务的实时服务请求流量;
    根据所述Pod服务的实时服务请求流量,自动扩缩所述Pod服务的数量。
  13. 根据权利要求12所述的方法,其特征在于,所述根据所述Pod服务的实时服务请求流量,自动扩缩所述Pod服务的数量,包括:
    若所述Pod服务的实时服务请求流量大于预设实时服务请求流量,则自动扩增所述Pod服务的数量以扩增所述Pod服务对应的vGPU的数量;
    若所述Pod服务的实时服务请求流量小于预设实时服务请求流量,则自动缩减所述Pod服务的数量以缩减所述Pod服务对应的vGPU的数量。
  14. 根据权利要求9-13任一项所述的方法,其特征在于,还包括:
    当自动扩缩后的所述Pod服务的数量满足所述Pod服务的预设资源配额时,将所述Pod服务调度到目标GPU。
  15. 根据权利要求1所述的方法,其特征在于,所述k8s集群还包括Master节点,所述Master节点包括控制器,所述控制器用于创建不同类型的Pod服务所对应的资源。
  16. 根据权利要求15所述的方法,其特征在于,所述资源至少包括Deployment部署、Service服务、Statefulset有状态集。
  17. 根据权利要求16所述的方法,其特征在于,所述Deployment部署用于部署无状态的Pod服务,所述Service服务用于部署可伸缩至零的Pod服务,所述Statefulset有状态集用于部署有状态的Pod服务。
  18. 一种GPU计算资源的管理装置,其特征在于,应用于GPU共享系统,所述GPU共享 系统部署有k8s集群,所述k8s集群包括Node节点和Pod服务,其中,所述Node节点包括GPU,所述GPU对应的GPU计算资源至少包括GPU显存和GPU计算核心,所述装置包括:
    GPU划分模块,用于将所述Node节点中的GPU进行划分,得到多个vGPU;其中,各个所述vGPU包含所述GPU的部分GPU显存和部分GPU计算核心,一个所述vGPU对应一个所述Pod服务;
    Pod信息获取模块,用于收集所述Node节点中各个所述vGPU的vGPU信息,并将各个所述vGPU信息进行注册,得到各个所述vGPU对应的各个所述Pod服务的Pod信息;
    Pod信息文件生成模块,用于接收各个所述Pod信息,将各个所述Pod信息保存为多个文件;
    资源管理模块,用于根据各个所述文件,对各个所述vGPU中的部分GPU显存和部分GPU计算核心进行管理。
  19. 一种电子设备,其特征在于,包括处理器、通信接口、存储器和通信总线,其中,所述处理器、所述通信接口以及所述存储器通过所述通信总线完成相互间的通信;
    所述存储器,用于存放计算机程序;
    所述处理器,用于执行存储器上所存放的程序时,实现如权利要求1-17任一项所述的方法。
  20. 一种非易失性可读存储介质,其上存储有指令,当由一个或多个处理器执行时,使得所述处理器执行如权利要求1-17任一项所述的方法。
PCT/CN2023/106827 2022-12-06 2023-07-11 Gpu计算资源的管理方法、装置、电子设备及可读存储介质 WO2024119823A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211553120.5 2022-12-06
CN202211553120.5A CN115562878B (zh) 2022-12-06 2022-12-06 Gpu计算资源的管理方法、装置、电子设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2024119823A1 true WO2024119823A1 (zh) 2024-06-13

Family

ID=84770770

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/106827 WO2024119823A1 (zh) 2022-12-06 2023-07-11 Gpu计算资源的管理方法、装置、电子设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN115562878B (zh)
WO (1) WO2024119823A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118656228A (zh) * 2024-08-22 2024-09-17 山东浪潮科学研究院有限公司 一种图形处理器调度方法、装置、设备及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115601221B (zh) * 2022-11-28 2023-05-23 苏州浪潮智能科技有限公司 一种资源的分配方法、装置和一种人工智能训练系统
CN115562878B (zh) * 2022-12-06 2023-06-02 苏州浪潮智能科技有限公司 Gpu计算资源的管理方法、装置、电子设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111506404A (zh) * 2020-04-07 2020-08-07 上海德拓信息技术股份有限公司 一种基于Kubernetes的共享GPU调度方法
CN111638953A (zh) * 2020-05-21 2020-09-08 贝壳技术有限公司 一种实现gpu虚拟化的方法、装置和存储介质
CN111966456A (zh) * 2020-08-07 2020-11-20 苏州浪潮智能科技有限公司 一种容器显存动态分配方法、装置、设备
US20210110506A1 (en) * 2019-10-15 2021-04-15 Vmware, Inc. Dynamic kernel slicing for vgpu sharing in serverless computing systems
CN115309556A (zh) * 2022-08-10 2022-11-08 中国联合网络通信集团有限公司 微服务扩展方法、装置、服务器及存储介质
CN115562878A (zh) * 2022-12-06 2023-01-03 苏州浪潮智能科技有限公司 Gpu计算资源的管理方法、装置、电子设备及可读存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110795249A (zh) * 2019-10-30 2020-02-14 亚信科技(中国)有限公司 基于mesos容器化平台的gpu资源调度方法及装置
CN113157428B (zh) * 2020-01-07 2022-04-08 阿里巴巴集团控股有限公司 基于容器的资源调度方法、装置及容器集群管理装置
CN111538586A (zh) * 2020-01-23 2020-08-14 中国银联股份有限公司 集群gpu资源管理调度系统、方法以及计算机可读存储介质
CN113127192B (zh) * 2021-03-12 2023-02-28 山东英信计算机技术有限公司 一种多个服务共享同一个gpu的方法、系统、设备及介质
CN114565502A (zh) * 2022-03-08 2022-05-31 重庆紫光华山智安科技有限公司 Gpu资源管理方法、调度方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210110506A1 (en) * 2019-10-15 2021-04-15 Vmware, Inc. Dynamic kernel slicing for vgpu sharing in serverless computing systems
CN111506404A (zh) * 2020-04-07 2020-08-07 上海德拓信息技术股份有限公司 一种基于Kubernetes的共享GPU调度方法
CN111638953A (zh) * 2020-05-21 2020-09-08 贝壳技术有限公司 一种实现gpu虚拟化的方法、装置和存储介质
CN111966456A (zh) * 2020-08-07 2020-11-20 苏州浪潮智能科技有限公司 一种容器显存动态分配方法、装置、设备
CN115309556A (zh) * 2022-08-10 2022-11-08 中国联合网络通信集团有限公司 微服务扩展方法、装置、服务器及存储介质
CN115562878A (zh) * 2022-12-06 2023-01-03 苏州浪潮智能科技有限公司 Gpu计算资源的管理方法、装置、电子设备及可读存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CSDN CLOUD COMPUTING: "Fine-Grained High-Performance GPU Resource Sharing Realized by means of Inspur AIStation", CSDN BLOG, 12 April 2022 (2022-04-12), XP093178013, Retrieved from the Internet <URL:https://blog.csdn.net/FL63Zv9Zou86950w/article/details/124131300> *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118656228A (zh) * 2024-08-22 2024-09-17 山东浪潮科学研究院有限公司 一种图形处理器调度方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN115562878B (zh) 2023-06-02
CN115562878A (zh) 2023-01-03

Similar Documents

Publication Publication Date Title
WO2024119823A1 (zh) Gpu计算资源的管理方法、装置、电子设备及可读存储介质
US10437631B2 (en) Operating system hot-switching method and apparatus and mobile terminal
EP3531290B1 (en) Data backup method, apparatus, electronic device, storage medium, and system
CN111338745B (zh) 一种虚拟机的部署方法、装置及智能设备
CN107590057B (zh) 冻屏监测与解决方法、移动终端及计算机可读存储介质
WO2015035870A1 (zh) 多cpu调度方法及装置
WO2024037068A1 (zh) 任务调度方法、电子设备及计算机可读存储介质
CN114817120A (zh) 一种跨域数据共享方法、系统级芯片、电子设备及介质
CN106708554A (zh) 程序运行方法及装置
WO2019128537A1 (zh) 应用冻结方法、计算机设备和计算机可读存储介质
CN116578422B (zh) 资源分配方法和电子设备
WO2021135574A1 (zh) 数据存储方法、装置及终端设备
WO2021109703A1 (zh) 数据处理方法、芯片、设备及存储介质
WO2019128574A1 (zh) 信息处理方法、装置、计算机设备和计算机可读存储介质
WO2019128573A1 (zh) 信息处理方法、装置、计算机设备和计算机可读存储介质
US20140237017A1 (en) Extending distributed computing systems to legacy programs
CN116208613A (zh) 云主机的迁移方法、装置、电子设备及存储介质
US9436505B2 (en) Power management for host with devices assigned to virtual machines
CN115237618A (zh) 请求处理方法、装置、计算机设备及可读存储介质
WO2015176422A1 (zh) 一种基于安卓系统的应用管理方法及其装置
US11221875B2 (en) Cooperative scheduling of virtual machines
CN111813541A (zh) 一种任务调度方法、装置、介质和设备
CN110045811B (zh) 应用程序处理方法和装置、电子设备、计算机可读存储介质
CN115373865A (zh) 一种并发线程管理方法、装置、电子设备和存储介质
CN108549573B (zh) 一种内存模型的计算方法、装置及计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23899408

Country of ref document: EP

Kind code of ref document: A1