WO2022188498A1 - 一种基于共享式gpu的分布式容器调度方法及其系统 - Google Patents

一种基于共享式gpu的分布式容器调度方法及其系统 Download PDF

Info

Publication number
WO2022188498A1
WO2022188498A1 PCT/CN2021/138799 CN2021138799W WO2022188498A1 WO 2022188498 A1 WO2022188498 A1 WO 2022188498A1 CN 2021138799 W CN2021138799 W CN 2021138799W WO 2022188498 A1 WO2022188498 A1 WO 2022188498A1
Authority
WO
WIPO (PCT)
Prior art keywords
container
gpu
node
scheduled
scheduling
Prior art date
Application number
PCT/CN2021/138799
Other languages
English (en)
French (fr)
Inventor
张登银
李俊江
刘子捷
程义
寇英杰
朱虹
严伟丹
Original Assignee
南京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 南京邮电大学 filed Critical 南京邮电大学
Priority to US17/701,637 priority Critical patent/US20220291956A1/en
Publication of WO2022188498A1 publication Critical patent/WO2022188498A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/4557Distribution of virtual machine instances; Migration and load balancing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the invention relates to a distributed container scheduling method and system based on a shared GPU, belonging to the technical field of cloud computing.
  • the present invention proposes a distributed container scheduling method and system based on a shared GPU, which realizes monitoring of container creation events, Container scheduling queue generation and container scheduling, the present invention can select the most suitable node for container scheduling according to the needs of the container to be scheduled, ensure the load balance of the nodes in the cluster, and improve the resource utilization rate of the cluster.
  • the present invention adopts the following technical means:
  • the present invention proposes a distributed container scheduling method based on a shared GPU, including the following steps:
  • the container scheduling queue When the container scheduling queue is empty, no operation is performed and wait for the successful container to be added to the scheduling queue; when the container scheduling queue is not empty, the containers to be scheduled are read from the container scheduling queue in order and selected from the Kubernetes cluster The optimal node corresponding to the container to be scheduled, generates a container scheduling binary group;
  • the container to be scheduled is scheduled to the best node to complete the distributed container scheduling.
  • the method for verifying the created container is:
  • the method for updating the container scheduling queue by using the successfully verified container is as follows:
  • the method for selecting the best node corresponding to the container to be scheduled from the Kubernetes cluster is as follows:
  • node selection and filtering are performed to obtain the container schedulable node
  • the container schedulable node is regarded as the best node
  • the score of each container schedulable node is calculated based on the GPU data of the container schedulable nodes, and the container schedulable node with the highest score is selected as the best node.
  • the container to be scheduled When the container to be scheduled carries the GPU quantity label, it traverses all nodes in the Kubernetes cluster. When the number of GPUs held by the node is greater than or equal to the GPU quantity label value, the node is marked as a once-schedulable node. When the to-be-scheduled container does not carry the GPU quantity label GPU quantity label, mark all nodes in the Kubernetes cluster as one-time schedulable nodes, and set the GPU quantity label value of the container to be scheduled to 1;
  • the container to be scheduled When the container to be scheduled carries the GPU memory tag, it traverses all once-schedulable nodes; if the GPU's free memory in the once-schedulable node is greater than the GPU memory tag value of the to-be-scheduled container, the GPU is regarded as the GPU that meets the first-level requirements; If the number of GPUs that meet the first-level requirements is greater than or equal to the GPU number label value of the container to be scheduled, mark the primary schedulable node as a secondary schedulable node. When the to-be-scheduled container does not carry the GPU memory tag, all primary schedulable nodes The scheduling node is marked as a secondary schedulable node;
  • the container to be scheduled When the container to be scheduled carries the GPU clock frequency label, it traverses all secondary schedulable nodes; if the GPU clock frequency in the secondary schedulable node is greater than the GPU clock frequency label value, the GPU is regarded as the GPU that meets the second-level requirements; if When the number of GPUs that meet the requirements of the second level is greater than or equal to the GPU number label value of the container to be scheduled, mark the secondary schedulable node as a container schedulable node. When the container to be scheduled does not carry the GPU clock frequency label, all secondary schedulable nodes are marked. A schedulable node is marked as a container schedulable node;
  • the calculation formula for calculating the score of each container schedulable node based on the GPU data of the container schedulable node is as follows:
  • Score represents the score of the container schedulable node
  • FilteredGPUScore represents the GPU score of all GPUs in the container schedulable node that meet the requirements of the to-be-scheduled container
  • the to-be-scheduled container requirement is the GPU memory label and the GPU clock frequency label of the to-be-scheduled container
  • FilteredGPUWeight is the weight of the GPU score
  • RealScore represents the video memory score of all GPUs in the container schedulable node
  • RealWeight is the weight of the video memory score
  • AllocateScore represents the quota score of the container schedulable node
  • AllocateWeight is the weight of the quota score
  • FilteredGPUScorePerCard represents the GPU score of the GPU in the container schedulable node that meets the requirements of the container to be scheduled
  • Bandwith represents the GPU video memory bit width
  • MaxBandwith represents the maximum GPU video memory bit width of all GPUs in the container schedulable node that meet the requirements of the to-be-scheduled container
  • Clock represents the GPU clock frequency
  • MaxClock represents the maximum GPU clock frequency of all GPUs in the container schedulable node that meets the requirements of the container to be scheduled
  • Power represents the GPU power
  • MaxPower represents the GPU of all GPUs in the container schedulable node that meets the requirements of the container to be scheduled.
  • the maximum power value Core indicates the number of GPU cores
  • MaxCore indicates the maximum number of GPU cores of all GPUs in the container schedulable node that meet the requirements of the container to be scheduled
  • FreeMemory indicates the GPU free video memory
  • MaxFreeMemory indicates that the container can schedule the node to meet the requirements of the container to be scheduled.
  • the maximum value of GPU free video memory of all GPUs TotalMemory indicates the total amount of GPU video memory
  • MaxTotalMemory indicates the maximum total amount of GPU video memory of all GPUs in the container schedulable node that meets the requirements of the container to be scheduled;
  • FreeMemorySum represents the sum of GPU free video memory of all GPUs in the container schedulable node
  • TotalMemorySum represents the sum of the total GPU video memory of all GPUs in the container schedulable node
  • AllocateMemorySum represents the total amount of video memory requested by the container to be scheduled, that is, the product of the GPU memory tag value of the container to be scheduled and the GPU quantity tag value.
  • the container scheduling binary group is composed of the container to be scheduled and the node name of the optimal node.
  • the specific operation of scheduling the container to be scheduled to the optimal node according to the container scheduling binary group is as follows:
  • the container scheduling two-tuple set the node name field of the container to be scheduled to the node name of the best node in the two-tuple, and asynchronously update the node name field of the container in the Kubernetes API-Server.
  • the present invention provides a distributed container scheduling system based on a shared GPU, including:
  • Container creation event listener which is used to monitor container creation events in Kubernetes API-Server, and perform container verification after monitoring new container creation events
  • the container scheduling queue is used to store the containers to be scheduled according to the priority
  • the container scheduler is used to read the container to be scheduled from the queue head of the container scheduling queue, and select the best node corresponding to the container to be scheduled from the Kubernetes cluster, and generate a container scheduling two-tuple;
  • the container scheduling executor is used to update the node name field of the container to be scheduled in the Kubernetes API-Server according to the container scheduling tuple to complete the container scheduling operation;
  • the communication module is used to construct the communication between the container creation event listener, the container scheduling queue, the container scheduler, the container scheduling executor and the Kubernetes API-Server according to the system configuration file.
  • system configuration file includes the IP address, port number, TLS public key and TLS private key of the Kubernetes API-Server;
  • the communication link is authenticated according to the TLS public key and TLS private key. After the authentication is successful, the communication construction is completed.
  • the present invention proposes a shared GPU-based distributed container scheduling method and system.
  • the present invention selects nodes based on the requirements of the container's GPU quantity, video memory, clock frequency, etc.
  • the fine-grained index status of the node can be reasonably scheduled, so that multi-container tasks can share the GPU, and the to-be-scheduled container can be scheduled to the most suitable node by considering the graphics card index status, free video memory and quota in the node at the same time, so as to improve the cluster.
  • GPU resource utilization to adapt to the computing needs of complex scenarios.
  • the present invention can ensure the load balance of the nodes in the cluster, enhance the utilization rate of GPU resources in the distributed container cluster, better meet the scheduling requirements, and allow the container to have a faster task completion time.
  • FIG. 1 is a flow chart of steps of a shared GPU-based distributed container scheduling method according to the present invention
  • FIG. 2 is a flowchart of an operation of updating a container scheduling queue in an embodiment of the present invention
  • FIG. 3 is an operation flowchart of node selection and filtering in an embodiment of the present invention.
  • FIG. 4 is a schematic structural diagram of a shared GPU-based distributed container scheduling system according to the present invention.
  • FIG. 5 is a working principle diagram of a distributed container scheduling system in an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of changes in load balance entropy when different schedulers perform container scheduling in an embodiment of the present invention
  • FIG. 7 is a schematic diagram of changes in scheduling time when different schedulers perform container scheduling in an embodiment of the present invention.
  • 1 is the container creation event listener
  • 2 is the container scheduling queue
  • 3 is the container scheduler
  • 4 is the container scheduling executor
  • 5 is the communication module.
  • the present invention proposes a distributed container scheduling method based on a shared GPU, as shown in Figure 1, which specifically includes the following steps:
  • Step A Monitor the container creation event in the Kubernetes API-Server in real time, and verify the created container after monitoring the new container creation event;
  • Step B updating the container scheduling queue with the container that has been successfully verified
  • Step C When the container scheduling queue is empty, do not operate, and wait for the successful container to be added to the scheduling queue; when the container scheduling queue is not empty, read the to-be-scheduled containers from the container scheduling queue in order, and from the Kubernetes cluster Select the best node corresponding to the container to be scheduled, and generate a container scheduling binary group;
  • step D the container to be scheduled is scheduled to the optimal node according to the container scheduling binary group to complete the distributed container scheduling.
  • step A use the network to communicate with the Kubernetes API-Server, and monitor container creation events in the Kubernetes API-Server in real time.
  • System users can send requests to the Kubernetes API-Server through kubectl to create GPU containers, generate container creation events, and manually set container image names, container scheduling priority labels, container startup commands, container startup parameters, and GPU labels used by containers before creation. etc., wherein the GPU tags include GPU quantity tags, GPU video memory tags, and GPU clock frequency tags.
  • the Kubernetes API-Server can instantiate (create) the container object according to the container creation event and perform container storage. When a new container creation event is detected, each field information of the container object created by the container creation event needs to be acquired, and the container is verified according to the field information.
  • Step A01 perform GPU label verification according to the field information of the created container: determine whether the container contains a GPU label, when the container does not contain any GPU label, the GPU label verification fails, and the verification failure time and the corresponding error information ( Does not contain GPU labels) is written to the Kubernetes event log, so that the error information can be found later; when the container does not contain one or more GPU labels, the GPU label verification is successful, and subsequent operations can be performed.
  • Step A02 when the GPU tag verification is successful, perform scheduler name verification according to the field information of the created container: determine whether the scheduler field of the container is the name of the system scheduler, and when the scheduler field is not the name of the system scheduler, the scheduler If the name verification fails, write the verification failure time and the corresponding error information (the container's scheduler field) to the Kubernetes event log; otherwise, the scheduler name verification succeeds, the container verification is completed, and the container verification succeeds.
  • step B the successfully verified container will be sent to the container scheduling queue, and the container scheduling queue will be updated, as shown in Figure 2.
  • the specific operations are as follows:
  • Step B01 sending the successfully verified container into the container scheduling queue from the end of the queue to generate the container scheduling queue at the current moment.
  • Step B02 Obtain the preset priority label of each container in the container scheduling queue at the current moment, sort all the containers in the container scheduling queue from high to low according to the priority label, and place the container with a higher priority on the bottom of the container scheduling queue. At the head of the queue, containers with lower priority are placed at the tail of the queue to complete the update of the container scheduling queue.
  • step C is:
  • Step C01 monitor in real time whether the container scheduling queue is empty, when the container scheduling queue is empty, do not operate, and wait for the successful container to be added to the scheduling queue; when the container scheduling queue is not empty, read from the queue head of the container scheduling queue Take a container to be scheduled and obtain the GPU tag of the container to be scheduled.
  • the present invention initiates a request to the Kubernetes API-Server to obtain the GPU data of all nodes in the current Kubernetes cluster, including: the number of GPUs held by the nodes, the video memory bit width of each GPU held by the nodes, the GPU clock frequency, the GPU Number of cores, total GPU memory, total available GPU memory, GPU power, etc.
  • Step C02 Select and filter nodes according to the GPU data of each node in the Kubernetes cluster and the GPU label of the container to be scheduled, to obtain a node that can be scheduled for the container.
  • Step C03 when the number of container schedulable nodes is 1, the container schedulable node is regarded as the best node.
  • Step C04 when the number of container schedulable nodes is greater than 1, calculate the score of each container schedulable node based on the GPU data of the container schedulable node, and select the container schedulable node with the highest score as the best node.
  • Step C05 using the node name of the container to be scheduled and the node name of the best node to form a container scheduling two-tuple.
  • a container schedulable node is a node in the Kubernetes cluster that meets the requirements of the container to be scheduled. As shown in Figure 3, the present invention mainly filters the container schedulable node from three dimensions:
  • Step C021 filter nodes according to the GPU quantity label: when the container to be scheduled carries the GPU quantity label, traverse all nodes in the Kubernetes cluster, and when the GPU quantity held by the node is greater than or equal to the GPU quantity label value, then mark the node as One-time schedulable node; when the container to be scheduled does not carry the GPU quantity label, mark all nodes in the Kubernetes cluster as once-schedulable nodes, and set the GPU quantity label value of the to-be-scheduled container to 1.
  • Step C022 on the basis of C021, perform node screening according to the GPU video memory label: when the container to be scheduled carries the GPU video memory label, traverse all the once-schedulable nodes; if the free video memory of the GPU in the once-schedulable node is greater than the GPU video memory of the to-be-scheduled container
  • the label value, the GPU is regarded as the GPU that meets the first-level requirements; if the number of GPUs that meet the first-level requirements is greater than or equal to the GPU number label value of the container to be scheduled (the default value in C021 that does not carry the GPU number label is 1) , mark the primary schedulable node as a secondary schedulable node; when the container to be scheduled does not carry a GPU memory tag, mark all primary schedulable nodes as secondary schedulable nodes.
  • the GPU clock frequency tag value of the container, the GPU is regarded as the GPU that meets the second-level requirements; if the number of GPUs that meet the second-level requirements is greater than or equal to the GPU number tag value of the container to be scheduled, the secondary schedulable node Mark as container schedulable nodes; when the container to be scheduled does not carry the GPU clock frequency label, mark all secondary schedulable nodes as container schedulable nodes.
  • Step C024 If the container schedulable node is empty after filtering in three dimensions, write the current time and scheduling error information (the container schedulable node is empty) into the Kubernetes event log.
  • the score of the container schedulable node in step C04 is mainly divided into three parts: 1.
  • the GPU score that meets the requirements of the to-be-scheduled container, and the to-be-scheduled container requirement is the GPU memory label and the GPU clock frequency of the to-be-scheduled container Label; 2.
  • FilteredGPUScore represents the GPU score of all GPUs in the container schedulable node that meet the requirements of the container to be scheduled
  • FilteredGPUScorePerCard represents the GPU score of one GPU in the container schedulable node that meets the requirements of the to-be-scheduled container.
  • Bandwith represents the GPU video memory bit width
  • MaxBandwith represents the maximum GPU video memory bit width of all GPUs in the container schedulable node that meets the requirements of the container to be scheduled
  • Clock represents the GPU clock frequency
  • MaxClock represents the container schedulable node that meets the requirements of the to-be-scheduled container.
  • the maximum GPU clock frequency of all GPUs Power represents the GPU power
  • MaxPower represents the maximum GPU power of all GPUs in the container schedulable node that meets the requirements of the container to be scheduled
  • Core represents the number of GPU cores
  • MaxCore represents the container schedulable node that meets the requirements of the container to be scheduled.
  • FreeMemory indicates the GPU free video memory
  • MaxFreeMemory indicates the maximum GPU free video memory of all GPUs in the container schedulable node that meets the requirements of the container to be scheduled
  • TotalMemory indicates the total amount of GPU video memory
  • MaxTotalMemory Indicates the maximum total GPU memory of all GPUs in the container schedulable node that meets the requirements of the container to be scheduled.
  • RealScore represents the video memory score of all GPUs in the schedulable node
  • FreeMemorySum represents the sum of GPU free video memory of all GPUs in the container schedulable node
  • TotalMemorySum represents the sum of the total GPU video memory of all GPUs in the container schedulable node.
  • AllocateScore represents the quota score of the container's schedulable nodes
  • AllocateMemorySum represents the total amount of video memory requested by the container to be scheduled, that is, the product of the GPU memory tag value of the container to be scheduled and the GPU quantity tag value.
  • FilteredGPUWeight is the weight of GPU score
  • the default value of FilteredGPUWeight is 2
  • RealWeight is the weight of video memory score
  • the default value of RealWeight is 1
  • AllocateWeight is the weight of quota score
  • the default value of AllocateWeight is 2.
  • step D is: according to the container scheduling two-tuple, set the node name field of the container to be scheduled to the node name of the best node in the two-tuple, and asynchronously update the Kubernetes API-Server The node name field for this container.
  • the present invention also proposes a distributed container scheduling system based on a shared GPU.
  • the system mainly includes a container creation event listener 1, a container scheduling queue 2, a container scheduler 3, a container scheduling executor 4 and Communication module 5.
  • the working principle of the system of the present invention is shown in FIG. 5 .
  • the container creation event listener is mainly used to monitor the container creation event in the Kubernetes API-Server, and perform container verification after monitoring the new container creation event.
  • the container creation event listener will also send the successfully verified container to the container scheduling queue; its working process is consistent with step A of the method of the present invention.
  • the container scheduling queue is mainly used to store the containers to be scheduled according to the priority, and its working process is consistent with step B of the method of the present invention.
  • the container scheduler is mainly used to read the container to be scheduled from the queue head of the container scheduling queue, and select the best node corresponding to the to-be-scheduled container from the Kubernetes cluster, and generate a container scheduling binary group.
  • the working process is the same as that of the method of the present invention.
  • Step C is the same.
  • the container scheduling executor is mainly used to update the node name field of the container to be scheduled in the Kubernetes API-Server according to the container scheduling two-tuple, complete the container scheduling operation, and realize node binding, and its working process is consistent with step D of the method of the present invention.
  • the communication module is used to help containers create event listeners, container scheduling queues, container schedulers, and container scheduling executors to establish communication links with the Kubernetes API-Server.
  • the communication module obtains the system configuration file, which includes the IP address, port number, TLS public key and TLS private key of the Kubernetes API-Server.
  • the communication module first checks whether the IP address and port number exist in the system configuration file. If so, the communication module reads the IP address and port number, and tries to communicate with the Kubernetes cluster according to the IP address and port number. If the communication is successful, the container is created.
  • the container creation event listener, container scheduling queue, container scheduler, and container scheduling executor can communicate with Kubernetes. API-Server for information exchange.
  • the IP address is unreachable, the port is closed, or the authentication is unsuccessful, record the communication failure time and failure reason, generate the failure information and record it locally, and send the failure information to the operation and maintenance engineer by email for operation Maintenance engineer to check and repair.
  • Node Simulator a scheduling simulator named Node Simulator is used to simulate node resources and the state of containers in Kubernetes.
  • the Node Simulator is deployed on the physical server where the Kubernetes control plane is located, and its configuration is shown in Table 1:
  • the containers are all set as machine learning tasks, each task requires mainstream frameworks such as Tensorflow and Pytorch, and all containers are set to consume GPU resources 10 seconds after starting to run.
  • Kubernetes scheduler and Kubeshare were selected as the benchmarks for comparison in the experiments, and all experiments were repeated 20 times to calculate the average value to ensure the validity of the results.
  • load balance entropy is selected to measure the degree of load balance.
  • the definition of load balance entropy is:
  • E(U) represents the load balancing entropy
  • N represents the number of nodes in the cluster
  • n i represents the number of containers that consume GPU resources on node i
  • pod j.gpu_memory represents the GPU memory occupied by container j
  • ⁇ pod.gpu_memory represents the total GPU memory consumed by the container to be scheduled.
  • the number of containers scheduled in Experiment 1 is 225. These containers each request 2000M GPU memory, and form a scheduling queue together with the arrival of Poisson distribution requests.
  • Kubernetes scheduler, Kubeshare and the present invention are used for container scheduling, and the corresponding load is calculated.
  • Balance entropy the result is shown in Figure 6, the abscissa in Figure 6 represents the number of scheduled containers, and the ordinate represents the average load balance entropy of the cluster. It can be seen from the figure that the entropy value of the present invention is closest to 1, so the scheduling performance of the present invention is better than that of the Kubernetes scheduler.
  • the scheduling policy of the Kubernetes scheduler includes the LeastRequestedPriority and BalancedResourceAllocation policies to avoid consuming too many resources on a single node, it is still in a weak balance in terms of resource utilization because the default scheduling policy of Kubernetes fails to take into account The actual GPU resource consumption of the container.
  • Kubeshare uses a Most-fit scheduling strategy and affinity marking mechanism to ensure the degree of cluster load balancing, however, when containers start consuming GPU resources, scheduling decisions are skewed. The results show that the present invention can ensure the resource utilization of the cluster in a more balanced manner.
  • task scheduling time is regarded as an essential metric to measure the performance of the scheduler.
  • the number of scheduled containers is 100, and all the containers to be scheduled each request 500M GPU memory, and form a scheduling queue together with the arrival of Poisson distribution requests.
  • Kubernetes scheduler, Kubeshare and the present invention are used for container scheduling, respectively. And calculate the corresponding scheduling time.
  • Figure 7. In the figure, the abscissa represents the number of containers to be scheduled, and the ordinate represents the scheduling time of the container, which is from the creation of the scheduling event to the completion of the binding node Finish.
  • the container scheduling time of the present invention is better than other benchmark methods, and can ensure the consumption of cluster GPU resources in a more balanced manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

一种基于共享式GPU的分布式容器调度方法及其系统,旨在解决多样化的云计算业务中容器调度不合理、GPU资源利用率低下的技术问题。所述方法包括:实时监听容器创建事件并校验新创建的容器;利用校验成功的容器更新容器调度队列;按顺序从容器调度队列中读取待调度容器,并根据容器的GPU标签从集群中选出待调度容器对应的最佳节点;将待调度容器调度到最佳节点上,完成分布式容器调度。本方法能够针对待调度容器的需求选择最适配的节点进行容器调度,保证集群内部节点的负载均衡,提高集群的资源利用率。

Description

一种基于共享式GPU的分布式容器调度方法及其系统 技术领域
本发明涉及一种基于共享式GPU的分布式容器调度方法及其系统,属于云计算技术领域。
背景技术
由于云计算的发展,采用Kubernetes(管理云平台中多个主机上的容器化的应用)可以提升服务器集群中的资源利用率。然而随着云计算业务的多样化、复杂化,同时使用容器搭配图形处理器(Graphics Processing Unit,缩写:GPU)以提升业务和工作流的性能和效率已经成为了边缘计算和大型分布式机器学习的计算搭配主力,而现有的分布式容器调度器大部分仅仅能基于中央处理器(Central Processing Unit,简称CPU)和内存指标调度容器任务,或者只能简单检测GPU数量而不能检测显卡细颗粒的性能指标来实现GPU共享。现有的分布式容器调度器无法适配各种复杂场景的计算需求,导致有特定GPU需求的容器被调度在非适配节点上运行,使得整个分布式集群的GPU资源利用率低下,影响整个分布式集群性能。
在云计算领域,应用GPU的业务和工作流逐渐多样化,如云游戏、机器学习训练,这将会给针对GPU资源的调度带来更多的挑战,分布式集群的容器调度需要基于当前集群内的GPU指标状态合理调度容器,否则会导致整个分布式集群内部任务分配不均衡,影响GPU资源调度结果,间接造成分布式集群运算效率的低下。
发明内容
为了解决多样化的云计算业务中容器调度不合理、GPU资源利用率低下的问题,本发明提出了一种基于共享式GPU的分布式容器调度方法及其系统,实现对容器创建事件的监听、容器调度队列生成和容器调度,本发明能够针对待调度容器的需求选择最适配的节点进行容器调度,保证集群内部节点的负载均衡,提高集群的资源利用率。
为解决上述技术问题,本发明采用了如下技术手段:
第一方面,本发明提出了一种基于共享式GPU的分布式容器调度方法,包括如下步骤:
实时监听Kubernetes API-Server中的容器创建事件,并在监测到新的容器创建事件后,对创建的容器进行校验;
利用校验成功的容器更新容器调度队列;
当容器调度队列为空时,不进行操作,等待检验成功的容器加入调度队列;当容器调度队列不为空时,按顺序从容器调度队列中读取待调度容器,并从Kubernetes集群中选出待调度容器对应的最佳节点,生成容器调度二元组;
根据容器调度二元组将待调度容器调度到最佳节点上,完成分布式容器调度。
结合第一方面,进一步的,对创建的容器进行校验的方法为:
根据创建的容器的字段信息进行GPU标签校验:判断容器是否包含GPU标签,当容器不包含GPU标签,则GPU标签校验失败,将校验失败时间和对应的错误信息写入Kubernetes事件日志,否则GPU标签校验成 功,其中,所述GPU标签包括GPU数量标签、GPU显存标签、GPU时钟频率标签;
当GPU标签校验成功,根据创建的容器的字段信息进行调度器名称校验:判断容器的调度器字段是否为系统调度器名称,当调度器字段不是系统调度器名称,则调度器名称校验失败,将校验失败时间和对应的错误信息写入Kubernetes事件日志,否则,调度器名称校验成功,完成容器校验。
结合第一方面,进一步的,利用校验成功的容器更新容器调度队列的方法为:
将校验成功的容器从队尾送入容器调度队列;
获取容器调度队列中每个容器预设的优先级标签,根据优先级标签将容器调度队列中的所有容器从高到低排序,完成容器调度队列更新。
结合第一方面,进一步的,从Kubernetes集群中选出待调度容器对应的最佳节点的方法为:
根据Kubernetes集群中每个节点的GPU数据和待调度容器的GPU标签进行节点选择和过滤,获得容器可调度节点;
当容器可调度节点数量为1,将该容器可调度节点作为最佳节点;
当容器可调度节点数量大于1,基于容器可调度节点的GPU数据计算每个容器可调度节点的得分,选取得分最高的容器可调度节点作为最佳节点。
结合第一方面,进一步的,获得容器可调度节点的具体操作为:
当待调度容器携带GPU数量标签时,遍历Kubernetes集群中的所有节 点,当节点所持有的GPU数量大于等于GPU数量标签值,则将该节点标记为一次可调度节点,当待调度容器未携带GPU数量标签,将Kubernetes集群中的所有节点标记为一次可调度节点,并将该待调度容器的GPU数量标签值设置为1;
当待调度容器携带GPU显存标签,遍历所有一次可调度节点;如果一次可调度节点中GPU的空闲显存大于该待调度容器的GPU显存标签值,则将该GPU作为满足第一级要求的GPU;如果满足第一级要求的GPU数量大于等于该待调度容器的GPU数量标签值时,将该一次可调度节点标记为二次可调度节点,当待调度容器未携带GPU显存标签,将所有一次可调度节点标记为二次可调度节点;
当待调度容器携带GPU时钟频率标签,遍历所有二次可调度节点;如果二次可调度节点中GPU的时钟频率大于GPU时钟频率标签值,则将该GPU作为满足第二级要求的GPU;如果满足第二级要求的GPU数量大于等于该待调度容器的GPU数量标签值时,将该二次可调度节点标记为容器可调度节点,当待调度容器未携带GPU时钟频率标签,将所有二次可调度节点标记为容器可调度节点;
当容器可调度节点为空时,将当前时间和调度错误信息写入Kubernetes事件日志。
结合第一方面,进一步的,基于容器可调度节点的GPU数据计算每个容器可调度节点的得分的计算公式如下:
Figure PCTCN2021138799-appb-000001
Figure PCTCN2021138799-appb-000002
其中,Score表示容器可调度节点的得分,FilteredGPUScore表示容器可调度节点中满足待调度容器要求的所有GPU的GPU得分,所述待调度容器要求为待调度容器的GPU显存标签和GPU时钟频率标签,FilteredGPUWeight为GPU得分的权重,RealScore表示容器可调度节点中所有GPU的显存得分,RealWeight为显存得分的权重,AllocateScore表示容器可调度节点的配额得分,AllocateWeight为配额得分的权重;
FilteredGPUScore的计算公式为:
FilteredGPUScore=∑FilteredGPUScorePerCard      (2)
Figure PCTCN2021138799-appb-000003
其中,FilteredGPUScorePerCard表示容器可调度节点中满足待调度容器要求的GPU的GPU分数,Bandwith表示GPU显存位宽,MaxBandwith表示容器可调度节点中满足待调度容器要求的所有GPU的GPU显存位宽最大值,Clock表示GPU时钟频率,MaxClock表示容器可调度节点中满足待调度容器要求的所有GPU的GPU时钟频率最大值,Power表示GPU功率,MaxPower表示容器可调度节点中满足待调度容器要求的所有GPU的GPU功率最大值,Core表示GPU核心数,MaxCore表示容器可调度节点中满足待调度容器要求的所有GPU的GPU核心数最大值,FreeMemory表示GPU空闲显存,MaxFreeMemory表示容器可调度节点 中满足待调度容器要求的所有GPU的GPU空闲显存最大值,TotalMemory表示GPU显存总量,MaxTotalMemory表示容器可调度节点中满足待调度容器要求的所有GPU的GPU显存总量最大值;
RealScore的计算公式为:
Figure PCTCN2021138799-appb-000004
其中,FreeMemorySum表示容器可调度节点中所有GPU的GPU空闲显存之和,TotalMemorySum表示容器可调度节点中所有GPU的GPU显存总量之和;
AllocateScore的计算公式为:
Figure PCTCN2021138799-appb-000005
其中,AllocateMemorySum表示待调度容器申请的显存总量,即待调度容器的GPU显存标签值与GPU数量标签值的乘积。
结合第一方面,进一步的,所述容器调度二元组由待调度容器和最佳节点的节点名称组成。
结合第一方面,进一步的,根据容器调度二元组将待调度容器调度到最佳节点上的具体操作为:
根据容器调度二元组,将待调度容器的节点名称字段设置为二元组中最佳节点的节点名称,并异步更新Kubernetes API-Server中该容器的节点名称字段。
第二方面,本发明提出了一种基于共享式GPU的分布式容器调度系统,包括:
容器创建事件监听器,用于监听Kubernetes API-Server中的容器创建事件,并在监测到新的容器创建事件之后进行容器校验;
容器调度队列,用于按照优先级存放待调度的容器;
容器调度器,用于从容器调度队列的队头读取待调度容器,并从Kubernetes集群中选出待调度容器对应的最佳节点,生成容器调度二元组;
容器调度执行器,用于根据容器调度二元组更新Kubernetes API-Server中待调度容器的节点名称字段,完成容器调度操作;
通信模块,用于根据系统配置文件分别构建容器创建事件监听器、容器调度队列、容器调度器、容器调度执行器与Kubernetes API-Server的通信。
结合第二方面,进一步的,所述系统配置文件包括Kubernetes API-Server的IP地址、端口号、TLS公钥和TLS私钥;
根据系统配置文件构建通信的操作为:
根据IP地址和端口号建立容器创建事件监听器、容器调度队列、容器调度器、容器调度执行器与Kubernetes API-Server的通信链路;
根据TLS公钥和TLS私钥对通信链路进行认证,认证成功后完成通信构建。
采用以上技术手段后可以获得以下优势:
本发明提出了一种基于共享式GPU的分布式容器调度方法及其系统,在容器调度过程中,本发明基于容器的GPU数量、显存、时钟频率等要求 进行节点选择,通过根据集群中GPU显卡的细颗粒度指标状态合理调度容器,使得多容器任务可以共享GPU,通过同时考虑节点内的显卡指标状态、空闲显存和配额情况来将待调度容器调度到最适配的节点上,从而提高集群的GPU资源利用率,以适配复杂场景的计算需求。与现有技术相比,本发明能够保证集群内部节点的负载均衡,增强了分布式容器集群中的GPU资源利用率,更好的满足调度需求,让容器有更快的任务完成时间。
附图说明
图1为本发明一种基于共享式GPU的分布式容器调度方法的步骤流程图;
图2为本发明实施例中更新容器调度队列的操作流程图;
图3为本发明实施例中节点选择和过滤的操作流程图;
图4为本发明一种基于共享式GPU的分布式容器调度系统的结构示意图;
图5为本发明实施例中分布式容器调度系统的工作原理图;
图6为本发明实施例中不同调度器进行容器调度时负载平衡熵的变化示意图;
图7为本发明实施例中不同调度器进行容器调度时调度时间的变化示意图。
图中,1是容器创建事件监听器,2是容器调度队列,3是容器调度器,4是容器调度执行器,5是通信模块。
具体实施方式
下面结合附图对本发明的技术方案作进一步说明:
本发明提出了一种基于共享式GPU的分布式容器调度方法,如图1所示,具体包括如下步骤:
步骤A、实时监听Kubernetes API-Server中的容器创建事件,并在监测到新的容器创建事件后,对创建的容器进行校验;
步骤B、利用校验成功的容器更新容器调度队列;
步骤C、当容器调度队列为空时,不进行操作,等待检验成功的容器加入调度队列;当容器调度队列不为空时,按顺序从容器调度队列中读取待调度容器,并从Kubernetes集群中选出待调度容器对应的最佳节点,生成容器调度二元组;
步骤D、根据容器调度二元组将待调度容器调度到最佳节点上,完成分布式容器调度。
在步骤A中,利用网络实现与Kubernetes API-Server的通信,并实时监听Kubernetes API-Server中的容器创建事件。系统用户可以通过kubectl向Kubernetes API-Server发送请求创建GPU容器,生成容器创建事件,创建前可以人为设置容器的镜像名称、容器调度优先级标签、容器启动命令、容器启动参数、容器使用的GPU标签等等,其中,GPU标签包括GPU数量标签、GPU显存标签、GPU时钟频率标签。Kubernetes API-Server可以根据容器创建事件实例化(创建)该容器对象并进行容器存储。当监测到新的容器创建事件后,需要获取该容器创建事件创建的容器对象的各个字段信息,根据字段信息对容器进行校验。
对创建的容器进行校验的具体操作如下:
步骤A01、根据创建的容器的字段信息进行GPU标签校验:判断容器是否包含GPU标签,当容器不包含任一GPU标签,则GPU标签校验失败,将校验失败时间和对应的错误信息(不包含GPU标签)写入Kubernetes事件日志,以便后续查找错误信息;当容器不包含一个或多个GPU标签,GPU标签校验成功,可以进行后续操作。
步骤A02、当GPU标签校验成功,根据创建的容器的字段信息进行调度器名称校验:判断容器的调度器字段是否为系统调度器名称,当调度器字段不是系统调度器名称,则调度器名称校验失败,将校验失败时间和对应的错误信息(容器的调度器字段)写入Kubernetes事件日志;否则调度器名称校验成功,完成容器校验,容器校验成功。
在步骤B中会将校验成功的容器送入容器调度队列,并进行容器调度队列更新,如图2所示,具体操作如下:
步骤B01、将校验成功的容器从队尾送入容器调度队列中,生成当前时刻的容器调度队列。
步骤B02、获取当前时刻的容器调度队列中每个容器预设的优先级标签,根据优先级标签将容器调度队列中的所有容器从高到低排序,优先级高的容器放在容器调度队列的队头,优先级低的容器放在队尾,完成容器调度队列更新。
在本发明实施例中,步骤C的具体操作为:
步骤C01、实时监听容器调度队列是否为空,当容器调度队列为空时, 不进行操作,等待检验成功的容器加入调度队列;当容器调度队列不为空时,从容器调度队列的队头读取一个待调度容器,并获得该待调度容器的GPU标签。此外,本发明向Kubernetes API-Server发起请求获取当前Kubernetes集群中所有节点的GPU数据,包括:节点所持有的GPU数量,节点所持有的每张GPU的显存位宽、GPU时钟频率、GPU核心数、GPU显存总量、GPU可用显存总量、GPU功率等。
步骤C02、根据Kubernetes集群中每个节点的GPU数据和待调度容器的GPU标签进行节点选择和过滤,获得容器可调度节点。
步骤C03、当容器可调度节点数量为1,将该容器可调度节点作为最佳节点。
步骤C04、当容器可调度节点数量大于1,基于容器可调度节点的GPU数据计算每个容器可调度节点的得分,选取得分最高的容器可调度节点作为最佳节点。
步骤C05、利用待调度容器和最佳节点的节点名称组成容器调度二元组。
容器可调度节点是Kubernetes集群中满足待调度容器需求的节点,如图3所示,本发明主要从3个维度进行容器可调度节点的筛选:
步骤C021、根据GPU数量标签进行节点筛选:当待调度容器携带GPU数量标签时,遍历Kubernetes集群中的所有节点,当节点所持有的GPU数量大于等于GPU数量标签值,则将该节点标记为一次可调度节点;当待调度容器未携带GPU数量标签,将Kubernetes集群中的所有节点标记为一次 可调度节点,并将该待调度容器的GPU数量标签值设置为1。
步骤C022、在C021的基础上根据GPU显存标签进行节点筛选:当待调度容器携带GPU显存标签,遍历所有一次可调度节点;如果一次可调度节点中GPU的空闲显存大于该待调度容器的GPU显存标签值,则将该GPU作为满足第一级要求的GPU;如果满足第一级要求的GPU数量大于等于该待调度容器的GPU数量标签值(C021中未携带GPU数量标签的默认为1)时,将该一次可调度节点标记为二次可调度节点;当待调度容器未携带GPU显存标签,将所有一次可调度节点标记为二次可调度节点。
步骤C023、在C022的基础上根据GPU时钟频率标签进行节点筛选:当待调度容器携带GPU时钟频率标签,遍历所有二次可调度节点;如果二次可调度节点中GPU的时钟频率大于该待调度容器的GPU时钟频率标签值,则将该GPU作为满足第二级要求的GPU;如果满足第二级要求的GPU数量大于等于该待调度容器的GPU数量标签值时,将该二次可调度节点标记为容器可调度节点;当待调度容器未携带GPU时钟频率标签,将所有二次可调度节点标记为容器可调度节点。
步骤C024、如果经历了3个维度的筛选后,容器可调度节点为空,则将当前时间和调度错误信息(容器可调度节点为空)写入Kubernetes事件日志。
在本发明实施例中,步骤C04中容器可调度节点的得分主要分为3个部分:1、满足待调度容器要求的GPU得分,待调度容器要求为待调度容器的GPU显存标签和GPU时钟频率标签;2、节点上所有GPU的显存得分; 3、节点的配额得分。
满足待调度容器要求的GPU得分的计算公式如下:
FilteredGPUScore=∑FilteredGPUScorePerCard        (6)
其中,FilteredGPUScore表示容器可调度节点中满足待调度容器要求的所有GPU的GPU得分,FilteredGPUScorePerCard表示容器可调度节点中满足待调度容器要求的一个GPU的GPU分数。
FilteredGPUScorePerCard的计算公式如下:
Figure PCTCN2021138799-appb-000006
其中,Bandwith表示GPU显存位宽,MaxBandwith表示容器可调度节点中满足待调度容器要求的所有GPU的GPU显存位宽最大值,Clock表示GPU时钟频率,MaxClock表示容器可调度节点中满足待调度容器要求的所有GPU的GPU时钟频率最大值,Power表示GPU功率,MaxPower表示容器可调度节点中满足待调度容器要求的所有GPU的GPU功率最大值,Core表示GPU核心数,MaxCore表示容器可调度节点中满足待调度容器要求的所有GPU的GPU核心数最大值,FreeMemory表示GPU空闲显存,MaxFreeMemory表示容器可调度节点中满足待调度容器要求的所有GPU的GPU空闲显存最大值,TotalMemory表示GPU显存总量,MaxTotalMemory表示容器可调度节点中满足待调度容器要求的所有GPU的GPU显存总量最大值。
节点上所有GPU的显存得分的计算公式如下:
Figure PCTCN2021138799-appb-000007
其中,RealScore表示可调度节点中所有GPU的显存得分,FreeMemorySum表示容器可调度节点中所有GPU的GPU空闲显存之和,TotalMemorySum表示容器可调度节点中所有GPU的GPU显存总量之和。
节点的配额得分的计算公式如下:
Figure PCTCN2021138799-appb-000008
其中,AllocateScore表示容器可调度节点的配额得分,AllocateMemorySum表示待调度容器申请的显存总量,即待调度容器的GPU显存标签值与GPU数量标签值的乘积。
根据公式(6)~(9),容器可调度节点的得分Score的计算公式如下:
Figure PCTCN2021138799-appb-000009
其中,FilteredGPUWeight为GPU得分的权重,FilteredGPUWeight的默认值为2,RealWeight为显存得分的权重,RealWeight的默认值为1,AllocateWeight为配额得分的权重,AllocateWeight的默认值为2。
在本发明实施例中,步骤D的具体操作为:根据容器调度二元组,将待调度容器的节点名称字段设置为二元组中最佳节点的节点名称,并异步更新Kubernetes API-Server中该容器的节点名称字段。
本发明还提出了一种基于共享式GPU的分布式容器调度系统,如图4所示,系统主要包括容器创建事件监听器1、容器调度队列2、容器调度 器3、容器调度执行器4和通信模块5。本发明系统的工作原理如图5所示。
容器创建事件监听器主要用于监听Kubernetes API-Server中的容器创建事件,并在监测到新的容器创建事件之后进行容器校验,容器创建事件监听器还会将校验成功的容器送入容器调度队列;其工作过程和本发明方法的步骤A一致。容器调度队列主要用于按照优先级存放待调度的容器,其工作过程与本发明方法的步骤B一致。容器调度器主要用于从容器调度队列的队头读取待调度容器,并从Kubernetes集群中选出待调度容器对应的最佳节点,生成容器调度二元组,其工作过程与本发明方法的步骤C一致。容器调度执行器主要用于根据容器调度二元组更新Kubernetes API-Server中待调度容器的节点名称字段,完成容器调度操作,实现节点绑定,其工作过程与本发明方法的步骤D一致。
通信模块用于帮助容器创建事件监听器、容器调度队列、容器调度器、容器调度执行器与Kubernetes API-Server建立通信链接。通信模块获取系统配置文件,系统配置文件包括Kubernetes API-Server的IP地址、端口号、TLS公钥和TLS私钥。通信模块首先检查系统配置文件中是否存在IP地址和端口号,如果存在,通信模块读取IP地址和端口号,并根据IP地址和端口号尝试与Kubernetes集群通信,通信成功的情况下建立容器创建事件监听器、容器调度队列、容器调度器、容器调度执行器与Kubernetes API-Server的通信链路;通信模块再检查系统配置文件中是否存在TLS公钥和TLS私钥,如果存在,通过TLS公钥和TLS私钥尝试与 Kubernetes API-Server通信,对通信链路进行认证,如果认证成功,则完成通信构建,容器创建事件监听器、容器调度队列、容器调度器、容器调度执行器可以和Kubernetes API-Server进行信息交互。如果系统配置文件不存在、IP地址不可达、端口关闭或者认证不成功,则将通信失败时间和失败原因记录并生成故障信息记录到本地,将故障信息以邮件形式发送给运维工程师,以便运维工程师检查、修复。
为了验证本发明的容器调度效果,本发明实施例给出了如下实验:
本发明实施例利用名为Node Simulator的调度仿真器模拟节点资源和Kubernetes中容器的状态,Node Simulator部署在Kubernetes控制平面所处的物理服务器上,其配置如表1所示:
表1
Figure PCTCN2021138799-appb-000010
在本发明实施例中,容器均被设置为是机器学习任务,每个任务都需要Tensorflow、Pytorch等主流框架,所有容器均设置为在开始运行10秒钟后消耗GPU资源。实验选择Kubernetes调度器和Kubeshare作为比较的基准,所有实验重复20次计算平均值,以确保结果的有效性。通过Node Simulator生成10个Kubernetes节点,每个节点都配备4个NVIDIA TITAN-Xp GPU,具体配置参数如表2所示:
表2
Figure PCTCN2021138799-appb-000011
Figure PCTCN2021138799-appb-000012
实验1:
实验1选择负载平衡熵来衡量负载平衡的程度,负载平衡熵的定义为:
Figure PCTCN2021138799-appb-000013
其中,E(U)表示负载平衡熵,N表示集群中的节点数,u i表示节点i的GPU内存利用率,i=0,1,…,N-1。
Figure PCTCN2021138799-appb-000014
其中,n i表示消耗节点i上的GPU资源的容器数量,pod j.gpu_memory表示容器j占用的GPU内存,∑pod.gpu_memory表示要调度的容器消耗的总GPU内存。
从公式(11)、(12)可以看出,具有充分平衡的资源利用的集群的熵为1。
实验1中调度的容器的数量为225个,这些容器各自请求2000M GPU内存,并与Poisson分布请求到达一起形成调度队列,分别利用Kubernetes调度器、Kubeshare和本发明进行容器调度,并计算对应的负载平衡熵,结果如图6所示,图6中的横坐标表示调度的容器的数量,纵坐标表示集群的平均负载平衡熵。从图中可以看出,本发明的熵值最接近1,因此本发明的调度性能优于Kubernetes调度器。尽管Kubernetes调度器的调度策略包含LeastRequestedPriority和BalancedResourceAllocation策略,以避免在单个节点上消耗过多资源,但它在资源利用率方面仍处于较弱的平衡,这是因 为Kubernetes的默认调度策略未能考虑到容器的实际GPU资源消耗。同样的,Kubeshare使用Most-fit调度策略和相似性标记机制以确保集群负载平衡的程度,但是,当容器开始消耗GPU资源时,调度决策就会发生偏移。结果表明,本发明可以更平衡地保证集群的资源利用率。
实验2:
考虑到当前的集群需要处理大型并发任务,任务调度时间被视为衡量调度程序性能必不可少的指标。在实验2中,调度的容器的数量为100个,所有要调度的容器各自请求500M GPU内存,并与Poisson分布请求到达一起形成调度队列,分别利用Kubernetes调度器、Kubeshare和本发明进行容器调度,并计算对应的调度时间,结果如图7所示,图中,横坐标表示要调度的容器的数量,纵坐标表示容器的调度时间,该时间是从调度事件的创建开始到绑定节点的完成结束。从图7可以看出,与Kubernetes和本发明相比,Kubeshare的性能相对较差,由于它考虑了GPU级别的亲和性,导致亲和性操作非常耗时。同时,尽管Kubernetes与本发明相比表现良好,但它们的调度策略缺乏对集群资源利用率的深入考虑,资源利用的平衡相对较弱,因此,Kubernetes默认调度器虽然可以做出快速的调度决策,却忽略了调度质量。
综上所述,本发明的容器调度时间优于其他基准方法,并且可以确保以更加平衡的方式确保集群GPU资源的消耗。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明技术原理的前提下,还可以做出若干改进 和变形,这些改进和变形也应视为本发明的保护范围。

Claims (10)

  1. 一种基于共享式GPU的分布式容器调度方法,其特征在于,包括如下步骤:
    实时监听Kubernetes API-Server中的容器创建事件,并在监测到新的容器创建事件后,对创建的容器进行校验;
    利用校验成功的容器更新容器调度队列;
    当容器调度队列为空时,不进行操作,等待检验成功的容器加入调度队列;当容器调度队列不为空时,按顺序从容器调度队列中读取待调度容器,并从Kubernetes集群中选出待调度容器对应的最佳节点,生成容器调度二元组;
    根据容器调度二元组将待调度容器调度到最佳节点上,完成分布式容器调度。
  2. 根据权利要求1所述的一种基于共享式GPU的分布式容器调度方法,其特征在于,对创建的容器进行校验的方法为:
    根据创建的容器的字段信息进行GPU标签校验:判断容器是否包含GPU标签,当容器不包含GPU标签,则GPU标签校验失败,将校验失败时间和对应的错误信息写入Kubernetes事件日志,否则GPU标签校验成功,其中,所述GPU标签包括GPU数量标签、GPU显存标签、GPU时钟频率标签;
    当GPU标签校验成功,根据创建的容器的字段信息进行调度器名称校验:判断容器的调度器字段是否为系统调度器名称,当调度器字段不是系统调度器名称,则调度器名称校验失败,将校验失败时间和对应的错误信息写 入Kubernetes事件日志,否则,调度器名称校验成功,完成容器校验。
  3. 根据权利要求1所述的一种基于共享式GPU的分布式容器调度方法,其特征在于,利用校验成功的容器更新容器调度队列的方法为:
    将校验成功的容器从队尾送入容器调度队列;
    获取容器调度队列中每个容器预设的优先级标签,根据优先级标签将容器调度队列中的所有容器从高到低排序,完成容器调度队列更新。
  4. 根据权利要求1或2所述的一种基于共享式GPU的分布式容器调度方法,其特征在于,从Kubernetes集群中选出待调度容器对应的最佳节点的方法为:
    根据Kubernetes集群中每个节点的GPU数据和待调度容器的GPU标签进行节点选择和过滤,获得容器可调度节点;
    当容器可调度节点数量为1,将该容器可调度节点作为最佳节点;
    当容器可调度节点数量大于1,基于容器可调度节点的GPU数据计算每个容器可调度节点的得分,选取得分最高的容器可调度节点作为最佳节点。
  5. 根据权利要求4所述的一种基于共享式GPU的分布式容器调度方法,其特征在于,获得容器可调度节点的具体操作为:
    当待调度容器携带GPU数量标签时,遍历Kubernetes集群中的所有节点,当节点所持有的GPU数量大于等于GPU数量标签值,则将该节点标记为一次可调度节点,当待调度容器未携带GPU数量标签,将Kubernetes集群中的所有节点标记为一次可调度节点,并将该待调度容器的GPU数量 标签值设置为1;
    当待调度容器携带GPU显存标签,遍历所有一次可调度节点;如果一次可调度节点中GPU的空闲显存大于该待调度容器的GPU显存标签值,则将该GPU作为满足第一级要求的GPU;如果满足第一级要求的GPU数量大于等于该待调度容器的GPU数量标签值时,将该一次可调度节点标记为二次可调度节点,当待调度容器未携带GPU显存标签,将所有一次可调度节点标记为二次可调度节点;
    当待调度容器携带GPU时钟频率标签,遍历所有二次可调度节点;如果二次可调度节点中GPU的时钟频率大于该待调度容器的GPU时钟频率标签值,则将该GPU作为满足第二级要求的GPU;如果满足第二级要求的GPU数量大于等于该待调度容器的GPU数量标签值时,将该二次可调度节点标记为容器可调度节点,当待调度容器未携带GPU时钟频率标签,将所有二次可调度节点标记为容器可调度节点;
    当容器可调度节点为空时,将当前时间和调度错误信息写入Kubernetes事件日志。
  6. 根据权利要求4所述的一种基于共享式GPU的分布式容器调度方法,其特征在于,基于容器可调度节点的GPU数据计算每个容器可调度节点的得分的计算公式如下:
    Score=
    FilteredGPUScore×FilteredGPUWeight
    +RealScore×RealWeight
    +AllocateScore×AllocateWeight
    其中,Score表示容器可调度节点的得分,FilteredGPUScore表示容器可 调度节点中满足待调度容器要求的所有GPU的GPU得分,所述待调度容器要求为待调度容器的GPU显存标签和GPU时钟频率标签,FilteredGPUWeight为GPU得分的权重,RealScore表示可调度节点中所有GPU的显存得分,RealWeight为显存得分的权重,AllocateScore表示容器可调度节点的配额得分,AllocateWeight为配额得分的权重;
    FilteredGPUScore的计算公式为:
    Figure PCTCN2021138799-appb-100001
    其中,FilteredGPUScorePerCard表示容器可调度节点中满足待调度容器要求的GPU的GPU分数,Bandwith表示GPU显存位宽,MaxBandwith表示容器可调度节点中满足待调度容器要求的所有GPU的GPU显存位宽最大值,Clock表示GPU时钟频率,MaxClock表示容器可调度节点中满足待调度容器要求的所有GPU的GPU时钟频率最大值,Power表示GPU功率,MaxPower表示容器可调度节点中满足待调度容器要求的所有GPU的GPU功率最大值,Core表示GPU核心数,MaxCore表示容器可调度节点中满足待调度容器要求的所有GPU的GPU核心数最大值,FreeMemory表示GPU空闲显存,MaxFreeMemory表示容器可调度节点中满足待调度容器要求的所有GPU的GPU空闲显存最大值,TotalMemory表示GPU显存总量,MaxTotalMemory表示容器可调度节点中满足待调度容器要求的所有GPU的GPU显存总量最大值;
    RealScore的计算公式为:
    Figure PCTCN2021138799-appb-100002
    其中,FreeMemorySum表示容器可调度节点中所有GPU的GPU空闲显存之和,TotalMemorySum表示容器可调度节点中所有GPU的GPU显存总量之和;
    AllocateScore的计算公式为:
    Figure PCTCN2021138799-appb-100003
    其中,AllocateMemorySum表示待调度容器申请的显存总量,即待调度容器的GPU显存标签值与GPU数量标签值的乘积。
  7. 根据权利要求1所述的一种基于共享式GPU的分布式容器调度方法,其特征在于,所述容器调度二元组由待调度容器和最佳节点的节点名称组成。
  8. 根据权利要求1或7所述的一种基于共享式GPU的分布式容器调度方法,其特征在于,根据容器调度二元组将待调度容器调度到最佳节点上的具体操作为:
    根据容器调度二元组,将待调度容器的节点名称字段设置为二元组中最佳节点的节点名称,并异步更新Kubernetes API-Server中该容器的节点名称字段。
  9. 一种基于共享式GPU的分布式容器调度系统,其特征在于,包括:
    容器创建事件监听器,用于监听Kubernetes API-Server中的容器创建事件,并在监测到新的容器创建事件之后进行容器校验;
    容器调度队列,用于按照优先级存放待调度的容器;
    容器调度器,用于从容器调度队列的队头读取待调度容器,并从Kubernetes集群中选出待调度容器对应的最佳节点,生成容器调度二元组;
    容器调度执行器,用于根据容器调度二元组更新Kubernetes API-Server中待调度容器的节点名称字段,完成容器调度操作;
    通信模块,用于根据系统配置文件分别构建容器创建事件监听器、容器调度队列、容器调度器、容器调度执行器与Kubernetes API-Server的通信。
  10. 根据权利要求9所述的一种基于共享式GPU的分布式容器调度系统,其特征在于,所述系统配置文件包括Kubernetes API-Server的IP地址、端口号、TLS公钥和TLS私钥;
    根据系统配置文件构建通信的操作为:
    根据IP地址和端口号建立容器创建事件监听器、容器调度队列、容器调度器、容器调度执行器与Kubernetes API-Server的通信链路;
    根据TLS公钥和TLS私钥对通信链路进行认证,认证成功后完成通信构建。
PCT/CN2021/138799 2021-03-11 2021-12-16 一种基于共享式gpu的分布式容器调度方法及其系统 WO2022188498A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/701,637 US20220291956A1 (en) 2021-03-11 2022-03-22 Distributed container scheduling method and system based on shared gpus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110264399.4 2021-03-11
CN202110264399.4A CN112925611A (zh) 2021-03-11 2021-03-11 一种基于共享式gpu的分布式容器调度方法及其系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/701,637 Continuation-In-Part US20220291956A1 (en) 2021-03-11 2022-03-22 Distributed container scheduling method and system based on shared gpus

Publications (1)

Publication Number Publication Date
WO2022188498A1 true WO2022188498A1 (zh) 2022-09-15

Family

ID=76172574

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/138799 WO2022188498A1 (zh) 2021-03-11 2021-12-16 一种基于共享式gpu的分布式容器调度方法及其系统

Country Status (2)

Country Link
CN (1) CN112925611A (zh)
WO (1) WO2022188498A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112925611A (zh) * 2021-03-11 2021-06-08 南京邮电大学 一种基于共享式gpu的分布式容器调度方法及其系统
CN116339927B (zh) * 2023-05-29 2023-08-15 苏州浪潮智能科技有限公司 设备确定方法、装置、存储介质及电子装置
CN117971505B (zh) * 2024-03-29 2024-06-07 苏州元脑智能科技有限公司 部署容器应用的方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109634748A (zh) * 2018-12-12 2019-04-16 深圳前海微众银行股份有限公司 集群资源调度方法、装置、设备及计算机可读存储介质
CN111522639A (zh) * 2020-04-16 2020-08-11 南京邮电大学 Kubernetes集群架构系统下多维资源调度方法
CN111538586A (zh) * 2020-01-23 2020-08-14 中国银联股份有限公司 集群gpu资源管理调度系统、方法以及计算机可读存储介质
CN111858025A (zh) * 2020-06-10 2020-10-30 苏州浪潮智能科技有限公司 一种基于gpu卡显存的混合调度方法、装置、设备和介质
CN112925611A (zh) * 2021-03-11 2021-06-08 南京邮电大学 一种基于共享式gpu的分布式容器调度方法及其系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109634748A (zh) * 2018-12-12 2019-04-16 深圳前海微众银行股份有限公司 集群资源调度方法、装置、设备及计算机可读存储介质
CN111538586A (zh) * 2020-01-23 2020-08-14 中国银联股份有限公司 集群gpu资源管理调度系统、方法以及计算机可读存储介质
CN111522639A (zh) * 2020-04-16 2020-08-11 南京邮电大学 Kubernetes集群架构系统下多维资源调度方法
CN111858025A (zh) * 2020-06-10 2020-10-30 苏州浪潮智能科技有限公司 一种基于gpu卡显存的混合调度方法、装置、设备和介质
CN112925611A (zh) * 2021-03-11 2021-06-08 南京邮电大学 一种基于共享式gpu的分布式容器调度方法及其系统

Also Published As

Publication number Publication date
CN112925611A (zh) 2021-06-08

Similar Documents

Publication Publication Date Title
WO2022188498A1 (zh) 一种基于共享式gpu的分布式容器调度方法及其系统
US11656911B2 (en) Systems, methods, and apparatuses for implementing a scheduler with preemptive termination of existing workloads to free resources for high priority items
US20210124616A1 (en) Workload management using blockchain-based transaction deferrals
US10514951B2 (en) Systems, methods, and apparatuses for implementing a stateless, deterministic scheduler and work discovery system with interruption recovery
US11294726B2 (en) Systems, methods, and apparatuses for implementing a scalable scheduler with heterogeneous resource allocation of large competing workloads types using QoS
Chowdhury et al. Vineyard: Virtual network embedding algorithms with coordinated node and link mapping
US10623481B2 (en) Balancing resources in distributed computing environments
Sathiyamoorthi et al. Adaptive fault tolerant resource allocation scheme for cloud computing environments
EP2899947A1 (en) Component oriented hybrid cloud operating system architecture and communication method thereof
CN110008024A (zh) 一种多维约束下基于延迟决策的容器调度方法以及装置
JP2004005550A (ja) 分散ワークフロー管理方法およびシステム
Zhang et al. IoV scenario: Implementation of a bandwidth aware algorithm in wireless network communication mode
WO2021143590A1 (zh) 一种分布式容器镜像构建调度系统及方法
CN113886162B (zh) 一种计算设备性能测试方法、计算设备及存储介质
CN107545376B (zh) 一种下发任务的方法、装置及系统
CN111190691A (zh) 适用于虚拟机的自动迁移方法、系统、装置及存储介质
Cao et al. Collaborative attributes and resources for single-stage virtual network mapping in network virtualization
Li et al. Endpoint-flexible coflow scheduling across geo-distributed datacenters
US20220291956A1 (en) Distributed container scheduling method and system based on shared gpus
CN107679766B (zh) 一种群智任务动态冗余调度方法及装置
CN112379966B (zh) 一种云数据中心虚拟机实时整合方法及系统
Zhang et al. A virtual network embedding algorithm based on RBF neural network
CN113630451A (zh) 一种基于区块链、spark的计算服务系统
Al-Masri et al. Enhancing Resource Provisioning Across Edge-based Environments
Cui et al. Decentralized thermal-aware task scheduling for large-scale many-core systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21929960

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21929960

Country of ref document: EP

Kind code of ref document: A1