CN116578383A

CN116578383A - Pod scheduling method and device for NUMA perception

Info

Publication number: CN116578383A
Application number: CN202310522681.7A
Authority: CN
Inventors: 荣涛
Original assignee: CLP Cloud Digital Intelligence Technology Co Ltd
Current assignee: CLP Cloud Digital Intelligence Technology Co Ltd
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-08-11

Abstract

The application discloses a NUMA perceived Pod scheduling method and a device, and relates to the technical field of Pod scheduling, wherein the NUMA perceived Pod scheduling method comprises the following steps: and in the scoring stage of Pod scheduling, adjusting a scoring strategy, and scheduling pods with different functions on NUMA (non-uniform memory access) close to or far away from corresponding resources so as to schedule the pods to optimal nodes. The application can improve the accuracy of selecting the cluster Node nodes by the container arranging and scheduling algorithm, more fully utilize system resources and eliminate the system overhead caused by the scheduling algorithm which is not perceived by NUMA.

Description

Pod scheduling method and device for NUMA perception

Technical Field

The application relates to the technical field of system resource quota scheduling, in particular to a NUMA perceived Pod scheduling method and device.

Background

With the development of cloud computing technology, IT has become an important foundation of the modern IT information industry. Container orchestration refers to the deployment, management, and monitoring of multiple containers. The method has the advantages of more efficiently utilizing system resources, consistent running environment, more relaxed migration and expansion and the like, so that a container deployment mode becomes more and more mainstream. This has also led to the development of a container orchestration system, which is a system for automatically deploying, scaling and managing containerized applications, such as the open-source Kubernetes container orchestration system, which is an open-source Linux container automation operation and maintenance platform.

In order to more fully utilize the hardware resources, the container orchestration system needs to run the containers on the appropriate hardware resources, i.e. Pod scheduling, which is the smallest unit in Kubernetes that can be created and deployed, is an application instance in Kubernetes clusters.

Pod scheduling is generally divided into two phases: filtering and scoring. The filtering stage screens out nodes meeting the Pod scheduling requirement. For example, the filter function may check whether the available resources of the candidate Node can meet the resource request of Pod. After filtering, obtaining a list of available nodes, including all schedulable nodes; typically, the list of available nodes includes more than one Node, and if the list is empty, this indicates that the Pod is not schedulable. The scheduler then selects a most appropriate Node for the Pod from among all schedulable nodes. The scheduler may score each schedulable node according to the scoring rules currently enabled. Finally, the scheduler will schedule Pod to the Node with highest score. If there are multiple nodes with highest scores, the scheduler will choose one from them randomly.

In the process of filtering and scoring Node nodes by a scheduler according to the Pod, the scheduler can check whether ports of the currently scheduled Pod exist on the Node, check whether the Node has idle resources (such as CPU and memory) to meet the requirements of the Pod, and the like, and the more the pods are placed on the Node, the more the resources used by the pods are, the lower the scoring given by the Node is, so that the Node is preferentially scheduled to the Node with less Pod and less resource use.

The existing container arrangement system can integrate the resource utilization conditions of all systems in the cluster, filter and score Node nodes according to the resource utilization conditions, and finally select proper Node nodes for Pod scheduling. However, the prior art solution ignores the important factor of the difference in Pod resource demand. In an actual production environment, resources needed for different roles of Pod are often different.

In the filtering stage of the container arrangement system for Node nodes, whether the Node nodes are used as alternative nodes is determined according to whether the resources meet the requirements. In the scoring stage, the scoring of Node nodes with the same total storage resources, network resources and computing resources is the same, and at this time, the container arrangement system randomly selects nodes from Node nodes with the same score to perform Pod scheduling. The scoring strategy is unfair to the Pod with different functions, especially to the scoring strategy of computing resources and network resources which are identical at the same time, so that the Pod is scheduled to a non-optimal Node, and cluster resources cannot be utilized to the maximum extent or resource waste is caused.

Disclosure of Invention

Aiming at the defects in the prior art, the first aspect of the application provides a NUMA perceived Pod scheduling method, which can improve the accuracy of selecting cluster Node nodes by a container arranging and scheduling algorithm, more fully utilize system resources and eliminate system overhead caused by a NUMA imperceptible scheduling algorithm.

In order to achieve the above purpose, the application adopts the following technical scheme:

a NUMA-aware Pod scheduling method, the method comprising the steps of:

and in the scoring stage of Pod scheduling, adjusting the scoring strategy, and scheduling pods with different functions on non-uniform memory access NUMA which is close to or far away from corresponding resources so as to enable the pods to be scheduled to the optimal node.

In some embodiments, the adjusting the scoring policy, scheduling the Pod of different functions onto the NUMA proximate to or remote from the respective resource, includes:

the scoring policy is adjusted to schedule the I/O-intensive Pod on NUMA proximate the network and/or disk resources and the compute-intensive Pod on NUMA remote from the network and/or disk resources.

In some embodiments, the adjusting the scoring policy to schedule the I/O-intensive Pod onto a NUMA proximate to the network and/or disk resources and to schedule the compute-intensive Pod onto a NUMA remote from the network and/or disk resources includes:

for an I/O-intensive Pod, scoring a node having more remaining resources on NUMA proximate network and/or disk resources higher to schedule the I/O-intensive Pod on NUMA proximate network and/or disk resources;

for compute-intensive Pod, nodes that have more remaining resources on NUMAs remote from network and/or disk resources are given a higher score to schedule the compute-intensive Pod on NUMAs remote from network and/or disk resources.

for an I/O-intensive Pod, weighting nodes that have more remaining resources on NUMA proximate network and/or disk resources to schedule the I/O-intensive Pod on NUMA proximate network and/or disk resources;

for compute-intensive Pod, nodes that have more remaining resources on NUMAs remote from network and/or disk resources are weighted to schedule the compute-intensive Pod on NUMAs remote from network and/or disk resources.

In some embodiments, further comprising:

and in the filtering stage of Pod scheduling, eliminating nodes which do not meet the resource requirements.

The second aspect of the present application provides a NUMA-aware Pod scheduling apparatus, which can improve the accuracy of selecting cluster Node nodes by a container scheduling algorithm, more fully utilize system resources, and eliminate system overhead caused by a scheduling algorithm that is not aware of NUMA.

a NUMA-aware Pod scheduling apparatus, comprising:

and the scheduler is used for adjusting the scoring strategy in the scoring stage of the Pod scheduling, and scheduling the pods with different functions onto non-uniform memory access NUMA which is close to or far away from the corresponding resources so as to schedule the pods to the optimal nodes.

In some embodiments, the scheduler is to:

In some embodiments, the scheduler adjusts the scoring policy to schedule the I/O-intensive Pod onto a NUMA proximate to the network and/or disk resources and to schedule the compute-intensive Pod onto a NUMA remote from the network and/or disk resources, including:

In some embodiments, the scheduler is further configured to:

Compared with the prior art, the application has the advantages that:

the application relates to a NUMA perceived Pod scheduling method, which adjusts a scoring strategy in a scoring stage of Pod scheduling, and schedules pods with different functions to NUMA which are close to or far from corresponding resources so as to schedule the pods to optimal nodes. Such as scheduling I/O-intensive Pod onto a NUMA proximate to network and/or disk resources and scheduling compute-intensive Pod onto a NUMA remote from network and/or disk resources. Therefore, the accuracy of selecting cluster Node nodes by the container scheduling algorithm can be improved, system resources are more fully utilized, and system overhead caused by the scheduling algorithm which is not perceived by NUMA is eliminated.

Drawings

FIG. 1 is a detailed view of a cluster and Node nodes thereof in an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment of the application provides a NUMA perceived Pod scheduling method, which comprises the following steps:

and in the scoring stage of Pod scheduling, adjusting a scoring strategy, and scheduling pods with different functions on NUMA (non-uniform memory access) close to or far away from corresponding resources so as to schedule the pods to optimal nodes. In addition, in the filtering stage of Pod scheduling, nodes which do not meet the resource requirements are removed. Therefore, nodes which do not meet the requirements can be screened first.

Non-uniform memory access NUMA, among other things, is a computer memory design for multiprocessing in which memory access time depends on the location of memory relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (local memory of another processor or memory shared between processors). The I/O bus is often related to NUMA, the cluster is shown in the following figure and the Node detail diagram thereof, the Node1 Node is composed of two NUMA, each NUMA is provided with one CPU, two DDR memories are arranged in each NUMA, and the I/O bus where the network card is located belongs to NUMA0.

It should be noted that in an actual production environment, resources required by different roles of Pod are often different, for example, I/O intensive Pod has strong demands on network and disk bandwidth, while computationally intensive Pod has no excessive demands on network and disk bandwidth, and in turn has higher demands on memory and CPU resources. It is obviously insufficient if the Node is judged whether to be available for Pod scheduling according to the Node overall resource utilization.

Taking fig. 1 as an example, suppose there are two Pod to schedule, pod1 is I/O intensive and Pod2 is computationally intensive. The I/O dense type mainly comprises network I/O and disk I/O, in the embodiment, the network I/O is taken as an example for explanation, and the disk I/O needs to be considered in actual situations, so that the I/O is similar to the network I/O.

Suppose that, by filtering and scoring by the container orchestration system, final Pod1 and Pod2 are dispatched to Node1. Since the container orchestration system does not participate in the scheduling of Pod1 and Pod2 within the Node1 Node, the operation of Pod1 and Pod2 can be divided into two cases:

case 1: pod1 (I/O intensive) runs on NUMA0 and Pod2 (computationally intensive) runs on NUMA 1;

case 2: pod1 (I/O intensive) runs on NUMA1 and Pod2 (computationally intensive) runs on NUMA 0;

in case 1, pod1 may frequently send and receive data through the network card during operation, and may occupy a lot of network bandwidth, however, pod1 operates on NUMA0, and during operation, access to resources across the NUMA bus may not be involved. In the running process of Pod2, intensive operation is carried out, data can be hardly transmitted and received through a network card in the period, and little network bandwidth is occupied, so that node overload caused by accessing resources across NUMA buses is avoided.

In case 2, pod1 receives and transmits data across NUMA through network card of NUMA0 during operation, and at this time, not only bandwidth resources of the network card cannot be fully utilized, but also overhead of NUMA cross-path access is introduced. In the running process of Pod2, intensive mathematical computation is performed, during which the CPU is loaded with high load even fully, and the CPU load caused by the intensive computation not only reduces the processing performance of receiving and transmitting data of the network card, but also may cause additional overhead caused by accessing the remote memory due to the large demand of the intensive computation on the memory.

Therefore, when network resources are unevenly distributed on NUMA nodes, scheduling I/O-intensive Pods to NUMA that is far away from the network resources, or scheduling computation-intensive Pods to NUMA nodes that are close to the network resources, will result in wasted system resources and increased unnecessary overhead.

In summary, when the network resources are unevenly distributed on the NUMA nodes, the I/O-intensive Pod should be scheduled on the NUMA nodes close to the network card, and the computation-intensive Pod should be scheduled on the NUMA nodes far from the network card.

Based on the analysis, the Pod with different functions in the embodiment of the application can be reasonably scheduled according to the needed resources.

For example, an I/O-intensive Pod may be scheduled on NUMA proximate to network and/or disk resources and a compute-intensive Pod may be scheduled on NUMA remote from network and/or disk resources by adjusting scoring policies.

When the adjustment scoring strategy is specifically implemented, one way may be:

Another way may be:

Therefore, the embodiment of the application defines the scheduled Pod type as I/O intensive and computation intensive by modifying a Pod scheduling program of the container scheduling system, eliminates nodes which do not meet the resource requirement in a Node filtering stage of scheduling according to the Pod type, and in a scoring stage, scores or weights Node nodes which have more residual resources on NUMA which are close to a network and/or disk resources more highly, and scores or weights Node nodes which have more residual resources on NUMA which are far away from the network resources more highly. By the method, the target node can obtain higher score, so that the Pod can be ensured to be scheduled to the optimal node.

In summary, in the NUMA-aware Pod scheduling method of the present application, in the scoring stage of Pod scheduling, the scoring policy is adjusted to schedule pods with different functions to NUMAs close to or far from the corresponding resources, so that the pods are scheduled to the optimal nodes. Such as scheduling I/O-intensive Pod onto a NUMA proximate to network and/or disk resources and scheduling compute-intensive Pod onto a NUMA remote from network and/or disk resources. Therefore, the accuracy of selecting cluster Node nodes by the container scheduling algorithm can be improved, system resources are more fully utilized, and system overhead caused by the scheduling algorithm which is not perceived by NUMA is eliminated.

The embodiment of the application also provides a NUMA perceived Pod dispatching device which comprises a dispatcher.

The scheduler is used for adjusting the scoring strategy in the scoring stage of Pod scheduling, and scheduling pods with different functions on NUMA (non-uniform memory access) close to or far away from corresponding resources so as to enable the pods to be scheduled to optimal nodes.

In some embodiments, the scheduler is to:

In some embodiments, the scheduler is further configured to:

In summary, in the NUMA-aware Pod scheduling apparatus of the present application, the scheduler adjusts the scoring policy in the scoring stage of Pod scheduling, and schedules pods with different functions to NUMAs close to or far from corresponding resources, so that the pods are scheduled to the optimal nodes. Such as scheduling I/O-intensive Pod onto a NUMA proximate to network and/or disk resources and scheduling compute-intensive Pod onto a NUMA remote from network and/or disk resources. Therefore, the accuracy of selecting cluster Node nodes by the container scheduling algorithm can be improved, system resources are more fully utilized, and system overhead caused by the scheduling algorithm which is not perceived by NUMA is eliminated.

The foregoing is merely a specific implementation of the embodiment of the present application, but the protection scope of the embodiment of the present application is not limited thereto, and any person skilled in the art may easily think of various equivalent modifications or substitutions within the technical scope of the embodiment of the present application, and these modifications or substitutions should be covered in the protection scope of the embodiment of the present application. Therefore, the protection scope of the embodiments of the present application shall be subject to the protection scope of the claims.

Claims

1. A NUMA-aware Pod scheduling method, the method comprising the steps of:

2. The NUMA-aware Pod scheduling method of claim 1, wherein the adjusting the scoring policy to schedule pods of different functions to NUMAs that are immediately adjacent or far from the corresponding resources comprises:

3. The NUMA-aware Pod scheduling method of claim 2, wherein the adjusting the scoring policy to schedule the I/O-intensive Pod to a NUMA proximate to network and/or disk resources and to schedule the compute-intensive Pod to a NUMA remote from network and/or disk resources comprises:

4. The NUMA-aware Pod scheduling method of claim 2, wherein the adjusting the scoring policy to schedule the I/O-intensive Pod to a NUMA proximate to network and/or disk resources and to schedule the compute-intensive Pod to a NUMA remote from network and/or disk resources comprises:

5. The NUMA-aware Pod scheduling method of claim 1, further comprising:

6. A NUMA-aware Pod scheduling apparatus, comprising:

7. The NUMA-aware Pod scheduling apparatus of claim 6, wherein the scheduler is to:

8. The NUMA-aware Pod scheduling apparatus of claim 7, wherein the scheduler to adjust the scoring policy to schedule I/O-intensive pods onto NUMAs proximate to network and/or disk resources and to schedule compute-intensive pods onto NUMAs remote from network and/or disk resources comprises:

9. The NUMA-aware Pod scheduling apparatus of claim 7, wherein the scheduler to adjust the scoring policy to schedule I/O-intensive pods onto NUMAs proximate to network and/or disk resources and to schedule compute-intensive pods onto NUMAs remote from network and/or disk resources comprises:

10. The NUMA-aware Pod scheduling apparatus of claim 6, wherein the scheduler is further to: