CN113535332A

CN113535332A - Cluster resource scheduling method and device, computer equipment and storage medium

Info

Publication number: CN113535332A
Application number: CN202110921010.9A
Authority: CN
Inventors: 李亚坤; 张云尧
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2021-08-11
Filing date: 2021-08-11
Publication date: 2021-10-22
Anticipated expiration: 2041-08-11
Also published as: CN113535332B

Abstract

The present disclosure relates to a method, an apparatus, a computer device and a storage medium for scheduling cluster resources, where each node in a cluster includes a plurality of computing units and adopts a non-uniform memory access architecture, the method includes: receiving the residual resource quantity and unit identification of each computing unit reported by each node; receiving a resource scheduling request of a target job, wherein the resource scheduling request comprises the number of resources required by a task of the target job; determining a target computing unit with the residual resource quantity not less than the resource quantity required by the task in the plurality of computing units, acquiring a first node identification of a first target node where the target computing unit is located based on the unit identification of the target computing unit, and determining a resource allocation result of the task; and sending the resource allocation result to the first target node based on the first node identification so as to bind the task to the resource of the target computing unit and execute the task on the target computing unit. Therefore, the condition of accessing the memory across the computing units is avoided, and the overall performance of task execution is improved.

Description

Cluster resource scheduling method and device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a resource scheduling method, a resource scheduling apparatus, a computer device implementing the resource scheduling method, and a computer-readable storage medium.

Background

The cluster scheduling system is a basic platform for Resource management and scheduling of a cluster, and supports scheduling of various tasks in a large-scale cluster, such as HadoopYARN (HadoopYet other Resource manager). The YARN cluster scheduling system generally includes a resource manager RM (resource manager) which is responsible for resource management and allocation of the whole cluster, and a plurality of node managers NM (node manager), where NM is a resource and task manager on each node, which is an agent for managing the node. NM reports the use of the resource (such as CPU, memory, etc.) and the running state of the resource Container (Container) to RM at regular time. After a user submits a job through a Client (Client), an RM creates a corresponding application manager AM (application Master) for the job, the AM corresponds to a specific subtask of each job and is responsible for managing the job, one job generally comprises a plurality of subtasks, the AM submits a resource application to a subtask application resource of which the RM is the job, and after the RM distributes the resource, the AM communicates with an NM of a node where the distributed resource is located so as to execute the subtasks on the corresponding node.

In the related art, nodes in the cluster scheduling system may adopt a Non-Uniform memory access (NUMA) architecture, and under the NUMA architecture, each node is divided into a plurality of computing units (also referred to as sockets), each computing unit has its own independent CPU and memory, i.e., local memory, and the computing units are connected to each other. The CPU in one computing unit may access the local memory, or may access the memory in any other computing unit, i.e., there is a case where the memory is accessed across computing units.

When the job has a task execution requirement, the AM applies for resources from the RM, informs the RM of the quantity of the resources required for executing the task, the RM receives the residual resource quantity of each node reported by the NM of each node, and if the residual resource quantity of one node meets the quantity of the resources required for executing the task, the RM schedules the task to be executed on the corresponding node. However, since there are multiple computing units on a node under the NUMA architecture, and each computing unit has its own independent CPU and memory, a large number of cases of accessing the memory across the computing units easily occur inside the node when a task of a job runs, which may cause the overall performance of task execution to be seriously degraded.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a cluster resource scheduling method, a cluster resource scheduling apparatus, and a computer device and a computer readable storage medium for implementing the cluster resource scheduling method, so as to reduce or even avoid a situation that a cluster accesses a memory across computing units inside a node when scheduling a resource for a task of a job, thereby improving overall performance of task execution.

In a first aspect, the present disclosure provides a method for scheduling cluster resources, where each node in a cluster includes multiple computing units and adopts a non-uniform memory access architecture, the method includes:

receiving the residual resource quantity and unit identification of each computing unit in the nodes reported by each node in the cluster;

receiving a resource scheduling request of a target job, wherein the resource scheduling request comprises the quantity of resources required by a task of the target job;

determining a target computing unit, which is not less than the quantity of resources required by the task, in the plurality of computing units based on the quantity of remaining resources of each computing unit of each node and the quantity of resources required by the task;

based on the unit identifier of the target computing unit, acquiring a first node identifier of a first target node where the target computing unit is located, and determining a resource allocation result of the task;

based on the first node identification, sending the resource allocation result to the first target node to bind the task to the resource of the target computing unit, and then executing the task on the target computing unit.

Optionally, in some embodiments of the present disclosure, before determining the resource allocation result of the task, the method further includes:

determining a preset performance parameter of each node, wherein the preset performance parameter is used for representing the capacity of the node for accessing the memory across the computing units;

and when each preset performance parameter is smaller than or equal to a preset performance parameter threshold value, returning to the step of determining a target computing unit of which the remaining resource quantity is not smaller than the resource quantity required by the task in the plurality of computing units based on the remaining resource quantity of each computing unit of each node and the resource quantity required by the task.

Optionally, in some embodiments of the present disclosure, the step of determining the preset performance parameter of each node includes:

receiving unit parameters of each computing unit on each node reported by each node, wherein the unit parameters comprise any one or more of the total number of the computing units on the node, the resource type in each computing unit and the total number of resources;

and determining a preset performance parameter corresponding to each node based on the unit parameter corresponding to each node, wherein the unit parameter and the preset performance parameter are in positive correlation.

Optionally, in some embodiments of the present disclosure, the method further includes:

when each preset performance parameter is larger than the preset performance parameter threshold, determining a second target node of which the remaining resource quantity in each node is not smaller than the resource quantity required by the task based on the resource quantity required by the task and the remaining resource quantity of each node;

acquiring a second node identifier of the second target node;

based on the second node identification, notifying a node manager of the second target node to launch a corresponding resource container to perform the task.

starting timing when a target computing unit with the residual resource quantity of the plurality of computing units not less than the resource quantity required by the task is not determined;

and when the timing duration is longer than the preset duration, determining a target computing unit of which the remaining resource quantity is not less than the resource quantity required by the task in the plurality of computing units based on the remaining resource quantity of each computing unit of each node at the current moment and the resource quantity required by the task.

selecting any one of the target computing units when it is determined that there are a plurality of the target computing units.

In a second aspect, an embodiment of the present disclosure provides a cluster resource scheduling apparatus, where each node in a cluster includes a plurality of computing units and adopts a non-uniform memory access architecture, the apparatus includes:

a data receiving module, configured to receive the number of remaining resources and the unit identifier of each computing unit in the node, where the number of remaining resources and the unit identifier are reported by each node in the cluster;

the system comprises a request receiving module, a resource scheduling module and a resource scheduling module, wherein the request receiving module is used for receiving a resource scheduling request of a target job, and the resource scheduling request comprises the quantity of resources required by a task of the target job;

a unit determining module, configured to determine, based on a remaining resource amount of each computing unit of each node and a resource amount required by the task, a target computing unit in which the remaining resource amount in the plurality of computing units is not less than the resource amount required by the task;

the resource allocation module is used for acquiring a first node identifier of a first target node where the target computing unit is located based on the unit identifier of the target computing unit and determining a resource allocation result of the task;

and the task execution module is used for sending the resource allocation result to the first target node based on the first node identifier so as to bind the task to the resource of the target computing unit, and then executing the task on the target computing unit.

Optionally, in some embodiments of the present disclosure, the apparatus further includes:

the performance parameter determining module is used for determining preset performance parameters of each node, and the preset performance parameters are used for representing the capacity of the nodes for accessing the memory across the computing units;

the unit determining module is further configured to determine, when each of the preset performance parameters is less than or equal to a preset performance parameter threshold, a target computing unit in which the remaining resource amount in the plurality of computing units is not less than the resource amount required by the task, based on the remaining resource amount of each of the computing units of each of the nodes and the resource amount required by the task.

In a third aspect, embodiments of the present disclosure provide a computer device comprising a processor and a memory;

wherein the memory stores a computer program executable by the processor;

the processor, when executing the computer program, implements the cluster resource scheduling method as described in any embodiment of the first aspect.

In a fourth aspect, this disclosure provides a computer-readable storage medium, where a computer program is stored, and the computer program is called and executed by a processor to implement the cluster resource scheduling method described in any embodiment of the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages:

the method, the device, the computer equipment and the storage medium for scheduling the cluster resources provided by the embodiment of the disclosure receive the residual resource quantity and the unit identification of each computing unit in the nodes reported by each node in the cluster; receiving a resource scheduling request of a target job, wherein the resource scheduling request comprises the quantity of resources required by a task of the target job; determining a target computing unit, which is not less than the quantity of resources required by the task, in the plurality of computing units based on the quantity of remaining resources of each computing unit of each node and the quantity of resources required by the task; based on the unit identifier of the target computing unit, acquiring a first node identifier of a first target node where the target computing unit is located, and determining a resource allocation result of the task; based on the first node identification, sending the resource allocation result to the first target node to bind the task to the resource of the target computing unit, and then executing the task on the target computing unit. In this way, after the first target node and the target computing unit on the first target node which meet the resource quantity required by task execution are determined, the task can be bound to the resource in the target computing unit on the first target node, so that the task is executed inside the target computing unit, and therefore when the resource is scheduled for the task of the job, the situation that the cross-computing unit inside the node accesses the memory is avoided, and the overall performance of task execution can be improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic flowchart of a resource scheduling method according to an embodiment of the present disclosure;

fig. 2 is a flowchart illustrating a resource scheduling method according to another embodiment of the disclosure;

fig. 3 is a flowchart illustrating a resource scheduling method according to another embodiment of the disclosure;

fig. 4 is a flowchart illustrating a resource scheduling method according to yet another embodiment of the present disclosure;

FIG. 5 is a schematic diagram of data interaction for resource scheduling in the system structure of the Hadoop YARN according to the embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a resource scheduling apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

In order to avoid the problem that a cluster scheduling system such as YARN accesses a memory across computing units inside a node where a resource is located when scheduling the resource for a job task and improve the overall performance of task execution, the embodiments of the present disclosure provide a resource scheduling method, a resource scheduling apparatus, a computer device, and a computer-readable storage medium. Next, a resource scheduling method provided in the embodiment of the present disclosure is first described.

The resource scheduling method provided by the embodiment of the disclosure can be applied to a cluster scheduling system, which is a basic platform with cluster resource management and scheduling functions, and can be Hadoop YARN, K8s, Docker Swarm, and the like. Taking YARN as an example, the cluster scheduling system mainly includes RM and multiple NM, where NM is the resource and task manager on each node. In some scenarios, nodes in the cluster adopt a NUMA architecture, each node is divided into a plurality of computing units, and each computing unit has its own independent resources such as CPU, memory, and other types of resources. As shown in fig. 1, a resource scheduling method provided in the embodiment of the present disclosure may include the following steps:

step S101: and receiving the residual resource quantity and the unit identification of each computing unit in the nodes reported by each node in the cluster.

Specifically, the resource manager RM receives the remaining resource quantity and unit identifier of each computing unit of each node, which are reported by the node manager NM of each node.

Illustratively, for example, a cluster includes N nodes, i.e., node 1 to node N, each node has M (M ≧ 2) computing units, and each computing unit on each node has its own unit identifier and an independently available resource, such as a CPU, a memory, a network card, a GPU, and the like.

For example, assume that node 1 has 3 computing units (A, B, C), and the independently available resources of each computing unit (A, B, C) include, but are not limited to, CPU 10 cores, memory 100GB, 2 GPU cards, 1 network card, etc. Each computing unit (A, B, C) has its own unit ID, for example, the unit ID of computing unit a is 11, the unit ID of computing unit B is 12, and the unit ID of computing unit C is 13, which is only an example here, and the specific value of ID is not limited in this embodiment as long as it can distinguish and identify different computing units.

The remaining resource amount S for any one computing unit is obtained by subtracting the allocated resource amount Y from the total available resource amount X for that computing unit, i.e., S ═ X-Y. For example, the total number of independently available resources X of the computing unit a includes a CPU 10 core and a memory 100GB, while the number of allocated resources Y of the computing unit a includes a CPU 4 core and a memory 60GB at present, the remaining number of resources S of the computing unit a includes a CPU 6 core and a memory 40 GB.

Specifically, the NM of each node will typically report available resource information of each node, such as the remaining resource amount of the node, the operating status information of the resource Container, and the like, to the RM through heartbeat. In this embodiment, the NM of each node also reports the remaining resource quantity and unit identifier, such as ID, of each computing unit of each node to the RM through heartbeat.

Step S102: receiving a resource scheduling request of a target job, wherein the resource scheduling request comprises the number of resources required by a task of the target job.

Specifically, when receiving and responding to the resource scheduling request to allocate resources to the task of the target job, the RM acquires the number of resources required by the task in the resource scheduling request.

Step S103: and determining a target computing unit of which the residual resource quantity is not less than the resource quantity required by the task in the plurality of computing units based on the residual resource quantity of each computing unit of each node and the resource quantity required by the task.

Specifically, the RM determines, based on the remaining resource amount of each computing unit of each node and the resource amount required by the task, a target computing unit whose remaining resource amount of the computing unit is not less than the resource amount required by the task.

When the user submits the target job to the RM through the Client, the resource quantities required for executing the task of the job, such as the CPU 5 core and the memory 10G, are configured. After submitting the job, the RM creates a corresponding AM for the job, the AM is responsible for managing the job, one job generally comprises a plurality of subtasks, and the AM submits a resource scheduling request to the RM to apply for resources for the subtasks of the job. When the RM allocates resources for the subtasks of the job, the resource quantities required by the subtasks, such as the CPU 5 core and the memory 10G, are obtained, and then the target computing unit whose remaining resource quantity is not less than the resource quantity required by the subtasks is determined based on the remaining resource quantity of each computing unit of each node reported by the NM of each node.

For example, if the remaining resource amount of the computing unit a on the current node 1 includes the CPU 6 core and the memory 40GB, which is significantly larger than the resource amount required by the sub-task, such as the CPU 5 core and the memory 10G, it may be determined that the computing unit a is the target computing unit. For example, the remaining resource of a certain node is composed of two computing units, and the remaining resource quantity of each computing unit is CPU 3 core and memory 6GB, at this time, there is no requirement that the remaining resource quantity of any computing unit satisfies the resource quantity required by the subtask, such as the requirement of CPU 5 core and memory 10G, and the resource may not be allocated, or the resource is reallocated after waiting for a certain duration, or the resource is abandoned after waiting for a certain duration and not satisfied.

Step S104: and acquiring a first node identifier of a first target node where the target computing unit is located based on the unit identifier of the target computing unit, and determining a resource allocation result of the task.

Specifically, after the RM determines a target computing unit, such as computing unit A on node 1, the RM may obtain a unit identification of computing unit A, such as ID "11", that is, the target unit identification. A first node identification, such as the name, IP address, etc., of the node 1 may then be obtained for a first target node, such as node 1, where the target computing unit, such as computing unit a, is located. Thereafter, the RM may return the acquired unit identification, such as ID "11", and the first node identification, such as the name of node 1, IP address, etc., as resource allocation results to the AM corresponding to the job.

Step S105: based on the first node identification, sending the resource allocation result to the first target node to bind the task to the resource of the target computing unit, and then executing the task on the target computing unit.

Specifically, the AM sends a resource allocation result to the NM of the first target node based on the first node identifier, so as to bind the task to the resource in the target computing unit on the first target node, and then execute the task on the target computing unit.

Specifically, the AM generates the task scheduling information based on the first node identification, such as the name of node 1, the IP address, and the unit ID "11" of the target computing unit, such as computing unit a. Under native scheduling, node information such as node identification needs to be attached to the scheduling information, and in this embodiment, in addition to the node information such as node identification, an additional unit identification of the target computing unit needs to be attached to indicate on which specific computing unit on the node 1 the allocated resource is. The AM sends task scheduling information to the NM of node 1 based on the first node identity, e.g. name, IP address of node 1, the task scheduling information instructing the NM of node 1 to bind the sub-tasks to resources in a target computational unit on node 1, e.g. computational unit a, and then execute the sub-tasks on computational unit a.

In one embodiment, the sub-tasks are bound to resources of a corresponding target computing unit, such as computing unit a, specifically, the sub-tasks may be bound to corresponding resources, such as CPU, memory, GPU and network card of computing unit a. For example, if the unit ID "11" of the computing unit a corresponds to the CPU IDs of 0 to 9 and the memory ID of 0, the subtask may be bound to the 10 CPUs with IDs of 0 to 9 and the corresponding memory with ID of 0 by using the CPUSET controller of the CGroup (Control Groups) or other binding system. GPUs, network cards, etc. may be bound in a similar manner. Therefore, the subtask has no perception, can run normally and only use bound resources such as a CPU (central processing unit), a memory and the like, and does not have any condition of accessing the memory across the computing units.

According to the cluster resource scheduling method provided by the embodiment of the disclosure, after the first target node and the target computing unit on the first target node which meet the resource quantity required for task execution are determined, the task can be bound to the resource in the target computing unit on the first target node, so that the task is executed inside the target computing unit, and therefore, when the resource is scheduled for the task of the job, the situation that the memory is accessed by the cross-computing unit inside the node is avoided, and the overall performance of task execution can be improved.

Optionally, on the basis of the above embodiments, in some embodiments of the present disclosure, with reference to fig. 2, before determining the result of allocating resources of the task in step S103, the method may further include the following steps:

step S201: and determining a preset performance parameter of each node, wherein the preset performance parameter is used for representing the capacity of the node for accessing the memory across the computing units.

Specifically, the number of computing units on different nodes in the cluster may be different, and the performance of the nodes may be different, so that the strength of the memory access capability of each node across the computing units is different. If the capacity of each node for accessing the memory across the computing units is poor, the performance loss is large, the operation is hardly endured, and the overall performance of the task execution of the operation is greatly influenced.

In this regard, in one embodiment, the RM may determine the preset performance parameters of each node, i.e., parameters characterizing how strong the node has the ability to access the memory across the computing units, before allocating resources for the task of the job. The predetermined performance parameters may be pre-configured and stored in each node, and obtained by NM of each node and reported to RM, but is not limited thereto.

Step S202: and when each preset performance parameter is smaller than or equal to a preset performance parameter threshold value, returning to the step of determining a target computing unit of which the remaining resource quantity is not smaller than the resource quantity required by the task in the plurality of computing units based on the remaining resource quantity of each computing unit of each node and the resource quantity required by the task.

It can be understood that when each preset performance parameter is less than or equal to the preset performance parameter threshold, that is, when the capability of each node to access the memory across the computing units is weak, the performance loss of the cluster system is large, and the task operation of the job is seriously affected. At this time, the process returns to the step of determining a target computing unit in which the remaining resource amount of the plurality of computing units is not less than the resource amount required by the task based on the remaining resource amount of each computing unit of each node and the resource amount required by the task in step S103, and then the steps S104 to S105 are continuously executed.

Therefore, by judging the performance parameter of the cluster node reflecting the strength of the memory access capability across the computing units in advance, when the performance parameter is smaller, resource scheduling is started based on the scheme provided by the embodiment of the disclosure, after the first target node meeting the resource quantity required by task execution and the target computing unit on the first target node are determined, the task can be bound to the resource in the target computing unit on the first target node, so that the task is executed inside the target computing unit, and therefore when the resource is scheduled for the task of the job, the situation that the memory access capability across the computing units inside the node is avoided, and the overall performance of task execution can be improved. In addition, the flexibility of cluster resource scheduling is increased.

Optionally, in some embodiments of the present disclosure, the step of determining the preset performance parameter of each node in step S201 may specifically include the following sub-steps:

step i): receiving unit parameters of each computing unit on each node reported by each node, wherein the unit parameters comprise any one or more of the total number of the computing units on the node, the resource type in each computing unit and the total number of resources.

For example, in some embodiments of the present disclosure, the resource type in each computing unit may include, but is not limited to, any one or more of a CPU, memory, GPU, and network card.

Specifically, the NM of each node may report, by heartbeat, the total number of computing units on each node, for example, 3 computing units are provided on node 1, the resource type in each computing unit is, for example, CPU and memory, and the total resource number of each computing unit is, for example, the total available resource number is, for example, CPU 10 core and memory 100 GB.

Step ii): and determining a preset performance parameter corresponding to each node based on the unit parameter corresponding to each node, wherein the unit parameter and the preset performance parameter are in positive correlation.

Specifically, the RM may further determine the preset performance parameter corresponding to each node based on the total number of the received computing units on each node, for example, the total number of the computing units on the node 1, the resource type in each computing unit, for example, the CPU and the memory, and the total number of the resources of each computing unit, for example, the total available resource number is the CPU 10 core, the memory 100GB, and the like.

In one embodiment, for example, the greater the total number of computing units on each node, the greater the preset performance parameter is, for example, a correspondence table between the total number of different computing units and the preset performance parameter is established in advance, and the preset performance parameter corresponding to the node is determined by looking up the table. Or, two preset performance parameters are respectively calculated based on the total number of the calculation units, the resource type and the total number of the resources in each calculation unit, and then an average value is obtained to be used as a final preset performance parameter corresponding to the node.

Optionally, on the basis of the above embodiments, in some embodiments of the present disclosure, in combination with the step shown in fig. 3, the method may further include the following steps:

step S301: and when each preset performance parameter is larger than the preset performance parameter threshold, determining a second target node of which the residual resource quantity in each node is not smaller than the resource quantity required by the task based on the resource quantity required by the task and the residual resource quantity of each node.

It can be understood that, when each preset performance parameter is greater than a preset performance parameter threshold, that is, when the capability of each node for accessing the memory across the computing units is strong, the overall performance loss of the cluster system is greatly reduced, and the influence on the task operation of the job is light or even negligible. At this time, resource allocation may be performed based on the existing native resource scheduling allocation manner, and specifically, the RM determines, based on the number of resources required by the sub-task and the remaining resource number of each node, a second target node, such as node 2, where the remaining resource number of the node is not less than the number of resources required by the sub-task.

Step S302: and acquiring a second node identifier of the second target node.

Specifically, the RM acquires a second node identifier, such as an ID, of the node 2, and sends the second node identifier, such as the ID, to the AM.

Step S303: based on the second node identification, informing the NM of the second target node to start a corresponding resource container to perform the task.

In particular, the AM sends a notification message to the NM of the second target node, e.g. node 2, based on the second node identity, e.g. ID, to instruct the NM to initiate the corresponding resource Container on node 2 to perform the sub-task.

Therefore, by judging the size of the performance parameter reflecting the strong and weak degree of accessing the memory by the cross-computing unit of the cluster node in advance, when the performance parameter is larger, only a primary resource scheduling and distributing mode is adopted, and the flexibility of cluster resource scheduling is improved.

Optionally, in some embodiments of the present disclosure, in combination with the illustration in fig. 4, the method may further include the following steps:

step S401: and starting timing when the target computing unit with the residual resource quantity of the plurality of computing units not less than the resource quantity required by the task is not determined.

Step S402: and when the timing duration is longer than the preset duration, determining a target computing unit of which the remaining resource quantity is not less than the resource quantity required by the task in the plurality of computing units based on the remaining resource quantity of each computing unit of each node at the current time and the resource quantity required by the task.

For example, the preset time period may be set according to needs, and is not limited in this respect. Specifically, the RM may wait for a certain time period when the target computing unit that satisfies the number of resources required by the subtask is not determined, and with dynamic change of the cluster system, the resources of the computing unit on one or more nodes may be released, and at this time, the RM may determine the target computing unit whose remaining number of resources is not less than the number of resources required by the subtask based on the remaining number of resources of each computing unit of each node at the current time. Therefore, when the resources are scheduled for the tasks of the jobs, the situation that cross-computing units inside the nodes access the memory is avoided, the overall performance of task execution can be improved, and meanwhile, the resources can be scheduled for the tasks of the jobs in time, so that the tasks of the jobs can be executed quickly.

Optionally, in some embodiments of the present disclosure, the method further includes: when it is determined that there are a plurality of target calculation units, an arbitrary one of the target calculation units is selected.

Specifically, in an embodiment, when the RM determines, in step S103, a target computing unit whose remaining resource amount of the computing unit is not less than the resource amount required by the task based on the remaining resource amount of each computing unit of each node, if it is determined that the remaining resource amount of the multiple computing units is not less than the resource amount required by the task, that is, there are multiple target computing units, then any computing unit of the multiple target computing units is selected as a final target computing unit.

For example, the RM determines that the remaining resource amount, e.g., the amount of remaining CPU and memory, of the 3 computing units (A, B, C) on the node 1 is not less than the amount of resource, e.g., the amount of CPU and memory, required by the task based on the remaining resource amount of each computing unit of each node (e.g., node 1 to node N), and may select any one computing unit, e.g., computing unit B, from the 3 computing units (A, B, C) as the target computing unit. In some embodiments, certain, for example, the 3 computing units (A, B, C) may also be located on different nodes, such as node 2 and node 3, which is only an example, and this is not limited in this embodiment.

For convenience of understanding, the resource scheduling method provided by the embodiments of the present disclosure is described below with reference to a specific example of a Hadoop YARN.

As shown in fig. 5, a schematic diagram of data interaction for resource scheduling under the system structure of the Hadoop YARN is shown. The Hadoop YARN includes an RM and a plurality of NMs. The RM comprises a scheduler used for resource scheduling management, each NM node adopts NUMA architecture and is divided into a plurality of computing unit sockets, and each Socket has independent CPU, memory and other resources such as GPU cards, network cards and the like.

Firstly, NMs of nodes detect Socket related information (such as Socket total number, Socket ID, the number of remaining resources of each Socket, and the like) of the nodes and report the Socket related information to an RM (management system).

For example, a node has 4 sockets, and the total resource corresponding to each Socket includes a CPU 10 core, a memory 100GB, 2 GPU cards, 1 network card, and the like. And after receiving the Socket related information reported by the NM of each node, the RM records the Socket related information for subsequent scheduling.

And secondly, when the AM applies resources to the task of the operation, the RM schedules and allocates the resources, and finds out idle nodes and idle sockets for the task.

Specifically, when the remaining resource size of a Socket can completely satisfy the Container resource size required by the task, the task is allowed to be allocated to the Socket for execution. For example, a Container required by a task is defined as (5 cores, 10GB), one node residual resource is composed of two sockets, each Socket has only a residual resource (3 cores, 6GB), at this time, there is no Socket that satisfies the resource task requirement, and at this time, the resource may not be allowed to be allocated.

When the residual resource size of one Socket can completely meet the Container resource size required by the task, the RM returns the Socket ID of the corresponding Socket and the node ID of the node where the Socket is located to the AM.

Thirdly, the AM generates a task scheduling message according to the socketID and the node ID, and sends the task scheduling message to the NM of the node indicated by the node ID, for example, the NM of the node N.

In the embodiment of the present disclosure, in addition to node information such as a node ID, the task scheduling information needs to be accompanied by an additional Socket ID (which indicates a specific Socket).

And fourthly, when the NM receives the task scheduling information, binding the task to a CPU, a memory, a GPU card and a network card in the Socket corresponding to the Socket ID according to the Socket ID and the indication of the task scheduling information.

For example, Socket ID is 0, which indicates that the task needs to be bound to Socket0 on the node N, and the CPU ID corresponding to Socket0 is 0-9, and the memory ID is 0, and the task is bound to the 10 cores and the corresponding memory through the CGroup CPU set or other core binding system. The GPU card and the network card are bound in other similar ways.

The tasks are distributed and bound to one Socket, so that the tasks run without perception, and only the resources of the Socket, such as a CPU (central processing unit), a memory, a GPU (graphics processing unit) card, a network card and the like, are used, so that the condition that the memory is accessed across the sockets is avoided, and the overall performance of the operation during task execution is improved.

The scheme of the embodiment greatly improves the performance of the machine with insufficient Socket crossing capacity, and almost has great benefits for all operations; and for a machine with stronger Socket crossing capability, the performance is also improved to a certain extent, and great benefits are brought to performance-sensitive operations.

An embodiment of the present disclosure further provides a cluster resource scheduling apparatus, where each node in a cluster includes a plurality of computing units and adopts a non-uniform memory access architecture, as shown in fig. 6, the apparatus may include:

a data receiving module 601, configured to receive the number of remaining resources and the unit identifier of each computing unit in the node, which are reported by each node in the cluster;

a request receiving module 602, configured to receive a resource scheduling request of a target job, where the resource scheduling request includes a number of resources required by a task of the target job;

a unit determining module 603, configured to determine, based on a remaining resource amount of each computing unit of each node and a resource amount required by the task, a target computing unit in which the remaining resource amount in the plurality of computing units is not less than the resource amount required by the task;

a resource allocation module 604, configured to obtain, based on the unit identifier of the target computing unit, a first node identifier of a first target node where the target computing unit is located, and determine a resource allocation result of the task;

a task executing module 605, configured to send the resource allocation result to the first target node based on the first node identifier, so as to bind the task to the resource of the target computing unit, and then execute the task on the target computing unit.

According to the cluster resource scheduling device provided by the embodiment of the disclosure, after the first target node and the target computing unit on the first target node which satisfy the resource quantity required for task execution are determined, the task can be bound to the resource in the target computing unit on the first target node, so that the task is executed inside the target computing unit, and therefore, when the resource is scheduled for the task of the job, the situation that the memory is accessed by the cross-computing unit inside the node is avoided, and the overall performance of task execution can be improved.

the unit determining module 603 is further configured to determine, when each preset performance parameter is less than or equal to a preset performance parameter threshold, a target computing unit, where the remaining resource amount in the multiple computing units is not less than the resource amount required by the task, based on the remaining resource amount of each computing unit of each node and the resource amount required by the task.

Optionally, in some embodiments of the present disclosure, the performance parameter determining module is specifically configured to: receiving unit parameters of each computing unit on each node reported by each node, wherein the unit parameters comprise any one or more of the total number of the computing units on the node, the resource type in each computing unit and the total number of resources; and determining a preset performance parameter corresponding to each node based on the unit parameter corresponding to each node, wherein the unit parameter and the preset performance parameter are in positive correlation.

Optionally, in some embodiments of the present disclosure, the resource types in the computing unit may include, but are not limited to, any one or more of a CPU, a memory, a GPU, and a network card.

Optionally, in some embodiments of the present disclosure, the apparatus may further include a scheduling control module, configured to determine, when each of the preset performance parameters is greater than the preset performance parameter threshold, a second target node whose remaining resource amount in each of the nodes is not less than the resource amount required by the task, based on the resource amount required by the task and the remaining resource amount of each of the nodes; acquiring a second node identifier of the second target node; based on the second node identification, notifying a node manager of the second target node to launch a corresponding resource container to perform the task.

Optionally, in some embodiments of the present disclosure, the apparatus may further include a timing module, configured to start timing when a target computing unit that has a remaining resource amount of the plurality of computing units that is not less than the resource amount required by the task is not determined. The unit determining module 603 is further configured to determine, when the timing duration of the timing module is greater than a preset duration, a target computing unit, where the remaining resource quantity in the plurality of computing units is not less than the resource quantity required by the task, based on the remaining resource quantity of each computing unit of each node at the current time and the resource quantity required by the task.

Optionally, in some embodiments of the present disclosure, the unit determining module 603 is further configured to: selecting any one of the target computing units when it is determined that there are a plurality of the target computing units.

The embodiment of the present disclosure provides a computer device, as shown in fig. 7, including a processor 701 and a memory 702, where the memory 702 stores a computer-executable program that can be executed by the processor 701, and when the processor 701 executes the computer-executable program, the cluster resource scheduling method provided in the embodiments of the present disclosure is implemented.

By applying the above scheme provided by the embodiment of the present disclosure, after the first target node and the target computing unit on the first target node that satisfy the number of resources required for task execution are determined, the task may be bound to the resources in the target computing unit on the first target node, so that the task is executed inside the target computing unit, thereby avoiding a situation that a cross-computing unit inside the node accesses a memory when the resource is scheduled for the task of the job, and further improving the overall performance of task execution.

The Memory may include a RAM (Random Access Memory) or an NVM (Non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor including a CPU, an NP (Network Processor), and the like; but also a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field-Programmable Gate Array) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

The memory 702 and the processor 701 may be connected by a wire or wireless connection for data transmission, and the computer device and other devices may communicate through a wire or wireless communication interface. Fig. 7 shows an example of data transmission via a bus, and the connection method is not limited to a specific connection method.

When the computer device serves as a cluster node, the processor 701 in the embodiment shown in fig. 7 may be the CPU mentioned in the above method embodiment, or may be another independent processor. When the computer device is used as a cluster node, the computer device may further include other functional components such as a GPU card and a network card.

In addition, the embodiment of the present disclosure also provides a computer-readable storage medium, where the computer-readable storage medium may store a computer program, and the computer program is called and executed by a processor to implement the above cluster resource scheduling method provided by the embodiments of the present disclosure.

In another embodiment provided by the present disclosure, a computer program product is further provided, which when running on a computer, causes the computer to execute the above cluster resource scheduling method provided by the embodiment of the present disclosure.

For the above cluster resource scheduling apparatus, computer device, and computer-readable storage medium embodiments, since the related contents are substantially similar to the foregoing method embodiments, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber, DSL (Digital Subscriber Line)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., a floppy Disk, a hard Disk, a magnetic tape), an optical medium (e.g., a DVD (Digital Versatile Disk)), or a semiconductor medium (e.g., a SSD (Solid State Disk)), etc.

It is noted that, in this document, relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The foregoing are merely exemplary embodiments of the present disclosure, which enable those skilled in the art to understand or practice the present disclosure. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A cluster resource scheduling method is characterized in that each node in a cluster comprises a plurality of computing units and adopts a non-uniform memory access architecture, and the method comprises the following steps:

2. The method of claim 1, wherein prior to determining the resource allocation result for the task, the method further comprises:

3. The method of claim 2, wherein the step of determining the preset performance parameters of each node comprises:

4. A method according to claim 2 or 3, characterized in that the method further comprises:

acquiring a second node identifier of the second target node;

5. The method according to any one of claims 1 to 3, further comprising:

6. The method according to any one of claims 1 to 3, further comprising:

7. A device for scheduling cluster resources, wherein each node in the cluster includes a plurality of computing units and employs a non-uniform memory access architecture, the device comprising:

8. The apparatus of claim 7, further comprising:

9. A computer device comprising a processor and a memory;

wherein the memory stores a computer program executable by the processor;

the processor, when executing the computer program, implements the cluster resource scheduling method of any of claims 1-6.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when invoked and executed by a processor, implements the cluster resource scheduling method of any one of claims 1-6.