CN113419827A

CN113419827A - High-performance computing resource scheduling fair sharing method

Info

Publication number: CN113419827A
Application number: CN202110509596.8A
Authority: CN
Inventors: 陆伟钊
Original assignee: Beijing Skycloud Rongchuang Software Technology Co ltd
Current assignee: Beijing Skycloud Rongchuang Software Technology Co ltd
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-09-21

Abstract

The invention provides a high-performance computing resource scheduling fair sharing method, which realizes fair resource sharing by computing the right of historical resource usage in a dynamic limit, and improves the service quality of a high-performance computing system to users. The leaves to which the tasks belong are searched through the hash table, and the sequencing of all the shared tree nodes is optimized after each task is successfully scheduled, so that the scheduling speed is increased, and the high throughput of the system is realized.

Description

High-performance computing resource scheduling fair sharing method

Technical Field

The invention relates to a high-performance computing resource and task scheduling method, in particular to a distributed computing-based high-performance computing resource scheduling fair sharing method.

Background

The following is a related art introduction:

high-performance computing and big data task scheduling system

High-performance computation and big data belong to a distributed computing system, namely the whole system is formed by a plurality of servers into a cluster, and computation and data tasks are distributed to each server to run.

Resource and task scheduling systems are key technologies for distributed computing systems. The user's computing tasks are run through the resource and task scheduling system, rather than by directly accessing a server.

A task refers to a whole segment of computation. The user submits multiple tasks into a queue. The scheduler reads the task definition from the queue and allocates resources for the task based on the availability of resources (i.e., the host is working properly), the situation that has been allocated, and the scheduling policy definition.

The sum of the resources required for all tasks in a high performance computing and big data environment is typically greater than the system available resources. How to automatically and rationally allocate resources is the main role of the resource and task scheduler.

Second, fair sharing principle

Fair sharing is a scheduling policy commonly used in resource and task scheduling, and refers to automatically and reasonably allocating the amount of resources used by each user according to a predefined scheduling policy when allocating limited resources to multiple users. A typical example is that multiple tasks with two different users in a task waiting queue wait to be scheduled to 100 CPU cores (these cores are distributed on a cluster of multiple hosts), and the scheduling policy requires:

1. when the total number of CPU cores needed by all tasks in the queue is more than the number of cluster CPU cores, the number of the CPU cores occupied by the tasks of the two users is 3: 2 allocation (i.e. 60: 40).

2. When the first user has fewer tasks than is using its share, the remaining resources on the cluster may all be allocated to the second user. If the first user currently only needs 20 cores, and the second user has enough tasks waiting in the queue, the second user can use up to 80 cores.

3. When the first user adds more tasks to the queue, the second user occupies more than 40 extra-core resources and needs to be released for the first user to use until the ratio of the resources used by the two users is 3: 2.

in practice, the fair sharing scheduling policy may define multiple layers, as shown in fig. 1.

A cluster is used by multiple users, and users are organized by department and project. The value in parentheses is the share of the resource of the unit, the size of the value itself has no meaning (may be any value), and the ratio of the adjacent unit values determines the resource allocation amount of the resource in the whole cluster (resource pool), as in the first example 3: the partition ratio of 2 may be defined as 30 and 20, or 60 and 40.

The cluster is shared by the department 1 and the department 2, and the sharing proportion of the department 1 and the department 2 is 20: 10, namely 2: 1.

two users in department 1, the ratio of using the resources assigned to that department is 1: 2.

the resources allocated to each department 2 are subdivided into two projects, with project 1 being 2 and project 2 being 3.

The resources of project 1 are distributed to all users belonging to the project in the same proportion.

The resources of project 2 are as follows 1: 2 to user 1 and user 2.

Third, the existing fair sharing algorithm

Business and open source resource task scheduling software generally has a fair sharing scheduling strategy, although the priority of each unit (user, department, project and the like) is determined by sequencing leaves of a tree data structure basically, the implementation mode of the scheduling software is realized by respective unique code logic, and a public library is not provided or does not exist to realize the fair sharing function.

Description of fair sharing of open source software SLURM: slurm Workload Manager-Fair Tree Fairshare Algorithm (schedmd. com)

Description of fair sharing of business software PBSPro: pdf of Fairshare Management for Altair PBS protocol

The commercial software IBM Spectrum LSF also has a fair sharing function: fairshare scheduling (ibm.com)

Fourth, fair sharing of scheduling performance

Fair sharing scheduling is to determine the priority of task scheduling according to the comparison between the dynamic limit in each unit on the shared tree and the limit of other units, and the dynamic limit changes after each task of the unit is scheduled, so that each time a task is scheduled, the priority of all units needs to be evaluated once to determine the unit to which the next task belongs. And then the waiting task belonging to the unit is obtained in the queue.

Therefore, in the whole scheduling process, after each task is scheduled, the shared tree unit needs to be queued, waiting tasks corresponding to the high-priority unit are searched, and the scheduling of a plurality of tasks cannot be parallelized, so that the scheduling speed has certain limitation.

How to quickly schedule a large number of tasks in a large environment (over a hundred thousand CPU cores, hundreds of thousands waiting tasks), or in a high-throughput task environment (task running time is less than 30 seconds, total tasks are more than ten thousand) in order to use free resources in time is a critical issue.

Fifthly, fair sharing considering historical usage

Many fair sharing scheduling algorithms (e.g., SLURM) only consider the current resource usage of the nodes of the current sharing tree, and such algorithms are not "fair". For example, 2 users, user a and user B share one resource pool, and the quota ratio is 1: 1, i.e. 50% each. User a has 1000 tasks at the beginning and user B has no tasks, so user a can use the entire resource pool. When the task of user a is completed by 900, user B starts to submit 1000 tasks, and then the scheduler starts to schedule tasks according to the following rule 1: 1 schedules the tasks for a and user B at a time, so that user B has 900 tasks left when 1000 tasks for user a are completed. From the whole operation situation, the scheduler is scheduled according to a first-in first-out strategy.

While a more reasonable fair share of scheduling requires tasks for both users to be completed simultaneously.

This algorithm requires consideration of historical resource usage such that when user B starts to be tasked, user a has a very low priority since it has used the excess resources for a long time, and the resources are limited to user B. And after the total resource usage amount of the user B is increased to be equal to the total resource usage amount of the user A, the quota of the user A is the same as that of the user B, and all tasks of the last two users are finished at the same time.

Disclosure of Invention

The invention provides a high-performance computing resource scheduling fair sharing method, which solves two problems of the above mentioned fair sharing scheduling algorithm: (1) the scheduling speed ensures high throughput; (2) and fair sharing related to historical resource utilization is realized. The technical scheme is as follows:

a high-performance computing resource scheduling fair sharing method comprises the following steps:

s1: initializing a data structure, converting the configured fair sharing structure into a data structure of a tree, calculating a static quota of each leaf, setting a dynamic quota as the static quota, and setting a sub-queue of the leaves;

s2: putting tree leaf sequences according to users to which tasks belong;

s3: the leaves are sorted according to the dynamic quota in a descending order to generate a leaf sorting list;

s4: taking a first task from a tree leaf queue of the highest dynamic quota for scheduling, and adjusting the dynamic quota of the leaf;

s5: comparing the dynamic quota of the next leaf in the leaf ordered list, and if the dynamic quota in the leaf unit of the task which is just scheduled is lower than the next leaf, arranging the dynamic quota behind the next leaf until the dynamic quota is higher than the next leaf;

s6: looping the steps S4-S5 until the scheduling period is finished or the tasks in all the sub queues are scheduled;

s7: the scheduling cycle is ended;

s8; according to the running and ending states of the tasks, adjusting the dynamic quota of the shared leaves;

s9: returning to step S2, the process of the next scheduling cycle is performed.

Further, in step S1, the global share amount of each leaf of the fair share tree is calculated, the cluster-level amount is defined as 1, and the share amount of each leaf is calculated from top to bottom.

Further, in step S1, each leaf unit has a dynamic quota, and the initial value of the dynamic quota is the same as the static quota.

Further, in step S1, setting up the leaf subqueue means that a "hash" table with the leaf unit name as a key is established, so that the leaf unit can be quickly found according to the user name of the task when the task is put into the leaf unit subqueue in the following.

Further, in step S2, the tasks whose number is 2-3 times of the total amount of the resources are taken out from the task queue according to the order of task submission.

Further, in step S4, after the first task is scheduled, if the task is successfully started, the unit is decremented according to the resource allocated by the task.

Further, in step S4, the calculation formula of the dynamic quota is as follows:

wherein the content of the first and second substances,

the invention has the following advantages:

(1) and the traditional method for sequencing all the nodes by using the shared tree is not used, and the sequencing of the shared tree after no task is scheduled is optimized.

(2) When the tasks in the waiting queue are evaluated, certain optimization is also carried out, and all the tasks in the waiting queue are not traversed.

(3) And after the task is scheduled, the computation node quota considers the use of historical resources.

Drawings

FIG. 1 is a schematic diagram of a fair sharing scheduling policy definition multi-layer;

FIG. 2 is a schematic flow diagram of a method provided by the present invention;

FIG. 3 is a schematic diagram of a leaf sort list;

FIG. 4 is a schematic diagram of the configuration of fair sharing of test in the test of the present method;

FIG. 5 is a diagram illustrating the statistics in units of users during testing;

FIG. 6 is a diagram showing the statistical results in terms of items during the test.

Detailed Description

As shown in fig. 2, the high-performance and high-performance computing resource scheduling fair sharing method provided by the present invention includes the following steps:

s1: initializing a data structure: converting the configured fair sharing structure into a data structure of a tree, calculating the static quota of each leaf, setting the dynamic quota as the static quota, and setting a sub-queue of the leaves;

the fair sharing structure is shown in fig. 1. All the following description takes the structure in this figure as an example.

S11: and calculating the global sharing amount of each leaf of the fair sharing tree, defining the amount of the cluster level as 1, and calculating the sharing amount of each leaf from top to bottom. In the example of fig. 1, department 1 is 0.6667, department 2 is 0.3333, user 3 is 0.2222, user 4 is 0.4445, item 1 is 0.1333, item 2 is 0.2, user 1 is 0.0667, and user 2 is 0.0333. The unit values of the final leaf part are shown in Table 1. The credit is static credit, i.e. credit of each leaf with a total of 1.000 according to the configuration.

Table 1: fair sharing of configured credits for pages

Leaf unit	Static quota
		User
1	0.0667
		User 2	0.0333
Item 1	0.1333
		User 3	0.2222
User 4	0.4445

S12: each leaf unit is further provided with a dynamic amount, and the initial value is the same as the static amount. And sorting the leaves in a descending order according to the amount. The sequence in table 1 after sorting is: user 4, user 3, item 1, user 2.

S13: then, a sub-queue is established for each leaf unit:

and establishing a 'hash' table taking the leaf unit name as a key so as to quickly find the leaf unit according to the user name of the task when the task is put into the leaf unit subqueue in the following.

S2: according to the user to which the task belongs, putting a tree leaf sequence: wherein, the number of tasks in the task queue is 2-3 times of the number of the resource pools.

A scheduling cycle is started as follows:

and taking out tasks with the quantity 2-3 times of the total amount of resources (resource pool) from the task queue according to the order of task submission (pre-fetching of first submission).

This is done to (a) avoid traversing all waiting tasks (perhaps hundreds of thousands to one million), and (b) ensure that free resources are made full.

The task is then placed in the subqueue of the leaf to which it belongs.

S3: the leaves are sorted according to the dynamic quota in a descending order to generate a leaf sorting list:

the sorted list is shown in fig. 3, and 12 tasks are sorted and wait for the tasks to be put into the leaf unit sub-queue.

The leaf units are sorted from high to low according to the dynamic quota, and the sorted result is shown in table 2.

Table 2: ordering of leaf units

Leaf unit	Dynamic limit	Number of sub-queue tasks
			User 4	0.4445	4
User 3	0.2222	2
			Item 1	0.1333	2
User 1	0.0667	2
			User 2	0.0333	2

S4: and taking the first task from the tree leaf queue with the highest dynamic quota for scheduling, and adjusting the dynamic quota of the leaf:

and taking a task from the unit subqueue with the highest quota, distributing resources for the task and starting the task. If the start is successful, the unit credit is reduced according to the resources allocated by the task (see the description of S9 for a method for calculating the dynamic credit).

S5: and comparing the dynamic quota of the next leaf in the leaf ordered list, and if the dynamic quota in the leaf unit of the task just scheduled is lower than the next, arranging the dynamic quota behind the next leaf until the dynamic quota is higher than the next leaf:

and descending the leaf ranks according to the updated quota. It is time consuming if the leaf elements (e.g. the number of different users) of the shared tree are large.

In order to increase the speed, only the dynamic quota in the leaf unit of the task just scheduled is compared with the next unit, and if the dynamic quota is smaller than the next unit, the sequence is exchanged until the dynamic quota is larger than the next unit.

Thus, the leaf unit does not need to be moved or moved 1-2 times in a large probability, and the reordering is faster than the whole leaf unit column. This speed increase is very much possible when scheduling a large number of tasks (hundreds of thousands) because this step is done once per task.

S6: and (4) until the scheduling period is finished or the tasks in all the sub queues are scheduled, otherwise, performing loop processing, and jumping to the step S4:

and repeating the step 4 until a) no CPU core is available in the cluster or b) all tasks in the tree leaf queues are scheduled.

S7: the scheduling cycle is ended;

s8: according to the task running and ending state (waiting, running or ending), the dynamic quota of the shared leaf is adjusted:

and checking the running state and running time of all running tasks, and then adjusting the dynamic limit of each leaf unit.

S9: entering the next scheduling period;

the process proceeds to step S2, and the process proceeds to the next scheduling cycle.

Considering the influence of the resource occupation of the historical tasks on the dynamic quota, the dynamic quota of the unit is calculated by using the following formula:

wherein:

in the embodiment, the method is tested in the SkyForm AIP task scheduler, and the specific test result is as follows:

1. the test environment is as follows:

(1) resource:

using AWS (amazon public cloud), the main scheduling servers are c4.xlarge:16CPU core, 30GB memory, 1000 compute servers t2.micro:1CPU core, 1GB memory, each compute server simulates 50 CPU cores, totaling a cluster of 5 ten thousand cores for the scheduler.

(2) Task: 1 million tasks with a running time of around 3 minutes.

(3) Configuration:

5 items, 8 users, each of which is in all items simultaneously, as shown in fig. 4.

2. Task submission:

users u001-u008 submitted 125000 tasks per user. These tasks are divided equally into five groups of items, idle, owner, priority, normal, short, i.e. 25000 tasks per group of items per user. The order of delivery was random.

3. The final result is:

(1) the statistical results are shown in fig. 5 in units of users. It can be seen that the task scheduling situation realizes fair sharing among users.

(2) The statistical results in terms of items are shown in fig. 6. It can be seen that the case of task scheduling enables fair sharing among items (one queue per item).

And a throughput rate of 792952 tasks/hour is calculated. The same test compared to the open source software SLURM, the resulting dispatch throughput rate was 126849 tasks/hour.

The technical scheme has the advantages that:

(1) by calculating the right of historical resource usage in the dynamic limit, fair resource sharing is realized, and the service quality of a high-performance computing system to users is improved.

(2) The leaves to which the tasks belong are searched through the hash table, and the sequencing of all the shared tree nodes is optimized after each task is successfully scheduled, so that the scheduling speed is increased, and the high throughput of the system is realized.

Claims

1. A high-performance computing resource scheduling fair sharing method comprises the following steps:

s2: putting tree leaf sequences according to users to which tasks belong;

s7: the scheduling cycle is ended;

2. The method according to claim 1, wherein the method comprises: in step S1, the global share amount of each leaf of the fair share tree is calculated, the cluster-level amount is defined as 1, and the share amount of each leaf is calculated from top to bottom.

3. The method according to claim 1, wherein the method comprises: in step S1, each leaf unit has a dynamic quota, and the initial value of the dynamic quota is the same as the static quota.

4. The method according to claim 1, wherein the method comprises: in step S1, setting up the leaf sub-queue means to establish a "hash" table keyed by the leaf unit name, so that the leaf unit can be quickly found according to the user name of the task when the task is put into the leaf unit sub-queue in the following.

5. The method according to claim 1, wherein the method comprises: in step S2, the tasks whose number is 2-3 times of the total amount of resources are taken out from the task queue according to the order of task submission.

6. The method according to claim 1, wherein the method comprises: in step S4, after the first task is scheduled, if the task is started successfully, the unit is reduced according to the resource allocated by the task.

7. The method according to claim 1, wherein the method comprises: in step S4, the calculation formula of the dynamic quota is as follows:

wherein the content of the first and second substances,