CN114780213A

CN114780213A - Resource scheduling method, system and storage medium for high-performance computing cloud platform

Info

Publication number: CN114780213A
Application number: CN202210297756.1A
Authority: CN
Inventors: 冯建新; 李青松
Original assignee: Shenzhen Beikun Cloud Computing Co ltd
Current assignee: Shenzhen Beikun Cloud Computing Co ltd
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-07-22

Abstract

The invention discloses a resource scheduling method, a resource scheduling system and a storage medium for a high-performance computing cloud platform, wherein running characteristic data of an application designated by a user for computing when a computing task is submitted by the high-performance computing cloud platform and hardware related parameters of a designated resource specification submission job are obtained; screening out the case types meeting the conditions from a database containing the case types of the regions of various cloud manufacturers according to the acquired operation characteristic data of the application and the relevant parameters of the hardware; sequencing the example types to obtain a sequenced example type list; resources are scheduled from cloud manufacturers according to the screened and sequenced example type list, example types are automatically selected through a resource scheduling method, the calculation force of global cloud manufacturers is taken as a resource pool, the optimal calculation resources are selected according to the operation characteristics, the flexibility of resource selection is improved, the condition that the example type inventory is insufficient can be effectively avoided, and the success rate of resource scheduling is ensured.

Description

Resource scheduling method, system and storage medium for high-performance computing cloud platform

Technical Field

The invention relates to the field of high-performance computing, in particular to a resource scheduling method, a resource scheduling system and a storage medium for a high-performance computing cloud platform.

Background

Conventional High Performance Computing (HPC) resource scheduling methods use the concept of partition partitioning or queue, hereinafter referred to collectively as queues, which essentially describe a set of nodes having the same or similar performance characteristics that are used to specify the corresponding computing resource specification when a job is submitted. The CLOUD-HPC follows the naming convention, and uses the concept of queue to represent the set of instance types with the same hardware specification in a specific region of a specific CLOUD vendor, and when a job is submitted, the queue needs to be specified, and in fact, the information of computing resources required by the job, including the CLOUD vendor, the region and the used instance type, is specified.

The queue usually contains the following information, such as resource types: CPU/GPU, core number or card number, memory size, cloud manufacturer, region and instance type information.

Currently, a High Performance Computing (HPC) job scheduling system is mainly composed of a job management submodule and a resource management submodule, wherein the job management submodule is responsible for submitting and managing jobs, and the resource management submodule is responsible for allocating computing resources for job computing. The related configuration information of the queue is maintained in the resource management submodule, including the above-mentioned information of cloud manufacturer, region and example type, when the user submits the job through the job management submodule, the user needs to specify the queue used by the job, and the job management system sends the queue to the resource management submodule, which creates an elastic computing cluster by using the specified example type according to the related configuration information of the queue to the specified cloud manufacturer and region, but the system has the following disadvantages:

1. cannot be extended to other instance types. The corresponding relation between the queue and the instance type is pre-specified during cluster configuration, and cannot be dynamically adjusted during operation. When the instance type is in short stock, the creation of the computing cluster fails, and different queues are used for running the jobs by other instance types.

2. Cannot be extended to other regions. Due to the tight coupling relationship between queues and regions in the existing scheme, when the computing resources required by the operation queues and the operation are not in the same region or the resource quantity of the regions where the queues are located is insufficient, the region limitation of the scheduling mode is exposed, so that the resources cannot be fully scheduled by computing, and the computing efficiency is influenced.

3. Cannot be extended to other cloud vendors. Similar to regional limitations, binding a queue with a cloud vendor will cause submitted jobs to be limited to the resources of the corresponding cloud vendor. If the method is not limited by cloud manufacturers, the increased resource richness can relieve the condition of job queuing or failure caused by resource shortage to a certain extent.

4. The optimum cost performance cannot be obtained. Because the example types provided by different cloud manufacturers and regions have larger difference in hardware specification, performance and price, if the scheduling operation and the cloud manufacturers are decoupled, multiple factors such as hardware configuration, price and the like can be integrated, and the computing efficiency and the cost performance are improved by selecting the most appropriate computing resources.

Disclosure of Invention

The invention mainly aims to provide a resource scheduling method, a resource scheduling system and a storage medium for a high-performance computing cloud platform, which do not use the concept of queues, do not need to statically configure instance types in advance, but comprehensively consider factors such as resource richness, resource cost performance, application running characteristics and the like according to the specified resource specification and task quantity information submitted by a job, automatically select the instance types through a resource scheduling method, and accordingly take the computing power of global cloud manufacturers as a resource pool and select optimal computing resources according to the characteristics of the job.

In order to achieve the above object, the present invention provides a resource scheduling method for a high performance computing cloud platform, the method including the following steps:

acquiring running characteristic data of an application designated for computing by a user when a computing task is submitted on a high-performance computing cloud platform, and hardware related parameters of a designated resource specification submission job;

screening out the instance types meeting the conditions from a database containing the region instance types of various cloud manufacturers according to the acquired running characteristic data of the application and the related parameters of the hardware;

sorting the instance types to obtain a sorted instance type list;

and scheduling resources from the cloud manufacturer according to the screened and sequenced example type list.

The step of acquiring the running characteristic data of the application designated by the user for computing when the high-performance computing cloud platform submits the computing task comprises the following steps:

acquiring an application ID appointed by a user for computing when a computing task is submitted on a high-performance computing cloud platform;

querying a database containing application running characteristics according to the application ID to obtain Json data containing application information and the application running characteristics, wherein the application running characteristics at least comprise one of the following characteristics: single or double precision, computing coupling, high master frequency, large memory, network I/O, disk I/O, CPU instruction set.

The step of screening out the qualified instance types from the database containing the cloud manufacturer region instance types according to the acquired operation characteristic data and hardware related parameters of the application comprises the following steps:

according to the acquired resource specification of the hardware related parameters of the application, inquiring from a database containing instance type information to obtain an instance type list meeting the conditions;

and further screening the inquired instance type list according to the acquired running characteristic data of the application to obtain an available instance type list.

The step of sorting the instance types to obtain a sorted list of instance types includes:

calculating each index weight of the instance type in the available instance type list according to the running characteristics of the application;

calculating a comprehensive score of the instance type according to the index weight and the ranking of the index of the instance type in the instance type list;

and sorting according to the comprehensive scores of the instance types to obtain a sorted instance type list.

Wherein, the step of calculating the index weights of the instance types in the available instance type list according to the running characteristics of the application comprises:

the rule for calculating the weight adopts a strategy that calculation efficiency is prior rather than cost performance, or adjusts the value of weight calculation.

Wherein the hardware-related parameters include: asset type-type: node minimum core number-core.

Wherein the list of eligible instance types includes at least one of: cloud manufacturer and region, instance type, CPU core number, GPU card number, memory size, network IO performance, disk IO performance, payment type and price, instruction set, CPU master frequency, single-precision floating-point computing performance and double-precision floating-point computing performance.

The invention also provides a high-performance computing cloud platform resource scheduling system, which comprises: a memory having stored thereon a computer program which, when executed by the processor, carries out the steps of the method as described above.

The invention also proposes a computer storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the method as described above.

Compared with the prior art, the resource scheduling method of the high-performance cloud computing platform does not use the concept of queues, does not need to statically configure the instance types in advance, but automatically selects the instance types by taking the computing power of global cloud manufacturers as the resource pool and selecting the optimal computing resources according to the characteristics of the operation according to the factors of resource richness, resource cost ratio, application operation characteristics and the like comprehensively in consideration of the specified resource specification and task quantity information submitted by the operation.

Specifically, compared with the traditional HPC job scheduling method, the high-performance computing cloud platform computing resource scheduling method of the present invention mainly has the following differences: the concept of application running characteristics is introduced firstly, the application is appointed to obtain the application running characteristics before the operation is submitted, and the example type with better performance-price ratio can be obtained after the application running characteristics are screened and sorted. And secondly, the concept of a queue is not used any more, the example type does not need to be configured statically in advance, cloud manufacturers, regions and example types are dynamically selected after a series of screening and sorting according to the specified resource specification, task quantity information and application operation index submitted by the operation, the example types are dynamically selected, the flexibility of resource selection is greatly improved, resources of a plurality of cloud manufacturers in a plurality of regions can be used simultaneously, the condition of insufficient inventory of the example types can be effectively avoided, and the success rate of resource scheduling is ensured. Finally, unique index weight and score calculation rules are used for selecting the instance types, and the weight of each index can be appointed in the sorting process so as to calculate the instance type with the optimal cost performance.

Drawings

FIG. 1 is a schematic flow chart of a resource scheduling method for a high-performance computing cloud platform according to the present invention;

FIG. 2 is a schematic diagram of the overall process of resource scheduling for a high performance computing cloud platform according to the present invention;

FIG. 3 is a flow chart illustrating the type of examples that may be used in the screening of the present invention;

FIG. 4 is a flow chart illustrating the sorting of example types according to the present invention.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention.

Referring to fig. 1 to 4, the present invention provides a resource scheduling method for a high performance computing cloud platform, including the following steps:

s10, acquiring running characteristic data of an application designated by a user for computing when a computing task is submitted on the high-performance computing cloud platform, and hardware related parameters of a designated resource specification submission job;

s20, screening out instance types meeting the conditions from a database containing the region instance types of various cloud manufacturers according to the acquired running characteristic data of the application and the relevant parameters of the hardware;

s30, sorting the instance types to obtain a sorted instance type list;

and S40, scheduling resources from the cloud manufacturer according to the screened and sequenced example type list.

querying a database containing application running characteristics according to the application ID to obtain Json data containing application information and the application running characteristics, wherein the application running characteristics at least comprise one of the following characteristics: single or double precision, computational coupling, high host frequency, large memory, network I/O, disk I/O, CPU instruction set.

and sorting according to the comprehensive scores of the example types to obtain a sorted example type list.

Compared with the traditional HPC job scheduling method, the high-performance computing cloud platform computing resource scheduling method mainly has the following differences: the concept of application running characteristics is introduced firstly, the application is appointed to obtain the application running characteristics before the operation is submitted, and the example type with better performance-price ratio can be obtained after the application running characteristics are screened and sorted. And secondly, the concept of a queue is not used, the instance type does not need to be configured statically in advance, the cloud manufacturers, the regions and the instance type are selected dynamically after a series of screening and sequencing according to the specified resource specification, task quantity information and application operation indexes submitted by the operation, and the instance type is selected dynamically, so that the flexibility of resource selection is greatly improved, the resources of a plurality of cloud manufacturers in a plurality of regions can be used simultaneously, the condition of insufficient inventory of the instance type can be effectively avoided, and the success rate of resource scheduling is ensured. Finally, unique index weight and score calculation rules are used for selecting the example types, and the weight of each index can be appointed in the sorting process so as to calculate the example type with the optimal cost performance.

The scheme of the invention is explained in detail as follows:

the invention realizes a resource scheduling method of a high-performance cloud computing platform, which does not use the concept of a queue, does not need to statically configure instance types in advance, but submits the specified resource specification and task quantity information according to the operation, comprehensively considers the factors such as resource richness, resource cost performance, application running characteristics and the like, and automatically selects the instance types through the resource scheduling method, thereby taking the computing power of global cloud manufacturers as a resource pool and selecting the optimal computing resources according to the characteristics of the operation.

By means of the integrated cloud manufacturer instance type metadata, the method mainly performs resource comparison and selection from two aspects. On one hand, based on the running characteristic data of the application, including information such as hardware preference and cluster node coupling degree, screening and sorting are carried out according to the running characteristics of the application, on the other hand, screening is carried out according to hardware requirements when a user submits a job, and a proper instance type with high cost performance is obtained after screening and sorting of multiple dimensions. According to the scheduling method of the high-performance computing cloud platform resource, the defect that the high-performance computing cloud platform resource scheduling method cannot be expanded to other cloud manufacturers, regions and example types can be effectively overcome, and meanwhile, an optimal solution can be obtained in the aspect of cost performance.

Specifically, the resource scheduling method of the present invention completes scheduling of resources by 5 steps, as shown in fig. 2:

the detailed flow of the 5 steps is described in detail below, which specifically includes:

s1: the specified application obtains application running characteristics. When a user submits a computing task on a high-performance computing cloud platform, the user first needs to specify an application for computing, where the application may be a commercial or open source software such as Ansys series software, AlphaFold, tensrflow, or the like, or may be an application program of the user. The use of different hardware resources for these applications has a significant impact on computational efficiency,

for example, when a single RTX3090 graphics card or TESLA V100 graphics card is used, since 3090 performs far better than V100 in the case of performing only single-precision high-performance calculation of FP32, but performs far lower than V100 in the case of performing double-precision calculation of FP64, it is very important whether the application depends on single-precision or double-precision floating point calculation, which is an operation feature of the application, and if the scheduled resource satisfies the operation feature of the application, the calculation efficiency can be greatly increased.

After a user designates an application for calculation, a database containing application running characteristics is queried according to an application ID, if relevant information is not found, the application does not need to screen resources through the running characteristics, if the application can be queried through the application ID, a returned result is Json data containing application information and application characteristics, and the running characteristics of the application at least comprise one of the following characteristics:

single or double precision; calculated coupling: calculating the interdependence degree of the nodes; high dominant frequency; a large memory; network I/O; disk I/O; a CPU instruction set; ......

S2: and appointing a resource specification to submit the operation. When a user submits a job on a high-performance computing cloud platform, the user needs to specify hardware related parameters besides parameters such as the number of tasks, the core number of each task and the like, for example, the following parameters need to be specified when submitting the job in a command line:

resource type-type: the designation is a CPU or GPU resource;

node minimum core number-core: the number of CPU cores or CPU cards needed on a single machine on which the task runs.

In one possible implementation, the existing HPC job management system may be extended to support this information, such as extending SLURM's sbatch to submit a job sbatch-n 1-type CPU-core 16job.

S3: the available instance types are screened. After the application operation features and the hardware-related parameters obtained in step S1 and step S2 are obtained, an instance type meeting the conditions is screened from a database including the respective cloud manufacturer region instance types according to the information as a limiting condition, and fig. 3 is a specific screening process of step S3, and as shown in fig. 3, the specific screening process specifically includes:

and S3-1, inquiring from the database containing the example type information according to the resource type and the minimum node core number specified by the user in the step S2 to obtain an example type list meeting the conditions, wherein the screened example types all meet the basic requirements of job calculation. The screened example type results contain the following information: cloud manufacturer and region, instance type, CPU core number, GPU card number, memory size, network IO performance, disk IO performance, payment type and price, instruction set, CPU master frequency, single-precision floating-point computing performance, double-precision floating-point computing performance.

S3-2: and (4) further screening the list of the instance types inquired out in the step S3-1 according to the running characteristics of the application in the step S1, wherein the running characteristics of the application can be classified into 2 types, the first type is a mandatory requirement, for example, the application must depend on a CPU instruction set of an intel, then the screened instance types must use the intel instruction set, and the second type is an optional requirement, for example, the CPU has higher calculation efficiency when the CPU has high main frequency. The filtering in this step is performed for the first type of operation features, so that the situation that the acquired instance type cannot be finally applied to calculation is avoided.

S3-3: and returning to the filtered example list of S3-2.

S4: the instance types are ordered. After the filtering of step S3, if there is no instance type that is met, the resource scheduling will fail directly. If a set of available instance types is successfully obtained, the set of instance types needs to be sorted by priority and finally used for purchasing cloud vendor resources.

The specific steps of the sorting are shown in fig. 4, and specifically include:

s4-1: the available instance type list returned in step S3 already contains information such as the price of the instance type, various hardware indicators, and the like, and a weight division needs to be performed on the effective indicators of the instance type according to the application running characteristics contained in the current scheduling, so as to determine the priority of the instance type. Calculating the percentage of each index such as price, core number, dominant frequency and the like in weight according to a specific algorithm, for example:

1. the weight ratio of the prices is not less than 70%: the cost performance is prior;

2. the application run characteristics other than price bisect the remaining weights, with the price weight accounting for 100% when no run characteristics are specified for the application.

S4-2: and calculating the comprehensive score of the instance type according to the index weight calculated by the S4-1 and the ranking of the index of the instance type in the instance list. Examples as shown in the table below, the active application run characteristics are high dominant frequency, price weight 0.7, and dominant frequency weight 0.3.

Instance type	Price/ranking	Triple frequency/sequencing	Composite score (rank 1 is 100 points, rank 2 is 50 points, rank 3 is 0 points)
				A	10￥/1	2.8GHz/3	100＊0.7+0＊0.3＝70
B	15￥/2	3.2Ghz/2	50＊0.7+50＊0.3＝50
				C	25￥/3	3.5Ghz/1	0＊0.7+100＊0.3＝30

S4-3: and returning to the example list after sorting according to the example type comprehensive scores.

And S5, purchasing resources from the cloud manufacturer according to the screened and sorted example type list, and if the example type with the highest comprehensive score is insufficient in stock, purchasing the next-ranked example type until enough resources are purchased. And if the finally purchased resources do not reach the quantity required by the operation, all the nodes with failed resource scheduling are returned to the cloud manufacturer.

Compared with the traditional HPC job scheduling method, the high-performance computing cloud platform computing resource scheduling method mainly has the following differences: the concept of application running characteristics is introduced firstly, the application is appointed to obtain the application running characteristics before the operation is submitted, and the example type with better performance-price ratio can be obtained after the application running characteristics are screened and sorted. And secondly, the concept of a queue is not used any more, the example type does not need to be configured statically in advance, cloud manufacturers, regions and example types are dynamically selected after a series of screening and sorting according to the specified resource specification, task quantity information and application operation index submitted by the operation, the example types are dynamically selected, the flexibility of resource selection is greatly improved, resources of a plurality of cloud manufacturers in a plurality of regions can be used simultaneously, the condition of insufficient inventory of the example types can be effectively avoided, and the success rate of resource scheduling is ensured. Finally, unique index weight and score calculation rules are used for selecting the example types, and the weight of each index can be appointed in the sorting process so as to calculate the example type with the optimal cost performance.

It should be noted that the application operation index database and the cloud vendor instance type metadata database used in the resource scheduling algorithm of the present invention may also be provided in other alternative manners, such as being provided in a service manner, and may also be replaced by querying cloud vendor data in real time.

In addition, when the example types are sequenced, the weights of all indexes need to be determined firstly, and the rule for calculating the weights may adopt a strategy that the calculation efficiency is prior rather than the cost performance is prior, or the numerical value calculated by the weights is adjusted to a certain extent.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention, and all equivalent structures or equivalent processes performed by the present invention or directly or indirectly applied to other related technical fields are also included in the scope of the present invention.

Claims

1. A resource scheduling method for a high-performance computing cloud platform is characterized by comprising the following steps:

sorting the instance types to obtain a sorted instance type list;

2. The method of claim 1, wherein the step of obtaining running characteristic data of an application specified for computing by a user when a computing task is submitted by the high-performance computing cloud platform comprises:

3. The method according to claim 1, wherein the step of screening out eligible instance types from a database containing instance types of respective cloud vendor domains according to the obtained operation feature data and hardware-related parameters of the application comprises:

4. The method of claim 3, wherein the step of sorting the instance types to obtain a sorted list of instance types comprises:

5. The method according to claim 4, wherein the step of calculating index weights of instance types in the list of available instance types according to the running characteristics of the application comprises:

6. The method of claim 1, wherein the hardware-related parameters comprise: asset type-type: node minimum core number-core.

7. The method of claim 3, wherein the list of eligible instance types comprises at least one of: cloud manufacturer and region, instance type, CPU core number, GPU card number, memory size, network IO performance, disk IO performance, payment type and price, instruction set, CPU master frequency, single-precision floating-point computing performance and double-precision floating-point computing performance.

8. A high performance computing cloud platform resource scheduling system, the system comprising: memory and a processor, the memory having stored thereon a computer program which, when executed by the processor, carries out the steps of the method according to any one of claims 1-7.

9. A computer storage medium, characterized in that a computer program is stored on the computer storage medium, which computer program, when being executed by a processor, carries out the steps of the method according to any one of claims 1-7.