WO2020133245A1

WO2020133245A1 - Cloud resource elastic scaling system for high performance computing and scheduling method

Info

Publication number: WO2020133245A1
Application number: PCT/CN2018/124970
Authority: WO
Inventors: 林帅康; 刘阳; 温书豪; 马健; 赖力鹏
Original assignee: 深圳晶泰科技有限公司
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2020-07-02

Abstract

A cloud resource elastic scaling system for high performance computing, which belongs to the technical field of high performance computing, wherein the system comprises a resource expansion sub-system responsible for adding nodes to a cluster and a resource contraction sub-system responsible for deleting node from the computing cluster. A scheduling system accepts tasks submitted by an external user or system, and distributes the tasks to a waiting queue, the resource elastic scaling system scans the task waiting queue, combines aspects of expansion decision algorithms, and applies for bidding resources in a suitable area, and the tasks are finally run on the newly added computing nodes; when the tasks are distributed, the computing nodes in the cluster are slowing idle, a contraction strategy of the resource elastic scaling system is triggered to recover and release the nodes. The system integrates elastic scaling APIs of large public cloud providers to manage and control global resources; and an optimal resource use mode is predicted through statistical learning of a large number of existing and continuously added different types of task running time.

Description

Resource elastic scaling system for high-performance computing on cloud and its scheduling method

Technical field

The invention belongs to the technical field of high-performance computing, and can be used in a cloud computing platform computing cluster as a cluster resource elastic expansion management system.

Background technique

High-performance computing resource elastic scaling refers to the fact that the resource scheduler dynamically adjusts the size of the resource pool according to the different resource requirements of the current computing task, so that the task can obtain the computing resources required for operation.

In the public cloud, high-performance computing uses large-scale computation-intensive tasks as the computing unit, and distributes the tasks to the cluster through an efficient job scheduling system. The resource elastic scaling system periodically scans the task queue to count the resource size required by the task and triggers resource expansion, so that the task can be calculated on the corresponding node. When the task calculation is completed, the node will be idle for multiple cycles in a row, which will trigger resource downsizing, and the node will be recovered and released to save costs. At the same time, when the computing node fails repeatedly due to the health check, it will also be forcibly recycled and replaced with a new node. The resource elastic scaling system ensures the dynamic adjustment of the resource pool through the above mechanism, so that tasks can be scheduled to run as much as possible.

The current problems of the resource elastic scaling system mainly include the following aspects:

1. The single configuration of computing nodes supported by the resource elastic scaling system forces the task scheduling system to deal with complex resource bundling. Within a scaling group are composed of homogeneous computing nodes, and different computing tasks require different numbers of CPU cores. For example, there are 8-core, 16-core and 32-core tasks in the queue. Since the computing nodes are all 32-core resources, the total number of each task is different. Eventually, 8-core or 16-core tasks will occupy a single 32 Core computing nodes, which causes a lot of waste of resources.

2. The health detection mechanism of the resource elastic scaling system is not suitable for high-performance computing tasks with high CPU load. The health detection mechanism usually runs a background detection service on the computing node and periodically sends health heartbeat information to the node master to indicate the current node health good. However, because high-performance computing tasks will perform a large number of floating-point calculations, the CPU easily reaches 100%, and the CPU is too busy to send heartbeat information to the node master control system in time, resulting in the node master control mistakenly thinking that the computing node is unresponsive and triggers the node recycling mechanism. The interrupted task is mistakenly killed and returned to the scheduling queue again. The health check fails next time and the task is accidentally killed again, resulting in waste of resources.

3. The computing nodes managed by the resource elastic scaling system are usually billed on demand. Compared with the on-demand billing model, the bidding billing model that has emerged in recent years can allow companies to obtain a large amount of elastic computing resources, while greatly reducing Calculating costs, bidding resources are available free computing capacity in public cloud vendors, and their prices can be as low as 10% of on-demand resources. The only difference between bidding resources and on-demand resources is that bidding resources will be due to on-demand resources at a certain time. The demand has been interrupted for recovery. Therefore, bidding resources are very suitable for interruptible high-performance computing task scenarios. Therefore, the price fluctuation and interruption rate of bidding resources are related to the current regional supply and demand relationship, and the elastic scaling system that manages the bidding resources cannot dynamically select the appropriate area based on this supply and demand relationship, and therefore cannot find lower prices and more Bidding resources with low interruption rate.

4. The number of computing nodes for single expansion of resource elasticity system decision-making is usually calculated based on the total number of cores required by the task queue. If there are 1000 32-core tasks in the queue and the current resource pool does not have free resources, then the resources The elastic scaling system will directly add 1,000 32-core computing nodes. However, the calculation time required for a 32-core computing task will vary greatly due to different computational complexity. High-complexity tasks may take several hours to several days to complete, but low-complexity computing tasks may only require a few ten minutes. After the task calculation is completed, the resource elastic scaling system needs to continuously scan the computing nodes for multiple cycles before recovering the computing nodes. For example, if each cycle is set to 5 minutes, the recovery will be triggered if the node is idle for 2 consecutive cycles. In the end, there will be 1,000 computing nodes running empty for 10 minutes, which wastes a lot of resources. At the same time, it is also possible that the price of bidding resources in the currently selected area is relatively high, and high-priced computing resources are used to run this batch of tasks. Such a large-scale 32-core computing task scenario is usually not sensitive to the feedback time of the results, which means that as long as the task is calculated within the agreed time, it will not affect the progress of the business. In fact, the reasons for this one-time over-capacity expansion are: 1: the decision conditions of the resource scaling system are too simple, 2: there is no perceived difference in task type, and the task running time cannot be predicted. 3: No awareness of the urgency of the current task priority, 4: Unable to perceive the current bid price trends in different resource areas at different time periods.

When the scheduling system no longer distributes new tasks to the task queue, tasks with different CPU numbers are running in the cluster at this time, such as 4-core, 8-core, and 16-core CPU tasks. The scheduling system optimizes the cluster through algorithms at the beginning The boxing problem allows different tasks to fill each 32-core or 16-core computing node, but because the task runs at different times, if there is no new task scheduled to the node, a single task will occupy a single 32-core Calculate the nodes. Since the scaling system periodically scans to find that there are still tasks running on the nodes, the node recycling mechanism will not be triggered, and the utilization rate of the cluster will continue to decline.

technical problem

Technical solution

In response to the above technical problems, the present invention provides a high-performance computing resource resilience system on the cloud and its scheduling method, which supports the support of multiple public cloud regions and multiple computing resource configurations and adapts to high-performance computing nodes Health detection; adapt to the usage pattern of bidding instance resources; and can predict task running time to avoid excessively adding computing nodes to cause waste of resources; dynamically adjust the shrinkage mechanism to avoid waste of resources due to packing problems.

The specific technical solution is:

A resource elastic scaling system for high-performance computing in the cloud, including two subsystems: a resource expansion subsystem and a resource scaling subsystem; the resource expansion subsystem is responsible for adding nodes to the cluster, and the resource scaling subsystem is responsible for Remove the node from the computing cluster.

The resource expansion subsystem includes three data acquisition modules, which are:

The task running time statistics module collects statistics on different task types from the task database;

The auction resource price monitoring and forecasting module collects and monitors price trend data from the bidding resource pool of public cloud vendors;

The auction instance interrupt processing module collects and monitors auction instance interrupt data in real time from the computing cluster.

The resource shrinkage subsystem includes two data collection group modules, which are:

Compute node load monitoring module, real-time collection of node CPU utilization rate time series data;

The cluster node scanning module periodically scans to collect cluster idle and health data.

The scheduling method of the resource elastic scaling system for high performance computing in the cloud includes the following steps: the scheduling system accepts tasks submitted by external users or the system and distributes them to the waiting queue, and the resource elastic scaling system scans the task waiting queue, combining multiple aspects The capacity expansion decision algorithm applies for bidding resources in a suitable area, and the task finally runs on the newly added computing node; when the task is distributed and the computing nodes in the cluster are slowly idle, the resource elastic scaling system is triggered to shrink Content strategy, recycle and release the nodes.

Specifically, the addition of nodes to the cluster by the resource expansion subsystem is determined based on the three major data collection group modules, and includes the following steps:

S11, the task running time statistics module collects statistics on different task types from the task database; performs statistics based on the existing task data, predicts the running time required for the tasks in the task queue, and then combines the required CPU cores with the task, That is, the total number of resource cores required by all tasks in the waiting queue can be calculated;

S12, the bidding resource price monitoring and forecasting module collects and monitors price trend data from the bidding resource pool of public cloud vendors; based on the historical fluctuation data of the bidding resource price, it can predict the price fluctuation range of the resource in various regions at different time points;

S13. The auction instance interrupt processing module collects and monitors auction instance interrupt data from the computing cluster in real time; combined with the real-time feedback of the node instance interrupt processing module to calculate the node interrupt rate, that is, the bidding resources in the most suitable area can be screened out;

Finally, when the flexible expansion subsystem monitors and finds that there are tasks waiting in the task queue, combined with the resource data tables obtained by the above three modules, it is finally determined that the cost-effective and low interruption rate that can meet the task calculation needs can be applied in the appropriate area. Calculates the node resources in order to add nodes to the computing cluster.

The resource shrinkage subsystem described above adds nodes to the cluster based on the decision of the two major data collection cluster modules and includes the following steps:

S14, the load monitoring module of the computing node collects the CPU timing data of the node in real time;

The computing node load monitoring module can obtain the real-time CPU usage of the computing node through the public cloud vendor interface, and add this data to the time series database influxdb, so that the external can obtain the monitoring data of all computing nodes in the cluster through the direct influxdb interface.

S15, the cluster node scanning module periodically scans to collect cluster idle and health data;

The cluster node scanning module periodically scans the entire cluster to find out whether there are idle nodes in the current computing cluster that are running tasks. It also finds non-healthy nodes through the health detection mechanism, and finally stores the relevant data in the cluster nodes. Checklist.

Further, it also includes that for the health detection of computing nodes in high-performance computing, it is assisted by monitoring the CPU load index of the computing node. When the CPU load enters the 80% threshold, the detection program will add the calculation section to the contract Capacity protection queue; when the task computing load drops below 80%, the health check returns to normal, and the computing node is removed from the capacity reduction protection queue to avoid node error recovery due to failure of the health check;

The flexible shrinkage subsystem combines the data collected by its two data collection group modules to make a recycling decision on the nodes, thereby deleting idle computing nodes from the cluster.

Beneficial effect

The resource elastic scaling system and scheduling method for high-performance computing on the cloud provided by the present invention have the following technical effects:

(1) Realize the management and control of global resources by integrating the elastic scaling APIs of major public cloud vendors;

(2) Implement a more flexible computing node health detection mechanism for high-performance computing tasks;

(3) Dynamically sense the price and interruption rate of bidding resources among major public cloud vendors;

(4) Through statistical learning of the running time of a large number of existing and continuously added different types of tasks, the resource scaling system can predict the best resource usage.

BRIEF DESCRIPTION

FIG. 1 is a system structure diagram of the resource elastic scaling system of the present invention;

2 is a data acquisition diagram of a resource expansion subsystem of the resource elastic scaling system of the present invention;

3 is a data collection diagram of a resource scaling subsystem of the resource elastic scaling system of the present invention;

4 is a flow chart of the scheduling method of the resource elastic scaling system of the present invention;

5 is a schematic diagram of the implementation of the present invention.

Best Mode of the Invention

Embodiments of the invention

The specific technical solution of the present invention will be described in conjunction with the embodiments.

As shown in FIG. 1, the resource elastic scaling system method provided by the embodiment of the present invention includes two subsystems: a resource expansion subsystem and a resource scaling subsystem; the resource expansion subsystem is responsible for adding nodes to the cluster, and the resource scaling subsystem is responsible for Delete the node in the computing cluster.

Adding nodes to the cluster by the resource expansion subsystem is determined based on the three data collection group modules. As shown in Figure 2, the three major data collection modules are:

S11, the task running time statistics module collects statistics on different task types from the task database;

S12, the bidding resource price monitoring and forecasting module collects and monitors price trend data from the bidding resource pool of public cloud vendors;

S13. The auction instance interrupt processing module collects and monitors the auction instance interrupt data from the computing cluster in real time.

First, in the task runtime statistics module of step S11, the task has the following attributes:

mission name

Task category

CPU requirements

Estimated duration

Total tasks

According to the statistics of the existing task data, the running time required for the tasks in the task queue is predicted, and then the number of CPU cores required by the task pair is combined to calculate the total number of resource cores required by all tasks in the waiting queue.

任务名称mission name	任务类别Task category	CPU需求（核数）CPU requirements (cores)	预估持续时间 (小时)Estimated duration (hour)	任务总数 (个)Total tasks (Piece)
AA	XX	88	0.50.5	10001000
BB	YY	1616	3.03.0	500500
CC	ZZ	3232	12.012.0	300300

Secondly, in the auction resource price monitoring and forecasting module of step S12, the auction resource has the following attributes:

Auction area

Auction instance category

Unit price of auction instance

Auction instance interruption rate

According to the historical fluctuation data of the price of bidding resources, the price fluctuation range of resources in various regions at different points in time can be predicted. Combined with the real-time feedback of the node interruption rate calculation module of the auction instance interrupt processing module of step S13, the most suitable area can be selected. Bidding resources in.

竞价区域Auction area	竞价实例类别Auction instance category	竞价实例单价（元）Unit price of auction instance (yuan)	竞价实例中断率Auction instance interruption rate
AWS-A区AWS-A	A1A1	1.61.6	10%10%
腾讯云-B 区Tencent Cloud-B District	B1B1	2.42.4	15%15%
华为云-C 区Huawei Cloud-C Area	C1C1	1.81.8	20%20%

Finally, when the flexible expansion subsystem monitors and finds that there are tasks waiting in the task queue, combining the resource data tables obtained from the above three modules, it is finally determined that the cost-effective and low interruption rate that can meet the task calculation needs can be applied in the appropriate area. Calculates the node resources in order to add nodes to the computing cluster.

The resource shrinkage subsystem adds nodes to the cluster based on two major data collection group modules. As shown in Figure 3, the two major data collection modules are:

First, in S14, the computing node load monitoring module can obtain the real-time CPU usage of the computing node through the public cloud vendor interface, and add the data to the time series database influxdb, so that the external through the direct influxdb interface to obtain all the calculations in the cluster The monitoring data of the node.

Secondly, in S15, the cluster node scanning module periodically scans the entire cluster, and timely finds out whether there are idle nodes in the current computing cluster that are running tasks, and also finds non-healthy nodes through the health detection mechanism. The data is stored in the cluster node detection table.

竞价区域Auction area	竞价实例类别Auction instance category	是否空闲Free	是否健康Is it healthy
AWS-A区AWS-A	A1A1	TRUETRUE	TRUETRUE
腾讯云-B 区Tencent Cloud-B District	B1B1	FALSEFALSE	FALSEFALSE
华为云-C 区Huawei Cloud-C Area	C1C1	FALSEFALSE	TRUETRUE

At the same time, for the health detection of computing nodes in high-performance computing, this method adopts assistance by monitoring the CPU load index of the computing node. When the CPU load enters the 80% threshold, the detection program will add the calculation section to the scale-down protection Queue, when the CPU load reaches 100%, the health detection program may not be able to continue to send heartbeat information to trigger shrinkage, but because the shrinkage protection is set in advance, the computing node will not be killed by this time. When the task computing load drops below 80%, the health check returns to normal, and the computing node is removed from the shrink protection queue to avoid node error recovery due to a failed health check.

Finally, the elastic shrinkage subsystem combines the data collected by the above two modules to make a recycling decision on the node, thereby deleting the idle computing node from the cluster.

The flexible resource scaling system uses various modules to collect statistically related data to provide prepared decisions for resource expansion and resource capacity. The entire system flow is shown in Figure 4. The scheduling system accepts tasks submitted by external users or the system and distributes them to the waiting queue. The resource elastic scaling system scans the task waiting queue and combines multiple expansion decision algorithms to apply in the appropriate area. Bidding resources, the task will eventually run on the newly added computing node. When tasks are distributed and computing nodes in the cluster are slowly idle, the scaling strategy of the resource elastic scaling system is triggered to recover and release the nodes.

This method can be used to build an efficient and flexible system in major public cloud vendors, such as AWS, Tencent Cloud, Huawei Cloud, and Google Cloud. You can run it by applying for a host on the cloud and attaching the corresponding resource operation authority, while providing related interfaces for scheduling system task query, as shown in Figure 5. When the operation node submits the task to the scheduling system, the elastic scaling system will automatically add the appropriate bidding node, and implement the node recovery strategy after the task is completed.

Claims

A resource-elastic scaling system for high-performance computing on the cloud, which is characterized by including two subsystems: a resource expansion subsystem and a resource scaling subsystem; the resource expansion subsystem is responsible for adding nodes to the cluster, and the resources The scaling-down subsystem is responsible for removing nodes from the computing cluster.
The resource-elastic scaling system for high-performance computing on the cloud according to claim 1, wherein the resource expansion subsystem includes three data collection modules, which are:

The task running time statistics module collects statistics on different task types from the task database;

The auction resource price monitoring and forecasting module collects and monitors price trend data from the bidding resource pool of public cloud vendors;

The auction instance interrupt processing module collects and monitors auction instance interrupt data in real time from the computing cluster.
The resource elastic scaling system for high-performance computing on the cloud according to claim 1, wherein the resource scaling subsystem includes two data collection group modules, which are:

Compute node load monitoring module, real-time collection of node CPU utilization rate time series data;

The cluster node scanning module periodically scans to collect cluster node idle and node health data.
The scheduling method for a resource-elastic scaling system for high-performance computing on the cloud according to any one of claims 1 to 3, characterized in that it includes the following steps: the scheduling system accepts tasks submitted by external users or the system and distributes them to The waiting queue, the resource elastic scaling system scans the task waiting queue, combines multiple expansion decision algorithms, applies for bidding resources in the appropriate area, and the task finally runs on the newly added computing node; when the task is distributed, the cluster has When the computing node slowly idles, it triggers the scaling strategy of the resource elastic scaling system to recover and release the node.
The scheduling method for a resource-elastic scaling system for high-performance computing on the cloud according to claim 4, wherein the resource expansion subsystem adds nodes to the cluster based on three major data collection group modules, It includes the following steps:

S11, the task running time statistics module collects statistics on different task types from the task database; performs statistics based on the existing task data, predicts the running time required for the tasks in the task queue, and then combines the CPU cores required by the task, That is, the total number of resource cores required by all tasks in the waiting queue can be calculated;

S12, the bidding resource price monitoring and forecasting module collects and monitors price trend data from the bidding resource pool of public cloud vendors; based on the historical fluctuation data of the bidding resource price, it can predict the price fluctuation range of the resource in various regions at different time points;

S13. The auction instance interrupt processing module collects and monitors auction instance interrupt data from the computing cluster in real time; combined with the real-time feedback of the node instance interrupt processing module to calculate the node interrupt rate, that is, the bidding resources in the most suitable area can be screened out;

Finally, when the flexible expansion subsystem monitors and finds that there are tasks waiting in the task queue, combining the resource data tables obtained by the above three modules, it is finally determined that the cost-effective and low-interruption can be applied in the appropriate area to meet the task calculation requirements. The rate of bidding calculates node resources, thereby adding nodes to the computing cluster.
The scheduling method for a resource-elastic scaling system for high-performance computing on the cloud according to claim 4, characterized in that the resource shrinkage subsystem adds nodes to the cluster based on two major data collection group modules, It includes the following steps:

S14, the load monitoring module of the computing node collects the CPU timing data of the node in real time;

The computing node load monitoring module can obtain the real-time CPU usage of the computing node through the public cloud vendor interface, and add this data to the time series database influxdb, so that the external can obtain the monitoring data of all computing nodes in the cluster through the direct influxdb interface;

S15, the cluster node scanning module periodically scans to collect cluster node idle and node health data;

The cluster node scanning module periodically scans the entire cluster to find out whether there are idle nodes in the current computing cluster that are running tasks. It also finds non-healthy nodes through the health detection mechanism, and finally stores the relevant data in the cluster nodes. Checklist.
The scheduling method for a resource-elastic scaling system for high-performance computing on the cloud according to claim 4, further comprising: for the health detection of the computing node in the high-performance computing, the monitoring of the CPU load index of the computing node is adopted With the aid of the scaling strategy, when the CPU load enters the 80% threshold, the detection program will add the computing node to the scaling protection queue; when the task computing load drops below 80%, the health check returns to normal, and the computing node starts from Remove the shrinkage protection queue to avoid node error recovery due to failed health check;

The flexible shrinkage subsystem combines the data collected by its two data collection group modules to make a recycling decision on the nodes, thereby deleting idle computing nodes from the cluster.