WO2023015787A1

WO2023015787A1 - High throughput cloud computing resource recovery system

Info

Publication number: WO2023015787A1
Application number: PCT/CN2021/135609
Authority: WO
Inventors: 赵来平; 崔育帅; 邱铁
Original assignee: 天津大学
Priority date: 2021-08-10
Filing date: 2021-12-06
Publication date: 2023-02-16
Also published as: CN113608875B; CN113608875A

Abstract

Disclosed is a high-throughput cloud computing resource recovery system, which comprises a quality of service monitoring module, a preemption loss analysis module, and an offline load recovery queue module; the quality of service monitoring module is used for real-time monitoring and recording of a processing delay request of a latency critical service (LC) component of a cloud data center, and is used for performing resource recovery when it is detected that a quality of service cannot be ensured; the preemption loss analysis module is used for calculating a preemption loss of an offline load; and the offline load recovery queue module is used for constructing a batch processing application (BE) recovery queue as well as allocating preemption priorities; and each server performs resource recovery according to a locally maintained preemption loss priority queue and a contribution level of a deployed latency critical service (LC) component of the cloud data center. Upon comparison with the prior art, the present invention is able to reduce useless computation caused by the system due to scheduling, thereby improving the throughput and the resource utilization of a cluster.

Description

A high-throughput cloud computing resource recovery system

technical field

The invention relates to the technical field of cloud computing, in particular to a scheduling and optimization method for mixed deployment of microservices and various offline loads in a cloud data center.

Background technique

The hybrid deployment of multiple applications in the data center has been proven to be a means to effectively improve the resource utilization of the computing system. A reasonable resource allocation scheme can reduce the interference between mixed loads due to competition for shared resources, thereby ensuring the service quality of applications in the system. With the continuous growth of the cloud computing market and the continuous enrichment of application functions, more and more online applications are shifting from monolithic design to complex services composed of multiple components. At the same time, the type of batch processing load is also showing explosive growth. This increasingly complex service componentization scenario puts more stringent control requirements on the hybrid system.

In production environments, data center operators seek higher utilization of server resources by allocating transient resources to offline loads. Such resources are reclaimed by cloud service providers to guarantee the service quality (SLA) of online applications. Therefore, offline loads deployed on transient resources face the risk of being rescheduled at any time. Even though many advanced fault-tolerant mechanisms and strategies have been proposed to alleviate different types of applications (such as big data analysis jobs, machine learning training tasks, scientific computing applications, etc.), due to the computational loss caused by rescheduling, these solutions are in many cases It is necessary to modify the code of the application program, which causes a great burden on the program itself. Therefore, in enterprise data centers, it is still an important issue to minimize the performance impact of rescheduling on offline loads on the premise of ensuring the service quality of online applications. In addition, with the continuous enrichment of offline load functions, more and more tasks have begun to require strict execution time, which makes the deployment strategy in the data center more and more complicated. In many cases, in order to ensure the service quality of online applications in a timely manner, it is inevitable to accept rescheduling of offline loads. However, the tolerance of these offline workloads to rescheduling is inconsistent. For offline loads with fault-tolerant mechanisms, they can retain part of the calculation through mechanisms such as checkpoints, while for offline loads without fault-tolerant mechanisms, each rescheduling will cause them to lose all calculations. Moreover, the work progress of different offline applications is inconsistent, so that the tasks to be completed are at risk of being preempted, which will reduce the throughput of the system and bring about useless utilization of resources.

Solving the low efficiency of server utilization caused by the coarse-grained resource recovery scheme is a technical problem to be solved urgently in the present invention.

Contents of the invention

In order to solve the problem of inefficient server utilization caused by the coarse-grained resource recovery scheme, the present invention proposes a high-throughput cloud computing resource recovery system, based on the distinction between cloud data center delay-sensitive services (LC) and batch applications (BE ) computing loss caused by preempting different batch processing application services during hybrid deployment, and designed and optimized the resource recovery strategy for batch processing applications (BE) when the service quality of delay-sensitive services (LC) in the cloud data center cannot be guaranteed, thereby improving hybrid deployment throughput at the time.

The present invention is realized through the following technical solutions:

A high-throughput cloud computing resource recovery system, the system includes a service quality monitoring module 100, a preemption loss analysis module 200 and an offline load recovery queue module 300; wherein:

The service quality monitoring module 100 is used to monitor and record the processing delay request of the cloud data center delay-sensitive service LC component in real time, so as to analyze whether the service quality is guaranteed at the current moment; when it is detected that the service quality cannot be guaranteed, resource recovery is performed, The evaluation of resource recovery is based on the formula resource×time, resource represents the resource occupied by BE, and time represents the completion time;

The preemption loss analysis module 200 is used to calculate the preemption loss of offline loads;

The formula for calculating the preemption loss L caused by resource recycling for each application is as follows:

L=S _pmtn -S _ognl =t _pmtn r _pmtn -t _ognl r _ognl

Among them, t _pmtn represents the completion time of BE in the case of being preempted (or not preempted), t _ognl represents the completion time of BE in the case of not being preempted, and r _pmtn represents the CPU occupied by BE when it is preempted or not. The number of cores, r _ognl indicates the resources occupied by BE when it is not preempted;

The offline load recovery queue module 300 is used to construct a batch processing application BE recovery queue and preemptive priority assignment; the batch processing application BE recovery queue includes two separate recovery queues composed of predictable BE and unpredictable BE. Queue; when the service quality of the delay-sensitive service LC components in the cloud data center cannot be guaranteed, each server allocates resources according to the locally maintained preemptive loss priority queue and the contribution of the deployed cloud data center delay-sensitive service LC components Recycle.

The batch processing application BE is divided into three categories, namely big data application category, artificial intelligence training category and scientific computing category.

The delay contribution of each cloud data center delay-sensitive service LC component is different, then each mixed server maintains a local MLRQ, and there are sub-queues in each MLRQ level, and the number of BEs in the MLRQ level q _MLRQ is determined by the corresponding local The contribution decision of the cloud data center delay-sensitive service LC component is determined by the following formula:

Among them, n _BE represents the number of BEs in the system, and C _i represents the contribution of LC service components.

Compared with the non-differentiated BE hybrid system of the existing data center, a high-throughput cloud computing resource recovery system of the present invention can reduce the useless calculations caused by system scheduling, thereby improving the throughput and resource utilization of the cluster. For: the designed system can prompt throughput of 13.1%, CPU utilization of 10.2%, memory bandwidth utilization of 11.4%. Compared with the traditional non-differentiated BE hybrid system.

Description of drawings

Figure 1 is a schematic diagram of the difference comparison of offline service preemption losses of different batch processing applications BE;

Fig. 2 is a kind of high-throughput cloud computing resource recovery system architecture diagram 1 of the present invention;

FIG. 3 is a second architecture diagram of a high-throughput cloud computing resource recovery system of the present invention;

Figure 4 is an integration diagram of the batch application BE offline load recovery queue.

Detailed ways

In conjunction with the accompanying drawings, the technical solution of the present invention is described in detail as follows.

The basic idea of the present invention is: when the service quality of the delay-sensitive service LC in the cloud data center cannot be guaranteed due to sudden load, the offline service resource preemption loss is calculated according to the collected batch application BE runtime data, so as to select the current situation Under the appropriate offline load to preempt to release resources to the cloud data center delay-sensitive services (LC). In the present invention, current common search engines Solr and ElasticSearch, and distributed non-relational database Redis are used as LC services, and representative distributed offline loads in the current data center are selected at the same time: big data analysis task Spark, distributed depth Learning training tasks as well as single executable binary scientific computing as BE payloads.

As shown in Figure 1, it is a schematic diagram of the difference comparison of the offline service preemption loss of BE for different batch processing applications. (1a) Deep Learning Model for Image Classification in Asynchronous Training Mode DDL-ASP, (1b), Deep Learning Model for Image Classification Based on Synchronous Training Mode DDL-BSP, (1c) Bigdata Applied to SPARK, (1d) Scientific and Numerical Computing The offline service preemption loss of the Java benchmark model SCIMARK) is significantly different. (1a) Terminating a Service Worker in asynchronous mode will not cause the BE application to fail, and there is no need to reschedule the terminated Service Worker. In the case of preemption, its maximum completion time does not change much, and fewer resources are occupied. Therefore, in the configuration of , task preemption in DDL-ASP actually improves service efficiency. (1b) Service Workers must be synchronized, any Service Worker that fails will be restarted from the most recent checkpoint, terminating one of its Service Workers will result in a loss of service if the termination occurs after 30% progress. Typically, tasks that are preempted later will incur higher losses for BE applications. (1c) Later preemption incurs less penalty. The reason for this is twofold: (1) Since rdd provides high fault tolerance for Spark applications, when a task fails, no matter when it happens, the Spark scheduler can quickly resume the task. (2) Applications are typically executed as a series of stages. It was found that preemption at 70% progress produces less contention in Spark executors. Therefore, recycling at this stage has little effect on the maximum completion time. (1d) Offline service preemption loss for SCIMARK, a Java benchmark model for scientific and numerical computing, that grows linearly with progress. Since no fault tolerance mechanism is provided for it, every preemption of SCIMARK will cause it to resubmit and rerun from scratch.

As shown in FIG. 2 , it is an architecture diagram of a high-throughput cloud computing resource recovery system of the present invention. The system includes a service quality monitoring module 100 , a preemption loss analysis module 200 and an offline load reclamation queue module 300 .

The service quality monitoring module 100 is configured to monitor and record the processing delay request of the LC in real time, so as to analyze whether the service quality is guaranteed at the current moment. When it is detected that the service quality cannot be guaranteed, a resource recovery command is issued to trigger the system to perform resource recovery, so as to ensure the rapid recovery of LC service quality. At this time, the resource recovery signal will be sent to the preemption loss analysis module 200 to select BEs suitable for recovery.

The preemption loss analysis module 200 is used to calculate the preemption loss of offline loads, and transmit the preemption loss information of each offline load to the offline load recovery queue module 300 for queue construction and preemption priority assignment.

Currently, representative BEs running in data centers are mainly divided into three categories: big data applications, artificial intelligence training, and scientific computing. Among them, the big data application calculates a set of data through frameworks such as Mapreduce and Spark; the running time is estimated according to the processing progress of the measured data. The purpose of artificial intelligence training is to find a good quality neural network model that meets the desired accuracy. Scientific computing mainly includes short-term computing applications that do not deal with large amounts of data. BEs have different structures and can be monolithic or contain multiple components. Reclaiming resources from different BE components may affect BE throughput differently. For example, it may slow down processing or even prevent BE from running. In order to reduce the negative impact, calculate resource×time to evaluate how resource reclamation changes the service occupied by each BE, where resource represents the CPU resource occupied by BE, and time represents the completion time.

L=S _pmtn -S _ognl =t _pmtn r _pmtn -t _ognl r _ognl

Among them, t _pmtn represents the completion time of BE in the case of being preempted (or not preempted), t _ognl represents the completion time of BE in the case of not being preempted, and r _pmtn represents the time occupied by BE when it is preempted (or not being preempted). The number of CPU cores, r _ognl indicates the number of CPU cores occupied by the BE when it is not preempted. If the service occupied by BE becomes larger after recycling, the preemption loss is greater than 0. Computing the preemption loss L requires BE runtime information, namely t _pmtn and t _ognl . If a predictive model for a particular BE exists such that its running time can be accurately estimated, then the BE is classified as predictable offline load; otherwise, a BE without an accurate predictive model is classified as unpredictable offline load.

1. The expected completion time of the two types of BE that can predict the offline load is as follows:

(1) Based on the spark-based big data BE, the completion progress c, occupation time t and preempted resource ratio p of BE are used as input to obtain the expected completion time of the BE application. The formula is as follows:

Among them, c is obtained through the HTTP API exposed by spark.

(2) Based on the BE trained by deep learning, the existing white-box model is used to predict the completion time of BE applications under different resource configurations. Taking the number of remaining training steps s, the occupied time t and the step processing speed q as input, use t _pmtn =(s/q)+t to obtain the expected completion time of the BE application. s and q need to be estimated by the model. The number of remaining steps will be updated based on the live loss value of the training job.

2. The expected completion time of BE with unpredictable offline load, the related derivation is as follows:

Choose to use the number of useless calculations U as the resource recovery priority. That is to say, BEs that generate less amount of useless computation U are prioritized for resource recovery. The number U of useless calculations refers to the number of double calculations caused by resource recovery. After resource reclamation, if the task becomes slower, there is no need to recalculate, with U=0. If more than one BE has U=0, the occupied service of such BE is calculated by resource×elaspedtime, where elaspedtime represents the execution time, and resource represents the CPU resource occupied by the BE. If a task fails, part of its computation becomes useless, so U > 0.

Useless computation is related to BE's fault-tolerant mechanism. According to the existing BE fault-tolerant mechanism, the push of the useless calculation amount U is mainly divided into the following two categories:

①Based on the mechanism of time redundancy, the task execution is delayed by rescheduling the failed tasks on the backup server. In order to reduce the double calculation caused by rescheduling, when a failure occurs, the failed task on the backup server is set to restart from the latest checkpoint. From this, the calculation formula U _temp = t _ckpt r _ognl for the number of useless calculations in the mechanism based on time redundancy is deduced, where t _ckpt represents the calculation time since the latest checkpoint time; ② The mechanism based on space redundancy is the same Tasks send multiple copies to sacrifice space for efficiency. The replicas run concurrently, and the task will succeed if at least one of the replicas completes successfully. Therefore, if a task has more than 1 copy, recycling will not generate any recalculation, ie U_space=0. If all copies of a task fail, it will have to be rescheduled, using the calculation method of useless calculation, ie U_space=U_temp.

The offline load reclamation queue module 300 is used to construct a batch application BE reclamation queue and preemptive priority assignment. Unified maintenance of predictable BE and unpredictable BE in operation. When the service quality of the delay-sensitive LC in the cloud data center cannot be guaranteed, each server will perform resource recovery according to the locally maintained preemption loss priority queue and the contribution of the deployed LC components.

Build two separate reclamation queues consisting of predictable BEs and unpredictable BEs, respectively. It is challenging to select the best BE for recycling from two separate recycling queues. In order to solve this problem, the Borda counting voting method is adopted to unify the predictable BE queue and the unpredictable BE queue into one BE recovery queue. Each voter ranks the candidates according to his or her preference, and finally integrates the rankings of candidates in different orders to select the winner. BE _i represents the ith score in different sequences obtained by borda counting. The one with the smallest sum of BE _i scores in different sequences will be preempted first.

Maintain a predictable preemption loss queue, a predictable wasteful computation queue, and an unpredictable wasteful computation queue for the BE load at runtime. Since unpredictable BEs only appear in unpredictable useless computing queues, predictable BEs appear in predictable preemption loss queues and predictable useless computing queues, and the Borda counting method is used to obtain the BE in the three queues score, and then calculate the sum of the scores for each BE, it is unpredictable that the BE will have a lower score due to the lack of preemption loss queue scores. For a fair comparison, the score obtained by unpredictable BE in the unpredictable useless computation queue is doubled. Then, the scores of predictable and unpredictable BEs are combined and sorted in descending order of scores. The merged result is a global BE recovery queue. Because the contributions of different LC components are inconsistent, the unified queue is divided into a multi-level recovery queue (MLRQ) according to the contribution level. When a resource recovery request is received, the system will recover all loads in the high-priority queue. As a result, BE loads mixed with LC components with high contribution will face a larger recovery granularity, so as to realize the rapid recovery of LC service quality.

When the quality of service of the LC cannot be guaranteed, the reclaim operation is always performed at the top of the reclaim queue of the global BE. If the first BE in the list does not exist on the local server, the BEs are replaced sequentially until a matching BE is found. In order to speed up the SLA recovery process, the global BE recovery queue is further organized into a multi-level recovery queue MLRQ, and resource recovery always selects the BE at the top of the MLRQ for recovery. Since each LC component contributes differently to the delay, each hybrid server maintains a local MLRQ with a longer subqueue in each MLRQ level. In this way, more resources are reclaimed from BEs deployed mixed with LC components. MLRQ level q The number of BEs in _{an MLRQ} is determined by the contribution of its local components. The formula is as follows:

When the cloud data center delay-sensitive service LCs are running individually, their residence time on each LC service component is recorded; then the contribution of each service group to the tail delay is deduced from the collected information. This feature only depends on the LC service itself, and its cost increases linearly with the number of serviced components. Thus, the present invention reduces the cost of M jobs compared to a configuration-based approach of measuring M combined interferences of M LC services and N BE jobs.

Predictable BE refers to data for which job completion time (JCT) can be estimated easily and accurately without relying on offline analysis, such as the task completion time of Mapreduce or Spark applications can be estimated based on the proportion of processed data. For distributed deep learning training tasks, some white-box prediction models, such as Optimus, can also be used as predictors to predict the completion time of tasks. For other BEs, it is considered unpredictable. While predictable BEs are prioritized based on their progress, unpredictable BEs can be prioritized following a Least Acquired Service (LAS) policy, which recycles BEs that receive least priority service.

Claims

A high-throughput cloud computing resource recovery system, characterized in that the system includes a service quality monitoring module (100), a preemption loss analysis module (200) and an offline load recovery queue module (300); wherein:

The quality of service monitoring module (100) is used for real-time monitoring and recording of the processing delay request of the delay-sensitive service LC component of the cloud data center to analyze whether the quality of service at the current moment is guaranteed; when it is detected that the quality of service cannot be guaranteed, perform resource Recycling, the evaluation of resource recycling is based on the formula resource×time, resource represents the resource occupied by BE, and time represents the completion time;

The preemption loss analysis module (200) is used to calculate the preemption loss of offline load;

The formula for calculating the preemption loss L caused by resource recycling for each application is as follows:

L=S pmtn -S ognl =t pmtn r pmtn -t ognl r ognl

Among them, t pmtn represents the completion time of BE in the case of being preempted (or not preempted), t ognl represents the completion time of BE in the case of not being preempted, and r pmtn represents the CPU occupied by BE when it is preempted or not. The number of cores, r ognl indicates the resources occupied by BE when it is not preempted;

The offline load recovery queue module (300) is used to construct a batch processing application BE recovery queue and preemptive priority assignment; the batch processing application BE recovery queue includes two independent BEs composed of a predictable BE and an unpredictable BE. recovery queue; when the service quality of the delay-sensitive service LC components in the cloud data center cannot be guaranteed, each server calculates the Perform resource recovery.
The high-throughput cloud computing resource recovery system according to claim 1, wherein the batch processing application BE is divided into three categories, namely big data application category, artificial intelligence training category and scientific computing category.
A high-throughput cloud computing resource reclamation system according to claim 1, wherein the delay contribution of each cloud data center delay-sensitive service LC component is different, and each mixed server maintains a local MLRQ, and There are sub-queues in each MLRQ level, and the number of BEs in an MLRQ level q MLRQ is determined by the contribution of the corresponding local cloud data center delay-sensitive service LC component, the formula is as follows:

Among them, n BE represents the number of BEs in the system, and C i represents the contribution of LC service components.