WO2023015787A1 - High throughput cloud computing resource recovery system - Google Patents

High throughput cloud computing resource recovery system Download PDF

Info

Publication number
WO2023015787A1
WO2023015787A1 PCT/CN2021/135609 CN2021135609W WO2023015787A1 WO 2023015787 A1 WO2023015787 A1 WO 2023015787A1 CN 2021135609 W CN2021135609 W CN 2021135609W WO 2023015787 A1 WO2023015787 A1 WO 2023015787A1
Authority
WO
WIPO (PCT)
Prior art keywords
service
preemption
resource
recovery
loss
Prior art date
Application number
PCT/CN2021/135609
Other languages
French (fr)
Chinese (zh)
Inventor
赵来平
崔育帅
邱铁
Original Assignee
天津大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天津大学 filed Critical 天津大学
Publication of WO2023015787A1 publication Critical patent/WO2023015787A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5022Mechanisms to release resources
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

Disclosed is a high-throughput cloud computing resource recovery system, which comprises a quality of service monitoring module, a preemption loss analysis module, and an offline load recovery queue module; the quality of service monitoring module is used for real-time monitoring and recording of a processing delay request of a latency critical service (LC) component of a cloud data center, and is used for performing resource recovery when it is detected that a quality of service cannot be ensured; the preemption loss analysis module is used for calculating a preemption loss of an offline load; and the offline load recovery queue module is used for constructing a batch processing application (BE) recovery queue as well as allocating preemption priorities; and each server performs resource recovery according to a locally maintained preemption loss priority queue and a contribution level of a deployed latency critical service (LC) component of the cloud data center. Upon comparison with the prior art, the present invention is able to reduce useless computation caused by the system due to scheduling, thereby improving the throughput and the resource utilization of a cluster.

Description

一种高吞吐云计算资源回收系统A high-throughput cloud computing resource recovery system 技术领域technical field
本发明涉及云计算技术领域,特别是涉及云数据中心下针对微服务与多种离线负载混合部署的调度与优化方法。The invention relates to the technical field of cloud computing, in particular to a scheduling and optimization method for mixed deployment of microservices and various offline loads in a cloud data center.
背景技术Background technique
数据中心下多种应用的混合部署已经被证明为可有效提升计算系统的资源利用率的手段。合理的资源分配方案能够减少混合负载之间因竞争共享资源所产生的干扰,从而保障系统中应用的服务质量。伴随着云计算市场规模的不断增长,应用功能的不断丰富,越来越多的在线应用正从单片设计转向由多个组件构成的复杂服务,同时批处理负载类型也呈现爆发式增长。这种日益复杂的服务组件化场景向混部系统提出了更加严苛的控制要求。The hybrid deployment of multiple applications in the data center has been proven to be a means to effectively improve the resource utilization of the computing system. A reasonable resource allocation scheme can reduce the interference between mixed loads due to competition for shared resources, thereby ensuring the service quality of applications in the system. With the continuous growth of the cloud computing market and the continuous enrichment of application functions, more and more online applications are shifting from monolithic design to complex services composed of multiple components. At the same time, the type of batch processing load is also showing explosive growth. This increasingly complex service componentization scenario puts more stringent control requirements on the hybrid system.
在生产环境中,数据中心运营商通过将暂态资源分配给离线负载以谋求更高的服务器资源利用率。而这类资源被云服务商回收以供保障在线应用的服务质量(SLA)。因此部署在暂态资源上的离线负载面临着随时被重新调度的风险。即便已经提出许多先进的容错机制及策略以减轻不同类别应用程序(如大数据分析作业、机器学习训练任务、科学计算应用等),但是由于重调度所造成的计算损失,这些方案在许多情况下都需修改应用程序的代码,对程序本身造成很大的负担。因此在企业数据中心中,以保障在线应用服务质量为前提,最小化重调度对离线负载所产生的性能影响仍然是一个重要问题。此外随着离线负载功能的不断丰富,越来越多的任务也开始要求严格的执行时间,这让数据中心下部署策略变得越发复杂。在很多时候,为及时保障在线应用的服务质量,离线负载接受重新调度不可避免。但这些离线负载对重调度的容忍能力是不一致的。对于具有容错机制的离线负载,它们能够通过检查点等机制保留部分计算量,而对于不具备容错机制的离线负载,每次重调度都会让其损失所有计算量。并且不同的离线应用,其工作进度也不一致,让即将完成的任务遭受抢占的风险,会降低系统的吞吐,同时带来资源的无用利用。In production environments, data center operators seek higher utilization of server resources by allocating transient resources to offline loads. Such resources are reclaimed by cloud service providers to guarantee the service quality (SLA) of online applications. Therefore, offline loads deployed on transient resources face the risk of being rescheduled at any time. Even though many advanced fault-tolerant mechanisms and strategies have been proposed to alleviate different types of applications (such as big data analysis jobs, machine learning training tasks, scientific computing applications, etc.), due to the computational loss caused by rescheduling, these solutions are in many cases It is necessary to modify the code of the application program, which causes a great burden on the program itself. Therefore, in enterprise data centers, it is still an important issue to minimize the performance impact of rescheduling on offline loads on the premise of ensuring the service quality of online applications. In addition, with the continuous enrichment of offline load functions, more and more tasks have begun to require strict execution time, which makes the deployment strategy in the data center more and more complicated. In many cases, in order to ensure the service quality of online applications in a timely manner, it is inevitable to accept rescheduling of offline loads. However, the tolerance of these offline workloads to rescheduling is inconsistent. For offline loads with fault-tolerant mechanisms, they can retain part of the calculation through mechanisms such as checkpoints, while for offline loads without fault-tolerant mechanisms, each rescheduling will cause them to lose all calculations. Moreover, the work progress of different offline applications is inconsistent, so that the tasks to be completed are at risk of being preempted, which will reduce the throughput of the system and bring about useless utilization of resources.
解决粗粒度资源回收方案所带来的服务器利用率低效是本发明亟待解决的技术问题。Solving the low efficiency of server utilization caused by the coarse-grained resource recovery scheme is a technical problem to be solved urgently in the present invention.
发明内容Contents of the invention
为解决粗粒度资源回收方案所带来的服务器利用率低效的问题,本发明提出了一种高 吞吐云计算资源回收系统,依据区分云数据中心延迟敏感服务(LC)和批处理应用(BE)混合部署时因抢占不同批处理应用服务所产生的计算损失,设计了优化云数据中心延迟敏感服务(LC)服务质量无法保障时针对批处理应用(BE)的资源回收策略,从而提高混合部署时的吞吐量。In order to solve the problem of inefficient server utilization caused by the coarse-grained resource recovery scheme, the present invention proposes a high-throughput cloud computing resource recovery system, based on the distinction between cloud data center delay-sensitive services (LC) and batch applications (BE ) computing loss caused by preempting different batch processing application services during hybrid deployment, and designed and optimized the resource recovery strategy for batch processing applications (BE) when the service quality of delay-sensitive services (LC) in the cloud data center cannot be guaranteed, thereby improving hybrid deployment throughput at the time.
本发明通过以下技术方案来实现:The present invention is realized through the following technical solutions:
一种高吞吐云计算资源回收系统,该系统包括服务质量监控模块100,抢占损失分析模块200和离线负载回收队列模块300;其中:A high-throughput cloud computing resource recovery system, the system includes a service quality monitoring module 100, a preemption loss analysis module 200 and an offline load recovery queue module 300; wherein:
所述服务质量监控模块100,用于实时监控记录云数据中心延迟敏感服务LC组件的处理延迟请求,以分析当前时刻服务质量是否被保障;当检测到服务质量无法被保障时,进行资源回收,所述资源回收的评估依据公式resource×time,resource表示BE占据的资源,time表示完成时间;The service quality monitoring module 100 is used to monitor and record the processing delay request of the cloud data center delay-sensitive service LC component in real time, so as to analyze whether the service quality is guaranteed at the current moment; when it is detected that the service quality cannot be guaranteed, resource recovery is performed, The evaluation of resource recovery is based on the formula resource×time, resource represents the resource occupied by BE, and time represents the completion time;
所述抢占损失分析模块200,用于计算离线负载的抢占损失;The preemption loss analysis module 200 is used to calculate the preemption loss of offline loads;
每个应用因资源回收所导致的抢占损失L的计算公式如下:The formula for calculating the preemption loss L caused by resource recycling for each application is as follows:
L=S pmtn-S ognl=t pmtnr pmtn-t ognlr ognl L=S pmtn -S ognl =t pmtn r pmtn -t ognl r ognl
其中,t pmtn表示BE在被抢占(或未被抢占)情况下的完工时间,t ognl表示BE在未被抢占情况下的完工时间,r pmtn表示BE被抢占或未被抢占时所占用的CPU核数,r ognl表示BE未被抢占时所占用的资源; Among them, t pmtn represents the completion time of BE in the case of being preempted (or not preempted), t ognl represents the completion time of BE in the case of not being preempted, and r pmtn represents the CPU occupied by BE when it is preempted or not. The number of cores, r ognl indicates the resources occupied by BE when it is not preempted;
所述离线负载回收队列模块300,用于构建批处理应用BE回收队列与抢占优先级分配;所述批处理应用BE回收队列包括分别由可预测的BE和不可预测的BE组成两个单独的回收队列;当云数据中心延迟敏感服务LC组件的服务质量无法被保障时,每台服务器根据本地所维护的抢占损失优先级队列以及所部署的云数据中心延迟敏感服务LC组件的贡献度来进行资源回收。The offline load recovery queue module 300 is used to construct a batch processing application BE recovery queue and preemptive priority assignment; the batch processing application BE recovery queue includes two separate recovery queues composed of predictable BE and unpredictable BE. Queue; when the service quality of the delay-sensitive service LC components in the cloud data center cannot be guaranteed, each server allocates resources according to the locally maintained preemptive loss priority queue and the contribution of the deployed cloud data center delay-sensitive service LC components Recycle.
所述批处理应用BE分为三类即大数据应用类、人工智能训练类和科学计算类。The batch processing application BE is divided into three categories, namely big data application category, artificial intelligence training category and scientific computing category.
每个云数据中心延迟敏感服务LC组件的延迟贡献度不同,则每个混部服务器维护一个本地MLRQ,并且在每个MLRQ级别中有子队列,MLRQ级别q MLRQ中的BE数量由相应的本地云数据中心延迟敏感服务LC组件的贡献决定,公式如下: The delay contribution of each cloud data center delay-sensitive service LC component is different, then each mixed server maintains a local MLRQ, and there are sub-queues in each MLRQ level, and the number of BEs in the MLRQ level q MLRQ is determined by the corresponding local The contribution decision of the cloud data center delay-sensitive service LC component is determined by the following formula:
Figure PCTCN2021135609-appb-000001
Figure PCTCN2021135609-appb-000001
其中,n BE表示系统中BE的数量,C i表示LC服务组件的贡献度。 Among them, n BE represents the number of BEs in the system, and C i represents the contribution of LC service components.
相比现有数据中心的非区分BE的混部系统,本发明的一种高吞吐云计算资源回收系统能够减少系统因调度产生的无用计算,从而提升了集群的吞吐量与资源利用率,具体为:所设计的系统能提示吞吐量13.1%,CPU利用率10.2%,内存带宽利用率11.4%。与传统非区分BE混部系统比较。Compared with the non-differentiated BE hybrid system of the existing data center, a high-throughput cloud computing resource recovery system of the present invention can reduce the useless calculations caused by system scheduling, thereby improving the throughput and resource utilization of the cluster. For: the designed system can prompt throughput of 13.1%, CPU utilization of 10.2%, memory bandwidth utilization of 11.4%. Compared with the traditional non-differentiated BE hybrid system.
附图说明Description of drawings
图1为不同的批处理应用BE的离线服务抢占损失的差异性比较示意图;Figure 1 is a schematic diagram of the difference comparison of offline service preemption losses of different batch processing applications BE;
图2为本发明的一种高吞吐云计算资源回收系统架构图一;Fig. 2 is a kind of high-throughput cloud computing resource recovery system architecture diagram 1 of the present invention;
图3为本发明的一种高吞吐云计算资源回收系统架构图二;FIG. 3 is a second architecture diagram of a high-throughput cloud computing resource recovery system of the present invention;
图4为批处理应用BE离线负载回收队列整合图。Figure 4 is an integration diagram of the batch application BE offline load recovery queue.
具体实施方式Detailed ways
结合附图,对本发明的技术方案进行详细说明如下。In conjunction with the accompanying drawings, the technical solution of the present invention is described in detail as follows.
本发明基本思想是:当云数据中心延迟敏感服务LC的服务质量因突发负载而无法被保障时,根据所收集的批处理应用BE运行时数据计算其离线服务资源抢占损失,从而挑选当前情况下合适的离线负载进行抢占以释放资源给云数据中心延迟敏感服务(LC)。本发明中,使用当前的常见搜索引擎Solr和ElasticSearch,以及分布式非关系型数据库Redis作为LC服务,同时选取当前数据中心内具有代表性的分布式离线负载:大数据分析任务Spark、分布式深度学习训练任务以及单个可执行二进制文件科学计算作为BE负载。The basic idea of the present invention is: when the service quality of the delay-sensitive service LC in the cloud data center cannot be guaranteed due to sudden load, the offline service resource preemption loss is calculated according to the collected batch application BE runtime data, so as to select the current situation Under the appropriate offline load to preempt to release resources to the cloud data center delay-sensitive services (LC). In the present invention, current common search engines Solr and ElasticSearch, and distributed non-relational database Redis are used as LC services, and representative distributed offline loads in the current data center are selected at the same time: big data analysis task Spark, distributed depth Learning training tasks as well as single executable binary scientific computing as BE payloads.
如图1所示,为不同的批处理应用BE的离线服务抢占损失的差异性比较示意图。(1a)异步训练模式下的图像分类深度学习模型DDL-ASP、(1b)、基于同步训练模式的图像分类深度学习模型DDL-BSP、(1c)Bigdata应用SPARK、(1d)科学和数值计算的Java基准测试模型SCIMARK)的离线服务抢占损失,差异显著。(1a)在异步模式下终止一个Service Worker不会使BE应用失败,而且也不需要重新调度被终止的Service Worker。在 抢占情况下,它的最大完工时间变化不大,而被占用的资源变少。因此,在的配置中,DDL-ASP中的任务抢占实际上提高了服务效率。(1b)Service Worker必须被同步,任何失败的Service Worker将从最近的检查点重新启动,终止它的一个Service Worker会导致服务丢失,如果终止发生在30%的进度之后。通常情况下,较晚被抢占的任务会对BE应用产生较高的损失。(1c)稍后的抢占产生的损失更少。其原因有两方面:(1)由于rdd为Spark应用程序提供了较高的容错能力,所以当任务失败时,无论何时发生,Spark调度器都可以快速恢复任务。(2)应用程序通常作为一系列阶段执行。发现,在70%进度时的抢占会在Spark执行器中产生更少的争用。因此,现阶段的回收对最大完工时间影响不大。(1d)为科学和数值计算的Java基准测试模型SCIMARK的离线服务抢占损失,随着进度线性增长。由于没有为其提供任何容错机制,因此SCIMARK的每次抢占都将导致其重新提交并从头重新运行。As shown in Figure 1, it is a schematic diagram of the difference comparison of the offline service preemption loss of BE for different batch processing applications. (1a) Deep Learning Model for Image Classification in Asynchronous Training Mode DDL-ASP, (1b), Deep Learning Model for Image Classification Based on Synchronous Training Mode DDL-BSP, (1c) Bigdata Applied to SPARK, (1d) Scientific and Numerical Computing The offline service preemption loss of the Java benchmark model SCIMARK) is significantly different. (1a) Terminating a Service Worker in asynchronous mode will not cause the BE application to fail, and there is no need to reschedule the terminated Service Worker. In the case of preemption, its maximum completion time does not change much, and fewer resources are occupied. Therefore, in the configuration of , task preemption in DDL-ASP actually improves service efficiency. (1b) Service Workers must be synchronized, any Service Worker that fails will be restarted from the most recent checkpoint, terminating one of its Service Workers will result in a loss of service if the termination occurs after 30% progress. Typically, tasks that are preempted later will incur higher losses for BE applications. (1c) Later preemption incurs less penalty. The reason for this is twofold: (1) Since rdd provides high fault tolerance for Spark applications, when a task fails, no matter when it happens, the Spark scheduler can quickly resume the task. (2) Applications are typically executed as a series of stages. It was found that preemption at 70% progress produces less contention in Spark executors. Therefore, recycling at this stage has little effect on the maximum completion time. (1d) Offline service preemption loss for SCIMARK, a Java benchmark model for scientific and numerical computing, that grows linearly with progress. Since no fault tolerance mechanism is provided for it, every preemption of SCIMARK will cause it to resubmit and rerun from scratch.
如图2所示,为本发明的一种高吞吐云计算资源回收系统架构图。该系统包括服务质量监控模块100,抢占损失分析模块200和离线负载回收队列模块300。As shown in FIG. 2 , it is an architecture diagram of a high-throughput cloud computing resource recovery system of the present invention. The system includes a service quality monitoring module 100 , a preemption loss analysis module 200 and an offline load reclamation queue module 300 .
服务质量监控模块100,用于实时监控记录LC的处理延迟请求,以分析当前时刻服务质量是否被保障。当检测到服务质量无法被保障时,下发资源回收指令,触发系统进行资源回收,以保证LC服务质量的快速恢复。此时会将资源回收信号发送到抢占损失分析模块200以挑选适合回收的BE。The service quality monitoring module 100 is configured to monitor and record the processing delay request of the LC in real time, so as to analyze whether the service quality is guaranteed at the current moment. When it is detected that the service quality cannot be guaranteed, a resource recovery command is issued to trigger the system to perform resource recovery, so as to ensure the rapid recovery of LC service quality. At this time, the resource recovery signal will be sent to the preemption loss analysis module 200 to select BEs suitable for recovery.
抢占损失分析模块200,用于计算离线负载的抢占损失,并将每个离线负载的抢占损失信息传送给离线负载回收队列模块300,进行队列构建与抢占优先级分配。The preemption loss analysis module 200 is used to calculate the preemption loss of offline loads, and transmit the preemption loss information of each offline load to the offline load recovery queue module 300 for queue construction and preemption priority assignment.
目前在数据中心运行的有代表性的BE主要分为三类:大数据应用、人工智能训练和科学计算。其中,大数据应用通过Mapreduce、Spark等框架计算一组数据;根据实测数据的处理进度估算出运行时间。人工智能训练的目的是找到一个质量好的神经网络模型,满足期望的精度。科学计算主要包括不处理大量数据的短期计算应用。BE具有不同的结构,可以是单片的,也可以是包含多个组件的。从不同的BE组件中回收资源可能会对BE吞吐量产生不同的影响。例如,它可能会降低处理速度,甚至阻止BE运行。为了减少负面影响,计算resource×time评估资源回收如何改变每个BE所占用的服务,其中resource表示BE所占用的CPU资源,time表示完成时间。Currently, representative BEs running in data centers are mainly divided into three categories: big data applications, artificial intelligence training, and scientific computing. Among them, the big data application calculates a set of data through frameworks such as Mapreduce and Spark; the running time is estimated according to the processing progress of the measured data. The purpose of artificial intelligence training is to find a good quality neural network model that meets the desired accuracy. Scientific computing mainly includes short-term computing applications that do not deal with large amounts of data. BEs have different structures and can be monolithic or contain multiple components. Reclaiming resources from different BE components may affect BE throughput differently. For example, it may slow down processing or even prevent BE from running. In order to reduce the negative impact, calculate resource×time to evaluate how resource reclamation changes the service occupied by each BE, where resource represents the CPU resource occupied by BE, and time represents the completion time.
每个应用因资源回收所导致的抢占损失L的计算公式如下:The formula for calculating the preemption loss L caused by resource recycling for each application is as follows:
L=S pmtn-S ognl=t pmtnr pmtn-t ognlr ognl L=S pmtn -S ognl =t pmtn r pmtn -t ognl r ognl
其中,t pmtn表示BE在被抢占(或未被抢占)情况下的完工时间,t ognl表示BE在未被抢占情况下的完工时间,r pmtn表示BE被抢占(或未被抢占)时所占用的CPU核数,r ognl表示BE未被抢占时所占用的CPU核数。如果BE占用的服务在回收后变大,得到抢占损失大于0。计算抢占损失L需要BE运行时的信息,即t pmtn和t ognl。如果存在一个特定BE的预测模型,以便准确地估计其运行时间,那么将BE归类为可预测的离线负载;否则,将没有准确预测模型的BE归类为不可预测的离线负载。 Among them, t pmtn represents the completion time of BE in the case of being preempted (or not preempted), t ognl represents the completion time of BE in the case of not being preempted, and r pmtn represents the time occupied by BE when it is preempted (or not being preempted). The number of CPU cores, r ognl indicates the number of CPU cores occupied by the BE when it is not preempted. If the service occupied by BE becomes larger after recycling, the preemption loss is greater than 0. Computing the preemption loss L requires BE runtime information, namely t pmtn and t ognl . If a predictive model for a particular BE exists such that its running time can be accurately estimated, then the BE is classified as predictable offline load; otherwise, a BE without an accurate predictive model is classified as unpredictable offline load.
1、可预测离线负载的两类BE的预期完成时间有以下两种:1. The expected completion time of the two types of BE that can predict the offline load is as follows:
(1)基于spark的大数据BE,将BE的 完成进度c、占用时间t和被抢占资源比例p作为输入,得出BE应用的预期完成时间,公式如下: (1) Based on the spark-based big data BE, the completion progress c, occupation time t and preempted resource ratio p of BE are used as input to obtain the expected completion time of the BE application. The formula is as follows:
Figure PCTCN2021135609-appb-000002
Figure PCTCN2021135609-appb-000002
其中,c通过spark公开的HTTP API获取。Among them, c is obtained through the HTTP API exposed by spark.
(2)基于深度学习训练的BE,利用已有的白盒模型来预测不同资源配置下BE应用的完成时间。将剩余的训练步骤数s、占用时间t和步骤处理速度q作为输入,使用t pmtn=(s/q)+t得出BE应用的预期完成时间。需要通过模型估计s和q。剩余的步骤数将根据培训工作的实时损失值进行更新。 (2) Based on the BE trained by deep learning, the existing white-box model is used to predict the completion time of BE applications under different resource configurations. Taking the number of remaining training steps s, the occupied time t and the step processing speed q as input, use t pmtn =(s/q)+t to obtain the expected completion time of the BE application. s and q need to be estimated by the model. The number of remaining steps will be updated based on the live loss value of the training job.
2、不可预测的离线负载的BE的预期完成时间,相关推导如下:2. The expected completion time of BE with unpredictable offline load, the related derivation is as follows:
选择使用无用计算数量U作为资源回收优先级。也就是说,生成较少无用计算数量U的BE被优先用于资源回收。无用计算数量U是指资源回收造成的重复计算数量。资源回收后,如果任务变得更慢,不需要重新计算,有U=0。如果有一个以上的BE拥有U=0,则通过resource×elaspedtime来计算这类BE的占用服务,elaspedtime表示执行时间,resource表示BE所占用的CPU资源。如果任务失败,它的部分计算将变得无用,那么有U>0。Choose to use the number of useless calculations U as the resource recovery priority. That is to say, BEs that generate less amount of useless computation U are prioritized for resource recovery. The number U of useless calculations refers to the number of double calculations caused by resource recovery. After resource reclamation, if the task becomes slower, there is no need to recalculate, with U=0. If more than one BE has U=0, the occupied service of such BE is calculated by resource×elaspedtime, where elaspedtime represents the execution time, and resource represents the CPU resource occupied by the BE. If a task fails, part of its computation becomes useless, so U > 0.
无用计算与BE的容错机制有关。根据现有BE的容错机制,无用计算数量U的推到主要分为以下两类:Useless computation is related to BE's fault-tolerant mechanism. According to the existing BE fault-tolerant mechanism, the push of the useless calculation amount U is mainly divided into the following two categories:
①基于时间冗余的机制,通过重新调度备份服务器上失败的任务,从而延迟任务的执行。为了减少重新调度引起的重复计算,当发生故障时,设置备份服务器上失败的任务从最新的检查点重新启动。由此推导出基于时间冗余的的机制无用计算数量的计算公式U temp=t ckptr ognl,其中t ckpt表示自最近检查点时间以来的计算时间;②基于空间冗余的机制通过为同一个任务发送多个副本来牺牲空间来提高效率。副本同时运行,如果至少有一个副本成功完成,则任务将成功。因此,如果一个任务有超过1个副本,回收将不会产生任何重复计算,即U_space=0。如果一个任务的所有副本都失败了,将不得不重新调度它,无用计算的计算方式,即U_space=U_temp。 ①Based on the mechanism of time redundancy, the task execution is delayed by rescheduling the failed tasks on the backup server. In order to reduce the double calculation caused by rescheduling, when a failure occurs, the failed task on the backup server is set to restart from the latest checkpoint. From this, the calculation formula U temp = t ckpt r ognl for the number of useless calculations in the mechanism based on time redundancy is deduced, where t ckpt represents the calculation time since the latest checkpoint time; ② The mechanism based on space redundancy is the same Tasks send multiple copies to sacrifice space for efficiency. The replicas run concurrently, and the task will succeed if at least one of the replicas completes successfully. Therefore, if a task has more than 1 copy, recycling will not generate any recalculation, ie U_space=0. If all copies of a task fail, it will have to be rescheduled, using the calculation method of useless calculation, ie U_space=U_temp.
离线负载回收队列模块300,用于构建批处理应用BE回收队列与抢占优先级分配。将运行中的可预测BE与不可预测BE进行了统一维护。当云数据中心延迟敏感服务LC的服务质量无法被保障时,每台服务器上会根据本地所维护的抢占损失优先级队列以及所部署LC组件的贡献度来进行资源回收。The offline load reclamation queue module 300 is used to construct a batch application BE reclamation queue and preemptive priority assignment. Unified maintenance of predictable BE and unpredictable BE in operation. When the service quality of the delay-sensitive LC in the cloud data center cannot be guaranteed, each server will perform resource recovery according to the locally maintained preemption loss priority queue and the contribution of the deployed LC components.
构建分别由可预测的BE和不可预测的BE组成两个单独的回收队列。从两个单独的回收队列中选择最好的BE进行回收是一项挑战。为了解决这一问题,采用Borda计数投票方法,将可预测的BE的队列和不可预测的BE的队列统一为一个BE回收队列。每个选民根据自己的喜好对候选人排序,最后将不同顺序的候选人排序进行整合,选出获胜者。BE i代表通过borda计数法得到的在不同序列中的第i个得分。不同序列中BE i得分之和最小的将首先被抢占。 Build two separate reclamation queues consisting of predictable BEs and unpredictable BEs, respectively. It is challenging to select the best BE for recycling from two separate recycling queues. In order to solve this problem, the Borda counting voting method is adopted to unify the predictable BE queue and the unpredictable BE queue into one BE recovery queue. Each voter ranks the candidates according to his or her preference, and finally integrates the rankings of candidates in different orders to select the winner. BE i represents the ith score in different sequences obtained by borda counting. The one with the smallest sum of BE i scores in different sequences will be preempted first.
为运行时的BE负载维护可预测抢占损失队列、可预测无用计算队列和不可预测无用计算队列。由于不可预测的BE只出现在不可预测无用计算队列中,可预测的BE出现在可预测抢占损失队列、可预测无用计算队列中,而在通过Borda计数法分别获取各个BE在三个队列中的得分,然后计算每个BE的得分总和时,不可预测BE会因为缺少抢占损失队列得分,而导致得分较低。为了公平比较,将不可预测的BE在不可预测无用计算队列中获得的得分加倍。然后,将可预测的BE的得分和不可预测的BE的得分合并,并按得分降序排列。合并得到的是一个全局BE的回收队列。因为不同LC组件的贡献度不一 致,因此将统一后的队列按照贡献度划分一个多级回收队列(MLRQ),当接收到回收资源请求时,系统会回收处在高优先级队列中的所有负载。这使得与具备高贡献度的LC组件混部的BE负载,会面临着较大的回收粒度,从而实现LC服务质量的快速恢复。Maintain a predictable preemption loss queue, a predictable wasteful computation queue, and an unpredictable wasteful computation queue for the BE load at runtime. Since unpredictable BEs only appear in unpredictable useless computing queues, predictable BEs appear in predictable preemption loss queues and predictable useless computing queues, and the Borda counting method is used to obtain the BE in the three queues score, and then calculate the sum of the scores for each BE, it is unpredictable that the BE will have a lower score due to the lack of preemption loss queue scores. For a fair comparison, the score obtained by unpredictable BE in the unpredictable useless computation queue is doubled. Then, the scores of predictable and unpredictable BEs are combined and sorted in descending order of scores. The merged result is a global BE recovery queue. Because the contributions of different LC components are inconsistent, the unified queue is divided into a multi-level recovery queue (MLRQ) according to the contribution level. When a resource recovery request is received, the system will recover all loads in the high-priority queue. As a result, BE loads mixed with LC components with high contribution will face a larger recovery granularity, so as to realize the rapid recovery of LC service quality.
在LC的服务质量无法被保障时,回收操作总是选择全局BE的回收队列顶部执行。如果列表中的第一个BE在本地服务器上不存在,则依次替换BE,直到找到匹配的BE。为了加快SLA恢复进程,进一步将全局BE的回收队列组织成多级回收队列MLRQ,资源回收总是选择处在MLRQ最顶层的BE进行回收。因为每个LC组件的延迟贡献度不同,则每个混部服务器都会维护一个本地MLRQ,并且在每个MLRQ级别中有一个更长的子队列。通过这种方式,更多的资源从与LC组件混合部署的BE中回收。MLRQ级别q MLRQ中的BE数量由其本地组件的贡献决定。公式如下: When the quality of service of the LC cannot be guaranteed, the reclaim operation is always performed at the top of the reclaim queue of the global BE. If the first BE in the list does not exist on the local server, the BEs are replaced sequentially until a matching BE is found. In order to speed up the SLA recovery process, the global BE recovery queue is further organized into a multi-level recovery queue MLRQ, and resource recovery always selects the BE at the top of the MLRQ for recovery. Since each LC component contributes differently to the delay, each hybrid server maintains a local MLRQ with a longer subqueue in each MLRQ level. In this way, more resources are reclaimed from BEs deployed mixed with LC components. MLRQ level q The number of BEs in an MLRQ is determined by the contribution of its local components. The formula is as follows:
Figure PCTCN2021135609-appb-000003
Figure PCTCN2021135609-appb-000003
其中,n BE表示系统中BE的数量,C i表示LC服务组件的贡献度。 Among them, n BE represents the number of BEs in the system, and C i represents the contribution of LC service components.
当云数据中心延迟敏感服务LC单独运行时,记录它们在每个LC服务组件上的逗留时间;然后根据所采集的信息推导出每个服务组对尾部延迟的贡献。此特性仅依赖于LC服务本身,其成本随服务组件的数量线性增加。因此,与测量M个LC服务和N个BE作业的M个组合干扰的基于配置的方法相比,本发明降低了M个作业的成本。When the cloud data center delay-sensitive service LCs are running individually, their residence time on each LC service component is recorded; then the contribution of each service group to the tail delay is deduced from the collected information. This feature only depends on the LC service itself, and its cost increases linearly with the number of serviced components. Thus, the present invention reduces the cost of M jobs compared to a configuration-based approach of measuring M combined interferences of M LC services and N BE jobs.
可预测的BE表示那些工作完成时间(JCT)可以在不依赖离线分析的情况下轻松而准确地估计的数据,如Mapreduce或Spark应用程序的任务完成时间可以根据处理数据的比例来估算。而对于分布式深度学习训练任务一些白盒预测模型,如Optimus,也可以被使用作为预测器,以预测任务的完成时间。对于其他的BE,认为是不可预测的。虽然可预测的BE的优先级是根据其进度确定的,但不可预测的BE可以按照最低获得服务(LAS)策略来确定优先级,该策略回收获得最少优先服务的BE。Predictable BE refers to data for which job completion time (JCT) can be estimated easily and accurately without relying on offline analysis, such as the task completion time of Mapreduce or Spark applications can be estimated based on the proportion of processed data. For distributed deep learning training tasks, some white-box prediction models, such as Optimus, can also be used as predictors to predict the completion time of tasks. For other BEs, it is considered unpredictable. While predictable BEs are prioritized based on their progress, unpredictable BEs can be prioritized following a Least Acquired Service (LAS) policy, which recycles BEs that receive least priority service.

Claims (3)

  1. 一种高吞吐云计算资源回收系统,其特征在于,该系统包括服务质量监控模块(100),抢占损失分析模块(200)和离线负载回收队列模块(300);其中:A high-throughput cloud computing resource recovery system, characterized in that the system includes a service quality monitoring module (100), a preemption loss analysis module (200) and an offline load recovery queue module (300); wherein:
    所述服务质量监控模块(100),用于实时监控记录云数据中心延迟敏感服务LC组件的处理延迟请求,以分析当前时刻服务质量是否被保障;当检测到服务质量无法被保障时,进行资源回收,所述资源回收的评估依据公式resource×time,resource表示BE占据的资源,time表示完成时间;The quality of service monitoring module (100) is used for real-time monitoring and recording of the processing delay request of the delay-sensitive service LC component of the cloud data center to analyze whether the quality of service at the current moment is guaranteed; when it is detected that the quality of service cannot be guaranteed, perform resource Recycling, the evaluation of resource recycling is based on the formula resource×time, resource represents the resource occupied by BE, and time represents the completion time;
    所述抢占损失分析模块(200),用于计算离线负载的抢占损失;The preemption loss analysis module (200) is used to calculate the preemption loss of offline load;
    每个应用因资源回收所导致的抢占损失L的计算公式如下:The formula for calculating the preemption loss L caused by resource recycling for each application is as follows:
    L=S pmtn-S ognl=t pmtnr pmtn-t ognlr ognl L=S pmtn -S ognl =t pmtn r pmtn -t ognl r ognl
    其中,t pmtn表示BE在被抢占(或未被抢占)情况下的完工时间,t ognl表示BE在未被抢占情况下的完工时间,r pmtn表示BE被抢占或未被抢占时所占用的CPU核数,r ognl表示BE未被抢占时所占用的资源; Among them, t pmtn represents the completion time of BE in the case of being preempted (or not preempted), t ognl represents the completion time of BE in the case of not being preempted, and r pmtn represents the CPU occupied by BE when it is preempted or not. The number of cores, r ognl indicates the resources occupied by BE when it is not preempted;
    所述离线负载回收队列模块(300),用于构建批处理应用BE回收队列与抢占优先级分配;所述批处理应用BE回收队列包括分别由可预测的BE和不可预测的BE组成两个单独的回收队列;当云数据中心延迟敏感服务LC组件的服务质量无法被保障时,每台服务器根据本地所维护的抢占损失优先级队列以及所部署的云数据中心延迟敏感服务LC组件的贡献度来进行资源回收。The offline load recovery queue module (300) is used to construct a batch processing application BE recovery queue and preemptive priority assignment; the batch processing application BE recovery queue includes two independent BEs composed of a predictable BE and an unpredictable BE. recovery queue; when the service quality of the delay-sensitive service LC components in the cloud data center cannot be guaranteed, each server calculates the Perform resource recovery.
  2. 如权利要求1所述的一种高吞吐云计算资源回收系统,其特征在于,所述批处理应用BE分为三类即大数据应用类、人工智能训练类和科学计算类。The high-throughput cloud computing resource recovery system according to claim 1, wherein the batch processing application BE is divided into three categories, namely big data application category, artificial intelligence training category and scientific computing category.
  3. 如权利要求1所述的一种高吞吐云计算资源回收系统,其特征在于,每个云数据中心延迟敏感服务LC组件的延迟贡献度不同,则每个混部服务器维护一个本地MLRQ,并且在每个MLRQ级别中有子队列,MLRQ级别q MLRQ中的BE数量由相应的本地云数据中心延迟敏感服务LC组件的贡献决定,公式如下: A high-throughput cloud computing resource reclamation system according to claim 1, wherein the delay contribution of each cloud data center delay-sensitive service LC component is different, and each mixed server maintains a local MLRQ, and There are sub-queues in each MLRQ level, and the number of BEs in an MLRQ level q MLRQ is determined by the contribution of the corresponding local cloud data center delay-sensitive service LC component, the formula is as follows:
    Figure PCTCN2021135609-appb-100001
    Figure PCTCN2021135609-appb-100001
    其中,n BE表示系统中BE的数量,C i表示LC服务组件的贡献度。 Among them, n BE represents the number of BEs in the system, and C i represents the contribution of LC service components.
PCT/CN2021/135609 2021-08-10 2021-12-06 High throughput cloud computing resource recovery system WO2023015787A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110912342.0A CN113608875B (en) 2021-08-10 2021-08-10 High-throughput cloud computing resource recovery system
CN202110912342.0 2021-08-10

Publications (1)

Publication Number Publication Date
WO2023015787A1 true WO2023015787A1 (en) 2023-02-16

Family

ID=78340084

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/135609 WO2023015787A1 (en) 2021-08-10 2021-12-06 High throughput cloud computing resource recovery system

Country Status (2)

Country Link
CN (1) CN113608875B (en)
WO (1) WO2023015787A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113608875B (en) * 2021-08-10 2023-09-12 天津大学 High-throughput cloud computing resource recovery system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200159587A1 (en) * 2018-11-20 2020-05-21 International Business Machines Corporation Releasable resource based preemptive scheduling
CN111782355A (en) * 2020-06-03 2020-10-16 上海交通大学 Cloud computing task scheduling method and system based on mixed load
US20210011765A1 (en) * 2020-09-22 2021-01-14 Kshitij Arun Doshi Adaptive limited-duration edge resource management
CN112395052A (en) * 2020-12-03 2021-02-23 华中科技大学 Container-based cluster resource management method and system for mixed load
CN113608875A (en) * 2021-08-10 2021-11-05 天津大学 High-throughput cloud computing resource recovery system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7302685B2 (en) * 2000-06-02 2007-11-27 Honeywell International Inc. Methods and apparatus for sharing slack in a time-partitioned system
US20040205752A1 (en) * 2003-04-09 2004-10-14 Ching-Roung Chou Method and system for management of traffic processor resources supporting UMTS QoS classes
CN111491006B (en) * 2020-03-03 2021-11-02 天津大学 Load-aware cloud computing resource elastic distribution system and method
CN113190351B (en) * 2021-05-06 2022-06-21 天津大学 Efficient resource distribution system for distributed deep learning training task

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200159587A1 (en) * 2018-11-20 2020-05-21 International Business Machines Corporation Releasable resource based preemptive scheduling
CN111782355A (en) * 2020-06-03 2020-10-16 上海交通大学 Cloud computing task scheduling method and system based on mixed load
US20210011765A1 (en) * 2020-09-22 2021-01-14 Kshitij Arun Doshi Adaptive limited-duration edge resource management
CN112395052A (en) * 2020-12-03 2021-02-23 华中科技大学 Container-based cluster resource management method and system for mixed load
CN113608875A (en) * 2021-08-10 2021-11-05 天津大学 High-throughput cloud computing resource recovery system

Also Published As

Publication number Publication date
CN113608875B (en) 2023-09-12
CN113608875A (en) 2021-11-05

Similar Documents

Publication Publication Date Title
Jeon et al. Analysis of {Large-Scale}{Multi-Tenant}{GPU} clusters for {DNN} training workloads
CN106293919B (en) A kind of the built-in tasks dispatching device and method of time trigger
Ananthanarayanan et al. {GRASS}: Trimming stragglers in approximation analytics
US9442760B2 (en) Job scheduling using expected server performance information
CN111079921A (en) Efficient neural network training and scheduling method based on heterogeneous distributed system
US7920282B2 (en) Job preempt set generation for resource management
Liu et al. Predicting of job failure in compute cloud based on online extreme learning machine: a comparative study
US8458710B2 (en) Scheduling jobs for execution on a computer system
WO2024021489A1 (en) Task scheduling method and apparatus, and kubernetes scheduler
CN111026519A (en) Distributed task priority scheduling method and system and storage medium
Xu et al. Task-cloning algorithms in a MapReduce cluster with competitive performance bounds
WO2023015787A1 (en) High throughput cloud computing resource recovery system
US20180039514A1 (en) Methods and apparatus to facilitate efficient scheduling of digital tasks in a system
CN111061565A (en) Two-stage pipeline task scheduling method and system in Spark environment
US9424078B2 (en) Managing high performance computing resources using job preemption
Naik et al. A review of adaptive approaches to MapReduce scheduling in heterogeneous environments
CN114968563A (en) Micro-service resource allocation method based on combined neural network
CN104796494A (en) Data transmission method for cloud platform
CN115619002A (en) Flexible dynamic mixed key system scheduling method
Wei et al. Composite rules selection using reinforcement learning for dynamic job-shop scheduling
CN116755893B (en) Job scheduling method and device of deep learning-oriented distributed computing system
AU2021104950A4 (en) A method for task scheduling on multiprocessor parallel system using genetic approach
Stankovic et al. On using the Spring kernel to support real-time AI applications
Kupalov-Yaropolk et al. Control of resource-intensive computations under uncertainty. II. Scheduling complex
CN114691314A (en) Service scheduling method based on deterministic operator coexistence and GPU applied by same

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21953392

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE