CN114035935A - High-throughput heterogeneous resource management method and device for multi-stage AI cloud service - Google Patents

High-throughput heterogeneous resource management method and device for multi-stage AI cloud service Download PDF

Info

Publication number
CN114035935A
CN114035935A CN202111193853.8A CN202111193853A CN114035935A CN 114035935 A CN114035935 A CN 114035935A CN 202111193853 A CN202111193853 A CN 202111193853A CN 114035935 A CN114035935 A CN 114035935A
Authority
CN
China
Prior art keywords
cpu
phase
resource
service
service quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111193853.8A
Other languages
Chinese (zh)
Other versions
CN114035935B (en
Inventor
陈�全
过敏意
张蔚
符凯华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111193853.8A priority Critical patent/CN114035935B/en
Publication of CN114035935A publication Critical patent/CN114035935A/en
Application granted granted Critical
Publication of CN114035935B publication Critical patent/CN114035935B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5055Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering software capabilities, i.e. software resources associated or available to the machine
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)
  • Multi Processors (AREA)

Abstract

The invention provides a high-throughput heterogeneous resource management method and device for multi-phase AI cloud service, wherein the high-throughput heterogeneous resource management method for the multi-phase AI cloud service comprises the following steps: splitting a service quality target into a CPU side service quality target and a GPU side service quality target by using a service quality target distributor based on the received LC service request; searching optimal resource allocation by using a heterogeneous resource manager and taking a CPU side service quality target and a GPU side service quality target as initial samples; the progress of the CPU phase is monitored in real time by a service quality compensator, and when the time spent by a user in the CPU phase exceeds the service quality target of the CPU, the execution of the CPU phase on an accelerator side is accelerated. The invention not only ensures the service quality of LC service, but also greatly improves the comprehensive performance of all BE applications on heterogeneous equipment.

Description

High-throughput heterogeneous resource management method and device for multi-stage AI cloud service
Technical Field
The invention relates to the technical field of GPUs (graphics processing units), in particular to a high-throughput heterogeneous resource management method and device for multi-stage AI (Artificial Intelligence) cloud services.
Background
Modern data centers often host some user-oriented application services such as web searching, social networking, face recognition, and the like. Such applications are generally attractive to users with low response times and high accuracy, and therefore all have stringent delay requirements. This type of application is also known as LC (latency-critical) application. Ensuring quality of service (QoS) for these LC applications is the focus of current data center related research.
With the rapid development of cloud computing platforms and deep learning, recent Deep Neural Networks (DNNs) have achieved human-level accuracy in various application scenarios, such as image recognition, voice recognition, and the like. Accordingly, deep neural networks are also used to support LC applications in various data centers. Accelerators such as new hardware, e.g., GPUs, are also adopted by cloud computing service providers and widely deployed in computer clusters to support the high computing power requirements of emerging deep learning tasks. Compared with the traditional online service, the online service based on deep learning not only has strict service quality requirements but also is computationally very high in requirement, and various heterogeneous resources are used.
The application of LC based on DNN support mainly has two stages: data preprocessing and online reasoning (inference). A heterogeneous accelerator (e.g., GPU) is typically used for the inference phase, while the host CPU is used for the data pre-processing phase (including decoding and data resizing, etc.). The interaction phase (memcpy) between the host and accelerator is supported by the PCI-e bus. Data centers have some unavoidable problems, one of which is important is the over-allocation of resources. According to some previous studies, these LC services often encounter a day-night mode of user access (at certain times of day, user requests are high and concentrated; by night, user request load decreases rapidly). This mode results in a large amount of time other than peak hours when the CPU/GPU resources are not fully utilized, thereby resulting in a waste of resources.
In order to improve the utilization rate of heterogeneous resources in a data center, it is a common practice to run a LC service with a QoS requirement and a best-effort (best-effort) BE application without a QoS requirement at the same time. Accelerator manufacturers are also now producing such multi-tasking accelerators that support space division sharing in order to achieve higher throughput on the same accelerator. For example, MPS (multiprocessing service) is an implementation of CUDA Application Program Interface (API). With the latest NVIDIA VOLTA and turn architectures, VOLTA MPS techniques allow different applications to execute simultaneously on the same GPU at a certain percentage of resources, thereby increasing overall GPU utilization.
But a mixed deployment of multiple applications may introduce performance penalties to the LC service and increase the end-to-end delay for user requests, risking a quality of service violation. This is primarily because multiple applications in a mixed deployment compete for shared heterogeneous resources. In addition, for this new multi-phase AI cloud service, they have both a host phase (CPU) and an accelerator phase (GPU). There are new challenges how to design a suitable scheduling strategy for such emerging heterogeneous hybrid deployment scenarios. The important problem to be solved by the invention is to ensure the service quality of LC application and simultaneously to ensure that the utilization of the whole resource is more efficient and economic.
For QoS guarantee technologies of CPU hybrid deployment, the prior art is divided into two categories: methods based on performance analysis and methods based on feedback regulation. Methods based on performance analysis, such as Bubble-up, can analyze user-oriented services and batch applications to predict performance degradation due to shared cache and memory bandwidth contention and determine a "safe" scheduling mode that will not result in QoS violations. Feedback regulation based methods, such as Heracles, build decision trees to determine resource allocation in the next time period based on QoS feedback directed to user service in the current time period, thereby periodically adjusting the allocation of shared resources.
The existing multi-task scheduling algorithm on the CPU is not suitable for the current novel multi-stage AI cloud service, because the multi-task scheduling algorithm cannot sense the space division multiplexing characteristic of accelerators such as the GPU and the like, the high parallelism of the GPU cannot be effectively utilized, and hardware resources can be fully exerted. At the same time, they also ignore complex interference between different shared resources on the GPU.
For QoS guarantee technologies of GPU hybrid deployment, the prior art is divided into two categories: a time-division based shared accelerator and a space-division based shared accelerator. On a time-sharing accelerator, a queue-based approach (e.g., grand slam and Baymax) predicts the duration of kernel, thereby reordering the GPU kernels. On spatial multitasking accelerators, an analysis-based approach (e.g., Laius) divides the computational resources between LC services and BE applications.
Previous studies were limited to scheduling mixed deployments on GPU accelerators, which were not versatile enough to manage these new application scenarios. LC services have both host and accelerator phases, and managing the GPU and CPU phases separately can cause severe QoS violations. If the host phase delay prediction is not accurate or the feedback mechanism is not timely enough, the LC service will suffer QoS violation and the delay cannot be made up.
Disclosure of Invention
In view of the above drawbacks of the prior art, an object of the present invention is to provide a method and a device for managing a high throughput heterogeneous resource for a multi-stage AI cloud service, which are used to meet QoS targets of LC services and maximize performance of all BE applications on a CPU and a GPU.
To achieve the above and other related objects, the present invention provides a high throughput heterogeneous resource management method for a multi-phase AI cloud service, including: splitting a service quality target into a CPU side service quality target and a GPU side service quality target by using a service quality target distributor based on the received LC service request; searching optimal resource allocation by using a heterogeneous resource manager and taking a CPU side service quality target and a GPU side service quality target as initial samples; the progress of the CPU stage is monitored in real time by using a service quality compensator, and the execution of the CPU stage is accelerated when the time spent in the CPU stage by a user request exceeds the service quality target of the CPU.
In an embodiment of the present invention, the splitting the qos target into the CPU-side qos target and the GPU-side qos target based on the received LC service request includes: setting each resource quota of the LC task as the minimum resource unit of the LC task, and simultaneously allocating the rest resources to the BE task; adjusting the service quality distribution of the CPU-GPU stage according to the performance curved surface of the shared resource; recording the service quality increase value and the BE task performance degradation value of the LC task; and selecting the optimal resource allocation, adjusting the optimal resource allocation from the BE task to the LC task and executing the next cycle, thereby realizing the purpose of splitting the service quality target into a CPU side service quality target and a GPU side service quality target.
In an embodiment of the present invention, the heterogeneous resource manager searches for an optimal resource allocation based on a bayesian optimization algorithm of a random forest.
In an embodiment of the present invention, the initial sample is sampled by using any one of the following strategies: all CPU stage tasks distribute the same priority strategy of the same computing resource, obtain the initial resource distribution strategy of the initial point of resource distribution from the service quality target distributor, distribute the minimum resource quota for BE operation, and remain the service quality guarantee strategy for LC task.
In an embodiment of the present invention, a scoring function is configured for the heterogeneous resource managers to guide the resource managers to search in the correct direction in the configuration space.
In one embodiment of the invention, a segmented objective function based on the quality of service of the LC task and the comprehensive throughput of the BE task is constructed; the first goal of the piecewise objective function is to meet quality of service objectives on the CPU and GPU, and the second goal is to maximize the overall system throughput of the BE task.
In an embodiment of the present invention, the method further includes configuring an optimization constraint condition for searching for optimal resource allocation; the optimization constraints include: the maximum quota of each task does not exceed the total amount of resources; for each resource, the sum of the quotas for all tasks cannot be greater than the total.
In an embodiment of the present invention, the monitoring the progress of the CPU stage in real time includes: and calculating the increased time length for executing the CPU phase, and if the increased time length for executing the CPU phase is greater than the reduced execution time of the GPU phase under the new resource quota, determining that the time spent by the user request in the CPU phase does not exceed the service quality target of the CPU.
In an embodiment of the present invention, the service quality compensator determines a new resource quota for the request of the LC service, and when determining the new resource quota, the computation resource quota allocated to the BE task is updated at the same time; if the CPU phase progress of the request of the LC service meets the service quality of the new quota, the resource quota allocated to the request of the LC service is rolled back to the original quota.
Embodiments of the present invention also provide an electronic device, including a CPU and a GPU, that applies the high-throughput heterogeneous resource management method for multi-phase AI cloud services as described above.
As described above, the high-throughput heterogeneous resource management method and device for multi-phase AI cloud service according to the present invention have the following beneficial effects:
1. the invention can ensure the service quality of LC service and greatly improve the comprehensive performance of all BE applications on heterogeneous equipment on the premise of a real data center machine and no need of modifying hardware equipment.
2. The achievement of the invention can provide support for landing of scheduling technology for the emerging heterogeneous mixed deployment problem of the data center. Meanwhile, the achievement of the invention has commercial significance, can provide high-throughput dynamic task scheduling service for multi-stage AI cloud service, and ensures the cloud service quality under the condition of maximizing the utilization rate of heterogeneous equipment of the data center.
Drawings
Fig. 1 is a flowchart illustrating a high throughput heterogeneous resource management method for multi-phase AI cloud service according to an embodiment of the present invention.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.
The embodiment of the invention aims to provide a high-throughput heterogeneous resource management method and device for multi-stage AI cloud service, which are used for meeting the QoS (quality of service) target of LC (inductance-capacitance) service and improving the performance of all BE (BE object) applications on a CPU (Central processing Unit) and a GPU (graphics processing Unit) to the maximum extent.
The embodiment comprises a BE-aware QoS target distributor, a unified heterogeneous resource manager and an accelerator-side QoS compensator. The main contributions and innovation points of the present embodiment are: a unified heterogeneous resource management method for the multi-stage AI cloud service of the data center is designed, so that the service quality of the LC service is guaranteed, and the resource utilization of the whole CPU and GPU is more efficient and economical.
In this embodiment, a high-throughput heterogeneous resource management method for multi-phase AI cloud service is designed and implemented, and the method has two goals in heterogeneous operation: the QoS target of LC service is satisfied, and the performance of all BE applications on CPU and GPU is improved to the maximum extent.
The embodiment provides a BE-aware QoS target distributor, a unified heterogeneous resource manager and an accelerator-side QoS compensator, which are used for realizing a high-throughput heterogeneous resource management method for multi-stage AI cloud service. The embodiment solves a plurality of challenges of resource management aiming at the heterogeneous mixed deployment, ensures the service quality guarantee during the mixed operation of the LC service, and simultaneously improves the utilization efficiency of the whole resources on the CPU and the GPU.
The principle and implementation of the high-throughput heterogeneous resource management method and server for multi-phase AI cloud service according to the present embodiment will be described in detail below, so that those skilled in the art can understand the high-throughput heterogeneous resource management method and server for multi-phase AI cloud service according to the present embodiment without creative work.
As shown in fig. 1, the present embodiment provides a high-throughput heterogeneous resource management method for a multi-phase AI cloud service, where the high-throughput heterogeneous resource management method for the multi-phase AI cloud service includes:
step S100, a service quality target distributor is utilized to divide the service quality target into a CPU side service quality target and a GPU side service quality target based on the received LC service request;
s200, searching optimal resource allocation by using a heterogeneous resource manager and taking a CPU side service quality target and a GPU side service quality target as initial samples;
step S300, a service quality compensator is used for monitoring the progress of the CPU stage in real time, and the time spent in the CPU stage by a user request exceeds the service quality target of the CPU, so that the execution of the CPU at the accelerator end is accelerated.
The embodiment provides a BE-aware QoS target distributor, a unified heterogeneous resource manager and an accelerator-side QoS compensator, which are used for realizing a high-throughput heterogeneous resource management method for multi-stage AI cloud service. The high-throughput heterogeneous resource management method for the multi-stage AI cloud service in the embodiment processes the LC service and the BE application program in different modes. Upon submitting a request for LC service, the QoS target allocator heuristically splits its QoS target into a CPU-side QoS target and a GPU-side QoS target. These partitions are then passed as initial sample points to the uniform resource allocator to search for the best resource allocation. The resource allocator determines an optimal resource allocation, maximizing the throughput of the BE job while guaranteeing QoS of the LC service (considering economic efficiency). In addition, the co-compensator monitors the progress of the CPU phase in real time, speeding up the execution on its accelerator side if the user requests that the time spent in the CPU phase exceed its QoS target.
The following describes the steps S100 to S300 of the present embodiment in detail.
Step S100, a service quality target distributor is used for dividing the service quality target into a CPU side service quality target and a GPU side service quality target based on the received LC service request.
In this embodiment, the splitting the qos target into the CPU-side qos target and the GPU-side qos target based on the received LC service request includes: setting each resource quota of the LC task as the minimum resource unit of the LC task, and simultaneously allocating the rest resources to the BE task; adjusting the service quality distribution of the CPU-GPU stage according to the performance curved surface of the shared resource; recording the service quality increase value and the BE task performance degradation value of the LC task; and selecting the optimal resource allocation, adjusting the optimal resource allocation from the BE task to the LC task and executing the next cycle, thereby realizing the purpose of splitting the service quality target into a CPU side service quality target and a GPU side service quality target.
The goal of step S100 is how to accurately solve the QoS partitioning problem within a limited sampling range. In the preliminary evaluation of this embodiment, increasing the resource quota of the BE task can effectively improve the performance of the BE task. However, different BE tasks have different sensitivities to shared resources (kernel in CPU, LLC, memory bandwidth, and SM in GPU). In order to make maximum use of the computational resources, the shared resource usage of the LC tasks should have as little impact as possible on the performance of the BE tasks.
Therefore, this embodiment designs a heuristic search algorithm to perform QoS partitioning for LC services, and specifically, this embodiment initially sets the quota of each resource (CPU core, LLC, memory bandwidth in CPU, and SM in GPU) of the LC task as its minimum resource unit, while allocating the remaining resources to the BE task. In order to guarantee QoS of the LC task, it is necessary to increase a share resource quota allocated to the LC task. In each iteration, the QoS target distributor can curve and adjust the QoS distribution of the CPU-GPU stage according to the performance of the shared resources. This embodiment records the QoS increment value Δ QoS of the LC task in the searching step respectivelydAnd BE task performance degradation Δ perfd. The present embodiment uses the following formula to select the optimal resource allocation d*Then resource d*Adjust from BE task to LC task and execute the next loop.
Figure BDA0003302283480000061
In this way, near optimal results are obtained, not only satisfying the QoS of the LC tasks, but also trying to ensure a minimum performance degradation of the BE tasks. The final time of the CPU phase and GPU phase is scaled up to the QoS target and used as the QoS split result. The QoS partitioning results are passed as initial samples to a unified heterogeneous resource manager.
Step S200, a heterogeneous resource manager is utilized to search for optimal resource allocation by taking a CPU side service quality target and a GPU side service quality target as initial samples.
Once the QoS target allocator obtains the QoS target partition, the present embodiment needs to manage the shared resources on each device (CPU/GPU). Given the large sample space of resource configurations in heterogeneous hybrid deployment scenarios, the resource manager must quickly identify the optimal resource allocation with minimal sampling time.
In this embodiment, the heterogeneous resource manager searches for the optimal resource allocation based on a bayesian optimization algorithm of a random forest.
The invention uses a Bayesian optimization algorithm (SMAC) based on random forest to realize the resource distribution management, however, the SMAC algorithm is not suitable for the heterogeneous hybrid deployment directly. First, the SMAC algorithm randomly selects the initial sample points. Although this approach works well for some simple services, they are prone to covariant transitions in heterogeneous cases, which can lead to frequent QoS violations during sampling. While the objective function in the conventional SMAC returns only one maximized value (e.g., throughput or execution time of the system), the present embodiment has multiple optimization objectives (QoS of LC service and performance maximization of BE task).
In this embodiment, the initial sample is sampled by using any one of the following strategies: all CPU stage tasks are distributed with the same priority strategy of the same computing resource, an initial resource distribution strategy of a resource distribution initial point is obtained from a service quality target distributor, a minimum resource quota is distributed for BE operation, and a service quality guarantee strategy of LC tasks is remained.
This embodiment performs two adaptive corrections to the SMAC algorithm. For the initial sampling points, the invention carefully selects them according to different strategies: 1) same priority policy (all CPU phase tasks allocate equal computational resources); 2) acquiring a resource allocation initial point from a QoS target allocator; 3) QoS guarantee strategy (minimum resource quota is allocated to BE operation, and LC task is remained). The above three resource configurations as initial points can better discover potential resource quotas and speed up the sampling process. As to how to determine the next sample point, the present embodiment carefully designs the objective function so that SMAC optimization can be applied to heterogeneous co-location.
In addition, in this embodiment, a scoring function is configured for the heterogeneous resource managers to guide the resource managers to search in the configuration space in the correct direction.
That is, the present embodiment designs a scoring function for the resource manager:
Figure BDA0003302283480000071
the score of the scoring function is passed to the objective function (i.e., the objective score is assigned at the end of each cycle when the system is operating in a given resource configuration). This scoring function will direct the resource manager to search in the right direction in a large configuration space.
Furthermore, in this embodiment, a piecewise objective function based on the quality of service of the LC task and the integrated throughput of the BE task is constructed; the first objective of the piecewise objective function is to meet quality of service objectives on the CPU and GPU, and the second objective is to maximize the overall system throughput of the BE task.
Namely, the embodiment constructs a segmented objective function, which considers the QoS of the LC task and the comprehensive throughput of the BE task (considering the economic benefit). The function value is between 0 (worst case, no LC task satisfies its QoS) and 1 (ideally, all LC tasks satisfy QoS and BE task performance is consistent with its individual run-time). The first goal is to meet QoS goals on the CPU and GPU. QoS in the above equationtargetIs the QoS target of the LC taskevalIs the delay of the LC task under the current resource configuration. This score will get less than 0.5 regardless of the BE task performance as long as the LC task has a QoS violation on any device. Only if the score is greater than 0.5, the second objective function is considered.
The second objective in the above equation is to maximize the overall system throughput of the BE task, where perfrIs the throughput, perf, of BE jobs during a samplesIs the throughput of the BE task alone. Also considering the large price difference between the rented CPU and the GPU, the methodEmbodiments perform a weighted summation of the CPU and GPU throughputs, where a and β are related to the CPU/GPU lease price, respectively.
In this embodiment, the method further includes configuring an optimization constraint condition for searching for optimal resource allocation; the optimization constraints include: the maximum quota of each task does not exceed the total amount of resources; for each resource, the sum of the quotas for all tasks cannot be greater than the total.
In order to speed up the search process of SMAC optimization, the invention constrains the search space using a pruning strategy based on the following formula optimization problem:
Object=MAX(a(Score(R)))
Figure BDA0003302283480000081
this may remove most "unneeded" resource allocations. Assuming that n tasks need to deploy m resources, the search process is as shown in equation (3). R is a matrix with n rows and m columns, where RijRepresenting the share of the jth resource owned by the ith task. RjRepresenting the total amount of resource j. The optimization problem contains two constraints. First, the maximum quota per task does not exceed the total amount of resources. Second, the sum of the quotas for all tasks cannot be greater than the total for each resource. Meanwhile, in order to find the globally optimal resource partition, the present embodiment calculates the final score of the resource partition using the acquisition function.
Step S300, a service quality compensator is used for monitoring the progress of the CPU stage in real time, and the time spent in the CPU stage by a user request exceeds the service quality target of the CPU, so that the execution of the CPU at the accelerator end is accelerated.
In this embodiment, the monitoring the progress of the CPU stage in real time includes: and calculating the time length increased by executing the CPU phase, and if the time length increased by executing the CPU phase is greater than the reduced GPU phase execution time under the new resource quota, determining that the time spent by the user request in the CPU phase does not exceed the service quality target of the CPU.
Specifically, in this embodiment, the service quality compensator determines a new resource quota for the request of the LC service, and when determining the new resource quota, the computation resource quota allocated to the BE task is updated at the same time; if the CPU phase progress of the request of the LC service meets the service quality of the new quota, the resource quota allocated to the request of the LC service is rolled back to the original quota.
During the execution of LC requests (denoted by Q), the invention provides an accelerator-side QoS compensator to monitor the execution progress of Q in real time at the CPU stage. If the CPU phase Q runs slower than expected (i.e., the workload or other contention that cannot be managed explicitly suddenly spikes), the compensator accelerates the GPU phase of Q by allocating more computing resources on the accelerator side to it. The difficulty of this step is to quickly determine the new computation resource quota of the GPU side without seriously reducing the throughput of the BE application on the GPU.
Specifically, the compensator will periodically check whether its operating speed at the CPU stage is slower than expected. T iscpuAnd T'gpuAnd respectively representing the actual CPU phase execution time of the request of the LC service under the current resource quota and the execution time of the GPU phase of the request of the LC service under the newly determined calculation resource quota of the GPU phase.
Figure BDA0003302283480000082
The length of time the CPU phase is executed is calculated. If T issaveIf the reduced GPU phase execution time is greater than the reduced GPU phase execution time under the new resource quota, the request of the LC service can meet the QoS target of the LC service.
Figure BDA0003302283480000083
The compensator can confirm a new "just enough" GPU side to compute a resource quota for the request of the LC service according to the above formula. In the formula, T'gpuIt can be obtained from a performance model that,
Figure BDA0003302283480000091
and
Figure BDA0003302283480000092
obtained from the BE-aware QoS target allocator. T iscpuMeasured directly during operation. Once the new resource quota is determined, the calculated resource quota allocated to the BE job is also updated at the same time. If the requested CPU phase progress of the LC service satisfies the new quota of QoS, the requested resource quota allocated to the LC service is rolled back to its original quota. In this way, the present embodiment ensures that the request of the LC service satisfies its QoS and minimizes the resources it uses.
To enable those skilled in the art to further understand the method for managing high-throughput heterogeneous resources for multi-phase AI cloud service according to the present embodiment, an implementation process of the method for managing high-throughput heterogeneous resources for multi-phase AI cloud service according to the present embodiment is described below:
specifically, the invention manages and allocates resources in a heterogeneous hybrid deployment scenario by the following steps, Q representing an LC request:
1. the QoS target allocator builds a performance model for Q based on the characteristics of BE jobs currently running on the CPU and GPU. Based on the model, the invention divides the QoS target of Q into a CPU stage and a GPU stage by a heuristic method. And then, the QoS target division result is used as an initial sample point to be transmitted to a uniform heterogeneous resource manager. This step can significantly affect the number of subsequent attempts to determine the optimal resource allocation.
2. Once the QoS target of Q is split, the resource manager allocates various resource configurations (CPU core, memory bandwidth, LLC, SM) for Q and co-located BE applications on the CPU and GPU sides according to the optimized SMAC algorithm. In performing the allocation, the present embodiment maximizes the overall throughput of the BE job (considering economic benefits) while mitigating QoS violations due to resource contention. A challenging part is to minimize the time required to determine the optimal allocation to accommodate the dynamic load.
3. The accelerator-side QoS compensator will monitor the CPU stage progress of Q in real time. If the CPU phase of Q is running slower than expected, the compensator will speed up the accelerator phase of Q by allocating more GPU computing resources. The difficulty here is to quickly determine the new computation resource quota of Q on the GPU without severely reducing the throughput of BE applications on the GPU.
The embodiment of the invention also provides a GPU, and the GPU applies the high-throughput heterogeneous resource management method for the multi-phase AI cloud service. The above-mentioned high-throughput heterogeneous resource management method for multi-phase AI cloud service has been described in detail, and is not described herein again.
In conclusion, the invention can ensure the service quality of LC service and greatly improve the comprehensive performance of all BE applications on heterogeneous equipment on the premise of a real data center machine and no need of modifying hardware equipment; the achievement of the invention can provide support for landing of scheduling technology for the emerging heterogeneous mixed deployment problem of the data center. Meanwhile, the achievement of the invention has commercial significance, can provide high-throughput dynamic task scheduling service for multi-stage AI cloud service, and ensures the cloud service quality under the condition of maximizing the utilization rate of data center heterogeneous equipment. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (10)

1. A high-throughput heterogeneous resource management method for multi-phase AI cloud service is characterized in that:
splitting a service quality target into a CPU side service quality target and a GPU side service quality target by using a service quality target distributor based on the received LC service request;
searching optimal resource allocation by using a heterogeneous resource manager and taking a CPU side service quality target and a GPU side service quality target as initial samples;
the progress of the CPU phase is monitored in real time by a service quality compensator, and the execution of the CPU phase is accelerated when the time spent in the CPU phase by a user request exceeds the service quality target of the CPU.
2. The method of high throughput heterogeneous resource management for multi-phase AI cloud services according to claim 1, wherein: the splitting of the quality of service objective into a CPU side quality of service objective and a GPU side quality of service objective based on the received request for LC service includes:
setting each resource quota of the LC task as the minimum resource unit of the LC task, and simultaneously allocating the rest resources to the BE task;
adjusting the service quality distribution of the CPU-GPU stage according to the performance curved surface of the shared resource;
recording the service quality increase value and the BE task performance degradation value of the LC task;
and selecting the optimal resource allocation, adjusting the optimal resource allocation from the BE task to the LC task and executing the next cycle, thereby realizing the purpose of splitting the service quality target into a CPU side service quality target and a GPU side service quality target.
3. The method of high throughput heterogeneous resource management for multi-phase AI cloud services according to claim 1, wherein: the heterogeneous resource manager searches for optimal resource allocation based on a Bayesian optimization algorithm of a random forest.
4. The method of high throughput heterogeneous resource management for multi-phase AI cloud services according to claim 3, wherein: the initial sample is sampled using any one of the following strategies: all CPU stage tasks distribute the same priority strategy of the same computing resource, an initial resource distribution strategy of a resource distribution initial point is obtained from a service quality target distributor, a minimum resource quota is distributed for BE operation, and a service quality guarantee strategy of the LC task is remained.
5. The method of high throughput heterogeneous resource management for multi-phase AI cloud services oriented according to claim 1, characterized in that: the heterogeneous resource managers are configured with scoring functions to guide the resource managers to search in the correct direction in the configuration space.
6. The method of high throughput heterogeneous resource management for multi-phase AI cloud services oriented according to claim 1, characterized in that: constructing a segmented objective function based on the service quality of the LC task and the comprehensive throughput of the BE task; the first goal of the piecewise objective function is to meet quality of service objectives on the CPU and GPU, and the second goal is to maximize the overall system throughput of the BE task.
7. The method for high-throughput heterogeneous resource management for multi-phase AI cloud services according to claim 1 or 6, characterized in that: configuring an optimization constraint condition for searching optimal resource allocation; the optimization constraints include: the maximum quota of each task does not exceed the total amount of resources; for each resource, the sum of the quotas for all tasks cannot be greater than the total.
8. The method for high-throughput heterogeneous resource management for multi-phase AI cloud services according to claim 6 or 7, characterized in that: the real-time monitoring of the progress of the CPU phase comprises: and calculating the time length increased by executing the CPU phase, and if the time length increased by executing the CPU phase is greater than the reduced GPU phase execution time under the new resource quota, determining that the time spent by the user request in the CPU phase does not exceed the service quality target of the CPU.
9. The method of high throughput heterogeneous resource management for multi-phase AI cloud services oriented according to claim 1, characterized in that: the service quality compensator confirms a new resource quota for the request of the LC service, and when the new resource quota is determined, the calculation resource quota distributed to the BE task is updated simultaneously; if the CPU phase progress of the request of the LC service meets the service quality of the new quota, the resource quota allocated to the request of the LC service is rolled back to the original quota.
10. An electronic device, characterized by: comprising a CPU and a GPU, said electronic device applying the high throughput heterogeneous resource management method for multi-phase AI cloud services according to any one of claims 1 to 9.
CN202111193853.8A 2021-10-13 2021-10-13 High-throughput heterogeneous resource management method and device for multi-stage AI cloud service Active CN114035935B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111193853.8A CN114035935B (en) 2021-10-13 2021-10-13 High-throughput heterogeneous resource management method and device for multi-stage AI cloud service

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111193853.8A CN114035935B (en) 2021-10-13 2021-10-13 High-throughput heterogeneous resource management method and device for multi-stage AI cloud service

Publications (2)

Publication Number Publication Date
CN114035935A true CN114035935A (en) 2022-02-11
CN114035935B CN114035935B (en) 2024-07-19

Family

ID=80141258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111193853.8A Active CN114035935B (en) 2021-10-13 2021-10-13 High-throughput heterogeneous resource management method and device for multi-stage AI cloud service

Country Status (1)

Country Link
CN (1) CN114035935B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107357661A (en) * 2017-07-12 2017-11-17 北京航空航天大学 A kind of fine granularity GPU resource management method for mixed load
CN108900432A (en) * 2018-07-05 2018-11-27 中山大学 A kind of perception of content method based on network Flow Behavior
US20190188276A1 (en) * 2017-12-20 2019-06-20 International Business Machines Corporation Facilitation of domain and client-specific application program interface recommendations
US20200225655A1 (en) * 2016-05-09 2020-07-16 Strong Force Iot Portfolio 2016, Llc Methods, systems, kits and apparatuses for monitoring and managing industrial settings in an industrial internet of things data collection environment
CN111580934A (en) * 2020-05-13 2020-08-25 杭州电子科技大学 Resource allocation method for consistent performance of multi-tenant virtual machines in cloud computing environment
CN111597045A (en) * 2020-05-15 2020-08-28 上海交通大学 Shared resource management method, system and server system for managing mixed deployment
US20210011765A1 (en) * 2020-09-22 2021-01-14 Kshitij Arun Doshi Adaptive limited-duration edge resource management
CN112445605A (en) * 2019-08-30 2021-03-05 中兴通讯股份有限公司 Media data processing method and device and media server

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200225655A1 (en) * 2016-05-09 2020-07-16 Strong Force Iot Portfolio 2016, Llc Methods, systems, kits and apparatuses for monitoring and managing industrial settings in an industrial internet of things data collection environment
CN107357661A (en) * 2017-07-12 2017-11-17 北京航空航天大学 A kind of fine granularity GPU resource management method for mixed load
US20190188276A1 (en) * 2017-12-20 2019-06-20 International Business Machines Corporation Facilitation of domain and client-specific application program interface recommendations
CN108900432A (en) * 2018-07-05 2018-11-27 中山大学 A kind of perception of content method based on network Flow Behavior
CN112445605A (en) * 2019-08-30 2021-03-05 中兴通讯股份有限公司 Media data processing method and device and media server
CN111580934A (en) * 2020-05-13 2020-08-25 杭州电子科技大学 Resource allocation method for consistent performance of multi-tenant virtual machines in cloud computing environment
CN111597045A (en) * 2020-05-15 2020-08-28 上海交通大学 Shared resource management method, system and server system for managing mixed deployment
US20210011765A1 (en) * 2020-09-22 2021-01-14 Kshitij Arun Doshi Adaptive limited-duration edge resource management

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WEI ZHANG: ""CHARM: Collaborative Host and Accelerator Resource Management for GPU Datacenters"", 《2021 IEEE 39TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD)》, 20 December 2021 (2021-12-20), pages 307 - 315 *
WENBO BAO: ""High-quality and real-time frame interpolation on heterogeneous computing system"", 《2017 IEEE INTERNATIONAL SYMPOSIUM ON BROADBAND MULTIMEDIA SYSTEMS AND BROADCASTING (BMSB)》, 20 July 2017 (2017-07-20), pages 1 - 4 *
李鹏: ""多核片上系统全局主动访存优化研究"", 《高技术通讯》, vol. 29, no. 03, 15 March 2019 (2019-03-15), pages 203 - 212 *

Also Published As

Publication number Publication date
CN114035935B (en) 2024-07-19

Similar Documents

Publication Publication Date Title
US11989647B2 (en) Self-learning scheduler for application orchestration on shared compute cluster
Zhang et al. {Model-Switching}: Dealing with Fluctuating Workloads in {Machine-Learning-as-a-Service} Systems
CN111491006A (en) Load-aware cloud computing resource elastic distribution system and method
Urgaonkar et al. Dynamic resource allocation and power management in virtualized data centers
Yu et al. Gillis: Serving large neural networks in serverless functions with automatic model partitioning
CN110737529A (en) cluster scheduling adaptive configuration method for short-time multiple variable-size data jobs
US20080104605A1 (en) Methods and apparatus for dynamic placement of heterogeneous workloads
CN109947619B (en) Multi-resource management system and server for improving throughput based on service quality perception
Iyapparaja et al. Efficient Resource Allocation in Fog Computing Using QTCS Model.
Dublish et al. Poise: Balancing thread-level parallelism and memory system performance in GPUs using machine learning
CN116820784B (en) GPU real-time scheduling method and system for reasoning task QoS
CN114217974A (en) Resource management method and system in cloud computing environment
Zhang et al. CHARM: Collaborative host and accelerator resource management for gpu datacenters
CN115714820A (en) Distributed micro-service scheduling optimization method
CN115858110A (en) Multi-objective optimization strategy-based multi-level task scheduling method
CN115237586A (en) GPU resource configuration method for deep learning inference performance interference perception
Desprez et al. A bi-criteria algorithm for scheduling parallel task graphs on clusters
Ferikoglou et al. Iris: interference and resource aware predictive orchestration for ml inference serving
EP4300305A1 (en) Methods and systems for energy-efficient scheduling of periodic tasks on a group of processing devices
CN112306642A (en) Workflow scheduling method based on stable matching game theory
CN114035935B (en) High-throughput heterogeneous resource management method and device for multi-stage AI cloud service
Wang et al. On mapreduce scheduling in hadoop yarn on heterogeneous clusters
CN114466014B (en) Service scheduling method and device, electronic equipment and storage medium
Омельченко et al. Automation of resource management in information systems based on reactive vertical scaling
Nemirovsky et al. A deep learning mapper (DLM) for scheduling on heterogeneous systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant