CN114035935A

CN114035935A - High-throughput heterogeneous resource management method and device for multi-stage AI cloud service

Info

Publication number: CN114035935A
Application number: CN202111193853.8A
Authority: CN
Inventors: 陈�全; 过敏意; 张蔚; 符凯华
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2022-02-11
Anticipated expiration: 2041-10-13
Also published as: CN114035935B

Abstract

The invention provides a high-throughput heterogeneous resource management method and device for multi-phase AI cloud service, wherein the high-throughput heterogeneous resource management method for the multi-phase AI cloud service comprises the following steps: splitting a service quality target into a CPU side service quality target and a GPU side service quality target by using a service quality target distributor based on the received LC service request; searching optimal resource allocation by using a heterogeneous resource manager and taking a CPU side service quality target and a GPU side service quality target as initial samples; the progress of the CPU phase is monitored in real time by a service quality compensator, and when the time spent by a user in the CPU phase exceeds the service quality target of the CPU, the execution of the CPU phase on an accelerator side is accelerated. The invention not only ensures the service quality of LC service, but also greatly improves the comprehensive performance of all BE applications on heterogeneous equipment.

Description

High-throughput heterogeneous resource management method and device for multi-stage AI cloud service

Technical Field

The invention relates to the technical field of GPUs (graphics processing units), in particular to a high-throughput heterogeneous resource management method and device for multi-stage AI (Artificial Intelligence) cloud services.

Background

Modern data centers often host some user-oriented application services such as web searching, social networking, face recognition, and the like. Such applications are generally attractive to users with low response times and high accuracy, and therefore all have stringent delay requirements. This type of application is also known as LC (latency-critical) application. Ensuring quality of service (QoS) for these LC applications is the focus of current data center related research.

With the rapid development of cloud computing platforms and deep learning, recent Deep Neural Networks (DNNs) have achieved human-level accuracy in various application scenarios, such as image recognition, voice recognition, and the like. Accordingly, deep neural networks are also used to support LC applications in various data centers. Accelerators such as new hardware, e.g., GPUs, are also adopted by cloud computing service providers and widely deployed in computer clusters to support the high computing power requirements of emerging deep learning tasks. Compared with the traditional online service, the online service based on deep learning not only has strict service quality requirements but also is computationally very high in requirement, and various heterogeneous resources are used.

The application of LC based on DNN support mainly has two stages: data preprocessing and online reasoning (inference). A heterogeneous accelerator (e.g., GPU) is typically used for the inference phase, while the host CPU is used for the data pre-processing phase (including decoding and data resizing, etc.). The interaction phase (memcpy) between the host and accelerator is supported by the PCI-e bus. Data centers have some unavoidable problems, one of which is important is the over-allocation of resources. According to some previous studies, these LC services often encounter a day-night mode of user access (at certain times of day, user requests are high and concentrated; by night, user request load decreases rapidly). This mode results in a large amount of time other than peak hours when the CPU/GPU resources are not fully utilized, thereby resulting in a waste of resources.

In order to improve the utilization rate of heterogeneous resources in a data center, it is a common practice to run a LC service with a QoS requirement and a best-effort (best-effort) BE application without a QoS requirement at the same time. Accelerator manufacturers are also now producing such multi-tasking accelerators that support space division sharing in order to achieve higher throughput on the same accelerator. For example, MPS (multiprocessing service) is an implementation of CUDA Application Program Interface (API). With the latest NVIDIA VOLTA and turn architectures, VOLTA MPS techniques allow different applications to execute simultaneously on the same GPU at a certain percentage of resources, thereby increasing overall GPU utilization.

But a mixed deployment of multiple applications may introduce performance penalties to the LC service and increase the end-to-end delay for user requests, risking a quality of service violation. This is primarily because multiple applications in a mixed deployment compete for shared heterogeneous resources. In addition, for this new multi-phase AI cloud service, they have both a host phase (CPU) and an accelerator phase (GPU). There are new challenges how to design a suitable scheduling strategy for such emerging heterogeneous hybrid deployment scenarios. The important problem to be solved by the invention is to ensure the service quality of LC application and simultaneously to ensure that the utilization of the whole resource is more efficient and economic.

For QoS guarantee technologies of CPU hybrid deployment, the prior art is divided into two categories: methods based on performance analysis and methods based on feedback regulation. Methods based on performance analysis, such as Bubble-up, can analyze user-oriented services and batch applications to predict performance degradation due to shared cache and memory bandwidth contention and determine a "safe" scheduling mode that will not result in QoS violations. Feedback regulation based methods, such as Heracles, build decision trees to determine resource allocation in the next time period based on QoS feedback directed to user service in the current time period, thereby periodically adjusting the allocation of shared resources.

The existing multi-task scheduling algorithm on the CPU is not suitable for the current novel multi-stage AI cloud service, because the multi-task scheduling algorithm cannot sense the space division multiplexing characteristic of accelerators such as the GPU and the like, the high parallelism of the GPU cannot be effectively utilized, and hardware resources can be fully exerted. At the same time, they also ignore complex interference between different shared resources on the GPU.

For QoS guarantee technologies of GPU hybrid deployment, the prior art is divided into two categories: a time-division based shared accelerator and a space-division based shared accelerator. On a time-sharing accelerator, a queue-based approach (e.g., grand slam and Baymax) predicts the duration of kernel, thereby reordering the GPU kernels. On spatial multitasking accelerators, an analysis-based approach (e.g., Laius) divides the computational resources between LC services and BE applications.

Previous studies were limited to scheduling mixed deployments on GPU accelerators, which were not versatile enough to manage these new application scenarios. LC services have both host and accelerator phases, and managing the GPU and CPU phases separately can cause severe QoS violations. If the host phase delay prediction is not accurate or the feedback mechanism is not timely enough, the LC service will suffer QoS violation and the delay cannot be made up.

Disclosure of Invention

In view of the above drawbacks of the prior art, an object of the present invention is to provide a method and a device for managing a high throughput heterogeneous resource for a multi-stage AI cloud service, which are used to meet QoS targets of LC services and maximize performance of all BE applications on a CPU and a GPU.

To achieve the above and other related objects, the present invention provides a high throughput heterogeneous resource management method for a multi-phase AI cloud service, including: splitting a service quality target into a CPU side service quality target and a GPU side service quality target by using a service quality target distributor based on the received LC service request; searching optimal resource allocation by using a heterogeneous resource manager and taking a CPU side service quality target and a GPU side service quality target as initial samples; the progress of the CPU stage is monitored in real time by using a service quality compensator, and the execution of the CPU stage is accelerated when the time spent in the CPU stage by a user request exceeds the service quality target of the CPU.

In an embodiment of the present invention, the splitting the qos target into the CPU-side qos target and the GPU-side qos target based on the received LC service request includes: setting each resource quota of the LC task as the minimum resource unit of the LC task, and simultaneously allocating the rest resources to the BE task; adjusting the service quality distribution of the CPU-GPU stage according to the performance curved surface of the shared resource; recording the service quality increase value and the BE task performance degradation value of the LC task; and selecting the optimal resource allocation, adjusting the optimal resource allocation from the BE task to the LC task and executing the next cycle, thereby realizing the purpose of splitting the service quality target into a CPU side service quality target and a GPU side service quality target.

In an embodiment of the present invention, the heterogeneous resource manager searches for an optimal resource allocation based on a bayesian optimization algorithm of a random forest.

In an embodiment of the present invention, the initial sample is sampled by using any one of the following strategies: all CPU stage tasks distribute the same priority strategy of the same computing resource, obtain the initial resource distribution strategy of the initial point of resource distribution from the service quality target distributor, distribute the minimum resource quota for BE operation, and remain the service quality guarantee strategy for LC task.

In an embodiment of the present invention, a scoring function is configured for the heterogeneous resource managers to guide the resource managers to search in the correct direction in the configuration space.

In one embodiment of the invention, a segmented objective function based on the quality of service of the LC task and the comprehensive throughput of the BE task is constructed; the first goal of the piecewise objective function is to meet quality of service objectives on the CPU and GPU, and the second goal is to maximize the overall system throughput of the BE task.

In an embodiment of the present invention, the method further includes configuring an optimization constraint condition for searching for optimal resource allocation; the optimization constraints include: the maximum quota of each task does not exceed the total amount of resources; for each resource, the sum of the quotas for all tasks cannot be greater than the total.

In an embodiment of the present invention, the monitoring the progress of the CPU stage in real time includes: and calculating the increased time length for executing the CPU phase, and if the increased time length for executing the CPU phase is greater than the reduced execution time of the GPU phase under the new resource quota, determining that the time spent by the user request in the CPU phase does not exceed the service quality target of the CPU.

In an embodiment of the present invention, the service quality compensator determines a new resource quota for the request of the LC service, and when determining the new resource quota, the computation resource quota allocated to the BE task is updated at the same time; if the CPU phase progress of the request of the LC service meets the service quality of the new quota, the resource quota allocated to the request of the LC service is rolled back to the original quota.

Embodiments of the present invention also provide an electronic device, including a CPU and a GPU, that applies the high-throughput heterogeneous resource management method for multi-phase AI cloud services as described above.

As described above, the high-throughput heterogeneous resource management method and device for multi-phase AI cloud service according to the present invention have the following beneficial effects:

1. the invention can ensure the service quality of LC service and greatly improve the comprehensive performance of all BE applications on heterogeneous equipment on the premise of a real data center machine and no need of modifying hardware equipment.

2. The achievement of the invention can provide support for landing of scheduling technology for the emerging heterogeneous mixed deployment problem of the data center. Meanwhile, the achievement of the invention has commercial significance, can provide high-throughput dynamic task scheduling service for multi-stage AI cloud service, and ensures the cloud service quality under the condition of maximizing the utilization rate of heterogeneous equipment of the data center.

Drawings

Fig. 1 is a flowchart illustrating a high throughput heterogeneous resource management method for multi-phase AI cloud service according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention.

The embodiment of the invention aims to provide a high-throughput heterogeneous resource management method and device for multi-stage AI cloud service, which are used for meeting the QoS (quality of service) target of LC (inductance-capacitance) service and improving the performance of all BE (BE object) applications on a CPU (Central processing Unit) and a GPU (graphics processing Unit) to the maximum extent.

The embodiment comprises a BE-aware QoS target distributor, a unified heterogeneous resource manager and an accelerator-side QoS compensator. The main contributions and innovation points of the present embodiment are: a unified heterogeneous resource management method for the multi-stage AI cloud service of the data center is designed, so that the service quality of the LC service is guaranteed, and the resource utilization of the whole CPU and GPU is more efficient and economical.

In this embodiment, a high-throughput heterogeneous resource management method for multi-phase AI cloud service is designed and implemented, and the method has two goals in heterogeneous operation: the QoS target of LC service is satisfied, and the performance of all BE applications on CPU and GPU is improved to the maximum extent.

The embodiment provides a BE-aware QoS target distributor, a unified heterogeneous resource manager and an accelerator-side QoS compensator, which are used for realizing a high-throughput heterogeneous resource management method for multi-stage AI cloud service. The embodiment solves a plurality of challenges of resource management aiming at the heterogeneous mixed deployment, ensures the service quality guarantee during the mixed operation of the LC service, and simultaneously improves the utilization efficiency of the whole resources on the CPU and the GPU.

The principle and implementation of the high-throughput heterogeneous resource management method and server for multi-phase AI cloud service according to the present embodiment will be described in detail below, so that those skilled in the art can understand the high-throughput heterogeneous resource management method and server for multi-phase AI cloud service according to the present embodiment without creative work.

As shown in fig. 1, the present embodiment provides a high-throughput heterogeneous resource management method for a multi-phase AI cloud service, where the high-throughput heterogeneous resource management method for the multi-phase AI cloud service includes:

step S100, a service quality target distributor is utilized to divide the service quality target into a CPU side service quality target and a GPU side service quality target based on the received LC service request;

s200, searching optimal resource allocation by using a heterogeneous resource manager and taking a CPU side service quality target and a GPU side service quality target as initial samples;

step S300, a service quality compensator is used for monitoring the progress of the CPU stage in real time, and the time spent in the CPU stage by a user request exceeds the service quality target of the CPU, so that the execution of the CPU at the accelerator end is accelerated.

The embodiment provides a BE-aware QoS target distributor, a unified heterogeneous resource manager and an accelerator-side QoS compensator, which are used for realizing a high-throughput heterogeneous resource management method for multi-stage AI cloud service. The high-throughput heterogeneous resource management method for the multi-stage AI cloud service in the embodiment processes the LC service and the BE application program in different modes. Upon submitting a request for LC service, the QoS target allocator heuristically splits its QoS target into a CPU-side QoS target and a GPU-side QoS target. These partitions are then passed as initial sample points to the uniform resource allocator to search for the best resource allocation. The resource allocator determines an optimal resource allocation, maximizing the throughput of the BE job while guaranteeing QoS of the LC service (considering economic efficiency). In addition, the co-compensator monitors the progress of the CPU phase in real time, speeding up the execution on its accelerator side if the user requests that the time spent in the CPU phase exceed its QoS target.

The following describes the steps S100 to S300 of the present embodiment in detail.

Step S100, a service quality target distributor is used for dividing the service quality target into a CPU side service quality target and a GPU side service quality target based on the received LC service request.

In this embodiment, the splitting the qos target into the CPU-side qos target and the GPU-side qos target based on the received LC service request includes: setting each resource quota of the LC task as the minimum resource unit of the LC task, and simultaneously allocating the rest resources to the BE task; adjusting the service quality distribution of the CPU-GPU stage according to the performance curved surface of the shared resource; recording the service quality increase value and the BE task performance degradation value of the LC task; and selecting the optimal resource allocation, adjusting the optimal resource allocation from the BE task to the LC task and executing the next cycle, thereby realizing the purpose of splitting the service quality target into a CPU side service quality target and a GPU side service quality target.

The goal of step S100 is how to accurately solve the QoS partitioning problem within a limited sampling range. In the preliminary evaluation of this embodiment, increasing the resource quota of the BE task can effectively improve the performance of the BE task. However, different BE tasks have different sensitivities to shared resources (kernel in CPU, LLC, memory bandwidth, and SM in GPU). In order to make maximum use of the computational resources, the shared resource usage of the LC tasks should have as little impact as possible on the performance of the BE tasks.

Therefore, this embodiment designs a heuristic search algorithm to perform QoS partitioning for LC services, and specifically, this embodiment initially sets the quota of each resource (CPU core, LLC, memory bandwidth in CPU, and SM in GPU) of the LC task as its minimum resource unit, while allocating the remaining resources to the BE task. In order to guarantee QoS of the LC task, it is necessary to increase a share resource quota allocated to the LC task. In each iteration, the QoS target distributor can curve and adjust the QoS distribution of the CPU-GPU stage according to the performance of the shared resources. This embodiment records the QoS increment value Δ QoS of the LC task in the searching step respectively_dAnd BE task performance degradation Δ perf_d. The present embodiment uses the following formula to select the optimal resource allocation d^*Then resource d^*Adjust from BE task to LC task and execute the next loop.

In this way, near optimal results are obtained, not only satisfying the QoS of the LC tasks, but also trying to ensure a minimum performance degradation of the BE tasks. The final time of the CPU phase and GPU phase is scaled up to the QoS target and used as the QoS split result. The QoS partitioning results are passed as initial samples to a unified heterogeneous resource manager.

Step S200, a heterogeneous resource manager is utilized to search for optimal resource allocation by taking a CPU side service quality target and a GPU side service quality target as initial samples.

Once the QoS target allocator obtains the QoS target partition, the present embodiment needs to manage the shared resources on each device (CPU/GPU). Given the large sample space of resource configurations in heterogeneous hybrid deployment scenarios, the resource manager must quickly identify the optimal resource allocation with minimal sampling time.

In this embodiment, the heterogeneous resource manager searches for the optimal resource allocation based on a bayesian optimization algorithm of a random forest.

The invention uses a Bayesian optimization algorithm (SMAC) based on random forest to realize the resource distribution management, however, the SMAC algorithm is not suitable for the heterogeneous hybrid deployment directly. First, the SMAC algorithm randomly selects the initial sample points. Although this approach works well for some simple services, they are prone to covariant transitions in heterogeneous cases, which can lead to frequent QoS violations during sampling. While the objective function in the conventional SMAC returns only one maximized value (e.g., throughput or execution time of the system), the present embodiment has multiple optimization objectives (QoS of LC service and performance maximization of BE task).

In this embodiment, the initial sample is sampled by using any one of the following strategies: all CPU stage tasks are distributed with the same priority strategy of the same computing resource, an initial resource distribution strategy of a resource distribution initial point is obtained from a service quality target distributor, a minimum resource quota is distributed for BE operation, and a service quality guarantee strategy of LC tasks is remained.

This embodiment performs two adaptive corrections to the SMAC algorithm. For the initial sampling points, the invention carefully selects them according to different strategies: 1) same priority policy (all CPU phase tasks allocate equal computational resources); 2) acquiring a resource allocation initial point from a QoS target allocator; 3) QoS guarantee strategy (minimum resource quota is allocated to BE operation, and LC task is remained). The above three resource configurations as initial points can better discover potential resource quotas and speed up the sampling process. As to how to determine the next sample point, the present embodiment carefully designs the objective function so that SMAC optimization can be applied to heterogeneous co-location.

In addition, in this embodiment, a scoring function is configured for the heterogeneous resource managers to guide the resource managers to search in the configuration space in the correct direction.

That is, the present embodiment designs a scoring function for the resource manager:

the score of the scoring function is passed to the objective function (i.e., the objective score is assigned at the end of each cycle when the system is operating in a given resource configuration). This scoring function will direct the resource manager to search in the right direction in a large configuration space.

Furthermore, in this embodiment, a piecewise objective function based on the quality of service of the LC task and the integrated throughput of the BE task is constructed; the first objective of the piecewise objective function is to meet quality of service objectives on the CPU and GPU, and the second objective is to maximize the overall system throughput of the BE task.

Namely, the embodiment constructs a segmented objective function, which considers the QoS of the LC task and the comprehensive throughput of the BE task (considering the economic benefit). The function value is between 0 (worst case, no LC task satisfies its QoS) and 1 (ideally, all LC tasks satisfy QoS and BE task performance is consistent with its individual run-time). The first goal is to meet QoS goals on the CPU and GPU. QoS in the above equation^targetIs the QoS target of the LC task^evalIs the delay of the LC task under the current resource configuration. This score will get less than 0.5 regardless of the BE task performance as long as the LC task has a QoS violation on any device. Only if the score is greater than 0.5, the second objective function is considered.

The second objective in the above equation is to maximize the overall system throughput of the BE task, where perf_rIs the throughput, perf, of BE jobs during a sample_sIs the throughput of the BE task alone. Also considering the large price difference between the rented CPU and the GPU, the methodEmbodiments perform a weighted summation of the CPU and GPU throughputs, where a and β are related to the CPU/GPU lease price, respectively.

In this embodiment, the method further includes configuring an optimization constraint condition for searching for optimal resource allocation; the optimization constraints include: the maximum quota of each task does not exceed the total amount of resources; for each resource, the sum of the quotas for all tasks cannot be greater than the total.

In order to speed up the search process of SMAC optimization, the invention constrains the search space using a pruning strategy based on the following formula optimization problem:

Object＝MAX(a(Score(R)))

this may remove most "unneeded" resource allocations. Assuming that n tasks need to deploy m resources, the search process is as shown in equation (3). R is a matrix with n rows and m columns, where R_ijRepresenting the share of the jth resource owned by the ith task. R_jRepresenting the total amount of resource j. The optimization problem contains two constraints. First, the maximum quota per task does not exceed the total amount of resources. Second, the sum of the quotas for all tasks cannot be greater than the total for each resource. Meanwhile, in order to find the globally optimal resource partition, the present embodiment calculates the final score of the resource partition using the acquisition function.

In this embodiment, the monitoring the progress of the CPU stage in real time includes: and calculating the time length increased by executing the CPU phase, and if the time length increased by executing the CPU phase is greater than the reduced GPU phase execution time under the new resource quota, determining that the time spent by the user request in the CPU phase does not exceed the service quality target of the CPU.

Specifically, in this embodiment, the service quality compensator determines a new resource quota for the request of the LC service, and when determining the new resource quota, the computation resource quota allocated to the BE task is updated at the same time; if the CPU phase progress of the request of the LC service meets the service quality of the new quota, the resource quota allocated to the request of the LC service is rolled back to the original quota.

During the execution of LC requests (denoted by Q), the invention provides an accelerator-side QoS compensator to monitor the execution progress of Q in real time at the CPU stage. If the CPU phase Q runs slower than expected (i.e., the workload or other contention that cannot be managed explicitly suddenly spikes), the compensator accelerates the GPU phase of Q by allocating more computing resources on the accelerator side to it. The difficulty of this step is to quickly determine the new computation resource quota of the GPU side without seriously reducing the throughput of the BE application on the GPU.

Specifically, the compensator will periodically check whether its operating speed at the CPU stage is slower than expected. T is_cpuAnd T'_gpuAnd respectively representing the actual CPU phase execution time of the request of the LC service under the current resource quota and the execution time of the GPU phase of the request of the LC service under the newly determined calculation resource quota of the GPU phase.

The length of time the CPU phase is executed is calculated. If T is_saveIf the reduced GPU phase execution time is greater than the reduced GPU phase execution time under the new resource quota, the request of the LC service can meet the QoS target of the LC service.

The compensator can confirm a new "just enough" GPU side to compute a resource quota for the request of the LC service according to the above formula. In the formula, T'_gpuIt can be obtained from a performance model that,

and

obtained from the BE-aware QoS target allocator. T is_cpuMeasured directly during operation. Once the new resource quota is determined, the calculated resource quota allocated to the BE job is also updated at the same time. If the requested CPU phase progress of the LC service satisfies the new quota of QoS, the requested resource quota allocated to the LC service is rolled back to its original quota. In this way, the present embodiment ensures that the request of the LC service satisfies its QoS and minimizes the resources it uses.

To enable those skilled in the art to further understand the method for managing high-throughput heterogeneous resources for multi-phase AI cloud service according to the present embodiment, an implementation process of the method for managing high-throughput heterogeneous resources for multi-phase AI cloud service according to the present embodiment is described below:

specifically, the invention manages and allocates resources in a heterogeneous hybrid deployment scenario by the following steps, Q representing an LC request:

1. the QoS target allocator builds a performance model for Q based on the characteristics of BE jobs currently running on the CPU and GPU. Based on the model, the invention divides the QoS target of Q into a CPU stage and a GPU stage by a heuristic method. And then, the QoS target division result is used as an initial sample point to be transmitted to a uniform heterogeneous resource manager. This step can significantly affect the number of subsequent attempts to determine the optimal resource allocation.

2. Once the QoS target of Q is split, the resource manager allocates various resource configurations (CPU core, memory bandwidth, LLC, SM) for Q and co-located BE applications on the CPU and GPU sides according to the optimized SMAC algorithm. In performing the allocation, the present embodiment maximizes the overall throughput of the BE job (considering economic benefits) while mitigating QoS violations due to resource contention. A challenging part is to minimize the time required to determine the optimal allocation to accommodate the dynamic load.

3. The accelerator-side QoS compensator will monitor the CPU stage progress of Q in real time. If the CPU phase of Q is running slower than expected, the compensator will speed up the accelerator phase of Q by allocating more GPU computing resources. The difficulty here is to quickly determine the new computation resource quota of Q on the GPU without severely reducing the throughput of BE applications on the GPU.

The embodiment of the invention also provides a GPU, and the GPU applies the high-throughput heterogeneous resource management method for the multi-phase AI cloud service. The above-mentioned high-throughput heterogeneous resource management method for multi-phase AI cloud service has been described in detail, and is not described herein again.

In conclusion, the invention can ensure the service quality of LC service and greatly improve the comprehensive performance of all BE applications on heterogeneous equipment on the premise of a real data center machine and no need of modifying hardware equipment; the achievement of the invention can provide support for landing of scheduling technology for the emerging heterogeneous mixed deployment problem of the data center. Meanwhile, the achievement of the invention has commercial significance, can provide high-throughput dynamic task scheduling service for multi-stage AI cloud service, and ensures the cloud service quality under the condition of maximizing the utilization rate of data center heterogeneous equipment. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A high-throughput heterogeneous resource management method for multi-phase AI cloud service is characterized in that:

splitting a service quality target into a CPU side service quality target and a GPU side service quality target by using a service quality target distributor based on the received LC service request;

searching optimal resource allocation by using a heterogeneous resource manager and taking a CPU side service quality target and a GPU side service quality target as initial samples;

the progress of the CPU phase is monitored in real time by a service quality compensator, and the execution of the CPU phase is accelerated when the time spent in the CPU phase by a user request exceeds the service quality target of the CPU.

2. The method of high throughput heterogeneous resource management for multi-phase AI cloud services according to claim 1, wherein: the splitting of the quality of service objective into a CPU side quality of service objective and a GPU side quality of service objective based on the received request for LC service includes:

setting each resource quota of the LC task as the minimum resource unit of the LC task, and simultaneously allocating the rest resources to the BE task;

adjusting the service quality distribution of the CPU-GPU stage according to the performance curved surface of the shared resource;

recording the service quality increase value and the BE task performance degradation value of the LC task;

and selecting the optimal resource allocation, adjusting the optimal resource allocation from the BE task to the LC task and executing the next cycle, thereby realizing the purpose of splitting the service quality target into a CPU side service quality target and a GPU side service quality target.

3. The method of high throughput heterogeneous resource management for multi-phase AI cloud services according to claim 1, wherein: the heterogeneous resource manager searches for optimal resource allocation based on a Bayesian optimization algorithm of a random forest.

4. The method of high throughput heterogeneous resource management for multi-phase AI cloud services according to claim 3, wherein: the initial sample is sampled using any one of the following strategies: all CPU stage tasks distribute the same priority strategy of the same computing resource, an initial resource distribution strategy of a resource distribution initial point is obtained from a service quality target distributor, a minimum resource quota is distributed for BE operation, and a service quality guarantee strategy of the LC task is remained.

5. The method of high throughput heterogeneous resource management for multi-phase AI cloud services oriented according to claim 1, characterized in that: the heterogeneous resource managers are configured with scoring functions to guide the resource managers to search in the correct direction in the configuration space.

6. The method of high throughput heterogeneous resource management for multi-phase AI cloud services oriented according to claim 1, characterized in that: constructing a segmented objective function based on the service quality of the LC task and the comprehensive throughput of the BE task; the first goal of the piecewise objective function is to meet quality of service objectives on the CPU and GPU, and the second goal is to maximize the overall system throughput of the BE task.

7. The method for high-throughput heterogeneous resource management for multi-phase AI cloud services according to claim 1 or 6, characterized in that: configuring an optimization constraint condition for searching optimal resource allocation; the optimization constraints include: the maximum quota of each task does not exceed the total amount of resources; for each resource, the sum of the quotas for all tasks cannot be greater than the total.

8. The method for high-throughput heterogeneous resource management for multi-phase AI cloud services according to claim 6 or 7, characterized in that: the real-time monitoring of the progress of the CPU phase comprises: and calculating the time length increased by executing the CPU phase, and if the time length increased by executing the CPU phase is greater than the reduced GPU phase execution time under the new resource quota, determining that the time spent by the user request in the CPU phase does not exceed the service quality target of the CPU.

9. The method of high throughput heterogeneous resource management for multi-phase AI cloud services oriented according to claim 1, characterized in that: the service quality compensator confirms a new resource quota for the request of the LC service, and when the new resource quota is determined, the calculation resource quota distributed to the BE task is updated simultaneously; if the CPU phase progress of the request of the LC service meets the service quality of the new quota, the resource quota allocated to the request of the LC service is rolled back to the original quota.

10. An electronic device, characterized by: comprising a CPU and a GPU, said electronic device applying the high throughput heterogeneous resource management method for multi-phase AI cloud services according to any one of claims 1 to 9.