CN115237586A - GPU resource configuration method for deep learning inference performance interference perception - Google Patents
GPU resource configuration method for deep learning inference performance interference perception Download PDFInfo
- Publication number
- CN115237586A CN115237586A CN202210295359.0A CN202210295359A CN115237586A CN 115237586 A CN115237586 A CN 115237586A CN 202210295359 A CN202210295359 A CN 202210295359A CN 115237586 A CN115237586 A CN 115237586A
- Authority
- CN
- China
- Prior art keywords
- gpu
- inference
- load
- dnn
- delay
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a GPU resource configuration method for Deep learning inference performance interference perception, which comprises a Deep Neural Network (DNN) inference performance prediction model and GPU resource configuration for Deep learning inference performance interference perception by utilizing the model. Submitting DNN load for pre-operation and acquiring DNN load parameters and GPU hardware parameters; designing a DNN reasoning delay and throughput prediction model based on the acquired parameters; establishing a mathematical optimization problem of minimizing DNN reasoning cost expense under the condition of ensuring DNN reasoning delay and throughput; the invention designs and realizes a simple and effective GPU resource allocation strategy for interference perception of deep learning inference performanceiGniterThe method solves the performance prediction problem of DNN inference on the GPU, and minimizes DNN inference cost expense on the premise of ensuring DNN inference performance.
Description
Technical Field
The invention belongs to the technical field of GPU resource configuration, and particularly relates to a GPU resource configuration method for interference perception of deep learning inference performance, which can provide predictable inference performance and minimize inference cost expense on GPU equipment.
Background
As the DNN model becomes more complex, the GPU is typically used as an accelerator to reduce latency to meet SLO. To fully utilize the resource utilization of the GPU, great britain recently developed a Multi-Process Service (MPS) technique that can limit the GPU resources occupied by multiple speculative loads in space to share the GPU in space.
While MPS may limit GPU resources for each speculative load, there is significant performance interference between multiple speculative loads running on the same GPU device. When configuring GPU resources for DNN inference loads, without explicitly considering performance interference, the SLO of the user may be made unsatisfiable and incur high inference cost expenses.
Disclosure of Invention
In order to solve the above problems, an object of the present invention is to provide a GPU resource configuration method for deep learning inference performance interference perception, that is, a GPU resource configuration method for DNN inference performance prediction based on GPU equipment and minimizing inference cost overhead, which can reduce cost overhead of a user on the premise of ensuring DNN inference load performance SLO.
The specific technical scheme for realizing the purpose of the invention is as follows:
a GPU resource configuration method for interference perception of deep learning inference performance comprises the following steps:
step 1: submitting the DNN reasoning model and the delay and throughput target to a GPU reasoning system, pre-running the DNN reasoning model without historical data, and obtaining the size of the incoming data of the DNN reasoning model according to a pre-running resultResult data sizeNumber of kernel functionsGPU level two cache parametersScheduling delay at individual runtimeActive timeGPU Power p i GPU second level cache c i 8 load parameters and maximum power P, maximum frequency F, idle power P of the GPU idle Available PCIe bandwidth B pcie Relation parameter alpha between GPU power and frequency f GPU growth scheduling delay relation parameter alpha sch 、β sch 7 hardware parameters;
and 2, step: establishing a performance prediction model about DNN inference load according to the 8 load parameters and the 7 hardware parameters acquired in the step 1, and predicting inference delay and throughput of the DNN load; wherein the predicted inference delay modelAs follows:
wherein, the first and the second end of the pipe are connected with each other,in order for the inference delay to be predicted,delay for data loading,Performing a sum for a GPUA delay is fed back for the result; server Triton inference due to mainstream DNNOverlap the data loading phase with the GPU execution and result feedback phase, so DNN predicted throughput h ij Expressed as:
wherein b is i Is the batch size;
andwherein the content of the first and second substances,andrespectively represent when b i For 1 time, the size of input data and output result data are inferred, and B pcie Representing available bandwidth of a PCIe bus; GPU execution delayIs represented as: GPU execution delayIs represented as:
wherein the content of the first and second substances,to reason about the GPU scheduling delay when load i is running alone,for increased scheduling delay due to performance interference at the GPU resource scheduler,the number of kernel functions of the load i is inferred,to describe the inference load i the relevant parameters that cause GPU active time to be extended due to GPU level two cache contention,and c i Respectively expressing GPU active time and GPU secondary cache utilization rate v when inference load i operates independently ij To infer whether load i is running on GPU device j, f j And F is the actual GPU frequency and the maximum frequency on the GPU device j respectively;expressed as:
wherein alpha is sch And beta sch Respectively representing the relation parameters between the GPU growth scheduling delay and the inference load common quantity; f. of j Expressed as:
wherein alpha is f Is a parameter for describing the relationship between the GPU power requirements and the GPU operating frequencyAnd P is the total GPU power requirement and the maximum GPU power,expressed as:
wherein p is idle And p i GPU idle power and the GPU power requirements of the inference load i when running alone, respectively.
And step 3: according to the inference model provided by the step 1 user and the throughput target R i Delaying service level targets(Service Level Objective, SLO), establishing a mathematical optimization problem of minimizing cost and expense of DNN inference costs; the method specifically comprises the following steps:
wherein C represents the cost expense of DNN inference, u j Representing GPU device Unit price, r ij GPU resources allocated on GPU device j for inference load i, b i In order to infer the batch size of a load I, wherein I is a set of inference loads, and theta is a set of distributed GPU equipment; the variable in the model is r ij And b i The variables that need to be solved for the minimized mathematical problem; the first constraint condition ensures that the throughput of each inference load can meet the request arrival rate; the second constraint ensures that the inference delay of each inference load is below its target delayThe third constraint condition ensures that the GPU resource allocated by each GPU device should not exceed the maximum GPU resource r max (ii) a The fourth constraint indicates that each inference load can only be placed on one GPU device.
And 4, step 4: calculating each inference load batch and a GPU resource lower bound by using the delay and throughput target and the model obtained in the step 2, and outputting a GPU resource configuration scheme which minimizes inference cost expense under the condition of ensuring inference performance; the method specifically comprises the following steps: according to the target throughput and the delay constraint, the lower bound of the GPU resource required by the independent operation is solvedAnd appropriate batch sizeAccording toAnd sorting the inference loads in a descending order, selecting GPU equipment with the minimum performance interference to host the inference loads in the started candidate GPUs each time, and putting the inference loads on the newly started GPU equipment when the resources of the candidate GPU equipment are insufficient. When GPU resources are allocated, r is the GPU resource unit unit And increasing GPU resources violating the delay inference load one by one until the delay of all inference loads is not violated. Therefore, the GPU resource configuration scheme, namely the batch size b, which can minimize inference cost expense under the condition of ensuring the inference performance can be output i And GPU resource r ij 。
The invention solves the problems that a plurality of inference loads operate on the same GPU, the inference performance is unpredictable, the GPU resource configuration is not predictable, and the DNN inference cost is minimized. The invention provides predictable performance for DNN inference load based on the GPU in a mathematical modeling mode, further provides more scientific and reasonable GPU resource configuration, and can reduce cost and expense of users on the premise of ensuring DNN inference load performance SLO.
Drawings
FIG. 1 shows different speculative requests (i.e., i) for the same DNN speculative load 1 、i 2 、i 3 ) The operation process on the GPU is schematically shown;
FIG. 2 is a diagram of a GPU resource configuration framework (based on AWS EC 2) for deep learning inference performance interference awareness;
FIG. 3 is a flow chart of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The invention designs and realizes a GPU resource configuration framework iGniter with high-benefit deep learning inference performance interference perception to ensure the performance of DNN inference load based on GPU equipment and minimize inference cost and expense.
First, a scenario in which the present invention is applied is described, it involves multiple continuously arriving DNN inference loads I = { I = } 1 ,i 2 ,…,i m And a series of allocated GPU devices θ = { j = 1 ,j 2 ,…,j g }。
As shown in fig. 1, the execution of DNN inference on the GPU may be divided into three sequential steps: data loading, GPU execution and result feedback. Thus, one speculative load i runs on one GPU device j for DNN speculative delayBy delaying the loading of dataGPU execution delayAnd result feedback delayAdded up, which can be expressed as:
as shown in fig. 1, to improve GPU resource utilization, mainstream DNN inference servers (e.g., triton) overlap the data loading phase with the GPU execution and result feedback phase. Thus, the throughput h of the DNN inference ij Can be expressed as:
wherein, b i Representing the batch size selected at the inference load runtime.
Data loading and result feedback stage: since the inference input and output result data are transmitted between the CPU and GPU device through PCIe and these data are linearly related to the batch size, data loading delaysAnd result feedback delayCan be expressed as:
andwhereinAndrespectively represent when b i Reasoning input and output result data size for 1 hour, and B pcie Representing PCIe available bandwidth.
GPU execution phase: each inference load will be assigned a certain amount of GPU resources r ij ∈[0,r max ],These resources are mapped onto a set of Streaming Multi-processors (SM), where r max Is set to 1. As shown in FIG. 1, the GPU execution phase includes GPU scheduling and running kernel functions on the allocated SM, which directly receives the allocated GPU resources r ij Influence. Furthermore, performance disturbances may be due to reduced GPU frequency due to load co-location, which inevitably lengthens the GPU execution phase. Thus, GPU execution latencyCan be expressed as:
whereinAndrespectively representing the total scheduling delay and GPU active time of a GPU with an inferred load i running on a GPU device j without downconversion. f. of j And F represent the actual and maximum GPU frequency on device j, respectively, for GPU time.
Next, scheduling delays for the inferred load are precededAnd (6) modeling.And number of kernel functionsRoughly proportional, it can be expressed as:
whereinTo infer the number of kernel functions for load i,to reason about the scheduling delay when load i alone is running,the increased scheduling delay caused by the performance interference for the GPU resource scheduler is highly correlated to the amount of speculative load that co-operates on GPU device j. It is therefore expressed as:
wherein alpha is sch And beta sch Respectively represents the GPU growth scheduling delay relation parameter, sigma i∈I v ij Representing the number of inference loads co-operating on one GPU device j. v. of ij Representing whether inference load i is running on GPU device j may be given by:
next, the GPU active time of the inference load i running on the GPU device j is calculatedAnd (6) modeling. Because of contention by the speculative load on the shared GPU level two cache space on the GPU, this system level indicator of GPU level two cache utilization may be used to characterize the demand on the GPU level two cache space by the speculative load. For a fixed supply of level two cache space on a given GPU device, a higher GPU level two cache utilization (i.e., demand) indicates a more competitive on GPU level two cache space, resulting in a longer GPU active time. Therefore, the temperature of the molten metal is controlled,can be expressed as:
whereinTo reason about the load i parameters that cause the GPU active time to be extended due to second level cache contention.And c i Respectively representing the GPU active time and the secondary cache utilization rate when the inference load i operates independently.
Finally, the GPU frequency f on the GPU device j is adjusted j And modeling. Total GPU power demand when loadedAbove the upper GPU power limit for a given GPU type, the GPU frequency drops dramatically. Since GPU frequency is highly correlated with GPU power, GPU frequency can be expressed as:
wherein alpha is f Are relational parameters that describe GPU power and frequency for a given GPU type. In addition, by connecting GPU device jThe power of the inferential load and the idle power p idle Add to estimate the total power consumption requirement of GPU device j, which is given by:
wherein p is i To infer the power at which load i operates alone.
Acquiring a load parameter: based on the above, there are 8 load parameters (i.e.p i 、c i 、) And 7 hardware parameters (i.e., P, F, P) idle 、B pcie 、α f 、α sch 、β sch ) In the performance model. Specifically, four load parameters can be obtained just by pre-running once through the Nsight Systems (i.e., ). Caching parametersThis can be achieved by activating multiple (i.e. 2) inferential loads simultaneously. For a given GPU type, 3 hardware parameters (i.e., P, F, P) can be obtained using nvidia-smi idle ). Available PCIe Bandwidth B pcie May be obtained by transferring data from CPU memory to GPU memory. GPU frequency parameter alpha f And a scheduling parameter alpha sch 、β sch This can be achieved by starting 1 to 5 speculative loads on a given GPU type simultaneously. Furthermore, GPU active time is obtained by running the inference load i alone on a given GPU typePower p i And level two cache c i . Whereinp i And c i Subject batch size b i And allocated GPU resource r ij The influence of (c).
First according to the allocated GPU resource r ij And batch size b i For GPU active timeAnd modeling. Since GPU active time is roughly inversely proportional to the allocated GPU resources, and GPU active time increases with batch size, it can be represented by a quadratic function. Therefore, GPU active timeCan be expressed as:
whereinTo infer the active time parameter of load i. These parameters were obtained by fitting data of several (i.e. 9) different batch sizes and allocated GPU resources using a least squares method.
Modeling Power p i Second level cache utilization c i And GPU processing capability (i.e.) The relationship between them. Since both power and level two cache utilization increase linearly with GPU processing power. Therefore, power p i Second level cache utilization c i And GPU processing capabilityBetweenCan be expressed as:
wherein, the first and the second end of the pipe are connected with each other,andare parameters that characterize the relationship between power, level two cache utilization, and GPU processing power. These load parameters can also be obtained by fitting the pre-run data using a least squares method. Where second level cache utilization may be obtained using Nsight computer.
Based on the DNN inference performance model above, the optimization problem of the GPU resource configuration of the DNN inference load is further defined as: at a given request arrival rate R i And delay SLOHow to provide GPU resources r ij And configures a batch size b for each inference load i i To achieve predictable DNN inference performance while minimizing the cost overhead C of GPU resource configuration, therefore, the optimization problem can be expressed as:
wherein u is j For a unit price per GPU device j, equation (12) defines that the goal is to minimize the cost overhead of DNN inference, subject to the following four constraints. In particular, constraints (13) ensure that the throughput of each inferred load can meet its request arrival rate. Constraints (14) ensure that the inference delay of each inference load is below its target delayThis is because the batch process inference delay cannot exceed half of SLO, taking into account the impact of inferential request batching and queuing delays. The constraint (15) indicates that the allocated GPU resources per GPU device should not exceed the maximum GPU resource r max . Constraints (16) indicate that each inference load can only be placed on one GPU device.
For a given GPU according to equation (12), the cost C is governed by the GPU unit price u j And the impact of the allocated GPU device θ. Due to unit price u j Is a fixed value, so the original optimization problem can be simplified to minimize the number of allocated GPU devices | theta |. To achieve this goal, each speculative load needs to allocate GPU resources that just meet the request arrival rate and delay SLO.
For a given SLO inference load with arrival rate and delay, the minimum GPU resources required for its individual operation can be first foundAnd batch size
Whereinr unit It may be set to 2.5% (i.e., 2 SMs) for V100 GPUs for the allocation unit of GPU resources. Substituting equations (17), (18) into (12) to (16), it can be simplified as:
whereinMulti-configured GPU resources are required due to performance interference between co-located inference loads.A fragment of unallocated GPU resources for GPU device j. Thus, at a given lower bound of GPU resourcesIn the case of (2), the optimization problem may beTranslating into minimizing fragmentation of GPU resources and GPU resources increased by performance interference.
To configure the GPU resources, a decision is first made as to how to place the inference load, and then a determination is made as to how to assign the GPU resources to the inference load. To reduce unallocated GPU resource fragmentation, first, the method is based onAnd sorting the reasoning load in a descending order, and only placing the reasoning load on the new GPU equipment when the GPU resources are insufficient. To greedily reduce GPU resources that are increased by performance interference, a GPU device with minimal interference capability is selected to host each inference load. When an inferred load i is placed on a GPU device j, resources need to be allocated to all loads on that GPU. According to delayGPU resource r that has been assigned to an inference load ij GPU resource lower bound of inference load iIn units of r unit And gradually increasing GPU resources violating the delay SLO inference load until all inference delays are not violated.
Thus, fig. 2 shows in detail the GPU resource system iggiter for interference perception of deep learning inference performance, first submitting the DNN inference model and request arrival rate and delay SLO to the iggiter entry. For models without historical data, the models are pre-run to obtain load parameters and hardware parameters. Using these parameters, the inferential performance predictor first uses a performance model to estimate the inferential delay, which directs the GPU resource allocator and the inferential load placer to determine the GPU device with the least amount of performance interference among the candidate GPUs for each inferential load and guarantee SLOs. And finally constructing a GPU cluster by the GPU equipment controller according to the generated GPU resource configuration scheme, and starting a Triton inference service process for each DNN inference load on the configured GPU equipment.
Examples
In order to verify the feasibility and accuracy of the inventionIn the invention, a Triton-based GPU cluster with 10V 100 GPU cards is established on AWS EC2 by using a p3.2xlage example. On each instance, a Triton inference service process and its corresponding client are started, with each DNN inference load having a constant inter-request arrival rate. Seven hardware parameters (i.e., P, F, P) were measured using Nsight Systems and nvidia-smi idle 、B pcie 、α f 、α sc 、β sch ). GPU maximum power, frequency, idle power p of V100 idle And available PCIe bandwidth B pcie 300W, 1530MHz, 53.5W and 10GBps, respectively. Power parameter alpha f Scheduling parameter alpha sch 、β sch Respectively-1.025, 0.00475 and-0.00902.
Configuration of DNN inference load: four representative DNN models listed in table 1 were selected, with the AlexNet, resNet-50, and VGG-19 models used to run the image classification task on the ImageNet dataset and SSD used to run the target detection task on the VOC2012 dataset. From these models, 12 DNN inference loads consisting of three applications (i.e., application 1, application 2, and application 3) of different delay SLOs and different throughput SLOs (i.e., request arrival rates) are generated.
TABLE 1 configuration description of two SLOs for three applications
Evaluation criteria and evaluation indexes: iGniter will compare with the following three strategies: (1) FFD + : it is based on GPU resource lower boundTo allocate resources and place inference loads using a descending first-time-adaptation algorithm (FFD); (2) GSLICE + : firstly, placing according to an iGniter reasoning load placing plan, and then adjusting distributed GPU resources and batch sizes on each GPU according to the online average delay and throughput of each reasoning load; (3) gpu-lets + : it allocates GPU resources with the goal of maximizing request throughput and puts the inference load onOn the best adapted GPU. GSLICE + And gpu-lets + The batch size is increased to just meet the request arrival rate. The experiment focused on two key indicators: including inferential load cost spending and SLO violation inferential load quantity, where SLO violation is defined as the 99 th percentile delay of the inferred load exceeding its delay SLO.
As shown in Table 2, the iGniter can ensure that the 99 th percentile delay of all the inference loads can satisfy the delay SLO, and simultaneously, compared with GPU-lets + Cost savings of up to 25% are achieved. Therefore, the effectiveness and the high cost performance of the GPU resource configuration method for the interference perception of the deep learning inference performance are verified.
TABLE 2 comparison of gpu-lets + 、FFD + 、GSLICE + Cost, expense and number of violations for the iGniter strategy
gpu-lets + | FFD + | GSLICE + | iGniter | |
Cost expense ($) | 24.48 | 15.3 | 18.36 | 18.36 |
Number of violations | 3 | 10 | 3 | 0 |
The embodiment of the present invention can also provide a GPU resource configuration system for guaranteeing DNN speculative performance, where the system includes:
the DNN inference load pre-operation module submits a DNN model to GPU equipment to pre-operate 11 different configurations and obtain DNN inference performance prediction model parameters;
the DNN reasoning performance predicting module is used for establishing a DNN reasoning performance predicting model which explicitly considers performance interference so as to predict DNN reasoning performance;
the GPU resource configuration module iGniter determines GPU equipment with minimum interference performance in the candidate GPUs and ensures SLO (service level optimization) for each inference load based on a DNN inference performance prediction model, and the generated GPU resource configuration scheme can minimize DNN inference cost expense under the condition that DNN inference performance SLO of a user is met;
and the GPU equipment control module is used for finally constructing a GPU cluster by the GPU equipment controller according to the generated GPU resource configuration scheme and starting a Triton inference service process for each DNN inference load on the configured GPU equipment.
The user can automatically complete GPU resource configuration only by submitting deep learning inference load, request arrival rate and delay SLO to the resource configuration system, and the resource configuration scheme can minimize DNN inference cost expense under the condition of meeting DNN inference performance SLO of the user.
Claims (4)
1. A GPU resource allocation method for sensing deep learning inference performance interference is characterized by comprising the following specific steps:
step 1: submitting the DNN inference model and the delay and throughput target to a GPU inference system, and pre-running the DNN inference model without historical dataThe model obtains the size of the incoming data of the DNN model according to the pre-operation resultResult data sizeNumber of kernel functionsGPU second level cache parametersScheduling delay at individual runtimeActive timeGPU Power p i GPU second level cache c i 8 load parameters and maximum power P, maximum frequency F, idle power P of the GPU idle Available PCIe bandwidth B pcie Relation parameter alpha between GPU power and frequency f GPU growth scheduling delay relation parameter alpha sch 、β sch 7 hardware parameters;
step 2: establishing a performance prediction model about DNN inference load according to the 8 load parameters and the 7 hardware parameters acquired in the step 1, and predicting inference delay and throughput of the DNN load;
and step 3: according to the inference model provided by the user in the step 1 and the throughput target R i Delaying service level targetsEstablishing a mathematical optimization problem of minimizing DNN reasoning cost expense;
and 4, step 4: and (3) calculating each inference load batch and a GPU resource lower bound by using the delay and throughput target and the model obtained in the step (2), and outputting a GPU resource configuration scheme which minimizes inference cost expense under the condition of ensuring inference performance.
2. The GPU resource configuration method for interference perception of deep learning inference performance according to claim 1, wherein the step 2 specifically comprises:
wherein the content of the first and second substances,in order for the inference delay to be predicted,delays for data loading,Performing a sum for a GPUA delay is fed back for the result; since the mainstream DNN inference server Triton overlaps the data loading phase with the GPU execution and result feedback phase, the DNN predicted throughput h ij Expressed as:
wherein, b i Is the batch size;
wherein, the first and the second end of the pipe are connected with each other,andrespectively represent when b i The size of input data and output result data is inferred for 1, and B pcie Representing available bandwidth of the PCIe bus; GPU execution latencyIs represented as:
wherein the content of the first and second substances,to reason about the GPU scheduling delay when load i is running alone,for increased scheduling delay due to performance interference at the GPU resource scheduler,the number of kernel functions of the load i is inferred,reason for descriptionLoad i is a relevant parameter that causes the GPU active time to be extended due to GPU level two cache contention,and c i Respectively expressing GPU active time and GPU secondary cache utilization rate v when inference load i operates independently ij To infer whether load i is running on GPU device j, f j And F is the actual GPU frequency and the maximum frequency on GPU device j, respectively;expressed as:
wherein alpha is sch And beta sc Respectively representing the relation parameters between the GPU growth scheduling delay and the inference load common quantity; f. of j Expressed as:
wherein alpha is f Is a parameter for describing the relationship between the GPU power requirements and the GPU operating frequencyAnd P is the total GPU power requirement and the maximum GPU power,expressed as:
wherein p is idle And p i GPU idle power and the GPU power requirements of the inference load i when running alone, respectively.
3. The deep learning inference performance interference aware GPU resource configuration method of claim 1, wherein the step 3 specifically comprises:
wherein C represents the cost expense of DNN inference, u j Representing GPU device Unit price, r ij GPU resources allocated on GPU device j for reasoning load i, b i In order to infer the batch size of the load I, wherein I is a set of inference loads, and theta is a set of distributed GPU equipment; the variable in the model is r ij And b i Variables that need to be solved for the minimized mathematical problem; the first constraint condition ensures that the throughput of each inference load can meet the request arrival rate; the second constraint ensures that the inference delay of each inference load is below its target delayThe third constraint ensures that the allocated GPU resources of each GPU device should not exceed the maximumGPU resource r max (ii) a The fourth constraint indicates that each inference load can only be placed on one GPU device.
4. The deep learning inference performance interference-aware GPU resource configuration method according to claim 1, wherein the step 4 specifically comprises: according to the target throughput and the delay constraint, the lower bound of the GPU resources required by the independent operation is solvedAnd batch sizeAccording toSorting the inference loads in a descending order, selecting GPU equipment with the minimum performance interference from the started candidate GPUs to host the inference loads each time, and placing the inference loads on newly started GPU equipment when the resources of the candidate GPU equipment are insufficient; when GPU resources are allocated, r is the GPU resource unit unit Gradually increasing GPU resources violating the delay inference load until the delay of all inference loads is not violated; therefore, the GPU resource configuration scheme, namely the batch size b, which minimizes the inference cost expense under the condition of ensuring the inference performance can be output i And GPU resource r ij 。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210295359.0A CN115237586A (en) | 2022-03-24 | 2022-03-24 | GPU resource configuration method for deep learning inference performance interference perception |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210295359.0A CN115237586A (en) | 2022-03-24 | 2022-03-24 | GPU resource configuration method for deep learning inference performance interference perception |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115237586A true CN115237586A (en) | 2022-10-25 |
Family
ID=83668264
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210295359.0A Pending CN115237586A (en) | 2022-03-24 | 2022-03-24 | GPU resource configuration method for deep learning inference performance interference perception |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115237586A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116401062A (en) * | 2023-04-13 | 2023-07-07 | 北京大学 | Method and device for processing server non-perception resources and electronic equipment |
CN116842994A (en) * | 2023-07-03 | 2023-10-03 | 上海交通大学 | Dynamic optimization method and system for execution efficiency of multiple neural networks |
CN116991590A (en) * | 2023-09-25 | 2023-11-03 | 北京大学 | Deep learning application-oriented resource decoupling system, execution method and equipment |
-
2022
- 2022-03-24 CN CN202210295359.0A patent/CN115237586A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116401062A (en) * | 2023-04-13 | 2023-07-07 | 北京大学 | Method and device for processing server non-perception resources and electronic equipment |
CN116401062B (en) * | 2023-04-13 | 2023-09-12 | 北京大学 | Method and device for processing server non-perception resources and electronic equipment |
CN116842994A (en) * | 2023-07-03 | 2023-10-03 | 上海交通大学 | Dynamic optimization method and system for execution efficiency of multiple neural networks |
CN116842994B (en) * | 2023-07-03 | 2024-03-01 | 上海交通大学 | Dynamic optimization method and system for execution efficiency of multiple neural networks |
CN116991590A (en) * | 2023-09-25 | 2023-11-03 | 北京大学 | Deep learning application-oriented resource decoupling system, execution method and equipment |
CN116991590B (en) * | 2023-09-25 | 2024-01-12 | 北京大学 | Deep learning application-oriented resource decoupling system, execution method and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115237586A (en) | GPU resource configuration method for deep learning inference performance interference perception | |
Hu et al. | Concurrent container scheduling on heterogeneous clusters with multi-resource constraints | |
Hashem et al. | MapReduce scheduling algorithms: a review | |
US11496413B2 (en) | Allocating cloud computing resources in a cloud computing environment based on user predictability | |
CN110597639B (en) | CPU distribution control method, device, server and storage medium | |
CN109947619B (en) | Multi-resource management system and server for improving throughput based on service quality perception | |
CN105446816B (en) | A kind of energy optimization dispatching method towards heterogeneous platform | |
US10936377B2 (en) | Distributed database system and resource management method for distributed database system | |
Kaushik et al. | An energy-efficient reliable grid scheduling model using NSGA-II | |
Nanda et al. | Racc: resource-aware container consolidation using a deep learning approach | |
Tang et al. | Container-based task scheduling in cloud-edge collaborative environment using priority-aware greedy strategy | |
Al-Masri et al. | Energy-efficient cooperative resource allocation and task scheduling for Internet of Things environments | |
CN116820784B (en) | GPU real-time scheduling method and system for reasoning task QoS | |
Tychalas et al. | SaMW: a probabilistic meta-heuristic algorithm for job scheduling in heterogeneous distributed systems powered by microservices | |
Zhao et al. | Performance and cost-aware task scheduling via deep reinforcement learning in cloud environment | |
CN102184124A (en) | Task scheduling method and system | |
Huang et al. | Optimal power allocation and load balancing for non-dedicated heterogeneous distributed embedded computing systems | |
Runsewe et al. | CRAM: a container resource allocation mechanism for big data streaming applications | |
Ray et al. | Is high performance computing (HPC) ready to handle big data? | |
CN106445661A (en) | Dynamic optimization method and system | |
Nzanywayingoma et al. | Task scheduling and virtual resource optimising in Hadoop YARN-based cloud computing environment | |
CN112306642A (en) | Workflow scheduling method based on stable matching game theory | |
Hamad | An overview of Hadoop scheduler algorithms | |
CN112181498A (en) | Concurrency control method, device and equipment | |
CN111930485A (en) | Job scheduling method based on performance expression |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |