CN115237586A - GPU resource configuration method for deep learning inference performance interference perception - Google Patents

GPU resource configuration method for deep learning inference performance interference perception Download PDF

Info

Publication number
CN115237586A
CN115237586A CN202210295359.0A CN202210295359A CN115237586A CN 115237586 A CN115237586 A CN 115237586A CN 202210295359 A CN202210295359 A CN 202210295359A CN 115237586 A CN115237586 A CN 115237586A
Authority
CN
China
Prior art keywords
gpu
inference
load
dnn
delay
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210295359.0A
Other languages
Chinese (zh)
Inventor
徐飞
徐家年
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
East China Normal University
Original Assignee
East China Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by East China Normal University filed Critical East China Normal University
Priority to CN202210295359.0A priority Critical patent/CN115237586A/en
Publication of CN115237586A publication Critical patent/CN115237586A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a GPU resource configuration method for Deep learning inference performance interference perception, which comprises a Deep Neural Network (DNN) inference performance prediction model and GPU resource configuration for Deep learning inference performance interference perception by utilizing the model. Submitting DNN load for pre-operation and acquiring DNN load parameters and GPU hardware parameters; designing a DNN reasoning delay and throughput prediction model based on the acquired parameters; establishing a mathematical optimization problem of minimizing DNN reasoning cost expense under the condition of ensuring DNN reasoning delay and throughput; the invention designs and realizes a simple and effective GPU resource allocation strategy for interference perception of deep learning inference performanceiGniterThe method solves the performance prediction problem of DNN inference on the GPU, and minimizes DNN inference cost expense on the premise of ensuring DNN inference performance.

Description

GPU resource configuration method for interference perception of deep learning inference performance
Technical Field
The invention belongs to the technical field of GPU resource configuration, and particularly relates to a GPU resource configuration method for interference perception of deep learning inference performance, which can provide predictable inference performance and minimize inference cost expense on GPU equipment.
Background
As the DNN model becomes more complex, the GPU is typically used as an accelerator to reduce latency to meet SLO. To fully utilize the resource utilization of the GPU, great britain recently developed a Multi-Process Service (MPS) technique that can limit the GPU resources occupied by multiple speculative loads in space to share the GPU in space.
While MPS may limit GPU resources for each speculative load, there is significant performance interference between multiple speculative loads running on the same GPU device. When configuring GPU resources for DNN inference loads, without explicitly considering performance interference, the SLO of the user may be made unsatisfiable and incur high inference cost expenses.
Disclosure of Invention
In order to solve the above problems, an object of the present invention is to provide a GPU resource configuration method for deep learning inference performance interference perception, that is, a GPU resource configuration method for DNN inference performance prediction based on GPU equipment and minimizing inference cost overhead, which can reduce cost overhead of a user on the premise of ensuring DNN inference load performance SLO.
The specific technical scheme for realizing the purpose of the invention is as follows:
a GPU resource configuration method for interference perception of deep learning inference performance comprises the following steps:
step 1: submitting the DNN reasoning model and the delay and throughput target to a GPU reasoning system, pre-running the DNN reasoning model without historical data, and obtaining the size of the incoming data of the DNN reasoning model according to a pre-running result
Figure BDA0003563087190000011
Result data size
Figure BDA0003563087190000012
Number of kernel functions
Figure BDA0003563087190000013
GPU level two cache parameters
Figure BDA0003563087190000014
Scheduling delay at individual runtime
Figure BDA0003563087190000015
Active time
Figure BDA0003563087190000016
GPU Power p i GPU second level cache c i 8 load parameters and maximum power P, maximum frequency F, idle power P of the GPU idle Available PCIe bandwidth B pcie Relation parameter alpha between GPU power and frequency f GPU growth scheduling delay relation parameter alpha sch 、β sch 7 hardware parameters;
and 2, step: establishing a performance prediction model about DNN inference load according to the 8 load parameters and the 7 hardware parameters acquired in the step 1, and predicting inference delay and throughput of the DNN load; wherein the predicted inference delay model
Figure BDA0003563087190000017
As follows:
Figure BDA0003563087190000018
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003563087190000021
in order for the inference delay to be predicted,
Figure BDA0003563087190000022
delay for data loading,
Figure BDA0003563087190000023
Performing a sum for a GPU
Figure BDA0003563087190000024
A delay is fed back for the result; server Triton inference due to mainstream DNNOverlap the data loading phase with the GPU execution and result feedback phase, so DNN predicted throughput h ij Expressed as:
Figure BDA0003563087190000025
wherein b is i Is the batch size;
data load latency
Figure BDA0003563087190000026
And result feedback delay
Figure BDA0003563087190000027
Expressed as:
Figure BDA0003563087190000028
and
Figure BDA0003563087190000029
wherein the content of the first and second substances,
Figure BDA00035630871900000210
and
Figure BDA00035630871900000211
respectively represent when b i For 1 time, the size of input data and output result data are inferred, and B pcie Representing available bandwidth of a PCIe bus; GPU execution delay
Figure BDA00035630871900000212
Is represented as: GPU execution delay
Figure BDA00035630871900000213
Is represented as:
Figure BDA00035630871900000214
wherein the content of the first and second substances,
Figure BDA00035630871900000215
to reason about the GPU scheduling delay when load i is running alone,
Figure BDA00035630871900000216
for increased scheduling delay due to performance interference at the GPU resource scheduler,
Figure BDA00035630871900000217
the number of kernel functions of the load i is inferred,
Figure BDA00035630871900000218
to describe the inference load i the relevant parameters that cause GPU active time to be extended due to GPU level two cache contention,
Figure BDA00035630871900000219
and c i Respectively expressing GPU active time and GPU secondary cache utilization rate v when inference load i operates independently ij To infer whether load i is running on GPU device j, f j And F is the actual GPU frequency and the maximum frequency on the GPU device j respectively;
Figure BDA00035630871900000220
expressed as:
Figure BDA00035630871900000221
wherein alpha is sch And beta sch Respectively representing the relation parameters between the GPU growth scheduling delay and the inference load common quantity; f. of j Expressed as:
Figure BDA00035630871900000222
wherein alpha is f Is a parameter for describing the relationship between the GPU power requirements and the GPU operating frequency
Figure BDA00035630871900000223
And P is the total GPU power requirement and the maximum GPU power,
Figure BDA00035630871900000224
expressed as:
Figure BDA00035630871900000225
wherein p is idle And p i GPU idle power and the GPU power requirements of the inference load i when running alone, respectively.
And step 3: according to the inference model provided by the step 1 user and the throughput target R i Delaying service level targets
Figure BDA00035630871900000226
(Service Level Objective, SLO), establishing a mathematical optimization problem of minimizing cost and expense of DNN inference costs; the method specifically comprises the following steps:
Figure BDA0003563087190000031
Figure BDA0003563087190000032
Figure BDA0003563087190000033
Figure BDA0003563087190000034
Figure BDA0003563087190000035
wherein C represents the cost expense of DNN inference, u j Representing GPU device Unit price, r ij GPU resources allocated on GPU device j for inference load i, b i In order to infer the batch size of a load I, wherein I is a set of inference loads, and theta is a set of distributed GPU equipment; the variable in the model is r ij And b i The variables that need to be solved for the minimized mathematical problem; the first constraint condition ensures that the throughput of each inference load can meet the request arrival rate; the second constraint ensures that the inference delay of each inference load is below its target delay
Figure BDA0003563087190000036
The third constraint condition ensures that the GPU resource allocated by each GPU device should not exceed the maximum GPU resource r max (ii) a The fourth constraint indicates that each inference load can only be placed on one GPU device.
And 4, step 4: calculating each inference load batch and a GPU resource lower bound by using the delay and throughput target and the model obtained in the step 2, and outputting a GPU resource configuration scheme which minimizes inference cost expense under the condition of ensuring inference performance; the method specifically comprises the following steps: according to the target throughput and the delay constraint, the lower bound of the GPU resource required by the independent operation is solved
Figure BDA0003563087190000037
And appropriate batch size
Figure BDA0003563087190000038
According to
Figure BDA0003563087190000039
And sorting the inference loads in a descending order, selecting GPU equipment with the minimum performance interference to host the inference loads in the started candidate GPUs each time, and putting the inference loads on the newly started GPU equipment when the resources of the candidate GPU equipment are insufficient. When GPU resources are allocated, r is the GPU resource unit unit And increasing GPU resources violating the delay inference load one by one until the delay of all inference loads is not violated. Therefore, the GPU resource configuration scheme, namely the batch size b, which can minimize inference cost expense under the condition of ensuring the inference performance can be output i And GPU resource r ij
The invention solves the problems that a plurality of inference loads operate on the same GPU, the inference performance is unpredictable, the GPU resource configuration is not predictable, and the DNN inference cost is minimized. The invention provides predictable performance for DNN inference load based on the GPU in a mathematical modeling mode, further provides more scientific and reasonable GPU resource configuration, and can reduce cost and expense of users on the premise of ensuring DNN inference load performance SLO.
Drawings
FIG. 1 shows different speculative requests (i.e., i) for the same DNN speculative load 1 、i 2 、i 3 ) The operation process on the GPU is schematically shown;
FIG. 2 is a diagram of a GPU resource configuration framework (based on AWS EC 2) for deep learning inference performance interference awareness;
FIG. 3 is a flow chart of the present invention.
Detailed Description
To make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The invention designs and realizes a GPU resource configuration framework iGniter with high-benefit deep learning inference performance interference perception to ensure the performance of DNN inference load based on GPU equipment and minimize inference cost and expense.
First, a scenario in which the present invention is applied is described, it involves multiple continuously arriving DNN inference loads I = { I = } 1 ,i 2 ,…,i m And a series of allocated GPU devices θ = { j = 1 ,j 2 ,…,j g }。
As shown in fig. 1, the execution of DNN inference on the GPU may be divided into three sequential steps: data loading, GPU execution and result feedback. Thus, one speculative load i runs on one GPU device j for DNN speculative delay
Figure BDA0003563087190000041
By delaying the loading of data
Figure BDA0003563087190000042
GPU execution delay
Figure BDA0003563087190000043
And result feedback delay
Figure BDA0003563087190000044
Added up, which can be expressed as:
Figure BDA0003563087190000045
as shown in fig. 1, to improve GPU resource utilization, mainstream DNN inference servers (e.g., triton) overlap the data loading phase with the GPU execution and result feedback phase. Thus, the throughput h of the DNN inference ij Can be expressed as:
Figure BDA0003563087190000046
wherein, b i Representing the batch size selected at the inference load runtime.
Data loading and result feedback stage: since the inference input and output result data are transmitted between the CPU and GPU device through PCIe and these data are linearly related to the batch size, data loading delays
Figure BDA0003563087190000047
And result feedback delay
Figure BDA0003563087190000048
Can be expressed as:
Figure BDA0003563087190000049
and
Figure BDA00035630871900000410
wherein
Figure BDA00035630871900000411
And
Figure BDA00035630871900000412
respectively represent when b i Reasoning input and output result data size for 1 hour, and B pcie Representing PCIe available bandwidth.
GPU execution phase: each inference load will be assigned a certain amount of GPU resources r ij ∈[0,r max ],
Figure BDA00035630871900000413
These resources are mapped onto a set of Streaming Multi-processors (SM), where r max Is set to 1. As shown in FIG. 1, the GPU execution phase includes GPU scheduling and running kernel functions on the allocated SM, which directly receives the allocated GPU resources r ij Influence. Furthermore, performance disturbances may be due to reduced GPU frequency due to load co-location, which inevitably lengthens the GPU execution phase. Thus, GPU execution latency
Figure BDA00035630871900000414
Can be expressed as:
Figure BDA00035630871900000417
wherein
Figure BDA00035630871900000415
And
Figure BDA00035630871900000416
respectively representing the total scheduling delay and GPU active time of a GPU with an inferred load i running on a GPU device j without downconversion. f. of j And F represent the actual and maximum GPU frequency on device j, respectively, for GPU time.
Next, scheduling delays for the inferred load are preceded
Figure BDA0003563087190000051
And (6) modeling.
Figure BDA0003563087190000052
And number of kernel functions
Figure BDA0003563087190000053
Roughly proportional, it can be expressed as:
Figure BDA0003563087190000054
wherein
Figure BDA0003563087190000055
To infer the number of kernel functions for load i,
Figure BDA0003563087190000056
to reason about the scheduling delay when load i alone is running,
Figure BDA0003563087190000057
the increased scheduling delay caused by the performance interference for the GPU resource scheduler is highly correlated to the amount of speculative load that co-operates on GPU device j. It is therefore expressed as:
Figure BDA0003563087190000058
wherein alpha is sch And beta sch Respectively represents the GPU growth scheduling delay relation parameter, sigma i∈I v ij Representing the number of inference loads co-operating on one GPU device j. v. of ij Representing whether inference load i is running on GPU device j may be given by:
Figure BDA0003563087190000059
next, the GPU active time of the inference load i running on the GPU device j is calculated
Figure BDA00035630871900000510
And (6) modeling. Because of contention by the speculative load on the shared GPU level two cache space on the GPU, this system level indicator of GPU level two cache utilization may be used to characterize the demand on the GPU level two cache space by the speculative load. For a fixed supply of level two cache space on a given GPU device, a higher GPU level two cache utilization (i.e., demand) indicates a more competitive on GPU level two cache space, resulting in a longer GPU active time. Therefore, the temperature of the molten metal is controlled,
Figure BDA00035630871900000511
can be expressed as:
Figure BDA00035630871900000512
wherein
Figure BDA00035630871900000513
To reason about the load i parameters that cause the GPU active time to be extended due to second level cache contention.
Figure BDA00035630871900000514
And c i Respectively representing the GPU active time and the secondary cache utilization rate when the inference load i operates independently.
Finally, the GPU frequency f on the GPU device j is adjusted j And modeling. Total GPU power demand when loaded
Figure BDA00035630871900000515
Above the upper GPU power limit for a given GPU type, the GPU frequency drops dramatically. Since GPU frequency is highly correlated with GPU power, GPU frequency can be expressed as:
Figure BDA00035630871900000516
wherein alpha is f Are relational parameters that describe GPU power and frequency for a given GPU type. In addition, by connecting GPU device jThe power of the inferential load and the idle power p idle Add to estimate the total power consumption requirement of GPU device j, which is given by:
Figure BDA0003563087190000061
wherein p is i To infer the power at which load i operates alone.
Acquiring a load parameter: based on the above, there are 8 load parameters (i.e.
Figure BDA0003563087190000062
p i 、c i
Figure BDA0003563087190000063
) And 7 hardware parameters (i.e., P, F, P) idle 、B pcie 、α f 、α sch 、β sch ) In the performance model. Specifically, four load parameters can be obtained just by pre-running once through the Nsight Systems (i.e.,
Figure BDA0003563087190000064
Figure BDA0003563087190000065
). Caching parameters
Figure BDA0003563087190000066
This can be achieved by activating multiple (i.e. 2) inferential loads simultaneously. For a given GPU type, 3 hardware parameters (i.e., P, F, P) can be obtained using nvidia-smi idle ). Available PCIe Bandwidth B pcie May be obtained by transferring data from CPU memory to GPU memory. GPU frequency parameter alpha f And a scheduling parameter alpha sch 、β sch This can be achieved by starting 1 to 5 speculative loads on a given GPU type simultaneously. Furthermore, GPU active time is obtained by running the inference load i alone on a given GPU type
Figure BDA0003563087190000067
Power p i And level two cache c i . Wherein
Figure BDA0003563087190000068
p i And c i Subject batch size b i And allocated GPU resource r ij The influence of (c).
First according to the allocated GPU resource r ij And batch size b i For GPU active time
Figure BDA0003563087190000069
And modeling. Since GPU active time is roughly inversely proportional to the allocated GPU resources, and GPU active time increases with batch size, it can be represented by a quadratic function. Therefore, GPU active time
Figure BDA00035630871900000610
Can be expressed as:
Figure BDA00035630871900000611
wherein
Figure BDA00035630871900000612
To infer the active time parameter of load i. These parameters were obtained by fitting data of several (i.e. 9) different batch sizes and allocated GPU resources using a least squares method.
Modeling Power p i Second level cache utilization c i And GPU processing capability (i.e.
Figure BDA00035630871900000613
) The relationship between them. Since both power and level two cache utilization increase linearly with GPU processing power. Therefore, power p i Second level cache utilization c i And GPU processing capability
Figure BDA00035630871900000614
BetweenCan be expressed as:
Figure BDA00035630871900000615
Figure BDA00035630871900000616
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA00035630871900000617
and
Figure BDA00035630871900000618
are parameters that characterize the relationship between power, level two cache utilization, and GPU processing power. These load parameters can also be obtained by fitting the pre-run data using a least squares method. Where second level cache utilization may be obtained using Nsight computer.
Based on the DNN inference performance model above, the optimization problem of the GPU resource configuration of the DNN inference load is further defined as: at a given request arrival rate R i And delay SLO
Figure BDA0003563087190000071
How to provide GPU resources r ij And configures a batch size b for each inference load i i To achieve predictable DNN inference performance while minimizing the cost overhead C of GPU resource configuration, therefore, the optimization problem can be expressed as:
Figure BDA0003563087190000072
Figure BDA0003563087190000073
Figure BDA0003563087190000074
Figure BDA0003563087190000075
Figure BDA0003563087190000076
wherein u is j For a unit price per GPU device j, equation (12) defines that the goal is to minimize the cost overhead of DNN inference, subject to the following four constraints. In particular, constraints (13) ensure that the throughput of each inferred load can meet its request arrival rate. Constraints (14) ensure that the inference delay of each inference load is below its target delay
Figure BDA0003563087190000077
This is because the batch process inference delay cannot exceed half of SLO, taking into account the impact of inferential request batching and queuing delays. The constraint (15) indicates that the allocated GPU resources per GPU device should not exceed the maximum GPU resource r max . Constraints (16) indicate that each inference load can only be placed on one GPU device.
For a given GPU according to equation (12), the cost C is governed by the GPU unit price u j And the impact of the allocated GPU device θ. Due to unit price u j Is a fixed value, so the original optimization problem can be simplified to minimize the number of allocated GPU devices | theta |. To achieve this goal, each speculative load needs to allocate GPU resources that just meet the request arrival rate and delay SLO.
For a given SLO inference load with arrival rate and delay, the minimum GPU resources required for its individual operation can be first found
Figure BDA0003563087190000078
And batch size
Figure BDA0003563087190000079
Figure BDA00035630871900000710
Figure BDA00035630871900000711
Wherein
Figure BDA00035630871900000712
r unit It may be set to 2.5% (i.e., 2 SMs) for V100 GPUs for the allocation unit of GPU resources. Substituting equations (17), (18) into (12) to (16), it can be simplified as:
Figure BDA0003563087190000081
Figure BDA0003563087190000082
Figure BDA0003563087190000083
Figure BDA0003563087190000084
wherein
Figure BDA0003563087190000085
Multi-configured GPU resources are required due to performance interference between co-located inference loads.
Figure BDA0003563087190000086
A fragment of unallocated GPU resources for GPU device j. Thus, at a given lower bound of GPU resources
Figure BDA0003563087190000087
In the case of (2), the optimization problem may beTranslating into minimizing fragmentation of GPU resources and GPU resources increased by performance interference.
To configure the GPU resources, a decision is first made as to how to place the inference load, and then a determination is made as to how to assign the GPU resources to the inference load. To reduce unallocated GPU resource fragmentation, first, the method is based on
Figure BDA0003563087190000088
And sorting the reasoning load in a descending order, and only placing the reasoning load on the new GPU equipment when the GPU resources are insufficient. To greedily reduce GPU resources that are increased by performance interference, a GPU device with minimal interference capability is selected to host each inference load. When an inferred load i is placed on a GPU device j, resources need to be allocated to all loads on that GPU. According to delay
Figure BDA0003563087190000089
GPU resource r that has been assigned to an inference load ij GPU resource lower bound of inference load i
Figure BDA00035630871900000810
In units of r unit And gradually increasing GPU resources violating the delay SLO inference load until all inference delays are not violated.
Thus, fig. 2 shows in detail the GPU resource system iggiter for interference perception of deep learning inference performance, first submitting the DNN inference model and request arrival rate and delay SLO to the iggiter entry. For models without historical data, the models are pre-run to obtain load parameters and hardware parameters. Using these parameters, the inferential performance predictor first uses a performance model to estimate the inferential delay, which directs the GPU resource allocator and the inferential load placer to determine the GPU device with the least amount of performance interference among the candidate GPUs for each inferential load and guarantee SLOs. And finally constructing a GPU cluster by the GPU equipment controller according to the generated GPU resource configuration scheme, and starting a Triton inference service process for each DNN inference load on the configured GPU equipment.
Examples
In order to verify the feasibility and accuracy of the inventionIn the invention, a Triton-based GPU cluster with 10V 100 GPU cards is established on AWS EC2 by using a p3.2xlage example. On each instance, a Triton inference service process and its corresponding client are started, with each DNN inference load having a constant inter-request arrival rate. Seven hardware parameters (i.e., P, F, P) were measured using Nsight Systems and nvidia-smi idle 、B pcie 、α f 、α sc 、β sch ). GPU maximum power, frequency, idle power p of V100 idle And available PCIe bandwidth B pcie 300W, 1530MHz, 53.5W and 10GBps, respectively. Power parameter alpha f Scheduling parameter alpha sch 、β sch Respectively-1.025, 0.00475 and-0.00902.
Configuration of DNN inference load: four representative DNN models listed in table 1 were selected, with the AlexNet, resNet-50, and VGG-19 models used to run the image classification task on the ImageNet dataset and SSD used to run the target detection task on the VOC2012 dataset. From these models, 12 DNN inference loads consisting of three applications (i.e., application 1, application 2, and application 3) of different delay SLOs and different throughput SLOs (i.e., request arrival rates) are generated.
TABLE 1 configuration description of two SLOs for three applications
Figure BDA0003563087190000091
Evaluation criteria and evaluation indexes: iGniter will compare with the following three strategies: (1) FFD + : it is based on GPU resource lower bound
Figure BDA0003563087190000092
To allocate resources and place inference loads using a descending first-time-adaptation algorithm (FFD); (2) GSLICE + : firstly, placing according to an iGniter reasoning load placing plan, and then adjusting distributed GPU resources and batch sizes on each GPU according to the online average delay and throughput of each reasoning load; (3) gpu-lets + : it allocates GPU resources with the goal of maximizing request throughput and puts the inference load onOn the best adapted GPU. GSLICE + And gpu-lets + The batch size is increased to just meet the request arrival rate. The experiment focused on two key indicators: including inferential load cost spending and SLO violation inferential load quantity, where SLO violation is defined as the 99 th percentile delay of the inferred load exceeding its delay SLO.
As shown in Table 2, the iGniter can ensure that the 99 th percentile delay of all the inference loads can satisfy the delay SLO, and simultaneously, compared with GPU-lets + Cost savings of up to 25% are achieved. Therefore, the effectiveness and the high cost performance of the GPU resource configuration method for the interference perception of the deep learning inference performance are verified.
TABLE 2 comparison of gpu-lets + 、FFD + 、GSLICE + Cost, expense and number of violations for the iGniter strategy
gpu-lets + FFD + GSLICE + iGniter
Cost expense ($) 24.48 15.3 18.36 18.36
Number of violations 3 10 3 0
The embodiment of the present invention can also provide a GPU resource configuration system for guaranteeing DNN speculative performance, where the system includes:
the DNN inference load pre-operation module submits a DNN model to GPU equipment to pre-operate 11 different configurations and obtain DNN inference performance prediction model parameters;
the DNN reasoning performance predicting module is used for establishing a DNN reasoning performance predicting model which explicitly considers performance interference so as to predict DNN reasoning performance;
the GPU resource configuration module iGniter determines GPU equipment with minimum interference performance in the candidate GPUs and ensures SLO (service level optimization) for each inference load based on a DNN inference performance prediction model, and the generated GPU resource configuration scheme can minimize DNN inference cost expense under the condition that DNN inference performance SLO of a user is met;
and the GPU equipment control module is used for finally constructing a GPU cluster by the GPU equipment controller according to the generated GPU resource configuration scheme and starting a Triton inference service process for each DNN inference load on the configured GPU equipment.
The user can automatically complete GPU resource configuration only by submitting deep learning inference load, request arrival rate and delay SLO to the resource configuration system, and the resource configuration scheme can minimize DNN inference cost expense under the condition of meeting DNN inference performance SLO of the user.

Claims (4)

1. A GPU resource allocation method for sensing deep learning inference performance interference is characterized by comprising the following specific steps:
step 1: submitting the DNN inference model and the delay and throughput target to a GPU inference system, and pre-running the DNN inference model without historical dataThe model obtains the size of the incoming data of the DNN model according to the pre-operation result
Figure FDA0003563087180000011
Result data size
Figure FDA0003563087180000012
Number of kernel functions
Figure FDA0003563087180000013
GPU second level cache parameters
Figure FDA0003563087180000014
Scheduling delay at individual runtime
Figure FDA0003563087180000015
Active time
Figure FDA0003563087180000016
GPU Power p i GPU second level cache c i 8 load parameters and maximum power P, maximum frequency F, idle power P of the GPU idle Available PCIe bandwidth B pcie Relation parameter alpha between GPU power and frequency f GPU growth scheduling delay relation parameter alpha sch 、β sch 7 hardware parameters;
step 2: establishing a performance prediction model about DNN inference load according to the 8 load parameters and the 7 hardware parameters acquired in the step 1, and predicting inference delay and throughput of the DNN load;
and step 3: according to the inference model provided by the user in the step 1 and the throughput target R i Delaying service level targets
Figure FDA0003563087180000017
Establishing a mathematical optimization problem of minimizing DNN reasoning cost expense;
and 4, step 4: and (3) calculating each inference load batch and a GPU resource lower bound by using the delay and throughput target and the model obtained in the step (2), and outputting a GPU resource configuration scheme which minimizes inference cost expense under the condition of ensuring inference performance.
2. The GPU resource configuration method for interference perception of deep learning inference performance according to claim 1, wherein the step 2 specifically comprises:
Figure FDA0003563087180000018
wherein the content of the first and second substances,
Figure FDA0003563087180000019
in order for the inference delay to be predicted,
Figure FDA00035630871800000110
delays for data loading,
Figure FDA00035630871800000111
Performing a sum for a GPU
Figure FDA00035630871800000112
A delay is fed back for the result; since the mainstream DNN inference server Triton overlaps the data loading phase with the GPU execution and result feedback phase, the DNN predicted throughput h ij Expressed as:
Figure FDA00035630871800000113
wherein, b i Is the batch size;
data load latency
Figure FDA00035630871800000114
And result feedback delay
Figure FDA00035630871800000115
Expressed as:
Figure FDA00035630871800000116
and
Figure FDA00035630871800000117
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA00035630871800000118
and
Figure FDA00035630871800000119
respectively represent when b i The size of input data and output result data is inferred for 1, and B pcie Representing available bandwidth of the PCIe bus; GPU execution latency
Figure FDA00035630871800000120
Is represented as:
Figure FDA0003563087180000021
wherein the content of the first and second substances,
Figure FDA0003563087180000022
to reason about the GPU scheduling delay when load i is running alone,
Figure FDA0003563087180000023
for increased scheduling delay due to performance interference at the GPU resource scheduler,
Figure FDA0003563087180000024
the number of kernel functions of the load i is inferred,
Figure FDA0003563087180000025
reason for descriptionLoad i is a relevant parameter that causes the GPU active time to be extended due to GPU level two cache contention,
Figure FDA0003563087180000026
and c i Respectively expressing GPU active time and GPU secondary cache utilization rate v when inference load i operates independently ij To infer whether load i is running on GPU device j, f j And F is the actual GPU frequency and the maximum frequency on GPU device j, respectively;
Figure FDA0003563087180000027
expressed as:
Figure FDA0003563087180000028
wherein alpha is sch And beta sc Respectively representing the relation parameters between the GPU growth scheduling delay and the inference load common quantity; f. of j Expressed as:
Figure FDA0003563087180000029
wherein alpha is f Is a parameter for describing the relationship between the GPU power requirements and the GPU operating frequency
Figure FDA00035630871800000210
And P is the total GPU power requirement and the maximum GPU power,
Figure FDA00035630871800000211
expressed as:
Figure FDA00035630871800000212
wherein p is idle And p i GPU idle power and the GPU power requirements of the inference load i when running alone, respectively.
3. The deep learning inference performance interference aware GPU resource configuration method of claim 1, wherein the step 3 specifically comprises:
Figure FDA00035630871800000213
Figure FDA00035630871800000214
Figure FDA00035630871800000215
Figure FDA00035630871800000216
Figure FDA00035630871800000217
wherein C represents the cost expense of DNN inference, u j Representing GPU device Unit price, r ij GPU resources allocated on GPU device j for reasoning load i, b i In order to infer the batch size of the load I, wherein I is a set of inference loads, and theta is a set of distributed GPU equipment; the variable in the model is r ij And b i Variables that need to be solved for the minimized mathematical problem; the first constraint condition ensures that the throughput of each inference load can meet the request arrival rate; the second constraint ensures that the inference delay of each inference load is below its target delay
Figure FDA00035630871800000218
The third constraint ensures that the allocated GPU resources of each GPU device should not exceed the maximumGPU resource r max (ii) a The fourth constraint indicates that each inference load can only be placed on one GPU device.
4. The deep learning inference performance interference-aware GPU resource configuration method according to claim 1, wherein the step 4 specifically comprises: according to the target throughput and the delay constraint, the lower bound of the GPU resources required by the independent operation is solved
Figure FDA0003563087180000031
And batch size
Figure FDA0003563087180000032
According to
Figure FDA0003563087180000033
Sorting the inference loads in a descending order, selecting GPU equipment with the minimum performance interference from the started candidate GPUs to host the inference loads each time, and placing the inference loads on newly started GPU equipment when the resources of the candidate GPU equipment are insufficient; when GPU resources are allocated, r is the GPU resource unit unit Gradually increasing GPU resources violating the delay inference load until the delay of all inference loads is not violated; therefore, the GPU resource configuration scheme, namely the batch size b, which minimizes the inference cost expense under the condition of ensuring the inference performance can be output i And GPU resource r ij
CN202210295359.0A 2022-03-24 2022-03-24 GPU resource configuration method for deep learning inference performance interference perception Pending CN115237586A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210295359.0A CN115237586A (en) 2022-03-24 2022-03-24 GPU resource configuration method for deep learning inference performance interference perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210295359.0A CN115237586A (en) 2022-03-24 2022-03-24 GPU resource configuration method for deep learning inference performance interference perception

Publications (1)

Publication Number Publication Date
CN115237586A true CN115237586A (en) 2022-10-25

Family

ID=83668264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210295359.0A Pending CN115237586A (en) 2022-03-24 2022-03-24 GPU resource configuration method for deep learning inference performance interference perception

Country Status (1)

Country Link
CN (1) CN115237586A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116401062A (en) * 2023-04-13 2023-07-07 北京大学 Method and device for processing server non-perception resources and electronic equipment
CN116842994A (en) * 2023-07-03 2023-10-03 上海交通大学 Dynamic optimization method and system for execution efficiency of multiple neural networks
CN116991590A (en) * 2023-09-25 2023-11-03 北京大学 Deep learning application-oriented resource decoupling system, execution method and equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116401062A (en) * 2023-04-13 2023-07-07 北京大学 Method and device for processing server non-perception resources and electronic equipment
CN116401062B (en) * 2023-04-13 2023-09-12 北京大学 Method and device for processing server non-perception resources and electronic equipment
CN116842994A (en) * 2023-07-03 2023-10-03 上海交通大学 Dynamic optimization method and system for execution efficiency of multiple neural networks
CN116842994B (en) * 2023-07-03 2024-03-01 上海交通大学 Dynamic optimization method and system for execution efficiency of multiple neural networks
CN116991590A (en) * 2023-09-25 2023-11-03 北京大学 Deep learning application-oriented resource decoupling system, execution method and equipment
CN116991590B (en) * 2023-09-25 2024-01-12 北京大学 Deep learning application-oriented resource decoupling system, execution method and equipment

Similar Documents

Publication Publication Date Title
CN115237586A (en) GPU resource configuration method for deep learning inference performance interference perception
Hu et al. Concurrent container scheduling on heterogeneous clusters with multi-resource constraints
Hashem et al. MapReduce scheduling algorithms: a review
US11496413B2 (en) Allocating cloud computing resources in a cloud computing environment based on user predictability
CN110597639B (en) CPU distribution control method, device, server and storage medium
CN109947619B (en) Multi-resource management system and server for improving throughput based on service quality perception
CN105446816B (en) A kind of energy optimization dispatching method towards heterogeneous platform
US10936377B2 (en) Distributed database system and resource management method for distributed database system
Kaushik et al. An energy-efficient reliable grid scheduling model using NSGA-II
Nanda et al. Racc: resource-aware container consolidation using a deep learning approach
Tang et al. Container-based task scheduling in cloud-edge collaborative environment using priority-aware greedy strategy
Al-Masri et al. Energy-efficient cooperative resource allocation and task scheduling for Internet of Things environments
CN116820784B (en) GPU real-time scheduling method and system for reasoning task QoS
Tychalas et al. SaMW: a probabilistic meta-heuristic algorithm for job scheduling in heterogeneous distributed systems powered by microservices
Zhao et al. Performance and cost-aware task scheduling via deep reinforcement learning in cloud environment
CN102184124A (en) Task scheduling method and system
Huang et al. Optimal power allocation and load balancing for non-dedicated heterogeneous distributed embedded computing systems
Runsewe et al. CRAM: a container resource allocation mechanism for big data streaming applications
Ray et al. Is high performance computing (HPC) ready to handle big data?
CN106445661A (en) Dynamic optimization method and system
Nzanywayingoma et al. Task scheduling and virtual resource optimising in Hadoop YARN-based cloud computing environment
CN112306642A (en) Workflow scheduling method based on stable matching game theory
Hamad An overview of Hadoop scheduler algorithms
CN112181498A (en) Concurrency control method, device and equipment
CN111930485A (en) Job scheduling method based on performance expression

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination