CN115237586A

CN115237586A - GPU resource configuration method for deep learning inference performance interference perception

Info

Publication number: CN115237586A
Application number: CN202210295359.0A
Authority: CN
Inventors: 徐飞; 徐家年
Original assignee: East China Normal University
Current assignee: East China Normal University
Priority date: 2022-03-24
Filing date: 2022-03-24
Publication date: 2022-10-25

Abstract

The invention discloses a GPU resource configuration method for Deep learning inference performance interference perception, which comprises a Deep Neural Network (DNN) inference performance prediction model and GPU resource configuration for Deep learning inference performance interference perception by utilizing the model. Submitting DNN load for pre-operation and acquiring DNN load parameters and GPU hardware parameters; designing a DNN reasoning delay and throughput prediction model based on the acquired parameters; establishing a mathematical optimization problem of minimizing DNN reasoning cost expense under the condition of ensuring DNN reasoning delay and throughput; the invention designs and realizes a simple and effective GPU resource allocation strategy for interference perception of deep learning inference performanceiGniterThe method solves the performance prediction problem of DNN inference on the GPU, and minimizes DNN inference cost expense on the premise of ensuring DNN inference performance.

Description

GPU resource configuration method for interference perception of deep learning inference performance

Technical Field

The invention belongs to the technical field of GPU resource configuration, and particularly relates to a GPU resource configuration method for interference perception of deep learning inference performance, which can provide predictable inference performance and minimize inference cost expense on GPU equipment.

Background

As the DNN model becomes more complex, the GPU is typically used as an accelerator to reduce latency to meet SLO. To fully utilize the resource utilization of the GPU, great britain recently developed a Multi-Process Service (MPS) technique that can limit the GPU resources occupied by multiple speculative loads in space to share the GPU in space.

While MPS may limit GPU resources for each speculative load, there is significant performance interference between multiple speculative loads running on the same GPU device. When configuring GPU resources for DNN inference loads, without explicitly considering performance interference, the SLO of the user may be made unsatisfiable and incur high inference cost expenses.

Disclosure of Invention

In order to solve the above problems, an object of the present invention is to provide a GPU resource configuration method for deep learning inference performance interference perception, that is, a GPU resource configuration method for DNN inference performance prediction based on GPU equipment and minimizing inference cost overhead, which can reduce cost overhead of a user on the premise of ensuring DNN inference load performance SLO.

The specific technical scheme for realizing the purpose of the invention is as follows:

a GPU resource configuration method for interference perception of deep learning inference performance comprises the following steps:

step 1: submitting the DNN reasoning model and the delay and throughput target to a GPU reasoning system, pre-running the DNN reasoning model without historical data, and obtaining the size of the incoming data of the DNN reasoning model according to a pre-running result

Result data size

Number of kernel functions

GPU level two cache parameters

Scheduling delay at individual runtime

Active time

GPU Power p ⁱ GPU second level cache c ⁱ 8 load parameters and maximum power P, maximum frequency F, idle power P of the GPU _idle Available PCIe bandwidth B _pcie Relation parameter alpha between GPU power and frequency _f GPU growth scheduling delay relation parameter alpha _sch 、β _sch 7 hardware parameters;

and 2, step: establishing a performance prediction model about DNN inference load according to the 8 load parameters and the 7 hardware parameters acquired in the step 1, and predicting inference delay and throughput of the DNN load; wherein the predicted inference delay model

As follows:

wherein, the first and the second end of the pipe are connected with each other,

in order for the inference delay to be predicted,

delay for data loading,

Performing a sum for a GPU

A delay is fed back for the result; server Triton inference due to mainstream DNNOverlap the data loading phase with the GPU execution and result feedback phase, so DNN predicted throughput h ^ij Expressed as:

wherein b is ⁱ Is the batch size;

data load latency

And result feedback delay

Expressed as:

and

wherein the content of the first and second substances,

and

respectively represent when b ⁱ For 1 time, the size of input data and output result data are inferred, and B _pcie Representing available bandwidth of a PCIe bus; GPU execution delay

Is represented as: GPU execution delay

Is represented as:

wherein the content of the first and second substances,

to reason about the GPU scheduling delay when load i is running alone,

for increased scheduling delay due to performance interference at the GPU resource scheduler,

the number of kernel functions of the load i is inferred,

to describe the inference load i the relevant parameters that cause GPU active time to be extended due to GPU level two cache contention,

and c ⁱ Respectively expressing GPU active time and GPU secondary cache utilization rate v when inference load i operates independently ^ij To infer whether load i is running on GPU device j, f ^j And F is the actual GPU frequency and the maximum frequency on the GPU device j respectively;

expressed as:

wherein alpha is _sch And beta _sch Respectively representing the relation parameters between the GPU growth scheduling delay and the inference load common quantity; f. of ^j Expressed as:

wherein alpha is _f Is a parameter for describing the relationship between the GPU power requirements and the GPU operating frequency

And P is the total GPU power requirement and the maximum GPU power,

expressed as:

wherein p is _idle And p ⁱ GPU idle power and the GPU power requirements of the inference load i when running alone, respectively.

And step 3: according to the inference model provided by the step 1 user and the throughput target R ⁱ Delaying service level targets

(Service Level Objective, SLO), establishing a mathematical optimization problem of minimizing cost and expense of DNN inference costs; the method specifically comprises the following steps:

wherein C represents the cost expense of DNN inference, u ^j Representing GPU device Unit price, r ^ij GPU resources allocated on GPU device j for inference load i, b ⁱ In order to infer the batch size of a load I, wherein I is a set of inference loads, and theta is a set of distributed GPU equipment; the variable in the model is r ^ij And b ⁱ The variables that need to be solved for the minimized mathematical problem; the first constraint condition ensures that the throughput of each inference load can meet the request arrival rate; the second constraint ensures that the inference delay of each inference load is below its target delay

The third constraint condition ensures that the GPU resource allocated by each GPU device should not exceed the maximum GPU resource r _max (ii) a The fourth constraint indicates that each inference load can only be placed on one GPU device.

And 4, step 4: calculating each inference load batch and a GPU resource lower bound by using the delay and throughput target and the model obtained in the step 2, and outputting a GPU resource configuration scheme which minimizes inference cost expense under the condition of ensuring inference performance; the method specifically comprises the following steps: according to the target throughput and the delay constraint, the lower bound of the GPU resource required by the independent operation is solved

And appropriate batch size

According to

And sorting the inference loads in a descending order, selecting GPU equipment with the minimum performance interference to host the inference loads in the started candidate GPUs each time, and putting the inference loads on the newly started GPU equipment when the resources of the candidate GPU equipment are insufficient. When GPU resources are allocated, r is the GPU resource unit _unit And increasing GPU resources violating the delay inference load one by one until the delay of all inference loads is not violated. Therefore, the GPU resource configuration scheme, namely the batch size b, which can minimize inference cost expense under the condition of ensuring the inference performance can be output ⁱ And GPU resource r ^ij 。

The invention solves the problems that a plurality of inference loads operate on the same GPU, the inference performance is unpredictable, the GPU resource configuration is not predictable, and the DNN inference cost is minimized. The invention provides predictable performance for DNN inference load based on the GPU in a mathematical modeling mode, further provides more scientific and reasonable GPU resource configuration, and can reduce cost and expense of users on the premise of ensuring DNN inference load performance SLO.

Drawings

FIG. 1 shows different speculative requests (i.e., i) for the same DNN speculative load ₁ 、i ₂ 、i ₃ ) The operation process on the GPU is schematically shown;

FIG. 2 is a diagram of a GPU resource configuration framework (based on AWS EC 2) for deep learning inference performance interference awareness;

FIG. 3 is a flow chart of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings. The invention designs and realizes a GPU resource configuration framework iGniter with high-benefit deep learning inference performance interference perception to ensure the performance of DNN inference load based on GPU equipment and minimize inference cost and expense.

First, a scenario in which the present invention is applied is described, it involves multiple continuously arriving DNN inference loads I = { I = } ₁ ,i ₂ ,…,i _m And a series of allocated GPU devices θ = { j = ₁ ,j ₂ ,…,j _g }。

As shown in fig. 1, the execution of DNN inference on the GPU may be divided into three sequential steps: data loading, GPU execution and result feedback. Thus, one speculative load i runs on one GPU device j for DNN speculative delay

By delaying the loading of data

GPU execution delay

And result feedback delay

Added up, which can be expressed as:

as shown in fig. 1, to improve GPU resource utilization, mainstream DNN inference servers (e.g., triton) overlap the data loading phase with the GPU execution and result feedback phase. Thus, the throughput h of the DNN inference ^ij Can be expressed as:

wherein, b ⁱ Representing the batch size selected at the inference load runtime.

Data loading and result feedback stage: since the inference input and output result data are transmitted between the CPU and GPU device through PCIe and these data are linearly related to the batch size, data loading delays

And result feedback delay

Can be expressed as:

and

wherein

And

respectively represent when b ⁱ Reasoning input and output result data size for 1 hour, and B _pcie Representing PCIe available bandwidth.

GPU execution phase: each inference load will be assigned a certain amount of GPU resources r ^ij ∈[0,r _max ],

These resources are mapped onto a set of Streaming Multi-processors (SM), where r _max Is set to 1. As shown in FIG. 1, the GPU execution phase includes GPU scheduling and running kernel functions on the allocated SM, which directly receives the allocated GPU resources r ^ij Influence. Furthermore, performance disturbances may be due to reduced GPU frequency due to load co-location, which inevitably lengthens the GPU execution phase. Thus, GPU execution latency

Can be expressed as:

wherein

And

respectively representing the total scheduling delay and GPU active time of a GPU with an inferred load i running on a GPU device j without downconversion. f. of ^j And F represent the actual and maximum GPU frequency on device j, respectively, for GPU time.

Next, scheduling delays for the inferred load are preceded

And (6) modeling.

And number of kernel functions

Roughly proportional, it can be expressed as:

wherein

To infer the number of kernel functions for load i,

to reason about the scheduling delay when load i alone is running,

the increased scheduling delay caused by the performance interference for the GPU resource scheduler is highly correlated to the amount of speculative load that co-operates on GPU device j. It is therefore expressed as:

wherein alpha is _sch And beta _sch Respectively represents the GPU growth scheduling delay relation parameter, sigma _i∈I v ^ij Representing the number of inference loads co-operating on one GPU device j. v. of ^ij Representing whether inference load i is running on GPU device j may be given by:

next, the GPU active time of the inference load i running on the GPU device j is calculated

And (6) modeling. Because of contention by the speculative load on the shared GPU level two cache space on the GPU, this system level indicator of GPU level two cache utilization may be used to characterize the demand on the GPU level two cache space by the speculative load. For a fixed supply of level two cache space on a given GPU device, a higher GPU level two cache utilization (i.e., demand) indicates a more competitive on GPU level two cache space, resulting in a longer GPU active time. Therefore, the temperature of the molten metal is controlled,

can be expressed as:

wherein

To reason about the load i parameters that cause the GPU active time to be extended due to second level cache contention.

And c ⁱ Respectively representing the GPU active time and the secondary cache utilization rate when the inference load i operates independently.

Finally, the GPU frequency f on the GPU device j is adjusted ^j And modeling. Total GPU power demand when loaded

Above the upper GPU power limit for a given GPU type, the GPU frequency drops dramatically. Since GPU frequency is highly correlated with GPU power, GPU frequency can be expressed as:

wherein alpha is _f Are relational parameters that describe GPU power and frequency for a given GPU type. In addition, by connecting GPU device jThe power of the inferential load and the idle power p _idle Add to estimate the total power consumption requirement of GPU device j, which is given by:

wherein p is ⁱ To infer the power at which load i operates alone.

Acquiring a load parameter: based on the above, there are 8 load parameters (i.e.

p ⁱ 、c ⁱ 、

) And 7 hardware parameters (i.e., P, F, P) _idle 、B _pcie 、α _f 、α _sch 、β _sch ) In the performance model. Specifically, four load parameters can be obtained just by pre-running once through the Nsight Systems (i.e.,

). Caching parameters

This can be achieved by activating multiple (i.e. 2) inferential loads simultaneously. For a given GPU type, 3 hardware parameters (i.e., P, F, P) can be obtained using nvidia-smi _idle ). Available PCIe Bandwidth B _pcie May be obtained by transferring data from CPU memory to GPU memory. GPU frequency parameter alpha _f And a scheduling parameter alpha _sch 、β _sch This can be achieved by starting 1 to 5 speculative loads on a given GPU type simultaneously. Furthermore, GPU active time is obtained by running the inference load i alone on a given GPU type

Power p ⁱ And level two cache c ⁱ . Wherein

p ⁱ And c ⁱ Subject batch size b ⁱ And allocated GPU resource r ^ij The influence of (c).

First according to the allocated GPU resource r ^ij And batch size b ⁱ For GPU active time

And modeling. Since GPU active time is roughly inversely proportional to the allocated GPU resources, and GPU active time increases with batch size, it can be represented by a quadratic function. Therefore, GPU active time

Can be expressed as:

wherein

To infer the active time parameter of load i. These parameters were obtained by fitting data of several (i.e. 9) different batch sizes and allocated GPU resources using a least squares method.

Modeling Power p ⁱ Second level cache utilization c ⁱ And GPU processing capability (i.e.

) The relationship between them. Since both power and level two cache utilization increase linearly with GPU processing power. Therefore, power p ⁱ Second level cache utilization c ⁱ And GPU processing capability

BetweenCan be expressed as:

and

are parameters that characterize the relationship between power, level two cache utilization, and GPU processing power. These load parameters can also be obtained by fitting the pre-run data using a least squares method. Where second level cache utilization may be obtained using Nsight computer.

Based on the DNN inference performance model above, the optimization problem of the GPU resource configuration of the DNN inference load is further defined as: at a given request arrival rate R ⁱ And delay SLO

How to provide GPU resources r ^ij And configures a batch size b for each inference load i ⁱ To achieve predictable DNN inference performance while minimizing the cost overhead C of GPU resource configuration, therefore, the optimization problem can be expressed as:

wherein u is ^j For a unit price per GPU device j, equation (12) defines that the goal is to minimize the cost overhead of DNN inference, subject to the following four constraints. In particular, constraints (13) ensure that the throughput of each inferred load can meet its request arrival rate. Constraints (14) ensure that the inference delay of each inference load is below its target delay

This is because the batch process inference delay cannot exceed half of SLO, taking into account the impact of inferential request batching and queuing delays. The constraint (15) indicates that the allocated GPU resources per GPU device should not exceed the maximum GPU resource r _max . Constraints (16) indicate that each inference load can only be placed on one GPU device.

For a given GPU according to equation (12), the cost C is governed by the GPU unit price u ^j And the impact of the allocated GPU device θ. Due to unit price u ^j Is a fixed value, so the original optimization problem can be simplified to minimize the number of allocated GPU devices | theta |. To achieve this goal, each speculative load needs to allocate GPU resources that just meet the request arrival rate and delay SLO.

For a given SLO inference load with arrival rate and delay, the minimum GPU resources required for its individual operation can be first found

And batch size

Wherein

r _unit It may be set to 2.5% (i.e., 2 SMs) for V100 GPUs for the allocation unit of GPU resources. Substituting equations (17), (18) into (12) to (16), it can be simplified as:

wherein

Multi-configured GPU resources are required due to performance interference between co-located inference loads.

A fragment of unallocated GPU resources for GPU device j. Thus, at a given lower bound of GPU resources

In the case of (2), the optimization problem may beTranslating into minimizing fragmentation of GPU resources and GPU resources increased by performance interference.

To configure the GPU resources, a decision is first made as to how to place the inference load, and then a determination is made as to how to assign the GPU resources to the inference load. To reduce unallocated GPU resource fragmentation, first, the method is based on

And sorting the reasoning load in a descending order, and only placing the reasoning load on the new GPU equipment when the GPU resources are insufficient. To greedily reduce GPU resources that are increased by performance interference, a GPU device with minimal interference capability is selected to host each inference load. When an inferred load i is placed on a GPU device j, resources need to be allocated to all loads on that GPU. According to delay

GPU resource r that has been assigned to an inference load ^ij GPU resource lower bound of inference load i

In units of r _unit And gradually increasing GPU resources violating the delay SLO inference load until all inference delays are not violated.

Thus, fig. 2 shows in detail the GPU resource system iggiter for interference perception of deep learning inference performance, first submitting the DNN inference model and request arrival rate and delay SLO to the iggiter entry. For models without historical data, the models are pre-run to obtain load parameters and hardware parameters. Using these parameters, the inferential performance predictor first uses a performance model to estimate the inferential delay, which directs the GPU resource allocator and the inferential load placer to determine the GPU device with the least amount of performance interference among the candidate GPUs for each inferential load and guarantee SLOs. And finally constructing a GPU cluster by the GPU equipment controller according to the generated GPU resource configuration scheme, and starting a Triton inference service process for each DNN inference load on the configured GPU equipment.

Examples

In order to verify the feasibility and accuracy of the inventionIn the invention, a Triton-based GPU cluster with 10V 100 GPU cards is established on AWS EC2 by using a p3.2xlage example. On each instance, a Triton inference service process and its corresponding client are started, with each DNN inference load having a constant inter-request arrival rate. Seven hardware parameters (i.e., P, F, P) were measured using Nsight Systems and nvidia-smi _idle 、B _pcie 、α _f 、α _sc 、β _sch ). GPU maximum power, frequency, idle power p of V100 _idle And available PCIe bandwidth B _pcie 300W, 1530MHz, 53.5W and 10GBps, respectively. Power parameter alpha _f Scheduling parameter alpha _sch 、β _sch Respectively-1.025, 0.00475 and-0.00902.

Configuration of DNN inference load: four representative DNN models listed in table 1 were selected, with the AlexNet, resNet-50, and VGG-19 models used to run the image classification task on the ImageNet dataset and SSD used to run the target detection task on the VOC2012 dataset. From these models, 12 DNN inference loads consisting of three applications (i.e., application 1, application 2, and application 3) of different delay SLOs and different throughput SLOs (i.e., request arrival rates) are generated.

TABLE 1 configuration description of two SLOs for three applications

Evaluation criteria and evaluation indexes: iGniter will compare with the following three strategies: (1) FFD ⁺ : it is based on GPU resource lower bound

To allocate resources and place inference loads using a descending first-time-adaptation algorithm (FFD); (2) GSLICE ⁺ : firstly, placing according to an iGniter reasoning load placing plan, and then adjusting distributed GPU resources and batch sizes on each GPU according to the online average delay and throughput of each reasoning load; (3) gpu-lets ⁺ : it allocates GPU resources with the goal of maximizing request throughput and puts the inference load onOn the best adapted GPU. GSLICE ⁺ And gpu-lets ⁺ The batch size is increased to just meet the request arrival rate. The experiment focused on two key indicators: including inferential load cost spending and SLO violation inferential load quantity, where SLO violation is defined as the 99 th percentile delay of the inferred load exceeding its delay SLO.

As shown in Table 2, the iGniter can ensure that the 99 th percentile delay of all the inference loads can satisfy the delay SLO, and simultaneously, compared with GPU-lets ⁺ Cost savings of up to 25% are achieved. Therefore, the effectiveness and the high cost performance of the GPU resource configuration method for the interference perception of the deep learning inference performance are verified.

TABLE 2 comparison of gpu-lets ⁺ 、FFD ⁺ 、GSLICE ⁺ Cost, expense and number of violations for the iGniter strategy

	gpu-lets ⁺	FFD ⁺	GSLICE ⁺	iGniter
					Cost expense ($)	24.48	15.3	18.36	18.36
Number of violations	3	10	3	0

The embodiment of the present invention can also provide a GPU resource configuration system for guaranteeing DNN speculative performance, where the system includes:

the DNN inference load pre-operation module submits a DNN model to GPU equipment to pre-operate 11 different configurations and obtain DNN inference performance prediction model parameters;

the DNN reasoning performance predicting module is used for establishing a DNN reasoning performance predicting model which explicitly considers performance interference so as to predict DNN reasoning performance;

the GPU resource configuration module iGniter determines GPU equipment with minimum interference performance in the candidate GPUs and ensures SLO (service level optimization) for each inference load based on a DNN inference performance prediction model, and the generated GPU resource configuration scheme can minimize DNN inference cost expense under the condition that DNN inference performance SLO of a user is met;

and the GPU equipment control module is used for finally constructing a GPU cluster by the GPU equipment controller according to the generated GPU resource configuration scheme and starting a Triton inference service process for each DNN inference load on the configured GPU equipment.

The user can automatically complete GPU resource configuration only by submitting deep learning inference load, request arrival rate and delay SLO to the resource configuration system, and the resource configuration scheme can minimize DNN inference cost expense under the condition of meeting DNN inference performance SLO of the user.

Claims

1. A GPU resource allocation method for sensing deep learning inference performance interference is characterized by comprising the following specific steps:

step 1: submitting the DNN inference model and the delay and throughput target to a GPU inference system, and pre-running the DNN inference model without historical dataThe model obtains the size of the incoming data of the DNN model according to the pre-operation result

Result data size

Number of kernel functions

GPU second level cache parameters

Scheduling delay at individual runtime

Active time

step 2: establishing a performance prediction model about DNN inference load according to the 8 load parameters and the 7 hardware parameters acquired in the step 1, and predicting inference delay and throughput of the DNN load;

and step 3: according to the inference model provided by the user in the step 1 and the throughput target R ⁱ Delaying service level targets

Establishing a mathematical optimization problem of minimizing DNN reasoning cost expense;

and 4, step 4: and (3) calculating each inference load batch and a GPU resource lower bound by using the delay and throughput target and the model obtained in the step (2), and outputting a GPU resource configuration scheme which minimizes inference cost expense under the condition of ensuring inference performance.

2. The GPU resource configuration method for interference perception of deep learning inference performance according to claim 1, wherein the step 2 specifically comprises:

wherein the content of the first and second substances,

in order for the inference delay to be predicted,

delays for data loading,

Performing a sum for a GPU

A delay is fed back for the result; since the mainstream DNN inference server Triton overlaps the data loading phase with the GPU execution and result feedback phase, the DNN predicted throughput h ^ij Expressed as:

wherein, b ⁱ Is the batch size;

data load latency

And result feedback delay

Expressed as:

and

and

respectively represent when b ⁱ The size of input data and output result data is inferred for 1, and B _pcie Representing available bandwidth of the PCIe bus; GPU execution latency

Is represented as:

wherein the content of the first and second substances,

to reason about the GPU scheduling delay when load i is running alone,

the number of kernel functions of the load i is inferred,

reason for descriptionLoad i is a relevant parameter that causes the GPU active time to be extended due to GPU level two cache contention,

and c ⁱ Respectively expressing GPU active time and GPU secondary cache utilization rate v when inference load i operates independently ^ij To infer whether load i is running on GPU device j, f ^j And F is the actual GPU frequency and the maximum frequency on GPU device j, respectively;

expressed as:

wherein alpha is _sch And beta _sc Respectively representing the relation parameters between the GPU growth scheduling delay and the inference load common quantity; f. of ^j Expressed as:

And P is the total GPU power requirement and the maximum GPU power,

expressed as:

3. The deep learning inference performance interference aware GPU resource configuration method of claim 1, wherein the step 3 specifically comprises:

wherein C represents the cost expense of DNN inference, u ^j Representing GPU device Unit price, r ^ij GPU resources allocated on GPU device j for reasoning load i, b ⁱ In order to infer the batch size of the load I, wherein I is a set of inference loads, and theta is a set of distributed GPU equipment; the variable in the model is r ^ij And b ⁱ Variables that need to be solved for the minimized mathematical problem; the first constraint condition ensures that the throughput of each inference load can meet the request arrival rate; the second constraint ensures that the inference delay of each inference load is below its target delay

The third constraint ensures that the allocated GPU resources of each GPU device should not exceed the maximumGPU resource r _max (ii) a The fourth constraint indicates that each inference load can only be placed on one GPU device.

4. The deep learning inference performance interference-aware GPU resource configuration method according to claim 1, wherein the step 4 specifically comprises: according to the target throughput and the delay constraint, the lower bound of the GPU resources required by the independent operation is solved

And batch size

According to

Sorting the inference loads in a descending order, selecting GPU equipment with the minimum performance interference from the started candidate GPUs to host the inference loads each time, and placing the inference loads on newly started GPU equipment when the resources of the candidate GPU equipment are insufficient; when GPU resources are allocated, r is the GPU resource unit _unit Gradually increasing GPU resources violating the delay inference load until the delay of all inference loads is not violated; therefore, the GPU resource configuration scheme, namely the batch size b, which minimizes the inference cost expense under the condition of ensuring the inference performance can be output ⁱ And GPU resource r ^ij 。