CN115756789A

CN115756789A - GPU scheduling optimization method for deep learning inference service system

Info

Publication number: CN115756789A
Application number: CN202211456890.8A
Authority: CN
Inventors: 彭亚琼
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-11-21
Filing date: 2022-11-21
Publication date: 2023-03-07

Abstract

The invention discloses a GPU scheduling optimization method aiming at a deep learning inference service system, which comprises the steps of carrying out initialization processing aiming at the deep learning inference service system; distributing all the models correspondingly contained in the obtained processed system, starting a prediction thread, periodically executing a throughput demand prediction process, and predicting the throughput demand in a new period aiming at the models distributed by the prediction thread; and starting a scheduling thread, executing a throughput adjustment flow based on a feedback control strategy at the scheduling time by adopting the obtained predicted throughput demand, and optimizing the actual throughput distribution of each model of the system. The method can dynamically predict the throughput of each model service when monopolizing the GPU, and effectively adapts to complex and variable workload; the method meets the requirements of different delays and throughputs of the model requests deployed on the same server, and makes up the defects of the task scheduling strategy in the existing model service system.

Description

GPU scheduling optimization method for deep learning inference service system

Technical Field

The invention belongs to the technical field of computer system structures and artificial intelligence, and particularly relates to a GPU scheduling optimization method for a deep learning reasoning service system.

Background

With the explosive growth of data, the improvement of core algorithms and the improvement of hardware computing power, the deep learning technology brings immeasurable value to the development and application of artificial intelligence, and has also been widely applied to the fields of computer vision, speech recognition, natural language processing and the like. The deep learning mainly comprises two stages of training and reasoning. The training stage is a process of continuously adjusting Deep Neural Networks (DNN) weight by using an optimization algorithm according to input data, and is a process of constructing a model. Due to the large number of input data and DNN weight parameters, training these DNN models typically requires a significant amount of computing power and takes hours to days to complete the training process.

Hardware accelerators such as a GPU (Graphics Processing Unit) and a TPU (temporal Processing Unit) are mainstream hardware for training and accelerating a DNN model by virtue of their powerful capability of performing simple repetitive operations on data. After the DNN model is trained, input data can be predicted, and then the inference stage is started. The DNN model is deployed in a cloud data center to provide reasoning service for tenants, so that mobile equipment with limited computing resources can support deep learning application through the service. Reasoning tasks were earlier more deployed on the CPU (Central Processing Unit). With the increase of the DNN model, the inference task running on the CPU is difficult to meet the real-time requirement from end to end (the time delay is less than 100 ms). Therefore, a great deal of work is currently favored over using GPUs to accelerate DNN inference tasks.

In order to improve the utilization rate of GPU resources, a current approach is to deploy multiple deep learning models on the same GPU server. Under the scene, requirements of low delay of tasks, performance isolation among multiple tenants and the like are inferred, and a new challenge is provided for a task scheduling mechanism under a shared GPU environment. Currently, task scheduling in a shared GPU environment is mainly classified into a space-sharing type and a time-sharing type. The space sharing technology represented by NVIDIA multiprocess Service (MPS) can enable the GPU to run multiple tasks at any time, thereby effectively improving the resource utilization rate of the inference Service system. However, under the space sharing policy, there is a great uncertainty in the performance of tasks running simultaneously, and the performance isolation and real-time requirements among multiple tenants cannot be guaranteed. In any scheduling unit under the time sharing strategy, the GPU only runs a single inference task, and the parallel execution characteristic of a GPU kernel is not fully utilized. Compared with a space sharing strategy, although the GPU resource utilization rate under the time sharing strategy is low, the execution time of the inference task is more stable, and the real-time requirement of the inference service request can be better guaranteed. However, when the heterogeneous models are deployed in the same server, the service performance of the relatively small-scale model is more easily interfered, and the service experience of the corresponding tenant is seriously affected.

In summary, for the acceleration process of the DNN inference task, multiple tasks cannot be well scheduled in the currently used shared GPU environment, and the real-time requirement of the inference service request cannot be well guaranteed due to uncertainty and interference.

Disclosure of the invention

The invention aims to provide a GPU scheduling optimization method for a deep learning inference service system, which is used for supporting a deep learning inference service system with multi-tenant isolation in a shared GPU environment, can dynamically predict the throughput demand of each model service in the shared GPU environment in real time under the condition of complex and changeable working load, and performs dynamic real-time GPU scheduling on the basis of the prediction result, thereby solving the problem of performance isolation among multiple tenants.

The GPU scheduling optimization method for the deep learning inference service system comprises the following steps:

s1, initializing scheduling optimization parameters;

s2, acquiring a task to be processed of the deep learning inference service system at the current moment, and predicting the throughput demand of each model in the task to be processed in a new cycle based on system parameters of the deep learning inference service system; wherein, the model throughput refers to the inference request number of the successful response of the model; each inference request has a deadline, and if and only if the client obtains the corresponding inference result before the deadline of the request, the request is successfully responded and is included in the throughput of the model;

s3, starting a new cycle, and adjusting and distributing the throughput of each model within the cycle duration based on the throughput requirement of each model within the new cycle obtained in the step S2;

and S4, after the current period is finished, repeating the steps S2-S3 until the deep learning inference service system stops running, and finishing GPU scheduling optimization aiming at the deep learning inference service system.

The initializing the scheduling optimization parameters in step S1 specifically includes: the system-defined global shared data comprises a model variable array models [2] [ m ], a model state index si and a model state lock variable si _ lock corresponding to the model state index si, and a simulation queue and a request queue are set for each model; the model [2] [ m ] is used for storing state information, and the stored state information comprises an estimated value of throughput demand of each model and actual throughput of the current period; m is the number of models, each column of the models [2] [ m ] array corresponds to the state information of one model, the initial values of each element and si in the models [2] [ m ] are set to be 0, and the value of si is limited to be 0 or 1; in order to avoid read-write conflict between the prediction process and the scheduling process on the model state information, the scheduling process reads and writes the model state stored in each column of the models [2] [ m ] array only from the row specified by the si value, and the prediction process reads and writes the model state stored in each column of the models [2] [ m ] array from the row specified by the si 'value, wherein si' = (si + 1)% 2, and the% is a remainder symbol; the number of the parallel executed prediction processes is n, the number of the parallel executed scheduling processes is k, m, n and k are natural integers, and the value of m is required to be more than or equal to 1, and the specific values of n and k are set according to the hardware configuration of the system; the request queue of each model is used for storing the inference requests of the model waiting for response, and the simulation queue is used for storing the inference requests involved in the prediction process in step S2.

Step S2, predicting throughput requirements of each model in the task to be processed in a new cycle based on the system parameters of the deep learning inference service system, specifically predicting throughput requirements by using the following steps:

A. all models were assigned equally to the respective prediction processes: the number of models existing in the system is m, the number of parallel executed prediction processes is n, m and n are both natural numbers, the value of m is greater than or equal to 1, and the specific value of n is set according to the hardware configuration of the system and is greater than or equal to 1; the number of the models allocated to the first n-1 prediction processes is

The number of the main components is one,

in order to get the whole symbol upwards, the number of the models allocated to the last 1 prediction process is m% n, wherein% is the operation of taking the remainder;

B. starting a prediction process to estimate the throughput requirements of all models in a new cycle: aiming at the current model, the prediction process assumes that the model monopolizes GPU computing resources, simulates and dispatches the incomplete request of the model, under the condition of computing and simulating the dispatching, the successful inference request number can be responded in one period of time, and the computing result is taken as the throughput demand of the model in a new period;

C. all the prediction processes are finished after the estimation task is finished;

D. obtaining a model state lock variable si _ lock, updating the model state index si to be si = si', issuing model state data in a new cycle for a scheduling process, and releasing the model state lock variable after the updating is finished;

E. ending the prediction process of the current round;

the detailed flow of the prediction process for estimating the throughput demand of a single model specifically comprises the following steps:

(1) Copying all requests in a request queue of a current model into a simulation queue of the model;

(2) Clearing state data to be issued of the current model: assuming that the serial number of the current model is i, and the element corresponding to models [ (si + 1)% 2] [ i ] is the state data to be issued by the model; each model state data contains two member variables: sim _ solo and goodput; wherein models [ (si + 1)% 2] [ i ]. Sim _ solo is the sim _ solo member variable of the model numbered i for recording the throughput requirement estimated for the current model, and models [ (si + 1)% 2] [ i ]. Goodput is the goodput member variable of the model numbered i for recording the actual throughput of the current model; assigning the sim _ solo and goodput member variables of the state data of each model to be 0, and emptying the state data to be issued of the current model;

(3) Deleting the requests which cannot be completed within the self deadline from the simulation queue of the current model;

(4) Setting start and end variables for indicating the start time and the end time of a new cycle respectively; reading the current system time and assigning a value to start; then assigning the result of adding the start and the period duration to the end;

(5) Assuming that the system has N GPUs, setting a simulated scheduling time scheduled _ time for each GPU, and initializing the scheduled _ time to start;

(6) Judging whether the analog queue is empty: if the number is null, executing the step (13); otherwise, executing the step (7);

(7) Acquiring the minimum scheduled _ time values in all the GPUs, and assigning the result to a variable min _ scheduled _ time, wherein the min _ scheduled _ time is a temporary variable for recording the minimum scheduled _ time values in all the GPUs; setting a GPU with a minimum scheduled _ time value as a GPU _i ；

(8) Judging whether min _ scheduled _ time is more than or equal to end: if yes, executing step (13); otherwise, executing step (9);

(9) Finding as many consecutive request blocks as possible from the analog queue and satisfying the following requirements:

min_sched_time+inferTime(batch_size)<deadline

the deadline of the first request in the request block is deadline, and the request number in the block is batch _ size; insertime (batch _ size) is the completion time for batch execution of batch _ size current model requests; the formula is used for judging whether the request block can be executed on the GPU in batch when the current model monopolizes the GPU;

(10) Updating the throughput of the current model under the condition of simulated scheduling to sim _ solo + batch _ size; wherein sim _ solo is a member variable of the model state data;

(11) GPU under update simulation scheduling _i The simulated scheduling time scheduled _ time of (1) is min scheduled _ time + prefertime (batch _ size);

(12) Deleting all requests in the continuous request block in the step (9) from the simulation queue, and jumping to the step (6);

(13) The prediction process for the current model is ended.

Step S3, based on the throughput demand of each model in the new cycle obtained in step S2, adjusting and allocating the throughput of each model in the new cycle, specifically including the following steps:

1) Obtaining a model state lock variable si _ lock;

2) Copying all elements in one row indicated by a model state index si in a variable array model [2] [ m ] to a local one-dimensional array ms, and then releasing a model state lock variable si _ lock;

3) Traversing each element in the array ms, and judging whether the actual throughput acquired by each model is lower than a standard value: if it is

The actual throughput obtained by the model is lower than the standard value; m is _i Is the (i + 1) th element in the array ms;

4) If the actual throughput of the models is lower than the standard value, adding the models into an M list with an empty initial value; otherwise, adding all models into the M list;

5) Searching as many continuous request blocks as possible from the request queue of each model to meet the following requirements:

curTime+inferTime(batch_size)<deadline

in the formula, the arrival time of the first request in the request block is arrival _ time, the deadline is, the number of requests in the block is batch _ size, the current system time is currtime, and the infettime (batch _ size) is the time for executing batch of batch _ size current model requests; the deadline-inferttime (batch _ size) is the latest scheduling time when all requests in the request block can be successfully responded; then, setting the time for executing the scheduling process for the current GPU next time to be currTime + InferTime (batch _ size);

6) And exiting the scheduling process.

The GPU scheduling optimization method for the deep learning inference service system, provided by the invention, initializes the deep learning inference service system, can dynamically predict the throughput of each model service when monopolizing GPU based on the real-time load condition distributed by each model under the GPU sharing environment, and the predicted result can effectively adapt to complicated and variable working load without additionally introducing an off-line prediction process; meanwhile, the weight of GPU resource allocation among heterogeneous models is adjusted by taking the throughput when the models are placed independently as a measurement standard, different delay and throughput requirements of model requests deployed on the same server are met, and the defect of a task scheduling strategy in the existing model service system in the aspect of heterogeneous model request performance isolation is overcome.

Drawings

Fig. 1 is a diagram of a GPU scheduling framework for a deep learning inference service system.

FIG. 2 is a schematic flow chart of the method of the present invention.

Fig. 3 is a general flow diagram of throughput demand prediction.

FIG. 4 is a detailed flow diagram of a prediction thread estimating throughput requirements for a single model in a new cycle.

Fig. 5 is a schematic diagram of a throughput adjustment process based on a feedback control strategy.

FIG. 6 is a comparison of the performance of example 1 of the present invention with that of the prior art.

Detailed Description

FIG. 1 is a diagram of a GPU scheduling framework for a deep learning inference service system: the system mainly comprises a throughput demand prediction module and a throughput adjustment module based on a feedback control strategy, wherein the two parts are mutually interacted to jointly complete the whole operation. The scheduling system maintains an inference request queue for each model, continuously receives inference requests from the client and inserts the inference requests into the inference request queue of the target model. Under the scene of multi-model sharing GPU, the throughput demand prediction module periodically estimates the throughput of each model when the model independently runs under the current working load condition, then determines the throughput demand proportion among the models based on the estimation result, and provides guidance information for the throughput regulation module based on the feedback control strategy. The throughput adjusting module based on the feedback control strategy dynamically monitors the actual throughput of each model in the current period, optimizes the actual throughput distribution in a fine-grained manner based on the throughput demand proportion among the models, minimizes the difference of GPU sharing to the performance loss of heterogeneous models, and ensures the performance isolation among the models.

FIG. 2 is a schematic flow chart of the method of the present invention: the GPU scheduling optimization method for the deep learning inference service system comprises the following steps:

s1, initializing scheduling optimization parameters, specifically comprising:

the system-defined global shared data comprises a model variable array models [2] [ m ], a model state index si and a model state lock variable si _ lock corresponding to the model state index si, and a simulation queue and a request queue are set for each model; the model [2] [ m ] is used for storing state information, and the stored state information comprises an estimated value of the throughput demand of each model and the actual throughput of the current cycle; m is the number of models, each column of the model s [2] [ m ] array corresponds to the state information of one model, the initial values of each element and si in the model s [2] [ m ] are set to be 0, and the value of si is limited to be 0 or 1; in order to avoid read-write conflict between the prediction process and the scheduling process on model state information, the scheduling process only reads and writes the model state stored in each column of the models [2] [ m ] array from the row specified by the si value, and the prediction process reads and writes the model state stored in each column of the models [2] [ m ] array from the row specified by the si 'value, wherein si' = (si + 1)% 2, and the% is a redundancy symbol; the number of the parallel executed prediction processes is n, the number of the parallel executed scheduling processes is k, m, n and k are natural integers, and the value of m is required to be more than or equal to 1, and the specific values of n and k are set according to the hardware configuration of the system; each prediction process or scheduling process is executed by one thread; the request queue of each model is used for storing the inference requests of the model waiting for response, and the simulation queue is used for storing the inference requests related to the prediction process in the step S2;

s2, acquiring a task to be processed of the deep learning inference service system at the current moment, and predicting the throughput demand of each model in the task to be processed in a new cycle based on system parameters of the deep learning inference service system, wherein the method specifically comprises the following steps:

fig. 3 is a general flow diagram of the prediction of throughput demand: the model throughput refers to the number of inference requests successfully responded by the model; each inference request has a deadline, and if and only if the client obtains the corresponding inference result before the deadline of the request, the request is successfully responded and is included in the throughput of the model;

predicting the throughput requirement of each model in a new cycle by adopting the following steps:

The number of the main components is one,

B. the prediction thread estimates the throughput requirements of all models in a new cycle; aiming at the current model, the prediction thread assumes that the model monopolizes GPU computing resources, simulates and dispatches the incomplete request of the model, under the condition of computing and simulating dispatching, the number of successful inference requests can be responded in one period of time, and the computing result is taken as the throughput demand of the model in a new period;

E. ending the prediction process of the current round;

FIG. 4 is a detailed flow chart of the prediction process to estimate the throughput requirement in a new cycle for a single model:

the detailed process of the prediction thread for estimating the throughput demand aiming at a single model specifically comprises the following steps:

(2) Emptying the state data to be released of the current model: assuming that the serial number of the current model is i, and the element corresponding to models [ (si + 1)% 2] [ i ] is the state data to be issued by the model; each model state data contains two member variables: sim _ solo and goodput; wherein models [ (si + 1)% 2] [ i ]. Sim _ solo is the sim _ solo member variable of the model numbered i for recording the throughput requirement estimated for the current model, and models [ (si + 1)% 2] [ i ]. Goodput is the goodput member variable of the model numbered i for recording the actual throughput of the current model; assigning the sim _ solo and the goodput member variable of each model state data to be 0, and emptying the state data to be issued of the current model;

(4) Setting a start variable and an end variable which are respectively used for indicating the starting time and the ending time of a new cycle; reading the current system time, and assigning a value to start; then assigning the result of adding the start and the period duration to the end;

(6) Judging whether the simulation queue is empty: if the number is null, executing the step (13); otherwise, executing the step (7);

(7) Obtaining the minimum s of all GPUsThe checked _ time value is assigned to a variable min _ scheduled _ time, and the min _ scheduled _ time is a temporary variable for recording the minimum scheduled _ time value in all GPUs; setting a GPU with a minimum scheduled _ time value as a GPU _i ；

(9) Searching for as many continuous request blocks as possible from the simulation queue, and meeting the following requirements:

min_sched_time+inferTime(batch_size)<deadline

(11) GPU under update simulation scheduling _i The simulated schedule time (scheduled _ time) of (1) is min _ scheduled _ time + inferTime (batch _ size);

(13) Ending the prediction process for the current model;

s3, starting a new cycle, and adjusting and distributing the throughput of each model within the cycle duration based on the throughput requirement of each model within the new cycle obtained in the step S2; the method specifically comprises the following steps:

as shown in fig. 5, a schematic diagram of a throughput adjustment process based on a feedback control policy is shown, where the process of executing single throughput adjustment by each scheduling thread specifically includes:

1) Obtaining a model state lock variable si _ lock;

2) Copying all elements in one row indicated by the model state index si in the variable array models [2] [ m ] into a local one-dimensional array ms, and then releasing the model state lock variable si _ lock;

curTime+inferTime(batch_size)<deadline

in the formula, the arrival time of the first request in the request block is arrival _ time and deadline, the number of requests in the block is batch _ size, the current system time is currtime, and the inferTime (batch _ size) is the time for executing batch of batch-size current model requests; the deadline-inferttime (batch _ size) is the latest scheduling time when all requests in the request block can be successfully responded; then, setting the time for executing the scheduling flow for the current GPU next time as currTime + InferTime (batch _ size);

6) And exiting the scheduling process.

Placing 1 google net and ResNet152 models on a single block enginedas tesla model GPU, then loading each model with 325 inference requests per second, testing each model for Goodput per second (Goodput) with Clockwork and support of the present method.

Fig. 6 is a schematic diagram showing the comparison of the performance of the invention and the prior art, and the experimental result is shown in fig. 6, and it can be observed from the experimental result that the Goodput of the GoogleNet model can be greatly improved on the basis that a relatively large ResNet152 model is hardly influenced, and the real-time requirement of more inference service requests is met.

And S4, after the current period is finished, repeating the steps S2 to S3 until the deep learning inference service system stops running, and finishing GPU scheduling optimization aiming at the deep learning inference service system.

Claims

1. A GPU scheduling optimization method aiming at a deep learning inference service system comprises the following steps:

s1, initializing scheduling optimization parameters;

s2, acquiring a task to be processed of the deep learning inference service system at the current time, and predicting the throughput demand of each model in the task to be processed in a new cycle based on system parameters of the deep learning inference service system;

2. The GPU scheduling optimization method for the deep learning inference service system according to claim 1, wherein initializing the scheduling optimization parameter in step S1 specifically includes: the system-defined global shared data comprises a model variable array models [2] [ m ], a model state index si and a model state lock variable si _ lock corresponding to the model state index si, and a simulation queue and a request queue are set for each model; the model [2] [ m ] is used for storing state information, and the stored state information comprises an estimated value of the throughput demand of each model and the actual throughput of the current cycle; m is the number of models, each column of the models [2] [ m ] array corresponds to the state information of one model, the initial values of each element and si in the models [2] [ m ] are set to be 0, and the value of si is limited to be 0 or 1; in order to avoid read-write conflict between the prediction process and the scheduling process on the model state information, the scheduling process reads and writes the model state stored in each column of the models [2] [ m ] array only from the row specified by the si value, and the prediction process reads and writes the model state stored in each column of the models [2] [ m ] array from the row specified by the si 'value, wherein si' = (si + 1)% 2, and the% is a remainder symbol; the number of the parallel executed prediction processes is n, the number of the parallel executed scheduling processes is k, m, n and k are natural integers, and the value of m is required to be more than or equal to 1, and the specific values of n and k are set according to the hardware configuration of the system; the request queue of each model is used for storing the inference requests of the model waiting for response, and the simulation queue is used for storing the inference requests involved in the prediction process in step S2.

3. The GPU scheduling optimization method for the deep learning inference service system according to claim 2, wherein the system parameters based on the deep learning inference service system in step S2 predict the throughput requirement of each model in the task to be processed in a new cycle, specifically, the following steps are adopted to predict the throughput requirement:

A. all models were assigned to the respective prediction processes on average: the number of models existing in the system is m, the number of parallel executed prediction processes is n, m and n are both natural numbers, the value of m is greater than or equal to 1, and the specific value of n is set according to the hardware configuration of the system and is greater than or equal to 1; the number of models allocated to the first n-1 prediction processes is

The number of the main components is one,

E. ending the prediction process of the current round;

the detailed process of the prediction process for estimating the throughput demand by aiming at a single model specifically comprises the following steps:

(2) Emptying the state data to be released of the current model: assuming that the serial number of the current model is i, and the element corresponding to models [ (si + 1)% 2] [ i ] is the state data to be issued by the model; each model state data contains two member variables: sim _ solo and goodput; wherein models [ (si + 1)% 2] [ i ]. Sim _ solo is the sim _ solo member variable of the model numbered i for recording the throughput requirement estimated for the current model, and models [ (si + 1)% 2] [ i ]. Goodput is the goodput member variable of the model numbered i for recording the actual throughput of the current model; assigning the sim _ solo and goodput member variables of the state data of each model to be 0, and emptying the state data to be issued of the current model;

(4) Setting start and end variables for indicating the start time and the end time of a new cycle respectively; reading the current system time, and assigning a value to start; then assigning the result of adding the start and the period duration to the end;

(7) Acquiring the minimum scheduled _ time value in all the GPUs, and assigning the result to a variable min _ scheduled _ time which is a temporary variable for recording the minimum scheduled _ time value in all the GPUs; setting a GPU with a minimum scheduled _ time value as a GPU _i ；

min_sched_time+inferTime(batch_size)<deadline

(13) The prediction process for the current model is ended.

4. The GPU scheduling optimization method for the deep learning inference service system according to claim 3, wherein the step S3 adjusts and allocates the throughput of each model in a new cycle based on the throughput demand of each model in the new cycle obtained in the step S2, specifically comprising the following steps:

1) Obtaining a model state lock variable si _ lock;

curTime+inferTime(batch_size)<deadline

6) And exiting the scheduling process.