CN115756789A - GPU scheduling optimization method for deep learning inference service system - Google Patents

GPU scheduling optimization method for deep learning inference service system Download PDF

Info

Publication number
CN115756789A
CN115756789A CN202211456890.8A CN202211456890A CN115756789A CN 115756789 A CN115756789 A CN 115756789A CN 202211456890 A CN202211456890 A CN 202211456890A CN 115756789 A CN115756789 A CN 115756789A
Authority
CN
China
Prior art keywords
model
time
throughput
models
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211456890.8A
Other languages
Chinese (zh)
Inventor
彭亚琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202211456890.8A priority Critical patent/CN115756789A/en
Publication of CN115756789A publication Critical patent/CN115756789A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a GPU scheduling optimization method aiming at a deep learning inference service system, which comprises the steps of carrying out initialization processing aiming at the deep learning inference service system; distributing all the models correspondingly contained in the obtained processed system, starting a prediction thread, periodically executing a throughput demand prediction process, and predicting the throughput demand in a new period aiming at the models distributed by the prediction thread; and starting a scheduling thread, executing a throughput adjustment flow based on a feedback control strategy at the scheduling time by adopting the obtained predicted throughput demand, and optimizing the actual throughput distribution of each model of the system. The method can dynamically predict the throughput of each model service when monopolizing the GPU, and effectively adapts to complex and variable workload; the method meets the requirements of different delays and throughputs of the model requests deployed on the same server, and makes up the defects of the task scheduling strategy in the existing model service system.

Description

GPU scheduling optimization method for deep learning inference service system
Technical Field
The invention belongs to the technical field of computer system structures and artificial intelligence, and particularly relates to a GPU scheduling optimization method for a deep learning reasoning service system.
Background
With the explosive growth of data, the improvement of core algorithms and the improvement of hardware computing power, the deep learning technology brings immeasurable value to the development and application of artificial intelligence, and has also been widely applied to the fields of computer vision, speech recognition, natural language processing and the like. The deep learning mainly comprises two stages of training and reasoning. The training stage is a process of continuously adjusting Deep Neural Networks (DNN) weight by using an optimization algorithm according to input data, and is a process of constructing a model. Due to the large number of input data and DNN weight parameters, training these DNN models typically requires a significant amount of computing power and takes hours to days to complete the training process.
Hardware accelerators such as a GPU (Graphics Processing Unit) and a TPU (temporal Processing Unit) are mainstream hardware for training and accelerating a DNN model by virtue of their powerful capability of performing simple repetitive operations on data. After the DNN model is trained, input data can be predicted, and then the inference stage is started. The DNN model is deployed in a cloud data center to provide reasoning service for tenants, so that mobile equipment with limited computing resources can support deep learning application through the service. Reasoning tasks were earlier more deployed on the CPU (Central Processing Unit). With the increase of the DNN model, the inference task running on the CPU is difficult to meet the real-time requirement from end to end (the time delay is less than 100 ms). Therefore, a great deal of work is currently favored over using GPUs to accelerate DNN inference tasks.
In order to improve the utilization rate of GPU resources, a current approach is to deploy multiple deep learning models on the same GPU server. Under the scene, requirements of low delay of tasks, performance isolation among multiple tenants and the like are inferred, and a new challenge is provided for a task scheduling mechanism under a shared GPU environment. Currently, task scheduling in a shared GPU environment is mainly classified into a space-sharing type and a time-sharing type. The space sharing technology represented by NVIDIA multiprocess Service (MPS) can enable the GPU to run multiple tasks at any time, thereby effectively improving the resource utilization rate of the inference Service system. However, under the space sharing policy, there is a great uncertainty in the performance of tasks running simultaneously, and the performance isolation and real-time requirements among multiple tenants cannot be guaranteed. In any scheduling unit under the time sharing strategy, the GPU only runs a single inference task, and the parallel execution characteristic of a GPU kernel is not fully utilized. Compared with a space sharing strategy, although the GPU resource utilization rate under the time sharing strategy is low, the execution time of the inference task is more stable, and the real-time requirement of the inference service request can be better guaranteed. However, when the heterogeneous models are deployed in the same server, the service performance of the relatively small-scale model is more easily interfered, and the service experience of the corresponding tenant is seriously affected.
In summary, for the acceleration process of the DNN inference task, multiple tasks cannot be well scheduled in the currently used shared GPU environment, and the real-time requirement of the inference service request cannot be well guaranteed due to uncertainty and interference.
Disclosure of the invention
The invention aims to provide a GPU scheduling optimization method for a deep learning inference service system, which is used for supporting a deep learning inference service system with multi-tenant isolation in a shared GPU environment, can dynamically predict the throughput demand of each model service in the shared GPU environment in real time under the condition of complex and changeable working load, and performs dynamic real-time GPU scheduling on the basis of the prediction result, thereby solving the problem of performance isolation among multiple tenants.
The GPU scheduling optimization method for the deep learning inference service system comprises the following steps:
s1, initializing scheduling optimization parameters;
s2, acquiring a task to be processed of the deep learning inference service system at the current moment, and predicting the throughput demand of each model in the task to be processed in a new cycle based on system parameters of the deep learning inference service system; wherein, the model throughput refers to the inference request number of the successful response of the model; each inference request has a deadline, and if and only if the client obtains the corresponding inference result before the deadline of the request, the request is successfully responded and is included in the throughput of the model;
s3, starting a new cycle, and adjusting and distributing the throughput of each model within the cycle duration based on the throughput requirement of each model within the new cycle obtained in the step S2;
and S4, after the current period is finished, repeating the steps S2-S3 until the deep learning inference service system stops running, and finishing GPU scheduling optimization aiming at the deep learning inference service system.
The initializing the scheduling optimization parameters in step S1 specifically includes: the system-defined global shared data comprises a model variable array models [2] [ m ], a model state index si and a model state lock variable si _ lock corresponding to the model state index si, and a simulation queue and a request queue are set for each model; the model [2] [ m ] is used for storing state information, and the stored state information comprises an estimated value of throughput demand of each model and actual throughput of the current period; m is the number of models, each column of the models [2] [ m ] array corresponds to the state information of one model, the initial values of each element and si in the models [2] [ m ] are set to be 0, and the value of si is limited to be 0 or 1; in order to avoid read-write conflict between the prediction process and the scheduling process on the model state information, the scheduling process reads and writes the model state stored in each column of the models [2] [ m ] array only from the row specified by the si value, and the prediction process reads and writes the model state stored in each column of the models [2] [ m ] array from the row specified by the si 'value, wherein si' = (si + 1)% 2, and the% is a remainder symbol; the number of the parallel executed prediction processes is n, the number of the parallel executed scheduling processes is k, m, n and k are natural integers, and the value of m is required to be more than or equal to 1, and the specific values of n and k are set according to the hardware configuration of the system; the request queue of each model is used for storing the inference requests of the model waiting for response, and the simulation queue is used for storing the inference requests involved in the prediction process in step S2.
Step S2, predicting throughput requirements of each model in the task to be processed in a new cycle based on the system parameters of the deep learning inference service system, specifically predicting throughput requirements by using the following steps:
A. all models were assigned equally to the respective prediction processes: the number of models existing in the system is m, the number of parallel executed prediction processes is n, m and n are both natural numbers, the value of m is greater than or equal to 1, and the specific value of n is set according to the hardware configuration of the system and is greater than or equal to 1; the number of the models allocated to the first n-1 prediction processes is
Figure BDA0003953585230000041
The number of the main components is one,
Figure BDA0003953585230000042
in order to get the whole symbol upwards, the number of the models allocated to the last 1 prediction process is m% n, wherein% is the operation of taking the remainder;
B. starting a prediction process to estimate the throughput requirements of all models in a new cycle: aiming at the current model, the prediction process assumes that the model monopolizes GPU computing resources, simulates and dispatches the incomplete request of the model, under the condition of computing and simulating the dispatching, the successful inference request number can be responded in one period of time, and the computing result is taken as the throughput demand of the model in a new period;
C. all the prediction processes are finished after the estimation task is finished;
D. obtaining a model state lock variable si _ lock, updating the model state index si to be si = si', issuing model state data in a new cycle for a scheduling process, and releasing the model state lock variable after the updating is finished;
E. ending the prediction process of the current round;
the detailed flow of the prediction process for estimating the throughput demand of a single model specifically comprises the following steps:
(1) Copying all requests in a request queue of a current model into a simulation queue of the model;
(2) Clearing state data to be issued of the current model: assuming that the serial number of the current model is i, and the element corresponding to models [ (si + 1)% 2] [ i ] is the state data to be issued by the model; each model state data contains two member variables: sim _ solo and goodput; wherein models [ (si + 1)% 2] [ i ]. Sim _ solo is the sim _ solo member variable of the model numbered i for recording the throughput requirement estimated for the current model, and models [ (si + 1)% 2] [ i ]. Goodput is the goodput member variable of the model numbered i for recording the actual throughput of the current model; assigning the sim _ solo and goodput member variables of the state data of each model to be 0, and emptying the state data to be issued of the current model;
(3) Deleting the requests which cannot be completed within the self deadline from the simulation queue of the current model;
(4) Setting start and end variables for indicating the start time and the end time of a new cycle respectively; reading the current system time and assigning a value to start; then assigning the result of adding the start and the period duration to the end;
(5) Assuming that the system has N GPUs, setting a simulated scheduling time scheduled _ time for each GPU, and initializing the scheduled _ time to start;
(6) Judging whether the analog queue is empty: if the number is null, executing the step (13); otherwise, executing the step (7);
(7) Acquiring the minimum scheduled _ time values in all the GPUs, and assigning the result to a variable min _ scheduled _ time, wherein the min _ scheduled _ time is a temporary variable for recording the minimum scheduled _ time values in all the GPUs; setting a GPU with a minimum scheduled _ time value as a GPU i
(8) Judging whether min _ scheduled _ time is more than or equal to end: if yes, executing step (13); otherwise, executing step (9);
(9) Finding as many consecutive request blocks as possible from the analog queue and satisfying the following requirements:
min_sched_time+inferTime(batch_size)<deadline
the deadline of the first request in the request block is deadline, and the request number in the block is batch _ size; insertime (batch _ size) is the completion time for batch execution of batch _ size current model requests; the formula is used for judging whether the request block can be executed on the GPU in batch when the current model monopolizes the GPU;
(10) Updating the throughput of the current model under the condition of simulated scheduling to sim _ solo + batch _ size; wherein sim _ solo is a member variable of the model state data;
(11) GPU under update simulation scheduling i The simulated scheduling time scheduled _ time of (1) is min scheduled _ time + prefertime (batch _ size);
(12) Deleting all requests in the continuous request block in the step (9) from the simulation queue, and jumping to the step (6);
(13) The prediction process for the current model is ended.
Step S3, based on the throughput demand of each model in the new cycle obtained in step S2, adjusting and allocating the throughput of each model in the new cycle, specifically including the following steps:
1) Obtaining a model state lock variable si _ lock;
2) Copying all elements in one row indicated by a model state index si in a variable array model [2] [ m ] to a local one-dimensional array ms, and then releasing a model state lock variable si _ lock;
3) Traversing each element in the array ms, and judging whether the actual throughput acquired by each model is lower than a standard value: if it is
Figure BDA0003953585230000061
The actual throughput obtained by the model is lower than the standard value; m is i Is the (i + 1) th element in the array ms;
4) If the actual throughput of the models is lower than the standard value, adding the models into an M list with an empty initial value; otherwise, adding all models into the M list;
5) Searching as many continuous request blocks as possible from the request queue of each model to meet the following requirements:
curTime+inferTime(batch_size)<deadline
in the formula, the arrival time of the first request in the request block is arrival _ time, the deadline is, the number of requests in the block is batch _ size, the current system time is currtime, and the infettime (batch _ size) is the time for executing batch of batch _ size current model requests; the deadline-inferttime (batch _ size) is the latest scheduling time when all requests in the request block can be successfully responded; then, setting the time for executing the scheduling process for the current GPU next time to be currTime + InferTime (batch _ size);
6) And exiting the scheduling process.
The GPU scheduling optimization method for the deep learning inference service system, provided by the invention, initializes the deep learning inference service system, can dynamically predict the throughput of each model service when monopolizing GPU based on the real-time load condition distributed by each model under the GPU sharing environment, and the predicted result can effectively adapt to complicated and variable working load without additionally introducing an off-line prediction process; meanwhile, the weight of GPU resource allocation among heterogeneous models is adjusted by taking the throughput when the models are placed independently as a measurement standard, different delay and throughput requirements of model requests deployed on the same server are met, and the defect of a task scheduling strategy in the existing model service system in the aspect of heterogeneous model request performance isolation is overcome.
Drawings
Fig. 1 is a diagram of a GPU scheduling framework for a deep learning inference service system.
FIG. 2 is a schematic flow chart of the method of the present invention.
Fig. 3 is a general flow diagram of throughput demand prediction.
FIG. 4 is a detailed flow diagram of a prediction thread estimating throughput requirements for a single model in a new cycle.
Fig. 5 is a schematic diagram of a throughput adjustment process based on a feedback control strategy.
FIG. 6 is a comparison of the performance of example 1 of the present invention with that of the prior art.
Detailed Description
FIG. 1 is a diagram of a GPU scheduling framework for a deep learning inference service system: the system mainly comprises a throughput demand prediction module and a throughput adjustment module based on a feedback control strategy, wherein the two parts are mutually interacted to jointly complete the whole operation. The scheduling system maintains an inference request queue for each model, continuously receives inference requests from the client and inserts the inference requests into the inference request queue of the target model. Under the scene of multi-model sharing GPU, the throughput demand prediction module periodically estimates the throughput of each model when the model independently runs under the current working load condition, then determines the throughput demand proportion among the models based on the estimation result, and provides guidance information for the throughput regulation module based on the feedback control strategy. The throughput adjusting module based on the feedback control strategy dynamically monitors the actual throughput of each model in the current period, optimizes the actual throughput distribution in a fine-grained manner based on the throughput demand proportion among the models, minimizes the difference of GPU sharing to the performance loss of heterogeneous models, and ensures the performance isolation among the models.
FIG. 2 is a schematic flow chart of the method of the present invention: the GPU scheduling optimization method for the deep learning inference service system comprises the following steps:
s1, initializing scheduling optimization parameters, specifically comprising:
the system-defined global shared data comprises a model variable array models [2] [ m ], a model state index si and a model state lock variable si _ lock corresponding to the model state index si, and a simulation queue and a request queue are set for each model; the model [2] [ m ] is used for storing state information, and the stored state information comprises an estimated value of the throughput demand of each model and the actual throughput of the current cycle; m is the number of models, each column of the model s [2] [ m ] array corresponds to the state information of one model, the initial values of each element and si in the model s [2] [ m ] are set to be 0, and the value of si is limited to be 0 or 1; in order to avoid read-write conflict between the prediction process and the scheduling process on model state information, the scheduling process only reads and writes the model state stored in each column of the models [2] [ m ] array from the row specified by the si value, and the prediction process reads and writes the model state stored in each column of the models [2] [ m ] array from the row specified by the si 'value, wherein si' = (si + 1)% 2, and the% is a redundancy symbol; the number of the parallel executed prediction processes is n, the number of the parallel executed scheduling processes is k, m, n and k are natural integers, and the value of m is required to be more than or equal to 1, and the specific values of n and k are set according to the hardware configuration of the system; each prediction process or scheduling process is executed by one thread; the request queue of each model is used for storing the inference requests of the model waiting for response, and the simulation queue is used for storing the inference requests related to the prediction process in the step S2;
s2, acquiring a task to be processed of the deep learning inference service system at the current moment, and predicting the throughput demand of each model in the task to be processed in a new cycle based on system parameters of the deep learning inference service system, wherein the method specifically comprises the following steps:
fig. 3 is a general flow diagram of the prediction of throughput demand: the model throughput refers to the number of inference requests successfully responded by the model; each inference request has a deadline, and if and only if the client obtains the corresponding inference result before the deadline of the request, the request is successfully responded and is included in the throughput of the model;
predicting the throughput requirement of each model in a new cycle by adopting the following steps:
A. all models were assigned equally to the respective prediction processes: the number of models existing in the system is m, the number of parallel executed prediction processes is n, m and n are both natural numbers, the value of m is greater than or equal to 1, and the specific value of n is set according to the hardware configuration of the system and is greater than or equal to 1; the number of the models allocated to the first n-1 prediction processes is
Figure BDA0003953585230000091
The number of the main components is one,
Figure BDA0003953585230000092
in order to get the whole symbol upwards, the number of the models allocated to the last 1 prediction process is m% n, wherein% is the operation of taking the remainder;
B. the prediction thread estimates the throughput requirements of all models in a new cycle; aiming at the current model, the prediction thread assumes that the model monopolizes GPU computing resources, simulates and dispatches the incomplete request of the model, under the condition of computing and simulating dispatching, the number of successful inference requests can be responded in one period of time, and the computing result is taken as the throughput demand of the model in a new period;
C. all the prediction processes are finished after the estimation task is finished;
D. obtaining a model state lock variable si _ lock, updating the model state index si to be si = si', issuing model state data in a new cycle for a scheduling process, and releasing the model state lock variable after the updating is finished;
E. ending the prediction process of the current round;
FIG. 4 is a detailed flow chart of the prediction process to estimate the throughput requirement in a new cycle for a single model:
the detailed process of the prediction thread for estimating the throughput demand aiming at a single model specifically comprises the following steps:
(1) Copying all requests in a request queue of a current model into a simulation queue of the model;
(2) Emptying the state data to be released of the current model: assuming that the serial number of the current model is i, and the element corresponding to models [ (si + 1)% 2] [ i ] is the state data to be issued by the model; each model state data contains two member variables: sim _ solo and goodput; wherein models [ (si + 1)% 2] [ i ]. Sim _ solo is the sim _ solo member variable of the model numbered i for recording the throughput requirement estimated for the current model, and models [ (si + 1)% 2] [ i ]. Goodput is the goodput member variable of the model numbered i for recording the actual throughput of the current model; assigning the sim _ solo and the goodput member variable of each model state data to be 0, and emptying the state data to be issued of the current model;
(3) Deleting the requests which cannot be completed within the self deadline from the simulation queue of the current model;
(4) Setting a start variable and an end variable which are respectively used for indicating the starting time and the ending time of a new cycle; reading the current system time, and assigning a value to start; then assigning the result of adding the start and the period duration to the end;
(5) Assuming that the system has N GPUs, setting a simulated scheduling time scheduled _ time for each GPU, and initializing the scheduled _ time to start;
(6) Judging whether the simulation queue is empty: if the number is null, executing the step (13); otherwise, executing the step (7);
(7) Obtaining the minimum s of all GPUsThe checked _ time value is assigned to a variable min _ scheduled _ time, and the min _ scheduled _ time is a temporary variable for recording the minimum scheduled _ time value in all GPUs; setting a GPU with a minimum scheduled _ time value as a GPU i
(8) Judging whether min _ scheduled _ time is more than or equal to end: if yes, executing step (13); otherwise, executing step (9);
(9) Searching for as many continuous request blocks as possible from the simulation queue, and meeting the following requirements:
min_sched_time+inferTime(batch_size)<deadline
the deadline of the first request in the request block is deadline, and the request number in the block is batch _ size; insertime (batch _ size) is the completion time for batch execution of batch _ size current model requests; the formula is used for judging whether the request block can be executed on the GPU in batch when the current model monopolizes the GPU;
(10) Updating the throughput of the current model under the condition of simulated scheduling to sim _ solo + batch _ size; wherein sim _ solo is a member variable of the model state data;
(11) GPU under update simulation scheduling i The simulated schedule time (scheduled _ time) of (1) is min _ scheduled _ time + inferTime (batch _ size);
(12) Deleting all requests in the continuous request block in the step (9) from the simulation queue, and jumping to the step (6);
(13) Ending the prediction process for the current model;
s3, starting a new cycle, and adjusting and distributing the throughput of each model within the cycle duration based on the throughput requirement of each model within the new cycle obtained in the step S2; the method specifically comprises the following steps:
as shown in fig. 5, a schematic diagram of a throughput adjustment process based on a feedback control policy is shown, where the process of executing single throughput adjustment by each scheduling thread specifically includes:
1) Obtaining a model state lock variable si _ lock;
2) Copying all elements in one row indicated by the model state index si in the variable array models [2] [ m ] into a local one-dimensional array ms, and then releasing the model state lock variable si _ lock;
3) Traversing each element in the array ms, and judging whether the actual throughput acquired by each model is lower than a standard value: if it is
Figure BDA0003953585230000121
The actual throughput obtained by the model is lower than the standard value; m is i Is the (i + 1) th element in the array ms;
4) If the actual throughput of the models is lower than the standard value, adding the models into an M list with an empty initial value; otherwise, adding all models into the M list;
5) Searching as many continuous request blocks as possible from the request queue of each model to meet the following requirements:
curTime+inferTime(batch_size)<deadline
in the formula, the arrival time of the first request in the request block is arrival _ time and deadline, the number of requests in the block is batch _ size, the current system time is currtime, and the inferTime (batch _ size) is the time for executing batch of batch-size current model requests; the deadline-inferttime (batch _ size) is the latest scheduling time when all requests in the request block can be successfully responded; then, setting the time for executing the scheduling flow for the current GPU next time as currTime + InferTime (batch _ size);
6) And exiting the scheduling process.
Placing 1 google net and ResNet152 models on a single block enginedas tesla model GPU, then loading each model with 325 inference requests per second, testing each model for Goodput per second (Goodput) with Clockwork and support of the present method.
Fig. 6 is a schematic diagram showing the comparison of the performance of the invention and the prior art, and the experimental result is shown in fig. 6, and it can be observed from the experimental result that the Goodput of the GoogleNet model can be greatly improved on the basis that a relatively large ResNet152 model is hardly influenced, and the real-time requirement of more inference service requests is met.
And S4, after the current period is finished, repeating the steps S2 to S3 until the deep learning inference service system stops running, and finishing GPU scheduling optimization aiming at the deep learning inference service system.

Claims (4)

1. A GPU scheduling optimization method aiming at a deep learning inference service system comprises the following steps:
s1, initializing scheduling optimization parameters;
s2, acquiring a task to be processed of the deep learning inference service system at the current time, and predicting the throughput demand of each model in the task to be processed in a new cycle based on system parameters of the deep learning inference service system;
s3, starting a new cycle, and adjusting and distributing the throughput of each model within the cycle duration based on the throughput requirement of each model within the new cycle obtained in the step S2;
and S4, after the current period is finished, repeating the steps S2 to S3 until the deep learning inference service system stops running, and finishing GPU scheduling optimization aiming at the deep learning inference service system.
2. The GPU scheduling optimization method for the deep learning inference service system according to claim 1, wherein initializing the scheduling optimization parameter in step S1 specifically includes: the system-defined global shared data comprises a model variable array models [2] [ m ], a model state index si and a model state lock variable si _ lock corresponding to the model state index si, and a simulation queue and a request queue are set for each model; the model [2] [ m ] is used for storing state information, and the stored state information comprises an estimated value of the throughput demand of each model and the actual throughput of the current cycle; m is the number of models, each column of the models [2] [ m ] array corresponds to the state information of one model, the initial values of each element and si in the models [2] [ m ] are set to be 0, and the value of si is limited to be 0 or 1; in order to avoid read-write conflict between the prediction process and the scheduling process on the model state information, the scheduling process reads and writes the model state stored in each column of the models [2] [ m ] array only from the row specified by the si value, and the prediction process reads and writes the model state stored in each column of the models [2] [ m ] array from the row specified by the si 'value, wherein si' = (si + 1)% 2, and the% is a remainder symbol; the number of the parallel executed prediction processes is n, the number of the parallel executed scheduling processes is k, m, n and k are natural integers, and the value of m is required to be more than or equal to 1, and the specific values of n and k are set according to the hardware configuration of the system; the request queue of each model is used for storing the inference requests of the model waiting for response, and the simulation queue is used for storing the inference requests involved in the prediction process in step S2.
3. The GPU scheduling optimization method for the deep learning inference service system according to claim 2, wherein the system parameters based on the deep learning inference service system in step S2 predict the throughput requirement of each model in the task to be processed in a new cycle, specifically, the following steps are adopted to predict the throughput requirement:
A. all models were assigned to the respective prediction processes on average: the number of models existing in the system is m, the number of parallel executed prediction processes is n, m and n are both natural numbers, the value of m is greater than or equal to 1, and the specific value of n is set according to the hardware configuration of the system and is greater than or equal to 1; the number of models allocated to the first n-1 prediction processes is
Figure FDA0003953585220000021
The number of the main components is one,
Figure FDA0003953585220000022
in order to get the whole symbol upwards, the number of the models allocated to the last 1 prediction process is m% n, wherein% is the operation of taking the remainder;
B. starting a prediction process to estimate the throughput requirements of all models in a new cycle: aiming at the current model, the prediction process assumes that the model monopolizes GPU computing resources, simulates and dispatches the incomplete request of the model, under the condition of computing and simulating the dispatching, the successful inference request number can be responded in one period of time, and the computing result is taken as the throughput demand of the model in a new period;
C. all the prediction processes are finished after the estimation task is finished;
D. obtaining a model state lock variable si _ lock, updating the model state index si to be si = si', issuing model state data in a new cycle for a scheduling process, and releasing the model state lock variable after the updating is finished;
E. ending the prediction process of the current round;
the detailed process of the prediction process for estimating the throughput demand by aiming at a single model specifically comprises the following steps:
(1) Copying all requests in a request queue of a current model into a simulation queue of the model;
(2) Emptying the state data to be released of the current model: assuming that the serial number of the current model is i, and the element corresponding to models [ (si + 1)% 2] [ i ] is the state data to be issued by the model; each model state data contains two member variables: sim _ solo and goodput; wherein models [ (si + 1)% 2] [ i ]. Sim _ solo is the sim _ solo member variable of the model numbered i for recording the throughput requirement estimated for the current model, and models [ (si + 1)% 2] [ i ]. Goodput is the goodput member variable of the model numbered i for recording the actual throughput of the current model; assigning the sim _ solo and goodput member variables of the state data of each model to be 0, and emptying the state data to be issued of the current model;
(3) Deleting the requests which cannot be completed within the self deadline from the simulation queue of the current model;
(4) Setting start and end variables for indicating the start time and the end time of a new cycle respectively; reading the current system time, and assigning a value to start; then assigning the result of adding the start and the period duration to the end;
(5) Assuming that the system has N GPUs, setting a simulated scheduling time scheduled _ time for each GPU, and initializing the scheduled _ time to start;
(6) Judging whether the simulation queue is empty: if the number is null, executing the step (13); otherwise, executing the step (7);
(7) Acquiring the minimum scheduled _ time value in all the GPUs, and assigning the result to a variable min _ scheduled _ time which is a temporary variable for recording the minimum scheduled _ time value in all the GPUs; setting a GPU with a minimum scheduled _ time value as a GPU i
(8) Judging whether min _ scheduled _ time is more than or equal to end: if yes, executing step (13); otherwise, executing step (9);
(9) Finding as many consecutive request blocks as possible from the analog queue and satisfying the following requirements:
min_sched_time+inferTime(batch_size)<deadline
the deadline of the first request in the request block is deadline, and the request number in the block is batch _ size; insertime (batch _ size) is the completion time for batch execution of batch _ size current model requests; the formula is used for judging whether the request block can be executed on the GPU in batch when the current model monopolizes the GPU;
(10) Updating the throughput of the current model under the condition of simulated scheduling to sim _ solo + batch _ size; wherein sim _ solo is a member variable of the model state data;
(11) GPU under update simulation scheduling i The simulated schedule time (scheduled _ time) of (1) is min _ scheduled _ time + inferTime (batch _ size);
(12) Deleting all requests in the continuous request block in the step (9) from the simulation queue, and jumping to the step (6);
(13) The prediction process for the current model is ended.
4. The GPU scheduling optimization method for the deep learning inference service system according to claim 3, wherein the step S3 adjusts and allocates the throughput of each model in a new cycle based on the throughput demand of each model in the new cycle obtained in the step S2, specifically comprising the following steps:
1) Obtaining a model state lock variable si _ lock;
2) Copying all elements in one row indicated by a model state index si in a variable array model [2] [ m ] to a local one-dimensional array ms, and then releasing a model state lock variable si _ lock;
3) Traversing each element in the array ms, and judging whether the actual throughput acquired by each model is lower than a standard value: if it is
Figure FDA0003953585220000041
The actual throughput obtained by the model is lower than the standard value; m is i Is the (i + 1) th element in the array ms;
4) If the actual throughput of the models is lower than the standard value, adding the models into an M list with an empty initial value; otherwise, adding all models into the M list;
5) Searching as many continuous request blocks as possible from the request queue of each model to meet the following requirements:
curTime+inferTime(batch_size)<deadline
in the formula, the arrival time of the first request in the request block is arrival _ time, the deadline is, the number of requests in the block is batch _ size, the current system time is currtime, and the infettime (batch _ size) is the time for executing batch of batch _ size current model requests; the deadline-inferttime (batch _ size) is the latest scheduling time when all requests in the request block can be successfully responded; then, setting the time for executing the scheduling process for the current GPU next time to be currTime + InferTime (batch _ size);
6) And exiting the scheduling process.
CN202211456890.8A 2022-11-21 2022-11-21 GPU scheduling optimization method for deep learning inference service system Pending CN115756789A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211456890.8A CN115756789A (en) 2022-11-21 2022-11-21 GPU scheduling optimization method for deep learning inference service system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211456890.8A CN115756789A (en) 2022-11-21 2022-11-21 GPU scheduling optimization method for deep learning inference service system

Publications (1)

Publication Number Publication Date
CN115756789A true CN115756789A (en) 2023-03-07

Family

ID=85333689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211456890.8A Pending CN115756789A (en) 2022-11-21 2022-11-21 GPU scheduling optimization method for deep learning inference service system

Country Status (1)

Country Link
CN (1) CN115756789A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349032A (en) * 2023-12-05 2024-01-05 城云科技(中国)有限公司 Method and device for improving throughput of large language model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117349032A (en) * 2023-12-05 2024-01-05 城云科技(中国)有限公司 Method and device for improving throughput of large language model
CN117349032B (en) * 2023-12-05 2024-02-20 城云科技(中国)有限公司 Method and device for improving throughput of large language model

Similar Documents

Publication Publication Date Title
US11989647B2 (en) Self-learning scheduler for application orchestration on shared compute cluster
CN110737529B (en) Short-time multi-variable-size data job cluster scheduling adaptive configuration method
CN105956021B (en) A kind of automation task suitable for distributed machines study parallel method and its system
CN107888669B (en) Deep learning neural network-based large-scale resource scheduling system and method
WO2024060789A1 (en) Intelligent computing-oriented method, system and apparatus for scheduling distributed training tasks
CN111274036B (en) Scheduling method of deep learning task based on speed prediction
CN110135584B (en) Large-scale symbolic regression method and system based on adaptive parallel genetic algorithm
CN112416585A (en) GPU resource management and intelligent scheduling method for deep learning
CN116069512B (en) Serverless efficient resource allocation method and system based on reinforcement learning
CN114647515A (en) GPU cluster-oriented dynamic resource scheduling method
CN115756789A (en) GPU scheduling optimization method for deep learning inference service system
CN112732444A (en) Distributed machine learning-oriented data partitioning method
CN115437760A (en) Computing resource allocation method, electronic device, storage medium, and program product
CN115033357A (en) Micro-service workflow scheduling method and device based on dynamic resource selection strategy
CN115309521A (en) Marine unmanned equipment-oriented deep reinforcement learning task scheduling method and device
CN115564242A (en) Ship power equipment-oriented scheduling method and system for preemptible task maintenance personnel
CN111061565A (en) Two-stage pipeline task scheduling method and system in Spark environment
JP2000259605A (en) Simulation device
CN112862083A (en) Deep neural network inference method and device under edge environment
CN117539612A (en) AI training platform task scheduling method and system based on chaotic sparrow algorithm
CN115858112B (en) Constraint planning-based comprehensive avionics system task allocation and scheduling method
CN112070370A (en) Relay satellite task planning method, system and storage medium
US11513866B1 (en) Method and system for managing resource utilization based on reinforcement learning
CN114466014B (en) Service scheduling method and device, electronic equipment and storage medium
CN116010051A (en) Federal learning multitasking scheduling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination