CN114217966A

CN114217966A - Deep learning model dynamic batch processing scheduling method and system based on resource adjustment

Info

Publication number: CN114217966A
Application number: CN202111543693.5A
Authority: CN
Inventors: 陈伟睿; 蒋昌龙; 冯奕乐; 王子龙; 张政; 丁晓伟
Original assignee: Shanghai Tisu Information Technology Co ltd
Current assignee: Shanghai Tisu Information Technology Co ltd
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-03-22

Abstract

The invention provides a deep learning model dynamic batch processing scheduling method and system based on resource adjustment, which comprises the following steps: step 1: performing task splitting on the deep learning reasoning operation according to the type of the consumption evaluation resources during running; step 2: and performing dynamic batch processing on the tasks according to the resource types of the deep learning inference jobs and the current deployment environment resource conditions, stopping scheduling when a preset scheduling stopping condition is met, and triggering new scheduling when a new inference job is received or available resources are changed. According to the method, the size of the batch processing reasoning batch of a certain reasoning task at the moment is obtained and scheduled to be executed according to the situation of the deployment resources at the moment of operation scheduling in different operation periods, the batch processing batch can be dynamically adjusted according to the real-time deployment resources, the operation efficiency is improved by using batch processing, the real-time full utilization of the deployment resources is achieved, and therefore the throughput of reasoning operation processing in the scene of dynamic change of the resources is improved.

Description

Deep learning model dynamic batch processing scheduling method and system based on resource adjustment

Technical Field

The invention relates to the technical field of scheduling and reasoning deployment of deep learning models, in particular to a dynamic batch processing scheduling method and system of a deep learning model based on resource adjustment.

Background

With the promotion of computer hardware and the rapid development of the deep learning field, various neural network models are applied to various fields such as life health, retail sale, industry and the like. Successful application of deep learning models to the business field relies on multiple links, and in addition to model training, trained models are often required to be optimized and deployed for use scenarios. And the user transmits data to the deployed model, and after the input data is subjected to model reasoning operation, the user obtains a corresponding output result.

To optimize the model inference process, various model inference engines that facilitate industrial deployment are also continually being developed. For example, the model inference engine triton for optimizing NVidia GPU is introduced by NVidia corporation, and the convolutional neural network model inference engine OpenVINO for Intel hardware expansion is introduced by Intel corporation. Inference deployment tools suitable for self-framework models or general models are also provided by various deep learning frameworks such as tensorflow, porch, mxnet and the like. And after receiving the model inference request, merging the inference requests according to the designated batch processing size and queue waiting time to perform batch inference. Due to the advantages of GPU architectural features and parallelized instruction execution, running multiple model inference tasks on a GPU in a batch mode will typically input request data into a model in a batch mode, at the expense of the video memory consumption multiplied by the batch size during execution, increasing the throughput of inference.

However, in the existing medical image reasoning scene, one deep learning related operation is completed by a plurality of large-scale depth model reasoning tasks including complex dependency relationships, and the deep learning related operation comprises a plurality of steps including preprocessing, model reasoning, post-processing and the like, and the video memory occupation size and the dependency relationships are complex. In medical use scenes such as hospitals and the like, reasoning deployment environment hardware network resources are generally fixed, and deep learning related operation types are continuously upgraded and increased. However, the GPU has limited video memory and is difficult to expand, so that the execution of batch reasoning operation is restricted by the video memory, all models can be ensured to run simultaneously under the preset batch size, and the reasoning throughput is not high. And if the batch processing is closed, the loaded model and the residual video memory resources cannot be fully utilized when certain deep learning related jobs are operated. In addition, deep learning inference jobs received at a time in this scenario have uncertainty in both quantity and kind. The deep learning operation batch processing running efficiency under the environment can be improved only by adopting a reasonable dynamic batch processing strategy.

The batch processing scheme adopted in CN111523670 mainly aims at multiple inference task requests of a single inference task, and proposes a method for request combination and batch processing, but the batch processing size when multiple inference task models are deployed does not change according to the environment resource situation, so that when the deployment situation of multiple inference models is met, the size of the operating environment resource usually needs to meet the inference of all models corresponding to the preset batch size to perform effective inference, and the requirement on the environment resource is more strict than that of the present patent.

CN112860402 provides a batch processing dynamic adjustment method, which performs continuous iterative adjustment on the batch processing size according to the historical throughput condition and probability distribution. The invention fully utilizes the current resource environment limitation and the real-time task state and other information in the queue to the batch processing dynamic adjustment scheme, and is more flexible and has practical significance in the scene of resource limitation or resource dynamic change and large task amount or dealing with sudden change of the task amount.

Patent document CN112860402A (application number: cn202110192645.x) discloses a dynamic batch task scheduling method and system for deep learning inference service, the method includes: describing the number of tasks such as queues and the like at the leaving time of each batch and the size of the leaving batch by a two-dimensional Markov process, determining the steady-state probability of the two-dimensional Markov process, and determining the average service delay in a deep learning inference service system according to the steady-state probability; and constructing an optimization model to optimize the upper limit of the batch processing task size, the average service delay and the memory usage amount, and solving the optimization model to determine the upper limit of the batch size of the batch processing task. Although the patent can dynamically adjust the batch size of the batch processing, the dynamic adjustment mode is based on the queue historical change information, not the current environment resource information.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a deep learning model dynamic batch processing scheduling method and system based on resource adjustment.

The deep learning model dynamic batch processing scheduling method based on resource adjustment provided by the invention comprises the following steps:

step 1: performing task splitting on the deep learning reasoning operation according to the type of the consumption evaluation resources during running;

step 2: and performing dynamic batch processing on the tasks according to the resource types of the deep learning inference jobs and the current deployment environment resource conditions, stopping scheduling when a preset scheduling stopping condition is met, and triggering new scheduling when a new inference job is received or available resources are changed.

Preferably, after the tasks are all completed in the preamble of the DAG, the tasks are recorded as ready and added to the corresponding scheduling ready queue;

in order to realize the batch processing dynamic scheduling of the GPU type task, a queue structure different from a CPU type ready task queue is constructed for the GPU type ready task.

Preferably, if the CPU-type ready queue is a multi-level queue, the first level performs first-in first-out FIFO scheduling according to the order of receiving jobs, adds a corresponding queue every time a new job arrives, and deletes the queue when all job-decomposed tasks have been completed;

and the second stage performs FIFO scheduling according to the ready job task sequence in the job, the tasks are recorded as the time sequence of the ready state, the first queue in the first stage is scheduled preferentially during scheduling, and the first task in the first queue is scheduled preferentially.

Preferably, the GPU-type ready queue consists of a plurality of independent queues, and when a split GPU-type task appears, a deep learning model needs to be loaded before operation to construct a corresponding queue;

and during scheduling, dynamically arranging the queue priority according to the number of the remaining elements in the current queue from high to low or the distance between the last successful scheduling time of the queue and the current time.

Preferably, after triggering the GPU type task scheduling, the following steps are performed:

and step 3: sorting and selecting an inference task queue QueueA, and acquiring ready tasks TaskA and the number tryBatchSize;

and 4, step 4: sorting and selecting a GPU containing a loaded and idle state stageA model and having the largest current available video memory, calculating the largest batch maxBatchSize supported by the operation of scheduling stageA, and executing the step 5 if min (tryBatchSize, maxBatchSize) > 0; otherwise, calculating and scheduling the maximum batch maxBatchSize supported by the StageA operation after trying to unload other idle models loaded on the GPU, and if min (tryBatchSize, maxBatchSize) >0, executing the step 5;

otherwise, sorting and selecting the GPU which does not contain the loaded and idle state stageA model and has the largest available video memory, operating the supported maximum batch maxBatchSize after calculating the loaded model stageA, and executing the step 5 if min (tryBatchSize, maxBatchSize) > 0; otherwise, calculating and scheduling the maximum batch maxBatchSize supported by the StageA operation after trying to unload other idle models loaded on the GPU, if min (tryBatchSize, maxBatchSize) >0, executing the step 5, otherwise, ending the GPU type task scheduling;

and 5: and taking min (tryBatchSize, maxBatchSize) as the batch processing size to take out the corresponding number of Task executions from the Queue, and then returning to the step 3 to continue executing.

The deep learning model dynamic batch processing scheduling system based on resource adjustment provided by the invention comprises:

module M1: performing task splitting on the deep learning reasoning operation according to the type of the consumption evaluation resources during running;

module M2: and performing dynamic batch processing on the tasks according to the resource types of the deep learning inference jobs and the current deployment environment resource conditions, stopping scheduling when a preset scheduling stopping condition is met, and triggering new scheduling when a new inference job is received or available resources are changed.

Preferably, after triggering the GPU type task scheduling, the following modules are invoked:

module M3: sorting and selecting an inference task queue QueueA, and acquiring ready tasks TaskA and the number tryBatchSize;

module M4: sorting and selecting a GPU containing a loaded and idle state stageA model and having the largest current available video memory, calculating the largest batch maxBatchSize supported by the operation of scheduling stageA, and calling a module M5 if min (tryBatchSize, maxBatchSize) > 0; otherwise, calculating the maximum batch maxBatchSize supported by the StageA operation after trying to unload other idle models loaded on the GPU, and calling the module M5 if min (tryBatchSize, maxBatchSize) > 0;

otherwise, sorting and selecting the GPU which does not contain the loaded and idle state stageA model and has the largest available video memory at present, operating the supported maximum batch maxBatchSize after calculating the loaded model stageA, and calling the module M5 if min (tryBatchSize, maxBatchSize) > 0; otherwise, calculating and scheduling the maximum batch maxBatchSize supported by the StageA operation after trying to unload other idle models loaded on the GPU, if min (tryBatchSize, maxBatchSize) >0, calling the module M5, otherwise, ending the GPU type task scheduling;

module M5: and taking min (tryBatchSize, maxBatchSize) as the batch size to take out the corresponding number of Task executions from the Queue, and then calling the module M3 to continue running.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, the size of the batch processing reasoning batch of a certain reasoning task at the moment is obtained and scheduled to be executed according to the situation of the deployment resources at the moment of operation scheduling in different operation periods, the batch processing batch can be dynamically adjusted according to the real-time deployment resources, the operation efficiency is improved by using batch processing, the real-time full utilization of the deployment resources is achieved, and therefore the throughput of reasoning operation processing in the scene of dynamic change of the resources is improved.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flowchart of a dynamic batch scheduling process for GPU-type tasks;

fig. 2 is a block diagram of the system architecture.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example (b):

the invention provides a method for classifying and splitting deep learning inference operation and carrying out dynamic batch scheduling operation according to the current deployment environment resource condition. The method can flexibly batch process the batch size of the tasks according to the deployment environment resource condition when a plurality of deep learning inference operations are operated, thereby reasonably utilizing the limited computing resources and effectively finishing the whole inference operation.

The splitting of the deep learning inference operation specifically means splitting the deep learning inference operation according to the type of the consumed evaluation resources in operation. Generally, related tasks of processing before and after deep learning are divided into CPU type tasks, and CPU resource and memory resource consumption and running time during running of the CPU type tasks are evaluated; and (3) dividing the deep learning inference task into GPU type tasks, evaluating the GPU video memory consumption loaded by the model, and evaluating the GPU video memory consumption and the operation time during single batch operation. When a GPU type task is operated, the inference deep learning model corresponding to the current existing task needs to be loaded into a GPU video memory and is in a serviceable state.

The split tasks have input and output dependency relationships, and can be represented by using a DAG (directed acyclic graph) graph, wherein the tasks are nodes in the graph, and the input and output dependencies among the tasks are edges. And completing the single deep learning inference operation when the DAG graph is executed from the inlet to the outlet in sequence.

The method comprises the following steps of carrying out dynamic batch scheduling according to the current deployment environment resource condition, specifically, dividing split tasks into two types, recording the tasks as ready after the preorder tasks of the tasks in a DAG (direct current) graph are completed, and adding the tasks into a corresponding scheduling ready queue;

if the CPU type ready queue is a multi-level queue, the first level carries out FIFO scheduling according to the sequence of receiving the operation, and the second level generates sequential FIFO scheduling according to the ready operation in the operation;

if the GPU-type ready queue establishes a queue for each inference model, when a GPU-type ready task is in a ready state, the GPU-type ready task is added into the queue.

For example, receiving a certain Pipeline I-type Job, Job jobb 1, splitting the Job to generate two GPU-type tasks Task1A and Task1B, wherein the two GPU-type tasks respectively need to load models StageA and StageB before running, and when the Task a and the Task b in Job jobb 1 are ready, the tasks are respectively placed into two independent queues, QueueA and QueueB.

The specific mode of the GPU task dynamic batch scheduling is to find a queue QueueA with the largest number of elements left in the current queue or the longest time from the last successful scheduling time of the queue to the current time, record the number tryBatchSize of the task elements, correspond to a model StageA required for task execution, load the StageA model in currently available GPU resources and leave the queue, access the GPU with the largest value according to the currently available display of the GPU, calculate the largest number maxBatchSize capable of executing the model task according to the currently available display of the GPU, try to unload other free models on the video card and retry if the maxBatchSize is 0, if the min (tryBatchSize, maxBatchSize) is not 0, take the min (tryBatchSize, maxBatchSize) as the size of the idle models, and take out the min (tryBatchSize) number of tasks from the queue head from the queue QueueA to perform batch scheduling, and continue scheduling.

If min (tryBatchSize, maxBatchSize) is 0, trying to take the GPU of which the currently available display accesses the maximum value from the GPU which does not load the StageA model, calculating whether maxBatchSize is 0 at the moment after the model is loaded, if the maxBatchSize is 0, trying to unload other free models on the video card and retrying, and if min (tryBatchSize, maxBatchSize) is not 0, scheduling min (tryBatchSize, maxBatchSize) tasks of the batch in the same form to be executed in batch on the GPU and continuing scheduling.

And stopping scheduling when min (tryBatchSize, maxBatchSize) is still 0, and restarting scheduling until the resource environment state is changed due to the completion of task execution or a new task is added into a corresponding ready queue.

As shown in fig. 1 and fig. 2, the specific operation embodiment is as follows:

when the dynamic batch processing scheduling method is applied to a medical image reasoning operation. The inference operation can be divided into 3 inference models which need to execute CPU preprocessing and post-processing tasks for 1s and can run in parallel, wherein 2 models, stageA and stageB, occupy 300MB of video memory, single batch inference consumes 1G of running video memory, inference time approximately keeps 1s of tasks, one model, stageC, occupies 400MB of video memory, single batch inference consumes 2G of running video memory, 2s of tasks are approximately kept during inference, a current deployment scene limits a CPU4 core, and GPU resources are single-card 9G of video memory. When 4 such jobs arrive at the same time, 3 models are all already loaded in the GPU in the ready state consuming 1G of video memory. The split pre-processing CPU tasks are scheduled to run for 1s in sequence, if the batch size of each model is fixed to be 2 at the moment in batch processing scheduling, the whole time consumption is totally 4s to complete all model reasoning in an ideal situation according to the limitation of video memory resources, and then the time consumption is 1s to complete 4 post-processing tasks. If the dynamic batch processing model is opened for scheduling, the time consumption for reasoning can be completed for 1s for stageA and stageB according to the batch size 4 according to the video memory resource limitation, then the time consumption for reasoning can be completed for 2s for stageC according to the batch size 4, the overall time consumption for reasoning is 3s, and then the time consumption for 1s is completed for 4 post-processing tasks. By the support of the dynamic batch method, resources are fully and dynamically utilized, and the effect of saving the completion time of a plurality of inference operations is achieved.

The deep learning model dynamic batch processing scheduling system based on resource adjustment provided by the invention comprises: module M1: performing task splitting on the deep learning reasoning operation according to the type of the consumption evaluation resources during running; module M2: and performing dynamic batch processing on the tasks according to the resource types of the deep learning inference jobs and the current deployment environment resource conditions, stopping scheduling when a preset scheduling stopping condition is met, and triggering new scheduling when a new inference job is received or available resources are changed.

After the tasks in the DAG are all finished in the preamble tasks, recording the tasks as ready, and adding the tasks to the corresponding scheduling ready queue; in order to realize the batch processing dynamic scheduling of the GPU type task, a queue structure different from a CPU type ready task queue is constructed for the GPU type ready task. If the CPU type ready queue is a multi-stage queue, the first stage performs first-in first-out (FIFO) scheduling according to the sequence of received jobs, adds a corresponding queue when every new job arrives, and deletes the queue when all job-decomposed tasks are completed; and the second stage performs FIFO scheduling according to the ready job task sequence in the job, the tasks are recorded as the time sequence of the ready state, the first queue in the first stage is scheduled preferentially during scheduling, and the first task in the first queue is scheduled preferentially. The GPU-type ready queue consists of a plurality of independent queues, and when a split GPU-type task appears, a deep learning model needs to be loaded before operation to construct a corresponding queue; during scheduling, dynamically arranging the queue priority according to the number of the remaining elements in the current queue from more to less or the distance between the last successful scheduling time of the queue and the current time; when a corresponding task in a certain job becomes ready state, the task is added into the queue.

After triggering GPU type task scheduling, calling the following modules: module M3: sorting and selecting an inference task queue QueueA, and acquiring ready tasks TaskA and the number tryBatchSize; module M4: sorting and selecting a GPU containing a loaded and idle state stageA model and having the largest current available video memory, calculating the largest batch maxBatchSize supported by the operation of scheduling stageA, and calling a module M5 if min (tryBatchSize, maxBatchSize) > 0; otherwise, calculating the maximum batch maxBatchSize supported by the StageA operation after trying to unload other idle models loaded on the GPU, and calling the module M5 if min (tryBatchSize, maxBatchSize) > 0; otherwise, sorting and selecting the GPU which does not contain the loaded and idle state stageA model and has the largest available video memory at present, operating the supported maximum batch maxBatchSize after calculating the loaded model stageA, and calling the module M5 if min (tryBatchSize, maxBatchSize) > 0; otherwise, calculating and scheduling the maximum batch maxBatchSize supported by the StageA operation after trying to unload other idle models loaded on the GPU, if min (tryBatchSize, maxBatchSize) >0, calling the module M5, otherwise, ending the GPU type task scheduling; module M5: and taking min (tryBatchSize, maxBatchSize) as the batch size to take out the corresponding number of Task executions from the Queue, and then calling the module M3 to continue running.

Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A deep learning model dynamic batch processing scheduling method based on resource adjustment is characterized by comprising the following steps:

2. The deep learning model dynamic batch scheduling method based on resource adjustment according to claim 1, wherein after all the pre-sequence tasks in the DAG are completed, the tasks are recorded as ready and added to the corresponding scheduling ready queue;

3. The deep learning model dynamic batch scheduling method based on resource adjustment according to claim 1, wherein if the CPU-type ready queue is a multi-level queue, the first level performs a first-in first-out FIFO scheduling according to the order of the received jobs, adds a corresponding queue every time a new job arrives, deletes the queue when all job-decomposed tasks have been completed;

4. The resource adjustment-based deep learning model dynamic batch scheduling method according to claim 1, wherein the GPU-type ready queue is composed of a plurality of independent queues, and when a split GPU-type task occurs, the deep learning model is loaded before running to construct a corresponding queue;

5. The resource adjustment-based deep learning model dynamic batch processing scheduling method of claim 1, wherein after triggering GPU type task scheduling, executing the following steps:

6. A deep learning model dynamic batch scheduling system based on resource adjustment is characterized by comprising:

7. The deep learning model dynamic batch scheduling system based on resource adjustment according to claim 6, wherein after all the pre-sequence tasks in the DAG are completed, the tasks are recorded as ready and added to the corresponding scheduling ready queue;

8. The deep learning model dynamic batch scheduling system based on resource adjustment according to claim 6, wherein if the CPU type ready queue is a multi-level queue, the first level performs FIFO scheduling according to the order of received jobs, adds a corresponding queue every time a new job arrives, deletes the queue when all job-decomposed tasks have been completed;

9. The resource adjustment-based deep learning model dynamic batch scheduling system of claim 6, wherein the GPU-type ready queue is composed of a plurality of independent queues, and when a split GPU-type task occurs, the deep learning model is loaded before running to construct a corresponding queue;

10. The resource adjustment-based deep learning model dynamic batch scheduling system of claim 6, wherein after triggering GPU-type task scheduling, the following modules are invoked: