CN112860402B

CN112860402B - Dynamic batch task scheduling method and system for deep learning reasoning service

Info

Publication number: CN112860402B
Application number: CN202110192645.XA
Authority: CN
Inventors: 张德宇; 罗云臻; 张尧学
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2023-12-05
Anticipated expiration: 2041-02-20
Also published as: CN112860402A

Abstract

The invention discloses a dynamic batch processing task scheduling method and a system for a deep learning reasoning service, wherein the method comprises the following steps: describing the number of queue waiting tasks at each batch leaving moment and the size of the leaving batch by using a two-dimensional Markov process, determining the steady-state probability of the two-dimensional Markov process, and determining the average service delay in a deep learning reasoning service system according to the steady-state probability; and constructing an optimization model to optimize the upper limit of the batch processing task size and the average service delay and the memory usage, and solving the optimization model to determine the upper limit of the batch processing task size. The invention accords with dynamic environment, and has the advantages of better average service delay, memory occupation and the like.

Description

Dynamic batch task scheduling method and system for deep learning reasoning service

Technical Field

The invention relates to the technical fields of edge computing and cloud computing, in particular to a dynamic batch task scheduling method and system for a deep learning reasoning service.

Background

Due to the excellent performance of Deep Learning in such fields as image processing and natural language processing, and the increasing popularity of mobile devices such as Android and iOS systems, mobile devices can provide many intelligent applications for end users. More than 16500 mobile applications, such as Google Play, use deep learning as a core component, providing intelligent services from computer vision to text and audio processing. A specific application includes a mobile phone APP named as a security AI developed by microsoft for assisting visually impaired people in recognizing the surrounding environment using a vehicle-mounted camera. Adobe Scan converts images to text using a deep learning based text recognition technique.

One common way for mobile devices to provide intelligent services using deep learning is to run deep learning reasoning based on pre-trained models. However, deep learning model reasoning has high requirements in terms of energy, memory and compute engine cycles. Although some mobile neural network computing elements have been released on the market to accelerate deep learning model reasoning on devices, such as NPUs and TPUs, their computing power is still very limited and it is difficult to guarantee high quality services.

To provide efficient mobile intelligent services, a more efficient solution is to offload model reasoning to powerful edges or cloud servers. As the application range of the deep learning model is continuously enlarged and improved, it is observed from the related information issued by the leading high-tech company that the demand for the deep learning reasoning is rapidly increasing in recent years. Specifically, a dedicated platform DLIS (Deep Learning Inference Service, DLIS) such as microsoft deployment may receive hundreds of thousands of deep learning reasoning requests per second. While the deep learning reasoning requirements of Facebook's data center have increased three times over two years.

In mobile applications such as AR and VR, a critical issue is the stringent low latency requirements, typically in the millisecond range. With the significant increase in the amount of deep learning reasoning task requests, even for powerful GPU servers, this stringent low latency requirement becomes a challenge.

Because of the highly parallel computing architecture of GPUs, batching inputs together can significantly improve the computing efficiency. Through research and analysis on throughput rates of the representative deep learning models on the 2 GPU servers under different batch sizes, throughput rates under different batch sizes in the different deep learning models shown in fig. 1 are obtained, and the throughput rates can be greatly improved through batch processing input. Meanwhile, through research on the relation between batches input by batch processing and the video memory occupation, the relation between the sizes of different batches and the video memory occupation in different deep learning models shown in fig. 2 is obtained, and the worst case is that the video memory occupation reaches 2558MB.

The existing related researches of improving the throughput rate of the deep learning reasoning service and reducing the delay of the deep learning reasoning service by adopting a batch processing mode are discussed in a static environment, wherein the static environment refers to that the related researches consider that tasks of the deep learning reasoning service are statically waiting at the local server, but in the case of actual network service, the tasks arrive randomly, so that how to optimize the deep learning reasoning service by reasonably utilizing the batch processing mode in the random arrival process is not deeply researched in the prior art and has practical significance.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems existing in the prior art, the invention provides a method and a system for scheduling the dynamic batch processing task, which are in line with the dynamic environment and have better average service delay and memory occupation amount and deep learning reasoning service.

In order to solve the technical problems, the technical scheme provided by the invention is as follows: a dynamic batch task scheduling method of deep learning reasoning service includes describing queue waiting task number and size of leaving batch at each batch leaving time by a two-dimensional Markov process, determining steady-state probability of the two-dimensional Markov process, and determining average service delay in a deep learning reasoning service system according to the steady-state probability;

optimizing the upper limit of the batch task size and the average service delay and the memory usage by an optimization model shown in the formula (1),

in the formula (1), E (W (b)) is the average service delay corresponding to the upper limit b of the batch size, b is the upper limit b of the batch size of the batch processing task, W (b) is the service delay, gamma is the weight of the memory usage compared with the average service delay, and m _b The method is characterized in that when the upper limit of the batch size is B, the corresponding memory usage amount is B, B is the maximum value of the upper limit of the batch size, N is the maximum number of tasks waiting in a batch processing task queue, lambda is the task arrival rate and mu _B The service rate is the service rate when the batch size is B; the optimization model of solution equation (1) determines the upper limit of batch size in the batch processing task.

Further, the average service delay is determined by calculation of (2),

in the formula (2), E (W (b)) is the average service delay corresponding to the upper limit b of the batch size, L is the average task number, lambda is the task arrival rate, and P _block Is the blocking probability of the task.

Further, the average task number is determined by the formula (3),

the blocking probability is determined by equation (4),

in the formulas (3) and (4), E (L) is the average task number, n is the waiting task number in the batch task queue, r is the batch size, a is the batch size lower limit of the batch task, b is the batch size upper limit of the batch task, pi _n,r To wait for a steady state probability of n batches of task number r, pi _n,0 To wait for a steady state probability of 0 for a number of tasks n of batch size pi _N,r To wait for a steady state probability of a number of tasks of N batches of size r.

Further, the solving process of the optimization model comprises the following steps:

initializing the upper limit of the batch size of the batch processing task, and adjusting the step length of the upper limit of the batch size in each iteration; taking the sum of average service delay and memory usage corresponding to the upper limit of the batch size as a convergence parameter; and in each iteration, adjusting the upper limit of the batch size according to the step length, and taking the upper limit of the batch size obtained by the iteration of the round as an optimal solution output by an optimization model when the convergence parameter obtained by the iteration of the round is larger than that of the previous round.

Further, in the first iteration, the method further comprises the process of correcting the adjustment direction of the step length: and when the difference between the average service delay obtained by the first round of iteration and the average service delay corresponding to the initialized batch size upper limit is larger than a preset threshold value, changing the adjustment direction for adjusting the batch size upper limit.

A dynamic batch task scheduling system of a deep learning reasoning service, which performs task scheduling according to the dynamic batch task scheduling method of the deep learning reasoning service as set forth in any one of the above.

Compared with the prior art, the invention has the advantages that: compared with the traditional single-task processing method, the processing speed of the invention is greatly improved; compared with the traditional batch processing method, the speed of the batch processing method with the optimal fixed batch size is greatly improved, and the video memory occupation is obviously improved; compared with a greedy dynamic batch processing method, the video memory occupation amount is greatly reduced under the condition that service delays are basically the same, and the video memory occupation aspect is greatly improved.

Drawings

Fig. 1 shows throughput rate cases of different deep learning models in the prior art.

FIG. 2 shows the memory occupancy of different deep learning models in the prior art.

FIG. 3 is a flow chart of an embodiment of the present invention.

FIG. 4 is a schematic diagram of the relationship between the throughput rate (a), the GPU utilization rate (b) and the batch size of the deep learning model GoogLeNet, denseNet-169 on the NVIDIA RTX 2080GPU according to an embodiment of the present invention.

FIG. 5 is a schematic diagram of the relationship between the throughput rate (a), the GPU utilization rate (b) and the lot size inferred by the deep learning models GoogLeNet, denseNet-169 on NVIDIA Titan Xp GPU according to an embodiment of the present invention.

FIG. 6 is a schematic diagram illustrating the relationship between the graphics memory occupancy of the GPU inferred by the deep learning models GoogLeNet, denseNet-169 on NVIDIARTX 2080GPU (a) and NVIDIA Titan Xp GPU (b) in an embodiment of the present invention.

Fig. 7 is a schematic diagram of service delay and blocking probability of a deep learning model google et reasoning service on an NVIDIA RTX 2080GPU in a dynamic batch (a=1) and a static batch (a=b), where the task arrival rate in the service process is 990 tasks/second.

FIG. 8 is a schematic diagram showing the relationship between dynamic lot lower limit and service delay for a deep learning model GoogLeNet reasoning service on NVIDIA RTX 2080GPU in accordance with an embodiment of the present invention

FIG. 9 is a schematic diagram showing the comparison of memory occupancy of the deep learning model GoogLeNet reasoning service on the NVIDIARTX 2080GPU in dynamic and static batch processes in accordance with an embodiment of the present invention.

Fig. 10 is a schematic diagram of a queuing model corresponding to a deep learning reasoning service system model in an embodiment of the present invention.

FIG. 11 is a schematic diagram showing a comparison of the deep learning model GoogLeNet and DenseNet-169 inference services on NVIDIA RTX 2080GPU in a real-world situation and a model prediction situation, wherein the task arrival rate in the GoogLeNet service process is 990 tasks/second, and the task arrival rate of the DenseNet-169 is 330 tasks/second.

Fig. 12 is a schematic diagram showing the influence of the upper limit b of the lot size and the arrival rate λ on the reasoning service of the google net model in the embodiment of the present invention.

FIG. 13 is a schematic diagram showing the comparison of service delays of the method and different static batch processes of the present invention under varying task arrival rates for a deep learning model GoogLeNet reasoning service on a NVIDIA RTX 2080GPU in an embodiment of the present invention.

FIG. 14 is a graph showing the memory footprint of the method and the different static and greedy dynamic batches of the invention under varying task arrival rates for a deep learning model GoogLeNet reasoning service on a NVIDIA RTX 2080GPU in accordance with an embodiment of the invention.

Detailed Description

The invention is further described below in connection with the drawings and the specific preferred embodiments, but the scope of protection of the invention is not limited thereby.

According to the dynamic batch task scheduling method of the deep learning reasoning service, a two-dimensional Markov process is used for describing the number of queue waiting tasks at each batch departure time and the size of the departure batch, the steady-state probability of the two-dimensional Markov process is determined, and the average service delay in the deep learning reasoning service system is determined according to the steady-state probability;

in the formula (1), E (W (b)) is the average service delay corresponding to the upper limit b of the batch size, b is the upper limit b of the batch size of the batch processing task, W (b) is the service delay, and gamma is the memory usage and averageWeight of service delay phase, m _b The method is characterized in that when the upper limit of the batch size is B, the corresponding memory usage amount is B, B is the maximum value of the upper limit of the batch size, N is the maximum number of tasks waiting in a batch processing task queue, lambda is the task arrival rate and mu _B The service rate is the service rate when the batch size is B; the optimization model of solution equation (1) determines the upper limit of batch size in the batch processing task.

In this embodiment, the average service delay is determined by the calculation of equation (2),

in the formula (2), E (W (b)) is the average service delay corresponding to the upper limit b of the batch size, L is the average task number, lambda is the task arrival rate, and P _block The definition of the remaining parameters in the equation is the same as above for the blocking probability of the task.

In this embodiment, the average task number is determined by the formula (3),

the blocking probability is determined by equation (4),

in the formulas (3) and (4), E (L) is the average task number, n is the waiting task number in the batch task queue, r is the batch size, a is the batch size lower limit of the batch task, b is the batch size upper limit of the batch task, pi _n,r To wait for a steady state probability of n batches of task number r, pi _n,0 To wait for a steady state probability of 0 for a number of tasks n of batch size pi _N,r To wait for a steady state probability for a number of tasks of N batches of size r, the definition of the remaining parameters is the same as above.

In this embodiment, the solving process of the optimization model includes: initializing the upper limit of the batch size of the batch processing task, and adjusting the step length of the upper limit of the batch size in each iteration; taking the sum of average service delay and memory usage corresponding to the upper limit of the batch size as a convergence parameter; and in each iteration, adjusting the upper limit of the batch size according to the step length, and taking the upper limit of the batch size obtained by the iteration of the round as an optimal solution output by an optimization model when the convergence parameter obtained by the iteration of the round is larger than that of the previous round.

In this embodiment, during the first iteration, the method further includes a process of correcting the adjustment direction of the step size: and when the difference between the average service delay obtained by the first round of iteration and the average service delay corresponding to the initialized batch size upper limit is larger than a preset threshold value, changing the adjustment direction for adjusting the batch size upper limit.

The method of the embodiment is verified and analyzed through a specific simulation experiment. In experiments, a batch-based deep learning inference service system is described as follows by analyzing patterns of actual network deep learning inference services.

The GPU server receives a large number of mobile devices or other terminals to receive deep learning model reasoning tasks, the task arrival process follows a poisson distribution with a rate of lambda, the reasoning process follows a general distribution, and the reasoning delay depends on the current batch size. When the GPU server runs only one deep learning reasoning model, the reasoning process follows a deterministic distribution since there are no other tasks and their competition. In the more realistic case, where multiple model inference services run on one server, the inference delay services are generally distributed to describe the different delays caused by computing resource competition between services.

According to the mode of the deep learning reasoning service system, the deep learning service reasoning system based on batch processing is modeled as an M/G (a, b)/1/N queuing model in the experiment of the embodiment, the queuing model models according to the paradigm of a D.G. Kendall queuing theory, wherein M represents the arrival process of tasks obeys Poisson distribution, G represents the reasoning process obeys general distribution, a represents the lower limit of the reasoning batch size, b represents the upper limit of the reasoning batch size, 1 represents the number of servers, and N represents the maximum task number waiting in a batch processing task queue of the queuing system. After the queuing model is established, the queuing model is analyzed, and the average service delay is obtained. Specifically, a closed Jie Gong type of average service delay of a deep learning service inference system is theoretically constructed, and the solution result of the closed solution formula comprises queuing delay and inference delay. After a large number of experimental analyses, it is determined that, as the number of pictures to be loaded into the video card increases with the increase of the batch and the amount of intermediate data generated in the reasoning process increases, the video memory occupancy increases linearly with the increase of the batch size, and this relationship can be described by a linear function. Finally, by generalizing the closed-form company and the linear function into an objective function of the optimization problem, the optimization variable is the upper batch size limit b. Since the values of the batch size are discrete and the search space is small, this problem is solved by traversing the search space.

In the current popular deep learning framework, such as TensorFlow, pytorch, mxNet and Dynet, with GPU acceleration, the batched images will be laid out into a matrix, and passed into the computation of the deep learning model already loaded on the GPU. On the one hand, batch processing can make all images share convolution kernel weights in convolution operation so as to reduce the calling of model parameters and reduce delay; on the other hand, batch processing exploits the parallelism of convolution operations and full-connection-layer neuron operations, which further exploits the parallel computing capabilities of the GPU architecture. In the experiment, a deep learning model reasoning experiment is carried out based on an RTX 2080GPU of an 8GB video memory and a Titan X Pascal (Titan Xp) GPU of a 12GB memory, and the deep learning model reasoning utilizes a CUDA-v10.0.130 interface and a cuDNN-v7.6.2 acceleration library, and Pytorch is used as a deep learning framework.

In this embodiment, for comparison, the throughput and the memory occupancy of 5 typical deep learning models under different batch sizes were tested first, and the input image size was 224×224 pixels. Considering that the number of tasks in the queue during service is an arbitrary integer instead of a power of 2, two models of DenseNet-169 and GoogLeNet are chosen, the FLPs (floating point operand) of which are 3.2 x 109 and 1.41 x 109, respectively, as shown in FIGS. 4 and 5. When the batch size of the batch changes from 1 to 64, the throughput rate increases rapidly with the increase of the batch scale, and then fluctuates around a certain point, which indicates that the batch can speed up model reasoning to a certain extent, and other models tested have the same characteristics. For GoogLeNet running on RTX 2080 and Titan Xp, the throughput rate of 7.67 times and 25.27 times can be improved by batch processing, and the change trend of the throughput rate and the GPU utilization rate along with the batch processing scale is almost consistent. For statistical verification, the throughput and memory footprint results shown in fig. 4 and 5 in the test are averages of 100 runs.

In the experiment, a curve was fitted by least squares to analyze the relationship between batch and throughput. Experiments show that under the batch size r, the batch reasoning time delay tau _r ＝v×r+τ ₀ V denotes the slope of the inference delay as I/O operations increase with batch size, τ ₀ Representing the intercept of the inference delay, v > 0, τ ₀ > 0. From this, the service rate μ at the lot size r can be determined _r The expression (in batch/s) isAt a finger size r, the throughput rate (image/s) is expressed as rXμ _r 。

In the scenario where a server provides a deep learning model reasoning service, there are a large number of clients submitting tasks to the server. The server organizes the tasks into a queue. N is used in the experiment to represent the maximum number of tasks waiting in the batch task queue. According to the Service Level Agreement (SLA), the response time is an important indicator of cloud services or network services, and N should not be too large, since too large N would cause the queue tail task to timeout due to high delay, and is set to n=128 in this experiment. The service delay W comprises two parts, queuing delay and reasoning delay. In experiments to imitate the random arrival of inferential tasks, it was assumed that the task arrival process followed a poisson distribution.

In a practical system, since tasks arrive randomly, the tasks that arrive randomly do not necessarily receive services immediately, and thus the deep learning service delay includes not only the inference delay but also the queuing delay. In experiments, service delay W and blocking task P of Google Net on RTX 2080GPU were tested under different system states _block . It can be determined that the batch size of the process is determined by the number of tasks waiting in the server, and if the batch size is smaller than the upper limit b of the batch size, the server processes all the tasks as a batch; otherwise, the batch size handled by the server will be the upper limit b. Considering that the batch size is limited by the GPU memory, in the experiment a maximum value B is set for B, i.e. b.ltoreq.B, B.ltoreq.N, in the present experiment B is set to 64. Bxμ _B Corresponding to the maximum throughput rate in a certain service. The flow intensity is defined asWherein mu _B Represents the service rate at batch size B, λ represents the task arrival rate, ρ < 1, because when λ is greater than or equal to the maximum throughput rate Bxμ _B When the arrival rate lambda is increased, P is increased only _block 。

By comparing the case of a=1 (a is the lower limit of the lot size, a=1 is the dynamic lot size) and a=b (i.e. the fixed lot size), and setting ρ to 0.75, the task arrival rate λ=990 can be calculated, i.e. represents an average arrival of 990 tasks per second. As shown in fig. 7, in the case where a=1, in the case where the value of b is small, the service delay decreases when b is increased, and the service delay fluctuates only slightly when the value of b becomes large. Whereas the service delay value in case a=b is larger than in case a=b, because the tasks that arrive first must wait until there are at least b tasks in the queue before the batch process can be performed. FIG. 8 shows that when the upper lot limit b is fixed, increasing the lower limit a increases the service delay, and the corresponding memory footprint is shown in FIG. 9. It can be determined that the dynamic batch size is larger than the fixed batch sizeBetter performing, in addition, when a=b=1, i.e. no batch processing is performed, the average value of the service delays is 781ms, p _block 83%, P _block The trend is almost the same as that of the service delay W, because the lower the service delay, the lower the queuing delay of the task and thus the lower the probability that the queue is full when the task arrives.

In deep learning reasoning calculation, GPU video memory (i.e. memory) is an important resource in the GPU calculation process, unlike CPU calculation, a physical GPU cannot accurately limit the video memory usage of a process, but the GPU can be virtualized through a remote API technology and a PCI pass-through technology or distributed to different containers through nvidia dock for different services. The GPU server provides all GPU video memory for one process and prepares to allocate more memory when the process applies. Since the overflow (Out of Memory) and Page Fault (Page Fault) are Fatal errors (Fatal Error) for the GPU, the process where the Fault is located stops running, and different processes compete for the GPU Memory, it is necessary to reduce the consumption of the GPU Memory of one process as much as possible without affecting the running speed. Because each image needs to be loaded into the memory in the reasoning process and an output tensor is generated at each layer of the neural network, the memory uses m _r In linear relationship with the lot size r, as shown in FIG. 6, the relationship between memory usage and lot size can be expressed as m _r ＝kr+m ₀ Wherein k is a slope indicating an increase in memory usage with an increase in lot size r, m ₀ To represent the memory usage of loading the deep learning model, and k is more than 0, m ₀ ＞0。

In a deep learning reasoning service system based on dynamic batch processing, the size constraint of each batch is equal to or less than r equal to or less than b, namely when the number of waiting tasks is greater than a, the server starts batch reasoning, and the number of tasks of one batch cannot exceed b. And the batch size is limited by the GPU video memory, namely, the value of B is B at maximum, namely, B is less than or equal to B. In order to obtain the average service delay, the number of waiting tasks in the system at any moment needs to be analyzed, and the service process is subject to general distribution, so that the change process of the number of waiting tasks in the system at any moment is a non-Markov process, and in addition, the reasoning delay depends on the batch size between a and b. To simplify the analysis, in this experiment, first an embedded Markov chain (eMC) technique is used to obtain the transition probability of a Markov process having two dimensions, including the number of queue waiting tasks n and the batch size r. The embedded markov process records the number of waiting tasks for a batch to complete, which in this embodiment is referred to as the batch departure time, as shown in fig. 10. And obtaining a probability matrix of the system state at any moment by the relation between the system state at the batch departure moment and the probability of the system state at any moment. In this embodiment, X (t) = (n (t), r (t)) is used to represent a two-dimensional markov process formed by the evolution of the number of queue waiting tasks n at each lot departure time and the departure lot size r, where t represents a subscript of the lot departure time, i.e., what number of the departure lots n (t) represents the number of queue waiting tasks at the t-th lot departure time, and r (t) represents the lot size of the t-th lot.

The two-dimensional Markov process can be used to represent a proof process of the relationship between the number of waiting tasks and the size of the batch leaving for the deep learning reasoning service as follows:

in V form _t,t+1 (r) represents the number of tasks reached between the departure times t and t+1 of the batch, the batch size being r, the conversion relationships of n (t) and r (t) in the following are deduced in each case:

n (t) < a, i.e. the number of tasks in the queue at the time of departure of the batch t is less than a, the server needs to wait until a-n (t) tasks arrive before reasoning about a-size, therefore, n (t+1) =v _t,t+1 (a) And r (t+1) =a.

A.ltoreq.n (t). Ltoreq.b, i.e. the number of tasks at the time t of departure of the batch is between a and b, all n (t) tasks will be inferred as one batch. Thus, there is n (t+1) =v _t,t+1 (n (t)) and r (t+1) =n (t).

N (t) > b, i.e. the number of tasks at the time t of the departure of the batch is greater than b, the first b tasks will be inferred as a batch. Thus, there is n (t+1) =n (t) -b+v _t,t+1 (b) And r (t+1) =b.

It can thus be determined that the values of n (t+1) and r (t+1) are defined by n (t), r (t) and V _t,t+1 (r (t+1)) and V _t,t+1 The value of (r (t+1)) follows a poisson distribution that is memoryless between individual lot departure times, so a two-dimensional markov process can be used to represent the relationship between the number of waiting tasks of the deep learning reasoning service and the lot size of the departure.

From the Markov nature of process X (t), the probability of each state of the system can be analyzed. The state space of X (t) is the union of the remaining tasks and the departure lot from 0 to N and a to b, respectively. By usingTo represent a probability transition matrix having dimensions (n+1) (b-a+1) × (n+1) (b-a+1). To simplify the analysis, the original matrix is divided into sub-matrices of size (b-a+1) x (b-a+1) with +.>The expression is shown in the following formula,

wherein, is represented by θ ()The values in brackets for each element of the matrix, θ (·) represent the batch size between a and b.

For ease of explanation of the derivation process, the values are defined prior to the determinationTo process a batch of tasks of size r, the probability of j tasks arriving at the server is defined +.>If n (t). Ltoreq.a, n (t) is equal to a, if a < n (t). Ltoreq.b, n (t) is equal to b, if n (t) > b, the value of θ (. Cndot.) is determined in the following cases:

·n(t)≤b，n (t+1) is less than or equal to N-1. In this case, the batch size to be inferred for next batch is equal toIf n (t) < a, then waiting for a task to arrive and then carrying out reasoning, otherwise, carrying out batch reasoning on all the tasks. The probability of n (t+1) tasks being present in the server at the completion of the next batch is equal to +.>I.e. < ->In (a)Other θ (·) =0.

N (t) is less than or equal to b and N (t+1) =n. The same reason as in the previous case is the case, in which the batch size to be inferred next is equal toSince N (t+1) =n, it indicates that at least N tasks arrive during the service of the batch of tasks. The probability of n (t+1) tasks being present in the server at the completion of the next batch is equal to +.>I.e. < ->In (a)Other θ (·) =0.

B < N (t) is less than or equal to N and N (t+1) is less than or equal to N-1. In this case, the batch size to be inferred for the next batchTo achieve N (t+1). Ltoreq.N-1, the number of tasks reached is equal to N (t+1) - (N (t) -b). The probability of the presence of n (t+1) tasks in the server at the completion of the next batch is equal to +.>I.e. < ->Middle->Other θ (·) =0.

B < N (t). Ltoreq.N and N (t+1) =N. In this case, the batch size to be inferred for the next batchSince N (t+1) =n, at least N (t+1) - (N (t) -b) tasks arrive in the batch reasoning process. The probability of n (t+1) tasks being present in the server at the completion of the next batch is equal to +.>I.e. < ->Middle->Other θ (·) =0.

Based on transition probability matricesIt is possible to derive the steady-state probability matrix of X (t)>By solving->Equation is available->Use->To represent the steady-state probability of the server for n tasks in the queue at a lot departure time of lot size r, since all states of the system are to be described, the steady-state probability matrix of the lot departure time is needed, i.e. & lt + & gt>Associated with a steady-state probability matrix pi at any time of the system, using pi _n，r To represent the probability of each state in the steady state probability matrix pi. By calculation, the steady-state probability matrix of the moment of departure of the batch can be obtained>The functional relationship with the steady state probability matrix N at any time of the system is as follows, and the steady state probability of the system state at 0.ltoreq.n.ltoreq.N-1 is as follows:

wherein,s _r representing the average inferred delay when the batch size is r.

The steady state probability in the n=n state is as follows:

wherein p is _n,r (0) Representing the probability that there are n tasks in the queue and the lot remaining service time for the lot size r is 0, p _n,r (0) Given by the formula:

after the analysis, the obtained important index can be calculated, the average task number E (L) in the system is shown as a formula (3), the system comprises the tasks in the queuing and the tasks in the service, and then the average service delay E (W) can be obtained according to the Liteur rule and is shown as a formula (2), and the blocking probability P _block As shown in formula (4), according to lambda (1-P _block ) The effective arrival rate can be calculated.

Through experiments and the analysis, it can be determined that the service delay in the deep learning reasoning service based on dynamic batch processing initially decreases with the batch size b, then fluctuates with the increase of the upper bound of the batch size b, and the video memory occupancy monotonically increases with the increase of the batch size. Therefore, it can be determined that a good balance between the delay and the memory usage is achieved by adjusting the value of the upper bound b in different system states, that is, the optimization model shown in the formula (1) is obtained. Under the constraint in formula (1), the flow intensity is, for practical reasonsIn order to ensure stable operation of the system, B is more than or equal to 1 and less than or equal to B, which is the range of the upper limit of the batch processing size. It has been found through experimentation that for smaller values of b, E (W (b)) decreases as b increases and that E (W (b)) fluctuates only slightly as b increases, so gamma can be set to a smaller value when the system is not very sensitive to memory occupancy. The solution variable of the optimization problem is the value of the upper bound b of the batch size, with the goal of minimizing the service delay E (W (b)) and γm _b A kind of electronic device.

Since the queuing model can accurately predict service delays in different system states, the influence of the upper batch size bound b, the arrival rate λ, on the average service delay E (W (b)) can be analyzed by the queuing model, as found in fig. 12:

whenWhen, i.e. bμ _b Above the arrival rate λ, the service delay E (W (b)) drops slightly as b increases. As can be seen from FIG. 4 and FIG. 5, bμ _b Monotonically increasing with b, where μ _b The service rate when the batch size is b is shown. b is the upper limit of the batch size during service, therefore bμ _b Is the upper limit of throughput during service. When b mu _b When the arrival rate is greater than the arrival rate, the server can timely infer the arrival task, and service delay reduction caused by increasing the value of b is not obvious.

WhenI.e. bμ _b When equal to or initially less than λ, the value of the service delay E (W (b)) may increase sharply with decreasing b. In this case, in queuing systems, the task queues must be backlogged, resulting in each newly arrived task being faced with a full task queue.

WhenI.e. bμ _b At less than the arrival rate λ, the value of the service delay E (W (b)) continues to increase as b decreases. In queuing systems where the queue capacity is limited, the throughput rate has already reached saturation when the arrival rate is greater than the throughput rate, in which case decreasing the throughput rate, i.e. decreasing the value of b in the system, will result in an increase in the queuing delay of the task.

As shown in fig. 12, the situation that E (W (b)) varies with the upper limit b of the magnitude and the arrival rate λ in the google net reasoning service is represented, and other deep learning models as listed in fig. 1 also have the above characteristics due to the similarity of the reasoning process.

In this embodiment, after the optimization model shown in the formula (1) is determined, the solution of the optimization model is implemented through an iterative process, which has higher efficiency compared with a brute force search. The iterative algorithm is specifically implemented as follows:

comprising the following steps: first, whenA relatively low average service delay and memory usage can be achieved. For equation λ=bμ _b B solving in (2) to obtain->Will be λτ ₀ The upper rounding of/(1- λv) is to ensure(lines 1-2). Then, the values of E (W (b)) and E (W (b-1)) corresponding to the current b are calculated to obtain the surge condition of service delay. Since E (W (b)) and E (W (b-1)) correspond to +.>And->Average service delay of (line 3). Finally, the value of b is adjusted according to the trade-off parameters γ and k. When: 1) When E (W (b-1)) -E (W (b)) < γk, it is indicated that the weight of the memory usage in (OP) is greater than the surge speed of the service delay, and the value of b needs to be reduced to reduce the memory usage. 2) E (W (b-1)) -E (W (b)). Gtoreq.γk, indicating that b can continue to increase to obtain lower service delays (lines 7-8, 11). 3) Changing the value of b results in E (W (b)) +γm _b The value of b becomes larger, and the value of b is the optimal solution b ^* (lines 15-16). Due to E (W (b)) and m _b Monotonically decreasing and increasing with b, respectively, so E (W (b)) +γm _b In definition field [1, B]There must be a minimum value above, and the worst case for the algorithm search is b ^* At the end points of the definition domain.

The performance of the optimization method was evaluated by implementing the method of the invention on a deep learning framework Pytorch and using a NVIDIA RTX 2080GPU and a deep learning model GoogLeNet. FIG. 13 shows the service delay comparison results of the method of the present invention and different static batch processes for a deep learning model GoogLeNet reasoning service on NVIDIA RTX 2080GPU with varying task arrival rates. The transition cases of the arrival rates are 330, 800, 730, 930, 1120, 990, 330, 530, 670, 400 tasks/second, wherein 50000 tasks are reached at each arrival rate. FIG. 14 is a graph showing the comparison of memory occupancy for the method of the present invention and different static and greedy dynamic batches under varying task arrival rates for the deep learning model GoogLeNet reasoning service on the NVIDIA RTX 2080 GPU. The transition cases of arrival rates are 330, 800, 730, 930, 1120, 990, 330, 530, 670, 400 tasks/sec, where 50000 tasks are reached at each arrival rate. The comparison conclusion in the drawing can confirm that the optimization method of the invention is accelerated by 31 times compared with the method of single task processing; in the case of batch processing, as shown in fig. 11 and 12, the method of the present invention is accelerated by 2.2 times compared to the method of the optimal fixed batch size batch processing, and the GPU video memory occupation is 0.8 times; compared with the greedy dynamic batch processing method, the GPU video memory occupation is only 0.3 times of that of the greedy dynamic batch processing method, and the service delay is basically the same.

The foregoing is merely a preferred embodiment of the present invention and is not intended to limit the present invention in any way. While the invention has been described with reference to preferred embodiments, it is not intended to be limiting. Therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention shall fall within the scope of the technical solution of the present invention.

Claims

1. A dynamic batch processing task scheduling method of a deep learning reasoning service is characterized in that:

describing the number of queue waiting tasks at each batch leaving moment and the size of the leaving batch by using a two-dimensional Markov process, determining the steady-state probability of the two-dimensional Markov process, and determining the average service delay in a deep learning reasoning service system according to the steady-state probability;

in the formula (1), E (W (b)) is the average service delay corresponding to the upper limit b of the batch size, b is the upper limit b of the batch size of the batch processing task, W (b) is the service delay, gamma is the weight of the memory usage compared with the average service delay, and m _b The method is characterized in that when the upper limit of the batch size is B, the corresponding memory usage amount is B, B is the maximum value of the upper limit of the batch size, N is the maximum number of tasks waiting in a batch processing task queue, lambda is the task arrival rate and mu _B The service rate is the service rate when the batch size is B; solving an optimization model of the formula (1) to determine an upper limit of the batch size in the batch processing task;

the average service delay is determined by calculation of (2),

in the formula (2), E (W (b)) is the average service delay corresponding to the upper limit b of the batch size, L is the average task number, lambda is the task arrival rate, and P _block The blocking probability for the task;

the average task number is determined by equation (3),

the blocking probability is determined by equation (4),

2. The method for dynamic batch task scheduling for deep learning reasoning services of claim 1, wherein the solving process of the optimization model comprises:

3. The method for dynamic batch task scheduling for deep learning reasoning service of claim 2, further comprising, during the first iteration, a process of correcting the adjustment direction of the step size: and when the difference between the average service delay obtained by the first round of iteration and the average service delay corresponding to the initialized batch size upper limit is larger than a preset threshold value, changing the adjustment direction for adjusting the batch size upper limit.

4. A dynamic batch task scheduling system of a deep learning reasoning service, characterized in that a task scheduling is performed according to the dynamic batch task scheduling method of a deep learning reasoning service as claimed in any one of claims 1 to 3.