CN115454585A

CN115454585A - Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment

Info

Publication number: CN115454585A
Application number: CN202210662359.XA
Authority: CN
Inventors: 张子阳; 刘劼; 李峰; 李欢; 林昌垚
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-12-09

Abstract

The invention provides an adaptive batch processing and parallel scheduling system facing to deep learning model reasoning of edge equipment, which comprises a decision module, a dynamic batch processing scheduling module, a model parallel module and a performance analyzer; the decision module carries out scheduling modeling on batch processing and parallel reasoning of the deep learning model and selects proper batch processing size and model parallel quantity for different models, and the dynamic batch processing scheduling module carries out batch processing reasoning; the model parallel module processes a plurality of inference requests simultaneously; the performance analyzer collects the system state of the edge equipment in an online mode in real time; compared with the traditional heuristic and other reinforcement learning methods, the scheduling decision algorithm based on the maximum entropy reinforcement learning has the performance improvement of 3.2-58% in the aspect of the balance between the system throughput and the reasoning delay, and meanwhile, the convergence speed is 1.8-6.5 times that of other algorithms; furthermore, the average scheduling overhead is only 49% of the other algorithms.

Description

Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment

Technical Field

The invention belongs to the technical field of edge computing, and particularly relates to an adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment.

Background

Edge computing makes it possible to sink computing power from the cloud to edge devices and reason for deep learning workloads in real time at the edge. Due to the parallel characteristic of the hardware accelerator, the application of batch processing and parallel operation to deep learning inference operation can effectively improve throughput and reduce delay. However, due to the constraints of power consumption and cost, the edge device cannot provide rich computing and memory resources, which causes that when resources are shared by multiple tenants, system throughput and inference delay are greatly affected, and thus real-time performance of the application program cannot be guaranteed.

Disclosure of Invention

The invention provides an adaptive batch processing and parallel scheduling system facing to edge equipment deep learning model reasoning, aiming at improving the extraction precision of news text keywords, further improving the accuracy of content retrieval of a public opinion analysis system in news text analysis, more comprehensively covering the main information of news texts and saving the time of manual review.

The invention is realized by the following technical scheme:

an adaptive batch processing and parallel scheduling system facing deep learning model inference of edge equipment comprises:

the system comprises a decision module, a dynamic batch processing scheduling module, a model parallel module and a performance analyzer;

the decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, carries out scheduling decision of batch processing and parallel inference through a scheduling decision algorithm, and automatically selects proper batch processing size and parallel number of models aiming at different models;

the dynamic batch processing scheduling module sequentially adds inference requests into a request sequence according to the sequence of arrival of the requests, and schedules the inference requests to batch processing slots si on a plurality of instances of the model to carry out batch processing inference;

the model parallel module allows multiple instances of different models/the same model to be executed in parallel and simultaneously processes multiple inference requests of the model;

the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (central processing unit), a GPU (graphics processing unit), a memory and energy consumption at the current moment, the system throughput and the reasoning delay.

A control method of an adaptive batch processing and parallel scheduling system facing to deep learning model reasoning of edge equipment comprises the following steps:

the control method specifically comprises the following steps:

step 1, terminal equipment sends an inference request to a decision module of an inference system;

step 2, the decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, carries out scheduling decision of batch processing and parallel inference through a scheduling decision algorithm, and automatically selects proper batch processing size and model parallel quantity according to different models;

step 3, the dynamic batch scheduling module adds the inference requests into the request sequence in turn according to the sequence of the arrival of the requests and schedules the inference requests to batch processing slots s on a plurality of instances of the model _i Carrying out batch processing reasoning;

step 4, the model parallel module executes a plurality of instances of different models/the same model in parallel and processes a plurality of inference requests of the models simultaneously;

and step 5, the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (Central processing Unit), a GPU (graphics processing Unit), a memory and energy consumption at the current moment, the system throughput and inference delay, and feeds the system state back to the scheduling decision module, and the scheduling decision is made for the inference request at the next moment on the basis of analyzing available resources of the current system.

Further, the air conditioner is provided with a fan,

by a quintuple

Describing the Markov decision Process, quintuple

Is defined as:

the state is as follows:

is a discrete state space; at each scheduling time step, the reinforcement learning intelligent agent constructs a state s _t (s _t E.g. S), periodically collecting inference request information and system state information on the edge equipment;

the system state information includes the following parts:

(I) The model type mt of the current inference request;

(II) data type d of current request _t And data size d _s ；

(III) Absolute deadline ddl for Current request _a And relative cut-off time ddl _r ；

(IV) the currently available CPU, GPU, memory and energy consumption utilization rate of the edge device are respectively marked as C _u ，G _u ，M _u And E _u ；

(V) request sequence information seq waiting for scheduling _b ；

The actions are as follows:

is a discrete motion space; for selecting proper batch size b and model parallel number m _c Thus, the action made by the agent at the scheduling time t may be denoted as a _t ＝(b，m _c )；

Strategy: strategy pi (a) _t |s _t ) Representing the current state of the agent according to the environment at time tState s _t To determine the next action

A function of (a);

maximizing entropy of visited states while maximizing cumulative expected reward earned by agent, optimal policy π ^* As shown in equation (2):

wherein gamma is epsilon [0,1]Is a discount factor, p _π Is a trajectory distribution generated by a strategy pi, alpha is a temperature parameter used to control whether the optimization objective focuses more on rewards or entropy;

is shown in state s _t Entropy of the lower strategy pi;

probability of state transition: p (s' _t |s _t ，a _t ) Is the state transition probability, representing the current state s at time t _t Do a certain action a _t After that, transition is made to the next state s' _t Probability of, satisfy

Rewarding:

is a reward function; the goal of the agent is to maximize the desired reward accrued

r _t Representing that at each scheduling time t, the agent selects an appropriate batch size and model parallelism, and then obtains instant rewards when performing reasoning;

to enable the reward to reflect the objective function, r _t Is defined as the form of equation (3):

wherein,

and ξ represent weight, and

b and m _c Respectively representing batch size and model parallelism number selected by the agent, u = (C) _u +G _u +M _u +E _u ) And/4 represents the average value of the system resource utilization rate.

Further, the air conditioner is provided with a fan,

the scheduling decision algorithm BP-SACD is based on an Actor-Critic framework; the Critic uses an action-state value function Q-function to judge whether the action made by the Actor according to the strategy is good or bad, namely uses soft strategy iteration to maximize the reward and simultaneously maximize the entropy;

the soft strategy iteration comprises two steps of strategy evaluation and strategy improvement, and is alternately carried out in the training process;

the strategy evaluation step comprises the following steps:

first the soft q-function is calculated and defined as:

wherein,

representing a modified bellman backup operator;

wherein

V(s _t )：＝π(s _t ) ^T [Q(s _t )-αlog(π(s _t ))] (5)

Is a function of the soft state values in discrete cases;

training the soft q-function with the minimized soft berman residual in equation (6);

further, the air conditioner is provided with a fan,

the strategy improvement updates the strategy, and the specific formula is as follows:

wherein D is _KL The dispersion of the KL is expressed,

representing a partition function;

updating parameters of the policy network with the minimized KL divergence in equation (8):

temperature parameters: the temperature parameter can be automatically adjusted using equation (9):

wherein,

is a constant vector equal to the hyperparameter representing the target entropy.

Further, the air conditioner is provided with a fan,

when the dynamic batch scheduling module faces a plurality of request sequences of the same model, the request sequences are sequentially added to the batch processing slot s according to the arrival sequence of the request sequences _i The preparation method comprises the following steps of (1) performing;

the dynamic batch processor maintains a sequence of requests for different models individually, including all inference requests belonging to the model, which are aggregated into a sequence of requests of different batch sizes according to the decision module.

Further, the air conditioner is provided with a fan,

in the model parallel module,

if the current GPU is in an idle state, when a plurality of requests arrive at the same time, the model parallel module immediately dispatches the requests to the GPU, and a hardware dispatcher of the GPU starts to process the requests in parallel;

if multiple requests for the same model arrive at the same time, model inference will be performed by scheduling only one request at a time on the GPU.

Further, the air conditioner is provided with a fan,

the performance analyzer records a series of inference models, each time executed on the edge device, system state information under different sizes of input data,

the system state information is fed back to the scheduling decision module in time, on the basis of analyzing available resources of the current system, scheduling decisions are made for inference requests at the next moment, and proper batch processing size and model parallel quantity are selected to maximize a utility function, so that system overload is avoided, and resource utilization rate is improved.

An electronic device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.

A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the above methods.

The invention has the beneficial effects

Compared with the traditional heuristic and other reinforcement learning methods, the scheduling strategy based on the maximum entropy reinforcement learning designed by the invention has the performance improvement of 3.2-58% in the aspect of the balance between the system throughput and the reasoning delay, and meanwhile, the convergence speed is 1.8-6.5 times that of other algorithms. Furthermore, the average scheduling overhead is only 49% of the other algorithms.

Drawings

FIG. 1 is a BPEdge framework diagram of a batch and parallel reasoning system of the present invention;

FIG. 2 is a dynamic batch scheduling module of the present invention;

FIG. 3 is a model parallel module of the present invention;

FIG. 4 is the utility values of different algorithms under the YOLO-v5 and MobileNet-v3 inference models, where (a) is the utility value of YOLO-v5 and (b) is the utility value of MobileNet-v 3;

FIG. 5 is a plot of batch sizes selected for different algorithms, where (a) is the batch size of YOLO-v5 and (b) is the batch size of MobileNet-v 3;

FIG. 6 is a graph of the number of model parallels selected for different algorithms, where (a) is the number of model parallels for YOLO-v5 and (b) is the number of model parallels for MobileNet-v 3;

FIG. 7 is the average system throughput for different algorithms, where (a) is the average system throughput for YOLO-v5 and (b) is the average system throughput for MobileNet-v 3;

FIG. 8 is the mean inference delay for different algorithms, where (a) is the mean inference delay for YOLO-v5 and (b) is the mean inference delay for MobileNet-v 3;

fig. 9 is the average scheduling overhead for the different algorithms.

Detailed Description

The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

With reference to fig. 1 to 9.

The invention firstly provides a utility function to evaluate the balance degree between the system throughput and the inference delay. The specific formula is as follows:

wherein,

indicated in scheduling time slot s _i The size of the internal execution batch processing is b, and the parallel number of the models is m _c The throughput of the system at the time of the day,

representing the actual batch and parallel inference delays of the model,

representing the ratio of the sum of the absolute deadlines of inference requests within a batch to the number of parallels of the current model. In particular, it is possible to use, for example,

the condition that the request cannot be dispatched unsuccessfully is ensured, and real-time reasoning is achieved.

Table 1 is a variety of symbol definitions and descriptions.

TABLE 1 Main symbol table

the decision-making module is the most core component of the system,

through a Markov Decision Process (MDP), modeling is carried out on the process of batch processing and parallel scheduling on the arrived inference requests, then scheduling decision of batch processing and parallel inference is carried out through a scheduling decision algorithm (BP-SACD), and proper batch processing size and model parallel quantity are automatically selected aiming at different models, so that the system throughput is improved while the lower inference delay is ensured;

the dynamic batch processing scheduling module adds a batch of inference requests of the same model to a request sequence in turn according to the sequence of arrival of the requestsAnd scheduling to batch slots s on multiple instances of the model _i Carrying out batch processing reasoning;

the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (central processing unit), a GPU (graphic processing unit), a memory and energy consumption at the current moment, the system throughput and the reasoning delay.

the control method specifically comprises the following steps:

step 1, terminal equipment sends an inference request to a scheduling decision module;

step 2, the scheduling decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, carries out scheduling decision of batch processing and parallel inference through a scheduling decision algorithm, and automatically selects proper batch processing size and model parallel quantity according to different models;

step 3, the dynamic batch processing scheduling module adds a batch of inference requests of the same model into a request sequence in turn according to the sequence of arrival of the requests and schedules the inference requests to batch processing slots s on a plurality of instances of the model _i Carrying out batch processing reasoning;

Markov decision process modeling

By a quintuple

Describing the Markov Decision Process (MDP), quintuple

Is defined as:

the state is as follows:

is a discrete state space; at each scheduling time step, the reinforcement learning agent constructs a state

Periodically collecting inference request information and system state information on the edge equipment;

the system state information includes the following parts:

(I) Model type m of current inference request _t ；

(II) data type d of current request _t And data size d _s ；

(V) request sequence information seq waiting for scheduling _b ；

The actions are as follows:

is a discrete motion space; for selecting proper batch size b and model parallel number m _c Thus, the action made by the agent at the scheduling time t may be denoted as a _t ＝(b，m _c ) (ii) a For example, a certain inference model has

A selectable batch size and

the parallel number of selectable models is the discrete motion space

Is of a size of

Strategy: strategy pi (a) _t |s _t ) Representing the current state s of the agent according to the environment at time t _t To determine the next action

A function of (a);

the invention maximizes the entropy of the accessed state while maximizing the cumulative expected reward obtained by the agent based on the maximum entropy random strategy algorithm, and the optimal strategy pi ^* As shown in equation (2):

is shown in state s _t Entropy of the lower strategy pi;

probability of state transition: p (s' _t |s _t ，a _t ) Is the state transition probability, representing the current state s at time t _t Do a certain action a _t Then, the process shifts to the next state s' _t Satisfies sigma _s′∈S p(s′ _t |s _t ，a _t )＝1；

Rewarding:

wherein,

and ξ represent weight, and

A decision algorithm based on maximum entropy reinforcement learning:

the strategy evaluation step comprises the following steps:

first the soft q-function is calculated and defined as:

wherein,

representing a modified bellman backup operator;

wherein

V(s _t )：＝π(s _t ) ^T [Q(s _t )-αlog(π(s _t ))] (5)

Is a function of the soft state values in discrete cases;

wherein D is _KL The dispersion of the KL is expressed,

representing a partition function;

wherein,

When the dynamic batch scheduling module faces a plurality of request sequences of the same model, the request sequences are sequentially added to the batch processing slot s according to the arrival sequence of the request sequences _i Performing the following steps;

FIG. 2 illustrates how multiple request sequences of the same model are scheduled onto multiple model instances using a dynamic batch scheduling module. First, the module will allocate a batch slot for the newly arrived request sequence. The priority ordering is based on the relative deadlines of each request sequence, with shorter relative deadlines giving higher priority. The request sequences are then dispatched to the corresponding batch slots according to the priority. And sequentially scheduling according to the sequence of the arrival time under the condition of the same priority.

The left diagram shows several request sequences that arrive randomly. The number of inference requests per sequence is determined by the batch size chosen by the deep reinforcement learning based decision module, and these individual inference requests may arrive in any order relative to the inference requests in other sequences. The right diagram shows how the request sequence is scheduled over time onto the corresponding model instance.

In the model parallel module,

Fig. 3 shows an example in which 3 models (model 1, model 2, and model 3) are executed in parallel, and specifically, the 3 models are set to states that allow 2 instances to be executed in parallel, respectively. The first three inference requests are executed immediately in parallel. The latter three inference requests must wait until the first three identical requests have been executed.

The performance analyzer records a series of inference models, each time executed on the edge device, system state information under input data of different sizes,

the system state information is fed back to a decision module based on deep reinforcement learning in time, on the basis of analyzing available resources of the current system, a scheduling decision is made for an inference request at the next moment, a proper batch processing size and a proper parallel number of models are selected to maximize a utility function, meanwhile, system overload is avoided, and the resource utilization rate is improved. This also reflects the potential advantages of the scheduler of the present invention with dynamic resource management and allocation;

Example (b):

1. baseline method: the traditional heuristic method and other reinforcement learning methods are used as baseline methods, the decision algorithm BP-SACD based on the maximum entropy reinforcement learning provided by the invention is replaced in the scheduling system BPedge for comparison, and the specific replacement method comprises the following steps:

genetic Algorithm (GA): the GAU is one of the most classical heuristic algorithms, and is mainly characterized in that a local optimal solution can be jumped out of a nonlinear extreme value problem by the probability 1, and then a global optimal solution is found. The invention takes the fitness function in the GA as the utiity.

Proximal strategy optimization (PPO): PPO is used as a baseline method in reinforcement learning, is an online strategy algorithm and has good stability.

Dual Depth Q Network (DDQN): the DDQN is used as an off-line strategy algorithm, and has good performance in a discrete scene.

Three performance indicators are mainly considered:

(1) Utility function: the balance degree between the system throughput and the inference delay is reflected, the higher the utility value is, the higher the system throughput is, and the lower the inference delay is;

(2) Convergence rate: reflecting the performance advantages and disadvantages of various algorithms;

(3) And (3) scheduling overhead: i.e. the time required to perform the scheduling operation.

2. Scheduling system utility: for real-time performance, the request rate is set to 30fps, that is, 30 video frames with different resolutions are reached every second and are used as inference requests, wherein the absolute deadline time E [10ms,15ms ] of each inference request is equal to the total generation time length of 1.85 hours, and the total generation time length is 20 ten thousand inference requests. Further, the number of training times for each algorithm was set to 100.

FIG. 4 is a graph of the degree of system throughput and the amount of model parallelism traded off by the four algorithms when performing batch and parallel reasoning on two of the most advanced models of object detection and image classification.

In the target detection model inference based on YOLO-v5 in FIG. 4 (a), it can be seen that the performance of the BP-SACD algorithm proposed by the present invention is superior to the other three methods. Specifically, the utility value of BP-SACD is 11.2 percent, 41.8 percent and 21.2 percent higher than that of GA, PPO and DDQN respectively, and the stability is better than that of the other three methods.

In the inference test for the image classification model based on MobileNet-v3 (FIG. 4 (b)), the utility value of BP-SACD also has 3.2%, 58% and 30.2% performance improvement relative to the other three methods, respectively.

Fig. 5 and 6 are graphs showing the variation of the batch size and the model parallelism quantity selected by the four algorithms along with the scheduling process. BP-SACD is able to find the optimal batch size and model parallelism for different models faster than the other three algorithms.

Fig. 7 and 8 are graphs showing the variation of the average system throughput and the average inference delay of the four algorithms with the scheduling process, respectively. Compared with the other three algorithms, the BP-SACD can achieve better balance between the system throughput and the inference delay, and tends to trade lower inference delay for proper inference throughput, so that the balance degree is better.

The adaptive batch processing and parallel scheduling system for deep learning model inference of edge devices proposed by the present invention is introduced in detail above, and the principle and implementation of the present invention are explained, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. An adaptive batch processing and parallel scheduling system facing to deep learning model inference of edge equipment is characterized in that:

the decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, carries out batch processing and parallel inference scheduling decision through a scheduling decision algorithm, and automatically selects proper batch processing size and model parallel quantity aiming at different models;

the dynamic batch processing scheduling module adds the inference request into the request sequence in turn according to the sequence of the arrival of the requests and schedules the inference request to the batch processing slots s on the multiple instances of the model _i Carrying out batch processing reasoning;

2. An adaptive batch processing and parallel scheduling system applied to the edge-oriented equipment deep learning model inference in claim 1, characterized in that:

the control method specifically comprises the following steps:

step 1, the terminal equipment sends an inference request to a scheduling decision module;

step 2, the scheduling decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, then carries out scheduling decision of batch processing and parallel inference through a scheduling decision algorithm, and automatically selects proper batch processing size and model parallel quantity according to different models;

step 3, the dynamic batch processing scheduling module adds the inference requests into the request sequence in turn according to the sequence of the arrival of the requests and schedules the inference requests to batch processing slots s on a plurality of examples of the model _i Carrying out batch processing reasoning;

step 4, the model parallel module executes a plurality of instances of different models/the same model in parallel and processes a plurality of inference requests of the models at the same time;

3. The control method according to claim 2, characterized in that:

by a quintuple

Describing the Markov decision Process, quintuple

Is defined as:

the state is as follows:

is a discrete state space; at each scheduling time step, the agent of the reinforcement learning agent constructs a state s _t (s _t E.g. S), periodically collecting inference request information and system state information on the edge equipment;

the system state information includes the following parts:

(I) Model type m of current inference request _t ；

(II) data type d of current request _t And data size d _s ；

(III) Absolute deadline ddl for Current request _a And relative cutoff time ddl _r ；

(V) request sequence information seq waiting for scheduling _b ；

The actions are as follows:

is a discrete motion space; for selecting proper batch size b and model parallel number m _c Thus, the action taken by the agent at scheduling time t may be denoted as a _t ＝(b,m _c )；

Strategy pi (a) _t |s _t ) Representing the current state s of the agent according to the environment at time t _t To determine the next action

A function of (a);

wherein gamma epsilon [0,1]Is a discount factor, p _π Is a trajectory distribution generated by a strategy pi, alpha is a temperature parameter used to control whether the optimization objective focuses more on rewards or entropy;

is shown in state s _t Entropy of the lower strategy pi;

probability of state transition: p (s' _t |s _t ,a _t ) Is the state transition probability, representing the current state s at time t _t Do a certain action a _t After that, transition is made to the next state s' _t Probability of, satisfy

Reward:

in order to enable the reward to reflect an objective function, r _t Is defined as the form of equation (3):

wherein,

and ξ represent the weight, and

b and m _c U = (C) representing batch size and number of model parallels, respectively, for agent selection _u +G _u +M _u +E _u ) And/4 represents the average value of the system resource utilization rate.

4. The control method according to claim 3, characterized in that:

the scheduling decision algorithm is based on an Actor-Critic framework; the Critic uses an action-state value function Q-function to judge whether the action made by the Actor according to the strategy is good or bad, namely uses soft strategy iteration to maximize the reward and simultaneously maximize the entropy;

the strategy evaluation step comprises the following steps:

first the soft q-function is calculated and defined as:

wherein,

representing a modified bellman backup operator;

wherein

V(s _t )：＝π(s _t ) ^T [Q(s _t )-αlog(π(s _t ))] (5)

Is a soft state value function in discrete cases;

training the soft q-function with the minimized soft Bellman residual in equation (6);

5. the control method according to claim 4, characterized in that:

wherein D is _KL The dispersion of the KL is expressed,

representing a partition function;

wherein,

6. The control method according to claim 5, characterized in that:

when the dynamic batch processing scheduling module faces a plurality of request sequences of the same model, the dynamic batch processing scheduling module reaches according to the sequence of the request sequencesSequentially adding them to the batch tank s _i Performing the following steps;

7. The control method according to claim 6, characterized in that:

in the case of the model-parallel module,

8. The control method according to claim 7, characterized in that:

9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 2 to 8 when executing the computer program.

10. A computer readable storage medium storing computer instructions, which when executed by a processor implement the steps of the method of any one of claims 2 to 8.