CN115454585A - Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment - Google Patents
Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment Download PDFInfo
- Publication number
- CN115454585A CN115454585A CN202210662359.XA CN202210662359A CN115454585A CN 115454585 A CN115454585 A CN 115454585A CN 202210662359 A CN202210662359 A CN 202210662359A CN 115454585 A CN115454585 A CN 115454585A
- Authority
- CN
- China
- Prior art keywords
- scheduling
- model
- inference
- batch processing
- parallel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012545 processing Methods 0.000 title claims abstract description 75
- 238000013136 deep learning model Methods 0.000 title claims abstract description 13
- 230000003044 adaptive effect Effects 0.000 title claims abstract description 12
- 238000000034 method Methods 0.000 claims abstract description 61
- 230000008569 process Effects 0.000 claims abstract description 32
- 230000002787 reinforcement Effects 0.000 claims abstract description 13
- 230000006872 improvement Effects 0.000 claims abstract description 9
- 230000006870 function Effects 0.000 claims description 24
- 239000003795 chemical substances by application Substances 0.000 claims description 22
- 230000009471 action Effects 0.000 claims description 12
- 238000005265 energy consumption Methods 0.000 claims description 9
- 230000007704 transition Effects 0.000 claims description 8
- 238000012549 training Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000011156 evaluation Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 4
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 239000006185 dispersion Substances 0.000 claims description 3
- 238000009826 distribution Methods 0.000 claims description 3
- 238000005192 partition Methods 0.000 claims description 3
- 238000003860 storage Methods 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 239000008358 core component Substances 0.000 description 1
- 230000001934 delay Effects 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5072—Grid computing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/502—Proximity
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/5021—Priority
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides an adaptive batch processing and parallel scheduling system facing to deep learning model reasoning of edge equipment, which comprises a decision module, a dynamic batch processing scheduling module, a model parallel module and a performance analyzer; the decision module carries out scheduling modeling on batch processing and parallel reasoning of the deep learning model and selects proper batch processing size and model parallel quantity for different models, and the dynamic batch processing scheduling module carries out batch processing reasoning; the model parallel module processes a plurality of inference requests simultaneously; the performance analyzer collects the system state of the edge equipment in an online mode in real time; compared with the traditional heuristic and other reinforcement learning methods, the scheduling decision algorithm based on the maximum entropy reinforcement learning has the performance improvement of 3.2-58% in the aspect of the balance between the system throughput and the reasoning delay, and meanwhile, the convergence speed is 1.8-6.5 times that of other algorithms; furthermore, the average scheduling overhead is only 49% of the other algorithms.
Description
Technical Field
The invention belongs to the technical field of edge computing, and particularly relates to an adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment.
Background
Edge computing makes it possible to sink computing power from the cloud to edge devices and reason for deep learning workloads in real time at the edge. Due to the parallel characteristic of the hardware accelerator, the application of batch processing and parallel operation to deep learning inference operation can effectively improve throughput and reduce delay. However, due to the constraints of power consumption and cost, the edge device cannot provide rich computing and memory resources, which causes that when resources are shared by multiple tenants, system throughput and inference delay are greatly affected, and thus real-time performance of the application program cannot be guaranteed.
Disclosure of Invention
The invention provides an adaptive batch processing and parallel scheduling system facing to edge equipment deep learning model reasoning, aiming at improving the extraction precision of news text keywords, further improving the accuracy of content retrieval of a public opinion analysis system in news text analysis, more comprehensively covering the main information of news texts and saving the time of manual review.
The invention is realized by the following technical scheme:
an adaptive batch processing and parallel scheduling system facing deep learning model inference of edge equipment comprises:
the system comprises a decision module, a dynamic batch processing scheduling module, a model parallel module and a performance analyzer;
the decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, carries out scheduling decision of batch processing and parallel inference through a scheduling decision algorithm, and automatically selects proper batch processing size and parallel number of models aiming at different models;
the dynamic batch processing scheduling module sequentially adds inference requests into a request sequence according to the sequence of arrival of the requests, and schedules the inference requests to batch processing slots si on a plurality of instances of the model to carry out batch processing inference;
the model parallel module allows multiple instances of different models/the same model to be executed in parallel and simultaneously processes multiple inference requests of the model;
the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (central processing unit), a GPU (graphics processing unit), a memory and energy consumption at the current moment, the system throughput and the reasoning delay.
A control method of an adaptive batch processing and parallel scheduling system facing to deep learning model reasoning of edge equipment comprises the following steps:
the control method specifically comprises the following steps:
step 4, the model parallel module executes a plurality of instances of different models/the same model in parallel and processes a plurality of inference requests of the models simultaneously;
and step 5, the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (Central processing Unit), a GPU (graphics processing Unit), a memory and energy consumption at the current moment, the system throughput and inference delay, and feeds the system state back to the scheduling decision module, and the scheduling decision is made for the inference request at the next moment on the basis of analyzing available resources of the current system.
Further, the air conditioner is provided with a fan,
the state is as follows:is a discrete state space; at each scheduling time step, the reinforcement learning intelligent agent constructs a state s t (s t E.g. S), periodically collecting inference request information and system state information on the edge equipment;
the system state information includes the following parts:
(I) The model type mt of the current inference request;
(II) data type d of current request t And data size d s ;
(III) Absolute deadline ddl for Current request a And relative cut-off time ddl r ;
(IV) the currently available CPU, GPU, memory and energy consumption utilization rate of the edge device are respectively marked as C u ,G u ,M u And E u ;
(V) request sequence information seq waiting for scheduling b ;
The actions are as follows:is a discrete motion space; for selecting proper batch size b and model parallel number m c Thus, the action made by the agent at the scheduling time t may be denoted as a t =(b,m c );
Strategy: strategy pi (a) t |s t ) Representing the current state of the agent according to the environment at time tState s t To determine the next actionA function of (a);
maximizing entropy of visited states while maximizing cumulative expected reward earned by agent, optimal policy π * As shown in equation (2):
wherein gamma is epsilon [0,1]Is a discount factor, p π Is a trajectory distribution generated by a strategy pi, alpha is a temperature parameter used to control whether the optimization objective focuses more on rewards or entropy;is shown in state s t Entropy of the lower strategy pi;
probability of state transition: p (s' t |s t ,a t ) Is the state transition probability, representing the current state s at time t t Do a certain action a t After that, transition is made to the next state s' t Probability of, satisfy
Rewarding:is a reward function; the goal of the agent is to maximize the desired reward accruedr t Representing that at each scheduling time t, the agent selects an appropriate batch size and model parallelism, and then obtains instant rewards when performing reasoning;
to enable the reward to reflect the objective function, r t Is defined as the form of equation (3):
wherein,and ξ represent weight, andb and m c Respectively representing batch size and model parallelism number selected by the agent, u = (C) u +G u +M u +E u ) And/4 represents the average value of the system resource utilization rate.
Further, the air conditioner is provided with a fan,
the scheduling decision algorithm BP-SACD is based on an Actor-Critic framework; the Critic uses an action-state value function Q-function to judge whether the action made by the Actor according to the strategy is good or bad, namely uses soft strategy iteration to maximize the reward and simultaneously maximize the entropy;
the soft strategy iteration comprises two steps of strategy evaluation and strategy improvement, and is alternately carried out in the training process;
the strategy evaluation step comprises the following steps:
first the soft q-function is calculated and defined as:
wherein
V(s t ):=π(s t ) T [Q(s t )-αlog(π(s t ))] (5)
Is a function of the soft state values in discrete cases;
training the soft q-function with the minimized soft berman residual in equation (6);
further, the air conditioner is provided with a fan,
the strategy improvement updates the strategy, and the specific formula is as follows:
updating parameters of the policy network with the minimized KL divergence in equation (8):
temperature parameters: the temperature parameter can be automatically adjusted using equation (9):
Further, the air conditioner is provided with a fan,
when the dynamic batch scheduling module faces a plurality of request sequences of the same model, the request sequences are sequentially added to the batch processing slot s according to the arrival sequence of the request sequences i The preparation method comprises the following steps of (1) performing;
the dynamic batch processor maintains a sequence of requests for different models individually, including all inference requests belonging to the model, which are aggregated into a sequence of requests of different batch sizes according to the decision module.
Further, the air conditioner is provided with a fan,
in the model parallel module,
if the current GPU is in an idle state, when a plurality of requests arrive at the same time, the model parallel module immediately dispatches the requests to the GPU, and a hardware dispatcher of the GPU starts to process the requests in parallel;
if multiple requests for the same model arrive at the same time, model inference will be performed by scheduling only one request at a time on the GPU.
Further, the air conditioner is provided with a fan,
the performance analyzer records a series of inference models, each time executed on the edge device, system state information under different sizes of input data,
the system state information is fed back to the scheduling decision module in time, on the basis of analyzing available resources of the current system, scheduling decisions are made for inference requests at the next moment, and proper batch processing size and model parallel quantity are selected to maximize a utility function, so that system overload is avoided, and resource utilization rate is improved.
An electronic device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.
A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the above methods.
The invention has the beneficial effects
Compared with the traditional heuristic and other reinforcement learning methods, the scheduling strategy based on the maximum entropy reinforcement learning designed by the invention has the performance improvement of 3.2-58% in the aspect of the balance between the system throughput and the reasoning delay, and meanwhile, the convergence speed is 1.8-6.5 times that of other algorithms. Furthermore, the average scheduling overhead is only 49% of the other algorithms.
Drawings
FIG. 1 is a BPEdge framework diagram of a batch and parallel reasoning system of the present invention;
FIG. 2 is a dynamic batch scheduling module of the present invention;
FIG. 3 is a model parallel module of the present invention;
FIG. 4 is the utility values of different algorithms under the YOLO-v5 and MobileNet-v3 inference models, where (a) is the utility value of YOLO-v5 and (b) is the utility value of MobileNet-v 3;
FIG. 5 is a plot of batch sizes selected for different algorithms, where (a) is the batch size of YOLO-v5 and (b) is the batch size of MobileNet-v 3;
FIG. 6 is a graph of the number of model parallels selected for different algorithms, where (a) is the number of model parallels for YOLO-v5 and (b) is the number of model parallels for MobileNet-v 3;
FIG. 7 is the average system throughput for different algorithms, where (a) is the average system throughput for YOLO-v5 and (b) is the average system throughput for MobileNet-v 3;
FIG. 8 is the mean inference delay for different algorithms, where (a) is the mean inference delay for YOLO-v5 and (b) is the mean inference delay for MobileNet-v 3;
fig. 9 is the average scheduling overhead for the different algorithms.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With reference to fig. 1 to 9.
The invention firstly provides a utility function to evaluate the balance degree between the system throughput and the inference delay. The specific formula is as follows:
wherein,indicated in scheduling time slot s i The size of the internal execution batch processing is b, and the parallel number of the models is m c The throughput of the system at the time of the day,representing the actual batch and parallel inference delays of the model,representing the ratio of the sum of the absolute deadlines of inference requests within a batch to the number of parallels of the current model. In particular, it is possible to use, for example,the condition that the request cannot be dispatched unsuccessfully is ensured, and real-time reasoning is achieved.
Table 1 is a variety of symbol definitions and descriptions.
TABLE 1 Main symbol table
An adaptive batch processing and parallel scheduling system facing deep learning model inference of edge equipment comprises:
the system comprises a decision module, a dynamic batch processing scheduling module, a model parallel module and a performance analyzer;
the decision-making module is the most core component of the system,
through a Markov Decision Process (MDP), modeling is carried out on the process of batch processing and parallel scheduling on the arrived inference requests, then scheduling decision of batch processing and parallel inference is carried out through a scheduling decision algorithm (BP-SACD), and proper batch processing size and model parallel quantity are automatically selected aiming at different models, so that the system throughput is improved while the lower inference delay is ensured;
the dynamic batch processing scheduling module adds a batch of inference requests of the same model to a request sequence in turn according to the sequence of arrival of the requestsAnd scheduling to batch slots s on multiple instances of the model i Carrying out batch processing reasoning;
the model parallel module allows multiple instances of different models/the same model to be executed in parallel and simultaneously processes multiple inference requests of the model;
the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (central processing unit), a GPU (graphic processing unit), a memory and energy consumption at the current moment, the system throughput and the reasoning delay.
A control method of an adaptive batch processing and parallel scheduling system facing to deep learning model reasoning of edge equipment comprises the following steps:
the control method specifically comprises the following steps:
step 4, the model parallel module executes a plurality of instances of different models/the same model in parallel and processes a plurality of inference requests of the models simultaneously;
and step 5, the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (Central processing Unit), a GPU (graphics processing Unit), a memory and energy consumption at the current moment, the system throughput and inference delay, and feeds the system state back to the scheduling decision module, and the scheduling decision is made for the inference request at the next moment on the basis of analyzing available resources of the current system.
Markov decision process modeling
the state is as follows:is a discrete state space; at each scheduling time step, the reinforcement learning agent constructs a statePeriodically collecting inference request information and system state information on the edge equipment;
the system state information includes the following parts:
(I) Model type m of current inference request t ;
(II) data type d of current request t And data size d s ;
(III) Absolute deadline ddl for Current request a And relative cut-off time ddl r ;
(IV) the currently available CPU, GPU, memory and energy consumption utilization rate of the edge device are respectively marked as C u ,G u ,M u And E u ;
(V) request sequence information seq waiting for scheduling b ;
The actions are as follows:is a discrete motion space; for selecting proper batch size b and model parallel number m c Thus, the action made by the agent at the scheduling time t may be denoted as a t =(b,m c ) (ii) a For example, a certain inference model hasA selectable batch size andthe parallel number of selectable models is the discrete motion spaceIs of a size of
Strategy: strategy pi (a) t |s t ) Representing the current state s of the agent according to the environment at time t t To determine the next actionA function of (a);
the invention maximizes the entropy of the accessed state while maximizing the cumulative expected reward obtained by the agent based on the maximum entropy random strategy algorithm, and the optimal strategy pi * As shown in equation (2):
wherein gamma is epsilon [0,1]Is a discount factor, p π Is a trajectory distribution generated by a strategy pi, alpha is a temperature parameter used to control whether the optimization objective focuses more on rewards or entropy;is shown in state s t Entropy of the lower strategy pi;
probability of state transition: p (s' t |s t ,a t ) Is the state transition probability, representing the current state s at time t t Do a certain action a t Then, the process shifts to the next state s' t Satisfies sigma s′∈S p(s′ t |s t ,a t )=1;
Rewarding:is a reward function; the goal of the agent is to maximize the desired reward accruedr t Representing that at each scheduling time t, the agent selects an appropriate batch size and model parallelism, and then obtains instant rewards when performing reasoning;
to enable the reward to reflect the objective function, r t Is defined as the form of equation (3):
wherein,and ξ represent weight, andb and m c Respectively representing batch size and model parallelism number selected by the agent, u = (C) u +G u +M u +E u ) And/4 represents the average value of the system resource utilization rate.
A decision algorithm based on maximum entropy reinforcement learning:
the scheduling decision algorithm BP-SACD is based on an Actor-Critic framework; the Critic uses an action-state value function Q-function to judge whether the action made by the Actor according to the strategy is good or bad, namely uses soft strategy iteration to maximize the reward and simultaneously maximize the entropy;
the soft strategy iteration comprises two steps of strategy evaluation and strategy improvement, and is alternately carried out in the training process;
the strategy evaluation step comprises the following steps:
first the soft q-function is calculated and defined as:
wherein
V(s t ):=π(s t ) T [Q(s t )-αlog(π(s t ))] (5)
Is a function of the soft state values in discrete cases;
training the soft q-function with the minimized soft berman residual in equation (6);
the strategy improvement updates the strategy, and the specific formula is as follows:
updating parameters of the policy network with the minimized KL divergence in equation (8):
temperature parameters: the temperature parameter can be automatically adjusted using equation (9):
When the dynamic batch scheduling module faces a plurality of request sequences of the same model, the request sequences are sequentially added to the batch processing slot s according to the arrival sequence of the request sequences i Performing the following steps;
the dynamic batch processor maintains a sequence of requests for different models individually, including all inference requests belonging to the model, which are aggregated into a sequence of requests of different batch sizes according to the decision module.
FIG. 2 illustrates how multiple request sequences of the same model are scheduled onto multiple model instances using a dynamic batch scheduling module. First, the module will allocate a batch slot for the newly arrived request sequence. The priority ordering is based on the relative deadlines of each request sequence, with shorter relative deadlines giving higher priority. The request sequences are then dispatched to the corresponding batch slots according to the priority. And sequentially scheduling according to the sequence of the arrival time under the condition of the same priority.
The left diagram shows several request sequences that arrive randomly. The number of inference requests per sequence is determined by the batch size chosen by the deep reinforcement learning based decision module, and these individual inference requests may arrive in any order relative to the inference requests in other sequences. The right diagram shows how the request sequence is scheduled over time onto the corresponding model instance.
In the model parallel module,
if the current GPU is in an idle state, when a plurality of requests arrive at the same time, the model parallel module immediately dispatches the requests to the GPU, and a hardware dispatcher of the GPU starts to process the requests in parallel;
if multiple requests for the same model arrive at the same time, model inference will be performed by scheduling only one request at a time on the GPU.
Fig. 3 shows an example in which 3 models (model 1, model 2, and model 3) are executed in parallel, and specifically, the 3 models are set to states that allow 2 instances to be executed in parallel, respectively. The first three inference requests are executed immediately in parallel. The latter three inference requests must wait until the first three identical requests have been executed.
The performance analyzer records a series of inference models, each time executed on the edge device, system state information under input data of different sizes,
the system state information is fed back to a decision module based on deep reinforcement learning in time, on the basis of analyzing available resources of the current system, a scheduling decision is made for an inference request at the next moment, a proper batch processing size and a proper parallel number of models are selected to maximize a utility function, meanwhile, system overload is avoided, and the resource utilization rate is improved. This also reflects the potential advantages of the scheduler of the present invention with dynamic resource management and allocation;
an electronic device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.
A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the above methods.
Example (b):
1. baseline method: the traditional heuristic method and other reinforcement learning methods are used as baseline methods, the decision algorithm BP-SACD based on the maximum entropy reinforcement learning provided by the invention is replaced in the scheduling system BPedge for comparison, and the specific replacement method comprises the following steps:
genetic Algorithm (GA): the GAU is one of the most classical heuristic algorithms, and is mainly characterized in that a local optimal solution can be jumped out of a nonlinear extreme value problem by the probability 1, and then a global optimal solution is found. The invention takes the fitness function in the GA as the utiity.
Proximal strategy optimization (PPO): PPO is used as a baseline method in reinforcement learning, is an online strategy algorithm and has good stability.
Dual Depth Q Network (DDQN): the DDQN is used as an off-line strategy algorithm, and has good performance in a discrete scene.
Three performance indicators are mainly considered:
(1) Utility function: the balance degree between the system throughput and the inference delay is reflected, the higher the utility value is, the higher the system throughput is, and the lower the inference delay is;
(2) Convergence rate: reflecting the performance advantages and disadvantages of various algorithms;
(3) And (3) scheduling overhead: i.e. the time required to perform the scheduling operation.
2. Scheduling system utility: for real-time performance, the request rate is set to 30fps, that is, 30 video frames with different resolutions are reached every second and are used as inference requests, wherein the absolute deadline time E [10ms,15ms ] of each inference request is equal to the total generation time length of 1.85 hours, and the total generation time length is 20 ten thousand inference requests. Further, the number of training times for each algorithm was set to 100.
FIG. 4 is a graph of the degree of system throughput and the amount of model parallelism traded off by the four algorithms when performing batch and parallel reasoning on two of the most advanced models of object detection and image classification.
In the target detection model inference based on YOLO-v5 in FIG. 4 (a), it can be seen that the performance of the BP-SACD algorithm proposed by the present invention is superior to the other three methods. Specifically, the utility value of BP-SACD is 11.2 percent, 41.8 percent and 21.2 percent higher than that of GA, PPO and DDQN respectively, and the stability is better than that of the other three methods.
In the inference test for the image classification model based on MobileNet-v3 (FIG. 4 (b)), the utility value of BP-SACD also has 3.2%, 58% and 30.2% performance improvement relative to the other three methods, respectively.
Fig. 5 and 6 are graphs showing the variation of the batch size and the model parallelism quantity selected by the four algorithms along with the scheduling process. BP-SACD is able to find the optimal batch size and model parallelism for different models faster than the other three algorithms.
Fig. 7 and 8 are graphs showing the variation of the average system throughput and the average inference delay of the four algorithms with the scheduling process, respectively. Compared with the other three algorithms, the BP-SACD can achieve better balance between the system throughput and the inference delay, and tends to trade lower inference delay for proper inference throughput, so that the balance degree is better.
The adaptive batch processing and parallel scheduling system for deep learning model inference of edge devices proposed by the present invention is introduced in detail above, and the principle and implementation of the present invention are explained, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (10)
1. An adaptive batch processing and parallel scheduling system facing to deep learning model inference of edge equipment is characterized in that:
the system comprises a decision module, a dynamic batch processing scheduling module, a model parallel module and a performance analyzer;
the decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, carries out batch processing and parallel inference scheduling decision through a scheduling decision algorithm, and automatically selects proper batch processing size and model parallel quantity aiming at different models;
the dynamic batch processing scheduling module adds the inference request into the request sequence in turn according to the sequence of the arrival of the requests and schedules the inference request to the batch processing slots s on the multiple instances of the model i Carrying out batch processing reasoning;
the model parallel module allows multiple instances of different models/the same model to be executed in parallel and simultaneously processes multiple inference requests of the model;
the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (central processing unit), a GPU (graphics processing unit), a memory and energy consumption at the current moment, the system throughput and the reasoning delay.
2. An adaptive batch processing and parallel scheduling system applied to the edge-oriented equipment deep learning model inference in claim 1, characterized in that:
the control method specifically comprises the following steps:
step 1, the terminal equipment sends an inference request to a scheduling decision module;
step 2, the scheduling decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, then carries out scheduling decision of batch processing and parallel inference through a scheduling decision algorithm, and automatically selects proper batch processing size and model parallel quantity according to different models;
step 3, the dynamic batch processing scheduling module adds the inference requests into the request sequence in turn according to the sequence of the arrival of the requests and schedules the inference requests to batch processing slots s on a plurality of examples of the model i Carrying out batch processing reasoning;
step 4, the model parallel module executes a plurality of instances of different models/the same model in parallel and processes a plurality of inference requests of the models at the same time;
and step 5, the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (Central processing Unit), a GPU (graphics processing Unit), a memory and energy consumption at the current moment, the system throughput and inference delay, and feeds the system state back to the scheduling decision module, and the scheduling decision is made for the inference request at the next moment on the basis of analyzing available resources of the current system.
3. The control method according to claim 2, characterized in that:
the state is as follows:is a discrete state space; at each scheduling time step, the agent of the reinforcement learning agent constructs a state s t (s t E.g. S), periodically collecting inference request information and system state information on the edge equipment;
the system state information includes the following parts:
(I) Model type m of current inference request t ;
(II) data type d of current request t And data size d s ;
(III) Absolute deadline ddl for Current request a And relative cutoff time ddl r ;
(IV) the currently available CPU, GPU, memory and energy consumption utilization rate of the edge device are respectively marked as C u ,G u ,M u And E u ;
(V) request sequence information seq waiting for scheduling b ;
The actions are as follows:is a discrete motion space; for selecting proper batch size b and model parallel number m c Thus, the action taken by the agent at scheduling time t may be denoted as a t =(b,m c );
Strategy pi (a) t |s t ) Representing the current state s of the agent according to the environment at time t t To determine the next actionA function of (a);
maximizing entropy of visited states while maximizing cumulative expected reward earned by agent, optimal policy π * As shown in equation (2):
wherein gamma epsilon [0,1]Is a discount factor, p π Is a trajectory distribution generated by a strategy pi, alpha is a temperature parameter used to control whether the optimization objective focuses more on rewards or entropy;is shown in state s t Entropy of the lower strategy pi;
probability of state transition: p (s' t |s t ,a t ) Is the state transition probability, representing the current state s at time t t Do a certain action a t After that, transition is made to the next state s' t Probability of, satisfy
Reward:is a reward function; the goal of the agent is to maximize the desired reward accruedr t Representing that at each scheduling time t, the agent selects an appropriate batch size and model parallelism, and then obtains instant rewards when performing reasoning;
in order to enable the reward to reflect an objective function, r t Is defined as the form of equation (3):
4. The control method according to claim 3, characterized in that:
the scheduling decision algorithm is based on an Actor-Critic framework; the Critic uses an action-state value function Q-function to judge whether the action made by the Actor according to the strategy is good or bad, namely uses soft strategy iteration to maximize the reward and simultaneously maximize the entropy;
the soft strategy iteration comprises two steps of strategy evaluation and strategy improvement, and is alternately carried out in the training process;
the strategy evaluation step comprises the following steps:
first the soft q-function is calculated and defined as:
wherein
V(s t ):=π(s t ) T [Q(s t )-αlog(π(s t ))] (5)
Is a soft state value function in discrete cases;
training the soft q-function with the minimized soft Bellman residual in equation (6);
5. the control method according to claim 4, characterized in that:
the strategy improvement updates the strategy, and the specific formula is as follows:
updating parameters of the policy network with the minimized KL divergence in equation (8):
temperature parameters: the temperature parameter can be automatically adjusted using equation (9):
6. The control method according to claim 5, characterized in that:
when the dynamic batch processing scheduling module faces a plurality of request sequences of the same model, the dynamic batch processing scheduling module reaches according to the sequence of the request sequencesSequentially adding them to the batch tank s i Performing the following steps;
the dynamic batch processor maintains a sequence of requests for different models individually, including all inference requests belonging to the model, which are aggregated into a sequence of requests of different batch sizes according to the decision module.
7. The control method according to claim 6, characterized in that:
in the case of the model-parallel module,
if the current GPU is in an idle state, when a plurality of requests arrive at the same time, the model parallel module immediately dispatches the requests to the GPU, and a hardware dispatcher of the GPU starts to process the requests in parallel;
if multiple requests for the same model arrive at the same time, model inference will be performed by scheduling only one request at a time on the GPU.
8. The control method according to claim 7, characterized in that:
the performance analyzer records a series of inference models, each time executed on the edge device, system state information under different sizes of input data,
the system state information is fed back to the scheduling decision module in time, on the basis of analyzing available resources of the current system, scheduling decisions are made for inference requests at the next moment, and proper batch processing size and model parallel quantity are selected to maximize a utility function, so that system overload is avoided, and resource utilization rate is improved.
9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 2 to 8 when executing the computer program.
10. A computer readable storage medium storing computer instructions, which when executed by a processor implement the steps of the method of any one of claims 2 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210662359.XA CN115454585A (en) | 2022-06-13 | 2022-06-13 | Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210662359.XA CN115454585A (en) | 2022-06-13 | 2022-06-13 | Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115454585A true CN115454585A (en) | 2022-12-09 |
Family
ID=84296543
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210662359.XA Pending CN115454585A (en) | 2022-06-13 | 2022-06-13 | Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115454585A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116090383A (en) * | 2022-12-27 | 2023-05-09 | 广东高云半导体科技股份有限公司 | Method, device, computer storage medium and terminal for realizing static time sequence analysis |
-
2022
- 2022-06-13 CN CN202210662359.XA patent/CN115454585A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116090383A (en) * | 2022-12-27 | 2023-05-09 | 广东高云半导体科技股份有限公司 | Method, device, computer storage medium and terminal for realizing static time sequence analysis |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114691363A (en) | Cloud data center self-adaption efficient resource allocation method based on deep reinforcement learning | |
CN113515351B (en) | Resource scheduling implementation method based on energy consumption and QoS (quality of service) cooperative optimization | |
Chen et al. | Learning-based resource allocation in cloud data center using advantage actor-critic | |
CN109491761A (en) | Cloud computing multiple target method for scheduling task based on EDA-GA hybrid algorithm | |
CN111782355B (en) | Cloud computing task scheduling method and system based on mixed load | |
Chakravarthi et al. | TOPSIS inspired budget and deadline aware multi-workflow scheduling for cloud computing | |
CN103279818A (en) | Method for cloud workflow scheduling based on heuristic genetic algorithm | |
US8281313B1 (en) | Scheduling computer processing jobs that have stages and precedence constraints among the stages | |
CN112632615B (en) | Scientific workflow data layout method based on hybrid cloud environment | |
CN112685138A (en) | Multi-workflow scheduling method based on multi-population hybrid intelligent optimization in cloud environment | |
CN116932201A (en) | Multi-resource sharing scheduling method for deep learning training task | |
CN116932198A (en) | Resource scheduling method, device, electronic equipment and readable storage medium | |
CN116467076A (en) | Multi-cluster scheduling method and system based on cluster available resources | |
CN115454585A (en) | Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment | |
Wang et al. | Dynamic multiworkflow deadline and budget constrained scheduling in heterogeneous distributed systems | |
Zhang et al. | Strategy-proof mechanism for online time-varying resource allocation with restart | |
Yang et al. | A fully hybrid algorithm for deadline constrained workflow scheduling in clouds | |
Zhang et al. | Sustainable AIGC workload scheduling of geo-Distributed data centers: A multi-agent reinforcement learning approach | |
CN111629216B (en) | VOD service cache replacement method based on random forest algorithm under edge network environment | |
Eswari et al. | Expected completion time based scheduling algorithm for heterogeneous processors | |
CN116089083A (en) | Multi-target data center resource scheduling method | |
CN106295117A (en) | A kind of passive phased-array radar resource is dynamically queued up management-control method | |
Murad et al. | Priority based fair scheduling: Enhancing efficiency in cloud job distribution | |
CN114980216A (en) | Dependent task unloading system and method based on mobile edge calculation | |
CN114035954A (en) | Scheduling system and task scheduling system based on DDQN algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |