CN115454585A - Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment - Google Patents

Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment Download PDF

Info

Publication number
CN115454585A
CN115454585A CN202210662359.XA CN202210662359A CN115454585A CN 115454585 A CN115454585 A CN 115454585A CN 202210662359 A CN202210662359 A CN 202210662359A CN 115454585 A CN115454585 A CN 115454585A
Authority
CN
China
Prior art keywords
scheduling
model
inference
batch processing
parallel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210662359.XA
Other languages
Chinese (zh)
Inventor
张子阳
刘劼
李峰
李欢
林昌垚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202210662359.XA priority Critical patent/CN115454585A/en
Publication of CN115454585A publication Critical patent/CN115454585A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/502Proximity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5021Priority
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides an adaptive batch processing and parallel scheduling system facing to deep learning model reasoning of edge equipment, which comprises a decision module, a dynamic batch processing scheduling module, a model parallel module and a performance analyzer; the decision module carries out scheduling modeling on batch processing and parallel reasoning of the deep learning model and selects proper batch processing size and model parallel quantity for different models, and the dynamic batch processing scheduling module carries out batch processing reasoning; the model parallel module processes a plurality of inference requests simultaneously; the performance analyzer collects the system state of the edge equipment in an online mode in real time; compared with the traditional heuristic and other reinforcement learning methods, the scheduling decision algorithm based on the maximum entropy reinforcement learning has the performance improvement of 3.2-58% in the aspect of the balance between the system throughput and the reasoning delay, and meanwhile, the convergence speed is 1.8-6.5 times that of other algorithms; furthermore, the average scheduling overhead is only 49% of the other algorithms.

Description

Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment
Technical Field
The invention belongs to the technical field of edge computing, and particularly relates to an adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment.
Background
Edge computing makes it possible to sink computing power from the cloud to edge devices and reason for deep learning workloads in real time at the edge. Due to the parallel characteristic of the hardware accelerator, the application of batch processing and parallel operation to deep learning inference operation can effectively improve throughput and reduce delay. However, due to the constraints of power consumption and cost, the edge device cannot provide rich computing and memory resources, which causes that when resources are shared by multiple tenants, system throughput and inference delay are greatly affected, and thus real-time performance of the application program cannot be guaranteed.
Disclosure of Invention
The invention provides an adaptive batch processing and parallel scheduling system facing to edge equipment deep learning model reasoning, aiming at improving the extraction precision of news text keywords, further improving the accuracy of content retrieval of a public opinion analysis system in news text analysis, more comprehensively covering the main information of news texts and saving the time of manual review.
The invention is realized by the following technical scheme:
an adaptive batch processing and parallel scheduling system facing deep learning model inference of edge equipment comprises:
the system comprises a decision module, a dynamic batch processing scheduling module, a model parallel module and a performance analyzer;
the decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, carries out scheduling decision of batch processing and parallel inference through a scheduling decision algorithm, and automatically selects proper batch processing size and parallel number of models aiming at different models;
the dynamic batch processing scheduling module sequentially adds inference requests into a request sequence according to the sequence of arrival of the requests, and schedules the inference requests to batch processing slots si on a plurality of instances of the model to carry out batch processing inference;
the model parallel module allows multiple instances of different models/the same model to be executed in parallel and simultaneously processes multiple inference requests of the model;
the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (central processing unit), a GPU (graphics processing unit), a memory and energy consumption at the current moment, the system throughput and the reasoning delay.
A control method of an adaptive batch processing and parallel scheduling system facing to deep learning model reasoning of edge equipment comprises the following steps:
the control method specifically comprises the following steps:
step 1, terminal equipment sends an inference request to a decision module of an inference system;
step 2, the decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, carries out scheduling decision of batch processing and parallel inference through a scheduling decision algorithm, and automatically selects proper batch processing size and model parallel quantity according to different models;
step 3, the dynamic batch scheduling module adds the inference requests into the request sequence in turn according to the sequence of the arrival of the requests and schedules the inference requests to batch processing slots s on a plurality of instances of the model i Carrying out batch processing reasoning;
step 4, the model parallel module executes a plurality of instances of different models/the same model in parallel and processes a plurality of inference requests of the models simultaneously;
and step 5, the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (Central processing Unit), a GPU (graphics processing Unit), a memory and energy consumption at the current moment, the system throughput and inference delay, and feeds the system state back to the scheduling decision module, and the scheduling decision is made for the inference request at the next moment on the basis of analyzing available resources of the current system.
Further, the air conditioner is provided with a fan,
by a quintuple
Figure BDA0003691365480000021
Describing the Markov decision Process, quintuple
Figure BDA0003691365480000022
Figure BDA0003691365480000023
Is defined as:
the state is as follows:
Figure BDA0003691365480000024
is a discrete state space; at each scheduling time step, the reinforcement learning intelligent agent constructs a state s t (s t E.g. S), periodically collecting inference request information and system state information on the edge equipment;
the system state information includes the following parts:
(I) The model type mt of the current inference request;
(II) data type d of current request t And data size d s
(III) Absolute deadline ddl for Current request a And relative cut-off time ddl r
(IV) the currently available CPU, GPU, memory and energy consumption utilization rate of the edge device are respectively marked as C u ,G u ,M u And E u
(V) request sequence information seq waiting for scheduling b
The actions are as follows:
Figure BDA0003691365480000025
is a discrete motion space; for selecting proper batch size b and model parallel number m c Thus, the action made by the agent at the scheduling time t may be denoted as a t =(b,m c );
Strategy: strategy pi (a) t |s t ) Representing the current state of the agent according to the environment at time tState s t To determine the next action
Figure BDA0003691365480000031
A function of (a);
maximizing entropy of visited states while maximizing cumulative expected reward earned by agent, optimal policy π * As shown in equation (2):
Figure BDA0003691365480000032
wherein gamma is epsilon [0,1]Is a discount factor, p π Is a trajectory distribution generated by a strategy pi, alpha is a temperature parameter used to control whether the optimization objective focuses more on rewards or entropy;
Figure BDA00036913654800000310
is shown in state s t Entropy of the lower strategy pi;
probability of state transition: p (s' t |s t ,a t ) Is the state transition probability, representing the current state s at time t t Do a certain action a t After that, transition is made to the next state s' t Probability of, satisfy
Figure BDA00036913654800000311
Rewarding:
Figure BDA0003691365480000033
is a reward function; the goal of the agent is to maximize the desired reward accrued
Figure BDA0003691365480000034
r t Representing that at each scheduling time t, the agent selects an appropriate batch size and model parallelism, and then obtains instant rewards when performing reasoning;
to enable the reward to reflect the objective function, r t Is defined as the form of equation (3):
Figure BDA0003691365480000035
wherein the content of the first and second substances,
Figure BDA0003691365480000036
and ξ represent weight, and
Figure BDA0003691365480000037
b and m c Respectively representing batch size and model parallelism number selected by the agent, u = (C) u +G u +M u +E u ) And/4 represents the average value of the system resource utilization rate.
Further, the air conditioner is provided with a fan,
the scheduling decision algorithm BP-SACD is based on an Actor-Critic framework; the Critic uses an action-state value function Q-function to judge whether the action made by the Actor according to the strategy is good or bad, namely uses soft strategy iteration to maximize the reward and simultaneously maximize the entropy;
the soft strategy iteration comprises two steps of strategy evaluation and strategy improvement, and is alternately carried out in the training process;
the strategy evaluation step comprises the following steps:
first the soft q-function is calculated and defined as:
Figure BDA0003691365480000038
wherein the content of the first and second substances,
Figure BDA0003691365480000039
representing a modified bellman backup operator;
wherein
V(s t ):=π(s t ) T [Q(s t )-αlog(π(s t ))] (5)
Is a function of the soft state values in discrete cases;
training the soft q-function with the minimized soft berman residual in equation (6);
Figure BDA0003691365480000041
further, the air conditioner is provided with a fan,
the strategy improvement updates the strategy, and the specific formula is as follows:
Figure BDA0003691365480000042
wherein D is KL The dispersion of the KL is expressed,
Figure BDA0003691365480000046
representing a partition function;
updating parameters of the policy network with the minimized KL divergence in equation (8):
Figure BDA0003691365480000043
temperature parameters: the temperature parameter can be automatically adjusted using equation (9):
Figure BDA0003691365480000044
wherein the content of the first and second substances,
Figure BDA0003691365480000045
is a constant vector equal to the hyperparameter representing the target entropy.
Further, the air conditioner is provided with a fan,
when the dynamic batch scheduling module faces a plurality of request sequences of the same model, the request sequences are sequentially added to the batch processing slot s according to the arrival sequence of the request sequences i The preparation method comprises the following steps of (1) performing;
the dynamic batch processor maintains a sequence of requests for different models individually, including all inference requests belonging to the model, which are aggregated into a sequence of requests of different batch sizes according to the decision module.
Further, the air conditioner is provided with a fan,
in the model parallel module,
if the current GPU is in an idle state, when a plurality of requests arrive at the same time, the model parallel module immediately dispatches the requests to the GPU, and a hardware dispatcher of the GPU starts to process the requests in parallel;
if multiple requests for the same model arrive at the same time, model inference will be performed by scheduling only one request at a time on the GPU.
Further, the air conditioner is provided with a fan,
the performance analyzer records a series of inference models, each time executed on the edge device, system state information under different sizes of input data,
the system state information is fed back to the scheduling decision module in time, on the basis of analyzing available resources of the current system, scheduling decisions are made for inference requests at the next moment, and proper batch processing size and model parallel quantity are selected to maximize a utility function, so that system overload is avoided, and resource utilization rate is improved.
An electronic device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.
A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the above methods.
The invention has the beneficial effects
Compared with the traditional heuristic and other reinforcement learning methods, the scheduling strategy based on the maximum entropy reinforcement learning designed by the invention has the performance improvement of 3.2-58% in the aspect of the balance between the system throughput and the reasoning delay, and meanwhile, the convergence speed is 1.8-6.5 times that of other algorithms. Furthermore, the average scheduling overhead is only 49% of the other algorithms.
Drawings
FIG. 1 is a BPEdge framework diagram of a batch and parallel reasoning system of the present invention;
FIG. 2 is a dynamic batch scheduling module of the present invention;
FIG. 3 is a model parallel module of the present invention;
FIG. 4 is the utility values of different algorithms under the YOLO-v5 and MobileNet-v3 inference models, where (a) is the utility value of YOLO-v5 and (b) is the utility value of MobileNet-v 3;
FIG. 5 is a plot of batch sizes selected for different algorithms, where (a) is the batch size of YOLO-v5 and (b) is the batch size of MobileNet-v 3;
FIG. 6 is a graph of the number of model parallels selected for different algorithms, where (a) is the number of model parallels for YOLO-v5 and (b) is the number of model parallels for MobileNet-v 3;
FIG. 7 is the average system throughput for different algorithms, where (a) is the average system throughput for YOLO-v5 and (b) is the average system throughput for MobileNet-v 3;
FIG. 8 is the mean inference delay for different algorithms, where (a) is the mean inference delay for YOLO-v5 and (b) is the mean inference delay for MobileNet-v 3;
fig. 9 is the average scheduling overhead for the different algorithms.
Detailed Description
The technical solutions in the embodiments of the present invention will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
With reference to fig. 1 to 9.
The invention firstly provides a utility function to evaluate the balance degree between the system throughput and the inference delay. The specific formula is as follows:
Figure BDA0003691365480000061
wherein the content of the first and second substances,
Figure BDA0003691365480000066
indicated in scheduling time slot s i The size of the internal execution batch processing is b, and the parallel number of the models is m c The throughput of the system at the time of the day,
Figure BDA0003691365480000062
representing the actual batch and parallel inference delays of the model,
Figure BDA0003691365480000063
representing the ratio of the sum of the absolute deadlines of inference requests within a batch to the number of parallels of the current model. In particular, it is possible to use, for example,
Figure BDA0003691365480000064
the condition that the request cannot be dispatched unsuccessfully is ensured, and real-time reasoning is achieved.
Table 1 is a variety of symbol definitions and descriptions.
Figure BDA0003691365480000065
TABLE 1 Main symbol table
An adaptive batch processing and parallel scheduling system facing deep learning model inference of edge equipment comprises:
the system comprises a decision module, a dynamic batch processing scheduling module, a model parallel module and a performance analyzer;
the decision-making module is the most core component of the system,
through a Markov Decision Process (MDP), modeling is carried out on the process of batch processing and parallel scheduling on the arrived inference requests, then scheduling decision of batch processing and parallel inference is carried out through a scheduling decision algorithm (BP-SACD), and proper batch processing size and model parallel quantity are automatically selected aiming at different models, so that the system throughput is improved while the lower inference delay is ensured;
the dynamic batch processing scheduling module adds a batch of inference requests of the same model to a request sequence in turn according to the sequence of arrival of the requestsAnd scheduling to batch slots s on multiple instances of the model i Carrying out batch processing reasoning;
the model parallel module allows multiple instances of different models/the same model to be executed in parallel and simultaneously processes multiple inference requests of the model;
the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (central processing unit), a GPU (graphic processing unit), a memory and energy consumption at the current moment, the system throughput and the reasoning delay.
A control method of an adaptive batch processing and parallel scheduling system facing to deep learning model reasoning of edge equipment comprises the following steps:
the control method specifically comprises the following steps:
step 1, terminal equipment sends an inference request to a scheduling decision module;
step 2, the scheduling decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, carries out scheduling decision of batch processing and parallel inference through a scheduling decision algorithm, and automatically selects proper batch processing size and model parallel quantity according to different models;
step 3, the dynamic batch processing scheduling module adds a batch of inference requests of the same model into a request sequence in turn according to the sequence of arrival of the requests and schedules the inference requests to batch processing slots s on a plurality of instances of the model i Carrying out batch processing reasoning;
step 4, the model parallel module executes a plurality of instances of different models/the same model in parallel and processes a plurality of inference requests of the models simultaneously;
and step 5, the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (Central processing Unit), a GPU (graphics processing Unit), a memory and energy consumption at the current moment, the system throughput and inference delay, and feeds the system state back to the scheduling decision module, and the scheduling decision is made for the inference request at the next moment on the basis of analyzing available resources of the current system.
Markov decision process modeling
By a quintuple
Figure BDA0003691365480000081
Describing the Markov Decision Process (MDP), quintuple
Figure BDA0003691365480000082
Figure BDA0003691365480000083
Is defined as:
the state is as follows:
Figure BDA0003691365480000084
is a discrete state space; at each scheduling time step, the reinforcement learning agent constructs a state
Figure BDA0003691365480000085
Periodically collecting inference request information and system state information on the edge equipment;
the system state information includes the following parts:
(I) Model type m of current inference request t
(II) data type d of current request t And data size d s
(III) Absolute deadline ddl for Current request a And relative cut-off time ddl r
(IV) the currently available CPU, GPU, memory and energy consumption utilization rate of the edge device are respectively marked as C u ,G u ,M u And E u
(V) request sequence information seq waiting for scheduling b
The actions are as follows:
Figure BDA0003691365480000086
is a discrete motion space; for selecting proper batch size b and model parallel number m c Thus, the action made by the agent at the scheduling time t may be denoted as a t =(b,m c ) (ii) a For example, a certain inference model has
Figure BDA0003691365480000087
A selectable batch size and
Figure BDA0003691365480000088
the parallel number of selectable models is the discrete motion space
Figure BDA0003691365480000089
Is of a size of
Figure BDA00036913654800000810
Strategy: strategy pi (a) t |s t ) Representing the current state s of the agent according to the environment at time t t To determine the next action
Figure BDA00036913654800000811
A function of (a);
the invention maximizes the entropy of the accessed state while maximizing the cumulative expected reward obtained by the agent based on the maximum entropy random strategy algorithm, and the optimal strategy pi * As shown in equation (2):
Figure BDA00036913654800000812
wherein gamma is epsilon [0,1]Is a discount factor, p π Is a trajectory distribution generated by a strategy pi, alpha is a temperature parameter used to control whether the optimization objective focuses more on rewards or entropy;
Figure BDA00036913654800000813
is shown in state s t Entropy of the lower strategy pi;
probability of state transition: p (s' t |s t ,a t ) Is the state transition probability, representing the current state s at time t t Do a certain action a t Then, the process shifts to the next state s' t Satisfies sigma s′∈S p(s′ t |s t ,a t )=1;
Rewarding:
Figure BDA0003691365480000091
is a reward function; the goal of the agent is to maximize the desired reward accrued
Figure BDA0003691365480000092
r t Representing that at each scheduling time t, the agent selects an appropriate batch size and model parallelism, and then obtains instant rewards when performing reasoning;
to enable the reward to reflect the objective function, r t Is defined as the form of equation (3):
Figure BDA0003691365480000093
wherein the content of the first and second substances,
Figure BDA0003691365480000094
and ξ represent weight, and
Figure BDA0003691365480000095
b and m c Respectively representing batch size and model parallelism number selected by the agent, u = (C) u +G u +M u +E u ) And/4 represents the average value of the system resource utilization rate.
A decision algorithm based on maximum entropy reinforcement learning:
the scheduling decision algorithm BP-SACD is based on an Actor-Critic framework; the Critic uses an action-state value function Q-function to judge whether the action made by the Actor according to the strategy is good or bad, namely uses soft strategy iteration to maximize the reward and simultaneously maximize the entropy;
the soft strategy iteration comprises two steps of strategy evaluation and strategy improvement, and is alternately carried out in the training process;
the strategy evaluation step comprises the following steps:
first the soft q-function is calculated and defined as:
Figure BDA0003691365480000096
wherein the content of the first and second substances,
Figure BDA0003691365480000097
representing a modified bellman backup operator;
wherein
V(s t ):=π(s t ) T [Q(s t )-αlog(π(s t ))] (5)
Is a function of the soft state values in discrete cases;
training the soft q-function with the minimized soft berman residual in equation (6);
Figure BDA0003691365480000098
the strategy improvement updates the strategy, and the specific formula is as follows:
Figure BDA0003691365480000099
wherein D is KL The dispersion of the KL is expressed,
Figure BDA00036913654800000910
representing a partition function;
updating parameters of the policy network with the minimized KL divergence in equation (8):
Figure BDA00036913654800000911
temperature parameters: the temperature parameter can be automatically adjusted using equation (9):
Figure BDA0003691365480000101
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003691365480000102
is a constant vector equal to the hyperparameter representing the target entropy.
Figure BDA0003691365480000103
When the dynamic batch scheduling module faces a plurality of request sequences of the same model, the request sequences are sequentially added to the batch processing slot s according to the arrival sequence of the request sequences i Performing the following steps;
the dynamic batch processor maintains a sequence of requests for different models individually, including all inference requests belonging to the model, which are aggregated into a sequence of requests of different batch sizes according to the decision module.
FIG. 2 illustrates how multiple request sequences of the same model are scheduled onto multiple model instances using a dynamic batch scheduling module. First, the module will allocate a batch slot for the newly arrived request sequence. The priority ordering is based on the relative deadlines of each request sequence, with shorter relative deadlines giving higher priority. The request sequences are then dispatched to the corresponding batch slots according to the priority. And sequentially scheduling according to the sequence of the arrival time under the condition of the same priority.
The left diagram shows several request sequences that arrive randomly. The number of inference requests per sequence is determined by the batch size chosen by the deep reinforcement learning based decision module, and these individual inference requests may arrive in any order relative to the inference requests in other sequences. The right diagram shows how the request sequence is scheduled over time onto the corresponding model instance.
In the model parallel module,
if the current GPU is in an idle state, when a plurality of requests arrive at the same time, the model parallel module immediately dispatches the requests to the GPU, and a hardware dispatcher of the GPU starts to process the requests in parallel;
if multiple requests for the same model arrive at the same time, model inference will be performed by scheduling only one request at a time on the GPU.
Fig. 3 shows an example in which 3 models (model 1, model 2, and model 3) are executed in parallel, and specifically, the 3 models are set to states that allow 2 instances to be executed in parallel, respectively. The first three inference requests are executed immediately in parallel. The latter three inference requests must wait until the first three identical requests have been executed.
The performance analyzer records a series of inference models, each time executed on the edge device, system state information under input data of different sizes,
the system state information is fed back to a decision module based on deep reinforcement learning in time, on the basis of analyzing available resources of the current system, a scheduling decision is made for an inference request at the next moment, a proper batch processing size and a proper parallel number of models are selected to maximize a utility function, meanwhile, system overload is avoided, and the resource utilization rate is improved. This also reflects the potential advantages of the scheduler of the present invention with dynamic resource management and allocation;
an electronic device comprising a memory storing a computer program and a processor implementing the steps of any of the above methods when the processor executes the computer program.
A computer readable storage medium storing computer instructions which, when executed by a processor, implement the steps of any of the above methods.
Example (b):
1. baseline method: the traditional heuristic method and other reinforcement learning methods are used as baseline methods, the decision algorithm BP-SACD based on the maximum entropy reinforcement learning provided by the invention is replaced in the scheduling system BPedge for comparison, and the specific replacement method comprises the following steps:
genetic Algorithm (GA): the GAU is one of the most classical heuristic algorithms, and is mainly characterized in that a local optimal solution can be jumped out of a nonlinear extreme value problem by the probability 1, and then a global optimal solution is found. The invention takes the fitness function in the GA as the utiity.
Proximal strategy optimization (PPO): PPO is used as a baseline method in reinforcement learning, is an online strategy algorithm and has good stability.
Dual Depth Q Network (DDQN): the DDQN is used as an off-line strategy algorithm, and has good performance in a discrete scene.
Three performance indicators are mainly considered:
(1) Utility function: the balance degree between the system throughput and the inference delay is reflected, the higher the utility value is, the higher the system throughput is, and the lower the inference delay is;
(2) Convergence rate: reflecting the performance advantages and disadvantages of various algorithms;
(3) And (3) scheduling overhead: i.e. the time required to perform the scheduling operation.
2. Scheduling system utility: for real-time performance, the request rate is set to 30fps, that is, 30 video frames with different resolutions are reached every second and are used as inference requests, wherein the absolute deadline time E [10ms,15ms ] of each inference request is equal to the total generation time length of 1.85 hours, and the total generation time length is 20 ten thousand inference requests. Further, the number of training times for each algorithm was set to 100.
FIG. 4 is a graph of the degree of system throughput and the amount of model parallelism traded off by the four algorithms when performing batch and parallel reasoning on two of the most advanced models of object detection and image classification.
In the target detection model inference based on YOLO-v5 in FIG. 4 (a), it can be seen that the performance of the BP-SACD algorithm proposed by the present invention is superior to the other three methods. Specifically, the utility value of BP-SACD is 11.2 percent, 41.8 percent and 21.2 percent higher than that of GA, PPO and DDQN respectively, and the stability is better than that of the other three methods.
In the inference test for the image classification model based on MobileNet-v3 (FIG. 4 (b)), the utility value of BP-SACD also has 3.2%, 58% and 30.2% performance improvement relative to the other three methods, respectively.
Fig. 5 and 6 are graphs showing the variation of the batch size and the model parallelism quantity selected by the four algorithms along with the scheduling process. BP-SACD is able to find the optimal batch size and model parallelism for different models faster than the other three algorithms.
Fig. 7 and 8 are graphs showing the variation of the average system throughput and the average inference delay of the four algorithms with the scheduling process, respectively. Compared with the other three algorithms, the BP-SACD can achieve better balance between the system throughput and the inference delay, and tends to trade lower inference delay for proper inference throughput, so that the balance degree is better.
The adaptive batch processing and parallel scheduling system for deep learning model inference of edge devices proposed by the present invention is introduced in detail above, and the principle and implementation of the present invention are explained, and the description of the above embodiments is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (10)

1. An adaptive batch processing and parallel scheduling system facing to deep learning model inference of edge equipment is characterized in that:
the system comprises a decision module, a dynamic batch processing scheduling module, a model parallel module and a performance analyzer;
the decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, carries out batch processing and parallel inference scheduling decision through a scheduling decision algorithm, and automatically selects proper batch processing size and model parallel quantity aiming at different models;
the dynamic batch processing scheduling module adds the inference request into the request sequence in turn according to the sequence of the arrival of the requests and schedules the inference request to the batch processing slots s on the multiple instances of the model i Carrying out batch processing reasoning;
the model parallel module allows multiple instances of different models/the same model to be executed in parallel and simultaneously processes multiple inference requests of the model;
the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (central processing unit), a GPU (graphics processing unit), a memory and energy consumption at the current moment, the system throughput and the reasoning delay.
2. An adaptive batch processing and parallel scheduling system applied to the edge-oriented equipment deep learning model inference in claim 1, characterized in that:
the control method specifically comprises the following steps:
step 1, the terminal equipment sends an inference request to a scheduling decision module;
step 2, the scheduling decision module carries out batch processing and parallel scheduling process modeling on the arrived inference requests through a Markov decision process, then carries out scheduling decision of batch processing and parallel inference through a scheduling decision algorithm, and automatically selects proper batch processing size and model parallel quantity according to different models;
step 3, the dynamic batch processing scheduling module adds the inference requests into the request sequence in turn according to the sequence of the arrival of the requests and schedules the inference requests to batch processing slots s on a plurality of examples of the model i Carrying out batch processing reasoning;
step 4, the model parallel module executes a plurality of instances of different models/the same model in parallel and processes a plurality of inference requests of the models at the same time;
and step 5, the performance analyzer collects the system state of the edge device in real time in an online mode, wherein the system state comprises the utilization rate of a CPU (Central processing Unit), a GPU (graphics processing Unit), a memory and energy consumption at the current moment, the system throughput and inference delay, and feeds the system state back to the scheduling decision module, and the scheduling decision is made for the inference request at the next moment on the basis of analyzing available resources of the current system.
3. The control method according to claim 2, characterized in that:
by a quintuple
Figure FDA0003691365470000021
Describing the Markov decision Process, quintuple
Figure FDA0003691365470000022
Figure FDA0003691365470000023
Is defined as:
the state is as follows:
Figure FDA0003691365470000024
is a discrete state space; at each scheduling time step, the agent of the reinforcement learning agent constructs a state s t (s t E.g. S), periodically collecting inference request information and system state information on the edge equipment;
the system state information includes the following parts:
(I) Model type m of current inference request t
(II) data type d of current request t And data size d s
(III) Absolute deadline ddl for Current request a And relative cutoff time ddl r
(IV) the currently available CPU, GPU, memory and energy consumption utilization rate of the edge device are respectively marked as C u ,G u ,M u And E u
(V) request sequence information seq waiting for scheduling b
The actions are as follows:
Figure FDA0003691365470000025
is a discrete motion space; for selecting proper batch size b and model parallel number m c Thus, the action taken by the agent at scheduling time t may be denoted as a t =(b,m c );
Strategy pi (a) t |s t ) Representing the current state s of the agent according to the environment at time t t To determine the next action
Figure FDA0003691365470000026
A function of (a);
maximizing entropy of visited states while maximizing cumulative expected reward earned by agent, optimal policy π * As shown in equation (2):
Figure FDA0003691365470000027
wherein gamma epsilon [0,1]Is a discount factor, p π Is a trajectory distribution generated by a strategy pi, alpha is a temperature parameter used to control whether the optimization objective focuses more on rewards or entropy;
Figure FDA0003691365470000028
is shown in state s t Entropy of the lower strategy pi;
probability of state transition: p (s' t |s t ,a t ) Is the state transition probability, representing the current state s at time t t Do a certain action a t After that, transition is made to the next state s' t Probability of, satisfy
Figure FDA0003691365470000029
Reward:
Figure FDA00036913654700000210
is a reward function; the goal of the agent is to maximize the desired reward accrued
Figure FDA00036913654700000211
r t Representing that at each scheduling time t, the agent selects an appropriate batch size and model parallelism, and then obtains instant rewards when performing reasoning;
in order to enable the reward to reflect an objective function, r t Is defined as the form of equation (3):
Figure FDA00036913654700000212
wherein the content of the first and second substances,
Figure FDA0003691365470000031
and ξ represent the weight, and
Figure FDA0003691365470000032
b and m c U = (C) representing batch size and number of model parallels, respectively, for agent selection u +G u +M u +E u ) And/4 represents the average value of the system resource utilization rate.
4. The control method according to claim 3, characterized in that:
the scheduling decision algorithm is based on an Actor-Critic framework; the Critic uses an action-state value function Q-function to judge whether the action made by the Actor according to the strategy is good or bad, namely uses soft strategy iteration to maximize the reward and simultaneously maximize the entropy;
the soft strategy iteration comprises two steps of strategy evaluation and strategy improvement, and is alternately carried out in the training process;
the strategy evaluation step comprises the following steps:
first the soft q-function is calculated and defined as:
Figure FDA0003691365470000033
wherein the content of the first and second substances,
Figure FDA0003691365470000034
representing a modified bellman backup operator;
wherein
V(s t ):=π(s t ) T [Q(s t )-αlog(π(s t ))] (5)
Is a soft state value function in discrete cases;
training the soft q-function with the minimized soft Bellman residual in equation (6);
Figure FDA0003691365470000035
5. the control method according to claim 4, characterized in that:
the strategy improvement updates the strategy, and the specific formula is as follows:
Figure FDA0003691365470000036
wherein D is KL The dispersion of the KL is expressed,
Figure FDA0003691365470000037
representing a partition function;
updating parameters of the policy network with the minimized KL divergence in equation (8):
Figure FDA0003691365470000038
temperature parameters: the temperature parameter can be automatically adjusted using equation (9):
Figure FDA0003691365470000039
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA00036913654700000310
is a constant vector equal to the hyperparameter representing the target entropy.
6. The control method according to claim 5, characterized in that:
when the dynamic batch processing scheduling module faces a plurality of request sequences of the same model, the dynamic batch processing scheduling module reaches according to the sequence of the request sequencesSequentially adding them to the batch tank s i Performing the following steps;
the dynamic batch processor maintains a sequence of requests for different models individually, including all inference requests belonging to the model, which are aggregated into a sequence of requests of different batch sizes according to the decision module.
7. The control method according to claim 6, characterized in that:
in the case of the model-parallel module,
if the current GPU is in an idle state, when a plurality of requests arrive at the same time, the model parallel module immediately dispatches the requests to the GPU, and a hardware dispatcher of the GPU starts to process the requests in parallel;
if multiple requests for the same model arrive at the same time, model inference will be performed by scheduling only one request at a time on the GPU.
8. The control method according to claim 7, characterized in that:
the performance analyzer records a series of inference models, each time executed on the edge device, system state information under different sizes of input data,
the system state information is fed back to the scheduling decision module in time, on the basis of analyzing available resources of the current system, scheduling decisions are made for inference requests at the next moment, and proper batch processing size and model parallel quantity are selected to maximize a utility function, so that system overload is avoided, and resource utilization rate is improved.
9. An electronic device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method according to any one of claims 2 to 8 when executing the computer program.
10. A computer readable storage medium storing computer instructions, which when executed by a processor implement the steps of the method of any one of claims 2 to 8.
CN202210662359.XA 2022-06-13 2022-06-13 Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment Pending CN115454585A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210662359.XA CN115454585A (en) 2022-06-13 2022-06-13 Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210662359.XA CN115454585A (en) 2022-06-13 2022-06-13 Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment

Publications (1)

Publication Number Publication Date
CN115454585A true CN115454585A (en) 2022-12-09

Family

ID=84296543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210662359.XA Pending CN115454585A (en) 2022-06-13 2022-06-13 Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment

Country Status (1)

Country Link
CN (1) CN115454585A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116090383A (en) * 2022-12-27 2023-05-09 广东高云半导体科技股份有限公司 Method, device, computer storage medium and terminal for realizing static time sequence analysis

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116090383A (en) * 2022-12-27 2023-05-09 广东高云半导体科技股份有限公司 Method, device, computer storage medium and terminal for realizing static time sequence analysis

Similar Documents

Publication Publication Date Title
CN114691363A (en) Cloud data center self-adaption efficient resource allocation method based on deep reinforcement learning
CN108182115B (en) Virtual machine load balancing method in cloud environment
CN113515351B (en) Resource scheduling implementation method based on energy consumption and QoS (quality of service) cooperative optimization
CN111782355B (en) Cloud computing task scheduling method and system based on mixed load
Chen et al. Learning-based resource allocation in cloud data center using advantage actor-critic
CN106775932B (en) Real-time workflow scheduling method triggered by random event in cloud computing system
Chakravarthi et al. TOPSIS inspired budget and deadline aware multi-workflow scheduling for cloud computing
CN112685138B (en) Multi-workflow scheduling method based on multi-population hybrid intelligent optimization in cloud environment
CN110321217A (en) A kind of cloud resource dispatching method, device, equipment and the storage medium of multiple target
CN116932201A (en) Multi-resource sharing scheduling method for deep learning training task
US8281313B1 (en) Scheduling computer processing jobs that have stages and precedence constraints among the stages
CN116467076A (en) Multi-cluster scheduling method and system based on cluster available resources
CN109710372A (en) A kind of computation-intensive cloud workflow schedule method based on cat owl searching algorithm
Zhang et al. Strategy-proof mechanism for online time-varying resource allocation with restart
Wang et al. Dynamic multiworkflow deadline and budget constrained scheduling in heterogeneous distributed systems
CN111629216B (en) VOD service cache replacement method based on random forest algorithm under edge network environment
CN108958919A (en) More DAG task schedule expense fairness assessment models of limited constraint in a kind of cloud computing
Yang et al. Trust-based scheduling strategy for cloud workflow applications
Eswari et al. Expected completion time based scheduling algorithm for heterogeneous processors
CN115454585A (en) Adaptive batch processing and parallel scheduling system for deep learning model inference of edge equipment
CN116089083A (en) Multi-target data center resource scheduling method
CN116932198A (en) Resource scheduling method, device, electronic equipment and readable storage medium
Konovalov et al. Job control in heterogeneous computing systems
CN112698911B (en) Cloud job scheduling method based on deep reinforcement learning
CN106295117B (en) A kind of passive phased-array radar resource dynamic queuing management-control method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination