CN114706678A

CN114706678A - Neural network inference task scheduling method for edge intelligent server

Info

Publication number: CN114706678A
Application number: CN202210284033.8A
Authority: CN
Inventors: 王彦波; 张德宇; 张永敏; 吕丰; 张尧学
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-03-22
Filing date: 2022-03-22
Publication date: 2022-07-05

Abstract

The invention discloses a neural network inference task scheduling method of an edge intelligent server, which comprises the following steps: virtualizing the GPU into a plurality of virtual GPUs by utilizing a GPU virtualization technology; distributing preset resources for the virtual GPU according to a preset distribution strategy, and distributing inference tasks for the virtual GPU corresponding to the category of each inference task by using a queuing service system according to a preset execution batch; collecting the average service delay and the calculation resource amount of each class of inference tasks, judging whether the allocation strategy needs to be adjusted, and if so, calculating a new allocation strategy by using a reinforcement learning algorithm; and distributing corresponding resources for the virtual GPU according to the new distribution strategy, and distributing the neural network inference task for the virtual GPU corresponding to the category of each task by using a queuing service system according to the corresponding execution batch. The invention meets the real-time requirement of the dynamic scene with lower computation complexity, and effectively solves the load balancing problem in larger-scale edge computation scenes.

Description

Neural network inference task scheduling method for edge intelligent server

Technical Field

The invention relates to the field of edge computing, in particular to a neural network inference task scheduling method for an edge intelligent server.

Background

Edge Computing (EC) is a new cloud Computing model. And deploying the server at the edge of the network to provide computing services for users. The network edge is not a terminal device but a network location close to the terminal device, and is characterized by low delay for communication with the terminal device. However, in a complex scene in real life, one edge server, especially an edge intelligent server, should be responsible for many types of neural network inference tasks with high concurrency. The problem that arises is how to perform appropriate task and resource scheduling to increase the speed at which the edge server processes these tasks and increase throughput.

The queuing system, also called "queuing service system", is a service system which is composed of one or more parallel, serial and mixed connected service stations, serves several customers or work objects with different requirements and determines the service order according to the given queuing rule. The production manufacturing and service system in reality mostly belongs to a queuing system. The object of the service may be a natural person, a work to be completed, or a workpiece to be processed. The batch queuing system is a derivative system thereof. Tasks in a batch queuing system are not processed directly, but rather the tasks in the system are accumulated to a certain number and processed simultaneously as a batch. The processing speed is increased, the cost of resources is gradually increased, if the batch processing size is set to be too large, tasks are queued in the system due to the limitation of the resource size, so that the queuing delay is obviously higher than expected, and if the batch processing size is set to be smaller, the advantages brought by batch processing cannot be effectively utilized.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a neural network inference task scheduling method of an edge intelligent server, which is characterized in that under the scene that the edge intelligent server faces high-concurrency and multi-type neural network inference tasks, characteristics of inference tasks are analyzed and distributed to corresponding computing resources, relevant evaluation indexes of a system are obtained through dynamic batch queuing service system modeling simulation, and finally, the system resources and a task scheduling scheme are determined by utilizing a reinforcement learning algorithm (D3QN) so as to improve the execution speed of the inference tasks and solve the problem that the throughput of the multi-type and high-concurrency inference tasks is too low.

In order to solve the technical problems, the technical scheme provided by the invention is as follows:

a neural network inference task scheduling method for an edge intelligent server comprises the following steps:

s1, virtualizing the GPU into a plurality of virtual GPUs by utilizing a GPU virtualization technology;

s2, distributing preset resources for the virtual GPU, and distributing inference tasks for the virtual GPU corresponding to the type of each inference task by using a queuing service system according to a preset execution batch;

s3, collecting the average service delay and the distributed computing resource amount of each interrupt task, judging whether the distribution strategy needs to be adjusted, if so, calculating a new distribution strategy by using a reinforcement learning algorithm;

and S4, distributing resources corresponding to the new distribution strategy for the virtual GPU, and distributing neural network inference tasks for the virtual GPU corresponding to the category of each task by using the queuing service system according to the execution batch corresponding to the new distribution strategy.

Further, the specific step of step S3 includes:

collecting the average service delay E (W) of all inference tasks_i) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasks_iAs the state value s at this time^tSelecting a current state value s using a greedy-epsilon method^tThe next corresponding action a;

according to the action A, the calculation resource amount is redistributed and task scheduling is carried out, and the next state s is obtained^t′Obtaining a reward R by utilizing a reward and punishment mechanism according to the residual resource amount and the inferred task execution condition;

obtaining a current state s^tThe task execution situation and the resource allocation situation of (2) are combined into the current state data s^tObtaining the next state s^t′Is combined into the next state data s^t′The current state data s^tAction A, next state data s^t′The return R is stored in the prior experience playback pool in the form of an array D, and the sampling probability of the array D in the prior experience playback pool is calculated through a td-error algorithm;

extracting the array D in the experience playback pool to a D3QN network according to the sampling probability, carrying out gradient descent error training of the D3QN network, judging whether a termination condition is met, if so, obtaining a trained GPU resource allocation and task scheduling model, executing the next step, and if not, extracting the next state s^t′As the current state s^tExecution selects the current state value s using the e-greedy method^tA step of the corresponding action A;

and importing the trained GPU resource allocation and task scheduling model into an edge intelligent server.

Further, the average service delay E (W) of all inferred tasks is collected_i) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasks_iThe method also comprises the following steps:

an environment capable of virtualizing a GPU and a simulated task submission model are built on an edge server, and the task submission model is placed in the environment for autonomous scheduling execution by adopting a D3QN network.

Further, the average service delay E (W) of all inference tasks is collected_i) The method specifically comprises the following steps: calculating the average service delay of each type of inference task according to the type of the arrival inference task and the corresponding average task number, wherein the expression is as follows:

in the above formula, i is the type of inference taskNumber, λ_i' is the effective arrival rate of class i interrupt task, E (L)_i) Is the average number of tasks of the class i interrupt task in the queuing service system.

Further, according to the action a, the specific steps of reallocating the amount of the computing resources and scheduling the task include:

if there is an inferred task average service delay E (W) greater than a preset threshold in action A_i) If the estimated task average service delay E (W) is not greater than the predetermined threshold in the action A, the amount of computing resources is increased or the size of the execution batch is decreased_i) Then the amount of computing resources or execution batch size is kept unchanged.

Further, the preset threshold is an average service delay E (W)_i) The delay of the respective inference tasks performed individually.

Further, the expression of the reward R obtained by using the reward and punishment mechanism is as follows:

in the above equation, i is the type number of the inference task,

represents the execution time, T, of the ith type task under the current allocation strategy_iIs the independent execution time of the ith type task, alpha'_iIndicating the percentage of GPU resources allocated by the class i task.

Further, the average service delay E (W) of all inferred tasks is collected_i) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasks_iThe method also comprises the step of calculating the occupation situation of each type of interrupt task resource, and specifically comprises the following steps:

calculating the server rate of each class of inference tasks according to the type of the neural network, the time required by the initialization of each class of inference tasks and the amount of computing resources distributed to each class of inference tasks by the edge intelligent server, wherein the expression is as follows:

in the above formula, i is the type number of the inference task, ε_iIndicating the time required for initiating the class i interrupt task, j_iIs the base of a logarithmic function, related to the GFLOPS of the neural network model itself, alpha_iIndicating the amount of computing resources that the edge intelligence server allocates to the class i interrupt task.

The invention also provides an edge intelligent server neural network inference task scheduling system which is programmed or configured to execute any one of the edge intelligent server neural network inference task scheduling methods.

The present invention also provides a computer readable storage medium having stored therein a computer program programmed or configured to perform any of the edge intelligent server neural network inference task scheduling methods described herein.

Compared with the prior art, the invention has the advantages that:

the invention adjusts the server resource and the task scheduling strategy through mathematical modeling quantitative analysis and reinforcement learning algorithm to solve the problem of low throughput of multi-type and high-concurrency inferred tasks. Firstly, analyzing the task quantity and calculating the average service delay of each type of task in a queuing service system based on a method of experimental analysis and mathematical modeling, judging whether the strategy is unreasonable at the moment according to all the average service delays, starting a strategy adjusting algorithm realized based on reinforcement learning to adjust the strategy if the strategy is unreasonable, and repeating the process until the strategy is reasonable. The method has low computational complexity, meets the real-time requirement of a dynamic scene, and effectively solves the load balancing problem in a large-scale edge computing scene.

Drawings

FIG. 1 is a flow chart of an embodiment of the present invention.

Fig. 2 is a diagram illustrating task scheduling performed by the queuing service system according to an embodiment of the present invention.

FIG. 3 is a schematic diagram showing the relationship between the amount of computing resources allocated to the residual network inference task by the edge intelligent server and the inference time.

FIG. 4 is a schematic diagram showing the relationship between the amount of computing resources allocated to the convolutional neural network inference task by the edge intelligent server and the inference time.

FIG. 5 is a diagram illustrating the relationship between the amount of computing resources allocated by an edge intelligence server to a dense convolutional network inference task and inference time.

Detailed Description

The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.

According to our research, the neural network model at the present stage has a more obvious trend: most models used by the same task, such as a computer vision task, are more structurally similar. Even some tasks may be implemented by the same model. For these tasks, we can make them into a batch to process, thereby improving the throughput of the neural network inference task. The batch size of the inference task is increased, and the processing speed of the neural network inference task can be obviously increased. While the processing speed is increased, the overhead of resources brought by the method is gradually increased. However, once the batch size is established, the problem with this is that if the batch size is set too large, it will cause some tasks to queue up in the system due to memory constraints, making the queuing delay significantly higher than expected. If the setting is small, the advantages of batch processing cannot be effectively utilized. From the above preliminary analysis, it is clear that the setting of the inferred batch size of the task is limited by the configuration of the edge smart server device. The batch size can significantly impact the speed at which the queuing service system processes the inferred task. The setting of the batch size therefore presents a boundary. In addition, the task arrival is random in a real scene, and various inference tasks are influenced by each other due to the problems of resource preemption and the like. These factors are troublesome to our quantitative analysis. In order to solve the problem of resource contention of various tasks, a GPU virtualization technology is adopted. The technology can virtualize a physical GPU into a plurality of mutually independent GPUs, so that the influence on the efficiency of the server for executing the inference task due to resource contention is reduced.

On the basis, on one hand, the calculation amount of different kinds of tasks is calculated, and the system is guided to carry out resource allocation through a resource-speed model of modeling tasks. On the other hand, the execution process of various tasks in the system is modeled according to the dynamic batch queuing model theory. And obtaining an optimization problem, and finally guiding us to carry out task scheduling.

According to the analysis result and the technical idea, the embodiment provides a method for scheduling inference tasks of a neural network of an edge intelligent server, as shown in fig. 1, including the following steps:

s2, distributing preset resources for the virtual GPU according to a preset distribution strategy, and distributing inference tasks for the virtual GPU corresponding to the type of each inference task by using a queuing service system according to a preset execution batch; as shown in fig. 2, for different categories of task one, task two and task three, the queuing service system sequentially allocates each batch of task one, task two and task three to the corresponding virtual GPU;

and S4, distributing corresponding resources for the virtual GPU according to the new distribution strategy, and distributing the neural network inference task for the virtual GPU corresponding to the category of each task by using the queuing service system according to the corresponding execution batch.

Through the steps, the neural network inference task scheduling method of the edge intelligent server in the embodiment firstly utilizes the GPU virtualization technology to divide the GPU into a plurality of sharable blocks, each block can independently allocate GPU resources, then calculates the average service delay of the inference task, and finally adjusts the allocation strategy comprising resource allocation and task scheduling through the average service delay by the reinforcement learning algorithm, so that the real-time requirement of a dynamic scene is met with lower calculation complexity, and the load balancing problem in a larger-scale edge calculation scene is effectively solved.

Step S3 of this embodiment is used to determine whether the queuing service system has an unreasonable scheduling manner, if so, start a policy adjustment algorithm implemented based on reinforcement learning to adjust resource allocation and task scheduling, and after adjustment, if it is determined that the current scenario is still the scheduling and allocation manner, continue to adjust until the policy is reasonable. After the strategy is reasonable, the situation that the strategy is unreasonable can still occur after a period of time because the task is continuously arrived, and the actions need to be repeated until the strategy is reasonable again. And recording the system state at the moment for subsequent off-line adjustment, and the specific steps comprise:

the first step is as follows: an environment capable of virtualizing a GPU and a simulated task submission model are built on an edge server, and the task submission model is placed in the environment for autonomous scheduling execution by adopting a D3QN network;

the second step is that: collecting the average service delay E (W) of all inference tasks_i) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasks_iAs the state value s at this time^tSelecting the current state value s using an e-greedy method^tThe next corresponding action a;

the third step: according to action A, the calculation resource amount is redistributed and task scheduling is carried out, and the next state s is obtained^t′Obtaining a return R by utilizing a reward and punishment mechanism according to the residual resource amount and the inferred task execution condition;

the fourth step: obtaining a current state s^tThe task execution situation and the resource allocation situation of (2) are combined into the current state data s^tObtaining the next state s^t′Is combined into the next state data s^t′The current state data s^tAction A, next state data s^t′The return R is stored in the prior experience playback pool in the form of an array D, and the sampling probability of the array D in the prior experience playback pool is calculated through a td-error algorithm;

the fifth step: extracting the array D in the experience playback pool to a D3QN network according to the sampling probability, carrying out gradient descent error training of the D3QN network, judging whether a termination condition is met, if so, obtaining a trained GPU resource allocation and task scheduling model, executing the next step, and otherwise, obtaining the next state s^t′As the current state s^tExecution selects the current state value s using the e-greedy method^tA step of the corresponding action A;

and a sixth step: and importing the trained GPU resource allocation and task scheduling model into an edge intelligent server, performing resource allocation and task scheduling of a real environment, and obtaining a final allocation and scheduling scheme.

The amount of computing resources a allocated for collecting all inference tasks in the second step_iBecause the server processing neural network infers the task speed, i.e., the service rate, as a function of the computational resources the server allocates to the task and the computational load of the task itself. In order to quantitatively analyze the relationship among server resource allocation, task calculation amount and service rate, the server resource allocation, the task calculation amount and the service rate are modeled. In our analysis it was found that: when the computing resource allocated by a certain inference task is improved, the processing speed of the task by the server is in a trend of increasing firstly and then tending to be stable. And, this rule is true for different inference tasks. Therefore, assuming that all the computing resources of the server are 1, the amount of computing resources allocated to the class i interrupt task by the server is α_iThe relationship between the occupation of the task resource and the service time can be expressed by the following equation:

wherein mu_iIndicating the server rate of the class i outage task. Wherein epsilon_iIndicating the time required for the i-th type interrupt task to initialize. j is a function of_iIs the base of the logarithmic function and is related to the GFLOPS of the neural network model itself. Alpha (alpha) ("alpha")_iRepresenting the amount of computing resources that the server allocates to the class i interrupt task. We tested some common models for inferring the relationship between time and resource allocation, as shown in fig. 3-5.

Average service delay E (W) for all inferred tasks collected in the second step_i) Assuming that the process of task arrival is a random process subject to poisson distribution, the arrival rate is represented by λ. In the process that the task reaches the edge server, the task flow has stationarity, no aftereffect and universality. That is, the task reaches the interval length only depending on the time, and in the non-overlapping time period, the task reaches are independent, if the time interval is sufficiently small, the task of a certain user does not appear twice. The assumption is therefore true. After the scheduling of server resources is completed, the server can tend to fix the service time of a certain kind of tasks, namely the service time follows a fixed-length distribution. The upper and lower limits of the batch size can be denoted by a and b, respectively. When the number of the computing tasks is less than a, the server waits until the number of the arriving tasks is more than or equal to a and the tasks are served as a batch. When the task quantity is more than a and less than b, all tasks are finished as a batch. When the number of tasks is more than b, only b tasks are completed at a time. The remaining tasks continue to be queued. Let X (n, r) and Y (n, r) denote the number of tasks remaining in the queuing service system when the nth task is completed and the number of tasks reached when the nth task is serviced, respectively. r is the number of tasks performed in the nth batch. Then X (n +1, r) can be represented as follows:

it can be seen that X (n +1, r) is only related to X (n, l). And is independent of the value of n. So { X (N, r) }, N ∈ N₀,r∈M_a,b,M_a,bA, b is a homogeneous markov chain. Therefore, the queuing service system can be described by M/D (a, b)/1/N queues.

According to a relevant theory, calculating the average service delay of a certain kind of tasks in a queuing service system, wherein the expression is as follows:

in the above formula, i is the type number of the inference task, λ_i' is the effective arrival rate of class i interrupt task, E (L)_i) Is the average task number of class i interrupt tasks in the queuing service system.

The service delay of the whole queuing service system is therefore:

E(W)＝[E(W₁),E(W₂),…,E(W_i)]i∈1,2,3…,N (4)

where i represents the class i interrupt task.

After the server calculates the service delay of the whole system in the current state, if the E (W) of a certain type of task_i) If the number of the tasks is larger, the situation that the resources allocated to the tasks are insufficient or the scheduling mode is unreasonable is shown, and at the moment, the resource allocation and task scheduling schemes need to be adjusted. Otherwise, the task can have better performance under the current strategy of the system. When the average service delay of a certain kind of task in E (W) is obviously larger than that of the task which is executed independently, the system allocation and scheduling strategy is not reasonable at this moment. At this point, the reinforcement learning algorithm (D3QN) needs to be started to adjust the strategy. Details of reinforcement learning algorithms include:

in this embodiment, for the state Space (Station Space) of the reinforcement learning algorithm: e (W)_i) Can be used for describing the assignment of a certain type of task to the resource alpha in the system_iWhen the execution is completed, the set s is [ E (W)_i),α_i]It can be used to describe the situation of the whole system. We take this as the state of the system.

In this embodiment, for an Action Space (Action Space) of the reinforcement learning algorithm: after observing the environment state, if the scheduling mode is judged to be unreasonable at this time, the action at this time needs to be selected to solve the problem of poor scheduling. The actions in this embodiment are two, one is to adjust resource allocation, and the other is to adjust the task execution batch size. As shown in table 1.

TABLE 1 scheduling action Table

Maintain	0	0	0
				Increase	+1	*1.25	*1.5
Decrease	-1	*0.75	*0.5

In table 1, Maintain indicates that the resource allocation or batch size remains the same, Increase indicates that the resource or batch size is increased, and the Increase is fine adjustment (+1), 25% (. about.1.25), 50% (. about.1.5), and Decrease is indicated by Decrease, as above.

In this embodiment, for the Reward value (Reward) of the reinforcement learning algorithm: after observing a certain state and making an action, Reward needs to be used to evaluate the quality of the action. We define the reward value for each type of task as R, the expression is as follows:

in the above formula, i is the type number of the inference task,

represents the execution time, T, of the ith type task under the current allocation strategy_iIs the independent execution time of the ith type task, alpha'_iIndicating the percentage of GPU resources allocated by the class i task. I.e. the reward function is determined by the task execution time promotion rate and the resource occupation. The larger the execution time is increased, the less the resource is occupied, the resources of the server are effectively utilized, the throughput at the moment is increased, and the profit value is higher.

Therefore, in the third step, according to the action a, the specific steps of reallocating the amount of the computing resources and performing task scheduling include:

if there is an inferred task average service delay E (W) greater than the predetermined threshold in action A_i) If the inferred task average service delay E (W) is not greater than the predetermined threshold in the action A, the computing resource amount is increased or the execution batch size is decreased_i) Then the amount of computing resources or execution batch size is kept unchanged. Wherein the preset threshold is an average service delay E (W)_i) The delay of the respective inference tasks performed individually.

The foregoing is illustrative of the preferred embodiments of the present invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention shall fall within the protection scope of the technical solution of the present invention, unless the technical essence of the present invention departs from the content of the technical solution of the present invention.

Claims

1. A method for scheduling inference tasks of an edge intelligent server neural network is characterized by comprising the following steps:

and S4, allocating resources corresponding to the new allocation strategy for the virtual GPU, and allocating neural network inference tasks for the virtual GPU corresponding to the category of each task by using a queuing service system according to the execution batch corresponding to the new allocation strategy.

2. The method for inference task scheduling by neural network of edge intelligent server according to claim 1, wherein the specific steps of step S3 include:

collecting the average service delay E (W) of all inference tasks_i) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasks_iAs the state value s at this time^tSelecting the current state value s using an e-greedy method^tThe next corresponding action a;

according to action A, the calculation resource amount is redistributed and task scheduling is carried out, and the next state s is obtained^t′Obtaining a reward R by utilizing a reward and punishment mechanism according to the residual resource amount and the inferred task execution condition;

obtaining a current state s^tThe task execution situation and the resource allocation situation of (2) are combined into the current state data s^tTo obtainTake a next state s^t′The task execution situation and the resource allocation situation of (2) are merged into the next state data s^t′The current state data s^tAction A, next state data s^t′The return R is stored in the prior experience playback pool in the form of an array D, and the sampling probability of the array D in the prior experience playback pool is calculated through a td-error algorithm;

extracting the array D in the experience playback pool to a D3QN network according to the sampling probability, carrying out gradient descent error training of the D3QN network, judging whether a termination condition is met, if so, obtaining a trained GPU resource allocation and task scheduling model, executing the next step, and if not, extracting the next state s^t′As the current state s^tExecution selects the current state value s using the e-greedy method^tA step of performing a corresponding action A;

3. The method of claim 2, wherein the average service delay E (W) of all inference tasks is collected_i) And the computing resource amount alpha allocated to all the inference tasks by the edge intelligent server_iThe method also comprises the following steps:

4. The method of claim 2, wherein the average service delay E (W) of all inference tasks is collected_i) The method specifically comprises the following steps: calculating the average service delay of each type of inference task according to the type of the arrival inference task and the corresponding average task number, wherein the expression is as follows:

5. The method for inferring task scheduling by the neural network of the edge intelligent server as claimed in claim 2, wherein the specific steps of reallocating the amount of computing resources and scheduling the task according to action a comprise:

6. The method as claimed in claim 5, wherein the preset threshold is an average service delay E (W)_i) The delay of the respective inference tasks performed individually.

7. The method for inferring task scheduling by the neural network of the edge intelligent server according to claim 2, wherein an expression of a reward R obtained by using a reward and punishment mechanism is as follows:

in the above equation, i is the type number of the inference task,

8. The method of claim 2, wherein the average service delay E (W) of all inference tasks is collected_i) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasks_iThe method also comprises the step of calculating the occupation condition of each class of interrupt task resources, and specifically comprises the following steps:

9. An edge intelligence server neural network inference task scheduling system, wherein the edge intelligence server neural network inference task scheduling system is programmed or configured to perform the method of edge intelligence server neural network inference task scheduling of any of claims 1-8.

10. A computer readable storage medium having stored therein a computer program programmed or configured to perform the edge intelligence server neural network inference task scheduling method of any of claims 1-8.