CN114706678A - Neural network inference task scheduling method for edge intelligent server - Google Patents

Neural network inference task scheduling method for edge intelligent server Download PDF

Info

Publication number
CN114706678A
CN114706678A CN202210284033.8A CN202210284033A CN114706678A CN 114706678 A CN114706678 A CN 114706678A CN 202210284033 A CN202210284033 A CN 202210284033A CN 114706678 A CN114706678 A CN 114706678A
Authority
CN
China
Prior art keywords
task
inference
tasks
neural network
gpu
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210284033.8A
Other languages
Chinese (zh)
Inventor
王彦波
张德宇
张永敏
吕丰
张尧学
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210284033.8A priority Critical patent/CN114706678A/en
Publication of CN114706678A publication Critical patent/CN114706678A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a neural network inference task scheduling method of an edge intelligent server, which comprises the following steps: virtualizing the GPU into a plurality of virtual GPUs by utilizing a GPU virtualization technology; distributing preset resources for the virtual GPU according to a preset distribution strategy, and distributing inference tasks for the virtual GPU corresponding to the category of each inference task by using a queuing service system according to a preset execution batch; collecting the average service delay and the calculation resource amount of each class of inference tasks, judging whether the allocation strategy needs to be adjusted, and if so, calculating a new allocation strategy by using a reinforcement learning algorithm; and distributing corresponding resources for the virtual GPU according to the new distribution strategy, and distributing the neural network inference task for the virtual GPU corresponding to the category of each task by using a queuing service system according to the corresponding execution batch. The invention meets the real-time requirement of the dynamic scene with lower computation complexity, and effectively solves the load balancing problem in larger-scale edge computation scenes.

Description

Neural network inference task scheduling method for edge intelligent server
Technical Field
The invention relates to the field of edge computing, in particular to a neural network inference task scheduling method for an edge intelligent server.
Background
Edge Computing (EC) is a new cloud Computing model. And deploying the server at the edge of the network to provide computing services for users. The network edge is not a terminal device but a network location close to the terminal device, and is characterized by low delay for communication with the terminal device. However, in a complex scene in real life, one edge server, especially an edge intelligent server, should be responsible for many types of neural network inference tasks with high concurrency. The problem that arises is how to perform appropriate task and resource scheduling to increase the speed at which the edge server processes these tasks and increase throughput.
The queuing system, also called "queuing service system", is a service system which is composed of one or more parallel, serial and mixed connected service stations, serves several customers or work objects with different requirements and determines the service order according to the given queuing rule. The production manufacturing and service system in reality mostly belongs to a queuing system. The object of the service may be a natural person, a work to be completed, or a workpiece to be processed. The batch queuing system is a derivative system thereof. Tasks in a batch queuing system are not processed directly, but rather the tasks in the system are accumulated to a certain number and processed simultaneously as a batch. The processing speed is increased, the cost of resources is gradually increased, if the batch processing size is set to be too large, tasks are queued in the system due to the limitation of the resource size, so that the queuing delay is obviously higher than expected, and if the batch processing size is set to be smaller, the advantages brought by batch processing cannot be effectively utilized.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: aiming at the technical problems in the prior art, the invention provides a neural network inference task scheduling method of an edge intelligent server, which is characterized in that under the scene that the edge intelligent server faces high-concurrency and multi-type neural network inference tasks, characteristics of inference tasks are analyzed and distributed to corresponding computing resources, relevant evaluation indexes of a system are obtained through dynamic batch queuing service system modeling simulation, and finally, the system resources and a task scheduling scheme are determined by utilizing a reinforcement learning algorithm (D3QN) so as to improve the execution speed of the inference tasks and solve the problem that the throughput of the multi-type and high-concurrency inference tasks is too low.
In order to solve the technical problems, the technical scheme provided by the invention is as follows:
a neural network inference task scheduling method for an edge intelligent server comprises the following steps:
s1, virtualizing the GPU into a plurality of virtual GPUs by utilizing a GPU virtualization technology;
s2, distributing preset resources for the virtual GPU, and distributing inference tasks for the virtual GPU corresponding to the type of each inference task by using a queuing service system according to a preset execution batch;
s3, collecting the average service delay and the distributed computing resource amount of each interrupt task, judging whether the distribution strategy needs to be adjusted, if so, calculating a new distribution strategy by using a reinforcement learning algorithm;
and S4, distributing resources corresponding to the new distribution strategy for the virtual GPU, and distributing neural network inference tasks for the virtual GPU corresponding to the category of each task by using the queuing service system according to the execution batch corresponding to the new distribution strategy.
Further, the specific step of step S3 includes:
collecting the average service delay E (W) of all inference tasksi) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasksiAs the state value s at this timetSelecting a current state value s using a greedy-epsilon methodtThe next corresponding action a;
according to the action A, the calculation resource amount is redistributed and task scheduling is carried out, and the next state s is obtainedt′Obtaining a reward R by utilizing a reward and punishment mechanism according to the residual resource amount and the inferred task execution condition;
obtaining a current state stThe task execution situation and the resource allocation situation of (2) are combined into the current state data stObtaining the next state st′Is combined into the next state data st′The current state data stAction A, next state data st′The return R is stored in the prior experience playback pool in the form of an array D, and the sampling probability of the array D in the prior experience playback pool is calculated through a td-error algorithm;
extracting the array D in the experience playback pool to a D3QN network according to the sampling probability, carrying out gradient descent error training of the D3QN network, judging whether a termination condition is met, if so, obtaining a trained GPU resource allocation and task scheduling model, executing the next step, and if not, extracting the next state st′As the current state stExecution selects the current state value s using the e-greedy methodtA step of the corresponding action A;
and importing the trained GPU resource allocation and task scheduling model into an edge intelligent server.
Further, the average service delay E (W) of all inferred tasks is collectedi) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasksiThe method also comprises the following steps:
an environment capable of virtualizing a GPU and a simulated task submission model are built on an edge server, and the task submission model is placed in the environment for autonomous scheduling execution by adopting a D3QN network.
Further, the average service delay E (W) of all inference tasks is collectedi) The method specifically comprises the following steps: calculating the average service delay of each type of inference task according to the type of the arrival inference task and the corresponding average task number, wherein the expression is as follows:
Figure BDA0003559303370000031
in the above formula, i is the type of inference taskNumber, λi' is the effective arrival rate of class i interrupt task, E (L)i) Is the average number of tasks of the class i interrupt task in the queuing service system.
Further, according to the action a, the specific steps of reallocating the amount of the computing resources and scheduling the task include:
if there is an inferred task average service delay E (W) greater than a preset threshold in action Ai) If the estimated task average service delay E (W) is not greater than the predetermined threshold in the action A, the amount of computing resources is increased or the size of the execution batch is decreasedi) Then the amount of computing resources or execution batch size is kept unchanged.
Further, the preset threshold is an average service delay E (W)i) The delay of the respective inference tasks performed individually.
Further, the expression of the reward R obtained by using the reward and punishment mechanism is as follows:
Figure BDA0003559303370000032
in the above equation, i is the type number of the inference task,
Figure BDA0003559303370000033
represents the execution time, T, of the ith type task under the current allocation strategyiIs the independent execution time of the ith type task, alpha'iIndicating the percentage of GPU resources allocated by the class i task.
Further, the average service delay E (W) of all inferred tasks is collectedi) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasksiThe method also comprises the step of calculating the occupation situation of each type of interrupt task resource, and specifically comprises the following steps:
calculating the server rate of each class of inference tasks according to the type of the neural network, the time required by the initialization of each class of inference tasks and the amount of computing resources distributed to each class of inference tasks by the edge intelligent server, wherein the expression is as follows:
Figure BDA0003559303370000041
in the above formula, i is the type number of the inference task, εiIndicating the time required for initiating the class i interrupt task, jiIs the base of a logarithmic function, related to the GFLOPS of the neural network model itself, alphaiIndicating the amount of computing resources that the edge intelligence server allocates to the class i interrupt task.
The invention also provides an edge intelligent server neural network inference task scheduling system which is programmed or configured to execute any one of the edge intelligent server neural network inference task scheduling methods.
The present invention also provides a computer readable storage medium having stored therein a computer program programmed or configured to perform any of the edge intelligent server neural network inference task scheduling methods described herein.
Compared with the prior art, the invention has the advantages that:
the invention adjusts the server resource and the task scheduling strategy through mathematical modeling quantitative analysis and reinforcement learning algorithm to solve the problem of low throughput of multi-type and high-concurrency inferred tasks. Firstly, analyzing the task quantity and calculating the average service delay of each type of task in a queuing service system based on a method of experimental analysis and mathematical modeling, judging whether the strategy is unreasonable at the moment according to all the average service delays, starting a strategy adjusting algorithm realized based on reinforcement learning to adjust the strategy if the strategy is unreasonable, and repeating the process until the strategy is reasonable. The method has low computational complexity, meets the real-time requirement of a dynamic scene, and effectively solves the load balancing problem in a large-scale edge computing scene.
Drawings
FIG. 1 is a flow chart of an embodiment of the present invention.
Fig. 2 is a diagram illustrating task scheduling performed by the queuing service system according to an embodiment of the present invention.
FIG. 3 is a schematic diagram showing the relationship between the amount of computing resources allocated to the residual network inference task by the edge intelligent server and the inference time.
FIG. 4 is a schematic diagram showing the relationship between the amount of computing resources allocated to the convolutional neural network inference task by the edge intelligent server and the inference time.
FIG. 5 is a diagram illustrating the relationship between the amount of computing resources allocated by an edge intelligence server to a dense convolutional network inference task and inference time.
Detailed Description
The invention is further described below with reference to the drawings and specific preferred embodiments of the description, without thereby limiting the scope of protection of the invention.
According to our research, the neural network model at the present stage has a more obvious trend: most models used by the same task, such as a computer vision task, are more structurally similar. Even some tasks may be implemented by the same model. For these tasks, we can make them into a batch to process, thereby improving the throughput of the neural network inference task. The batch size of the inference task is increased, and the processing speed of the neural network inference task can be obviously increased. While the processing speed is increased, the overhead of resources brought by the method is gradually increased. However, once the batch size is established, the problem with this is that if the batch size is set too large, it will cause some tasks to queue up in the system due to memory constraints, making the queuing delay significantly higher than expected. If the setting is small, the advantages of batch processing cannot be effectively utilized. From the above preliminary analysis, it is clear that the setting of the inferred batch size of the task is limited by the configuration of the edge smart server device. The batch size can significantly impact the speed at which the queuing service system processes the inferred task. The setting of the batch size therefore presents a boundary. In addition, the task arrival is random in a real scene, and various inference tasks are influenced by each other due to the problems of resource preemption and the like. These factors are troublesome to our quantitative analysis. In order to solve the problem of resource contention of various tasks, a GPU virtualization technology is adopted. The technology can virtualize a physical GPU into a plurality of mutually independent GPUs, so that the influence on the efficiency of the server for executing the inference task due to resource contention is reduced.
On the basis, on one hand, the calculation amount of different kinds of tasks is calculated, and the system is guided to carry out resource allocation through a resource-speed model of modeling tasks. On the other hand, the execution process of various tasks in the system is modeled according to the dynamic batch queuing model theory. And obtaining an optimization problem, and finally guiding us to carry out task scheduling.
According to the analysis result and the technical idea, the embodiment provides a method for scheduling inference tasks of a neural network of an edge intelligent server, as shown in fig. 1, including the following steps:
s1, virtualizing the GPU into a plurality of virtual GPUs by utilizing a GPU virtualization technology;
s2, distributing preset resources for the virtual GPU according to a preset distribution strategy, and distributing inference tasks for the virtual GPU corresponding to the type of each inference task by using a queuing service system according to a preset execution batch; as shown in fig. 2, for different categories of task one, task two and task three, the queuing service system sequentially allocates each batch of task one, task two and task three to the corresponding virtual GPU;
s3, collecting the average service delay and the distributed computing resource amount of each interrupt task, judging whether the distribution strategy needs to be adjusted, if so, calculating a new distribution strategy by using a reinforcement learning algorithm;
and S4, distributing corresponding resources for the virtual GPU according to the new distribution strategy, and distributing the neural network inference task for the virtual GPU corresponding to the category of each task by using the queuing service system according to the corresponding execution batch.
Through the steps, the neural network inference task scheduling method of the edge intelligent server in the embodiment firstly utilizes the GPU virtualization technology to divide the GPU into a plurality of sharable blocks, each block can independently allocate GPU resources, then calculates the average service delay of the inference task, and finally adjusts the allocation strategy comprising resource allocation and task scheduling through the average service delay by the reinforcement learning algorithm, so that the real-time requirement of a dynamic scene is met with lower calculation complexity, and the load balancing problem in a larger-scale edge calculation scene is effectively solved.
Step S3 of this embodiment is used to determine whether the queuing service system has an unreasonable scheduling manner, if so, start a policy adjustment algorithm implemented based on reinforcement learning to adjust resource allocation and task scheduling, and after adjustment, if it is determined that the current scenario is still the scheduling and allocation manner, continue to adjust until the policy is reasonable. After the strategy is reasonable, the situation that the strategy is unreasonable can still occur after a period of time because the task is continuously arrived, and the actions need to be repeated until the strategy is reasonable again. And recording the system state at the moment for subsequent off-line adjustment, and the specific steps comprise:
the first step is as follows: an environment capable of virtualizing a GPU and a simulated task submission model are built on an edge server, and the task submission model is placed in the environment for autonomous scheduling execution by adopting a D3QN network;
the second step is that: collecting the average service delay E (W) of all inference tasksi) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasksiAs the state value s at this timetSelecting the current state value s using an e-greedy methodtThe next corresponding action a;
the third step: according to action A, the calculation resource amount is redistributed and task scheduling is carried out, and the next state s is obtainedt′Obtaining a return R by utilizing a reward and punishment mechanism according to the residual resource amount and the inferred task execution condition;
the fourth step: obtaining a current state stThe task execution situation and the resource allocation situation of (2) are combined into the current state data stObtaining the next state st′Is combined into the next state data st′The current state data stAction A, next state data st′The return R is stored in the prior experience playback pool in the form of an array D, and the sampling probability of the array D in the prior experience playback pool is calculated through a td-error algorithm;
the fifth step: extracting the array D in the experience playback pool to a D3QN network according to the sampling probability, carrying out gradient descent error training of the D3QN network, judging whether a termination condition is met, if so, obtaining a trained GPU resource allocation and task scheduling model, executing the next step, and otherwise, obtaining the next state st′As the current state stExecution selects the current state value s using the e-greedy methodtA step of the corresponding action A;
and a sixth step: and importing the trained GPU resource allocation and task scheduling model into an edge intelligent server, performing resource allocation and task scheduling of a real environment, and obtaining a final allocation and scheduling scheme.
The amount of computing resources a allocated for collecting all inference tasks in the second stepiBecause the server processing neural network infers the task speed, i.e., the service rate, as a function of the computational resources the server allocates to the task and the computational load of the task itself. In order to quantitatively analyze the relationship among server resource allocation, task calculation amount and service rate, the server resource allocation, the task calculation amount and the service rate are modeled. In our analysis it was found that: when the computing resource allocated by a certain inference task is improved, the processing speed of the task by the server is in a trend of increasing firstly and then tending to be stable. And, this rule is true for different inference tasks. Therefore, assuming that all the computing resources of the server are 1, the amount of computing resources allocated to the class i interrupt task by the server is αiThe relationship between the occupation of the task resource and the service time can be expressed by the following equation:
Figure BDA0003559303370000081
wherein muiIndicating the server rate of the class i outage task. Wherein epsiloniIndicating the time required for the i-th type interrupt task to initialize. j is a function ofiIs the base of the logarithmic function and is related to the GFLOPS of the neural network model itself. Alpha (alpha) ("alpha")iRepresenting the amount of computing resources that the server allocates to the class i interrupt task. We tested some common models for inferring the relationship between time and resource allocation, as shown in fig. 3-5.
Average service delay E (W) for all inferred tasks collected in the second stepi) Assuming that the process of task arrival is a random process subject to poisson distribution, the arrival rate is represented by λ. In the process that the task reaches the edge server, the task flow has stationarity, no aftereffect and universality. That is, the task reaches the interval length only depending on the time, and in the non-overlapping time period, the task reaches are independent, if the time interval is sufficiently small, the task of a certain user does not appear twice. The assumption is therefore true. After the scheduling of server resources is completed, the server can tend to fix the service time of a certain kind of tasks, namely the service time follows a fixed-length distribution. The upper and lower limits of the batch size can be denoted by a and b, respectively. When the number of the computing tasks is less than a, the server waits until the number of the arriving tasks is more than or equal to a and the tasks are served as a batch. When the task quantity is more than a and less than b, all tasks are finished as a batch. When the number of tasks is more than b, only b tasks are completed at a time. The remaining tasks continue to be queued. Let X (n, r) and Y (n, r) denote the number of tasks remaining in the queuing service system when the nth task is completed and the number of tasks reached when the nth task is serviced, respectively. r is the number of tasks performed in the nth batch. Then X (n +1, r) can be represented as follows:
Figure BDA0003559303370000082
it can be seen that X (n +1, r) is only related to X (n, l). And is independent of the value of n. So { X (N, r) }, N ∈ N0,r∈Ma,b,Ma,bA, b is a homogeneous markov chain. Therefore, the queuing service system can be described by M/D (a, b)/1/N queues.
According to a relevant theory, calculating the average service delay of a certain kind of tasks in a queuing service system, wherein the expression is as follows:
Figure BDA0003559303370000091
in the above formula, i is the type number of the inference task, λi' is the effective arrival rate of class i interrupt task, E (L)i) Is the average task number of class i interrupt tasks in the queuing service system.
The service delay of the whole queuing service system is therefore:
E(W)=[E(W1),E(W2),…,E(Wi)]i∈1,2,3…,N (4)
where i represents the class i interrupt task.
After the server calculates the service delay of the whole system in the current state, if the E (W) of a certain type of taski) If the number of the tasks is larger, the situation that the resources allocated to the tasks are insufficient or the scheduling mode is unreasonable is shown, and at the moment, the resource allocation and task scheduling schemes need to be adjusted. Otherwise, the task can have better performance under the current strategy of the system. When the average service delay of a certain kind of task in E (W) is obviously larger than that of the task which is executed independently, the system allocation and scheduling strategy is not reasonable at this moment. At this point, the reinforcement learning algorithm (D3QN) needs to be started to adjust the strategy. Details of reinforcement learning algorithms include:
in this embodiment, for the state Space (Station Space) of the reinforcement learning algorithm: e (W)i) Can be used for describing the assignment of a certain type of task to the resource alpha in the systemiWhen the execution is completed, the set s is [ E (W)i),αi]It can be used to describe the situation of the whole system. We take this as the state of the system.
In this embodiment, for an Action Space (Action Space) of the reinforcement learning algorithm: after observing the environment state, if the scheduling mode is judged to be unreasonable at this time, the action at this time needs to be selected to solve the problem of poor scheduling. The actions in this embodiment are two, one is to adjust resource allocation, and the other is to adjust the task execution batch size. As shown in table 1.
TABLE 1 scheduling action Table
Maintain 0 0 0
Increase +1 *1.25 *1.5
Decrease -1 *0.75 *0.5
In table 1, Maintain indicates that the resource allocation or batch size remains the same, Increase indicates that the resource or batch size is increased, and the Increase is fine adjustment (+1), 25% (. about.1.25), 50% (. about.1.5), and Decrease is indicated by Decrease, as above.
In this embodiment, for the Reward value (Reward) of the reinforcement learning algorithm: after observing a certain state and making an action, Reward needs to be used to evaluate the quality of the action. We define the reward value for each type of task as R, the expression is as follows:
Figure BDA0003559303370000101
in the above formula, i is the type number of the inference task,
Figure BDA0003559303370000102
represents the execution time, T, of the ith type task under the current allocation strategyiIs the independent execution time of the ith type task, alpha'iIndicating the percentage of GPU resources allocated by the class i task. I.e. the reward function is determined by the task execution time promotion rate and the resource occupation. The larger the execution time is increased, the less the resource is occupied, the resources of the server are effectively utilized, the throughput at the moment is increased, and the profit value is higher.
Therefore, in the third step, according to the action a, the specific steps of reallocating the amount of the computing resources and performing task scheduling include:
if there is an inferred task average service delay E (W) greater than the predetermined threshold in action Ai) If the inferred task average service delay E (W) is not greater than the predetermined threshold in the action A, the computing resource amount is increased or the execution batch size is decreasedi) Then the amount of computing resources or execution batch size is kept unchanged. Wherein the preset threshold is an average service delay E (W)i) The delay of the respective inference tasks performed individually.
The invention also provides an edge intelligent server neural network inference task scheduling system which is programmed or configured to execute any one of the edge intelligent server neural network inference task scheduling methods.
The present invention also provides a computer readable storage medium having stored therein a computer program programmed or configured to perform any of the edge intelligent server neural network inference task scheduling methods described herein.
The foregoing is illustrative of the preferred embodiments of the present invention and is not to be construed as limiting the invention in any way. Although the present invention has been described with reference to the preferred embodiments, it is not intended to be limited thereto. Therefore, any simple modification, equivalent change and modification made to the above embodiments according to the technical essence of the present invention shall fall within the protection scope of the technical solution of the present invention, unless the technical essence of the present invention departs from the content of the technical solution of the present invention.

Claims (10)

1. A method for scheduling inference tasks of an edge intelligent server neural network is characterized by comprising the following steps:
s1, virtualizing the GPU into a plurality of virtual GPUs by utilizing a GPU virtualization technology;
s2, distributing preset resources for the virtual GPU, and distributing inference tasks for the virtual GPU corresponding to the type of each inference task by using a queuing service system according to a preset execution batch;
s3, collecting the average service delay and the distributed computing resource amount of each interrupt task, judging whether the distribution strategy needs to be adjusted, if so, calculating a new distribution strategy by using a reinforcement learning algorithm;
and S4, allocating resources corresponding to the new allocation strategy for the virtual GPU, and allocating neural network inference tasks for the virtual GPU corresponding to the category of each task by using a queuing service system according to the execution batch corresponding to the new allocation strategy.
2. The method for inference task scheduling by neural network of edge intelligent server according to claim 1, wherein the specific steps of step S3 include:
collecting the average service delay E (W) of all inference tasksi) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasksiAs the state value s at this timetSelecting the current state value s using an e-greedy methodtThe next corresponding action a;
according to action A, the calculation resource amount is redistributed and task scheduling is carried out, and the next state s is obtainedt′Obtaining a reward R by utilizing a reward and punishment mechanism according to the residual resource amount and the inferred task execution condition;
obtaining a current state stThe task execution situation and the resource allocation situation of (2) are combined into the current state data stTo obtainTake a next state st′The task execution situation and the resource allocation situation of (2) are merged into the next state data st′The current state data stAction A, next state data st′The return R is stored in the prior experience playback pool in the form of an array D, and the sampling probability of the array D in the prior experience playback pool is calculated through a td-error algorithm;
extracting the array D in the experience playback pool to a D3QN network according to the sampling probability, carrying out gradient descent error training of the D3QN network, judging whether a termination condition is met, if so, obtaining a trained GPU resource allocation and task scheduling model, executing the next step, and if not, extracting the next state st′As the current state stExecution selects the current state value s using the e-greedy methodtA step of performing a corresponding action A;
and importing the trained GPU resource allocation and task scheduling model into an edge intelligent server.
3. The method of claim 2, wherein the average service delay E (W) of all inference tasks is collectedi) And the computing resource amount alpha allocated to all the inference tasks by the edge intelligent serveriThe method also comprises the following steps:
an environment capable of virtualizing a GPU and a simulated task submission model are built on an edge server, and the task submission model is placed in the environment for autonomous scheduling execution by adopting a D3QN network.
4. The method of claim 2, wherein the average service delay E (W) of all inference tasks is collectedi) The method specifically comprises the following steps: calculating the average service delay of each type of inference task according to the type of the arrival inference task and the corresponding average task number, wherein the expression is as follows:
Figure FDA0003559303360000021
in the above formula, i is the type number of the inference task, λi' is the effective arrival rate of class i interrupt task, E (L)i) Is the average task number of class i interrupt tasks in the queuing service system.
5. The method for inferring task scheduling by the neural network of the edge intelligent server as claimed in claim 2, wherein the specific steps of reallocating the amount of computing resources and scheduling the task according to action a comprise:
if there is an inferred task average service delay E (W) greater than a preset threshold in action Ai) If the estimated task average service delay E (W) is not greater than the predetermined threshold in the action A, the amount of computing resources is increased or the size of the execution batch is decreasedi) Then the amount of computing resources or execution batch size is kept unchanged.
6. The method as claimed in claim 5, wherein the preset threshold is an average service delay E (W)i) The delay of the respective inference tasks performed individually.
7. The method for inferring task scheduling by the neural network of the edge intelligent server according to claim 2, wherein an expression of a reward R obtained by using a reward and punishment mechanism is as follows:
Figure FDA0003559303360000022
in the above equation, i is the type number of the inference task,
Figure FDA0003559303360000031
represents the execution time, T, of the ith type task under the current allocation strategyiIs the independent execution time of the ith type task, alpha'iIndicating the percentage of GPU resources allocated by the class i task.
8. The method of claim 2, wherein the average service delay E (W) of all inference tasks is collectedi) And the amount of computing resources alpha allocated by the edge intelligence server to all inference tasksiThe method also comprises the step of calculating the occupation condition of each class of interrupt task resources, and specifically comprises the following steps:
calculating the server rate of each class of inference tasks according to the type of the neural network, the time required by the initialization of each class of inference tasks and the amount of computing resources distributed to each class of inference tasks by the edge intelligent server, wherein the expression is as follows:
Figure FDA0003559303360000033
in the above formula, i is the type number of the inference task, εiIndicating the time required for initiating the class i interrupt task, jiIs the base of a logarithmic function, related to the GFLOPS of the neural network model itself, alphaiIndicating the amount of computing resources that the edge intelligence server allocates to the class i interrupt task.
9. An edge intelligence server neural network inference task scheduling system, wherein the edge intelligence server neural network inference task scheduling system is programmed or configured to perform the method of edge intelligence server neural network inference task scheduling of any of claims 1-8.
10. A computer readable storage medium having stored therein a computer program programmed or configured to perform the edge intelligence server neural network inference task scheduling method of any of claims 1-8.
CN202210284033.8A 2022-03-22 2022-03-22 Neural network inference task scheduling method for edge intelligent server Pending CN114706678A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210284033.8A CN114706678A (en) 2022-03-22 2022-03-22 Neural network inference task scheduling method for edge intelligent server

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210284033.8A CN114706678A (en) 2022-03-22 2022-03-22 Neural network inference task scheduling method for edge intelligent server

Publications (1)

Publication Number Publication Date
CN114706678A true CN114706678A (en) 2022-07-05

Family

ID=82168807

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210284033.8A Pending CN114706678A (en) 2022-03-22 2022-03-22 Neural network inference task scheduling method for edge intelligent server

Country Status (1)

Country Link
CN (1) CN114706678A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115334165A (en) * 2022-07-11 2022-11-11 西安交通大学 Underwater multi-unmanned platform scheduling method and system based on deep reinforcement learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115334165A (en) * 2022-07-11 2022-11-11 西安交通大学 Underwater multi-unmanned platform scheduling method and system based on deep reinforcement learning
CN115334165B (en) * 2022-07-11 2023-10-17 西安交通大学 Underwater multi-unmanned platform scheduling method and system based on deep reinforcement learning

Similar Documents

Publication Publication Date Title
CN111176852B (en) Resource allocation method, device, chip and computer readable storage medium
CN104168318B (en) A kind of Resource service system and its resource allocation methods
CN110321222B (en) Decision tree prediction-based data parallel operation resource allocation method
CN109561148A (en) Distributed task dispatching method in edge calculations network based on directed acyclic graph
CN108984301A (en) Self-adaptive cloud resource allocation method and device
CN109005130B (en) Network resource allocation scheduling method and device
CN112181613B (en) Heterogeneous resource distributed computing platform batch task scheduling method and storage medium
CN109976911B (en) Self-adaptive resource scheduling method
CN112559147B (en) Dynamic matching method, system and equipment based on GPU (graphics processing Unit) occupied resource characteristics
CN111885137A (en) Edge container resource allocation method based on deep reinforcement learning
CN114564312A (en) Cloud edge-side cooperative computing method based on adaptive deep neural network
CN114638167A (en) High-performance cluster resource fair distribution method based on multi-agent reinforcement learning
CN112732444A (en) Distributed machine learning-oriented data partitioning method
CN114706678A (en) Neural network inference task scheduling method for edge intelligent server
CN108170861B (en) Distributed database system collaborative optimization method based on dynamic programming
CN111740925B (en) Deep reinforcement learning-based flow scheduling method
CN114518945A (en) Resource scheduling method, device, equipment and storage medium
CN115586961A (en) AI platform computing resource task scheduling method, device and medium
CN113722112B (en) Service resource load balancing processing method and system
CN111131447A (en) Load balancing method based on intermediate node task allocation
CN112685162A (en) High-efficiency scheduling method, system and medium for heterogeneous computing resources of edge server
CN117436485A (en) Multi-exit point end-edge-cloud cooperative system and method based on trade-off time delay and precision
CN114490094B (en) GPU (graphics processing Unit) video memory allocation method and system based on machine learning
CN116010051A (en) Federal learning multitasking scheduling method and device
CN114896070A (en) GPU resource allocation method for deep learning task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination