CN115686779B

CN115686779B - DQN-based self-adaptive edge computing task scheduling method

Info

Publication number: CN115686779B
Application number: CN202211261147.7A
Authority: CN
Inventors: 巨涛; 王志强; 刘帅; 火久元; 张学军
Original assignee: Lanzhou Jiaotong University
Current assignee: Lanzhou Jiaotong University
Priority date: 2022-10-14
Filing date: 2022-10-14
Publication date: 2024-02-09
Anticipated expiration: 2042-10-14
Also published as: CN115686779A

Abstract

The invention discloses a self-adaptive edge computing task scheduling method based on DQN for an edge computing system, wherein an agent in the method respectively acquires task configuration information and computing node configuration information and takes the task configuration information and the computing node configuration information as environment state information of an input neural network; and calculating final output of the neural network according to the loss values of the previous training, selecting calculation nodes for tasks according to the final output and the loss values of the last several times of training, and finally storing learning experience based on the loss values. Finally, the optimal matching of the task and the computing node is realized, and the invention provides an effective solution for fully utilizing the edge computing resource, improving the real-time performance of task processing and reducing the system overhead.

Description

DQN-based self-adaptive edge computing task scheduling method

Technical Field

The invention belongs to the field of computer system structures, relates to a self-adaptive task scheduling method, and particularly relates to a DQN-based self-adaptive task scheduling method for an edge computing system.

Background

How to fully utilize the computing resources in the edge computing system, improve the real-time performance of task processing, and reduce the system overhead is a key problem faced by the edge computing system. With the development of machine learning technology, more and more deep reinforcement learning algorithms (such as DQN, DDPG, actor-Critic and the like) are used to solve task scheduling problems under edge computing. However, the task scheduling problem belongs to the continuity problem, and this characteristic requires discretization of an action space and a state space in an algorithm or selection of an algorithm suitable for processing the problem. The ability to make finer divisions of tasks is detrimental to efficient use of computing resources when scheduled as a whole. If algorithms such as DQN are applied that are suitable for continuous problems, more efficient discretization is performed and the convergence speed of the neural network is compromised while reducing the impact of the "overestimation" problem caused by the algorithm itself. If the exploring degree of the action space cannot be effectively adjusted in the training process of the neural network, the convergence and the stability of the neural network are not facilitated. When the deep reinforcement learning algorithm is applied to edge calculation to solve the task scheduling problem, limited computing resources, internal characteristics of the tasks, convergence speed and stability of the algorithm are considered, the algorithm with small relative computing quantity is selected, the tasks are reasonably divided, the exploration efficiency of the algorithm to a solution space is improved, fluctuation after algorithm convergence is reduced, optimal matching of the tasks and computing nodes is realized, the utilization rate of the computing resources of the system can be improved, the real-time performance of task processing is improved, and the system cost is reduced.

Most of the existing research works consider that the task is regarded as a whole to be scheduled, which cannot effectively utilize computing resources, and the probability value in the selection strategy of the computing node is a fixed value, which is unfavorable for effective exploration of an action space, so that algorithm convergence speed is slow and unstable. The discretization is needed in the work of using the DQN and other algorithms with smaller calculation amount but suitable for the discrete space problem algorithm, otherwise, the accuracy is reduced, and the discretization treatment, namely, the sampling strategy of taking out part of the learning experience from the experience pool for playback is mostly random sampling, which cannot effectively improve the sample efficiency. Since algorithms such as DQN continue to select the computing node with the largest adaptation value to the task, this will lead to an "overestimation" problem, i.e. an estimated value that is larger than the actual value. While algorithms such as DDPG are suitable for task scheduling, they are not as computationally intensive as edge computing systems with relatively limited computing resources.

Disclosure of Invention

The invention aims to overcome the problems in the prior art and provide a DQN-based self-adaptive edge computing task scheduling method, which is based on task configuration information and computing node configuration information to realize optimal matching of tasks and computing nodes so as to fully utilize computing resources, improve the real-time performance of task processing and reduce system overhead.

In order to achieve the purpose, the invention is realized by adopting the following technical scheme:

an adaptive edge computing task scheduling method based on DQN comprises the following steps:

1) When the training steps of the neural network are multiples of the steps of the appointed copying parameters, copying the evaluation network parameters in the DQN to the target network; when the training steps of the neural network are multiples of the designated playback experience steps, playing back the learning experience in the experience pool and emptying the experience pool;

2) Acquiring computing node configuration information, terminal equipment configuration information and task configuration information as environment state information, and normalizing the environment state information to be used as input of a deep reinforcement learning neural network; the environment state information is composed of the calculation task data size, the required calculation resource number, the required storage resource number, the available calculation resource number of all calculation nodes and the available storage resource number information, namely:

wherein, state _i Status information representing a computing task and an ith computing node; ds, tc, ts are the data size, the number of required computing resources and the number of required storage resources of the computing task respectively; nc and ns are the number of available computing resources and the number of available storage resources of the computing node respectively.

3) Respectively obtaining the output of an evaluation network and a target network, and calculating the final output of the neural network by combining the loss value of the last training through a comprehensive Q value calculation method, and taking the final output as an adaptation degree value of a task and a calculation node; the specific calculation formula of the comprehensive Q value is as follows:

TNet and ENT are respectively a target network and an evaluation network, OT and OE are respectively the outputs of the target network and the evaluation network, and Loss is the Loss of the last iteration.

4) Based on a self-adaptive dynamic action space exploration degree adjustment strategy, selecting a computing node corresponding to the maximum adaptation degree value by taking the final output of the neural network and the loss value of the training of the last times as tasks with a certain probability, otherwise, randomly selecting the computing node; the adaptive dynamic motion space exploration degree adjustment strategy specifically comprises the following steps:

wherein rd is a random number generation function for generating random numbers in the range of [0,1 ]; if the F value is True, selecting an unloading action corresponding to a non-maximum value for the current task to be processed, and if the F value is False, selecting an unloading action corresponding to the maximum value;

3) Calculating the loss values of all the current tasks;

the specific calculation method is as follows:

wherein output is output of the evaluation network, and action is action selection;

6) The current task is prioritized by utilizing an adaptive lightweight playback mechanism based on the loss value, and learning experience with the highest priority is stored in an experience pool;

7) Updating and evaluating network parameters;

8) Until the end condition is satisfied.

Further:

in step 2), subtask configuration information and each computing node configuration information divided under the task are taken as environment state information.

In step 3), the loss value of the last training in the comprehensive Q value calculation method is used to measure the duty ratio of the evaluation network and the target network in the final output, the output of the target network is mainly used in the initial training stage of the neural network, and the output of the evaluation network is mainly converted as the training progresses.

In step 4), the average value of the loss values of the last several training in the adaptive dynamic motion space exploration degree adjustment strategy is calculated and used as a design basis for calculating the node selection probability.

In step 5), a cross entropy loss function is adopted when the loss values of all the current tasks are calculated.

In step 6), the adaptive lightweight playback mechanism sorts the loss values based on the current learning experience, and stores the learning experience in the middle part into the experience pool because the learning experience with small loss values easily guides the neural network to local optimum and the loss values are far away from the optimum solution.

Compared with the prior art, the invention has the following beneficial effects:

aiming at the task scheduling problem under edge calculation, the task is regarded as being composed of independent subtasks, the subtasks and the configuration information of each calculation node are used as the input of the neural network, the final output of the neural network is calculated based on the loss value obtained in the last training, the calculation nodes are selected for the task based on the final output and the loss value of the last training, finally, the priority ranking is carried out according to the loss values of all the tasks, the learning samples of the middle part are stored in an experience pool, and parameter replication or experience playback is carried out when the specified condition is met, so that the optimal matching of the task and the calculation nodes is realized, the calculation resources are fully utilized, the real-time performance of the task processing is improved, and the system cost is reduced.

Drawings

Fig. 1 is a general framework of the present invention:

FIG. 2 is a process flow of the present invention:

FIG. 3 is a graph of loss values for the present invention:

fig. 4 is a graph of loss values for DQN:

fig. 5 is a graph of loss values for D3 DQN:

fig. 6 is an overall comparison of loss value curves:

FIG. 7 is a graph showing the cumulative energy consumption of the present invention versus various baseline algorithms:

FIG. 8 is a graph of the cumulative weighted overhead of the present invention versus various baseline algorithms.

Detailed Description

The invention is described in further detail below with reference to the accompanying drawings.

Referring to fig. 1, the application scenario of the present invention may be:

in an edge computing system, there is a set of computing nodes at the edge, a set of terminal devices, and a decision agent. When the agent receives a task scheduling request from the terminal equipment, the agent collects task information and computing node information submitted by the terminal equipment through a wireless network and makes a task unloading decision, if the task is unloaded, the task data is uploaded to a computing node at an edge end for processing, and a processing result is returned to the terminal equipment; and if the local processing is performed, processing the task on the terminal equipment.

Referring to fig. 2, an adaptive edge computing task scheduling method based on DQN includes the steps of:

1) When the training steps of the neural network are multiples of the steps of the appointed copying parameters, copying the evaluation network parameters in the DQN to the target network; when the training steps of the neural network are multiples of the specified playback experience steps, the learning experience in the experience pool is played back and the experience pool is emptied. Specific: initializing various parameters when the processing starts; ending if the training step number itr reaches the maximum value, otherwise continuing processing; if the training step number itr satisfies the condition 1: when the number of training steps of the specified replication parameters is multiple, the replication parameters are replicated to a target network; if the training step number itr satisfies the condition 2: when the multiple of the training steps of the appointed playback experience is reached, the learning experience in the experience pool is played back;

2) And acquiring computing node configuration information, terminal equipment configuration information and task configuration information as environment state information, and normalizing the environment state information to be used as input of the deep reinforcement learning neural network. Wherein subtask configuration information and each computing node configuration information divided under the task are taken as environment state information. The specific treatment is as follows:

when processing a scheduling request sent by a terminal device, the agent needs to comprehensively consider the current task to be processed and the state information of all computing nodes so as to make an optimal scheduling decision. The agent receives the scheduling request sent by the terminal equipment and contains the state information of the task to be processed and the state information of the terminal equipment; upward, the proxy requests state information of all edge computing nodes from the edge server. The agent can start scheduling decisions after acquiring the above needed environmental state information.

The environment state information is composed of the calculation task data size, the required calculation resource number, the required storage resource number, the available calculation resource number of all the calculation nodes and the available storage resource number information, namely:

3) Respectively obtaining the output of an evaluation network and a target network, and calculating the final output of the neural network by combining the loss value of the last training through a comprehensive Q value calculation method, and taking the final output as an adaptation degree value of a task and a calculation node; the loss value of the last training in the comprehensive Q value calculation method is used for measuring the duty ratio of the evaluation network and the target network in the final output, the output of the target network is mainly used in the initial training stage of the neural network, and the output of the evaluation network is mainly converted into the output of the neural network along with the progress of training.

The design idea of the specific calculation method is as follows:

in the conventional DQN algorithm, Q values of all possible unloading actions are output according to the environmental status information, the magnitude of which represents the probability magnitude that the unloading action is selected. And then selecting an unloading action corresponding to the maximum value of the Q value as a scheduling decision of the current task to be processed. However, in the initial stage of training of the neural network, selecting the maximum Q value causes that the actual Q value of the neural network is updated in a direction larger than the actual Q value when the parameter is updated, thereby causing an overestimation problem. In the existing work, when parameter updating is carried out on two neural networks of an evaluation network and a target network, the evaluation network parameter is updated in real time, the target network parameter is updated in a delayed manner, and the output of the target network is used as the basis of action selection. While this reduces the effect of "overestimation", it is disadvantageous for parameter updates of the evaluation network and can easily cause fluctuations in the neural network after copying the parameters of the evaluation network to the target network. In order to solve the above problems, the final output calculation of the neural network is performed based on the loss value of the last training and the outputs of the evaluation network and the target network, and the specific calculation formula of the comprehensive Q value is as follows:

TNet and ENT are respectively a target network and an evaluation network, OT and OE are respectively the outputs of the target network and the evaluation network, and Loss is the Loss of the last iteration. The loss value can reflect the learning degree of the neural network, the larger the loss value is, the farther the distance between the neural network and the convergence is, the more difficult the accurate evaluation is made on the current environment state, and the larger the influence of over-estimation is; whereas the closer the distance converges, the less affected by "overestimation". In the initial stage of learning, the loss value of the neural network is larger, and the output of the whole neural network is mainly the output of the target neural network according to the formula, so that the influence of overestimation is reduced; the closer to convergence, the transition is to predominate in evaluating the output of the network. Therefore, the target network and the evaluation network jointly determine the final network output so as to reduce the influence of over estimation and ensure the stability of the neural network.

4) Based on the self-adaptive dynamic action space exploration degree adjustment strategy, selecting a computing node corresponding to the maximum adaptation degree value by taking the final output of the neural network and the loss value of the training of the last times as tasks with a certain probability, and otherwise, randomly selecting the computing node. And the average value of the loss values of the last several training steps in the self-adaptive dynamic action space exploration degree adjustment strategy is calculated and is used as a design basis for calculating the node selection probability.

The design idea of the selection method of the concrete computing node is as follows:

to increase the heuristics of the action space, existing work often employs an ε -greedy strategy on action selection. And selecting other actions with fixed probability, and otherwise selecting the action corresponding to the maximum Q value. However, the degree of exploration required at different stages in the learning process of the neural network is different, and in the initial stage of the neural network learning, in order to approach the optimal solution as soon as possible, the exploration of the action space should be performed with a higher probability, that is, the unloading action corresponding to the non-maximum value is selected, and as the learning proceeds, the unloading action corresponding to the non-maximum value is selected with a lower probability. The learning progress of the neural network can be reflected by the loss value, if the loss value is large, the neural network is not converged, and the action space is required to be explored with a large exploration degree to find the optimal solution, and if the loss value is small, the exploration degree is correspondingly reduced. Therefore, the design of the action selection strategy is carried out based on the loss value, and in order to prevent the neural network from fluctuating due to overlarge change of the loss value, the square of the average value of the loss value in the training of the near several times is used as the basis of the probability value of the design action selection strategy, so that the dynamic adjustment of the exploration degree of the action space of the neural network is realized. The calculation method comprises the following steps:

where rd is a random number generation function for generating random numbers in the range of [0,1 ]. And if the F value is True, selecting an unloading action corresponding to a non-maximum value for the current task to be processed, and if the F value is False, selecting an unloading action corresponding to the maximum value.

5) Calculating the loss values of all the current tasks;

the specific calculation method is as follows:

wherein output is the output of the evaluation network and action is the action selection.

6) The current task is prioritized by utilizing an adaptive lightweight playback mechanism based on the loss value, and learning experience with the highest priority is stored in an experience pool; the self-adaptive lightweight playback mechanism sorts the loss values based on the current learning experience, and the learning experience of the middle part is stored in the experience pool because the learning experience with small loss values is easy to guide the neural network to local optimum and the loss value is far away from the optimum solution.

The specific design concept is as follows:

as the state space dimension increases, it will result in a "curse dimension", i.e. more learning samples are needed to achieve a satisfactory effect on the neural network. However, the number of actual samples is often limited, and consideration is given to how to increase the efficiency of the limited number of samples. The experience playback mechanism not only can solve the problem of low efficiency of learning samples, but also can break the continuity of action space, and often solves the problem of complex high dimension together with the DQN algorithm. However, in a marginal environment of limited computing resources, conventional experience playback mechanisms that save all historical experience consume significant memory resources, and randomly extracting a certain number of samples from the historical experience for playback may not effectively utilize the more efficient samples.

Since the most recent learning experience is most favorable for the learning of the neural network and has the greatest correlation with the neural network, the playback mechanism only saves the learning experience of the most recent m iterations. The learning experience is ranked based on the loss value, the neural network is easily guided to be locally optimal due to the fact that the loss value is small, and the neural network is far away from the optimal solution due to the fact that the loss value is large, and therefore the historical experience of the middle part x is extracted and played back. The value of the x value is different in different stages of the neural network learning, the neural network needs to be learned again in the initial stage of learning, and the value of the x value should be a smaller value at the moment; as learning goes deep, neural networks should pay attention to playback of historical experience to stabilize performance, where the value of x should take a larger value.

7) Parameters of the evaluation network are updated in a designated optimizer based on the loss value, action selection and learning rate of the current task.

8) Until the end condition is satisfied.

The invention can sense the changed environment state and acquire the needed environment state information, and optimally match the task with the computing node according to the environment state information, thereby realizing efficient, real-time and low-energy task scheduling in the edge computing system with limited computing resources. A specific process flow is shown in fig. 2.

Aiming at the task scheduling problem under an edge computing system, the invention utilizes a DQN algorithm with relatively small calculated amount, and combines various improved methods and strategies designed based on loss values; and matching the task with the computing node on the basis of the environmental state information formed by the task and the computing node. According to the invention, the optimal matching of the task and the computing node can be realized according to the changed environmental state information, the computing resource is fully utilized, the real-time performance of task processing is improved, the system overhead is reduced, and a self-adaptive task scheduling method is provided for the edge computing system.

To verify the effectiveness of the present invention, various performance comparisons of the present invention with various baseline algorithms are made, as shown in fig. 3-8. The brief analysis is as follows:

the D3DQN-CAA is the invention obtained by combining the three methods and mechanisms designed in the invention on the basis of Dueling Double DQN (D3 DQN), and the rest of the algorithms are all existing algorithms, and are used for comparing the performance of the invention.

FIGS. 3-6 are comprehensive comparisons of loss values for D3DQN-CAA, DQN, and D3DQN, with the number of training steps on the abscissa and loss values on the ordinate. It can be seen that the D3DQN-CAA curve is the smoother and has the smallest fluctuation amplitude after convergence, which illustrates that the comprehensive Q value calculation method and the self-adaptive lightweight playback mechanism can play a role of stabilizing the model. As can be seen by comparing FIGS. 3, 4 and 5, the D3DQN-CAA, DQN, D D3DQN loss curve has similar trend, but the DQN loss curve decreases too fast, and neither DQN nor D3DQN can reach the convergence value of D3 DQN-CAA. Meanwhile, although the loss value curves in the figures 4 and 5 are converged, the loss value curves are at a relatively higher convergence value and have larger fluctuation range, which shows that the adaptive dynamic motion space exploration degree adjustment strategy mechanism can effectively control the exploration degree of the model in the motion space and improve the stability of the model; in addition, the comparison shows that the loss value curves of the DQN and the D3DQN are easy to generate larger fluctuation at 1200, 1400 and 1600 and more training steps are needed to achieve convergence compared with the D3DQN-CAA, which shows that the comprehensive Q value calculation method has remarkable effects on improving the convergence speed of the neural network and reducing the fluctuation amplitude of the neural network after parameter replication. The experimental comparison results show that the loss value curve of the designed method has the advantages of minimum training steps required for convergence, difficult fluctuation and more stable performance.

FIG. 7 is a graph showing the cumulative energy consumption comparisons of D3DQN-CAA, DQN, D3DQN, only Local, only Edge and Random, with the number of training steps on the abscissa and the cumulative energy consumption value on the ordinate. The method D3DQN-CAA adopts a comprehensive Q value calculation method, a self-adaptive dynamic action space exploration degree adjustment strategy and a self-adaptive lightweight playback mechanism, so that the training steps of the neural network can be effectively reduced and the calculation resources can be fully utilized while the stability of the algorithm is ensured, and the accumulated energy consumption curve of the method is lower than DQN, D3DQN and other algorithms.

Fig. 8 is a graph of D3DQN-CAA, DQN, D3DQN, only Local, only Edge, and Random accumulated weighted overhead (calculated from the weighted sum of system computation delay, transmission delay, and total energy consumption), plotted against training steps, and accumulated weighted overhead values plotted against the abscissa. As can be seen from the figure, the worst performance, i.e. the accumulated weighted cost curve is higher than that of other algorithms, the optimal performance is that of the scheduling algorithm D3DQN-CAA, and the middle part is sequentially from high to low, namely Random, DQN, only Edge and D3DQN.

The above is only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited by this, and any modification made on the basis of the technical scheme according to the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. The adaptive edge computing task scheduling method based on the DQN is characterized by comprising the following steps of:

2) Acquiring computing node configuration information, terminal equipment configuration information and task configuration information as environment state information, and normalizing the environment state information to be used as input of a deep reinforcement learning neural network; the environment state information consists of the calculation task data size, the required calculation resource number, the required storage resource number, the available calculation resource number of all the calculation nodes and the available storage resource number information, namely

Wherein, state _i Status information representing a computing task and an ith computing node; ds, tc, ts are the data size, the number of required computing resources and the number of required storage resources of the computing task respectively; nc, ns are the number of available computing resources and the number of available storage resources of the computing node respectively;

TNet and ENT are respectively a target network and an evaluation network, OT and OE are respectively the output of the target network and the evaluation network, and Loss is the Loss of the last iteration;

5) Calculating the loss values of all the current tasks;

the specific calculation method is as follows:

7) Updating and evaluating network parameters;

8) Until the end condition is satisfied.

2. The DQN-based adaptive edge computing task scheduling method of claim 1, wherein: in step 2), subtask configuration information and each computing node configuration information divided under the task are taken as environment state information.

3. The DQN-based adaptive edge computing task scheduling method of claim 1, wherein: in step 3), the loss value of the last training in the comprehensive Q value calculation method is used to measure the duty ratio of the evaluation network and the target network in the final output, the output of the target network is mainly used in the initial training stage of the neural network, and the output of the evaluation network is mainly converted as the training progresses.

4. The DQN-based adaptive edge computing task scheduling method of claim 1, wherein: in step 4), the average value of the loss values of the last several training in the adaptive dynamic motion space exploration degree adjustment strategy is calculated and used as a design basis for calculating the node selection probability.

5. The DQN-based adaptive edge computing task scheduling method of claim 1, wherein: in step 5), a cross entropy loss function is adopted when the loss values of all the current tasks are calculated.

6. The DQN-based adaptive edge computing task scheduling method of claim 1, characterized by: in step 6), the adaptive lightweight playback mechanism sorts the loss values based on the current learning experience, and stores the learning experience in the middle part into the experience pool because the learning experience with small loss values easily guides the neural network to local optimum and the loss values are far away from the optimum solution.