CN115220818A

CN115220818A - Real-time dependency task unloading method based on deep reinforcement learning

Info

Publication number: CN115220818A
Application number: CN202210937248.5A
Authority: CN
Inventors: 陈星�; 胡晟熙; 姚泽玮; 林潮伟
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-08-05
Filing date: 2022-08-05
Publication date: 2022-10-21

Abstract

The invention relates to a real-time dependent task unloading method based on deep reinforcement learning, which comprises the following steps: s1, training an unloading operation Q value prediction model by using a DQN algorithm in a runtime environment based on a task unloading system model; s2, an unloading operation Q value prediction model predicts Q values of different unloading operations according to the computing capacity of the computing nodes, the transmission rate among the computing nodes and an applied unloading scheme, and then selects proper unloading operation by comparing the corresponding Q values; and S3, repeating the step S2, and gradually determining an execution position for each task through feedback iteration. The method can be well adapted to different cloud edge environments, and the unloading scheme can be generated efficiently.

Description

Real-time dependency task unloading method based on deep reinforcement learning

Technical Field

The invention belongs to the field of cloud-edge cooperative computing, and particularly relates to a real-time dependent task unloading method based on deep reinforcement learning.

Background

With the rise of intelligent technologies, more and more computation-intensive applications (such as automatic driving, face recognition, augmented reality, etc.) are developed to meet the needs of people, and computing platforms of the applications are not limited to smart phones and notebook computers, and have gradually been extended to smart devices such as wearable devices, vehicles, unmanned aerial vehicles, etc. Although these mobile devices are becoming more powerful, they are still limited in terms of processing power, memory capacity, and battery capacity due to size and weight constraints, and most mobile devices are still unable to handle the emerging variety of computationally intensive tasks in a short amount of time.

Computational offloading is an effective way to solve the problem of mobile device resource limitations, sending computationally intensive tasks in software from local to remote devices for execution, using remote resources to extend local resources. Computing resources are dispersed in the mobile equipment, the edge server and the cloud server, and the amount of the resources owned by each computing platform is obvious in difference; for an application, the computing resources that an application can acquire are often spread across multiple different computing platforms and dynamically change as the location changes. The application unloading scheme determines which computing tasks of the application should be run on which computing platform, but any unloading scheme is not an ultimate scheme, and when the running environment of the application program changes, the unloading scheme also needs to be adjusted along with the change of the running environment of the application program to provide better performance; therefore, there is a need to decide at runtime whether and to which computing platforms a computing task is offloaded. Most of the existing research works adopt heuristic algorithms or search algorithms to find suitable unloading schemes, and the suitable unloading schemes can be found only after tens of seconds or even hundreds of seconds. Considering the mobility of mobile devices, application running environment changes will occur frequently, and selecting a suitable offloading scheme faces the problem of combinatorial explosion, so that it is still a very challenging research to find an efficient offloading scheme generation method to meet the real-time requirement of computational offloading.

Disclosure of Invention

In view of this, the present invention provides a real-time dependent task offloading method based on deep reinforcement learning, which implements and efficiently generates an offloading scheme to meet a real-time requirement of computation offloading.

In order to realize the purpose, the invention adopts the following technical scheme:

a real-time dependent task unloading method based on deep reinforcement learning comprises the following steps:

s1, training an unloading operation Q value prediction model by using a DQN algorithm in a runtime environment based on a task unloading system model;

s2, an unloading operation Q value prediction model predicts Q values of different unloading operations according to the computing capacity of the computing nodes, the transmission rate among the computing nodes and an applied unloading scheme, and then selects proper unloading operation by comparing the corresponding Q values;

and S3, repeating the step S2, and gradually determining an execution position for each task through feedback iteration.

Further, the task unloading system model includes a system model and a task model, and specifically includes:

the system model comprises a mobile device MD, an edge server ES and a cloud server CS, a set of computing nodes is represented by V = { MD, ES, CS }, and the computing capacity of each computing node is represented by f _k (k ∈ V); data transfer rate between different computing nodes by v _k，l (k, l ∈ V);

the task model specifically comprises: an application is represented by a directed acyclic graph G = (N, E), wherein N = {1, 2.., N } represents a set of subtasks, N is the number of subtasks, and the calculation amount of each task is c _i (i ∈ N); e = { E = _i，j I, j belongs to N, i < j represents a dependent directed edge set among subtasks, and for one e _i，j The directed edge belonging to E is called that the subtask i is a direct predecessor task of the subtask j, and the subtask j is a direct successor task of the subtask i; in addition, each e _i，j E.g. directed edge and weight d of E _i，j Is associated with d _i，j Represents the amount of data transferred from subtask i to subtask j; the direct predecessor task set and the direct successor task set of the subtask i are represented by pre (i) and suc (i), and a subtask can only start to execute after receiving the processing results of all its predecessor tasks.

Further, a binary variable x is defined _ik To indicate an offloading scheme, if x _ik =1 denotes the assignment of a subtask i to a computing node k, whereas x _ik =0; since each subtask can only be assigned to one compute node in the network, there is the following definition:

in addition, any subtask j epsilon N can be executed only when two conditions are met; firstly, the distributed computing node is available, namely no other subtask is executed on the computing node currently; available time of node assigned by subtask j

The following constraints should be satisfied:

wherein

Is the completion time of subtask i;

second, the subtask j should be ready, i.e. it has received the processing results of all predecessor subtasks, task j ready time

Is defined as:

if a subtask j and its one predecessor subtask i ∈ P (j) are assigned to different compute nodes k and l, respectively, the communication delay needs to be taken into account

In this case, the second term to the right of the constraint will be zero;

considering the above two conditions together, the start time of the subtask j is defined as:

the end time of subtask j is defined as:

by D _1：t Representing all subtask sets successfully completed at the t-th time step;

cumulative execution delay T of application program at T time step _1：t Is defined as:

an application is considered complete if and only if all its n subtasks are successfully completed, when all n tasks of an application are successfully completed, D is then performed _1：n = {1,2,. Eta, n }; thus, the total execution delay T of the application _1：n Calculated by the following formula:

for an application with n tasks, representing the unloading scheme of the application by DEP = (DEP (1), DEP (2),.. DEP (n)); the dep (i) belongs to {1,2,3} represents the execution position of the task i belongs to N, namely the task i belongs to the terminal equipment, the edge server and the cloud server respectively;

the objective function is defined as:

Minimize T _1：n 。

further, the DQN algorithm obtains a state s in a cloud edge environment during operation, then selects an action a through an epsilon-greedy strategy, and then receives an award value r and a next state s' obtained after the environment executes the action a; then the DQN algorithm stores (s, a, r, s') obtained in each step into an experience storage pool; generally, the capacity of the empirical storage pool in the DQN algorithm is preset, and when a storage threshold is reached, the neural network parameters are updated, and the loss function of the neural network is as follows:

Loss＝(r+γmaxQ(s′，a′；ω′)-Q(s，a；ω)) ²

wherein γ is a discount factor; q (s, a; ω) is the output of "EvalNet", which calculates the Q value of the current state action pair, ω being the DNN weight of "EvalNet"; maxQ (s ', a'; ω ') is the output of "TargetNet", which calculates the maximum Q value when the next state s' performs action a ', ω' being the DNN weight of "TargetNet".

Further, the step S2 specifically includes:

first, the weight ω of the EvalNet neural network and the weight ω' = ω of the TargetNet neural network are randomly initialized (row 3);

for each training period, the current offload scheme DEP _cur The current state s and the current response time T are initialized;

in the algorithm training process, the execution position of each subtask is sequentially determined through an epsilon-greedy strategy, one of all unloading schemes is randomly selected according to the probability of epsilon, and the unloading scheme with the maximum Q value in EvalNet is selected according to the probability of 1-epsilon;

next, performing action a and obtaining a new response time T ', calculating the reward r and updating the current response time T, observing the next state s';

then, putting (s, a, r, s') into an experience storage pool, randomly extracting m samples from the experience storage pool, and calculating a target value;

next, obtaining Loss according to a mean square error Loss function and updating the weight ω of the EvalNet neural network by using an Adam optimizer, and updating the weight ω' = ω of the TargetNet neural network after reaching the set C iteration;

and finally, updating the current state.

Compared with the prior art, the invention has the following beneficial effects:

the method can be well adapted to different cloud edge environments, and the unloading scheme can be generated efficiently.

Drawings

FIG. 1 is a system model for task offloading in one embodiment of the invention;

FIG. 2 is an example of a task offloading process in one embodiment of the invention;

FIG. 3 is a DQN algorithm architecture diagram in an embodiment of the invention;

FIG. 4 is a diagram of DAGs with different task sizes in an embodiment of the invention;

FIG. 5 is a comparison of DODQ and ideal performance under different scenarios in an embodiment of the invention;

FIG. 6 illustrates the accuracy of the unloading operation at different task sizes for DODQ according to an embodiment of the invention;

fig. 7 is a comparison of the performance of the DODQ in different scenarios with other classical methods in an embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

In this embodiment, referring to fig. 1, the system model for task offloading is composed of a Mobile Device (MD), an Edge Server (ES), and a Cloud Server (CS). Denote the set of compute nodes by V = { MD, ES, CS }, the computing power of each compute node by f _k And (k. Epsilon. V). Typically, the mobile device is the weakest computing power due to the constraints on size and weight of the mobile device, and the edge server is more powerful than the terminal device but less powerful than the remote cloud server. Data transfer rate between different computing nodes by v _k，l (k, l ∈ V) indicates that the clothing is due to the edgeThe server is deployed at a location close to the mobile device, and the remote cloud server and the mobile device are located far away, so that the data transmission rate between the mobile device and the edge server is faster than the data transmission rate between the remote cloud server and the mobile device and the edge server.

As shown in fig. 1, an application can be represented by a directed acyclic graph G = (N, E), where N = {1,2,. Multidot., N } represents a set of subtasks, N is the number of subtasks, and the amount of computation per task is represented by c _i (i ∈ N); e = { E = { E) _i，j I, j belongs to N, i < j represents a dependent directed edge set among subtasks, and for one e _i，j Let us say that subtask i is a direct predecessor task of subtask j, and subtask j is a direct successor task of subtask i. In addition, each e _i，j E.g. directed edge and weight d of E _i，j Is associated with d _i，j Representing the amount of data transferred from subtask i to subtask j. We denote the set of immediate predecessor tasks and the set of immediate successor tasks of subtask i by pre (i) and suc (i), and a subtask can only start executing after receiving the processing results of all its predecessor tasks. For example, the set of directly preceding tasks and the set of directly succeeding tasks of the subtask are pre (9) = {2,4,5} and suc (9) = {10}, respectively, so the subtask 9 must receive the processing results of the

subtasks

2,4, and 5 before starting execution. Furthermore, for a subtask 10 that has no immediate successor, we call the subtask 10 the end task.

In this embodiment, we can process their tasks locally, or offload the subtasks to an edge server or a remote cloud server for execution. Thus, we define a binary variable x _ik To indicate an offloading scheme, if x _ik =1 indicating the assignment of a subtask i to a computing node k, whereas x _ik =0. Since each subtask can only be assigned to one compute node in the network, there is the following definition:

in addition, two conditions need to be satisfied for any subtask j ∈ N to start execution. First, the allocated compute node is available, i.e., no other subtasks are currently executing on the compute node. Available time of node assigned by subtask j

The following constraints should be satisfied:

wherein

Is the completion time of subtask i.

Second, the sub-task j should be ready, i.e. it has received the processing results of all predecessor sub-tasks. Task j ready time

Is defined as:

if a subtask j and its one predecessor subtask i ∈ P (j) are assigned to different compute nodes k and l, respectively, we need to consider the communication delay

Otherwise, data transfer between them can be realized by sharing the memory without communication delay. In this case, the second term to the right of the constraint will be zero.

further, the end time of the subtask j is defined as:

notably, the tasks in the directed acyclic task graph (DAG) are executed in parallel, and thus the delay of the application will be equal to the longest time it takes to complete the tasks in the task dependency chain. By D _1：t Representing the set of all subtasks that completed successfully at the t-th time step. Further, the cumulative execution delay T of the application at the T-th time step _1：t Can be defined as:

an application is considered complete if and only if all its n subtasks are successfully completed, when all n tasks of an application are successfully completed, D is then performed _1：n =1, 2,. Eta, n. Thus, the total execution delay T of the application _1：n Can be calculated by the following formula:

for an application with n tasks, the offload scheme of the application is represented by DEP = (DEP (1), DEP (2),.. DEP (n)). Where dep (i) e {1,2,3} represents the execution location of task i e N, i.e. end device, edge server, and cloud server, respectively. Since the mobile application is deployed on the terminal device, it is assumed that the 1 st task of the application is executed on the mobile device, i.e., dep (1) =1.

Different offload schemes DEP may result in T _1：n In order to achieve a better computation offload effect in the cloud-side collaborative environment, the optimization goal is to minimize T as much as possible _1：n . Thus, the objective function is defined as:

Minimize T _1：n

based on the above definition, the MEC runtime environment is composed of 2-tuples<F，V>And (4) forming. As shown in table 1.1, wherein F = (F) _MD ，f _ES ，f _CS ) Representing the computing power of different computing nodes. V = (V) _MD，ES ，v _MD，CS ，v _ES，CS ) Representing the data transfer rate between the compute nodes. DEP = (DEP (1), DEP (2),.. DEP (n)) represents an unloading scheme of an application, where DEP (i) represents an execution position of an ith task. T is a unit of _1：n Represents the application response time, T, under the corresponding offload scenario _1：n Smaller means better computational offload.

TABLE 1.1 runtime Environment and application offload schemes including Performance indicators

In this embodiment, a method for Offloading a real-time dependent task with Deep Qnetworks (DODQ) based on Deep reinforcement learning is provided, including the following steps:

step S1: training an unloading operation Q value prediction model by using a DQN algorithm in a runtime environment; the training data includes the computational power F of the compute nodes, the data transfer rate V between the compute nodes, the offload scheme of the application and the corresponding application delay, as shown in table 3. The unloading operation Q value prediction model can evaluate the Q values of unloading operations in different operating environments, so that when the current system state (the computing capacity of the computing nodes, the data transmission rate among the computing nodes and the unloading scheme of the application) is input, the model can accurately predict the Q values of different unloading operations;

step S2: the unloading operation Q value prediction model predicts the Q values of different unloading operations according to the computing capacity of the computing nodes, the transmission rate among the computing nodes and the applied unloading scheme, and then selects proper unloading operation by comparing the corresponding Q values;

and step S3: and repeating the step S2, and gradually determining the execution position for each task through feedback iteration.

In this example, DEP is used _t ＝(dep _t (1)，dep _t (2)，...，dep _t (N)) represents the offload scenario applied at time step t, where dept (i) e {0,1,2,3} (i e N) corresponds to the execution of task i at that time. In particular, when dep _t (i) And =0, it means that the task i is not executed at time step t. At this time, DEP is shown in FIG. 2 ₃ =1, 2,0, indicating that at the 3 rd time step, task 1 is executed at the terminal device,

tasks

2 and 3 are executed at the edge server, and the subsequent tasks are not executed yet.

Table 2.1 shows the application (n = 10) offloading process in a certain scenario. First, the 1 st task of an application is executed on a mobile device, at which time the corresponding DEP is executed ₁ ＝{1,0,0,0,0,0,0,0,0,0}，T _1:1 =0.069s. Next, task 2 is offloaded to the edge server for execution, at which time the corresponding DEP ₂ ＝{1,2,0,0,0,0,0,0,0,0}，T _1:2 =0.183s. Similarly, the execution position of each task is determined in turn. Finally, when DEP _1:10 When the value of {1,2, 1,3, 2} indicates that all tasks of the application have been executed, so we can obtain the response time T of the application _1:n It was 0.54s.

TABLE 2.1 application uninstallation procedure under certain scenarios

Reinforcement learning is an algorithmic model that can make optimal decisions by self-learning in a particular scenario, and it models all real-world problems by abstracting them into the interactive process between agents and the environment. At each time step of the interactive process, the agent receives the state of the environment and selects a corresponding response action. Then, at the next time step, the agent obtains a reward value and a new state based on the feedback from the environment. Since the goal of reinforcement learning is to maximize the cumulative reward, it is typically modeled using a Markov Decision Process (MDP). More specifically, the MDP may be defined as a 4-tuple, denoted as < S, A, T, R >, where S is the state space, A is the action space, T is the state transition function, and R is the reward function.

Based on the problem definition, in the proposed computational offload problem, the corresponding state space S, action space a, state transition function T and reward function R are defined as follows:

state space: the state space is denoted S, where S ∈ S denotes a potential state. To comprehensively consider the characteristics of the operating environment and the unloading scheme, s is defined as a triple<F，V，DEP _cur >. Thus, s represents the current system state of the runtime environment, i.e., the current system state consists of the computing power of the different computing nodes, the data transfer rate between the computing nodes, and the currently applied offload scheme.

An action space: motion space is represented as a = { a = ₁ ，a ₂ ，a ₃ And one action a (a belongs to A) determines the execution position of the task to be decided currently. Specifically, a ∈ { a } ₁ ，a ₂ ，a ₃ And respectively executing the current task at the terminal equipment, unloading the current task to the edge server for execution and unloading the current task to the cloud server for execution.

State transition function: the state transition function is denoted as T (s, a), and the function return value is the next state to perform action a in state s. For example, as shown in table 2.1, when in state s =<F，V，(1，0，0，...，0)>Performing action a ₂ Next state s' =<F，V，(1，2，0，...，0)>It can be observed that task 2 is offloaded to the edge server for execution.

The reward function: to guide the RL agent to minimize the response time of the application, the reward function is defined as:

R(s，a)＝T-T′

r (s, a) represents the reward received by the RL agent after performing action a in the current state s, where T corresponds to the application delay under the current offload scenario, and T' represents the application delay after performing action a. For example, as shown in table 2.1, task 2 is offloaded to the reward R = R (s, a) = T executed by the edge server-T′＝T _1：1 -T _1：2 ＝-0.114。

Referring to fig. 3, the DQN algorithm architecture diagram is shown, where a DQN agent obtains a state S (S e S) in a runtime cloud-edge environment, selects an action a (a e a) through an e-greedy policy, and then receives an award value r and a next state S' obtained after the environment executes the action a. The DQN algorithm then deposits (s, a, r, s') from each step into an empirical storage pool. Generally, the capacity of the empirical storage pool in the DQN algorithm is preset, and when a storage threshold is reached, the neural network parameters are updated, and the loss function of the neural network is as follows:

Loss＝(r+γmaxQ(s′，a′；ω′)-Q(s，a；ω)) ²

wherein γ is a discount factor; q (s, a; ω) is the output of "EvalNet", which calculates the Q of the current state action pair, ω being the DNN weight of "EvalNet"; maxQ (s ', a'; ω ') is the output of "TargetNet", which calculates the maximum Q value when the next state s' performs action a ', ω' being the DNN weight of "TargetNet".

Based on the above definition, the DQN algorithm is used to evaluate the Q-values for different offloading operations. The key steps of the algorithm are as shown in algorithm 1, and first, the weight ω of the EvalNet neural network and the weight ω' = ω of the TargetNet neural network are randomly initialized (row 3). For each training period, the current offload scheme DEP _cur Current state s and current response time T are initialized (lines 5-6). In the algorithm training process, the execution position of each subtask is determined in turn through an epsilon-greedy strategy, one of all unloading schemes is randomly selected according to the probability of epsilon, and the unloading scheme with the maximum Q value in EvalNet is selected according to the probability of 1-epsilon (line 8). Next, action a is performed and a new response time T 'is obtained, the reward r is calculated and the current response time T is updated, the next state s' is observed (line 9). Then, (s, a, r, s') is put into an experience storage pool, m samples in the experience pool are randomly extracted, and a target value is calculatedThese samples may come from different cloud-side operating environments to ensure the sufficiency of learning (lines 10-11). Next, lose was derived from the mean square error Loss function and the weight ω of the EvalNet neural network was updated using Adam optimizer and the weight ω' = ω of the TargetNet neural network was updated after the set C round of iteration was reached (lines 12-14). Finally, the current state is updated (line 15).

2.2 runtime decisions for offload operations

The decision process for the offloading operation is performed at run-time, with the main steps given in algorithm 2. For each task of the mobile application, the current offload scenario is first initialized (lines 9-10). Next, the Q value of each unload operation is evaluated by calling a Q value prediction model (lines 12-14). Finally, the unload operation with the largest Q value is selected and the unload scheme is updated (lines 15-17). Thus, in the decision making process, the execution position of each task is determined in turn.

By iterating the feedback control process, an optimal offload scenario may be performed step-by-step in a runtime cloud edge environment. The feedback control operation will continue until all tasks of the mobile application are completed.

Example 1:

in this embodiment, in order to simulate the diversity of the application programs, four task graphs with the task scale n = {10,15,20,25} are constructed, and the structure of the constructed task graph is shown in fig. 4.

For each application G = (N, E), the calculated amount of each sub-task c _i (i e N) obeys [50, 500 ]]Even distribution within the Mcycle. And for each e _i,j E, the data transmission amount from the subtask i to the subtask j obeys 0,1000]uniform distribution within KB. Furthermore, the computing power F (F) of the different computing nodes _MD ,f _ES ,f _CS ) And a data transfer rate V (V) between the computing nodes _MD,ES ,v _MD,CS ,b _ES,CS ) Uniform distribution is obeyed. Table 3.1 lists the detailed settings of the simulation parameters.

TABLE 3.1 setting of simulation parameters

The proposed DODQ is implemented based on TensorFlow 2.3.0. In DODQ, a fully-connected DNN is used that consists of one input layer, two hidden layers, and one output layer, where both hidden layers have 128 hidden neurons. The memory pool capacity M, the training batch M, the discount factor gamma, and the learning rate of the Adam optimizer are set to 15000, 64, 0.9, and 0.001, respectively. Further, different training periods are set for task graphs of different sizes, and training periods of task sizes n =10, 15,20,25 are set to 10000, 15000, 20000, 25000, respectively.

Based on the above settings, 10 different F, V cloud edge environment scenarios were simulated, as shown in table 3.2. And executing a runtime decision algorithm for each scene to realize the self-adaptive task unloading in different cloud edge environments.

TABLE 3.2 different scenarios for cloud edge Environment

In this embodiment, the effectiveness of the DODQ for adaptive task offloading was evaluated under different scenarios described in table 3.2. In particular, we will compare the performance of DODQ with the ideal scheme under these scenarios. By combining management experience and local verification, different scenes can be obtainedThe ideal unloading scheme with the shortest response time is obtained. However, finding an ideal solution is not feasible in practice, since it would need to exhaust all possibilities and therefore would be unacceptably complex. For example, for an application with n tasks, after the terminal device starts the application (i.e. the 1 st task is executed on the terminal device), we need to determine the execution positions for the remaining n-1 tasks (i.e. execute the task on the terminal device, or unload the task to the edge server or cloud server for execution), so we must count up to 3 in given F, V ^n-1 Each offload profile is tested for performance. As shown in fig. 5, in different scenarios, DODQ achieves a response time similar to the ideal scheme. When n =10, the DODQ achieves the best performance in

scenarios

1, 5, 6, and 8. In other cases, the performance gap between DODQ and the ideal solution remains below 4% at all times. The results verify that the DODQ realizes the optimal/near-optimal task unloading performance in different cloud edge environments.

Taking scenario 1 (n = 10) as an example, a task offload process using DODQ is explained. As shown in Table 3.3, when the initialization offload scenario is DEP _cur = (1,0,0,0,0,0,0,0,0,0,0), the predicted Q value for unload operation a =2 is higher than for other unload operations. Thus, the task currently to be decided (i.e. the second subtask) is offloaded to the edge server for execution, hence the offload scheme DEP _cur <xnotran> (1,2,0,0,0,0,0,0,0). </xnotran> Next, the unload operation with the highest predicted Q value is selected and executed in each step. Finally, when DEP _cur In case of =1,2,2,1,1,3,2,2,2, the unloading operation no longer needs to be performed, and the decision process of task unloading is completed.

Table 3.3 calculation offload procedure in scenario 1 (n = 10)

In this embodiment, the Action Accuracy Rate (AAR) is used to measure the correctness of the offload operation in the decision process, and is defined as:

where O and A are the number of all the offload operations that need to be performed and the correct offload operation, respectively. They are considered correct when the unloading operation brings the current unloading scenario close to the ideal scenario.

As shown in fig. 6, the DODQ can make an unloading operation decision with high accuracy under a Q value prediction model based on DQN. Specifically, the average AAR can reach 94.8% at different task scales.

In the present embodiment, the performance of DODQ is compared to Rule-based, machine-learning (ML-based), and Q-learning methods to further evaluate the advantage of DODQ in task offloading. Rule-based selects the best DAG partition point and offloads the remaining tasks from the mobile device to the cloud. And the ML-based method adopts particle swarm optimization and genetic algorithm (PSO-GA) to search the unloading scheme according to the predicted response time of the application program under different environments. Q-learning stores each state action pair and its corresponding Q value in a Q table to maximize the cumulative return of offloading schemes, which can handle offloading problems with small scale state space well. When the runtime environment changes, Q-learning needs to retrain the task-offloaded decision model to better adapt to the new environment.

As shown in FIG. 7, the DODQ response time is 13-30% better than Rule-based under different scenarios. Rule-based involves segmentation rules set by experts, but these predefined rules do not apply well to different scenarios. Therefore, the rule-based method cannot effectively adapt to the task unloading problem in the dynamic cloud edge environment. Especially when the environment is complex (e.g., n = 25), the advantage of DODQ compared to Rule-based becomes very significant. In addition, compared with ML-based, the performance of DODQ is improved by 6% -8%. In particular, we evaluated the prediction accuracy of response times in ML-based, with only 72.5% accuracy, allowing 15% model error. This is because ML-based methods require enough training data to develop an accurate prediction model. However, in the absence of training data support, task offloading is inefficient due to inaccurate predictions. Furthermore, Q-learning achieves a response time comparable to DODQ in different scenarios without considering training time. However, when the runtime environment changes, Q-learning needs to retrain the decision model for task offloading for the new environment, which results in too long training time.

Furthermore, we evaluated the convergence times of the DODQ, rule-based, ML-based and Q-learning methods with different task scales, and the results are shown in Table 3.4. In particular, there is no training time using Rule-based, since the rules are predefined. However, rule-based has the worst performance in reducing response time, as shown in fig. 7. Among these methods, the ML-based search for the offload solution using DNN requires a large amount of offload solution performance prediction resulting in the longest convergence time. In contrast, DODQ can adapt well to dynamic cloud edge environments and generate optimal/near optimal offload scenarios in milliseconds. In addition, Q-learning also has a high dimensional state space problem, especially when the task size becomes large, because it records all state-action pairs and their corresponding Q-values into one Q-table. Meanwhile, when the runtime environment changes, the Q-learning needs to retrain the task-unloaded decision model to obtain better adaptability. These factors result in a convergence time for Q learning that is too long. For example, when the number of tasks is 25, the convergence time of Q-learning exceeds 1000 seconds, which is much longer than DODQ. The above results demonstrate the advantage of DODQ in achieving low response time and high training efficiency.

TABLE 3.4 Convergence time of different methods at different task Scale

The above description is only a preferred embodiment of the present invention, and all the equivalent changes and modifications made according to the claims of the present invention should be covered by the present invention.

Claims

1. A real-time dependent task unloading method based on deep reinforcement learning is characterized by comprising the following steps:

step S1: based on a system model of task unloading, training an unloading operation Q value prediction model by using a DQN algorithm in a runtime environment;

step S2: the unloading operation Q value prediction model predicts Q values of different unloading operations according to the computing capacity of the computing nodes, the transmission rate among the computing nodes and the applied unloading scheme, and then selects proper unloading operation by comparing the corresponding Q values;

2. The deep reinforcement learning-based real-time dependent task offloading method according to claim 1, wherein the task offloading system model includes a system model and a task model, and specifically includes:

the task model specifically comprises: an application is represented by a directed acyclic graph G = (N, E), where N = {1, 2.., N } represents a set of subtasks, N is the number of subtasks, and the amount of computation per task is represented by c _i (i ∈ N); e = { E = _i，j I, j belongs to N, i < j represents a dependent directed edge set among subtasks, and for one e _i，j E.g., E, and calls subtask i as the direct predecessor of subtask j, which is the direct predecessor of subtask iSubsequent tasks are carried out; in addition, each e _i，j E.g. directed edge and weight d of E _i，j Is associated with d _i，j Represents the amount of data transferred from subtask i to subtask j; the direct predecessor task set and the direct successor task set of the subtask i are represented by pre (i) and suc (i), and a subtask can only start to execute after receiving the processing results of all its predecessor tasks.

3. The deep reinforcement learning-based real-time dependent task offloading method of claim 2, wherein a binary variable x is defined _ik To indicate an offload scenario, if x _ik =1 denotes the assignment of a subtask i to a computing node k, whereas x _ik =0; since each subtask can only be assigned to one compute node in the network, there is the following definition:

The following constraints should be satisfied:

wherein

Is the completion time of subtask i;

Is defined as:

In this case, the second term to the right of the constraint will be zero;

the end time of subtask j is defined as:

an application is considered complete if and only if all its n subtasks are successfully completed, when all n tasks of an application are successfully completed, D is then performed _1：n =1, 2, ·, n }; thus, the total execution delay T of the application _1：n Calculated by the following formula:

for an application with n tasks, the unloading scheme of the application is represented by DEP = (DEP (1), DEP (2),.., DEP (n)); the dep (i) belongs to the {1,2,3} represents the execution position of the task i belongs to the N, namely the dep is a terminal device, an edge server and a cloud server respectively;

the objective function is defined as:

Minimize T _1：n 。

4. the deep reinforcement learning-based real-time dependency task uninstalling method according to claim 1, wherein the DQN algorithm obtains a state s in a runtime cloud-edge environment, then selects an action a through an epsilon-greedy policy, and then receives an incentive value r and a next state s' obtained after the environment executes the action a; then the DQN algorithm stores the (s, a, r, s') obtained in each step into an experience storage pool; generally, the capacity of the empirical storage pool in the DQN algorithm is preset, and when a storage threshold is reached, the neural network parameters are updated, and the loss function of the neural network is as follows:

Loss＝(r+γmaxQ(s′，a′；ω′)-Q(s，a；ω)) ²

5. The deep reinforcement learning-based real-time dependent task offloading method according to claim 1, wherein the step S2 specifically includes:

first, randomly initializing the weight ω of the EvalNet neural network and the weight ω' = ω of the TargetNet neural network (row 3);

next, performing action a and obtaining a new response time T ', calculating a reward r and updating the current response time T, observing the next state s';

next, obtaining Loss according to a mean square error Loss function, updating the weight ω of the EvalNet neural network by using an Adam optimizer, and updating the weight ω' = ω of the TargetNet neural network after reaching the set C-round iteration;

and finally, updating the current state.