CN115016858A - Task unloading method based on post-decision state deep reinforcement learning - Google Patents

Task unloading method based on post-decision state deep reinforcement learning Download PDF

Info

Publication number
CN115016858A
CN115016858A CN202210572305.4A CN202210572305A CN115016858A CN 115016858 A CN115016858 A CN 115016858A CN 202210572305 A CN202210572305 A CN 202210572305A CN 115016858 A CN115016858 A CN 115016858A
Authority
CN
China
Prior art keywords
task
state
post
action
decision
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210572305.4A
Other languages
Chinese (zh)
Other versions
CN115016858B (en
Inventor
张竞哲
贺晓帆
张晨
周嘉曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202210572305.4A priority Critical patent/CN115016858B/en
Publication of CN115016858A publication Critical patent/CN115016858A/en
Application granted granted Critical
Publication of CN115016858B publication Critical patent/CN115016858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44594Unloading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/509Offload
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a task unloading method based on post-decision state deep reinforcement learning, which can make decisions on all dimensions of unloading actions, such as unloading objects, unloading quantity and the like of tasks. And facing to different optimization targets, and realizing the optimal strategies under different targets by changing the cost function. The invention utilizes the experience playback mechanism of DQN, randomly selects collected historical experiences as training samples and improves the learning efficiency. Meanwhile, a post-decision state learning framework and an additional hot start process are utilized to accelerate the learning speed. The traditional post-decision learning framework has higher learning efficiency, but needs additional prior information. The invention provides a task unloading method based on post-decision state deep reinforcement learning, which utilizes an additional learning process to obtain additional information required in the traditional post-decision learning, and realizes the rapid convergence of the unloading method by utilizing an efficient post-decision state learning framework, a hot start process and an experience playback mechanism.

Description

Task unloading method based on post-decision state deep reinforcement learning
Technical Field
The invention relates to the technical field of machine learning and distributed computing, in particular to a task unloading method based on post-decision state deep reinforcement learning.
Background
Under the background of explosive increase of operation demand and data scale, edge calculation is often applied to solve the problem of limited computing capability of terminal equipment. Edge computing is a mode of offloading tasks to edge devices for processing. Mobile devices are typically focused on reducing latency and power consumption, so when wireless channel conditions are poor, the mobile device will preferentially process tasks at the local CPU, while when wireless channel conditions are good, the mobile device will tend to offload most of the tasks to the edge of the network for processing. If the task is unloaded to an unreliable server for processing, information such as the position and the identity of the user can be revealed, and the privacy of the user is threatened. Therefore, the issue of privacy disclosure needs to be considered while balancing energy consumption.
On the other hand, distributed computing is also widely used in edge computing due to the increasing scale of computing tasks. The computational efficiency of a distributed computing system is susceptible to the computing power of individual nodes or the communication environment. Part of nodes may take a long time to complete calculation and return calculation results, which further causes the effect of a queue-dropping person to bring calculation delay and affects calculation efficiency. The encoding calculation is a framework for applying an encoding theory to the field of distributed calculation, and through flexible and various encoding technologies, redundancy is properly introduced, so that the effect of falling behind can be effectively relieved. Replication calculation is a simple and common encoding mode, and the same task is offloaded to a plurality of different users for processing, so that a calculation result can be obtained only by waiting for one node to complete calculation. However, when the channel condition is not good, the task is copied and unloaded to a plurality of servers blindly, and the simultaneous processing not only wastes energy, but also is not beneficial to privacy protection. To balance various requirements of energy consumption, privacy protection, etc., an optimal offloading strategy that minimizes long-term cost can be solved using a reinforcement learning algorithm by modeling such problems as a Markov Decision Process (MDP) with appropriate states, action space, and cost loss functions.
In practical situations, the state space of the markov problem is usually large, and the efficiency of a general reinforcement learning algorithm is low, which is not beneficial to practical application.
Disclosure of Invention
The invention provides a task unloading method based on post-decision state deep reinforcement learning, which is used for solving or at least partially solving the technical problem of low task unloading efficiency in the prior art.
The invention discloses a task unloading method based on post-decision state deep reinforcement learning, which comprises the following steps:
s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;
s2: the random initialization starting state specifically includes: initializing post-decision state after taking action a in state s
Figure BDA0003659567410000021
State transition probability of
Figure BDA0003659567410000022
Evaluating a network
Figure BDA0003659567410000023
Weight parameter of, target network
Figure BDA0003659567410000024
K represents the state from s to the post-decision state
Figure BDA0003659567410000025
Using a Markov random problem pair corresponding to the target task
Figure BDA0003659567410000026
Performing hot start, and setting the iteration number to 1, wherein the experience buffer is used for storing the state at a certain moment, the action taken and the corresponding post-decision stateCost of taking action and status to the next time;
s3: selecting an action according to the strategy, wherein one action corresponds to one unloading scheme;
s4: observing the post-decision state, forming a group of experiences by the post-decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experiences in an experience buffer;
s5: at certain intervals, update
Figure BDA0003659567410000027
Randomly selecting a batch of experiences from the experience buffer, and updating the evaluation network by experience playback
Figure BDA0003659567410000028
And corresponding evaluation network function, and updating the evaluation network
Figure BDA0003659567410000029
Assigning the weight parameter of to the target network
Figure BDA00036595674100000210
Updating the corresponding target network function;
s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5 until the network is evaluated
Figure BDA00036595674100000211
Converging to finish hot start;
s7: setting the current iteration number to 1, emptying the task buffer and reinitializing
Figure BDA00036595674100000212
Evaluation network obtained by hot start
Figure BDA00036595674100000213
Repeating the steps S3-S6 for the target task until the evaluation network converges; based on an evaluation network
Figure BDA00036595674100000214
And correspondingly evaluating the network function to obtain the optimal unloading strategy in different states.
In one embodiment, the system states in the state set in step S1 are in the form of:
Figure BDA00036595674100000215
wherein s is n The system state at the time n is defined by the channel state and the state of the task buffer, and the task buffer b has i states in total, which means that b is { b ═ b 1 ,b 2 ,....,b i In which b is 1 、b i Respectively representing the 1 st and ith states of the task buffer, b n The number of tasks in the task buffer at time n is indicated, and channel h has j states, which is indicated as h ═ h { (h) } 1 ,h 2 ,....,h j },h 1 、h j Respectively, 1 st and j th states of the channel, m represents the number of edge servers,
Figure BDA00036595674100000216
respectively showing the channel state of the 1 st edge server at the time n and the channel state of the mth edge server at the time n.
In one embodiment, the action of the set of actions in step S1 corresponds to an unload decision, and the action taken at time n is a n The offload decision includes three cases, the first, to store p in the task buffer n One task is processed at the local CPU and the second is no task, when p is n 0, third, p in task buffer n The tasks are simultaneously and respectively unloaded to the k with the best channel n An edge server processes, wherein p n For the number of tasks to be processed in the task buffer at n moments, k n Number of edge servers tasking at time n, k n ≤m。
In one embodiment, the strategy in step S3 is a greedy strategy, which specifically includesRandomly selecting an action with probability e, and selecting an action value Q under the current state with probability 1-e eval The action for which the function is minimal.
In one embodiment, the post-decision state in step S4 is an intermediate state before the transition to the next state after the action is taken by the current state, and is represented by:
Figure BDA0003659567410000031
wherein p is n For the number of tasks to be processed in the task buffer at time n, Δ b n Indicating the number of newly arrived tasks,
Figure BDA0003659567410000038
representing a post-decision state after taking action at n moments;
state s at the next moment n+1 Is represented by the form:
Figure BDA0003659567410000032
b max which represents the capacity of the task buffer and,
Figure BDA0003659567410000033
respectively show the channel state of the 1 st edge server at the time of n +1, the channel state of the 2 nd edge server at the time of n +1, the channel state of the m-th edge server at the time of n +1, and the state s at the next time n+1 I.e. the state at time n + 1.
In one embodiment, step S4 includes:
acquiring the caching cost of the task in the task buffer, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task energy Involving the energy consumption and task handling at the local CPUEnergy consumption of traffic offloading to edge servers for processing; caching cost c of task in task cache holding =b n -p n Privacy cost of task offload to edge server processing c privacy =p n Overflow cost c generated when the task buffer overflows due to insufficient capacity overflow =max{b n -p n +Δb n -b max ,0},b max The size of the task buffer; p is a radical of n For the number of tasks to be processed in the task buffer at time n, b n Indicating the number of tasks in the task buffer at time n, Δ b n Representing the number of newly arrived tasks;
the slave s is obtained according to the cost of holding the task buffer by the task, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity n To
Figure BDA0003659567410000034
Cost function c of k (s, a) and from
Figure BDA0003659567410000035
To s n+1 Cost function of
Figure BDA0003659567410000036
c k (s,a)=η 1 ·c holding2 ·c energy3 ·c privacy ,
Figure BDA0003659567410000037
Wherein eta is 1234 K and u represent state transition identifiers respectively for corresponding weight coefficients;
after observing a complete state transition process, the current state, the post-decision state, the state at the next moment, and the sampling are carried outA set of experiences comprising the action taken and the cost of the action
Figure BDA0003659567410000041
And storing the data into an experience buffer.
In one embodiment, when a task is processed at the local CPU, the energy consumed per unit task is:
e local =κL 3 ζ 32 ,
wherein, kappa is a CPU parameter, L is a task size, zeta is a CPU frequency, and tau is a time interval; when offloading the task to the edge server, the energy consumed by a unit task is:
Figure BDA0003659567410000042
where W is the bandwidth of the edge computation network, h is the power gain, N 0 Is the noise power spectral density.
In one embodiment, the update in step S5
Figure BDA0003659567410000043
The method of (2) adopts a statistical updating method.
In one embodiment, when the single set of experience is randomly selected from the experience buffer in step S5, the selected experience is
Figure BDA0003659567410000044
When a batch of experience is played back empirically, the weight parameters of the evaluation network are updated by minimizing the loss of the following loss function:
Figure BDA0003659567410000045
where γ represents a discount factor, θ is a weight parameter of the evaluation network,
Figure BDA0003659567410000046
representing a post-decisionStatus of state
Figure BDA0003659567410000047
State s by time n +1 n+1 Is determined by the cost function of (a) a,
Figure BDA0003659567410000048
for decision states after input
Figure BDA0003659567410000049
And evaluating the network after action a
Figure BDA00036595674100000410
Output of (Q) target (s ', a ') is the input state s ' and Q after action a target And (4) outputting the function. Update the corresponding Q according to the following formula eval And Q target Function:
Figure BDA00036595674100000411
where Q (s, a) represents the value of a Q function in state s and action a, the Q function including an evaluation function Q eval And an objective function Q target
Figure BDA00036595674100000412
Expressed as a post-input decision state
Figure BDA00036595674100000413
Evaluating the network or target network in response to action a
Figure BDA00036595674100000414
To output of (c).
Compared with the prior art, the invention has the advantages and beneficial technical effects as follows:
the invention realizes the network evaluation by introducing the deep neural network and using the method of empirical playback and minimized loss function
Figure BDA00036595674100000415
With a target network
Figure BDA00036595674100000416
Updating the weight parameter of (2). The additionally adopted warm-start learning process can accelerate the parameter updating of the deep network. The invention is a reinforcement algorithm based on a post-decision learning framework, and the traditional post-decision state learning framework needs additional prior information.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the embodiments or technical solutions in the prior art are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a post-decision state according to an embodiment of the present invention;
FIG. 3 is a schematic processing framework diagram of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention;
FIG. 4 is a diagram of simulation results of a method according to an embodiment of the present invention.
Detailed Description
The invention aims to solve the problem of quickly solving an optimal decision in task unloading, and provides a computation unloading strategy learning framework based on post-decision depth reinforcement learning. The framework is introduced on the basis of the traditional deep learning algorithm DQN, and then the framework and the hot start process are decided and learned, so that the algorithm convergence is accelerated.
The main concept and innovation of the invention are as follows:
the invention relates to a task unloading method based on deep reinforcement learning, which can make decisions on all dimensions of unloading actions, such as unloading objects, unloading quantity and the like of tasks. And facing to different optimization targets, and realizing the optimal strategies under different targets by changing the cost function. The method utilizes an experience playback mechanism of the DQN, randomly selects collected historical experiences as training samples, and therefore learning efficiency can be improved. Meanwhile, a post-decision learning framework and an additional hot start process are utilized to accelerate the learning speed. The traditional post-decision state learning framework has higher learning efficiency but needs additional prior information, but the invention provides a task unloading algorithm based on the post-decision state deep reinforcement learning, the additional information needed in the traditional post-decision state learning is obtained by utilizing an additional learning process, and the rapid convergence of the algorithm is realized by utilizing the efficient post-decision learning framework, the hot start process and the experience playback mechanism.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a task unloading method based on post-decision state deep reinforcement learning, which comprises the following steps:
s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;
s2: the random initialization starting state specifically includes: initializing post-decision state after taking action a in state s
Figure BDA0003659567410000051
State transition probability of
Figure BDA0003659567410000052
Evaluating a network
Figure BDA0003659567410000053
Weight parameter of, target network
Figure BDA0003659567410000054
K represents a transition identifier from the state s to the post-decision state, and the evaluation network is paired with a Markov random problem corresponding to the target task
Figure BDA0003659567410000061
Performing hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing the state at a certain moment, the action taken, the corresponding post-decision state, the cost generated by the action taken and the state at the next moment;
s3: selecting an action according to the strategy, wherein one action corresponds to one unloading scheme;
s4: observing the post-decision state, forming a group of experiences by the post-decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experiences in an experience buffer;
s5: at certain intervals, update
Figure BDA0003659567410000062
Randomly selecting a batch of experiences from the experience buffer to update the evaluation network through experience playback
Figure BDA0003659567410000063
Updating the corresponding evaluation network function, and updating the updated evaluation network
Figure BDA0003659567410000064
Assigning the weight parameter of to the target network
Figure BDA0003659567410000065
Updating the corresponding target network function;
s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5 until the network is evaluated
Figure BDA0003659567410000066
Converging to finish hot start;
s7: setting the current iteration number to 1, emptying the task buffer and reinitializing
Figure BDA0003659567410000067
Evaluation network obtained by hot start
Figure BDA0003659567410000068
Repeating steps S3-S6 for the target task until network convergence is evaluated; based on an evaluation network
Figure BDA0003659567410000069
And correspondingly evaluating the network function to obtain the optimal unloading strategy in different states.
Fig. 1 is a flowchart of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention. In the specific implementation process, a Markov random problem corresponding to a target task, namely a task similar to the target task, is different from an example target task in the distribution of new tasks. Fig. 2 is a schematic diagram of a post-decision state according to an embodiment of the present invention.
It should be noted that, for the task of the warm start process, the steps S3-S6 need to be repeatedly executed, and for the target task, that is, the task that needs to make task unloading decision, the steps S3-S6 need to be repeatedly executed.
The optimal unloading strategy comprises unloading modes and unloading quantities, wherein the unloading objects, namely tasks, need to be unloaded to where to process, and the unloading quantities, namely the unloading task quantities.
Generally speaking, the invention discloses a task unloading method based on post-decision state deep reinforcement learning, and corresponding optimal unloading strategies can be solved for different targets by changing a cost function. The deep reinforcement learning algorithm combines the post-decision state learning framework in deep learning and reinforcement learning, has the advantages of a common deep learning algorithm DQN, can eliminate the need of prior knowledge in the post-decision state learning framework, and can further improve the training speed of the model by utilizing an additional hot start process.
In one embodiment, the system states in the state set in step S1 are in the form of:
Figure BDA00036595674100000610
wherein s is n The system state at the time n is defined by the channel state and the state of the task buffer, and the task buffer b has i states in total, which means that b is { b ═ b 1 ,b 2 ,....,b i In which b 1 、b i Respectively representing the 1 st and ith states of the task buffer, b n The number of tasks in the task buffer at time n is indicated, and channel h has j states, which is indicated as h ═ h { (h) } 1 ,h 2 ,....,h j },h 1 、h j Respectively, 1 st and j th states of the channel, m represents the number of edge servers,
Figure BDA0003659567410000071
respectively showing the channel state of the 1 st edge server at the time n and the channel state of the mth edge server at the time n.
In one embodiment, the action of the set of actions in step S1 corresponds to an unload decision, and the action taken at time n is a n The offload decision includes three cases, the first, to store p in the task buffer n One task is processed at the local CPU and the second is no task, when p is n The number of the first, second,p in task buffer n The tasks are simultaneously and respectively unloaded to the k with the best channel n An edge server processes, wherein p n For the number of tasks to be processed in the task buffer at n moments, k n Number of edge servers, k, handling offload tasks for n moments n ≤m。
In the specific implementation process, the method of the invention also initializes the corresponding action set. And adopts a repetition coding calculation method, i.e. as long as one of the edge servers completes the calculation, p n The individual tasks have been successfully processed. In this embodiment, m-5, j-2, h-130, -125 (dB), and the corresponding state transition probability is
Figure BDA0003659567410000072
k n With {1,2,3,4,5}, the probability of each edge server completing the computation is 0.5.
In one embodiment, the strategy in step S3 is a greedy strategy, which specifically includes randomly selecting an action with a probability e, and selecting the action value function Q in the current state with probabilities 1-e eval Minimum motion.
In particular, that is to say that a n =argmin a Q eval (s n A) to expedite evaluation of the network
Figure BDA0003659567410000077
Convergence of (2). In this example, e is 0.1. The action value function here is an evaluation network function Q eval It needs to be calculated by the following formula:
Figure BDA0003659567410000073
in one embodiment, the post-decision state in step S4 is an intermediate state before the transition to the next state after the action is taken by the current state, and is represented by:
Figure BDA0003659567410000074
wherein p is n For the number of tasks to be processed in the task buffer at time n, Δ b n Indicating the number of tasks that have been newly reached,
Figure BDA0003659567410000075
representing a post-decision state after taking action at n moments;
state s at the next moment n+1 Is represented by the following form:
Figure BDA0003659567410000076
b max which represents the capacity of the task buffer and,
Figure BDA0003659567410000081
respectively shows the channel state of the 1 st edge server at the time of n +1, the channel state of the 2 nd edge server at the time of n +1, the channel state of the mth edge server at the time of n +1, and the state s at the next time n+1 I.e. the state at time n + 1.
In one embodiment, step S4 includes:
acquiring the caching cost of the task in the task buffer, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task energy The method comprises the steps of energy consumption of processing a task in a local CPU and energy consumption of unloading the task to an edge server for processing; caching cost c of task in task cache holding =b n -p n Privacy cost of task offload to edge server processing c privacy =p n Overflow cost c generated when the task buffer overflows due to insufficient capacity overflow =max{b n -p n +Δb n -B max ,0},B max The size of the task buffer; p is a radical of n For the number of tasks to be processed in the task buffer at time n, b n Indicating the number of tasks in the task buffer at time n, Δ b n Representing the number of newly arrived tasks;
the slave s is obtained according to the holding cost of the task buffer, the privacy cost of the task unloading to the edge server, the energy loss cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity n To
Figure BDA0003659567410000082
Cost function c of k (s, a) and from
Figure BDA0003659567410000083
To s n+1 Cost function of
Figure BDA0003659567410000084
c k (s,a)=η 1 ·c holding2 ·c energy3 ·c privacy ,
Figure BDA0003659567410000085
Wherein eta 1234 K and u represent state transition identifiers respectively for corresponding weight coefficients;
after observing a complete state transition process, a set of experiences comprising the current state, the post-decision state, the state at the next time, the action taken and the cost of the action generation
Figure BDA0003659567410000086
And storing the data into an experience buffer.
Specifically, storing unprocessed tasks in the task buffer incurs a buffering cost, and if the tasks are offloaded to the edge server for processing, a corresponding privacy cost is incurred, and if the task buffer overflows due to insufficient capacity, a corresponding overflow cost is incurred.
In a particular embodiment, the buffer is capable of storing up to 15 tasks, i.e. b max 15, Δ b may be {0,1,2,3,4,5}, the corresponding probability distribution in the hot start tasks is random, and the target tasks are uniform. The weighting factor can take on a value of η 1 =50,η 2 =10^6,η 3 =150,η 4 =300。
In one embodiment, when a task is processed at the local CPU, the energy consumed per unit task is:
e local =κL 3 ζ 32
wherein, kappa is a CPU parameter, L is a task size, zeta is a CPU frequency, and tau is a time interval; when offloading the task to the edge server, the energy consumed by a unit task is:
Figure BDA0003659567410000091
where W is the bandwidth of the edge computation network, h is the power gain, N 0 Is the noise power spectral density.
The energy consumed by a unit task is the energy consumed by a single task.
Kappa-10 in this example -28 ,L=10 3 ,ζ=800,τ=10 -3 ,W=10MHz,N 0 =10 -19 W/Hz。
In one embodiment, the update in step S5
Figure BDA0003659567410000092
The method of (2) adopts a statistical updating method.
In one embodiment, when the single set of experience is randomly selected from the experience buffer in step S5, the selected experience is
Figure BDA0003659567410000093
When a batch of experience is played back empirically, the weight parameters of the evaluation network are updated by minimizing the loss of the following loss function:
Figure BDA0003659567410000094
where γ represents a discount factor, θ is a weight parameter of the evaluation network,
Figure BDA00036595674100000925
representing post-decision state
Figure BDA0003659567410000095
State s by time n +1 n+1 Is determined by the cost function of (a) a,
Figure BDA0003659567410000096
for decision states after input
Figure BDA00036595674100000926
And evaluating the network after action a
Figure BDA0003659567410000097
Output of Q target (s ', a ') is the function Q after the input state s ' and the action a target To output of (c). Updating the evaluation network for each set of experiences based on an experience replay mechanism
Figure BDA0003659567410000098
And corresponding function Q eval After a batch of experience is finished and network updating is carried out, the parameters of the evaluation network are assigned to the target network
Figure BDA0003659567410000099
Updating the corresponding function Q simultaneously target (ii) a Update the corresponding Q according to the following formula eval And Q target Function:
Figure BDA00036595674100000910
wherein Q (s, a) represents the value of a Q function in state s and action a, the Q function comprising an evaluation function Q eval Eyes of HemuStandard function Q target
Figure BDA00036595674100000911
Expressed as a post-input decision state
Figure BDA00036595674100000912
Evaluating the network or target network in response to action a
Figure BDA00036595674100000913
To output of (c).
In particular, evaluating a network
Figure BDA00036595674100000914
The input variables of (1) include post-decision state
Figure BDA00036595674100000927
And action a, theta is an evaluation network
Figure BDA00036595674100000915
Each parameter of the neural network is input into a group
Figure BDA00036595674100000916
Then an output is obtained
Figure BDA00036595674100000917
The network is updated. Because a batch of experiences are randomly selected in the experience playback mechanism, a plurality of groups of the post-decision state and action pairs can be input
Figure BDA00036595674100000918
Updates to the network). Evaluating a network
Figure BDA00036595674100000919
And a target network
Figure BDA00036595674100000920
All need to be substituted into the above equation to calculate the corresponding Q eval And Q target A function.
The operation value function Q is eval By evaluating networks
Figure BDA00036595674100000921
Updating according to the formula, and only using the motion selection (S3); function of action value Q target By the target network
Figure BDA00036595674100000922
According to the above formula update, it is used as a target value only at the time of network update (S5).
Fig. 3 is a schematic processing framework diagram of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention.
In one embodiment, the updates are performed every 200 times
Figure BDA00036595674100000923
And Q eval Every 1000 updates
Figure BDA00036595674100000924
And Q target The average cost of 100000 state transitions was randomly counted every 10000 times to evaluate algorithm performance.
In order to more clearly illustrate the method proposed by the present invention, it is explained below by specific experimental data.
1. Simulation conditions and content
The operating system is Microsoft Windows 10 and the programming emulation language is python. The simulation adopts a group of parameters to carry out effect simulation on the algorithm and the existing common deep reinforcement learning algorithm DQN.
2. Analysis of simulation results
Fig. 4 is a comparison graph of the effect of the currently popular deep reinforcement learning algorithm DQN and the proposed algorithm. Compared with DQN, the average cost of the proposed algorithm can be reduced quickly, and the algorithm is a more efficient task unloading algorithm based on post-decision state deep reinforcement learning.
Aiming at the problem that the convergence speed of a deep learning algorithm in the task unloading method in the prior art is still limited, the invention realizes the updating of the parameter values of an evaluation network and a target network by introducing a deep neural network and playing back the parameters by experience and minimizing a loss function. The additionally adopted warm-start learning process can accelerate the parameter updating of the deep network. Secondly, a strengthening algorithm based on a post-decision state learning framework is adopted, and aiming at the problem that the traditional post-decision state learning framework needs additional prior information, an additional learning process is added to additionally estimate the state transition probability from the current state to the post-decision state, so that the requirement on prior knowledge is eliminated, the network updating can be further accelerated by utilizing the structural advantage of the post-decision state, and the algorithm performance of the invention is better than that of the traditional deep learning algorithm DQN.
It should be understood that the above description of the preferred embodiments is illustrative, and not restrictive, and that various changes and modifications may be made therein by those skilled in the art without departing from the scope of the invention as defined in the appended claims.

Claims (9)

1. A task unloading method based on post-decision state deep reinforcement learning is characterized by comprising the following steps:
s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;
s2: the random initialization starting state specifically includes: initializing post-decision state after taking action a in state s
Figure FDA00036595674000000115
State transition probability of
Figure FDA0003659567400000011
Evaluating a network
Figure FDA0003659567400000012
Weight parameter of, target network
Figure FDA0003659567400000013
K represents a transition identifier from the state s to the post-decision state, and the evaluation network is paired with a Markov random problem corresponding to the target task
Figure FDA0003659567400000014
Performing hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing the state at a certain moment, the action taken, the corresponding post-decision state, the cost generated by the action taken and the state at the next moment;
s3: selecting an action according to the strategy, wherein one action corresponds to one unloading scheme;
s4: observing the post-decision state, forming a group of experiences by the post-decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experiences in an experience buffer;
s5: at certain intervals, update
Figure FDA0003659567400000015
Randomly selecting a batch of experiences from the experience buffer, and updating the evaluation network by experience playback
Figure FDA0003659567400000016
And corresponding evaluation function, and updating the evaluation network
Figure FDA0003659567400000017
Assigning the weight parameter of to the target network
Figure FDA0003659567400000018
Updating a corresponding target network function;
s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5 until the network is evaluated
Figure FDA0003659567400000019
Converging to finish hot start;
s7: setting the current iteration number to 1, emptying the task buffer and reinitializing
Figure FDA00036595674000000110
Evaluation network obtained by hot start
Figure FDA00036595674000000111
Repeating steps S3-S6 for the target task until network convergence is evaluated; based on an evaluation network
Figure FDA00036595674000000112
And correspondingly evaluating the network function to obtain the optimal unloading strategy in different states.
2. The task offloading method based on post-decision state deep reinforcement learning of claim 1, wherein the system state in the state set in step S1 is in the form of:
Figure FDA00036595674000000113
wherein s is n The system state at time n is defined by the channel state and the state of the task buffer, and the task buffer b has i states in common, and represents that b is { b ═ b } 1 ,b 2 ,....,b i In which b 1 、b i Respectively representing the 1 st and ith states of the task buffer, b n Indicates the number of tasks in the task buffer at time n, and the channel h has j states, which is expressed as h ═ h { (h) 1 ,h 2 ,....,h j },h 1 、h j Respectively, 1 st and j th states of the channel, m represents the number of edge servers,
Figure FDA00036595674000000114
respectively showing the channel state of the 1 st edge server at the time n and the channel state of the mth edge server at the time n.
3. The task offloading method based on post-decision state deep reinforcement learning of claim 2, wherein the action of the action set in step S1 corresponds to an offloading decision, and the action taken at time n is a n The offload decision includes three cases, the first, to store p in the task buffer n One task is processed at the local CPU and the second is no task, when p is n Third, p in task buffer is defined as 0 n The tasks are simultaneously and respectively unloaded to the k with the best channel n An edge server processes, wherein p n For the number of tasks to be processed in the task buffer at n moments, k n Number of servers, k, handling offload tasks for n moments n ≤m。
4. The task offloading method based on post-decision state depth reinforcement learning of claim 1, wherein the strategy in step S3 is a greedy strategy, and specifically includes randomly selecting an action with a probability epsilon, and selecting the action value function Q in the current state with a probability 1-epsilon eval Minimum motion.
5. The task offloading method based on post-decision state deep reinforcement learning of claim 2, wherein the post-decision state in step S4 is an intermediate state before the transition to the next state after the action is taken by the current state, and the representation form of the post-decision state is:
Figure FDA0003659567400000021
wherein p is n For the number of tasks to be processed in the task buffer at time n, Δ b n Indicating the number of tasks that have been newly reached,
Figure FDA0003659567400000022
representing a post-decision state after taking action at n moments;
state s at the next moment n+ 1 is represented by the form:
Figure FDA0003659567400000023
wherein, b max Which represents the capacity of the task buffer and,
Figure FDA0003659567400000024
respectively show the channel state of the 1 st edge server at the time of n +1, the channel state of the 2 nd edge server at the time of n +1, the channel state of the m-th edge server at the time of n +1, and the state s at the next time n+1 I.e. the state at time n + 1.
6. The task offloading method based on post-decision state deep reinforcement learning of claim 1, wherein step S4 comprises:
acquiring the caching cost of the task in the task buffer, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task energy The method comprises the steps of energy consumption of processing a task in a local CPU and energy consumption of unloading the task to an edge server for processing; caching cost c of task in task cache holding =b n -p n Privacy cost of task offload to edge server processing c privacy =p n Overflow cost c generated when the task buffer overflows due to insufficient capacity overflow =max{b n -p n +Δb n -b max ,0},b max The size of the task buffer; p is a radical of n For the number of tasks to be processed in the task buffer at time n, b n Indicating the number of tasks in the task buffer at time n, Δ b n Representing the number of newly arrived tasks;
the slave s is obtained according to the holding cost of the task buffer, the privacy cost of the task unloading to the edge server, the energy loss cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity n To
Figure FDA0003659567400000031
Cost function c of k (s, a) and from
Figure FDA0003659567400000032
To s n+1 Cost function of
Figure FDA0003659567400000033
Figure FDA0003659567400000034
Figure FDA0003659567400000035
Wherein eta 1234 K and u represent state transition identifiers respectively for corresponding weight coefficients;
after observing a complete state transition process, form a set of experiences of the current state, the post-decision state, the state at the next time, the action taken, and the cost of the action generation
Figure FDA0003659567400000036
And storing the data into an experience buffer.
7. The task offloading method based on post-decision state deep reinforcement learning of claim 6, wherein when the task is processed in the local CPU, the energy consumed by the unit task is:
e local =κL 3 ζ 32 ,
wherein, kappa is a CPU parameter, L is a task size, zeta is a CPU frequency, and tau is a time interval; when offloading the task to the edge server, the energy consumed by a unit task is:
Figure FDA0003659567400000037
where W is the bandwidth of the edge computation network, h is the power gain, N 0 Is the noise power spectral density.
8. The task offloading method based on post-decision state deep reinforcement learning of claim 1, wherein the step S5 is updated
Figure FDA0003659567400000038
The method of (2) adopts a statistical updating method.
9. The method for task off-loading based on post-decision state deep reinforcement learning of claim 1, wherein when step S5 is executed, a single set of experience is randomly selected from the experience buffer
Figure FDA0003659567400000039
When a batch of experience is played back empirically, the weight parameters of the evaluation network are updated by minimizing the loss of the following loss function:
Figure FDA00036595674000000310
wherein γ representsA discount factor, theta is a weight parameter for evaluating the network,
Figure FDA00036595674000000317
representing post-decision state
Figure FDA00036595674000000311
State s by time n +1 n+1 Is determined by the cost function of (a) a,
Figure FDA00036595674000000312
for making a decision on the state after input
Figure FDA00036595674000000313
And evaluating the network after action a
Figure FDA00036595674000000314
Output of Q target (s ', a ') is the input state s ' and Q after action a target The output of the function updates the evaluation network for each group of experiences according to an experience playback mechanism
Figure FDA00036595674000000315
And corresponding function Q eval After a batch of experience is finished and network updating is carried out, the parameters of the evaluation network are assigned to the target network
Figure FDA00036595674000000316
Updating the corresponding function Q simultaneously target (ii) a Update the corresponding Q according to the following formula eval And Q target The function is:
Figure FDA0003659567400000041
where Q (s, a) represents the value of a Q function in state s and action a, the Q function including an evaluation function Q eval And an objective function Q target
Figure FDA0003659567400000042
Expressed as a post-input decision state
Figure FDA0003659567400000043
Evaluating the network or target network in response to action a
Figure FDA0003659567400000044
To output of (c).
CN202210572305.4A 2022-05-24 2022-05-24 Task unloading method based on post-decision state deep reinforcement learning Active CN115016858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210572305.4A CN115016858B (en) 2022-05-24 2022-05-24 Task unloading method based on post-decision state deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210572305.4A CN115016858B (en) 2022-05-24 2022-05-24 Task unloading method based on post-decision state deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN115016858A true CN115016858A (en) 2022-09-06
CN115016858B CN115016858B (en) 2024-03-29

Family

ID=83069645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210572305.4A Active CN115016858B (en) 2022-05-24 2022-05-24 Task unloading method based on post-decision state deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115016858B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111726826A (en) * 2020-05-25 2020-09-29 上海大学 Online task unloading method in base station intensive edge computing network
CN113064671A (en) * 2021-04-27 2021-07-02 清华大学 Multi-agent-based edge cloud extensible task unloading method
CN113434212A (en) * 2021-06-24 2021-09-24 北京邮电大学 Cache auxiliary task cooperative unloading and resource allocation method based on meta reinforcement learning
CN113504987A (en) * 2021-06-30 2021-10-15 广州大学 Mobile edge computing task unloading method and device based on transfer learning
CN113612843A (en) * 2021-08-02 2021-11-05 吉林大学 MEC task unloading and resource allocation method based on deep reinforcement learning
WO2022027776A1 (en) * 2020-08-03 2022-02-10 威胜信息技术股份有限公司 Edge computing network task scheduling and resource allocation method and edge computing system
CN114205353A (en) * 2021-11-26 2022-03-18 华东师范大学 Calculation unloading method based on hybrid action space reinforcement learning algorithm

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111726826A (en) * 2020-05-25 2020-09-29 上海大学 Online task unloading method in base station intensive edge computing network
WO2022027776A1 (en) * 2020-08-03 2022-02-10 威胜信息技术股份有限公司 Edge computing network task scheduling and resource allocation method and edge computing system
CN113064671A (en) * 2021-04-27 2021-07-02 清华大学 Multi-agent-based edge cloud extensible task unloading method
CN113434212A (en) * 2021-06-24 2021-09-24 北京邮电大学 Cache auxiliary task cooperative unloading and resource allocation method based on meta reinforcement learning
CN113504987A (en) * 2021-06-30 2021-10-15 广州大学 Mobile edge computing task unloading method and device based on transfer learning
CN113612843A (en) * 2021-08-02 2021-11-05 吉林大学 MEC task unloading and resource allocation method based on deep reinforcement learning
CN114205353A (en) * 2021-11-26 2022-03-18 华东师范大学 Calculation unloading method based on hybrid action space reinforcement learning algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张海波 等: "车联网中基于NOMA-MEC的卸载策略研究", 《电子与信息学报》, vol. 42, no. 4, 30 April 2021 (2021-04-30) *
张海波;荆昆仑;刘开健;贺晓帆;: "车联网中一种基于软件定义网络与移动边缘计算的卸载策略", 电子与信息学报, no. 03, 15 March 2020 (2020-03-15) *
彭军;王成龙;蒋富;顾欣;牟??;刘伟荣;: "一种车载服务的快速深度Q学习网络边云迁移策略", 电子与信息学报, no. 01, 15 January 2020 (2020-01-15) *

Also Published As

Publication number Publication date
CN115016858B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN112882815B (en) Multi-user edge calculation optimization scheduling method based on deep reinforcement learning
CN108962238B (en) Dialogue method, system, equipment and storage medium based on structured neural network
Bistritz et al. Online exp3 learning in adversarial bandits with delayed feedback
CN112817653A (en) Cloud-side-based federated learning calculation unloading computing system and method
WO2021227508A1 (en) Deep reinforcement learning-based industrial 5g dynamic multi-priority multi-access method
CN113434212A (en) Cache auxiliary task cooperative unloading and resource allocation method based on meta reinforcement learning
CN114065863B (en) Federal learning method, apparatus, system, electronic device and storage medium
CN112511336B (en) Online service placement method in edge computing system
CN114866494B (en) Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device
CN115374853A (en) Asynchronous federal learning method and system based on T-Step polymerization algorithm
CN116010054A (en) Heterogeneous edge cloud AI system task scheduling frame based on reinforcement learning
CN110647403A (en) Cloud computing resource allocation method in multi-user MEC system
CN114065929A (en) Training method and device for deep reinforcement learning model and storage medium
CN113626104A (en) Multi-objective optimization unloading strategy based on deep reinforcement learning under edge cloud architecture
CN116523079A (en) Reinforced learning-based federal learning optimization method and system
CN113760511A (en) Vehicle edge calculation task unloading method based on depth certainty strategy
CN113867843A (en) Mobile edge computing task unloading method based on deep reinforcement learning
CN111740925A (en) Deep reinforcement learning-based flow scheduling method
CN117436485A (en) Multi-exit point end-edge-cloud cooperative system and method based on trade-off time delay and precision
CN115016858A (en) Task unloading method based on post-decision state deep reinforcement learning
CN113778550A (en) Task unloading system and method based on mobile edge calculation
CN111488208A (en) Edge cloud cooperative computing node scheduling optimization method based on variable step length bat algorithm
CN116367231A (en) Edge computing Internet of vehicles resource management joint optimization method based on DDPG algorithm
CN113157344B (en) DRL-based energy consumption perception task unloading method in mobile edge computing environment
CN117014355A (en) TSSDN dynamic route decision method based on DDPG deep reinforcement learning algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant