CN115016858A - Task unloading method based on post-decision state deep reinforcement learning - Google Patents
Task unloading method based on post-decision state deep reinforcement learning Download PDFInfo
- Publication number
- CN115016858A CN115016858A CN202210572305.4A CN202210572305A CN115016858A CN 115016858 A CN115016858 A CN 115016858A CN 202210572305 A CN202210572305 A CN 202210572305A CN 115016858 A CN115016858 A CN 115016858A
- Authority
- CN
- China
- Prior art keywords
- task
- state
- post
- action
- decision
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 61
- 230000002787 reinforcement Effects 0.000 title claims abstract description 33
- 230000009471 action Effects 0.000 claims abstract description 73
- 230000006870 function Effects 0.000 claims abstract description 56
- 230000008569 process Effects 0.000 claims abstract description 23
- 238000011156 evaluation Methods 0.000 claims description 35
- 238000012545 processing Methods 0.000 claims description 25
- 230000007704 transition Effects 0.000 claims description 17
- 238000005265 energy consumption Methods 0.000 claims description 14
- 230000004044 response Effects 0.000 claims description 3
- 230000003595 spectral effect Effects 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 abstract description 5
- 238000012549 training Methods 0.000 abstract description 3
- 238000005457 optimization Methods 0.000 abstract description 2
- 230000000875 corresponding effect Effects 0.000 description 22
- 238000004422 calculation algorithm Methods 0.000 description 21
- 238000004364 calculation method Methods 0.000 description 12
- 238000013135 deep learning Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000004088 simulation Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000013528 artificial neural network Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- OEBIHOVSAMBXIB-SJKOYZFVSA-N selitrectinib Chemical class C[C@@H]1CCC2=NC=C(F)C=C2[C@H]2CCCN2C2=NC3=C(C=NN3C=C2)C(=O)N1 OEBIHOVSAMBXIB-SJKOYZFVSA-N 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/445—Program loading or initiating
- G06F9/44594—Unloading
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2209/00—Indexing scheme relating to G06F9/00
- G06F2209/50—Indexing scheme relating to G06F9/50
- G06F2209/509—Offload
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Abstract
The invention discloses a task unloading method based on post-decision state deep reinforcement learning, which can make decisions on all dimensions of unloading actions, such as unloading objects, unloading quantity and the like of tasks. And facing to different optimization targets, and realizing the optimal strategies under different targets by changing the cost function. The invention utilizes the experience playback mechanism of DQN, randomly selects collected historical experiences as training samples and improves the learning efficiency. Meanwhile, a post-decision state learning framework and an additional hot start process are utilized to accelerate the learning speed. The traditional post-decision learning framework has higher learning efficiency, but needs additional prior information. The invention provides a task unloading method based on post-decision state deep reinforcement learning, which utilizes an additional learning process to obtain additional information required in the traditional post-decision learning, and realizes the rapid convergence of the unloading method by utilizing an efficient post-decision state learning framework, a hot start process and an experience playback mechanism.
Description
Technical Field
The invention relates to the technical field of machine learning and distributed computing, in particular to a task unloading method based on post-decision state deep reinforcement learning.
Background
Under the background of explosive increase of operation demand and data scale, edge calculation is often applied to solve the problem of limited computing capability of terminal equipment. Edge computing is a mode of offloading tasks to edge devices for processing. Mobile devices are typically focused on reducing latency and power consumption, so when wireless channel conditions are poor, the mobile device will preferentially process tasks at the local CPU, while when wireless channel conditions are good, the mobile device will tend to offload most of the tasks to the edge of the network for processing. If the task is unloaded to an unreliable server for processing, information such as the position and the identity of the user can be revealed, and the privacy of the user is threatened. Therefore, the issue of privacy disclosure needs to be considered while balancing energy consumption.
On the other hand, distributed computing is also widely used in edge computing due to the increasing scale of computing tasks. The computational efficiency of a distributed computing system is susceptible to the computing power of individual nodes or the communication environment. Part of nodes may take a long time to complete calculation and return calculation results, which further causes the effect of a queue-dropping person to bring calculation delay and affects calculation efficiency. The encoding calculation is a framework for applying an encoding theory to the field of distributed calculation, and through flexible and various encoding technologies, redundancy is properly introduced, so that the effect of falling behind can be effectively relieved. Replication calculation is a simple and common encoding mode, and the same task is offloaded to a plurality of different users for processing, so that a calculation result can be obtained only by waiting for one node to complete calculation. However, when the channel condition is not good, the task is copied and unloaded to a plurality of servers blindly, and the simultaneous processing not only wastes energy, but also is not beneficial to privacy protection. To balance various requirements of energy consumption, privacy protection, etc., an optimal offloading strategy that minimizes long-term cost can be solved using a reinforcement learning algorithm by modeling such problems as a Markov Decision Process (MDP) with appropriate states, action space, and cost loss functions.
In practical situations, the state space of the markov problem is usually large, and the efficiency of a general reinforcement learning algorithm is low, which is not beneficial to practical application.
Disclosure of Invention
The invention provides a task unloading method based on post-decision state deep reinforcement learning, which is used for solving or at least partially solving the technical problem of low task unloading efficiency in the prior art.
The invention discloses a task unloading method based on post-decision state deep reinforcement learning, which comprises the following steps:
s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;
s2: the random initialization starting state specifically includes: initializing post-decision state after taking action a in state sState transition probability ofEvaluating a networkWeight parameter of, target networkK represents the state from s to the post-decision stateUsing a Markov random problem pair corresponding to the target taskPerforming hot start, and setting the iteration number to 1, wherein the experience buffer is used for storing the state at a certain moment, the action taken and the corresponding post-decision stateCost of taking action and status to the next time;
s3: selecting an action according to the strategy, wherein one action corresponds to one unloading scheme;
s4: observing the post-decision state, forming a group of experiences by the post-decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experiences in an experience buffer;
s5: at certain intervals, updateRandomly selecting a batch of experiences from the experience buffer, and updating the evaluation network by experience playbackAnd corresponding evaluation network function, and updating the evaluation networkAssigning the weight parameter of to the target networkUpdating the corresponding target network function;
s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5 until the network is evaluatedConverging to finish hot start;
s7: setting the current iteration number to 1, emptying the task buffer and reinitializingEvaluation network obtained by hot startRepeating the steps S3-S6 for the target task until the evaluation network converges; based on an evaluation networkAnd correspondingly evaluating the network function to obtain the optimal unloading strategy in different states.
In one embodiment, the system states in the state set in step S1 are in the form of:
wherein s is n The system state at the time n is defined by the channel state and the state of the task buffer, and the task buffer b has i states in total, which means that b is { b ═ b 1 ,b 2 ,....,b i In which b is 1 、b i Respectively representing the 1 st and ith states of the task buffer, b n The number of tasks in the task buffer at time n is indicated, and channel h has j states, which is indicated as h ═ h { (h) } 1 ,h 2 ,....,h j },h 1 、h j Respectively, 1 st and j th states of the channel, m represents the number of edge servers,respectively showing the channel state of the 1 st edge server at the time n and the channel state of the mth edge server at the time n.
In one embodiment, the action of the set of actions in step S1 corresponds to an unload decision, and the action taken at time n is a n The offload decision includes three cases, the first, to store p in the task buffer n One task is processed at the local CPU and the second is no task, when p is n 0, third, p in task buffer n The tasks are simultaneously and respectively unloaded to the k with the best channel n An edge server processes, wherein p n For the number of tasks to be processed in the task buffer at n moments, k n Number of edge servers tasking at time n, k n ≤m。
In one embodiment, the strategy in step S3 is a greedy strategy, which specifically includesRandomly selecting an action with probability e, and selecting an action value Q under the current state with probability 1-e eval The action for which the function is minimal.
In one embodiment, the post-decision state in step S4 is an intermediate state before the transition to the next state after the action is taken by the current state, and is represented by:
wherein p is n For the number of tasks to be processed in the task buffer at time n, Δ b n Indicating the number of newly arrived tasks,representing a post-decision state after taking action at n moments;
state s at the next moment n+1 Is represented by the form:
b max which represents the capacity of the task buffer and,respectively show the channel state of the 1 st edge server at the time of n +1, the channel state of the 2 nd edge server at the time of n +1, the channel state of the m-th edge server at the time of n +1, and the state s at the next time n+1 I.e. the state at time n + 1.
In one embodiment, step S4 includes:
acquiring the caching cost of the task in the task buffer, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task energy Involving the energy consumption and task handling at the local CPUEnergy consumption of traffic offloading to edge servers for processing; caching cost c of task in task cache holding =b n -p n Privacy cost of task offload to edge server processing c privacy =p n Overflow cost c generated when the task buffer overflows due to insufficient capacity overflow =max{b n -p n +Δb n -b max ,0},b max The size of the task buffer; p is a radical of n For the number of tasks to be processed in the task buffer at time n, b n Indicating the number of tasks in the task buffer at time n, Δ b n Representing the number of newly arrived tasks;
the slave s is obtained according to the cost of holding the task buffer by the task, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity n ToCost function c of k (s, a) and fromTo s n+1 Cost function of
c k (s,a)=η 1 ·c holding +η 2 ·c energy +η 3 ·c privacy ,
Wherein eta is 1 ,η 2 ,η 3 ,η 4 K and u represent state transition identifiers respectively for corresponding weight coefficients;
after observing a complete state transition process, the current state, the post-decision state, the state at the next moment, and the sampling are carried outA set of experiences comprising the action taken and the cost of the actionAnd storing the data into an experience buffer.
In one embodiment, when a task is processed at the local CPU, the energy consumed per unit task is:
e local =κL 3 ζ 3 /τ 2 ,
wherein, kappa is a CPU parameter, L is a task size, zeta is a CPU frequency, and tau is a time interval; when offloading the task to the edge server, the energy consumed by a unit task is:
where W is the bandwidth of the edge computation network, h is the power gain, N 0 Is the noise power spectral density.
In one embodiment, when the single set of experience is randomly selected from the experience buffer in step S5, the selected experience isWhen a batch of experience is played back empirically, the weight parameters of the evaluation network are updated by minimizing the loss of the following loss function:
where γ represents a discount factor, θ is a weight parameter of the evaluation network,representing a post-decisionStatus of stateState s by time n +1 n+1 Is determined by the cost function of (a) a,for decision states after inputAnd evaluating the network after action aOutput of (Q) target (s ', a ') is the input state s ' and Q after action a target And (4) outputting the function. Update the corresponding Q according to the following formula eval And Q target Function:
where Q (s, a) represents the value of a Q function in state s and action a, the Q function including an evaluation function Q eval And an objective function Q target ,Expressed as a post-input decision stateEvaluating the network or target network in response to action aTo output of (c).
Compared with the prior art, the invention has the advantages and beneficial technical effects as follows:
the invention realizes the network evaluation by introducing the deep neural network and using the method of empirical playback and minimized loss functionWith a target networkUpdating the weight parameter of (2). The additionally adopted warm-start learning process can accelerate the parameter updating of the deep network. The invention is a reinforcement algorithm based on a post-decision learning framework, and the traditional post-decision state learning framework needs additional prior information.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the embodiments or technical solutions in the prior art are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flowchart of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a post-decision state according to an embodiment of the present invention;
FIG. 3 is a schematic processing framework diagram of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention;
FIG. 4 is a diagram of simulation results of a method according to an embodiment of the present invention.
Detailed Description
The invention aims to solve the problem of quickly solving an optimal decision in task unloading, and provides a computation unloading strategy learning framework based on post-decision depth reinforcement learning. The framework is introduced on the basis of the traditional deep learning algorithm DQN, and then the framework and the hot start process are decided and learned, so that the algorithm convergence is accelerated.
The main concept and innovation of the invention are as follows:
the invention relates to a task unloading method based on deep reinforcement learning, which can make decisions on all dimensions of unloading actions, such as unloading objects, unloading quantity and the like of tasks. And facing to different optimization targets, and realizing the optimal strategies under different targets by changing the cost function. The method utilizes an experience playback mechanism of the DQN, randomly selects collected historical experiences as training samples, and therefore learning efficiency can be improved. Meanwhile, a post-decision learning framework and an additional hot start process are utilized to accelerate the learning speed. The traditional post-decision state learning framework has higher learning efficiency but needs additional prior information, but the invention provides a task unloading algorithm based on the post-decision state deep reinforcement learning, the additional information needed in the traditional post-decision state learning is obtained by utilizing an additional learning process, and the rapid convergence of the algorithm is realized by utilizing the efficient post-decision learning framework, the hot start process and the experience playback mechanism.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a task unloading method based on post-decision state deep reinforcement learning, which comprises the following steps:
s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;
s2: the random initialization starting state specifically includes: initializing post-decision state after taking action a in state sState transition probability ofEvaluating a networkWeight parameter of, target networkK represents a transition identifier from the state s to the post-decision state, and the evaluation network is paired with a Markov random problem corresponding to the target taskPerforming hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing the state at a certain moment, the action taken, the corresponding post-decision state, the cost generated by the action taken and the state at the next moment;
s3: selecting an action according to the strategy, wherein one action corresponds to one unloading scheme;
s4: observing the post-decision state, forming a group of experiences by the post-decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experiences in an experience buffer;
s5: at certain intervals, updateRandomly selecting a batch of experiences from the experience buffer to update the evaluation network through experience playbackUpdating the corresponding evaluation network function, and updating the updated evaluation networkAssigning the weight parameter of to the target networkUpdating the corresponding target network function;
s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5 until the network is evaluatedConverging to finish hot start;
s7: setting the current iteration number to 1, emptying the task buffer and reinitializingEvaluation network obtained by hot startRepeating steps S3-S6 for the target task until network convergence is evaluated; based on an evaluation networkAnd correspondingly evaluating the network function to obtain the optimal unloading strategy in different states.
Fig. 1 is a flowchart of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention. In the specific implementation process, a Markov random problem corresponding to a target task, namely a task similar to the target task, is different from an example target task in the distribution of new tasks. Fig. 2 is a schematic diagram of a post-decision state according to an embodiment of the present invention.
It should be noted that, for the task of the warm start process, the steps S3-S6 need to be repeatedly executed, and for the target task, that is, the task that needs to make task unloading decision, the steps S3-S6 need to be repeatedly executed.
The optimal unloading strategy comprises unloading modes and unloading quantities, wherein the unloading objects, namely tasks, need to be unloaded to where to process, and the unloading quantities, namely the unloading task quantities.
Generally speaking, the invention discloses a task unloading method based on post-decision state deep reinforcement learning, and corresponding optimal unloading strategies can be solved for different targets by changing a cost function. The deep reinforcement learning algorithm combines the post-decision state learning framework in deep learning and reinforcement learning, has the advantages of a common deep learning algorithm DQN, can eliminate the need of prior knowledge in the post-decision state learning framework, and can further improve the training speed of the model by utilizing an additional hot start process.
In one embodiment, the system states in the state set in step S1 are in the form of:
wherein s is n The system state at the time n is defined by the channel state and the state of the task buffer, and the task buffer b has i states in total, which means that b is { b ═ b 1 ,b 2 ,....,b i In which b 1 、b i Respectively representing the 1 st and ith states of the task buffer, b n The number of tasks in the task buffer at time n is indicated, and channel h has j states, which is indicated as h ═ h { (h) } 1 ,h 2 ,....,h j },h 1 、h j Respectively, 1 st and j th states of the channel, m represents the number of edge servers,respectively showing the channel state of the 1 st edge server at the time n and the channel state of the mth edge server at the time n.
In one embodiment, the action of the set of actions in step S1 corresponds to an unload decision, and the action taken at time n is a n The offload decision includes three cases, the first, to store p in the task buffer n One task is processed at the local CPU and the second is no task, when p is n The number of the first, second,p in task buffer n The tasks are simultaneously and respectively unloaded to the k with the best channel n An edge server processes, wherein p n For the number of tasks to be processed in the task buffer at n moments, k n Number of edge servers, k, handling offload tasks for n moments n ≤m。
In the specific implementation process, the method of the invention also initializes the corresponding action set. And adopts a repetition coding calculation method, i.e. as long as one of the edge servers completes the calculation, p n The individual tasks have been successfully processed. In this embodiment, m-5, j-2, h-130, -125 (dB), and the corresponding state transition probability isk n With {1,2,3,4,5}, the probability of each edge server completing the computation is 0.5.
In one embodiment, the strategy in step S3 is a greedy strategy, which specifically includes randomly selecting an action with a probability e, and selecting the action value function Q in the current state with probabilities 1-e eval Minimum motion.
In particular, that is to say that a n =argmin a Q eval (s n A) to expedite evaluation of the networkConvergence of (2). In this example, e is 0.1. The action value function here is an evaluation network function Q eval It needs to be calculated by the following formula:
in one embodiment, the post-decision state in step S4 is an intermediate state before the transition to the next state after the action is taken by the current state, and is represented by:
wherein p is n For the number of tasks to be processed in the task buffer at time n, Δ b n Indicating the number of tasks that have been newly reached,representing a post-decision state after taking action at n moments;
state s at the next moment n+1 Is represented by the following form:
b max which represents the capacity of the task buffer and,respectively shows the channel state of the 1 st edge server at the time of n +1, the channel state of the 2 nd edge server at the time of n +1, the channel state of the mth edge server at the time of n +1, and the state s at the next time n+1 I.e. the state at time n + 1.
In one embodiment, step S4 includes:
acquiring the caching cost of the task in the task buffer, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task energy The method comprises the steps of energy consumption of processing a task in a local CPU and energy consumption of unloading the task to an edge server for processing; caching cost c of task in task cache holding =b n -p n Privacy cost of task offload to edge server processing c privacy =p n Overflow cost c generated when the task buffer overflows due to insufficient capacity overflow =max{b n -p n +Δb n -B max ,0},B max The size of the task buffer; p is a radical of n For the number of tasks to be processed in the task buffer at time n, b n Indicating the number of tasks in the task buffer at time n, Δ b n Representing the number of newly arrived tasks;
the slave s is obtained according to the holding cost of the task buffer, the privacy cost of the task unloading to the edge server, the energy loss cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity n ToCost function c of k (s, a) and fromTo s n+1 Cost function of
c k (s,a)=η 1 ·c holding +η 2 ·c energy +η 3 ·c privacy ,
Wherein eta 1 ,η 2 ,η 3 ,η 4 K and u represent state transition identifiers respectively for corresponding weight coefficients;
after observing a complete state transition process, a set of experiences comprising the current state, the post-decision state, the state at the next time, the action taken and the cost of the action generationAnd storing the data into an experience buffer.
Specifically, storing unprocessed tasks in the task buffer incurs a buffering cost, and if the tasks are offloaded to the edge server for processing, a corresponding privacy cost is incurred, and if the task buffer overflows due to insufficient capacity, a corresponding overflow cost is incurred.
In a particular embodiment, the buffer is capable of storing up to 15 tasks, i.e. b max 15, Δ b may be {0,1,2,3,4,5}, the corresponding probability distribution in the hot start tasks is random, and the target tasks are uniform. The weighting factor can take on a value of η 1 =50,η 2 =10^6,η 3 =150,η 4 =300。
In one embodiment, when a task is processed at the local CPU, the energy consumed per unit task is:
e local =κL 3 ζ 3 /τ 2 ,
wherein, kappa is a CPU parameter, L is a task size, zeta is a CPU frequency, and tau is a time interval; when offloading the task to the edge server, the energy consumed by a unit task is:
where W is the bandwidth of the edge computation network, h is the power gain, N 0 Is the noise power spectral density.
The energy consumed by a unit task is the energy consumed by a single task.
Kappa-10 in this example -28 ,L=10 3 ,ζ=800,τ=10 -3 ,W=10MHz,N 0 =10 -19 W/Hz。
In one embodiment, when the single set of experience is randomly selected from the experience buffer in step S5, the selected experience isWhen a batch of experience is played back empirically, the weight parameters of the evaluation network are updated by minimizing the loss of the following loss function:
where γ represents a discount factor, θ is a weight parameter of the evaluation network,representing post-decision stateState s by time n +1 n+1 Is determined by the cost function of (a) a,for decision states after inputAnd evaluating the network after action aOutput of Q target (s ', a ') is the function Q after the input state s ' and the action a target To output of (c). Updating the evaluation network for each set of experiences based on an experience replay mechanismAnd corresponding function Q eval After a batch of experience is finished and network updating is carried out, the parameters of the evaluation network are assigned to the target networkUpdating the corresponding function Q simultaneously target (ii) a Update the corresponding Q according to the following formula eval And Q target Function:
wherein Q (s, a) represents the value of a Q function in state s and action a, the Q function comprising an evaluation function Q eval Eyes of HemuStandard function Q target ,Expressed as a post-input decision stateEvaluating the network or target network in response to action aTo output of (c).
In particular, evaluating a networkThe input variables of (1) include post-decision stateAnd action a, theta is an evaluation networkEach parameter of the neural network is input into a groupThen an output is obtainedThe network is updated. Because a batch of experiences are randomly selected in the experience playback mechanism, a plurality of groups of the post-decision state and action pairs can be inputUpdates to the network). Evaluating a networkAnd a target networkAll need to be substituted into the above equation to calculate the corresponding Q eval And Q target A function.
The operation value function Q is eval By evaluating networksUpdating according to the formula, and only using the motion selection (S3); function of action value Q target By the target networkAccording to the above formula update, it is used as a target value only at the time of network update (S5).
Fig. 3 is a schematic processing framework diagram of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention.
In one embodiment, the updates are performed every 200 timesAnd Q eval Every 1000 updatesAnd Q target The average cost of 100000 state transitions was randomly counted every 10000 times to evaluate algorithm performance.
In order to more clearly illustrate the method proposed by the present invention, it is explained below by specific experimental data.
1. Simulation conditions and content
The operating system is Microsoft Windows 10 and the programming emulation language is python. The simulation adopts a group of parameters to carry out effect simulation on the algorithm and the existing common deep reinforcement learning algorithm DQN.
2. Analysis of simulation results
Fig. 4 is a comparison graph of the effect of the currently popular deep reinforcement learning algorithm DQN and the proposed algorithm. Compared with DQN, the average cost of the proposed algorithm can be reduced quickly, and the algorithm is a more efficient task unloading algorithm based on post-decision state deep reinforcement learning.
Aiming at the problem that the convergence speed of a deep learning algorithm in the task unloading method in the prior art is still limited, the invention realizes the updating of the parameter values of an evaluation network and a target network by introducing a deep neural network and playing back the parameters by experience and minimizing a loss function. The additionally adopted warm-start learning process can accelerate the parameter updating of the deep network. Secondly, a strengthening algorithm based on a post-decision state learning framework is adopted, and aiming at the problem that the traditional post-decision state learning framework needs additional prior information, an additional learning process is added to additionally estimate the state transition probability from the current state to the post-decision state, so that the requirement on prior knowledge is eliminated, the network updating can be further accelerated by utilizing the structural advantage of the post-decision state, and the algorithm performance of the invention is better than that of the traditional deep learning algorithm DQN.
It should be understood that the above description of the preferred embodiments is illustrative, and not restrictive, and that various changes and modifications may be made therein by those skilled in the art without departing from the scope of the invention as defined in the appended claims.
Claims (9)
1. A task unloading method based on post-decision state deep reinforcement learning is characterized by comprising the following steps:
s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;
s2: the random initialization starting state specifically includes: initializing post-decision state after taking action a in state sState transition probability ofEvaluating a networkWeight parameter of, target networkK represents a transition identifier from the state s to the post-decision state, and the evaluation network is paired with a Markov random problem corresponding to the target taskPerforming hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing the state at a certain moment, the action taken, the corresponding post-decision state, the cost generated by the action taken and the state at the next moment;
s3: selecting an action according to the strategy, wherein one action corresponds to one unloading scheme;
s4: observing the post-decision state, forming a group of experiences by the post-decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experiences in an experience buffer;
s5: at certain intervals, updateRandomly selecting a batch of experiences from the experience buffer, and updating the evaluation network by experience playbackAnd corresponding evaluation function, and updating the evaluation networkAssigning the weight parameter of to the target networkUpdating a corresponding target network function;
s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5 until the network is evaluatedConverging to finish hot start;
s7: setting the current iteration number to 1, emptying the task buffer and reinitializingEvaluation network obtained by hot startRepeating steps S3-S6 for the target task until network convergence is evaluated; based on an evaluation networkAnd correspondingly evaluating the network function to obtain the optimal unloading strategy in different states.
2. The task offloading method based on post-decision state deep reinforcement learning of claim 1, wherein the system state in the state set in step S1 is in the form of:
wherein s is n The system state at time n is defined by the channel state and the state of the task buffer, and the task buffer b has i states in common, and represents that b is { b ═ b } 1 ,b 2 ,....,b i In which b 1 、b i Respectively representing the 1 st and ith states of the task buffer, b n Indicates the number of tasks in the task buffer at time n, and the channel h has j states, which is expressed as h ═ h { (h) 1 ,h 2 ,....,h j },h 1 、h j Respectively, 1 st and j th states of the channel, m represents the number of edge servers,respectively showing the channel state of the 1 st edge server at the time n and the channel state of the mth edge server at the time n.
3. The task offloading method based on post-decision state deep reinforcement learning of claim 2, wherein the action of the action set in step S1 corresponds to an offloading decision, and the action taken at time n is a n The offload decision includes three cases, the first, to store p in the task buffer n One task is processed at the local CPU and the second is no task, when p is n Third, p in task buffer is defined as 0 n The tasks are simultaneously and respectively unloaded to the k with the best channel n An edge server processes, wherein p n For the number of tasks to be processed in the task buffer at n moments, k n Number of servers, k, handling offload tasks for n moments n ≤m。
4. The task offloading method based on post-decision state depth reinforcement learning of claim 1, wherein the strategy in step S3 is a greedy strategy, and specifically includes randomly selecting an action with a probability epsilon, and selecting the action value function Q in the current state with a probability 1-epsilon eval Minimum motion.
5. The task offloading method based on post-decision state deep reinforcement learning of claim 2, wherein the post-decision state in step S4 is an intermediate state before the transition to the next state after the action is taken by the current state, and the representation form of the post-decision state is:
wherein p is n For the number of tasks to be processed in the task buffer at time n, Δ b n Indicating the number of tasks that have been newly reached,representing a post-decision state after taking action at n moments;
state s at the next moment n+ 1 is represented by the form:
wherein, b max Which represents the capacity of the task buffer and,respectively show the channel state of the 1 st edge server at the time of n +1, the channel state of the 2 nd edge server at the time of n +1, the channel state of the m-th edge server at the time of n +1, and the state s at the next time n+1 I.e. the state at time n + 1.
6. The task offloading method based on post-decision state deep reinforcement learning of claim 1, wherein step S4 comprises:
acquiring the caching cost of the task in the task buffer, the privacy cost of unloading the task to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task energy The method comprises the steps of energy consumption of processing a task in a local CPU and energy consumption of unloading the task to an edge server for processing; caching cost c of task in task cache holding =b n -p n Privacy cost of task offload to edge server processing c privacy =p n Overflow cost c generated when the task buffer overflows due to insufficient capacity overflow =max{b n -p n +Δb n -b max ,0},b max The size of the task buffer; p is a radical of n For the number of tasks to be processed in the task buffer at time n, b n Indicating the number of tasks in the task buffer at time n, Δ b n Representing the number of newly arrived tasks;
the slave s is obtained according to the holding cost of the task buffer, the privacy cost of the task unloading to the edge server, the energy loss cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity n ToCost function c of k (s, a) and fromTo s n+1 Cost function of
Wherein eta 1 ,η 2 ,η 3 ,η 4 K and u represent state transition identifiers respectively for corresponding weight coefficients;
7. The task offloading method based on post-decision state deep reinforcement learning of claim 6, wherein when the task is processed in the local CPU, the energy consumed by the unit task is:
e local =κL 3 ζ 3 /τ 2 ,
wherein, kappa is a CPU parameter, L is a task size, zeta is a CPU frequency, and tau is a time interval; when offloading the task to the edge server, the energy consumed by a unit task is:
where W is the bandwidth of the edge computation network, h is the power gain, N 0 Is the noise power spectral density.
9. The method for task off-loading based on post-decision state deep reinforcement learning of claim 1, wherein when step S5 is executed, a single set of experience is randomly selected from the experience bufferWhen a batch of experience is played back empirically, the weight parameters of the evaluation network are updated by minimizing the loss of the following loss function:
wherein γ representsA discount factor, theta is a weight parameter for evaluating the network,representing post-decision stateState s by time n +1 n+1 Is determined by the cost function of (a) a,for making a decision on the state after inputAnd evaluating the network after action aOutput of Q target (s ', a ') is the input state s ' and Q after action a target The output of the function updates the evaluation network for each group of experiences according to an experience playback mechanismAnd corresponding function Q eval After a batch of experience is finished and network updating is carried out, the parameters of the evaluation network are assigned to the target networkUpdating the corresponding function Q simultaneously target (ii) a Update the corresponding Q according to the following formula eval And Q target The function is:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210572305.4A CN115016858B (en) | 2022-05-24 | 2022-05-24 | Task unloading method based on post-decision state deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210572305.4A CN115016858B (en) | 2022-05-24 | 2022-05-24 | Task unloading method based on post-decision state deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115016858A true CN115016858A (en) | 2022-09-06 |
CN115016858B CN115016858B (en) | 2024-03-29 |
Family
ID=83069645
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210572305.4A Active CN115016858B (en) | 2022-05-24 | 2022-05-24 | Task unloading method based on post-decision state deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115016858B (en) |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111726826A (en) * | 2020-05-25 | 2020-09-29 | 上海大学 | Online task unloading method in base station intensive edge computing network |
CN113064671A (en) * | 2021-04-27 | 2021-07-02 | 清华大学 | Multi-agent-based edge cloud extensible task unloading method |
CN113434212A (en) * | 2021-06-24 | 2021-09-24 | 北京邮电大学 | Cache auxiliary task cooperative unloading and resource allocation method based on meta reinforcement learning |
CN113504987A (en) * | 2021-06-30 | 2021-10-15 | 广州大学 | Mobile edge computing task unloading method and device based on transfer learning |
CN113612843A (en) * | 2021-08-02 | 2021-11-05 | 吉林大学 | MEC task unloading and resource allocation method based on deep reinforcement learning |
WO2022027776A1 (en) * | 2020-08-03 | 2022-02-10 | 威胜信息技术股份有限公司 | Edge computing network task scheduling and resource allocation method and edge computing system |
CN114205353A (en) * | 2021-11-26 | 2022-03-18 | 华东师范大学 | Calculation unloading method based on hybrid action space reinforcement learning algorithm |
-
2022
- 2022-05-24 CN CN202210572305.4A patent/CN115016858B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111726826A (en) * | 2020-05-25 | 2020-09-29 | 上海大学 | Online task unloading method in base station intensive edge computing network |
WO2022027776A1 (en) * | 2020-08-03 | 2022-02-10 | 威胜信息技术股份有限公司 | Edge computing network task scheduling and resource allocation method and edge computing system |
CN113064671A (en) * | 2021-04-27 | 2021-07-02 | 清华大学 | Multi-agent-based edge cloud extensible task unloading method |
CN113434212A (en) * | 2021-06-24 | 2021-09-24 | 北京邮电大学 | Cache auxiliary task cooperative unloading and resource allocation method based on meta reinforcement learning |
CN113504987A (en) * | 2021-06-30 | 2021-10-15 | 广州大学 | Mobile edge computing task unloading method and device based on transfer learning |
CN113612843A (en) * | 2021-08-02 | 2021-11-05 | 吉林大学 | MEC task unloading and resource allocation method based on deep reinforcement learning |
CN114205353A (en) * | 2021-11-26 | 2022-03-18 | 华东师范大学 | Calculation unloading method based on hybrid action space reinforcement learning algorithm |
Non-Patent Citations (3)
Title |
---|
张海波 等: "车联网中基于NOMA-MEC的卸载策略研究", 《电子与信息学报》, vol. 42, no. 4, 30 April 2021 (2021-04-30) * |
张海波;荆昆仑;刘开健;贺晓帆;: "车联网中一种基于软件定义网络与移动边缘计算的卸载策略", 电子与信息学报, no. 03, 15 March 2020 (2020-03-15) * |
彭军;王成龙;蒋富;顾欣;牟??;刘伟荣;: "一种车载服务的快速深度Q学习网络边云迁移策略", 电子与信息学报, no. 01, 15 January 2020 (2020-01-15) * |
Also Published As
Publication number | Publication date |
---|---|
CN115016858B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112882815B (en) | Multi-user edge calculation optimization scheduling method based on deep reinforcement learning | |
CN108962238B (en) | Dialogue method, system, equipment and storage medium based on structured neural network | |
Bistritz et al. | Online exp3 learning in adversarial bandits with delayed feedback | |
CN112817653A (en) | Cloud-side-based federated learning calculation unloading computing system and method | |
WO2021227508A1 (en) | Deep reinforcement learning-based industrial 5g dynamic multi-priority multi-access method | |
CN113434212A (en) | Cache auxiliary task cooperative unloading and resource allocation method based on meta reinforcement learning | |
CN114065863B (en) | Federal learning method, apparatus, system, electronic device and storage medium | |
CN112511336B (en) | Online service placement method in edge computing system | |
CN114866494B (en) | Reinforced learning intelligent agent training method, modal bandwidth resource scheduling method and device | |
CN115374853A (en) | Asynchronous federal learning method and system based on T-Step polymerization algorithm | |
CN116010054A (en) | Heterogeneous edge cloud AI system task scheduling frame based on reinforcement learning | |
CN110647403A (en) | Cloud computing resource allocation method in multi-user MEC system | |
CN114065929A (en) | Training method and device for deep reinforcement learning model and storage medium | |
CN113626104A (en) | Multi-objective optimization unloading strategy based on deep reinforcement learning under edge cloud architecture | |
CN116523079A (en) | Reinforced learning-based federal learning optimization method and system | |
CN113760511A (en) | Vehicle edge calculation task unloading method based on depth certainty strategy | |
CN113867843A (en) | Mobile edge computing task unloading method based on deep reinforcement learning | |
CN111740925A (en) | Deep reinforcement learning-based flow scheduling method | |
CN117436485A (en) | Multi-exit point end-edge-cloud cooperative system and method based on trade-off time delay and precision | |
CN115016858A (en) | Task unloading method based on post-decision state deep reinforcement learning | |
CN113778550A (en) | Task unloading system and method based on mobile edge calculation | |
CN111488208A (en) | Edge cloud cooperative computing node scheduling optimization method based on variable step length bat algorithm | |
CN116367231A (en) | Edge computing Internet of vehicles resource management joint optimization method based on DDPG algorithm | |
CN113157344B (en) | DRL-based energy consumption perception task unloading method in mobile edge computing environment | |
CN117014355A (en) | TSSDN dynamic route decision method based on DDPG deep reinforcement learning algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |