CN115016858B - Task unloading method based on post-decision state deep reinforcement learning - Google Patents

Task unloading method based on post-decision state deep reinforcement learning Download PDF

Info

Publication number
CN115016858B
CN115016858B CN202210572305.4A CN202210572305A CN115016858B CN 115016858 B CN115016858 B CN 115016858B CN 202210572305 A CN202210572305 A CN 202210572305A CN 115016858 B CN115016858 B CN 115016858B
Authority
CN
China
Prior art keywords
task
state
post
action
buffer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210572305.4A
Other languages
Chinese (zh)
Other versions
CN115016858A (en
Inventor
张竞哲
贺晓帆
张晨
周嘉曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202210572305.4A priority Critical patent/CN115016858B/en
Publication of CN115016858A publication Critical patent/CN115016858A/en
Application granted granted Critical
Publication of CN115016858B publication Critical patent/CN115016858B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44594Unloading
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/509Offload
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a task unloading method based on deep reinforcement learning of a post-decision state, which can make decisions on each dimension of unloading actions, such as unloading objects, unloading quantity and the like of tasks. The optimal strategy under different targets can be realized by changing the cost function for different optimization targets. The invention utilizes the experience playback mechanism of DQN, randomly selects the collected historical experience as a training sample, and improves the learning efficiency. Meanwhile, a post-decision state learning framework and an additional hot start process are utilized to accelerate the learning speed. The traditional post-decision learning framework has higher learning efficiency, but requires additional prior information. The invention provides a task unloading method based on deep reinforcement learning of a post-decision state, which utilizes an additional learning process to acquire additional information required in traditional post-decision learning, and utilizes an efficient post-decision state learning framework, a hot start process and an experience playback mechanism to realize rapid convergence of the unloading method.

Description

Task unloading method based on post-decision state deep reinforcement learning
Technical Field
The invention relates to the technical field of machine learning and distributed computing, in particular to a task unloading method based on deep reinforcement learning of a post-decision state.
Background
In the context of explosive growth of computing demands and data sizes, edge computation is often applied to solve the problem of limited computing power of terminal devices. Edge computing is a mode of offloading tasks to edge devices for processing. Mobile devices are often focused on reducing latency and power consumption, so that when wireless channel conditions are poor, the mobile device will prefer to process tasks at the local CPU, while when wireless channel conditions are good, the mobile device will tend to offload most tasks to the network edge for processing. If the task is offloaded to an unreliable server for processing, information such as the position, the identity and the like of the user can be revealed, and the privacy of the user is threatened. Therefore, the problem of privacy disclosure needs to be considered while balancing energy consumption.
On the other hand, distributed computing is also widely used in edge computing due to the increasing scale of computing tasks. The computational efficiency of a distributed computing system is susceptible to the computational power of individual nodes or the communication environment. Some nodes may take a long time to complete calculation and return a calculation result, so that a dequeue effect is caused to bring calculation delay and influence calculation efficiency. The coding calculation is a framework for applying the coding theory to the field of distributed calculation, and redundancy is properly introduced through flexible and diverse coding technologies, so that the effect of a dewing person can be effectively relieved. Duplicate computation is a simple and common coding method, and the same task is offloaded to a plurality of different users for processing, so that a computing result can be obtained by only waiting for one node to complete computation. However, when the channel condition is not good, tasks are blindly copied and unloaded to a plurality of servers, and the processing is not only energy-wasting, but also unfavorable for privacy protection. To balance the various demands of energy consumption, privacy preservation, etc., the best offloading strategy to minimize long-term costs can be solved with reinforcement learning algorithms by modeling such problems as a Markov Decision Process (MDP) with appropriate states, action space, and cost loss functions.
In practical situations, the state space of the markov problem is usually larger, and the general reinforcement learning algorithm is low in efficiency, so that the practical application is not facilitated.
Disclosure of Invention
The invention provides a task unloading method based on deep reinforcement learning of a post-decision state, which is used for solving or at least partially solving the technical problem of low task unloading efficiency in the prior art.
The invention discloses a task unloading method based on post-decision state deep reinforcement learning, which comprises the following steps:
s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;
s2: the random initialization initial state specifically comprises: initializing the post-decision state after taking action a in state sState transition probability>Evaluation network->Weight parameter of (2) target network->Weight parameters of (c), corresponding evaluation network function, objective network function and experience buffer, k represents from state s to post-decision state +.>Is evaluated against the network using a Markov random problem pair corresponding to the target task>Performing hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing a state at a certain moment, an action taken, a corresponding post-decision state, a cost generated by taking the action and a state to the next moment;
s3: selecting an action according to the strategy, wherein the action corresponds to an unloading scheme;
s4: observing the post decision state, forming a group of experience by the post decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experience into an experience buffer;
s5: at a certain interval, updateRandomly selecting a batch of experience from the experience buffer to perform experience playback update evaluation network +.>Weight parameters and corresponding evaluation network functions of (a) and to update the evaluation network +.>Assignment of weight parameters to target network +.>Updating the corresponding target network function;
s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5 until the network is evaluatedConverging and finishing hot start;
s7: the current iteration number is 1, the task buffer is emptied, and the initialization is performed againEvaluation network obtained with warm start->Repeating the steps S3-S6 for the target task until the convergence of the network is evaluated; according to the evaluation network->And correspondingly evaluating the network function to obtain the optimal unloading strategy under different states.
In one embodiment, the system state in the state set in step S1 is in the form of:
wherein s is n The system state at time n is defined by the channel state and the states of the task buffers, and the task buffers b have i states, which is represented by b= { b 1 ,b 2 ,....,b i And b is }, where 1 、b i Respectively representing the 1 st state and the i st state of the task buffer, b n The number of tasks in the task buffer at time n is represented, and channel h has j states, denoted as h= { h 1 ,h 2 ,....,h j },h 1 、h j Representing the 1 st and j th states of the channel, respectively, m represents the number of edge servers,the channel state of the 1 st edge server at the time n and the channel state of the m-th edge server at the time n are respectively shown.
In one embodiment, the actions of the action set in step S1 correspond to offloading decisions, with the action taken at time n being a n The offloading decision includes three cases, the first one, p in the task buffer n The tasks are processed in the local CPU, the second is not processed any task, p is the time n =0, third, p in the task buffer n The tasks are simultaneously and respectively unloaded to k with the best channel n Processing by the edge servers, wherein p n K is the number of tasks to be processed in the task buffer at the time n n The number of edge servers, k, for performing task processing at time n n ≤m。
In one embodiment, the strategy in step S3 is a greedy strategy, which specifically includes randomly selecting an action with probability e, and selecting the current state with probabilities 1-e such that the action value Q eval The action with the smallest function.
In one embodiment, after the post-decision state in step S4 is the intermediate state before moving to the next state after taking action for the current state, the post-decision state is expressed in the form of:
wherein p is n For the number of tasks to be processed in the task buffer at time n, Δb n Indicating the number of tasks that have arrived newly,representing a post-decision state after taking action at time n;
state s at the next time n+1 The expression form is:
b max indicating the capacity of the task buffer,respectively representing the channel state of the 1 st edge server at the time of n+1, the channel state of the 2 nd edge server at the time of n+1, the channel state of the m th edge server at the time of n+1, and the state s at the next time n+1 I.e. the state at time n + 1.
In one embodiment, step S4 includes:
obtaining the caching cost of a task in a task buffer, the privacy cost of unloading the task to an edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task energy The method comprises the steps of processing energy consumption of a task in a local CPU and unloading the task to an edge server for processing; cache cost c of task in task cache holding =b n -p n Privacy cost c of task offloading to edge server processing privacy =p n Overflow cost c generated when the task buffer overflows due to insufficient capacity overflow =max{b n -p n +Δb n -b max ,0},b max The size of the task buffer; p is p n B, for the number of tasks to be processed in the task buffer at the moment n n Indicating the number of tasks in the task buffer at time n, Δb n Representing the number of newly arrived tasks;
obtaining the cost s according to the holding cost of the task in the task buffer, the privacy cost of the task unloading to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity n To the point ofCost function c of (2) k (s, a) and from->To s n+1 Cost function of->
c k (s,a)=η 1 ·c holding2 ·c energy3 ·c privacy ,
Wherein eta 1234 K and u respectively represent state transition identifiers for corresponding weight coefficients;
after a complete state transition is observed, a set of experiences is formed of the current state, the post-decision state, the next-time state, the actions taken, and the costs incurred by the actionsStored in an experience buffer.
In one embodiment, when a task is processed at the local CPU, the energy consumed per task is:
e local =κL 3 ζ 32 ,
wherein, κ is CPU parameter, L is task size, ζ is CPU frequency, τ is time interval; when offloading tasks to an edge server, the energy consumed per task is:
wherein W is the bandwidth of the edge computing network, h is the power gain, N 0 Is the noise power spectral density.
In one embodiment, the update in step S5The method adopts a statistical updating method.
In one embodiment, when a single set of experiences is randomly selected from the experience buffer in step S5 asThe weight parameters of the evaluation network are updated by minimizing the loss of the following loss function when the experience playback is performed from a batch of experiences:
where γ represents a discount factor, θ is a weight parameter of the evaluation network,representing the post decision state->State s to time n+1 n+1 Cost function of>For entering post-decision state->And action a post-evaluation network->Output of Q target (s ', a ') is the input state s ' and Q after action a target And (3) outputting a function. Updating the corresponding Q according to the following formula eval And Q target Function:
wherein Q (s, a) represents the value of the Q function under state s and action a, the Q function comprising an evaluation function Q eval And an objective function Q targetExpressed as decision state after input->And action a is followed by corresponding evaluation network or target network->Is provided.
Compared with the prior art, the invention has the following advantages and beneficial technical effects:
the invention realizes the evaluation network by introducing the deep neural network and playing back with experience and minimizing the loss functionIs +.>Is used for updating the weight parameters of the (a). The additionally adopted hot start learning process can accelerate the parameter updating of the deep network. Second, the invention is an enhancement based on a post-decision learning frameworkThe algorithm and the traditional post-decision state learning framework need additional prior information, and the invention carries out additional estimation on the state transition probability from the current state to the post-decision state by adding an additional learning process, so that the structural advantage of the post-decision state can be utilized to further accelerate the updating of the network while eliminating the prior knowledge requirement, and the algorithm performance of the invention is better than the traditional deep learning algorithm DQN, thereby improving the task unloading efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a task offloading method based on post-decision state deep reinforcement learning in an embodiment of the invention;
FIG. 2 is a diagram illustrating a post-decision state according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a processing framework of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the present invention;
fig. 4 is a diagram of simulation results of a method according to an embodiment of the present invention.
Detailed Description
The invention aims to provide a computational unloading strategy learning framework based on post-decision deep reinforcement learning for realizing the problem of fast solving optimal decisions in task unloading. The framework introduces a post decision learning framework and a hot start process on the basis of a traditional deep learning algorithm DQN, so that algorithm convergence is quickened.
The main conception and innovation of the invention are as follows:
the invention relates to a task unloading method based on deep reinforcement learning, which can make decisions on each dimension of unloading actions, such as unloading objects, unloading quantity and the like of tasks. The optimal strategy under different targets can be realized by changing the cost function for different optimization targets. The method utilizes the experience playback mechanism of the DQN, randomly selects the collected historical experience as a training sample, and can improve the learning efficiency. Meanwhile, a post-decision learning framework and an additional hot start process are utilized to accelerate the learning speed. The traditional post-decision state learning framework has higher learning efficiency but requires additional prior information, and the invention provides a task unloading algorithm based on post-decision state deep reinforcement learning, which utilizes an additional learning process to acquire the additional information required in the traditional post-decision state learning, and utilizes an efficient post-decision learning framework, a hot start process and an experience playback mechanism to realize rapid algorithm convergence.
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention provides a task unloading method based on deep reinforcement learning of a post-decision state, which comprises the following steps:
s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;
s2: the random initialization initial state specifically comprises: initializing the post-decision state after taking action a in state sState transition probability>Evaluation network->Weight parameter of (2) target network->Weight parameters of (a) corresponding evaluation network function, target network function and experience buffer, k represents the transition identification from state s to post-decision state, evaluation network>Performing hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing a state at a certain moment, an action taken, a corresponding post-decision state, a cost generated by taking the action and a state to the next moment;
s3: selecting an action according to the strategy, wherein the action corresponds to an unloading scheme;
s4: observing the post decision state, forming a group of experience by the post decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experience into an experience buffer;
s5: at a certain interval, updateRandomly selecting a batch of experience from the experience buffer to perform experience playback update evaluation network +.>Updating the corresponding evaluation network function and updating the updated evaluation networkAssignment of weight parameters to target network +.>Updating the corresponding target network function;
s6: will beRepeating steps S3-S5 until evaluating the network by adding 1 to the iteration numberConverging and finishing hot start;
s7: the current iteration number is 1, the task buffer is emptied, and the initialization is performed againEvaluation network obtained with warm start->Repeating the steps S3-S6 for the target task until the convergence of the network is evaluated; according to the evaluation network->And correspondingly evaluating the network function to obtain the optimal unloading strategy under different states.
Referring to fig. 1, a flowchart of a task offloading method based on post-decision state deep reinforcement learning in an embodiment of the invention is shown. In the implementation process, a markov random problem corresponding to the target task, namely a task similar to the target task, is different from the example target task in that the new task arrives in different distribution. Fig. 2 is a schematic diagram of a state after decision making in the embodiment of the invention.
It should be noted that, for the task of the hot start process, the steps S3 to S6 need to be repeatedly executed, and for the target task, the steps S3 to S6 need to be repeatedly executed, and the target task is the task that needs to make the task unloading decision.
The optimal unloading strategy comprises an unloading mode and an unloading quantity, wherein an unloading object is the position where the task needs to be unloaded for processing, and the unloading quantity is the number of the unloaded tasks.
In general, the invention relates to a task unloading method based on deep reinforcement learning of a post-decision state, and corresponding optimal unloading strategies can be solved for different targets by changing a cost function. The method is a deep reinforcement learning algorithm combining the post-decision state learning framework in deep learning and reinforcement learning, has the advantages of a common deep learning algorithm DQN, can eliminate the need of priori knowledge in the post-decision state learning framework, and further improves the training speed of the model by using an additional hot start process.
In one embodiment, the system state in the state set in step S1 is in the form of:
wherein s is n The system state at time n is defined by the channel state and the states of the task buffers, and the task buffers b have i states, which is represented by b= { b 1 ,b 2 ,....,b i And b is }, where 1 、b i Respectively representing the 1 st state and the i st state of the task buffer, b n The number of tasks in the task buffer at time n is represented, and channel h has j states, denoted as h= { h 1 ,h 2 ,....,h j },h 1 、h j Representing the 1 st and j th states of the channel, respectively, m represents the number of edge servers,the channel state of the 1 st edge server at the time n and the channel state of the m-th edge server at the time n are respectively shown.
In one embodiment, the actions of the action set in step S1 correspond to offloading decisions, with the action taken at time n being a n The offloading decision includes three cases, the first one, p in the task buffer n The tasks are processed in the local CPU, the second is not processed any task, p is the time n =0, third, p in the task buffer n The tasks are simultaneously and respectively unloaded to k with the best channel n Processing by the edge servers, wherein p n K is the number of tasks to be processed in the task buffer at the time n n The number of edge servers, k, that handle offload tasks for time n n ≤m。
In a specific implementation process, the method of the invention also initializes the corresponding action set. And employs repetition coded computation, i.e., as long as one of the edge servers completes the computation, then p n The individual tasks have been successfully processed. In this embodiment, m= 5,j = 2,h = { -130, -125} (dB), the corresponding state transition probability isk n = {1,2,3,4,5}, the probability of each edge server completing the calculation is 0.5.
In one embodiment, the strategy in step S3 is a greedy strategy, which specifically includes randomly selecting an action with probability e, and selecting the action value function Q in the current state with probabilities 1-e eval Minimal action.
In particular, that is to say so that a n =argmin a Q eval (s n A) to accelerate evaluation of a networkIs not limited. In this embodiment, e=0.1. The action value function is an evaluation network function Q eval It is calculated by the following formula:
in one embodiment, after the post-decision state in step S4 is the intermediate state before moving to the next state after taking action for the current state, the post-decision state is expressed in the form of:
wherein p is n For the number of tasks to be processed in the task buffer at time n, Δb n Indicating the number of tasks that have been reached,representing a post-decision state after taking action at time n;
state s at the next time n+1 The expression form is:
b max indicating the capacity of the task buffer,respectively representing the channel state of the 1 st edge server at the time of n+1, the channel state of the 2 nd edge server at the time of n+1, the channel state of the m th edge server at the time of n+1, and the state s at the next time n+1 I.e. the state at time n + 1.
In one embodiment, step S4 includes:
obtaining the caching cost of a task in a task buffer, the privacy cost of unloading the task to an edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task energy The method comprises the steps of processing energy consumption of a task in a local CPU and unloading the task to an edge server for processing; cache cost c of task in task cache holding =b n -p n Privacy cost c of task offloading to edge server processing privacy =p n Overflow cost c generated when the task buffer overflows due to insufficient capacity overflow =max{b n -p n +Δb n -B max ,0},B max The size of the task buffer; p is p n B, for the number of tasks to be processed in the task buffer at the moment n n Indicating the number of tasks in the task buffer at time n, Δb n Representing the number of newly arrived tasks;
based on the cost of task holding in task buffer, the cost of task offloading to edge server, the cost of energy loss from processing task and the task bufferingThe overflow cost of the memory when overflow occurs due to insufficient capacity is obtained from s n To the point ofCost function c of (2) k (s, a) and from->To s n+1 Cost function of->
c k (s,a)=η 1 ·c holding2 ·c energy3 ·c privacy ,
Wherein eta 1234 K and u respectively represent state transition identifiers for corresponding weight coefficients;
after a complete state transition is observed, a set of experiences is formed of the current state, the post-decision state, the next-time state, the actions taken, and the costs incurred by the actionsStored in an experience buffer.
Specifically, storing an unprocessed task in a task buffer will generate a buffer cost, and if the task is offloaded to an edge server, a corresponding privacy cost will be generated, and if the task buffer overflows due to insufficient capacity, a corresponding overflow cost will be generated.
In the embodiment, the buffer can store at most 15 tasks, i.e. b max =15, Δb may be {0,1,2,3,4,5}, where the probability distribution corresponding to the hot start task is random, and the target task is uniform. The weight coefficient can be eta 1 =50,η 2 =10^6,η 3 =150,η 4 =300。
In one embodiment, when a task is processed at the local CPU, the energy consumed per task is:
e local =κL 3 ζ 32
wherein, κ is CPU parameter, L is task size, ζ is CPU frequency, τ is time interval; when offloading tasks to an edge server, the energy consumed per task is:
wherein W is the bandwidth of the edge computing network, h is the power gain, N 0 Is the noise power spectral density.
The energy consumed by a unit task is the energy consumed by a single task.
In this example, κ=10 -28 ,L=10 3 ,ζ=800,τ=10 -3 ,W=10MHz,N 0 =10 -19 W/Hz。
In one embodiment, the update in step S5The method adopts a statistical updating method.
In one embodiment, when a single set of experiences is randomly selected from the experience buffer in step S5 asThe weight parameters of the evaluation network are updated by minimizing the loss of the following loss function when the experience playback is performed from a batch of experiences:
where γ represents a discount factor, θ is a weight parameter of the evaluation network,representing the post decision state->State s to time n+1 n+1 Cost function of>For entering post-decision state->And action a post-evaluation network->Output of Q target (s ', a ') is the function Q after the input state s ' and the action a target Is provided. According to the experience playback mechanism, the evaluation network is updated for each group of experiences>Corresponding function Q eval After a batch of experience is completed to update the network, the parameters of the evaluation network are assigned to the target network +.>Simultaneously updating the corresponding function Q target The method comprises the steps of carrying out a first treatment on the surface of the Updating the corresponding Q according to the following formula eval And Q target Function:
wherein Q (s, a) represents the value of the Q function under state s and action a, the Q function comprising an evaluation function Q eval And an objective function Q targetExpressed as decision state after input->And action a is followed by corresponding evaluation network or target network->Is provided.
Specifically, evaluation networkThe input variables of (a) include the post-decision state +.>And action a, θ is evaluation network +.>Is input with a set of +/for each neural network>Then an output +.>And updating the network. Since a batch of experience is randomly selected in the experience playback mechanism, multiple groups of such post-decision states and action pairs can be input>Update the network). Evaluation network->And target network->All need to be substituted into the above equation to calculate the corresponding Q eval And Q target A function.
The motion value function Q eval By an evaluation networkUpdating according to the above formula, only used at the time of action selection (S3); action value function Q target By the target network->According to the above formula update, it is used as a target value only at the time of network update (S5).
Fig. 3 is a schematic diagram of a processing framework of a task offloading method based on post-decision state deep reinforcement learning according to an embodiment of the invention.
In a specific embodiment, every 200 updatesAnd Q eval Every 1000 updates->And Q target Average cost of 100000 state transitions was counted randomly every 10000 times to evaluate algorithm performance.
In order to more clearly illustrate the method proposed by the present invention, the following description is made by specific experimental data.
1. Emulation conditions and content
The operating system is Microsoft Windows and the programming simulation language is python. The simulation adopts a group of parameters to simulate the effect of the algorithm and the conventional common deep reinforcement learning algorithm DQN.
2. Simulation result analysis
Fig. 4 is a graph comparing the effects of the currently popular deep reinforcement learning algorithm DQN and the proposed algorithm. It can be seen from the figure that the proposed algorithm can reduce average cost faster than DQN, and is a more efficient task offloading algorithm based on post-decision state deep reinforcement learning.
Aiming at the problem that the convergence speed of a deep learning algorithm is still limited in a task unloading method in the prior art, the invention realizes the updating of the parameter values of an evaluation network and a target network by introducing a deep neural network and playing back with experience and minimizing a loss function. Additional employed hot-start learning processes may expedite parameter updates for the deep network. Secondly, an enhanced algorithm based on a post-decision state learning framework is adopted, and the problem that additional prior information is needed by a traditional post-decision state learning framework is solved.
It should be understood that the foregoing description of the preferred embodiments is not intended to limit the scope of the invention, but rather to limit the scope of the claims, and that those skilled in the art can make substitutions or modifications without departing from the scope of the invention as set forth in the appended claims.

Claims (9)

1. The task unloading method based on the deep reinforcement learning of the post-decision state is characterized by comprising the following steps of:
s1: setting a state set, a post-decision state set and an action set, wherein the state set comprises a system state, the post-decision state set comprises a post-decision state, and the action set comprises an action to be taken;
s2: the random initialization initial state specifically comprises: initializing the post-decision state after taking action a in state sState transition probability>Evaluation network->Weight of (2)Heavy parameters, target network->Weight parameters of (a) corresponding evaluation network function, target network function and experience buffer, k represents the transition identification from state s to post-decision state, evaluation network>Performing hot start, and setting the iteration number to be 1, wherein the experience buffer is used for storing a state at a certain moment, an action taken, a corresponding post-decision state, a cost generated by taking the action and a state to the next moment;
s3: selecting an action according to the strategy, wherein the action corresponds to an unloading scheme;
s4: observing the post decision state, forming a group of experience by the post decision state, the cost generated by taking the action of the step S3 and the state at the next moment, and storing the group of experience into an experience buffer;
s5: at a certain interval, updateRandomly selecting a batch of experience from the experience buffer to perform experience playback update evaluation network +.>Weight parameters and corresponding evaluation functions of (a) and to update the evaluation network +.>Assignment of weight parameters to target network +.>Updating the corresponding target network function;
s6: adding 1 to the iteration number, and repeatedly executing the steps S3-S5, up to the evaluation networkConverging and finishing hot start;
s7: the current iteration number is 1, the task buffer is emptied, and the initialization is performed againEvaluation network obtained with warm start->Repeating the steps S3-S6 for the target task until the convergence of the network is evaluated; according to the evaluation network->And correspondingly evaluating the network function to obtain the optimal unloading strategy under different states.
2. The post-decision state deep reinforcement learning-based task offloading method of claim 1, wherein the system states in the state set in step S1 are in the form of:
wherein s is n The system state at time n is defined by the channel state and the states of the task buffers, and the task buffers b have i states, which is represented by b= { b 1 ,b 2 ,....,b i And b is }, where 1 、b i Respectively representing the 1 st state and the i st state of the task buffer, b n The number of tasks in the task buffer at time n is represented, and channel h has j states, denoted as h= { h 1 ,h 2 ,....,h j },h 1 、h j Representing the 1 st and j th states of the channel, respectively, m represents the number of edge servers,the channel state of the 1 st edge server at the time n and the channel state of the m-th edge server at the time n are respectively shown.
3. The post-decision state deep reinforcement learning based task offloading method of claim 2, wherein the actions of the action set in step S1 correspond to offloading decisions, and the action taken at time n is a n The offloading decision includes three cases, the first one, p in the task buffer n The tasks are processed in the local CPU, the second is not processed any task, p is the time n =0, third, p in the task buffer n The tasks are simultaneously and respectively unloaded to k with the best channel n Processing by the edge servers, wherein p n K is the number of tasks to be processed in the task buffer at the time n n The number of servers, k, that handle off-load tasks for time n n ≤m。
4. The task offloading method of claim 1, wherein the strategy in step S3 is a greedy strategy, specifically comprising randomly selecting an action with probability ∈, and selecting the action value function Q in the current state with probability 1-epsilon eval Minimal action.
5. The post-decision state deep reinforcement learning-based task offloading method of claim 2, wherein the post-decision state in step S4 is an intermediate state before moving to the next state after taking an action for the current state, and the post-decision state is expressed in the form of:
wherein p is n For the number of tasks to be processed in the task buffer at time n, Δb n Representing newly arrived tasksThe number of the pieces of the plastic material,representing a post-decision state after taking action at time n;
state s at the next time n+ 1 is expressed in the following form:
wherein b max Indicating the capacity of the task buffer,respectively representing the channel state of the 1 st edge server at the time of n+1, the channel state of the 2 nd edge server at the time of n+1, the channel state of the m th edge server at the time of n+1, and the state s at the next time n+1 I.e. the state at time n + 1.
6. The post-decision state deep reinforcement learning-based task offloading method of claim 1, wherein step S4 comprises:
obtaining the caching cost of a task in a task buffer, the privacy cost of unloading the task to an edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity, wherein the energy consumption cost c generated by processing the task energy The method comprises the steps of processing energy consumption of a task in a local CPU and unloading the task to an edge server for processing; cache cost c of task in task cache holding =b n -p n Privacy cost c of task offloading to edge server processing privacy =p n Overflow cost c generated when the task buffer overflows due to insufficient capacity overflow =max{b n -p n +Δb n -b max ,0},b max The size of the task buffer; p is p n For the task to be processed in the n-time task bufferQuantity b n Indicating the number of tasks in the task buffer at time n, Δb n Representing the number of newly arrived tasks;
obtaining the cost s according to the holding cost of the task in the task buffer, the privacy cost of the task unloading to the edge server, the energy consumption cost generated by processing the task and the overflow cost generated when the task buffer overflows due to insufficient capacity n To the point ofCost function c of (2) k (s, a) and from->To s n+1 Cost function of->
Wherein eta 1234 K and u respectively represent state transition identifiers for corresponding weight coefficients;
after a complete state transition is observed, a set of experiences is formed of the current state, the post-decision state, the next-time state, the actions taken, and the costs incurred by the actionsStored in an experience buffer.
7. The post-decision state deep reinforcement learning based task offloading method of claim 6, wherein when the task is processed at the local CPU, the energy consumed per task is:
e local =κL 3 ζ 32 ,
wherein, κ is CPU parameter, L is task size, ζ is CPU frequency, τ is time interval; when offloading tasks to an edge server, the energy consumed per task is:
wherein W is the bandwidth of the edge computing network, h is the power gain, N 0 Is the noise power spectral density.
8. The post-decision state deep reinforcement learning-based task offloading method of claim 1, wherein the updating in step S5The method adopts a statistical updating method.
9. The task offloading method of claim 1, wherein when a single set of experiences is randomly selected from the experience buffer in step S5 as followsThe weight parameters of the evaluation network are updated by minimizing the loss of the following loss function when the experience playback is performed from a batch of experiences:
where γ represents a discount factor, θ is a weight parameter of the evaluation network,representing the post decision state->State s to time n+1 n+1 Cost function of>For entering post-decision state->And action a post-evaluation network->Output of Q target (s ', a ') is the input state s ' and Q after action a target Output of the function, update the evaluation network for each group of experiences according to the experience playback mechanism>Corresponding function Q eval After a batch of experience is completed to update the network, the parameters of the evaluation network are assigned to the target network +.>Simultaneously updating the corresponding function Q target The method comprises the steps of carrying out a first treatment on the surface of the Updating the corresponding Q according to the following formula eval And Q target Function:
wherein Q (s, a) represents the value of the Q function under state s and action a, the Q function comprising an evaluation function Q eval And an objective function Q targetExpressed as post-input decision state/>And action a is followed by corresponding evaluation network or target network->Is provided.
CN202210572305.4A 2022-05-24 2022-05-24 Task unloading method based on post-decision state deep reinforcement learning Active CN115016858B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210572305.4A CN115016858B (en) 2022-05-24 2022-05-24 Task unloading method based on post-decision state deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210572305.4A CN115016858B (en) 2022-05-24 2022-05-24 Task unloading method based on post-decision state deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN115016858A CN115016858A (en) 2022-09-06
CN115016858B true CN115016858B (en) 2024-03-29

Family

ID=83069645

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210572305.4A Active CN115016858B (en) 2022-05-24 2022-05-24 Task unloading method based on post-decision state deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115016858B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111726826A (en) * 2020-05-25 2020-09-29 上海大学 Online task unloading method in base station intensive edge computing network
CN113064671A (en) * 2021-04-27 2021-07-02 清华大学 Multi-agent-based edge cloud extensible task unloading method
CN113434212A (en) * 2021-06-24 2021-09-24 北京邮电大学 Cache auxiliary task cooperative unloading and resource allocation method based on meta reinforcement learning
CN113504987A (en) * 2021-06-30 2021-10-15 广州大学 Mobile edge computing task unloading method and device based on transfer learning
CN113612843A (en) * 2021-08-02 2021-11-05 吉林大学 MEC task unloading and resource allocation method based on deep reinforcement learning
WO2022027776A1 (en) * 2020-08-03 2022-02-10 威胜信息技术股份有限公司 Edge computing network task scheduling and resource allocation method and edge computing system
CN114205353A (en) * 2021-11-26 2022-03-18 华东师范大学 Calculation unloading method based on hybrid action space reinforcement learning algorithm

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111726826A (en) * 2020-05-25 2020-09-29 上海大学 Online task unloading method in base station intensive edge computing network
WO2022027776A1 (en) * 2020-08-03 2022-02-10 威胜信息技术股份有限公司 Edge computing network task scheduling and resource allocation method and edge computing system
CN113064671A (en) * 2021-04-27 2021-07-02 清华大学 Multi-agent-based edge cloud extensible task unloading method
CN113434212A (en) * 2021-06-24 2021-09-24 北京邮电大学 Cache auxiliary task cooperative unloading and resource allocation method based on meta reinforcement learning
CN113504987A (en) * 2021-06-30 2021-10-15 广州大学 Mobile edge computing task unloading method and device based on transfer learning
CN113612843A (en) * 2021-08-02 2021-11-05 吉林大学 MEC task unloading and resource allocation method based on deep reinforcement learning
CN114205353A (en) * 2021-11-26 2022-03-18 华东师范大学 Calculation unloading method based on hybrid action space reinforcement learning algorithm

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
一种车载服务的快速深度Q学习网络边云迁移策略;彭军;王成龙;蒋富;顾欣;牟玥玥;刘伟荣;;电子与信息学报;20200115(01);全文 *
车联网中一种基于软件定义网络与移动边缘计算的卸载策略;张海波;荆昆仑;刘开健;贺晓帆;;电子与信息学报;20200315(03);全文 *
车联网中基于NOMA-MEC的卸载策略研究;张海波 等;《电子与信息学报》;20210430;42(4);全文 *

Also Published As

Publication number Publication date
CN115016858A (en) 2022-09-06

Similar Documents

Publication Publication Date Title
CN112882815B (en) Multi-user edge calculation optimization scheduling method based on deep reinforcement learning
CN113242568B (en) Task unloading and resource allocation method in uncertain network environment
CN108962238B (en) Dialogue method, system, equipment and storage medium based on structured neural network
CN112817653A (en) Cloud-side-based federated learning calculation unloading computing system and method
CN110955463B (en) Internet of things multi-user computing unloading method supporting edge computing
CN110890930B (en) Channel prediction method, related equipment and storage medium
CN113434212A (en) Cache auxiliary task cooperative unloading and resource allocation method based on meta reinforcement learning
CN114065863B (en) Federal learning method, apparatus, system, electronic device and storage medium
CN112181655A (en) Hybrid genetic algorithm-based calculation unloading method in mobile edge calculation
CN112083967B (en) Cloud edge computing task unloading method, computer equipment and storage medium
CN112995343B (en) Edge node calculation unloading method with performance and demand matching capability
CN114065929A (en) Training method and device for deep reinforcement learning model and storage medium
CN111158912A (en) Task unloading decision method based on deep learning in cloud and mist collaborative computing environment
CN112686376A (en) Node representation method based on timing diagram neural network and incremental learning method
CN113760511A (en) Vehicle edge calculation task unloading method based on depth certainty strategy
CN113867843A (en) Mobile edge computing task unloading method based on deep reinforcement learning
CN114385272B (en) Ocean task oriented online adaptive computing unloading method and system
CN115016858B (en) Task unloading method based on post-decision state deep reinforcement learning
Tang et al. Variational deep q network
CN113778550B (en) Task unloading system and method based on mobile edge calculation
Cesa-Bianchi et al. Efficient online learning via randomized rounding
CN114449584A (en) Distributed computing unloading method and device based on deep reinforcement learning
Jeong et al. Deep reinforcement learning-based task offloading decision in the time varying channel
CN113961204A (en) Vehicle networking computing unloading method and system based on multi-target reinforcement learning
CN114116995A (en) Session recommendation method, system and medium based on enhanced graph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant