CN113543342A - Reinforced learning resource allocation and task unloading method based on NOMA-MEC - Google Patents
Reinforced learning resource allocation and task unloading method based on NOMA-MEC Download PDFInfo
- Publication number
- CN113543342A CN113543342A CN202110756466.4A CN202110756466A CN113543342A CN 113543342 A CN113543342 A CN 113543342A CN 202110756466 A CN202110756466 A CN 202110756466A CN 113543342 A CN113543342 A CN 113543342A
- Authority
- CN
- China
- Prior art keywords
- network
- task
- state
- mobile equipment
- noma
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000013468 resource allocation Methods 0.000 title claims abstract description 30
- 206010042135 Stomatitis necrotising Diseases 0.000 title claims abstract description 18
- 201000008585 noma Diseases 0.000 title claims abstract description 18
- 230000002787 reinforcement Effects 0.000 claims abstract description 14
- 239000003795 chemical substances by application Substances 0.000 claims description 37
- 230000009471 action Effects 0.000 claims description 28
- 230000006870 function Effects 0.000 claims description 25
- 238000012549 training Methods 0.000 claims description 21
- 230000008569 process Effects 0.000 claims description 15
- 238000005457 optimization Methods 0.000 claims description 14
- 238000005516 engineering process Methods 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 4
- 239000013598 vector Substances 0.000 claims description 4
- 230000006854 communication Effects 0.000 abstract description 4
- 238000010295 mobile communication Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 6
- 238000005265 energy consumption Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 3
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/50—Allocation or scheduling criteria for wireless resources
- H04W72/53—Allocation or scheduling criteria for wireless resources based on regulatory allocation policies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W16/00—Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
- H04W16/22—Traffic simulation tools or models
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W24/00—Supervisory, monitoring or testing arrangements
- H04W24/02—Arrangements for optimising operational condition
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/04—Wireless resource allocation
- H04W72/044—Wireless resource allocation based on the type of the allocated resource
- H04W72/0453—Resources in frequency domain, e.g. a carrier in FDMA
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a resource allocation and task unloading method based on NOMA-MEC reinforcement learning, belonging to the technical field of mobile communication networks. The invention effectively solves the problem of huge task amount in the mobile equipment, reduces the time delay in the whole communication process, obtains the optimal resource allocation mode in different environments and improves the utilization efficiency of channel resources.
Description
Technical Field
The invention belongs to the technical field of mobile communication networks, and particularly relates to a resource allocation and task unloading method based on NOMA-MEC reinforcement learning.
Background
With the development of the times, mobile terminals such as mobile phones are gaining great popularity, such as face recognition, online interactive games, augmented reality, and the like, and more mobile applications are rising and drawing great attention. These mobile applications typically require large amounts of resources, large amounts of intensive computing, low latency, and high energy consumption, and handsets with limited computing resources and battery life are nearly incapable of supporting these applications. Mobile Edge Computing (MEC) can meet the high computational requirements of these tasks, and the application of Non-Orthogonal Multiple Access (NOMA) technology can further reduce the problem of multitask offload delay.
The mobile edge computing technology belongs to distributed computing, and data processing, application program running and even realization of some functional services are put on nodes at the edge of a network. The mobile edge is composed of one or more edge servers, namely, servers with calculation storage functions are configured on the traditional base station, and the traditional base station is updated to be the mobile edge calculation base station. By offloading computationally intensive or delay sensitive applications to nearby MEC servers, resource-constrained mobile devices may reduce task processing time while reducing mobile energy consumption and transmission costs.
The non-orthogonal multiple access technology is one of the key technologies of the 5 th generation cellular network, and can provide services for a plurality of users simultaneously on the same frequency band by distributing different powers for terminal users, thereby effectively improving the utilization rate of frequency spectrum. Compared with the traditional Orthogonal Multiple Access (OMA), the method can provide task unloading for more users under the same channel resource condition, and simultaneously, the method adopts a non-Orthogonal Multiple Access (NOMA) mode to Access the users to a communication system in consideration of influence factors in Multiple aspects in the task unloading process.
Machine Learning (ML) is an algorithm for automatically analyzing and obtaining rules from data and predicting unknown data by using the rules, and is researched by more and more scholars as an emerging technology with wide application prospects. Nowadays, the 5G mobile communication network is applied to be more strongly supported by machine learning. Machine Learning is classified into four major categories, namely, supervised Learning, semi-supervised Learning, unsupervised Learning, and Reinforcement Learning (RL) according to Learning methods. Different from other three types of learning modes, the RL learning method does not need complete prior information, and an intelligent agent continuously learns in the interaction process with the environment to finally find the optimal strategy. The RL theory plays a key role in solving problems of dynamic planning, system control, decision making, etc., and in particular, when dealing with dynamic optimization problems, the optimal solution is finally obtained by continuously learning a "trial and error" type to a changing environment. For the research of the joint resource allocation problem of the subcarrier channel and the task unloading in the NOMA-MEC environment, the diversity of the transmission environment greatly increases the design difficulty of the resource allocation strategy, and the application of the RL theory in a wireless communication system provides a brand-new design idea for solving the resource allocation problem.
Each mobile device can offload the whole or part of the computing task to the MEC server for computing by selecting a carrier channel, so as to reduce delay and energy consumption and obtain good user experience. The existing traditional algorithm is feasible for solving the problems of task unloading and resource allocation, but is not applicable to the MEC system with high real-time performance. In the invention, each mobile device is used as an agent, and each agent continuously learns and improves the strategy, so that the environment is dynamically unstable from the perspective of each agent, and the key skills of DON such as experience playback and the like cannot be directly used.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects of low frequency spectrum efficiency and limited computing capacity of user equipment, the invention provides a resource allocation and task unloading method based on NOMA-MEC reinforcement learning, and the mobile equipment adopting NOMA technology and MADDPG algorithm can intelligently allocate sub-carrier channels and unload tasks, thereby achieving the purpose of reducing system delay.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:
a resource allocation and task unloading method based on NOMA-MEC reinforcement learning comprises the following steps:
step 2, establishing a task unloading and resource allocation combined optimization model by using a NOMA technology; establishing a joint optimization model aiming at carrier channel allocation and task unloading of all mobile equipment in a network;
step 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process;
and 4, training the learning network by using the MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.
Further, the step 2 specifically includes the following steps:
when the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is Xm(ii) a At the receiving end of the system, the received signal of any user n in the sub-channel m is Yn,m(ii) a For receivingSignals are sorted according to the signal power, the nth mobile equipment is decoded and x is output under the assumption that the power of the nth mobile equipment is strongestnRecovering the signal estimation value of the nth mobile equipment, subtracting the estimation value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power, completing the signal decoding of all the mobile equipment, and obtaining the signal-to-noise ratio after the decoding;
method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formulan,m(ii) a The total delay of uploading the task to the MEC server by the user n through the subchannel m for task unloading is as follows:
in the formula, ckFor the computing power of the MEC server, rnCalculating result data for the MEC server;
the delay computed locally by user n is:
where fn moves the computing power of the user.
Further, in step 3, initializing each parameter, anIs the action of the mobile device n, denoted as an={dn,cnIn which d isnIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. CnE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;
snis the state of the mobile device n, denoted sn={Xn,xn,GmIn which X isnE {0, 1} indicates that the subcarrier channel is in a free/busy state, xnData size, G, representing an offload taskmChannel information representing a sub-channel;
rnis a reward function defined as the negative value of the system delay, denoted as rn=-EE(dn,cn)。
Further, the step 4 includes the following steps:
step 4.1) adopting MADDPG algorithm to update the mobile equipment user network, wherein each mobile equipment user comprises an Actor network and a criticic network, the Actor network and the criticic network have respective estimation network and target network, and theta is [ theta ]1,θ2...θn]Parameters representing n agent policies, for the resulting state siEach agent generates an action a according to the deterministic policy of the Actor networkiWhile receiving an instant prize riEnters a next state s'iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s1,s2...sn]Representing observation vectors, i.e. states, a ═ a1,a2...an]Represents an action, r ═ r1,r2...rn]Denotes prize, x '═ s'1,s′2...s′n]Indicating the state at the next time.
Step 4.2) when the number of samples in the experience pool D reaches a certain number, sampling batch data from the experience pool D for network training, and carrying out state siInputting the data into the Actor estimation network of the ith agent to obtain action aiAnd a prize riThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'iIs input to the Actor target network to obtain action a 'at the next time'iInputting x 'and a' into a Critic target network to obtain a target Q function yiAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,q value, μ ' ═ μ ' representing critical target network output '1,μ′2...μ′n]Parameter θ 'with hysteresis update for target policy'j;
And 4.3) updating the Actor estimation network by the intelligent agent according to the deterministic strategy gradient and the Q function obtained in the criticic estimation network, and aiming at the accumulated expected reward J (mu) of the ith intelligent agenti) The strategy gradient is expressed as
Step 4.4) repeating the step 4.2) and the step 4.3), and updating the parameters in the Actor target network and the Critic target network by a soft updating method at regular intervals of iteration times;
until the iteration times are set and the network is trained, only the state s at the current moment needs to be settInputting the input into the Actor network, and outputting the action atAnd obtaining the optimal resource allocation scheme at the current moment, so that the energy efficiency of the mobile equipment user is optimized. When the network state changes, a new allocation scheme can be obtained only by inputting a new state into the Actor network again.
Has the advantages that: compared with the prior art, the method for allocating and unloading the resources based on the NOMA-MEC reinforcement learning of the invention treats each mobile device in the network system as an independent intelligent agent, adopts the MADDPG method based on the Actor-Critic network structure, and ensures that each mobile device can learn a proper strategy to minimize time delay and energy consumption. The invention effectively solves the problem of huge task amount in the mobile equipment, reduces the time delay in the whole communication process, obtains the optimal resource allocation mode in different environments and improves the utilization efficiency of channel resources.
Drawings
FIG. 1 is a system model diagram;
FIG. 2 is a schematic diagram illustrating the steps of the present invention;
FIG. 3 is a block diagram of MADDPG.
Detailed Description
The present invention will be further described with reference to the following embodiments.
Assume that there are a total of N mobile devices in the entire network, where each mobile device has a corresponding task to perform. Each mobile device can select a carrier channel to offload tasks to the MEC server, so that the spectrum utilization efficiency of the system is effectively improved. In order to reduce interference and minimize unloading delay, a distributed reinforcement learning algorithm based on a NOMA-MEC environment is adopted to solve the problem of joint resource allocation of subcarrier channels and task unloading. As training progresses, the policy of each agent changes and the environment becomes unstable from the perspective of any single agent, on the other hand, policy gradient methods typically exhibit large variance when multiple agent coordination is required. And prevents the direct use of past experience replay, traditional reinforcement learning methods (e.g. Q learning or policy gradients) are not suitable for multi-agent environments, so the present invention employs maddppg algorithm, with its idea of centralized training, distributed execution, each mobile device performs resource allocation based only on its own observed environment.
Since each agent is continuously learning and improving its strategy, the environment is dynamically unstable from the viewpoint of each agent, and key skills such as experience playback cannot be directly used. In the invention, each mobile device in the network system is regarded as an independent agent, and each mobile device can learn a proper strategy by adopting the MADDPG method based on the Actor-Critic network structure so as to minimize time delay and energy consumption.
and 2, establishing a task unloading and resource allocation combined optimization model by adopting the NOMA technology. Establishing a joint optimization model aiming at carrier channel allocation and task unloading of all mobile equipment in a network;
step 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process;
and 4, training the learning network by using the MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.
Examples
The model diagram of the system of the present invention is shown in fig. 1, and mainly comprises a base station integrated with an MEC server and N mobile devices. The implementation of the solution is described in further detail below.
The specific implementation steps of the invention are as follows:
And 2, establishing a task unloading and resource allocation combined optimization model by adopting the NOMA technology.
When the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is Xm(ii) a At the receiving end of the system, the received signal of the user of the nth mobile device at the sub-channel m is Yn,m(ii) a For received signals, sorting the received signals according to the signal power, and assuming that the power of the nth mobile equipment is strongest, firstly decoding the nth mobile equipment and outputting xnRecovering the estimated value of the nth mobile equipment, subtracting the estimated value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power to finish the signal decoding of all the mobile equipment, and obtaining the decoded signalsTo a signal-to-noise ratio.
Method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formulan,m。
The total delay of uploading the task to the MEC server by the user n through the subchannel m for task unloading is as follows:
in the formula, ckFor the computing power of the MEC server, rnCalculating result data for the MEC server;
the delay computed locally by user n is:
in the formula (f)nComputing power of the mobile user.
And 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process.
Initialization of parameters, anIs the action of the mobile device n, denoted as an={dn,cnIn which d isnIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. CnE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;
snis the state of the mobile device n, denoted sn={Xn,xn,GmIn which X isnE {0, 1}, 0 indicating that the subcarrier channel is idle, 1 indicating that the subcarrier channel is busy, xnData size, G, representing an offload taskmChannel information representing a sub-channel;
rnis a reward function defined as the negative value of the system delay, denoted as rn=-EE(dn,cn)。
And 4, training the learning network through an MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.
Step (1) adopts a maddppg algorithm to update the network of the mobile device users, each mobile device user includes an Actor network and a critical network, the Actor network and the critical network have their own estimation network and target network, and a block diagram thereof is shown in fig. 3, where θ ═ θ1,θ2...θn]Parameters representing n agent policies, for the resulting state siEach agent generates an action a according to the deterministic policy of the Actor networkiWhile receiving an instant prize riEnters a next state s'iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s1,s2...sn]Representing observation vectors, i.e. states, a ═ a1,a2...an]Represents an action, r ═ r1,r2...rn]Denotes prize, x '═ s'1,s′2...s′n]Indicating the state at the next time.
Step (2) when the samples in the experience pool D reach a certain number, sampling batch data from the experience pool D for network training, and carrying out state siInputting the data into the Actor estimation network of the ith agent to obtain action aiAnd a prize riThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'iIs input to the Actor target network to obtain action a 'at the next time'iInputting x 'and a' into a Critic target network to obtain a target Q function yiAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,indicates the order of CriticQ value, μ ' ═ μ ' of standard network output '1,μ′2...μ′n]Parameter θ 'with hysteresis update for target policy'j。
And (3) updating the Actor estimation network by the intelligent agent according to the deterministic strategy gradient and the Q function obtained in the criticic estimation network, and aiming at the accumulated expected reward J (mu) of the ith intelligent agenti) The strategic gradient is represented as θiJ(μi)。
And (4) repeating the step (2) and the step (3), and updating parameters in the Actor target network and the Critic target network by a soft updating method at regular intervals of iteration times.
Until the iteration times are set and the network is trained, only the state s at the current moment needs to be settInputting the input into the Actor network, and outputting the action atAnd obtaining the optimal resource allocation scheme at the current moment, so that the energy efficiency of the mobile equipment user is optimized. When the network state changes, a new allocation scheme can be obtained only by inputting a new state into the Actor network again.
This example is merely to illustrate the process of minimizing system delay for channel allocation and task offloading of a mobile device in this invention and is not a constraint on the inventive data parameters.
The following describes in detail, by way of example, the process of task offloading and resource allocation scheme based on maddppg using NOMA technology. The method comprises the following concrete steps:
And 2, establishing a task unloading and resource allocation combined optimization model by adopting the NOMA technology.
When the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is Xm(ii) a At the receiving end of the system, the received signal of any user n in the sub-channel m is Yn,m(ii) a For received signals, the work of the signal is determinedRate sorting, assuming the power of the nth mobile device is strongest, decoding the nth mobile device first, and outputting xnAnd recovering the signal estimation value of the nth mobile equipment, subtracting the estimation value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power, completing the signal decoding of all the mobile equipment, and obtaining the signal-to-noise ratio after the decoding.
Method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formulan,m;
The total delay for uploading the task to the MEC server by the user of the nth mobile device through the subchannel m to unload the task is as follows:
in the formula, ckFor the computing power of the MEC server, rnCalculating result data for the MEC server; the delay calculated locally by the user of the nth mobile device is:
in the formula (f)nComputing power of the mobile user.
And 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process.
More specifically, 10 mobile devices are respectively considered as agents, anIs the action of the mobile device n, denoted as an={dn,cnIn which d isnIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. CnE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;
snis the state of the mobile device n, denoted sn={Xn,xn,GmIn which X isnE {0, 1} indicates that the subcarrier channel is in a free/busy state, xnData size, G, representing an offload taskmChannel information representing a sub-channel;
r is a reward function defined as the negative value of the system delay, denoted as rn=-EE(dn,cn)。
And 4, training the learning network through an MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.
Step (1) adopts a maddppg algorithm to update the network of the mobile device users, each mobile device user includes an Actor network and a critical network, the Actor network and the critical network have their own estimation network and target network, and a block diagram thereof is shown in fig. 3, where θ ═ θ1,θ2...θn],θnRepresenting policy parameters of the nth mobile device user for the obtained state siEach agent generates an action a according to the deterministic policy of the Actor networkiWhile receiving an instant prize riEnters a next state s'iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s1,s2...sn]Representing observation vectors, i.e. states, a ═ a1,a2...an]Represents an action, r ═ r1,r2...rn]Denotes prize, x '═ s'1,s′2...s′n]Indicating the state at the next time.
Step (2) when the number of samples in the experience pool D reaches 400, sampling batch data from the experience pool D for network training, and carrying out state siInputting the data into the Actor estimation network of the ith agent to obtain action aiAnd a prize riThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'iInputting to the target network of Actor to obtainAction a 'to the next time'iInputting x 'and a' into a Critic target network to obtain a target Q function yiAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,q value, μ ' ═ μ ' representing critical target network output '1,μ′2...μ′n]Parameter θ 'with hysteresis update for target policy'j。
And (3) updating the Actor estimation network by the intelligent agent according to the deterministic strategy gradient and the Q function obtained in the criticic estimation network, and aiming at the accumulated expected reward J (mu) of the ith intelligent agenti) The strategic gradient is represented as θiJ(μi)。
And (4) repeating the steps (2) and (3), and updating parameters in the Actor target network and the criticic target network by a soft updating method every 100 iterations.
Until the iteration number of 2000, after the network is trained, only the state s of the current moment is required to be obtainedtInputting the input into the Actor network, and outputting the action atAnd obtaining the optimal resource allocation scheme at the current moment, so that the energy efficiency of the mobile equipment user is optimized. When the network state changes, a new allocation scheme can be obtained only by inputting a new state into the Actor network again.
This example is merely to illustrate the process of minimizing system delay for channel allocation and task offloading of a mobile device in this invention and is not a constraint on the inventive data parameters.
The above description is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the technical principles of the present invention, and these modifications and variations should also be construed as the scope of the present invention.
Claims (6)
1. A resource allocation and task unloading method based on NOMA-MEC reinforcement learning is characterized in that: the method comprises the following steps:
step 1, setting N mobile devices (agents) in a network, wherein the N mobile devices are expressed as {1, 2.., N., N }; there are M subchannels, denoted as {1, 2.., M., M }; the task for the mobile device is denoted t1,t2...,tn,...,tNH, total tNA task;
step 2, establishing a task unloading and resource allocation combined optimization model by using a NOMA technology; establishing a joint optimization model aiming at carrier channel allocation and task unloading of all mobile equipment in a network;
step 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process;
and 4, training the learning network by using the MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.
2. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: the step 2 specifically comprises the following steps:
when the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is Xm(ii) a At the receiving end of the system, the received signal of any user n in the sub-channel m is Yn,m(ii) a For received signals, sorting the received signals according to the signal power, and assuming that the power of the nth mobile equipment is strongest, firstly decoding the nth mobile equipment and outputting xnRecovering the signal estimation value of the nth mobile equipment, subtracting the estimation value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power, completing the signal decoding of all the mobile equipment, and obtaining the signal-to-noise ratio after the decoding;
method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formulan,m(ii) a User n uploads task to MEC server through sub-channel mThe total latency of the row task offload is:
in the formula, ckFor the computing power of the MEC server, rnCalculating result data for the MEC server;
the delay computed locally by user n is:
in the formula (f)nComputing power of the mobile user.
3. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: in the step 3, the parameters, a, are initializednIs the action of the mobile device n, denoted as an={dn,cnIn which d isnIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. CnE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;
snis the state of the mobile device n, denoted sn={Xn,xn,GmIn which X isnE {0, 1} indicates that the subcarrier channel is in a free/busy state, xnData size, G, representing an offload taskmChannel information representing a sub-channel;
rnis a reward function defined as the negative value of the system delay, denoted as rn=-EE(dn,cn)。
4. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: in the step 4, the learning network is trained by the MADDPG algorithm, which specifically comprises the following steps:
the network updating of the mobile equipment users is carried out by adopting a MADDPG algorithm, each mobile equipment user comprises an Actor network and a Critic network, the Actor network and the Critic network have respective estimation network and target network, and theta is [ theta ]1,θ2…θn]Parameters representing n agent policies, for the resulting state siEach agent generates an action a according to the deterministic policy of the Actor networkiWhile receiving an instant prize riEnters a next state s'iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s1,s2…sn]Representing observation vectors, i.e. states, a ═ a1,a2…an]Represents an action, r ═ r1,r2…rn]Denotes prize, x '═ s'1,s′2…s′n]Indicating the state at the next time.
5. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: in step 4, the training aims at minimizing the time delay of the mobile device, and specifically includes the following steps:
4.1) sampling batch data from the experience pool D for network training when the number of samples in the experience pool D reaches a certain number, and carrying out state siInputting the data into the Actor estimation network of the ith agent to obtain action aiAnd a prize riThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'iIs input to the Actor target network to obtain action a 'at the next time'iInputting x 'and a' into a Critic target network to obtain a target Q function yiAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,q value, μ ' ═ μ ' representing critical target network output '1,μ′2…μ′n]Parameter θ 'with hysteresis update for target policy'j;
6. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 5, wherein: in the step 4), repeating the step 4.1) and the step 4.2), and updating parameters in the Actor target network and the Critic target network by a soft updating method at regular intervals of iteration times; until the iteration times are set and the network is trained, the state s of the current moment is obtainedtInputting the input into the Actor network, and outputting the action atObtaining the optimal resource allocation scheme at the current moment to optimize the energy efficiency of the mobile equipment user; when the network state changes, a new state is input into the Actor network again to obtain a new distribution scheme.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110756466.4A CN113543342B (en) | 2021-07-05 | 2021-07-05 | NOMA-MEC-based reinforcement learning resource allocation and task unloading method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110756466.4A CN113543342B (en) | 2021-07-05 | 2021-07-05 | NOMA-MEC-based reinforcement learning resource allocation and task unloading method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113543342A true CN113543342A (en) | 2021-10-22 |
CN113543342B CN113543342B (en) | 2024-03-29 |
Family
ID=78097770
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110756466.4A Active CN113543342B (en) | 2021-07-05 | 2021-07-05 | NOMA-MEC-based reinforcement learning resource allocation and task unloading method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113543342B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114375066A (en) * | 2022-01-08 | 2022-04-19 | 山东大学 | Distributed channel competition method based on multi-agent reinforcement learning |
CN114665952A (en) * | 2022-03-24 | 2022-06-24 | 重庆邮电大学 | Low-orbit satellite network beam hopping optimization method based on satellite-ground fusion architecture |
CN114938381A (en) * | 2022-06-30 | 2022-08-23 | 西安邮电大学 | D2D-MEC unloading method based on deep reinforcement learning and computer program product |
CN116367223A (en) * | 2023-03-30 | 2023-06-30 | 广州爱浦路网络技术有限公司 | XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium |
WO2023179010A1 (en) * | 2022-03-22 | 2023-09-28 | 南京邮电大学 | User packet and resource allocation method and apparatus in noma-mec system |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180317157A1 (en) * | 2017-04-27 | 2018-11-01 | Samsung Electronics Co., Ltd. | Method and apparatus for registration type addition for service negotiation |
CN111800828A (en) * | 2020-06-28 | 2020-10-20 | 西北工业大学 | Mobile edge computing resource allocation method for ultra-dense network |
CN111918339A (en) * | 2020-07-17 | 2020-11-10 | 西安交通大学 | AR task unloading and resource allocation method based on reinforcement learning in mobile edge network |
US20210034970A1 (en) * | 2018-02-05 | 2021-02-04 | Deepmind Technologies Limited | Distributed training using actor-critic reinforcement learning with off-policy correction factors |
CN112601284A (en) * | 2020-12-07 | 2021-04-02 | 南京邮电大学 | Downlink multi-cell OFDMA resource allocation method based on multi-agent deep reinforcement learning |
CN112788764A (en) * | 2020-12-23 | 2021-05-11 | 华北电力大学 | Method and system for task unloading and resource allocation of NOMA ultra-dense network |
-
2021
- 2021-07-05 CN CN202110756466.4A patent/CN113543342B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180317157A1 (en) * | 2017-04-27 | 2018-11-01 | Samsung Electronics Co., Ltd. | Method and apparatus for registration type addition for service negotiation |
US20210034970A1 (en) * | 2018-02-05 | 2021-02-04 | Deepmind Technologies Limited | Distributed training using actor-critic reinforcement learning with off-policy correction factors |
CN111800828A (en) * | 2020-06-28 | 2020-10-20 | 西北工业大学 | Mobile edge computing resource allocation method for ultra-dense network |
CN111918339A (en) * | 2020-07-17 | 2020-11-10 | 西安交通大学 | AR task unloading and resource allocation method based on reinforcement learning in mobile edge network |
CN112601284A (en) * | 2020-12-07 | 2021-04-02 | 南京邮电大学 | Downlink multi-cell OFDMA resource allocation method based on multi-agent deep reinforcement learning |
CN112788764A (en) * | 2020-12-23 | 2021-05-11 | 华北电力大学 | Method and system for task unloading and resource allocation of NOMA ultra-dense network |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114375066A (en) * | 2022-01-08 | 2022-04-19 | 山东大学 | Distributed channel competition method based on multi-agent reinforcement learning |
CN114375066B (en) * | 2022-01-08 | 2024-03-15 | 山东大学 | Distributed channel competition method based on multi-agent reinforcement learning |
WO2023179010A1 (en) * | 2022-03-22 | 2023-09-28 | 南京邮电大学 | User packet and resource allocation method and apparatus in noma-mec system |
CN114665952A (en) * | 2022-03-24 | 2022-06-24 | 重庆邮电大学 | Low-orbit satellite network beam hopping optimization method based on satellite-ground fusion architecture |
CN114938381A (en) * | 2022-06-30 | 2022-08-23 | 西安邮电大学 | D2D-MEC unloading method based on deep reinforcement learning and computer program product |
CN114938381B (en) * | 2022-06-30 | 2023-09-01 | 西安邮电大学 | D2D-MEC unloading method based on deep reinforcement learning |
CN116367223A (en) * | 2023-03-30 | 2023-06-30 | 广州爱浦路网络技术有限公司 | XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium |
CN116367223B (en) * | 2023-03-30 | 2024-01-02 | 广州爱浦路网络技术有限公司 | XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113543342B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113543342B (en) | NOMA-MEC-based reinforcement learning resource allocation and task unloading method | |
CN109729528B (en) | D2D resource allocation method based on multi-agent deep reinforcement learning | |
Bi et al. | Joint optimization of service caching placement and computation offloading in mobile edge computing systems | |
CN111405568B (en) | Computing unloading and resource allocation method and device based on Q learning | |
CN111800828B (en) | Mobile edge computing resource allocation method for ultra-dense network | |
CN110798849A (en) | Computing resource allocation and task unloading method for ultra-dense network edge computing | |
CN113612843A (en) | MEC task unloading and resource allocation method based on deep reinforcement learning | |
CN112738185B (en) | Edge computing system control joint optimization method based on non-orthogonal multiple access and application | |
Chen et al. | Cache-assisted collaborative task offloading and resource allocation strategy: A metareinforcement learning approach | |
CN113645637B (en) | Method and device for unloading tasks of ultra-dense network, computer equipment and storage medium | |
Wang et al. | Multi-agent reinforcement learning-based user pairing in multi-carrier NOMA systems | |
Zheng et al. | Channel assignment for hybrid NOMA systems with deep reinforcement learning | |
Ai et al. | Dynamic offloading strategy for delay-sensitive task in mobile-edge computing networks | |
Li et al. | Deep neural network based computational resource allocation for mobile edge computing | |
Zhang et al. | A deep reinforcement learning approach for online computation offloading in mobile edge computing | |
Shang et al. | Computation offloading and resource allocation in NOMA-MEC: A deep reinforcement learning approach | |
Fang et al. | Smart collaborative optimizations strategy for mobile edge computing based on deep reinforcement learning | |
Chen et al. | Joint optimization of task offloading and resource allocation via deep reinforcement learning for augmented reality in mobile edge network | |
Hu et al. | Dynamic task offloading in MEC-enabled IoT networks: A hybrid DDPG-D3QN approach | |
Yang et al. | Distributed reinforcement learning for NOMA-enabled mobile edge computing | |
Gan et al. | A multi-agent deep reinforcement learning approach for computation offloading in 5G mobile edge computing | |
CN113315806B (en) | Multi-access edge computing architecture for cloud network fusion | |
Li et al. | DQN-based computation-intensive graph task offloading for internet of vehicles | |
Li et al. | Task computation offloading for multi-access edge computing via attention communication deep reinforcement learning | |
Xie et al. | Backscatter-aided hybrid data offloading for mobile edge computing via deep reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |