CN113543342A - Reinforced learning resource allocation and task unloading method based on NOMA-MEC - Google Patents

Reinforced learning resource allocation and task unloading method based on NOMA-MEC Download PDF

Info

Publication number
CN113543342A
CN113543342A CN202110756466.4A CN202110756466A CN113543342A CN 113543342 A CN113543342 A CN 113543342A CN 202110756466 A CN202110756466 A CN 202110756466A CN 113543342 A CN113543342 A CN 113543342A
Authority
CN
China
Prior art keywords
network
task
state
mobile equipment
noma
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110756466.4A
Other languages
Chinese (zh)
Other versions
CN113543342B (en
Inventor
李君�
沈国丽
仲星
丁文杰
张茜茜
朱明浩
王秀敏
李正权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ictehi Technology Development Jiangsu Co ltd
Binjiang College of Nanjing University of Information Engineering
Original Assignee
Ictehi Technology Development Jiangsu Co ltd
Binjiang College of Nanjing University of Information Engineering
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ictehi Technology Development Jiangsu Co ltd, Binjiang College of Nanjing University of Information Engineering filed Critical Ictehi Technology Development Jiangsu Co ltd
Priority to CN202110756466.4A priority Critical patent/CN113543342B/en
Publication of CN113543342A publication Critical patent/CN113543342A/en
Application granted granted Critical
Publication of CN113543342B publication Critical patent/CN113543342B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/53Allocation or scheduling criteria for wireless resources based on regulatory allocation policies
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/22Traffic simulation tools or models
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0453Resources in frequency domain, e.g. a carrier in FDMA

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a resource allocation and task unloading method based on NOMA-MEC reinforcement learning, belonging to the technical field of mobile communication networks. The invention effectively solves the problem of huge task amount in the mobile equipment, reduces the time delay in the whole communication process, obtains the optimal resource allocation mode in different environments and improves the utilization efficiency of channel resources.

Description

Reinforced learning resource allocation and task unloading method based on NOMA-MEC
Technical Field
The invention belongs to the technical field of mobile communication networks, and particularly relates to a resource allocation and task unloading method based on NOMA-MEC reinforcement learning.
Background
With the development of the times, mobile terminals such as mobile phones are gaining great popularity, such as face recognition, online interactive games, augmented reality, and the like, and more mobile applications are rising and drawing great attention. These mobile applications typically require large amounts of resources, large amounts of intensive computing, low latency, and high energy consumption, and handsets with limited computing resources and battery life are nearly incapable of supporting these applications. Mobile Edge Computing (MEC) can meet the high computational requirements of these tasks, and the application of Non-Orthogonal Multiple Access (NOMA) technology can further reduce the problem of multitask offload delay.
The mobile edge computing technology belongs to distributed computing, and data processing, application program running and even realization of some functional services are put on nodes at the edge of a network. The mobile edge is composed of one or more edge servers, namely, servers with calculation storage functions are configured on the traditional base station, and the traditional base station is updated to be the mobile edge calculation base station. By offloading computationally intensive or delay sensitive applications to nearby MEC servers, resource-constrained mobile devices may reduce task processing time while reducing mobile energy consumption and transmission costs.
The non-orthogonal multiple access technology is one of the key technologies of the 5 th generation cellular network, and can provide services for a plurality of users simultaneously on the same frequency band by distributing different powers for terminal users, thereby effectively improving the utilization rate of frequency spectrum. Compared with the traditional Orthogonal Multiple Access (OMA), the method can provide task unloading for more users under the same channel resource condition, and simultaneously, the method adopts a non-Orthogonal Multiple Access (NOMA) mode to Access the users to a communication system in consideration of influence factors in Multiple aspects in the task unloading process.
Machine Learning (ML) is an algorithm for automatically analyzing and obtaining rules from data and predicting unknown data by using the rules, and is researched by more and more scholars as an emerging technology with wide application prospects. Nowadays, the 5G mobile communication network is applied to be more strongly supported by machine learning. Machine Learning is classified into four major categories, namely, supervised Learning, semi-supervised Learning, unsupervised Learning, and Reinforcement Learning (RL) according to Learning methods. Different from other three types of learning modes, the RL learning method does not need complete prior information, and an intelligent agent continuously learns in the interaction process with the environment to finally find the optimal strategy. The RL theory plays a key role in solving problems of dynamic planning, system control, decision making, etc., and in particular, when dealing with dynamic optimization problems, the optimal solution is finally obtained by continuously learning a "trial and error" type to a changing environment. For the research of the joint resource allocation problem of the subcarrier channel and the task unloading in the NOMA-MEC environment, the diversity of the transmission environment greatly increases the design difficulty of the resource allocation strategy, and the application of the RL theory in a wireless communication system provides a brand-new design idea for solving the resource allocation problem.
Each mobile device can offload the whole or part of the computing task to the MEC server for computing by selecting a carrier channel, so as to reduce delay and energy consumption and obtain good user experience. The existing traditional algorithm is feasible for solving the problems of task unloading and resource allocation, but is not applicable to the MEC system with high real-time performance. In the invention, each mobile device is used as an agent, and each agent continuously learns and improves the strategy, so that the environment is dynamically unstable from the perspective of each agent, and the key skills of DON such as experience playback and the like cannot be directly used.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the defects of low frequency spectrum efficiency and limited computing capacity of user equipment, the invention provides a resource allocation and task unloading method based on NOMA-MEC reinforcement learning, and the mobile equipment adopting NOMA technology and MADDPG algorithm can intelligently allocate sub-carrier channels and unload tasks, thereby achieving the purpose of reducing system delay.
The technical scheme is as follows: in order to achieve the purpose, the invention adopts the following technical scheme:
a resource allocation and task unloading method based on NOMA-MEC reinforcement learning comprises the following steps:
step 1, setting N mobile devices (agents) in a network, wherein the N mobile devices are expressed as {1, 2.., N., N }; there are M subchannels, denoted as {1, 2.., M., M }; the task for the mobile device is denoted t1,t2...,tn,...,tNH, total tNA task;
step 2, establishing a task unloading and resource allocation combined optimization model by using a NOMA technology; establishing a joint optimization model aiming at carrier channel allocation and task unloading of all mobile equipment in a network;
step 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process;
and 4, training the learning network by using the MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.
Further, the step 2 specifically includes the following steps:
when the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is Xm(ii) a At the receiving end of the system, the received signal of any user n in the sub-channel m is Ynm(ii) a For receivingSignals are sorted according to the signal power, the nth mobile equipment is decoded and x is output under the assumption that the power of the nth mobile equipment is strongestnRecovering the signal estimation value of the nth mobile equipment, subtracting the estimation value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power, completing the signal decoding of all the mobile equipment, and obtaining the signal-to-noise ratio after the decoding;
method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formulanm(ii) a The total delay of uploading the task to the MEC server by the user n through the subchannel m for task unloading is as follows:
Figure BDA0003147409150000031
in the formula, ckFor the computing power of the MEC server, rnCalculating result data for the MEC server;
the delay computed locally by user n is:
Figure BDA0003147409150000032
where fn moves the computing power of the user.
Further, in step 3, initializing each parameter, anIs the action of the mobile device n, denoted as an={dn,cnIn which d isnIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. CnE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;
snis the state of the mobile device n, denoted sn={Xn,xn,GmIn which X isnE {0, 1} indicates that the subcarrier channel is in a free/busy state, xnData size, G, representing an offload taskmChannel information representing a sub-channel;
rnis a reward function defined as the negative value of the system delay, denoted as rn=-EE(dn,cn)。
Further, the step 4 includes the following steps:
step 4.1) adopting MADDPG algorithm to update the mobile equipment user network, wherein each mobile equipment user comprises an Actor network and a criticic network, the Actor network and the criticic network have respective estimation network and target network, and theta is [ theta ]1,θ2...θn]Parameters representing n agent policies, for the resulting state siEach agent generates an action a according to the deterministic policy of the Actor networkiWhile receiving an instant prize riEnters a next state s'iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s1,s2...sn]Representing observation vectors, i.e. states, a ═ a1,a2...an]Represents an action, r ═ r1,r2...rn]Denotes prize, x '═ s'1,s′2...s′n]Indicating the state at the next time.
Step 4.2) when the number of samples in the experience pool D reaches a certain number, sampling batch data from the experience pool D for network training, and carrying out state siInputting the data into the Actor estimation network of the ith agent to obtain action aiAnd a prize riThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'iIs input to the Actor target network to obtain action a 'at the next time'iInputting x 'and a' into a Critic target network to obtain a target Q function yiAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,
Figure BDA0003147409150000041
q value, μ ' ═ μ ' representing critical target network output '1,μ′2...μ′n]Parameter θ 'with hysteresis update for target policy'j
And 4.3) updating the Actor estimation network by the intelligent agent according to the deterministic strategy gradient and the Q function obtained in the criticic estimation network, and aiming at the accumulated expected reward J (mu) of the ith intelligent agenti) The strategy gradient is expressed as
Figure BDA0003147409150000042
Step 4.4) repeating the step 4.2) and the step 4.3), and updating the parameters in the Actor target network and the Critic target network by a soft updating method at regular intervals of iteration times;
until the iteration times are set and the network is trained, only the state s at the current moment needs to be settInputting the input into the Actor network, and outputting the action atAnd obtaining the optimal resource allocation scheme at the current moment, so that the energy efficiency of the mobile equipment user is optimized. When the network state changes, a new allocation scheme can be obtained only by inputting a new state into the Actor network again.
Has the advantages that: compared with the prior art, the method for allocating and unloading the resources based on the NOMA-MEC reinforcement learning of the invention treats each mobile device in the network system as an independent intelligent agent, adopts the MADDPG method based on the Actor-Critic network structure, and ensures that each mobile device can learn a proper strategy to minimize time delay and energy consumption. The invention effectively solves the problem of huge task amount in the mobile equipment, reduces the time delay in the whole communication process, obtains the optimal resource allocation mode in different environments and improves the utilization efficiency of channel resources.
Drawings
FIG. 1 is a system model diagram;
FIG. 2 is a schematic diagram illustrating the steps of the present invention;
FIG. 3 is a block diagram of MADDPG.
Detailed Description
The present invention will be further described with reference to the following embodiments.
Assume that there are a total of N mobile devices in the entire network, where each mobile device has a corresponding task to perform. Each mobile device can select a carrier channel to offload tasks to the MEC server, so that the spectrum utilization efficiency of the system is effectively improved. In order to reduce interference and minimize unloading delay, a distributed reinforcement learning algorithm based on a NOMA-MEC environment is adopted to solve the problem of joint resource allocation of subcarrier channels and task unloading. As training progresses, the policy of each agent changes and the environment becomes unstable from the perspective of any single agent, on the other hand, policy gradient methods typically exhibit large variance when multiple agent coordination is required. And prevents the direct use of past experience replay, traditional reinforcement learning methods (e.g. Q learning or policy gradients) are not suitable for multi-agent environments, so the present invention employs maddppg algorithm, with its idea of centralized training, distributed execution, each mobile device performs resource allocation based only on its own observed environment.
Since each agent is continuously learning and improving its strategy, the environment is dynamically unstable from the viewpoint of each agent, and key skills such as experience playback cannot be directly used. In the invention, each mobile device in the network system is regarded as an independent agent, and each mobile device can learn a proper strategy by adopting the MADDPG method based on the Actor-Critic network structure so as to minimize time delay and energy consumption.
Step 1, setting N mobile devices (intelligent agents) in a network, wherein the N mobile devices are expressed as {1, 2., N., N }, and N is less than or equal to N; a total of M subchannels, denoted as {1, 2., M., M }, M ≦ M; the task for the mobile device is denoted t1,t2...,tn,...,tNH, total tNA task;
and 2, establishing a task unloading and resource allocation combined optimization model by adopting the NOMA technology. Establishing a joint optimization model aiming at carrier channel allocation and task unloading of all mobile equipment in a network;
step 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process;
and 4, training the learning network by using the MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.
Examples
The model diagram of the system of the present invention is shown in fig. 1, and mainly comprises a base station integrated with an MEC server and N mobile devices. The implementation of the solution is described in further detail below.
The specific implementation steps of the invention are as follows:
step 1, setting N mobile devices (intelligent agents) in a network, wherein the N mobile devices are expressed as {1, 2., N., N }; there are M subchannels, denoted as {1, 2.., M., M }; the task for the mobile device is denoted t1,t2...,tn,...,tNH, total tNAnd (4) each task.
And 2, establishing a task unloading and resource allocation combined optimization model by adopting the NOMA technology.
When the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is Xm(ii) a At the receiving end of the system, the received signal of the user of the nth mobile device at the sub-channel m is Yn,m(ii) a For received signals, sorting the received signals according to the signal power, and assuming that the power of the nth mobile equipment is strongest, firstly decoding the nth mobile equipment and outputting xnRecovering the estimated value of the nth mobile equipment, subtracting the estimated value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power to finish the signal decoding of all the mobile equipment, and obtaining the decoded signalsTo a signal-to-noise ratio.
Method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formulan,m
The total delay of uploading the task to the MEC server by the user n through the subchannel m for task unloading is as follows:
Figure BDA0003147409150000061
in the formula, ckFor the computing power of the MEC server, rnCalculating result data for the MEC server;
the delay computed locally by user n is:
Figure BDA0003147409150000062
in the formula (f)nComputing power of the mobile user.
And 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process.
Initialization of parameters, anIs the action of the mobile device n, denoted as an={dn,cnIn which d isnIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. CnE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;
snis the state of the mobile device n, denoted sn={Xn,xn,GmIn which X isnE {0, 1}, 0 indicating that the subcarrier channel is idle, 1 indicating that the subcarrier channel is busy, xnData size, G, representing an offload taskmChannel information representing a sub-channel;
rnis a reward function defined as the negative value of the system delay, denoted as rn=-EE(dn,cn)。
And 4, training the learning network through an MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.
Step (1) adopts a maddppg algorithm to update the network of the mobile device users, each mobile device user includes an Actor network and a critical network, the Actor network and the critical network have their own estimation network and target network, and a block diagram thereof is shown in fig. 3, where θ ═ θ1,θ2...θn]Parameters representing n agent policies, for the resulting state siEach agent generates an action a according to the deterministic policy of the Actor networkiWhile receiving an instant prize riEnters a next state s'iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s1,s2...sn]Representing observation vectors, i.e. states, a ═ a1,a2...an]Represents an action, r ═ r1,r2...rn]Denotes prize, x '═ s'1,s′2...s′n]Indicating the state at the next time.
Step (2) when the samples in the experience pool D reach a certain number, sampling batch data from the experience pool D for network training, and carrying out state siInputting the data into the Actor estimation network of the ith agent to obtain action aiAnd a prize riThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'iIs input to the Actor target network to obtain action a 'at the next time'iInputting x 'and a' into a Critic target network to obtain a target Q function yiAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,
Figure BDA0003147409150000071
indicates the order of CriticQ value, μ ' ═ μ ' of standard network output '1,μ′2...μ′n]Parameter θ 'with hysteresis update for target policy'j
And (3) updating the Actor estimation network by the intelligent agent according to the deterministic strategy gradient and the Q function obtained in the criticic estimation network, and aiming at the accumulated expected reward J (mu) of the ith intelligent agenti) The strategic gradient is represented as θiJ(μi)。
And (4) repeating the step (2) and the step (3), and updating parameters in the Actor target network and the Critic target network by a soft updating method at regular intervals of iteration times.
Until the iteration times are set and the network is trained, only the state s at the current moment needs to be settInputting the input into the Actor network, and outputting the action atAnd obtaining the optimal resource allocation scheme at the current moment, so that the energy efficiency of the mobile equipment user is optimized. When the network state changes, a new allocation scheme can be obtained only by inputting a new state into the Actor network again.
This example is merely to illustrate the process of minimizing system delay for channel allocation and task offloading of a mobile device in this invention and is not a constraint on the inventive data parameters.
The following describes in detail, by way of example, the process of task offloading and resource allocation scheme based on maddppg using NOMA technology. The method comprises the following concrete steps:
step 1, setting a network, wherein 10 mobile devices (intelligent agents) are in total, 5 subchannels are in total, and tasks of the mobile devices are expressed as { t }1,t2...,tn,...,tNH, total tNAnd (4) each task.
And 2, establishing a task unloading and resource allocation combined optimization model by adopting the NOMA technology.
When the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is Xm(ii) a At the receiving end of the system, the received signal of any user n in the sub-channel m is Yn,m(ii) a For received signals, the work of the signal is determinedRate sorting, assuming the power of the nth mobile device is strongest, decoding the nth mobile device first, and outputting xnAnd recovering the signal estimation value of the nth mobile equipment, subtracting the estimation value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power, completing the signal decoding of all the mobile equipment, and obtaining the signal-to-noise ratio after the decoding.
Method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formulan,m
The total delay for uploading the task to the MEC server by the user of the nth mobile device through the subchannel m to unload the task is as follows:
Figure BDA0003147409150000081
in the formula, ckFor the computing power of the MEC server, rnCalculating result data for the MEC server; the delay calculated locally by the user of the nth mobile device is:
Figure BDA0003147409150000091
in the formula (f)nComputing power of the mobile user.
And 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process.
More specifically, 10 mobile devices are respectively considered as agents, anIs the action of the mobile device n, denoted as an={dn,cnIn which d isnIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. CnE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;
snis the state of the mobile device n, denoted sn={Xn,xn,GmIn which X isnE {0, 1} indicates that the subcarrier channel is in a free/busy state, xnData size, G, representing an offload taskmChannel information representing a sub-channel;
r is a reward function defined as the negative value of the system delay, denoted as rn=-EE(dn,cn)。
And 4, training the learning network through an MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.
Step (1) adopts a maddppg algorithm to update the network of the mobile device users, each mobile device user includes an Actor network and a critical network, the Actor network and the critical network have their own estimation network and target network, and a block diagram thereof is shown in fig. 3, where θ ═ θ1,θ2...θn],θnRepresenting policy parameters of the nth mobile device user for the obtained state siEach agent generates an action a according to the deterministic policy of the Actor networkiWhile receiving an instant prize riEnters a next state s'iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s1,s2...sn]Representing observation vectors, i.e. states, a ═ a1,a2...an]Represents an action, r ═ r1,r2...rn]Denotes prize, x '═ s'1,s′2...s′n]Indicating the state at the next time.
Step (2) when the number of samples in the experience pool D reaches 400, sampling batch data from the experience pool D for network training, and carrying out state siInputting the data into the Actor estimation network of the ith agent to obtain action aiAnd a prize riThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'iInputting to the target network of Actor to obtainAction a 'to the next time'iInputting x 'and a' into a Critic target network to obtain a target Q function yiAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,
Figure BDA0003147409150000101
q value, μ ' ═ μ ' representing critical target network output '1,μ′2...μ′n]Parameter θ 'with hysteresis update for target policy'j
And (3) updating the Actor estimation network by the intelligent agent according to the deterministic strategy gradient and the Q function obtained in the criticic estimation network, and aiming at the accumulated expected reward J (mu) of the ith intelligent agenti) The strategic gradient is represented as θiJ(μi)。
And (4) repeating the steps (2) and (3), and updating parameters in the Actor target network and the criticic target network by a soft updating method every 100 iterations.
Until the iteration number of 2000, after the network is trained, only the state s of the current moment is required to be obtainedtInputting the input into the Actor network, and outputting the action atAnd obtaining the optimal resource allocation scheme at the current moment, so that the energy efficiency of the mobile equipment user is optimized. When the network state changes, a new allocation scheme can be obtained only by inputting a new state into the Actor network again.
This example is merely to illustrate the process of minimizing system delay for channel allocation and task offloading of a mobile device in this invention and is not a constraint on the inventive data parameters.
The above description is only a preferred embodiment of the present invention, and it will be apparent to those skilled in the art that various modifications and variations can be made without departing from the technical principles of the present invention, and these modifications and variations should also be construed as the scope of the present invention.

Claims (6)

1. A resource allocation and task unloading method based on NOMA-MEC reinforcement learning is characterized in that: the method comprises the following steps:
step 1, setting N mobile devices (agents) in a network, wherein the N mobile devices are expressed as {1, 2.., N., N }; there are M subchannels, denoted as {1, 2.., M., M }; the task for the mobile device is denoted t1,t2...,tn,...,tNH, total tNA task;
step 2, establishing a task unloading and resource allocation combined optimization model by using a NOMA technology; establishing a joint optimization model aiming at carrier channel allocation and task unloading of all mobile equipment in a network;
step 3, converting the combined optimization model into a Markov decision process model, and setting the state, action and reward in the Markov decision process;
and 4, training the learning network by using the MADDPG algorithm, wherein the training aims at minimizing the time delay of the mobile equipment, and the optimal joint subcarrier channel allocation and task unloading strategy is obtained as a result.
2. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: the step 2 specifically comprises the following steps:
when the mobile equipment is accessed into the network by using the NOMA mode, one subchannel can be occupied by a plurality of mobile equipment, and for subchannel m, the superposed signal is Xm(ii) a At the receiving end of the system, the received signal of any user n in the sub-channel m is Yn,m(ii) a For received signals, sorting the received signals according to the signal power, and assuming that the power of the nth mobile equipment is strongest, firstly decoding the nth mobile equipment and outputting xnRecovering the signal estimation value of the nth mobile equipment, subtracting the estimation value of the nth mobile equipment from the received signal to obtain the signals of the rest users, sequentially executing the same operation according to the power, completing the signal decoding of all the mobile equipment, and obtaining the signal-to-noise ratio after the decoding;
method for solving maximum information rate R of nth mobile equipment user on subchannel m in NOMA mode by utilizing Shannon formulan,m(ii) a User n uploads task to MEC server through sub-channel mThe total latency of the row task offload is:
Figure FDA0003147409140000011
in the formula, ckFor the computing power of the MEC server, rnCalculating result data for the MEC server;
the delay computed locally by user n is:
Figure FDA0003147409140000021
in the formula (f)nComputing power of the mobile user.
3. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: in the step 3, the parameters, a, are initializednIs the action of the mobile device n, denoted as an={dn,cnIn which d isnIs [0, 1 ]]A continuous value between the two, 0 represents that the mobile device n performs local calculation, and 1 represents that the mobile device n completely unloads the task to the MEC server; c. CnE {0, 1.., m } represents that the mobile device n selects m subcarrier channels;
snis the state of the mobile device n, denoted sn={Xn,xn,GmIn which X isnE {0, 1} indicates that the subcarrier channel is in a free/busy state, xnData size, G, representing an offload taskmChannel information representing a sub-channel;
rnis a reward function defined as the negative value of the system delay, denoted as rn=-EE(dn,cn)。
4. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: in the step 4, the learning network is trained by the MADDPG algorithm, which specifically comprises the following steps:
the network updating of the mobile equipment users is carried out by adopting a MADDPG algorithm, each mobile equipment user comprises an Actor network and a Critic network, the Actor network and the Critic network have respective estimation network and target network, and theta is [ theta ]1,θ2…θn]Parameters representing n agent policies, for the resulting state siEach agent generates an action a according to the deterministic policy of the Actor networkiWhile receiving an instant prize riEnters a next state s'iThe combined state, the motion, the reward and the state [ x, a, r, x 'at the next time are set']Stored into experience pool D for subsequent training, x ═ s1,s2…sn]Representing observation vectors, i.e. states, a ═ a1,a2…an]Represents an action, r ═ r1,r2…rn]Denotes prize, x '═ s'1,s′2…s′n]Indicating the state at the next time.
5. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 1, wherein: in step 4, the training aims at minimizing the time delay of the mobile device, and specifically includes the following steps:
4.1) sampling batch data from the experience pool D for network training when the number of samples in the experience pool D reaches a certain number, and carrying out state siInputting the data into the Actor estimation network of the ith agent to obtain action aiAnd a prize riThen inputting x and a into a Critic estimation network to obtain an estimation state-action function at the current moment, namely an estimation Q function, and converting the state s 'at the next moment'iIs input to the Actor target network to obtain action a 'at the next time'iInputting x 'and a' into a Critic target network to obtain a target Q function yiAnd then updating the criticic estimation network by using a minimum Loss function, wherein the criticic network has an estimation network and a target network,
Figure FDA0003147409140000031
q value, μ ' ═ μ ' representing critical target network output '1,μ′2…μ′n]Parameter θ 'with hysteresis update for target policy'j
4.2) agent updates the Actor estimation network according to the deterministic policy gradient and the Q function obtained in the criticic estimation network, and the accumulated expected reward J (mu) of the ith agenti) Strategic gradient representation
Figure FDA0003147409140000032
6. The NOMA-MEC-based reinforcement learning resource allocation and task offloading method of claim 5, wherein: in the step 4), repeating the step 4.1) and the step 4.2), and updating parameters in the Actor target network and the Critic target network by a soft updating method at regular intervals of iteration times; until the iteration times are set and the network is trained, the state s of the current moment is obtainedtInputting the input into the Actor network, and outputting the action atObtaining the optimal resource allocation scheme at the current moment to optimize the energy efficiency of the mobile equipment user; when the network state changes, a new state is input into the Actor network again to obtain a new distribution scheme.
CN202110756466.4A 2021-07-05 2021-07-05 NOMA-MEC-based reinforcement learning resource allocation and task unloading method Active CN113543342B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110756466.4A CN113543342B (en) 2021-07-05 2021-07-05 NOMA-MEC-based reinforcement learning resource allocation and task unloading method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110756466.4A CN113543342B (en) 2021-07-05 2021-07-05 NOMA-MEC-based reinforcement learning resource allocation and task unloading method

Publications (2)

Publication Number Publication Date
CN113543342A true CN113543342A (en) 2021-10-22
CN113543342B CN113543342B (en) 2024-03-29

Family

ID=78097770

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110756466.4A Active CN113543342B (en) 2021-07-05 2021-07-05 NOMA-MEC-based reinforcement learning resource allocation and task unloading method

Country Status (1)

Country Link
CN (1) CN113543342B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114375066A (en) * 2022-01-08 2022-04-19 山东大学 Distributed channel competition method based on multi-agent reinforcement learning
CN114665952A (en) * 2022-03-24 2022-06-24 重庆邮电大学 Low-orbit satellite network beam hopping optimization method based on satellite-ground fusion architecture
CN114938381A (en) * 2022-06-30 2022-08-23 西安邮电大学 D2D-MEC unloading method based on deep reinforcement learning and computer program product
CN116367223A (en) * 2023-03-30 2023-06-30 广州爱浦路网络技术有限公司 XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium
WO2023179010A1 (en) * 2022-03-22 2023-09-28 南京邮电大学 User packet and resource allocation method and apparatus in noma-mec system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180317157A1 (en) * 2017-04-27 2018-11-01 Samsung Electronics Co., Ltd. Method and apparatus for registration type addition for service negotiation
CN111800828A (en) * 2020-06-28 2020-10-20 西北工业大学 Mobile edge computing resource allocation method for ultra-dense network
CN111918339A (en) * 2020-07-17 2020-11-10 西安交通大学 AR task unloading and resource allocation method based on reinforcement learning in mobile edge network
US20210034970A1 (en) * 2018-02-05 2021-02-04 Deepmind Technologies Limited Distributed training using actor-critic reinforcement learning with off-policy correction factors
CN112601284A (en) * 2020-12-07 2021-04-02 南京邮电大学 Downlink multi-cell OFDMA resource allocation method based on multi-agent deep reinforcement learning
CN112788764A (en) * 2020-12-23 2021-05-11 华北电力大学 Method and system for task unloading and resource allocation of NOMA ultra-dense network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180317157A1 (en) * 2017-04-27 2018-11-01 Samsung Electronics Co., Ltd. Method and apparatus for registration type addition for service negotiation
US20210034970A1 (en) * 2018-02-05 2021-02-04 Deepmind Technologies Limited Distributed training using actor-critic reinforcement learning with off-policy correction factors
CN111800828A (en) * 2020-06-28 2020-10-20 西北工业大学 Mobile edge computing resource allocation method for ultra-dense network
CN111918339A (en) * 2020-07-17 2020-11-10 西安交通大学 AR task unloading and resource allocation method based on reinforcement learning in mobile edge network
CN112601284A (en) * 2020-12-07 2021-04-02 南京邮电大学 Downlink multi-cell OFDMA resource allocation method based on multi-agent deep reinforcement learning
CN112788764A (en) * 2020-12-23 2021-05-11 华北电力大学 Method and system for task unloading and resource allocation of NOMA ultra-dense network

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114375066A (en) * 2022-01-08 2022-04-19 山东大学 Distributed channel competition method based on multi-agent reinforcement learning
CN114375066B (en) * 2022-01-08 2024-03-15 山东大学 Distributed channel competition method based on multi-agent reinforcement learning
WO2023179010A1 (en) * 2022-03-22 2023-09-28 南京邮电大学 User packet and resource allocation method and apparatus in noma-mec system
CN114665952A (en) * 2022-03-24 2022-06-24 重庆邮电大学 Low-orbit satellite network beam hopping optimization method based on satellite-ground fusion architecture
CN114938381A (en) * 2022-06-30 2022-08-23 西安邮电大学 D2D-MEC unloading method based on deep reinforcement learning and computer program product
CN114938381B (en) * 2022-06-30 2023-09-01 西安邮电大学 D2D-MEC unloading method based on deep reinforcement learning
CN116367223A (en) * 2023-03-30 2023-06-30 广州爱浦路网络技术有限公司 XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium
CN116367223B (en) * 2023-03-30 2024-01-02 广州爱浦路网络技术有限公司 XR service optimization method and device based on reinforcement learning, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113543342B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN113543342B (en) NOMA-MEC-based reinforcement learning resource allocation and task unloading method
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
Bi et al. Joint optimization of service caching placement and computation offloading in mobile edge computing systems
CN111405568B (en) Computing unloading and resource allocation method and device based on Q learning
CN111800828B (en) Mobile edge computing resource allocation method for ultra-dense network
CN110798849A (en) Computing resource allocation and task unloading method for ultra-dense network edge computing
CN113612843A (en) MEC task unloading and resource allocation method based on deep reinforcement learning
CN112738185B (en) Edge computing system control joint optimization method based on non-orthogonal multiple access and application
Chen et al. Cache-assisted collaborative task offloading and resource allocation strategy: A metareinforcement learning approach
CN113645637B (en) Method and device for unloading tasks of ultra-dense network, computer equipment and storage medium
Wang et al. Multi-agent reinforcement learning-based user pairing in multi-carrier NOMA systems
Zheng et al. Channel assignment for hybrid NOMA systems with deep reinforcement learning
Ai et al. Dynamic offloading strategy for delay-sensitive task in mobile-edge computing networks
Li et al. Deep neural network based computational resource allocation for mobile edge computing
Zhang et al. A deep reinforcement learning approach for online computation offloading in mobile edge computing
Shang et al. Computation offloading and resource allocation in NOMA-MEC: A deep reinforcement learning approach
Fang et al. Smart collaborative optimizations strategy for mobile edge computing based on deep reinforcement learning
Chen et al. Joint optimization of task offloading and resource allocation via deep reinforcement learning for augmented reality in mobile edge network
Hu et al. Dynamic task offloading in MEC-enabled IoT networks: A hybrid DDPG-D3QN approach
Yang et al. Distributed reinforcement learning for NOMA-enabled mobile edge computing
Gan et al. A multi-agent deep reinforcement learning approach for computation offloading in 5G mobile edge computing
CN113315806B (en) Multi-access edge computing architecture for cloud network fusion
Li et al. DQN-based computation-intensive graph task offloading for internet of vehicles
Li et al. Task computation offloading for multi-access edge computing via attention communication deep reinforcement learning
Xie et al. Backscatter-aided hybrid data offloading for mobile edge computing via deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant