CN116347635A - NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning - Google Patents
NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning Download PDFInfo
- Publication number
- CN116347635A CN116347635A CN202310427926.8A CN202310427926A CN116347635A CN 116347635 A CN116347635 A CN 116347635A CN 202310427926 A CN202310427926 A CN 202310427926A CN 116347635 A CN116347635 A CN 116347635A
- Authority
- CN
- China
- Prior art keywords
- frame
- representing
- resource
- action
- mappo
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 34
- 238000013468 resource allocation Methods 0.000 title claims abstract description 21
- 230000002787 reinforcement Effects 0.000 title claims abstract description 13
- 206010042135 Stomatitis necrotising Diseases 0.000 title claims abstract 7
- 201000008585 noma Diseases 0.000 title claims abstract 7
- 230000009471 action Effects 0.000 claims abstract description 50
- 238000005457 optimization Methods 0.000 claims abstract description 21
- 230000008569 process Effects 0.000 claims abstract description 18
- 230000006870 function Effects 0.000 claims description 29
- 230000005540 biological transmission Effects 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 14
- 238000005562 fading Methods 0.000 claims description 12
- 230000000873 masking effect Effects 0.000 claims description 6
- 230000008901 benefit Effects 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000007613 environmental effect Effects 0.000 claims 1
- 238000005516 engineering process Methods 0.000 abstract description 9
- 238000013528 artificial neural network Methods 0.000 abstract description 7
- 238000009826 distribution Methods 0.000 abstract description 7
- 238000004088 simulation Methods 0.000 abstract description 3
- 238000004891 communication Methods 0.000 description 4
- 238000013459 approach Methods 0.000 description 2
- 230000001413 cellular effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- GVVPGTZRZFNKDS-JXMROGBWSA-N geranyl diphosphate Chemical compound CC(C)=CCC\C(C)=C\CO[P@](O)(=O)OP(O)(O)=O GVVPGTZRZFNKDS-JXMROGBWSA-N 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000035515 penetration Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/50—Allocation or scheduling criteria for wireless resources
- H04W72/54—Allocation or scheduling criteria for wireless resources based on quality criteria
- H04W72/543—Allocation or scheduling criteria for wireless resources based on quality criteria based on requested quality, e.g. QoS
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/092—Reinforcement learning
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/04—Wireless resource allocation
- H04W72/044—Wireless resource allocation based on the type of the allocated resource
- H04W72/0453—Resources in frequency domain, e.g. a carrier in FDMA
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Quality & Reliability (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a NOMA and multi-agent reinforcement learning NB-IoT wireless resource allocation method, which aims to solve the problem of connection density maximization in a multi-user NB-IoT scene based on a NOMA technology. Considering a more realistic scenario, different users have different QoS requirements and use different tone types. Unlike traditional heuristic algorithm, the present invention models the joint optimization problem of power, resource block allocation and NOMA user pairing as a Markov decision process, and adopts MAPPO, which is the most advanced multi-agent reinforcement learning algorithm for solving. In consideration of the existing invalid actions, the method and the device shield the invalid actions in the process of calculating probability distribution of each action by the neural network, and further accelerate convergence. System level simulations using Python3.8.15 on the VScode platform show that the MAPPO algorithm is superior to the baseline algorithm.
Description
Technical Field
The invention belongs to the field of wireless communication technology and artificial intelligence, and particularly relates to an NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning.
Background
With the advent of various types of smart devices, large-scale machine type communication (mctc) networks will become one of the most important communication networks in the field of 5G and B5G networking over things (IoT). According to the internet of things industry study, over 640 billions of internet of things devices will exist by 2025, 300 billions of which are active. To support the large-scale internet of things scenario, the third generation partnership project (3 GPP) has standardized several cellular internet of things technologies, including narrowband internet of things (NB-IoT). NB-IoT supports mono and multi tone modes, a Low Power Wide Area (LPWA) cellular network technology, typically occupying 180kHz of system bandwidth. The goal of 5GmMTC is to support a huge connection density of 100 tens of thousands of devices per square kilometer with limited radio resources (the goal of 6GmMTC is to support 1000 tens of thousands of devices per square kilometer). Therefore, in order to meet the performance requirements of 5G and future 6G on mctc, it is an important research in the industry to maximize the connection density of mctc devices under limited resources.
Recently, non-orthogonal multiple access (NOMA) techniques have received widespread attention. NOMA supports non-orthogonal resource allocation among different users at the expense of receiver complexity. In addition, deep Reinforcement Learning (DRL) has recently made a great progress, and the use of DRL in wireless communication systems has also received great attention. The reinforcement learning does not need priori knowledge of an environment model, and an optimal strategy is trained through continuous interaction of an agent and the environment. Once training is complete, the agent can make decisions in real time, which is a great advantage over traditional optimization algorithms. Multi-agent DRLs are a generalization of single-agent DRLs that enable a group of agents to learn optimal strategies through interactions with the environment and each other.
Currently, the resource allocation of mctc mainly adopts Orthogonal Multiple Access (OMA) technology, which has low spectrum efficiency compared with non-orthogonal multiple access (NOMA) technology, and cannot support large-scale connection effectively. In addition, for optimization problems in resource allocation, mathematical optimization problems based on models are traditionally adopted for solving. However, due to the complexity of 5g,6g in the future, optimizing the configurable parameters in the system is extremely complex, and adapting the configuration parameters in a dynamically transformed environment is also very challenging.
Disclosure of Invention
The invention aims at: in order to solve the problem of maximizing the connection density of the NB-IoT, a NB-IoT radio resource allocation algorithm based on NOMA and multi-agent DRL performing centralized training in a decentralized manner is proposed, which can solve the problem mentioned in the background art.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a wireless resource allocation scheme based on NOMA and multi-agent DRL models the resource allocation problem in a multi-user NB-IoT network as a Markov decision process and adopts MAPPO algorithm to solve the problem, comprising the following steps:
step 1, establishing an optimization model, wherein the model is expressed as:
constraint conditions:
wherein,,representing an average of devices over a period of timeSuccessful transmission ratio; t represents the number of frames; n (N) t Indicating that the t th frame has N t The data packet of each device arrives at the base station; />For interrupt indication +.>Indicating that device n is interrupted at frame t; the resource units and the transmission power of the mMTC equipment are distributed at the beginning of each frame; let->For the indication of the resource unit(s),indicating that resource blocks RE (m, f) are allocated to device n at frame t; the total number of resource units interrupted by device n isWherein->Is a resource unit interrupt indication,/->Representing RE (m, f) t The signal to noise ratio above is below a threshold; the instantaneous signal-to-noise ratio SINR of device n at resource element RE (m, f) is defined as:
wherein the method comprises the steps ofFor the transmission power of mMTC device n at frame t, sigma 2 Is noise; />Indicating devicePreparing the fading coefficient of n at t moment, < >>Representing the power fading coefficient of device n at time t, wherein +.>Is the large scale fading coefficient of device n at frame t,/>Is the small-scale fading coefficient of the device n at the time of the frame t; for the total number of devices superimposed on the same resource block,/->Representing the transmit power of the ith device, +.>A channel coefficient representing an i-th device; to ensure that user n decodes successfully, user n's packet has a signal-to-noise ratio +.>Must reach the SINR threshold +.>
C1 ensures that device n is allocated sufficient resources at frame t, whereRepresenting the number of transport blocks required by device n at frame t +.>For device n packet size in frame t,/-for device n>For the size of the selected transport block; />Representation transmission->The number of resource units required for the bit information; />Representing the number of resource elements occupied by a resource element of packet n in frame t, for single tone transmission, < >>For multitone transmission, the user is given a weight>
C2 is a time-frequency resource allocation constraint, considering that one frame is divided into M slots in the time domain, and considering that the total bandwidth is divided into F orthogonal parts in the frequency domain, so that M e {1, 2..m }, F e {1, 2..f };
c3 is a power resource allocation constraint, and each mMTC device has 3 powers;
and 3, solving the Markov decision process in the step 2 by using a MAPPO algorithm with an action mask.
Further, modeling the optimization problem using the markov decision process in step 2 includes the following steps:
step 2.1 definition of the State
MAPPO uses centralized training, distributed execution; when in execution, each mMTC device can only observe the respective state, andto represent the observed state of the mctc device n at frame t, then: />While the base station can acquire during trainingObtaining the observed states of all mMTC devices by s t Representing the observed state of the base station at frame t, then
Step 2.2 definition of action
When the mMTC device observes the environment feedback state, the mMTC device automatically selects the transmitted power and occupied resource units, so that the action of the device n in the frame t is thatWherein->Representing the resource unit allocation condition of the equipment in the frame;
step 2.3 definition of rewards
In performing an actionImmediately feeding back a reward by the environment; to optimize the network objective, i.e. minimize the average outage rate for all devices over a period of T frames, the rewards are defined as:
during training, all agents share rewards.
Further, the solution to the problem in step 3 using the MAPPO algorithm with motion mask includes the following steps:
step 3.1, MAPPO is composed of an Actor network and a Critic network; order theRepresenting the gap between two Actor networks, wherein +.>Is an Actor network after updating parameters for the status to be observed +.>Mapping to specific actions-> Is an Actor network before updating parameters; the Actor network uses Adam's random gradient ramp-up algorithm to optimize the following function:
where L (θ) is a cost function,is an advantage function, B is the batch size, N is the number of the intelligent agents;will->Limiting to 1-E and 1+ [ E ] E to [0,1 ]]Super parameters in between;
in order to enhance the exploration capability of the neural network, penalty terms are added to the original optimization function, so that the entropy of the distribution is kept larger, and the neural network does not output certain fixed actions, and the original optimization function is modified as follows:
wherein S is policy entropy, and v is entropy coefficient super-parameter;
critic networks then use a gradient descent algorithm to minimize the following loss functions:
where L (phi) is the cost function of the Critic network,is the state cost function of the Critic network before updating the parameters,/->Is the state cost function of the Critic network after updating the parameters,/>Rewarding discounts;
step 3.2, action mask
The policy network selects actions based on the logarithm of the soft maximum action probability. However, not every action is effective. Since the input dimension of the neural network is fixed, the action space of the SADRL is assumed to be κ assuming that the action space of each device is κ D . If the device reaches the current step, the effective operation space is kappa K And there is kappa D -κ K And (5) invalidating operation. Because of the randomness of each frame of data packets, there are too many invalid actions, and these actions are dynamically changing. Conventional neural network models calculate the probability of all actions (including invalid actions), which reduces the convergence speed, and therefore action masking is proposed to alleviate the above problem. Specifically let l(s) i For the ith action's logits generated in state s, the invalid operation is masked by modifying the logits of the different actions as follows:
wherein, -inf is a larger negative number therein for ensuring that the probability of selecting an invalid action is zero; both training and execution employ action masking.
Compared with the prior art, the invention has the beneficial effects that:
(1) The present invention contemplates using NOMA technology in NB-IoT networks, which greatly improves the spectral efficiency and connection density of the system by serving multiple users on one resource block.
(2) The invention models the resource allocation problem in the multi-user NB-IoT network with single-tone and multi-tone equipment simultaneously as a Markov decision process and solves the problem by using a deep reinforcement learning method, thereby avoiding the complexity brought by the traditional mathematical optimization algorithm. Compared with the traditional single agent, the invention uses multi-agent reinforcement learning, and in the centralized training process, the agent can cooperate with other agents so as to obtain better results; in the distributed execution process, each intelligent agent executes actions according to the state observed by the intelligent agent, so that the complexity of the system is reduced.
(3) The multi-agent reinforcement learning provided by the invention has expansibility and can be applied to larger-scale and more complex environments. In multi-agent reinforcement learning, each agent only needs to pay attention to local information, and efficient parallel computation can be achieved.
Drawings
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
FIG. 1 is a general flow chart of an embodiment of the present invention;
fig. 2 is a graph showing the average successful transmission probability of the single-tone and multi-tone coexistence scene according to the present invention according to the NB-IoT packet arrival rate;
fig. 3 is a graph showing the average successful transmission probability of the multitone scene according to the NB-IoT packet arrival rate.
Detailed Description
The invention provides a wireless resource allocation method of multiple intelligent agents. The method takes a multi-user NB-IoT system based on NOMA technology as a background, takes the long-term average successful output probability of a maximized system as an objective function, takes the transmitting power of equipment and the number of allocated resource units as constraints, and researches the joint optimization problem of resource block allocation, power allocation and user pairing. In order to solve the corresponding complex combination optimization problem, the invention models the original problem as a Markov decision process, and introduces an advanced MAPPO algorithm for solving.
For a better illustration of the process according to the invention, reference is made to the following more detailed examples: considering a non-orthogonal multiple access technology based NB-IoT uplink transmission system with N devices, modeling the packet arrival process for device N as having arrival strength λ n Is updated independently of the update procedure. According to Palm-Khinachine theory, the sum of updating processes of large-scale Internet of things equipment in BS can be approximately as strong as possibleIs a poisson process of (2); the distance between any mMTC device and the BS is recorded as d, and the coverage radius of the BS is R S . Let d/R S Obeying the Beta (a, b) distribution shown below:
wherein a and b are adjustable parameters, so that the invention can simulate different distributions of the Internet of things equipment in different scenes, and u is an integral variable. Further, assume that the base station provides a maximum of D devices per step.
Step 1: establishing an optimization model, wherein the model is expressed as:
constraint conditions:
wherein,,representing the average successful transmission ratio of the device over a period of time; t represents the number of frames; n (N) t Indicating that the t th frame has N t The data packet of each device arrives at the base station; />For interrupt indication +.>Indicating that device n is interrupted at frame t; the resource units and the transmission power of the mMTC equipment are distributed at the beginning of each frame; let->For the indication of the resource unit(s),indicating that resource blocks RE (m, f) are allocated to device n at frame t; the total number of resource units interrupted by device n isWherein->Is a resource unit interrupt indication,/->Representing RE (m, f) t The signal to noise ratio above is below a threshold; the instantaneous signal-to-noise ratio SINR of device n at resource element RE (m, f) is defined as:
wherein the method comprises the steps ofFor the transmission power of mMTC device n at frame t, sigma 2 Is noise; />Representing the fading coefficient of device n at time t, < >>Representing the power fading coefficient of device n at time t, wherein +.>Is the large scale fading coefficient of device n at frame t,/>Is the small-scale fading coefficient of the device n at the time of the frame t; for the total number of devices superimposed on the same resource block,/->Representing the transmit power of the ith device, +.>A channel coefficient representing an i-th device; to ensure that user n decodes successfully, user n's packet has a signal-to-noise ratio +.>Must reach the SINR threshold +.>Assuming that any resource units allocated for device n violate SINR requirements, device n will break. Thus (S)>If->C1 ensures that packet n is allocated sufficient resources in frame t, wherein +.>Representing the number of transport blocks required by device n at frame t +.>For device n packet size in frame t,/-for device n>For the size of the selected transport block; />Representation transmission->The number of resource units required for the bit information; />Representing the number of resource elements occupied by one resource element of device n in frame t, for single tone transmission, +.>For multitone transmission, the user is given a weight>C2 is a time-frequency resource allocation constraint, considering that one frame is divided into M slots in the time domain, and considering that the total bandwidth is divided into F orthogonal parts in the frequency domain, so that M e {1, 2..m }, F e {1, 2..f }; c3 is a power resource allocation constraint, and each mMTC device has 3 powers;
and 2, modeling the optimization problem by using a Markov decision process.
2.1 definition of the State
MAPPO uses centralized training, distributed execution. When executing, each device can only observe its own state, usingTo indicate the observed state of the mMTC device n in frame t, then +.>While the base station can obtain the observed states of all mMTC devices during training, s is used t Representing the observed state of the base station at frame t, then
2.2 definition of actions
When the mMTC device observes the environment feedback state, the mMTC device automatically selects the transmitted power and occupied resource units, so that the action of the device n in the frame t is thatWherein->Indicating the allocation of the device in the resource units of the frame.
2.3 definition of rewards
In performing an actionThe environment will immediately feed back a prize. To optimize the network objective, i.e. minimize the average outage rate for all devices over a period of T frames, the rewards are defined as:
during training, all agents share rewards.
And 3, solving the problem by using a reinforcement learning algorithm based on MAPPO.
3.1、MAPPO
MAPPO is a multi-agent PPO that is trained centrally, and executed in a distributed manner.MAPPO consists of an Actor network and a Critic network. Order theRepresenting the gap between two Actor networks, wherein,is an Actor network after updating parameters for the status to be observed +.>Mapping to specific actions-> Is the Actor network before updating the parameters. The Actor network uses Adam's random gradient ramp-up algorithm to optimize the following function:
where L (θ) is a cost function,is an advantage function, B is the batch size, N is the number of the intelligent agents;will->Limiting to 1-E and 1+ [ E ] E to [0,1 ]]Super parameters in between;
in order to enhance the exploration capability of the neural network, the invention adds a punishment term in the original optimization function, which keeps the entropy of the distribution larger, so that the neural network does not output certain fixed actions, and the original optimization function is modified as follows:
wherein S is policy entropy, and v is entropy coefficient super-parameter;
critic networks then use a gradient descent algorithm to minimize the following loss functions:
where L (phi) is the cost function of the Critic network,is the state cost function of the Critic network before updating the parameters,/->Is the state cost function of the Critic network after updating the parameters,/>Rewarding discounts.
3.2 action mask
The policy network selects actions based on the logarithm of the soft maximum action probability. However, not every action is effective. Since the input dimension of the neural network is fixed, the action space of the SADRL is assumed to be κ assuming that the action space of each device is κ D . If the device reaches the current step, the effective operation space is kappa K And there is kappa D -κ K And (5) invalidating operation. Because of the randomness of each frame of data packets, there are too many invalid actions, and these actions are dynamically changing. Conventional neural network models calculate the probability of all actions (including invalid actions), which reduces the convergence speed, and thus the present invention proposes action masking to alleviate the above-mentioned problems. Specifically let l(s) i For the ith action's logits generated in state s, the invalid operation is masked by modifying the logits of the different actions as follows:
wherein, -inf is a larger negative number therein for ensuring that the probability of selecting an invalid action is zero; both training and execution employ action masking.
System level simulations were performed on VScode platform using python 3.8.15. The carrier frequency of the BS is considered to be 900Mhz,
the inter-station distance (ISD) is 1732m, the location of the device satisfies the Beta (2, 4) distribution, and there are three options for the transmit power of the device: 23dBm,20dBm,17dBm. The path loss is defined as:
where L is the indoor penetration loss of 20dB and G is the-4 dB antenna gain.
According to 3GPPTR37.910, the traffic model for each device is 1 message per 2 hours per device, and the message size is 32 bytes. The MCS index of each device is 8, i.e. each message occupies at least 2 resource units. Therefore, the 12 tones are allocated with 4 slots, the 6 tones are allocated with 8 slots, the 3 tones are allocated with 16 slots, and the single tones are allocated with 32 slots. The coverage enhancement at the general level is considered without loss of generality, and the NPUSCH repetition number is 1. According to the multi-tone threshold of 1.4dB, the single-tone threshold of 4.3dB. For the DRL model, the two steps are considered to be 32 time slots, and the arrival rate of the message to the BS point is lambda epsilon [1,12] packets/step. According to poisson distribution, when the maximum message arrival rate is 12, the maximum message number is assumed to be 24. Since the packets arriving at the different steps are independent, the present invention considers only instant rewards, and therefore the discount factor γ is 0.
Fig. 2 shows a variation rule of average successful transmission probability number with mctc traffic arrival rate under the coexistence of single tone and multiple tones. It can be seen that the NOMA-based MAPPO approach is superior to the OMA-based and NOMA-based random approaches. Fig. 3 shows the variation of the average successful transmission probability number with mctc traffic arrival rate considering only the multitone case. At this point, the performance of all algorithms is improved, but MAPPO still maintains optimal performance, which verifies the superiority of MAPPO algorithm.
Claims (3)
1. The NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning is characterized by comprising the following steps:
step 1, establishing an optimization model, wherein the model is expressed as:
constraint conditions:
wherein,,representing the average successful transmission ratio of the device over a period of time; t represents the number of frames; n (N) t Indicating that the t th frame has N t The data packet of each device arrives at the base station; />For interrupt indication +.>Indicating that device n is interrupted at frame t; the resource unit and the transmission power of the mMTC device are opened in each frameDistributing at the beginning; let->For resource unit indication, ++>Indicating that resource blocks RE (m, f) are allocated to device n at frame t; the total number of resource units interrupted by device n is +.>Wherein->Is a resource unit interrupt indication,/->Representing RE (m, f) t The signal to noise ratio above is below a threshold; the instantaneous signal-to-noise ratio SINR of device n at resource element RE (m, f) is defined as:
wherein the method comprises the steps ofFor the transmission power of mMTC device n at frame t, sigma 2 Is noise; />Representing the fading coefficient of device n at time t, < >>Representing the power fading coefficient of device n at time t, wherein +.>Is provided withPrepare n for the large scale fading coefficient at frame t, < >>Is the small-scale fading coefficient of the device n at the time of the frame t; for the total number of devices superimposed on the same resource block,/->Representing the transmit power of the ith device, +.>A channel coefficient representing an i-th device; to ensure that user n decodes successfully, user n's packet has a signal-to-noise ratio +.>Must reach the SINR threshold +.>
C1 ensures that device n is allocated sufficient resources at frame t, whereRepresenting the number of transport blocks required by device n at frame t +.>For device n packet size in frame t,/-for device n>For the size of the selected transport block; />Representation transmission->Required for bit informationIs a number of resource units; />Representing the number of resource elements occupied by one resource unit of the data packet n in the frame t;
c2 is a time-frequency resource allocation constraint, considering that one frame is divided into M slots in the time domain, and considering that the total bandwidth is divided into F orthogonal parts in the frequency domain, so that M e {1, 2..m }, F e {1, 2..f };
c3 is a power resource allocation constraint, and each mMTC device has 3 powers;
step 2, modeling the optimization model in the step 1 into a Markov decision process;
and 3, solving the Markov decision process in the step 2 by using a MAPPO algorithm with an action mask.
2. The NOMA and MAPPO based NB-IoT radio resource allocation method according to claim 1 wherein modeling the optimization problem using a markov decision process in step 2 comprises the steps of:
step 2.1 definition of the State
MAPPO uses centralized training, distributed execution; when in execution, each mMTC device can only observe the respective state, andto represent the observed state of the mctc device n at frame t, then: />While the base station can obtain the observed states of all mMTC devices during training, s is used t Representing the observed state of the base station at frame t, then
Step 2.2 definition of action
mMTC device self-discovers the state of environmental feedbackThe power and occupied resource units of the row select transmission, so the action of device n in frame t isWherein->Representing the resource unit allocation condition of the equipment in the frame;
step 2.3 definition of rewards
In performing an actionImmediately feeding back a reward by the environment; to optimize the network objective, i.e. minimize the average outage rate for all devices over a period of T frames, the rewards are defined as:
during training, all agents share rewards.
3. The NOMA and MAPPO based NB-IoT radio resource allocation method according to claim 2 wherein the solving of the problem using the MAPPO algorithm with action masking in step 3 comprises the steps of:
step 3.1, MAPPO is composed of an Actor network and a Critic network; order theRepresenting the gap between two Actor networks, wherein +.>Is an Actor network after updating parameters for the status to be observed +.>Mapping to specific actions-> Is an Actor network before updating parameters; the Actor network uses Adam's random gradient ramp-up algorithm to optimize the following function:
where L (θ) is a cost function,is an advantage function, B is the batch size, N is the number of the intelligent agents;will->Limiting to 1-E and 1+ [ E ] E to [0,1 ]]Super parameters in between;
adding a punishment term in the original optimization function, and correcting the original optimization function to be:
wherein S is policy entropy, and v is entropy coefficient super-parameter;
critic networks then use a gradient descent algorithm to minimize the following loss functions:
where L (phi) is the value of the Critic networkThe function of the function is that,is the state cost function of the Critic network before updating the parameters,/->Is the state cost function of the Critic network after updating the parameters,/>Rewarding discounts;
step 3.2, action mask
Let l(s) i For the ith action's logits generated in state s, the invalid operation is masked by modifying the logits of the different actions as follows:
wherein, -inf is a larger negative number therein for ensuring that the probability of selecting an invalid action is zero; both training and execution employ action masking.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310427926.8A CN116347635A (en) | 2023-04-20 | 2023-04-20 | NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310427926.8A CN116347635A (en) | 2023-04-20 | 2023-04-20 | NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116347635A true CN116347635A (en) | 2023-06-27 |
Family
ID=86887866
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310427926.8A Pending CN116347635A (en) | 2023-04-20 | 2023-04-20 | NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116347635A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117234216A (en) * | 2023-11-10 | 2023-12-15 | 武汉大学 | Robot deep reinforcement learning motion planning method and computer readable medium |
CN117376355A (en) * | 2023-10-31 | 2024-01-09 | 重庆理工大学 | B5G mass Internet of things resource allocation method and system based on hypergraph |
CN118233312A (en) * | 2024-03-20 | 2024-06-21 | 同济大学 | Adaptive broadband resource allocation method combining deep reinforcement learning and converter |
-
2023
- 2023-04-20 CN CN202310427926.8A patent/CN116347635A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117376355A (en) * | 2023-10-31 | 2024-01-09 | 重庆理工大学 | B5G mass Internet of things resource allocation method and system based on hypergraph |
CN117234216A (en) * | 2023-11-10 | 2023-12-15 | 武汉大学 | Robot deep reinforcement learning motion planning method and computer readable medium |
CN117234216B (en) * | 2023-11-10 | 2024-02-09 | 武汉大学 | Robot deep reinforcement learning motion planning method and computer readable medium |
CN118233312A (en) * | 2024-03-20 | 2024-06-21 | 同济大学 | Adaptive broadband resource allocation method combining deep reinforcement learning and converter |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109729528B (en) | D2D resource allocation method based on multi-agent deep reinforcement learning | |
Son et al. | REFIM: A practical interference management in heterogeneous wireless access networks | |
CN116347635A (en) | NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning | |
CN112601284B (en) | Downlink multi-cell OFDMA resource allocation method based on multi-agent deep reinforcement learning | |
CN106358308A (en) | Resource allocation method for reinforcement learning in ultra-dense network | |
CN114867030B (en) | Dual-time scale intelligent wireless access network slicing method | |
CN109861728B (en) | Joint multi-relay selection and time slot resource allocation method for large-scale MIMO system | |
CN111586697A (en) | Channel resource allocation method based on directed hyper-graph greedy coloring | |
Bi et al. | Deep reinforcement learning based power allocation for D2D network | |
Yu et al. | Dynamic resource allocation in TDD-based heterogeneous cloud radio access networks | |
CN106028456A (en) | Power allocation method of virtual cell in 5G high density network | |
Dao et al. | Deep reinforcement learning-based hierarchical time division duplexing control for dense wireless and mobile networks | |
Liu et al. | A deep reinforcement learning based adaptive transmission strategy in space-air-ground integrated networks | |
CN103281702B (en) | A kind of collaboration communication method based on multiple cell dynamic clustering | |
CN103338453A (en) | Dynamic frequency spectrum access method and system for hierarchical wireless network | |
CN110753365B (en) | Heterogeneous cellular network interference coordination method | |
CN111314938B (en) | Optimization method for time-frequency domain resource allocation of cellular network of single cell | |
Liang et al. | Decentralized bit, subcarrier and power allocation with interference avoidance in multicell OFDMA systems using game theoretic approach | |
Wang et al. | Deep transfer reinforcement learning for beamforming and resource allocation in multi-cell MISO-OFDMA systems | |
Yan et al. | An adaptive subcarrier, bit and power allocation algorithm for multicell OFDM systems | |
Ren et al. | Deep reinforcement learning for connection density maximization in NOMA-based NB-IoT networks | |
Jia et al. | Multi-agent deep reinforcement learning for uplink power control in multi-cell systems | |
Song et al. | Throughput maximization in multi-channel wireless mesh access networks | |
Allagiotis et al. | Reinforcement learning approach for resource allocation in 5g hetnets | |
Chen et al. | Cognitive wireless network resource allocation strategy based on effective capacity |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |