CN116347635A - NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning - Google Patents

NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning Download PDF

Info

Publication number
CN116347635A
CN116347635A CN202310427926.8A CN202310427926A CN116347635A CN 116347635 A CN116347635 A CN 116347635A CN 202310427926 A CN202310427926 A CN 202310427926A CN 116347635 A CN116347635 A CN 116347635A
Authority
CN
China
Prior art keywords
frame
representing
resource
action
mappo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310427926.8A
Other languages
Chinese (zh)
Inventor
任荣
王捷
余景明
朱向宇
罗鑫鹏
赖秋宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202310427926.8A priority Critical patent/CN116347635A/en
Publication of CN116347635A publication Critical patent/CN116347635A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/54Allocation or scheduling criteria for wireless resources based on quality criteria
    • H04W72/543Allocation or scheduling criteria for wireless resources based on quality criteria based on requested quality, e.g. QoS
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0453Resources in frequency domain, e.g. a carrier in FDMA
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a NOMA and multi-agent reinforcement learning NB-IoT wireless resource allocation method, which aims to solve the problem of connection density maximization in a multi-user NB-IoT scene based on a NOMA technology. Considering a more realistic scenario, different users have different QoS requirements and use different tone types. Unlike traditional heuristic algorithm, the present invention models the joint optimization problem of power, resource block allocation and NOMA user pairing as a Markov decision process, and adopts MAPPO, which is the most advanced multi-agent reinforcement learning algorithm for solving. In consideration of the existing invalid actions, the method and the device shield the invalid actions in the process of calculating probability distribution of each action by the neural network, and further accelerate convergence. System level simulations using Python3.8.15 on the VScode platform show that the MAPPO algorithm is superior to the baseline algorithm.

Description

NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning
Technical Field
The invention belongs to the field of wireless communication technology and artificial intelligence, and particularly relates to an NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning.
Background
With the advent of various types of smart devices, large-scale machine type communication (mctc) networks will become one of the most important communication networks in the field of 5G and B5G networking over things (IoT). According to the internet of things industry study, over 640 billions of internet of things devices will exist by 2025, 300 billions of which are active. To support the large-scale internet of things scenario, the third generation partnership project (3 GPP) has standardized several cellular internet of things technologies, including narrowband internet of things (NB-IoT). NB-IoT supports mono and multi tone modes, a Low Power Wide Area (LPWA) cellular network technology, typically occupying 180kHz of system bandwidth. The goal of 5GmMTC is to support a huge connection density of 100 tens of thousands of devices per square kilometer with limited radio resources (the goal of 6GmMTC is to support 1000 tens of thousands of devices per square kilometer). Therefore, in order to meet the performance requirements of 5G and future 6G on mctc, it is an important research in the industry to maximize the connection density of mctc devices under limited resources.
Recently, non-orthogonal multiple access (NOMA) techniques have received widespread attention. NOMA supports non-orthogonal resource allocation among different users at the expense of receiver complexity. In addition, deep Reinforcement Learning (DRL) has recently made a great progress, and the use of DRL in wireless communication systems has also received great attention. The reinforcement learning does not need priori knowledge of an environment model, and an optimal strategy is trained through continuous interaction of an agent and the environment. Once training is complete, the agent can make decisions in real time, which is a great advantage over traditional optimization algorithms. Multi-agent DRLs are a generalization of single-agent DRLs that enable a group of agents to learn optimal strategies through interactions with the environment and each other.
Currently, the resource allocation of mctc mainly adopts Orthogonal Multiple Access (OMA) technology, which has low spectrum efficiency compared with non-orthogonal multiple access (NOMA) technology, and cannot support large-scale connection effectively. In addition, for optimization problems in resource allocation, mathematical optimization problems based on models are traditionally adopted for solving. However, due to the complexity of 5g,6g in the future, optimizing the configurable parameters in the system is extremely complex, and adapting the configuration parameters in a dynamically transformed environment is also very challenging.
Disclosure of Invention
The invention aims at: in order to solve the problem of maximizing the connection density of the NB-IoT, a NB-IoT radio resource allocation algorithm based on NOMA and multi-agent DRL performing centralized training in a decentralized manner is proposed, which can solve the problem mentioned in the background art.
In order to achieve the above purpose, the present invention adopts the following technical scheme:
a wireless resource allocation scheme based on NOMA and multi-agent DRL models the resource allocation problem in a multi-user NB-IoT network as a Markov decision process and adopts MAPPO algorithm to solve the problem, comprising the following steps:
step 1, establishing an optimization model, wherein the model is expressed as:
Figure BDA0004189250260000021
constraint conditions:
Figure BDA0004189250260000022
Figure BDA0004189250260000023
Figure BDA0004189250260000024
wherein,,
Figure BDA0004189250260000025
representing an average of devices over a period of timeSuccessful transmission ratio; t represents the number of frames; n (N) t Indicating that the t th frame has N t The data packet of each device arrives at the base station; />
Figure BDA0004189250260000026
For interrupt indication +.>
Figure BDA0004189250260000027
Indicating that device n is interrupted at frame t; the resource units and the transmission power of the mMTC equipment are distributed at the beginning of each frame; let->
Figure BDA0004189250260000028
For the indication of the resource unit(s),
Figure BDA0004189250260000029
indicating that resource blocks RE (m, f) are allocated to device n at frame t; the total number of resource units interrupted by device n is
Figure BDA00041892502600000210
Wherein->
Figure BDA00041892502600000211
Is a resource unit interrupt indication,/->
Figure BDA00041892502600000212
Representing RE (m, f) t The signal to noise ratio above is below a threshold; the instantaneous signal-to-noise ratio SINR of device n at resource element RE (m, f) is defined as:
Figure BDA00041892502600000213
wherein the method comprises the steps of
Figure BDA00041892502600000214
For the transmission power of mMTC device n at frame t, sigma 2 Is noise; />
Figure BDA00041892502600000215
Indicating devicePreparing the fading coefficient of n at t moment, < >>
Figure BDA00041892502600000216
Representing the power fading coefficient of device n at time t, wherein +.>
Figure BDA00041892502600000217
Is the large scale fading coefficient of device n at frame t,/>
Figure BDA00041892502600000218
Is the small-scale fading coefficient of the device n at the time of the frame t; for the total number of devices superimposed on the same resource block,/->
Figure BDA00041892502600000219
Representing the transmit power of the ith device, +.>
Figure BDA00041892502600000220
A channel coefficient representing an i-th device; to ensure that user n decodes successfully, user n's packet has a signal-to-noise ratio +.>
Figure BDA00041892502600000221
Must reach the SINR threshold +.>
Figure BDA00041892502600000222
C1 ensures that device n is allocated sufficient resources at frame t, where
Figure BDA00041892502600000223
Representing the number of transport blocks required by device n at frame t +.>
Figure BDA00041892502600000224
For device n packet size in frame t,/-for device n>
Figure BDA00041892502600000225
For the size of the selected transport block; />
Figure BDA00041892502600000226
Representation transmission->
Figure BDA00041892502600000227
The number of resource units required for the bit information; />
Figure BDA0004189250260000031
Representing the number of resource elements occupied by a resource element of packet n in frame t, for single tone transmission, < >>
Figure BDA0004189250260000032
For multitone transmission, the user is given a weight>
Figure BDA0004189250260000033
C2 is a time-frequency resource allocation constraint, considering that one frame is divided into M slots in the time domain, and considering that the total bandwidth is divided into F orthogonal parts in the frequency domain, so that M e {1, 2..m }, F e {1, 2..f };
c3 is a power resource allocation constraint, and each mMTC device has 3 powers;
step 2, modeling the optimization model in the step 1 into a Markov decision process;
and 3, solving the Markov decision process in the step 2 by using a MAPPO algorithm with an action mask.
Further, modeling the optimization problem using the markov decision process in step 2 includes the following steps:
step 2.1 definition of the State
MAPPO uses centralized training, distributed execution; when in execution, each mMTC device can only observe the respective state, and
Figure BDA0004189250260000034
to represent the observed state of the mctc device n at frame t, then: />
Figure BDA0004189250260000035
While the base station can acquire during trainingObtaining the observed states of all mMTC devices by s t Representing the observed state of the base station at frame t, then
Figure BDA0004189250260000036
Step 2.2 definition of action
When the mMTC device observes the environment feedback state, the mMTC device automatically selects the transmitted power and occupied resource units, so that the action of the device n in the frame t is that
Figure BDA0004189250260000037
Wherein->
Figure BDA0004189250260000038
Representing the resource unit allocation condition of the equipment in the frame;
step 2.3 definition of rewards
In performing an action
Figure BDA0004189250260000039
Immediately feeding back a reward by the environment; to optimize the network objective, i.e. minimize the average outage rate for all devices over a period of T frames, the rewards are defined as:
Figure BDA00041892502600000310
during training, all agents share rewards.
Further, the solution to the problem in step 3 using the MAPPO algorithm with motion mask includes the following steps:
step 3.1, MAPPO is composed of an Actor network and a Critic network; order the
Figure BDA00041892502600000311
Representing the gap between two Actor networks, wherein +.>
Figure BDA00041892502600000312
Is an Actor network after updating parameters for the status to be observed +.>
Figure BDA00041892502600000313
Mapping to specific actions->
Figure BDA00041892502600000314
Figure BDA00041892502600000315
Is an Actor network before updating parameters; the Actor network uses Adam's random gradient ramp-up algorithm to optimize the following function:
Figure BDA0004189250260000041
where L (θ) is a cost function,
Figure BDA0004189250260000042
is an advantage function, B is the batch size, N is the number of the intelligent agents;
Figure BDA0004189250260000043
will->
Figure BDA0004189250260000044
Limiting to 1-E and 1+ [ E ] E to [0,1 ]]Super parameters in between;
in order to enhance the exploration capability of the neural network, penalty terms are added to the original optimization function, so that the entropy of the distribution is kept larger, and the neural network does not output certain fixed actions, and the original optimization function is modified as follows:
Figure BDA0004189250260000045
wherein S is policy entropy, and v is entropy coefficient super-parameter;
critic networks then use a gradient descent algorithm to minimize the following loss functions:
Figure BDA0004189250260000046
where L (phi) is the cost function of the Critic network,
Figure BDA0004189250260000047
is the state cost function of the Critic network before updating the parameters,/->
Figure BDA0004189250260000048
Is the state cost function of the Critic network after updating the parameters,/>
Figure BDA0004189250260000049
Rewarding discounts;
step 3.2, action mask
The policy network selects actions based on the logarithm of the soft maximum action probability. However, not every action is effective. Since the input dimension of the neural network is fixed, the action space of the SADRL is assumed to be κ assuming that the action space of each device is κ D . If the device reaches the current step, the effective operation space is kappa K And there is kappa DK And (5) invalidating operation. Because of the randomness of each frame of data packets, there are too many invalid actions, and these actions are dynamically changing. Conventional neural network models calculate the probability of all actions (including invalid actions), which reduces the convergence speed, and therefore action masking is proposed to alleviate the above problem. Specifically let l(s) i For the ith action's logits generated in state s, the invalid operation is masked by modifying the logits of the different actions as follows:
Figure BDA00041892502600000410
wherein, -inf is a larger negative number therein for ensuring that the probability of selecting an invalid action is zero; both training and execution employ action masking.
Compared with the prior art, the invention has the beneficial effects that:
(1) The present invention contemplates using NOMA technology in NB-IoT networks, which greatly improves the spectral efficiency and connection density of the system by serving multiple users on one resource block.
(2) The invention models the resource allocation problem in the multi-user NB-IoT network with single-tone and multi-tone equipment simultaneously as a Markov decision process and solves the problem by using a deep reinforcement learning method, thereby avoiding the complexity brought by the traditional mathematical optimization algorithm. Compared with the traditional single agent, the invention uses multi-agent reinforcement learning, and in the centralized training process, the agent can cooperate with other agents so as to obtain better results; in the distributed execution process, each intelligent agent executes actions according to the state observed by the intelligent agent, so that the complexity of the system is reduced.
(3) The multi-agent reinforcement learning provided by the invention has expansibility and can be applied to larger-scale and more complex environments. In multi-agent reinforcement learning, each agent only needs to pay attention to local information, and efficient parallel computation can be achieved.
Drawings
The invention is further illustrated by the following examples in conjunction with the accompanying drawings:
FIG. 1 is a general flow chart of an embodiment of the present invention;
fig. 2 is a graph showing the average successful transmission probability of the single-tone and multi-tone coexistence scene according to the present invention according to the NB-IoT packet arrival rate;
fig. 3 is a graph showing the average successful transmission probability of the multitone scene according to the NB-IoT packet arrival rate.
Detailed Description
The invention provides a wireless resource allocation method of multiple intelligent agents. The method takes a multi-user NB-IoT system based on NOMA technology as a background, takes the long-term average successful output probability of a maximized system as an objective function, takes the transmitting power of equipment and the number of allocated resource units as constraints, and researches the joint optimization problem of resource block allocation, power allocation and user pairing. In order to solve the corresponding complex combination optimization problem, the invention models the original problem as a Markov decision process, and introduces an advanced MAPPO algorithm for solving.
For a better illustration of the process according to the invention, reference is made to the following more detailed examples: considering a non-orthogonal multiple access technology based NB-IoT uplink transmission system with N devices, modeling the packet arrival process for device N as having arrival strength λ n Is updated independently of the update procedure. According to Palm-Khinachine theory, the sum of updating processes of large-scale Internet of things equipment in BS can be approximately as strong as possible
Figure BDA0004189250260000051
Is a poisson process of (2); the distance between any mMTC device and the BS is recorded as d, and the coverage radius of the BS is R S . Let d/R S Obeying the Beta (a, b) distribution shown below:
Figure BDA0004189250260000052
wherein a and b are adjustable parameters, so that the invention can simulate different distributions of the Internet of things equipment in different scenes, and u is an integral variable. Further, assume that the base station provides a maximum of D devices per step.
Step 1: establishing an optimization model, wherein the model is expressed as:
Figure BDA0004189250260000061
constraint conditions:
Figure BDA0004189250260000062
Figure BDA0004189250260000063
Figure BDA0004189250260000064
wherein,,
Figure BDA0004189250260000065
representing the average successful transmission ratio of the device over a period of time; t represents the number of frames; n (N) t Indicating that the t th frame has N t The data packet of each device arrives at the base station; />
Figure BDA0004189250260000066
For interrupt indication +.>
Figure BDA0004189250260000067
Indicating that device n is interrupted at frame t; the resource units and the transmission power of the mMTC equipment are distributed at the beginning of each frame; let->
Figure BDA0004189250260000068
For the indication of the resource unit(s),
Figure BDA0004189250260000069
indicating that resource blocks RE (m, f) are allocated to device n at frame t; the total number of resource units interrupted by device n is
Figure BDA00041892502600000610
Wherein->
Figure BDA00041892502600000611
Is a resource unit interrupt indication,/->
Figure BDA00041892502600000612
Representing RE (m, f) t The signal to noise ratio above is below a threshold; the instantaneous signal-to-noise ratio SINR of device n at resource element RE (m, f) is defined as:
Figure BDA00041892502600000613
wherein the method comprises the steps of
Figure BDA00041892502600000614
For the transmission power of mMTC device n at frame t, sigma 2 Is noise; />
Figure BDA00041892502600000615
Representing the fading coefficient of device n at time t, < >>
Figure BDA00041892502600000616
Representing the power fading coefficient of device n at time t, wherein +.>
Figure BDA00041892502600000617
Is the large scale fading coefficient of device n at frame t,/>
Figure BDA00041892502600000618
Is the small-scale fading coefficient of the device n at the time of the frame t; for the total number of devices superimposed on the same resource block,/->
Figure BDA00041892502600000619
Representing the transmit power of the ith device, +.>
Figure BDA00041892502600000620
A channel coefficient representing an i-th device; to ensure that user n decodes successfully, user n's packet has a signal-to-noise ratio +.>
Figure BDA00041892502600000621
Must reach the SINR threshold +.>
Figure BDA00041892502600000622
Assuming that any resource units allocated for device n violate SINR requirements, device n will break. Thus (S)>
Figure BDA00041892502600000623
If->
Figure BDA00041892502600000624
C1 ensures that packet n is allocated sufficient resources in frame t, wherein +.>
Figure BDA00041892502600000625
Representing the number of transport blocks required by device n at frame t +.>
Figure BDA00041892502600000626
For device n packet size in frame t,/-for device n>
Figure BDA00041892502600000627
For the size of the selected transport block; />
Figure BDA00041892502600000628
Representation transmission->
Figure BDA00041892502600000629
The number of resource units required for the bit information; />
Figure BDA0004189250260000071
Representing the number of resource elements occupied by one resource element of device n in frame t, for single tone transmission, +.>
Figure BDA0004189250260000072
For multitone transmission, the user is given a weight>
Figure BDA0004189250260000073
C2 is a time-frequency resource allocation constraint, considering that one frame is divided into M slots in the time domain, and considering that the total bandwidth is divided into F orthogonal parts in the frequency domain, so that M e {1, 2..m }, F e {1, 2..f }; c3 is a power resource allocation constraint, and each mMTC device has 3 powers;
and 2, modeling the optimization problem by using a Markov decision process.
2.1 definition of the State
MAPPO uses centralized training, distributed execution. When executing, each device can only observe its own state, using
Figure BDA0004189250260000074
To indicate the observed state of the mMTC device n in frame t, then +.>
Figure BDA0004189250260000075
While the base station can obtain the observed states of all mMTC devices during training, s is used t Representing the observed state of the base station at frame t, then
Figure BDA0004189250260000076
2.2 definition of actions
When the mMTC device observes the environment feedback state, the mMTC device automatically selects the transmitted power and occupied resource units, so that the action of the device n in the frame t is that
Figure BDA0004189250260000077
Wherein->
Figure BDA0004189250260000078
Indicating the allocation of the device in the resource units of the frame.
2.3 definition of rewards
In performing an action
Figure BDA0004189250260000079
The environment will immediately feed back a prize. To optimize the network objective, i.e. minimize the average outage rate for all devices over a period of T frames, the rewards are defined as:
Figure BDA00041892502600000710
during training, all agents share rewards.
And 3, solving the problem by using a reinforcement learning algorithm based on MAPPO.
3.1、MAPPO
MAPPO is a multi-agent PPO that is trained centrally, and executed in a distributed manner.MAPPO consists of an Actor network and a Critic network. Order the
Figure BDA00041892502600000711
Representing the gap between two Actor networks, wherein,
Figure BDA00041892502600000712
is an Actor network after updating parameters for the status to be observed +.>
Figure BDA00041892502600000713
Mapping to specific actions->
Figure BDA00041892502600000714
Figure BDA00041892502600000715
Is the Actor network before updating the parameters. The Actor network uses Adam's random gradient ramp-up algorithm to optimize the following function:
Figure BDA0004189250260000081
where L (θ) is a cost function,
Figure BDA0004189250260000082
is an advantage function, B is the batch size, N is the number of the intelligent agents;
Figure BDA0004189250260000083
will->
Figure BDA0004189250260000084
Limiting to 1-E and 1+ [ E ] E to [0,1 ]]Super parameters in between;
in order to enhance the exploration capability of the neural network, the invention adds a punishment term in the original optimization function, which keeps the entropy of the distribution larger, so that the neural network does not output certain fixed actions, and the original optimization function is modified as follows:
Figure BDA0004189250260000085
wherein S is policy entropy, and v is entropy coefficient super-parameter;
critic networks then use a gradient descent algorithm to minimize the following loss functions:
Figure BDA0004189250260000086
where L (phi) is the cost function of the Critic network,
Figure BDA0004189250260000087
is the state cost function of the Critic network before updating the parameters,/->
Figure BDA0004189250260000088
Is the state cost function of the Critic network after updating the parameters,/>
Figure BDA0004189250260000089
Rewarding discounts.
3.2 action mask
The policy network selects actions based on the logarithm of the soft maximum action probability. However, not every action is effective. Since the input dimension of the neural network is fixed, the action space of the SADRL is assumed to be κ assuming that the action space of each device is κ D . If the device reaches the current step, the effective operation space is kappa K And there is kappa DK And (5) invalidating operation. Because of the randomness of each frame of data packets, there are too many invalid actions, and these actions are dynamically changing. Conventional neural network models calculate the probability of all actions (including invalid actions), which reduces the convergence speed, and thus the present invention proposes action masking to alleviate the above-mentioned problems. Specifically let l(s) i For the ith action's logits generated in state s, the invalid operation is masked by modifying the logits of the different actions as follows:
Figure BDA00041892502600000810
wherein, -inf is a larger negative number therein for ensuring that the probability of selecting an invalid action is zero; both training and execution employ action masking.
Step 4, simulation
System level simulations were performed on VScode platform using python 3.8.15. The carrier frequency of the BS is considered to be 900Mhz,
the inter-station distance (ISD) is 1732m, the location of the device satisfies the Beta (2, 4) distribution, and there are three options for the transmit power of the device: 23dBm,20dBm,17dBm. The path loss is defined as:
Figure BDA0004189250260000091
where L is the indoor penetration loss of 20dB and G is the-4 dB antenna gain.
According to 3GPPTR37.910, the traffic model for each device is 1 message per 2 hours per device, and the message size is 32 bytes. The MCS index of each device is 8, i.e. each message occupies at least 2 resource units. Therefore, the 12 tones are allocated with 4 slots, the 6 tones are allocated with 8 slots, the 3 tones are allocated with 16 slots, and the single tones are allocated with 32 slots. The coverage enhancement at the general level is considered without loss of generality, and the NPUSCH repetition number is 1. According to the multi-tone threshold of 1.4dB, the single-tone threshold of 4.3dB. For the DRL model, the two steps are considered to be 32 time slots, and the arrival rate of the message to the BS point is lambda epsilon [1,12] packets/step. According to poisson distribution, when the maximum message arrival rate is 12, the maximum message number is assumed to be 24. Since the packets arriving at the different steps are independent, the present invention considers only instant rewards, and therefore the discount factor γ is 0.
Fig. 2 shows a variation rule of average successful transmission probability number with mctc traffic arrival rate under the coexistence of single tone and multiple tones. It can be seen that the NOMA-based MAPPO approach is superior to the OMA-based and NOMA-based random approaches. Fig. 3 shows the variation of the average successful transmission probability number with mctc traffic arrival rate considering only the multitone case. At this point, the performance of all algorithms is improved, but MAPPO still maintains optimal performance, which verifies the superiority of MAPPO algorithm.

Claims (3)

1. The NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning is characterized by comprising the following steps:
step 1, establishing an optimization model, wherein the model is expressed as:
Figure FDA0004189250250000011
constraint conditions:
C1:
Figure FDA0004189250250000012
C2:
Figure FDA0004189250250000013
C3:
Figure FDA0004189250250000014
wherein,,
Figure FDA0004189250250000015
representing the average successful transmission ratio of the device over a period of time; t represents the number of frames; n (N) t Indicating that the t th frame has N t The data packet of each device arrives at the base station; />
Figure FDA0004189250250000016
For interrupt indication +.>
Figure FDA0004189250250000017
Indicating that device n is interrupted at frame t; the resource unit and the transmission power of the mMTC device are opened in each frameDistributing at the beginning; let->
Figure FDA0004189250250000018
For resource unit indication, ++>
Figure FDA0004189250250000019
Indicating that resource blocks RE (m, f) are allocated to device n at frame t; the total number of resource units interrupted by device n is +.>
Figure FDA00041892502500000110
Wherein->
Figure FDA00041892502500000111
Is a resource unit interrupt indication,/->
Figure FDA00041892502500000112
Representing RE (m, f) t The signal to noise ratio above is below a threshold; the instantaneous signal-to-noise ratio SINR of device n at resource element RE (m, f) is defined as:
Figure FDA00041892502500000113
wherein the method comprises the steps of
Figure FDA00041892502500000114
For the transmission power of mMTC device n at frame t, sigma 2 Is noise; />
Figure FDA00041892502500000115
Representing the fading coefficient of device n at time t, < >>
Figure FDA00041892502500000116
Representing the power fading coefficient of device n at time t, wherein +.>
Figure FDA00041892502500000117
Is provided withPrepare n for the large scale fading coefficient at frame t, < >>
Figure FDA00041892502500000118
Is the small-scale fading coefficient of the device n at the time of the frame t; for the total number of devices superimposed on the same resource block,/->
Figure FDA00041892502500000119
Representing the transmit power of the ith device, +.>
Figure FDA00041892502500000120
A channel coefficient representing an i-th device; to ensure that user n decodes successfully, user n's packet has a signal-to-noise ratio +.>
Figure FDA00041892502500000121
Must reach the SINR threshold +.>
Figure FDA00041892502500000122
C1 ensures that device n is allocated sufficient resources at frame t, where
Figure FDA00041892502500000123
Representing the number of transport blocks required by device n at frame t +.>
Figure FDA00041892502500000124
For device n packet size in frame t,/-for device n>
Figure FDA00041892502500000125
For the size of the selected transport block; />
Figure FDA00041892502500000126
Representation transmission->
Figure FDA00041892502500000127
Required for bit informationIs a number of resource units; />
Figure FDA0004189250250000021
Representing the number of resource elements occupied by one resource unit of the data packet n in the frame t;
c2 is a time-frequency resource allocation constraint, considering that one frame is divided into M slots in the time domain, and considering that the total bandwidth is divided into F orthogonal parts in the frequency domain, so that M e {1, 2..m }, F e {1, 2..f };
c3 is a power resource allocation constraint, and each mMTC device has 3 powers;
step 2, modeling the optimization model in the step 1 into a Markov decision process;
and 3, solving the Markov decision process in the step 2 by using a MAPPO algorithm with an action mask.
2. The NOMA and MAPPO based NB-IoT radio resource allocation method according to claim 1 wherein modeling the optimization problem using a markov decision process in step 2 comprises the steps of:
step 2.1 definition of the State
MAPPO uses centralized training, distributed execution; when in execution, each mMTC device can only observe the respective state, and
Figure FDA0004189250250000022
to represent the observed state of the mctc device n at frame t, then: />
Figure FDA0004189250250000023
While the base station can obtain the observed states of all mMTC devices during training, s is used t Representing the observed state of the base station at frame t, then
Figure FDA0004189250250000024
Step 2.2 definition of action
mMTC device self-discovers the state of environmental feedbackThe power and occupied resource units of the row select transmission, so the action of device n in frame t is
Figure FDA0004189250250000025
Wherein->
Figure FDA0004189250250000026
Representing the resource unit allocation condition of the equipment in the frame;
step 2.3 definition of rewards
In performing an action
Figure FDA0004189250250000027
Immediately feeding back a reward by the environment; to optimize the network objective, i.e. minimize the average outage rate for all devices over a period of T frames, the rewards are defined as:
Figure FDA0004189250250000028
during training, all agents share rewards.
3. The NOMA and MAPPO based NB-IoT radio resource allocation method according to claim 2 wherein the solving of the problem using the MAPPO algorithm with action masking in step 3 comprises the steps of:
step 3.1, MAPPO is composed of an Actor network and a Critic network; order the
Figure FDA0004189250250000029
Representing the gap between two Actor networks, wherein +.>
Figure FDA00041892502500000210
Is an Actor network after updating parameters for the status to be observed +.>
Figure FDA00041892502500000211
Mapping to specific actions->
Figure FDA0004189250250000031
Figure FDA0004189250250000032
Is an Actor network before updating parameters; the Actor network uses Adam's random gradient ramp-up algorithm to optimize the following function:
Figure FDA0004189250250000033
where L (θ) is a cost function,
Figure FDA0004189250250000034
is an advantage function, B is the batch size, N is the number of the intelligent agents;
Figure FDA0004189250250000035
will->
Figure FDA0004189250250000036
Limiting to 1-E and 1+ [ E ] E to [0,1 ]]Super parameters in between;
adding a punishment term in the original optimization function, and correcting the original optimization function to be:
Figure FDA0004189250250000037
wherein S is policy entropy, and v is entropy coefficient super-parameter;
critic networks then use a gradient descent algorithm to minimize the following loss functions:
Figure FDA0004189250250000038
where L (phi) is the value of the Critic networkThe function of the function is that,
Figure FDA0004189250250000039
is the state cost function of the Critic network before updating the parameters,/->
Figure FDA00041892502500000310
Is the state cost function of the Critic network after updating the parameters,/>
Figure FDA00041892502500000311
Rewarding discounts;
step 3.2, action mask
Let l(s) i For the ith action's logits generated in state s, the invalid operation is masked by modifying the logits of the different actions as follows:
Figure FDA00041892502500000312
wherein, -inf is a larger negative number therein for ensuring that the probability of selecting an invalid action is zero; both training and execution employ action masking.
CN202310427926.8A 2023-04-20 2023-04-20 NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning Pending CN116347635A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310427926.8A CN116347635A (en) 2023-04-20 2023-04-20 NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310427926.8A CN116347635A (en) 2023-04-20 2023-04-20 NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning

Publications (1)

Publication Number Publication Date
CN116347635A true CN116347635A (en) 2023-06-27

Family

ID=86887866

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310427926.8A Pending CN116347635A (en) 2023-04-20 2023-04-20 NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning

Country Status (1)

Country Link
CN (1) CN116347635A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117234216A (en) * 2023-11-10 2023-12-15 武汉大学 Robot deep reinforcement learning motion planning method and computer readable medium
CN117376355A (en) * 2023-10-31 2024-01-09 重庆理工大学 B5G mass Internet of things resource allocation method and system based on hypergraph
CN118233312A (en) * 2024-03-20 2024-06-21 同济大学 Adaptive broadband resource allocation method combining deep reinforcement learning and converter

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117376355A (en) * 2023-10-31 2024-01-09 重庆理工大学 B5G mass Internet of things resource allocation method and system based on hypergraph
CN117234216A (en) * 2023-11-10 2023-12-15 武汉大学 Robot deep reinforcement learning motion planning method and computer readable medium
CN117234216B (en) * 2023-11-10 2024-02-09 武汉大学 Robot deep reinforcement learning motion planning method and computer readable medium
CN118233312A (en) * 2024-03-20 2024-06-21 同济大学 Adaptive broadband resource allocation method combining deep reinforcement learning and converter

Similar Documents

Publication Publication Date Title
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
Son et al. REFIM: A practical interference management in heterogeneous wireless access networks
CN116347635A (en) NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning
CN112601284B (en) Downlink multi-cell OFDMA resource allocation method based on multi-agent deep reinforcement learning
CN106358308A (en) Resource allocation method for reinforcement learning in ultra-dense network
CN114867030B (en) Dual-time scale intelligent wireless access network slicing method
CN109861728B (en) Joint multi-relay selection and time slot resource allocation method for large-scale MIMO system
CN111586697A (en) Channel resource allocation method based on directed hyper-graph greedy coloring
Bi et al. Deep reinforcement learning based power allocation for D2D network
Yu et al. Dynamic resource allocation in TDD-based heterogeneous cloud radio access networks
CN106028456A (en) Power allocation method of virtual cell in 5G high density network
Dao et al. Deep reinforcement learning-based hierarchical time division duplexing control for dense wireless and mobile networks
Liu et al. A deep reinforcement learning based adaptive transmission strategy in space-air-ground integrated networks
CN103281702B (en) A kind of collaboration communication method based on multiple cell dynamic clustering
CN103338453A (en) Dynamic frequency spectrum access method and system for hierarchical wireless network
CN110753365B (en) Heterogeneous cellular network interference coordination method
CN111314938B (en) Optimization method for time-frequency domain resource allocation of cellular network of single cell
Liang et al. Decentralized bit, subcarrier and power allocation with interference avoidance in multicell OFDMA systems using game theoretic approach
Wang et al. Deep transfer reinforcement learning for beamforming and resource allocation in multi-cell MISO-OFDMA systems
Yan et al. An adaptive subcarrier, bit and power allocation algorithm for multicell OFDM systems
Ren et al. Deep reinforcement learning for connection density maximization in NOMA-based NB-IoT networks
Jia et al. Multi-agent deep reinforcement learning for uplink power control in multi-cell systems
Song et al. Throughput maximization in multi-channel wireless mesh access networks
Allagiotis et al. Reinforcement learning approach for resource allocation in 5g hetnets
Chen et al. Cognitive wireless network resource allocation strategy based on effective capacity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination