CN116347635A

CN116347635A - NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning

Info

Publication number: CN116347635A
Application number: CN202310427926.8A
Authority: CN
Inventors: 任荣; 王捷; 余景明; 朱向宇; 罗鑫鹏; 赖秋宇
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2023-04-20
Filing date: 2023-04-20
Publication date: 2023-06-27

Abstract

The invention discloses a NOMA and multi-agent reinforcement learning NB-IoT wireless resource allocation method, which aims to solve the problem of connection density maximization in a multi-user NB-IoT scene based on a NOMA technology. Considering a more realistic scenario, different users have different QoS requirements and use different tone types. Unlike traditional heuristic algorithm, the present invention models the joint optimization problem of power, resource block allocation and NOMA user pairing as a Markov decision process, and adopts MAPPO, which is the most advanced multi-agent reinforcement learning algorithm for solving. In consideration of the existing invalid actions, the method and the device shield the invalid actions in the process of calculating probability distribution of each action by the neural network, and further accelerate convergence. System level simulations using Python3.8.15 on the VScode platform show that the MAPPO algorithm is superior to the baseline algorithm.

Description

NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning

Technical Field

The invention belongs to the field of wireless communication technology and artificial intelligence, and particularly relates to an NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning.

Background

With the advent of various types of smart devices, large-scale machine type communication (mctc) networks will become one of the most important communication networks in the field of 5G and B5G networking over things (IoT). According to the internet of things industry study, over 640 billions of internet of things devices will exist by 2025, 300 billions of which are active. To support the large-scale internet of things scenario, the third generation partnership project (3 GPP) has standardized several cellular internet of things technologies, including narrowband internet of things (NB-IoT). NB-IoT supports mono and multi tone modes, a Low Power Wide Area (LPWA) cellular network technology, typically occupying 180kHz of system bandwidth. The goal of 5GmMTC is to support a huge connection density of 100 tens of thousands of devices per square kilometer with limited radio resources (the goal of 6GmMTC is to support 1000 tens of thousands of devices per square kilometer). Therefore, in order to meet the performance requirements of 5G and future 6G on mctc, it is an important research in the industry to maximize the connection density of mctc devices under limited resources.

Recently, non-orthogonal multiple access (NOMA) techniques have received widespread attention. NOMA supports non-orthogonal resource allocation among different users at the expense of receiver complexity. In addition, deep Reinforcement Learning (DRL) has recently made a great progress, and the use of DRL in wireless communication systems has also received great attention. The reinforcement learning does not need priori knowledge of an environment model, and an optimal strategy is trained through continuous interaction of an agent and the environment. Once training is complete, the agent can make decisions in real time, which is a great advantage over traditional optimization algorithms. Multi-agent DRLs are a generalization of single-agent DRLs that enable a group of agents to learn optimal strategies through interactions with the environment and each other.

Currently, the resource allocation of mctc mainly adopts Orthogonal Multiple Access (OMA) technology, which has low spectrum efficiency compared with non-orthogonal multiple access (NOMA) technology, and cannot support large-scale connection effectively. In addition, for optimization problems in resource allocation, mathematical optimization problems based on models are traditionally adopted for solving. However, due to the complexity of 5g,6g in the future, optimizing the configurable parameters in the system is extremely complex, and adapting the configuration parameters in a dynamically transformed environment is also very challenging.

Disclosure of Invention

The invention aims at: in order to solve the problem of maximizing the connection density of the NB-IoT, a NB-IoT radio resource allocation algorithm based on NOMA and multi-agent DRL performing centralized training in a decentralized manner is proposed, which can solve the problem mentioned in the background art.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a wireless resource allocation scheme based on NOMA and multi-agent DRL models the resource allocation problem in a multi-user NB-IoT network as a Markov decision process and adopts MAPPO algorithm to solve the problem, comprising the following steps:

step 1, establishing an optimization model, wherein the model is expressed as:

constraint conditions:

wherein,,

representing an average of devices over a period of timeSuccessful transmission ratio; t represents the number of frames; n (N) _t Indicating that the t th frame has N _t The data packet of each device arrives at the base station; />

For interrupt indication +.>

Indicating that device n is interrupted at frame t; the resource units and the transmission power of the mMTC equipment are distributed at the beginning of each frame; let->

For the indication of the resource unit(s),

indicating that resource blocks RE (m, f) are allocated to device n at frame t; the total number of resource units interrupted by device n is

Wherein->

Is a resource unit interrupt indication,/->

Representing RE (m, f) _t The signal to noise ratio above is below a threshold; the instantaneous signal-to-noise ratio SINR of device n at resource element RE (m, f) is defined as:

wherein the method comprises the steps of

For the transmission power of mMTC device n at frame t, sigma ² Is noise; />

Indicating devicePreparing the fading coefficient of n at t moment, < >>

Representing the power fading coefficient of device n at time t, wherein +.>

Is the large scale fading coefficient of device n at frame t,/>

Is the small-scale fading coefficient of the device n at the time of the frame t; for the total number of devices superimposed on the same resource block,/->

Representing the transmit power of the ith device, +.>

A channel coefficient representing an i-th device; to ensure that user n decodes successfully, user n's packet has a signal-to-noise ratio +.>

Must reach the SINR threshold +.>

C1 ensures that device n is allocated sufficient resources at frame t, where

Representing the number of transport blocks required by device n at frame t +.>

For device n packet size in frame t,/-for device n>

For the size of the selected transport block; />

Representation transmission->

The number of resource units required for the bit information; />

Representing the number of resource elements occupied by a resource element of packet n in frame t, for single tone transmission, < >>

For multitone transmission, the user is given a weight>

C2 is a time-frequency resource allocation constraint, considering that one frame is divided into M slots in the time domain, and considering that the total bandwidth is divided into F orthogonal parts in the frequency domain, so that M e {1, 2..m }, F e {1, 2..f };

c3 is a power resource allocation constraint, and each mMTC device has 3 powers;

step 2, modeling the optimization model in the step 1 into a Markov decision process;

and 3, solving the Markov decision process in the step 2 by using a MAPPO algorithm with an action mask.

Further, modeling the optimization problem using the markov decision process in step 2 includes the following steps:

step 2.1 definition of the State

MAPPO uses centralized training, distributed execution; when in execution, each mMTC device can only observe the respective state, and

to represent the observed state of the mctc device n at frame t, then: />

While the base station can acquire during trainingObtaining the observed states of all mMTC devices by s _t Representing the observed state of the base station at frame t, then

Step 2.2 definition of action

When the mMTC device observes the environment feedback state, the mMTC device automatically selects the transmitted power and occupied resource units, so that the action of the device n in the frame t is that

Wherein->

Representing the resource unit allocation condition of the equipment in the frame;

step 2.3 definition of rewards

In performing an action

Immediately feeding back a reward by the environment; to optimize the network objective, i.e. minimize the average outage rate for all devices over a period of T frames, the rewards are defined as:

during training, all agents share rewards.

Further, the solution to the problem in step 3 using the MAPPO algorithm with motion mask includes the following steps:

step 3.1, MAPPO is composed of an Actor network and a Critic network; order the

Representing the gap between two Actor networks, wherein +.>

Is an Actor network after updating parameters for the status to be observed +.>

Mapping to specific actions->

Is an Actor network before updating parameters; the Actor network uses Adam's random gradient ramp-up algorithm to optimize the following function:

where L (θ) is a cost function,

is an advantage function, B is the batch size, N is the number of the intelligent agents;

will->

Limiting to 1-E and 1+ [ E ] E to [0,1 ]]Super parameters in between;

in order to enhance the exploration capability of the neural network, penalty terms are added to the original optimization function, so that the entropy of the distribution is kept larger, and the neural network does not output certain fixed actions, and the original optimization function is modified as follows:

wherein S is policy entropy, and v is entropy coefficient super-parameter;

critic networks then use a gradient descent algorithm to minimize the following loss functions:

where L (phi) is the cost function of the Critic network,

is the state cost function of the Critic network before updating the parameters,/->

Is the state cost function of the Critic network after updating the parameters,/>

Rewarding discounts;

step 3.2, action mask

The policy network selects actions based on the logarithm of the soft maximum action probability. However, not every action is effective. Since the input dimension of the neural network is fixed, the action space of the SADRL is assumed to be κ assuming that the action space of each device is κ ^D . If the device reaches the current step, the effective operation space is kappa ^K And there is kappa ^D -κ ^K And (5) invalidating operation. Because of the randomness of each frame of data packets, there are too many invalid actions, and these actions are dynamically changing. Conventional neural network models calculate the probability of all actions (including invalid actions), which reduces the convergence speed, and therefore action masking is proposed to alleviate the above problem. Specifically let l(s) _i For the ith action's logits generated in state s, the invalid operation is masked by modifying the logits of the different actions as follows:

wherein, -inf is a larger negative number therein for ensuring that the probability of selecting an invalid action is zero; both training and execution employ action masking.

Compared with the prior art, the invention has the beneficial effects that:

(1) The present invention contemplates using NOMA technology in NB-IoT networks, which greatly improves the spectral efficiency and connection density of the system by serving multiple users on one resource block.

(2) The invention models the resource allocation problem in the multi-user NB-IoT network with single-tone and multi-tone equipment simultaneously as a Markov decision process and solves the problem by using a deep reinforcement learning method, thereby avoiding the complexity brought by the traditional mathematical optimization algorithm. Compared with the traditional single agent, the invention uses multi-agent reinforcement learning, and in the centralized training process, the agent can cooperate with other agents so as to obtain better results; in the distributed execution process, each intelligent agent executes actions according to the state observed by the intelligent agent, so that the complexity of the system is reduced.

(3) The multi-agent reinforcement learning provided by the invention has expansibility and can be applied to larger-scale and more complex environments. In multi-agent reinforcement learning, each agent only needs to pay attention to local information, and efficient parallel computation can be achieved.

Drawings

The invention is further illustrated by the following examples in conjunction with the accompanying drawings:

FIG. 1 is a general flow chart of an embodiment of the present invention;

fig. 2 is a graph showing the average successful transmission probability of the single-tone and multi-tone coexistence scene according to the present invention according to the NB-IoT packet arrival rate;

fig. 3 is a graph showing the average successful transmission probability of the multitone scene according to the NB-IoT packet arrival rate.

Detailed Description

The invention provides a wireless resource allocation method of multiple intelligent agents. The method takes a multi-user NB-IoT system based on NOMA technology as a background, takes the long-term average successful output probability of a maximized system as an objective function, takes the transmitting power of equipment and the number of allocated resource units as constraints, and researches the joint optimization problem of resource block allocation, power allocation and user pairing. In order to solve the corresponding complex combination optimization problem, the invention models the original problem as a Markov decision process, and introduces an advanced MAPPO algorithm for solving.

For a better illustration of the process according to the invention, reference is made to the following more detailed examples: considering a non-orthogonal multiple access technology based NB-IoT uplink transmission system with N devices, modeling the packet arrival process for device N as having arrival strength λ _n Is updated independently of the update procedure. According to Palm-Khinachine theory, the sum of updating processes of large-scale Internet of things equipment in BS can be approximately as strong as possible

Is a poisson process of (2); the distance between any mMTC device and the BS is recorded as d, and the coverage radius of the BS is R _S . Let d/R _S Obeying the Beta (a, b) distribution shown below:

wherein a and b are adjustable parameters, so that the invention can simulate different distributions of the Internet of things equipment in different scenes, and u is an integral variable. Further, assume that the base station provides a maximum of D devices per step.

Step 1: establishing an optimization model, wherein the model is expressed as:

constraint conditions:

wherein,,

representing the average successful transmission ratio of the device over a period of time; t represents the number of frames; n (N) _t Indicating that the t th frame has N _t The data packet of each device arrives at the base station; />

For interrupt indication +.>

For the indication of the resource unit(s),

Wherein->

Is a resource unit interrupt indication,/->

wherein the method comprises the steps of

For the transmission power of mMTC device n at frame t, sigma ² Is noise; />

Representing the fading coefficient of device n at time t, < >>

Representing the power fading coefficient of device n at time t, wherein +.>

Is the large scale fading coefficient of device n at frame t,/>

Representing the transmit power of the ith device, +.>

Must reach the SINR threshold +.>

Assuming that any resource units allocated for device n violate SINR requirements, device n will break. Thus (S)>

If->

C1 ensures that packet n is allocated sufficient resources in frame t, wherein +.>

Representing the number of transport blocks required by device n at frame t +.>

For device n packet size in frame t,/-for device n>

For the size of the selected transport block; />

Representation transmission->

The number of resource units required for the bit information; />

Representing the number of resource elements occupied by one resource element of device n in frame t, for single tone transmission, +.>

For multitone transmission, the user is given a weight>

C2 is a time-frequency resource allocation constraint, considering that one frame is divided into M slots in the time domain, and considering that the total bandwidth is divided into F orthogonal parts in the frequency domain, so that M e {1, 2..m }, F e {1, 2..f }; c3 is a power resource allocation constraint, and each mMTC device has 3 powers;

and 2, modeling the optimization problem by using a Markov decision process.

2.1 definition of the State

MAPPO uses centralized training, distributed execution. When executing, each device can only observe its own state, using

To indicate the observed state of the mMTC device n in frame t, then +.>

While the base station can obtain the observed states of all mMTC devices during training, s is used _t Representing the observed state of the base station at frame t, then

2.2 definition of actions

Wherein->

Indicating the allocation of the device in the resource units of the frame.

2.3 definition of rewards

In performing an action

The environment will immediately feed back a prize. To optimize the network objective, i.e. minimize the average outage rate for all devices over a period of T frames, the rewards are defined as:

during training, all agents share rewards.

And 3, solving the problem by using a reinforcement learning algorithm based on MAPPO.

3.1、MAPPO

MAPPO is a multi-agent PPO that is trained centrally, and executed in a distributed manner.MAPPO consists of an Actor network and a Critic network. Order the

Representing the gap between two Actor networks, wherein,

is an Actor network after updating parameters for the status to be observed +.>

Mapping to specific actions->

Is the Actor network before updating the parameters. The Actor network uses Adam's random gradient ramp-up algorithm to optimize the following function:

where L (θ) is a cost function,

will->

Limiting to 1-E and 1+ [ E ] E to [0,1 ]]Super parameters in between;

in order to enhance the exploration capability of the neural network, the invention adds a punishment term in the original optimization function, which keeps the entropy of the distribution larger, so that the neural network does not output certain fixed actions, and the original optimization function is modified as follows:

wherein S is policy entropy, and v is entropy coefficient super-parameter;

where L (phi) is the cost function of the Critic network,

Rewarding discounts.

3.2 action mask

The policy network selects actions based on the logarithm of the soft maximum action probability. However, not every action is effective. Since the input dimension of the neural network is fixed, the action space of the SADRL is assumed to be κ assuming that the action space of each device is κ ^D . If the device reaches the current step, the effective operation space is kappa ^K And there is kappa ^D -κ ^K And (5) invalidating operation. Because of the randomness of each frame of data packets, there are too many invalid actions, and these actions are dynamically changing. Conventional neural network models calculate the probability of all actions (including invalid actions), which reduces the convergence speed, and thus the present invention proposes action masking to alleviate the above-mentioned problems. Specifically let l(s) _i For the ith action's logits generated in state s, the invalid operation is masked by modifying the logits of the different actions as follows:

Step 4, simulation

System level simulations were performed on VScode platform using python 3.8.15. The carrier frequency of the BS is considered to be 900Mhz,

the inter-station distance (ISD) is 1732m, the location of the device satisfies the Beta (2, 4) distribution, and there are three options for the transmit power of the device: 23dBm,20dBm,17dBm. The path loss is defined as:

where L is the indoor penetration loss of 20dB and G is the-4 dB antenna gain.

According to 3GPPTR37.910, the traffic model for each device is 1 message per 2 hours per device, and the message size is 32 bytes. The MCS index of each device is 8, i.e. each message occupies at least 2 resource units. Therefore, the 12 tones are allocated with 4 slots, the 6 tones are allocated with 8 slots, the 3 tones are allocated with 16 slots, and the single tones are allocated with 32 slots. The coverage enhancement at the general level is considered without loss of generality, and the NPUSCH repetition number is 1. According to the multi-tone threshold of 1.4dB, the single-tone threshold of 4.3dB. For the DRL model, the two steps are considered to be 32 time slots, and the arrival rate of the message to the BS point is lambda epsilon [1,12] packets/step. According to poisson distribution, when the maximum message arrival rate is 12, the maximum message number is assumed to be 24. Since the packets arriving at the different steps are independent, the present invention considers only instant rewards, and therefore the discount factor γ is 0.

Fig. 2 shows a variation rule of average successful transmission probability number with mctc traffic arrival rate under the coexistence of single tone and multiple tones. It can be seen that the NOMA-based MAPPO approach is superior to the OMA-based and NOMA-based random approaches. Fig. 3 shows the variation of the average successful transmission probability number with mctc traffic arrival rate considering only the multitone case. At this point, the performance of all algorithms is improved, but MAPPO still maintains optimal performance, which verifies the superiority of MAPPO algorithm.

Claims

1. The NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning is characterized by comprising the following steps:

step 1, establishing an optimization model, wherein the model is expressed as:

constraint conditions:

C1:

C2:

C3:

wherein,,

For interrupt indication +.>

Indicating that device n is interrupted at frame t; the resource unit and the transmission power of the mMTC device are opened in each frameDistributing at the beginning; let->

For resource unit indication, ++>

Indicating that resource blocks RE (m, f) are allocated to device n at frame t; the total number of resource units interrupted by device n is +.>

Wherein->

Is a resource unit interrupt indication,/->

wherein the method comprises the steps of

For the transmission power of mMTC device n at frame t, sigma ² Is noise; />

Representing the fading coefficient of device n at time t, < >>

Representing the power fading coefficient of device n at time t, wherein +.>

Is provided withPrepare n for the large scale fading coefficient at frame t, < >>

Representing the transmit power of the ith device, +.>

Must reach the SINR threshold +.>

C1 ensures that device n is allocated sufficient resources at frame t, where

Representing the number of transport blocks required by device n at frame t +.>

For device n packet size in frame t,/-for device n>

For the size of the selected transport block; />

Representation transmission->

Required for bit informationIs a number of resource units; />

Representing the number of resource elements occupied by one resource unit of the data packet n in the frame t;

2. The NOMA and MAPPO based NB-IoT radio resource allocation method according to claim 1 wherein modeling the optimization problem using a markov decision process in step 2 comprises the steps of:

step 2.1 definition of the State

to represent the observed state of the mctc device n at frame t, then: />

Step 2.2 definition of action

mMTC device self-discovers the state of environmental feedbackThe power and occupied resource units of the row select transmission, so the action of device n in frame t is

Wherein->

step 2.3 definition of rewards

In performing an action

during training, all agents share rewards.

3. The NOMA and MAPPO based NB-IoT radio resource allocation method according to claim 2 wherein the solving of the problem using the MAPPO algorithm with action masking in step 3 comprises the steps of:

step 3.1, MAPPO is composed of an Actor network and a Critic network; order the

Representing the gap between two Actor networks, wherein +.>

Is an Actor network after updating parameters for the status to be observed +.>

Mapping to specific actions->

where L (θ) is a cost function,

will->

Limiting to 1-E and 1+ [ E ] E to [0,1 ]]Super parameters in between;

adding a punishment term in the original optimization function, and correcting the original optimization function to be:

wherein S is policy entropy, and v is entropy coefficient super-parameter;

where L (phi) is the value of the Critic networkThe function of the function is that,

Rewarding discounts;

step 3.2, action mask

Let l(s) _i For the ith action's logits generated in state s, the invalid operation is masked by modifying the logits of the different actions as follows: