CN112566261A

CN112566261A - Deep reinforcement learning-based uplink NOMA resource allocation method

Info

Publication number: CN112566261A
Application number: CN202011445582.6A
Authority: CN
Inventors: 徐友云; 李大鹏; 蒋锐
Original assignee: Nanjing Ai Er Win Technology Co ltd
Current assignee: Nanjing Ai Er Win Technology Co ltd
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2021-03-26

Abstract

The invention discloses an uplink NOMA resource allocation method based on deep reinforcement learning. The method improves the energy efficiency of the whole system and effectively reduces the power consumed by transmission by selecting the optimal sub-channel allocation strategy and power allocation strategy under the condition of meeting the minimum transmission rate of each user. The method is based on the deep Q network in the deep reinforcement learning, and the network parameters are adjusted according to the feedback of the NOMA system, so that the optimal sub-channel and power distribution are realized. The method adapts the depth Q network to the continuous resource allocation task through power discretization, reduces the output dimension of the network by utilizing a distributed network structure, and further improves the performance of the whole resource allocation network. Compared with other methods, the method can achieve better average overall energy efficiency and achieve good performance under different transmission power limits.

Description

Deep reinforcement learning-based uplink NOMA resource allocation method

Technical Field

The invention relates to a mobile communication and reinforcement learning neighborhood, in particular to an uplink NOMA wireless resource allocation method based on deep reinforcement learning.

Background

Fifth generation communication networks (5G) are required to meet the rapidly increasing demand for wireless data traffic, support high-density mobile subscriber communications, and provide various wireless network services. A recently proposed Non-Orthogonal Multiple Access (NOMA) technology is considered as an emerging technology that can effectively increase network capacity, and meet low latency, large-scale connection and high throughput. On one hand, compared with the conventional Orthogonal Multiple Access (OMA), the NOMA uses the Superposition Coding (SC) technique at the transmitting end, uses different power levels to allocate the same sub-channel to Multiple users for simultaneous transmission, shares channel resources, and then uses the Successive Interference Cancellation (SIC) technique at the receiving end to cancel Interference, so that the spectrum efficiency and the system capacity are greatly improved, and the NOMA is very suitable for future mobile communication.

On the other hand, since the performance gain of the NOMA system is closely related to the allocation mode of the sub-channels and the transmission power, the energy efficiency of the whole NOMA system can be maximized by designing a reasonable resource allocation scheme. Therefore, the higher transmission rate is obtained by using the lower sending power, and unnecessary resource waste is reduced while the advantages of the NOMA technology are fully utilized. Different approaches have been proposed in the present research to study the optimal resource allocation scheme of NOMA systems.

Found by searching the existing literature. Manglayev et al published a text entitled "optimal Power Allocation for non-orthogonal multiple Access (NOMA)" in IEEE International Conference on Application of Information and Communication Technologies, Oct.2016, pp.1-4 (International Conference on Information and Communication technology applications, 2016, 10.10.2016). This article presents a power allocation algorithm that maximizes capacity in combination with a fairness factor, and simulations demonstrate that higher spectral efficiency can be achieved using NOMA technology than with the original OMA technology. Zhang et al, in IEEE Transactions on Vehicular Technology, Mar.2017, vol.66, No.3, pp.2852-2857 (journal of on-board Technology of the institute of Electrical and electronics Engineers, 2017, 3 rd Vol.66, 3 rd Vol. 2852 and 2857), published a article entitled "Energy-efficiency transmission design in non-orthogonal multiple access". This document proposes a power allocation strategy that maximizes energy efficiency to meet the minimum rate requirements of the user. In addition, it is found that a document entitled "Downlink power allocation for CoMP-NOMA in multi-cell networks" (Downlink power allocation of coordinated multi-point NOMA in multi-cell networks) "published by m.s.ali et al in IEEE Transactions on Communications, sep.2018, vol.66, No.9pp.3982-3998 (journal of Communications of the institute of electrical and electronics engineers, 9 years 2018, volume 66, 9 th, page 3982-3998), researches a Downlink power allocation scheme of multi-cell, proposes a distributed power optimization algorithm to reduce the calculation complexity, and analyzes the spectrum efficiency and energy efficiency performance of the multi-cell NOMA system through simulation. All three documents only focus on the power allocation scheme in the NOMA system, however, the quality of the sub-channel allocation scheme has a great influence on the improvement of the overall system efficiency.

It was also found through searching that c.l. wang et al published a text entitled "Low-complexity Resource Allocation for Downlink multi-carrier NOMA Systems" in IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, sep.2018, pp.1-6 (society of electrical and electronics engineers, Personal Indoor and Mobile Radio Communications, year 2018, 9, pages 1-6), which, on the basis of general power Allocation studies, proposed a method of Low-complexity joint subchannel and power Allocation in NOMA Systems. Under the method, the optimal power distribution factor is obtained by closed-form solution, and the optimal subcarrier is obtained based on the low-complexity channel gain criterion, and the system capacity better than that of the traditional orthogonal frequency division multiple access scheme can be obtained. Although the method has low computational complexity, it cannot guarantee that an optimal resource allocation scheme is found.

Through the patent search, Zhu Rong et al, Nanjing post and telecommunications university invented "a resource allocation method under downlink MIMO-NOMA network" (publication No. 109922487A). The invention discloses a resource allocation method in a downlink NOMA system. The method comprises the steps of clustering users by acquiring channel state information of the users, and then distributing beam directions to the clustered users by utilizing a zero-forcing beam forming theory. And obtaining an optimal channel allocation scheme and a power allocation scheme by respectively using a Hungarian algorithm and a sub-gradient algorithm on the premise of determining power allocation and channel allocation, and alternately iterating until the user capacity is converged, thereby obtaining the optimal resource allocation scheme. In addition, the research also finds that Down Jie et al of the university of southern China's worker invented a resource allocation method of a deep learning-based energy-carrying NOMA system (publication No. 108924935A). The invention discloses a combined resource allocation method based on deep learning, which minimizes the transmission power on the premise of meeting the Quality of service (QoS) of a user. The method firstly constructs a mathematical optimization problem of joint resource allocation based on transmission power minimization in the energy-carrying NOMA system, wherein the mathematical optimization problem comprises optimization variables, an optimization objective function and constraint conditions. Then, a large amount of sample data is obtained by adopting a genetic algorithm, and a deep confidence network is trained to obtain potential information between input and output of the data sample. And finally, in the operation stage, directly outputting the optimal carrier and power allocation strategy by using the trained network. The method can efficiently obtain the resource allocation scheme under the condition that the network training is finished, realizes the resource allocation with low power consumption, and better meets the requirement of low time delay.

Although these existing resource allocation schemes improve the energy efficiency or other indexes of the entire NOMA system to some extent, these schemes have certain limitations. For example, for a conventional model-based resource allocation scheme, the computational complexity of the optimization process is high, and the time taken for the iterative algorithm is long. Although the optimization algorithm based on deep learning reduces the computational complexity, a large amount of time is still needed to construct enough sample data training networks to achieve good performance.

Disclosure of Invention

The technical problem to be solved by the present invention is to overcome the defects of the prior art, and provide a method for joint sub-channel allocation and power allocation in an uplink NOMA multi-user scenario based on Deep Reinforcement Learning (DRL), which maximizes the energy efficiency of the whole system while ensuring the minimum rate requirement of the user. As a big branch of machine learning, the DRL combines the neural networks in the traditional reinforcement learning and deep learning, collects the feedback information of the system through continuous interaction, and dynamically adjusts the parameters to make better decisions, thereby maximizing the performance of the system. Therefore, the DRL does not need a mathematical model or prior knowledge of the system, and is more suitable for solving the problem of dynamic resource allocation of an unknown system. According to the method, a Deep Q Network (DQN) in the DRL is utilized, a proper sub-channel allocation strategy is selected firstly according to channel gain information of a user, then a proper power allocation strategy is selected, and finally parameters of the allocation strategy are updated according to feedback of the system, so that optimal sub-channel allocation and power allocation are achieved, and energy efficiency of the system is improved.

The invention is realized by the following technical scheme:

the invention relates to a sub-channel allocation and power allocation method of an uplink NOMA system based on DRL, which is used for solving the problem of resource allocation of an uplink of a multi-user NOMA wireless communication system and comprises the following steps:

s1, acquiring the state: at time t, the base station acquires channel gain information of all users in the cell on different sub-channels as the current state s_t。

S2, sub-channel allocation: sub-channel distribution network at base station selects optimal sub-channel distribution scheme according to epsilon-greedy strategy

S3, power distribution: deriving a sub-channel allocation scheme

Then, activating a power distribution network at the base station, and selecting an optimal power distribution scheme according to an epsilon-greedy strategy

S4, feedback acquisition: all ofResource allocation scheme for user according to two network outputs

Data is transmitted to the base station on a given subchannel at a given power. The base station returns corresponding feedback to the resource allocation network.

S5, updating parameters: and training the neural networks of all DQN units in the subchannel distribution network and the power distribution network based on two strategies of empirical replay and fixed Q values according to the obtained feedback, and updating the parameters of the networks, thereby better selecting a resource distribution scheme.

The S1) comprises the following specific steps:

at time t, the base station acquires the channel gain information of all users, and the state s at the current time_tExpressed as the channel gain of all users on different sub-channels at the current time. By g_k，m(t) represents the channel gain information for user m on subchannel k, then s_tIs represented as follows:

s_t＝{g_1，1(t)，g_2，1(t)，...，g_k，m(t)，..，g_K，M(t)}

where K and M represent the number of subchannels and users in a cell, g, respectively_k，m(t) includes large scale fading effects and small scale fading effects. The large-scale fading effect refers to fading caused by the shadow of a fixed obstacle on a channel path for communication between a user terminal and a base station, and comprises average path loss and shadow fading; small-scale fading is caused by multipath effects, and it is assumed that the effect on the user terminal follows rayleigh distribution.

The S2) comprises the following specific steps:

obtain the current state s_tThen, s_tIs transmitted to a subchannel assignment network at the base station. The network consists of one subchannel assignment DQN unit. The unit comprises two neural networks, namely a Q network Q (s, s; w) and a target Q network Q (s, a; w)^-) W and w^-Representing the parameters of the two neural networks, respectively.

Q network in sub-channel distribution DQN unit according to obtained shapeState s_tAnd estimating the Q values of all the sub-channel allocation schemes by using the network parameter w, namely:

wherein A is₁Representing the set of all possible sub-channel allocation schemes.

Then, the subchannel allocation DQN unit selects one of all subchannel allocation schemes as the current best allocation scheme following the epsilon-greedy policy. Wherein, the epsilon-greedy strategy refers to that: from A with a probability of 1-epsilon₁Randomly selecting a sub-channel distribution scheme as the optimal sub-channel distribution scheme at the time t

Outputting; or selecting the scheme with the maximum Q value according to the probability epsilon, namely selecting:

wherein 0 < epsilon < 1. Then, the sub-channel distribution network outputs the sub-channel distribution scheme at the time t

The S3) comprises the following specific steps:

in obtaining a sub-channel allocation scheme

Thereafter, the power distribution network at the base station is activated. The network consists of M power-distributing DQN units. Each power distribution DQN unit contains the same two neural networks as the subchannel allocation unit, but the parameters of these networks are different.

Using the same state s_tAs input, the Q-network of the mth power distribution DQN unit follows the epsilon-greedy strategy from the set a of all power distribution schemes using the same method in S2₂Is selected as the mth transmission power

And (6) outputting.

Then, the outputs of all M power distribution DQN units are combined into a power distribution scheme at time t by a power distribution network

Namely:

the S4) comprises the following specific steps:

resource allocation scheme for all users according to two network outputs

Data is transmitted to the base station on a given subchannel at a given power. If the transmission rate of each user can meet the minimum rate requirement, the base station calculates the sum of the energy efficiency of all users as the feedback r at the current time t_tTo a subchannel distribution network and a power distribution network. If not, the feedback obtained by the two resource allocation networks is 0, i.e. the feedback is not satisfied

Wherein r is_tIndicating feedback at time t, R_minIndicating a minimum rate requirement, E_k，mAnd R_k，mRespectively, energy efficiency and transmission rate of user m on subchannel k. Thereafter, the base station acquires new channel gain information as a new state s due to the movement of the user_t+1。

The S5) comprises the following specific steps:

according to the obtained system feedback r_tTraining the neural networks of all DQN units in the subchannel distribution network and the power distribution network based on two strategies of empirical replay and fixed Q value, and updating the parameters of the networks to better selectA resource allocation scheme is selected. The S of the specific parameter update includes:

(1) will be(s) at each moment_t，a_t，r_t，s_t+1) Storing the training samples into a memory library D as training samples of the neural network;

(2) randomly selecting N groups of samples(s) from D_i，a_i，r_i，s_i+1) Training a neural network;

(3) for the subchannel allocation network, the parameter w of the Q network in the subchannel allocation DQN unit is updated by minimizing the loss function by a random gradient descent method. The loss function therein is expressed as follows:

using the stochastic gradient descent method, the update mode of the parameter w is represented as:

wherein y is_iRepresenting the target Q network Q (s, alpha; w) within the DQN unit^-) The resulting target Q value, α, represents the learning rate.

(4) For the power distribution network, the same random gradient descent method as (3) is used to minimize the loss functions of the M power distribution DQN units, and the neural network parameters are updated. For the mth power distribution unit, the loss function is expressed as follows:

where M ═ 1, 2,. multidot.m. And then updating the corresponding network parameters by using a random gradient descent method.

(5) For M +1 target Q networks in all resource allocation DQN units, assigning the parameter W of the corresponding Q network to the parameter W of the corresponding Q network within a fixed time W^-And updating the target Q network parameters.

Compared with the prior art, the invention has the beneficial effects that: 1) the invention is a DRL-based model-free combined sub-channel allocation and power allocation method, has low calculation complexity, can efficiently obtain an optimal resource allocation scheme, and improves the energy efficiency of an uplink NOMA system. And good performance can be obtained under different transmission power limit conditions. 2) In order to apply the DQN to the power distribution task, the invention improves on the basis of the traditional DQN, provides a discretized and distributed DQN network, reduces the output dimension of the network, and further improves the performance of the whole power distribution network.

Drawings

FIG. 1 is a schematic diagram of an upstream multi-user NOMA system according to the present invention;

fig. 2 is a frame diagram of a DRL-based joint sub-channel and power allocation method according to the present invention;

FIG. 3 is a graphical illustration of the loss function over time for different learning rates according to the method of the present invention;

fig. 4 is a graph comparing the average total energy efficiency of the DRL-based joint subchannel and power allocation method of the present invention with other methods;

fig. 5 is a diagram illustrating average total energy efficiency of the DRL-based joint subchannel and power allocation method of the present invention and other methods under different transmission power constraints.

Detailed Description

The following is a detailed description of the embodiments of the present invention, which is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection authority of the present invention is not limited to the following embodiments. All other embodiments, which can be derived by a person skilled in the art from any of the embodiments of the invention without making any creative effort, fall within the protection scope of the invention.

The invention relates to a combined sub-channel allocation and power allocation method of an uplink NOMA system based on DRL. As shown in fig. 1, the base station in the NOMA wireless communication system is located at the center of a cell, and the subchannel allocation network and the power allocation network of the present invention are both located in a DRL controller at the base station. The M users are randomly distributed in the cell and randomly move between each time slot. The total bandwidth of the base station is divided equally into K mutually orthogonal subchannels. Each subchannel may serve multiple users simultaneously. Maximum transmission power per user terminal is P_max. By b_k，m(t) and p_k，mAnd (t) respectively represents the subchannel allocation flag and the allocated power of the user m on the subchannel k at the time t. Wherein, b_k，m(t)' 1 means that user m is allocated to subchannel k at time t, otherwise b_k，m(t)＝0。

The embodiment is realized by the following steps:

s1) state acquisition: the base station obtains the channel gain information of all users on different sub-channels in the cell at the time t as the current state s_t。

By g_k，mAnd (t) represents the channel gain information of the user m on the sub-channel k at the time t. The information consists of two parts, respectively a large-scale fading beta at time t_k，m(t) and small-scale fading h_k，m(t) of (d). The large-scale fading refers to fading caused by the shadow of a fixed obstacle on a channel path of communication between a user terminal and a base station, and comprises average path loss and shadow fading; small-scale fading is caused by multipath effects, and it is assumed that the effect on the user terminal follows rayleigh distribution. Then g_k，m(t) can be expressed as:

at the current moment tState of(s)_tIs represented as follows:

s_t＝{g_1，1(t)，g_2，1(t)，...，g_k，m(t)，...，g_K，M(t)}

s2) sub-channel allocation: according to the obtained s_tThe sub-channel allocation network within the DRL controller at the base station follows the epsilon-greedy policy to select the optimal sub-channel allocation scheme

Sub-channel allocation scheme

The mark b can be assigned by a subchannel_k，m(t) is expressed as:

wherein b is_k，mThe value of (t) may be 0 or 1. All possible allocation schemes constitute a set a of sub-channel allocation schemes₁。

State s obtained by the base station_tIs transmitted to a subchannel allocation network within the DRL controller, which network consists of one subchannel allocation DQN unit. The unit comprises two neural networks, namely a Q network Q (s, a; w) and a target Q network Q (s, a; w)^-) W and w^-Respectively representing the network parameters of the two networks. The Q network is used to estimate the Q value of the selected action, and the target Q network is used to generate a target Q value to train the network parameters.

Using the obtained s_tAs input, the Q network in the subchannel allocation DQN unit outputs estimated Q values for all subchannel allocation schemes using parameter w, i.e.:

after all estimated Q values are obtained, the sub-channel assignment DQN unit follows the epsilon-greedy strategy from A₁One scheme is selected as the optimal scheme at the current moment tSub-channel allocation scheme

Wherein, the epsilon-greedy strategy is as follows: from A with a probability of 1-epsilon₁In the scheme of randomly selecting a sub-scheme

Or selecting the scheme with the maximum estimated Q value with the probability epsilon, i.e.

Wherein the value range of epsilon is more than 0 and less than 1. The smaller epsilon, the more likely the base station is to attempt to select other allocation schemes, and the larger epsilon, the more likely the base station is to select the allocation scheme with the largest Q value.

Then, the sub-channel distribution network outputs the optimal sub-channel distribution scheme at the time t

S3) power allocation: in obtaining a sub-channel allocation scheme

Then, activating a power distribution network in a DRL controller at the base station, and selecting an optimal power distribution scheme according to an epsilon-greedy strategy

Power allocation scheme

Power p allocable by each user on different sub-channels_k，m(t) is expressed as:

wherein 0 is not more than p_k，m(t)≤P_max. Since only the transmission power on the subchannel allocated to user m needs to be decided, and the power of user m on other subchannels may not need to be considered, let:

this can reduce the dimensionality of the DQN cell outputs to improve performance.

Furthermore, since the power interval available for allocation is a continuous value, the power has to be discretized to adapt the input and output of the DQN. However, power discretization causes exponential increase of output dimension, so the scheme uses a distributed architecture to solve the problem.

In the scheme, a power distribution network in a DRL controller comprises M power distribution DQN units, each unit is responsible for the power distribution task of one user, and then the power distribution scheme

The expression form of (a) is converted into:

wherein

Representing the power distribution scheme made by the mth power distribution DQN unit at time t. Assuming that the power is discretized into L levels

There are L alternative powers, denoted as:

the letter is obtained in S2Lane allocation scheme

Thereafter, M power distribution DQN units within the power distribution network are activated. Each power distribution DQN unit contains the same two neural networks as the subchannel distribution DQN unit described above, but the parameters of these neural networks are different. Using the same state s_tAs input, the Q network of the mth power allocation unit outputs the estimated Q value and selects one from all power allocation schemes as the transmission power of the mth user following the epsilon-greedy policy

And (6) outputting. Combining the power of the M outputs into a power allocation scheme

As the optimal sub-channel allocation scheme at time t

And (6) outputting.

S4) feedback acquisition: resource allocation scheme output by all users according to subchannel allocation network and power allocation network

Data is transmitted to the base station on a given subchannel at a given power. The base station returns the sum of the energy efficiencies of all users as feedback.

Known subchannel allocation scheme

And power allocation scheme

After that, all b_k，m(t) and p_k，mThe value of (t) is known. According to the uplink NOMA transmission principle, the signal to interference plus noise ratio of user m on subchannel k is expressed as follows:

wherein

Representing the variance of gaussian white noise. Using normalized bandwidth, the corresponding transmission rate is then expressed as:

R_k，m(t)＝log(1+γ_k，m(t))

the uplink energy efficiency of the user m on the subchannel k is:

wherein P is_mRepresenting a portion of the energy consumed by the user equipment operating itself.

the feedback at time t is defined as the sum of the energy efficiencies of all users on all sub-channels at the current time. If the transmission rate R of each user_k，m(t) can satisfy the minimum rate requirement R_minThen the base station calculates the sum of the energy efficiencies of all the users as the feedback r at the current time t_tTo the subchannel allocation unit and all the power allocation units. If not, all resource allocation units obtain feedback r_tIs equal to 0, i.e

Then, the channel gain information of all users changes due to the movement of the users, and the base station acquires the channel gain information of all users again as a new state s_t+1。

S5) parameter update: according to the system feedback r obtained in S4_tTraining the neural networks of all DQN units in the subchannel distribution network and the power distribution network based on two strategies of empirical replay and fixed Q value, and updating the parameters of the networks to better select a resource distribution formulaA method for preparing a medical liquid. The S of the specific parameter update includes:

(3) for the subchannel allocation network, the parameter w of the Q network of the subchannel allocation DQN unit is updated by minimizing a loss function through a random gradient descent method. The loss function of the subchannel assignment DQN unit is expressed as follows:

wherein y is_iRepresenting a target Q network Q (s, a; w) within the DQN unit^-) The resulting target Q value, α, represents the learning rate.

(4) For the power distribution network, the same random gradient descent method as in S (3) is used to minimize the loss function of M power distribution DQN units, updating the neural network parameters. For the mth power distribution unit, its loss function is expressed as follows:

using a random gradient descent method, the parameter updating method of the mth power distribution unit is represented as:

where M ═ 1, 2,. multidot.m.

Fig. 2 is a frame diagram of a method for joint sub-channel and power allocation based on DRL according to the present invention.

In the example, a multi-user uplink NOMA scenario is considered, all users are optimized for joint sub-channel and power allocation, and main parameters of the simulation scenario of the example are shown in table 1.

TABLE 1 simulation scenario principal parameters

FIG. 3 is a graphical representation of the loss function over time for different learning rates according to the method of the present invention. The figure is from top to bottom for the case where the learning rate α in the method of the present invention is set to 0.001, 0.005 and 0.01, respectively. Simulation results show that the algorithm of the invention has good convergence. As shown in fig. 3, the loss functions for the three learning rates are initially large and decrease rapidly as the number of slots increases, and all converge within 20 steps. In particular, when α is 0.01, only a few steps are required to minimize the loss function and then stabilize. Therefore, using such a learning rate can provide a faster convergence rate to minimize the loss function, so that the prediction of the Q value becomes more accurate, and the performance of the network becomes better.

Fig. 4 is a graph comparing the average energy efficiency of the DRL-based joint subchannel and power allocation method of the present invention with other methods. The diagram is, from top to bottom, a DRL-based resource allocation method (DQN), a method using exhaustive search and random transmission power (OptRP), a method using exhaustive search and maximum transmission power (OptMP), and a method using random subchannels and maximum transmission power (RCMP), respectively, as proposed by the present invention. Where exhaustive search refers to a method of traversing all subchannel schemes and then selecting a subchannel allocation scheme that results in the highest energy efficiency. It should be noted that to better show the simulation results, the total energy efficiency is a running average taken every 100 steps. It can be seen from the figure that the energy efficiency performance of the NOMA system applying the resource allocation scheme of the present invention is much higher than that of other methods. The method of the invention can dynamically select the transmitting power according to the real-time channel information of the user, and adaptively adjust the resource allocation scheme. On the basis of meeting the requirement of the lowest rate, unnecessary transmission power is reduced, so that more energy efficiency can be provided. It can also be seen by comparison that the energy efficiency obtained using an exhaustive search far exceeds that of using random sub-channels. This also illustrates that the assignment of subchannels has a significant impact on the performance gain of the overall NOMA system.

Fig. 5 is a schematic diagram of the average energy efficiency of the DRL-based joint subchannel and power allocation method of the present invention and other methods under different transmission power constraints. The figure shows the average energy efficiency of each scheme over all time slots under different maximum transmission power constraints. It can be seen from the figure that as the maximum transmission power increases, the average energy efficiency of the method also increases and approaches a maximum value, while the average energy efficiency of the other three methods decreases to different degrees after increasing. Furthermore, it can be seen from the figure that the method of the present invention is superior to other methods under most maximum transmission power conditions.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention.

Claims

1. An uplink NOMA resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:

s1, acquiring the state: at time t, the base station acquires channel gain information of all users in the cell on different sub-channels as the current state s_t；

S3, power distribution: resulting subchannel allocation scheme

S4, feedback acquisition: all users according to the resource allocation scheme

Transmitting data to the base station on a given subchannel at a given power; the base station returns the sum of the energy efficiency of all the users as feedback;

s5, updating parameters: according to the system feedback r obtained in S4_tBased on two strategies of empirical replay and fixed Q value, training sub-channel distribution DQN unitAnd a neural network within all power allocation DQN units, updating parameters of the network to better select a resource allocation scheme.

2. The deep reinforcement learning-based uplink NOMA resource allocation method according to claim 1, wherein the channel gain information in S1 includes large-scale fading and small-scale fading; at time t, the channel gain information of all users on different sub-channels constitutes state s_t。

3. The deep reinforcement learning-based uplink NOMA resource allocation method according to claim 1, wherein the sub-channel allocation in S2 comprises the following specific steps:

obtaining the current state s_tThen, s_tA subchannel assignment DQN unit communicated to a base station; the Q network Q (s, a; w) in the cell is based on the obtained state s_tEstimating Q values Q(s) of all sub-channel allocation schemes by using network parameters w_t，a；w)，a∈A₁，A₁Represents a set of all subchannel allocation schemes;

the sub-channel distribution DQN unit selects one of all sub-channel distribution schemes according to an epsilon-greedy strategy; the strategy is as follows: from A with a probability of 1-epsilon₁In which a sub-channel allocation scheme is randomly selected

Or the scheme with the maximum Q value is selected with the probability epsilon, that is:

wherein 0 < epsilon < 1.

4. The deep reinforcement learning-based uplink NOMA resource allocation method according to claim 1, wherein the specific steps of power allocation in S3 are:

in obtaining a sub-channel allocation scheme

Then, activating M power distribution DQN units in a power distribution network at the base station; using the same state s_tAs input, the Q network of the mth power distribution DQN unit estimates the corresponding Q value, and then selects one from the set of all power distribution schemes according to the epsilon-greedy policy as the transmission power of the mth user

Outputting; the output M powers are then combined into a power allocation scheme

Namely:

5. the deep reinforcement learning-based uplink NOMA resource allocation method according to claim 1, wherein the feedback acquisition in S4 comprises the specific steps of:

resource allocation scheme selected by all users according to subchannel allocation network and power allocation network

Transmitting data to the base station on a given subchannel at a given power; if the transmission rate of each user can meet the minimum rate requirement, the base station calculates the sum of the energy efficiency of all users as the feedback r at the current time t_tTo the subchannel allocation unit and all the power allocation units; if not, all resource allocation units get a feedback of 0, i.e.

Wherein r is_tIndicating feedback at time t, R_minIndicating a minimum rate requirement, E_k，mAnd R_k，mRespectively representing the energy efficiency and the transmission rate of the user m on the subchannel k; as all users move, the base station acquires new channel gain information s_t+1。

6. The deep reinforcement learning-based uplink NOMA resource allocation method according to claim 1, wherein the specific step of updating S5 parameter includes:

(3) for a subchannel allocation network, updating a parameter w of a subchannel allocation DQN unit Q network by minimizing a loss function through a random gradient descent method; the loss function of the subchannel assignment DQN unit is expressed as follows:

wherein y is_iRepresenting a network Q (s, a; w) formed by a target Q^-) The resulting target Q value, α, represents the learning rate;

(4) for the power distribution network, minimizing loss functions of all power distribution DQN units by using the same random gradient descent method as (3), and updating neural network parameters; for the mth power distribution DQN unit, the loss function is expressed as follows:

wherein M ═ {1, 2,. said, M };