CN112566261A - Deep reinforcement learning-based uplink NOMA resource allocation method - Google Patents

Deep reinforcement learning-based uplink NOMA resource allocation method Download PDF

Info

Publication number
CN112566261A
CN112566261A CN202011445582.6A CN202011445582A CN112566261A CN 112566261 A CN112566261 A CN 112566261A CN 202011445582 A CN202011445582 A CN 202011445582A CN 112566261 A CN112566261 A CN 112566261A
Authority
CN
China
Prior art keywords
network
allocation
power
channel
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011445582.6A
Other languages
Chinese (zh)
Inventor
徐友云
李大鹏
蒋锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Ai Er Win Technology Co ltd
Original Assignee
Nanjing Ai Er Win Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Ai Er Win Technology Co ltd filed Critical Nanjing Ai Er Win Technology Co ltd
Priority to CN202011445582.6A priority Critical patent/CN112566261A/en
Publication of CN112566261A publication Critical patent/CN112566261A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/53Allocation or scheduling criteria for wireless resources based on regulatory allocation policies
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0473Wireless resource allocation based on the type of the allocated resource the resource being transmission power
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02EREDUCTION OF GREENHOUSE GAS [GHG] EMISSIONS, RELATED TO ENERGY GENERATION, TRANSMISSION OR DISTRIBUTION
    • Y02E40/00Technologies for an efficient electrical power generation, transmission or distribution
    • Y02E40/70Smart grids as climate change mitigation technology in the energy generation sector
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Abstract

The invention discloses an uplink NOMA resource allocation method based on deep reinforcement learning. The method improves the energy efficiency of the whole system and effectively reduces the power consumed by transmission by selecting the optimal sub-channel allocation strategy and power allocation strategy under the condition of meeting the minimum transmission rate of each user. The method is based on the deep Q network in the deep reinforcement learning, and the network parameters are adjusted according to the feedback of the NOMA system, so that the optimal sub-channel and power distribution are realized. The method adapts the depth Q network to the continuous resource allocation task through power discretization, reduces the output dimension of the network by utilizing a distributed network structure, and further improves the performance of the whole resource allocation network. Compared with other methods, the method can achieve better average overall energy efficiency and achieve good performance under different transmission power limits.

Description

Deep reinforcement learning-based uplink NOMA resource allocation method
Technical Field
The invention relates to a mobile communication and reinforcement learning neighborhood, in particular to an uplink NOMA wireless resource allocation method based on deep reinforcement learning.
Background
Fifth generation communication networks (5G) are required to meet the rapidly increasing demand for wireless data traffic, support high-density mobile subscriber communications, and provide various wireless network services. A recently proposed Non-Orthogonal Multiple Access (NOMA) technology is considered as an emerging technology that can effectively increase network capacity, and meet low latency, large-scale connection and high throughput. On one hand, compared with the conventional Orthogonal Multiple Access (OMA), the NOMA uses the Superposition Coding (SC) technique at the transmitting end, uses different power levels to allocate the same sub-channel to Multiple users for simultaneous transmission, shares channel resources, and then uses the Successive Interference Cancellation (SIC) technique at the receiving end to cancel Interference, so that the spectrum efficiency and the system capacity are greatly improved, and the NOMA is very suitable for future mobile communication.
On the other hand, since the performance gain of the NOMA system is closely related to the allocation mode of the sub-channels and the transmission power, the energy efficiency of the whole NOMA system can be maximized by designing a reasonable resource allocation scheme. Therefore, the higher transmission rate is obtained by using the lower sending power, and unnecessary resource waste is reduced while the advantages of the NOMA technology are fully utilized. Different approaches have been proposed in the present research to study the optimal resource allocation scheme of NOMA systems.
Found by searching the existing literature. Manglayev et al published a text entitled "optimal Power Allocation for non-orthogonal multiple Access (NOMA)" in IEEE International Conference on Application of Information and Communication Technologies, Oct.2016, pp.1-4 (International Conference on Information and Communication technology applications, 2016, 10.10.2016). This article presents a power allocation algorithm that maximizes capacity in combination with a fairness factor, and simulations demonstrate that higher spectral efficiency can be achieved using NOMA technology than with the original OMA technology. Zhang et al, in IEEE Transactions on Vehicular Technology, Mar.2017, vol.66, No.3, pp.2852-2857 (journal of on-board Technology of the institute of Electrical and electronics Engineers, 2017, 3 rd Vol.66, 3 rd Vol. 2852 and 2857), published a article entitled "Energy-efficiency transmission design in non-orthogonal multiple access". This document proposes a power allocation strategy that maximizes energy efficiency to meet the minimum rate requirements of the user. In addition, it is found that a document entitled "Downlink power allocation for CoMP-NOMA in multi-cell networks" (Downlink power allocation of coordinated multi-point NOMA in multi-cell networks) "published by m.s.ali et al in IEEE Transactions on Communications, sep.2018, vol.66, No.9pp.3982-3998 (journal of Communications of the institute of electrical and electronics engineers, 9 years 2018, volume 66, 9 th, page 3982-3998), researches a Downlink power allocation scheme of multi-cell, proposes a distributed power optimization algorithm to reduce the calculation complexity, and analyzes the spectrum efficiency and energy efficiency performance of the multi-cell NOMA system through simulation. All three documents only focus on the power allocation scheme in the NOMA system, however, the quality of the sub-channel allocation scheme has a great influence on the improvement of the overall system efficiency.
It was also found through searching that c.l. wang et al published a text entitled "Low-complexity Resource Allocation for Downlink multi-carrier NOMA Systems" in IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, sep.2018, pp.1-6 (society of electrical and electronics engineers, Personal Indoor and Mobile Radio Communications, year 2018, 9, pages 1-6), which, on the basis of general power Allocation studies, proposed a method of Low-complexity joint subchannel and power Allocation in NOMA Systems. Under the method, the optimal power distribution factor is obtained by closed-form solution, and the optimal subcarrier is obtained based on the low-complexity channel gain criterion, and the system capacity better than that of the traditional orthogonal frequency division multiple access scheme can be obtained. Although the method has low computational complexity, it cannot guarantee that an optimal resource allocation scheme is found.
Through the patent search, Zhu Rong et al, Nanjing post and telecommunications university invented "a resource allocation method under downlink MIMO-NOMA network" (publication No. 109922487A). The invention discloses a resource allocation method in a downlink NOMA system. The method comprises the steps of clustering users by acquiring channel state information of the users, and then distributing beam directions to the clustered users by utilizing a zero-forcing beam forming theory. And obtaining an optimal channel allocation scheme and a power allocation scheme by respectively using a Hungarian algorithm and a sub-gradient algorithm on the premise of determining power allocation and channel allocation, and alternately iterating until the user capacity is converged, thereby obtaining the optimal resource allocation scheme. In addition, the research also finds that Down Jie et al of the university of southern China's worker invented a resource allocation method of a deep learning-based energy-carrying NOMA system (publication No. 108924935A). The invention discloses a combined resource allocation method based on deep learning, which minimizes the transmission power on the premise of meeting the Quality of service (QoS) of a user. The method firstly constructs a mathematical optimization problem of joint resource allocation based on transmission power minimization in the energy-carrying NOMA system, wherein the mathematical optimization problem comprises optimization variables, an optimization objective function and constraint conditions. Then, a large amount of sample data is obtained by adopting a genetic algorithm, and a deep confidence network is trained to obtain potential information between input and output of the data sample. And finally, in the operation stage, directly outputting the optimal carrier and power allocation strategy by using the trained network. The method can efficiently obtain the resource allocation scheme under the condition that the network training is finished, realizes the resource allocation with low power consumption, and better meets the requirement of low time delay.
Although these existing resource allocation schemes improve the energy efficiency or other indexes of the entire NOMA system to some extent, these schemes have certain limitations. For example, for a conventional model-based resource allocation scheme, the computational complexity of the optimization process is high, and the time taken for the iterative algorithm is long. Although the optimization algorithm based on deep learning reduces the computational complexity, a large amount of time is still needed to construct enough sample data training networks to achieve good performance.
Disclosure of Invention
The technical problem to be solved by the present invention is to overcome the defects of the prior art, and provide a method for joint sub-channel allocation and power allocation in an uplink NOMA multi-user scenario based on Deep Reinforcement Learning (DRL), which maximizes the energy efficiency of the whole system while ensuring the minimum rate requirement of the user. As a big branch of machine learning, the DRL combines the neural networks in the traditional reinforcement learning and deep learning, collects the feedback information of the system through continuous interaction, and dynamically adjusts the parameters to make better decisions, thereby maximizing the performance of the system. Therefore, the DRL does not need a mathematical model or prior knowledge of the system, and is more suitable for solving the problem of dynamic resource allocation of an unknown system. According to the method, a Deep Q Network (DQN) in the DRL is utilized, a proper sub-channel allocation strategy is selected firstly according to channel gain information of a user, then a proper power allocation strategy is selected, and finally parameters of the allocation strategy are updated according to feedback of the system, so that optimal sub-channel allocation and power allocation are achieved, and energy efficiency of the system is improved.
The invention is realized by the following technical scheme:
the invention relates to a sub-channel allocation and power allocation method of an uplink NOMA system based on DRL, which is used for solving the problem of resource allocation of an uplink of a multi-user NOMA wireless communication system and comprises the following steps:
s1, acquiring the state: at time t, the base station acquires channel gain information of all users in the cell on different sub-channels as the current state st
S2, sub-channel allocation: sub-channel distribution network at base station selects optimal sub-channel distribution scheme according to epsilon-greedy strategy
Figure BDA0002824183550000041
S3, power distribution: deriving a sub-channel allocation scheme
Figure BDA0002824183550000042
Then, activating a power distribution network at the base station, and selecting an optimal power distribution scheme according to an epsilon-greedy strategy
Figure BDA0002824183550000043
S4, feedback acquisition: all ofResource allocation scheme for user according to two network outputs
Figure BDA0002824183550000044
Data is transmitted to the base station on a given subchannel at a given power. The base station returns corresponding feedback to the resource allocation network.
S5, updating parameters: and training the neural networks of all DQN units in the subchannel distribution network and the power distribution network based on two strategies of empirical replay and fixed Q values according to the obtained feedback, and updating the parameters of the networks, thereby better selecting a resource distribution scheme.
The S1) comprises the following specific steps:
at time t, the base station acquires the channel gain information of all users, and the state s at the current timetExpressed as the channel gain of all users on different sub-channels at the current time. By gk,m(t) represents the channel gain information for user m on subchannel k, then stIs represented as follows:
st={g1,1(t),g2,1(t),...,gk,m(t),..,gK,M(t)}
where K and M represent the number of subchannels and users in a cell, g, respectivelyk,m(t) includes large scale fading effects and small scale fading effects. The large-scale fading effect refers to fading caused by the shadow of a fixed obstacle on a channel path for communication between a user terminal and a base station, and comprises average path loss and shadow fading; small-scale fading is caused by multipath effects, and it is assumed that the effect on the user terminal follows rayleigh distribution.
The S2) comprises the following specific steps:
obtain the current state stThen, stIs transmitted to a subchannel assignment network at the base station. The network consists of one subchannel assignment DQN unit. The unit comprises two neural networks, namely a Q network Q (s, s; w) and a target Q network Q (s, a; w)-) W and w-Representing the parameters of the two neural networks, respectively.
Q network in sub-channel distribution DQN unit according to obtained shapeState stAnd estimating the Q values of all the sub-channel allocation schemes by using the network parameter w, namely:
Figure BDA0002824183550000051
wherein A is1Representing the set of all possible sub-channel allocation schemes.
Then, the subchannel allocation DQN unit selects one of all subchannel allocation schemes as the current best allocation scheme following the epsilon-greedy policy. Wherein, the epsilon-greedy strategy refers to that: from A with a probability of 1-epsilon1Randomly selecting a sub-channel distribution scheme as the optimal sub-channel distribution scheme at the time t
Figure BDA0002824183550000053
Outputting; or selecting the scheme with the maximum Q value according to the probability epsilon, namely selecting:
Figure BDA0002824183550000052
wherein 0 < epsilon < 1. Then, the sub-channel distribution network outputs the sub-channel distribution scheme at the time t
Figure BDA0002824183550000054
The S3) comprises the following specific steps:
in obtaining a sub-channel allocation scheme
Figure BDA0002824183550000055
Thereafter, the power distribution network at the base station is activated. The network consists of M power-distributing DQN units. Each power distribution DQN unit contains the same two neural networks as the subchannel allocation unit, but the parameters of these networks are different.
Using the same state stAs input, the Q-network of the mth power distribution DQN unit follows the epsilon-greedy strategy from the set a of all power distribution schemes using the same method in S22Is selected as the mth transmission power
Figure BDA0002824183550000056
And (6) outputting.
Then, the outputs of all M power distribution DQN units are combined into a power distribution scheme at time t by a power distribution network
Figure BDA0002824183550000066
Namely:
Figure BDA0002824183550000061
the S4) comprises the following specific steps:
resource allocation scheme for all users according to two network outputs
Figure BDA0002824183550000067
Data is transmitted to the base station on a given subchannel at a given power. If the transmission rate of each user can meet the minimum rate requirement, the base station calculates the sum of the energy efficiency of all users as the feedback r at the current time ttTo a subchannel distribution network and a power distribution network. If not, the feedback obtained by the two resource allocation networks is 0, i.e. the feedback is not satisfied
Figure BDA0002824183550000062
Wherein r istIndicating feedback at time t, RminIndicating a minimum rate requirement, Ek,mAnd Rk,mRespectively, energy efficiency and transmission rate of user m on subchannel k. Thereafter, the base station acquires new channel gain information as a new state s due to the movement of the usert+1
The S5) comprises the following specific steps:
according to the obtained system feedback rtTraining the neural networks of all DQN units in the subchannel distribution network and the power distribution network based on two strategies of empirical replay and fixed Q value, and updating the parameters of the networks to better selectA resource allocation scheme is selected. The S of the specific parameter update includes:
(1) will be(s) at each momentt,at,rt,st+1) Storing the training samples into a memory library D as training samples of the neural network;
(2) randomly selecting N groups of samples(s) from Di,ai,ri,si+1) Training a neural network;
(3) for the subchannel allocation network, the parameter w of the Q network in the subchannel allocation DQN unit is updated by minimizing the loss function by a random gradient descent method. The loss function therein is expressed as follows:
Figure BDA0002824183550000063
Figure BDA0002824183550000064
using the stochastic gradient descent method, the update mode of the parameter w is represented as:
Figure BDA0002824183550000065
wherein y isiRepresenting the target Q network Q (s, alpha; w) within the DQN unit-) The resulting target Q value, α, represents the learning rate.
(4) For the power distribution network, the same random gradient descent method as (3) is used to minimize the loss functions of the M power distribution DQN units, and the neural network parameters are updated. For the mth power distribution unit, the loss function is expressed as follows:
Figure BDA0002824183550000071
Figure BDA0002824183550000072
where M ═ 1, 2,. multidot.m. And then updating the corresponding network parameters by using a random gradient descent method.
(5) For M +1 target Q networks in all resource allocation DQN units, assigning the parameter W of the corresponding Q network to the parameter W of the corresponding Q network within a fixed time W-And updating the target Q network parameters.
Compared with the prior art, the invention has the beneficial effects that: 1) the invention is a DRL-based model-free combined sub-channel allocation and power allocation method, has low calculation complexity, can efficiently obtain an optimal resource allocation scheme, and improves the energy efficiency of an uplink NOMA system. And good performance can be obtained under different transmission power limit conditions. 2) In order to apply the DQN to the power distribution task, the invention improves on the basis of the traditional DQN, provides a discretized and distributed DQN network, reduces the output dimension of the network, and further improves the performance of the whole power distribution network.
Drawings
FIG. 1 is a schematic diagram of an upstream multi-user NOMA system according to the present invention;
fig. 2 is a frame diagram of a DRL-based joint sub-channel and power allocation method according to the present invention;
FIG. 3 is a graphical illustration of the loss function over time for different learning rates according to the method of the present invention;
fig. 4 is a graph comparing the average total energy efficiency of the DRL-based joint subchannel and power allocation method of the present invention with other methods;
fig. 5 is a diagram illustrating average total energy efficiency of the DRL-based joint subchannel and power allocation method of the present invention and other methods under different transmission power constraints.
Detailed Description
The following is a detailed description of the embodiments of the present invention, which is implemented on the premise of the technical solution of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection authority of the present invention is not limited to the following embodiments. All other embodiments, which can be derived by a person skilled in the art from any of the embodiments of the invention without making any creative effort, fall within the protection scope of the invention.
The invention relates to a combined sub-channel allocation and power allocation method of an uplink NOMA system based on DRL. As shown in fig. 1, the base station in the NOMA wireless communication system is located at the center of a cell, and the subchannel allocation network and the power allocation network of the present invention are both located in a DRL controller at the base station. The M users are randomly distributed in the cell and randomly move between each time slot. The total bandwidth of the base station is divided equally into K mutually orthogonal subchannels. Each subchannel may serve multiple users simultaneously. Maximum transmission power per user terminal is Pmax. By bk,m(t) and pk,mAnd (t) respectively represents the subchannel allocation flag and the allocated power of the user m on the subchannel k at the time t. Wherein, bk,m(t)' 1 means that user m is allocated to subchannel k at time t, otherwise bk,m(t)=0。
The embodiment is realized by the following steps:
s1) state acquisition: the base station obtains the channel gain information of all users on different sub-channels in the cell at the time t as the current state st
By gk,mAnd (t) represents the channel gain information of the user m on the sub-channel k at the time t. The information consists of two parts, respectively a large-scale fading beta at time tk,m(t) and small-scale fading hk,m(t) of (d). The large-scale fading refers to fading caused by the shadow of a fixed obstacle on a channel path of communication between a user terminal and a base station, and comprises average path loss and shadow fading; small-scale fading is caused by multipath effects, and it is assumed that the effect on the user terminal follows rayleigh distribution. Then gk,m(t) can be expressed as:
Figure BDA0002824183550000081
at the current moment tState of(s)tIs represented as follows:
st={g1,1(t),g2,1(t),...,gk,m(t),...,gK,M(t)}
s2) sub-channel allocation: according to the obtained stThe sub-channel allocation network within the DRL controller at the base station follows the epsilon-greedy policy to select the optimal sub-channel allocation scheme
Figure BDA0002824183550000094
Sub-channel allocation scheme
Figure BDA0002824183550000095
The mark b can be assigned by a subchannelk,m(t) is expressed as:
Figure BDA00028241835500000910
wherein b isk,mThe value of (t) may be 0 or 1. All possible allocation schemes constitute a set a of sub-channel allocation schemes1
State s obtained by the base stationtIs transmitted to a subchannel allocation network within the DRL controller, which network consists of one subchannel allocation DQN unit. The unit comprises two neural networks, namely a Q network Q (s, a; w) and a target Q network Q (s, a; w)-) W and w-Respectively representing the network parameters of the two networks. The Q network is used to estimate the Q value of the selected action, and the target Q network is used to generate a target Q value to train the network parameters.
Using the obtained stAs input, the Q network in the subchannel allocation DQN unit outputs estimated Q values for all subchannel allocation schemes using parameter w, i.e.:
Figure BDA0002824183550000091
after all estimated Q values are obtained, the sub-channel assignment DQN unit follows the epsilon-greedy strategy from A1One scheme is selected as the optimal scheme at the current moment tSub-channel allocation scheme
Figure BDA00028241835500000911
Wherein, the epsilon-greedy strategy is as follows: from A with a probability of 1-epsilon1In the scheme of randomly selecting a sub-scheme
Figure BDA00028241835500000912
Or selecting the scheme with the maximum estimated Q value with the probability epsilon, i.e.
Figure BDA0002824183550000092
Wherein the value range of epsilon is more than 0 and less than 1. The smaller epsilon, the more likely the base station is to attempt to select other allocation schemes, and the larger epsilon, the more likely the base station is to select the allocation scheme with the largest Q value.
Then, the sub-channel distribution network outputs the optimal sub-channel distribution scheme at the time t
Figure BDA0002824183550000096
S3) power allocation: in obtaining a sub-channel allocation scheme
Figure BDA0002824183550000097
Then, activating a power distribution network in a DRL controller at the base station, and selecting an optimal power distribution scheme according to an epsilon-greedy strategy
Figure BDA0002824183550000098
Power allocation scheme
Figure BDA0002824183550000099
Power p allocable by each user on different sub-channelsk,m(t) is expressed as:
Figure BDA0002824183550000093
wherein 0 is not more than pk,m(t)≤Pmax. Since only the transmission power on the subchannel allocated to user m needs to be decided, and the power of user m on other subchannels may not need to be considered, let:
Figure BDA0002824183550000101
this can reduce the dimensionality of the DQN cell outputs to improve performance.
Furthermore, since the power interval available for allocation is a continuous value, the power has to be discretized to adapt the input and output of the DQN. However, power discretization causes exponential increase of output dimension, so the scheme uses a distributed architecture to solve the problem.
In the scheme, a power distribution network in a DRL controller comprises M power distribution DQN units, each unit is responsible for the power distribution task of one user, and then the power distribution scheme
Figure BDA0002824183550000104
The expression form of (a) is converted into:
Figure BDA0002824183550000102
wherein
Figure BDA0002824183550000105
Representing the power distribution scheme made by the mth power distribution DQN unit at time t. Assuming that the power is discretized into L levels
Figure BDA0002824183550000106
There are L alternative powers, denoted as:
Figure BDA0002824183550000103
the letter is obtained in S2Lane allocation scheme
Figure BDA0002824183550000107
Thereafter, M power distribution DQN units within the power distribution network are activated. Each power distribution DQN unit contains the same two neural networks as the subchannel distribution DQN unit described above, but the parameters of these neural networks are different. Using the same state stAs input, the Q network of the mth power allocation unit outputs the estimated Q value and selects one from all power allocation schemes as the transmission power of the mth user following the epsilon-greedy policy
Figure BDA0002824183550000108
And (6) outputting. Combining the power of the M outputs into a power allocation scheme
Figure BDA0002824183550000109
As the optimal sub-channel allocation scheme at time t
Figure BDA00028241835500001010
And (6) outputting.
S4) feedback acquisition: resource allocation scheme output by all users according to subchannel allocation network and power allocation network
Figure BDA00028241835500001011
Data is transmitted to the base station on a given subchannel at a given power. The base station returns the sum of the energy efficiencies of all users as feedback.
Known subchannel allocation scheme
Figure BDA00028241835500001012
And power allocation scheme
Figure BDA00028241835500001013
After that, all bk,m(t) and pk,mThe value of (t) is known. According to the uplink NOMA transmission principle, the signal to interference plus noise ratio of user m on subchannel k is expressed as follows:
Figure BDA0002824183550000111
wherein
Figure BDA0002824183550000116
Representing the variance of gaussian white noise. Using normalized bandwidth, the corresponding transmission rate is then expressed as:
Rk,m(t)=log(1+γk,m(t))
the uplink energy efficiency of the user m on the subchannel k is:
Figure BDA0002824183550000112
wherein P ismRepresenting a portion of the energy consumed by the user equipment operating itself.
the feedback at time t is defined as the sum of the energy efficiencies of all users on all sub-channels at the current time. If the transmission rate R of each userk,m(t) can satisfy the minimum rate requirement RminThen the base station calculates the sum of the energy efficiencies of all the users as the feedback r at the current time ttTo the subchannel allocation unit and all the power allocation units. If not, all resource allocation units obtain feedback rtIs equal to 0, i.e
Figure BDA0002824183550000113
Then, the channel gain information of all users changes due to the movement of the users, and the base station acquires the channel gain information of all users again as a new state st+1
S5) parameter update: according to the system feedback r obtained in S4tTraining the neural networks of all DQN units in the subchannel distribution network and the power distribution network based on two strategies of empirical replay and fixed Q value, and updating the parameters of the networks to better select a resource distribution formulaA method for preparing a medical liquid. The S of the specific parameter update includes:
(1) will be(s) at each momentt,at,rt,st+1) Storing the training samples into a memory library D as training samples of the neural network;
(2) randomly selecting N groups of samples(s) from Di,ai,ri,si+1) Training a neural network;
(3) for the subchannel allocation network, the parameter w of the Q network of the subchannel allocation DQN unit is updated by minimizing a loss function through a random gradient descent method. The loss function of the subchannel assignment DQN unit is expressed as follows:
Figure BDA0002824183550000114
Figure BDA0002824183550000115
using the stochastic gradient descent method, the update mode of the parameter w is represented as:
Figure BDA0002824183550000121
wherein y isiRepresenting a target Q network Q (s, a; w) within the DQN unit-) The resulting target Q value, α, represents the learning rate.
(4) For the power distribution network, the same random gradient descent method as in S (3) is used to minimize the loss function of M power distribution DQN units, updating the neural network parameters. For the mth power distribution unit, its loss function is expressed as follows:
Figure BDA0002824183550000122
Figure BDA0002824183550000123
using a random gradient descent method, the parameter updating method of the mth power distribution unit is represented as:
Figure BDA0002824183550000124
where M ═ 1, 2,. multidot.m.
(5) For M +1 target Q networks in all resource allocation DQN units, assigning the parameter W of the corresponding Q network to the parameter W of the corresponding Q network within a fixed time W-And updating the target Q network parameters.
Fig. 2 is a frame diagram of a method for joint sub-channel and power allocation based on DRL according to the present invention.
In the example, a multi-user uplink NOMA scenario is considered, all users are optimized for joint sub-channel and power allocation, and main parameters of the simulation scenario of the example are shown in table 1.
TABLE 1 simulation scenario principal parameters
Figure BDA0002824183550000125
Figure BDA0002824183550000131
FIG. 3 is a graphical representation of the loss function over time for different learning rates according to the method of the present invention. The figure is from top to bottom for the case where the learning rate α in the method of the present invention is set to 0.001, 0.005 and 0.01, respectively. Simulation results show that the algorithm of the invention has good convergence. As shown in fig. 3, the loss functions for the three learning rates are initially large and decrease rapidly as the number of slots increases, and all converge within 20 steps. In particular, when α is 0.01, only a few steps are required to minimize the loss function and then stabilize. Therefore, using such a learning rate can provide a faster convergence rate to minimize the loss function, so that the prediction of the Q value becomes more accurate, and the performance of the network becomes better.
Fig. 4 is a graph comparing the average energy efficiency of the DRL-based joint subchannel and power allocation method of the present invention with other methods. The diagram is, from top to bottom, a DRL-based resource allocation method (DQN), a method using exhaustive search and random transmission power (OptRP), a method using exhaustive search and maximum transmission power (OptMP), and a method using random subchannels and maximum transmission power (RCMP), respectively, as proposed by the present invention. Where exhaustive search refers to a method of traversing all subchannel schemes and then selecting a subchannel allocation scheme that results in the highest energy efficiency. It should be noted that to better show the simulation results, the total energy efficiency is a running average taken every 100 steps. It can be seen from the figure that the energy efficiency performance of the NOMA system applying the resource allocation scheme of the present invention is much higher than that of other methods. The method of the invention can dynamically select the transmitting power according to the real-time channel information of the user, and adaptively adjust the resource allocation scheme. On the basis of meeting the requirement of the lowest rate, unnecessary transmission power is reduced, so that more energy efficiency can be provided. It can also be seen by comparison that the energy efficiency obtained using an exhaustive search far exceeds that of using random sub-channels. This also illustrates that the assignment of subchannels has a significant impact on the performance gain of the overall NOMA system.
Fig. 5 is a schematic diagram of the average energy efficiency of the DRL-based joint subchannel and power allocation method of the present invention and other methods under different transmission power constraints. The figure shows the average energy efficiency of each scheme over all time slots under different maximum transmission power constraints. It can be seen from the figure that as the maximum transmission power increases, the average energy efficiency of the method also increases and approaches a maximum value, while the average energy efficiency of the other three methods decreases to different degrees after increasing. Furthermore, it can be seen from the figure that the method of the present invention is superior to other methods under most maximum transmission power conditions.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention.

Claims (6)

1. An uplink NOMA resource allocation method based on deep reinforcement learning is characterized by comprising the following steps:
s1, acquiring the state: at time t, the base station acquires channel gain information of all users in the cell on different sub-channels as the current state st
S2, sub-channel allocation: sub-channel distribution network at base station selects optimal sub-channel distribution scheme according to epsilon-greedy strategy
Figure FDA0002824183540000011
S3, power distribution: resulting subchannel allocation scheme
Figure FDA0002824183540000012
Then, activating a power distribution network at the base station, and selecting an optimal power distribution scheme according to an epsilon-greedy strategy
Figure FDA0002824183540000013
S4, feedback acquisition: all users according to the resource allocation scheme
Figure FDA0002824183540000014
Transmitting data to the base station on a given subchannel at a given power; the base station returns the sum of the energy efficiency of all the users as feedback;
s5, updating parameters: according to the system feedback r obtained in S4tBased on two strategies of empirical replay and fixed Q value, training sub-channel distribution DQN unitAnd a neural network within all power allocation DQN units, updating parameters of the network to better select a resource allocation scheme.
2. The deep reinforcement learning-based uplink NOMA resource allocation method according to claim 1, wherein the channel gain information in S1 includes large-scale fading and small-scale fading; at time t, the channel gain information of all users on different sub-channels constitutes state st
3. The deep reinforcement learning-based uplink NOMA resource allocation method according to claim 1, wherein the sub-channel allocation in S2 comprises the following specific steps:
obtaining the current state stThen, stA subchannel assignment DQN unit communicated to a base station; the Q network Q (s, a; w) in the cell is based on the obtained state stEstimating Q values Q(s) of all sub-channel allocation schemes by using network parameters wt,a;w),a∈A1,A1Represents a set of all subchannel allocation schemes;
the sub-channel distribution DQN unit selects one of all sub-channel distribution schemes according to an epsilon-greedy strategy; the strategy is as follows: from A with a probability of 1-epsilon1In which a sub-channel allocation scheme is randomly selected
Figure FDA0002824183540000015
Or the scheme with the maximum Q value is selected with the probability epsilon, that is:
Figure FDA0002824183540000021
wherein 0 < epsilon < 1.
4. The deep reinforcement learning-based uplink NOMA resource allocation method according to claim 1, wherein the specific steps of power allocation in S3 are:
in obtaining a sub-channel allocation scheme
Figure FDA0002824183540000022
Then, activating M power distribution DQN units in a power distribution network at the base station; using the same state stAs input, the Q network of the mth power distribution DQN unit estimates the corresponding Q value, and then selects one from the set of all power distribution schemes according to the epsilon-greedy policy as the transmission power of the mth user
Figure FDA0002824183540000023
Outputting; the output M powers are then combined into a power allocation scheme
Figure FDA0002824183540000024
Namely:
Figure FDA0002824183540000025
5. the deep reinforcement learning-based uplink NOMA resource allocation method according to claim 1, wherein the feedback acquisition in S4 comprises the specific steps of:
resource allocation scheme selected by all users according to subchannel allocation network and power allocation network
Figure FDA0002824183540000026
Transmitting data to the base station on a given subchannel at a given power; if the transmission rate of each user can meet the minimum rate requirement, the base station calculates the sum of the energy efficiency of all users as the feedback r at the current time ttTo the subchannel allocation unit and all the power allocation units; if not, all resource allocation units get a feedback of 0, i.e.
Figure FDA0002824183540000027
Wherein r istIndicating feedback at time t, RminIndicating a minimum rate requirement, Ek,mAnd Rk,mRespectively representing the energy efficiency and the transmission rate of the user m on the subchannel k; as all users move, the base station acquires new channel gain information st+1
6. The deep reinforcement learning-based uplink NOMA resource allocation method according to claim 1, wherein the specific step of updating S5 parameter includes:
(1) will be(s) at each momentt,at,rt,st+1) Storing the training samples into a memory library D as training samples of the neural network;
(2) randomly selecting N groups of samples(s) from Di,ai,ri,si+1) Training a neural network;
(3) for a subchannel allocation network, updating a parameter w of a subchannel allocation DQN unit Q network by minimizing a loss function through a random gradient descent method; the loss function of the subchannel assignment DQN unit is expressed as follows:
Figure FDA0002824183540000031
Figure FDA0002824183540000032
using the stochastic gradient descent method, the update mode of the parameter w is represented as:
Figure FDA0002824183540000033
wherein y isiRepresenting a network Q (s, a; w) formed by a target Q-) The resulting target Q value, α, represents the learning rate;
(4) for the power distribution network, minimizing loss functions of all power distribution DQN units by using the same random gradient descent method as (3), and updating neural network parameters; for the mth power distribution DQN unit, the loss function is expressed as follows:
Figure FDA0002824183540000034
Figure FDA0002824183540000035
wherein M ═ {1, 2,. said, M };
(5) for M +1 target Q networks in all resource allocation DQN units, assigning the parameter W of the corresponding Q network to the parameter W of the corresponding Q network within a fixed time W-And updating the target Q network parameters.
CN202011445582.6A 2020-12-08 2020-12-08 Deep reinforcement learning-based uplink NOMA resource allocation method Pending CN112566261A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011445582.6A CN112566261A (en) 2020-12-08 2020-12-08 Deep reinforcement learning-based uplink NOMA resource allocation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011445582.6A CN112566261A (en) 2020-12-08 2020-12-08 Deep reinforcement learning-based uplink NOMA resource allocation method

Publications (1)

Publication Number Publication Date
CN112566261A true CN112566261A (en) 2021-03-26

Family

ID=75061197

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011445582.6A Pending CN112566261A (en) 2020-12-08 2020-12-08 Deep reinforcement learning-based uplink NOMA resource allocation method

Country Status (1)

Country Link
CN (1) CN112566261A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113162682A (en) * 2021-05-13 2021-07-23 重庆邮电大学 PD-NOMA-based multi-beam LEO satellite system resource allocation method
CN113242602A (en) * 2021-05-10 2021-08-10 内蒙古大学 Millimeter wave large-scale MIMO-NOMA system resource allocation method and system
CN113253249A (en) * 2021-04-19 2021-08-13 中国电子科技集团公司第二十九研究所 MIMO radar power distribution design method based on deep reinforcement learning
CN113543271A (en) * 2021-06-08 2021-10-22 西安交通大学 Effective capacity-oriented resource allocation method and system
CN113595609A (en) * 2021-08-13 2021-11-02 电子科技大学长三角研究院(湖州) Cellular mobile communication system cooperative signal sending method based on reinforcement learning
CN114698077A (en) * 2022-02-16 2022-07-01 东南大学 Dynamic power distribution and energy level selection method in semi-authorization-free scene

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180027507A1 (en) * 2016-07-19 2018-01-25 Institut Mines-Telecom / Telecom Bretagne Method and apparatus for power and user distribution to sub-bands in noma systems
CN108737057A (en) * 2018-04-27 2018-11-02 南京邮电大学 Multicarrier based on deep learning recognizes NOMA resource allocation methods

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180027507A1 (en) * 2016-07-19 2018-01-25 Institut Mines-Telecom / Telecom Bretagne Method and apparatus for power and user distribution to sub-bands in noma systems
CN108737057A (en) * 2018-04-27 2018-11-02 南京邮电大学 Multicarrier based on deep learning recognizes NOMA resource allocation methods

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOMING WANG ET AL.: "DRL-Based Energy-Efficient Resource Allocation Frameworks for Uplink NOMA Systems", 《IEEE INTERNET OF THINGS JOURNAL》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113253249A (en) * 2021-04-19 2021-08-13 中国电子科技集团公司第二十九研究所 MIMO radar power distribution design method based on deep reinforcement learning
CN113253249B (en) * 2021-04-19 2023-04-28 中国电子科技集团公司第二十九研究所 MIMO radar power distribution design method based on deep reinforcement learning
CN113242602A (en) * 2021-05-10 2021-08-10 内蒙古大学 Millimeter wave large-scale MIMO-NOMA system resource allocation method and system
CN113242602B (en) * 2021-05-10 2022-04-22 内蒙古大学 Millimeter wave large-scale MIMO-NOMA system resource allocation method and system
CN113162682A (en) * 2021-05-13 2021-07-23 重庆邮电大学 PD-NOMA-based multi-beam LEO satellite system resource allocation method
CN113162682B (en) * 2021-05-13 2022-06-24 重庆邮电大学 PD-NOMA-based multi-beam LEO satellite system resource allocation method
CN113543271A (en) * 2021-06-08 2021-10-22 西安交通大学 Effective capacity-oriented resource allocation method and system
CN113543271B (en) * 2021-06-08 2022-06-07 西安交通大学 Effective capacity-oriented resource allocation method and system
CN113595609A (en) * 2021-08-13 2021-11-02 电子科技大学长三角研究院(湖州) Cellular mobile communication system cooperative signal sending method based on reinforcement learning
CN113595609B (en) * 2021-08-13 2024-01-19 电子科技大学长三角研究院(湖州) Collaborative signal transmission method of cellular mobile communication system based on reinforcement learning
CN114698077A (en) * 2022-02-16 2022-07-01 东南大学 Dynamic power distribution and energy level selection method in semi-authorization-free scene
CN114698077B (en) * 2022-02-16 2024-02-02 东南大学 Dynamic power distribution and energy level selection method in semi-unlicensed scene

Similar Documents

Publication Publication Date Title
CN112566261A (en) Deep reinforcement learning-based uplink NOMA resource allocation method
CN108737057B (en) Multi-carrier cognitive NOMA resource allocation method based on deep learning
CN104640220B (en) A kind of frequency and power distribution method based on NOMA systems
Pietrzyk et al. Multiuser subcarrier allocation for QoS provision in the OFDMA systems
US8174959B2 (en) Auction based resource allocation in wireless systems
CN110418399B (en) NOMA-based Internet of vehicles resource allocation method
CN104703270B (en) User&#39;s access suitable for isomery wireless cellular network and power distribution method
CN109996264B (en) Power allocation method for maximizing safe energy efficiency in non-orthogonal multiple access system
CN111586646B (en) Resource allocation method for D2D communication combining uplink and downlink channels in cellular network
Guo et al. Fairness-aware energy-efficient resource allocation in D2D communication networks
CN114885420A (en) User grouping and resource allocation method and device in NOMA-MEC system
CN107484180B (en) Resource allocation method based on D2D communication in very high frequency band
CN114423028B (en) CoMP-NOMA cooperative clustering and power distribution method based on multi-agent deep reinforcement learning
CN113923767A (en) Energy efficiency maximization method for multi-carrier cooperation non-orthogonal multiple access system
CN112367523B (en) Resource management method in SVC multicast based on NOMA in heterogeneous wireless network
CN108419298B (en) Power distribution method based on energy efficiency optimization in non-orthogonal multiple access system
CN113507716A (en) SWIPT-based CR-NOMA network interruption and energy efficiency optimization method
Masaracchia et al. The impact of user mobility into non-orthogonal multiple access (noma) transmission systems
CN107613565B (en) Wireless resource management method in full-duplex ultra-dense network
CN115833886A (en) Power control method of non-cellular large-scale MIMO system
CN107172574B (en) Power distribution method for D2D user to sharing frequency spectrum with cellular user
CN115243234A (en) User association and power control method and system for M2M heterogeneous network
Wang et al. Throughput maximization-based optimal power allocation for energy-harvesting cognitive radio networks with multiusers
Masaracchia et al. On the optimal user grouping in NOMA system technology
CN113141656B (en) NOMA cross-layer power distribution method and device based on improved simulated annealing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210326

RJ01 Rejection of invention patent application after publication