CN109729528B - D2D resource allocation method based on multi-agent deep reinforcement learning - Google Patents

D2D resource allocation method based on multi-agent deep reinforcement learning Download PDF

Info

Publication number
CN109729528B
CN109729528B CN201910161391.8A CN201910161391A CN109729528B CN 109729528 B CN109729528 B CN 109729528B CN 201910161391 A CN201910161391 A CN 201910161391A CN 109729528 B CN109729528 B CN 109729528B
Authority
CN
China
Prior art keywords
communication
cellular
user
link
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910161391.8A
Other languages
Chinese (zh)
Other versions
CN109729528A (en
Inventor
郭彩丽
李政
宣一荻
冯春燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Publication of CN109729528A publication Critical patent/CN109729528A/en
Application granted granted Critical
Publication of CN109729528B publication Critical patent/CN109729528B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a D2D resource allocation method based on multi-agent deep reinforcement learning, and belongs to the field of wireless communication. Firstly, constructing a heterogeneous network model of a cellular network and a D2D communication shared spectrum, establishing a signal-to-interference-and-noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user based on the existing interference, then respectively calculating unit bandwidth communication rates of a cellular link and a D2D link, and then constructing a D2D resource allocation optimization model in the heterogeneous network by taking the maximized system capacity as an optimization target; aiming at the time slot t, on the basis of the D2D resource allocation optimization model, constructing a deep reinforcement learning model of each D2D communication pair; and respectively extracting respective state feature vectors for each D2D communication pair in the subsequent time slot, and inputting the state feature vectors into the trained deep reinforcement learning model to obtain a resource allocation scheme of each D2D communication pair. The invention optimizes the frequency spectrum allocation and the transmission power, maximizes the system capacity and provides a low-complexity resource allocation algorithm.

Description

D2D resource allocation method based on multi-agent deep reinforcement learning
Technical Field
The invention belongs to the field of wireless communication, relates to a heterogeneous cellular network system, and particularly relates to a D2D resource allocation method based on multi-agent deep reinforcement learning.
Background
The popularization of intelligent terminals and the blowout type development of mobile internet services put higher requirements on the data transmission capability of wireless communication networks. Under the current trend, the existing cellular network has the problems of spectrum resource shortage, heavy base station load and the like, and cannot meet the transmission requirement of the future wireless network.
Device-to-Device (D2D) communication allows neighboring users to establish a direct link for communication, which is a promising technology in future wireless communication networks because it has the advantages of improving spectral efficiency, saving power consumption and offloading base station load. The D2D communication is introduced into the cellular network, so that on one hand, energy consumption can be saved, and the performance of edge users can be improved, and on the other hand, the spectrum utilization rate can be greatly improved by sharing the spectrum of the cellular users through the D2D communication.
However, the spectrum of the D2D communication multiplexing cellular network causes cross-layer interference to the cellular communication link, the communication quality of the cellular user as the primary user of the cellular frequency band should be ensured, and meanwhile, in the case of dense D2D communication deployment, the same spectrum multiplexed by a plurality of D2D communication links causes peer-to-peer interference between each other, so the interference management problem when the cellular network and the D2D communication coexist is an urgent problem to be solved. The wireless network resource allocation aims at relieving interference through reasonable resource allocation, improves the utilization efficiency of frequency spectrum resources, and is an effective way for solving the interference management problem.
Existing research on D2D communication resource allocation in cellular networks can be divided into centralized and distributed categories. The centralized method assumes that the base station has instant global Channel State Information (CSI), and the base station controls resource allocation of the D2D user, but huge signaling overhead is required for the base station to acquire the global Channel State Information, and in a future massive wireless device scenario, the base station is difficult to have the instant global Information, so that the centralized algorithm is no longer applicable in a future communication device intensive scenario.
The distributed method enables a D2D user to autonomously select wireless network resources, and the existing research is mainly based on game theory and reinforcement learning. The game theory method models D2D users as game players to compete for the game until Nash equilibrium state, but the solution of Nash equilibrium state requires a great deal of information exchange among users and a great deal of iteration to converge. The resource allocation research based on reinforcement learning is mainly based on Q learning, such as a Deep Q Network (DQN), and a D2D user is regarded as an agent, and a wireless Network resource is selected by an autonomous learning strategy. However, when a plurality of agents learn and train, the strategy of each agent changes, which causes unstable training environment and makes the training difficult to converge. Therefore, a distributed resource allocation algorithm with good convergence and low complexity needs to be researched to solve the problem of interference management of D2D communication in a cellular network.
Disclosure of Invention
In order to solve the problems, the invention provides a D2D resource allocation method based on multi-agent deep reinforcement learning based on a deep reinforcement learning theory, optimizes spectrum allocation and transmission power of a D2D user, realizes system capacity maximization of a cellular network and D2D communication, and ensures communication quality of the cellular user.
The method comprises the following specific steps:
step one, constructing a heterogeneous network model of a cellular network and a D2D communication shared spectrum;
the heterogeneous network model comprises cellular base stations BS, M cellular downlink users and N D2D communication pairs.
Setting the mth cellular user as CmWherein M is more than or equal to 1 and less than or equal to M; the nth D2D communication pair is DnWherein N is more than or equal to 1 and less than or equal to N. D2D communication pair DnFor transmitting and receiving users in
Figure BDA0001984765490000021
And
Figure BDA0001984765490000022
and (4) showing.
The cellular downlink communication link and the D2D link adopt the orthogonal frequency division multiplexing technology, each cellular user occupies one communication resource block RB, and no interference exists between any two cellular links; while allowing one cellular user to share the same RB with multiple D2D users, the communication resource blocks RB and transmission power are selected autonomously by the D2D user.
Step two, establishing a signal-to-interference-and-noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user based on interference existing in a heterogeneous network model;
interference includes three types: 1) cellular users experience interference from transmitting users in each D2D communication pair sharing the same RB; 2) interference experienced by the receiving users in each D2D communication pair from the base station; 3) the receiving user in each D2D communication pair is subject to interference from the transmitting user in all other D2D communication pairs that share the same RBs.
Cellular user CmThe received signal SINR on the kth communication resource block RB from the base station is:
Figure BDA0001984765490000023
PBrepresents the fixed transmit power of the base station;
Figure BDA0001984765490000024
for base station to cellular user CmThe channel gain of the downlink target link; dkA set of all D2D communication pairs representing a shared kth RB;
Figure BDA0001984765490000025
representing D2D communication pair DnThe transmitting power of the transmitting user;
Figure BDA0001984765490000026
for D2D communication pair D when multiple links share RBnMiddle transmitting user
Figure BDA0001984765490000027
To cellular subscriber CmThe channel gain of the interfering link of (a); n is a radical of0Representing the power spectral density of additive white gaussian noise.
D2D communication pair DnThe SINR of the received signal of the receiving user on the kth RB is:
Figure BDA0001984765490000028
Figure BDA0001984765490000029
for D2D communication pair DnTo a transmitting user
Figure BDA00019847654900000210
To the receiving user
Figure BDA00019847654900000211
D2D channel gain of the target link;
Figure BDA00019847654900000212
for base station to D2D communication pair D when multiple links share RBnTo a receiving user
Figure BDA00019847654900000213
The channel gain of the interfering link of (a);
Figure BDA00019847654900000214
representing D2D communication pair DiThe transmitting power of the transmitting user;
Figure BDA00019847654900000215
for D2D communication pair D when multiple links share RBiMiddle transmitting user
Figure BDA00019847654900000216
To the receiving user
Figure BDA00019847654900000217
The channel gain of the interfering link of (a);
thirdly, calculating the unit bandwidth communication rates of the cellular link and the D2D link respectively by using the SINR of the cellular user and the SINR of the D2D receiving user;
communication rate per unit bandwidth of cellular link
Figure BDA0001984765490000031
The calculation formula is as follows:
Figure BDA0001984765490000032
communication rate per bandwidth of D2D link
Figure BDA0001984765490000033
The calculation formula is as follows:
Figure BDA0001984765490000034
step four, calculating system capacity by using the communication rate of the cellular link and the D2D link in unit bandwidth, and constructing a D2D resource allocation optimization model in the heterogeneous network by taking the maximized system capacity as an optimization target;
the optimization model is as follows:
Figure BDA0001984765490000035
Figure BDA0001984765490000036
Figure BDA0001984765490000037
Figure BDA0001984765490000038
BN×K=[bn,k]an allocation matrix of communication resource blocks RB for D2D communication pairs, bn,kFor D2D communication pair DnThe RB selection parameter of (a) is,
Figure BDA0001984765490000039
a power control vector that is composed jointly for the transmit powers of all D2D communication pairs.
Constraint C1 indicates that the SINR of each cellular user is greater than the minimum threshold for the cellular user's received SINR
Figure BDA00019847654900000310
Ensuring the communication quality of cellular users; the constraint condition C2 represents a D2D link spectrum allocation constraint condition, and each D2D user pair can be allocated with only one communication resource block RB at most; constraint C3 characterizes that the transmission power of the transmitting user of each D2D communication pair cannot exceed the maximum transmission power threshold Pmax
Step five, aiming at the time slot t, on the basis of the D2D resource allocation optimization model, constructing a deep reinforcement learning model of each D2D communication pair;
the specific construction steps are as follows:
step 501, for a certain D2D communication pair DpConstructing a state feature vector s at time slot tt
Figure BDA00019847654900000311
Figure BDA00019847654900000312
Instantaneous channel state information for the D2D communication link;
Figure BDA00019847654900000313
for base station to the D2D communication pair DpReceiving instantaneous channel state information of an interference link of a user; i ist-1The D2D communication pair D for the last time slot t-1pReceiving an interference power value received by a user;
Figure BDA00019847654900000314
the D2D communication pair D for the last time slot t-1pThe neighboring D2D communication pair occupied RB;
Figure BDA00019847654900000315
the D2D communication pair D for the last time slot t-1pThe RBs occupied by the neighboring cellular users.
Step 502, simultaneously constructing the D2D communication pair DpA return function r at time slot tt
Figure BDA00019847654900000316
rnIn negative return, rn<0;
Step 503, constructing the state characteristics of the multi-agent Markov game model by using the state characteristic vectors of the D2D communication pair; in order to optimize the Markov game model, a return function in the deep reinforcement learning model of the multi-agent actor critic is established by utilizing the return function of the D2D communication pair;
each agent markov game model is:
Figure BDA0001984765490000041
wherein the content of the first and second substances,
Figure BDA0001984765490000042
is a space of states that is,
Figure BDA0001984765490000043
is an action space, rjThe method is characterized in that the method is a return value of a return corresponding to a return function of the jth D2D communication pair, j ∈ { 1.., N }, p is a state transition probability of the whole environment, and gamma is a discount coefficient.
The goal of each D2D communication pair learning is to maximize the total discount return for that D2D communication pair;
the total discount return calculation formula is:
Figure BDA0001984765490000044
t is the time range; gamma raytIs the discount coefficient to the power of t;
Figure BDA0001984765490000045
is the return value of the return function of the jth D2D communication pair at time slot t.
The Actor Critic reinforcement learning model consists of an Actor (Actor) and a Critic (Critic);
in the training process, the strategy of the actor is fitted by using a deep neural network, and is updated by using the following deterministic strategy gradient formula so as to obtain the maximum expected return.
Let mu be { mu ═ mu1,...,μNDenotes the deterministic policy for all agents, θ ═ θ1,...,θNThe parameters contained in the strategy are expressed, and the gradient formula of the expected return of the jth agent is as follows:
Figure BDA0001984765490000046
s contains state information for all agents, s ═ s1,...,sN}; a contains the action information of all agents, a ═ a1,...,aN};
Figure BDA0001984765490000047
Is an experience replay buffer;
the critics also use deep neural networks to fit by minimizing a centralized action-cost function
Figure BDA00019847654900000411
To update the loss function of:
Figure BDA0001984765490000048
wherein the content of the first and second substances,
Figure BDA0001984765490000049
each sample is represented by a tuple(s)t,at,rt,st+1) Records the historical data of all agents,
Figure BDA00019847654900000410
including the reward of all agents at time slot t.
Step 504, performing offline training on the deep reinforcement learning model by using historical communication data to obtain D2D communication D for solving the DpProblem of resource allocationThe model of (1).
And step six, extracting respective state feature vectors for each D2D communication pair in the subsequent time slot, and inputting the state feature vectors into the trained deep reinforcement learning model to obtain the resource allocation scheme of each D2D communication pair.
The resource allocation scheme includes selecting appropriate communication resource blocks RB and transmission power.
The invention has the advantages that:
(1) a D2D resource allocation method based on multi-agent deep reinforcement learning optimizes the spectrum allocation and transmission power of D2D users, and maximizes the system capacity while ensuring the communication quality of cellular users;
(2) a D2D resource allocation method based on multi-agent deep reinforcement learning designs a D2D distributed resource allocation algorithm in a heterogeneous cellular network, thereby greatly reducing the signaling overhead generated for obtaining global instant channel state information;
(3) a D2D resource allocation method based on multi-agent deep reinforcement learning innovatively introduces a multi-agent reinforcement learning model with centralized training and distributed execution, solves the problem of resource allocation by multi-D2D communication, obtains good training convergence performance, and provides a low-complexity resource allocation algorithm.
Drawings
Fig. 1 is a schematic diagram of a heterogeneous network model of a cellular network and D2D communication sharing spectrum, which is constructed by the present invention;
FIG. 2 is a flow chart of a D2D resource allocation method based on multi-agent deep reinforcement learning according to the present invention;
FIG. 3 is a diagram illustrating a deep reinforcement learning model for D2D communication resource allocation according to the present invention;
FIG. 4 is a diagram of a model for reinforcement learning of a critic of a single agent actor in accordance with the present invention;
FIG. 5 is a diagram of a multi-agent actor critic reinforcement learning model of the present invention;
fig. 6 is a graph comparing the outage rates of cellular users according to the present invention with the DQN-based D2D resource allocation method and the D2D random resource allocation method.
Fig. 7 is a graph comparing the total system capacity performance of the present invention with the DQN-based D2D resource allocation method and the D2D random resource allocation method.
FIG. 8 is a graphical illustration of the total reward function and system capacity convergence performance of the present invention;
fig. 9 is a graph of the total return function and the system capacity convergence performance of the DQN-based D2D resource allocation method of the present invention.
Detailed Description
In order that the technical principles of the present invention may be more clearly understood, embodiments of the present invention are described in detail below with reference to the accompanying drawings.
A D2D Resource Allocation Method (MADRL, Multi-Agent deep Learning based Device-to-Device Resource Allocation Method) based on Multi-Agent deep reinforcement Learning is applied to a heterogeneous network with a cellular network and D2D communication coexisting; firstly, respectively establishing a signal-to-interference-and-noise ratio (SINR) and a unit bandwidth communication rate expression of a D2D receiving user and a cellular user, taking the maximized system capacity as an optimization target, and taking the SINR of the cellular user larger than a minimum SINR threshold, a D2D link spectrum allocation constraint condition and the transmitting power of a D2D transmitting user smaller than a maximum transmitting power threshold as optimization conditions, and constructing a D2D resource allocation optimization model in a heterogeneous network;
constructing a state feature vector and a return function of a multi-agent deep reinforcement learning model for D2D resource allocation according to an optimization model; establishing a multi-agent actor critic deep reinforcement learning model for D2D resource allocation based on a partially observable Markov game model and an actor critic reinforcement learning theory;
performing offline training by using historical communication data obtained by the simulation platform;
according to the instantaneous channel state information of the D2D link, the instantaneous channel state information of the interference link of the user received by the base station to the D2D, the interference power value received by the user received by the D2D in the previous time slot, the communication Resource Block (RB) occupied by the D2D link adjacent to the D2D link in the previous time slot and the RB occupied by the cellular user communication adjacent to the D2D link in the previous time slot, the Resource allocation strategy obtained by training is used for selecting the proper RB and the transmission power.
As shown in fig. 2, the whole system comprises five steps of establishing a system model, proposing an optimization problem, establishing an optimization model, establishing a multi-agent reinforcement learning model, training the model and executing an algorithm; the method comprises the steps of establishing a multi-agent reinforcement learning model, wherein the step of establishing the multi-agent reinforcement learning model comprises the steps of establishing state characteristics, designing a return function and establishing a multi-agent actor critic reinforcement learning model;
the method comprises the following specific steps:
step one, constructing a heterogeneous network model of a cellular network and a D2D communication shared spectrum;
as shown in fig. 1, the heterogeneous network model includes a cellular Base Station (BS), M cellular downlink users, and N D2D communication pairs.
Setting the mth cellular user as CmWherein M is more than or equal to 1 and less than or equal to M; the nth D2D communication pair is DnWherein N is more than or equal to 1 and less than or equal to N. D2D communication pair DnFor transmitting and receiving users in
Figure BDA0001984765490000061
And
Figure BDA0001984765490000062
and (4) showing.
The cellular downlink communication link and the D2D link both adopt an Orthogonal Frequency Division Multiplexing (OFDM) technology, each cellular user occupies one communication resource block RB, and there is no interference between any two cellular links; in the system model, one cellular user is allowed to share the same RB simultaneously with multiple D2D users, with communication resource blocks RB and transmission power being autonomously selected by D2D users.
Step two, based on the Interference existing in the heterogeneous network model, establishing a signal to Interference plus Noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user;
interference includes three types: 1) cellular users experience interference from transmitting users in each D2D communication pair sharing the same RB; 2) interference experienced by the receiving users in each D2D communication pair from the base station; 3) the receiving user in each D2D communication pair is subject to interference from the transmitting user in all other D2D communication pairs that share the same RBs.
Cellular user CmThe received signal SINR on the kth communication resource block RB from the base station is:
Figure BDA0001984765490000063
PBrepresents the fixed transmit power of the base station;
Figure BDA0001984765490000064
for base station to cellular user CmThe channel gain of the downlink target link; dkA set of all D2D communication pairs representing a shared kth RB;
Figure BDA0001984765490000065
representing D2D communication pair DnThe transmitting power of the transmitting user;
Figure BDA0001984765490000066
for D2D communication pair D when multiple links share RBnMiddle transmitting user
Figure BDA0001984765490000067
To cellular subscriber CmThe channel gain of the interfering link of (a); n is a radical of0Represents the power spectral density of Additive White Gaussian Noise (AWGN).
D2D communication pair DnThe SINR of the received signal of the receiving user on the kth RB is:
Figure BDA0001984765490000068
Figure BDA0001984765490000069
for D2D communication pair DnTo a transmitting user
Figure BDA00019847654900000610
To the receiving user
Figure BDA00019847654900000611
D2D channel gain of the target link;
Figure BDA00019847654900000612
for base station to D2D communication pair D when multiple links share RBnTo a receiving user
Figure BDA00019847654900000613
The channel gain of the interfering link of (a);
Figure BDA00019847654900000614
representing D2D communication pair DiThe transmitting power of the transmitting user;
Figure BDA00019847654900000615
for D2D communication pair D when multiple links share RBiMiddle transmitting user
Figure BDA0001984765490000071
To the receiving user
Figure BDA0001984765490000072
The channel gain of the interfering link of (a);
thirdly, calculating the unit bandwidth communication rates of the cellular link and the D2D link respectively by using the SINR of the cellular user and the SINR of the D2D receiving user;
cellular link communication rate per bandwidth based on shannon's formula
Figure BDA0001984765490000073
The calculation formula is as follows:
Figure BDA0001984765490000074
communication rate per bandwidth of D2D link
Figure BDA0001984765490000075
The calculation formula is as follows:
Figure BDA0001984765490000076
step four, calculating system capacity by using the communication rate of the cellular link and the D2D link in unit bandwidth, and constructing a D2D resource allocation optimization model in the heterogeneous network by taking the maximized system capacity as an optimization target;
due to the requirement of optimizing the distribution matrix B of the communication resource blocks RB of the D2D communication pair on the premise of ensuring the communication quality of the cellular userN×K=[bn,k]Power control vector jointly formed with transmit powers of all D2D communication pairs
Figure BDA0001984765490000077
Maximizing system capacity, building an optimization model as follows:
Figure BDA0001984765490000078
Figure BDA0001984765490000079
Figure BDA00019847654900000710
Figure BDA00019847654900000711
bn,kfor D2D communication pair DnThe RB selection parameter of (1).
Constraint C1 characterizes the cellular user SINR constraint, meaning that the SINR of each cellular user is greater than the minimum threshold for the cellular user's received SINR
Figure BDA00019847654900000712
Ensuring the communication quality of cellular users; the constraint C2 characterizes the D2D link spectrum allocation constraint, and each D2D user pair can be allocated with only one communication resource block R at mostB; constraint C3 characterizes that the transmission power of the transmitting user of each D2D communication pair cannot exceed the maximum transmission power threshold Pmax
Step five, aiming at the time slot t, on the basis of the D2D resource allocation optimization model, constructing a deep reinforcement learning model of each D2D communication pair;
establishing a reinforcement learning model for D2D resource allocation, as shown in FIG. 3, the principle is: in a time slot t, each D2D communication pair acts as an agent, slave to the state space
Figure BDA00019847654900000713
In which a state s is observedtThen from the action space according to the strategy pi and the current state
Figure BDA00019847654900000714
In which an action a is selectedtD2D communication pair selects the RB to use and the transmission power; performing action atThereafter, the D2D communication pair observes a context transition to a new state st+1And obtaining a report rtD2D communication pair based on the reward r obtainedtThe strategy is adjusted to achieve higher returns. The specific construction steps are as follows:
step 501, for a certain D2D communication pair DpConstructing a state feature vector s at time slot tt
The observed status characteristics of each D2D communication pair include the following:
Figure BDA00019847654900000715
Figure BDA00019847654900000716
instantaneous channel state information for the D2D communication link;
Figure BDA00019847654900000717
for base station to the D2D communication pair DpReceiving instantaneous channel state information of an interference link of a user; i ist-1For the last time slot t-1D2D communication pair DpReceiving an interference power value received by a user;
Figure BDA00019847654900000718
the D2D communication pair D for the last time slot t-1pThe neighboring D2D communication pair occupied RB;
Figure BDA00019847654900000719
the D2D communication pair D for the last time slot t-1pThe RBs occupied by the neighboring cellular users.
Step 502, simultaneously, according to the optimization objective, the D2D communication pair D is constructedpA return function r at time slot tt
The reward function is designed to take into account both the lowest received SINR threshold for the cellular user and the unit bandwidth rate of the D2D communication pair. If the cellular user receiving SINR for the shared spectrum in communication with D2D can satisfy the cellular user signal-to-noise ratio constraint, a positive reward is obtained; otherwise, a negative reward r will be obtainedn,rnIs less than 0. To boost the capacity of the D2D communication link, the positive reward is set to the unit bandwidth communication rate of the D2D link:
Figure BDA0001984765490000081
thus, the reward function is as follows:
Figure BDA0001984765490000082
step 503, constructing the state characteristics of the multi-agent Markov game model by using the state characteristic vectors of the D2D communication pair; in order to optimize the Markov game model, a return function in the deep reinforcement learning model of the multi-agent actor critic is established by utilizing the return function of the D2D communication pair;
each agent uses an Actor Critic reinforcement learning model, which is composed of an Actor (Actor) and a Critic (Critic), and the strategies of the Actor and Critic are obtained by using deep neural network fitting as shown in fig. 4. D2D actor netNetwork input environment state stOutput action atNamely, selecting an RB and transmission power; critic network input environment state vector stAnd selected action atAnd outputting a time Difference error (TD error) calculated based on the Q value, wherein the time Difference error drives the learning of the two networks.
In the heterogeneous cellular network, the resource allocation of a plurality of D2D communication pairs is a multi-agent reinforced learning problem and can be modeled as a partially observable Markov game model, and the Markov game models of N agents are as follows:
Figure BDA0001984765490000083
wherein the content of the first and second substances,
Figure BDA0001984765490000084
is a space of states that is,
Figure BDA0001984765490000085
is an action space, rjThe value of the return value of the jth intelligent agent is the return value corresponding to the return function of the jth D2D communication pair, j ∈ { 1.·, N }, p is the state transition probability of the whole environment, and gamma is a discount coefficient.
The goal of each agent's learning is to maximize its total discount return;
the total discount return calculation formula is:
Figure BDA0001984765490000086
t is the time range; gamma raytIs the discount coefficient to the power of t;
Figure BDA0001984765490000087
is the return value of the return function of the jth D2D communication pair at time slot t.
Aiming at the Markov game model, the reinforcement learning model of the actor critics is expanded to a multi-agent scene, and a deep reinforcement learning model of the multi-agent is constructed, as shown in FIG. 5. During training, the critic part uses historical global information to guide the actor part to update the strategy; when the system is executed, the single agent only uses part of the environmental information obtained by observation and uses the actor strategy obtained by training to make action selection, thereby realizing centralized training and distributed execution.
In the centralized training process, the strategy of N agents uses pi ═ pi1,...,πNDenotes, θ ═ θ1,...,θNDenotes the parameters contained in the policy, where the jth agent expects a reward
Figure BDA0001984765490000088
The gradient of (d) is:
Figure BDA0001984765490000091
here, s includes status information of all agents, and s ═ s1,...,sN}; a contains the action information of all agents, a ═ a1,...,aN};
Figure BDA0001984765490000092
The method is a centralized action-value function, takes the state information and actions of all agents as input, and outputs the Q value of the jth agent.
Extending the above description to deterministic policies, deterministic policies are considered
Figure BDA0001984765490000093
(abbreviated as mu)j) Let μ ═ μ1,...,μNThe deterministic policies of all agents are represented, the gradient of the j-th agent's expected reward is:
Figure BDA0001984765490000094
here, the
Figure BDA0001984765490000095
Is an empirical playback buffer whichIn tuples(s) of each samplet,at,rt,st+1) Record the historical data of all agents, here
Figure BDA0001984765490000096
Including the reward of all agents at time slot t. The strategy of the actor part is fitted by using a deep neural network, the gradient formula is an updating method of the actor network, and a gradient ascending method is used for updating so as to obtain the maximum expected return.
The critic network also uses a deep neural network for fitting by minimizing a centralized action-cost function
Figure BDA0001984765490000097
To update the loss function of:
Figure BDA0001984765490000098
wherein the content of the first and second substances,
Figure BDA0001984765490000099
step 504, performing offline training on the deep reinforcement learning model by using historical communication data to obtain D2D communication D for solving the DpA model of a resource allocation problem.
The training steps are as follows:
(1) initializing a cell, a base station, a cellular link, and a D2D link using a communication simulation platform;
(2) initializing strategy models pi and parameters theta of all agents, and initializing communication simulation time slot number T;
(3) initializing a communication simulation time slot t ← 0;
(4) all D2D communications obtain status information s for viewing the environmenttBased on stAnd pi select action atObtaining a report rt,t←t+1;
(5) Will(s)t,at,rt,st+1) Logging into an experience replay buffer
Figure BDA00019847654900000910
(6) From
Figure BDA00019847654900000911
Sampling small batch processing data in a medium mode;
(7) training by using small batch processing data, and updating a parameter theta of a strategy pi;
(8) returning to the step (4), and ending the training until T is equal to T;
(9) returning a parameter theta;
and step six, extracting respective state feature vectors for each D2D communication pair in the subsequent time slot, and inputting the state feature vectors into the trained deep reinforcement learning model to obtain the resource allocation scheme of each D2D communication pair.
The resource allocation scheme includes selecting appropriate communication resource blocks RB and transmission power.
The execution steps are as follows:
(1) initializing a cell, a base station, a cellular link, a D2D link using a communication simulation platform;
(2) initializing strategy models pi of all agents, importing the trained parameters theta into the models pi, and initializing communication simulation time slot number T;
(3) initializing a communication simulation time slot t ← 0;
(4) all D2D communications obtain status information s for viewing the environmenttBased on stAnd pi select action atI.e., RB and transmit power, statistics D2D receive SINR and system capacity of the user;
(5) t ← t +1, the simulation platform updates the environment, all D2D communications obtain s for the observation environmentt+1
(6) And returning to the step 4 until T is T.
Respectively comparing the D2D resource allocation method based on the multi-agent with the D2D resource allocation method and the D2D random resource allocation method based on DQN;
as shown in fig. 6, MADRL represents the method of the present invention, DQN represents the D2D resource allocation method based on the deep Q network, Random represents the D2D resource allocation method based on Random allocation, and the three methods respectively affect the communication quality of cellular users, and it can be known from the figure that the algorithm MADRL of the present invention can achieve the lowest cellular user outage probability when the number of users is different, D2D;
as shown in fig. 7, for the influence of the three methods on the total capacity of the system, the algorithm MADRL of the present invention achieves the maximum system capacity as the number of D2D communication pairs increases.
FIG. 8 illustrates the overall reward function and system capacity convergence performance of the present invention; as shown in fig. 9, the total reward function and the system capacity convergence of the DQN-based D2D resource allocation method are shown, and compared with the two methods, the method introduces global information into the training process for centralized training, so that the training environment is more stable and the convergence performance is better. From this it can be concluded that: the MADRL can achieve higher system throughput than Random and DQN while preserving the quality of cellular user communications, while having better convergence performance than DQN.
In conclusion, by implementing the multi-agent reinforcement learning-based D2D resource allocation method, the communication quality of cellular users can be protected, and the system throughput can be maximized; compared with a centralized algorithm, the distributed resource allocation algorithm designed by the invention reduces signaling overhead; compared with other resource allocation algorithms based on Q learning, the algorithm designed by the invention has better convergence performance.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (3)

1. A D2D resource allocation method based on multi-agent deep reinforcement learning is characterized by comprising the following specific steps:
step one, constructing a heterogeneous network model of a cellular network and a D2D communication shared spectrum;
the heterogeneous network model comprises a cellular Base Station (BS), M cellular downlink users and N D2D communication pairs;
setting the mth cellular user as CmWherein M is more than or equal to 1 and less than or equal to M; the nth D2D communication pair is DnWherein N is more than or equal to 1 and less than or equal to N; D2D communication pair DnFor transmitting and receiving users in
Figure FDA00024791347800000120
And
Figure FDA00024791347800000121
represents;
the cellular downlink communication link and the D2D link adopt the orthogonal frequency division multiplexing technology, each cellular user occupies one communication resource block RB, and no interference exists between any two cellular links; simultaneously, one cellular user is allowed to share the same RB with a plurality of D2D users, and communication Resource Blocks (RB) and transmission power are autonomously selected by the D2D users;
step two, establishing a signal-to-interference-and-noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user based on interference existing in a heterogeneous network model;
cellular user CmThe received signal SINR on the kth communication resource block RB from the base station is:
Figure FDA0002479134780000011
PBrepresents the fixed transmit power of the base station;
Figure FDA0002479134780000012
for base station to cellular user CmThe channel gain of the downlink target link; dkA set of all D2D communication pairs representing a shared kth RB;
Figure FDA0002479134780000013
representing D2D communication pair DnThe transmitting power of the transmitting user;
Figure FDA0002479134780000014
for D2D communication pair D when multiple links share RBnMiddle transmitting user
Figure FDA0002479134780000015
To cellular subscriber CmThe channel gain of the interfering link of (a); n is a radical of0A power spectral density representative of additive white gaussian noise;
D2D communication pair DnThe SINR of the received signal of the receiving user on the kth RB is:
Figure FDA0002479134780000016
Figure FDA0002479134780000017
for D2D communication pair DnTo a transmitting user
Figure FDA0002479134780000018
To the receiving user
Figure FDA0002479134780000019
D2D channel gain of the target link;
Figure FDA00024791347800000110
for base station to D2D communication pair D when multiple links share RBnTo a receiving user
Figure FDA00024791347800000111
The channel gain of the interfering link of (a);
Figure FDA00024791347800000112
representing D2D communication pair DiThe transmitting power of the transmitting user;
Figure FDA00024791347800000113
when there are multiple linksWhen RB is shared, D2D communication is paired with DiMiddle transmitting user
Figure FDA00024791347800000114
To the receiving user
Figure FDA00024791347800000115
The channel gain of the interfering link of (a);
thirdly, calculating the unit bandwidth communication rates of the cellular link and the D2D link respectively by using the SINR of the cellular user and the SINR of the D2D receiving user;
communication rate per unit bandwidth of cellular link
Figure FDA00024791347800000116
The calculation formula is as follows:
Figure FDA00024791347800000117
communication rate per bandwidth of D2D link
Figure FDA00024791347800000118
The calculation formula is as follows:
Figure FDA00024791347800000119
step four, calculating system capacity by using the communication rate of the cellular link and the D2D link in unit bandwidth, and constructing a D2D resource allocation optimization model in the heterogeneous network by taking the maximized system capacity as an optimization target;
the optimization model is as follows:
Figure FDA0002479134780000021
C1:
Figure FDA0002479134780000022
C2:
Figure FDA0002479134780000023
C3:
Figure FDA0002479134780000024
BN×K=[bn,k]an allocation matrix of communication resource blocks RB for D2D communication pairs, bn,kFor D2D communication pair DnThe RB selection parameter of (a) is,
Figure FDA0002479134780000025
a power control vector composed jointly for the transmit powers of all D2D communication pairs;
constraint C1 indicates that the SINR of each cellular user is greater than the minimum threshold for the cellular user's received SINR
Figure FDA0002479134780000026
Ensuring the communication quality of cellular users; the constraint condition C2 represents a D2D link spectrum allocation constraint condition, and each D2D user pair can be allocated with only one communication resource block RB at most; constraint C3 characterizes that the transmission power of the transmitting user of each D2D communication pair cannot exceed the maximum transmission power threshold Pmax
Step five, aiming at the time slot t, on the basis of the D2D resource allocation optimization model, constructing a deep reinforcement learning model of each D2D communication pair;
the specific construction steps are as follows:
step 501, for a certain D2D communication pair DpConstructing a state feature vector s at time slot tt
Figure FDA0002479134780000027
Figure FDA0002479134780000028
Instantaneous channel state information for the D2D communication link;
Figure FDA0002479134780000029
for base station to the D2D communication pair DpReceiving instantaneous channel state information of an interference link of a user; i ist-1The D2D communication pair D for the last time slot t-1pReceiving an interference power value received by a user;
Figure FDA00024791347800000210
the D2D communication pair D for the last time slot t-1pThe neighboring D2D communication pair occupied RB;
Figure FDA00024791347800000211
the D2D communication pair D for the last time slot t-1pRBs occupied by neighboring cellular users;
step 502, simultaneously constructing the D2D communication pair DpA return function r at time slot tt
Figure FDA00024791347800000212
rnIn negative return, rn<0;
Step 503, constructing the state characteristics of the multi-agent Markov game model by using the state characteristic vectors of the D2D communication pair; in order to optimize the Markov game model, a return function in the deep reinforcement learning model of the multi-agent actor critic is established by utilizing the return function of the D2D communication pair;
each agent markov game model is:
Figure FDA00024791347800000213
wherein the content of the first and second substances,
Figure FDA00024791347800000214
is a space of states that is,
Figure FDA00024791347800000215
is an action space, rjThe method comprises the steps that a return value corresponding to a return function of a jth D2D communication pair is j ∈ { 1.., N }, p is the state transition probability of the whole environment, and gamma is a discount coefficient;
the goal of each D2D communication pair learning is to maximize the total discount return for that D2D communication pair;
the total discount return calculation formula is:
Figure FDA0002479134780000031
t is the time range; gamma raytIs the discount coefficient to the power of t; r ist jIs the return value of the return function of the jth D2D communication pair at the time slot t;
the deep reinforcement learning model of the actor critics consists of actors and critics;
in the training process, the strategy of the actor is fitted by using a deep neural network, and is updated by using the following deterministic strategy gradient formula so as to obtain the maximum expected return;
let mu be { mu ═ mu1,...,μNDenotes the deterministic policy for all agents, θ ═ θ1,...,θNThe parameters contained in the strategy are expressed, and the gradient formula of the expected return of the jth agent is as follows:
Figure FDA0002479134780000032
s contains state information for all agents, s ═ s1,...,sN}; a contains the action information of all agents, a ═ a1,...,aN};
Figure FDA0002479134780000033
Is an experience replay buffer;
the critics also use deep neural networks to fit by minimizing a centralized action-cost function
Figure FDA0002479134780000034
To update the loss function of:
Figure FDA0002479134780000035
wherein the content of the first and second substances,
Figure FDA0002479134780000036
each sample is represented by a tuple(s)t,at,rt,st+1) In the form of a record of the historical data, r, of all agentst={rt 1,...,rt NThe rewards of all agents in the time slot t are included;
step 504, performing offline training on the deep reinforcement learning model by using historical communication data to obtain D2D communication D for solving the DpA model of a resource allocation problem;
and step six, extracting respective state feature vectors for each D2D communication pair in the subsequent time slot, and inputting the state feature vectors into the trained deep reinforcement learning model to obtain the resource allocation scheme of each D2D communication pair.
2. The multi-agent deep reinforcement learning-based D2D resource allocation method as claimed in claim 1, wherein the interference in step two includes three types: 1) cellular users experience interference from transmitting users in each D2D communication pair sharing the same RB; 2) interference experienced by the receiving users in each D2D communication pair from the base station; 3) the receiving user in each D2D communication pair is subject to interference from the transmitting user in all other D2D communication pairs that share the same RBs.
3. The multi-agent deep reinforcement learning-based D2D resource allocation method as claimed in claim 1, wherein the resource allocation scheme in step six includes selecting appropriate communication resource blocks RB and transmission power;
the execution steps are as follows:
(1) initializing a cell, a base station, a cellular link, a D2D link using a communication simulation platform;
(2) initializing strategy models pi of all agents, importing the trained parameters theta into the models pi, and initializing communication simulation time slot number T;
(3) initializing a communication simulation time slot t ← 0;
(4) all D2D communications obtain status information s for viewing the environmenttBased on stAnd pi select action atI.e., RB and transmit power, statistics D2D receive SINR and system capacity of the user;
(5) t ← t +1, the simulation platform updates the environment, all D2D communications obtain s for the observation environmentt+1
And returning to the step (4) until T is T.
CN201910161391.8A 2018-12-21 2019-03-04 D2D resource allocation method based on multi-agent deep reinforcement learning Active CN109729528B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2018115721684 2018-12-21
CN201811572168 2018-12-21

Publications (2)

Publication Number Publication Date
CN109729528A CN109729528A (en) 2019-05-07
CN109729528B true CN109729528B (en) 2020-08-18

Family

ID=66300856

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910161391.8A Active CN109729528B (en) 2018-12-21 2019-03-04 D2D resource allocation method based on multi-agent deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN109729528B (en)

Families Citing this family (40)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110267274B (en) * 2019-05-09 2022-12-16 广东工业大学 Spectrum sharing method for selecting sensing users according to social credibility among users
CN110049474B (en) * 2019-05-17 2020-07-17 北京邮电大学 Wireless resource allocation method, device and base station
CN112383922B (en) * 2019-07-07 2022-09-30 东北大学秦皇岛分校 Deep reinforcement learning frequency spectrum sharing method based on prior experience replay
CN110267338B (en) * 2019-07-08 2020-05-22 西安电子科技大学 Joint resource allocation and power control method in D2D communication
CN110582072B (en) * 2019-08-16 2020-07-10 北京邮电大学 Fuzzy matching-based resource allocation method and device in cellular internet of vehicles
CN110784882B (en) * 2019-10-28 2022-06-28 南京邮电大学 Energy acquisition D2D communication resource allocation method based on reinforcement learning
CN110856268B (en) * 2019-10-30 2021-09-07 西安交通大学 Dynamic multichannel access method for wireless network
CN110769514B (en) * 2019-11-08 2023-05-12 山东师范大学 Heterogeneous cellular network D2D communication resource allocation method and system
CN111026549B (en) * 2019-11-28 2022-06-10 国网甘肃省电力公司电力科学研究院 Automatic test resource scheduling method for power information communication equipment
CN111065102B (en) * 2019-12-16 2022-04-19 北京理工大学 Q learning-based 5G multi-system coexistence resource allocation method under unlicensed spectrum
CN111526592B (en) * 2020-04-14 2022-04-08 电子科技大学 Non-cooperative multi-agent power control method used in wireless interference channel
CN111556572B (en) * 2020-04-21 2022-06-07 北京邮电大学 Spectrum resource and computing resource joint allocation method based on reinforcement learning
CN111787624B (en) * 2020-06-28 2022-04-26 重庆邮电大学 Variable dimension resource allocation method based on deep learning
CN112118632B (en) * 2020-09-22 2022-07-29 电子科技大学 Adaptive power distribution system, method and medium for micro-cell base station
CN112584347B (en) * 2020-09-28 2022-07-08 西南电子技术研究所(中国电子科技集团公司第十研究所) UAV heterogeneous network multi-dimensional resource dynamic management method
CN112272353B (en) * 2020-10-09 2021-09-28 山西大学 Device-to-device proximity service method based on reinforcement learning
CN112533237B (en) * 2020-11-16 2022-03-04 北京科技大学 Network capacity optimization method for supporting large-scale equipment communication in industrial internet
CN112752266B (en) * 2020-12-28 2022-05-24 中国人民解放军陆军工程大学 Joint spectrum access and power control method in D2D haptic communication
CN112822781B (en) * 2021-01-20 2022-04-12 重庆邮电大学 Resource allocation method based on Q learning
CN113115451A (en) * 2021-02-23 2021-07-13 北京邮电大学 Interference management and resource allocation scheme based on multi-agent deep reinforcement learning
CN113115355B (en) * 2021-04-29 2022-04-22 电子科技大学 Power distribution method based on deep reinforcement learning in D2D system
CN113473419B (en) * 2021-05-20 2023-07-07 南京邮电大学 Method for accessing machine type communication device into cellular data network based on reinforcement learning
CN113543271B (en) * 2021-06-08 2022-06-07 西安交通大学 Effective capacity-oriented resource allocation method and system
CN113596786B (en) * 2021-07-26 2023-11-14 广东电网有限责任公司广州供电局 Resource allocation grouping optimization method for end-to-end communication
CN113766661B (en) * 2021-08-30 2023-12-26 北京邮电大学 Interference control method and system for wireless network environment
CN113810910B (en) * 2021-09-18 2022-05-20 大连理工大学 Deep reinforcement learning-based dynamic spectrum sharing method between 4G and 5G networks
WO2023054776A1 (en) * 2021-10-01 2023-04-06 엘지전자 주식회사 Method and device for transmitting progressive features for edge inference
CN113867178B (en) * 2021-10-26 2022-05-31 哈尔滨工业大学 Virtual and real migration training system for multi-robot confrontation
CN114245401B (en) * 2021-11-17 2023-12-05 航天科工微电子系统研究院有限公司 Multi-channel communication decision method and system
CN114363938B (en) * 2021-12-21 2024-01-26 深圳千通科技有限公司 Cellular network flow unloading method
CN114423070B (en) * 2022-02-10 2024-03-19 吉林大学 Heterogeneous wireless network power distribution method and system based on D2D
CN114928549A (en) * 2022-04-20 2022-08-19 清华大学 Communication resource allocation method and device of unauthorized frequency band based on reinforcement learning
CN114900827A (en) * 2022-05-10 2022-08-12 福州大学 Covert communication system in D2D heterogeneous cellular network based on deep reinforcement learning
CN115173922B (en) * 2022-06-30 2024-03-15 深圳泓越信息科技有限公司 Multi-beam satellite communication system resource allocation method based on CMADDQN network
CN115442812B (en) * 2022-11-08 2023-04-07 湖北工业大学 Deep reinforcement learning-based Internet of things spectrum allocation optimization method and system
CN115544899B (en) * 2022-11-23 2023-04-07 南京邮电大学 Water plant water intake pump station energy-saving scheduling method based on multi-agent deep reinforcement learning
CN115811788B (en) * 2022-11-23 2023-07-18 齐齐哈尔大学 D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning
CN116155991B (en) * 2023-01-30 2023-10-10 杭州滨电信息技术有限公司 Edge content caching and recommending method and system based on deep reinforcement learning
CN116193405B (en) * 2023-03-03 2023-10-27 中南大学 Heterogeneous V2X network data transmission method based on DONA framework
CN116489683B (en) * 2023-06-21 2023-08-18 北京邮电大学 Method and device for unloading computing tasks in space-sky network and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104995851A (en) * 2013-03-08 2015-10-21 英特尔公司 Distributed power control for d2d communications

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108834109B (en) * 2018-05-03 2021-03-19 中国人民解放军陆军工程大学 D2D cooperative relay power control method based on Q learning under full-duplex active eavesdropping

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104995851A (en) * 2013-03-08 2015-10-21 英特尔公司 Distributed power control for d2d communications

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
D2D通信中基于Q学习的联合资源分配与功率控制算法;王倩;《南京大学学报》;20181130;第1183-1192页 *
Location-Aware Hypergraph Coloring Based Spectrum Allocation for D2D Communication;Zheng Li等;《IEEE》;20181015;第1-6页 *
SECURE SOCIAL NETVUORKS IN 5G SYSTEMS WITH MOBILE EDGE COMPUTING,CACHING, AND DEVICE-TO-DEVICE CONINIUNICATIONS;Ying He等,;《IEEE》;20180704;第103-109页 *

Also Published As

Publication number Publication date
CN109729528A (en) 2019-05-07

Similar Documents

Publication Publication Date Title
CN109729528B (en) D2D resource allocation method based on multi-agent deep reinforcement learning
CN109862610B (en) D2D user resource allocation method based on deep reinforcement learning DDPG algorithm
CN110267338B (en) Joint resource allocation and power control method in D2D communication
Alqerm et al. Sophisticated online learning scheme for green resource allocation in 5G heterogeneous cloud radio access networks
CN111800828B (en) Mobile edge computing resource allocation method for ultra-dense network
CN111405569A (en) Calculation unloading and resource allocation method and device based on deep reinforcement learning
Wang et al. Joint interference alignment and power control for dense networks via deep reinforcement learning
Zhang et al. Deep reinforcement learning for multi-agent power control in heterogeneous networks
AlQerm et al. Enhanced machine learning scheme for energy efficient resource allocation in 5G heterogeneous cloud radio access networks
Lu et al. A cross-layer resource allocation scheme for ICIC in LTE-Advanced
CN107172576B (en) D2D communication downlink resource sharing method for enhancing cellular network security
Elsayed et al. Deep reinforcement learning for reducing latency in mission critical services
CN113596785A (en) D2D-NOMA communication system resource allocation method based on deep Q network
CN116390125A (en) Industrial Internet of things cloud edge cooperative unloading and resource allocation method based on DDPG-D3QN
CN114867030A (en) Double-time-scale intelligent wireless access network slicing method
CN116347635A (en) NB-IoT wireless resource allocation method based on NOMA and multi-agent reinforcement learning
Giri et al. Deep Q-learning based optimal resource allocation method for energy harvested cognitive radio networks
Yin et al. Decentralized federated reinforcement learning for user-centric dynamic tfdd control
Luo et al. Communication-aware path design for indoor robots exploiting federated deep reinforcement learning
Labana et al. Joint user association and resource allocation in CoMP-enabled heterogeneous CRAN
Chen et al. iPAS: A deep Monte Carlo Tree Search-based intelligent pilot-power allocation scheme for massive MIMO system
Zhao et al. Power control for D2D communication using multi-agent reinforcement learning
Liu et al. Power allocation in ultra-dense networks through deep deterministic policy gradient
Wang et al. Resource allocation in multi-cell NOMA systems with multi-agent deep reinforcement learning
CN114423070B (en) Heterogeneous wireless network power distribution method and system based on D2D

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant