CN109729528B

CN109729528B - D2D resource allocation method based on multi-agent deep reinforcement learning

Info

Publication number: CN109729528B
Application number: CN201910161391.8A
Authority: CN
Inventors: 郭彩丽; 李政; 宣一荻; 冯春燕
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-12-21
Filing date: 2019-03-04
Publication date: 2020-08-18
Anticipated expiration: 2039-03-04
Also published as: CN109729528A

Abstract

The invention discloses a D2D resource allocation method based on multi-agent deep reinforcement learning, and belongs to the field of wireless communication. Firstly, constructing a heterogeneous network model of a cellular network and a D2D communication shared spectrum, establishing a signal-to-interference-and-noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user based on the existing interference, then respectively calculating unit bandwidth communication rates of a cellular link and a D2D link, and then constructing a D2D resource allocation optimization model in the heterogeneous network by taking the maximized system capacity as an optimization target; aiming at the time slot t, on the basis of the D2D resource allocation optimization model, constructing a deep reinforcement learning model of each D2D communication pair; and respectively extracting respective state feature vectors for each D2D communication pair in the subsequent time slot, and inputting the state feature vectors into the trained deep reinforcement learning model to obtain a resource allocation scheme of each D2D communication pair. The invention optimizes the frequency spectrum allocation and the transmission power, maximizes the system capacity and provides a low-complexity resource allocation algorithm.

Description

D2D resource allocation method based on multi-agent deep reinforcement learning

Technical Field

The invention belongs to the field of wireless communication, relates to a heterogeneous cellular network system, and particularly relates to a D2D resource allocation method based on multi-agent deep reinforcement learning.

Background

The popularization of intelligent terminals and the blowout type development of mobile internet services put higher requirements on the data transmission capability of wireless communication networks. Under the current trend, the existing cellular network has the problems of spectrum resource shortage, heavy base station load and the like, and cannot meet the transmission requirement of the future wireless network.

Device-to-Device (D2D) communication allows neighboring users to establish a direct link for communication, which is a promising technology in future wireless communication networks because it has the advantages of improving spectral efficiency, saving power consumption and offloading base station load. The D2D communication is introduced into the cellular network, so that on one hand, energy consumption can be saved, and the performance of edge users can be improved, and on the other hand, the spectrum utilization rate can be greatly improved by sharing the spectrum of the cellular users through the D2D communication.

However, the spectrum of the D2D communication multiplexing cellular network causes cross-layer interference to the cellular communication link, the communication quality of the cellular user as the primary user of the cellular frequency band should be ensured, and meanwhile, in the case of dense D2D communication deployment, the same spectrum multiplexed by a plurality of D2D communication links causes peer-to-peer interference between each other, so the interference management problem when the cellular network and the D2D communication coexist is an urgent problem to be solved. The wireless network resource allocation aims at relieving interference through reasonable resource allocation, improves the utilization efficiency of frequency spectrum resources, and is an effective way for solving the interference management problem.

Existing research on D2D communication resource allocation in cellular networks can be divided into centralized and distributed categories. The centralized method assumes that the base station has instant global Channel State Information (CSI), and the base station controls resource allocation of the D2D user, but huge signaling overhead is required for the base station to acquire the global Channel State Information, and in a future massive wireless device scenario, the base station is difficult to have the instant global Information, so that the centralized algorithm is no longer applicable in a future communication device intensive scenario.

The distributed method enables a D2D user to autonomously select wireless network resources, and the existing research is mainly based on game theory and reinforcement learning. The game theory method models D2D users as game players to compete for the game until Nash equilibrium state, but the solution of Nash equilibrium state requires a great deal of information exchange among users and a great deal of iteration to converge. The resource allocation research based on reinforcement learning is mainly based on Q learning, such as a Deep Q Network (DQN), and a D2D user is regarded as an agent, and a wireless Network resource is selected by an autonomous learning strategy. However, when a plurality of agents learn and train, the strategy of each agent changes, which causes unstable training environment and makes the training difficult to converge. Therefore, a distributed resource allocation algorithm with good convergence and low complexity needs to be researched to solve the problem of interference management of D2D communication in a cellular network.

Disclosure of Invention

In order to solve the problems, the invention provides a D2D resource allocation method based on multi-agent deep reinforcement learning based on a deep reinforcement learning theory, optimizes spectrum allocation and transmission power of a D2D user, realizes system capacity maximization of a cellular network and D2D communication, and ensures communication quality of the cellular user.

The method comprises the following specific steps:

step one, constructing a heterogeneous network model of a cellular network and a D2D communication shared spectrum;

the heterogeneous network model comprises cellular base stations BS, M cellular downlink users and N D2D communication pairs.

Setting the mth cellular user as C_mWherein M is more than or equal to 1 and less than or equal to M; the nth D2D communication pair is D_nWherein N is more than or equal to 1 and less than or equal to N. D2D communication pair D_nFor transmitting and receiving users in

And

and (4) showing.

The cellular downlink communication link and the D2D link adopt the orthogonal frequency division multiplexing technology, each cellular user occupies one communication resource block RB, and no interference exists between any two cellular links; while allowing one cellular user to share the same RB with multiple D2D users, the communication resource blocks RB and transmission power are selected autonomously by the D2D user.

Step two, establishing a signal-to-interference-and-noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user based on interference existing in a heterogeneous network model;

interference includes three types: 1) cellular users experience interference from transmitting users in each D2D communication pair sharing the same RB; 2) interference experienced by the receiving users in each D2D communication pair from the base station; 3) the receiving user in each D2D communication pair is subject to interference from the transmitting user in all other D2D communication pairs that share the same RBs.

Cellular user C_mThe received signal SINR on the kth communication resource block RB from the base station is:

P_Brepresents the fixed transmit power of the base station;

for base station to cellular user C_mThe channel gain of the downlink target link; d_kA set of all D2D communication pairs representing a shared kth RB;

representing D2D communication pair D_nThe transmitting power of the transmitting user;

for D2D communication pair D when multiple links share RB_nMiddle transmitting user

To cellular subscriber C_mThe channel gain of the interfering link of (a); n is a radical of₀Representing the power spectral density of additive white gaussian noise.

D2D communication pair D_nThe SINR of the received signal of the receiving user on the kth RB is:

for D2D communication pair D_nTo a transmitting user

To the receiving user

D2D channel gain of the target link;

for base station to D2D communication pair D when multiple links share RB_nTo a receiving user

The channel gain of the interfering link of (a);

representing D2D communication pair D_iThe transmitting power of the transmitting user;

for D2D communication pair D when multiple links share RB_iMiddle transmitting user

To the receiving user

The channel gain of the interfering link of (a);

thirdly, calculating the unit bandwidth communication rates of the cellular link and the D2D link respectively by using the SINR of the cellular user and the SINR of the D2D receiving user;

communication rate per unit bandwidth of cellular link

The calculation formula is as follows:

communication rate per bandwidth of D2D link

The calculation formula is as follows:

step four, calculating system capacity by using the communication rate of the cellular link and the D2D link in unit bandwidth, and constructing a D2D resource allocation optimization model in the heterogeneous network by taking the maximized system capacity as an optimization target;

the optimization model is as follows:

B_N×K＝[b_n,k]an allocation matrix of communication resource blocks RB for D2D communication pairs, b_n,kFor D2D communication pair D_nThe RB selection parameter of (a) is,

a power control vector that is composed jointly for the transmit powers of all D2D communication pairs.

Constraint C1 indicates that the SINR of each cellular user is greater than the minimum threshold for the cellular user's received SINR

Ensuring the communication quality of cellular users; the constraint condition C2 represents a D2D link spectrum allocation constraint condition, and each D2D user pair can be allocated with only one communication resource block RB at most; constraint C3 characterizes that the transmission power of the transmitting user of each D2D communication pair cannot exceed the maximum transmission power threshold P_max。

Step five, aiming at the time slot t, on the basis of the D2D resource allocation optimization model, constructing a deep reinforcement learning model of each D2D communication pair;

the specific construction steps are as follows:

step 501, for a certain D2D communication pair D_pConstructing a state feature vector s at time slot t_t；

Instantaneous channel state information for the D2D communication link;

for base station to the D2D communication pair D_pReceiving instantaneous channel state information of an interference link of a user; i is_t-1The D2D communication pair D for the last time slot t-1_pReceiving an interference power value received by a user;

the D2D communication pair D for the last time slot t-1_pThe neighboring D2D communication pair occupied RB;

the D2D communication pair D for the last time slot t-1_pThe RBs occupied by the neighboring cellular users.

Step 502, simultaneously constructing the D2D communication pair D_pA return function r at time slot t_t；

r_nIn negative return, r_n＜0；

Step 503, constructing the state characteristics of the multi-agent Markov game model by using the state characteristic vectors of the D2D communication pair; in order to optimize the Markov game model, a return function in the deep reinforcement learning model of the multi-agent actor critic is established by utilizing the return function of the D2D communication pair;

each agent markov game model is:

wherein the content of the first and second substances,

is a space of states that is,

is an action space, r^jThe method is characterized in that the method is a return value of a return corresponding to a return function of the jth D2D communication pair, j ∈ { 1.., N }, p is a state transition probability of the whole environment, and gamma is a discount coefficient.

The goal of each D2D communication pair learning is to maximize the total discount return for that D2D communication pair;

the total discount return calculation formula is:

t is the time range; gamma ray^tIs the discount coefficient to the power of t;

is the return value of the return function of the jth D2D communication pair at time slot t.

The Actor Critic reinforcement learning model consists of an Actor (Actor) and a Critic (Critic);

in the training process, the strategy of the actor is fitted by using a deep neural network, and is updated by using the following deterministic strategy gradient formula so as to obtain the maximum expected return.

Let mu be { mu ═ mu¹,...,μ^NDenotes the deterministic policy for all agents, θ ═ θ¹,...,θ^NThe parameters contained in the strategy are expressed, and the gradient formula of the expected return of the jth agent is as follows:

s contains state information for all agents, s ═ s¹,...,s^N}; a contains the action information of all agents, a ═ a¹,...,a^N}；

Is an experience replay buffer;

the critics also use deep neural networks to fit by minimizing a centralized action-cost function

To update the loss function of:

wherein the content of the first and second substances,

each sample is represented by a tuple(s)_t,a_t,r_t,s_t+1) Records the historical data of all agents,

including the reward of all agents at time slot t.

Step 504, performing offline training on the deep reinforcement learning model by using historical communication data to obtain D2D communication D for solving the D_pProblem of resource allocationThe model of (1).

And step six, extracting respective state feature vectors for each D2D communication pair in the subsequent time slot, and inputting the state feature vectors into the trained deep reinforcement learning model to obtain the resource allocation scheme of each D2D communication pair.

The resource allocation scheme includes selecting appropriate communication resource blocks RB and transmission power.

The invention has the advantages that:

(1) a D2D resource allocation method based on multi-agent deep reinforcement learning optimizes the spectrum allocation and transmission power of D2D users, and maximizes the system capacity while ensuring the communication quality of cellular users;

(2) a D2D resource allocation method based on multi-agent deep reinforcement learning designs a D2D distributed resource allocation algorithm in a heterogeneous cellular network, thereby greatly reducing the signaling overhead generated for obtaining global instant channel state information;

(3) a D2D resource allocation method based on multi-agent deep reinforcement learning innovatively introduces a multi-agent reinforcement learning model with centralized training and distributed execution, solves the problem of resource allocation by multi-D2D communication, obtains good training convergence performance, and provides a low-complexity resource allocation algorithm.

Drawings

Fig. 1 is a schematic diagram of a heterogeneous network model of a cellular network and D2D communication sharing spectrum, which is constructed by the present invention;

FIG. 2 is a flow chart of a D2D resource allocation method based on multi-agent deep reinforcement learning according to the present invention;

FIG. 3 is a diagram illustrating a deep reinforcement learning model for D2D communication resource allocation according to the present invention;

FIG. 4 is a diagram of a model for reinforcement learning of a critic of a single agent actor in accordance with the present invention;

FIG. 5 is a diagram of a multi-agent actor critic reinforcement learning model of the present invention;

fig. 6 is a graph comparing the outage rates of cellular users according to the present invention with the DQN-based D2D resource allocation method and the D2D random resource allocation method.

Fig. 7 is a graph comparing the total system capacity performance of the present invention with the DQN-based D2D resource allocation method and the D2D random resource allocation method.

FIG. 8 is a graphical illustration of the total reward function and system capacity convergence performance of the present invention;

fig. 9 is a graph of the total return function and the system capacity convergence performance of the DQN-based D2D resource allocation method of the present invention.

Detailed Description

In order that the technical principles of the present invention may be more clearly understood, embodiments of the present invention are described in detail below with reference to the accompanying drawings.

A D2D Resource Allocation Method (MADRL, Multi-Agent deep Learning based Device-to-Device Resource Allocation Method) based on Multi-Agent deep reinforcement Learning is applied to a heterogeneous network with a cellular network and D2D communication coexisting; firstly, respectively establishing a signal-to-interference-and-noise ratio (SINR) and a unit bandwidth communication rate expression of a D2D receiving user and a cellular user, taking the maximized system capacity as an optimization target, and taking the SINR of the cellular user larger than a minimum SINR threshold, a D2D link spectrum allocation constraint condition and the transmitting power of a D2D transmitting user smaller than a maximum transmitting power threshold as optimization conditions, and constructing a D2D resource allocation optimization model in a heterogeneous network;

constructing a state feature vector and a return function of a multi-agent deep reinforcement learning model for D2D resource allocation according to an optimization model; establishing a multi-agent actor critic deep reinforcement learning model for D2D resource allocation based on a partially observable Markov game model and an actor critic reinforcement learning theory;

performing offline training by using historical communication data obtained by the simulation platform;

according to the instantaneous channel state information of the D2D link, the instantaneous channel state information of the interference link of the user received by the base station to the D2D, the interference power value received by the user received by the D2D in the previous time slot, the communication Resource Block (RB) occupied by the D2D link adjacent to the D2D link in the previous time slot and the RB occupied by the cellular user communication adjacent to the D2D link in the previous time slot, the Resource allocation strategy obtained by training is used for selecting the proper RB and the transmission power.

As shown in fig. 2, the whole system comprises five steps of establishing a system model, proposing an optimization problem, establishing an optimization model, establishing a multi-agent reinforcement learning model, training the model and executing an algorithm; the method comprises the steps of establishing a multi-agent reinforcement learning model, wherein the step of establishing the multi-agent reinforcement learning model comprises the steps of establishing state characteristics, designing a return function and establishing a multi-agent actor critic reinforcement learning model;

the method comprises the following specific steps:

as shown in fig. 1, the heterogeneous network model includes a cellular Base Station (BS), M cellular downlink users, and N D2D communication pairs.

And

and (4) showing.

The cellular downlink communication link and the D2D link both adopt an Orthogonal Frequency Division Multiplexing (OFDM) technology, each cellular user occupies one communication resource block RB, and there is no interference between any two cellular links; in the system model, one cellular user is allowed to share the same RB simultaneously with multiple D2D users, with communication resource blocks RB and transmission power being autonomously selected by D2D users.

Step two, based on the Interference existing in the heterogeneous network model, establishing a signal to Interference plus Noise ratio (SINR) of a D2D receiving user and an SINR of a cellular user;

P_Brepresents the fixed transmit power of the base station;

To cellular subscriber C_mThe channel gain of the interfering link of (a); n is a radical of₀Represents the power spectral density of Additive White Gaussian Noise (AWGN).

for D2D communication pair D_nTo a transmitting user

To the receiving user

D2D channel gain of the target link;

The channel gain of the interfering link of (a);

To the receiving user

The channel gain of the interfering link of (a);

cellular link communication rate per bandwidth based on shannon's formula

The calculation formula is as follows:

communication rate per bandwidth of D2D link

The calculation formula is as follows:

due to the requirement of optimizing the distribution matrix B of the communication resource blocks RB of the D2D communication pair on the premise of ensuring the communication quality of the cellular user_N×K＝[b_n,k]Power control vector jointly formed with transmit powers of all D2D communication pairs

Maximizing system capacity, building an optimization model as follows:

b_n,kfor D2D communication pair D_nThe RB selection parameter of (1).

Constraint C1 characterizes the cellular user SINR constraint, meaning that the SINR of each cellular user is greater than the minimum threshold for the cellular user's received SINR

Ensuring the communication quality of cellular users; the constraint C2 characterizes the D2D link spectrum allocation constraint, and each D2D user pair can be allocated with only one communication resource block R at mostB; constraint C3 characterizes that the transmission power of the transmitting user of each D2D communication pair cannot exceed the maximum transmission power threshold P_max。

establishing a reinforcement learning model for D2D resource allocation, as shown in FIG. 3, the principle is: in a time slot t, each D2D communication pair acts as an agent, slave to the state space

In which a state s is observed_tThen from the action space according to the strategy pi and the current state

In which an action a is selected_tD2D communication pair selects the RB to use and the transmission power; performing action a_tThereafter, the D2D communication pair observes a context transition to a new state s_t+1And obtaining a report r_tD2D communication pair based on the reward r obtained_tThe strategy is adjusted to achieve higher returns. The specific construction steps are as follows:

The observed status characteristics of each D2D communication pair include the following:

instantaneous channel state information for the D2D communication link;

for base station to the D2D communication pair D_pReceiving instantaneous channel state information of an interference link of a user; i is_t-1For the last time slot t-1D2D communication pair D_pReceiving an interference power value received by a user;

Step 502, simultaneously, according to the optimization objective, the D2D communication pair D is constructed_pA return function r at time slot t_t；

The reward function is designed to take into account both the lowest received SINR threshold for the cellular user and the unit bandwidth rate of the D2D communication pair. If the cellular user receiving SINR for the shared spectrum in communication with D2D can satisfy the cellular user signal-to-noise ratio constraint, a positive reward is obtained; otherwise, a negative reward r will be obtained_n，r_nIs less than 0. To boost the capacity of the D2D communication link, the positive reward is set to the unit bandwidth communication rate of the D2D link:

thus, the reward function is as follows:

each agent uses an Actor Critic reinforcement learning model, which is composed of an Actor (Actor) and a Critic (Critic), and the strategies of the Actor and Critic are obtained by using deep neural network fitting as shown in fig. 4. D2D actor netNetwork input environment state s_tOutput action a_tNamely, selecting an RB and transmission power; critic network input environment state vector s_tAnd selected action a_tAnd outputting a time Difference error (TD error) calculated based on the Q value, wherein the time Difference error drives the learning of the two networks.

In the heterogeneous cellular network, the resource allocation of a plurality of D2D communication pairs is a multi-agent reinforced learning problem and can be modeled as a partially observable Markov game model, and the Markov game models of N agents are as follows:

wherein the content of the first and second substances,

is a space of states that is,

is an action space, r^jThe value of the return value of the jth intelligent agent is the return value corresponding to the return function of the jth D2D communication pair, j ∈ { 1.·, N }, p is the state transition probability of the whole environment, and gamma is a discount coefficient.

The goal of each agent's learning is to maximize its total discount return;

the total discount return calculation formula is:

t is the time range; gamma ray^tIs the discount coefficient to the power of t;

Aiming at the Markov game model, the reinforcement learning model of the actor critics is expanded to a multi-agent scene, and a deep reinforcement learning model of the multi-agent is constructed, as shown in FIG. 5. During training, the critic part uses historical global information to guide the actor part to update the strategy; when the system is executed, the single agent only uses part of the environmental information obtained by observation and uses the actor strategy obtained by training to make action selection, thereby realizing centralized training and distributed execution.

In the centralized training process, the strategy of N agents uses pi ═ pi¹,...,π^NDenotes, θ ═ θ¹,...,θ^NDenotes the parameters contained in the policy, where the jth agent expects a reward

The gradient of (d) is:

here, s includes status information of all agents, and s ═ s¹,...,s^N}; a contains the action information of all agents, a ═ a¹,...,a^N}；

The method is a centralized action-value function, takes the state information and actions of all agents as input, and outputs the Q value of the jth agent.

Extending the above description to deterministic policies, deterministic policies are considered

(abbreviated as mu)^j) Let μ ═ μ¹,...,μ^NThe deterministic policies of all agents are represented, the gradient of the j-th agent's expected reward is:

here, the

Is an empirical playback buffer whichIn tuples(s) of each sample_t,a_t,r_t,s_t+1) Record the historical data of all agents, here

Including the reward of all agents at time slot t. The strategy of the actor part is fitted by using a deep neural network, the gradient formula is an updating method of the actor network, and a gradient ascending method is used for updating so as to obtain the maximum expected return.

The critic network also uses a deep neural network for fitting by minimizing a centralized action-cost function

To update the loss function of:

wherein the content of the first and second substances,

step 504, performing offline training on the deep reinforcement learning model by using historical communication data to obtain D2D communication D for solving the D_pA model of a resource allocation problem.

The training steps are as follows:

(1) initializing a cell, a base station, a cellular link, and a D2D link using a communication simulation platform;

(2) initializing strategy models pi and parameters theta of all agents, and initializing communication simulation time slot number T;

(3) initializing a communication simulation time slot t ← 0;

(4) all D2D communications obtain status information s for viewing the environment_tBased on s_tAnd pi select action a_tObtaining a report r_t，t←t+1；

(5) Will(s)_t,a_t,r_t,s_t+1) Logging into an experience replay buffer

(6) From

Sampling small batch processing data in a medium mode;

(7) training by using small batch processing data, and updating a parameter theta of a strategy pi;

(8) returning to the step (4), and ending the training until T is equal to T;

(9) returning a parameter theta;

The execution steps are as follows:

(1) initializing a cell, a base station, a cellular link, a D2D link using a communication simulation platform;

(2) initializing strategy models pi of all agents, importing the trained parameters theta into the models pi, and initializing communication simulation time slot number T;

(3) initializing a communication simulation time slot t ← 0;

(4) all D2D communications obtain status information s for viewing the environment_tBased on s_tAnd pi select action a_tI.e., RB and transmit power, statistics D2D receive SINR and system capacity of the user;

(5) t ← t +1, the simulation platform updates the environment, all D2D communications obtain s for the observation environment_t+1；

(6) And returning to the step 4 until T is T.

Respectively comparing the D2D resource allocation method based on the multi-agent with the D2D resource allocation method and the D2D random resource allocation method based on DQN;

as shown in fig. 6, MADRL represents the method of the present invention, DQN represents the D2D resource allocation method based on the deep Q network, Random represents the D2D resource allocation method based on Random allocation, and the three methods respectively affect the communication quality of cellular users, and it can be known from the figure that the algorithm MADRL of the present invention can achieve the lowest cellular user outage probability when the number of users is different, D2D;

as shown in fig. 7, for the influence of the three methods on the total capacity of the system, the algorithm MADRL of the present invention achieves the maximum system capacity as the number of D2D communication pairs increases.

FIG. 8 illustrates the overall reward function and system capacity convergence performance of the present invention; as shown in fig. 9, the total reward function and the system capacity convergence of the DQN-based D2D resource allocation method are shown, and compared with the two methods, the method introduces global information into the training process for centralized training, so that the training environment is more stable and the convergence performance is better. From this it can be concluded that: the MADRL can achieve higher system throughput than Random and DQN while preserving the quality of cellular user communications, while having better convergence performance than DQN.

In conclusion, by implementing the multi-agent reinforcement learning-based D2D resource allocation method, the communication quality of cellular users can be protected, and the system throughput can be maximized; compared with a centralized algorithm, the distributed resource allocation algorithm designed by the invention reduces signaling overhead; compared with other resource allocation algorithms based on Q learning, the algorithm designed by the invention has better convergence performance.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A D2D resource allocation method based on multi-agent deep reinforcement learning is characterized by comprising the following specific steps:

the heterogeneous network model comprises a cellular Base Station (BS), M cellular downlink users and N D2D communication pairs;

setting the mth cellular user as C_mWherein M is more than or equal to 1 and less than or equal to M; the nth D2D communication pair is D_nWherein N is more than or equal to 1 and less than or equal to N; D2D communication pair D_nFor transmitting and receiving users in

And

represents;

the cellular downlink communication link and the D2D link adopt the orthogonal frequency division multiplexing technology, each cellular user occupies one communication resource block RB, and no interference exists between any two cellular links; simultaneously, one cellular user is allowed to share the same RB with a plurality of D2D users, and communication Resource Blocks (RB) and transmission power are autonomously selected by the D2D users;

P_Brepresents the fixed transmit power of the base station;

To cellular subscriber C_mThe channel gain of the interfering link of (a); n is a radical of₀A power spectral density representative of additive white gaussian noise;

for D2D communication pair D_nTo a transmitting user

To the receiving user

D2D channel gain of the target link;

The channel gain of the interfering link of (a);

when there are multiple linksWhen RB is shared, D2D communication is paired with D_iMiddle transmitting user

To the receiving user

The channel gain of the interfering link of (a);

communication rate per unit bandwidth of cellular link

The calculation formula is as follows:

communication rate per bandwidth of D2D link

The calculation formula is as follows:

the optimization model is as follows:

C1:

C2:

C3:

a power control vector composed jointly for the transmit powers of all D2D communication pairs;

Ensuring the communication quality of cellular users; the constraint condition C2 represents a D2D link spectrum allocation constraint condition, and each D2D user pair can be allocated with only one communication resource block RB at most; constraint C3 characterizes that the transmission power of the transmitting user of each D2D communication pair cannot exceed the maximum transmission power threshold P_max；

the specific construction steps are as follows:

Instantaneous channel state information for the D2D communication link;

the D2D communication pair D for the last time slot t-1_pRBs occupied by neighboring cellular users;

r_nIn negative return, r_n＜0；

each agent markov game model is:

wherein the content of the first and second substances,

is a space of states that is,

is an action space, r^jThe method comprises the steps that a return value corresponding to a return function of a jth D2D communication pair is j ∈ { 1.., N }, p is the state transition probability of the whole environment, and gamma is a discount coefficient;

the total discount return calculation formula is:

t is the time range; gamma ray^tIs the discount coefficient to the power of t; r is_t ^jIs the return value of the return function of the jth D2D communication pair at the time slot t;

the deep reinforcement learning model of the actor critics consists of actors and critics;

in the training process, the strategy of the actor is fitted by using a deep neural network, and is updated by using the following deterministic strategy gradient formula so as to obtain the maximum expected return;

Is an experience replay buffer;

To update the loss function of:

wherein the content of the first and second substances,

each sample is represented by a tuple(s)_t,a_t,r_t,s_t+1) In the form of a record of the historical data, r, of all agents_t＝{r_t ¹,...,r_t ^NThe rewards of all agents in the time slot t are included;

step 504, performing offline training on the deep reinforcement learning model by using historical communication data to obtain D2D communication D for solving the D_pA model of a resource allocation problem;

2. The multi-agent deep reinforcement learning-based D2D resource allocation method as claimed in claim 1, wherein the interference in step two includes three types: 1) cellular users experience interference from transmitting users in each D2D communication pair sharing the same RB; 2) interference experienced by the receiving users in each D2D communication pair from the base station; 3) the receiving user in each D2D communication pair is subject to interference from the transmitting user in all other D2D communication pairs that share the same RBs.

3. The multi-agent deep reinforcement learning-based D2D resource allocation method as claimed in claim 1, wherein the resource allocation scheme in step six includes selecting appropriate communication resource blocks RB and transmission power;

the execution steps are as follows:

(3) initializing a communication simulation time slot t ← 0;

And returning to the step (4) until T is T.