CN113115355A

CN113115355A - Power distribution method based on deep reinforcement learning in D2D system

Info

Publication number: CN113115355A
Application number: CN202110475005.XA
Authority: CN
Inventors: 梁应敞; 史佳琦
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-13
Anticipated expiration: 2041-04-29
Also published as: CN113115355B

Abstract

The invention belongs to the technical field of wireless communication, and particularly relates to a power distribution method based on deep reinforcement learning in a D2D system. In the scheme of the invention, a deep neural network is independently constructed for each link pair, the channel information of all links is not required to be obtained in real time, the communication environment around the current link is predicted according to partial historical information and decision information of other links, and the link pairs can be mutually matched to carry out real-time power decision so as to maximize the weighting and the speed of the global network, so that the power distribution method based on deep reinforcement learning without iteration is realized.

Description

Power distribution method based on deep reinforcement learning in D2D system

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a power distribution method based on deep reinforcement learning in a D2D system.

Background

Network operators worldwide have shown a strong interest in the development and application of 5G. The basic idea of 5G is to take advantage of direct connections between mobile users to relieve the base station of burden. To improve the energy efficiency of cellular networks and to improve system throughput, device-to-device (D2D) is considered a good and viable solution. In a D2D network, multiple pairs of D2D link pairs coexist with full frequency reuse in a cell, causing interference between links to become very complex. In the D2D scenario, the system capacity is generally optimized by performing interference management through power control, most of the conventional power control algorithms are implemented through continuous iteration based on real-time channel information, and real-time power adjustment is very difficult due to time-consuming channel estimation and complex matrix operation.

Disclosure of Invention

Aiming at the problems existing in the traditional power control, the invention provides a power distribution method based on deep reinforcement learning in a D2D system without iteration.

The technical scheme of the invention is as follows:

a power allocation method based on deep reinforcement learning in a D2D system, assuming that there are N pairs of link pairs, namely N agents, in the D2D system, includes the following steps:

s1, information collection: n pairs of link pairs respectively receive outdated channel, power information and decision information of other links from a Central Controller (CC) to obtain respective observation vectors;

s2, network construction: each link pair independently creates a network and establishes an experience storage pool (Replay Buffer) of the link pair;

s3, online decision and training network: and performing online power decision according to the past observation vector collected in the last time in the communication environment around the link at step S1, and storing the state, action, reward and observation vector obtained by the interaction of the intelligent agent and the environment into an experience pool. Meanwhile, each link randomly selects a group of data from the experience pool to train the network in the S2, network parameters are updated, and the network with the updated network parameters is used in the next online decision making.

The invention provides a power control model of a network based on deep reinforcement learning, which mainly comprises the following steps of online detection and training:

data: the D2D system provides channel information and power data for the offline module and the online module, respectively. For the offline module: the D2D system provides labeled sample data as a training set; for online modules: the D2D system provides (unlabeled) sampled data as detected data.

Network construction: and (3) independently constructing a network for each link according to a specific format, wherein the network is responsible for giving specific power decision and loss function of the network according to input information.

Performing on-line training: continuous power distribution is viewed as a multi-agent interworking task through online training. The system establishes a fixed-size experience pool (Replay Buffer) for each link pair to store data. Each link pair independently takes out data from an experience pool of the link pair, and then the output of on-line training reinforcement learning can be modeled as posterior probability, so that a cost function (for example, the cost function based on the maximum posterior probability designed by the invention) suitable for power distribution is developed; and giving a training set, and obtaining a trained network through continuous online training and feedback.

And (3) online decision making: and when the online training is carried out, the real-time power distribution result is taken as the power distribution result according to the power distribution result of the network. And meanwhile, storing data collected by the online decision into an experience pool as training data for later training. The effect of the online decision will be better and better along with the process of online training.

The invention uses a Linear rectification function (ReLU) as an activation function of each layer based on the input and hidden layers of the deep neural network

Relu(x)＝log(1+exp x)

The output layer uses the tanh function to determine the final power output gear. The output value is:

the power distribution mechanism based on multi-agent deep reinforcement learning provided by the invention is a universal reinforcement learning framework, can be suitable for any type of network, and can be generalized to different networks.

The method has the advantages that a deep neural network is independently constructed for each link pair in the scheme of the invention, channel information of all links is not required to be obtained in real time, the communication environment around the current link is predicted according to partial historical information and decision information of other links, and all link pairs can be mutually matched to carry out real-time power decision so as to maximize the weighting and the speed of the global network, so that the power distribution method based on deep reinforcement learning without iteration is realized.

Drawings

Fig. 1 shows a D2D communication system model in the present invention;

fig. 2 shows a frame structure of a D2D communication system in the present invention;

FIG. 3 shows a network structure of each pair of links in the present invention

FIG. 4 shows a power decision flow for each pair of link users in the present invention;

FIG. 5 illustrates a comparison of performance of reinforcement learning based power allocation schemes and other power allocation schemes proposed by the present invention for different numbers of test links;

fig. 6 shows the variance of operator network training loss for a pair of links in the reinforcement learning based power allocation scheme proposed by the present invention.

Fig. 7 shows the criticc network training loss variation of a pair of links in the reinforcement learning based power allocation scheme proposed by the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

Fig. 1 shows a D2D network model in the present invention, the system is composed of a cellular mobile communication system and a D2D communication system, respectively. In this example, the macrocell base station reserves a small portion of the exclusive cellular spectrum for the D2D communication system. Therefore, the cellular mobile communication system and the D2D communication system do not interfere with each other, and the macro cellular base station only serves as a relay to help the D2D communication equipment to exchange control information with small amount and delay. Assume that there are M D2D communication devices, 1 channel, in this example system, as shown in fig. 1. Definition of

Are the channel parameters of transmitter i to receiver j. The correlation of the channels is defined as a first order gaussian markov process. The invention uses the Jakes model to express the change of the small-scale fading of the t frame, namely

Wherein

Representing the channel gain from transmitter i to receiver j,

representing the channel parameter from transmitter i to receiver j, the channel parameter at time zero

Compliance

Mean is μ and variance is σ²Complex gaussian. ρ refers to the channel correlation coefficient between different frames. Theta_ijRepresenting large scale fading, compliant with the ITU-1411 outdoor model for a short range of 5 mhz bandwidth, 2.4 mhz carrier frequency.

Indicating small scale fading is obeyed

Independent and equally distributed random variables. P represents the correlation coefficient of the adjacent time slot channel, obeying J₀(2πf_dT)，J₀Denotes a first class of zero-order besetsThe function of' l_dIs the maximum doppler frequency. The large scale fading is related to the distance between two communication nodes, and the small scale fading remains constant within one frame, but varies from frame to frame. the signal to interference plus noise ratio (SINR) of user i at time t is

Wherein

Indicating the power of user i at time t, using

To represent the power vectors of all users in the network at time t. Sigma²Representing the power of additive white gaussian noise. At time t the rate of user i is

The present invention aims to find an efficient user association scheme that maximizes the sum of the rates of all D2D users, i.e.

Wherein

Represents the weight of user i at time t, typically assigned according to the user's long-term average rate. Weights ensure user fairness in the network by assigning weights by allowing users with poor channel conditions more transmission opportunities.

In a large D2D network, it is practically difficult to obtain real-time CSI due to the large overhead and latency of the backhaul network. Here, the invention assumes that only past information is available, and therefore only the expectation function of the real-time weighting and rate can be maximized:

wherein past information

The problem in the above equation is that the non-convex function is difficult to solve using the conventional optimization method, and requires high-dimensional integration and complex matrix operation. The invention provides a method for directly obtaining a power distribution result by using past information and skipping complex matrix operation by using deep reinforcement learning.

Fig. 2 shows a frame structure of user communications in a D2D network in the present invention. The D2D link pair divides a data frame in one time slot into three parts. In the first part of the frame header, the D2D link pair receives the outdated interference information and the power decision information of the last time from the CC, and then inputs the processed information into the neural network for power decision. The second part, the D2D link pair, performs data transmission according to allocated power, while performing interference information collection in real time. Finally, the interference information at this time and its own power allocation information are transmitted to the CC in the third part of the tail of the frame.

Fig. 3 shows the decision flow of the present algorithm.

Fig. 4 shows the reinforcement learning network structure of each pair of links in the present invention. There are three Main components in the network, Replay Buffer, Main Net and Target Net.

The Replay Buffer is responsible for storing sample data tuples generated in the Main Net network, and the data tuples stored in the Replay Buffer can be taken out from the Replay Buffer according to a certain strategy in the network training process, wherein the strategy can be random or some designed weight selection strategies.

The network structures of the Main Net network and the Target Net network are completely consistent, and each network comprises an operator network and a critic network. The actor network is responsible for receiving the state information of the link and outputting a power decision value, and the critic network is responsible for evaluating the current output of the actor, namely judging whether the power decision is good or not.

The Main Net network has two functions: the method is mainly used for generating real-time data tuples and storing the data tuples into a Replay Buffer, and is also used for updating in real time after an Actor and a Critic network calculate loss functions.

The Target Net network has only one role, the Target Q value in calculating the loss function. The method is used for fixing the Q value to stabilize the network so as to prevent the target value from continuously jumping and the network training effect from being poor. The parameters of the Target Net network are overwritten by the parameters of the Main Net network after a fixed period of time or a fixed training number so as to update the parameters.

The following introduces more important variables in the network:

1) movement space

In each time slot, each agent needs to decide its own transmit power level. The network in the invention does not need power discretization, namely the network in the invention can make power decision on continuous action, which can not be realized by the traditional algorithm. Therefore, the movement space in the present invention

Is defined as:

thus, the dimensions of the network action space of the present invention are infinite. For link pair i, define

In the action of time slot t, the agent is in [0, P ]_max]Is arbitrarily selected from the range of values ofAnd (4) counting. Definition of

The decision vector to be stored into the experience pool for the current link.

2) State space S: as a basis for power decisions, the state must provide the network with enough information to allow the agent to have sufficient knowledge of the surrounding communication environment and to support the network in making the correct decisions. In a communication network, the communication environment around a link consists of three parts: the communication quality between the transmitter and the receiver of the local transmitter, the interference of the local transmitter to other receivers and the interference of other transmitters to the local receiver. With the three pieces of information, the link can sense the surrounding communication environment. Definition of

And K is the state information set of the agent i in the time slot t, and the number of the state information is K. The following detailed description

Of (1).

For a particular pair of D2D, it is the local CSI that best represents the quality of the communication between the current transmitter and receiver so:

another determinant factor affecting the rate of the link is power information:

third, the rate of the link at the last time may also represent the communication environment around the link:

the interference of the link transmitter to other receivers is expressed as:

the interference of other link transmitters to the local link receiver is represented as:

sixthly, in the algorithm, the network can make an accurate power decision by sensing the communication environment of the surrounding links when making an independent decision. Therefore, it is necessary to inform the channel information of the links around the link. Thus:

in the above formula, d ═ rank (a, b) means that a ranks the d-th bit in descending order of values in the set b from large to small.

In view of the above, it is desirable to provide,

can be expressed as:

while

Determined by equation (15).

3) Reward function

In order to make the link aware of the surrounding communication environment while maximizing the global and rate, three parts are considered in the design of the reward function.

First, the most direct feedback for measuring the quality of the primary power allocation of the link must be the transmission rate itself, so the first component of the reward function is

Secondly, it is desirable that the links learn to cooperate with each other, so if the reward function is only self and rate, it will certainly cause large interference to the surrounding links, so the interference information around the links is also added into the reward function. The classification of interference information is mainly divided into two categories, one is interference caused by the link to other link pairs due to the transmitted information

The second is the interference caused by other links to the current link

Finally, a reward function

The complete expression of (c) is as follows:

wherein:

representing the rate at which link j has thrown the interference it has caused by link i. In addition:

indicating the rate that the current link can achieve if no remaining links have an impact on the current link i.

The meaning of the reward function (17) is that the rate of the link i at the current moment subtracts the influence of the current link on the actual rate of other links, and then adds the influence of other links on the rate of the link.

The overall algorithm flow is as follows for link i:

first, the current state is determined

Inputting into main network to obtain current

And

and combining action training of other links

Obtaining the state vector of the next moment

Will be provided with

Stored as tuples into a data experience pool.

Second, picking out data tuples from the experience pool

And thirdly, directly inputting the data tuples into the main network to obtain the evaluation value corresponding to the current latest strategy.

Fourthly, the data of the next time in the data tuple

Input into secondary network to calculate action of next moment of current link

And utilize the next time action of the other link

An evaluation value is calculated.

And finally, calculating a loss function more Main Net network according to the information. In addition, the network adopts a soft update mode to update the parameters of Target Net, namely, the parameters are updated a little bit each time training is carried out. This may reduce the variance of the network. In addition, it is worth emphasizing that the value range of the output of the operator network after passing through the activation function tanh is (-1,1), and the value range does not correspond to the magnitude of the power, and one operator network output x and power p are designed_iThe mapping relationship between:

p_i＝P_max×(x+1)/2 (22)

in the following, the present invention will illustrate the performance of the proposed solution according to the simulation result. First, consider a network of 4D 2D link pairs. The transmitters of all the link pairs are randomly distributed in a square area with the side length of 50 meters, and the distances between the receivers and the transmitters of the link pairs are uniformly distributed between 2m and 50 m. The maximum transmission power of the D2D transmitter is set to p 38dBm and the background noise power is set to σ²At-114 dBm, doppler shift 10Hz, and correlation coefficient p between adjacent channels is 0.01. The path loss model is 32.45+20log₁₀(f)+20log₁₀(d)-G_t-G_r(in dB), where f (Mhz) is the carrierFrequency, d (km) is distance, G_tDenotes the transmit antenna gain, G_rRepresenting the receive antenna gain. The invention sets f to 2.4GHz, G_t＝G_r2.5 dB. The multi-agent deep reinforcement learning algorithm is implemented using TensorFlow.

FIG. 5 illustrates a comparison of the performance of a multi-agent reinforcement learning-based power distribution scheme and other power distribution schemes in different test areas. The three comparison algorithms are a full power transmission strategy (MPT), an FP scheme utilizing real-time channel information, and an AA scheme all transmitting with maximum power. The network of the present invention is able to stabilize after 6w training when there are only 4 link pairs, and is even more surprising in performance. The algorithm of the invention can be about 20% better than the FP algorithm, and about 50% higher than the full-open AA algorithm. The excellent performance is shown on only four links, and the effectiveness of the algorithm is proved. It is worth emphasizing that the training of the algorithm herein is performed with 4 links varying constantly. Only such changing link locations can test whether the network of the present invention really learns to use interference data around the links to infer a real-time communication environment and make decisions. Some previous algorithms using reinforcement learning are trained under the condition that the geographical position of a link is not changed, although the training can achieve some good effect, the method has no significance in an actual communication system because the position of a link pair cannot be changed all the time, and once the position of the link pair is changed, the algorithms become invalid and need to be retrained. The significance of the algorithm herein is that the network does not need to be retrained while the location of the link pairs is constantly changing, so that the algorithm of the present invention can remain effective at all times.

Some of the loss changes of reinforcement learning during the training process are shown below, and some details of the training of the network of the present invention are shown here by taking the agent1 as an example, so that the unsupervised framework of the algorithm of the present invention is more clear and intuitive. First, fig. 6 shows the loss function loss curve of the operator network of a link pair. It can be seen from the figure that the loss function of the actor network is increasing until 4 ten thousand steps, indicating that the performance of the network is deteriorating. After training of about 4 ten thousand steps, the network finally explores a strategy for reducing the loss function, so that the loss function of the network can be reduced all the time. And after 6 ten thousand training, the loss function of the network finally tends to be stable. Second, the penalty function for the critic network is shown in FIG. 7, where it is desirable to minimize the critic network to reduce the gap between the actual and expected Q values. Within 3 million training steps, the change in the critic loss function is irregular. The network of the present invention is continuously searching, so the randomness of the action is relatively high, and the network of the present invention is continuously searching for different strategies. Consistent with the trend of the actor network, the loss function of the critic network also tends to be stable after about 4 ten thousand training sessions.

Claims

1. A power allocation method based on deep reinforcement learning in a D2D system, wherein N pairs of link pairs, namely N agents, are assumed in the D2D system, and the method comprises the following steps:

s1, each agent receives outdated channel, power information and power decision information of other links from the central controller respectively to obtain respective observation vectors;

s2, each agent independently creates a power distribution network based on deep learning and establishes an experience storage pool;

s3, based on the outdated observation vector of the previous moment obtained in the step S1, performing online decision according to the power distribution network to obtain the power distribution result of the current moment, storing the state, the action, the reward and the observation vector obtained by interaction of the intelligent agent and the environment into an experience pool, simultaneously taking out data from respective experience storage pools to train the network, updating the network parameters, and using the network with the updated network parameters when performing online decision next time.

2. The method for power distribution based on deep reinforcement learning in D2D system of claim 1, wherein in step S2, the specific structure of the power distribution network created by each agent individually is: the power distribution network comprises a Main network for training and a Target network for calculation, wherein the input and the output of the Main network are connected with the experience storage pool;

the structure of the Main network and the Target network are completely the same, and the Main network and the Target network respectively comprise an operator network for receiving the state information of a link and outputting a power decision value and a critic network for evaluating the current output; the Main network is updated in real time after loss functions are calculated by the Actor and Critic networks, and the Target network is used for calculating a Target Q value and is used for fixing a Q value stabilizing network.

3. The method for power allocation based on deep reinforcement learning in D2D system of claim 2, wherein in step S3, the definition of state, action and reward obtained by interaction between agent and environment are:

defining states