CN113115355A - Power distribution method based on deep reinforcement learning in D2D system - Google Patents

Power distribution method based on deep reinforcement learning in D2D system Download PDF

Info

Publication number
CN113115355A
CN113115355A CN202110475005.XA CN202110475005A CN113115355A CN 113115355 A CN113115355 A CN 113115355A CN 202110475005 A CN202110475005 A CN 202110475005A CN 113115355 A CN113115355 A CN 113115355A
Authority
CN
China
Prior art keywords
network
link
power
agent
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110475005.XA
Other languages
Chinese (zh)
Other versions
CN113115355B (en
Inventor
梁应敞
史佳琦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202110475005.XA priority Critical patent/CN113115355B/en
Publication of CN113115355A publication Critical patent/CN113115355A/en
Application granted granted Critical
Publication of CN113115355B publication Critical patent/CN113115355B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/04Wireless resource allocation
    • H04W72/044Wireless resource allocation based on the type of the allocated resource
    • H04W72/0473Wireless resource allocation based on the type of the allocated resource the resource being transmission power
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/06Testing, supervising or monitoring using simulated traffic
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02EREDUCTION OF GREENHOUSE GAS [GHG] EMISSIONS, RELATED TO ENERGY GENERATION, TRANSMISSION OR DISTRIBUTION
    • Y02E40/00Technologies for an efficient electrical power generation, transmission or distribution
    • Y02E40/70Smart grids as climate change mitigation technology in the energy generation sector
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention belongs to the technical field of wireless communication, and particularly relates to a power distribution method based on deep reinforcement learning in a D2D system. In the scheme of the invention, a deep neural network is independently constructed for each link pair, the channel information of all links is not required to be obtained in real time, the communication environment around the current link is predicted according to partial historical information and decision information of other links, and the link pairs can be mutually matched to carry out real-time power decision so as to maximize the weighting and the speed of the global network, so that the power distribution method based on deep reinforcement learning without iteration is realized.

Description

Power distribution method based on deep reinforcement learning in D2D system
Technical Field
The invention belongs to the technical field of wireless communication, and particularly relates to a power distribution method based on deep reinforcement learning in a D2D system.
Background
Network operators worldwide have shown a strong interest in the development and application of 5G. The basic idea of 5G is to take advantage of direct connections between mobile users to relieve the base station of burden. To improve the energy efficiency of cellular networks and to improve system throughput, device-to-device (D2D) is considered a good and viable solution. In a D2D network, multiple pairs of D2D link pairs coexist with full frequency reuse in a cell, causing interference between links to become very complex. In the D2D scenario, the system capacity is generally optimized by performing interference management through power control, most of the conventional power control algorithms are implemented through continuous iteration based on real-time channel information, and real-time power adjustment is very difficult due to time-consuming channel estimation and complex matrix operation.
Disclosure of Invention
Aiming at the problems existing in the traditional power control, the invention provides a power distribution method based on deep reinforcement learning in a D2D system without iteration.
The technical scheme of the invention is as follows:
a power allocation method based on deep reinforcement learning in a D2D system, assuming that there are N pairs of link pairs, namely N agents, in the D2D system, includes the following steps:
s1, information collection: n pairs of link pairs respectively receive outdated channel, power information and decision information of other links from a Central Controller (CC) to obtain respective observation vectors;
s2, network construction: each link pair independently creates a network and establishes an experience storage pool (Replay Buffer) of the link pair;
s3, online decision and training network: and performing online power decision according to the past observation vector collected in the last time in the communication environment around the link at step S1, and storing the state, action, reward and observation vector obtained by the interaction of the intelligent agent and the environment into an experience pool. Meanwhile, each link randomly selects a group of data from the experience pool to train the network in the S2, network parameters are updated, and the network with the updated network parameters is used in the next online decision making.
The invention provides a power control model of a network based on deep reinforcement learning, which mainly comprises the following steps of online detection and training:
data: the D2D system provides channel information and power data for the offline module and the online module, respectively. For the offline module: the D2D system provides labeled sample data as a training set; for online modules: the D2D system provides (unlabeled) sampled data as detected data.
Network construction: and (3) independently constructing a network for each link according to a specific format, wherein the network is responsible for giving specific power decision and loss function of the network according to input information.
Performing on-line training: continuous power distribution is viewed as a multi-agent interworking task through online training. The system establishes a fixed-size experience pool (Replay Buffer) for each link pair to store data. Each link pair independently takes out data from an experience pool of the link pair, and then the output of on-line training reinforcement learning can be modeled as posterior probability, so that a cost function (for example, the cost function based on the maximum posterior probability designed by the invention) suitable for power distribution is developed; and giving a training set, and obtaining a trained network through continuous online training and feedback.
And (3) online decision making: and when the online training is carried out, the real-time power distribution result is taken as the power distribution result according to the power distribution result of the network. And meanwhile, storing data collected by the online decision into an experience pool as training data for later training. The effect of the online decision will be better and better along with the process of online training.
The invention uses a Linear rectification function (ReLU) as an activation function of each layer based on the input and hidden layers of the deep neural network
Relu(x)=log(1+exp x)
The output layer uses the tanh function to determine the final power output gear. The output value is:
Figure RE-GDA0003086246890000021
the power distribution mechanism based on multi-agent deep reinforcement learning provided by the invention is a universal reinforcement learning framework, can be suitable for any type of network, and can be generalized to different networks.
The method has the advantages that a deep neural network is independently constructed for each link pair in the scheme of the invention, channel information of all links is not required to be obtained in real time, the communication environment around the current link is predicted according to partial historical information and decision information of other links, and all link pairs can be mutually matched to carry out real-time power decision so as to maximize the weighting and the speed of the global network, so that the power distribution method based on deep reinforcement learning without iteration is realized.
Drawings
Fig. 1 shows a D2D communication system model in the present invention;
fig. 2 shows a frame structure of a D2D communication system in the present invention;
FIG. 3 shows a network structure of each pair of links in the present invention
FIG. 4 shows a power decision flow for each pair of link users in the present invention;
FIG. 5 illustrates a comparison of performance of reinforcement learning based power allocation schemes and other power allocation schemes proposed by the present invention for different numbers of test links;
fig. 6 shows the variance of operator network training loss for a pair of links in the reinforcement learning based power allocation scheme proposed by the present invention.
Fig. 7 shows the criticc network training loss variation of a pair of links in the reinforcement learning based power allocation scheme proposed by the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings.
Fig. 1 shows a D2D network model in the present invention, the system is composed of a cellular mobile communication system and a D2D communication system, respectively. In this example, the macrocell base station reserves a small portion of the exclusive cellular spectrum for the D2D communication system. Therefore, the cellular mobile communication system and the D2D communication system do not interfere with each other, and the macro cellular base station only serves as a relay to help the D2D communication equipment to exchange control information with small amount and delay. Assume that there are M D2D communication devices, 1 channel, in this example system, as shown in fig. 1. Definition of
Figure RE-GDA0003086246890000031
Are the channel parameters of transmitter i to receiver j. The correlation of the channels is defined as a first order gaussian markov process. The invention uses the Jakes model to express the change of the small-scale fading of the t frame, namely
Figure RE-GDA0003086246890000032
Figure RE-GDA0003086246890000033
Wherein
Figure RE-GDA0003086246890000034
Representing the channel gain from transmitter i to receiver j,
Figure RE-GDA0003086246890000035
representing the channel parameter from transmitter i to receiver j, the channel parameter at time zero
Figure RE-GDA0003086246890000036
Compliance
Figure RE-GDA0003086246890000037
Mean is μ and variance is σ2Complex gaussian. ρ refers to the channel correlation coefficient between different frames. ThetaijRepresenting large scale fading, compliant with the ITU-1411 outdoor model for a short range of 5 mhz bandwidth, 2.4 mhz carrier frequency.
Figure RE-GDA0003086246890000041
Indicating small scale fading is obeyed
Figure RE-GDA0003086246890000042
Independent and equally distributed random variables. P represents the correlation coefficient of the adjacent time slot channel, obeying J0(2πfdT),J0Denotes a first class of zero-order besetsThe function of' ldIs the maximum doppler frequency. The large scale fading is related to the distance between two communication nodes, and the small scale fading remains constant within one frame, but varies from frame to frame. the signal to interference plus noise ratio (SINR) of user i at time t is
Figure RE-GDA0003086246890000043
Wherein
Figure RE-GDA0003086246890000044
Indicating the power of user i at time t, using
Figure RE-GDA0003086246890000045
To represent the power vectors of all users in the network at time t. Sigma2Representing the power of additive white gaussian noise. At time t the rate of user i is
Figure RE-GDA0003086246890000046
The present invention aims to find an efficient user association scheme that maximizes the sum of the rates of all D2D users, i.e.
Figure RE-GDA0003086246890000047
Figure RE-GDA0003086246890000048
Wherein
Figure RE-GDA0003086246890000049
Represents the weight of user i at time t, typically assigned according to the user's long-term average rate. Weights ensure user fairness in the network by assigning weights by allowing users with poor channel conditions more transmission opportunities.
In a large D2D network, it is practically difficult to obtain real-time CSI due to the large overhead and latency of the backhaul network. Here, the invention assumes that only past information is available, and therefore only the expectation function of the real-time weighting and rate can be maximized:
Figure RE-GDA00030862468900000410
wherein past information
Figure RE-GDA00030862468900000411
The problem in the above equation is that the non-convex function is difficult to solve using the conventional optimization method, and requires high-dimensional integration and complex matrix operation. The invention provides a method for directly obtaining a power distribution result by using past information and skipping complex matrix operation by using deep reinforcement learning.
Fig. 2 shows a frame structure of user communications in a D2D network in the present invention. The D2D link pair divides a data frame in one time slot into three parts. In the first part of the frame header, the D2D link pair receives the outdated interference information and the power decision information of the last time from the CC, and then inputs the processed information into the neural network for power decision. The second part, the D2D link pair, performs data transmission according to allocated power, while performing interference information collection in real time. Finally, the interference information at this time and its own power allocation information are transmitted to the CC in the third part of the tail of the frame.
Fig. 3 shows the decision flow of the present algorithm.
Fig. 4 shows the reinforcement learning network structure of each pair of links in the present invention. There are three Main components in the network, Replay Buffer, Main Net and Target Net.
The Replay Buffer is responsible for storing sample data tuples generated in the Main Net network, and the data tuples stored in the Replay Buffer can be taken out from the Replay Buffer according to a certain strategy in the network training process, wherein the strategy can be random or some designed weight selection strategies.
The network structures of the Main Net network and the Target Net network are completely consistent, and each network comprises an operator network and a critic network. The actor network is responsible for receiving the state information of the link and outputting a power decision value, and the critic network is responsible for evaluating the current output of the actor, namely judging whether the power decision is good or not.
The Main Net network has two functions: the method is mainly used for generating real-time data tuples and storing the data tuples into a Replay Buffer, and is also used for updating in real time after an Actor and a Critic network calculate loss functions.
The Target Net network has only one role, the Target Q value in calculating the loss function. The method is used for fixing the Q value to stabilize the network so as to prevent the target value from continuously jumping and the network training effect from being poor. The parameters of the Target Net network are overwritten by the parameters of the Main Net network after a fixed period of time or a fixed training number so as to update the parameters.
The following introduces more important variables in the network:
1) movement space
Figure RE-GDA0003086246890000051
In each time slot, each agent needs to decide its own transmit power level. The network in the invention does not need power discretization, namely the network in the invention can make power decision on continuous action, which can not be realized by the traditional algorithm. Therefore, the movement space in the present invention
Figure RE-GDA0003086246890000052
Is defined as:
Figure RE-GDA0003086246890000053
thus, the dimensions of the network action space of the present invention are infinite. For link pair i, define
Figure RE-GDA0003086246890000054
In the action of time slot t, the agent is in [0, P ]max]Is arbitrarily selected from the range of values ofAnd (4) counting. Definition of
Figure RE-GDA0003086246890000055
The decision vector to be stored into the experience pool for the current link.
2) State space S: as a basis for power decisions, the state must provide the network with enough information to allow the agent to have sufficient knowledge of the surrounding communication environment and to support the network in making the correct decisions. In a communication network, the communication environment around a link consists of three parts: the communication quality between the transmitter and the receiver of the local transmitter, the interference of the local transmitter to other receivers and the interference of other transmitters to the local receiver. With the three pieces of information, the link can sense the surrounding communication environment. Definition of
Figure RE-GDA0003086246890000061
And K is the state information set of the agent i in the time slot t, and the number of the state information is K. The following detailed description
Figure RE-GDA0003086246890000062
Of (1).
For a particular pair of D2D, it is the local CSI that best represents the quality of the communication between the current transmitter and receiver so:
Figure RE-GDA0003086246890000063
another determinant factor affecting the rate of the link is power information:
Figure RE-GDA0003086246890000064
third, the rate of the link at the last time may also represent the communication environment around the link:
Figure RE-GDA0003086246890000065
the interference of the link transmitter to other receivers is expressed as:
Figure RE-GDA0003086246890000066
the interference of other link transmitters to the local link receiver is represented as:
Figure RE-GDA0003086246890000067
sixthly, in the algorithm, the network can make an accurate power decision by sensing the communication environment of the surrounding links when making an independent decision. Therefore, it is necessary to inform the channel information of the links around the link. Thus:
Figure RE-GDA0003086246890000068
Figure RE-GDA0003086246890000069
in the above formula, d ═ rank (a, b) means that a ranks the d-th bit in descending order of values in the set b from large to small.
In view of the above, it is desirable to provide,
Figure RE-GDA0003086246890000071
can be expressed as:
Figure RE-GDA0003086246890000072
while
Figure RE-GDA0003086246890000073
Determined by equation (15).
3) Reward function
Figure RE-GDA0003086246890000074
In order to make the link aware of the surrounding communication environment while maximizing the global and rate, three parts are considered in the design of the reward function.
First, the most direct feedback for measuring the quality of the primary power allocation of the link must be the transmission rate itself, so the first component of the reward function is
Figure RE-GDA0003086246890000075
Secondly, it is desirable that the links learn to cooperate with each other, so if the reward function is only self and rate, it will certainly cause large interference to the surrounding links, so the interference information around the links is also added into the reward function. The classification of interference information is mainly divided into two categories, one is interference caused by the link to other link pairs due to the transmitted information
Figure RE-GDA0003086246890000076
The second is the interference caused by other links to the current link
Figure RE-GDA0003086246890000077
Finally, a reward function
Figure RE-GDA0003086246890000078
The complete expression of (c) is as follows:
Figure RE-GDA0003086246890000079
wherein:
Figure RE-GDA00030862468900000710
Figure RE-GDA00030862468900000711
representing the rate at which link j has thrown the interference it has caused by link i. In addition:
Figure RE-GDA00030862468900000712
Figure RE-GDA00030862468900000713
indicating the rate that the current link can achieve if no remaining links have an impact on the current link i.
The meaning of the reward function (17) is that the rate of the link i at the current moment subtracts the influence of the current link on the actual rate of other links, and then adds the influence of other links on the rate of the link.
The overall algorithm flow is as follows for link i:
first, the current state is determined
Figure RE-GDA0003086246890000081
Inputting into main network to obtain current
Figure RE-GDA0003086246890000082
And
Figure RE-GDA0003086246890000083
and combining action training of other links
Figure RE-GDA0003086246890000084
Obtaining the state vector of the next moment
Figure RE-GDA0003086246890000085
Will be provided with
Figure RE-GDA0003086246890000086
Stored as tuples into a data experience pool.
Second, picking out data tuples from the experience pool
Figure RE-GDA0003086246890000087
And thirdly, directly inputting the data tuples into the main network to obtain the evaluation value corresponding to the current latest strategy.
Fourthly, the data of the next time in the data tuple
Figure RE-GDA0003086246890000088
Input into secondary network to calculate action of next moment of current link
Figure RE-GDA0003086246890000089
And utilize the next time action of the other link
Figure RE-GDA00030862468900000810
An evaluation value is calculated.
And finally, calculating a loss function more Main Net network according to the information. In addition, the network adopts a soft update mode to update the parameters of Target Net, namely, the parameters are updated a little bit each time training is carried out. This may reduce the variance of the network. In addition, it is worth emphasizing that the value range of the output of the operator network after passing through the activation function tanh is (-1,1), and the value range does not correspond to the magnitude of the power, and one operator network output x and power p are designediThe mapping relationship between:
pi=Pmax×(x+1)/2 (22)
in the following, the present invention will illustrate the performance of the proposed solution according to the simulation result. First, consider a network of 4D 2D link pairs. The transmitters of all the link pairs are randomly distributed in a square area with the side length of 50 meters, and the distances between the receivers and the transmitters of the link pairs are uniformly distributed between 2m and 50 m. The maximum transmission power of the D2D transmitter is set to p 38dBm and the background noise power is set to σ2At-114 dBm, doppler shift 10Hz, and correlation coefficient p between adjacent channels is 0.01. The path loss model is 32.45+20log10(f)+20log10(d)-Gt-Gr(in dB), where f (Mhz) is the carrierFrequency, d (km) is distance, GtDenotes the transmit antenna gain, GrRepresenting the receive antenna gain. The invention sets f to 2.4GHz, Gt=Gr2.5 dB. The multi-agent deep reinforcement learning algorithm is implemented using TensorFlow.
FIG. 5 illustrates a comparison of the performance of a multi-agent reinforcement learning-based power distribution scheme and other power distribution schemes in different test areas. The three comparison algorithms are a full power transmission strategy (MPT), an FP scheme utilizing real-time channel information, and an AA scheme all transmitting with maximum power. The network of the present invention is able to stabilize after 6w training when there are only 4 link pairs, and is even more surprising in performance. The algorithm of the invention can be about 20% better than the FP algorithm, and about 50% higher than the full-open AA algorithm. The excellent performance is shown on only four links, and the effectiveness of the algorithm is proved. It is worth emphasizing that the training of the algorithm herein is performed with 4 links varying constantly. Only such changing link locations can test whether the network of the present invention really learns to use interference data around the links to infer a real-time communication environment and make decisions. Some previous algorithms using reinforcement learning are trained under the condition that the geographical position of a link is not changed, although the training can achieve some good effect, the method has no significance in an actual communication system because the position of a link pair cannot be changed all the time, and once the position of the link pair is changed, the algorithms become invalid and need to be retrained. The significance of the algorithm herein is that the network does not need to be retrained while the location of the link pairs is constantly changing, so that the algorithm of the present invention can remain effective at all times.
Some of the loss changes of reinforcement learning during the training process are shown below, and some details of the training of the network of the present invention are shown here by taking the agent1 as an example, so that the unsupervised framework of the algorithm of the present invention is more clear and intuitive. First, fig. 6 shows the loss function loss curve of the operator network of a link pair. It can be seen from the figure that the loss function of the actor network is increasing until 4 ten thousand steps, indicating that the performance of the network is deteriorating. After training of about 4 ten thousand steps, the network finally explores a strategy for reducing the loss function, so that the loss function of the network can be reduced all the time. And after 6 ten thousand training, the loss function of the network finally tends to be stable. Second, the penalty function for the critic network is shown in FIG. 7, where it is desirable to minimize the critic network to reduce the gap between the actual and expected Q values. Within 3 million training steps, the change in the critic loss function is irregular. The network of the present invention is continuously searching, so the randomness of the action is relatively high, and the network of the present invention is continuously searching for different strategies. Consistent with the trend of the actor network, the loss function of the critic network also tends to be stable after about 4 ten thousand training sessions.

Claims (3)

1. A power allocation method based on deep reinforcement learning in a D2D system, wherein N pairs of link pairs, namely N agents, are assumed in the D2D system, and the method comprises the following steps:
s1, each agent receives outdated channel, power information and power decision information of other links from the central controller respectively to obtain respective observation vectors;
s2, each agent independently creates a power distribution network based on deep learning and establishes an experience storage pool;
s3, based on the outdated observation vector of the previous moment obtained in the step S1, performing online decision according to the power distribution network to obtain the power distribution result of the current moment, storing the state, the action, the reward and the observation vector obtained by interaction of the intelligent agent and the environment into an experience pool, simultaneously taking out data from respective experience storage pools to train the network, updating the network parameters, and using the network with the updated network parameters when performing online decision next time.
2. The method for power distribution based on deep reinforcement learning in D2D system of claim 1, wherein in step S2, the specific structure of the power distribution network created by each agent individually is: the power distribution network comprises a Main network for training and a Target network for calculation, wherein the input and the output of the Main network are connected with the experience storage pool;
the structure of the Main network and the Target network are completely the same, and the Main network and the Target network respectively comprise an operator network for receiving the state information of a link and outputting a power decision value and a critic network for evaluating the current output; the Main network is updated in real time after loss functions are calculated by the Actor and Critic networks, and the Target network is used for calculating a Target Q value and is used for fixing a Q value stabilizing network.
3. The method for power allocation based on deep reinforcement learning in D2D system of claim 2, wherein in step S3, the definition of state, action and reward obtained by interaction between agent and environment are:
defining states
Figure FDA0003046717390000011
Is the state information set of agent i in time slot t, K is the number of state information, wherein,
Figure FDA0003046717390000012
for the channel gain from transmitter i to receiver j at the last time,
Figure FDA0003046717390000013
in order to obtain the power information at the last time,
Figure FDA0003046717390000014
for interference of the transmitter of the link to other receivers, s2Represents the power of additive white gaussian noise,
Figure FDA0003046717390000015
for the link receiver to be interfered by other link transmitters,
Figure FDA0003046717390000016
for the rate of the link at the last time,
Figure FDA0003046717390000017
for the SINR ratio of user i at time t,
Figure FDA0003046717390000018
for the channel information of the links around the present link,
Figure FDA0003046717390000019
Figure FDA0003046717390000021
in order to be the information in the past,
Figure FDA0003046717390000022
defining an action space
Figure FDA0003046717390000023
Is composed of
Figure FDA0003046717390000024
For agent i, define
Figure FDA0003046717390000025
For the decision vector that the agent currently wants to store into the experience pool,
Figure FDA0003046717390000026
for the agent's action in time slot t, the agent is in [0, P ]max]Is arbitrarily chosen as a real number, PmaxMaximum power;
defining a reward function
Figure FDA0003046717390000027
Comprises the following steps:
Figure FDA0003046717390000028
Figure FDA0003046717390000029
w is the weight of the target,
Figure FDA00030467173900000210
represents the rate of the link j after the interference generated by the link i is removed;
Figure FDA00030467173900000211
Figure FDA00030467173900000212
indicating the rate that the current link can achieve if no remaining links have an impact on the current link i.
CN202110475005.XA 2021-04-29 2021-04-29 Power distribution method based on deep reinforcement learning in D2D system Active CN113115355B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110475005.XA CN113115355B (en) 2021-04-29 2021-04-29 Power distribution method based on deep reinforcement learning in D2D system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110475005.XA CN113115355B (en) 2021-04-29 2021-04-29 Power distribution method based on deep reinforcement learning in D2D system

Publications (2)

Publication Number Publication Date
CN113115355A true CN113115355A (en) 2021-07-13
CN113115355B CN113115355B (en) 2022-04-22

Family

ID=76720455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110475005.XA Active CN113115355B (en) 2021-04-29 2021-04-29 Power distribution method based on deep reinforcement learning in D2D system

Country Status (1)

Country Link
CN (1) CN113115355B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114257994A (en) * 2021-11-25 2022-03-29 西安电子科技大学 D2D network robust power control method, system, equipment and terminal

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109474980A (en) * 2018-12-14 2019-03-15 北京科技大学 A kind of wireless network resource distribution method based on depth enhancing study
CN109729528A (en) * 2018-12-21 2019-05-07 北京邮电大学 A kind of D2D resource allocation methods based on the study of multiple agent deeply
CN109862610A (en) * 2019-01-08 2019-06-07 华中科技大学 A kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm
CN110213814A (en) * 2019-07-04 2019-09-06 电子科技大学 A kind of distributed power distributing method based on deep neural network
US20190370086A1 (en) * 2019-08-15 2019-12-05 Intel Corporation Methods and apparatus to manage power of deep learning accelerator systems
WO2020135312A1 (en) * 2018-12-26 2020-07-02 上海交通大学 Artificial neural network-based power positioning and thrust distribution apparatus and method
CN111901862A (en) * 2020-07-07 2020-11-06 西安交通大学 User clustering and power distribution method, device and medium based on deep Q network
CN112261725A (en) * 2020-10-23 2021-01-22 安徽理工大学 Data packet transmission intelligent decision method based on deep reinforcement learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109474980A (en) * 2018-12-14 2019-03-15 北京科技大学 A kind of wireless network resource distribution method based on depth enhancing study
CN109729528A (en) * 2018-12-21 2019-05-07 北京邮电大学 A kind of D2D resource allocation methods based on the study of multiple agent deeply
WO2020135312A1 (en) * 2018-12-26 2020-07-02 上海交通大学 Artificial neural network-based power positioning and thrust distribution apparatus and method
CN109862610A (en) * 2019-01-08 2019-06-07 华中科技大学 A kind of D2D subscriber resource distribution method based on deeply study DDPG algorithm
CN110213814A (en) * 2019-07-04 2019-09-06 电子科技大学 A kind of distributed power distributing method based on deep neural network
US20190370086A1 (en) * 2019-08-15 2019-12-05 Intel Corporation Methods and apparatus to manage power of deep learning accelerator systems
CN112396172A (en) * 2019-08-15 2021-02-23 英特尔公司 Method and apparatus for managing power of deep learning accelerator system
CN111901862A (en) * 2020-07-07 2020-11-06 西安交通大学 User clustering and power distribution method, device and medium based on deep Q network
CN112261725A (en) * 2020-10-23 2021-01-22 安徽理工大学 Data packet transmission intelligent decision method based on deep reinforcement learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIAQI SHI: "Distributed Deep Learning Power Allocation for D2D Network Based on Outdated Information", 《 2020 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE (WCNC)》 *
吕亚平: "基于深度学习的家庭基站下行链路功率分配", 《计算机工程》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114257994A (en) * 2021-11-25 2022-03-29 西安电子科技大学 D2D network robust power control method, system, equipment and terminal
CN114257994B (en) * 2021-11-25 2024-04-26 西安电子科技大学 Method, system, equipment and terminal for controlling robust power of D2D network

Also Published As

Publication number Publication date
CN113115355B (en) 2022-04-22

Similar Documents

Publication Publication Date Title
EP3635505B1 (en) System and method for deep learning and wireless network optimization using deep learning
US10375585B2 (en) System and method for deep learning and wireless network optimization using deep learning
Li et al. Downlink transmit power control in ultra-dense UAV network based on mean field game and deep reinforcement learning
CN109962728B (en) Multi-node joint power control method based on deep reinforcement learning
US11533115B2 (en) Systems and methods for wireless signal configuration by a neural network
CN110213814B (en) Distributed power distribution method based on deep neural network
WO2021036414A1 (en) Co-channel interference prediction method for satellite-to-ground downlink under low earth orbit satellite constellation
CN111526592B (en) Non-cooperative multi-agent power control method used in wireless interference channel
CN114698128B (en) Anti-interference channel selection method and system for cognitive satellite-ground network
US11284361B2 (en) System and method for device-to-device communication
CN106604288B (en) Wireless sensor network interior joint adaptively covers distribution method and device on demand
CN113239632A (en) Wireless performance prediction method and device, electronic equipment and storage medium
Adeel et al. Critical analysis of learning algorithms in random neural network based cognitive engine for lte systems
CN113115355B (en) Power distribution method based on deep reinforcement learning in D2D system
CN115499921A (en) Three-dimensional trajectory design and resource scheduling optimization method for complex unmanned aerial vehicle network
CN113382060B (en) Unmanned aerial vehicle track optimization method and system in Internet of things data collection
Bhadauria et al. QoS based deep reinforcement learning for V2X resource allocation
CN110505604B (en) Method for accessing frequency spectrum of D2D communication system
CN115811788B (en) D2D network distributed resource allocation method combining deep reinforcement learning and unsupervised learning
CN110753367B (en) Safety performance prediction method for mobile communication system
Liu et al. A deep reinforcement learning based adaptive transmission strategy in space-air-ground integrated networks
CN113747386A (en) Intelligent power control method in cognitive radio network spectrum sharing
Ren et al. Joint spectrum allocation and power control in vehicular communications based on dueling double DQN
CN114268348A (en) Honeycomb-free large-scale MIMO power distribution method based on deep reinforcement learning
Adeel et al. Random neural network based power controller for inter-cell interference coordination in lte-ul

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant