CN112118556B - Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning - Google Patents

Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning Download PDF

Info

Publication number
CN112118556B
CN112118556B CN202011079226.7A CN202011079226A CN112118556B CN 112118556 B CN112118556 B CN 112118556B CN 202011079226 A CN202011079226 A CN 202011079226A CN 112118556 B CN112118556 B CN 112118556B
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
network
user equipment
drone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011079226.7A
Other languages
Chinese (zh)
Other versions
CN112118556A (en
Inventor
赵楠
程一强
萧洒
裴一扬
刘聪
刘泽华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubei University of Technology
Original Assignee
Hubei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubei University of Technology filed Critical Hubei University of Technology
Publication of CN112118556A publication Critical patent/CN112118556A/en
Application granted granted Critical
Publication of CN112118556B publication Critical patent/CN112118556B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/30Services specially adapted for particular environments, situations or purposes
    • H04W4/40Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P]
    • H04W4/44Services specially adapted for particular environments, situations or purposes for vehicles, e.g. vehicle-to-pedestrians [V2P] for communication between vehicles and infrastructures, e.g. vehicle-to-cloud [V2C] or vehicle-to-home [V2H]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W24/00Supervisory, monitoring or testing arrangements
    • H04W24/02Arrangements for optimising operational condition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/06TPC algorithms
    • H04W52/14Separate analysis of uplink or downlink
    • H04W52/146Uplink power control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/24TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
    • H04W52/241TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account channel quality metrics, e.g. SIR, SNR, CIR, Eb/lo
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/24TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters
    • H04W52/242TPC being performed according to specific parameters using SIR [Signal to Interference Ratio] or other wireless path parameters taking into account path loss
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/26TPC being performed according to specific parameters using transmission rate or quality of service QoS [Quality of Service]
    • H04W52/265TPC being performed according to specific parameters using transmission rate or quality of service QoS [Quality of Service] taking into account the quality of service QoS
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/28TPC being performed according to specific parameters using user profile, e.g. mobile speed, priority or network state, e.g. standby, idle or non transmission
    • H04W52/282TPC being performed according to specific parameters using user profile, e.g. mobile speed, priority or network state, e.g. standby, idle or non transmission taking into account the speed of the mobile
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W52/00Power management, e.g. TPC [Transmission Power Control], power saving or power classes
    • H04W52/04TPC
    • H04W52/18TPC being performed according to specific parameters
    • H04W52/28TPC being performed according to specific parameters using user profile, e.g. mobile speed, priority or network state, e.g. standby, idle or non transmission
    • H04W52/283Power depending on the position of the mobile
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention provides an unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning, which is characterized by comprising the following steps: establishing an unmanned aerial vehicle system model, and describing the unmanned aerial vehicle trajectory control and power distribution problems; establishing a Markov model, wherein the Markov model comprises the steps of determining a Markov decision process by setting a state, an action space and a reward function; and a depth certainty strategy gradient method is adopted to realize the joint optimization of the track control and the power distribution. By applying the invention, the unmanned aerial vehicle can accurately move to the vicinity of the target user equipment to provide wireless service, which can reduce the co-channel interference to the user equipment which is not served, and simultaneously control the transmitting power of the unmanned aerial vehicle to realize the balance between the frequency spectrum efficiency and the interference avoidance.

Description

Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of wireless communication, and particularly relates to an unmanned aerial vehicle track setting and power distribution joint optimization method based on deep reinforcement learning.
Background
Recently, drones are considered as an effective technology in future wireless networks. Due to its rapid deployment, flexible configuration, wide coverage and low cost, the drone may be used as a relay with ground user equipment for cooperative communications. Furthermore, drones are also designed as aerial base stations for wireless communication, since they can intelligently change their location to provide on-demand wireless services for ground user equipment. As a result, drone-assisted cellular networks have been applied to various applications such as remote sensing, traffic monitoring, public safety, and military.
However, there are currently some technical challenges in drone-assisted cellular networks, including trajectory control, resource allocation and interference management. By properly designing the trajectory of the drone, the drone may move into proximity of the target user device to provide wireless service, which may mitigate co-channel interference to user devices that are not being served. Furthermore, the transmit power of the drone should also be controlled to achieve a balance between spectrum efficiency and interference avoidance management. Therefore, the invention proposes that the technical problems of trajectory control and optimal implementation of power allocation should be considered together.
Disclosure of Invention
In order to overcome the non-convexity of the existing track control and power distribution problems, the invention aims to provide an optimal technical scheme of joint track control and power distribution based on deep reinforcement learning.
In order to achieve the aim, the technical scheme adopted by the invention is an unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning, an unmanned aerial vehicle system model is established, and the unmanned aerial vehicle track control and power distribution problems are described; establishing a Markov model, wherein the Markov model comprises the steps of determining a Markov decision process by setting a state, an action space and a reward function; the joint optimization of the track control and the power distribution is realized by adopting a depth certainty strategy gradient method in the following way,
the depth certainty strategy gradient method is combined with an actor network and a critic network, and a corresponding target network is set; the core ground base station firstly initializes an experience playback memory D, weights of an operator-critical network and a corresponding target network;
setting EP training sets in the training process, wherein each training set has T time slots; in each training set, firstly initializing the network state, and in each time slot of each training set, sending out an action by an actor network with random noise; after the core ground base station sends the selected action to all the unmanned aerial vehicles, all the unmanned aerial vehicles can set the own track and transmission power correspondingly; when some drones fly out of the network area, it will choose a random direction angle if the height h of some drones i (t) exceeds [ H ] min ,H max ]And the unmanned plane will stay at H min Or H max Height wherein H min And H max Respectively representing the minimum height and the maximum height of the unmanned aerial vehicle; once some drones learn the best trajectory and power and provide wireless service to the user devices within coverage, the training process is completely ended;
Furthermore, each user equipment measures the received power from all drones through the pilot signal; associating the user equipment with the drone based on the maximum received signal power; after the user is associated, the user equipment reports the current state of the user equipment to the associated unmanned aerial vehicle;
finally, with the help of a backhaul link, the core terrestrial base station obtains a next state of the global network and an instant reward, and corresponding information is stored in an experience replay memory D, where the information includes a state S (t), a next state S' (t), an action a (t), and a reward R (t); randomly drawing a mini-batch transfer sample from an empirical playback memory D to update an operator network and a critic network; the weight of the target network is updated slowly correspondingly;
the training process is repeated until all drones cover all hotspots without overlap and the quality of service requirements of all user equipments are met.
Moreover, the unmanned aerial vehicle system model is established by the following steps,
in drone-assisted cellular networks, N drones are deployed as aerial base stations to provide wireless service to M user equipments in N non-overlapping hotspots, the sets of user equipments and drones being denoted respectively as
Figure BDA0002717342120000021
And
Figure BDA0002717342120000022
the number of user devices in hotspot i is denoted M (i); assuming that the ith unmanned aerial vehicle provides service to the ith hot spot by using the same frequency band, each user equipment only belongs to one hot spot, and obtaining that
Figure BDA0002717342120000023
Meanwhile, all the unmanned aerial vehicles are controlled by one core ground base station, and at the moment t, the user equipment in the same hotspot is simultaneously provided with service by the same unmanned aerial vehicle; note the plane coordinates of the mth user equipment
Figure BDA0002717342120000024
Wherein x is m And y m Is the coordinates of the mth user equipment,
Figure BDA0002717342120000025
a representation domain;
at time t, the horizontal coordinate of the ith drone is expressed as
Figure BDA0002717342120000031
Wherein x is i (t) and y i (t) X and Y coordinates of the ith drone, respectively; obtaining a distance between the mth user equipment and the ith unmanned aerial vehicle in the horizontal direction as
Figure BDA0002717342120000032
Defining the height of the ith unmanned aerial vehicle as h i (t)∈[H min ,H max ]In which H min And H max Respectively representing the minimum height and the maximum height of the unmanned aerial vehicle; the distance between the ith unmanned aerial vehicle and the mth unmanned aerial vehicle is
Figure BDA0002717342120000033
Based on the finite flying speed of the unmanned aerial vehicle, the track of the unmanned aerial vehicle takes the maximum driving distance as the standard:
||v i (t+1)-v i (t)||≤V L T s , (1)
||h i (t+1)-h i (t)||≤V A T s , (2)
wherein, V L And V A Respectively representing each time slot T s The horizontal flight speed and the vertical flight speed of the medium unmanned aerial vehicle;
furthermore, to avoid collision of any two drones, considering the collision constraints of the drones, for the ith and jth drone there are:
Figure BDA0002717342120000034
wherein D is min The shortest distance between any two unmanned aerial vehicles is represented;
setting a time slot T s Small enough to approximate the channel as constant; consider collision avoidance between any two drones, T s Should satisfy
Figure BDA0002717342120000035
The constraint condition of (2); obtaining the maximum horizontal distance of each time slot unmanned aerial vehicle
Figure BDA0002717342120000036
And maximum vertical distance
Figure BDA0002717342120000037
Wherein, T max Is D min A corresponding threshold value;
let the radio signal sent from the drone be composed of line-of-sight transmission and non-line-of-sight transmission, the probability of the line-of-sight transmission connection between the mth user equipment and the ith drone is expressed as:
Figure BDA0002717342120000038
wherein a and b are parameters relating to the environment,
Figure BDA0002717342120000039
is the angle between the mth user equipment and the ith unmanned aerial vehicle; furthermore, the possibility of non-line-of-sight transmission is
Figure BDA00027173421200000310
At time t, the path loss for line-of-sight and non-line-of-sight transmissions may be represented as the following model:
Figure BDA00027173421200000311
Figure BDA0002717342120000041
wherein, f c Is a carrier frequency, η LoS And η NLoS Average excess loss for line-of-sight and non-line-of-sight transmissions, respectively;
the expected average path loss is expressed as
Figure BDA0002717342120000042
The total available bandwidth B is equally distributed to each user equipment, and the bandwidth of the mth user equipment in the ith hot spot is represented as B i,m = B/M (i) and the transmit power of the drone is also evenly distributed to each user equipment, p i,m (t)=p i (t)/M (i) wherein p i (t)∈[0,P max ]Indicating with maximum transmit power P max The ith drone transmit power of;
the signal-to-noise ratio of the mth user equipment received from the drone is expressed as:
Figure BDA0002717342120000043
wherein, g i,m (t) is the channel gain between the ith drone and the mth user equipment, N 0 Is the noise power spectral density;
let us assume that the achievable rate r of obtaining the mth user equipment from the ith drone i,m (t)=B i,m log 2 (1+Γ i,m (t)), obtaining the total velocity of the ith drone:
Figure BDA0002717342120000044
furthermore, the description of the problem of trajectory control and power distribution of drones, carried out as follows,
ensuring per-user setup in drone assisted cellular networksIs prepared to meet the minimum service quality requirement omega m Signal-to-noise ratio Γ for mth user equipment i,m (t) should not be less than Ω m
Γ i,m (t)≥Ω m . (9)
Utility w of ith unmanned aerial vehicle i (t) is defined as the cost of transmission minus the profit that can be achieved:
Figure BDA0002717342120000045
where ρ is i To make a profit, λ p The unit price of the unmanned aerial vehicle transmitting power;
at the same time, the trajectory (v) of the unmanned aerial vehicle is optimized by obtaining the joint i (t) and h i (t)) and transmission power p i (t) to maximize the overall network utility, the optimization problem is:
Figure BDA0002717342120000051
furthermore, the establishment of the Markov model is realized as follows,
the Markov model is set to be composed of five elements (S, A, R, P) ss′ γ) composition, wherein S is the state space, A is the action space, R is reward, P ss′ For transition probability, gamma belongs to [0, 1) as attenuation factor;
defining the state S (t) as whether all the user equipments meet their QoS requirement, and recording as S (t) = { S = 1 (t),s 2 (t),...,s M (t) }, in which, s m (t) is ∈ {0,1}; if the mth user equipment meets its minimum quality of service requirement, Γ i,m (t)≥Ω m ,s m (t) =1, otherwise s m (t)=0;
Considering the determination of the motion trajectory and the transmit power of the drone, the motion space is defined as a (t) = { P (t), L (t), phi (t), H (t) }, where P (t) = { P (P) } 1 (t),p 2 (t),...,p N (t) is with p i (t)∈{0,P max Nobody of the} is not presentMachine transmission power, L (t) = { L = 1 (t),l 2 (t),...,l N (t) } is the horizontal distance of the drone; setting taking into account horizontal trajectory constraints
Figure BDA0002717342120000055
φ(t)={φ 1 (t),φ 2 (t),...,φ N (t)},φ i (t) is the angle of the horizontal direction of the unmanned plane corresponding to the {0,2 pi }; defining vertical movement distance in consideration of the constraint of the altitude of the drone
Figure BDA0002717342120000052
H(t)={Δh 1 (t),Δh 2 (t),...,Δh N (t) is the offset of the unmanned aerial vehicle in the vertical direction;
in order to ensure that all drones provide downlink wireless service, the coverage of the user equipment should be considered in a reward function, and a penalty is introduced into a utility function (3) of the collision constraint on the basis of an optimization problem (11), wherein the reward function is as follows:
Figure BDA0002717342120000053
wherein M' (i) is the number of ue covered by the ith drone, ζ 1 For the penalty factor related to the degree of coverage,
Figure BDA0002717342120000054
penalty for unmanned aerial vehicle collision.
Moreover, when the combined optimization of the track control and the power distribution is realized by adopting a depth certainty strategy gradient method,
in the deep deterministic strategy gradient approach, during a finite period T, the best strategy is learned to obtain the maximum expected decay reward
Figure BDA0002717342120000061
Wherein T is the current moment, T' is the next moment, and T is the period; here, the state-action value function Q (S (t), a (t)) = E [ Φ (t) | S (t), a (t)]Is defined as being in state S (t)) A (t) is an action, E [ · C]Is a desired operator;
in addition, based on the actor-critic framework, the actor network and the critic network are realized by using a deep neural network; the critical network is represented as Q (S (t), A (t) | θ | Q ) With a weight of θ Q The operator network is denoted as μ (o (t) | θ μ ) Weight of theta μ O (t) is an observation of the network environment;
meanwhile, in order to improve the stability of learning, a target network strategy is set in the DDPG, the target network is a copy of an operator network and a critic network, and the target network weight is updated as follows:
Figure BDA0002717342120000062
where τ is the soft update rate of the target network weights, θ Q′ And theta μ′ Weights of corresponding target networks are respectively;
adopting an empirical replay strategy for model-free characteristics, and storing transition samples in an empirical replay memory D, wherein the transition samples comprise a state S (t), a next state S' (t), an action A (t) and a reward R (t); during the learning process, the mini-batch sample is randomly sampled from the empirical playback memory D, including the state s i S 'in the next state' i And action a i And a prize r i To update the operator network and the critic network;
updating weights of the operator network using a policy gradient method, comprising calculating the gradient as follows:
Figure BDA0002717342120000063
wherein M is the size of the mini-batch;
in addition, the critic network L (θ) is updated by minimizing a loss function Q ) Written as:
Figure BDA0002717342120000064
wherein, y i =r i +γQ′(s i+1 ,a i+1Q′ ) Is a target value generated by the target network of the critic network.
Using (14) and (15), weights of operator network and critic network are passed
Figure BDA0002717342120000065
And
Figure BDA0002717342120000066
update where δ μ And delta Q Is the corresponding learning rate.
Compared with the prior art, the method has the advantages that a deep reinforcement learning technology is introduced into the unmanned aerial vehicle network, and a joint optimization technical scheme of trajectory control and power distribution is provided. In a large application scene, the number of user equipment is possibly very huge, and the intelligent automatic optimization scheme can cope with the complex situation and provide efficient and reasonable unmanned aerial vehicle network support.
Detailed Description
The present invention will be described in further detail with reference to examples for the purpose of facilitating understanding and practice of the invention by those of ordinary skill in the art, and it is to be understood that the present invention has been described in the illustrative embodiments and is not to be construed as limited thereto.
The embodiment of the invention provides an unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning. Secondly, aiming at the non-convexity of the problems of track control and power distribution, the patent provides a method based on reinforcement learning, and a Markov decision process is determined by setting a state, an action space and a reward function. On the basis, the Markov model has a continuous action space, a deep reinforcement learning method is researched, a deep certainty strategy gradient method is provided, and joint optimization of trajectory control and power distribution is realized. The specific implementation of the method provided by the embodiment comprises the following steps:
step 1, establishing an unmanned aerial vehicle system model:
in a typical drone-assisted cellular network, N drones are deployed as air base stations to provide wireless service to M user equipments in N non-overlapping hotspots, the sets of user equipments and drones being denoted as user equipment and drone, respectively
Figure BDA0002717342120000075
And
Figure BDA0002717342120000076
the number of user devices in hotspot i is denoted as M (i). For simplicity of discussion, the embodiments assume that the ith drone provides service to the ith hotspot using the same frequency band, given that each user device only belongs to one hotspot, resulting in
Figure BDA0002717342120000071
Meanwhile, all the unmanned aerial vehicles are controlled by one core ground base station, and at the moment t, the user equipment in the same hotspot is simultaneously served by the same unmanned aerial vehicle.
Figure BDA0002717342120000072
Plane coordinates representing the m-th user equipment, where x m And y m Is the coordinate of the mth user equipment, which is the X coordinate and the Y coordinate, respectively. Wherein the content of the first and second substances,
Figure BDA0002717342120000073
a domain is represented.
Thus, at time t, the horizontal coordinate of the ith drone is represented as
Figure BDA0002717342120000074
Wherein x is i (t) and y i (t) respectively represent the X and Y coordinates of the ith drone. Next, the embodiment may obtain a distance between the mth user equipment and the ith drone in a horizontal direction as
Figure BDA0002717342120000081
In addition, the embodiment defines the height of the ith drone as h i (t)∈[H min ,H max ]In which H is min And H max Respectively representing the minimum and maximum altitude of the drone. The distance between the ith unmanned aerial vehicle and the mth unmanned aerial vehicle is
Figure BDA0002717342120000082
Here, considering that the flight speed of the drone is limited, the trajectory of the drone should be based on the maximum travel distance:
||v i (t+1)-v i (t)||≤V L T s , (1)
||h i (t+1)-h i (t)||≤V A T s , (2)
wherein, V L And V A Respectively representing each time slot T s The horizontal flight speed and the vertical flight speed of the medium unmanned plane.
Furthermore, in order to avoid collision of any two drones, the collision constraint of the drones should also be considered, i.e. for the ith drone and the jth drone:
Figure BDA0002717342120000083
wherein D is min The shortest distance between any two unmanned aerial vehicles is shown.
Notably, the time slot T s Should be small enough so that the channel can be approximated as constant. In addition, consider collision avoidance, T, between any two drones s Should satisfy
Figure BDA0002717342120000084
The constraint of (2). Thus, embodiments may obtain a maximum horizontal distance of drones per slot
Figure BDA0002717342120000085
And maximum vertical distance
Figure BDA0002717342120000086
Wherein, T max Is D min The corresponding threshold value.
Here, the radio signal emitted from the drone consists of a line-of-sight transmission and a non-line-of-sight transmission, and the probability of the line-of-sight transmission connection between the mth user equipment and the ith drone is expressed as:
Figure BDA0002717342120000087
wherein a and b are parameters relating to the environment,
Figure BDA0002717342120000088
is the angle between the mth user equipment and the ith drone. Furthermore, the possibility of non-line-of-sight transmission is
Figure BDA0002717342120000089
Thus, at time t, the path loss for line-of-sight and non-line-of-sight transmissions may be represented as the following model:
Figure BDA00027173421200000810
Figure BDA0002717342120000091
wherein f is c Is a carrier frequency, η LoS And η NLoS The average excess loss for line-of-sight and non-line-of-sight transmissions, respectively.
Thus, the expected average path loss can be expressed as
Figure BDA0002717342120000092
The total available bandwidth B is equally distributed to each user equipment, the ithThe bandwidth of the mth user equipment in the hotspot is denoted as B i,m = B/M (i), and the transmit power of the drone is also evenly distributed to each user equipment, p i,m (t)=p i (t)/M (i) wherein p i (t)∈[0,P max ]Indicating with maximum transmit power P max The ith drone transmit power.
Furthermore, the signal-to-noise ratio of the mth user equipment received from the drone is expressed as:
Figure BDA0002717342120000093
wherein, g i,m (t) is the channel gain between the ith drone and the mth user equipment, N 0 Is the noise power spectral density.
Thus, the achievable rate r of the mth user equipment can be obtained from the ith drone i,m (t)=B i,m log 2 (1+Γ i,m (t)). Thus, the embodiment obtains the total velocity of the ith drone:
Figure BDA0002717342120000094
step 2, unmanned aerial vehicle trajectory control and power distribution problem description:
in the unmanned aerial vehicle assisted cellular network, each user equipment is ensured to meet the minimum service quality requirement omega m Signal-to-noise ratio Γ for mth user equipment i,m (t) should not be less than Ω m I.e. by
Γ i,m (t)≥Ω m . (9)
Then, the utility w of the ith drone i (t) is defined as the cost of transmission minus the profit achievable, and can be given
Figure BDA0002717342120000095
Wherein the content of the first and second substances,ρ i to make a profit, λ p The unit price of the unmanned aerial vehicle transmitting power.
Meanwhile, the optimization problem is to obtain the joint optimal trajectory (v) of the unmanned aerial vehicle i (t) and h i (t)) and a transmission power p i (t) to maximize the overall network utility, the optimization problem is:
Figure BDA0002717342120000101
since the problem is non-convex and combinatorial, especially in large networks, the solved optimization problem may be difficult and the lack of knowledge of the user equipment information and channel status makes the traditional optimization strategy difficult to implement. The embodiment will propose a reinforcement learning solution to find the optimal joint trajectory control and power allocation strategy.
Step 3, establishing a Markov model:
the Markov decision process is formulated by designing a state, an action space and a reward function. The Markov model consists of five elements (S, A, R, P) ss′ γ) composition, wherein S is the state space, A is the action space, R is reward, P ss′ For transition probability, γ ∈ [0, 1) is the decay factor. Defining the state S (t) as whether all the user equipments meet their QoS requirement, and is S (t) = { S = 1 (t),s 2 (t),...,s M (t) }, in which, s m (t) is equal to {0,1}. If the mth user equipment meets its minimum quality of service requirement, Γ i,m (t)≥Ω m ,s m (t) =1, otherwise s m (t) =0. Note that the state space is 2 M This can be very large for a large number of M.
Further, considering the determination of the motion trajectory and the transmission power of the drone, the motion space is defined as a (t) = { P (t), L (t), Φ (t), H (t) }, where P (t) = { P 1 (t),p 2 (t),...,p N (t) is with p i (t)∈{0,P max Unmanned aerial vehicle transmission power of L (t) = { L } 1 (t),l 2 (t),...,l N (t) } horizontal distance of the drone. Setting taking into account horizontal trajectory constraints
Figure BDA0002717342120000102
φ(t)={φ 1 (t),φ 2 (t),...,φ N (t)},φ i (t) is equal to {0,2 pi } and is the horizontal direction angle of the unmanned plane. Also, considering the constraint of the height of the drone, the embodiment defines the vertical movement distance
Figure BDA0002717342120000103
Then, H (t) = { Δ H 1 (t),Δh 2 (t),...,Δh N (t) is the offset of the unmanned aerial vehicle in the vertical direction.
Thus, to ensure that all drones provide downlink wireless service, the coverage of the user device should be taken into account in the reward function. There may be penalties in the reward function if certain user devices are not covered by any drone. Therefore, on the basis of the optimization problem (11), penalties are introduced into the utility function (3) of the collision constraint, the reward function being:
Figure BDA0002717342120000111
wherein M' (i) is the number of ues covered by the ith drone, ζ 1 For the penalty factor related to the degree of coverage,
Figure BDA0002717342120000112
penalty for unmanned aerial vehicle collision. (12) Is the overall network utility, if the mth user equipment meets its quality of service requirements, s m (t) =1, otherwise, s m (t) =0. (12) The second part of (a) is a penalty for the degree of user equipment coverage, which becomes zero if all drones cover all user equipment. For the third part of (12), when the distance between two drones is less than the minimum distance D min In time, each unmanned aerial vehicle will get punishment
Figure BDA0002717342120000113
And 4, realizing optimal control of joint trajectory control and power distribution based on deep reinforcement learning:
in this patent, due to the Markov continuous action space, an accurate state transition probability P is obtained ss′ It is difficult. While policy-based learning methods may produce continuous learning behavior, the learning variance may be large. Furthermore, value-based learning can result in an optimal strategy with a lower learning variance, but it can only be applied to discrete action spaces. Therefore, the embodiment preferably adopts a DDPG method to realize the unmanned aerial vehicle trajectory and power joint optimization method, and includes combining a Policy-based learning method (actor network) and a value-based learning method (critic network) to obtain a Deep Deterministic Policy Gradient (DDPG) method.
In the DDPG method, during a finite period T, the optimal strategy is learned to obtain the maximum expected decay reward
Figure BDA0002717342120000114
Wherein T is the current time, T' is the next time, and T is the period. Here, the state-action value function Q (S (t), a (t)) = E [ Φ (t) | S (t), a (t)]Is defined as the expected reward on state S (t), A (t) is the action, E [ · C]Is the desired operator.
In addition, based on the actor-critic framework, the actor network and critic network are implemented using a deep neural network. Here, the critic network is represented as Q (S (t), A (t) | θ Q ) With a weight of θ Q The operator network is denoted as μ (o (t) | θ μ ) Weight of theta μ And o (t) is an observation of the network environment.
Meanwhile, in order to improve the stability of learning, a target network strategy is designed in the DDPG. The target network is a copy of the operator network and the critic network, and the target network weight is updated as follows:
Figure BDA0002717342120000121
where τ is the soft update rate of the target network weights, θ Q′ And theta μ′ Respectively, the weights of the corresponding target networks.
Thus, an empirical replay strategy is employed for the modeless nature of the method. Transition samples (state S (t), next state S' (t), action a (t), and reward R (t)) are stored in the experience replay memory D. During the learning process, the mini-batch sample (state s) is randomly sampled from the empirical playback memory D i Next state s i ', action a i And a prize r i ) To update the actor network and the critic network. Wherein, mini-batch refers to randomly selecting a small batch of data in training data.
Here, the weights of the actor network are updated using the strategic gradient method, which is to calculate the gradient as follows:
Figure BDA0002717342120000122
wherein M is the size of the mini-batch.
In addition, the critic network L (θ) is updated by minimizing a loss function Q ) Written as:
Figure BDA0002717342120000123
wherein, y i =r i +γQ′(s i+1 ,a i+1Q′ ) Is a target value generated by the target network of the critic network.
Thus, using (14) and (15), the weights of the actor network and the critic network can be passed
Figure BDA0002717342120000124
And
Figure BDA0002717342120000125
update wherein δ μ And delta Q Is the corresponding learning rate.
Core withThe facet base station first initializes the empirical replay memory D, the weights of the operator-critical network, and the corresponding target network. Let the training process have EP training sets, each training set having T slots. In each training set, the network state S (t) is initialized first, and in each time slot of each training set, the action is performed by carrying random noise
Figure BDA0002717342120000126
Of (a) operator network mu (o (t) | theta) μ ) And sending out the information, wherein,
Figure BDA0002717342120000127
is random noise. Then, after the core ground base station sends the selected action a (t) to all the drones, all the drones set their own trajectories and transmission powers accordingly. When some drone flies out of the network area, it will choose a random direction angle phi i (t) of (d). If the height h of some unmanned planes i (t) exceeds [ H ] min ,H max ]And the unmanned plane will stay at H min Or H max Height. Once some drones learn the best trajectory and power and provide wireless service to the user devices within coverage, the training process is completely ended.
Furthermore, with the pilot signal, each user equipment may measure the received power from all drones. Based on the maximum received signal power, the user equipment associates with the drone. After user association, the user device reports their own current status to the associated drone. Finally, with the help of the backhaul link, the core terrestrial base station can obtain the global network next state S' (t) and the immediate reward R (t). Thus, these pieces of information (S (t), a (t), R (t), S' (t)) are saved in the empirical playback memory D. Then, mini-batch transfer samples are randomly drawn from the empirical playback memory D to update the operator network and the critic network. The weights of the two target networks are slowly updated in (13). The training process is repeated until all drones cover all hotspots without overlap and the quality of service requirements of all user equipments are met.
The above processes can be automatically operated by adopting a computer software technology, and a device for operating the method process is also within the protection scope of the invention.
It is noted and understood that various changes and modifications can be made to the invention herein before described in detail without departing from the spirit and scope of the invention as claimed in the appended claims. Accordingly, the scope of the claimed subject matter is not limited by any of the specific exemplary teachings provided.

Claims (1)

1. An unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning is characterized in that: establishing an unmanned aerial vehicle system model, and describing the unmanned aerial vehicle trajectory control and power distribution problems; establishing a Markov model, wherein a Markov decision process is determined by setting a state, an action space and a reward function; the joint optimization of the track control and the power distribution is realized by adopting a depth deterministic strategy gradient method in the following way,
the depth certainty strategy gradient method is combined with an actor network and a critic network, and a corresponding target network is set; the core ground base station firstly initializes an experience playback memory D, weights of an operator-critical network and a corresponding target network;
setting EP training sets in the training process, wherein each training set has T time slots; in each training set, firstly initializing the network state, and in each time slot of each training set, sending out an action by an actor network with random noise; after the core ground base station sends the selected action to all the unmanned aerial vehicles, all the unmanned aerial vehicles can set the own track and transmission power correspondingly; when some drones fly out of the network area, it will choose a random direction angle if the height h of some drones i (t) exceeds [ H ] min ,H max ]And the unmanned plane will stay at H min Or H max Height of wherein H min And H max Respectively representing the minimum height and the maximum height of the unmanned aerial vehicle; once some drones learn the best trajectory and power and provide wireless service to the user equipment within the coverage area, the training process is completely ended;
furthermore, each user equipment measures the received power from all drones through the pilot signal; associating the user equipment with the drone based on the maximum received signal power; after the user is associated, the user equipment reports the current state of the user equipment to the associated unmanned aerial vehicle;
finally, with the help of the backhaul link, the core ground base station obtains the next state and the instant reward of the global network, and corresponding information is stored in an experience playback memory D, wherein the information includes a state S (t), a next state S' (t), an action a (t), and a reward R (t); randomly extracting a mini-batch transfer sample from an empirical playback memory D to update an operator network and a critic network; the weight of the target network is updated slowly correspondingly;
repeating the training process until all the unmanned aerial vehicles cover all the hotspots without overlapping and the service quality requirements of all the user equipment are met;
the description of the unmanned aerial vehicle trajectory control and power distribution problem is implemented as follows,
in the unmanned aerial vehicle-assisted cellular network, each user equipment is ensured to meet the minimum service quality requirement omega m Signal-to-noise ratio Γ for mth user equipment i,m (t) should not be less than Ω m
Γ i,m (t)≥Ω m . (9)
Utility w of ith unmanned aerial vehicle i (t) is defined as the cost of transmission minus the profit that can be achieved:
Figure FDA0003858780460000021
where ρ is i To make a profit, λ p The unit price of the unmanned aerial vehicle transmitting power;
meanwhile, the track and the transmitting power p of the unmanned aerial vehicle are jointly optimized through obtaining i (t) to maximize the overall network utility, the optimization problem is:
Figure FDA0003858780460000022
the establishment of the markov model is carried out as follows,
the Markov model is set up from five elements (S, A, R, P) ss′ γ) wherein S is a state space, A is an action space, R is a reward, P ss′ For transition probability, gamma belongs to [0, 1) as attenuation factor;
defining the state S (t) as whether all the user equipments meet their QoS requirement, and recording as S (t) = { S = 1 (t),s 2 (t),...,s M (t) }, in which, s m (t) is ∈ {0,1}; if the mth user equipment meets its minimum quality of service requirement, Γ i,m (t)≥Ω m ,s m (t) =1, otherwise s m (t)=0;
Considering the determination of the trajectory of motion and the transmission power of the drone, the motion space is defined as a (t) = { P (t), L (t), phi (t), H (t) }, where drone transmission power P (t) = { P (P) } 1 (t),p 2 (t),...,p N (t)},p i (t)∈{0,P max },L(t)={l 1 (t),l 2 (t),...,l N (t) } is the horizontal distance of the drone; setting taking into account horizontal trajectory constraints
Figure FDA0003858780460000023
φ(t)={φ 1 (t),φ 2 (t),...,φ N (t)},φ i (t) is an angle of the horizontal direction of the unmanned aerial vehicle, wherein the (t) belongs to {0,2 pi }; defining vertical movement distance in consideration of the constraint of the height of the unmanned aerial vehicle
Figure FDA0003858780460000024
H(t)={Δh 1 (t),Δh 2 (t),...,Δh N (t) is the offset of the unmanned aerial vehicle in the vertical direction;
in order to ensure that all drones provide downlink wireless service, the coverage of the user equipment should be considered in a reward function, and a penalty is introduced into a utility function (3) of the collision constraint on the basis of an optimization problem (11), wherein the reward function is as follows:
Figure FDA0003858780460000031
wherein M' (i) is the number of ues covered by the ith drone, ζ 1 For the penalty factor related to the degree of coverage,
Figure FDA0003858780460000032
penalty for unmanned aerial vehicle collision;
when the combined optimization of the trajectory control and the power distribution is realized by adopting a depth deterministic strategy gradient method,
in the deep deterministic strategy gradient approach, in a finite period T, the optimal strategy is learned to obtain the maximum expected decay reward
Figure FDA0003858780460000033
Wherein T is the current moment, T' is the next moment, and T is the period; here, the state-action value function Q (S (t), a (t)) = E [ Φ (t) | S (t), a (t)]Is defined as the expected reward on state S (t), A (t) is the action, E [ · C]Is a desired operator;
in addition, based on the operator-critic framework, the operator network and the critic network are realized by using a deep neural network; the critic network is expressed as a state-action value function Q (S (t), A (t) | theta) Q ) With a weight of θ Q The operator network is denoted as μ (o (t) | θ μ ) Weight of theta μ O (t) is an observation of the network environment;
meanwhile, in order to improve the stability of learning, a target network strategy is set in the DDPG, the target network is a copy of an operator network and a critic network, and the target network weight is updated as follows:
Figure FDA0003858780460000034
where τ is the soft update rate of the target network weights, θ Q′ And theta μ′ Respectively corresponding target networkThe weight of (c);
adopting an empirical replay strategy for model-free characteristics, and storing transition samples in an empirical replay memory D, wherein the transition samples comprise a state S (t), a next state S' (t), an action A (t) and a reward R (t); during the learning process, the mini-batch sample is randomly sampled from the empirical playback memory D, including the state s i And the next state s' i And action a i And a prize r i To update the operator network and the critic network;
updating weights of the operator network using a policy gradient method, comprising calculating a policy gradient as follows:
Figure FDA0003858780460000035
wherein, M b The size of the mini-batch is the same,
Figure FDA0003858780460000041
representing a policy gradient;
Figure FDA0003858780460000042
representing the operator network mu (o) iμ ) The value of the gradient is set to be,
Figure FDA0003858780460000043
represents a pair weight theta μ The gradient operator of (2);
Figure FDA0003858780460000044
function Q(s) representing state-action value i ,a i ) The value of the gradient is set to be,
Figure FDA0003858780460000045
represents a pair of actions a i The gradient operator of (3);
in addition, the criticc network L (θ) is updated by minimizing the loss function Q ) Written as:
Figure FDA0003858780460000046
wherein, y i =r i +γQ′(s i+1 ,a i+1Q′ ) Is a target value generated by a target network of the critic network;
using equations (14) and (15), the weights for the actor network and the critic network are passed
Figure FDA0003858780460000047
And
Figure FDA0003858780460000048
update where δ μ And delta Q Is the corresponding learning rate;
the establishment of the unmanned aerial vehicle system model is realized as follows,
in drone-assisted cellular networks, N drones are deployed as aerial base stations to provide wireless service to M user equipments in N non-overlapping hotspots, the sets of user equipments and drones being denoted respectively as
Figure FDA00038587804600000415
And
Figure FDA00038587804600000416
the number of user devices in hotspot i is denoted as M (i); assuming that the ith unmanned aerial vehicle provides service to the ith hot spot by using the same frequency band, each user equipment only belongs to one hot spot, and obtaining that
Figure FDA0003858780460000049
Meanwhile, all the unmanned aerial vehicles are controlled by one core ground base station, and at the moment t, the user equipment in the same hotspot is simultaneously served by the same unmanned aerial vehicle; note the plane coordinates of the mth user equipment
Figure FDA00038587804600000410
Wherein x is m And y m Is the coordinates of the mth user equipment,
Figure FDA00038587804600000411
a representation domain;
at time t, the horizontal coordinate of the ith unmanned aerial vehicle is expressed as
Figure FDA00038587804600000412
Wherein x is i (t) and y i (t) X and Y coordinates of the ith drone, respectively; obtaining a distance between the mth user equipment and the ith unmanned aerial vehicle in the horizontal direction as
Figure FDA00038587804600000413
Defining the height of the ith unmanned aerial vehicle as h i (t)∈[H min ,H max ]In which H is min And H max Respectively representing the minimum height and the maximum height of the unmanned aerial vehicle; the distance between the ith unmanned aerial vehicle and the mth unmanned aerial vehicle is
Figure FDA00038587804600000414
Based on the finite flying speed of the unmanned aerial vehicle, the track of the unmanned aerial vehicle takes the maximum driving distance as the standard:
||v i (t+1)-v i (t)||≤V L T s , (1)
||h i (t+1)-h i (t)||≤V A T s , (2)
wherein, V L And V A Respectively representing each time slot T s The horizontal flight speed and the vertical flight speed of the medium unmanned aerial vehicle;
furthermore, to avoid collision of any two drones, considering the collision constraints of the drones, for the ith and jth drone there are:
Figure FDA0003858780460000051
wherein D is min Representing the shortest distance between any two unmanned aerial vehicles;
setting a time slot T s Small enough to approximate the channel as constant; consider collision avoidance between any two drones, T s Should satisfy
Figure FDA0003858780460000052
The constraint of (2); obtaining the maximum horizontal distance of each time slot unmanned aerial vehicle
Figure FDA0003858780460000053
And maximum vertical distance
Figure FDA0003858780460000054
Wherein, T max Is D min A corresponding threshold value;
let the radio signal sent from the drone be composed of line-of-sight transmission and non-line-of-sight transmission, the probability of the line-of-sight transmission connection between the mth user equipment and the ith drone is expressed as:
Figure FDA0003858780460000055
wherein a and b are parameters related to the environment,
Figure FDA0003858780460000056
is the angle between the mth user equipment and the ith unmanned aerial vehicle; furthermore, the possibility of non-line-of-sight transmission is
Figure FDA0003858780460000057
At time t, the path loss for line-of-sight and non-line-of-sight transmissions may be represented as the following model:
Figure FDA0003858780460000058
Figure FDA0003858780460000059
wherein f is c Is the carrier frequency, η LoS And η NLoS Average excess loss for line-of-sight and non-line-of-sight transmissions, respectively;
the expected average path loss is expressed as
Figure FDA00038587804600000510
The total available bandwidth B is equally distributed to each user device, and the bandwidth of the mth user device in the ith hotspot is represented as B i,m = B/M (i) and the transmit power of the drone is also evenly distributed to each user equipment, p i,m (t)=p i (t)/M (i) wherein p i (t)∈[0,P max ]Representing the maximum transmission power P max The ith drone transmit power of;
the signal-to-noise ratio of the mth user equipment received from the drone is expressed as:
Figure FDA0003858780460000061
wherein, g i,m (t) is the channel gain between the ith drone and the mth user equipment, N 0 Is the noise power spectral density;
let us assume that the achievable rate r of obtaining the mth user equipment from the ith drone i,m (t)=B i,m log 2 (1+Γ i,m (t)), obtaining the total velocity of the ith drone:
Figure FDA0003858780460000062
in the unmanned aerial vehicle-assisted cellular network, each user equipment is ensured to meet the minimum service quality requirement omega m Signal-to-noise ratio Γ for mth user equipment i,m (t) should not be less than Ω m I.e. by
Γ i,m (t)≥Ω m . (9)
Utility w of ith unmanned aerial vehicle i (t) is defined as the cost of transmission minus the profit achievable.
CN202011079226.7A 2020-03-02 2020-10-10 Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning Active CN112118556B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010136467.4A CN111263332A (en) 2020-03-02 2020-03-02 Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning
CN2020101364674 2020-03-02

Publications (2)

Publication Number Publication Date
CN112118556A CN112118556A (en) 2020-12-22
CN112118556B true CN112118556B (en) 2022-11-18

Family

ID=70952865

Family Applications (2)

Application Number Title Priority Date Filing Date
CN202010136467.4A Pending CN111263332A (en) 2020-03-02 2020-03-02 Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning
CN202011079226.7A Active CN112118556B (en) 2020-03-02 2020-10-10 Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN202010136467.4A Pending CN111263332A (en) 2020-03-02 2020-03-02 Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning

Country Status (1)

Country Link
CN (2) CN111263332A (en)

Families Citing this family (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112118632B (en) * 2020-09-22 2022-07-29 电子科技大学 Adaptive power distribution system, method and medium for micro-cell base station
CN112533237B (en) * 2020-11-16 2022-03-04 北京科技大学 Network capacity optimization method for supporting large-scale equipment communication in industrial internet
CN112991384B (en) * 2021-01-27 2023-04-18 西安电子科技大学 DDPG-based intelligent cognitive management method for emission resources
CN112802061B (en) * 2021-03-22 2021-08-06 浙江师范大学 Robust target tracking method and system based on hierarchical decision network
CN113194488B (en) * 2021-03-31 2023-03-31 西安交通大学 Unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method and system
CN113162679B (en) * 2021-04-01 2023-03-10 南京邮电大学 DDPG algorithm-based IRS (intelligent resilient software) assisted unmanned aerial vehicle communication joint optimization method
CN113253249B (en) * 2021-04-19 2023-04-28 中国电子科技集团公司第二十九研究所 MIMO radar power distribution design method based on deep reinforcement learning
CN113316169B (en) * 2021-05-08 2023-01-31 北京科技大学 UAV auxiliary communication energy efficiency optimization method and device for smart port
CN113258989B (en) * 2021-05-17 2022-06-03 东南大学 Method for obtaining relay track of unmanned aerial vehicle by using reinforcement learning
CN113286314B (en) * 2021-05-25 2022-03-08 重庆邮电大学 Unmanned aerial vehicle base station deployment and user association method based on Q learning algorithm
CN113364495B (en) * 2021-05-25 2022-08-05 西安交通大学 Multi-unmanned aerial vehicle track and intelligent reflecting surface phase shift joint optimization method and system
CN113242556B (en) * 2021-06-04 2022-08-23 重庆邮电大学 Unmanned aerial vehicle resource dynamic deployment method based on differentiated services
CN113507717A (en) * 2021-06-08 2021-10-15 山东师范大学 Unmanned aerial vehicle track optimization method and system based on vehicle track prediction
CN113393495B (en) * 2021-06-21 2022-02-01 暨南大学 High-altitude parabolic track identification method based on reinforcement learning
CN113645589B (en) * 2021-07-09 2024-05-17 北京邮电大学 Unmanned aerial vehicle cluster route calculation method based on inverse fact policy gradient
CN113776531A (en) * 2021-07-21 2021-12-10 电子科技大学长三角研究院(湖州) Multi-unmanned-aerial-vehicle autonomous navigation and task allocation algorithm of wireless self-powered communication network
CN114422056B (en) * 2021-12-03 2023-05-23 北京航空航天大学 Space-to-ground non-orthogonal multiple access uplink transmission method based on intelligent reflecting surface
CN114221727B (en) * 2021-12-16 2024-05-24 浙江建德通用航空研究院 Same-frequency interference characterization method for WLAN (wireless local area network) system of unmanned aerial vehicle and interconnected vehicle
CN114546660A (en) * 2022-03-01 2022-05-27 重庆邮电大学 Multi-unmanned aerial vehicle cooperative edge calculation method
CN114696942B (en) * 2022-03-25 2023-10-10 电子科技大学 Interference method suitable for unmanned aerial vehicle communication link
CN115278849B (en) * 2022-09-29 2022-12-20 香港中文大学(深圳) Transmission opportunity and power control method for dynamic topology of unmanned aerial vehicle
CN116009590B (en) * 2023-02-01 2023-11-17 中山大学 Unmanned aerial vehicle network distributed track planning method, system, equipment and medium
CN116367291B (en) * 2023-06-01 2023-08-18 四川腾盾科技有限公司 Unmanned aerial vehicle interference avoidance group topology optimization method based on self-adaptive power control
CN117241300B (en) * 2023-11-16 2024-03-08 南京信息工程大学 Unmanned aerial vehicle-assisted general sense calculation network fusion method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018053187A1 (en) * 2016-09-15 2018-03-22 Google Inc. Deep reinforcement learning for robotic manipulation
CN109803344A (en) * 2018-12-28 2019-05-24 北京邮电大学 A kind of unmanned plane network topology and routing joint mapping method
CN109934332A (en) * 2018-12-31 2019-06-25 中国科学院软件研究所 The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends
WO2019133048A1 (en) * 2017-12-30 2019-07-04 Intel Corporation Methods and devices for wireless communications
CN110428115A (en) * 2019-08-13 2019-11-08 南京理工大学 Maximization system benefit method under dynamic environment based on deeply study
CN110488861A (en) * 2019-07-30 2019-11-22 北京邮电大学 Unmanned plane track optimizing method, device and unmanned plane based on deeply study
CN110673620A (en) * 2019-10-22 2020-01-10 西北工业大学 Four-rotor unmanned aerial vehicle air line following control method based on deep reinforcement learning
CN110809274A (en) * 2019-10-28 2020-02-18 南京邮电大学 Narrowband Internet of things-oriented unmanned aerial vehicle base station enhanced network optimization method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018053187A1 (en) * 2016-09-15 2018-03-22 Google Inc. Deep reinforcement learning for robotic manipulation
WO2019133048A1 (en) * 2017-12-30 2019-07-04 Intel Corporation Methods and devices for wireless communications
CN109803344A (en) * 2018-12-28 2019-05-24 北京邮电大学 A kind of unmanned plane network topology and routing joint mapping method
CN109934332A (en) * 2018-12-31 2019-06-25 中国科学院软件研究所 The depth deterministic policy Gradient learning method in pond is tested based on reviewer and double ends
CN110488861A (en) * 2019-07-30 2019-11-22 北京邮电大学 Unmanned plane track optimizing method, device and unmanned plane based on deeply study
CN110428115A (en) * 2019-08-13 2019-11-08 南京理工大学 Maximization system benefit method under dynamic environment based on deeply study
CN110673620A (en) * 2019-10-22 2020-01-10 西北工业大学 Four-rotor unmanned aerial vehicle air line following control method based on deep reinforcement learning
CN110809274A (en) * 2019-10-28 2020-02-18 南京邮电大学 Narrowband Internet of things-oriented unmanned aerial vehicle base station enhanced network optimization method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Fast and Accurate Trajectory Tracking for Unmanned Aerial Vehicles based on Deep Reinforcement Learning;Yilan Li;《2019 IEEE 25th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)》;20191014;全文 *
Optimal Trajectory Learning for UAV-BS Video Provisioning System: A Deep Reinforcement Learning Approach;Dohyun Kwon;《2019 International Conference on Information Networking (ICOIN)》;20190111;全文 *
基于深度强化学习的U2D通信场景下无人机飞行轨迹设计;吴凡毅;《2019年全国公共安全通信学术研讨会优秀论文集》;20190815;全文 *
基于深度强化学习的移动平台在已知环境下的路径规划;张心觉;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200215;全文 *
空地协作组网的无人机位置部署及能量优化机制研究;郜富晓;《中国优秀硕士学位论文全文数据库 信息科技辑》;20200115;全文 *

Also Published As

Publication number Publication date
CN111263332A (en) 2020-06-09
CN112118556A (en) 2020-12-22

Similar Documents

Publication Publication Date Title
CN112118556B (en) Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning
CN111786713B (en) Unmanned aerial vehicle network hovering position optimization method based on multi-agent deep reinforcement learning
US9986440B2 (en) Interference and mobility management in UAV-assisted wireless networks
CN110364031B (en) Path planning and wireless communication method for unmanned aerial vehicle cluster in ground sensor network
JP7074063B2 (en) Circuits, base stations, methods and recording media
Zeng et al. UAV-assisted data dissemination scheduling in VANETs
CN113055078B (en) Effective information age determination method and unmanned aerial vehicle flight trajectory optimization method
CN108668257B (en) A kind of distribution unmanned plane postman relaying track optimizing method
JP2020537423A (en) Uplink transmission power management for unmanned aerial vehicles
WO2021044819A1 (en) Communication control device, communication control method, and communication control program
WO2019067277A1 (en) Preamble management for unmanned aerial vehicles
El Hammouti et al. A distributed mechanism for joint 3D placement and user association in UAV-assisted networks
Zhong et al. Deployment optimization of UAV relays for collecting data from sensors: A potential game approach
Arani et al. Learning in the sky: Towards efficient 3D placement of UAVs
Xu et al. Joint trajectory and transmission optimization for energy efficient UAV enabled eLAA network
Park et al. Joint trajectory and resource optimization of MEC-assisted UAVs in sub-THz networks: A resources-based multi-agent proximal policy optimization DRL with attention mechanism
Li et al. Intelligent uav navigation: A DRL-QiER solution
Zhang et al. Machine learning driven UAV-assisted edge computing
Madelkhanova et al. Optimization of cell individual offset for handover of flying base station
Ahn et al. Velocity optimization for uav-mounted transmitter in population-varying fields
CN117063575A (en) Electronic device, method and storage medium for wireless communication system
Lee et al. QoS-aware UAV-BS deployment optimization based on reinforcement learning
CN114665947A (en) Optimization design method for joint power control and position planning of relay communication system supported by unmanned aerial vehicle
Kota et al. UAV assisted MIMO-NOMA for maximizing the sum capacity by satisfying the QoS of the users
Lyu et al. Movement and communication co-design in multi-UAV enabled wireless systems via DRL

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant