CN116227767A - Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning - Google Patents

Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning Download PDF

Info

Publication number
CN116227767A
CN116227767A CN202310021781.1A CN202310021781A CN116227767A CN 116227767 A CN116227767 A CN 116227767A CN 202310021781 A CN202310021781 A CN 202310021781A CN 116227767 A CN116227767 A CN 116227767A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
experience
coverage
state
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310021781.1A
Other languages
Chinese (zh)
Inventor
管昕洁
许昱雯
万夕里
张毅晔
徐波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tech University
Jiangsu Future Networks Innovation Institute
Original Assignee
Nanjing Tech University
Jiangsu Future Networks Innovation Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tech University, Jiangsu Future Networks Innovation Institute filed Critical Nanjing Tech University
Priority to CN202310021781.1A priority Critical patent/CN116227767A/en
Publication of CN116227767A publication Critical patent/CN116227767A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • G06Q10/047Optimisation of routes or paths, e.g. travelling salesman problem
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2111/00Details relating to CAD techniques
    • G06F2111/08Probabilistic or stochastic CAD
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Strategic Management (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computer Hardware Design (AREA)
  • Development Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Geometry (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Operations Research (AREA)
  • General Business, Economics & Management (AREA)
  • Quality & Reliability (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning, which comprises the following steps: firstly, defining a Markov model based on deep reinforcement learning, and modeling a five-tuple in a Markov decision process; then a depth deterministic strategy gradient DDPG algorithm is proposed according to modeling; then improving an experience buffer pool of the DDPG algorithm, classifying experience data stored in the experience buffer pool, and putting the obtained experience data into different experience buffer pools, wherein the improved DDPG algorithm can solve the problem of unstable convergence; and finally, designing a simulation environment, and enabling the unmanned aerial vehicle group to interact with the environment to acquire training data. By the method, the target task that the unmanned aerial vehicle group performs cooperative coverage on the ground nodes under the limitation of a plurality of constraint conditions is realized, and the method can enable the unmanned aerial vehicle group to have higher planning efficiency and lower flight cost.

Description

Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning
Technical Field
The invention provides a multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning, and belongs to the field of computer artificial intelligence.
Background
The unmanned aerial vehicle has the advantages of high maneuverability, flexible deployment and low cost, and is widely applied to industries such as terrain coverage, agricultural production, environmental reconnaissance, air rescue, disaster early warning and the like. The unmanned aerial vehicle can be used as an air base station to enhance the coverage range and performance of a communication network in various scenes. When the ground communication network is interrupted accidentally, the unmanned aerial vehicle can be deployed quickly, and the unmanned aerial vehicle establishes a communication link with the ground to transmit data, and meanwhile realizes cooperative interaction with the ground network. The coverage path planning algorithm is an important technology for supporting the unmanned aerial vehicle to be successfully applied to the complex scene.
In the process of planning the unmanned aerial vehicle to cover the ground node path, the energy constraint condition of the unmanned aerial vehicle needs to be considered, meanwhile, the unmanned aerial vehicle needs to ensure signal transmission with a ground base station in the process of executing tasks, but the signal transmission can generate loss to influence the coverage service quality. On the other hand, a single unmanned aerial vehicle is difficult to apply to a ground coverage task on a large scale due to energy and communication constraint, and cooperative flight of multiple unmanned aerial vehicles is an effective scheme for realizing the large-scale coverage task, and communication between unmanned aerial vehicles is required to be kept all the time. Therefore, how to efficiently realize the cooperative coverage of the ground nodes under the constraint conditions of limited energy consumption, limited communication distance and loss generated by signal transmission of the unmanned aerial vehicle group is a very challenging theory and application problem.
Disclosure of Invention
In order to solve the problem of how to realize efficient collaborative coverage to the ground under various constraint conditions, the invention provides a multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning, which specifically comprises the following steps:
firstly, defining a Markov model, namely modeling five-element groups (S, A, P, R, gamma) of a Markov decision process;
step two, designing a depth deterministic strategy gradient (DDPG) algorithm using basic depth reinforcement learning based on the five-tuple (S, A, P, R, gamma) of the Markov decision process obtained by modeling in the step one;
and thirdly, improving an experience buffer pool of the DDPG algorithm, and placing the obtained experience data into different experience buffer pools by classifying the experience data stored in the experience buffer pool.
And fourthly, designing a simulation environment, interacting the unmanned aerial vehicle group with the environment, acquiring training data, sampling the training data for simulation training, and planning a collaborative coverage path of the target ground node.
The specific steps of the first step comprise:
step 1.1, determining the state S of the unmanned aerial vehicle:
the whole target area is divided into I multiplied by J cells, m ground nodes with fixed positions and n unmanned aerial vehicles flying at a fixed height H are randomly distributed in the area, and the coordinates of the unmanned aerial vehicle I at the moment t are expressed as
Figure BDA0004042407820000011
The position coordinate of the u-th ground node is denoted as q u =(x u ,y u ). The fixed total energy of one unmanned aerial vehicle is e max The energy consumption of unmanned plane moving by one unit is e 1 Hovering over a ground node with energy consumption e 2 ,e 1 、e 2 All are constant and the drone must complete the task before the energy is exhausted. Therefore, the unmanned plane i flies from the initial position to the energy consumption +.>
Figure BDA0004042407820000012
The method comprises the following steps:
Figure BDA0004042407820000021
/>
Figure BDA0004042407820000022
wherein ,
Figure BDA0004042407820000023
the number of covered ground nodes of the unmanned aerial vehicle i at the time t;
the communication radius of each unmanned aerial vehicle is fixed to be R s Due to the limitation of communication connectivity, the unmanned aerial vehicle i must always keep the unmanned aerial vehicle j nearest to itself within the communication radius range, and the following formula is given:
min(||p i -p j ||,i≠j)<=R s
the process of the unmanned plane transmitting signals to the ground nodes can generate channel fading, and the signal loss can influence the service quality when the ground nodes are covered. If obstacles such as buildings and trees exist around, extra loss is brought on the basis of channel fading, and the probability formula of the line-of-sight fading LoS link between the unmanned aerial vehicle and the ground is as follows:
Figure BDA0004042407820000024
where f and g are constants related to the type of environment, H represents the altitude of the drone, d iu The formula is as follows for the horizontal distance between the ith unmanned plane and the ith ground node:
Figure BDA0004042407820000025
the probability formula for non-line-of-sight fading NLoS links is:
P NLoS =1-P LoS
the LoS and NLoS link loss models are:
Figure BDA0004042407820000026
Figure BDA0004042407820000027
Figure BDA0004042407820000028
where c is the propagation speed of light, f c For carrier frequency omega iu η is the distance between the ith unmanned plane and the ith ground node LoS 、η NLoS Is the extra loss of the line-of-sight fading LoS link and the non-line-of-sight fading NLoS link. Under the LoS and NLoS models, the signal loss formula of the jth ground node is as follows:
Figure BDA0004042407820000029
L u ≤κ
in order to ensure the service quality in the process of covering the ground nodes by the unmanned plane, the signal loss suffered by each ground node in the process of covering must be smaller than or equal to a certain threshold value k, so that the ground node is successfully covered, otherwise, the coverage of the ground node fails.
The state comprises the following parts: at time t, the position and energy consumption of the unmanned plane i and the signal loss suffered by each ground node. The state of the unmanned plane i at time t is:
Figure BDA00040424078200000210
step 1.2, determining an action set A of the unmanned aerial vehicle:
the flying speed of the unmanned aerial vehicle i in the flying process is fixed, and the next moving direction can be a t E (0, 2 pi) or hover action a t =0. The hovering action refers to that the current position of the unmanned aerial vehicle needs to be kept unchanged after the unmanned aerial vehicle covers the ground node. Therefore, the unmanned aerial vehicle i operates as follows:
a t ∈[0,2π)
step 1.3, defining a state s of the unmanned aerial vehicle at a time t and taking action a, wherein a state transition probability function P capable of reaching a next input state s' is as follows:
Figure BDA0004042407820000031
step 1.4, determining a reward function R of the unmanned aerial vehicle:
set the set b= { B of ground node coverage states 1 ,b 2 ,...,b u ,...,b m}. wherein bu The coverage state of the u-th ground node is Boolean domain {0,1}. If b u =1 then this ground node has been covered by the drone, b u And =0 is not covered. Coverage is the ratio of the number of covered ground nodes to the total number of ground nodes, and at time t the coverage is:
Figure BDA0004042407820000032
the coverage area of each unmanned plane is R c The coverage effect of the unmanned aerial vehicle on the target node is gradually decreased from the center of a circle to the periphery from strong to weak, and when the unmanned aerial vehicle is right above the ground node, the coverage effect is most obvious. Degree of effect of the u-th ground node being first covered
Figure BDA0004042407820000033
The formula is:
Figure BDA0004042407820000034
where lambda is the coverage effect constant.
Planning the optimal path requires that the ground node be transitioned from an initial state to a target state. The initial state of the ground node is an uncovered state, and the target state is a covered state of the unmanned plane. The coverage efficiency E is designed as a cooperative formula of coverage ground node rate and coverage effect c The formula is:
Figure BDA0004042407820000035
and defining a reward function, and representing feedback obtained after the unmanned aerial vehicle selects a certain action in the current state. The basic rewards formula is:
Figure BDA0004042407820000036
wherein coverage increment: Δα t =α tt-1 Energy consumption increment of the ith unmanned aerial vehicle:
Figure BDA0004042407820000037
if the forward rewards are given only when the unmanned aerial vehicle group successfully completes the task, the rewards are too sparse, and better results are difficult to obtain in multiple training rounds. So extra rewards and punishments are added, so that rewards are not sparse any more. In the additional punishment setting, when the overall coverage does not reach the expected value alpha ev When the overall coverage reaches our expected value, no penalty will be made; and setting the coverage rewards of each ground node covered by the unmanned aerial vehicle group for the first time to be +0.1, giving punishment to each frame if the unmanned aerial vehicle exceeds the energy consumption budget in the flight process in the process, and giving punishment to-1 if communication between unmanned aerial vehicles is impossible, wherein punishment to-1 is made. The extra prize and punishment amount is r extra The prize value calculation method is as follows:
Figure BDA0004042407820000038
step 1.5, defining discount factor gamma, wherein gamma is E (0, 1). The cumulative prize value over the course of the process is calculated, and the prize value will give rise to a discount over time, the greater the discount coefficient, the more emphasis is placed on long-term benefits.
The specific steps of the second step comprise:
and 2.1, adopting a Actor-reviewer (Actor-Critic) framework, wherein one network is an Actor and the other network is a reviewer Critic, and the two networks are mutually stimulated to compete with each other. Randomly initializing the network state-behavior value function Q (s, a|θ for Critic networks Q ) Policy function μ (s, a|θ) of Actor network μ ) Copying weights of Critic network and Actor network to target network parameters of respective networks, namely theta Q →θ Q′ 、θ μ →θ μ′, wherein θQ 、θ μ Respectively represent Critic network parameters and Actor network parameters, theta Q′ 、θ μ′ Respectively representing Critic target network parameters and Actor target network parameters.
Step 2.2, when the task starts, the initial state of the unmanned plane i is as follows
Figure BDA0004042407820000041
As the task proceeds, according to the current state s t Action a is made t The formula is:
a t =μ(s tμ )+β
where β is random noise. Executing action a t Obtain rewards r t And a new state s t+1
Step 2.3, obtaining an empirical bar(s) t ,a t ,r t ,s t+1 ). The method comprises the steps of storing experience strips in an experience pool, storing the newly stored experience strips in a first position in the experience pool, and sequentially shifting back the original experience strips in the experience pool by one position; from experienceA portion of the samples were randomly extracted from the pool for training, assuming (s i ,a i ,r i ,s i+1 ) For a batch of data sampled randomly, TD target training is carried out, and a target network Y i Expressed as:
Y i =r i +γQ′(s i+1 ,μ′(s i+1μ′ )|θ Q′ )
wherein μ' represents the pair s i+1 The strategy obtained by analysis is represented by Q' at s i+1 The state-behavior values obtained by the μ' strategy are adopted.
Step 2.4, updating the Critic network, and calculating a minimized loss function L as follows:
Figure BDA0004042407820000042
where N represents the number of random samples extracted from the experience pool for action exploration.
Step 2.5, updating the Actor network parameter theta μ Function using a strategic gradient descent algorithm
Figure BDA0004042407820000046
The method comprises the following steps:
Figure BDA0004042407820000043
wherein
Figure BDA0004042407820000044
Represents the Critic network state-behavior value function gradient, +.>
Figure BDA0004042407820000045
Represents the gradient of the Actor network policy function, μ (s i ) Representing input states s in an Actor network i Action strategy selected during time, < >>
Figure BDA0004042407820000048
Representing state s i Time CriticNetwork status-behavior value function, +.>
Figure BDA0004042407820000047
Representing state s i Time Actor network policy function.
And 2.6, calculating target network values by using the duplicate network, wherein the weight parameters of the target networks are updated by tracking and learning network delays. Meanwhile, the current network parameters are utilized to gradually update the corresponding Critic and Actor target networks:
θ Q′ ←τθ Q +(1-τ)θ Q
θ μ′ ←τθ μ +(1-τ)θ μ
where τ represents the update scaling factor, τ e (0, 1).
The specific steps of the third step comprise:
step 3.1, dividing the experience pool into M success and Mfailure Respectively storing successful and failed flight experiences, and setting a temporary experience pool M temp The latest flight experience is stored. M is M temp Once full, the earliest experience is taken out and stored in the experience pool M according to the first-in first-out principle success The latest flight experience will continue to be stored in the experience pool M temp . Repeating the steps, and finally according to the final state of the unmanned aerial vehicle, selecting the model M from the experience pool success and Mfailure And respectively extracting a plurality of experiences to train the neural network.
Step 3.2 for the purpose of removing the experience pool M success More multivalent experience is extracted, and the proportional sampling from two experience pools is set:
Figure BDA0004042407820000051
wherein ,ηsuccess 、η failure From experience pool M success and Mfailure The number of samples extracted in (1) [ phi ] is the total number of samples, and [ beta ] ∈ [0,1 ]]Is the success rate, representing the experience pool M success To the experience.
The technical scheme of the invention has the following advantages:
1. according to the method, a multi-unmanned aerial vehicle collaborative coverage scene model is established, unmanned aerial vehicle groups interact with the environment to obtain training data, and an optimal path is planned autonomously. The simulation environment in the process has higher practical application value.
2. The method uses a depth deterministic strategy gradient (DDPG) algorithm, classifies the experience data stored in the experience buffer pool, improves the DDPG algorithm, effectively solves the problem of unmanned aerial vehicle continuity control, improves the success rate of sample acquisition in the task process, and can obtain better convergence effect.
3. The method has better coverage efficiency, realizes overall energy consumption balance, and ensures that the task flight cost is lower and the completion time is shorter.
By the method, the target task that the unmanned aerial vehicle group performs cooperative coverage on the ground nodes under the limitation of a plurality of constraint conditions is realized, and the method can enable the unmanned aerial vehicle group to have higher planning efficiency and lower flight cost.
Drawings
FIG. 1 is a flow chart of the overall method of the present invention;
FIG. 2 is a schematic illustration of an application scenario of the present invention;
FIG. 3 is a graph showing the effect of comparing coverage efficiencies of four unmanned aerial vehicle groups under different coverage rates under the same algorithm;
fig. 4 is a graph of the balance degree versus effect of energy used by the unmanned aerial vehicle group in the flight process under four algorithms.
Detailed Description
The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:
it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Aiming at the problems that when a plurality of unmanned aerial vehicles cooperatively execute a coverage task, the unmanned aerial vehicles have more and more motion constraint conditions, high flight cost, poor motion continuity and the like, the invention provides an improved DDPG algorithm based on deep reinforcement learning. Meanwhile, the DDPG algorithm is improved by classifying the experience data stored in the experience buffer pool. And finally, the cooperative coverage path planning and dynamic adjustment of the unmanned aerial vehicle group are realized, and higher planning efficiency and lower flight cost are obtained.
The improved DDPG algorithm model and its application structure are shown in figure 1.
The method specifically comprises the following steps:
defining a Markov model, namely modeling five-element groups (S, A, P, R, gamma) of a Markov decision process, wherein the method comprises the following specific steps of:
step 1.1, determining the state S of the unmanned aerial vehicle:
the whole target area is divided into I multiplied by J cells, m ground nodes with fixed positions and n unmanned aerial vehicles flying at a fixed height H are randomly distributed in the area, and the coordinates of the unmanned aerial vehicle I at the moment t are expressed as
Figure BDA0004042407820000061
The position coordinate of the u-th ground node is denoted as q u =(x u ,y u ). The fixed total energy of one unmanned aerial vehicle is e max The energy consumption of unmanned plane moving by one unit is e 1 Hovering over a ground node with energy consumption e 2 ,e 1 、e 2 All are constant and the drone must complete the task before the energy is exhausted. Therefore, the unmanned plane i flies from the initial position to the energy consumption +.>
Figure BDA0004042407820000062
The method comprises the following steps:
Figure BDA0004042407820000063
Figure BDA0004042407820000064
wherein ,
Figure BDA0004042407820000065
the number of covered ground nodes of the unmanned aerial vehicle i at the time t;
the communication radius of each unmanned aerial vehicle is fixed to be R s Due to the limitation of communication connectivity, the unmanned aerial vehicle i must always keep the unmanned aerial vehicle j nearest to itself within the communication radius range, and the following formula is given:
min(||p i -p j ||,i≠j)<=R s
the process of the unmanned plane transmitting signals to the ground nodes can generate channel fading, and the signal loss can influence the service quality when the ground nodes are covered. If obstacles such as buildings and trees exist around, extra loss is brought on the basis of channel fading, and the probability formula of the line-of-sight fading LoS link between the unmanned aerial vehicle and the ground is as follows:
Figure BDA0004042407820000066
where f and g are constants related to the type of environment, H represents the altitude of the drone, d iu The formula is as follows for the horizontal distance between the ith unmanned plane and the ith ground node:
Figure BDA0004042407820000067
the probability formula for non-line-of-sight fading NLoS links is:
P NLoS =1-P LoS
the LoS and NLoS link loss models are:
Figure BDA0004042407820000068
Figure BDA0004042407820000069
Figure BDA00040424078200000610
where c is the propagation speed of light, f c For carrier frequency omega iu η is the distance between the ith unmanned plane and the ith ground node LoS 、η NLoS Is the extra loss of the line-of-sight fading LoS link and the non-line-of-sight fading NLoS link. Under the LoS and NLoS models, the signal loss formula of the jth ground node is as follows:
Figure BDA00040424078200000611
L u ≤κ
in order to ensure the service quality in the process of covering the ground nodes by the unmanned plane, the signal loss suffered by each ground node in the process of covering must be smaller than or equal to a certain threshold value k, so that the ground node is successfully covered, otherwise, the coverage of the ground node fails.
The state comprises the following parts: at time t, the position and energy consumption of the unmanned plane i and the signal loss suffered by each ground node. The state of the unmanned plane i at time t is:
Figure BDA00040424078200000612
step 1.2, determining an action set A of the unmanned aerial vehicle:
the flying speed of the unmanned aerial vehicle i in the flying process is fixed, and the next moving direction can be a t E (0, 2 pi) or hover action a t =0. The hovering action refers to that the current position of the unmanned aerial vehicle needs to be kept unchanged after the unmanned aerial vehicle covers the ground node. Therefore, the unmanned aerial vehicle i operates as follows:
a t ∈[0,2π)
step 1.3, defining a state s of the unmanned aerial vehicle at a time t and taking action a, wherein a state transition probability function P capable of reaching a next input state s' is as follows:
Figure BDA0004042407820000071
step 1.4, determining a reward function R of the unmanned aerial vehicle:
set the set b= { B of ground node coverage states 1 ,b 2 ,...,b u ,...,b m}. wherein bu The coverage state of the u-th ground node is Boolean domain {0,1}. If b u =1 then this ground node has been covered by the drone, b u And =0 is not covered. Coverage is the ratio of the number of covered ground nodes to the total number of ground nodes, and at time t the coverage is:
Figure BDA0004042407820000072
the coverage area of each unmanned plane is R c The coverage effect of the unmanned aerial vehicle on the target node is gradually decreased from the center of a circle to the periphery from strong to weak, and when the unmanned aerial vehicle is right above the ground node, the coverage effect is most obvious. Degree of effect of the u-th ground node being first covered
Figure BDA0004042407820000073
The formula is:
Figure BDA0004042407820000074
where lambda is the coverage effect constant.
Planning the optimal path requires that the ground node be transitioned from an initial state to a target state. The initial state of the ground node is an uncovered state, and the target state is a covered state of the unmanned plane. The coverage efficiency E is designed as a cooperative formula of coverage ground node rate and coverage effect c The formula is:
Figure BDA0004042407820000075
and defining a reward function, and representing feedback obtained after the unmanned aerial vehicle selects a certain action in the current state. The basic rewards formula is:
Figure BDA0004042407820000076
wherein coverage increment: Δα t =α tt-1 Energy consumption increment of the ith unmanned aerial vehicle:
Figure BDA0004042407820000077
if the forward rewards are given only when the unmanned aerial vehicle group successfully completes the task, the rewards are too sparse, and better results are difficult to obtain in multiple training rounds. So extra rewards and punishments are added, so that rewards are not sparse any more. In the additional punishment setting, when the overall coverage does not reach the expected value alpha ev When the overall coverage reaches our expected value, no penalty will be made; and setting the coverage rewards of each ground node covered by the unmanned aerial vehicle group for the first time to be +0.1, giving punishment to each frame if the unmanned aerial vehicle exceeds the energy consumption budget in the flight process in the process, and giving punishment to-1 if communication between unmanned aerial vehicles is impossible, wherein punishment to-1 is made. The extra prize and punishment amount is r extra The prize value calculation method is as follows:
Figure BDA0004042407820000078
step 1.5, defining discount factor gamma, wherein gamma is E (0, 1). The cumulative prize value over the course of the process is calculated, and the prize value will give rise to a discount over time, the greater the discount coefficient, the more emphasis is placed on long-term benefits.
Step two, designing a depth deterministic strategy gradient (DDPG) algorithm using basic depth reinforcement learning based on the five-tuple (S, A, P, R, gamma) of the Markov decision process modeled in the step one, wherein the specific steps are as follows:
and 2.1, adopting a Actor-reviewer (Actor-Critic) framework, wherein one network is an Actor and the other network is a reviewer Critic, and the two networks are mutually stimulated to compete with each other. Randomly initializing the network state-behavior value function Q (s, a|θ for Critic networks Q ) Policy function μ (s, a|θμ) of the Actor network, copies weights of the Critic network and the Actor network to target network parameters of the respective networks, i.e., θ Q →θ Q′ 、θ μ →θ μ′, wherein θQ 、θ μ Respectively represent Critic network parameters and Actor network parameters, theta Q′ 、θ μ′ Respectively representing Critic target network parameters and Actor target network parameters.
Step 2.2, when the task starts, the initial state of the unmanned plane i is as follows
Figure BDA0004042407820000081
As the task proceeds, according to the current state s t Action a is made t The formula is:
a t =μ(s t |θμ)+β
where β is random noise. Executing action a t Obtain rewards r t And a new state s t+1
Step 2.3, obtaining an empirical bar(s) t ,a t ,r t ,s t+1 ). Saving the experience strip in an experience pool, and saving the experience strip newlyThe first position in the experience pool is stored, and the original experience strips in the experience pool are sequentially moved backwards by one position; a portion of the samples were randomly extracted from the experience pool for training, assuming (s i ,a i ,r i ,s i+1 ) For a batch of data sampled randomly, TD target training is carried out, and a target network Y i Expressed as:
Y i =r i +γQ′(s i+1 ,μ′(s i+1 |θμ )|θ Q′ )
wherein μ' represents the pair s i+1 The strategy obtained by analysis is represented by Q' at s i+1 The state-behavior values obtained by the μ' strategy are adopted.
Step 2.4, updating the Critic network, and calculating a minimized loss function L as follows:
Figure BDA0004042407820000082
where N represents the number of random samples extracted from the experience pool for action exploration.
Step 2.5, updating the Actor network parameter theta μ Function using a strategic gradient descent algorithm
Figure BDA0004042407820000084
Is->
Figure BDA0004042407820000083
wherein
Figure BDA0004042407820000085
Represents the Critic network state-behavior value function gradient, +.>
Figure BDA0004042407820000086
Represents the gradient of the Actor network policy function, μ (s i ) Representing input states s in an Actor network i Action strategy selected during time, < >>
Figure BDA0004042407820000088
Representing state s i Time Critic network state-behavior value function, +.>
Figure BDA0004042407820000087
Representing state s i Time Actor network policy function.
And 2.6, calculating target network values by using the duplicate network, wherein the weight parameters of the target networks are updated by tracking and learning network delays. Meanwhile, the current network parameters are utilized to gradually update the corresponding Critic and Actor target networks:
θ Q′ ←τθ Q +(1-τ)θ Q
θμ ←τθμ+(1-τ)θμ
where τ represents the update scaling factor, τ e (0, 1).
And thirdly, improving an experience buffer pool of the DDPG algorithm, and placing the obtained experience data into different experience buffer pools by classifying the experience data stored in the experience buffer pool. The improved DDPG algorithm can solve the problem of unstable convergence.
Step 3.1, dividing the experience pool into M success and Mfailure Respectively storing successful and failed flight experiences, and setting a temporary experience pool M temp The latest flight experience is stored. M is M temp Once full, the earliest experience is taken out and stored in the experience pool M according to the first-in first-out principle success The latest flight experience will continue to be stored in the experience pool M temp . Repeating the steps, and finally according to the final state of the unmanned aerial vehicle, selecting the model M from the experience pool success and Mfailure And respectively extracting a plurality of experiences to train the neural network.
Step 3.2 for the purpose of removing the experience pool M success More multivalent experience is extracted, and the proportional sampling from two experience pools is set:
Figure BDA0004042407820000091
wherein ,ηsuccess 、η failure From experience pool M success and Mfailure The number of samples extracted in (1) [ phi ] is the total number of samples, and [ beta ] ∈ [0,1 ]]Is the success rate, representing the experience pool M success To the experience.
And finally, designing a simulation environment, and enabling the unmanned aerial vehicle group to interact with the environment to acquire training data.
The invention can be applied to actual scenes, the unmanned aerial vehicle is used as an air base station, the coverage range and performance of a communication network in various scenes are enhanced, when the ground communication network is interrupted due to accidents, the unmanned aerial vehicle can be rapidly deployed, and the unmanned aerial vehicle establishes a communication link with the ground by covering a ground target so as to transmit data, and meanwhile, the cooperative interaction with the ground network is realized. The plane scene cooperatively covered by the unmanned aerial vehicle group is shown in fig. 2: the ground nodes with the fixed m positions and n unmanned aerial vehicles flying at the fixed height H are randomly distributed in the area, all unmanned aerial vehicles take off from the random positions at the same moment, the unmanned aerial vehicles are planned to cooperatively cover the ground nodes under the limitation of a plurality of constraint conditions, an optimal path is obtained, and rapid, reliable, economical and efficient data transmission and network communication are provided for the ground.
By comparison with the random algorithm, the particle swarm algorithm and the DDPG algorithm, the improved DDPG algorithm of the present invention exceeds the foregoing algorithms in terms of coverage efficiency and energy consumption balance. Wherein:
the random algorithm means that each unmanned aerial vehicle randomly selects the flight direction within the range of [0,2 pi ] at each moment as the current action, and if the new position exceeds the boundary of the target area, all unmanned aerial vehicles abandon the action and keep in place;
the particle swarm algorithm is a meta heuristic algorithm, is a method commonly adopted at present for searching an optimal path, and finds an optimal solution by setting a group of random particles and carrying out multiple iterations. During each iteration, the particle may pass through tracking two extrema: the optimal solution found by the self and the optimal solution found by the whole population at present can update the self, and the extremum of the particle neighbors can also update the self.
Reference is made to fig. 3 and 4. Compared with the motion paths of the unmanned aerial vehicle groups obtained by four different algorithms, the coverage efficiency of the unmanned aerial vehicle groups under different coverage rates and the balance degree of energy used in the flight process are observed, the improved DDPG algorithm provided by the invention improves the training success rate, has higher convergence speed, realizes the maximization of the coverage efficiency under the same condition, effectively balances the flight energy consumption of each unmanned aerial vehicle, avoids the barrel effect of excessive energy consumption of a single unmanned aerial vehicle, and further reduces the flight time and cost of multiple unmanned aerial vehicles.
The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims (5)

1. A multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning includes the steps that firstly, a deep reinforcement learning model is designed, then, unmanned aerial vehicle groups interact with the environment in a simulation environment to obtain training data, the training data are sampled for simulation training, and finally, collaborative coverage path planning of target ground nodes is achieved;
the method is characterized by comprising the following steps of:
step one, defining a Markov model: modeling constraint conditions of the unmanned aerial vehicle base station by using five-tuple (S, A, P, R, gamma) of a Markov decision process; the unmanned aerial vehicle base station is a base station carried by an unmanned aerial vehicle, hereinafter referred to as unmanned aerial vehicle;
step two, designing a depth deterministic strategy gradient DDPG algorithm based on the five-tuple (S, A, P, R, gamma) of the Markov decision process obtained by modeling in the step one, wherein the DDPG algorithm uses basic depth reinforcement learning;
step three, improving an experience buffer pool of the DDPG algorithm, classifying experience data stored in the experience buffer pool, and placing the obtained experience data into different experience buffer pools;
in the first step:
step 1.1, determining the state S of the unmanned aerial vehicle:
m fixed ground nodes and n unmanned aerial vehicles are randomly distributed in a target area;
the drone state S contains: at time t, the position of the unmanned plane i
Figure FDA0004042407810000011
And energy consumption->
Figure FDA0004042407810000012
And the signal loss L experienced by each ground node 1 ,...,L u ,...,L m The method comprises the steps of carrying out a first treatment on the surface of the The status of the unmanned plane i at time t is expressed as:
Figure FDA0004042407810000013
Figure FDA0004042407810000014
the coordinates of the unmanned aerial vehicle i at the time t; />
Figure FDA0004042407810000015
The energy consumption of the unmanned aerial vehicle i from the initial position to the position at the moment t is calculated;
step 1.2, determining an action set A of the unmanned aerial vehicle:
the flying speed of the unmanned aerial vehicle i is fixed in the flying process, and the next moving direction is a t E (0, 2 pi) or hover action a t =0; the hovering action refers to the fact that the current position of the unmanned aerial vehicle needs to be kept unchanged after the unmanned aerial vehicle covers a ground node; the actions of the unmanned aerial vehicle i are: a, a t ∈[0,2π);
Step 1.3, defining a state s of the unmanned aerial vehicle at a time t and taking action a, wherein a state transition probability function P capable of reaching a next input state s' is as follows:
Figure FDA0004042407810000016
step 1.4, determining a reward function R of the unmanned aerial vehicle:
let set of ground node coverage states b= { B 1 ,b 2 ,...,b u ,...,b m}; wherein bu The coverage state of the u-th ground node is Boolean domain {0,1}; if b u =1, then this ground node has been covered by the drone, if b u If 0, the ground node is not covered by the drone;
coverage alpha t The coverage rate at time t is the ratio of the number of covered ground nodes to the total number of ground nodes m:
Figure FDA0004042407810000017
the coverage area of each unmanned plane is R c The coverage effect of the unmanned aerial vehicle on the target ground node is gradually decreased from the center of a circle to the periphery from strong to weak; degree of effect of the u-th ground node being first covered
Figure FDA0004042407810000018
The formula is:
Figure FDA0004042407810000021
/>
wherein λ is the coverage effect constant;
planning an optimal path is required to realize that ground nodes are converted from an initial state to a target state, wherein the initial state of the ground nodes is an uncovered state, and the target state is a covered state of an unmanned plane; the coverage efficiency E is designed as a cooperative formula of coverage ground node rate and coverage effect c The formula is:
Figure FDA0004042407810000022
defining a reward function, and representing feedback obtained after a certain action is selected by the unmanned aerial vehicle in the current state; the basic rewards formula is:
Figure FDA0004042407810000023
wherein coverage increment: Δα t =α tt-1 Energy consumption increment of the ith unmanned aerial vehicle:
Figure FDA0004042407810000024
base prize r t A prize value of ° as a prize function R;
step 1.5, defining a discount factor gamma, wherein gamma is E (0, 1); calculating a cumulative rewarding value in the whole process, wherein the rewarding value generates discounts along with the time, and the larger the discount coefficient is, the more emphasis is placed on long-term benefits;
in the second step,:
step 2.1, adopting a performer-reviewer Actor-Critic framework, wherein one network is a performer Actor, the other network is a reviewer Critic, and the two networks are mutually stimulated to compete with each other;
randomly initializing the network state-behavior value function Q (s, a|θ for Critic networks Q ) Policy function μ (s, a|θ) of Actor network μ ) The method comprises the steps of carrying out a first treatment on the surface of the Copying weights of Critic network and Actor network to target network parameters of respective networks, namely theta Q →θ Q′ 、θμ→θ μ′, wherein θQ 、θ μ Respectively represent Critic network parameters and Actor network parameters, theta Q′ 、θ μ′ Respectively representing Critic target network parameters and Actor target network parameters;
step 2.2, when the task starts, the initial state of the unmanned plane i is as follows
Figure FDA0004042407810000025
As the task proceeds, according to the current state s t As a result ofAction of discharging a t The formula is:
a t =μ(s tμ )+β
wherein β is random noise;
executing action a t Obtain rewards r t And a new state s t+1
Step 2.3, obtaining an empirical bar(s) t ,a t ,r t ,s t+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Saving the experience bar in an experience pool;
a portion of the samples were randomly extracted from the experience pool for training, assuming (s i ,a i ,r i ,s i+1 ) For a batch of data sampled randomly, TD target training is carried out, and a target network Y i Expressed as:
Y i =r i +γQ′(s i+1 ,μ′(s i+1μ′ )|θ Q′ )
wherein μ' represents the pair s i+1 The strategy obtained by analysis is represented by Q' at s i+1 The state-behavior value obtained by mu' strategy is adopted;
step 2.4, updating the Critic network, and calculating a minimized loss function L as follows:
Figure FDA0004042407810000026
/>
where N represents the number of random samples extracted from the experience pool for action exploration;
step 2.5, updating the Actor network parameter theta μ Function using a strategic gradient descent algorithm
Figure FDA0004042407810000031
The method comprises the following steps:
Figure FDA0004042407810000032
wherein
Figure FDA0004042407810000033
Represents the Critic network state-behavior value function gradient, +.>
Figure FDA0004042407810000034
Represents the gradient of the Actor network policy function, μ (s i ) Representing input states s in an Actor network i Action strategy selected during time, < >>
Figure FDA0004042407810000035
Representing state s i Time Critic network state-behavior value function, +.>
Figure FDA0004042407810000036
Representing state s i A time Actor network policy function;
step 2.6, calculating target network values by using a copy network, wherein the weight parameters of the target networks are updated by tracking and learning network delays; meanwhile, the current network parameters are utilized to gradually update the corresponding Critic and Actor target networks:
θ Q′ ←τθ Q +(1-τ)θ Q
θ μ′ ←τθ μ +(1-τ)θ μ
where τ represents the update scaling factor, τ e (0, 1);
in the third step:
step 3.1, dividing the experience pool into M success and Mfailure Storing two kinds of successful and failed flight experience respectively; from experience pool M success and Mfailure Respectively extracting a plurality of experiences to train the neural network;
step 3.2, setting up to sample proportionally from two experience pools:
Figure FDA0004042407810000037
wherein ,ηsuccess 、η failure From experience pool M success and Mfailure The number of samples extracted in (1) [ phi ] is the total number of samples, and [ beta ] ∈ [0,1 ]]Is the success rate, representing the experience pool M success To the experience.
2. The method for planning the cooperative coverage path of the base stations of the multiple unmanned aerial vehicles based on deep reinforcement learning according to claim 1, wherein in the step 1.1,
the whole target area is divided into I multiplied by J cells, m ground nodes with fixed positions and n unmanned aerial vehicles flying at a fixed height H are randomly distributed in the area, and the coordinates of the unmanned aerial vehicle I at the moment t are expressed as
Figure FDA0004042407810000038
The position coordinate of the u-th ground node is denoted as q u =(x u ,y u );
The fixed total energy of one unmanned aerial vehicle is e max The energy consumption of unmanned plane moving by one unit is e 1 Hovering over a ground node with energy consumption e 2 ,e 1 and e2 All are constants, and the unmanned aerial vehicle must complete tasks before energy is exhausted;
therefore, the unmanned aerial vehicle i flies from the initial position to the energy consumption at the position at the time t
Figure FDA0004042407810000039
The method comprises the following steps:
Figure FDA00040424078100000310
Figure FDA00040424078100000311
wherein ,
Figure FDA00040424078100000312
the number of covered ground nodes of the unmanned aerial vehicle i at the time t;
the communication radius of each unmanned aerial vehicle is fixed to be R s Due to the limitation of communication connectivity, the unmanned aerial vehicle i must always keep the unmanned aerial vehicle j nearest to itself within the communication radius range, and the following formula is given:
min(||p i -p j ||,i≠j)<=R s
p i and pj Respectively representing the positions of unmanned plane i and unmanned plane j;
the process of unmanned aerial vehicle propagation signal to ground node can produce the channel fading, and unmanned aerial vehicle and ground inter-line-of-sight fading LoS link's probability is:
Figure FDA0004042407810000041
where f and g are constants related to the type of environment, H represents the altitude of the drone, d iu The formula is as follows for the horizontal distance between the ith unmanned plane and the ith ground node:
Figure FDA0004042407810000042
the probability of non-line-of-sight fading NLoS links is:
P NLoS =1-P LoS
the LoS and NLoS link loss models are:
Figure FDA0004042407810000043
Figure FDA0004042407810000044
Figure FDA0004042407810000045
where c is the propagation speed of light, f c For carrier frequency omega iu η is the distance between the ith unmanned plane and the ith ground node LoS and ηNLoS The extra loss of the line-of-sight fading LoS link and the non-line-of-sight fading NLoS link respectively;
under the LoS and NLoS models, the signal loss formula of the u-th ground node is as follows:
Figure FDA0004042407810000046
L u ≤κ
in order to ensure the service quality in the process of covering the ground nodes by the unmanned plane, the signal loss suffered by each ground node in the process of covering must be smaller than or equal to a certain threshold value k, so that the ground node is successfully covered, otherwise, the coverage of the ground node fails.
3. The method for planning the cooperative coverage path of the multiple unmanned aerial vehicle base stations based on deep reinforcement learning according to claim 1 is characterized in that in the step 1.4, additional rewards and punishments are additionally arranged, and the sum of basic rewards and the additional rewards and punishments is used as a reward value of a reward function;
in the additional punishment setting, when the overall coverage does not reach the expected value alpha ev Negative increasing penalties will be made when the overall coverage reaches the expected value, and no penalties will be made when the overall coverage reaches the expected value;
setting that the coverage rewards of each ground node covered by the unmanned aerial vehicle group for the first time are increasing; if the unmanned aerial vehicle exceeds the energy consumption budget in the flight process, giving negative increasing punishment to each frame; if the unmanned aerial vehicles cannot communicate with each other, negative added punishment is made;
the extra prize and punishment amount is r extra Prize value r t =r t °+r extra
4. The method for planning the collaborative coverage path of the multi-unmanned aerial vehicle base station based on deep reinforcement learning according to claim 1 is characterized in that in step 2.3, experience bars are stored in an experience pool, the newly stored experience bars are stored in a first position in the experience pool, and the original experience bars in the experience pool are sequentially moved back by one position.
5. The method for planning a cooperative coverage path of multiple unmanned aerial vehicle base stations based on deep reinforcement learning as claimed in claim 1, wherein in step 3.1, a temporary experience pool M is further provided temp Storing the latest flight experience;
M temp once full, the earliest experience is taken out and stored in the experience pool M according to the first-in first-out principle success The latest flight experience will continue to be stored in the experience pool M temp The method comprises the steps of carrying out a first treatment on the surface of the Repeating the steps, and finally according to the final state of the unmanned aerial vehicle, selecting the model M from the experience pool success and Mfailure And respectively extracting a plurality of experiences to train the neural network.
CN202310021781.1A 2023-01-07 2023-01-07 Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning Pending CN116227767A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310021781.1A CN116227767A (en) 2023-01-07 2023-01-07 Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310021781.1A CN116227767A (en) 2023-01-07 2023-01-07 Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning

Publications (1)

Publication Number Publication Date
CN116227767A true CN116227767A (en) 2023-06-06

Family

ID=86572376

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310021781.1A Pending CN116227767A (en) 2023-01-07 2023-01-07 Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN116227767A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116502547A (en) * 2023-06-29 2023-07-28 深圳大学 Multi-unmanned aerial vehicle wireless energy transmission method based on graph reinforcement learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116502547A (en) * 2023-06-29 2023-07-28 深圳大学 Multi-unmanned aerial vehicle wireless energy transmission method based on graph reinforcement learning

Similar Documents

Publication Publication Date Title
CN110488861B (en) Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle
CN113190039B (en) Unmanned aerial vehicle acquisition path planning method based on layered deep reinforcement learning
CN113162679A (en) DDPG algorithm-based IRS (inter-Range instrumentation System) auxiliary unmanned aerial vehicle communication joint optimization method
CN111580544B (en) Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
CN112511250B (en) DRL-based multi-unmanned aerial vehicle air base station dynamic deployment method and system
CN113433967B (en) Chargeable unmanned aerial vehicle path planning method and system
CN113346944A (en) Time delay minimization calculation task unloading method and system in air-space-ground integrated network
CN110181508B (en) Three-dimensional route planning method and system for underwater robot
CN112788699B (en) Method and system for determining network topology of self-organizing network
CN114422056A (en) Air-ground non-orthogonal multiple access uplink transmission method based on intelligent reflecting surface
CN114567888B (en) Multi-unmanned aerial vehicle dynamic deployment method
JP2020077392A (en) Method and device for learning neural network at adaptive learning rate, and testing method and device using the same
CN116227767A (en) Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning
CN112363532B (en) Method for simultaneously taking off and gathering multiple unmanned aerial vehicles based on QUATRE algorithm
CN116451934B (en) Multi-unmanned aerial vehicle edge calculation path optimization and dependent task scheduling optimization method and system
CN116627162A (en) Multi-agent reinforcement learning-based multi-unmanned aerial vehicle data acquisition position optimization method
CN113406965A (en) Unmanned aerial vehicle energy consumption optimization method based on reinforcement learning
CN113507717A (en) Unmanned aerial vehicle track optimization method and system based on vehicle track prediction
CN113485409A (en) Unmanned aerial vehicle path planning and distribution method and system for geographic fairness
CN113382060B (en) Unmanned aerial vehicle track optimization method and system in Internet of things data collection
CN113377131B (en) Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning
Khamidehi et al. Reinforcement-learning-aided safe planning for aerial robots to collect data in dynamic environments
Seong et al. Multi-UAV trajectory optimizer: A sustainable system for wireless data harvesting with deep reinforcement learning
Zhang et al. Multi-objective optimization for UAV-enabled wireless powered IoT networks: an LSTM-based deep reinforcement learning approach
CN116859989A (en) Unmanned aerial vehicle cluster intelligent countermeasure strategy generation method based on group cooperation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination