CN116227767A - Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning - Google Patents
Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN116227767A CN116227767A CN202310021781.1A CN202310021781A CN116227767A CN 116227767 A CN116227767 A CN 116227767A CN 202310021781 A CN202310021781 A CN 202310021781A CN 116227767 A CN116227767 A CN 116227767A
- Authority
- CN
- China
- Prior art keywords
- unmanned aerial
- aerial vehicle
- experience
- coverage
- state
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 230000002787 reinforcement Effects 0.000 title claims abstract description 16
- 230000008569 process Effects 0.000 claims abstract description 33
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 32
- 238000004088 simulation Methods 0.000 claims abstract description 7
- 230000006870 function Effects 0.000 claims description 35
- 230000009471 action Effects 0.000 claims description 29
- 238000005265 energy consumption Methods 0.000 claims description 23
- 238000004891 communication Methods 0.000 claims description 21
- 230000000694 effects Effects 0.000 claims description 18
- 238000005562 fading Methods 0.000 claims description 17
- 230000008901 benefit Effects 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 claims description 3
- 230000003247 decreasing effect Effects 0.000 claims description 3
- 230000001934 delay Effects 0.000 claims description 3
- 230000007774 longterm Effects 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 238000007599 discharging Methods 0.000 claims 1
- 239000002245 particle Substances 0.000 description 5
- 238000005070 sampling Methods 0.000 description 3
- 230000008054 signal transmission Effects 0.000 description 3
- 230000006978 adaptation Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000009133 cooperative interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012271 agricultural production Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
- G06Q10/047—Optimisation of routes or paths, e.g. travelling salesman problem
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2111/00—Details relating to CAD techniques
- G06F2111/08—Probabilistic or stochastic CAD
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Human Resources & Organizations (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Strategic Management (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Computer Hardware Design (AREA)
- Development Economics (AREA)
- Game Theory and Decision Science (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Geometry (AREA)
- Tourism & Hospitality (AREA)
- Marketing (AREA)
- Operations Research (AREA)
- General Business, Economics & Management (AREA)
- Quality & Reliability (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning, which comprises the following steps: firstly, defining a Markov model based on deep reinforcement learning, and modeling a five-tuple in a Markov decision process; then a depth deterministic strategy gradient DDPG algorithm is proposed according to modeling; then improving an experience buffer pool of the DDPG algorithm, classifying experience data stored in the experience buffer pool, and putting the obtained experience data into different experience buffer pools, wherein the improved DDPG algorithm can solve the problem of unstable convergence; and finally, designing a simulation environment, and enabling the unmanned aerial vehicle group to interact with the environment to acquire training data. By the method, the target task that the unmanned aerial vehicle group performs cooperative coverage on the ground nodes under the limitation of a plurality of constraint conditions is realized, and the method can enable the unmanned aerial vehicle group to have higher planning efficiency and lower flight cost.
Description
Technical Field
The invention provides a multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning, and belongs to the field of computer artificial intelligence.
Background
The unmanned aerial vehicle has the advantages of high maneuverability, flexible deployment and low cost, and is widely applied to industries such as terrain coverage, agricultural production, environmental reconnaissance, air rescue, disaster early warning and the like. The unmanned aerial vehicle can be used as an air base station to enhance the coverage range and performance of a communication network in various scenes. When the ground communication network is interrupted accidentally, the unmanned aerial vehicle can be deployed quickly, and the unmanned aerial vehicle establishes a communication link with the ground to transmit data, and meanwhile realizes cooperative interaction with the ground network. The coverage path planning algorithm is an important technology for supporting the unmanned aerial vehicle to be successfully applied to the complex scene.
In the process of planning the unmanned aerial vehicle to cover the ground node path, the energy constraint condition of the unmanned aerial vehicle needs to be considered, meanwhile, the unmanned aerial vehicle needs to ensure signal transmission with a ground base station in the process of executing tasks, but the signal transmission can generate loss to influence the coverage service quality. On the other hand, a single unmanned aerial vehicle is difficult to apply to a ground coverage task on a large scale due to energy and communication constraint, and cooperative flight of multiple unmanned aerial vehicles is an effective scheme for realizing the large-scale coverage task, and communication between unmanned aerial vehicles is required to be kept all the time. Therefore, how to efficiently realize the cooperative coverage of the ground nodes under the constraint conditions of limited energy consumption, limited communication distance and loss generated by signal transmission of the unmanned aerial vehicle group is a very challenging theory and application problem.
Disclosure of Invention
In order to solve the problem of how to realize efficient collaborative coverage to the ground under various constraint conditions, the invention provides a multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning, which specifically comprises the following steps:
firstly, defining a Markov model, namely modeling five-element groups (S, A, P, R, gamma) of a Markov decision process;
step two, designing a depth deterministic strategy gradient (DDPG) algorithm using basic depth reinforcement learning based on the five-tuple (S, A, P, R, gamma) of the Markov decision process obtained by modeling in the step one;
and thirdly, improving an experience buffer pool of the DDPG algorithm, and placing the obtained experience data into different experience buffer pools by classifying the experience data stored in the experience buffer pool.
And fourthly, designing a simulation environment, interacting the unmanned aerial vehicle group with the environment, acquiring training data, sampling the training data for simulation training, and planning a collaborative coverage path of the target ground node.
The specific steps of the first step comprise:
step 1.1, determining the state S of the unmanned aerial vehicle:
the whole target area is divided into I multiplied by J cells, m ground nodes with fixed positions and n unmanned aerial vehicles flying at a fixed height H are randomly distributed in the area, and the coordinates of the unmanned aerial vehicle I at the moment t are expressed asThe position coordinate of the u-th ground node is denoted as q u =(x u ,y u ). The fixed total energy of one unmanned aerial vehicle is e max The energy consumption of unmanned plane moving by one unit is e 1 Hovering over a ground node with energy consumption e 2 ,e 1 、e 2 All are constant and the drone must complete the task before the energy is exhausted. Therefore, the unmanned plane i flies from the initial position to the energy consumption +.>The method comprises the following steps:
the communication radius of each unmanned aerial vehicle is fixed to be R s Due to the limitation of communication connectivity, the unmanned aerial vehicle i must always keep the unmanned aerial vehicle j nearest to itself within the communication radius range, and the following formula is given:
min(||p i -p j ||,i≠j)<=R s
the process of the unmanned plane transmitting signals to the ground nodes can generate channel fading, and the signal loss can influence the service quality when the ground nodes are covered. If obstacles such as buildings and trees exist around, extra loss is brought on the basis of channel fading, and the probability formula of the line-of-sight fading LoS link between the unmanned aerial vehicle and the ground is as follows:
where f and g are constants related to the type of environment, H represents the altitude of the drone, d iu The formula is as follows for the horizontal distance between the ith unmanned plane and the ith ground node:
the probability formula for non-line-of-sight fading NLoS links is:
P NLoS =1-P LoS
the LoS and NLoS link loss models are:
where c is the propagation speed of light, f c For carrier frequency omega iu η is the distance between the ith unmanned plane and the ith ground node LoS 、η NLoS Is the extra loss of the line-of-sight fading LoS link and the non-line-of-sight fading NLoS link. Under the LoS and NLoS models, the signal loss formula of the jth ground node is as follows:
L u ≤κ
in order to ensure the service quality in the process of covering the ground nodes by the unmanned plane, the signal loss suffered by each ground node in the process of covering must be smaller than or equal to a certain threshold value k, so that the ground node is successfully covered, otherwise, the coverage of the ground node fails.
The state comprises the following parts: at time t, the position and energy consumption of the unmanned plane i and the signal loss suffered by each ground node. The state of the unmanned plane i at time t is:
step 1.2, determining an action set A of the unmanned aerial vehicle:
the flying speed of the unmanned aerial vehicle i in the flying process is fixed, and the next moving direction can be a t E (0, 2 pi) or hover action a t =0. The hovering action refers to that the current position of the unmanned aerial vehicle needs to be kept unchanged after the unmanned aerial vehicle covers the ground node. Therefore, the unmanned aerial vehicle i operates as follows:
a t ∈[0,2π)
step 1.3, defining a state s of the unmanned aerial vehicle at a time t and taking action a, wherein a state transition probability function P capable of reaching a next input state s' is as follows:
step 1.4, determining a reward function R of the unmanned aerial vehicle:
set the set b= { B of ground node coverage states 1 ,b 2 ,...,b u ,...,b m}. wherein bu The coverage state of the u-th ground node is Boolean domain {0,1}. If b u =1 then this ground node has been covered by the drone, b u And =0 is not covered. Coverage is the ratio of the number of covered ground nodes to the total number of ground nodes, and at time t the coverage is:
the coverage area of each unmanned plane is R c The coverage effect of the unmanned aerial vehicle on the target node is gradually decreased from the center of a circle to the periphery from strong to weak, and when the unmanned aerial vehicle is right above the ground node, the coverage effect is most obvious. Degree of effect of the u-th ground node being first coveredThe formula is:
where lambda is the coverage effect constant.
Planning the optimal path requires that the ground node be transitioned from an initial state to a target state. The initial state of the ground node is an uncovered state, and the target state is a covered state of the unmanned plane. The coverage efficiency E is designed as a cooperative formula of coverage ground node rate and coverage effect c The formula is:
and defining a reward function, and representing feedback obtained after the unmanned aerial vehicle selects a certain action in the current state. The basic rewards formula is:
wherein coverage increment: Δα t =α t -α t-1 Energy consumption increment of the ith unmanned aerial vehicle:
if the forward rewards are given only when the unmanned aerial vehicle group successfully completes the task, the rewards are too sparse, and better results are difficult to obtain in multiple training rounds. So extra rewards and punishments are added, so that rewards are not sparse any more. In the additional punishment setting, when the overall coverage does not reach the expected value alpha ev When the overall coverage reaches our expected value, no penalty will be made; and setting the coverage rewards of each ground node covered by the unmanned aerial vehicle group for the first time to be +0.1, giving punishment to each frame if the unmanned aerial vehicle exceeds the energy consumption budget in the flight process in the process, and giving punishment to-1 if communication between unmanned aerial vehicles is impossible, wherein punishment to-1 is made. The extra prize and punishment amount is r extra The prize value calculation method is as follows:
step 1.5, defining discount factor gamma, wherein gamma is E (0, 1). The cumulative prize value over the course of the process is calculated, and the prize value will give rise to a discount over time, the greater the discount coefficient, the more emphasis is placed on long-term benefits.
The specific steps of the second step comprise:
and 2.1, adopting a Actor-reviewer (Actor-Critic) framework, wherein one network is an Actor and the other network is a reviewer Critic, and the two networks are mutually stimulated to compete with each other. Randomly initializing the network state-behavior value function Q (s, a|θ for Critic networks Q ) Policy function μ (s, a|θ) of Actor network μ ) Copying weights of Critic network and Actor network to target network parameters of respective networks, namely theta Q →θ Q′ 、θ μ →θ μ′, wherein θQ 、θ μ Respectively represent Critic network parameters and Actor network parameters, theta Q′ 、θ μ′ Respectively representing Critic target network parameters and Actor target network parameters.
Step 2.2, when the task starts, the initial state of the unmanned plane i is as follows
As the task proceeds, according to the current state s t Action a is made t The formula is:
a t =μ(s t |θ μ )+β
where β is random noise. Executing action a t Obtain rewards r t And a new state s t+1 。
Step 2.3, obtaining an empirical bar(s) t ,a t ,r t ,s t+1 ). The method comprises the steps of storing experience strips in an experience pool, storing the newly stored experience strips in a first position in the experience pool, and sequentially shifting back the original experience strips in the experience pool by one position; from experienceA portion of the samples were randomly extracted from the pool for training, assuming (s i ,a i ,r i ,s i+1 ) For a batch of data sampled randomly, TD target training is carried out, and a target network Y i Expressed as:
Y i =r i +γQ′(s i+1 ,μ′(s i+1 |θ μ′ )|θ Q′ )
wherein μ' represents the pair s i+1 The strategy obtained by analysis is represented by Q' at s i+1 The state-behavior values obtained by the μ' strategy are adopted.
Step 2.4, updating the Critic network, and calculating a minimized loss function L as follows:
where N represents the number of random samples extracted from the experience pool for action exploration.
Step 2.5, updating the Actor network parameter theta μ Function using a strategic gradient descent algorithmThe method comprises the following steps:
wherein Represents the Critic network state-behavior value function gradient, +.>Represents the gradient of the Actor network policy function, μ (s i ) Representing input states s in an Actor network i Action strategy selected during time, < >>Representing state s i Time CriticNetwork status-behavior value function, +.>Representing state s i Time Actor network policy function.
And 2.6, calculating target network values by using the duplicate network, wherein the weight parameters of the target networks are updated by tracking and learning network delays. Meanwhile, the current network parameters are utilized to gradually update the corresponding Critic and Actor target networks:
θ Q′ ←τθ Q +(1-τ)θ Q
θ μ′ ←τθ μ +(1-τ)θ μ
where τ represents the update scaling factor, τ e (0, 1).
The specific steps of the third step comprise:
step 3.1, dividing the experience pool into M success and Mfailure Respectively storing successful and failed flight experiences, and setting a temporary experience pool M temp The latest flight experience is stored. M is M temp Once full, the earliest experience is taken out and stored in the experience pool M according to the first-in first-out principle success The latest flight experience will continue to be stored in the experience pool M temp . Repeating the steps, and finally according to the final state of the unmanned aerial vehicle, selecting the model M from the experience pool success and Mfailure And respectively extracting a plurality of experiences to train the neural network.
Step 3.2 for the purpose of removing the experience pool M success More multivalent experience is extracted, and the proportional sampling from two experience pools is set:
wherein ,ηsuccess 、η failure From experience pool M success and Mfailure The number of samples extracted in (1) [ phi ] is the total number of samples, and [ beta ] ∈ [0,1 ]]Is the success rate, representing the experience pool M success To the experience.
The technical scheme of the invention has the following advantages:
1. according to the method, a multi-unmanned aerial vehicle collaborative coverage scene model is established, unmanned aerial vehicle groups interact with the environment to obtain training data, and an optimal path is planned autonomously. The simulation environment in the process has higher practical application value.
2. The method uses a depth deterministic strategy gradient (DDPG) algorithm, classifies the experience data stored in the experience buffer pool, improves the DDPG algorithm, effectively solves the problem of unmanned aerial vehicle continuity control, improves the success rate of sample acquisition in the task process, and can obtain better convergence effect.
3. The method has better coverage efficiency, realizes overall energy consumption balance, and ensures that the task flight cost is lower and the completion time is shorter.
By the method, the target task that the unmanned aerial vehicle group performs cooperative coverage on the ground nodes under the limitation of a plurality of constraint conditions is realized, and the method can enable the unmanned aerial vehicle group to have higher planning efficiency and lower flight cost.
Drawings
FIG. 1 is a flow chart of the overall method of the present invention;
FIG. 2 is a schematic illustration of an application scenario of the present invention;
FIG. 3 is a graph showing the effect of comparing coverage efficiencies of four unmanned aerial vehicle groups under different coverage rates under the same algorithm;
fig. 4 is a graph of the balance degree versus effect of energy used by the unmanned aerial vehicle group in the flight process under four algorithms.
Detailed Description
The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:
it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Aiming at the problems that when a plurality of unmanned aerial vehicles cooperatively execute a coverage task, the unmanned aerial vehicles have more and more motion constraint conditions, high flight cost, poor motion continuity and the like, the invention provides an improved DDPG algorithm based on deep reinforcement learning. Meanwhile, the DDPG algorithm is improved by classifying the experience data stored in the experience buffer pool. And finally, the cooperative coverage path planning and dynamic adjustment of the unmanned aerial vehicle group are realized, and higher planning efficiency and lower flight cost are obtained.
The improved DDPG algorithm model and its application structure are shown in figure 1.
The method specifically comprises the following steps:
defining a Markov model, namely modeling five-element groups (S, A, P, R, gamma) of a Markov decision process, wherein the method comprises the following specific steps of:
step 1.1, determining the state S of the unmanned aerial vehicle:
the whole target area is divided into I multiplied by J cells, m ground nodes with fixed positions and n unmanned aerial vehicles flying at a fixed height H are randomly distributed in the area, and the coordinates of the unmanned aerial vehicle I at the moment t are expressed asThe position coordinate of the u-th ground node is denoted as q u =(x u ,y u ). The fixed total energy of one unmanned aerial vehicle is e max The energy consumption of unmanned plane moving by one unit is e 1 Hovering over a ground node with energy consumption e 2 ,e 1 、e 2 All are constant and the drone must complete the task before the energy is exhausted. Therefore, the unmanned plane i flies from the initial position to the energy consumption +.>The method comprises the following steps:
the communication radius of each unmanned aerial vehicle is fixed to be R s Due to the limitation of communication connectivity, the unmanned aerial vehicle i must always keep the unmanned aerial vehicle j nearest to itself within the communication radius range, and the following formula is given:
min(||p i -p j ||,i≠j)<=R s
the process of the unmanned plane transmitting signals to the ground nodes can generate channel fading, and the signal loss can influence the service quality when the ground nodes are covered. If obstacles such as buildings and trees exist around, extra loss is brought on the basis of channel fading, and the probability formula of the line-of-sight fading LoS link between the unmanned aerial vehicle and the ground is as follows:
where f and g are constants related to the type of environment, H represents the altitude of the drone, d iu The formula is as follows for the horizontal distance between the ith unmanned plane and the ith ground node:
the probability formula for non-line-of-sight fading NLoS links is:
P NLoS =1-P LoS
the LoS and NLoS link loss models are:
where c is the propagation speed of light, f c For carrier frequency omega iu η is the distance between the ith unmanned plane and the ith ground node LoS 、η NLoS Is the extra loss of the line-of-sight fading LoS link and the non-line-of-sight fading NLoS link. Under the LoS and NLoS models, the signal loss formula of the jth ground node is as follows:
L u ≤κ
in order to ensure the service quality in the process of covering the ground nodes by the unmanned plane, the signal loss suffered by each ground node in the process of covering must be smaller than or equal to a certain threshold value k, so that the ground node is successfully covered, otherwise, the coverage of the ground node fails.
The state comprises the following parts: at time t, the position and energy consumption of the unmanned plane i and the signal loss suffered by each ground node. The state of the unmanned plane i at time t is:
step 1.2, determining an action set A of the unmanned aerial vehicle:
the flying speed of the unmanned aerial vehicle i in the flying process is fixed, and the next moving direction can be a t E (0, 2 pi) or hover action a t =0. The hovering action refers to that the current position of the unmanned aerial vehicle needs to be kept unchanged after the unmanned aerial vehicle covers the ground node. Therefore, the unmanned aerial vehicle i operates as follows:
a t ∈[0,2π)
step 1.3, defining a state s of the unmanned aerial vehicle at a time t and taking action a, wherein a state transition probability function P capable of reaching a next input state s' is as follows:
step 1.4, determining a reward function R of the unmanned aerial vehicle:
set the set b= { B of ground node coverage states 1 ,b 2 ,...,b u ,...,b m}. wherein bu The coverage state of the u-th ground node is Boolean domain {0,1}. If b u =1 then this ground node has been covered by the drone, b u And =0 is not covered. Coverage is the ratio of the number of covered ground nodes to the total number of ground nodes, and at time t the coverage is:
the coverage area of each unmanned plane is R c The coverage effect of the unmanned aerial vehicle on the target node is gradually decreased from the center of a circle to the periphery from strong to weak, and when the unmanned aerial vehicle is right above the ground node, the coverage effect is most obvious. Degree of effect of the u-th ground node being first coveredThe formula is:
where lambda is the coverage effect constant.
Planning the optimal path requires that the ground node be transitioned from an initial state to a target state. The initial state of the ground node is an uncovered state, and the target state is a covered state of the unmanned plane. The coverage efficiency E is designed as a cooperative formula of coverage ground node rate and coverage effect c The formula is:
and defining a reward function, and representing feedback obtained after the unmanned aerial vehicle selects a certain action in the current state. The basic rewards formula is:
wherein coverage increment: Δα t =α t -α t-1 Energy consumption increment of the ith unmanned aerial vehicle:
if the forward rewards are given only when the unmanned aerial vehicle group successfully completes the task, the rewards are too sparse, and better results are difficult to obtain in multiple training rounds. So extra rewards and punishments are added, so that rewards are not sparse any more. In the additional punishment setting, when the overall coverage does not reach the expected value alpha ev When the overall coverage reaches our expected value, no penalty will be made; and setting the coverage rewards of each ground node covered by the unmanned aerial vehicle group for the first time to be +0.1, giving punishment to each frame if the unmanned aerial vehicle exceeds the energy consumption budget in the flight process in the process, and giving punishment to-1 if communication between unmanned aerial vehicles is impossible, wherein punishment to-1 is made. The extra prize and punishment amount is r extra The prize value calculation method is as follows:
step 1.5, defining discount factor gamma, wherein gamma is E (0, 1). The cumulative prize value over the course of the process is calculated, and the prize value will give rise to a discount over time, the greater the discount coefficient, the more emphasis is placed on long-term benefits.
Step two, designing a depth deterministic strategy gradient (DDPG) algorithm using basic depth reinforcement learning based on the five-tuple (S, A, P, R, gamma) of the Markov decision process modeled in the step one, wherein the specific steps are as follows:
and 2.1, adopting a Actor-reviewer (Actor-Critic) framework, wherein one network is an Actor and the other network is a reviewer Critic, and the two networks are mutually stimulated to compete with each other. Randomly initializing the network state-behavior value function Q (s, a|θ for Critic networks Q ) Policy function μ (s, a|θμ) of the Actor network, copies weights of the Critic network and the Actor network to target network parameters of the respective networks, i.e., θ Q →θ Q′ 、θ μ →θ μ′, wherein θQ 、θ μ Respectively represent Critic network parameters and Actor network parameters, theta Q′ 、θ μ′ Respectively representing Critic target network parameters and Actor target network parameters.
Step 2.2, when the task starts, the initial state of the unmanned plane i is as follows
As the task proceeds, according to the current state s t Action a is made t The formula is:
a t =μ(s t |θμ)+β
where β is random noise. Executing action a t Obtain rewards r t And a new state s t+1 。
Step 2.3, obtaining an empirical bar(s) t ,a t ,r t ,s t+1 ). Saving the experience strip in an experience pool, and saving the experience strip newlyThe first position in the experience pool is stored, and the original experience strips in the experience pool are sequentially moved backwards by one position; a portion of the samples were randomly extracted from the experience pool for training, assuming (s i ,a i ,r i ,s i+1 ) For a batch of data sampled randomly, TD target training is carried out, and a target network Y i Expressed as:
Y i =r i +γQ′(s i+1 ,μ′(s i+1 |θμ ′ )|θ Q′ )
wherein μ' represents the pair s i+1 The strategy obtained by analysis is represented by Q' at s i+1 The state-behavior values obtained by the μ' strategy are adopted.
Step 2.4, updating the Critic network, and calculating a minimized loss function L as follows:
where N represents the number of random samples extracted from the experience pool for action exploration.
Step 2.5, updating the Actor network parameter theta μ Function using a strategic gradient descent algorithmIs->
wherein Represents the Critic network state-behavior value function gradient, +.>Represents the gradient of the Actor network policy function, μ (s i ) Representing input states s in an Actor network i Action strategy selected during time, < >>Representing state s i Time Critic network state-behavior value function, +.>Representing state s i Time Actor network policy function.
And 2.6, calculating target network values by using the duplicate network, wherein the weight parameters of the target networks are updated by tracking and learning network delays. Meanwhile, the current network parameters are utilized to gradually update the corresponding Critic and Actor target networks:
θ Q′ ←τθ Q +(1-τ)θ Q
θμ ′ ←τθμ+(1-τ)θμ
where τ represents the update scaling factor, τ e (0, 1).
And thirdly, improving an experience buffer pool of the DDPG algorithm, and placing the obtained experience data into different experience buffer pools by classifying the experience data stored in the experience buffer pool. The improved DDPG algorithm can solve the problem of unstable convergence.
Step 3.1, dividing the experience pool into M success and Mfailure Respectively storing successful and failed flight experiences, and setting a temporary experience pool M temp The latest flight experience is stored. M is M temp Once full, the earliest experience is taken out and stored in the experience pool M according to the first-in first-out principle success The latest flight experience will continue to be stored in the experience pool M temp . Repeating the steps, and finally according to the final state of the unmanned aerial vehicle, selecting the model M from the experience pool success and Mfailure And respectively extracting a plurality of experiences to train the neural network.
Step 3.2 for the purpose of removing the experience pool M success More multivalent experience is extracted, and the proportional sampling from two experience pools is set:
wherein ,ηsuccess 、η failure From experience pool M success and Mfailure The number of samples extracted in (1) [ phi ] is the total number of samples, and [ beta ] ∈ [0,1 ]]Is the success rate, representing the experience pool M success To the experience.
And finally, designing a simulation environment, and enabling the unmanned aerial vehicle group to interact with the environment to acquire training data.
The invention can be applied to actual scenes, the unmanned aerial vehicle is used as an air base station, the coverage range and performance of a communication network in various scenes are enhanced, when the ground communication network is interrupted due to accidents, the unmanned aerial vehicle can be rapidly deployed, and the unmanned aerial vehicle establishes a communication link with the ground by covering a ground target so as to transmit data, and meanwhile, the cooperative interaction with the ground network is realized. The plane scene cooperatively covered by the unmanned aerial vehicle group is shown in fig. 2: the ground nodes with the fixed m positions and n unmanned aerial vehicles flying at the fixed height H are randomly distributed in the area, all unmanned aerial vehicles take off from the random positions at the same moment, the unmanned aerial vehicles are planned to cooperatively cover the ground nodes under the limitation of a plurality of constraint conditions, an optimal path is obtained, and rapid, reliable, economical and efficient data transmission and network communication are provided for the ground.
By comparison with the random algorithm, the particle swarm algorithm and the DDPG algorithm, the improved DDPG algorithm of the present invention exceeds the foregoing algorithms in terms of coverage efficiency and energy consumption balance. Wherein:
the random algorithm means that each unmanned aerial vehicle randomly selects the flight direction within the range of [0,2 pi ] at each moment as the current action, and if the new position exceeds the boundary of the target area, all unmanned aerial vehicles abandon the action and keep in place;
the particle swarm algorithm is a meta heuristic algorithm, is a method commonly adopted at present for searching an optimal path, and finds an optimal solution by setting a group of random particles and carrying out multiple iterations. During each iteration, the particle may pass through tracking two extrema: the optimal solution found by the self and the optimal solution found by the whole population at present can update the self, and the extremum of the particle neighbors can also update the self.
Reference is made to fig. 3 and 4. Compared with the motion paths of the unmanned aerial vehicle groups obtained by four different algorithms, the coverage efficiency of the unmanned aerial vehicle groups under different coverage rates and the balance degree of energy used in the flight process are observed, the improved DDPG algorithm provided by the invention improves the training success rate, has higher convergence speed, realizes the maximization of the coverage efficiency under the same condition, effectively balances the flight energy consumption of each unmanned aerial vehicle, avoids the barrel effect of excessive energy consumption of a single unmanned aerial vehicle, and further reduces the flight time and cost of multiple unmanned aerial vehicles.
The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.
Claims (5)
1. A multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning includes the steps that firstly, a deep reinforcement learning model is designed, then, unmanned aerial vehicle groups interact with the environment in a simulation environment to obtain training data, the training data are sampled for simulation training, and finally, collaborative coverage path planning of target ground nodes is achieved;
the method is characterized by comprising the following steps of:
step one, defining a Markov model: modeling constraint conditions of the unmanned aerial vehicle base station by using five-tuple (S, A, P, R, gamma) of a Markov decision process; the unmanned aerial vehicle base station is a base station carried by an unmanned aerial vehicle, hereinafter referred to as unmanned aerial vehicle;
step two, designing a depth deterministic strategy gradient DDPG algorithm based on the five-tuple (S, A, P, R, gamma) of the Markov decision process obtained by modeling in the step one, wherein the DDPG algorithm uses basic depth reinforcement learning;
step three, improving an experience buffer pool of the DDPG algorithm, classifying experience data stored in the experience buffer pool, and placing the obtained experience data into different experience buffer pools;
in the first step:
step 1.1, determining the state S of the unmanned aerial vehicle:
m fixed ground nodes and n unmanned aerial vehicles are randomly distributed in a target area;
the drone state S contains: at time t, the position of the unmanned plane iAnd energy consumption->And the signal loss L experienced by each ground node 1 ,...,L u ,...,L m The method comprises the steps of carrying out a first treatment on the surface of the The status of the unmanned plane i at time t is expressed as:
the coordinates of the unmanned aerial vehicle i at the time t; />The energy consumption of the unmanned aerial vehicle i from the initial position to the position at the moment t is calculated;
step 1.2, determining an action set A of the unmanned aerial vehicle:
the flying speed of the unmanned aerial vehicle i is fixed in the flying process, and the next moving direction is a t E (0, 2 pi) or hover action a t =0; the hovering action refers to the fact that the current position of the unmanned aerial vehicle needs to be kept unchanged after the unmanned aerial vehicle covers a ground node; the actions of the unmanned aerial vehicle i are: a, a t ∈[0,2π);
Step 1.3, defining a state s of the unmanned aerial vehicle at a time t and taking action a, wherein a state transition probability function P capable of reaching a next input state s' is as follows:
step 1.4, determining a reward function R of the unmanned aerial vehicle:
let set of ground node coverage states b= { B 1 ,b 2 ,...,b u ,...,b m}; wherein bu The coverage state of the u-th ground node is Boolean domain {0,1}; if b u =1, then this ground node has been covered by the drone, if b u If 0, the ground node is not covered by the drone;
coverage alpha t The coverage rate at time t is the ratio of the number of covered ground nodes to the total number of ground nodes m:
the coverage area of each unmanned plane is R c The coverage effect of the unmanned aerial vehicle on the target ground node is gradually decreased from the center of a circle to the periphery from strong to weak; degree of effect of the u-th ground node being first coveredThe formula is:
wherein λ is the coverage effect constant;
planning an optimal path is required to realize that ground nodes are converted from an initial state to a target state, wherein the initial state of the ground nodes is an uncovered state, and the target state is a covered state of an unmanned plane; the coverage efficiency E is designed as a cooperative formula of coverage ground node rate and coverage effect c The formula is:
defining a reward function, and representing feedback obtained after a certain action is selected by the unmanned aerial vehicle in the current state; the basic rewards formula is:
wherein coverage increment: Δα t =α t -α t-1 Energy consumption increment of the ith unmanned aerial vehicle:base prize r t A prize value of ° as a prize function R;
step 1.5, defining a discount factor gamma, wherein gamma is E (0, 1); calculating a cumulative rewarding value in the whole process, wherein the rewarding value generates discounts along with the time, and the larger the discount coefficient is, the more emphasis is placed on long-term benefits;
in the second step,:
step 2.1, adopting a performer-reviewer Actor-Critic framework, wherein one network is a performer Actor, the other network is a reviewer Critic, and the two networks are mutually stimulated to compete with each other;
randomly initializing the network state-behavior value function Q (s, a|θ for Critic networks Q ) Policy function μ (s, a|θ) of Actor network μ ) The method comprises the steps of carrying out a first treatment on the surface of the Copying weights of Critic network and Actor network to target network parameters of respective networks, namely theta Q →θ Q′ 、θμ→θ μ′, wherein θQ 、θ μ Respectively represent Critic network parameters and Actor network parameters, theta Q′ 、θ μ′ Respectively representing Critic target network parameters and Actor target network parameters;
step 2.2, when the task starts, the initial state of the unmanned plane i is as follows
As the task proceeds, according to the current state s t As a result ofAction of discharging a t The formula is:
a t =μ(s t |θ μ )+β
wherein β is random noise;
executing action a t Obtain rewards r t And a new state s t+1 ;
Step 2.3, obtaining an empirical bar(s) t ,a t ,r t ,s t+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Saving the experience bar in an experience pool;
a portion of the samples were randomly extracted from the experience pool for training, assuming (s i ,a i ,r i ,s i+1 ) For a batch of data sampled randomly, TD target training is carried out, and a target network Y i Expressed as:
Y i =r i +γQ′(s i+1 ,μ′(s i+1 |θ μ′ )|θ Q′ )
wherein μ' represents the pair s i+1 The strategy obtained by analysis is represented by Q' at s i+1 The state-behavior value obtained by mu' strategy is adopted;
step 2.4, updating the Critic network, and calculating a minimized loss function L as follows:
where N represents the number of random samples extracted from the experience pool for action exploration;
step 2.5, updating the Actor network parameter theta μ Function using a strategic gradient descent algorithmThe method comprises the following steps:
wherein Represents the Critic network state-behavior value function gradient, +.>Represents the gradient of the Actor network policy function, μ (s i ) Representing input states s in an Actor network i Action strategy selected during time, < >>Representing state s i Time Critic network state-behavior value function, +.>Representing state s i A time Actor network policy function;
step 2.6, calculating target network values by using a copy network, wherein the weight parameters of the target networks are updated by tracking and learning network delays; meanwhile, the current network parameters are utilized to gradually update the corresponding Critic and Actor target networks:
θ Q′ ←τθ Q +(1-τ)θ Q
θ μ′ ←τθ μ +(1-τ)θ μ
where τ represents the update scaling factor, τ e (0, 1);
in the third step:
step 3.1, dividing the experience pool into M success and Mfailure Storing two kinds of successful and failed flight experience respectively; from experience pool M success and Mfailure Respectively extracting a plurality of experiences to train the neural network;
step 3.2, setting up to sample proportionally from two experience pools:
wherein ,ηsuccess 、η failure From experience pool M success and Mfailure The number of samples extracted in (1) [ phi ] is the total number of samples, and [ beta ] ∈ [0,1 ]]Is the success rate, representing the experience pool M success To the experience.
2. The method for planning the cooperative coverage path of the base stations of the multiple unmanned aerial vehicles based on deep reinforcement learning according to claim 1, wherein in the step 1.1,
the whole target area is divided into I multiplied by J cells, m ground nodes with fixed positions and n unmanned aerial vehicles flying at a fixed height H are randomly distributed in the area, and the coordinates of the unmanned aerial vehicle I at the moment t are expressed asThe position coordinate of the u-th ground node is denoted as q u =(x u ,y u );
The fixed total energy of one unmanned aerial vehicle is e max The energy consumption of unmanned plane moving by one unit is e 1 Hovering over a ground node with energy consumption e 2 ,e 1 and e2 All are constants, and the unmanned aerial vehicle must complete tasks before energy is exhausted;
therefore, the unmanned aerial vehicle i flies from the initial position to the energy consumption at the position at the time tThe method comprises the following steps:
the communication radius of each unmanned aerial vehicle is fixed to be R s Due to the limitation of communication connectivity, the unmanned aerial vehicle i must always keep the unmanned aerial vehicle j nearest to itself within the communication radius range, and the following formula is given:
min(||p i -p j ||,i≠j)<=R s
p i and pj Respectively representing the positions of unmanned plane i and unmanned plane j;
the process of unmanned aerial vehicle propagation signal to ground node can produce the channel fading, and unmanned aerial vehicle and ground inter-line-of-sight fading LoS link's probability is:
where f and g are constants related to the type of environment, H represents the altitude of the drone, d iu The formula is as follows for the horizontal distance between the ith unmanned plane and the ith ground node:
the probability of non-line-of-sight fading NLoS links is:
P NLoS =1-P LoS
the LoS and NLoS link loss models are:
where c is the propagation speed of light, f c For carrier frequency omega iu η is the distance between the ith unmanned plane and the ith ground node LoS and ηNLoS The extra loss of the line-of-sight fading LoS link and the non-line-of-sight fading NLoS link respectively;
under the LoS and NLoS models, the signal loss formula of the u-th ground node is as follows:
L u ≤κ
in order to ensure the service quality in the process of covering the ground nodes by the unmanned plane, the signal loss suffered by each ground node in the process of covering must be smaller than or equal to a certain threshold value k, so that the ground node is successfully covered, otherwise, the coverage of the ground node fails.
3. The method for planning the cooperative coverage path of the multiple unmanned aerial vehicle base stations based on deep reinforcement learning according to claim 1 is characterized in that in the step 1.4, additional rewards and punishments are additionally arranged, and the sum of basic rewards and the additional rewards and punishments is used as a reward value of a reward function;
in the additional punishment setting, when the overall coverage does not reach the expected value alpha ev Negative increasing penalties will be made when the overall coverage reaches the expected value, and no penalties will be made when the overall coverage reaches the expected value;
setting that the coverage rewards of each ground node covered by the unmanned aerial vehicle group for the first time are increasing; if the unmanned aerial vehicle exceeds the energy consumption budget in the flight process, giving negative increasing punishment to each frame; if the unmanned aerial vehicles cannot communicate with each other, negative added punishment is made;
the extra prize and punishment amount is r extra Prize value r t =r t °+r extra 。
4. The method for planning the collaborative coverage path of the multi-unmanned aerial vehicle base station based on deep reinforcement learning according to claim 1 is characterized in that in step 2.3, experience bars are stored in an experience pool, the newly stored experience bars are stored in a first position in the experience pool, and the original experience bars in the experience pool are sequentially moved back by one position.
5. The method for planning a cooperative coverage path of multiple unmanned aerial vehicle base stations based on deep reinforcement learning as claimed in claim 1, wherein in step 3.1, a temporary experience pool M is further provided temp Storing the latest flight experience;
M temp once full, the earliest experience is taken out and stored in the experience pool M according to the first-in first-out principle success The latest flight experience will continue to be stored in the experience pool M temp The method comprises the steps of carrying out a first treatment on the surface of the Repeating the steps, and finally according to the final state of the unmanned aerial vehicle, selecting the model M from the experience pool success and Mfailure And respectively extracting a plurality of experiences to train the neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310021781.1A CN116227767A (en) | 2023-01-07 | 2023-01-07 | Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310021781.1A CN116227767A (en) | 2023-01-07 | 2023-01-07 | Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116227767A true CN116227767A (en) | 2023-06-06 |
Family
ID=86572376
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310021781.1A Pending CN116227767A (en) | 2023-01-07 | 2023-01-07 | Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116227767A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116502547A (en) * | 2023-06-29 | 2023-07-28 | 深圳大学 | Multi-unmanned aerial vehicle wireless energy transmission method based on graph reinforcement learning |
-
2023
- 2023-01-07 CN CN202310021781.1A patent/CN116227767A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116502547A (en) * | 2023-06-29 | 2023-07-28 | 深圳大学 | Multi-unmanned aerial vehicle wireless energy transmission method based on graph reinforcement learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110488861B (en) | Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle | |
CN113190039B (en) | Unmanned aerial vehicle acquisition path planning method based on layered deep reinforcement learning | |
CN113162679A (en) | DDPG algorithm-based IRS (inter-Range instrumentation System) auxiliary unmanned aerial vehicle communication joint optimization method | |
CN111580544B (en) | Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm | |
CN112511250B (en) | DRL-based multi-unmanned aerial vehicle air base station dynamic deployment method and system | |
CN113433967B (en) | Chargeable unmanned aerial vehicle path planning method and system | |
CN113346944A (en) | Time delay minimization calculation task unloading method and system in air-space-ground integrated network | |
CN110181508B (en) | Three-dimensional route planning method and system for underwater robot | |
CN112788699B (en) | Method and system for determining network topology of self-organizing network | |
CN114422056A (en) | Air-ground non-orthogonal multiple access uplink transmission method based on intelligent reflecting surface | |
CN114567888B (en) | Multi-unmanned aerial vehicle dynamic deployment method | |
JP2020077392A (en) | Method and device for learning neural network at adaptive learning rate, and testing method and device using the same | |
CN116227767A (en) | Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning | |
CN112363532B (en) | Method for simultaneously taking off and gathering multiple unmanned aerial vehicles based on QUATRE algorithm | |
CN116451934B (en) | Multi-unmanned aerial vehicle edge calculation path optimization and dependent task scheduling optimization method and system | |
CN116627162A (en) | Multi-agent reinforcement learning-based multi-unmanned aerial vehicle data acquisition position optimization method | |
CN113406965A (en) | Unmanned aerial vehicle energy consumption optimization method based on reinforcement learning | |
CN113507717A (en) | Unmanned aerial vehicle track optimization method and system based on vehicle track prediction | |
CN113485409A (en) | Unmanned aerial vehicle path planning and distribution method and system for geographic fairness | |
CN113382060B (en) | Unmanned aerial vehicle track optimization method and system in Internet of things data collection | |
CN113377131B (en) | Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning | |
Khamidehi et al. | Reinforcement-learning-aided safe planning for aerial robots to collect data in dynamic environments | |
Seong et al. | Multi-UAV trajectory optimizer: A sustainable system for wireless data harvesting with deep reinforcement learning | |
Zhang et al. | Multi-objective optimization for UAV-enabled wireless powered IoT networks: an LSTM-based deep reinforcement learning approach | |
CN116859989A (en) | Unmanned aerial vehicle cluster intelligent countermeasure strategy generation method based on group cooperation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |