CN116227767A

CN116227767A - Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning

Info

Publication number: CN116227767A
Application number: CN202310021781.1A
Authority: CN
Inventors: 管昕洁; 许昱雯; 万夕里; 张毅晔; 徐波
Original assignee: Nanjing Tech University; Jiangsu Future Networks Innovation Institute
Current assignee: Nanjing Tech University; Jiangsu Future Networks Innovation Institute
Priority date: 2023-01-07
Filing date: 2023-01-07
Publication date: 2023-06-06

Abstract

The invention discloses a multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning, which comprises the following steps: firstly, defining a Markov model based on deep reinforcement learning, and modeling a five-tuple in a Markov decision process; then a depth deterministic strategy gradient DDPG algorithm is proposed according to modeling; then improving an experience buffer pool of the DDPG algorithm, classifying experience data stored in the experience buffer pool, and putting the obtained experience data into different experience buffer pools, wherein the improved DDPG algorithm can solve the problem of unstable convergence; and finally, designing a simulation environment, and enabling the unmanned aerial vehicle group to interact with the environment to acquire training data. By the method, the target task that the unmanned aerial vehicle group performs cooperative coverage on the ground nodes under the limitation of a plurality of constraint conditions is realized, and the method can enable the unmanned aerial vehicle group to have higher planning efficiency and lower flight cost.

Description

Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning

Technical Field

The invention provides a multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning, and belongs to the field of computer artificial intelligence.

Background

The unmanned aerial vehicle has the advantages of high maneuverability, flexible deployment and low cost, and is widely applied to industries such as terrain coverage, agricultural production, environmental reconnaissance, air rescue, disaster early warning and the like. The unmanned aerial vehicle can be used as an air base station to enhance the coverage range and performance of a communication network in various scenes. When the ground communication network is interrupted accidentally, the unmanned aerial vehicle can be deployed quickly, and the unmanned aerial vehicle establishes a communication link with the ground to transmit data, and meanwhile realizes cooperative interaction with the ground network. The coverage path planning algorithm is an important technology for supporting the unmanned aerial vehicle to be successfully applied to the complex scene.

In the process of planning the unmanned aerial vehicle to cover the ground node path, the energy constraint condition of the unmanned aerial vehicle needs to be considered, meanwhile, the unmanned aerial vehicle needs to ensure signal transmission with a ground base station in the process of executing tasks, but the signal transmission can generate loss to influence the coverage service quality. On the other hand, a single unmanned aerial vehicle is difficult to apply to a ground coverage task on a large scale due to energy and communication constraint, and cooperative flight of multiple unmanned aerial vehicles is an effective scheme for realizing the large-scale coverage task, and communication between unmanned aerial vehicles is required to be kept all the time. Therefore, how to efficiently realize the cooperative coverage of the ground nodes under the constraint conditions of limited energy consumption, limited communication distance and loss generated by signal transmission of the unmanned aerial vehicle group is a very challenging theory and application problem.

Disclosure of Invention

In order to solve the problem of how to realize efficient collaborative coverage to the ground under various constraint conditions, the invention provides a multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning, which specifically comprises the following steps:

firstly, defining a Markov model, namely modeling five-element groups (S, A, P, R, gamma) of a Markov decision process;

step two, designing a depth deterministic strategy gradient (DDPG) algorithm using basic depth reinforcement learning based on the five-tuple (S, A, P, R, gamma) of the Markov decision process obtained by modeling in the step one;

and thirdly, improving an experience buffer pool of the DDPG algorithm, and placing the obtained experience data into different experience buffer pools by classifying the experience data stored in the experience buffer pool.

And fourthly, designing a simulation environment, interacting the unmanned aerial vehicle group with the environment, acquiring training data, sampling the training data for simulation training, and planning a collaborative coverage path of the target ground node.

The specific steps of the first step comprise:

step 1.1, determining the state S of the unmanned aerial vehicle:

the whole target area is divided into I multiplied by J cells, m ground nodes with fixed positions and n unmanned aerial vehicles flying at a fixed height H are randomly distributed in the area, and the coordinates of the unmanned aerial vehicle I at the moment t are expressed as

The position coordinate of the u-th ground node is denoted as q _u ＝(x _u ,y _u ). The fixed total energy of one unmanned aerial vehicle is e _max The energy consumption of unmanned plane moving by one unit is e ₁ Hovering over a ground node with energy consumption e ₂ ，e ₁ 、e ₂ All are constant and the drone must complete the task before the energy is exhausted. Therefore, the unmanned plane i flies from the initial position to the energy consumption +.>

The method comprises the following steps:

/>

wherein ,

the number of covered ground nodes of the unmanned aerial vehicle i at the time t;

the communication radius of each unmanned aerial vehicle is fixed to be R _s Due to the limitation of communication connectivity, the unmanned aerial vehicle i must always keep the unmanned aerial vehicle j nearest to itself within the communication radius range, and the following formula is given:

min(||p _i -p _j ||,i≠j)＜＝R _s

the process of the unmanned plane transmitting signals to the ground nodes can generate channel fading, and the signal loss can influence the service quality when the ground nodes are covered. If obstacles such as buildings and trees exist around, extra loss is brought on the basis of channel fading, and the probability formula of the line-of-sight fading LoS link between the unmanned aerial vehicle and the ground is as follows:

where f and g are constants related to the type of environment, H represents the altitude of the drone, d _iu The formula is as follows for the horizontal distance between the ith unmanned plane and the ith ground node:

the probability formula for non-line-of-sight fading NLoS links is:

P _NLoS ＝1-P _LoS

the LoS and NLoS link loss models are:

where c is the propagation speed of light, f _c For carrier frequency omega _iu η is the distance between the ith unmanned plane and the ith ground node _LoS 、η _NLoS Is the extra loss of the line-of-sight fading LoS link and the non-line-of-sight fading NLoS link. Under the LoS and NLoS models, the signal loss formula of the jth ground node is as follows:

L _u ≤κ

in order to ensure the service quality in the process of covering the ground nodes by the unmanned plane, the signal loss suffered by each ground node in the process of covering must be smaller than or equal to a certain threshold value k, so that the ground node is successfully covered, otherwise, the coverage of the ground node fails.

The state comprises the following parts: at time t, the position and energy consumption of the unmanned plane i and the signal loss suffered by each ground node. The state of the unmanned plane i at time t is:

step 1.2, determining an action set A of the unmanned aerial vehicle:

the flying speed of the unmanned aerial vehicle i in the flying process is fixed, and the next moving direction can be a _t E (0, 2 pi) or hover action a _t =0. The hovering action refers to that the current position of the unmanned aerial vehicle needs to be kept unchanged after the unmanned aerial vehicle covers the ground node. Therefore, the unmanned aerial vehicle i operates as follows:

a _t ∈[0,2π)

step 1.3, defining a state s of the unmanned aerial vehicle at a time t and taking action a, wherein a state transition probability function P capable of reaching a next input state s' is as follows:

step 1.4, determining a reward function R of the unmanned aerial vehicle:

set the set b= { B of ground node coverage states ₁ ,b ₂ ,...,b _u ,...,b _m}. wherein b_u The coverage state of the u-th ground node is Boolean domain {0,1}. If b _u =1 then this ground node has been covered by the drone, b _u And =0 is not covered. Coverage is the ratio of the number of covered ground nodes to the total number of ground nodes, and at time t the coverage is:

the coverage area of each unmanned plane is R _c The coverage effect of the unmanned aerial vehicle on the target node is gradually decreased from the center of a circle to the periphery from strong to weak, and when the unmanned aerial vehicle is right above the ground node, the coverage effect is most obvious. Degree of effect of the u-th ground node being first covered

The formula is:

where lambda is the coverage effect constant.

Planning the optimal path requires that the ground node be transitioned from an initial state to a target state. The initial state of the ground node is an uncovered state, and the target state is a covered state of the unmanned plane. The coverage efficiency E is designed as a cooperative formula of coverage ground node rate and coverage effect _c The formula is:

and defining a reward function, and representing feedback obtained after the unmanned aerial vehicle selects a certain action in the current state. The basic rewards formula is:

wherein coverage increment: Δα ^t ＝α ^t -α ^t-1 Energy consumption increment of the ith unmanned aerial vehicle:

if the forward rewards are given only when the unmanned aerial vehicle group successfully completes the task, the rewards are too sparse, and better results are difficult to obtain in multiple training rounds. So extra rewards and punishments are added, so that rewards are not sparse any more. In the additional punishment setting, when the overall coverage does not reach the expected value alpha _ev When the overall coverage reaches our expected value, no penalty will be made; and setting the coverage rewards of each ground node covered by the unmanned aerial vehicle group for the first time to be +0.1, giving punishment to each frame if the unmanned aerial vehicle exceeds the energy consumption budget in the flight process in the process, and giving punishment to-1 if communication between unmanned aerial vehicles is impossible, wherein punishment to-1 is made. The extra prize and punishment amount is r _extra The prize value calculation method is as follows:

step 1.5, defining discount factor gamma, wherein gamma is E (0, 1). The cumulative prize value over the course of the process is calculated, and the prize value will give rise to a discount over time, the greater the discount coefficient, the more emphasis is placed on long-term benefits.

The specific steps of the second step comprise:

and 2.1, adopting a Actor-reviewer (Actor-Critic) framework, wherein one network is an Actor and the other network is a reviewer Critic, and the two networks are mutually stimulated to compete with each other. Randomly initializing the network state-behavior value function Q (s, a|θ for Critic networks ^Q ) Policy function μ (s, a|θ) of Actor network ^μ ) Copying weights of Critic network and Actor network to target network parameters of respective networks, namely theta ^Q →θ ^Q′ 、θ ^μ →θ ^μ′, wherein θ^Q 、θ ^μ Respectively represent Critic network parameters and Actor network parameters, theta ^Q′ 、θ ^μ′ Respectively representing Critic target network parameters and Actor target network parameters.

Step 2.2, when the task starts, the initial state of the unmanned plane i is as follows

As the task proceeds, according to the current state s _t Action a is made _t The formula is:

a _t ＝μ(s _t |θ ^μ )+β

where β is random noise. Executing action a _t Obtain rewards r _t And a new state s _t+1 。

Step 2.3, obtaining an empirical bar(s) _t ,a _t ,r _t ,s _t+1 ). The method comprises the steps of storing experience strips in an experience pool, storing the newly stored experience strips in a first position in the experience pool, and sequentially shifting back the original experience strips in the experience pool by one position; from experienceA portion of the samples were randomly extracted from the pool for training, assuming (s _i ,a _i ,r _i ,s _i+1 ) For a batch of data sampled randomly, TD target training is carried out, and a target network Y _i Expressed as:

Y _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ′ )|θ ^Q′ )

wherein μ' represents the pair s _i+1 The strategy obtained by analysis is represented by Q' at s _i+1 The state-behavior values obtained by the μ' strategy are adopted.

Step 2.4, updating the Critic network, and calculating a minimized loss function L as follows:

where N represents the number of random samples extracted from the experience pool for action exploration.

Step 2.5, updating the Actor network parameter theta ^μ Function using a strategic gradient descent algorithm

The method comprises the following steps:

wherein

Represents the Critic network state-behavior value function gradient, +.>

Represents the gradient of the Actor network policy function, μ (s _i ) Representing input states s in an Actor network _i Action strategy selected during time, < >>

Representing state s _i Time CriticNetwork status-behavior value function, +.>

Representing state s _i Time Actor network policy function.

And 2.6, calculating target network values by using the duplicate network, wherein the weight parameters of the target networks are updated by tracking and learning network delays. Meanwhile, the current network parameters are utilized to gradually update the corresponding Critic and Actor target networks:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ

where τ represents the update scaling factor, τ e (0, 1).

The specific steps of the third step comprise:

step 3.1, dividing the experience pool into M _success and M_failure Respectively storing successful and failed flight experiences, and setting a temporary experience pool M _temp The latest flight experience is stored. M is M _temp Once full, the earliest experience is taken out and stored in the experience pool M according to the first-in first-out principle _success The latest flight experience will continue to be stored in the experience pool M _temp . Repeating the steps, and finally according to the final state of the unmanned aerial vehicle, selecting the model M from the experience pool _success and M_failure And respectively extracting a plurality of experiences to train the neural network.

Step 3.2 for the purpose of removing the experience pool M _success More multivalent experience is extracted, and the proportional sampling from two experience pools is set:

wherein ,η_success 、η _failure From experience pool M _success and M_failure The number of samples extracted in (1) [ phi ] is the total number of samples, and [ beta ] ∈ [0,1 ]]Is the success rate, representing the experience pool M _success To the experience.

The technical scheme of the invention has the following advantages:

1. according to the method, a multi-unmanned aerial vehicle collaborative coverage scene model is established, unmanned aerial vehicle groups interact with the environment to obtain training data, and an optimal path is planned autonomously. The simulation environment in the process has higher practical application value.

2. The method uses a depth deterministic strategy gradient (DDPG) algorithm, classifies the experience data stored in the experience buffer pool, improves the DDPG algorithm, effectively solves the problem of unmanned aerial vehicle continuity control, improves the success rate of sample acquisition in the task process, and can obtain better convergence effect.

3. The method has better coverage efficiency, realizes overall energy consumption balance, and ensures that the task flight cost is lower and the completion time is shorter.

By the method, the target task that the unmanned aerial vehicle group performs cooperative coverage on the ground nodes under the limitation of a plurality of constraint conditions is realized, and the method can enable the unmanned aerial vehicle group to have higher planning efficiency and lower flight cost.

Drawings

FIG. 1 is a flow chart of the overall method of the present invention;

FIG. 2 is a schematic illustration of an application scenario of the present invention;

FIG. 3 is a graph showing the effect of comparing coverage efficiencies of four unmanned aerial vehicle groups under different coverage rates under the same algorithm;

fig. 4 is a graph of the balance degree versus effect of energy used by the unmanned aerial vehicle group in the flight process under four algorithms.

Detailed Description

The technical scheme of the invention is further described in detail below with reference to the accompanying drawings:

it will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Aiming at the problems that when a plurality of unmanned aerial vehicles cooperatively execute a coverage task, the unmanned aerial vehicles have more and more motion constraint conditions, high flight cost, poor motion continuity and the like, the invention provides an improved DDPG algorithm based on deep reinforcement learning. Meanwhile, the DDPG algorithm is improved by classifying the experience data stored in the experience buffer pool. And finally, the cooperative coverage path planning and dynamic adjustment of the unmanned aerial vehicle group are realized, and higher planning efficiency and lower flight cost are obtained.

The improved DDPG algorithm model and its application structure are shown in figure 1.

The method specifically comprises the following steps:

defining a Markov model, namely modeling five-element groups (S, A, P, R, gamma) of a Markov decision process, wherein the method comprises the following specific steps of:

step 1.1, determining the state S of the unmanned aerial vehicle:

The method comprises the following steps:

wherein ,

min(||p _i -p _j ||,i≠j)＜＝R _s

the probability formula for non-line-of-sight fading NLoS links is:

P _NLoS ＝1-P _LoS

the LoS and NLoS link loss models are:

L _u ≤κ

step 1.2, determining an action set A of the unmanned aerial vehicle:

a _t ∈[0,2π)

step 1.4, determining a reward function R of the unmanned aerial vehicle:

The formula is:

where lambda is the coverage effect constant.

Step two, designing a depth deterministic strategy gradient (DDPG) algorithm using basic depth reinforcement learning based on the five-tuple (S, A, P, R, gamma) of the Markov decision process modeled in the step one, wherein the specific steps are as follows:

and 2.1, adopting a Actor-reviewer (Actor-Critic) framework, wherein one network is an Actor and the other network is a reviewer Critic, and the two networks are mutually stimulated to compete with each other. Randomly initializing the network state-behavior value function Q (s, a|θ for Critic networks ^Q ) Policy function μ (s, a|θμ) of the Actor network, copies weights of the Critic network and the Actor network to target network parameters of the respective networks, i.e., θ ^Q →θ ^Q′ 、θ ^μ →θ ^μ′, wherein θ^Q 、θ ^μ Respectively represent Critic network parameters and Actor network parameters, theta ^Q′ 、θ ^μ′ Respectively representing Critic target network parameters and Actor target network parameters.

a _t ＝μ(s _t |θμ)+β

Step 2.3, obtaining an empirical bar(s) _t ,a _t ,r _t ,s _t+1 ). Saving the experience strip in an experience pool, and saving the experience strip newlyThe first position in the experience pool is stored, and the original experience strips in the experience pool are sequentially moved backwards by one position; a portion of the samples were randomly extracted from the experience pool for training, assuming (s _i ,a _i ,r _i ,s _i+1 ) For a batch of data sampled randomly, TD target training is carried out, and a target network Y _i Expressed as:

Y _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θμ ^′ )|θ ^Q′ )

Is->

wherein

Represents the Critic network state-behavior value function gradient, +.>

Representing state s _i Time Critic network state-behavior value function, +.>

Representing state s _i Time Actor network policy function.

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q

θμ ^′ ←τθμ+(1-τ)θμ

where τ represents the update scaling factor, τ e (0, 1).

And thirdly, improving an experience buffer pool of the DDPG algorithm, and placing the obtained experience data into different experience buffer pools by classifying the experience data stored in the experience buffer pool. The improved DDPG algorithm can solve the problem of unstable convergence.

And finally, designing a simulation environment, and enabling the unmanned aerial vehicle group to interact with the environment to acquire training data.

The invention can be applied to actual scenes, the unmanned aerial vehicle is used as an air base station, the coverage range and performance of a communication network in various scenes are enhanced, when the ground communication network is interrupted due to accidents, the unmanned aerial vehicle can be rapidly deployed, and the unmanned aerial vehicle establishes a communication link with the ground by covering a ground target so as to transmit data, and meanwhile, the cooperative interaction with the ground network is realized. The plane scene cooperatively covered by the unmanned aerial vehicle group is shown in fig. 2: the ground nodes with the fixed m positions and n unmanned aerial vehicles flying at the fixed height H are randomly distributed in the area, all unmanned aerial vehicles take off from the random positions at the same moment, the unmanned aerial vehicles are planned to cooperatively cover the ground nodes under the limitation of a plurality of constraint conditions, an optimal path is obtained, and rapid, reliable, economical and efficient data transmission and network communication are provided for the ground.

By comparison with the random algorithm, the particle swarm algorithm and the DDPG algorithm, the improved DDPG algorithm of the present invention exceeds the foregoing algorithms in terms of coverage efficiency and energy consumption balance. Wherein:

the random algorithm means that each unmanned aerial vehicle randomly selects the flight direction within the range of [0,2 pi ] at each moment as the current action, and if the new position exceeds the boundary of the target area, all unmanned aerial vehicles abandon the action and keep in place;

the particle swarm algorithm is a meta heuristic algorithm, is a method commonly adopted at present for searching an optimal path, and finds an optimal solution by setting a group of random particles and carrying out multiple iterations. During each iteration, the particle may pass through tracking two extrema: the optimal solution found by the self and the optimal solution found by the whole population at present can update the self, and the extremum of the particle neighbors can also update the self.

Reference is made to fig. 3 and 4. Compared with the motion paths of the unmanned aerial vehicle groups obtained by four different algorithms, the coverage efficiency of the unmanned aerial vehicle groups under different coverage rates and the balance degree of energy used in the flight process are observed, the improved DDPG algorithm provided by the invention improves the training success rate, has higher convergence speed, realizes the maximization of the coverage efficiency under the same condition, effectively balances the flight energy consumption of each unmanned aerial vehicle, avoids the barrel effect of excessive energy consumption of a single unmanned aerial vehicle, and further reduces the flight time and cost of multiple unmanned aerial vehicles.

The foregoing is only a partial embodiment of the present invention, and it should be noted that it will be apparent to those skilled in the art that modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning includes the steps that firstly, a deep reinforcement learning model is designed, then, unmanned aerial vehicle groups interact with the environment in a simulation environment to obtain training data, the training data are sampled for simulation training, and finally, collaborative coverage path planning of target ground nodes is achieved;

the method is characterized by comprising the following steps of:

step one, defining a Markov model: modeling constraint conditions of the unmanned aerial vehicle base station by using five-tuple (S, A, P, R, gamma) of a Markov decision process; the unmanned aerial vehicle base station is a base station carried by an unmanned aerial vehicle, hereinafter referred to as unmanned aerial vehicle;

step two, designing a depth deterministic strategy gradient DDPG algorithm based on the five-tuple (S, A, P, R, gamma) of the Markov decision process obtained by modeling in the step one, wherein the DDPG algorithm uses basic depth reinforcement learning;

step three, improving an experience buffer pool of the DDPG algorithm, classifying experience data stored in the experience buffer pool, and placing the obtained experience data into different experience buffer pools;

in the first step:

step 1.1, determining the state S of the unmanned aerial vehicle:

m fixed ground nodes and n unmanned aerial vehicles are randomly distributed in a target area;

the drone state S contains: at time t, the position of the unmanned plane i

And energy consumption->

And the signal loss L experienced by each ground node ₁ ,...,L _u ,...,L _m The method comprises the steps of carrying out a first treatment on the surface of the The status of the unmanned plane i at time t is expressed as:

the coordinates of the unmanned aerial vehicle i at the time t; />

The energy consumption of the unmanned aerial vehicle i from the initial position to the position at the moment t is calculated;

step 1.2, determining an action set A of the unmanned aerial vehicle:

the flying speed of the unmanned aerial vehicle i is fixed in the flying process, and the next moving direction is a _t E (0, 2 pi) or hover action a _t =0; the hovering action refers to the fact that the current position of the unmanned aerial vehicle needs to be kept unchanged after the unmanned aerial vehicle covers a ground node; the actions of the unmanned aerial vehicle i are: a, a _t ∈[0,2π)；

step 1.4, determining a reward function R of the unmanned aerial vehicle:

let set of ground node coverage states b= { B ₁ ,b ₂ ,...,b _u ,...,b _m}； wherein b_u The coverage state of the u-th ground node is Boolean domain {0,1}; if b _u =1, then this ground node has been covered by the drone, if b _u If 0, the ground node is not covered by the drone;

coverage alpha ^t The coverage rate at time t is the ratio of the number of covered ground nodes to the total number of ground nodes m:

the coverage area of each unmanned plane is R _c The coverage effect of the unmanned aerial vehicle on the target ground node is gradually decreased from the center of a circle to the periphery from strong to weak; degree of effect of the u-th ground node being first covered

The formula is:

/>

wherein λ is the coverage effect constant;

planning an optimal path is required to realize that ground nodes are converted from an initial state to a target state, wherein the initial state of the ground nodes is an uncovered state, and the target state is a covered state of an unmanned plane; the coverage efficiency E is designed as a cooperative formula of coverage ground node rate and coverage effect _c The formula is:

defining a reward function, and representing feedback obtained after a certain action is selected by the unmanned aerial vehicle in the current state; the basic rewards formula is:

base prize r _t A prize value of ° as a prize function R;

step 1.5, defining a discount factor gamma, wherein gamma is E (0, 1); calculating a cumulative rewarding value in the whole process, wherein the rewarding value generates discounts along with the time, and the larger the discount coefficient is, the more emphasis is placed on long-term benefits;

in the second step,:

step 2.1, adopting a performer-reviewer Actor-Critic framework, wherein one network is a performer Actor, the other network is a reviewer Critic, and the two networks are mutually stimulated to compete with each other;

randomly initializing the network state-behavior value function Q (s, a|θ for Critic networks ^Q ) Policy function μ (s, a|θ) of Actor network ^μ ) The method comprises the steps of carrying out a first treatment on the surface of the Copying weights of Critic network and Actor network to target network parameters of respective networks, namely theta ^Q →θ ^Q′ 、θμ→θ ^μ′, wherein θ^Q 、θ ^μ Respectively represent Critic network parameters and Actor network parameters, theta ^Q′ 、θ ^μ′ Respectively representing Critic target network parameters and Actor target network parameters;

As the task proceeds, according to the current state s _t As a result ofAction of discharging a _t The formula is:

a _t ＝μ(s _t |θ ^μ )+β

wherein β is random noise;

executing action a _t Obtain rewards r _t And a new state s _t+1 ；

Step 2.3, obtaining an empirical bar(s) _t ,a _t ,r _t ,s _t+1 ) The method comprises the steps of carrying out a first treatment on the surface of the Saving the experience bar in an experience pool;

a portion of the samples were randomly extracted from the experience pool for training, assuming (s _i ,a _i ,r _i ,s _i+1 ) For a batch of data sampled randomly, TD target training is carried out, and a target network Y _i Expressed as:

Y _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ′ )|θ ^Q′ )

wherein μ' represents the pair s _i+1 The strategy obtained by analysis is represented by Q' at s _i+1 The state-behavior value obtained by mu' strategy is adopted;

/>

where N represents the number of random samples extracted from the experience pool for action exploration;

The method comprises the following steps:

wherein

Represents the Critic network state-behavior value function gradient, +.>

Representing state s _i Time Critic network state-behavior value function, +.>

Representing state s _i A time Actor network policy function;

step 2.6, calculating target network values by using a copy network, wherein the weight parameters of the target networks are updated by tracking and learning network delays; meanwhile, the current network parameters are utilized to gradually update the corresponding Critic and Actor target networks:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ

where τ represents the update scaling factor, τ e (0, 1);

in the third step:

step 3.1, dividing the experience pool into M _success and M_failure Storing two kinds of successful and failed flight experience respectively; from experience pool M _success and M_failure Respectively extracting a plurality of experiences to train the neural network;

step 3.2, setting up to sample proportionally from two experience pools:

2. The method for planning the cooperative coverage path of the base stations of the multiple unmanned aerial vehicles based on deep reinforcement learning according to claim 1, wherein in the step 1.1,

The position coordinate of the u-th ground node is denoted as q _u ＝(x _u ,y _u )；

The fixed total energy of one unmanned aerial vehicle is e _max The energy consumption of unmanned plane moving by one unit is e ₁ Hovering over a ground node with energy consumption e ₂ ，e ₁ and e₂ All are constants, and the unmanned aerial vehicle must complete tasks before energy is exhausted;

therefore, the unmanned aerial vehicle i flies from the initial position to the energy consumption at the position at the time t

The method comprises the following steps:

wherein ,

min(||p _i -p _j ||,i≠j)＜＝R _s

p _i and p_j Respectively representing the positions of unmanned plane i and unmanned plane j;

the process of unmanned aerial vehicle propagation signal to ground node can produce the channel fading, and unmanned aerial vehicle and ground inter-line-of-sight fading LoS link's probability is:

the probability of non-line-of-sight fading NLoS links is:

P _NLoS ＝1-P _LoS

the LoS and NLoS link loss models are:

where c is the propagation speed of light, f _c For carrier frequency omega _iu η is the distance between the ith unmanned plane and the ith ground node _LoS and η_NLoS The extra loss of the line-of-sight fading LoS link and the non-line-of-sight fading NLoS link respectively;

under the LoS and NLoS models, the signal loss formula of the u-th ground node is as follows:

L _u ≤κ

3. The method for planning the cooperative coverage path of the multiple unmanned aerial vehicle base stations based on deep reinforcement learning according to claim 1 is characterized in that in the step 1.4, additional rewards and punishments are additionally arranged, and the sum of basic rewards and the additional rewards and punishments is used as a reward value of a reward function;

in the additional punishment setting, when the overall coverage does not reach the expected value alpha _ev Negative increasing penalties will be made when the overall coverage reaches the expected value, and no penalties will be made when the overall coverage reaches the expected value;

setting that the coverage rewards of each ground node covered by the unmanned aerial vehicle group for the first time are increasing; if the unmanned aerial vehicle exceeds the energy consumption budget in the flight process, giving negative increasing punishment to each frame; if the unmanned aerial vehicles cannot communicate with each other, negative added punishment is made;

the extra prize and punishment amount is r _extra Prize value r _t ＝r _t °+r _extra 。

4. The method for planning the collaborative coverage path of the multi-unmanned aerial vehicle base station based on deep reinforcement learning according to claim 1 is characterized in that in step 2.3, experience bars are stored in an experience pool, the newly stored experience bars are stored in a first position in the experience pool, and the original experience bars in the experience pool are sequentially moved back by one position.

5. The method for planning a cooperative coverage path of multiple unmanned aerial vehicle base stations based on deep reinforcement learning as claimed in claim 1, wherein in step 3.1, a temporary experience pool M is further provided _temp Storing the latest flight experience;

M _temp once full, the earliest experience is taken out and stored in the experience pool M according to the first-in first-out principle _success The latest flight experience will continue to be stored in the experience pool M _temp The method comprises the steps of carrying out a first treatment on the surface of the Repeating the steps, and finally according to the final state of the unmanned aerial vehicle, selecting the model M from the experience pool _success and M_failure And respectively extracting a plurality of experiences to train the neural network.