CN115696211A

CN115696211A - Unmanned aerial vehicle track self-adaptive optimization method based on information age

Info

Publication number: CN115696211A
Application number: CN202211348121.6A
Authority: CN
Inventors: 胡昊南; 韩铭; 张�杰; 陈前斌
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-02-03

Abstract

The invention belongs to the field of unmanned aerial vehicle communication, and particularly relates to an unmanned aerial vehicle track self-adaptive optimization method based on information age, which comprises the following steps: constructing an unmanned aerial vehicle ground communication system model, and modeling a track planning problem by adopting the unmanned aerial vehicle ground communication system model; the unmanned aerial vehicle acquires a data packet generated by a ground node in a current flight state; acquiring current environment state information, and solving an optimal solution for an information age optimization objective function by adopting a deep reinforcement learning algorithm according to the current environment state information; according to the invention, the optimal decision is obtained through the AoI optimization objective function, and the optimal unmanned aerial vehicle track is obtained according to the optimal decision, so that the optimal communication is realized.

Description

Unmanned aerial vehicle track self-adaptive optimization method based on information age

Technical Field

The invention belongs to the field of unmanned aerial vehicle communication, and particularly relates to an unmanned aerial vehicle track self-adaptive optimization method based on information age.

Background

With the development and deployment of the 6G technology, more and more emerging applications of the Internet of things appear in daily life of people, such as smart home, smart traffic, smart health and the like. In the real-time application-oriented internet of things system, equipment needs to sense the surrounding physical environment in real time and monitor the system state, so that timely and effective information is provided for intelligent decision and control. This information may be the instantaneous acceleration and position of the vehicle, and may also be the ambient temperature, soil humidity, status of a network control or decision making system. For such time-sensitive information, if the receiving terminal obtains outdated information, invalid decisions and false controls will result. In these scenarios and applications, information freshness is extremely important to the system. The concept of Age of Information (AoI) to measure the freshness of data in a wireless sensor network. In particular, the information age refers to the time elapsed since the last received packet by the receiving end was generated from the source device.

In order to meet the ubiquitous full coverage network requirements in the future, in addition to the terrestrial communication network, the 6G network needs to meet coverage and capacity requirements based on satellites and unmanned aerial vehicles, thereby forming an aerospace-ground integrated network. Because the unmanned aerial vehicle has good flexibility and mobility, under the condition that the source node is far away from the target node, the unmanned aerial vehicle can be used as a relay node to collect data from the sensor node, and therefore the obsolescence degree of the data is greatly reduced. Therefore, research on AoI in the unmanned aerial vehicle network is of great significance. Much of the previous work in the field of wireless communications has been devoted to the study of data collection, aoI for internet of things devices, and cellular network improvements. However, in large-scale land energy-barren scenarios where the devices form a network, most are based on an integrated drone system (i.e., using a single drone). Such networks often result in problems such as outdated information collection, efficiency of energy collection, and energy limitations of the drones. The unmanned aerial vehicle needs a long time to collect information and then return the information to the control center, and the unmanned aerial vehicle has limited energy and poor cruising ability, so that the ongoing task is interrupted probably because of insufficient electric quantity, and the data of the sensor node is stopped to be received or updated.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an unmanned aerial vehicle track self-adaptive optimization method based on information age, which comprises the following steps:

s1: constructing an unmanned aerial vehicle ground communication system model, and determining the track of the unmanned aerial vehicle by adopting the unmanned aerial vehicle ground communication system model; determining an AoI optimization objective function according to the unmanned aerial vehicle track;

s2: acquiring current environment state information, solving an optimal solution for an optimization objective function by adopting a deep reinforcement learning algorithm according to the current environment state information, and determining the current flight state of the unmanned aerial vehicle according to the optimal solution;

s3: the unmanned aerial vehicle acquires the data packet generated by the ground node in the current flight state, if the data collection of all the nodes is finished, the flight of the unmanned aerial vehicle is finished, and if not, the step S2 is returned.

Preferably, the constructing the model of the unmanned aerial vehicle-to-ground communication system comprises: acquiring flight environment information of the unmanned aerial vehicle, dividing the acquired environment information into a series of cells with the same size by adopting a grid method, and dividing part of the cells into no-flight areas; and acquiring the position information of the ground sensor nodes, and constructing an unmanned aerial vehicle-to-ground communication system model according to the flight environment of the unmanned aerial vehicle and the position information of the ground sensor nodes.

Preferably, the process of determining the optimization objective function of AoI includes: dispersing the time of the unmanned aerial vehicle for executing the task into a plurality of time intervals with equal length; determining the flying height and flying speed of the unmanned aerial vehicle; constructing an unmanned aerial vehicle speed constraint condition according to the flying height, the flying speed and the task time of the unmanned aerial vehicle; the unmanned aerial vehicle collects ground information, when the unmanned aerial vehicle collects the latest data stored in the ground node, the AoI information of the ground node is updated, otherwise, the AoI information of the ground node is linearly increased; if the data are not stored in the buffer area of the ground node or the data are collected, the AoI is set to 1; recording the time of completing acquisition of the unmanned aerial vehicle n and the ground node m as t _s According to time t _s And planning the flight path and the connection strategy of the unmanned aerial vehicle.

Further, the optimization objective function is:

(P1)：min _q，K ∑ _m∈M A _m (t _s )

s.t.||q _n (t)-q _n (t-1)||≤V _max ，

wherein q represents the trajectory sequence formed by the positions of the unmanned aerial vehicles, K represents the connection relation between the unmanned aerial vehicles and the ground nodes, M represents the number of the ground nodes, A _m AoI, t representing ground node m _s Represents the time of the unmanned plane n and the ground node m completing the acquisition, q _n (t) represents the position of the nth drone at time t, V _max Representing the maximum speed at which the drone is flying,

the expression indicates that at t ∈ [0]And at the moment, N belongs to the connection relation between N and M belongs to M of the unmanned aerial vehicles, and N represents the number of the unmanned aerial vehicles.

Preferably, the connection policy is a markov decision process, the markov decision includes a quadruple < S, a, P, R >, where S and a are a state space and an action space, respectively, representing the state and the action of the drone; p is a state transfer function and represents the probability of transferring to the next state when the unmanned aerial vehicle executes the action in the current state; r is a reward function representing the reward available when the drone is in the current state.

Further, the reward function comprises designing the reward function according to a target problem, the goal of the trajectory planning is to minimize AoI of the collected information, wherein the reward function is a function on AoI, and the reward is r when the target point is found ₁ (ii) a When flying out of the active area, the reward is negative r ₂ (ii) a The number of the rounds is certain, when a single round is finished, whether data packets of all ground nodes are collected or not is judged, and if the data packets are collected, the reward is r ₃ Otherwise is-r ₄ (ii) a Other cases are-A _m (t); wherein r is ₁ ,r ₂ ,r ₃ ,r ₄ Is a positive number.

Preferably, the process of deep learning the AoI optimization objective function by using the improved PPO algorithm includes:

s21: state information(s) ₁ ,s ₂ …s _n ) Inputting the data into the Actor network to obtain the probabilities of all the actions, and outputting the combined action (a) according to the probabilities of all the actions ₁ ,a ₂ …a _n ) (ii) a All agents share one Actor network, the input of each agent i is environment information of global observation, and the output is the joint action of the agent i;

s22: will act in conjunction (a) ₁ ,a ₂ …a _n ) Inputting the global reward r and the next state s _intothe environment, and obtaining the track according to the next state

And storing it in an experience pool;

s23: inputting all states s in the track tau into a criticic network to obtain state values V(s) corresponding to all states of the unmanned aerial vehicle in one track _t )；

S24: unmanned aerial vehicle executes combined action a _t And reaches state s _t+1 Thereafter, the expected cumulative prize average G using different actions is calculated _t ＝r _t +γV(s _t+1 ) Computing a merit function A(s) from the running total reward average _t ,a _t )＝G _t -V(s _t ) Using the generalized dominance estimation to balance the variance and deviation of the value function estimation;

s25: calculating the loss of the critic network, wherein the loss function of the critic is the square mean of the dominance function;

s26: the obtained merit function A(s) _t ,a _t ) As the Critic network evaluates the action strategy, the output strategy of the Actor network is improved to obtain a new strategy pi _θ 。

S27: all the stored state s combinations are respectively input into the old and new policiesSlightly pi _θ And pi _θ′ In the action Actor network, acquiring unmanned aerial vehicle action probability distribution prob1 and prob2 under different strategies; calculating importance weights according to prob1 and prob2; acquiring and correcting the difference between two action distributions of different strategies theta and theta' according to the importance weight, and calculating an updated strategy expected return value according to the difference between the two action distributions;

s28: setting constraint conditions of the updated strategy, and calculating a loss function of the Actor network according to the constraint conditions and the strategy expected return value;

s29: and updating parameters of the Actor network and the Critic network by using a gradient descent algorithm according to the loss function until the reward convergence is unchanged, and outputting the current optimal flight strategy of the unmanned aerial vehicle.

The invention has the beneficial effects that:

the method utilizes a low-complexity algorithm to plan the optimal track of the unmanned aerial vehicle, the algorithm has high convergence speed and stable training result, and finally, the AoI of the ground node can be obviously reduced, the information freshness of the transmitted data can be effectively ensured, and the wrong decision caused by untimely data delivery is avoided.

Drawings

FIG. 1 is a flow chart of the unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning of the present invention;

FIG. 2 is a diagram of a model of a UAV-to-ground communication system of the present invention;

fig. 3 is a flow chart of the PPO algorithm of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An unmanned aerial vehicle trajectory adaptive optimization method based on information age is disclosed, as shown in fig. 1, and the method comprises the following steps:

s2: obtaining current environment state information, solving an optimal solution for an optimization objective function by adopting a deep learning algorithm according to the current environment state information, and determining the current flight state of the unmanned aerial vehicle according to the optimal solution;

s3: and the unmanned aerial vehicle acquires the data packet generated by the ground node in the current flight state, if the data collection of all the nodes is finished, the flight of the unmanned aerial vehicle is finished, and if not, the step S2 is returned.

Specifically, a specific implementation of the unmanned aerial vehicle trajectory adaptive optimization method based on information age includes:

s1: and establishing an unmanned aerial vehicle to ground communication system model, and determining an AoI optimization objective function according to the track of the unmanned aerial vehicle.

S11: the process of determining the AoI optimization objective function includes: dispersing the time of the unmanned aerial vehicle for executing the task into a plurality of time intervals with equal length; determining the flight height and the flight speed of the unmanned aerial vehicle; constructing a mobility constraint condition of the unmanned aerial vehicle according to the flight altitude, the flight speed and the task time of the unmanned aerial vehicle; the unmanned aerial vehicle collects ground information, when the unmanned aerial vehicle successfully collects the latest data stored from a certain ground node, the AoI information of the ground node is updated, otherwise, the AoI of the node linearly increases; if the buffer area of the ground node has no stored data or has been collected, aoI is set to 1, otherwise, aoI is set to 0; recording the time for completing acquisition of unmanned aerial vehicle n and ground node m as t _s According to time t _s And planning the flight path and the connection strategy of the unmanned aerial vehicle.

Planning the flight trajectory and the connection strategy of the unmanned aerial vehicle comprises the following steps: for the communication connection between the unmanned aerial vehicles and the nodes, each unmanned aerial vehicle only communicates with one node at any time; meanwhile, in order to realize better cooperation among the unmanned aerial vehicles, when a ground node is occupied by one unmanned aerial vehicle, other unmanned aerial vehicles can not visit the ground node, and any node m can only be served by one unmanned aerial vehicle at most at the same time. The connection relation between the unmanned aerial vehicle and the ground nodes is related to the track, and the unmanned aerial vehicle always selects the nearest unoccupied node to communicate in the flight period.

Specifically, as shown in fig. 2, consider a plurality of drones communicating to a plurality of ground nodes, where a drone collects data information from the ground nodes as a mobile relay, and the ground nodes are fixed in position and known to the drone. The task time is divided into slots of equal size. In a given T time slots, the unmanned aerial vehicle needs to take off from the starting point position and collect data of M sensors. The area of the unmanned aerial vehicle, where the task is to be executed, is divided into small grids with uniform sizes. After the area is evenly divided by the grid, it is assumed that the drone flies along the center of the grid at each time slot and flies at a fixed height H. The center position of the lattice i is denoted as c _i ＝(x _i ,y _i ) E C, where C represents the set of center positions of all the bins in the region, and the distance between the center positions of two adjacent bins is denoted by D. In each slot, the drone flies at a fixed speed V to the center of the adjacent grid, so the flight trajectory of the drone can be represented by an ordered set of grid center positions. Establishing a three-dimensional Cartesian coordinate system, wherein the position of the ground node is represented as g (t) = (x (t), y (t), 0), and t is equal to [1, T ∈](ii) a During the communication time T, the drone flies at a fixed altitude H, the position of which is expressed as q (T) = [ x (T), y (T), H]T is more than or equal to 0 and less than or equal to T. Its distance from the ground node when the discrete time interval is sufficiently small, the speed and position of the drone can be expressed as:

‖q _n (t)-q _n (t-1)‖≤V _max

with the aim of minimizing AoI, assuming u (t) as the generation time of the latest data uploaded to the drone, aoI is a (t) = t-u (t). When the unmanned aerial vehicle successfully collects the latest updated data from the ground node, the AoI of the node becomes the latest information AoI, otherwise, the AoI of the node linearly increases.

AoI when there is data in the buffer of the ground node m is:

under the condition that no traditional ground infrastructure communication facility exists, a plurality of unmanned aerial vehicles are dispatched to fly, and data of ground nodes are collected. The base station is positioned in the center of the area, and the coverage area of the BS is a circular area with the radius of R; in the area coverage, there are M randomly distributed ground nodes M = {1,2, \8230 =, M }, any ground node has a data packet for storing information about the surrounding environment, such as temperature, air pollution condition and the like, and the size of the data packet is 1M. N unmanned planes N = {1,2, \8230;, N } are contained in the area, and the base station control station sends a control command to the unmanned planes through a satellite link. For each ground node, a spherical transmission range with the radius of r exists, a good line-of-Sight channel exists between the unmanned aerial vehicle and the ground node in the transmission range, and a line-of-Sight communication link (Linkof Sight, loS) is established between the unmanned aerial vehicle and a sensor on the ground because the unmanned aerial vehicle can fly at a high altitude. Similarly, the LoS link is also arranged between the unmanned aerial vehicle and the base station, so that the channel gain between the unmanned aerial vehicle and the base station in the t time slot can be obtained, and the channel gain is specifically expressed as h = beta ₀ [d] ^-2 D is the distance between the two, beta ₀ Is the channel gain at a reference distance of 1 meter.

The specific expression of the effective coverage area of the sensor is as follows:

where P represents the transmit power of the sensor, B represents the channel bandwidth between the drone and the sensor, σ ² Representing the noise power at the drone and S representing the size of the data status update packet generated by the sensor.

The invention aims to provide an AoI minimized unmanned aerial vehicle trajectory planning method on the basis of meeting the communication requirements of a plurality of unmanned aerial vehicles and a plurality of sensor nodes on the ground. The method adopts a joint optimization method to optimize data in an unmanned plane track { q [ t ] } and a transmission packet scheduling { K [ t ] }. The optimization problem can be represented by formula (P1).

(P1)：min _q,K ∑ _m∈M A _m (t _s )

s.t.‖q _n (t)-q _n (t-1)‖≤V _max ,

Wherein q represents a trajectory sequence formed by the positions of the unmanned aerial vehicles, K represents the connection relation between the unmanned aerial vehicles and ground nodes, M represents the number of the ground nodes, and A represents _m AoI, t representing ground node m _s Represents the time of the unmanned plane n and the ground node m completing the acquisition, q _n (t) represents the position of the nth drone at time t, V _max Representing the maximum speed at which the drone is flying,

Modeling the connection relation between the unmanned aerial vehicle and the nodes, and using a binary variable K (t) to represent that the connection relation belongs to [0]And at the moment, the unmanned aerial vehicle N belongs to the connection relation between N and the node M belongs to M. When in use

When, it means that drone n is accessing node m; on the contrary, when

Time indicates that the two are not in communicative connection with each other. For communication connection of unmanned aerial vehicle and node, there is constraint

For each drone, is shownCommunicating with only one node at any time; meanwhile, in order to realize better cooperation among the unmanned aerial vehicles, when the ground node is occupied by one unmanned aerial vehicle, other unmanned aerial vehicles can not visit the ground node, so that the constraint process is realized

Meaning that for any node m at most one drone can serve at the same time. The connection relation between the unmanned aerial vehicle and the ground nodes is related to the track, and the unmanned aerial vehicle always selects the nearest unoccupied node to communicate in the flight period.

The optimization problem can be modeled into a Markov decision process, and the Markov decision process is trained by using a deep reinforcement learning algorithm and comprises four tuples<S,A,P,R>(ii) a S and A are respectively a state space and an action space and represent the state and the action of the unmanned aerial vehicle; p is a state transfer function and represents the probability of transferring to the next state when the unmanned aerial vehicle executes the action in the current state; r is a reward function representing the reward available when the drone is in the current state. The objective of the Markov decision process is to find the optimal strategy

And (6) making a decision. The method specifically comprises the following steps:

(1) State space

The state space is composed of the positions of all unmanned aerial vehicles, the AoI of the ground nodes and the condition that the unmanned aerial vehicles are connected with the ground nodes. The position of the unmanned plane n at the moment t is

AoI of a ground node is

At time t, the ground node to which the unmanned aerial vehicle n is connected, namely the selected subtask is C _t,n ∈[1,M]. Thus, the state space is defined as X _t,n ＝{s _t,1 ,…,s _t,N }。

(2) Movement space

For a multi-agent scenario, the action space is defined as the joint action of all drones in time slot k, including the flight direction, i.e. a, of each drone _t,n = { N, S, W, E }. Thereby combining the motion space A _t,n ＝{a _t,1 ,…,a _t,N }。

(3) Reward function

The reward function is designed according to the objective problem, the goal of the trajectory planning is to minimize the AoI of the collected information, so the reward function is a function on AoI. When the target point is found, the reward is r ₁ (ii) a When flying out of the active area, the reward is negative r ₂ (ii) a The number of the rounds is fixed, when a single round is finished, whether data packets of all ground nodes are collected or not is judged, and if the data packets are collected, the reward is r ₃ Otherwise is-r ₄ (ii) a The other case is-A _m (t) of (d). (wherein r is ₁ ,r ₂ ,r ₃ ,r ₄ Is a positive number).

(4) State transition

At the time slot t, if the unmanned aerial vehicle selects the sensor m to send a data state updating packet and the unmanned aerial vehicle is located in the effective coverage range of the sensor m at the moment, the Aol of the sensor m can be reduced to 1; otherwise its AoI would increase by 1. The updated formula for each sensor AoI is:

the location update equation for a drone may be expressed as:

the state update equation for AoI is:

wherein q (t) represents the coordinates of the horizontal position of the drone, D represents the distance of the central positions of two adjacent grids, V _t Indicating the flight direction of the drone, north indicating the North direction, south indicating the South direction, east indicating the East direction, west indicating the West direction, a _m (t) represents ground transmission at time tThe AoI of the sensor m is,

expressed in t ∈ [0, T]And at the moment, the unmanned aerial vehicle N belongs to the connection relation between N and the node M belongs to M.

S2: and based on the current environment state information, selecting the action in the current state according to the obtained strategy by utilizing a deep reinforcement learning algorithm.

S21: for multi-agent tasks, the goal is to maximize J (π) by determining an optimal strategy _θ ). Consider a fully centralized learning framework in which all drones are treated as a single super agent.

In particular, in a fully focused learning framework, it is assumed that the UAV can know the global state X _t,n And Joint action A _t,n And learning a centralized strategy pi using an RL algorithm _θ . Therefore, it is always necessary to include X in both the training and execution phases _t,n And A _t,n Including global information. The centralized Critic estimates a joint value function based on global information, and the Actor makes a decision based on the global state.

S22: the unmanned aerial vehicle continuously receives flying position data broadcasted from a ground base station in the flying process, and a reward return updating strategy obtained by interacting with the environment is adopted, so that pseudo codes of an optimization problem P (1) algorithm are as follows:

as shown in fig. 3, the process of performing deep learning on the AoI optimization objective function by using the improved PPO algorithm includes specific processes: setting all hyper-parameters, including Critic value network Q _ω The hyper-parameter omega and the hyper-parameter theta parameter of the Actor strategy network pi; initializing a hyper-parameter theta of an Actor policy network pi to obtain theta', and determining a Critic value network Q _ω The hyper-parameter ω of (c) is initialized to obtain ω'. The maximum number of iteration rounds L is set. Trace data was collected using the old Actor network. During data collection, each UAV uses a policy π _old And the environmentAnd (6) interacting. In each iteration, each UAV collects a trace τ of T slots. And calculating the dominance function and a target V value. The trajectory, merit function, and target V value are then stored in a batch for later sampling. And circulating all batches for K rounds, extracting small batches with the size of mini-batch from the batches, and calculating the strategy loss and the value loss by using an Adam optimizer. And finally, updating the Actor network by using the gradient of strategy loss, and updating the criticic network by using the gradient of value loss.

Specifically, when a policy gradient algorithm is used to train an agent, one of the challenges is that the algorithm is susceptible to a sudden performance drop (performance collapse), and at this time, the agent suddenly starts to perform poorly. This situation may be difficult to recover because the agent will start generating poorly performing traces that are then used to further train the strategy. In addition, the same strategy algorithm results in insufficient sample utilization due to the inability to reuse data. The near-end Policy Optimization algorithm (PPO) is a type of Optimization algorithm that solves both of the above problems. The main idea behind near-end policy optimization is to introduce a surrogate objective function (flooring objective) that avoids sudden performance degradation by ensuring a monotonically improving policy, and that has the advantage of reusing hetero-policy data during training.

In a multi-agent system, since each agent rewards a function r _t The method is influenced by actions of other agents, and the agents need to consider a value function of the joint action to learn an optimal strategy. The centralized training refers to training an agent by using a joint behavior value function in training. Compared with distributed training, namely using a local behavior value function, a centralized value function can evaluate the joint strategy.

The specific method comprises the following steps: a centralized criticic network utilizes the joint action information, n agents interact with the environment by adopting a joint strategy in the training process, meanwhile, the joint behavior value function of each agent is evaluated, and the gradient of the strategy parameters is updated according to the joint behavior value function. The value function only knows the global reward, and the single agent does notKnowing its actual contribution, critic in the PPO algorithm uses a merit function to evaluate how good the action is. First, a strategy is given pi, and then its next iteration (after parameter update) is let to be pi'. An associated policy performance identifier is defined to measure the performance difference between the two policies. The goal of the agent is to get the desired reward

Maximum strategy pi _θ . Theta is a strategy parameter, and the return refers to a track tau =(s) ₀ ,a ₀ ,s ₁ ,a ₁ 8230), discount rewards and

the optimal strategy is

Through first order approximation, a substitution loss function is optimized, a new strategy is calculated in each iteration, and non-negative improvement of each strategy iteration is guaranteed.

Critic in the PPO algorithm adopts a dominance function to evaluate the quality of the action:

as PPO is a same-strategy (on-policy) algorithm, in order to improve the utilization rate of samples, importance sampling is introduced, and an old strategy pi is used _θ′ To sample, we get:

the gradient corresponds to an optimization objective function, namely:

in practical application, based on the sampling estimation expectation, the optimization target of the PPO, namely a substitution loss function, is obtained in a simplified mode, the strategy updating amplitude is limited through clip operation (clip), and the training stability is guaranteed.

The above equation is called a surrogate objective function for clipping, where:

and epsilon is the ratio of the old strategy and the new strategy, and epsilon is a clipping amplitude over-parameter. The idea is to limit r (theta) to the neighborhood of epsilon 1-epsilon, 1+ epsilon]。

The objective function assumes that the sample pattern results from an old strategy, which is independent of the current strategy except for the importance weights, and thus the sampled trajectory can be reused multiple times to perform parameter updates. Such modifications may enable more stable, sample-utilizing training.

A specific embodiment of deep learning of an AoI optimization objective function by using an improved PPO algorithm comprises the following steps:

step 1: state information(s) ₁ ,s ₂ …s _n ) Inputting the probability of all actions into the Actor network, and outputting the combined action according to the probability of all actions (a) ₁ ,a ₂ …a _n ) (ii) a All agents share one Actor network, the input of each agent i is environment information of global observation, and the output is the joint action of the agent i;

step 2: will act in conjunction (a) ₁ ,a ₂ …a _n ) Inputting the global reward r and the next state s _intothe environment, and obtaining the track according to the next state

And storing it in an experience pool;

and step 3: inputting all states s in the track tau into a Critic network to obtain state values V(s) corresponding to all states of the unmanned aerial vehicle in one track _t )；

And 4, step 4: calculating the action line a of the unmanned aerial vehicle _t Then reaches state s _t+1 Average value G of time reward _t Wherein G is _t ＝r _t +γV(s _t+1 ) (ii) a According to the average value G _t The merit function A(s) is obtained _t ,a _t )＝G _t -V(s _t ) (ii) a Carrying out generalized advantage estimation on the advantage function, and balancing an equation and a deviation of value function estimation through the generalized advantage estimation; the expression is as follows:

wherein the content of the first and second substances,

representing a merit function calculated using GAE, gamma representing a discount factor for the reward, lambda representing an adjustable hyperparameter, l representing the number of accumulations, r _t Indicates the reward at time t, V(s) _t+1 ) Representing the function of the state value at time t +1, V(s) _t ) Representing the function of the state value at time t +1, s _t Indicating the state at time t.

And 5: the loss of the Critic network is calculated, and the loss function of Critic is the mean square of the merit function.

Step 6: the obtained merit function A(s) _t ,a _t ) As the Critic network evaluates the action strategy, the output strategy of the Actor network is improved to obtain a new strategy pi _θ 。

And 7: calculating the loss of the Critic network, which will perform the generalized dominance function A(s) after dominance estimation _t ,a _t ) Pair action strategy for Critic pi _θ Optimizing to obtain a new strategy pi _θ ；

And step 8: all the stored state s combinations are respectively input into the new strategy and the old strategy _θ And pi _θ′ In the action Actor network, acquiring unmanned aerial vehicle action probability distribution prob1 and prob2 under different strategies; calculating importance weights according to prob1 and prob2; obtaining corrections based on importance weightsCalculating an updated strategy expectation-return value according to the difference between the two action distributions of the strategies theta and theta';

calculating the updated policy expected return value comprises: when the strategy is not updated, the expected return value of the new strategy cannot be calculated, and a method of importance sampling is introduced, the distribution of the old strategy is used for estimating the distribution of the new strategy, and the calculated and updated strategy expected return value is as follows:

wherein r (t) is the ratio of old and new strategies,

indicating an updated policy expected return value,

representing old policy expected reward value, π _θ (a | s) denotes a new strategy, π _θ′ (a | s) represents the old policy, a represents the action, and s represents the state information.

And step 9: setting constraint conditions of the updated strategy, and calculating a loss function of the Actor network according to the constraint conditions and the strategy expected return value; the constraint conditions of the strategy after updating are set to limit the strategy updating amplitude through clipping operation (clip), and r (t) is limited to the neighborhood [ 1-epsilon, 1+ epsilon ] of epsilon, so that the training stability is ensured. The loss function for an Actor network is:

J ^CLIP (θ)＝E[min(r(t)A ^π ,clip(r(t),1-ε,1+ε)A ^π )]

wherein, J ^CLIP (theta) represents an objective function of an Actor network, theta represents a network weight of the Actor, E represents a desired operation, r (t) is a ratio of a new strategy to an old strategy, and A ^π Representing a dominant function, clip, of adopting strategy ΠThe effect is three terms in parentheses, and if the first term is smaller than the second term, 1-epsilon is output, and if the first term is larger than the third term, 1+ epsilon is output, in order to limit the probability ratio to a reasonable range. ε represents the clipping magnitude hyperparameter.

Step 10: and updating parameters of the Actor network and the Critic network by using a gradient descent algorithm according to the loss function, and outputting the current optimal flight strategy of the unmanned aerial vehicle when the reward is updated to be converged.

S3: if the unmanned aerial vehicle finishes collecting the data packets generated by all the ground nodes, ending the flight; otherwise, S2 is continuously executed.

Starting from an initial state in each round, finishing the round when the unmanned aerial vehicle meets any one of the following conditions, and restarting a new round of learning, wherein 1) all the ground sensor data packets are completely collected; 2) The maximum range is reached. In this embodiment, the drone does not complete the data collection task within a single round of a maximum number of steps of 500, considering that the drone has reached the maximum range. And exiting the loop when the maximum round times are reached, and finishing the training.

The above-mentioned embodiments, which are further detailed for the purpose of illustrating the invention, technical solutions and advantages, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made to the present invention within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An unmanned aerial vehicle track self-adaptive optimization method based on information age is characterized by comprising the following steps:

s1: constructing an unmanned aerial vehicle ground communication system model, and determining the track of the unmanned aerial vehicle by adopting the unmanned aerial vehicle ground communication system model; determining an optimization objective function of the AoI according to the unmanned aerial vehicle track; wherein AoI represents information age;

s2: acquiring current environment state information, and performing deep learning on the AoI optimization target function by adopting an improved PPO algorithm according to the current environment state information to obtain the current flight state of the unmanned aerial vehicle; the improved PPO algorithm comprises the steps of optimizing the PPO algorithm by adopting centralized strategy learning and sharing rewards;

2. The method of claim 1, wherein the building of the unmanned aerial vehicle to ground communication system model comprises: acquiring the flight environment information of the unmanned aerial vehicle, and dividing the acquired environment information into a series of cells with the same size by adopting a grid method; the method comprises the steps that a base station is arranged in the center of a flight area of the unmanned aerial vehicle, the coverage area of the base station is a circular area with the radius of R, and cells outside the coverage area of a base station signal are divided into no-fly areas; and acquiring the position information of the ground communication node, and constructing an unmanned aerial vehicle-to-ground communication system model according to the flight environment of the unmanned aerial vehicle and the position information of the ground communication node.

3. The method of claim 1, wherein the step of determining the optimization objective function of AoI comprises: dispersing the time of the unmanned aerial vehicle for executing the task into at least two time intervals with equal length; determining the flight height and the flight speed of the unmanned aerial vehicle; constructing an unmanned aerial vehicle speed constraint condition according to the flying height, the flying speed and the task time of the unmanned aerial vehicle; the unmanned aerial vehicle collects ground information, when the unmanned aerial vehicle collects the latest data stored in the ground node, the AoI information of the ground node is updated, otherwise, the AoI information of the ground node is linearly increased; if the data are not stored in the buffer area of the ground node or the data are collected, the AoI is set to be 1, otherwise, the AoI is set to be 0; recording the time for completing acquisition of unmanned aerial vehicle n and ground node m as t _s According to time t _s And planning the flight track and the connection strategy of the unmanned aerial vehicle.

4. The method of claim 3, wherein the optimization objective function is as follows:

(P1)：min _q,K ∑ _m∈M A _m (t _s )

s.t.‖q _n (t)-q _n (t-1)‖≤V _max ,

expressed in t ∈ [0, T]And (3) connecting the unmanned aerial vehicles N and the nodes m at the moment, wherein N represents the number of the unmanned aerial vehicles.

5. The unmanned aerial vehicle track self-adaptive optimization method based on the information age is characterized in that the connection strategy of the unmanned aerial vehicle is a Markov decision process, the Markov decision process comprises four tuples < S, A, P, R >, wherein S and A are a state space and an action space respectively, and the state and the action of the unmanned aerial vehicle are reflected; p is a state transfer function and represents the probability of transferring to the next state when the unmanned aerial vehicle executes the action in the current state; r is a reward function representing the reward that can be obtained when the drone is in the current state.

6. The method for unmanned aerial vehicle trajectory adaptive optimization based on information age according to claim 5, wherein the state transfer function comprises a position update equation of the unmanned aerial vehicle and a state update equation of AoI; wherein, unmanned aerial vehicle's position update equation is:

the state update equation for AoI is:

wherein q (t) represents the coordinates of the horizontal position of the drone, D represents the distance of the central positions of two adjacent grids, V _t Indicating the flight direction of the drone, north indicating the North direction, south indicating the South direction, east indicating the East direction, west indicating the West direction, a _m (t) denotes the AoI of the ground sensor m at time t,

expressed in t ∈ [0, T]And connecting the unmanned aerial vehicle n and the node m at the moment.

7. The method of claim 5, wherein the reward function comprises a reward function constructed according to a target optimization problem, the goal of trajectory planning is to minimize acquisition target information, wherein the reward function is a function related to AoI, and when a target point is found, the reward is r ₁ (ii) a When flying out of the active area, the reward is negative r ₂ (ii) a When the information acquisition of the unmanned aerial vehicle is finished, whether data packets of all ground nodes are acquired or not is judged, and if the data packets are acquired, the reward is r ₃ Otherwise is-r ₄ (ii) a The other case is-A _m (t); wherein r is ₁ ,r ₂ ,r ₃ ,r ₄ Is a positive number.

8. The unmanned aerial vehicle trajectory adaptive optimization method based on the information age is characterized in that the process of deep learning the AoI optimization objective function by adopting the improved PPO algorithm comprises the following steps:

s21: status information(s) ₁ ,s ₂ …s _n ) Inputting the probability of all actions into the Actor network, and outputting the combined action according to the probability of all actions (a) ₁ ,a ₂ …a _n ) (ii) a All agents share one Actor network, the input of each agent i is environment information of global observation, and the output is the joint action of the agent i;

And storing it in an experience pool;

s23: inputting all states s in the track tau into a Critic network to obtain state values V(s) corresponding to all states of the unmanned aerial vehicle in one track _t )；

S24: unmanned aerial vehicle executes combined action a _t And reaches state s _t+1 Thereafter, the expected cumulative prize average G using different actions is calculated _t ＝r _t +γV(s _t+1 ) Calculating the merit function A(s) from the cumulative prize mean _t ,a _t )＝G _t -V(s _t ) Using the generalized dominance estimation to balance the variance and deviation of the value function estimation;

s25: calculating the loss of the Critic network, wherein the loss function of Critic is the square mean of the dominance function;

s26: the obtained merit function A(s) _t ,a _t ) As the Critic network evaluates the action strategy, the output strategy of the Actor network is improved to obtain a new strategy pi _θ ；

S27: all the stored state s combinations are respectively input into the new strategy and the old strategy _θ And pi _θ′ In the action Actor network, the difference is obtainedUnmanned aerial vehicle action probability distribution prob1 and prob2 under the strategy; calculating an importance weight according to prob1 and prob2; acquiring and correcting the difference between two action distributions of different strategies theta and theta' according to the importance weight, and calculating an updated strategy expected return value according to the difference between the two action distributions;

s28: setting a constraint condition of the updated strategy, and calculating a loss function of the Actor network according to the constraint condition and the strategy expectation return value;

9. The method of claim 8, wherein the optimal policy is as follows:

wherein the content of the first and second substances,

indicates an expectation of the discount reward according to the trajectory tau, gamma indicates a discount factor, r _t Indicating an instant prize at time t.

10. The method of claim 8, wherein the loss function of the Actor network is as follows:

J ^CLIP (θ)＝E[min(r(t)A ^π ,clip(r(t),1-ε,1+ε)A ^π )]

wherein, J ^CLIP (theta) represents a loss function of an Actor network, theta represents a network weight of the Actor, E represents a desired operation, r (t) is a ratio of a new policy to an old policy, and A ^π Representing the dominant function of adopting a strategy pi; clip is a select output function, i.e., if r (t) is less than 1- ε, 1- ε is output, and if r (t) is greater than 1+ ε,1+ ε is output, which isHe then outputs r (t); ε represents the clipping magnitude hyperparameter.