CN115696211A - Unmanned aerial vehicle track self-adaptive optimization method based on information age - Google Patents

Unmanned aerial vehicle track self-adaptive optimization method based on information age Download PDF

Info

Publication number
CN115696211A
CN115696211A CN202211348121.6A CN202211348121A CN115696211A CN 115696211 A CN115696211 A CN 115696211A CN 202211348121 A CN202211348121 A CN 202211348121A CN 115696211 A CN115696211 A CN 115696211A
Authority
CN
China
Prior art keywords
unmanned aerial
aerial vehicle
strategy
function
aoi
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211348121.6A
Other languages
Chinese (zh)
Inventor
胡昊南
韩铭
张�杰
陈前斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202211348121.6A priority Critical patent/CN115696211A/en
Publication of CN115696211A publication Critical patent/CN115696211A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention belongs to the field of unmanned aerial vehicle communication, and particularly relates to an unmanned aerial vehicle track self-adaptive optimization method based on information age, which comprises the following steps: constructing an unmanned aerial vehicle ground communication system model, and modeling a track planning problem by adopting the unmanned aerial vehicle ground communication system model; the unmanned aerial vehicle acquires a data packet generated by a ground node in a current flight state; acquiring current environment state information, and solving an optimal solution for an information age optimization objective function by adopting a deep reinforcement learning algorithm according to the current environment state information; according to the invention, the optimal decision is obtained through the AoI optimization objective function, and the optimal unmanned aerial vehicle track is obtained according to the optimal decision, so that the optimal communication is realized.

Description

Unmanned aerial vehicle track self-adaptive optimization method based on information age
Technical Field
The invention belongs to the field of unmanned aerial vehicle communication, and particularly relates to an unmanned aerial vehicle track self-adaptive optimization method based on information age.
Background
With the development and deployment of the 6G technology, more and more emerging applications of the Internet of things appear in daily life of people, such as smart home, smart traffic, smart health and the like. In the real-time application-oriented internet of things system, equipment needs to sense the surrounding physical environment in real time and monitor the system state, so that timely and effective information is provided for intelligent decision and control. This information may be the instantaneous acceleration and position of the vehicle, and may also be the ambient temperature, soil humidity, status of a network control or decision making system. For such time-sensitive information, if the receiving terminal obtains outdated information, invalid decisions and false controls will result. In these scenarios and applications, information freshness is extremely important to the system. The concept of Age of Information (AoI) to measure the freshness of data in a wireless sensor network. In particular, the information age refers to the time elapsed since the last received packet by the receiving end was generated from the source device.
In order to meet the ubiquitous full coverage network requirements in the future, in addition to the terrestrial communication network, the 6G network needs to meet coverage and capacity requirements based on satellites and unmanned aerial vehicles, thereby forming an aerospace-ground integrated network. Because the unmanned aerial vehicle has good flexibility and mobility, under the condition that the source node is far away from the target node, the unmanned aerial vehicle can be used as a relay node to collect data from the sensor node, and therefore the obsolescence degree of the data is greatly reduced. Therefore, research on AoI in the unmanned aerial vehicle network is of great significance. Much of the previous work in the field of wireless communications has been devoted to the study of data collection, aoI for internet of things devices, and cellular network improvements. However, in large-scale land energy-barren scenarios where the devices form a network, most are based on an integrated drone system (i.e., using a single drone). Such networks often result in problems such as outdated information collection, efficiency of energy collection, and energy limitations of the drones. The unmanned aerial vehicle needs a long time to collect information and then return the information to the control center, and the unmanned aerial vehicle has limited energy and poor cruising ability, so that the ongoing task is interrupted probably because of insufficient electric quantity, and the data of the sensor node is stopped to be received or updated.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an unmanned aerial vehicle track self-adaptive optimization method based on information age, which comprises the following steps:
s1: constructing an unmanned aerial vehicle ground communication system model, and determining the track of the unmanned aerial vehicle by adopting the unmanned aerial vehicle ground communication system model; determining an AoI optimization objective function according to the unmanned aerial vehicle track;
s2: acquiring current environment state information, solving an optimal solution for an optimization objective function by adopting a deep reinforcement learning algorithm according to the current environment state information, and determining the current flight state of the unmanned aerial vehicle according to the optimal solution;
s3: the unmanned aerial vehicle acquires the data packet generated by the ground node in the current flight state, if the data collection of all the nodes is finished, the flight of the unmanned aerial vehicle is finished, and if not, the step S2 is returned.
Preferably, the constructing the model of the unmanned aerial vehicle-to-ground communication system comprises: acquiring flight environment information of the unmanned aerial vehicle, dividing the acquired environment information into a series of cells with the same size by adopting a grid method, and dividing part of the cells into no-flight areas; and acquiring the position information of the ground sensor nodes, and constructing an unmanned aerial vehicle-to-ground communication system model according to the flight environment of the unmanned aerial vehicle and the position information of the ground sensor nodes.
Preferably, the process of determining the optimization objective function of AoI includes: dispersing the time of the unmanned aerial vehicle for executing the task into a plurality of time intervals with equal length; determining the flying height and flying speed of the unmanned aerial vehicle; constructing an unmanned aerial vehicle speed constraint condition according to the flying height, the flying speed and the task time of the unmanned aerial vehicle; the unmanned aerial vehicle collects ground information, when the unmanned aerial vehicle collects the latest data stored in the ground node, the AoI information of the ground node is updated, otherwise, the AoI information of the ground node is linearly increased; if the data are not stored in the buffer area of the ground node or the data are collected, the AoI is set to 1; recording the time of completing acquisition of the unmanned aerial vehicle n and the ground node m as t s According to time t s And planning the flight path and the connection strategy of the unmanned aerial vehicle.
Further, the optimization objective function is:
(P1):min q,Km∈M A m (t s )
s.t.||q n (t)-q n (t-1)||≤V max
Figure BDA0003918998410000021
Figure BDA0003918998410000031
wherein q represents the trajectory sequence formed by the positions of the unmanned aerial vehicles, K represents the connection relation between the unmanned aerial vehicles and the ground nodes, M represents the number of the ground nodes, A m AoI, t representing ground node m s Represents the time of the unmanned plane n and the ground node m completing the acquisition, q n (t) represents the position of the nth drone at time t, V max Representing the maximum speed at which the drone is flying,
Figure BDA0003918998410000032
the expression indicates that at t ∈ [0]And at the moment, N belongs to the connection relation between N and M belongs to M of the unmanned aerial vehicles, and N represents the number of the unmanned aerial vehicles.
Preferably, the connection policy is a markov decision process, the markov decision includes a quadruple < S, a, P, R >, where S and a are a state space and an action space, respectively, representing the state and the action of the drone; p is a state transfer function and represents the probability of transferring to the next state when the unmanned aerial vehicle executes the action in the current state; r is a reward function representing the reward available when the drone is in the current state.
Further, the reward function comprises designing the reward function according to a target problem, the goal of the trajectory planning is to minimize AoI of the collected information, wherein the reward function is a function on AoI, and the reward is r when the target point is found 1 (ii) a When flying out of the active area, the reward is negative r 2 (ii) a The number of the rounds is certain, when a single round is finished, whether data packets of all ground nodes are collected or not is judged, and if the data packets are collected, the reward is r 3 Otherwise is-r 4 (ii) a Other cases are-A m (t); wherein r is 1 ,r 2 ,r 3 ,r 4 Is a positive number.
Preferably, the process of deep learning the AoI optimization objective function by using the improved PPO algorithm includes:
s21: state information(s) 1 ,s 2 …s n ) Inputting the data into the Actor network to obtain the probabilities of all the actions, and outputting the combined action (a) according to the probabilities of all the actions 1 ,a 2 …a n ) (ii) a All agents share one Actor network, the input of each agent i is environment information of global observation, and the output is the joint action of the agent i;
s22: will act in conjunction (a) 1 ,a 2 …a n ) Inputting the global reward r and the next state s _intothe environment, and obtaining the track according to the next state
Figure BDA0003918998410000033
And storing it in an experience pool;
s23: inputting all states s in the track tau into a criticic network to obtain state values V(s) corresponding to all states of the unmanned aerial vehicle in one track t );
S24: unmanned aerial vehicle executes combined action a t And reaches state s t+1 Thereafter, the expected cumulative prize average G using different actions is calculated t =r t +γV(s t+1 ) Computing a merit function A(s) from the running total reward average t ,a t )=G t -V(s t ) Using the generalized dominance estimation to balance the variance and deviation of the value function estimation;
s25: calculating the loss of the critic network, wherein the loss function of the critic is the square mean of the dominance function;
s26: the obtained merit function A(s) t ,a t ) As the Critic network evaluates the action strategy, the output strategy of the Actor network is improved to obtain a new strategy pi θ
S27: all the stored state s combinations are respectively input into the old and new policiesSlightly pi θ And pi θ′ In the action Actor network, acquiring unmanned aerial vehicle action probability distribution prob1 and prob2 under different strategies; calculating importance weights according to prob1 and prob2; acquiring and correcting the difference between two action distributions of different strategies theta and theta' according to the importance weight, and calculating an updated strategy expected return value according to the difference between the two action distributions;
s28: setting constraint conditions of the updated strategy, and calculating a loss function of the Actor network according to the constraint conditions and the strategy expected return value;
s29: and updating parameters of the Actor network and the Critic network by using a gradient descent algorithm according to the loss function until the reward convergence is unchanged, and outputting the current optimal flight strategy of the unmanned aerial vehicle.
The invention has the beneficial effects that:
the method utilizes a low-complexity algorithm to plan the optimal track of the unmanned aerial vehicle, the algorithm has high convergence speed and stable training result, and finally, the AoI of the ground node can be obviously reduced, the information freshness of the transmitted data can be effectively ensured, and the wrong decision caused by untimely data delivery is avoided.
Drawings
FIG. 1 is a flow chart of the unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning of the present invention;
FIG. 2 is a diagram of a model of a UAV-to-ground communication system of the present invention;
fig. 3 is a flow chart of the PPO algorithm of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
An unmanned aerial vehicle trajectory adaptive optimization method based on information age is disclosed, as shown in fig. 1, and the method comprises the following steps:
s1: constructing an unmanned aerial vehicle ground communication system model, and determining the track of the unmanned aerial vehicle by adopting the unmanned aerial vehicle ground communication system model; determining an AoI optimization objective function according to the unmanned aerial vehicle track;
s2: obtaining current environment state information, solving an optimal solution for an optimization objective function by adopting a deep learning algorithm according to the current environment state information, and determining the current flight state of the unmanned aerial vehicle according to the optimal solution;
s3: and the unmanned aerial vehicle acquires the data packet generated by the ground node in the current flight state, if the data collection of all the nodes is finished, the flight of the unmanned aerial vehicle is finished, and if not, the step S2 is returned.
Specifically, a specific implementation of the unmanned aerial vehicle trajectory adaptive optimization method based on information age includes:
s1: and establishing an unmanned aerial vehicle to ground communication system model, and determining an AoI optimization objective function according to the track of the unmanned aerial vehicle.
S11: the process of determining the AoI optimization objective function includes: dispersing the time of the unmanned aerial vehicle for executing the task into a plurality of time intervals with equal length; determining the flight height and the flight speed of the unmanned aerial vehicle; constructing a mobility constraint condition of the unmanned aerial vehicle according to the flight altitude, the flight speed and the task time of the unmanned aerial vehicle; the unmanned aerial vehicle collects ground information, when the unmanned aerial vehicle successfully collects the latest data stored from a certain ground node, the AoI information of the ground node is updated, otherwise, the AoI of the node linearly increases; if the buffer area of the ground node has no stored data or has been collected, aoI is set to 1, otherwise, aoI is set to 0; recording the time for completing acquisition of unmanned aerial vehicle n and ground node m as t s According to time t s And planning the flight path and the connection strategy of the unmanned aerial vehicle.
Planning the flight trajectory and the connection strategy of the unmanned aerial vehicle comprises the following steps: for the communication connection between the unmanned aerial vehicles and the nodes, each unmanned aerial vehicle only communicates with one node at any time; meanwhile, in order to realize better cooperation among the unmanned aerial vehicles, when a ground node is occupied by one unmanned aerial vehicle, other unmanned aerial vehicles can not visit the ground node, and any node m can only be served by one unmanned aerial vehicle at most at the same time. The connection relation between the unmanned aerial vehicle and the ground nodes is related to the track, and the unmanned aerial vehicle always selects the nearest unoccupied node to communicate in the flight period.
Specifically, as shown in fig. 2, consider a plurality of drones communicating to a plurality of ground nodes, where a drone collects data information from the ground nodes as a mobile relay, and the ground nodes are fixed in position and known to the drone. The task time is divided into slots of equal size. In a given T time slots, the unmanned aerial vehicle needs to take off from the starting point position and collect data of M sensors. The area of the unmanned aerial vehicle, where the task is to be executed, is divided into small grids with uniform sizes. After the area is evenly divided by the grid, it is assumed that the drone flies along the center of the grid at each time slot and flies at a fixed height H. The center position of the lattice i is denoted as c i =(x i ,y i ) E C, where C represents the set of center positions of all the bins in the region, and the distance between the center positions of two adjacent bins is denoted by D. In each slot, the drone flies at a fixed speed V to the center of the adjacent grid, so the flight trajectory of the drone can be represented by an ordered set of grid center positions. Establishing a three-dimensional Cartesian coordinate system, wherein the position of the ground node is represented as g (t) = (x (t), y (t), 0), and t is equal to [1, T ∈](ii) a During the communication time T, the drone flies at a fixed altitude H, the position of which is expressed as q (T) = [ x (T), y (T), H]T is more than or equal to 0 and less than or equal to T. Its distance from the ground node when the discrete time interval is sufficiently small, the speed and position of the drone can be expressed as:
‖q n (t)-q n (t-1)‖≤V max
with the aim of minimizing AoI, assuming u (t) as the generation time of the latest data uploaded to the drone, aoI is a (t) = t-u (t). When the unmanned aerial vehicle successfully collects the latest updated data from the ground node, the AoI of the node becomes the latest information AoI, otherwise, the AoI of the node linearly increases.
AoI when there is data in the buffer of the ground node m is:
Figure BDA0003918998410000061
under the condition that no traditional ground infrastructure communication facility exists, a plurality of unmanned aerial vehicles are dispatched to fly, and data of ground nodes are collected. The base station is positioned in the center of the area, and the coverage area of the BS is a circular area with the radius of R; in the area coverage, there are M randomly distributed ground nodes M = {1,2, \8230 =, M }, any ground node has a data packet for storing information about the surrounding environment, such as temperature, air pollution condition and the like, and the size of the data packet is 1M. N unmanned planes N = {1,2, \8230;, N } are contained in the area, and the base station control station sends a control command to the unmanned planes through a satellite link. For each ground node, a spherical transmission range with the radius of r exists, a good line-of-Sight channel exists between the unmanned aerial vehicle and the ground node in the transmission range, and a line-of-Sight communication link (Linkof Sight, loS) is established between the unmanned aerial vehicle and a sensor on the ground because the unmanned aerial vehicle can fly at a high altitude. Similarly, the LoS link is also arranged between the unmanned aerial vehicle and the base station, so that the channel gain between the unmanned aerial vehicle and the base station in the t time slot can be obtained, and the channel gain is specifically expressed as h = beta 0 [d] -2 D is the distance between the two, beta 0 Is the channel gain at a reference distance of 1 meter.
The specific expression of the effective coverage area of the sensor is as follows:
Figure BDA0003918998410000071
where P represents the transmit power of the sensor, B represents the channel bandwidth between the drone and the sensor, σ 2 Representing the noise power at the drone and S representing the size of the data status update packet generated by the sensor.
The invention aims to provide an AoI minimized unmanned aerial vehicle trajectory planning method on the basis of meeting the communication requirements of a plurality of unmanned aerial vehicles and a plurality of sensor nodes on the ground. The method adopts a joint optimization method to optimize data in an unmanned plane track { q [ t ] } and a transmission packet scheduling { K [ t ] }. The optimization problem can be represented by formula (P1).
(P1):min q,Km∈M A m (t s )
s.t.‖q n (t)-q n (t-1)‖≤V max ,
Figure BDA0003918998410000072
Figure BDA0003918998410000073
Wherein q represents a trajectory sequence formed by the positions of the unmanned aerial vehicles, K represents the connection relation between the unmanned aerial vehicles and ground nodes, M represents the number of the ground nodes, and A represents m AoI, t representing ground node m s Represents the time of the unmanned plane n and the ground node m completing the acquisition, q n (t) represents the position of the nth drone at time t, V max Representing the maximum speed at which the drone is flying,
Figure BDA0003918998410000074
the expression indicates that at t ∈ [0]And at the moment, N belongs to the connection relation between N and M belongs to M of the unmanned aerial vehicles, and N represents the number of the unmanned aerial vehicles.
Modeling the connection relation between the unmanned aerial vehicle and the nodes, and using a binary variable K (t) to represent that the connection relation belongs to [0]And at the moment, the unmanned aerial vehicle N belongs to the connection relation between N and the node M belongs to M. When in use
Figure BDA0003918998410000075
When, it means that drone n is accessing node m; on the contrary, when
Figure BDA0003918998410000076
Time indicates that the two are not in communicative connection with each other. For communication connection of unmanned aerial vehicle and node, there is constraint
Figure BDA0003918998410000081
For each drone, is shownCommunicating with only one node at any time; meanwhile, in order to realize better cooperation among the unmanned aerial vehicles, when the ground node is occupied by one unmanned aerial vehicle, other unmanned aerial vehicles can not visit the ground node, so that the constraint process is realized
Figure BDA0003918998410000082
Meaning that for any node m at most one drone can serve at the same time. The connection relation between the unmanned aerial vehicle and the ground nodes is related to the track, and the unmanned aerial vehicle always selects the nearest unoccupied node to communicate in the flight period.
The optimization problem can be modeled into a Markov decision process, and the Markov decision process is trained by using a deep reinforcement learning algorithm and comprises four tuples<S,A,P,R>(ii) a S and A are respectively a state space and an action space and represent the state and the action of the unmanned aerial vehicle; p is a state transfer function and represents the probability of transferring to the next state when the unmanned aerial vehicle executes the action in the current state; r is a reward function representing the reward available when the drone is in the current state. The objective of the Markov decision process is to find the optimal strategy
Figure BDA0003918998410000083
And (6) making a decision. The method specifically comprises the following steps:
(1) State space
The state space is composed of the positions of all unmanned aerial vehicles, the AoI of the ground nodes and the condition that the unmanned aerial vehicles are connected with the ground nodes. The position of the unmanned plane n at the moment t is
Figure BDA0003918998410000084
AoI of a ground node is
Figure BDA0003918998410000085
At time t, the ground node to which the unmanned aerial vehicle n is connected, namely the selected subtask is C t,n ∈[1,M]. Thus, the state space is defined as X t,n ={s t,1 ,…,s t,N }。
(2) Movement space
For a multi-agent scenario, the action space is defined as the joint action of all drones in time slot k, including the flight direction, i.e. a, of each drone t,n = { N, S, W, E }. Thereby combining the motion space A t,n ={a t,1 ,…,a t,N }。
(3) Reward function
The reward function is designed according to the objective problem, the goal of the trajectory planning is to minimize the AoI of the collected information, so the reward function is a function on AoI. When the target point is found, the reward is r 1 (ii) a When flying out of the active area, the reward is negative r 2 (ii) a The number of the rounds is fixed, when a single round is finished, whether data packets of all ground nodes are collected or not is judged, and if the data packets are collected, the reward is r 3 Otherwise is-r 4 (ii) a The other case is-A m (t) of (d). (wherein r is 1 ,r 2 ,r 3 ,r 4 Is a positive number).
(4) State transition
At the time slot t, if the unmanned aerial vehicle selects the sensor m to send a data state updating packet and the unmanned aerial vehicle is located in the effective coverage range of the sensor m at the moment, the Aol of the sensor m can be reduced to 1; otherwise its AoI would increase by 1. The updated formula for each sensor AoI is:
the location update equation for a drone may be expressed as:
Figure BDA0003918998410000091
the state update equation for AoI is:
Figure BDA0003918998410000092
wherein q (t) represents the coordinates of the horizontal position of the drone, D represents the distance of the central positions of two adjacent grids, V t Indicating the flight direction of the drone, north indicating the North direction, south indicating the South direction, east indicating the East direction, west indicating the West direction, a m (t) represents ground transmission at time tThe AoI of the sensor m is,
Figure BDA0003918998410000093
expressed in t ∈ [0, T]And at the moment, the unmanned aerial vehicle N belongs to the connection relation between N and the node M belongs to M.
S2: and based on the current environment state information, selecting the action in the current state according to the obtained strategy by utilizing a deep reinforcement learning algorithm.
S21: for multi-agent tasks, the goal is to maximize J (π) by determining an optimal strategy θ ). Consider a fully centralized learning framework in which all drones are treated as a single super agent.
In particular, in a fully focused learning framework, it is assumed that the UAV can know the global state X t,n And Joint action A t,n And learning a centralized strategy pi using an RL algorithm θ . Therefore, it is always necessary to include X in both the training and execution phases t,n And A t,n Including global information. The centralized Critic estimates a joint value function based on global information, and the Actor makes a decision based on the global state.
S22: the unmanned aerial vehicle continuously receives flying position data broadcasted from a ground base station in the flying process, and a reward return updating strategy obtained by interacting with the environment is adopted, so that pseudo codes of an optimization problem P (1) algorithm are as follows:
Figure BDA0003918998410000101
as shown in fig. 3, the process of performing deep learning on the AoI optimization objective function by using the improved PPO algorithm includes specific processes: setting all hyper-parameters, including Critic value network Q ω The hyper-parameter omega and the hyper-parameter theta parameter of the Actor strategy network pi; initializing a hyper-parameter theta of an Actor policy network pi to obtain theta', and determining a Critic value network Q ω The hyper-parameter ω of (c) is initialized to obtain ω'. The maximum number of iteration rounds L is set. Trace data was collected using the old Actor network. During data collection, each UAV uses a policy π old And the environmentAnd (6) interacting. In each iteration, each UAV collects a trace τ of T slots. And calculating the dominance function and a target V value. The trajectory, merit function, and target V value are then stored in a batch for later sampling. And circulating all batches for K rounds, extracting small batches with the size of mini-batch from the batches, and calculating the strategy loss and the value loss by using an Adam optimizer. And finally, updating the Actor network by using the gradient of strategy loss, and updating the criticic network by using the gradient of value loss.
Specifically, when a policy gradient algorithm is used to train an agent, one of the challenges is that the algorithm is susceptible to a sudden performance drop (performance collapse), and at this time, the agent suddenly starts to perform poorly. This situation may be difficult to recover because the agent will start generating poorly performing traces that are then used to further train the strategy. In addition, the same strategy algorithm results in insufficient sample utilization due to the inability to reuse data. The near-end Policy Optimization algorithm (PPO) is a type of Optimization algorithm that solves both of the above problems. The main idea behind near-end policy optimization is to introduce a surrogate objective function (flooring objective) that avoids sudden performance degradation by ensuring a monotonically improving policy, and that has the advantage of reusing hetero-policy data during training.
In a multi-agent system, since each agent rewards a function r t The method is influenced by actions of other agents, and the agents need to consider a value function of the joint action to learn an optimal strategy. The centralized training refers to training an agent by using a joint behavior value function in training. Compared with distributed training, namely using a local behavior value function, a centralized value function can evaluate the joint strategy.
The specific method comprises the following steps: a centralized criticic network utilizes the joint action information, n agents interact with the environment by adopting a joint strategy in the training process, meanwhile, the joint behavior value function of each agent is evaluated, and the gradient of the strategy parameters is updated according to the joint behavior value function. The value function only knows the global reward, and the single agent does notKnowing its actual contribution, critic in the PPO algorithm uses a merit function to evaluate how good the action is. First, a strategy is given pi, and then its next iteration (after parameter update) is let to be pi'. An associated policy performance identifier is defined to measure the performance difference between the two policies. The goal of the agent is to get the desired reward
Figure BDA0003918998410000111
Maximum strategy pi θ . Theta is a strategy parameter, and the return refers to a track tau =(s) 0 ,a 0 ,s 1 ,a 1 8230), discount rewards and
Figure BDA0003918998410000112
the optimal strategy is
Figure BDA0003918998410000113
Through first order approximation, a substitution loss function is optimized, a new strategy is calculated in each iteration, and non-negative improvement of each strategy iteration is guaranteed.
Critic in the PPO algorithm adopts a dominance function to evaluate the quality of the action:
Figure BDA0003918998410000114
as PPO is a same-strategy (on-policy) algorithm, in order to improve the utilization rate of samples, importance sampling is introduced, and an old strategy pi is used θ′ To sample, we get:
Figure BDA0003918998410000115
the gradient corresponds to an optimization objective function, namely:
Figure BDA0003918998410000121
in practical application, based on the sampling estimation expectation, the optimization target of the PPO, namely a substitution loss function, is obtained in a simplified mode, the strategy updating amplitude is limited through clip operation (clip), and the training stability is guaranteed.
Figure BDA0003918998410000122
The above equation is called a surrogate objective function for clipping, where:
Figure BDA0003918998410000123
and epsilon is the ratio of the old strategy and the new strategy, and epsilon is a clipping amplitude over-parameter. The idea is to limit r (theta) to the neighborhood of epsilon 1-epsilon, 1+ epsilon]。
The objective function assumes that the sample pattern results from an old strategy, which is independent of the current strategy except for the importance weights, and thus the sampled trajectory can be reused multiple times to perform parameter updates. Such modifications may enable more stable, sample-utilizing training.
A specific embodiment of deep learning of an AoI optimization objective function by using an improved PPO algorithm comprises the following steps:
step 1: state information(s) 1 ,s 2 …s n ) Inputting the probability of all actions into the Actor network, and outputting the combined action according to the probability of all actions (a) 1 ,a 2 …a n ) (ii) a All agents share one Actor network, the input of each agent i is environment information of global observation, and the output is the joint action of the agent i;
step 2: will act in conjunction (a) 1 ,a 2 …a n ) Inputting the global reward r and the next state s _intothe environment, and obtaining the track according to the next state
Figure BDA0003918998410000124
And storing it in an experience pool;
and step 3: inputting all states s in the track tau into a Critic network to obtain state values V(s) corresponding to all states of the unmanned aerial vehicle in one track t );
And 4, step 4: calculating the action line a of the unmanned aerial vehicle t Then reaches state s t+1 Average value G of time reward t Wherein G is t =r t +γV(s t+1 ) (ii) a According to the average value G t The merit function A(s) is obtained t ,a t )=G t -V(s t ) (ii) a Carrying out generalized advantage estimation on the advantage function, and balancing an equation and a deviation of value function estimation through the generalized advantage estimation; the expression is as follows:
Figure BDA0003918998410000125
wherein the content of the first and second substances,
Figure BDA0003918998410000126
representing a merit function calculated using GAE, gamma representing a discount factor for the reward, lambda representing an adjustable hyperparameter, l representing the number of accumulations, r t Indicates the reward at time t, V(s) t+1 ) Representing the function of the state value at time t +1, V(s) t ) Representing the function of the state value at time t +1, s t Indicating the state at time t.
And 5: the loss of the Critic network is calculated, and the loss function of Critic is the mean square of the merit function.
Step 6: the obtained merit function A(s) t ,a t ) As the Critic network evaluates the action strategy, the output strategy of the Actor network is improved to obtain a new strategy pi θ
And 7: calculating the loss of the Critic network, which will perform the generalized dominance function A(s) after dominance estimation t ,a t ) Pair action strategy for Critic pi θ Optimizing to obtain a new strategy pi θ
And step 8: all the stored state s combinations are respectively input into the new strategy and the old strategy θ And pi θ′ In the action Actor network, acquiring unmanned aerial vehicle action probability distribution prob1 and prob2 under different strategies; calculating importance weights according to prob1 and prob2; obtaining corrections based on importance weightsCalculating an updated strategy expectation-return value according to the difference between the two action distributions of the strategies theta and theta';
calculating the updated policy expected return value comprises: when the strategy is not updated, the expected return value of the new strategy cannot be calculated, and a method of importance sampling is introduced, the distribution of the old strategy is used for estimating the distribution of the new strategy, and the calculated and updated strategy expected return value is as follows:
Figure BDA0003918998410000131
Figure BDA0003918998410000132
wherein r (t) is the ratio of old and new strategies,
Figure BDA0003918998410000133
indicating an updated policy expected return value,
Figure BDA0003918998410000134
representing old policy expected reward value, π θ (a | s) denotes a new strategy, π θ′ (a | s) represents the old policy, a represents the action, and s represents the state information.
And step 9: setting constraint conditions of the updated strategy, and calculating a loss function of the Actor network according to the constraint conditions and the strategy expected return value; the constraint conditions of the strategy after updating are set to limit the strategy updating amplitude through clipping operation (clip), and r (t) is limited to the neighborhood [ 1-epsilon, 1+ epsilon ] of epsilon, so that the training stability is ensured. The loss function for an Actor network is:
J CLIP (θ)=E[min(r(t)A π ,clip(r(t),1-ε,1+ε)A π )]
wherein, J CLIP (theta) represents an objective function of an Actor network, theta represents a network weight of the Actor, E represents a desired operation, r (t) is a ratio of a new strategy to an old strategy, and A π Representing a dominant function, clip, of adopting strategy ΠThe effect is three terms in parentheses, and if the first term is smaller than the second term, 1-epsilon is output, and if the first term is larger than the third term, 1+ epsilon is output, in order to limit the probability ratio to a reasonable range. ε represents the clipping magnitude hyperparameter.
Step 10: and updating parameters of the Actor network and the Critic network by using a gradient descent algorithm according to the loss function, and outputting the current optimal flight strategy of the unmanned aerial vehicle when the reward is updated to be converged.
S3: if the unmanned aerial vehicle finishes collecting the data packets generated by all the ground nodes, ending the flight; otherwise, S2 is continuously executed.
Starting from an initial state in each round, finishing the round when the unmanned aerial vehicle meets any one of the following conditions, and restarting a new round of learning, wherein 1) all the ground sensor data packets are completely collected; 2) The maximum range is reached. In this embodiment, the drone does not complete the data collection task within a single round of a maximum number of steps of 500, considering that the drone has reached the maximum range. And exiting the loop when the maximum round times are reached, and finishing the training.
The above-mentioned embodiments, which are further detailed for the purpose of illustrating the invention, technical solutions and advantages, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made to the present invention within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An unmanned aerial vehicle track self-adaptive optimization method based on information age is characterized by comprising the following steps:
s1: constructing an unmanned aerial vehicle ground communication system model, and determining the track of the unmanned aerial vehicle by adopting the unmanned aerial vehicle ground communication system model; determining an optimization objective function of the AoI according to the unmanned aerial vehicle track; wherein AoI represents information age;
s2: acquiring current environment state information, and performing deep learning on the AoI optimization target function by adopting an improved PPO algorithm according to the current environment state information to obtain the current flight state of the unmanned aerial vehicle; the improved PPO algorithm comprises the steps of optimizing the PPO algorithm by adopting centralized strategy learning and sharing rewards;
s3: the unmanned aerial vehicle acquires the data packet generated by the ground node in the current flight state, if the data collection of all the nodes is finished, the flight of the unmanned aerial vehicle is finished, and if not, the step S2 is returned.
2. The method of claim 1, wherein the building of the unmanned aerial vehicle to ground communication system model comprises: acquiring the flight environment information of the unmanned aerial vehicle, and dividing the acquired environment information into a series of cells with the same size by adopting a grid method; the method comprises the steps that a base station is arranged in the center of a flight area of the unmanned aerial vehicle, the coverage area of the base station is a circular area with the radius of R, and cells outside the coverage area of a base station signal are divided into no-fly areas; and acquiring the position information of the ground communication node, and constructing an unmanned aerial vehicle-to-ground communication system model according to the flight environment of the unmanned aerial vehicle and the position information of the ground communication node.
3. The method of claim 1, wherein the step of determining the optimization objective function of AoI comprises: dispersing the time of the unmanned aerial vehicle for executing the task into at least two time intervals with equal length; determining the flight height and the flight speed of the unmanned aerial vehicle; constructing an unmanned aerial vehicle speed constraint condition according to the flying height, the flying speed and the task time of the unmanned aerial vehicle; the unmanned aerial vehicle collects ground information, when the unmanned aerial vehicle collects the latest data stored in the ground node, the AoI information of the ground node is updated, otherwise, the AoI information of the ground node is linearly increased; if the data are not stored in the buffer area of the ground node or the data are collected, the AoI is set to be 1, otherwise, the AoI is set to be 0; recording the time for completing acquisition of unmanned aerial vehicle n and ground node m as t s According to time t s And planning the flight track and the connection strategy of the unmanned aerial vehicle.
4. The method of claim 3, wherein the optimization objective function is as follows:
(P1):min q,Km∈M A m (t s )
s.t.‖q n (t)-q n (t-1)‖≤V max ,
Figure FDA0003918998400000021
Figure FDA0003918998400000022
wherein q represents the trajectory sequence formed by the positions of the unmanned aerial vehicles, K represents the connection relation between the unmanned aerial vehicles and the ground nodes, M represents the number of the ground nodes, A m AoI, t representing ground node m s Represents the time of the unmanned plane n and the ground node m completing the acquisition, q n (t) represents the position of the nth drone at time t, V max Representing the maximum speed at which the drone is flying,
Figure FDA0003918998400000023
expressed in t ∈ [0, T]And (3) connecting the unmanned aerial vehicles N and the nodes m at the moment, wherein N represents the number of the unmanned aerial vehicles.
5. The unmanned aerial vehicle track self-adaptive optimization method based on the information age is characterized in that the connection strategy of the unmanned aerial vehicle is a Markov decision process, the Markov decision process comprises four tuples < S, A, P, R >, wherein S and A are a state space and an action space respectively, and the state and the action of the unmanned aerial vehicle are reflected; p is a state transfer function and represents the probability of transferring to the next state when the unmanned aerial vehicle executes the action in the current state; r is a reward function representing the reward that can be obtained when the drone is in the current state.
6. The method for unmanned aerial vehicle trajectory adaptive optimization based on information age according to claim 5, wherein the state transfer function comprises a position update equation of the unmanned aerial vehicle and a state update equation of AoI; wherein, unmanned aerial vehicle's position update equation is:
Figure FDA0003918998400000024
the state update equation for AoI is:
Figure FDA0003918998400000025
wherein q (t) represents the coordinates of the horizontal position of the drone, D represents the distance of the central positions of two adjacent grids, V t Indicating the flight direction of the drone, north indicating the North direction, south indicating the South direction, east indicating the East direction, west indicating the West direction, a m (t) denotes the AoI of the ground sensor m at time t,
Figure FDA0003918998400000026
expressed in t ∈ [0, T]And connecting the unmanned aerial vehicle n and the node m at the moment.
7. The method of claim 5, wherein the reward function comprises a reward function constructed according to a target optimization problem, the goal of trajectory planning is to minimize acquisition target information, wherein the reward function is a function related to AoI, and when a target point is found, the reward is r 1 (ii) a When flying out of the active area, the reward is negative r 2 (ii) a When the information acquisition of the unmanned aerial vehicle is finished, whether data packets of all ground nodes are acquired or not is judged, and if the data packets are acquired, the reward is r 3 Otherwise is-r 4 (ii) a The other case is-A m (t); wherein r is 1 ,r 2 ,r 3 ,r 4 Is a positive number.
8. The unmanned aerial vehicle trajectory adaptive optimization method based on the information age is characterized in that the process of deep learning the AoI optimization objective function by adopting the improved PPO algorithm comprises the following steps:
s21: status information(s) 1 ,s 2 …s n ) Inputting the probability of all actions into the Actor network, and outputting the combined action according to the probability of all actions (a) 1 ,a 2 …a n ) (ii) a All agents share one Actor network, the input of each agent i is environment information of global observation, and the output is the joint action of the agent i;
s22: will act in conjunction (a) 1 ,a 2 …a n ) Inputting the global reward r and the next state s _intothe environment, and obtaining the track according to the next state
Figure FDA0003918998400000031
And storing it in an experience pool;
s23: inputting all states s in the track tau into a Critic network to obtain state values V(s) corresponding to all states of the unmanned aerial vehicle in one track t );
S24: unmanned aerial vehicle executes combined action a t And reaches state s t+1 Thereafter, the expected cumulative prize average G using different actions is calculated t =r t +γV(s t+1 ) Calculating the merit function A(s) from the cumulative prize mean t ,a t )=G t -V(s t ) Using the generalized dominance estimation to balance the variance and deviation of the value function estimation;
s25: calculating the loss of the Critic network, wherein the loss function of Critic is the square mean of the dominance function;
s26: the obtained merit function A(s) t ,a t ) As the Critic network evaluates the action strategy, the output strategy of the Actor network is improved to obtain a new strategy pi θ
S27: all the stored state s combinations are respectively input into the new strategy and the old strategy θ And pi θ′ In the action Actor network, the difference is obtainedUnmanned aerial vehicle action probability distribution prob1 and prob2 under the strategy; calculating an importance weight according to prob1 and prob2; acquiring and correcting the difference between two action distributions of different strategies theta and theta' according to the importance weight, and calculating an updated strategy expected return value according to the difference between the two action distributions;
s28: setting a constraint condition of the updated strategy, and calculating a loss function of the Actor network according to the constraint condition and the strategy expectation return value;
s29: and updating parameters of the Actor network and the Critic network by using a gradient descent algorithm according to the loss function until the reward convergence is unchanged, and outputting the current optimal flight strategy of the unmanned aerial vehicle.
9. The method of claim 8, wherein the optimal policy is as follows:
Figure FDA0003918998400000041
wherein the content of the first and second substances,
Figure FDA0003918998400000042
indicates an expectation of the discount reward according to the trajectory tau, gamma indicates a discount factor, r t Indicating an instant prize at time t.
10. The method of claim 8, wherein the loss function of the Actor network is as follows:
J CLIP (θ)=E[min(r(t)A π ,clip(r(t),1-ε,1+ε)A π )]
wherein, J CLIP (theta) represents a loss function of an Actor network, theta represents a network weight of the Actor, E represents a desired operation, r (t) is a ratio of a new policy to an old policy, and A π Representing the dominant function of adopting a strategy pi; clip is a select output function, i.e., if r (t) is less than 1- ε, 1- ε is output, and if r (t) is greater than 1+ ε,1+ ε is output, which isHe then outputs r (t); ε represents the clipping magnitude hyperparameter.
CN202211348121.6A 2022-10-31 2022-10-31 Unmanned aerial vehicle track self-adaptive optimization method based on information age Pending CN115696211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211348121.6A CN115696211A (en) 2022-10-31 2022-10-31 Unmanned aerial vehicle track self-adaptive optimization method based on information age

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211348121.6A CN115696211A (en) 2022-10-31 2022-10-31 Unmanned aerial vehicle track self-adaptive optimization method based on information age

Publications (1)

Publication Number Publication Date
CN115696211A true CN115696211A (en) 2023-02-03

Family

ID=85046179

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211348121.6A Pending CN115696211A (en) 2022-10-31 2022-10-31 Unmanned aerial vehicle track self-adaptive optimization method based on information age

Country Status (1)

Country Link
CN (1) CN115696211A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116232440A (en) * 2023-03-23 2023-06-06 鹏城实验室 Data acquisition method, system and storage medium
CN116233791A (en) * 2023-03-23 2023-06-06 重庆邮电大学 Track optimization and resource allocation method in multi-machine cooperative internet of vehicles
CN116320984A (en) * 2023-03-22 2023-06-23 扬州宇安电子科技有限公司 Unmanned aerial vehicle safety communication system and method based on cooperative interference
CN117193381A (en) * 2023-11-07 2023-12-08 天津云圣智能科技有限责任公司 Unmanned aerial vehicle control method and device and computer storage medium
CN117193378A (en) * 2023-10-24 2023-12-08 安徽大学 Multi-unmanned aerial vehicle path planning method based on improved PPO algorithm
CN117729555A (en) * 2024-02-18 2024-03-19 北京中电飞华通信有限公司 Air base station deployment method, cooperative system and related equipment

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116320984A (en) * 2023-03-22 2023-06-23 扬州宇安电子科技有限公司 Unmanned aerial vehicle safety communication system and method based on cooperative interference
CN116320984B (en) * 2023-03-22 2023-10-03 扬州宇安电子科技有限公司 Unmanned aerial vehicle safety communication system and method based on cooperative interference
CN116232440A (en) * 2023-03-23 2023-06-06 鹏城实验室 Data acquisition method, system and storage medium
CN116233791A (en) * 2023-03-23 2023-06-06 重庆邮电大学 Track optimization and resource allocation method in multi-machine cooperative internet of vehicles
CN116232440B (en) * 2023-03-23 2024-05-14 鹏城实验室 Data acquisition method, system and storage medium
CN116233791B (en) * 2023-03-23 2024-05-24 重庆邮电大学 Track optimization and resource allocation method in multi-machine cooperative internet of vehicles
CN117193378A (en) * 2023-10-24 2023-12-08 安徽大学 Multi-unmanned aerial vehicle path planning method based on improved PPO algorithm
CN117193378B (en) * 2023-10-24 2024-04-12 安徽大学 Multi-unmanned aerial vehicle path planning method based on improved PPO algorithm
CN117193381A (en) * 2023-11-07 2023-12-08 天津云圣智能科技有限责任公司 Unmanned aerial vehicle control method and device and computer storage medium
CN117193381B (en) * 2023-11-07 2024-02-23 天津云圣智能科技有限责任公司 Unmanned aerial vehicle control method and device and computer storage medium
CN117729555A (en) * 2024-02-18 2024-03-19 北京中电飞华通信有限公司 Air base station deployment method, cooperative system and related equipment
CN117729555B (en) * 2024-02-18 2024-04-26 北京中电飞华通信有限公司 Air base station deployment method, cooperative system and related equipment

Similar Documents

Publication Publication Date Title
CN115696211A (en) Unmanned aerial vehicle track self-adaptive optimization method based on information age
Hu et al. Reinforcement learning for decentralized trajectory design in cellular UAV networks with sense-and-send protocol
Li et al. On-board deep Q-network for UAV-assisted online power transfer and data collection
You et al. Hybrid offline-online design for UAV-enabled data harvesting in probabilistic LoS channels
Li et al. Joint flight cruise control and data collection in UAV-aided Internet of Things: An onboard deep reinforcement learning approach
Cao et al. Deep reinforcement learning for multi-user access control in non-terrestrial networks
CN114690799A (en) Air-space-ground integrated unmanned aerial vehicle Internet of things data acquisition method based on information age
CN115494732B (en) Unmanned aerial vehicle track design and power distribution method based on near-end strategy optimization
CN115640131A (en) Unmanned aerial vehicle auxiliary computing migration method based on depth certainty strategy gradient
Hu et al. Meta-reinforcement learning for trajectory design in wireless UAV networks
CN115499921A (en) Three-dimensional trajectory design and resource scheduling optimization method for complex unmanned aerial vehicle network
CN113660681A (en) Multi-agent resource optimization method applied to unmanned aerial vehicle cluster auxiliary transmission
CN117499867A (en) Method for realizing high-energy-efficiency calculation and unloading through strategy gradient algorithm in multi-unmanned plane auxiliary movement edge calculation
Li et al. Deep reinforcement learning for real-time trajectory planning in UAV networks
Nguyen et al. Deep reinforcement learning-based partial task offloading in high altitude platform-aided vehicular networks
CN113382060B (en) Unmanned aerial vehicle track optimization method and system in Internet of things data collection
CN117580105A (en) Unmanned aerial vehicle task unloading optimization method for power grid inspection
CN113776531A (en) Multi-unmanned-aerial-vehicle autonomous navigation and task allocation algorithm of wireless self-powered communication network
Wei et al. DRL-based energy-efficient trajectory planning, computation offloading, and charging scheduling in UAV-MEC network
Gao et al. Multi-UAV assisted offloading optimization: A game combined reinforcement learning approach
CN117236561A (en) SAC-based multi-unmanned aerial vehicle auxiliary mobile edge computing method, device and storage medium
Hu et al. Digital twins-based multi-agent deep reinforcement learning for UAV-assisted vehicle edge computing
CN116847293A (en) Combined buffer decision and track optimization method under unmanned aerial vehicle auxiliary vehicle networking
CN116546559A (en) Distributed multi-target space-ground combined track planning and unloading scheduling method and system
CN114520991B (en) Unmanned aerial vehicle cluster-based edge network self-adaptive deployment method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination