WO2021017227A1 - Path optimization method and device for unmanned aerial vehicle, and storage medium - Google Patents

Path optimization method and device for unmanned aerial vehicle, and storage medium Download PDF

Info

Publication number
WO2021017227A1
WO2021017227A1 PCT/CN2019/114200 CN2019114200W WO2021017227A1 WO 2021017227 A1 WO2021017227 A1 WO 2021017227A1 CN 2019114200 W CN2019114200 W CN 2019114200W WO 2021017227 A1 WO2021017227 A1 WO 2021017227A1
Authority
WO
WIPO (PCT)
Prior art keywords
uav
drone
data
flight
time
Prior art date
Application number
PCT/CN2019/114200
Other languages
French (fr)
Chinese (zh)
Inventor
许文俊
徐越
吴思雷
张治�
张平
林家儒
Original Assignee
北京邮电大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京邮电大学 filed Critical 北京邮电大学
Publication of WO2021017227A1 publication Critical patent/WO2021017227A1/en

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course or altitude of land, water, air, or space vehicles, e.g. automatic pilot
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft

Definitions

  • This specification relates to the field of wireless communication technology, and in particular to methods, devices and storage media for trajectory optimization of drones.
  • UAV communication technology is considered to be an indispensable part of the fifth generation (5G) and subsequent evolution (5G+) mobile communication networks.
  • 5G fifth generation
  • 5G+ subsequent evolution
  • the UAV communication system has a unique air-to-ground channel model, highly dynamic three-dimensional flight capabilities and limited flight energy, making UAV communication systems more complex than traditional communication systems.
  • Some embodiments of this specification propose a UAV trajectory optimization method, which includes: acquiring UAV status data and action decision data during the flight of the UAV; determining according to the UAV status data and action decision data The instantaneous energy efficiency of the drone; taking the state data as input, the action decision data as the output, and the instantaneous energy efficiency as a reward, to train a pre-built deep reinforcement learning network to optimize deep reinforcement learning The strategy parameters of the network; and output the UAV flight strategy according to the trained deep reinforcement learning network.
  • the above method may further include: pre-constructing a deep learning network structure including an action network and an evaluation network; wherein the action network uses a near-end strategy optimization algorithm and a deep neural network to fit the UAV flight action strategy function, and the decision is made The aircraft flight action; the evaluation network uses a deep neural network to fit the state value function, and optimize the strategy parameters in the UAV flight action strategy function.
  • obtaining the status data and action data of the drone includes: determining the distance between the drone and the IoT device, the data transmission rate from the IoT device to the drone, and the remaining energy of the drone as the status Data; and collecting the acceleration and flight control angle of the drone as the motion data.
  • the distance between the Internet of Things device and the drone includes the European distance between the Internet of Things device and the drone.
  • the above determining the instantaneous energy efficiency of the UAV includes: using the following formula to determine the instantaneous energy efficiency of the UAV: Wherein, r (s t, a t ) represented UAV state S t at time t, the action of instantaneous energy efficiency when A t; E(q(t)) represents the energy loss of the UAV at time t at the maximum transmission rate of the IoT device u to transmit data to the UAV at time t.
  • the maximum transmission rate of data transmitted by the IoT device u to the UAV at time t is determined by the following process: determine the average path loss of the UAV; determine the UAV and the IoT device u at t according to the average path loss The signal-to-noise ratio at time; the maximum transmission rate of data transmitted by the device u to the drone at time t is determined according to the signal-to-noise ratio.
  • the above determination of the average path loss of the UAV includes: the average path loss of the UAV is determined by the following formula: among them, Represents the average path loss of the UAV; f c represents the center frequency; Represents the distance between the drone and the device u at time t; c represents the speed of light; ⁇ LoS represents the additional spatial propagation loss of the LoS link.
  • the above determining the signal-to-noise ratio of the drone includes: determining the signal-to-noise ratio of the drone and the IoT device u at time t by the following formula: among them, Represents the signal-to-noise ratio between the UAV and the IoT device u at time t; P u represents the transmission power of the upload link of the device u; Represents the gain of the channel between the UAV and the device u at time t; N 0 is the noise power; where,
  • determining the maximum transmission rate of data transmitted by device u to the drone at time t includes: determining the maximum transmission rate of data transmitted by device u to the drone at time t through the following formula: Among them, B represents the channel bandwidth; Represents the signal-to-noise ratio of UAV and IoT device u at time t.
  • the remaining energy of the above-mentioned UAV is the difference between the initial total energy of the UAV and the energy loss of the UAV; wherein, the energy loss of the UAV includes at least one of the flight energy loss of the UAV and the communication energy loss One item.
  • the foregoing determining the instantaneous energy efficiency of the drone further includes: when the energy exhaustion occurs during the return of the drone, a penalty term of a preset value is added to the formula for calculating the instantaneous energy efficiency.
  • the above-mentioned training of the pre-built deep reinforcement learning network includes:
  • is the strategy parameter to be optimized
  • is the preset constant used to control the update range of the UAV flight strategy
  • Is the expected value at time t
  • clip represents the clipping function
  • r t ( ⁇ ) is the ratio of the old strategy function to the new strategy function in an iterative update, which can be expressed as:
  • ⁇ ⁇ represents UAV flight policy function
  • s t) represents the state S t at time t, a new UAV flight operation policy function of t A
  • T represents time state s t, UAV flight action for the old policy function of a t
  • ⁇ ⁇ represents UAV flight policy function
  • s t) represents the state S t at time t
  • T represents time state s t, UAV flight action for the old policy function of a t;
  • is the attenuation index
  • is the track parameter
  • ⁇ t is the time difference error value at time t
  • ⁇ T-1 is the time difference error value at time T-1
  • T is the total time of autonomous flight
  • the deep neural network is optimized based on the UAV status data, UAV action decision data and the UAV's instantaneous energy efficiency.
  • the above advantage function Using a deep neural network to optimize according to the UAV status data, UAV action decision data, and the UAV's instantaneous energy efficiency includes: using the status data, action decision data, and the UAV's instantaneous energy efficiency Use deep neural network to estimate dominance function Calculation function And use the gradient descent method to update the parameter ⁇ , and iterate the predetermined number of iterations; find the advantage function when the function reaches the maximum value
  • the above method further includes: determining the action decision data of the drone according to the drone flight strategy.
  • UAV trajectory optimization device which includes:
  • the training data collection module is used to obtain the status data and action decision data of the drone during the flight of the drone, and calculate the instantaneous energy efficiency of the drone;
  • the training module is used to train the deep reinforcement learning network with the state data as input, the action decision data as the output, and the instantaneous energy efficiency as a reward, to train the deep reinforcement learning network, optimize strategy parameters, and output drone flight Strategy.
  • the above-mentioned building module is used to construct a deep learning network structure including an action network and an evaluation network; wherein, the action network uses a near-end strategy optimization algorithm and a deep neural network to fit the UAV flight action strategy function, and make decisions The aircraft flight action; the evaluation network uses a deep neural network to fit the state value function, and optimize the strategy parameters in the UAV flight action strategy function.
  • the aforementioned training data collection module is used to determine the distance between the drone and the Internet of Things device, the data transmission rate from the Internet of Things device to the drone, and the remaining energy of the drone as the status data; and The acceleration and flight control angle of the drone are used as the action decision data.
  • Still other embodiments of this specification provide a UAV trajectory optimization device, including at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores the at least one An instruction executed by the processor, the instruction being executed by the at least one processor, so that the at least one processor can execute the above-mentioned UAV trajectory optimization method.
  • Still other embodiments of this specification also provide a computer-readable storage medium on which computer instructions are stored, and the above-mentioned UAV trajectory optimization method is realized when the processor executes the above-mentioned computer instructions.
  • This manual discloses a drone trajectory optimization method, device and drone based on deep reinforcement learning.
  • Deep reinforcement learning technology is introduced in the drone trajectory optimization, so that the drone can interact with the environment in real time during the flight.
  • the ability to learn independently can adapt to changes in the dynamic environment according to needs.
  • the UAV trajectory optimization method described in this specification also has the advantage of not being limited to the choice of learning step length.
  • the self-learning method based on the PPO algorithm proposed in this manual can process three-dimensional continuous bounded data.
  • input data, output data, etc. are not limited to the discrete domain, and realize the flight of the drone in the three-dimensional space under the continuous domain.
  • Control optimization closer to the real scene. Compared with the control method based on discrete domain data or a limited number of solutions in the table, it is more in line with the needs of the actual flight environment.
  • a penalty item is added to the reward function when the aircraft cannot return home for charging/refueling.
  • the UAV’s return time can be determined so that no The man-machine can return home immediately to avoid losses and improve the energy efficiency of drone flight work.
  • Figure 1 is a schematic diagram of the overall structure and data interaction of the system applied by the UAV trajectory optimization method according to some embodiments of this specification;
  • FIG. 2 is a schematic flowchart of a method for optimizing the trajectory of a drone according to some embodiments of this specification;
  • FIG. 3 is a schematic flowchart of a method for optimizing the trajectory of a drone according to some other embodiments of this specification;
  • FIG. 4 is a schematic flowchart of a specific modeling method of a deep reinforcement learning network according to some embodiments of this specification
  • Figure 5 shows the specific process of the deep reinforcement learning PPO algorithm in the embodiment of this specification
  • Figure 6 shows a schematic structural diagram of the UAV trajectory optimization device according to the embodiment of the specification.
  • Fig. 7 shows a schematic diagram of the hardware mechanism of the UAV trajectory optimization device according to the embodiment of this specification.
  • Deep reinforcement learning technology is a machine learning technology that combines reinforcement learning and deep neural networks. Specifically, the reinforcement learning individual collects the return information of different actions taken in different environmental states by interacting with the environment, and according to the collected data, summarizes and learns the optimal behavior strategy, so as to obtain the ability to adapt to the unknown dynamic environment . Deep neural networks can significantly improve the generalization ability of algorithms in high-dimensional state spaces and high-dimensional action spaces, thereby gaining the ability to adapt to more complex environments.
  • the technical solution provided in this manual combines deep reinforcement learning technology with drone technology, collects the status data during the drone flight, determines the action decision data taken under the status data, and further determines the reward information . Then, based on the data collected above, we can summarize and learn the best flight strategy of the drone, so that the drone can acquire the ability to adapt to the unknown dynamic environment.
  • Figure 1 shows the overall structure and related data interaction process of the system applied in the UAV trajectory optimization method provided by some embodiments of this specification.
  • the system applied by the drone trajectory optimization method described in this specification can be that a single drone 102 provides services for multiple fixed Internet of Things devices 104, and the Internet of Things device 104 is Activate randomly or periodically to collect data and transmit it to the drone 102.
  • the device that executes the drone trajectory optimization method may be referred to as the drone trajectory optimization device 106.
  • the above-mentioned UAV trajectory optimization device 106 may be located in the UAV 102, may also be located in the IoT device 104, or may be independent of the UAV 102 and the IoT device 104 Computing device, and can communicate with drone 102 and IoT device 104.
  • the above-mentioned human-machine trajectory optimization device 106 may include a data acquisition module 1062 and a deep reinforcement learning network 1064.
  • the above-mentioned deep reinforcement learning network 1064 may adopt the deep reinforcement learning structure of the Actor-Critic framework, that is, it includes an action network 1066 and an evaluation network 1068.
  • the action network 1066 can use the proximal policy optimization algorithm (PPO, Proximal Policy Optimization) and the deep neural network to fit the UAV flight strategy function, so as to decide the flight action of the UAV; and the evaluation network can use the deep neural network To fit the state value function and optimize the strategy parameters in the UAV flight action strategy function.
  • PPO proximal policy optimization algorithm
  • the evaluation network can use the deep neural network To fit the state value function and optimize the strategy parameters in the UAV flight action strategy function.
  • the UAV 102 inputs the distance between it and the IoT device 104, the transmission rate, and the remaining energy of the UAV 102 as state data into the mobile network 1066 of the deep reinforcement learning network 1064, and the unmanned Action decision data such as the acceleration of the aircraft and the flight control angle (ie, flight direction) are used as behaviors output by the deep reinforcement learning network 1064, and the instantaneous energy efficiency of the drone 102 is used as a reward.
  • the data of state input, action decision, reward and return are generated as training data for the evaluation network 1068 and the action network 1066.
  • the evaluation network uses the deep neural network to fit the state value function to provide an advantage function for the optimization of the mobile network; the mobile network uses the PPO algorithm to optimize the strategy parameters and the deep neural network to fit the strategy function.
  • the UAV 102 can adapt to the environment and obtain the optimal flight strategy, that is, the optimal flight trajectory of the UAV 102 can be obtained, thereby realizing the optimization of the UAV trajectory.
  • Fig. 2 is a schematic flowchart of a method for optimizing the trajectory of an unmanned aerial vehicle according to some embodiments of this specification.
  • the method can be executed by a drone, an Internet of Things device, or a computing device independent of the drone and the Internet of Things device.
  • the method may include the following steps:
  • Step 202 Obtain status data and action decision data during the flight of the drone.
  • the aforementioned status data may include: the distance between the drone and the Internet of Things device, the transmission rate of the Internet of Things device to the drone, and the remaining energy of the drone.
  • the aforementioned action decision data may include: the acceleration of the UAV and the flight control angle (ie, the flight direction) and so on.
  • the distance between the aforementioned drone and the Internet of Things device may be the European distance between the drone and the Internet of Things device.
  • the aforementioned action decision data can be determined based on the aforementioned state data and the current drone flight strategy.
  • Step 204 Determine the instantaneous energy efficiency of the UAV based on the aforementioned state data and action decision data.
  • Step 206 using the state data as the input, the action decision data as the output, and the instantaneous energy efficiency as the reward, the pre-built deep reinforcement learning network is trained to optimize the policy parameters of the deep reinforcement learning network.
  • the PPO algorithm may be used to train the deep reinforcement learning network to optimize the policy parameters of the deep reinforcement learning network.
  • the above strategy parameters are parameters that can determine the UAV action strategy data. Therefore, the above strategy parameters are also parameters that can determine the flight trajectory of the UAV.
  • the above-mentioned pre-built deep reinforcement learning network may adopt the deep reinforcement learning structure of the Actor-Critic framework, which is composed of two networks, an action network and an evaluation network.
  • the action network uses the PPO algorithm and the deep neural network to fit the flight strategy function and decision-making action; while the evaluation network uses the deep neural network to fit the state value function and optimize the strategy parameters.
  • Step 208 Output the flight strategy of the drone according to the trained deep reinforcement learning network.
  • the above-mentioned steps 202 to 208 can be executed in a loop, and the strategy parameters of the deep reinforcement learning network can be iteratively updated using the collected data to obtain a continuously optimized drone flight strategy.
  • the above-mentioned UAV flight strategy is a flight strategy that maximizes energy efficiency obtained through autonomous learning.
  • the UAV can further determine its own flight parameters based on the above flight strategy and current status data, that is, the UAV The acceleration and flight control angle, etc., and can perform flight tasks according to the above flight parameters.
  • the flight strategy can be transmitted to the UAV, and the UAV will further follow the above-mentioned flight strategy.
  • the current state data determines its own flight parameters, that is, the acceleration and flight control angle of the UAV, and can perform flight tasks according to the above flight parameters.
  • a penalty term can be added to the UAV's instantaneous energy efficiency, and the instantaneous energy efficiency of the added penalty term is used as the reward function. After continuous learning, the UAV can return to home immediately to avoid losses, thereby further improving the energy efficiency of UAV flight work.
  • Step 302 Initialize the reinforcement learning decision strategy and related parameters, and the deep neural network related parameters.
  • Step 304 During the flight of the drone, the drone flies autonomously and records relevant data.
  • the above-mentioned step 304 may include: the drone calculates the distance, transmission rate and remaining energy from the Internet of Things device, determines the flight parameters based on the current flight strategy, receives the data sent by the Internet of Things device, and calculates the Instantaneous energy efficiency under flight trajectory.
  • Step 306 Evaluate the network fitting state value function, calculate the advantage function, and transmit it to the mobile network based on the data collected in the predetermined period of time.
  • step 308 the parameters of the deep neural network of the action network and the evaluation network are trained separately, and the drone flight strategy is updated.
  • Step 310 Repeat the above steps 304-308 until the drone mission ends.
  • the aforementioned UAV trajectory optimization method introduces deep reinforcement learning technology in the UAV trajectory optimization.
  • the UAV interacts with the environment in real time during the flight, and collects the status data and movement data of the current flight trajectory as training data.
  • real-time autonomous learning realizes continuous optimization of strategic parameters for decision-making flight trajectories, that is, giving drones the ability to learn autonomously in the environment online, and adapt to changes in the dynamic environment according to needs.
  • the above-mentioned autonomous learning based on the PPO algorithm also has the advantage of not being limited to the choice of learning step size.
  • the data objects processed by the above-mentioned autonomous learning method can be three-dimensional continuous bounded data, such as input data, output data, etc., which are not limited to the discrete domain, and realize the flight control optimization of the drone in the three-dimensional space in the continuous domain, which is closer to Realistic scene.
  • the control method based on discrete domain data or a limited number of solutions in the table it can be more in line with the needs of the actual flight environment.
  • Fig. 4 The specific modeling method of the deep reinforcement learning network described in the embodiment of this specification is shown in Fig. 4, and may include:
  • Step 402 Extract the status data of the drone from the flight environment.
  • the above-mentioned status data can be extracted from the environment and obtained by calculation, and can be characterized as the following three parts: i) the distance from the drone to each IoT device; ii) the direction of each IoT device The transmission rate of the UAV to transmit information; iii) The remaining energy of the UAV.
  • the above-mentioned state data can be quantified as (Here "T” represents the transpose of the matrix).
  • ⁇ (s t ) represents the state data matrix
  • s t represents the state at time t
  • It represents the remaining energy of the UAV at time t
  • q(t) represents the flight trajectory of the UAV.
  • Step 404 Obtain the action decision data of the drone.
  • the aforementioned action decision data is used to characterize the actions of the drone, and these actions are issued by the drone to control the flight trajectory.
  • the action decision data can include the following two parts: i) the horizontal flight control angle of the drone at time t ⁇ t ⁇ [0,2 ⁇ ]; ii) the vertical flight control angle of the drone at time t ⁇ t ⁇ [0, ⁇ ]; iii) Acceleration AC t of the UAV at time t .
  • the instantaneous flight speed of the drone can be expressed as And the acceleration of the drone can be expressed as And the above two parameters are three-dimensional continuous and bounded.
  • Step 406 Calculate the average path loss of the UAV.
  • the communication channel between the UAV and the Internet of Things device usually adopts an air-to-ground link in the Sub-6GHz frequency band, and line-of-sight transmission (LoS) is dominant in this wireless link.
  • LoS line-of-sight transmission
  • the average path loss between the UAV and the ground IoT device u at time t can be expressed by the following formula (1):
  • f c represents the center frequency
  • It represents the Euclidean distance between the UAV and the device u at time t
  • c represents the speed of light
  • ⁇ LoS represents the additional spatial propagation loss of the LoS link, which is usually a constant.
  • Step 408 Calculate the signal-to-noise ratio based on the above average path loss.
  • the signal-to-noise ratio (SINR) of the drone and the IoT device u at time t can be expressed by the following formula (2):
  • P u represents the transmission power of the upload link of the device u
  • N 0 is the noise power.
  • the channel gain is only determined by the path loss
  • Step 410 Determine the maximum transmission rate of the device u to the UAV according to the foregoing signal-to-noise ratio.
  • the maximum transmission rate of the device u to the drone can be expressed as:
  • B represents the channel bandwidth, assuming that all devices have the same bandwidth.
  • the maximum transmission rate of the device u to the drone can be determined through the above steps 406-410.
  • Step 412 Calculate the energy loss of the drone.
  • the energy loss of the drone may include one or a combination of flight energy loss caused by propulsion and communication energy loss related to communication. Therefore, in an embodiment of this specification, the energy loss of the drone may be the sum of the flight energy loss and the communication energy loss.
  • the flight energy loss caused by the driving force allows the drone to keep flying in the air and change the flight trajectory. Its power is related to the speed and acceleration of the drone flight. Therefore, the flight energy loss of the drone at time t It can be expressed as the equation of flight trajectory q(t), which can be shown in the following formula (3):
  • communication energy loss includes radiation, signal processing, and other circuit losses, of which the energy loss caused by signal processing dominates.
  • the energy loss caused by signal processing has nothing to do with the flight of the UAV, and is an inversely proportional function of the square of the flight time. Therefore, the communication energy loss of the UAV at time t can be shown in the following formula (4):
  • E comp is the communication-related energy loss at time t
  • G represents the hardware calculation constant of the drone node
  • D represents the number of bits of data that the drone needs to process
  • t is time t.
  • the remaining energy of the drone can be expressed as the difference between the initial total energy of the drone and the energy loss of the drone. For example, suppose E(q(t)) is the energy loss of the drone at time t, Represents the remaining energy of the drone at time t, the remaining energy of the drone at time t is the initial total energy of the drone before this flight minus the energy loss of the drone at time t, namely Among them, E 0 is the initial total energy of the drone before this flight.
  • Step 414 Build a reward function.
  • the reward function can be defined as the instantaneous energy efficiency of the drone, that is, the maximum transmission rate of the device u to the drone.
  • a penalty item should be added after the return function. i.e., it may determine the new function returns the original reward function r (s t, a t) ⁇ with a penalty term sum:
  • the penalty term ⁇ can be a predetermined negative value. For example, if the UAV runs out of energy on its way back, causing the UAV to crash, the reward function value is directly set to a large negative number, such as -100. Of course, the above penalties can also be set to positive values.
  • the new reward function can be determined as the difference between the original reward function r(s t , a t ) and the above-mentioned positive penalty term ⁇ :
  • the value of the specific penalty item can be flexibly set by those skilled in the art according to the actual scenario, and is not unique, and this specification does not list them one by one.
  • the embodiments of this specification can further determine the return time of the drone through continuous learning, so that the drone can return to the home immediately to avoid losses and improve the energy efficiency of the drone flight work.
  • the instantaneous energy efficiency of the drone can be determined through the above steps 410-414, that is, the reward function can be established.
  • Step 416 Establish a strategy function.
  • the reinforcement learning method based on policy gradient is to parameterize the policy, and the modeling form is a stochastic equation, namely ⁇ ⁇ : S ⁇ P(A), which represents any state in the state set S (that is, the set of states s), Using the probability of action in action set A (ie the set of action a), ⁇ R n is the strategy parameter that needs to be optimized.
  • R n represents a set of n-dimensional real numbers, and the size of n is equal to the dimension of ⁇ .
  • Step 418 Establish a target equation based on the above reward function and strategy function.
  • the state value function of state s under the strategy ⁇ ⁇ is defined as the long-term cumulative return.
  • the state value function can be expressed as the following formula (5):
  • is the discount factor
  • the value range is ⁇ [0,1].
  • the state-action value function of action a can be defined as the following formula (6):
  • C 1 and C 2 are the limiting conditions of UAV flight speed and acceleration respectively.
  • the strategy gradient method can be applied to optimize the strategy ⁇ ⁇ to maximize the target equation.
  • the gradient of the target equation with respect to the independent variable ⁇ can be expressed by the following formula (9):
  • b t is a constant baseline introduced in the reward function in order to reduce the variance of the strategy gradient.
  • a constant is introduced into the reward function, and the strategy gradient remains unchanged but the variance decreases.
  • b t typically selected state value equation V ⁇ (s t) is the estimated value
  • the strategy gradient usually has a large variance, so it changes greatly under the influence of parameters.
  • the parameter update equation is ⁇ is the update step size.
  • the step size is inappropriate, the strategy corresponding to the updated parameter will be a worse strategy.
  • the trust region method TRPO algorithm improves the robustness of the algorithm by limiting the size of the policy change in each iteration.
  • the deep reinforcement learning algorithm PPO inherits the advantages of the trust region method algorithm, while the implementation method is simpler, more versatile, and has better sample complexity based on experience.
  • Step 420 Use the PPO algorithm to rewrite the above-mentioned target equation.
  • the target equation can be rewritten as the following formula (10) by using the PPO algorithm:
  • is the parameter to be optimized in the strategy function
  • the purpose is to control the update range of the strategy. It is a mathematical expectation symbol, which means taking the average value over time t.
  • r t ( ⁇ ) is the ratio of the old strategy function to the new strategy function, which can be expressed as:
  • the old strategy function and the new strategy function mean that in an iterative update, the updated strategy function is the new strategy function, and the strategy function before the update is the old strategy function.
  • is the attenuation index, which is a preset fixed value
  • is the track parameter, which is also a preset fixed value
  • the value range of ⁇ is (0,1)
  • the value range of ⁇ is also Is (0,1).
  • ⁇ t is the temporal difference error at time t, and its specific mathematical expression is shown in the second line of the above formula
  • ⁇ T-1 is the temporal difference error at T-1
  • T is the total time of autonomous flight.
  • T is discretized, and can also be referred to as the maximum number of consecutive decision moments. In this way, T-1 represents the maximum continuous decision time.
  • the advantage function needs all the data in a period of time from the current moment to the moment t.
  • FIG. 5 shows the specific process of the deep reinforcement learning PPO algorithm in the embodiment of this specification.
  • the above PPO algorithm includes the following steps:
  • the UAV trajectory optimization solution proposed in the embodiments of this specification can take the remaining energy of the UAV into consideration into the state value and input it into the reinforcement learning network, and can directly output the acceleration and flight direction of the UAV.
  • the penalty value is added to the reward function
  • the return time of the drone can also be output.
  • the program uses online learning to dynamically adjust the learned strategies according to environmental changes to adapt to the environment. At the same time, this scheme considers the control problem in the continuous domain, which is consistent with the continuous domain flight control mechanism in the actual scenario.
  • the PPO algorithm is the continuous domain control algorithm with the best robustness and the most outstanding performance. It eliminates the shortcomings that it is difficult to determine the appropriate learning step size and reduces the complexity of the algorithm.
  • an embodiment of this specification also provides an UAV trajectory optimization device, the internal structure of which is shown in FIG. 6, including: a construction module 602, a training data collection module 604, and a training module 606.
  • the UAV trajectory optimization device can be built into the UAV or the Internet of Things device, and it can also be a separate device that can communicate with the UAV and the Internet of Things device. The embodiment of this specification does not limit this.
  • the aforementioned construction module 602 is used to construct a deep reinforcement learning network.
  • the above-mentioned training data collection module 604 is used to obtain status data and action decision data of the drone during the flight of the drone, and calculate the instantaneous energy efficiency of the drone.
  • the training module 606 is configured to take the above-mentioned state data as input, the above-mentioned action decision data as the output, and the above-mentioned instantaneous energy efficiency as a reward return, to train the deep reinforcement learning network, optimize the strategy parameters, and output the drone flight strategy.
  • training data collection module 604, and training module 606 can implement their specific functions through the methods described in the foregoing embodiments, and will not be repeated here.
  • Fig. 7 shows the hardware structure of the UAV trajectory optimization device provided by an embodiment of this specification.
  • the above-mentioned UAV trajectory optimization device includes at least one processor 702; and a memory 704 communicatively connected with the at least one processor; wherein, the memory 1004 stores data that can be used by the at least one processor.
  • the instructions executed by 702 are executed by the at least one processor, so that the at least one processor can execute the method for optimizing the drone trajectory as described above.
  • the electronic device includes a processor 702 and a memory 704, and may also include: an input device and an output device.
  • the processor, memory, input device, and output device may be connected by a bus or in other ways.
  • the memory can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, as corresponding to the drone trajectory optimization method in the embodiments of this specification Program instructions/modules.
  • the processor executes various functional applications and data processing of the server by running non-volatile software programs, instructions, and modules stored in the memory, that is, realizing the UAV trajectory optimization method of the above method embodiment.
  • the memory may include a storage program area and a storage data area.
  • the storage program area can store an operating system and an application program required by at least one function; the storage data area can store data created according to the use of the drone trajectory optimization device.
  • the memory may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices.
  • the memory may optionally include a memory remotely provided with respect to the processor. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device can receive input digital or character information, and generate key signal inputs related to the user settings and function control of the drone trajectory optimization device.
  • the output device may include a display device such as a display screen.
  • the one or more modules are stored in the memory, and when executed by the processor, the drone trajectory optimization method in any of the foregoing method embodiments is executed. Any embodiment of the electronic device that executes the method for optimizing the drone trajectory can achieve the same or similar effect as any of the foregoing method embodiments.
  • the program can be stored in a computer readable storage medium. When executed, it may include the processes of the above-mentioned method embodiments.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc.
  • the embodiment of the computer program can achieve the same or similar effect as any of the foregoing corresponding method embodiments.
  • the method according to the present disclosure may also be implemented as a computer program executed by a CPU, and the computer program may be stored in a computer-readable storage medium.
  • the computer program executes the above-mentioned functions defined in the method of the present disclosure.
  • DRAM dynamic RAM
  • the technical solution disclosed in this specification is more capable of adapting to the scene and environment than the conventional solution using convex optimization algorithm. Because we introduce reinforcement learning algorithms to optimize policy parameters during the learning process, rather than based on a fixed target equation, it has greater flexibility; and the deep reinforcement learning network in this manual is enhanced by inputting environmental conditions and obtaining rewards. The interaction with the external environment can respond more quickly to changes in the scene and environment.
  • the manual adopts the continuous-domain UAV trajectory optimization solution.
  • the continuous speed and acceleration of the action output by the reinforcement learning are closer to the reality and easy to expand the flight area.
  • trajectory optimization in a large area there will be no potential problems of dimensional explosion.
  • the technical solution disclosed in this specification integrates deep reinforcement learning and UAV trajectory optimization problems, and the PPO algorithm is used to solve this problem for the first time.
  • the PPO algorithm is subject to
  • the training step has less impact, and it is more adaptable when solving control problems in real scenarios. It solves the problem of difficulty in determining the learning step length using the DDPG algorithm in the prior art, and has higher efficiency.
  • this manual also considers the optimal return-to-home charging/refueling time.
  • the drone can flexibly adjust the flight time and trajectory under the condition of returning home safely, and improve its own performance as much as possible. Energy efficiency.

Abstract

A path optimization method for an unmanned aerial vehicle (102). The method comprises: obtaining status data and action decision-making data of an unmanned aerial vehicle (102) during a flight thereof (202); determining, according to the status data and the action decision-making data of the unmanned aerial vehicle (102), the instantaneous energy efficiency of the unmanned aerial vehicle (102) (204); training a pre-built deep reinforcement learning network (1064) using the status data as input, the action decision-making data as output, and the instantaneous energy efficiency as a reward, and optimizing a policy parameter of the deep reinforcement learning network (1064) (206); and outputting, according to the trained deep reinforcement learning network (1064), a flight policy of the unmanned aerial vehicle (102) (208). Further provided are a path optimization device (106) for an unmanned aerial vehicle, and a computer-readable storage medium (704).

Description

无人机轨迹优化方法、装置及存储介质UAV trajectory optimization method, device and storage medium
本说明书基于申请号为201910697007.6,申请日为2019年7月30日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本说明书作为参考。This specification is based on a Chinese patent application with an application number of 201910697007.6 and an application date of July 30, 2019, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference into this specification.
技术领域Technical field
本说明书涉及无线通信技术领域,特别涉及无人机轨迹优化方法、装置及存储介质。This specification relates to the field of wireless communication technology, and in particular to methods, devices and storage media for trajectory optimization of drones.
背景技术Background technique
无人机通信技术被认为是第五代(5G)以及后续演进(5G+)移动通信网络中不可或缺的一个组成部分。但是无人机通信系统具有独特的空对地信道模型,高动态的三维飞行能力以及有限的飞行能源,使得无人机通信系统相较于传统通信系统更加复杂。UAV communication technology is considered to be an indispensable part of the fifth generation (5G) and subsequent evolution (5G+) mobile communication networks. However, the UAV communication system has a unique air-to-ground channel model, highly dynamic three-dimensional flight capabilities and limited flight energy, making UAV communication systems more complex than traditional communication systems.
然而,现有技术中关于优化无人机飞行轨迹的部分技术方案,适应的飞行场景和提供的飞行动作方案都比较有限,难以应对无人机飞行过程中动态变化的环境信息,偏离无人机的实际飞行需求。However, some technical solutions for optimizing the flight trajectory of UAVs in the prior art are relatively limited in adaptable flight scenarios and flight action programs. It is difficult to cope with the dynamically changing environmental information during UAV flight and deviate from UAV. Actual flight requirements.
发明内容Summary of the invention
本说明书的一些实施例提出了一种无人机轨迹优化方法,该方法包括:在无人机飞行过程中获取无人机状态数据和动作决策数据;根据无人机状态数据和动作决策数据确定无人机的瞬时能量效率;以所述状态数据为输入、以所述动作决策数据为输出,以所述瞬时能量效率为奖励回报,对预先构建的深度强化学习网络进行训练,优化深度强化学习网络的策略参数;以及根据训练后的深度强化学习网络输出无人机飞行策略。Some embodiments of this specification propose a UAV trajectory optimization method, which includes: acquiring UAV status data and action decision data during the flight of the UAV; determining according to the UAV status data and action decision data The instantaneous energy efficiency of the drone; taking the state data as input, the action decision data as the output, and the instantaneous energy efficiency as a reward, to train a pre-built deep reinforcement learning network to optimize deep reinforcement learning The strategy parameters of the network; and output the UAV flight strategy according to the trained deep reinforcement learning network.
上述方法可以进一步包括:预先构建包括行动网络和评价网络的深度学习网络结构;其中,所述行动网络利用近端策略优化算法和深度神经网络来拟合无人机飞行动作策略函数,决策无人机飞行动作;所述评价网络利用深度神经网络来拟合状态价值函数,优化所述无人机飞行动作策略函数中的策略参数。The above method may further include: pre-constructing a deep learning network structure including an action network and an evaluation network; wherein the action network uses a near-end strategy optimization algorithm and a deep neural network to fit the UAV flight action strategy function, and the decision is made The aircraft flight action; the evaluation network uses a deep neural network to fit the state value function, and optimize the strategy parameters in the UAV flight action strategy function.
其中,上述获取无人机状态数据和动作数据包括:确定所述无人机与 物联网设备间的距离、物联网设备到无人机的数据传输速率和无人机剩余能量,作为所述状态数据;以及采集所述无人机的加速度和飞行操纵角度,作为所述动作数据。Wherein, obtaining the status data and action data of the drone includes: determining the distance between the drone and the IoT device, the data transmission rate from the IoT device to the drone, and the remaining energy of the drone as the status Data; and collecting the acceleration and flight control angle of the drone as the motion data.
其中,上述获取无人机状态数据和动作数据进一步包括:将所述状态数据量化表示为
Figure PCTCN2019114200-appb-000001
其中φ(s t)表示状态数据矩阵,s t表示t时刻的状态,
Figure PCTCN2019114200-appb-000002
分别表示在t时刻第1至第N个物联网设备与无人机的距离;
Figure PCTCN2019114200-appb-000003
分别表示在t时刻第1至第N个物联网设备向无人机传送信息的传输速率;
Figure PCTCN2019114200-appb-000004
表示无人机在t时刻的剩余能量;将所述动作数据表示为a t=[ω tt,AC t] T;其中a t表示在t时刻的动作;ω t∈[0,2π]表示在t时刻无人机的水平飞行操纵角度;θ t∈[0,π]表示在t时刻无人机的垂直飞行操纵角度;AC t表示在t时刻无人机的加速度大小,AC t为连续有界数据。
Wherein, obtaining the state data and action data of the drone further includes: quantifying the state data as
Figure PCTCN2019114200-appb-000001
Where φ(s t ) represents the state data matrix, s t represents the state at time t,
Figure PCTCN2019114200-appb-000002
Respectively represent the distance between the 1st to Nth IoT devices and the UAV at time t;
Figure PCTCN2019114200-appb-000003
Respectively represent the transmission rate of the 1st to Nth IoT devices transmitting information to the UAV at time t;
Figure PCTCN2019114200-appb-000004
The remaining energy represents the UAV at time t; and the action data is represented as a t = [ω t, θ t, AC t] T; t where A represents operation at time t; ω t ∈ [0,2π ] Represents the horizontal flight control angle of the drone at time t; θ t ∈[0,π] represents the vertical flight control angle of the drone at time t ; AC t represents the acceleration of the drone at time t , AC t It is continuous bounded data.
其中,上述物联网设备与无人机的距离包括物联网设备与无人机的欧式距离。Among them, the distance between the Internet of Things device and the drone includes the European distance between the Internet of Things device and the drone.
其中,上述确定无人机的瞬时能量效率包括:利用如下公式确定无人机的瞬时能量效率:
Figure PCTCN2019114200-appb-000005
其中,r(s t,a t)表示无人机在t时刻状态为s t、动作为a t时的瞬时能量效率;
Figure PCTCN2019114200-appb-000006
为在t时刻物联网设备u向无人机传送数据的最大传输速率;E(q(t))表示无人机在t时刻的能量损耗。
Among them, the above determining the instantaneous energy efficiency of the UAV includes: using the following formula to determine the instantaneous energy efficiency of the UAV:
Figure PCTCN2019114200-appb-000005
Wherein, r (s t, a t ) represented UAV state S t at time t, the action of instantaneous energy efficiency when A t;
Figure PCTCN2019114200-appb-000006
E(q(t)) represents the energy loss of the UAV at time t at the maximum transmission rate of the IoT device u to transmit data to the UAV at time t.
其中,上述t时刻物联网设备u向无人机传送数据的最大传输速率通过如下过程确定:确定无人机的平均路径损耗;根据所述平均路径损耗确定无人机与物联网设备u在t时刻的信噪比;根据所述信噪比确定t时刻设备u向无人机传送数据的最大传输速率。Among them, the maximum transmission rate of data transmitted by the IoT device u to the UAV at time t is determined by the following process: determine the average path loss of the UAV; determine the UAV and the IoT device u at t according to the average path loss The signal-to-noise ratio at time; the maximum transmission rate of data transmitted by the device u to the drone at time t is determined according to the signal-to-noise ratio.
其中,上述确定无人机的平均路径损耗包括:通过如下公式确定无人机的平均路径损耗:
Figure PCTCN2019114200-appb-000007
其中,
Figure PCTCN2019114200-appb-000008
代表无人机的平均路径损耗;f c代表中心频率;
Figure PCTCN2019114200-appb-000009
代表t时刻无人机与设备u之间的距离;c代表光速;η LoS代表LoS链路的附加空间传播损耗。
Among them, the above determination of the average path loss of the UAV includes: the average path loss of the UAV is determined by the following formula:
Figure PCTCN2019114200-appb-000007
among them,
Figure PCTCN2019114200-appb-000008
Represents the average path loss of the UAV; f c represents the center frequency;
Figure PCTCN2019114200-appb-000009
Represents the distance between the drone and the device u at time t; c represents the speed of light; η LoS represents the additional spatial propagation loss of the LoS link.
其中,上述确定无人机的信噪比包括:通过如下公式确定无人机与物 联网设备u在t时刻的信噪比:
Figure PCTCN2019114200-appb-000010
其中,
Figure PCTCN2019114200-appb-000011
代表无人机与物联网设备u在t时刻的信噪比;P u代表设备u上传链路的传输功率;
Figure PCTCN2019114200-appb-000012
代表t时刻无人机与设备u之间信道的增益;N 0为噪声功率;其中,
Figure PCTCN2019114200-appb-000013
Among them, the above determining the signal-to-noise ratio of the drone includes: determining the signal-to-noise ratio of the drone and the IoT device u at time t by the following formula:
Figure PCTCN2019114200-appb-000010
among them,
Figure PCTCN2019114200-appb-000011
Represents the signal-to-noise ratio between the UAV and the IoT device u at time t; P u represents the transmission power of the upload link of the device u;
Figure PCTCN2019114200-appb-000012
Represents the gain of the channel between the UAV and the device u at time t; N 0 is the noise power; where,
Figure PCTCN2019114200-appb-000013
其中,上述确定t时刻设备u向无人机传送数据的最大传输速率包括:通过如下公式确定t时刻设备u向无人机传送数据的最大传输速率:
Figure PCTCN2019114200-appb-000014
其中,B代表信道带宽;
Figure PCTCN2019114200-appb-000015
代表无人机与物联网设备u在t时刻的信噪比。
Wherein, determining the maximum transmission rate of data transmitted by device u to the drone at time t includes: determining the maximum transmission rate of data transmitted by device u to the drone at time t through the following formula:
Figure PCTCN2019114200-appb-000014
Among them, B represents the channel bandwidth;
Figure PCTCN2019114200-appb-000015
Represents the signal-to-noise ratio of UAV and IoT device u at time t.
其中,上述无人机的剩余能量为无人机初始总能量与无人机能量损耗之差;其中,无人机能量损耗包括:无人机飞行能量损耗和无人机通信能量损耗中的至少一项。Among them, the remaining energy of the above-mentioned UAV is the difference between the initial total energy of the UAV and the energy loss of the UAV; wherein, the energy loss of the UAV includes at least one of the flight energy loss of the UAV and the communication energy loss One item.
上述确定无人机的瞬时能量效率进一步包括:在无人机返回途中发生能源耗尽情况时,在所述计算瞬时能量效率的公式后添加预设数值的惩罚项。The foregoing determining the instantaneous energy efficiency of the drone further includes: when the energy exhaustion occurs during the return of the drone, a penalty term of a preset value is added to the formula for calculating the instantaneous energy efficiency.
其中,上述对预先构建的深度强化学习网络进行训练包括:Among them, the above-mentioned training of the pre-built deep reinforcement learning network includes:
采用近端策略优化算法,将深度强化学习网络的目标方程改写为:
Figure PCTCN2019114200-appb-000016
其中,θ为待优化的策略参数;ε为预设的用于控制无人机飞行策略更新幅度的常数;
Figure PCTCN2019114200-appb-000017
为时刻t的期望值;
Figure PCTCN2019114200-appb-000018
表示优势函数;clip表示裁剪函数,r t(θ)是一次迭代更新中旧策略函数和新策略函数的比值,可表示为:
Figure PCTCN2019114200-appb-000019
其中,π θ表示无人机飞行策略函数,π θ(a t|s t)表示t时刻状态为s t、动作为a t的新无人机飞行策略函数,
Figure PCTCN2019114200-appb-000020
表示t时刻状态为s t、动作为a t的旧无人机飞行策略函数;
Using the near-end strategy optimization algorithm, the goal equation of the deep reinforcement learning network is rewritten as:
Figure PCTCN2019114200-appb-000016
Among them, θ is the strategy parameter to be optimized; ε is the preset constant used to control the update range of the UAV flight strategy;
Figure PCTCN2019114200-appb-000017
Is the expected value at time t;
Figure PCTCN2019114200-appb-000018
Represents the advantage function; clip represents the clipping function, r t (θ) is the ratio of the old strategy function to the new strategy function in an iterative update, which can be expressed as:
Figure PCTCN2019114200-appb-000019
Wherein, π θ represents UAV flight policy function, π θ (a t | s t) represents the state S t at time t, a new UAV flight operation policy function of t A,
Figure PCTCN2019114200-appb-000020
T represents time state s t, UAV flight action for the old policy function of a t;
其中,优势函数
Figure PCTCN2019114200-appb-000021
可由如下方程表示:
Figure PCTCN2019114200-appb-000022
Among them, the advantage function
Figure PCTCN2019114200-appb-000021
It can be expressed by the following equation:
Figure PCTCN2019114200-appb-000022
其中,γ为衰减指数;λ为径迹参数;δ t为t时刻的时间差分错误值;δ T-1为T-1时刻的时间差分错误值;T为自主飞行总时长; Among them, γ is the attenuation index; λ is the track parameter; δ t is the time difference error value at time t; δ T-1 is the time difference error value at time T-1; T is the total time of autonomous flight;
通过至少一次的迭代更新,求取所述目标方程的最大值,优化无人机飞行策略函数中的策略参数,并将所述目标方程最大值对应的策略参数作为所述无人机飞行策略输出。Through at least one iterative update, find the maximum value of the target equation, optimize the strategy parameter in the UAV flight strategy function, and use the strategy parameter corresponding to the maximum value of the target equation as the UAV flight strategy output .
其中,上述优势函数
Figure PCTCN2019114200-appb-000023
利用深度神经网络依据所述无人机状态数据、无人机动作决策数据以及无人机的瞬时能量效率优化得到。
Among them, the above advantage function
Figure PCTCN2019114200-appb-000023
The deep neural network is optimized based on the UAV status data, UAV action decision data and the UAV's instantaneous energy efficiency.
具体地,上述优势函数
Figure PCTCN2019114200-appb-000024
利用深度神经网络依据所述无人机状态数据、无人机动作决策数据以及无人机的瞬时能量效率优化得到包括:通过所述状态数据、动作决策数据以及所述无人机的瞬时能量效率利用深度神经网络估计优势函数
Figure PCTCN2019114200-appb-000025
计算函数
Figure PCTCN2019114200-appb-000026
并利用梯度下降法更新参数ω,并迭代预先确定的迭代次数;求取使得所述函数达到最大值时的优势函数
Figure PCTCN2019114200-appb-000027
Specifically, the above advantage function
Figure PCTCN2019114200-appb-000024
Using a deep neural network to optimize according to the UAV status data, UAV action decision data, and the UAV's instantaneous energy efficiency includes: using the status data, action decision data, and the UAV's instantaneous energy efficiency Use deep neural network to estimate dominance function
Figure PCTCN2019114200-appb-000025
Calculation function
Figure PCTCN2019114200-appb-000026
And use the gradient descent method to update the parameter ω, and iterate the predetermined number of iterations; find the advantage function when the function reaches the maximum value
Figure PCTCN2019114200-appb-000027
上述方法进一步包括:根据所述无人机飞行策略确定无人机的动作决策数据。The above method further includes: determining the action decision data of the drone according to the drone flight strategy.
本说明书的另一些实施例提供了一种无人机轨迹优化装置,该装置包括:Other embodiments of this specification provide a UAV trajectory optimization device, which includes:
构建模块,用于构建深度强化学习网络;Building modules for building deep reinforcement learning networks;
训练数据收集模块,用于在无人机飞行过程中获取无人机的状态数据和动作决策数据,并计算无人机的瞬时能量效率;以及The training data collection module is used to obtain the status data and action decision data of the drone during the flight of the drone, and calculate the instantaneous energy efficiency of the drone; and
训练模块,用于以所述状态数据为输入、以所述动作决策数据为输出,以所述瞬时能量效率为奖励回报,对深度强化学习网络进行训练,优化策略参数,并输出无人机飞行策略。The training module is used to train the deep reinforcement learning network with the state data as input, the action decision data as the output, and the instantaneous energy efficiency as a reward, to train the deep reinforcement learning network, optimize strategy parameters, and output drone flight Strategy.
其中,上述构建模块用于构建包括行动网络和评价网络的深度学习网络结构;其中,所述行动网络利用近端策略优化算法和深度神经网络来拟合无人机飞行动作策略函数,决策无人机飞行动作;所述评价网络利用深度神经网络来拟合状态价值函数,优化所述无人机飞行动作策略函数中的策略参数。Wherein, the above-mentioned building module is used to construct a deep learning network structure including an action network and an evaluation network; wherein, the action network uses a near-end strategy optimization algorithm and a deep neural network to fit the UAV flight action strategy function, and make decisions The aircraft flight action; the evaluation network uses a deep neural network to fit the state value function, and optimize the strategy parameters in the UAV flight action strategy function.
其中,上述训练数据收集模块用于确定所述无人机与物联网设备间的距离、物联网设备到无人机的数据传输速率以及无人机的剩余能量作为所述状态数据;以及采集所述无人机的加速度和飞行操纵角度,作为所述动作决策数据。Wherein, the aforementioned training data collection module is used to determine the distance between the drone and the Internet of Things device, the data transmission rate from the Internet of Things device to the drone, and the remaining energy of the drone as the status data; and The acceleration and flight control angle of the drone are used as the action decision data.
本说明书的又一些实施例提供了一种无人机轨迹优化装置,包括至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述无人机轨迹优化方法。Still other embodiments of this specification provide a UAV trajectory optimization device, including at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores the at least one An instruction executed by the processor, the instruction being executed by the at least one processor, so that the at least one processor can execute the above-mentioned UAV trajectory optimization method.
本说明书的又一些实施例还提供了一种计算机可读存储介质,其上存储有计算机指令,在处理器执行上述计算机指令时实现上述无人机轨迹优化方法。Still other embodiments of this specification also provide a computer-readable storage medium on which computer instructions are stored, and the above-mentioned UAV trajectory optimization method is realized when the processor executes the above-mentioned computer instructions.
本说明书公开了基于深度强化学习的无人机轨迹优化方法、装置和无人机,在无人机轨迹优化中引入了深度强化学习技术,使得无人机在飞行过程中可以实时与环境交互,收集当前飞行轨迹下的状态数据和动作决策数据作为训练数据,并以瞬时能量效率为回报函数,实时自主学习,实现对决策飞行轨迹的策略参数的不断优化,即赋予无人机在环境中在线自主学习的能力,可以根据需求,适应动态环境的变化。This manual discloses a drone trajectory optimization method, device and drone based on deep reinforcement learning. Deep reinforcement learning technology is introduced in the drone trajectory optimization, so that the drone can interact with the environment in real time during the flight. Collect status data and action decision data under the current flight trajectory as training data, and use instantaneous energy efficiency as the return function, real-time autonomous learning, to achieve continuous optimization of strategic parameters for decision-making flight trajectories, that is, to give drones online in the environment The ability to learn independently can adapt to changes in the dynamic environment according to needs.
此外,基于上述PPO算法的自主学习,本说明书所述的无人机轨迹优化方法还具有不受限于学习步长的选择的优点。In addition, based on the autonomous learning of the above-mentioned PPO algorithm, the UAV trajectory optimization method described in this specification also has the advantage of not being limited to the choice of learning step length.
本说明书提出的基于PPO算法的自主学习方法其处理的数据对象可以为三维连续有界数据,例如输入数据、输出数据等不限于离散域,实现了连续域下三维空间内的无人机的飞行控制优化,更贴近现实场景。相比于基于离散域数据或者表格中有限几种应对方案的控制方式,更符合实际飞行环境的需求。The self-learning method based on the PPO algorithm proposed in this manual can process three-dimensional continuous bounded data. For example, input data, output data, etc. are not limited to the discrete domain, and realize the flight of the drone in the three-dimensional space under the continuous domain. Control optimization, closer to the real scene. Compared with the control method based on discrete domain data or a limited number of solutions in the table, it is more in line with the needs of the actual flight environment.
进一步地,将回报函数赋值为无人机飞行瞬时能量效率的同时,在飞机无法顺利返航充电/加油时在回报函数上增加惩罚项,经过不断学习后可以确定无人机的返航时间,使得无人机能够即时返航避免损失,提高无人机飞行工作的能量效率。Furthermore, while assigning the reward function to the instantaneous energy efficiency of the UAV flight, a penalty item is added to the reward function when the aircraft cannot return home for charging/refueling. After continuous learning, the UAV’s return time can be determined so that no The man-machine can return home immediately to avoid losses and improve the energy efficiency of drone flight work.
附图说明Description of the drawings
为了更清楚地说明本说明书实施例或现有技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本说明书的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the embodiments of this specification or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some of the specification. Embodiments, for those of ordinary skill in the art, without creative labor, other drawings can be obtained from these drawings.
图1为本说明书一些实施例所述的无人机轨迹优化方法所应用系统的 整体结构及数据交互示意图;Figure 1 is a schematic diagram of the overall structure and data interaction of the system applied by the UAV trajectory optimization method according to some embodiments of this specification;
图2为本说明书一些实施例所述的无人机轨迹优化方法的流程示意图;FIG. 2 is a schematic flowchart of a method for optimizing the trajectory of a drone according to some embodiments of this specification;
图3为本说明书另一些实施例所述的无人机轨迹优化方法的流程示意图;FIG. 3 is a schematic flowchart of a method for optimizing the trajectory of a drone according to some other embodiments of this specification;
图4为本说明书一些实施例所述的深度强化学习网络具体建模方法的流程示意图;4 is a schematic flowchart of a specific modeling method of a deep reinforcement learning network according to some embodiments of this specification;
图5显示了本说明书实施例中深度强化学习PPO算法的具体流程;Figure 5 shows the specific process of the deep reinforcement learning PPO algorithm in the embodiment of this specification;
图6显示了本说明书实施例所述的无人机轨迹优化装置结构示意图;以及Figure 6 shows a schematic structural diagram of the UAV trajectory optimization device according to the embodiment of the specification; and
图7显示了本说明书实施例所述的无人机轨迹优化装置硬件机构示意图。Fig. 7 shows a schematic diagram of the hardware mechanism of the UAV trajectory optimization device according to the embodiment of this specification.
具体实施方式Detailed ways
为使本说明书实施例的目的、技术方案和优点更加清楚,下面将结合本说明书实施例的附图,对本说明书实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本说明书的一部分实施例,而不是全部的实施例。基于所描述的本说明书的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本说明书保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of this specification clearer, the technical solutions of the embodiments of this specification will be described clearly and completely in conjunction with the accompanying drawings of the embodiments of this specification. Obviously, the described embodiments are part of the embodiments of this specification, rather than all of the embodiments. Based on the described embodiments of this specification, all other embodiments obtained by those of ordinary skill in the art without creative labor are within the protection scope of this specification.
深度强化学习技术是一种结合强化学习和深度神经网络的机器学习技术。具体而言,强化学习个体通过与环境交互的方式,收集不同环境状态下采取不同动作的回报信息,并根据收集到的数据,归纳学习出最优的行为策略,从而获得适应未知动态环境的能力。深度神经网络可显著提升算法在高维状态空间和高维行动空间上的泛化能力,从而获得适应更加复杂环境的能力。Deep reinforcement learning technology is a machine learning technology that combines reinforcement learning and deep neural networks. Specifically, the reinforcement learning individual collects the return information of different actions taken in different environmental states by interacting with the environment, and according to the collected data, summarizes and learns the optimal behavior strategy, so as to obtain the ability to adapt to the unknown dynamic environment . Deep neural networks can significantly improve the generalization ability of algorithms in high-dimensional state spaces and high-dimensional action spaces, thereby gaining the ability to adapt to more complex environments.
本说明书提供的技术方案通过将深度强化学习技术与无人机技术相结合,通过收集无人机飞行过程中的状态数据,确定在该状态数据下采取的动作决策数据,并进一步确定奖励回报信息。然后进一步根据上述收集到的数据,归纳学习出无人机最佳的飞行策略,从而使得无人机获得适应未知动态环境的能力。The technical solution provided in this manual combines deep reinforcement learning technology with drone technology, collects the status data during the drone flight, determines the action decision data taken under the status data, and further determines the reward information . Then, based on the data collected above, we can summarize and learn the best flight strategy of the drone, so that the drone can acquire the ability to adapt to the unknown dynamic environment.
图1显示了本说明书一些实施例提供的无人机轨迹优化方法所应用系统的整体结构及相关数据交互过程。Figure 1 shows the overall structure and related data interaction process of the system applied in the UAV trajectory optimization method provided by some embodiments of this specification.
如图1所示,在本实施例中,本说明书所述无人机轨迹优化方法所应 用的系统可以是单一无人机102为多个固定的物联网设备104提供服务,物联网设备104被随机激活或周期性激活采集数据并传输至无人机102。As shown in Figure 1, in this embodiment, the system applied by the drone trajectory optimization method described in this specification can be that a single drone 102 provides services for multiple fixed Internet of Things devices 104, and the Internet of Things device 104 is Activate randomly or periodically to collect data and transmit it to the drone 102.
此外,在该系统中,执行无人机轨迹优化方法的装置可被称为无人机轨迹优化装置106。在本说明书的实施例中,上述无人机轨迹优化装置106可以位于无人机102中,也可以位于物联网设备104中,还可以是独立于无人机102和物联网设备104的独立的计算设备,并且可以与无人机102和物联网设备104进行通信。In addition, in this system, the device that executes the drone trajectory optimization method may be referred to as the drone trajectory optimization device 106. In the embodiment of this specification, the above-mentioned UAV trajectory optimization device 106 may be located in the UAV 102, may also be located in the IoT device 104, or may be independent of the UAV 102 and the IoT device 104 Computing device, and can communicate with drone 102 and IoT device 104.
如图1所示,上述人机轨迹优化装置106可以包括数据采集模块1062和深度强化学习网络1064。As shown in FIG. 1, the above-mentioned human-machine trajectory optimization device 106 may include a data acquisition module 1062 and a deep reinforcement learning network 1064.
具体地,在本说明书的实施例中,上述深度强化学习网络1064可以采用演员评论家(Actor-Critic)框架的深度强化学习结构,也即包括行动网络1066和评价网络1068。其中,行动网络1066可以利用近端策略优化算法(PPO,Proximal Policy Optimization)和深度神经网络来拟合无人机飞行策略函数,从而决策无人机飞行动作;而评价网络则可以利用深度神经网络来拟合状态价值函数,优化所述无人机飞行动作策略函数中的策略参数。Specifically, in the embodiment of this specification, the above-mentioned deep reinforcement learning network 1064 may adopt the deep reinforcement learning structure of the Actor-Critic framework, that is, it includes an action network 1066 and an evaluation network 1068. Among them, the action network 1066 can use the proximal policy optimization algorithm (PPO, Proximal Policy Optimization) and the deep neural network to fit the UAV flight strategy function, so as to decide the flight action of the UAV; and the evaluation network can use the deep neural network To fit the state value function and optimize the strategy parameters in the UAV flight action strategy function.
作为一种可实施方式,无人机102将其与物联网设备104间的距离、传输速率以及无人机102的剩余能量作为状态数据输入深度强化学习网络1064的行动网络1066,而将无人机的加速度、飞行操纵角度(即飞行方向)等动作决策数据作为深度强化学习网络1064输出的行为,而将无人机102的瞬时能量效率作为奖励回报。并通过不断与环境交互,产生状态输入、行动决策、奖励回报的数据,作为评价网络1068与行动网络1066的训练数据。评价网络利用深度神经网络拟合状态价值函数,为行动网络优化提供优势函数;行动网络利用PPO算法优化策略参数,利用深度神经网络拟合策略函数。经过多次迭代更新的过程,无人机102即可适应环境并得到最优的飞行策略,也即可以得到无人机102的最优飞行轨迹,从而实现无人机轨迹的优化。As an implementable manner, the UAV 102 inputs the distance between it and the IoT device 104, the transmission rate, and the remaining energy of the UAV 102 as state data into the mobile network 1066 of the deep reinforcement learning network 1064, and the unmanned Action decision data such as the acceleration of the aircraft and the flight control angle (ie, flight direction) are used as behaviors output by the deep reinforcement learning network 1064, and the instantaneous energy efficiency of the drone 102 is used as a reward. And through continuous interaction with the environment, the data of state input, action decision, reward and return are generated as training data for the evaluation network 1068 and the action network 1066. The evaluation network uses the deep neural network to fit the state value function to provide an advantage function for the optimization of the mobile network; the mobile network uses the PPO algorithm to optimize the strategy parameters and the deep neural network to fit the strategy function. After multiple iterations of the update process, the UAV 102 can adapt to the environment and obtain the optimal flight strategy, that is, the optimal flight trajectory of the UAV 102 can be obtained, thereby realizing the optimization of the UAV trajectory.
下面结合具体的示例详细说明无人机轨迹优化方法的具体实现过程。图2为本说明书的一些实施例提供的一种无人机轨迹优化方法流程示意图。该方法可以由无人机、物联网设备或者独立于无人机和物联网设备的计算设备执行。如图2所示,该方法可以包括如下步骤:The specific implementation process of the UAV trajectory optimization method will be described in detail below with specific examples. Fig. 2 is a schematic flowchart of a method for optimizing the trajectory of an unmanned aerial vehicle according to some embodiments of this specification. The method can be executed by a drone, an Internet of Things device, or a computing device independent of the drone and the Internet of Things device. As shown in Figure 2, the method may include the following steps:
步骤202,获取无人机飞行过程中的状态数据和动作决策数据。Step 202: Obtain status data and action decision data during the flight of the drone.
在本说明书的实施中,上述状态数据可以包括:无人机与物联网设备间的距离、物联网设备向无人机传输数据的传输速率以及无人机的剩余能量等。上述动作决策数据可以包括:无人机的加速度以及飞行操纵角度(即飞行方向)等。In the implementation of this specification, the aforementioned status data may include: the distance between the drone and the Internet of Things device, the transmission rate of the Internet of Things device to the drone, and the remaining energy of the drone. The aforementioned action decision data may include: the acceleration of the UAV and the flight control angle (ie, the flight direction) and so on.
具体地,上述无人机与物联网设备间的距离可以是无人机与物联网设备间的欧式距离。Specifically, the distance between the aforementioned drone and the Internet of Things device may be the European distance between the drone and the Internet of Things device.
在本说明书的实施中,上述动作决策数据可以根据上述状态数据以及当前的无人机飞行策略确定。In the implementation of this specification, the aforementioned action decision data can be determined based on the aforementioned state data and the current drone flight strategy.
步骤204,根据上述状态数据和动作决策数据确定无人机瞬时能量效率。Step 204: Determine the instantaneous energy efficiency of the UAV based on the aforementioned state data and action decision data.
关于无人机瞬时能量效率的确定方法将在后面的实施例中详细描述,在此暂不详述。The method for determining the instantaneous energy efficiency of the drone will be described in detail in the following embodiments, and will not be described in detail here.
步骤206,以上述状态数据为输入、以上述动作决策数据为输出,以上述瞬时能量效率为奖励回报,对预先构建的深度强化学习网络进行训练,优化上述深度强化学习网络的策略参数。 Step 206, using the state data as the input, the action decision data as the output, and the instantaneous energy efficiency as the reward, the pre-built deep reinforcement learning network is trained to optimize the policy parameters of the deep reinforcement learning network.
在本说明书的实施例中,可以利用PPO算法对所述深度强化学习网络进行训练,以优化深度强化学习网络的策略参数。其中,上述策略参数即为可以决定无人机动作策略数据的参数,因而,上述策略参数也是可以决定无人机飞行轨迹的参数。In the embodiments of this specification, the PPO algorithm may be used to train the deep reinforcement learning network to optimize the policy parameters of the deep reinforcement learning network. Among them, the above strategy parameters are parameters that can determine the UAV action strategy data. Therefore, the above strategy parameters are also parameters that can determine the flight trajectory of the UAV.
如前所述,在本说明书的一些实施例中,上述预先构建的深度强化学习网络可以采用演员评论家(Actor-Critic)框架的深度强化学习结构,由行动网络和评价网络两个网络构成。其中,行动网络利用PPO算法和深度神经网络来拟合飞行策略函数,决策行动;而评价网络利用深度神经网络来拟合状态价值函数,优化策略参数。As mentioned above, in some embodiments of this specification, the above-mentioned pre-built deep reinforcement learning network may adopt the deep reinforcement learning structure of the Actor-Critic framework, which is composed of two networks, an action network and an evaluation network. Among them, the action network uses the PPO algorithm and the deep neural network to fit the flight strategy function and decision-making action; while the evaluation network uses the deep neural network to fit the state value function and optimize the strategy parameters.
步骤208,根据训练后的深度强化学习网络输出无人机的飞行策略。Step 208: Output the flight strategy of the drone according to the trained deep reinforcement learning network.
在本说明书的实施例中,可以循环执行上述步骤202至步骤208,利用收集到的数据不断迭代更新深度强化学习网络的策略参数,得到不断优化的无人机飞行策略。In the embodiment of this specification, the above-mentioned steps 202 to 208 can be executed in a loop, and the strategy parameters of the deep reinforcement learning network can be iteratively updated using the collected data to obtain a continuously optimized drone flight strategy.
在本说明书的实施例中,通过上述不断优化的过程,上述无人机飞行策略即通过自主学习得到的使能量效率最大化的飞行策略。In the embodiment of this specification, through the above-mentioned continuous optimization process, the above-mentioned UAV flight strategy is a flight strategy that maximizes energy efficiency obtained through autonomous learning.
如果上述无人机轨迹优化方法由无人机执行,则在得到无人机的飞行策略后,无人机可以进一步根据上述飞行策略以及当前的状态数据确定自身的飞行参数,也即无人机的加速度以及飞行操纵角度等,并可以根据上述飞行参数执行飞行任务。If the above UAV trajectory optimization method is executed by the UAV, after obtaining the UAV's flight strategy, the UAV can further determine its own flight parameters based on the above flight strategy and current status data, that is, the UAV The acceleration and flight control angle, etc., and can perform flight tasks according to the above flight parameters.
如果上述无人机轨迹优化方法由除无人机之外的其他设备执行,则在得到无人机的飞行策略后,可以将飞行策略传输给无人机,由无人机进一步根据上述飞行策略以及当前的状态数据确定自身的飞行参数,也即无人机的加速度以及飞行操纵角度等,并可以根据上述飞行参数执行飞行任务。If the above-mentioned UAV trajectory optimization method is executed by equipment other than the UAV, after obtaining the UAV's flight strategy, the flight strategy can be transmitted to the UAV, and the UAV will further follow the above-mentioned flight strategy. And the current state data determines its own flight parameters, that is, the acceleration and flight control angle of the UAV, and can perform flight tasks according to the above flight parameters.
更进一步,在本说明书的实施例中,在飞机无法顺利返航充电/加油时,还可以在无人机瞬时能量效率的基础上增加惩罚项,将增加了惩罚项的瞬时能量效率作为回报函数,经过不断学习后使得无人机能够即时返航避免损失,从而进一步提高无人机飞行工作的能量效率。Furthermore, in the embodiments of this specification, when the aircraft cannot return to home for charging/refueling, a penalty term can be added to the UAV's instantaneous energy efficiency, and the instantaneous energy efficiency of the added penalty term is used as the reward function. After continuous learning, the UAV can return to home immediately to avoid losses, thereby further improving the energy efficiency of UAV flight work.
作为一种可实施方式,本说明书一些实施例提供的无人机轨迹优化方法的过程示例可如图3所示,包括:As an implementable manner, the process example of the drone trajectory optimization method provided by some embodiments of this specification may be shown in Figure 3, including:
步骤302,初始化强化学习决策策略及相关参数,以及深度神经网络相关参数。Step 302: Initialize the reinforcement learning decision strategy and related parameters, and the deep neural network related parameters.
其中,确定强化学习决策策略以及相关参数具体方法将在后文中详细描述。Among them, the specific methods for determining the reinforcement learning decision-making strategy and related parameters will be described in detail later.
步骤304,在无人机的飞行过程中,无人机自主飞行并记录相关数据。Step 304: During the flight of the drone, the drone flies autonomously and records relevant data.
在本说明书的实施例中,上述步骤304可以包括:无人机计算与物联网设备的距离、传输速率和剩余能量,基于当前飞行策略决策飞行参数,接收物联网设备发送的数据,并计算该飞行轨迹下的瞬时能量效率。In the embodiment of this specification, the above-mentioned step 304 may include: the drone calculates the distance, transmission rate and remaining energy from the Internet of Things device, determines the flight parameters based on the current flight strategy, receives the data sent by the Internet of Things device, and calculates the Instantaneous energy efficiency under flight trajectory.
步骤306,通过上述预设时长的一段时间内收集的数据,评价网络拟合状态价值函数,计算优势函数,传入行动网络。Step 306: Evaluate the network fitting state value function, calculate the advantage function, and transmit it to the mobile network based on the data collected in the predetermined period of time.
步骤308,分别训练行动网络与评价网络的深度神经网络各参数,更新无人机飞行策略。In step 308, the parameters of the deep neural network of the action network and the evaluation network are trained separately, and the drone flight strategy is updated.
步骤310,重复执行上述步骤304-步骤308,直至无人机任务结束。Step 310: Repeat the above steps 304-308 until the drone mission ends.
上述无人机轨迹优化方法在无人机轨迹优化中引入了深度强化学习技术,这样,无人机在飞行过程中实时与环境交互,收集当前飞行轨迹下的状态数据和动作数据作为训练数据,以瞬时能量效率为回报函数,实时自主学习,实现对决策飞行轨迹的策略参数的不断优化,即赋予无人机在 环境中在线自主学习的能力,可以根据需求,适应动态环境的变化。The aforementioned UAV trajectory optimization method introduces deep reinforcement learning technology in the UAV trajectory optimization. In this way, the UAV interacts with the environment in real time during the flight, and collects the status data and movement data of the current flight trajectory as training data. Taking instantaneous energy efficiency as the reward function, real-time autonomous learning realizes continuous optimization of strategic parameters for decision-making flight trajectories, that is, giving drones the ability to learn autonomously in the environment online, and adapt to changes in the dynamic environment according to needs.
此外,上述基于PPO算法的自主学习,还具有不受限于学习步长的选择的优点。In addition, the above-mentioned autonomous learning based on the PPO algorithm also has the advantage of not being limited to the choice of learning step size.
进一步地,上述自主学习方法处理的数据对象可以为三维连续有界数据,例如输入数据、输出数据等不限于离散域,实现了连续域下三维空间内的无人机的飞行控制优化,更贴近现实场景。相比于基于离散域数据或者表格中有限几种应对方案的控制方式,可以更符合实际飞行环境的需求。Further, the data objects processed by the above-mentioned autonomous learning method can be three-dimensional continuous bounded data, such as input data, output data, etc., which are not limited to the discrete domain, and realize the flight control optimization of the drone in the three-dimensional space in the continuous domain, which is closer to Realistic scene. Compared with the control method based on discrete domain data or a limited number of solutions in the table, it can be more in line with the needs of the actual flight environment.
下面将进一步结合具体示例详细说明本说明书使用无人机通信建模方法以及基于深度强化学习的无人机高能效轨迹优化方法。The following will further explain in detail the UAV communication modeling method used in this manual and the UAV high-efficiency trajectory optimization method based on deep reinforcement learning with specific examples.
在本实施例中建立的用于无人机轨迹优化的深度强化学习网络模型,考虑一个无人机为N个地面物联网设备提供时延容忍服务的场景,物联网设备随机分布且位置固定,周期性或随机性地采集数据并传送至无人机。目标是优化无人机飞行轨迹,在能量有限的条件下最大化累计能量效率。为了完成这个目标,无人机应可以检测剩余能量,并决策最优的返航充电/加油时间。In the deep reinforcement learning network model for drone trajectory optimization established in this embodiment, consider a scenario where a drone provides a delay tolerant service for N ground Internet of Things devices, and the Internet of Things devices are randomly distributed and fixed in location. Collect data periodically or randomly and send it to the drone. The goal is to optimize the UAV's flight trajectory and maximize the cumulative energy efficiency under limited energy conditions. In order to accomplish this goal, the UAV should be able to detect the remaining energy and decide the optimal return time for charging/refueling.
本说明书实施例所述的深度强化学习网络的具体建模方法如图4所示,可以包括:The specific modeling method of the deep reinforcement learning network described in the embodiment of this specification is shown in Fig. 4, and may include:
步骤402:从飞行环境中提取无人机的状态数据。Step 402: Extract the status data of the drone from the flight environment.
在本说明书的实施例中,上述状态数据可以从环境中提取并计算获得,可以特征化为如下三部分:i)无人机到每一个物联网设备的距离;ii)每个物联网设备向无人机传送信息的传输速率;iii)无人机剩余能量。In the embodiments of this specification, the above-mentioned status data can be extracted from the environment and obtained by calculation, and can be characterized as the following three parts: i) the distance from the drone to each IoT device; ii) the direction of each IoT device The transmission rate of the UAV to transmit information; iii) The remaining energy of the UAV.
进一步,在本说明书的实施例中,可以将上述状态数据量化表示为
Figure PCTCN2019114200-appb-000028
(此处“T”表示矩阵的转置)。
Further, in the embodiments of this specification, the above-mentioned state data can be quantified as
Figure PCTCN2019114200-appb-000028
(Here "T" represents the transpose of the matrix).
其中,φ(s t)表示状态数据矩阵;s t表示t时刻的状态;
Figure PCTCN2019114200-appb-000029
分别表示在t时刻第1至第N个物联网设备与无人机的距离;
Figure PCTCN2019114200-appb-000030
分别表示在t时刻第1至第N个物联网设备向无人机传送信息的传输速率;
Figure PCTCN2019114200-appb-000031
表示无人机在t时刻的剩余能量;q(t)表示无人机的飞行轨迹。
Among them, φ(s t ) represents the state data matrix; s t represents the state at time t;
Figure PCTCN2019114200-appb-000029
Respectively represent the distance between the 1st to Nth IoT devices and the UAV at time t;
Figure PCTCN2019114200-appb-000030
Respectively represent the transmission rate of the 1st to Nth IoT devices transmitting information to the UAV at time t;
Figure PCTCN2019114200-appb-000031
It represents the remaining energy of the UAV at time t; q(t) represents the flight trajectory of the UAV.
步骤404:获取无人机的动作决策数据。Step 404: Obtain the action decision data of the drone.
在本说明书的实施例中,上述动作决策数据用于表征无人机的动作,这些动作由无人机发出,用于控制飞行轨迹。通常动作决策数据可以包括 如下两部分:i)t时刻无人机的水平飞行操纵角度ω t∈[0,2π];ii)t时刻无人机的垂直飞行操纵角度θ t∈[0,π];iii)t时刻无人机的加速度大小AC tIn the embodiments of this specification, the aforementioned action decision data is used to characterize the actions of the drone, and these actions are issued by the drone to control the flight trajectory. Usually the action decision data can include the following two parts: i) the horizontal flight control angle of the drone at time t ω t ∈[0,2π]; ii) the vertical flight control angle of the drone at time t θ t ∈[0,π ]; iii) Acceleration AC t of the UAV at time t .
进一步,在本说明书的实施例中,可以将上述动作决策数据量化表示为a t=[ω tt,AC t] T(此处“T”表示矩阵的转置)。 Further, in the embodiment of this specification, the above-mentioned action decision data can be quantified and expressed as a t =[ω tt ,AC t ] T (here "T" represents the transposition of the matrix).
需要说明的是,在本说明书的实施例中,无人机的瞬时飞行速度可以表示为
Figure PCTCN2019114200-appb-000032
和无人机的加速度可以表示为
Figure PCTCN2019114200-appb-000033
而且上述两个参数都是三维连续有界的。
It should be noted that in the embodiments of this specification, the instantaneous flight speed of the drone can be expressed as
Figure PCTCN2019114200-appb-000032
And the acceleration of the drone can be expressed as
Figure PCTCN2019114200-appb-000033
And the above two parameters are three-dimensional continuous and bounded.
步骤406:计算无人机的平均路径损耗。Step 406: Calculate the average path loss of the UAV.
在本说明书的实施例中,无人机与物联网设备间的通信信道,通常采用Sub-6GHz频段的空对地的链接,视线传输(LoS)在该无线链路中占主导地位。在这种情况下,无人机与地面物联网设备u在t时刻的平均路径损耗可以由如下的公式(1)表示:In the embodiments of this specification, the communication channel between the UAV and the Internet of Things device usually adopts an air-to-ground link in the Sub-6GHz frequency band, and line-of-sight transmission (LoS) is dominant in this wireless link. In this case, the average path loss between the UAV and the ground IoT device u at time t can be expressed by the following formula (1):
Figure PCTCN2019114200-appb-000034
Figure PCTCN2019114200-appb-000034
其中,f c代表中心频率,
Figure PCTCN2019114200-appb-000035
代表t时刻无人机与设备u之间的欧氏距离,c代表光速,η LoS代表LoS链路的附加空间传播损耗,通常是一个常量。
Among them, f c represents the center frequency,
Figure PCTCN2019114200-appb-000035
It represents the Euclidean distance between the UAV and the device u at time t, c represents the speed of light, and η LoS represents the additional spatial propagation loss of the LoS link, which is usually a constant.
步骤408:根据上述平均路径损耗计算信噪比。Step 408: Calculate the signal-to-noise ratio based on the above average path loss.
在本说明书的实施例中,无人机与物联网设备u在t时刻的信噪比(SINR)可以由如下的公式(2)表示:In the embodiment of this specification, the signal-to-noise ratio (SINR) of the drone and the IoT device u at time t can be expressed by the following formula (2):
Figure PCTCN2019114200-appb-000036
Figure PCTCN2019114200-appb-000036
其中,P u代表设备u上传链路的传输功率,
Figure PCTCN2019114200-appb-000037
代表t时刻无人机与设备u之间信道的增益,N 0为噪声功率。
Among them, P u represents the transmission power of the upload link of the device u,
Figure PCTCN2019114200-appb-000037
Represents the gain of the channel between the drone and the device u at time t, and N 0 is the noise power.
假定所有设备的传输功率和噪声功率相同,信道增益只受路径损耗决定,所以
Figure PCTCN2019114200-appb-000038
Assuming that the transmission power and noise power of all devices are the same, the channel gain is only determined by the path loss, so
Figure PCTCN2019114200-appb-000038
步骤410:根据上述信噪比确定设备u向无人机传输的最大速率。Step 410: Determine the maximum transmission rate of the device u to the UAV according to the foregoing signal-to-noise ratio.
在本说明书的实施例中,假设无人机移动带来的多普勒效应可以用现有技术完美补偿,如锁相环技术,则设备u向无人机传输的最大速率可以表示为:
Figure PCTCN2019114200-appb-000039
其中,B代表信道带宽,假定所有设备 的带宽相同。
In the embodiments of this specification, it is assumed that the Doppler effect caused by the movement of the drone can be perfectly compensated by the existing technology, such as the phase-locked loop technology, the maximum transmission rate of the device u to the drone can be expressed as:
Figure PCTCN2019114200-appb-000039
Among them, B represents the channel bandwidth, assuming that all devices have the same bandwidth.
也即,在本说明书的实施例中,通过上述步骤406-410可以确定设备u向无人机传输的最大速率。That is, in the embodiment of this specification, the maximum transmission rate of the device u to the drone can be determined through the above steps 406-410.
步骤412:计算无人机的能量损耗。Step 412: Calculate the energy loss of the drone.
在本说明书的实施例中,无人机的能量损耗可以包括推动力造成的飞行能量损耗和通信相关的通信能量损耗之一或其组合。因此,在本说明书的一个实施例中,无人机能量损耗可以为飞行能量损耗和通信能量损耗之和。In the embodiments of this specification, the energy loss of the drone may include one or a combination of flight energy loss caused by propulsion and communication energy loss related to communication. Therefore, in an embodiment of this specification, the energy loss of the drone may be the sum of the flight energy loss and the communication energy loss.
其中,推动力造成的飞行能量损耗让无人机可以保持在空中飞行、改变飞行的轨迹,其功率大小与无人机飞行的速度和加速度相关,因此,无人机在t时刻的飞行能量损耗可以表示为飞行轨迹q(t)的方程,可如下公式(3)所示:Among them, the flight energy loss caused by the driving force allows the drone to keep flying in the air and change the flight trajectory. Its power is related to the speed and acceleration of the drone flight. Therefore, the flight energy loss of the drone at time t It can be expressed as the equation of flight trajectory q(t), which can be shown in the following formula (3):
Figure PCTCN2019114200-appb-000040
Figure PCTCN2019114200-appb-000040
其中,
Figure PCTCN2019114200-appb-000041
代表无人机的瞬时速度,
Figure PCTCN2019114200-appb-000042
代表无人机的加速度,c 1和c 2是两个和无人机自身物理性质相关的常数,如机翼数量和重量。需要说明的是,这里AC T(t)表示AC(t)的转置,此处“T”为转置符号。
among them,
Figure PCTCN2019114200-appb-000041
Represents the instantaneous speed of the drone,
Figure PCTCN2019114200-appb-000042
Represents the acceleration of the drone, c 1 and c 2 are two constants related to the physical properties of the drone, such as the number of wings and weight. It should be noted that AC T (t) here represents the transposition of AC (t), and "T" here is the transposition symbol.
此外,通信能量损耗包括辐射、信号处理以及其他电路损耗,其中信号处理造成的能量损耗占主导部分。信号处理造成的能量损耗与无人机飞行无关,是飞行时间的平方的反比例函数,因此,无人机在t时刻的通信能量损耗可如下公式(4)所示:In addition, communication energy loss includes radiation, signal processing, and other circuit losses, of which the energy loss caused by signal processing dominates. The energy loss caused by signal processing has nothing to do with the flight of the UAV, and is an inversely proportional function of the square of the flight time. Therefore, the communication energy loss of the UAV at time t can be shown in the following formula (4):
Figure PCTCN2019114200-appb-000043
Figure PCTCN2019114200-appb-000043
其中,E comp即为到t时刻的通信相关能量损耗,G表示无人机节点的硬件计算常数,D代表无人机需要处理数据的比特数,t即t时刻。 Among them, E comp is the communication-related energy loss at time t, G represents the hardware calculation constant of the drone node, D represents the number of bits of data that the drone needs to process, and t is time t.
在本说明书的实施例中,无人机剩余能量可以表示为无人机初始总能量与无人机能量损耗之差。例如,假设E(q(t))为无人机在t时刻的能量损耗,
Figure PCTCN2019114200-appb-000044
表示无人机在t时刻的剩余能量,则无人机在t时刻的剩余能量为无人机本次飞行前的初始总能量减去无人机在t时刻的能量损耗,即
Figure PCTCN2019114200-appb-000045
其中,E 0为无人机本次飞行前的初始总能量。
In the embodiments of this specification, the remaining energy of the drone can be expressed as the difference between the initial total energy of the drone and the energy loss of the drone. For example, suppose E(q(t)) is the energy loss of the drone at time t,
Figure PCTCN2019114200-appb-000044
Represents the remaining energy of the drone at time t, the remaining energy of the drone at time t is the initial total energy of the drone before this flight minus the energy loss of the drone at time t, namely
Figure PCTCN2019114200-appb-000045
Among them, E 0 is the initial total energy of the drone before this flight.
步骤414,建立回报函数。Step 414: Build a reward function.
在本说明书的实施例中,可以将回报函数定义为无人机的瞬时能量效率,即设备u向无人机传输的最大速率
Figure PCTCN2019114200-appb-000046
与无人机瞬时能量损耗E(q(t))的比值
Figure PCTCN2019114200-appb-000047
In the embodiments of this specification, the reward function can be defined as the instantaneous energy efficiency of the drone, that is, the maximum transmission rate of the device u to the drone
Figure PCTCN2019114200-appb-000046
The ratio of instantaneous energy loss E(q(t))
Figure PCTCN2019114200-appb-000047
此外,在本说明书的另一些实施例中,由于考虑算法需要自动决策无人机的返航充电/加油时间,因此当无人机返回途中能源耗尽时应在回报函数后添加一个惩罚项,也即可以确定新的回报函数为原回报函数r(s t,a t)与一个惩罚项∈之和:
Figure PCTCN2019114200-appb-000048
通常在种情况下该惩罚项∈可以是一个预先确定的负数值。例如,无人机返回途中能源耗尽,造成无人机坠毁,将回报函数值直接置为较大的负数,如-100。当然,上述惩罚项也可以设置为正数值。在此种情况下,可以确定新的回报函数为原回报函数r(s t,a t)与上述正数值的惩罚项∈之差:
Figure PCTCN2019114200-appb-000049
具体惩罚项的数值可由本领域技术人员根据实际场景具体灵活设置,并不唯一,本说明书不逐一列举。通过在上述回报函数上引入惩罚项,本说明书的实施例还可以进一步经过不断学习确定无人机的返航时间,使得无人机能够即时返航避免损失,提高无人机飞行工作的能量效率。
In addition, in some other embodiments of this specification, because the algorithm needs to automatically decide the return charging/refueling time of the drone, when the energy of the drone is exhausted on the way back, a penalty item should be added after the return function. i.e., it may determine the new function returns the original reward function r (s t, a t) ∈ with a penalty term sum:
Figure PCTCN2019114200-appb-000048
Usually, in this case, the penalty term ε can be a predetermined negative value. For example, if the UAV runs out of energy on its way back, causing the UAV to crash, the reward function value is directly set to a large negative number, such as -100. Of course, the above penalties can also be set to positive values. In this case, the new reward function can be determined as the difference between the original reward function r(s t , a t ) and the above-mentioned positive penalty term ∈:
Figure PCTCN2019114200-appb-000049
The value of the specific penalty item can be flexibly set by those skilled in the art according to the actual scenario, and is not unique, and this specification does not list them one by one. By introducing a penalty term in the above reward function, the embodiments of this specification can further determine the return time of the drone through continuous learning, so that the drone can return to the home immediately to avoid losses and improve the energy efficiency of the drone flight work.
如此,通过上述步骤410-414可以确定无人机的瞬时能量效率,也即建立回报函数。In this way, the instantaneous energy efficiency of the drone can be determined through the above steps 410-414, that is, the reward function can be established.
步骤416:建立策略函数。Step 416: Establish a strategy function.
基于策略梯度的强化学习方法是将策略参数化,建模形式为一个随机方程,即π θ:S→P(A),代表在状态集S(即状态s的集合)内的任何状态下,采用动作集A(即动作a的集合)中动作的概率,θ∈R n是需要被优化的策略参数。R n表示n维实数集合,n的大小等于θ的维数。 The reinforcement learning method based on policy gradient is to parameterize the policy, and the modeling form is a stochastic equation, namely π θ : S→P(A), which represents any state in the state set S (that is, the set of states s), Using the probability of action in action set A (ie the set of action a), θ∈R n is the strategy parameter that needs to be optimized. R n represents a set of n-dimensional real numbers, and the size of n is equal to the dimension of θ.
步骤418:根据上述回报函数和策略函数建立目标方程。Step 418: Establish a target equation based on the above reward function and strategy function.
在强化学习中,状态s在策略π θ下的状态价值函数被定义为长期累积的回报。当其时状态为s,策略为π θ时,状态价值函数可表示为如下公式(5)的形式: In reinforcement learning, the state value function of state s under the strategy π θ is defined as the long-term cumulative return. When the state is s and the strategy is π θ , the state value function can be expressed as the following formula (5):
Figure PCTCN2019114200-appb-000050
Figure PCTCN2019114200-appb-000050
其中,γ为折扣因子,取值范围γ∈[0,1]。相似的,在策略π θ下,动作a的状态-动作价值函数可以定义为如下公式(6)所示的形式: Among them, γ is the discount factor, and the value range is γ∈[0,1]. Similarly, under the strategy π θ , the state-action value function of action a can be defined as the following formula (6):
Figure PCTCN2019114200-appb-000051
Figure PCTCN2019114200-appb-000051
如此,强化学习的目标方程可以由如下公式(7)表示:In this way, the target equation of reinforcement learning can be expressed by the following formula (7):
Figure PCTCN2019114200-appb-000052
Figure PCTCN2019114200-appb-000052
其中,
Figure PCTCN2019114200-appb-000053
是在策略π θ下的有折扣的状态访问概率分布。
among them,
Figure PCTCN2019114200-appb-000053
Is the discounted state access probability distribution under the strategy π θ .
因此,基于强化学习的无人机轨迹优化问题可以最终简化为下式(8):Therefore, the UAV trajectory optimization problem based on reinforcement learning can be finally simplified to the following equation (8):
Figure PCTCN2019114200-appb-000054
Figure PCTCN2019114200-appb-000054
其中,C 1和C 2分别为无人机飞行速度和加速度的限制条件。 Among them, C 1 and C 2 are the limiting conditions of UAV flight speed and acceleration respectively.
在本说明书的实施例中,策略梯度方法可以应用于优化策略π θ以使目标方程达到最大。其中,目标方程关于自变量θ的梯度可以由如下公式(9)表示: In the embodiment of this specification, the strategy gradient method can be applied to optimize the strategy π θ to maximize the target equation. Among them, the gradient of the target equation with respect to the independent variable θ can be expressed by the following formula (9):
Figure PCTCN2019114200-appb-000055
Figure PCTCN2019114200-appb-000055
其中,b t是为了减小策略梯度方差而在回报函数中引入的常数基线,在回报函数中引入常数,策略梯度不变而方差减小。特别的,b t通常选择状态值方程V θ(s t)的估计值,R t-b t则可被看作优势函数A(a t,s t)=Q(a t,s t)-V(s t)的估计值。 Among them, b t is a constant baseline introduced in the reward function in order to reduce the variance of the strategy gradient. A constant is introduced into the reward function, and the strategy gradient remains unchanged but the variance decreases. In particular, b t typically selected state value equation V θ (s t) is the estimated value, R t -b t can be regarded as a function of the advantages of A (a t, s t) = Q (a t, s t) - Estimated value of V(s t ).
策略梯度算法在使用时策略梯度通常具有较大的方差,因此受参数影响变化较大。而且根据策略梯度算法,参数更新方程式为
Figure PCTCN2019114200-appb-000056
α为更新步长,当步长不合适时,更新的参数所对应的策略将会是一个更不好的策略。
When the strategy gradient algorithm is used, the strategy gradient usually has a large variance, so it changes greatly under the influence of parameters. And according to the strategy gradient algorithm, the parameter update equation is
Figure PCTCN2019114200-appb-000056
α is the update step size. When the step size is inappropriate, the strategy corresponding to the updated parameter will be a worse strategy.
信赖域系方法TRPO算法(trust region policy optimization)通过限制每次迭代中策略的变化大小来提升算法的鲁棒性。深度强化学习算法PPO,继承了信赖域系方法算法中的优点,同时实现方法更简单,更加通用,并且根据经验具有更好的样本复杂度。The trust region method TRPO algorithm (trust region policy optimization) improves the robustness of the algorithm by limiting the size of the policy change in each iteration. The deep reinforcement learning algorithm PPO inherits the advantages of the trust region method algorithm, while the implementation method is simpler, more versatile, and has better sample complexity based on experience.
步骤420:采用PPO算法改写上述目标方程。Step 420: Use the PPO algorithm to rewrite the above-mentioned target equation.
在本说明书的实施例中,采用PPO算法可以将目标方程改写为如下的公式(10):In the embodiment of this specification, the target equation can be rewritten as the following formula (10) by using the PPO algorithm:
Figure PCTCN2019114200-appb-000057
Figure PCTCN2019114200-appb-000057
其中,θ为策略函数中的待优化参数,ε为预先设置的固定值,ε=0.1~0.3,目的是控制策略的更新幅度。
Figure PCTCN2019114200-appb-000058
为数学期望符号,表示对时间t取平均值。r t(θ)是旧策略函数和新策略函数的比值,可表示为:
Among them, θ is the parameter to be optimized in the strategy function, ε is a fixed value set in advance, ε=0.1~0.3, the purpose is to control the update range of the strategy.
Figure PCTCN2019114200-appb-000058
It is a mathematical expectation symbol, which means taking the average value over time t. r t (θ) is the ratio of the old strategy function to the new strategy function, which can be expressed as:
Figure PCTCN2019114200-appb-000059
Figure PCTCN2019114200-appb-000059
其中,旧策略函数与新策略函数指在一次迭代更新中,更新后的策略函数即为新策略函数,更新前的策略函数即为旧策略函数。Among them, the old strategy function and the new strategy function mean that in an iterative update, the updated strategy function is the new strategy function, and the strategy function before the update is the old strategy function.
其中,优势函数方程可如下公式(11)所示:Among them, the advantage function equation can be shown in the following formula (11):
Figure PCTCN2019114200-appb-000060
Figure PCTCN2019114200-appb-000060
其中,γ为衰减指数,是一预先设定的固定值;λ为径迹参数,也为一预先设定的固定值;γ的取值范围为(0,1),λ的取值范围也为(0,1)。δ t为t时刻的时间差分错误值(Temporal difference error),其具体数学表达式参见上式第二行;δ T-1为T-1时刻的时间差分错误值,T为自主飞行总时长。在本说明书的实例中,上述T是离散化的,亦可称为最大连续决策时刻数目。如此,T-1则代表最大连续决策时刻。 Among them, γ is the attenuation index, which is a preset fixed value; λ is the track parameter, which is also a preset fixed value; the value range of γ is (0,1), and the value range of λ is also Is (0,1). δ t is the temporal difference error at time t, and its specific mathematical expression is shown in the second line of the above formula; δ T-1 is the temporal difference error at T-1, and T is the total time of autonomous flight. In the examples of this specification, the above-mentioned T is discretized, and can also be referred to as the maximum number of consecutive decision moments. In this way, T-1 represents the maximum continuous decision time.
需要注意的是,优势函数需要从当前时刻直到时刻t一段时间内的全部数据。It should be noted that the advantage function needs all the data in a period of time from the current moment to the moment t.
因此,本说明书在两个位置引入深度神经网络,分别用于表示状态-动作价值函数方程Q ω(s,a)≈Q π(s,a)并学习参数ω,以及表示策略函数π θ(s)=π(s)并学习参数θ。 Therefore, this specification introduces a deep neural network in two locations, which are used to represent the state-action value function equation Q ω (s, a) ≈ Q π (s, a) and learn the parameters ω, as well as the strategy function π θ ( s)=π(s) and learn the parameter θ.
具体地,图5显示了本说明书实施例中深度强化学习PPO算法的具体流程。如图5所示,上述PPO算法包括如下步骤:Specifically, FIG. 5 shows the specific process of the deep reinforcement learning PPO algorithm in the embodiment of this specification. As shown in Figure 5, the above PPO algorithm includes the following steps:
首先,设置深度强化学习神经网络的各个参数。First, set the parameters of the deep reinforcement learning neural network.
上述参数具体可以包括:为参数ω和θ随机赋值,自主飞行总时长(最大连续决策时刻数目)设为T,两个深度神经网络迭代次数分别设为M次和B次,取ε=0.2,γ=0.99,总任务时间设定为N。The above-mentioned parameters can specifically include: randomly assigning values to the parameters ω and θ, the total length of autonomous flight (the maximum number of consecutive decision-making moments) is set to T, the number of iterations of the two deep neural networks are set to M and B times, and ε=0.2, γ=0.99, and the total task time is set to N.
初始化第一迭代次数参数i为0。Initialize the first iteration number parameter i to 0.
执行从第1个时间片段到第N个时间片段的循环,其起始步骤为初始化第二迭代次数参数j为0。Execute the loop from the 1st time segment to the Nth time segment, and its initial step is to initialize the second iteration number parameter j to 0.
然后,基于当前策略π θ连续自主决策行动T次,同时与环境交互收集元组{s t,a t,r t}。其中,自主决策行动的次数由上述第二迭代次数参数j监控。 Then, based on the current strategy π θ, continuous autonomous decision-making actions T times, while interacting with the environment to collect tuples {s t , a t , r t }. Among them, the number of autonomous decision-making actions is monitored by the aforementioned second iteration number parameter j.
再通过收集到的元组{s t,a t,r t},并利用深度神经网络估计优势函数
Figure PCTCN2019114200-appb-000061
And then collected by the tuple {s t, a t, r t}, and using the depth estimation neural network function Advantages
Figure PCTCN2019114200-appb-000061
接下来,计算目标函数
Figure PCTCN2019114200-appb-000062
并利用梯度下降法更新参数θ,迭代M次。
Next, calculate the objective function
Figure PCTCN2019114200-appb-000062
And use the gradient descent method to update the parameter θ, iterate M times.
还应当,计算函数
Figure PCTCN2019114200-appb-000063
并利用梯度下降法更新参数ω,迭代B次。
It should also be calculated function
Figure PCTCN2019114200-appb-000063
And use the gradient descent method to update the parameter ω, iterate B times.
经过上述两方面的迭代后,执行将上述第一迭代次数参数i加1的操作。After the above two iterations, the operation of adding 1 to the first iteration number parameter i is performed.
在上述第一迭代次数参数i小于N时返回上述步骤506,实现从第1个时间片段到第N个时间片段的循环;而在上述第一迭代次数参数i等于N时,结束上述流程。When the first iteration number parameter i is less than N, return to the step 506 to realize the loop from the first time segment to the Nth time segment; and when the first iteration number parameter i is equal to N, the above process is ended.
本说明书实施例提出的无人机轨迹优化方案,可以把无人机剩余能量考虑入状态值输入强化学习网络,并可以直接输出无人机的加速度和飞行方向。此外,在在回报函数上加入惩罚值的情况下,还可以输出无人机的返航时间。该方案通过在线学习的方式,根据环境变化,动态调整学习到 的策略,从而适应环境。同时本方案考虑的是连续域下的控制问题,与实际场景下连续域飞行控制机制相符。The UAV trajectory optimization solution proposed in the embodiments of this specification can take the remaining energy of the UAV into consideration into the state value and input it into the reinforcement learning network, and can directly output the acceleration and flight direction of the UAV. In addition, when the penalty value is added to the reward function, the return time of the drone can also be output. The program uses online learning to dynamically adjust the learned strategies according to environmental changes to adapt to the environment. At the same time, this scheme considers the control problem in the continuous domain, which is consistent with the continuous domain flight control mechanism in the actual scenario.
另一方面,PPO算法是鲁棒性最好,性能最为突出的连续域控制算法,消除了不易确定合适学习步长的缺点,降低了算法的复杂度。On the other hand, the PPO algorithm is the continuous domain control algorithm with the best robustness and the most outstanding performance. It eliminates the shortcomings that it is difficult to determine the appropriate learning step size and reduces the complexity of the algorithm.
基于上述无人机轨迹优化方法,本说明书实施例还提供一种无人机轨迹优化装置,其内部结构如图6所示,包括:构建模块602、训练数据收集模块604和训练模块606。如前所述,该无人机轨迹优化装置可以内置于无人机或者物联网设备中,此外还可以是可以与无人机和物联网设备进行通信的单独的设备。本说明书的实施例对此不进行限定。Based on the above-mentioned UAV trajectory optimization method, an embodiment of this specification also provides an UAV trajectory optimization device, the internal structure of which is shown in FIG. 6, including: a construction module 602, a training data collection module 604, and a training module 606. As mentioned above, the UAV trajectory optimization device can be built into the UAV or the Internet of Things device, and it can also be a separate device that can communicate with the UAV and the Internet of Things device. The embodiment of this specification does not limit this.
其中,上述构建模块602用于构建深度强化学习网络。Among them, the aforementioned construction module 602 is used to construct a deep reinforcement learning network.
上述训练数据收集模块604用于在无人机飞行过程中获取无人机的状态数据和动作决策数据,并计算无人机的瞬时能量效率。The above-mentioned training data collection module 604 is used to obtain status data and action decision data of the drone during the flight of the drone, and calculate the instantaneous energy efficiency of the drone.
训练模块606用于以上述状态数据为输入、以上述动作决策数据为输出,以上述瞬时能量效率为奖励回报,对深度强化学习网络进行训练,优化策略参数,并输出无人机飞行策略。The training module 606 is configured to take the above-mentioned state data as input, the above-mentioned action decision data as the output, and the above-mentioned instantaneous energy efficiency as a reward return, to train the deep reinforcement learning network, optimize the strategy parameters, and output the drone flight strategy.
需要说明的是,上述构建模块602、训练数据收集模块604和训练模块606可以通过前述实施例所述的方法实现各自具体的功能,在此不再赘述。It should be noted that the above-mentioned construction module 602, training data collection module 604, and training module 606 can implement their specific functions through the methods described in the foregoing embodiments, and will not be repeated here.
图7示出了本说明书一个实施例提供的无人机轨迹优化装置的硬件结构。如图7所示,上述无人机轨迹优化装置包括至少一个处理器702;以及与所述至少一个处理器通信连接的存储器704;其中,所述存储器1004存储有可被所述至少一个处理器702执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如上所述无人机轨迹优化方法。Fig. 7 shows the hardware structure of the UAV trajectory optimization device provided by an embodiment of this specification. As shown in FIG. 7, the above-mentioned UAV trajectory optimization device includes at least one processor 702; and a memory 704 communicatively connected with the at least one processor; wherein, the memory 1004 stores data that can be used by the at least one processor. The instructions executed by 702 are executed by the at least one processor, so that the at least one processor can execute the method for optimizing the drone trajectory as described above.
该电子设备中包括一个处理器702以及一个存储器704,并还可以包括:输入装置和输出装置。处理器、存储器、输入装置和输出装置可以通过总线或者其他方式连接。存储器作为一种非易失性计算机可读存储介质,可用于存储非易失性软件程序、非易失性计算机可执行程序以及模块,如本说明书实施例中的无人机轨迹优化方法对应的程序指令/模块。处理器通过运行存储在存储器中的非易失性软件程序、指令以及模块,从而执行服 务器的各种功能应用以及数据处理,即实现上述方法实施例的无人机轨迹优化方法。The electronic device includes a processor 702 and a memory 704, and may also include: an input device and an output device. The processor, memory, input device, and output device may be connected by a bus or in other ways. As a non-volatile computer-readable storage medium, the memory can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, as corresponding to the drone trajectory optimization method in the embodiments of this specification Program instructions/modules. The processor executes various functional applications and data processing of the server by running non-volatile software programs, instructions, and modules stored in the memory, that is, realizing the UAV trajectory optimization method of the above method embodiment.
存储器可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储根据无人机轨迹优化装置的使用所创建的数据等。此外,存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实施例中,存储器可选包括相对于处理器远程设置的存储器。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory may include a storage program area and a storage data area. The storage program area can store an operating system and an application program required by at least one function; the storage data area can store data created according to the use of the drone trajectory optimization device. In addition, the memory may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices. In some embodiments, the memory may optionally include a memory remotely provided with respect to the processor. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
输入装置可接收输入的数字或字符信息,以及产生与无人机轨迹优化装置的用户设置以及功能控制有关的键信号输入。输出装置可包括显示屏等显示设备。所述一个或者多个模块存储在所述存储器中,当被所述处理器执行时,执行上述任意方法实施例中的无人机轨迹优化方法。所述执行所述无人机轨迹优化方法的电子设备的任何一个实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。The input device can receive input digital or character information, and generate key signal inputs related to the user settings and function control of the drone trajectory optimization device. The output device may include a display device such as a display screen. The one or more modules are stored in the memory, and when executed by the processor, the drone trajectory optimization method in any of the foregoing method embodiments is executed. Any embodiment of the electronic device that executes the method for optimizing the drone trajectory can achieve the same or similar effect as any of the foregoing method embodiments.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。所述计算机程序的实施例,可以达到与之对应的前述任意方法实施例相同或者相类似的效果。A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. When executed, it may include the processes of the above-mentioned method embodiments. Wherein, the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc. The embodiment of the computer program can achieve the same or similar effect as any of the foregoing corresponding method embodiments.
此外,根据本公开的方法还可以被实现为由CPU执行的计算机程序,该计算机程序可以存储在计算机可读存储介质中。在该计算机程序被CPU执行时,执行本公开的方法中限定的上述功能。In addition, the method according to the present disclosure may also be implemented as a computer program executed by a CPU, and the computer program may be stored in a computer-readable storage medium. When the computer program is executed by the CPU, it executes the above-mentioned functions defined in the method of the present disclosure.
需要说明的是,以上实施例仅为用以说明本说明书的技术方案,而非对其的限制。尽管参照前述实施例对说明书的进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本说明书的实施例技术方案的精神和保护范围。It should be noted that the above embodiments are only used to illustrate the technical solutions of this specification, but not to limit them. Although the description of the specification has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still modify the technical solutions described in the foregoing embodiments, or equivalently replace some of the technical features; and these Modification or replacement does not cause the essence of the corresponding technical solutions to deviate from the spirit and protection scope of the technical solutions of the embodiments of this specification.
所属领域的普通技术人员应当理解:以上任何实施例的讨论仅为示例性的,并非旨在暗示本公开的范围(包括权利要求)被限于这些例子;在本说明书的思路下,以上实施例或者不同实施例中的技术特征之间也可以进行组合,步骤可以以任意顺序实现,并存在如上所述的本说明书的不同方面的许多其它变化,为了简明它们没有在细节中提供。Those of ordinary skill in the art should understand that the discussion of any of the above embodiments is only exemplary, and is not intended to imply that the scope of the present disclosure (including the claims) is limited to these examples; under the idea of this specification, the above embodiments or The technical features in different embodiments can also be combined, the steps can be implemented in any order, and there are many other changes in different aspects of this specification as described above, which are not provided in the details for brevity.
另外,为简化说明和讨论,并且为了不会使本说明书难以理解,在所提供的附图中可以示出或可以不示出与集成电路(IC)芯片和其它部件的公知的电源/接地连接。此外,可以以框图的形式示出装置,以便避免使本说明书难以理解,并且这也考虑了以下事实,即关于这些框图装置的实施方式的细节是高度取决于将要实施本说明书的平台的(即,这些细节应当完全处于本领域技术人员的理解范围内)。在阐述了具体细节(例如,电路)以描述本说明书的示例性实施例的情况下,对本领域技术人员来说显而易见的是,可以在没有这些具体细节的情况下或者这些具体细节有变化的情况下实施本说明书的技术方案。因此,这些描述应被认为是说明性的而不是限制性的。In addition, in order to simplify the description and discussion, and in order not to make the description difficult to understand, the well-known power/ground connections with integrated circuit (IC) chips and other components may or may not be shown in the drawings provided. . In addition, the devices may be shown in the form of block diagrams in order to avoid making the description difficult to understand, and this also takes into account the fact that the details about the implementation of these block diagram devices are highly dependent on the platform on which the description will be implemented (ie These details should be completely within the understanding of those skilled in the art). In the case where specific details (for example, a circuit) are described to describe the exemplary embodiments of the present specification, it is obvious to those skilled in the art that it may be possible without these specific details or when these specific details are changed. Implement the technical solutions of this specification. Therefore, these descriptions should be considered illustrative rather than restrictive.
尽管已经结合了本说明书的具体实施例对本说明书进行了描述,但是根据前面的描述,这些实施例的很多替换、修改和变型对本领域普通技术人员来说将是显而易见的。例如,其它存储器架构(例如,动态RAM(DRAM))可以使用所讨论的实施例。Although this specification has been described in conjunction with the specific embodiments of this specification, many substitutions, modifications and variations of these embodiments will be apparent to those of ordinary skill in the art based on the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the discussed embodiments.
本说明书所公开的技术方案在适应场景及环境的能力强于现有技术中采用凸优化算法的方案。由于我们引入强化学习算法,在学习过程中优化策略参数,而不是基于固定的目标方程,因此具有更强的灵活性;而且本说明书深度强化学习网络通过输入环境状态,以及获取回报奖励,强化了与外界环境的交互,对于场景和环境的变化能够更快速地应对。The technical solution disclosed in this specification is more capable of adapting to the scene and environment than the conventional solution using convex optimization algorithm. Because we introduce reinforcement learning algorithms to optimize policy parameters during the learning process, rather than based on a fixed target equation, it has greater flexibility; and the deep reinforcement learning network in this manual is enhanced by inputting environmental conditions and obtaining rewards. The interaction with the external environment can respond more quickly to changes in the scene and environment.
其次,说明书相比现有技术中基于Q学习的方案,采用了连续域的无人机轨迹优化方案,强化学习输出的行动连续的速度与加速度,更贴近现实情况,容易进行飞行区域的扩展,在大区域轨迹优化时,不会出现维度爆炸的潜在问题。Secondly, compared with the Q-learning-based solution in the prior art, the manual adopts the continuous-domain UAV trajectory optimization solution. The continuous speed and acceleration of the action output by the reinforcement learning are closer to the reality and easy to expand the flight area. When trajectory optimization in a large area, there will be no potential problems of dimensional explosion.
本说明书所公开的技术方案融合了深度强化学习与无人机轨迹优化问题,并且首次采用PPO算法解决该问题,相比于采用深度确定策略梯度(DDPG)算法进行更新的优化方案,PPO算法受训练的步长影响更小, 解决真实场景下的控制问题时适应性更强,解决了现有技术中采用DDPG算法不易确定学习步长的问题,效率更高。The technical solution disclosed in this specification integrates deep reinforcement learning and UAV trajectory optimization problems, and the PPO algorithm is used to solve this problem for the first time. Compared with the optimization solution that uses the depth determination strategy gradient (DDPG) algorithm to update, the PPO algorithm is subject to The training step has less impact, and it is more adaptable when solving control problems in real scenarios. It solves the problem of difficulty in determining the learning step length using the DDPG algorithm in the prior art, and has higher efficiency.
此外,本说明书还考虑了最优返航充电/加油时间,通过在回报函数上增加惩罚项,使无人机能在安全返航的情况下,灵活的调整飞行时间和轨迹,尽可能大的提升自身的能量利用效率。In addition, this manual also considers the optimal return-to-home charging/refueling time. By adding a penalty item to the return function, the drone can flexibly adjust the flight time and trajectory under the condition of returning home safely, and improve its own performance as much as possible. Energy efficiency.
本说明书的实施例旨在涵盖落入所附权利要求的宽泛范围之内的所有这样的替换、修改和变型。因此,凡在本说明书的精神和原则之内,所做的任何省略、修改、等同替换、改进等,均应包含在本说明书的保护范围之内。The embodiments of this specification are intended to cover all such substitutions, modifications and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this specification should be included in the scope of protection of this specification.

Claims (21)

  1. 一种无人机轨迹优化方法,其特征在于,包括:A method for trajectory optimization of UAV, characterized in that it comprises:
    在无人机飞行过程中获取无人机状态数据和动作决策数据;Obtain UAV status data and action decision data during UAV flight;
    根据无人机状态数据和动作决策数据确定无人机的瞬时能量效率;Determine the instantaneous energy efficiency of the UAV based on the UAV status data and action decision data;
    以所述状态数据为输入、以所述动作决策数据为输出,以所述瞬时能量效率为奖励回报,对预先构建的深度强化学习网络进行训练,优化深度强化学习网络的策略参数;以及Training the pre-built deep reinforcement learning network with the state data as input, the action decision data as output, and the instantaneous energy efficiency as a reward, to optimize the policy parameters of the deep reinforcement learning network; and
    根据训练后的深度强化学习网络输出无人机飞行策略。According to the trained deep reinforcement learning network, the drone flight strategy is output.
  2. 根据权利要求1所述的无人机轨迹优化方法,其特征在于,所述方法进一步包括:预先构建包括行动网络和评价网络的深度学习网络结构;The UAV trajectory optimization method according to claim 1, wherein the method further comprises: pre-building a deep learning network structure including an action network and an evaluation network;
    其中,所述行动网络利用近端策略优化算法和深度神经网络来拟合无人机飞行动作策略函数,决策无人机飞行动作;Wherein, the action network uses a near-end strategy optimization algorithm and a deep neural network to fit the UAV flight action strategy function and decide the UAV flight action;
    所述评价网络利用深度神经网络来拟合状态价值函数,优化所述无人机飞行动作策略函数中的策略参数。The evaluation network uses a deep neural network to fit the state value function and optimize the strategy parameters in the UAV flight action strategy function.
  3. 根据权利要求1所述的无人机轨迹优化方法,其特征在于,所述获取无人机状态数据和动作数据包括:The method for optimizing the trajectory of an unmanned aerial vehicle according to claim 1, wherein said acquiring state data and action data of an unmanned aerial vehicle comprises:
    确定所述无人机与物联网设备间的距离、物联网设备到无人机的数据传输速率和无人机剩余能量,作为所述状态数据;以及Determine the distance between the drone and the Internet of Things device, the data transmission rate from the Internet of Things device to the drone, and the remaining energy of the drone as the status data; and
    采集所述无人机的加速度和飞行操纵角度,作为所述动作数据。The acceleration and flight control angle of the drone are collected as the motion data.
  4. 根据权利要求3所述的无人机轨迹优化方法,其特征在于,所述获取无人机状态数据和动作数据进一步包括:The method for optimizing the trajectory of an unmanned aerial vehicle according to claim 3, wherein said acquiring state data and movement data of an unmanned aerial vehicle further comprises:
    将所述状态数据量化表示为
    Figure PCTCN2019114200-appb-100001
    其中φ(s t)表示状态数据矩阵,s t表示t时刻的状态,
    Figure PCTCN2019114200-appb-100002
    分别表示在t时刻第1至第N个物联网设备与无人机的距离;
    Figure PCTCN2019114200-appb-100003
    分别表示在t时刻第1至第N个物联网设备向无人机传送信息的传输速率;
    Figure PCTCN2019114200-appb-100004
    表示无人机在t时刻的剩余能量;
    Quantify the state data as
    Figure PCTCN2019114200-appb-100001
    Where φ(s t ) represents the state data matrix, s t represents the state at time t,
    Figure PCTCN2019114200-appb-100002
    Respectively represent the distance between the 1st to Nth IoT devices and the UAV at time t;
    Figure PCTCN2019114200-appb-100003
    Respectively represent the transmission rate of the 1st to Nth IoT devices transmitting information to the UAV at time t;
    Figure PCTCN2019114200-appb-100004
    Indicates the remaining energy of the UAV at time t;
    将所述动作数据表示为a t=[ω tt,AC t] T;其中,a t表示在t时刻的动作;ω t∈[0,2π]表示在t时刻无人机的水平飞行操纵角度;θ t∈[0,π]表示在t时刻无人机的垂直飞行操纵角度;AC t表示在t时刻无人机的加速度。 The motion data is represented as a t = [ω t, θ t, AC t] T; wherein, a t represents an operation at time t; ω t ∈ [0,2π] represents UAV level at time t Flight control angle; θ t ∈[0,π] represents the vertical flight control angle of the UAV at time t ; AC t represents the acceleration of the UAV at time t.
  5. 根据权利要求4所述的无人机轨迹优化方法,其特征在于,所述物联网设备与无人机的距离包括物联网设备与无人机的欧式距离。The UAV trajectory optimization method according to claim 4, wherein the distance between the Internet of Things device and the UAV includes the Euclidean distance between the Internet of Things device and the UAV.
  6. 根据权利要求1所述的无人机轨迹优化方法,其特征在于,所述确定无人机的瞬时能量效率包括:利用如下公式确定无人机的瞬时能量效率:The UAV trajectory optimization method according to claim 1, wherein the determining the instantaneous energy efficiency of the UAV comprises: determining the instantaneous energy efficiency of the UAV using the following formula:
    Figure PCTCN2019114200-appb-100005
    Figure PCTCN2019114200-appb-100005
    其中,r(s t,a t)表示无人机在t时刻状态为s t、动作为a t时的瞬时能量效率;
    Figure PCTCN2019114200-appb-100006
    为在t时刻物联网设备u向无人机传送数据的最大传输速率;E(q(t))表示无人机在t时刻的能量损耗。
    Wherein, r (s t, a t ) represented UAV state S t at time t, the action of instantaneous energy efficiency when A t;
    Figure PCTCN2019114200-appb-100006
    E(q(t)) represents the energy loss of the UAV at time t at the maximum transmission rate of the IoT device u to transmit data to the UAV at time t.
  7. 根据权利要求6所述的无人机轨迹优化方法,其特征在于,所述t时刻物联网设备u向无人机传送数据的最大传输速率通过如下过程确定:The UAV trajectory optimization method according to claim 6, wherein the maximum transmission rate of data transmitted by the IoT device u to the UAV at time t is determined by the following process:
    确定无人机的平均路径损耗;Determine the average path loss of the UAV;
    根据所述平均路径损耗确定无人机与物联网设备u在t时刻的信噪比;Determine the signal-to-noise ratio between the drone and the IoT device u at time t according to the average path loss;
    根据所述信噪比确定t时刻设备u向无人机传送数据的最大传输速率。According to the signal-to-noise ratio, the maximum transmission rate of data transmitted by the device u to the drone at time t is determined.
  8. 根据权利要求7所述的无人机轨迹优化方法,其特征在于,所述确定无人机的平均路径损耗包括:通过如下公式确定无人机的平均路径损耗:The UAV trajectory optimization method according to claim 7, wherein the determining the average path loss of the UAV comprises: determining the average path loss of the UAV through the following formula:
    Figure PCTCN2019114200-appb-100007
    Figure PCTCN2019114200-appb-100007
    其中,
    Figure PCTCN2019114200-appb-100008
    代表无人机的平均路径损耗;f c代表中心频率;
    Figure PCTCN2019114200-appb-100009
    代表t时刻无人机与设备u之间的距离;c代表光速;η LoS代表LoS链路的附加空间传播损耗。
    among them,
    Figure PCTCN2019114200-appb-100008
    Represents the average path loss of UAV; f c represents the center frequency
    Figure PCTCN2019114200-appb-100009
    Represents the distance between the drone and the device u at time t; c represents the speed of light; η LoS represents the additional spatial propagation loss of the LoS link.
  9. 根据权利要求7所述的无人机轨迹优化方法,其特征在于,所述确定无人机的信噪比包括:通过如下公式确定无人机与物联网设备u在t时刻的信噪比:The UAV trajectory optimization method according to claim 7, wherein the determining the signal-to-noise ratio of the UAV comprises: determining the signal-to-noise ratio of the UAV and the Internet of Things device u at time t by the following formula:
    Figure PCTCN2019114200-appb-100010
    Figure PCTCN2019114200-appb-100010
    其中,
    Figure PCTCN2019114200-appb-100011
    代表无人机与物联网设备u在t时刻的信噪比;P u代表设备u上传链路的传输功率;
    Figure PCTCN2019114200-appb-100012
    代表t时刻无人机与设备u之间信道的增益;N 0为噪声功率;其中,
    Figure PCTCN2019114200-appb-100013
    among them,
    Figure PCTCN2019114200-appb-100011
    Represents the signal-to-noise ratio between the UAV and the IoT device u at time t; P u represents the transmission power of the upload link of the device u;
    Figure PCTCN2019114200-appb-100012
    Represents the gain of the channel between the UAV and the device u at time t; N 0 is the noise power; where,
    Figure PCTCN2019114200-appb-100013
  10. 根据权利要求7所述的无人机轨迹优化方法,其特征在于,所述 确定t时刻设备u向无人机传送数据的最大传输速率包括:通过如下公式确定t时刻设备u向无人机传送数据的最大传输速率:The UAV trajectory optimization method according to claim 7, wherein the determining the maximum transmission rate of the device u to the UAV at time t comprises: determining the transmission of the device u to the UAV at time t by the following formula Maximum data transfer rate:
    Figure PCTCN2019114200-appb-100014
    Figure PCTCN2019114200-appb-100014
    其中,B代表信道带宽;
    Figure PCTCN2019114200-appb-100015
    代表无人机与物联网设备u在t时刻的信噪比。
    Among them, B represents the channel bandwidth;
    Figure PCTCN2019114200-appb-100015
    Represents the signal-to-noise ratio of UAV and IoT device u at time t.
  11. 根据权利要求6所述的无人机轨迹优化方法,其特征在于,所述无人机的剩余能量为无人机初始总能量与无人机能量损耗之差;其中,无人机能量损耗包括:无人机飞行能量损耗和无人机通信能量损耗中的至少一项。The UAV trajectory optimization method according to claim 6, wherein the remaining energy of the UAV is the difference between the initial total energy of the UAV and the energy loss of the UAV; wherein the energy loss of the UAV includes : At least one of UAV flight energy loss and UAV communication energy loss.
  12. 根据权利要求6所述的无人机轨迹优化方法,其特征在于,所述确定无人机的瞬时能量效率进一步包括:The UAV trajectory optimization method according to claim 6, wherein said determining the instantaneous energy efficiency of the UAV further comprises:
    在无人机返回途中发生能源耗尽情况时,在所述计算瞬时能量效率的公式后添加预设数值的惩罚项。When an energy exhaustion occurs during the return of the drone, a penalty item of a preset value is added after the formula for calculating the instantaneous energy efficiency.
  13. 根据权利要求1所述的无人机轨迹优化方法,其特征在于,所述对预先构建的深度强化学习网络进行训练包括:The UAV trajectory optimization method according to claim 1, wherein the training of the pre-built deep reinforcement learning network comprises:
    采用近端策略优化算法,将深度强化学习网络的目标方程改写为:Using the near-end strategy optimization algorithm, the goal equation of the deep reinforcement learning network is rewritten as:
    Figure PCTCN2019114200-appb-100016
    Figure PCTCN2019114200-appb-100016
    其中,θ为待优化的策略参数;ε为预设的用于控制无人机飞行策略更新幅度的常数;
    Figure PCTCN2019114200-appb-100017
    为时刻t的期望值;
    Figure PCTCN2019114200-appb-100018
    表示优势函数;clip表示裁剪函数,r t(θ)是一次迭代更新中旧策略函数和新策略函数的比值,可表示为:
    Among them, θ is the strategy parameter to be optimized; ε is the preset constant used to control the update range of the UAV flight strategy;
    Figure PCTCN2019114200-appb-100017
    Is the expected value at time t;
    Figure PCTCN2019114200-appb-100018
    Represents the advantage function; clip represents the clipping function, r t (θ) is the ratio of the old strategy function to the new strategy function in an iterative update, which can be expressed as:
    Figure PCTCN2019114200-appb-100019
    Figure PCTCN2019114200-appb-100019
    其中,π θ表示无人机飞行策略函数,π θ(a t|s t)表示t时刻状态为s t、动作为a t的新无人机飞行策略函数,
    Figure PCTCN2019114200-appb-100020
    表示t时刻状态为s t、动作为a t的旧无人机飞行策略函数;
    Wherein, π θ represents UAV flight policy function, π θ (a t | s t) represents the state S t at time t, a new UAV flight operation policy function of t A,
    Figure PCTCN2019114200-appb-100020
    T represents time state s t, UAV flight action for the old policy function of a t;
    其中,优势函数
    Figure PCTCN2019114200-appb-100021
    可由如下方程表示:
    Among them, the advantage function
    Figure PCTCN2019114200-appb-100021
    It can be expressed by the following equation:
    Figure PCTCN2019114200-appb-100022
    Figure PCTCN2019114200-appb-100022
    δ t=r t+γV(s t+1)-V(s t), δ t =r t +γV(s t+1 )-V(s t ),
    其中,γ为衰减指数;λ为径迹参数;δ t为t时刻的时间差分错误值; δ T-1为T-1时刻的时间差分错误值;T为自主飞行总时长; Among them, γ is the attenuation index; λ is the track parameter; δ t is the time difference error value at time t; δ T-1 is the time difference error value at time T-1; T is the total time of autonomous flight;
    通过至少一次的迭代更新,求取所述目标方程的最大值,优化无人机飞行策略函数中的策略参数,并将所述目标方程最大值对应的策略参数作为所述无人机飞行策略输出。Through at least one iterative update, find the maximum value of the target equation, optimize the strategy parameter in the UAV flight strategy function, and use the strategy parameter corresponding to the maximum value of the target equation as the UAV flight strategy output .
  14. 根据权利要求12所述的无人机轨迹优化方法,其特征在于,所述优势函数
    Figure PCTCN2019114200-appb-100023
    利用深度神经网络依据所述无人机状态数据、无人机动作决策数据以及无人机的瞬时能量效率优化得到。
    The UAV trajectory optimization method according to claim 12, wherein the advantage function
    Figure PCTCN2019114200-appb-100023
    The deep neural network is optimized based on the UAV status data, UAV action decision data and the UAV's instantaneous energy efficiency.
  15. 根据权利要求14所述的无人机轨迹优化方法,其特征在于,所述优势函数
    Figure PCTCN2019114200-appb-100024
    利用深度神经网络依据所述无人机状态数据、无人机动作决策数据以及无人机的瞬时能量效率优化得到包括:
    The UAV trajectory optimization method according to claim 14, wherein the advantage function
    Figure PCTCN2019114200-appb-100024
    According to the UAV status data, UAV action decision data, and the UAV's instantaneous energy efficiency optimization by using the deep neural network, the optimization results include:
    通过所述状态数据、动作决策数据以及所述无人机的瞬时能量效率利用深度神经网络估计优势函数
    Figure PCTCN2019114200-appb-100025
    Utilize deep neural network to estimate the advantage function through the state data, the action decision data and the instantaneous energy efficiency of the drone
    Figure PCTCN2019114200-appb-100025
    计算函数
    Figure PCTCN2019114200-appb-100026
    并利用梯度下降法更新参数ω,并迭代预先确定的迭代次数;
    Calculation function
    Figure PCTCN2019114200-appb-100026
    And use the gradient descent method to update the parameter ω, and iterate the predetermined number of iterations;
    求取使得所述函数达到最大值时的优势函数
    Figure PCTCN2019114200-appb-100027
    Find the advantage function when the function reaches its maximum
    Figure PCTCN2019114200-appb-100027
  16. 根据权利要求1所述的无人机轨迹优化方法,其特征在于,所述方法进一步包括:根据所述无人机飞行策略确定无人机的动作决策数据。The UAV trajectory optimization method according to claim 1, wherein the method further comprises: determining the action decision data of the UAV according to the UAV flight strategy.
  17. 一种无人机轨迹优化装置,其特征在于,包括:An UAV trajectory optimization device, characterized in that it comprises:
    构建模块,用于构建深度强化学习网络;Building modules for building deep reinforcement learning networks;
    训练数据收集模块,用于在无人机飞行过程中获取无人机的状态数据和动作决策数据,并计算无人机的瞬时能量效率;以及The training data collection module is used to obtain the status data and action decision data of the drone during the flight of the drone, and calculate the instantaneous energy efficiency of the drone; and
    训练模块,用于以所述状态数据为输入、以所述动作决策数据为输出,以所述瞬时能量效率为奖励回报,对深度强化学习网络进行训练,优化策略参数,并输出无人机飞行策略。The training module is used to train the deep reinforcement learning network with the state data as input, the action decision data as the output, and the instantaneous energy efficiency as a reward, to train the deep reinforcement learning network, optimize strategy parameters, and output drone flight Strategy.
  18. 根据权利要求17所述的无人机轨迹优化装置,其特征在于,所述构建模块用于构建包括行动网络和评价网络的深度学习网络结构;其中,所述行动网络利用近端策略优化算法和深度神经网络来拟合无人机飞行动作策略函数,决策无人机飞行动作;所述评价网络利用深度神经网络来拟合状态价值函数,优化所述无人机飞行动作策略函数中的策略参数。The UAV trajectory optimization device according to claim 17, wherein the construction module is used to construct a deep learning network structure including an action network and an evaluation network; wherein the action network uses a near-end strategy optimization algorithm and The deep neural network is used to fit the UAV flight action strategy function, and the UAV flight action is determined; the evaluation network uses the deep neural network to fit the state value function, and optimize the strategy parameter in the UAV flight action strategy function .
  19. 根据权利要求17所述的无人机轨迹优化装置,其特征在于,所述训练数据收集模块用于确定所述无人机与物联网设备间的距离、物联网设备到无人机的数据传输速率以及无人机的剩余能量作为所述状态数据;以及采集所述无人机的加速度和飞行操纵角度,作为所述动作决策数据。The drone trajectory optimization device according to claim 17, wherein the training data collection module is used to determine the distance between the drone and the Internet of Things device, and the data transmission from the Internet of Things device to the drone The speed and the remaining energy of the drone are used as the state data; and the acceleration and flight control angle of the drone are collected as the action decision data.
  20. 一种无人机轨迹优化装置,其特征在于,包括至少一个处理器;以及与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至16任意一项所述的无人机轨迹优化方法。An UAV trajectory optimization device, characterized by comprising at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores instructions that can be executed by the at least one processor The instructions are executed by the at least one processor, so that the at least one processor can execute the method for optimizing the drone trajectory according to any one of claims 1 to 16.
  21. 一种计算机可读存储介质,其特征在于,其上存储有计算机指令,在处理器执行上述计算机指令时实现如权利要求1至16任意一项所述的无人机轨迹优化方法。A computer-readable storage medium, characterized in that computer instructions are stored thereon, and the UAV trajectory optimization method according to any one of claims 1 to 16 is realized when the processor executes the computer instructions.
PCT/CN2019/114200 2019-07-30 2019-10-30 Path optimization method and device for unmanned aerial vehicle, and storage medium WO2021017227A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910697007.6A CN110488861B (en) 2019-07-30 2019-07-30 Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle
CN201910697007.6 2019-07-30

Publications (1)

Publication Number Publication Date
WO2021017227A1 true WO2021017227A1 (en) 2021-02-04

Family

ID=68548830

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/114200 WO2021017227A1 (en) 2019-07-30 2019-10-30 Path optimization method and device for unmanned aerial vehicle, and storage medium

Country Status (2)

Country Link
CN (1) CN110488861B (en)
WO (1) WO2021017227A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268074A (en) * 2021-06-07 2021-08-17 哈尔滨工程大学 Unmanned aerial vehicle flight path planning method based on joint optimization

Families Citing this family (68)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879595A (en) * 2019-11-29 2020-03-13 江苏徐工工程机械研究院有限公司 Unmanned mine card tracking control system and method based on deep reinforcement learning
CN110958680B (en) * 2019-12-09 2022-09-13 长江师范学院 Energy efficiency-oriented unmanned aerial vehicle cluster multi-agent deep reinforcement learning optimization method
CN111132192B (en) * 2019-12-13 2023-01-17 广东工业大学 Unmanned aerial vehicle base station online track optimization method
CN111123963B (en) * 2019-12-19 2021-06-08 南京航空航天大学 Unknown environment autonomous navigation system and method based on reinforcement learning
CN111026147B (en) * 2019-12-25 2021-01-08 北京航空航天大学 Zero overshoot unmanned aerial vehicle position control method and device based on deep reinforcement learning
CN111191728B (en) * 2019-12-31 2023-05-09 中国电子科技集团公司信息科学研究院 Deep reinforcement learning distributed training method and system based on asynchronization or synchronization
CN111314929B (en) * 2020-01-20 2023-06-09 浙江工业大学 Contract-based unmanned aerial vehicle edge cache strategy and rewarding optimization method
CN111385806B (en) * 2020-02-18 2021-10-26 清华大学 Unmanned aerial vehicle base station path planning and bandwidth resource allocation method and device
CN111263332A (en) * 2020-03-02 2020-06-09 湖北工业大学 Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning
CN111381499B (en) * 2020-03-10 2022-09-27 东南大学 Internet-connected aircraft self-adaptive control method based on three-dimensional space radio frequency map learning
CN111565065B (en) * 2020-03-24 2021-06-04 北京邮电大学 Unmanned aerial vehicle base station deployment method and device and electronic equipment
CN111580544B (en) * 2020-03-25 2021-05-07 北京航空航天大学 Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
CN111460650B (en) * 2020-03-31 2022-11-01 北京航空航天大学 Unmanned aerial vehicle end-to-end control method based on deep reinforcement learning
CN112180967B (en) * 2020-04-26 2022-08-19 北京理工大学 Multi-unmanned aerial vehicle cooperative countermeasure decision-making method based on evaluation-execution architecture
CN111552313B (en) * 2020-04-29 2022-06-28 南京理工大学 Multi-unmanned aerial vehicle path planning method based on edge calculation dynamic task arrival
CN112198870B (en) * 2020-06-01 2022-09-02 西北工业大学 Unmanned aerial vehicle autonomous guiding maneuver decision method based on DDQN
CN111786713B (en) * 2020-06-04 2021-06-08 大连理工大学 Unmanned aerial vehicle network hovering position optimization method based on multi-agent deep reinforcement learning
CN111752304B (en) * 2020-06-23 2022-10-14 深圳清华大学研究院 Unmanned aerial vehicle data acquisition method and related equipment
CN111724001B (en) * 2020-06-29 2023-08-29 重庆大学 Aircraft detection sensor resource scheduling method based on deep reinforcement learning
CN112097783B (en) * 2020-08-14 2022-05-20 广东工业大学 Electric taxi charging navigation path planning method based on deep reinforcement learning
CN112068590A (en) * 2020-08-21 2020-12-11 广东工业大学 Unmanned aerial vehicle base station flight planning method and system, storage medium and unmanned aerial vehicle base station
CN112235810B (en) * 2020-09-17 2021-07-09 广州番禺职业技术学院 Multi-dimensional optimization method and system of unmanned aerial vehicle communication system based on reinforcement learning
CN112051863A (en) * 2020-09-25 2020-12-08 南京大学 Unmanned aerial vehicle autonomous anti-reconnaissance and enemy attack avoidance method
CN112362522B (en) * 2020-10-23 2022-08-02 浙江中烟工业有限责任公司 Tobacco leaf volume weight measuring method based on reinforcement learning
CN114527737A (en) * 2020-11-06 2022-05-24 百度在线网络技术(北京)有限公司 Speed planning method, device, equipment, medium and vehicle for automatic driving
CN112566209A (en) * 2020-11-24 2021-03-26 山西三友和智慧信息技术股份有限公司 UAV-BSs energy and service priority track design method based on double Q learning
CN112711271B (en) * 2020-12-16 2022-05-17 中山大学 Autonomous navigation unmanned aerial vehicle power optimization method based on deep reinforcement learning
CN112865855B (en) * 2021-01-04 2022-04-08 福州大学 High-efficiency wireless covert transmission method based on unmanned aerial vehicle relay
CN112819215B (en) * 2021-01-26 2024-01-12 北京百度网讯科技有限公司 Recommendation strategy training method and device, electronic equipment and readable storage medium
CN112791394B (en) * 2021-02-02 2022-09-30 腾讯科技(深圳)有限公司 Game model training method and device, electronic equipment and storage medium
CN115046433B (en) * 2021-03-09 2023-04-07 北京理工大学 Aircraft time collaborative guidance method based on deep reinforcement learning
CN113159386A (en) * 2021-03-22 2021-07-23 中国科学技术大学 Unmanned aerial vehicle return state estimation method and system
CN113050673B (en) * 2021-03-25 2021-12-28 四川大学 Three-dimensional trajectory optimization method for high-energy-efficiency unmanned aerial vehicle of auxiliary communication system
CN113115344B (en) * 2021-04-19 2021-12-14 中国人民解放军火箭军工程大学 Unmanned aerial vehicle base station communication resource allocation strategy prediction method based on noise optimization
CN113110546B (en) * 2021-04-20 2022-09-23 南京大学 Unmanned aerial vehicle autonomous flight control method based on offline reinforcement learning
CN113110550B (en) * 2021-04-23 2022-09-23 南京大学 Unmanned aerial vehicle flight control method based on reinforcement learning and network model distillation
CN113316239B (en) * 2021-05-10 2022-07-08 北京科技大学 Unmanned aerial vehicle network transmission power distribution method and device based on reinforcement learning
CN113258989B (en) * 2021-05-17 2022-06-03 东南大学 Method for obtaining relay track of unmanned aerial vehicle by using reinforcement learning
CN113110516B (en) * 2021-05-20 2023-12-22 广东工业大学 Operation planning method for limited space robot with deep reinforcement learning
CN113283169B (en) * 2021-05-24 2022-04-26 北京理工大学 Three-dimensional group exploration method based on multi-head attention asynchronous reinforcement learning
CN113255218B (en) * 2021-05-27 2022-05-31 电子科技大学 Unmanned aerial vehicle autonomous navigation and resource scheduling method of wireless self-powered communication network
CN113419548A (en) * 2021-05-28 2021-09-21 北京控制工程研究所 Spacecraft deep reinforcement learning Levier flight control system
CN113157002A (en) * 2021-05-28 2021-07-23 南开大学 Air-ground cooperative full-coverage trajectory planning method based on multiple unmanned aerial vehicles and multiple base stations
CN113543068B (en) * 2021-06-07 2024-02-02 北京邮电大学 Forest area unmanned aerial vehicle network deployment method and system based on hierarchical clustering
CN113382060B (en) * 2021-06-07 2022-03-22 北京理工大学 Unmanned aerial vehicle track optimization method and system in Internet of things data collection
CN113507717A (en) * 2021-06-08 2021-10-15 山东师范大学 Unmanned aerial vehicle track optimization method and system based on vehicle track prediction
CN113283013B (en) * 2021-06-10 2022-07-19 北京邮电大学 Multi-unmanned aerial vehicle charging and task scheduling method based on deep reinforcement learning
CN113423060B (en) * 2021-06-22 2022-05-10 广东工业大学 Online optimization method for flight route of unmanned aerial communication platform
CN113377131B (en) * 2021-06-23 2022-06-03 东南大学 Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning
CN113239639B (en) * 2021-06-29 2022-08-26 暨南大学 Policy information generation method, policy information generation device, electronic device, and storage medium
CN113359480B (en) * 2021-07-16 2022-02-01 中国人民解放军火箭军工程大学 Multi-unmanned aerial vehicle and user cooperative communication optimization method based on MAPPO algorithm
CN113776531A (en) * 2021-07-21 2021-12-10 电子科技大学长三角研究院(湖州) Multi-unmanned-aerial-vehicle autonomous navigation and task allocation algorithm of wireless self-powered communication network
CN113639755A (en) * 2021-08-20 2021-11-12 江苏科技大学苏州理工学院 Fire scene escape-rescue combined system based on deep reinforcement learning
CN113721655B (en) * 2021-08-26 2023-06-16 南京大学 Control period self-adaptive reinforcement learning unmanned aerial vehicle stable flight control method
CN114200950B (en) * 2021-10-26 2023-06-02 北京航天自动控制研究所 Flight attitude control method
CN113885549B (en) * 2021-11-23 2023-11-21 江苏科技大学 Four-rotor gesture track control method based on dimension clipping PPO algorithm
CN114142912B (en) * 2021-11-26 2023-01-06 西安电子科技大学 Resource control method for guaranteeing time coverage continuity of high-dynamic air network
CN114268986A (en) * 2021-12-14 2022-04-01 北京航空航天大学 Unmanned aerial vehicle computing unloading and charging service efficiency optimization method
CN114372612B (en) * 2021-12-16 2023-04-28 电子科技大学 Path planning and task unloading method for unmanned aerial vehicle mobile edge computing scene
CN114384931B (en) * 2021-12-23 2023-08-29 同济大学 Multi-target optimal control method and equipment for unmanned aerial vehicle based on strategy gradient
CN114550540A (en) * 2022-02-10 2022-05-27 北方天途航空技术发展(北京)有限公司 Intelligent monitoring method, device, equipment and medium for training machine
CN114785397B (en) * 2022-03-11 2023-04-07 成都三维原光通讯技术有限公司 Unmanned aerial vehicle base station control method, flight trajectory optimization model construction and training method
CN114741886B (en) * 2022-04-18 2022-11-22 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
CN115202377B (en) * 2022-06-13 2023-06-09 北京理工大学 Fuzzy self-adaptive NMPC track tracking control and energy management method
CN115061371B (en) * 2022-06-20 2023-08-04 中国航空工业集团公司沈阳飞机设计研究所 Unmanned plane control strategy reinforcement learning generation method capable of preventing strategy jitter
CN115001002B (en) * 2022-08-01 2022-12-30 广东电网有限责任公司肇庆供电局 Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling
CN116741019A (en) * 2023-08-11 2023-09-12 成都飞航智云科技有限公司 Flight model training method and training system based on AI
CN116736729B (en) * 2023-08-14 2023-10-27 成都蓉奥科技有限公司 Method for generating perception error-resistant maneuvering strategy of air combat in line of sight

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100016909A (en) * 2008-08-05 2010-02-16 주식회사 케이티 Apparatus and method of policy modeling based on partially observable markov decision processes
CN106019950A (en) * 2016-08-09 2016-10-12 中国科学院软件研究所 Mobile phone satellite self-adaptive attitude control method
CN108594638A (en) * 2018-03-27 2018-09-28 南京航空航天大学 The in-orbit reconstructing methods of spacecraft ACS towards the constraint of multitask multi-index optimization
CN109343341A (en) * 2018-11-21 2019-02-15 北京航天自动控制研究所 It is a kind of based on deeply study carrier rocket vertically recycle intelligent control method
CN109445456A (en) * 2018-10-15 2019-03-08 清华大学 A kind of multiple no-manned plane cluster air navigation aid
CN109639377A (en) * 2018-12-13 2019-04-16 西安电子科技大学 Dynamic spectrum resource management method based on deeply study

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101813697B1 (en) * 2015-12-22 2017-12-29 한국항공대학교산학협력단 Unmanned aerial vehicle flight control system and method using deep learning
CN106168808A (en) * 2016-08-25 2016-11-30 南京邮电大学 A kind of rotor wing unmanned aerial vehicle automatic cruising method based on degree of depth study and system thereof
CN106970615B (en) * 2017-03-21 2019-10-22 西北工业大学 A kind of real-time online paths planning method of deeply study
CN108319286B (en) * 2018-03-12 2020-09-22 西北工业大学 Unmanned aerial vehicle air combat maneuver decision method based on reinforcement learning
CN109443366B (en) * 2018-12-20 2020-08-21 北京航空航天大学 Unmanned aerial vehicle group path planning method based on improved Q learning algorithm
CN109933086B (en) * 2019-03-14 2022-08-30 天津大学 Unmanned aerial vehicle environment perception and autonomous obstacle avoidance method based on deep Q learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100016909A (en) * 2008-08-05 2010-02-16 주식회사 케이티 Apparatus and method of policy modeling based on partially observable markov decision processes
CN106019950A (en) * 2016-08-09 2016-10-12 中国科学院软件研究所 Mobile phone satellite self-adaptive attitude control method
CN108594638A (en) * 2018-03-27 2018-09-28 南京航空航天大学 The in-orbit reconstructing methods of spacecraft ACS towards the constraint of multitask multi-index optimization
CN109445456A (en) * 2018-10-15 2019-03-08 清华大学 A kind of multiple no-manned plane cluster air navigation aid
CN109343341A (en) * 2018-11-21 2019-02-15 北京航天自动控制研究所 It is a kind of based on deeply study carrier rocket vertically recycle intelligent control method
CN109639377A (en) * 2018-12-13 2019-04-16 西安电子科技大学 Dynamic spectrum resource management method based on deeply study

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113268074A (en) * 2021-06-07 2021-08-17 哈尔滨工程大学 Unmanned aerial vehicle flight path planning method based on joint optimization
CN113268074B (en) * 2021-06-07 2022-05-13 哈尔滨工程大学 Unmanned aerial vehicle flight path planning method based on joint optimization

Also Published As

Publication number Publication date
CN110488861B (en) 2020-08-28
CN110488861A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
WO2021017227A1 (en) Path optimization method and device for unmanned aerial vehicle, and storage medium
CN113162679B (en) DDPG algorithm-based IRS (intelligent resilient software) assisted unmanned aerial vehicle communication joint optimization method
CN113132943B (en) Task unloading scheduling and resource allocation method for vehicle-side cooperation in Internet of vehicles
CN113660681B (en) Multi-agent resource optimization method applied to unmanned aerial vehicle cluster auxiliary transmission
Apostolopoulos et al. Satisfaction-aware data offloading in surveillance systems
CN115277689A (en) Yun Bianwang network communication optimization method and system based on distributed federal learning
CN114690799A (en) Air-space-ground integrated unmanned aerial vehicle Internet of things data acquisition method based on information age
CN115499921A (en) Three-dimensional trajectory design and resource scheduling optimization method for complex unmanned aerial vehicle network
CN113406965A (en) Unmanned aerial vehicle energy consumption optimization method based on reinforcement learning
CN113382060B (en) Unmanned aerial vehicle track optimization method and system in Internet of things data collection
Yang et al. Federated imitation learning for uav swarm coordination in urban traffic monitoring
CN114268986A (en) Unmanned aerial vehicle computing unloading and charging service efficiency optimization method
Mu et al. Memory-event-triggered consensus control for multi-UAV systems against deception attacks
CN110673651B (en) Robust formation method for unmanned aerial vehicle cluster under limited communication condition
CN116321237A (en) Unmanned aerial vehicle auxiliary internet of vehicles data collection method based on deep reinforcement learning
CN114895710A (en) Control method and system for autonomous behavior of unmanned aerial vehicle cluster
CN114022731A (en) Federal learning node selection method based on DRL
Blair et al. A continuum approach for collaborative task processing in UAV MEC networks
CN116634388B (en) Electric power fusion network-oriented big data edge caching and resource scheduling method and system
CN115633320B (en) Multi-unmanned aerial vehicle assisted data acquisition and return method, system, equipment and medium
Qi et al. Edge-edge Collaboration Based Micro-service Deployment in Edge Computing Networks
Quan et al. Interpretable and Secure Trajectory Optimization for UAV-Assisted Communication
CN114745693B (en) PSO-GA hybrid algorithm-based UAV auxiliary Internet of vehicles resource allocation method
Kumar et al. Proximal Policy Optimization based computations offloading for delay optimization in UAV-assisted mobile edge computing
CN117336735A (en) Rapid phase shift optimization method for intelligent reflection surface-oriented auxiliary unmanned aerial vehicle line inspection system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19939274

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19939274

Country of ref document: EP

Kind code of ref document: A1