WO2021017227A1

WO2021017227A1 - Path optimization method and device for unmanned aerial vehicle, and storage medium

Info

Publication number: WO2021017227A1
Application number: PCT/CN2019/114200
Authority: WO
Inventors: 许文俊; 徐越; 吴思雷; 张治�; 张平; 林家儒
Original assignee: 北京邮电大学
Priority date: 2019-07-30
Filing date: 2019-10-30
Publication date: 2021-02-04
Also published as: CN110488861B; CN110488861A

Abstract

A path optimization method for an unmanned aerial vehicle (102). The method comprises: obtaining status data and action decision-making data of an unmanned aerial vehicle (102) during a flight thereof (202); determining, according to the status data and the action decision-making data of the unmanned aerial vehicle (102), the instantaneous energy efficiency of the unmanned aerial vehicle (102) (204); training a pre-built deep reinforcement learning network (1064) using the status data as input, the action decision-making data as output, and the instantaneous energy efficiency as a reward, and optimizing a policy parameter of the deep reinforcement learning network (1064) (206); and outputting, according to the trained deep reinforcement learning network (1064), a flight policy of the unmanned aerial vehicle (102) (208). Further provided are a path optimization device (106) for an unmanned aerial vehicle, and a computer-readable storage medium (704).

Description

UAV trajectory optimization method, device and storage medium

This specification is based on a Chinese patent application with an application number of 201910697007.6 and an application date of July 30, 2019, and claims the priority of the Chinese patent application. The entire content of the Chinese patent application is hereby incorporated by reference into this specification.

Technical field

This specification relates to the field of wireless communication technology, and in particular to methods, devices and storage media for trajectory optimization of drones.

Background technique

UAV communication technology is considered to be an indispensable part of the fifth generation (5G) and subsequent evolution (5G+) mobile communication networks. However, the UAV communication system has a unique air-to-ground channel model, highly dynamic three-dimensional flight capabilities and limited flight energy, making UAV communication systems more complex than traditional communication systems.

However, some technical solutions for optimizing the flight trajectory of UAVs in the prior art are relatively limited in adaptable flight scenarios and flight action programs. It is difficult to cope with the dynamically changing environmental information during UAV flight and deviate from UAV. Actual flight requirements.

Summary of the invention

Some embodiments of this specification propose a UAV trajectory optimization method, which includes: acquiring UAV status data and action decision data during the flight of the UAV; determining according to the UAV status data and action decision data The instantaneous energy efficiency of the drone; taking the state data as input, the action decision data as the output, and the instantaneous energy efficiency as a reward, to train a pre-built deep reinforcement learning network to optimize deep reinforcement learning The strategy parameters of the network; and output the UAV flight strategy according to the trained deep reinforcement learning network.

The above method may further include: pre-constructing a deep learning network structure including an action network and an evaluation network; wherein the action network uses a near-end strategy optimization algorithm and a deep neural network to fit the UAV flight action strategy function, and the decision is made The aircraft flight action; the evaluation network uses a deep neural network to fit the state value function, and optimize the strategy parameters in the UAV flight action strategy function.

Wherein, obtaining the status data and action data of the drone includes: determining the distance between the drone and the IoT device, the data transmission rate from the IoT device to the drone, and the remaining energy of the drone as the status Data; and collecting the acceleration and flight control angle of the drone as the motion data.

Wherein, obtaining the state data and action data of the drone further includes: quantifying the state data as

Where φ(s _t ) represents the state data matrix, s _t represents the state at time t,

Respectively represent the distance between the 1st to Nth IoT devices and the UAV at time t;

Respectively represent the transmission rate of the 1st to Nth IoT devices transmitting information to the UAV at time t;

The remaining energy represents the UAV at time t; and the action data is represented as _{^{a t = [ω t, θ}} t, AC t] T; _t where A represents operation at time t; ω ^t ∈ [0,2π ] Represents the horizontal flight control angle of the drone at time t; θ ^t ∈[0,π] represents the vertical flight control angle of the drone at time ^t ; AC ^t represents the acceleration of the drone at time ^t , AC ^t It is continuous bounded data.

Among them, the distance between the Internet of Things device and the drone includes the European distance between the Internet of Things device and the drone.

Among them, the above determining the instantaneous energy efficiency of the UAV includes: using the following formula to determine the instantaneous energy efficiency of the UAV:

_{_{Wherein, r (s t, a t}} ) represented UAV state S _t at time t, the action of instantaneous energy efficiency when A _t;

E(q(t)) represents the energy loss of the UAV at time t at the maximum transmission rate of the IoT device u to transmit data to the UAV at time t.

Among them, the maximum transmission rate of data transmitted by the IoT device u to the UAV at time t is determined by the following process: determine the average path loss of the UAV; determine the UAV and the IoT device u at t according to the average path loss The signal-to-noise ratio at time; the maximum transmission rate of data transmitted by the device u to the drone at time t is determined according to the signal-to-noise ratio.

Among them, the above determination of the average path loss of the UAV includes: the average path loss of the UAV is determined by the following formula:

among them,

Represents the average path loss of the UAV; f _c represents the center frequency;

Represents the distance between the drone and the device u at time t; c represents the speed of light; η _LoS represents the additional spatial propagation loss of the LoS link.

Among them, the above determining the signal-to-noise ratio of the drone includes: determining the signal-to-noise ratio of the drone and the IoT device u at time t by the following formula:

among them,

Represents the signal-to-noise ratio between the UAV and the IoT device u at time t; P _u represents the transmission power of the upload link of the device u;

Represents the gain of the channel between the UAV and the device u at time t; N ₀ is the noise power; where,

Wherein, determining the maximum transmission rate of data transmitted by device u to the drone at time t includes: determining the maximum transmission rate of data transmitted by device u to the drone at time t through the following formula:

Among them, B represents the channel bandwidth;

Represents the signal-to-noise ratio of UAV and IoT device u at time t.

Among them, the remaining energy of the above-mentioned UAV is the difference between the initial total energy of the UAV and the energy loss of the UAV; wherein, the energy loss of the UAV includes at least one of the flight energy loss of the UAV and the communication energy loss One item.

The foregoing determining the instantaneous energy efficiency of the drone further includes: when the energy exhaustion occurs during the return of the drone, a penalty term of a preset value is added to the formula for calculating the instantaneous energy efficiency.

Among them, the above-mentioned training of the pre-built deep reinforcement learning network includes:

Using the near-end strategy optimization algorithm, the goal equation of the deep reinforcement learning network is rewritten as:

Among them, θ is the strategy parameter to be optimized; ε is the preset constant used to control the update range of the UAV flight strategy;

Is the expected value at time t;

Represents the advantage function; clip represents the clipping function, r _t (θ) is the ratio of the old strategy function to the new strategy function in an iterative update, which can be expressed as:

_Wherein, π θ represents UAV flight policy _{_{function, π θ (a t | s}} t) represents the state S _t at time t, a new UAV flight operation policy function of _t A,

T represents time state s _t, UAV flight action for the old policy function of a _t;

Among them, the advantage function

It can be expressed by the following equation:

Among them, γ is the attenuation index; λ is the track parameter; δ _t is the time difference error value at time t; δ _T-1 is the time difference error value at time T-1; T is the total time of autonomous flight;

Through at least one iterative update, find the maximum value of the target equation, optimize the strategy parameter in the UAV flight strategy function, and use the strategy parameter corresponding to the maximum value of the target equation as the UAV flight strategy output .

Among them, the above advantage function

The deep neural network is optimized based on the UAV status data, UAV action decision data and the UAV's instantaneous energy efficiency.

Specifically, the above advantage function

Using a deep neural network to optimize according to the UAV status data, UAV action decision data, and the UAV's instantaneous energy efficiency includes: using the status data, action decision data, and the UAV's instantaneous energy efficiency Use deep neural network to estimate dominance function

Calculation function

And use the gradient descent method to update the parameter ω, and iterate the predetermined number of iterations; find the advantage function when the function reaches the maximum value

The above method further includes: determining the action decision data of the drone according to the drone flight strategy.

Other embodiments of this specification provide a UAV trajectory optimization device, which includes:

Building modules for building deep reinforcement learning networks;

The training data collection module is used to obtain the status data and action decision data of the drone during the flight of the drone, and calculate the instantaneous energy efficiency of the drone; and

The training module is used to train the deep reinforcement learning network with the state data as input, the action decision data as the output, and the instantaneous energy efficiency as a reward, to train the deep reinforcement learning network, optimize strategy parameters, and output drone flight Strategy.

Wherein, the above-mentioned building module is used to construct a deep learning network structure including an action network and an evaluation network; wherein, the action network uses a near-end strategy optimization algorithm and a deep neural network to fit the UAV flight action strategy function, and make decisions The aircraft flight action; the evaluation network uses a deep neural network to fit the state value function, and optimize the strategy parameters in the UAV flight action strategy function.

Wherein, the aforementioned training data collection module is used to determine the distance between the drone and the Internet of Things device, the data transmission rate from the Internet of Things device to the drone, and the remaining energy of the drone as the status data; and The acceleration and flight control angle of the drone are used as the action decision data.

Still other embodiments of this specification provide a UAV trajectory optimization device, including at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores the at least one An instruction executed by the processor, the instruction being executed by the at least one processor, so that the at least one processor can execute the above-mentioned UAV trajectory optimization method.

Still other embodiments of this specification also provide a computer-readable storage medium on which computer instructions are stored, and the above-mentioned UAV trajectory optimization method is realized when the processor executes the above-mentioned computer instructions.

This manual discloses a drone trajectory optimization method, device and drone based on deep reinforcement learning. Deep reinforcement learning technology is introduced in the drone trajectory optimization, so that the drone can interact with the environment in real time during the flight. Collect status data and action decision data under the current flight trajectory as training data, and use instantaneous energy efficiency as the return function, real-time autonomous learning, to achieve continuous optimization of strategic parameters for decision-making flight trajectories, that is, to give drones online in the environment The ability to learn independently can adapt to changes in the dynamic environment according to needs.

In addition, based on the autonomous learning of the above-mentioned PPO algorithm, the UAV trajectory optimization method described in this specification also has the advantage of not being limited to the choice of learning step length.

The self-learning method based on the PPO algorithm proposed in this manual can process three-dimensional continuous bounded data. For example, input data, output data, etc. are not limited to the discrete domain, and realize the flight of the drone in the three-dimensional space under the continuous domain. Control optimization, closer to the real scene. Compared with the control method based on discrete domain data or a limited number of solutions in the table, it is more in line with the needs of the actual flight environment.

Furthermore, while assigning the reward function to the instantaneous energy efficiency of the UAV flight, a penalty item is added to the reward function when the aircraft cannot return home for charging/refueling. After continuous learning, the UAV’s return time can be determined so that no The man-machine can return home immediately to avoid losses and improve the energy efficiency of drone flight work.

Description of the drawings

In order to explain the embodiments of this specification or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings needed in the embodiments. Obviously, the drawings in the following description are only some of the specification. Embodiments, for those of ordinary skill in the art, without creative labor, other drawings can be obtained from these drawings.

Figure 1 is a schematic diagram of the overall structure and data interaction of the system applied by the UAV trajectory optimization method according to some embodiments of this specification;

FIG. 2 is a schematic flowchart of a method for optimizing the trajectory of a drone according to some embodiments of this specification;

FIG. 3 is a schematic flowchart of a method for optimizing the trajectory of a drone according to some other embodiments of this specification;

4 is a schematic flowchart of a specific modeling method of a deep reinforcement learning network according to some embodiments of this specification;

Figure 5 shows the specific process of the deep reinforcement learning PPO algorithm in the embodiment of this specification;

Figure 6 shows a schematic structural diagram of the UAV trajectory optimization device according to the embodiment of the specification; and

Fig. 7 shows a schematic diagram of the hardware mechanism of the UAV trajectory optimization device according to the embodiment of this specification.

Detailed ways

In order to make the purpose, technical solutions and advantages of the embodiments of this specification clearer, the technical solutions of the embodiments of this specification will be described clearly and completely in conjunction with the accompanying drawings of the embodiments of this specification. Obviously, the described embodiments are part of the embodiments of this specification, rather than all of the embodiments. Based on the described embodiments of this specification, all other embodiments obtained by those of ordinary skill in the art without creative labor are within the protection scope of this specification.

Deep reinforcement learning technology is a machine learning technology that combines reinforcement learning and deep neural networks. Specifically, the reinforcement learning individual collects the return information of different actions taken in different environmental states by interacting with the environment, and according to the collected data, summarizes and learns the optimal behavior strategy, so as to obtain the ability to adapt to the unknown dynamic environment . Deep neural networks can significantly improve the generalization ability of algorithms in high-dimensional state spaces and high-dimensional action spaces, thereby gaining the ability to adapt to more complex environments.

The technical solution provided in this manual combines deep reinforcement learning technology with drone technology, collects the status data during the drone flight, determines the action decision data taken under the status data, and further determines the reward information . Then, based on the data collected above, we can summarize and learn the best flight strategy of the drone, so that the drone can acquire the ability to adapt to the unknown dynamic environment.

Figure 1 shows the overall structure and related data interaction process of the system applied in the UAV trajectory optimization method provided by some embodiments of this specification.

As shown in Figure 1, in this embodiment, the system applied by the drone trajectory optimization method described in this specification can be that a single drone 102 provides services for multiple fixed Internet of Things devices 104, and the Internet of Things device 104 is Activate randomly or periodically to collect data and transmit it to the drone 102.

In addition, in this system, the device that executes the drone trajectory optimization method may be referred to as the drone trajectory optimization device 106. In the embodiment of this specification, the above-mentioned UAV trajectory optimization device 106 may be located in the UAV 102, may also be located in the IoT device 104, or may be independent of the UAV 102 and the IoT device 104 Computing device, and can communicate with drone 102 and IoT device 104.

As shown in FIG. 1, the above-mentioned human-machine trajectory optimization device 106 may include a data acquisition module 1062 and a deep reinforcement learning network 1064.

Specifically, in the embodiment of this specification, the above-mentioned deep reinforcement learning network 1064 may adopt the deep reinforcement learning structure of the Actor-Critic framework, that is, it includes an action network 1066 and an evaluation network 1068. Among them, the action network 1066 can use the proximal policy optimization algorithm (PPO, Proximal Policy Optimization) and the deep neural network to fit the UAV flight strategy function, so as to decide the flight action of the UAV; and the evaluation network can use the deep neural network To fit the state value function and optimize the strategy parameters in the UAV flight action strategy function.

As an implementable manner, the UAV 102 inputs the distance between it and the IoT device 104, the transmission rate, and the remaining energy of the UAV 102 as state data into the mobile network 1066 of the deep reinforcement learning network 1064, and the unmanned Action decision data such as the acceleration of the aircraft and the flight control angle (ie, flight direction) are used as behaviors output by the deep reinforcement learning network 1064, and the instantaneous energy efficiency of the drone 102 is used as a reward. And through continuous interaction with the environment, the data of state input, action decision, reward and return are generated as training data for the evaluation network 1068 and the action network 1066. The evaluation network uses the deep neural network to fit the state value function to provide an advantage function for the optimization of the mobile network; the mobile network uses the PPO algorithm to optimize the strategy parameters and the deep neural network to fit the strategy function. After multiple iterations of the update process, the UAV 102 can adapt to the environment and obtain the optimal flight strategy, that is, the optimal flight trajectory of the UAV 102 can be obtained, thereby realizing the optimization of the UAV trajectory.

The specific implementation process of the UAV trajectory optimization method will be described in detail below with specific examples. Fig. 2 is a schematic flowchart of a method for optimizing the trajectory of an unmanned aerial vehicle according to some embodiments of this specification. The method can be executed by a drone, an Internet of Things device, or a computing device independent of the drone and the Internet of Things device. As shown in Figure 2, the method may include the following steps:

Step 202: Obtain status data and action decision data during the flight of the drone.

In the implementation of this specification, the aforementioned status data may include: the distance between the drone and the Internet of Things device, the transmission rate of the Internet of Things device to the drone, and the remaining energy of the drone. The aforementioned action decision data may include: the acceleration of the UAV and the flight control angle (ie, the flight direction) and so on.

Specifically, the distance between the aforementioned drone and the Internet of Things device may be the European distance between the drone and the Internet of Things device.

In the implementation of this specification, the aforementioned action decision data can be determined based on the aforementioned state data and the current drone flight strategy.

Step 204: Determine the instantaneous energy efficiency of the UAV based on the aforementioned state data and action decision data.

The method for determining the instantaneous energy efficiency of the drone will be described in detail in the following embodiments, and will not be described in detail here.

Step 206, using the state data as the input, the action decision data as the output, and the instantaneous energy efficiency as the reward, the pre-built deep reinforcement learning network is trained to optimize the policy parameters of the deep reinforcement learning network.

In the embodiments of this specification, the PPO algorithm may be used to train the deep reinforcement learning network to optimize the policy parameters of the deep reinforcement learning network. Among them, the above strategy parameters are parameters that can determine the UAV action strategy data. Therefore, the above strategy parameters are also parameters that can determine the flight trajectory of the UAV.

As mentioned above, in some embodiments of this specification, the above-mentioned pre-built deep reinforcement learning network may adopt the deep reinforcement learning structure of the Actor-Critic framework, which is composed of two networks, an action network and an evaluation network. Among them, the action network uses the PPO algorithm and the deep neural network to fit the flight strategy function and decision-making action; while the evaluation network uses the deep neural network to fit the state value function and optimize the strategy parameters.

Step 208: Output the flight strategy of the drone according to the trained deep reinforcement learning network.

In the embodiment of this specification, the above-mentioned steps 202 to 208 can be executed in a loop, and the strategy parameters of the deep reinforcement learning network can be iteratively updated using the collected data to obtain a continuously optimized drone flight strategy.

In the embodiment of this specification, through the above-mentioned continuous optimization process, the above-mentioned UAV flight strategy is a flight strategy that maximizes energy efficiency obtained through autonomous learning.

If the above UAV trajectory optimization method is executed by the UAV, after obtaining the UAV's flight strategy, the UAV can further determine its own flight parameters based on the above flight strategy and current status data, that is, the UAV The acceleration and flight control angle, etc., and can perform flight tasks according to the above flight parameters.

If the above-mentioned UAV trajectory optimization method is executed by equipment other than the UAV, after obtaining the UAV's flight strategy, the flight strategy can be transmitted to the UAV, and the UAV will further follow the above-mentioned flight strategy. And the current state data determines its own flight parameters, that is, the acceleration and flight control angle of the UAV, and can perform flight tasks according to the above flight parameters.

Furthermore, in the embodiments of this specification, when the aircraft cannot return to home for charging/refueling, a penalty term can be added to the UAV's instantaneous energy efficiency, and the instantaneous energy efficiency of the added penalty term is used as the reward function. After continuous learning, the UAV can return to home immediately to avoid losses, thereby further improving the energy efficiency of UAV flight work.

As an implementable manner, the process example of the drone trajectory optimization method provided by some embodiments of this specification may be shown in Figure 3, including:

Step 302: Initialize the reinforcement learning decision strategy and related parameters, and the deep neural network related parameters.

Among them, the specific methods for determining the reinforcement learning decision-making strategy and related parameters will be described in detail later.

Step 304: During the flight of the drone, the drone flies autonomously and records relevant data.

In the embodiment of this specification, the above-mentioned step 304 may include: the drone calculates the distance, transmission rate and remaining energy from the Internet of Things device, determines the flight parameters based on the current flight strategy, receives the data sent by the Internet of Things device, and calculates the Instantaneous energy efficiency under flight trajectory.

Step 306: Evaluate the network fitting state value function, calculate the advantage function, and transmit it to the mobile network based on the data collected in the predetermined period of time.

In step 308, the parameters of the deep neural network of the action network and the evaluation network are trained separately, and the drone flight strategy is updated.

Step 310: Repeat the above steps 304-308 until the drone mission ends.

The aforementioned UAV trajectory optimization method introduces deep reinforcement learning technology in the UAV trajectory optimization. In this way, the UAV interacts with the environment in real time during the flight, and collects the status data and movement data of the current flight trajectory as training data. Taking instantaneous energy efficiency as the reward function, real-time autonomous learning realizes continuous optimization of strategic parameters for decision-making flight trajectories, that is, giving drones the ability to learn autonomously in the environment online, and adapt to changes in the dynamic environment according to needs.

In addition, the above-mentioned autonomous learning based on the PPO algorithm also has the advantage of not being limited to the choice of learning step size.

Further, the data objects processed by the above-mentioned autonomous learning method can be three-dimensional continuous bounded data, such as input data, output data, etc., which are not limited to the discrete domain, and realize the flight control optimization of the drone in the three-dimensional space in the continuous domain, which is closer to Realistic scene. Compared with the control method based on discrete domain data or a limited number of solutions in the table, it can be more in line with the needs of the actual flight environment.

The following will further explain in detail the UAV communication modeling method used in this manual and the UAV high-efficiency trajectory optimization method based on deep reinforcement learning with specific examples.

In the deep reinforcement learning network model for drone trajectory optimization established in this embodiment, consider a scenario where a drone provides a delay tolerant service for N ground Internet of Things devices, and the Internet of Things devices are randomly distributed and fixed in location. Collect data periodically or randomly and send it to the drone. The goal is to optimize the UAV's flight trajectory and maximize the cumulative energy efficiency under limited energy conditions. In order to accomplish this goal, the UAV should be able to detect the remaining energy and decide the optimal return time for charging/refueling.

The specific modeling method of the deep reinforcement learning network described in the embodiment of this specification is shown in Fig. 4, and may include:

Step 402: Extract the status data of the drone from the flight environment.

In the embodiments of this specification, the above-mentioned status data can be extracted from the environment and obtained by calculation, and can be characterized as the following three parts: i) the distance from the drone to each IoT device; ii) the direction of each IoT device The transmission rate of the UAV to transmit information; iii) The remaining energy of the UAV.

Further, in the embodiments of this specification, the above-mentioned state data can be quantified as

(Here "T" represents the transpose of the matrix).

Among them, φ(s _t ) represents the state data matrix; s _t represents the state at time t;

It represents the remaining energy of the UAV at time t; q(t) represents the flight trajectory of the UAV.

Step 404: Obtain the action decision data of the drone.

In the embodiments of this specification, the aforementioned action decision data is used to characterize the actions of the drone, and these actions are issued by the drone to control the flight trajectory. Usually the action decision data can include the following two parts: i) the horizontal flight control angle of the drone at time t ω ^t ∈[0,2π]; ii) the vertical flight control angle of the drone at time ^t θ ^t ∈[0,π ]; iii) Acceleration AC ^{t of the} UAV at time ^t .

Further, in the embodiment of this specification, the above-mentioned action decision data can be quantified and expressed as a _t =[ω ^t ,θ ^t ,AC ^t ] ^T (here "T" represents the transposition of the matrix).

It should be noted that in the embodiments of this specification, the instantaneous flight speed of the drone can be expressed as

And the acceleration of the drone can be expressed as

And the above two parameters are three-dimensional continuous and bounded.

Step 406: Calculate the average path loss of the UAV.

In the embodiments of this specification, the communication channel between the UAV and the Internet of Things device usually adopts an air-to-ground link in the Sub-6GHz frequency band, and line-of-sight transmission (LoS) is dominant in this wireless link. In this case, the average path loss between the UAV and the ground IoT device u at time t can be expressed by the following formula (1):

Among them, f _c represents the center frequency,

It represents the Euclidean distance between the UAV and the device u at time t, c represents the speed of light, and η _LoS represents the additional spatial propagation loss of the LoS link, which is usually a constant.

Step 408: Calculate the signal-to-noise ratio based on the above average path loss.

In the embodiment of this specification, the signal-to-noise ratio (SINR) of the drone and the IoT device u at time t can be expressed by the following formula (2):

Among them, P _u represents the transmission power of the upload link of the device u,

Represents the gain of the channel between the drone and the device u at time t, and N ₀ is the noise power.

Assuming that the transmission power and noise power of all devices are the same, the channel gain is only determined by the path loss, so

Step 410: Determine the maximum transmission rate of the device u to the UAV according to the foregoing signal-to-noise ratio.

In the embodiments of this specification, it is assumed that the Doppler effect caused by the movement of the drone can be perfectly compensated by the existing technology, such as the phase-locked loop technology, the maximum transmission rate of the device u to the drone can be expressed as:

Among them, B represents the channel bandwidth, assuming that all devices have the same bandwidth.

That is, in the embodiment of this specification, the maximum transmission rate of the device u to the drone can be determined through the above steps 406-410.

Step 412: Calculate the energy loss of the drone.

In the embodiments of this specification, the energy loss of the drone may include one or a combination of flight energy loss caused by propulsion and communication energy loss related to communication. Therefore, in an embodiment of this specification, the energy loss of the drone may be the sum of the flight energy loss and the communication energy loss.

Among them, the flight energy loss caused by the driving force allows the drone to keep flying in the air and change the flight trajectory. Its power is related to the speed and acceleration of the drone flight. Therefore, the flight energy loss of the drone at time t It can be expressed as the equation of flight trajectory q(t), which can be shown in the following formula (3):

among them,

Represents the instantaneous speed of the drone,

Represents the acceleration of the drone, c ₁ and c ₂ are two constants related to the physical properties of the drone, such as the number of wings and weight. It should be noted that AC ^T (t) here represents the transposition of AC (t), and "T" here is the transposition symbol.

In addition, communication energy loss includes radiation, signal processing, and other circuit losses, of which the energy loss caused by signal processing dominates. The energy loss caused by signal processing has nothing to do with the flight of the UAV, and is an inversely proportional function of the square of the flight time. Therefore, the communication energy loss of the UAV at time t can be shown in the following formula (4):

Among them, E ^{comp is} the communication-related energy loss at time t, G represents the hardware calculation constant of the drone node, D represents the number of bits of data that the drone needs to process, and t is time t.

In the embodiments of this specification, the remaining energy of the drone can be expressed as the difference between the initial total energy of the drone and the energy loss of the drone. For example, suppose E(q(t)) is the energy loss of the drone at time t,

Represents the remaining energy of the drone at time t, the remaining energy of the drone at time t is the initial total energy of the drone before this flight minus the energy loss of the drone at time t, namely

Among them, E ₀ is the initial total energy of the drone before this flight.

Step 414: Build a reward function.

In the embodiments of this specification, the reward function can be defined as the instantaneous energy efficiency of the drone, that is, the maximum transmission rate of the device u to the drone

The ratio of instantaneous energy loss E(q(t))

In addition, in some other embodiments of this specification, because the algorithm needs to automatically decide the return charging/refueling time of the drone, when the energy of the drone is exhausted on the way back, a penalty item should be added after the return function. i.e., it may determine the new function returns the original reward function r (s _{_t,} a _t) ∈ with a penalty term sum:

Usually, in this case, the penalty term ε can be a predetermined negative value. For example, if the UAV runs out of energy on its way back, causing the UAV to crash, the reward function value is directly set to a large negative number, such as -100. Of course, the above penalties can also be set to positive values. In this case, the new reward function can be determined as the difference between the original reward function r(s _t , a _t ) and the above-mentioned positive penalty term ∈:

The value of the specific penalty item can be flexibly set by those skilled in the art according to the actual scenario, and is not unique, and this specification does not list them one by one. By introducing a penalty term in the above reward function, the embodiments of this specification can further determine the return time of the drone through continuous learning, so that the drone can return to the home immediately to avoid losses and improve the energy efficiency of the drone flight work.

In this way, the instantaneous energy efficiency of the drone can be determined through the above steps 410-414, that is, the reward function can be established.

Step 416: Establish a strategy function.

The reinforcement learning method based on policy gradient is to parameterize the policy, and the modeling form is a stochastic equation, namely π _θ : S→P(A), which represents any state in the state set S (that is, the set of states s), Using the probability of action in action set A (ie the set of action a), θ∈R ⁿ is the strategy parameter that needs to be optimized. R ⁿ represents a set of n-dimensional real numbers, and the size of n is equal to the dimension of θ.

Step 418: Establish a target equation based on the above reward function and strategy function.

In reinforcement learning, the state value function of state s under the strategy π _θ is defined as the long-term cumulative return. When the state is s and the strategy is π _θ , the state value function can be expressed as the following formula (5):

Among them, γ is the discount factor, and the value range is γ∈[0,1]. Similarly, under the strategy π _θ , the state-action value function of action a can be defined as the following formula (6):

In this way, the target equation of reinforcement learning can be expressed by the following formula (7):

among them,

Is the discounted state access probability distribution under the strategy π _θ .

Therefore, the UAV trajectory optimization problem based on reinforcement learning can be finally simplified to the following equation (8):

Among them, C ₁ and C ₂ are the limiting conditions of UAV flight speed and acceleration respectively.

In the embodiment of this specification, the strategy gradient method can be applied to optimize the strategy π _θ to maximize the target equation. Among them, the gradient of the target equation with respect to the independent variable θ can be expressed by the following formula (9):

Among them, b _t is a constant baseline introduced in the reward function in order to reduce the variance of the strategy gradient. A constant is introduced into the reward function, and the strategy gradient remains unchanged but the variance decreases. In particular, b _t typically selected state value equation ^V θ (s _t) is the estimated value, R _{_t} -b _t can be regarded as a function of the advantages of _{_{A (a t, s t)}} = Q (a t, s t) - Estimated value of V(s _t ).

When the strategy gradient algorithm is used, the strategy gradient usually has a large variance, so it changes greatly under the influence of parameters. And according to the strategy gradient algorithm, the parameter update equation is

α is the update step size. When the step size is inappropriate, the strategy corresponding to the updated parameter will be a worse strategy.

The trust region method TRPO algorithm (trust region policy optimization) improves the robustness of the algorithm by limiting the size of the policy change in each iteration. The deep reinforcement learning algorithm PPO inherits the advantages of the trust region method algorithm, while the implementation method is simpler, more versatile, and has better sample complexity based on experience.

Step 420: Use the PPO algorithm to rewrite the above-mentioned target equation.

In the embodiment of this specification, the target equation can be rewritten as the following formula (10) by using the PPO algorithm:

Among them, θ is the parameter to be optimized in the strategy function, ε is a fixed value set in advance, ε=0.1～0.3, the purpose is to control the update range of the strategy.

It is a mathematical expectation symbol, which means taking the average value over time t. r _t (θ) is the ratio of the old strategy function to the new strategy function, which can be expressed as:

Among them, the old strategy function and the new strategy function mean that in an iterative update, the updated strategy function is the new strategy function, and the strategy function before the update is the old strategy function.

Among them, the advantage function equation can be shown in the following formula (11):

Among them, γ is the attenuation index, which is a preset fixed value; λ is the track parameter, which is also a preset fixed value; the value range of γ is (0,1), and the value range of λ is also Is (0,1). δ _t is the temporal difference error at time t, and its specific mathematical expression is shown in the second line of the above formula; δ _T-1 is the temporal difference error at T-1, and T is the total time of autonomous flight. In the examples of this specification, the above-mentioned T is discretized, and can also be referred to as the maximum number of consecutive decision moments. In this way, T-1 represents the maximum continuous decision time.

It should be noted that the advantage function needs all the data in a period of time from the current moment to the moment t.

Therefore, this specification introduces a deep neural network in two locations, which are used to represent the state-action value function equation Q ^ω (s, a) ≈ Q ^π (s, a) and learn the parameters ω, as well as the strategy function π _θ ( s)=π(s) and learn the parameter θ.

Specifically, FIG. 5 shows the specific process of the deep reinforcement learning PPO algorithm in the embodiment of this specification. As shown in Figure 5, the above PPO algorithm includes the following steps:

First, set the parameters of the deep reinforcement learning neural network.

The above-mentioned parameters can specifically include: randomly assigning values to the parameters ω and θ, the total length of autonomous flight (the maximum number of consecutive decision-making moments) is set to T, the number of iterations of the two deep neural networks are set to M and B times, and ε=0.2, γ=0.99, and the total task time is set to N.

Initialize the first iteration number parameter i to 0.

Execute the loop from the 1st time segment to the Nth time segment, and its initial step is to initialize the second iteration number parameter j to 0.

Then, based on the current strategy π _θ, continuous autonomous decision-making actions T times, while interacting with the environment to collect tuples {s _t , a _t , r _t }. Among them, the number of autonomous decision-making actions is monitored by the aforementioned second iteration number parameter j.

And then collected by the tuple _{_{{s t, a t, r}} t}, and using the depth estimation neural network function Advantages

Next, calculate the objective function

And use the gradient descent method to update the parameter θ, iterate M times.

It should also be calculated function

And use the gradient descent method to update the parameter ω, iterate B times.

After the above two iterations, the operation of adding 1 to the first iteration number parameter i is performed.

When the first iteration number parameter i is less than N, return to the step 506 to realize the loop from the first time segment to the Nth time segment; and when the first iteration number parameter i is equal to N, the above process is ended.

The UAV trajectory optimization solution proposed in the embodiments of this specification can take the remaining energy of the UAV into consideration into the state value and input it into the reinforcement learning network, and can directly output the acceleration and flight direction of the UAV. In addition, when the penalty value is added to the reward function, the return time of the drone can also be output. The program uses online learning to dynamically adjust the learned strategies according to environmental changes to adapt to the environment. At the same time, this scheme considers the control problem in the continuous domain, which is consistent with the continuous domain flight control mechanism in the actual scenario.

On the other hand, the PPO algorithm is the continuous domain control algorithm with the best robustness and the most outstanding performance. It eliminates the shortcomings that it is difficult to determine the appropriate learning step size and reduces the complexity of the algorithm.

Based on the above-mentioned UAV trajectory optimization method, an embodiment of this specification also provides an UAV trajectory optimization device, the internal structure of which is shown in FIG. 6, including: a construction module 602, a training data collection module 604, and a training module 606. As mentioned above, the UAV trajectory optimization device can be built into the UAV or the Internet of Things device, and it can also be a separate device that can communicate with the UAV and the Internet of Things device. The embodiment of this specification does not limit this.

Among them, the aforementioned construction module 602 is used to construct a deep reinforcement learning network.

The above-mentioned training data collection module 604 is used to obtain status data and action decision data of the drone during the flight of the drone, and calculate the instantaneous energy efficiency of the drone.

The training module 606 is configured to take the above-mentioned state data as input, the above-mentioned action decision data as the output, and the above-mentioned instantaneous energy efficiency as a reward return, to train the deep reinforcement learning network, optimize the strategy parameters, and output the drone flight strategy.

It should be noted that the above-mentioned construction module 602, training data collection module 604, and training module 606 can implement their specific functions through the methods described in the foregoing embodiments, and will not be repeated here.

Fig. 7 shows the hardware structure of the UAV trajectory optimization device provided by an embodiment of this specification. As shown in FIG. 7, the above-mentioned UAV trajectory optimization device includes at least one processor 702; and a memory 704 communicatively connected with the at least one processor; wherein, the memory 1004 stores data that can be used by the at least one processor. The instructions executed by 702 are executed by the at least one processor, so that the at least one processor can execute the method for optimizing the drone trajectory as described above.

The electronic device includes a processor 702 and a memory 704, and may also include: an input device and an output device. The processor, memory, input device, and output device may be connected by a bus or in other ways. As a non-volatile computer-readable storage medium, the memory can be used to store non-volatile software programs, non-volatile computer-executable programs and modules, as corresponding to the drone trajectory optimization method in the embodiments of this specification Program instructions/modules. The processor executes various functional applications and data processing of the server by running non-volatile software programs, instructions, and modules stored in the memory, that is, realizing the UAV trajectory optimization method of the above method embodiment.

The memory may include a storage program area and a storage data area. The storage program area can store an operating system and an application program required by at least one function; the storage data area can store data created according to the use of the drone trajectory optimization device. In addition, the memory may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other non-volatile solid-state storage devices. In some embodiments, the memory may optionally include a memory remotely provided with respect to the processor. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

The input device can receive input digital or character information, and generate key signal inputs related to the user settings and function control of the drone trajectory optimization device. The output device may include a display device such as a display screen. The one or more modules are stored in the memory, and when executed by the processor, the drone trajectory optimization method in any of the foregoing method embodiments is executed. Any embodiment of the electronic device that executes the method for optimizing the drone trajectory can achieve the same or similar effect as any of the foregoing method embodiments.

A person of ordinary skill in the art can understand that all or part of the processes in the above-mentioned embodiment methods can be implemented by instructing relevant hardware through a computer program. The program can be stored in a computer readable storage medium. When executed, it may include the processes of the above-mentioned method embodiments. Wherein, the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM), etc. The embodiment of the computer program can achieve the same or similar effect as any of the foregoing corresponding method embodiments.

In addition, the method according to the present disclosure may also be implemented as a computer program executed by a CPU, and the computer program may be stored in a computer-readable storage medium. When the computer program is executed by the CPU, it executes the above-mentioned functions defined in the method of the present disclosure.

It should be noted that the above embodiments are only used to illustrate the technical solutions of this specification, but not to limit them. Although the description of the specification has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that they can still modify the technical solutions described in the foregoing embodiments, or equivalently replace some of the technical features; and these Modification or replacement does not cause the essence of the corresponding technical solutions to deviate from the spirit and protection scope of the technical solutions of the embodiments of this specification.

Those of ordinary skill in the art should understand that the discussion of any of the above embodiments is only exemplary, and is not intended to imply that the scope of the present disclosure (including the claims) is limited to these examples; under the idea of this specification, the above embodiments or The technical features in different embodiments can also be combined, the steps can be implemented in any order, and there are many other changes in different aspects of this specification as described above, which are not provided in the details for brevity.

In addition, in order to simplify the description and discussion, and in order not to make the description difficult to understand, the well-known power/ground connections with integrated circuit (IC) chips and other components may or may not be shown in the drawings provided. . In addition, the devices may be shown in the form of block diagrams in order to avoid making the description difficult to understand, and this also takes into account the fact that the details about the implementation of these block diagram devices are highly dependent on the platform on which the description will be implemented (ie These details should be completely within the understanding of those skilled in the art). In the case where specific details (for example, a circuit) are described to describe the exemplary embodiments of the present specification, it is obvious to those skilled in the art that it may be possible without these specific details or when these specific details are changed. Implement the technical solutions of this specification. Therefore, these descriptions should be considered illustrative rather than restrictive.

Although this specification has been described in conjunction with the specific embodiments of this specification, many substitutions, modifications and variations of these embodiments will be apparent to those of ordinary skill in the art based on the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may use the discussed embodiments.

The technical solution disclosed in this specification is more capable of adapting to the scene and environment than the conventional solution using convex optimization algorithm. Because we introduce reinforcement learning algorithms to optimize policy parameters during the learning process, rather than based on a fixed target equation, it has greater flexibility; and the deep reinforcement learning network in this manual is enhanced by inputting environmental conditions and obtaining rewards. The interaction with the external environment can respond more quickly to changes in the scene and environment.

Secondly, compared with the Q-learning-based solution in the prior art, the manual adopts the continuous-domain UAV trajectory optimization solution. The continuous speed and acceleration of the action output by the reinforcement learning are closer to the reality and easy to expand the flight area. When trajectory optimization in a large area, there will be no potential problems of dimensional explosion.

The technical solution disclosed in this specification integrates deep reinforcement learning and UAV trajectory optimization problems, and the PPO algorithm is used to solve this problem for the first time. Compared with the optimization solution that uses the depth determination strategy gradient (DDPG) algorithm to update, the PPO algorithm is subject to The training step has less impact, and it is more adaptable when solving control problems in real scenarios. It solves the problem of difficulty in determining the learning step length using the DDPG algorithm in the prior art, and has higher efficiency.

In addition, this manual also considers the optimal return-to-home charging/refueling time. By adding a penalty item to the return function, the drone can flexibly adjust the flight time and trajectory under the condition of returning home safely, and improve its own performance as much as possible. Energy efficiency.

The embodiments of this specification are intended to cover all such substitutions, modifications and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc. made within the spirit and principles of this specification should be included in the scope of protection of this specification.

Claims

A method for trajectory optimization of UAV, characterized in that it comprises:

Obtain UAV status data and action decision data during UAV flight;

Determine the instantaneous energy efficiency of the UAV based on the UAV status data and action decision data;

Training the pre-built deep reinforcement learning network with the state data as input, the action decision data as output, and the instantaneous energy efficiency as a reward, to optimize the policy parameters of the deep reinforcement learning network; and

According to the trained deep reinforcement learning network, the drone flight strategy is output.
The UAV trajectory optimization method according to claim 1, wherein the method further comprises: pre-building a deep learning network structure including an action network and an evaluation network;

Wherein, the action network uses a near-end strategy optimization algorithm and a deep neural network to fit the UAV flight action strategy function and decide the UAV flight action;

The evaluation network uses a deep neural network to fit the state value function and optimize the strategy parameters in the UAV flight action strategy function.
The method for optimizing the trajectory of an unmanned aerial vehicle according to claim 1, wherein said acquiring state data and action data of an unmanned aerial vehicle comprises:

Determine the distance between the drone and the Internet of Things device, the data transmission rate from the Internet of Things device to the drone, and the remaining energy of the drone as the status data; and

The acceleration and flight control angle of the drone are collected as the motion data.
The method for optimizing the trajectory of an unmanned aerial vehicle according to claim 3, wherein said acquiring state data and movement data of an unmanned aerial vehicle further comprises:

Quantify the state data as
Where φ(s t ) represents the state data matrix, s t represents the state at time t,
Respectively represent the distance between the 1st to Nth IoT devices and the UAV at time t;
Respectively represent the transmission rate of the 1st to Nth IoT devices transmitting information to the UAV at time t;
Indicates the remaining energy of the UAV at time t;

The motion data is represented as a t = [ω t, θ t, AC t] T; wherein, a t represents an operation at time t; ω t ∈ [0,2π] represents UAV level at time t Flight control angle; θ t ∈[0,π] represents the vertical flight control angle of the UAV at time t ; AC t represents the acceleration of the UAV at time t.
The UAV trajectory optimization method according to claim 4, wherein the distance between the Internet of Things device and the UAV includes the Euclidean distance between the Internet of Things device and the UAV.
The UAV trajectory optimization method according to claim 1, wherein the determining the instantaneous energy efficiency of the UAV comprises: determining the instantaneous energy efficiency of the UAV using the following formula:

Wherein, r (s t, a t ) represented UAV state S t at time t, the action of instantaneous energy efficiency when A t;
E(q(t)) represents the energy loss of the UAV at time t at the maximum transmission rate of the IoT device u to transmit data to the UAV at time t.
The UAV trajectory optimization method according to claim 6, wherein the maximum transmission rate of data transmitted by the IoT device u to the UAV at time t is determined by the following process:

Determine the average path loss of the UAV;

Determine the signal-to-noise ratio between the drone and the IoT device u at time t according to the average path loss;

According to the signal-to-noise ratio, the maximum transmission rate of data transmitted by the device u to the drone at time t is determined.
The UAV trajectory optimization method according to claim 7, wherein the determining the average path loss of the UAV comprises: determining the average path loss of the UAV through the following formula:

among them,
Represents the average path loss of UAV; f c represents the center frequency
Represents the distance between the drone and the device u at time t; c represents the speed of light; η LoS represents the additional spatial propagation loss of the LoS link.
The UAV trajectory optimization method according to claim 7, wherein the determining the signal-to-noise ratio of the UAV comprises: determining the signal-to-noise ratio of the UAV and the Internet of Things device u at time t by the following formula:

among them,
Represents the signal-to-noise ratio between the UAV and the IoT device u at time t; P u represents the transmission power of the upload link of the device u;
Represents the gain of the channel between the UAV and the device u at time t; N 0 is the noise power; where,
The UAV trajectory optimization method according to claim 7, wherein the determining the maximum transmission rate of the device u to the UAV at time t comprises: determining the transmission of the device u to the UAV at time t by the following formula Maximum data transfer rate:

Among them, B represents the channel bandwidth;
Represents the signal-to-noise ratio of UAV and IoT device u at time t.
The UAV trajectory optimization method according to claim 6, wherein the remaining energy of the UAV is the difference between the initial total energy of the UAV and the energy loss of the UAV; wherein the energy loss of the UAV includes : At least one of UAV flight energy loss and UAV communication energy loss.
The UAV trajectory optimization method according to claim 6, wherein said determining the instantaneous energy efficiency of the UAV further comprises:

When an energy exhaustion occurs during the return of the drone, a penalty item of a preset value is added after the formula for calculating the instantaneous energy efficiency.
The UAV trajectory optimization method according to claim 1, wherein the training of the pre-built deep reinforcement learning network comprises:

Using the near-end strategy optimization algorithm, the goal equation of the deep reinforcement learning network is rewritten as:

Among them, θ is the strategy parameter to be optimized; ε is the preset constant used to control the update range of the UAV flight strategy;
Is the expected value at time t;
Represents the advantage function; clip represents the clipping function, r t (θ) is the ratio of the old strategy function to the new strategy function in an iterative update, which can be expressed as:

Wherein, π θ represents UAV flight policy function, π θ (a t | s t) represents the state S t at time t, a new UAV flight operation policy function of t A,
T represents time state s t, UAV flight action for the old policy function of a t;

Among them, the advantage function
It can be expressed by the following equation:

δ t =r t +γV(s t+1 )-V(s t ),

Among them, γ is the attenuation index; λ is the track parameter; δ t is the time difference error value at time t; δ T-1 is the time difference error value at time T-1; T is the total time of autonomous flight;

Through at least one iterative update, find the maximum value of the target equation, optimize the strategy parameter in the UAV flight strategy function, and use the strategy parameter corresponding to the maximum value of the target equation as the UAV flight strategy output .
The UAV trajectory optimization method according to claim 12, wherein the advantage function
The deep neural network is optimized based on the UAV status data, UAV action decision data and the UAV's instantaneous energy efficiency.
The UAV trajectory optimization method according to claim 14, wherein the advantage function
According to the UAV status data, UAV action decision data, and the UAV's instantaneous energy efficiency optimization by using the deep neural network, the optimization results include:

Utilize deep neural network to estimate the advantage function through the state data, the action decision data and the instantaneous energy efficiency of the drone

Calculation function
And use the gradient descent method to update the parameter ω, and iterate the predetermined number of iterations;

Find the advantage function when the function reaches its maximum
The UAV trajectory optimization method according to claim 1, wherein the method further comprises: determining the action decision data of the UAV according to the UAV flight strategy.
An UAV trajectory optimization device, characterized in that it comprises:

Building modules for building deep reinforcement learning networks;

The training data collection module is used to obtain the status data and action decision data of the drone during the flight of the drone, and calculate the instantaneous energy efficiency of the drone; and

The training module is used to train the deep reinforcement learning network with the state data as input, the action decision data as the output, and the instantaneous energy efficiency as a reward, to train the deep reinforcement learning network, optimize strategy parameters, and output drone flight Strategy.
The UAV trajectory optimization device according to claim 17, wherein the construction module is used to construct a deep learning network structure including an action network and an evaluation network; wherein the action network uses a near-end strategy optimization algorithm and The deep neural network is used to fit the UAV flight action strategy function, and the UAV flight action is determined; the evaluation network uses the deep neural network to fit the state value function, and optimize the strategy parameter in the UAV flight action strategy function .
The drone trajectory optimization device according to claim 17, wherein the training data collection module is used to determine the distance between the drone and the Internet of Things device, and the data transmission from the Internet of Things device to the drone The speed and the remaining energy of the drone are used as the state data; and the acceleration and flight control angle of the drone are collected as the action decision data.
An UAV trajectory optimization device, characterized by comprising at least one processor; and a memory communicatively connected with the at least one processor; wherein the memory stores instructions that can be executed by the at least one processor The instructions are executed by the at least one processor, so that the at least one processor can execute the method for optimizing the drone trajectory according to any one of claims 1 to 16.
A computer-readable storage medium, characterized in that computer instructions are stored thereon, and the UAV trajectory optimization method according to any one of claims 1 to 16 is realized when the processor executes the computer instructions.