CN110488861B

CN110488861B - Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle

Info

Publication number: CN110488861B
Application number: CN201910697007.6A
Authority: CN
Inventors: 许文俊; 徐越; 吴思雷; 张治�; 张平; 林家儒
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2019-07-30
Filing date: 2019-07-30
Publication date: 2020-08-28
Anticipated expiration: 2039-07-30
Also published as: WO2021017227A1; CN110488861A

Abstract

The invention discloses an unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and an unmanned aerial vehicle, wherein a reinforcement learning network is constructed in advance, and state data and action decision data are generated in real time in the flight process of the unmanned aerial vehicle; and optimizing strategy parameters by using a PPO algorithm and outputting an optimal strategy by using state data as input, the action decision data as output and instantaneous energy efficiency as reward return. The device comprises a construction module, a training data collection module and a training module. The unmanned aerial vehicle comprises a processor, and the processor is used for executing the unmanned aerial vehicle track optimization method based on deep reinforcement learning. The invention has the capability of autonomous learning from accumulated flight data, can intelligently determine the optimal flight speed, acceleration, flight direction and return flight time of the aircraft under an unknown communication scene, generalizes the flight strategy with optimal energy efficiency, and has strong environment adaptability and generalization capability.

Description

Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle

Technical Field

The invention relates to the technical field of wireless communication, in particular to an unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and an unmanned aerial vehicle.

Background

The unmanned aerial vehicle communication technology is considered as an indispensable component in fifth generation (5G) and later evolution (5G +) mobile communication networks. But unmanned aerial vehicle communication system has unique air-to-ground channel model, high dynamic three-dimensional flight ability and limited flight energy for unmanned aerial vehicle communication system is more complicated than traditional communication system.

At present, the conventional convex optimization algorithm and the reinforcement learning algorithm are mainly used for the unmanned aerial vehicle trajectory optimization. If a chinese patent application with application number "201811144956.3" exists, an energy consumption optimization method for unmanned aerial vehicle mobile edge computing system based on cellular network is disclosed. According to the method, the position, the speed and the acceleration of the unmanned aerial vehicle at all times are optimized by using a convex optimization algorithm according to constraint conditions brought by data processing, communication and flight of the unmanned aerial vehicle. For example, a chinese patent with application number "201811564184.9" discloses an unmanned aerial vehicle group path planning method based on an improved Q learning algorithm. According to the method, a reinforcement learning method Q learning algorithm is combined with unmanned aerial vehicle track optimization, firstly, a discretization environment model is established by adopting a grid method, secondly, a limited environment state value is input, a state-behavior value matrix is inquired through a reinforcement learning network part to output actions, a return updating matrix is obtained from the environment, and finally, the track planning of the unmanned aerial vehicle in an unknown environment is realized.

When the unmanned aerial vehicle trajectory optimization is carried out by using a convex optimization algorithm, as the form of an objective equation under an actual scene is very complex, the scene needs to be simplified, a scene hypothesis is established, and the flight control optimization of the unmanned aerial vehicle is carried out in a discrete domain, so that a simplified solvable objective problem can be obtained, and the obtained result usually deviates from the actual optimal condition; on the other hand, the unmanned aerial vehicle trajectory optimization method based on the convex optimization algorithm is difficult to deal with dynamically changing environment information. For example, when the communication demand changes dynamically, the original objective equation cannot be applied. In the prior art, a scheme for optimizing the trajectory of the unmanned aerial vehicle by using a reinforcement learning algorithm, such as Q learning, needs to establish a table of mapping the environmental state and the action first, and then select the action corresponding to the maximum state-action value (Q value) by looking up the table of the current state value. Because of the limitations of the state-action table, both the defined state and the actions that can be taken are limited. In practice, however, states and actions are often infinite or continuous, with information loss when translated into a finite number, and with the potential for a dimensional explosion.

It can be seen that, in the prior art, regarding to some technical schemes for optimizing the flight trajectory of the unmanned aerial vehicle, the adapted flight scene and the provided flight action scheme are relatively limited, and are difficult to deal with the environment information of dynamic change in the flight process of the unmanned aerial vehicle, and deviate from the actual flight requirements of the unmanned aerial vehicle.

Disclosure of Invention

The invention aims to provide an unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and an unmanned aerial vehicle, so as to solve the technical problems.

In order to achieve the purpose, the invention provides the following scheme:

the first aspect of the embodiment of the invention provides an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning, which comprises the following steps:

a deep reinforcement learning network based on a PPO algorithm is constructed in advance;

interacting with the environment in real time in the flight process of the unmanned aerial vehicle to generate state data and action decision data, and calculating instantaneous energy efficiency;

and training the deep reinforcement learning network by using the PPO algorithm, optimizing strategy parameters, and outputting an optimal strategy through repeated iteration updating by using the state data as input, the action decision data as output and the instantaneous energy efficiency as reward return.

Optionally, the pre-constructing a deep reinforcement learning network based on a PPO algorithm includes:

constructing a deep learning network structure comprising an action network and an evaluation network;

the action network utilizes a PPO algorithm and a deep neural network to fit a strategy function and decide flight actions; the evaluation network utilizes a deep neural network to fit a state cost function and optimize strategy parameters in a strategy function.

Optionally, generating the state data and the action decision data includes:

calculating the distance between the unmanned aerial vehicle and the Internet of things equipment, the transmission rate and the self residual energy as state data;

and acquiring the acceleration and the flight direction of the unmanned aerial vehicle as action decision data.

Optionally, generating the state data and the action decision data includes:

quantizing and representing state data as

Wherein phi(s)_t) Representing a matrix of state data, s_tWhich represents the state at the time of t,

respectively representing Euclidean distances between 1 st to Nth Internet of things devices and the unmanned aerial vehicle at the time t;

respectively representing the transmission rates of the 1 st to Nth Internet of things devices transmitting information to the unmanned aerial vehicle at the time t;

representing the self residual energy of the unmanned aerial vehicle at the time t;

representing action decision data as a_t＝[ω^t,a^t]^T(ii) a Wherein a is_tRepresents the action at time t; omega^t∈[0,2π]，ω^tRepresenting the flight steering angle of the unmanned aerial vehicle at time t; a is^tRepresenting the magnitude of the acceleration of the drone at time t, a^tIs continuously bounded data.

Optionally, calculating the instantaneous energy efficiency comprises calculating according to the following formula:

wherein r(s)_t,a_t) Indicating that the state of the unmanned plane at the time t is s_tThe action is a_tInstantaneous energy efficiency of the time;

the maximum transmission rate of data transmitted to the unmanned aerial vehicle by the Internet of things device u at the moment t is set;

indicating its own remaining energy.

Optionally, training the deep reinforcement learning network by using a PPO algorithm, optimizing strategy parameters, and outputting an optimal strategy through multiple iterative updates, including:

and (3) rewriting the target equation into the following equation by adopting a PPO algorithm:

where θ is the policy parameter to be optimized, is preset forThe control strategy updates the constant of the amplitude,

for the desired value of the time t,

representing a dominance function, clip representing a clipping function, r_t(θ) is the ratio of the old policy function to the new policy function in one iteration of the update, and can be expressed as:

wherein pi_θRepresenting a policy function, pi_θ(a_t|s_t) Indicates that the state at time t is s_tThe action is a_tThe new policy function of (2) is,

indicates that the state at time t is s_tThe action is a_tOld policy function of;

the dominant function equation is solved as follows:

wherein gamma is an attenuation index and lambda is a trajectory parameter;_tfor a time differential error value at time t,_T-1the time difference error value at the T-1 moment; t is the total duration of the autonomous flight;

and solving the maximum value of the target equation through repeated iteration updating so as to optimize the strategy parameters in the strategy function, and outputting the strategy parameters corresponding to the maximum value of the target equation as the optimal strategy.

Optionally, calculating the instantaneous energy efficiency comprises:

when the unmanned aerial vehicle returns to the way and the situation of energy exhaustion occurs, a penalty term of a preset numerical value is added after an equation of instantaneous energy efficiency is calculated.

In a second aspect of the embodiment of the present invention, there is also provided an unmanned aerial vehicle trajectory optimization device based on deep reinforcement learning, including a construction module, a training data collection module, and a training module;

the building module is used for building a deep reinforcement learning network based on a PPO algorithm;

the training data collection module is used for interacting with the environment in real time in the flight process of the unmanned aerial vehicle, generating state data and action decision data and calculating instantaneous energy efficiency;

and the training module is used for training the deep reinforcement learning network by using the PPO algorithm with state data as input, action decision data as output and instantaneous energy efficiency as reward return, optimizing strategy parameters and outputting an optimal strategy through repeated iterative updating.

Optionally, the building block is configured to: constructing a mobile network and an evaluation network; fitting a state value function by using a deep neural network and transmitting the state value function into an evaluation network, calculating an advantage function through the evaluation network, and transmitting the advantage function into a mobile network; fitting a strategy function through a mobile network by utilizing a deep neural network, and transmitting the strategy function into the mobile network;

and/or a training data collection module to: calculating the distance between the unmanned aerial vehicle and the Internet of things equipment, the transmission rate and the self residual energy as state data; and acquiring the acceleration and the flight direction of the unmanned aerial vehicle as action decision data.

In a third aspect of the embodiment of the present invention, an unmanned aerial vehicle is further provided, which includes a processor, where the processor is configured to execute the above unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention discloses an unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and an unmanned aerial vehicle, wherein a deep reinforcement learning technology PPO algorithm is introduced in the unmanned aerial vehicle track optimization, the unmanned aerial vehicle interacts with the environment in real time in the flight process, state data and action data under the current flight track are collected as training data, instantaneous energy efficiency is taken as a return function, and the continuous optimization of strategy parameters of a decision flight track is realized through the real-time autonomous learning of the PPO algorithm, namely the unmanned aerial vehicle is endowed with the online autonomous learning capability in the environment, and can adapt to the change of a dynamic environment according to the requirement; in addition, the autonomous learning based on the PPO algorithm also has the advantage of being not limited by the selection of the learning step length;

in addition, the data objects processed by the autonomous learning method based on the PPO algorithm can be three-dimensional continuous bounded data, such as input data, output data and the like which are not limited to discrete domains, so that flight control optimization of the unmanned aerial vehicle in a three-dimensional space under the continuous domains is realized, and the method is closer to a real scene; compared with a control mode based on discrete domain data or a limited number of coping schemes in a table, the method is more suitable for the requirement of the actual flight environment;

furthermore, when the return function is assigned to be the flying instantaneous energy efficiency of the unmanned aerial vehicle, punishment items are added when the aircraft cannot smoothly return to the air and charge/refuel, the unmanned aerial vehicle can immediately return to the air after continuous learning to avoid loss, and the energy efficiency of the flying work of the unmanned aerial vehicle is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

Fig. 1 is a schematic flow chart of an embodiment of an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning according to the present invention;

fig. 2 is a schematic view of the overall structure and the interaction of related data in another embodiment of the unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning according to the present invention;

fig. 3 is a schematic flow chart of an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning according to a preferred embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.

Example 1

The deep reinforcement learning technology is a machine learning technology combining reinforcement learning and a deep neural network. Specifically, the reinforcement learning individual collects the return information of taking different actions in different environment states in an environment interaction mode, and induces and learns the optimal behavior strategy according to the collected data, so that the ability of adapting to the unknown dynamic environment is obtained. The deep neural network can remarkably improve the generalization capability of the algorithm on a high-dimensional state space and a high-dimensional action space, thereby obtaining the capability of adapting to more complex environments.

The embodiment 1 of the invention provides an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning, and as shown in fig. 1, the method comprises the following steps:

s101, a deep reinforcement learning network based on a PPO algorithm (near-end policy optimization algorithm) is constructed in advance.

This degree of depth reinforcement learning network model can also can install in thing networking equipment end on unmanned aerial vehicle before unmanned aerial vehicle takes off in advance, and unmanned aerial vehicle realizes online independently studying with thing networking equipment end real-time interaction data at the flight in-process.

And S102, interacting with the environment in real time in the flight process of the unmanned aerial vehicle to generate state data and action decision data.

S103, calculating the instantaneous energy efficiency.

And S104, training the deep reinforcement learning network by utilizing a PPO algorithm, and optimizing strategy parameters.

And (5) circularly executing the step (S102) to the step (S104), and continuously and iteratively updating the network parameters by using the collected data to finally achieve the optimal state.

And S105, obtaining the trained optimal strategy after repeated iteration updating, and outputting the optimal strategy.

And training the deep reinforcement learning network by taking the state data as input, taking the action decision data as output and taking the instantaneous energy efficiency as reward return, and realizing the optimization of the strategy parameters through repeated iterative updating.

The strategy parameters are the action parameters for determining the flight trajectory, and the optimal strategy is the flight strategy which is obtained by autonomous learning and enables the energy efficiency to be maximized.

Example 2

The embodiment 2 of the invention provides another embodiment of an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning.

In embodiment 2 of the present invention, the PPO algorithm adopts a deep reinforcement learning structure of an Actor Critic (Actor-Critic) framework, and is composed of two networks, namely, an action network and an evaluation network: the action network utilizes a PPO algorithm and a deep neural network to fit a strategy function and decide actions; the evaluation network utilizes a deep neural network to fit a state cost function and optimize strategy parameters. The overall structure and the related data interaction of the optimization method provided by embodiment 2 of the present invention are shown in fig. 2.

In this embodiment, the unmanned aerial vehicle communication scenario that uses is that a single unmanned aerial vehicle base station provides service for a plurality of fixed internet of things devices, and the internet of things devices are activated at random or periodically to collect data and transmit to the unmanned aerial vehicle base station.

As an implementable manner, the distance between the unmanned aerial vehicle and the internet of things device, the transmission rate and the self remaining energy are used as the state input action network of reinforcement learning by the unmanned aerial vehicle, the acceleration and the flight direction (i.e. the flight control angle) of the unmanned aerial vehicle are used as the output behaviors, and the instantaneous energy efficiency of the unmanned aerial vehicle obtained from the environment is used as the reward. And through continuous interaction with the environment, data of state input, action decision and reward return are generated and used as training data of the evaluation network and the action network. The evaluation network utilizes a deep neural network to fit a state cost function to provide an advantage function for the optimization of the action network; the mobile network optimizes the strategy parameters by using a PPO algorithm and fits a strategy function by using a deep neural network. Through the process of repeated iterative updating, the unmanned aerial vehicle adapts to the environment and obtains an optimal strategy.

As an implementable manner, the method for optimizing the trajectory of the unmanned aerial vehicle based on the deep reinforcement learning provided in embodiment 2 of the present invention may include the following steps:

s201, initializing a reinforcement learning decision strategy, relevant parameters and relevant parameters of a deep neural network.

S202, in a period of preset duration, the unmanned aerial vehicle autonomously flies to complete a task and records related data. The unmanned aerial vehicle calculates the distance to the Internet of things equipment, the transmission rate and the residual energy of the unmanned aerial vehicle, decides a flight track based on the current strategy, receives data sent by the Internet of things equipment, and calculates the instantaneous energy efficiency under the flight track.

S203, evaluating a network fitting state cost function according to the data collected within a period of the preset time, calculating an advantage function, and transmitting the advantage function into the mobile network. And respectively training parameters of the deep neural network of the action network and the evaluation network, and updating the flight strategy of the unmanned aerial vehicle.

And S204, repeating the step 202 and the step 203 until the unmanned aerial vehicle task is finished.

Example 3

Embodiment 3 of the present invention provides a preferred embodiment of an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning, and further details of an unmanned aerial vehicle communication modeling method used in the present invention and an unmanned aerial vehicle high energy efficiency trajectory optimization method based on deep reinforcement learning are described in this embodiment.

The unmanned aerial vehicle communication model established in this embodiment considers a scenario in which one unmanned aerial vehicle provides delay tolerant service for N pieces of ground internet of things equipment, and the internet of things equipment is randomly distributed and fixed in position, and periodically or randomly acquires data and transmits the data to the unmanned aerial vehicle. The target is to optimize the unmanned aerial vehicle flight trajectory, maximize the cumulative energy efficiency under the limited energy condition. To accomplish this goal, the drone should be able to detect its own remaining energy and decide on the optimal return charge/refuel time.

The specific modeling method is as follows:

s301: the average path loss is calculated.

A communication channel between the unmanned aerial vehicle and the Internet of things equipment adopts sub-6GHz frequency band air-to-ground link, and line of sight (LoS) transmission is dominant in the wireless link. The average path loss of the unmanned aerial vehicle and the ground internet of things device u at the time t can be expressed as follows:

f_cwhich represents the center frequency of the signal at the center,

representing the Euclidean distance between the drone and the device u at time t, c representing the speed of light, η_LoSThe additional spatial propagation loss, representing the LoS link, is a constant.

302: and calculating the signal-to-noise ratio.

The signal-to-noise ratio (SINR) of the unmanned aerial vehicle and the internet of things device u at the time t can be expressed as:

P_uthe transmission power of the uplink on behalf of device u,

representing the gain of the channel between the drone and the device u at time t, N₀Is the noise power. Assuming that the transmission power and noise power of all devices are the same, the channel gain is determined only by the path loss, so

Assuming that the doppler effect caused by the movement of the drone can be perfectly compensated by existing techniques, such as phase-locked loop techniques, the maximum rate of transmission of device u to the drone can therefore be expressed as:

b represents the channel bandwidth, assuming that the bandwidth of all devices is the same.

S303: and calculating the self residual energy.

The energy loss of the drone includes flight energy loss caused by propulsion and communication related energy loss. The flight energy consumption caused by the propulsion allows the drone to fly in the air, changing the flight trajectory, the power of which is related to the speed and acceleration of the drone, so the flight energy consumption can be expressed as the equation of the flight trajectory q (t), as follows:

wherein E (p (t)) is the self energy loss,

representing the self residual energy, which is the initial total energy minus the self energy loss, i.e.

Wherein E₀The initial total energy of the unmanned aerial vehicle before the current flight is obtained. The self energy loss is the integral of the instantaneous energy loss from t-0 to t-t.

p (t) is the instantaneous energy loss,

representing the instantaneous speed of the drone,

acceleration on behalf of the drone, c₁And c₂Are two constants that are related to the physical properties of the drone itself, such as wing number and weight. It is to be noted that, here a^TDenotes the transpose of a, where "T" is the transpose symbol.

Communication related energy losses include radiation, signal processing, and other circuit consumption, where the energy losses due to signal processing dominate. The energy loss caused by signal processing is independent of the flight of the drone and is an inverse proportional function of the square of the flight time and can be expressed as:

wherein E is^compNamely, the communication-related energy loss at the time t, G represents a hardware calculation constant of the node of the unmanned aerial vehicle, D represents the number of bits of data to be processed by the unmanned aerial vehicle, and t is the time t.

In the invention, the self energy loss is flight energy loss plus communication related energy loss; self remaining energy is initial total energy-self energy loss.

S304: status data is extracted from the flight environment.

The state data is obtained by extracting calculation from the environment and can be characterized as the following three parts: i) the distance from the unmanned aerial vehicle to each piece of internet-of-things equipment; ii) a transmission rate at which each internet of things device transmits information to the drone; iii) self residual energy. Thus, the status data may be represented as

(here, "T" denotes the transpose of the matrix).

S305: motion data is acquired.

The action is sent by the unmanned aerial vehicle for control flight path, includes following two parts: i) flight control angle omega of unmanned aerial vehicle at time t^t∈[0,2π](ii) a ii) acceleration a of the drone at time t^t. Thus, actions may be collectively denoted as a_t＝[ω^t,a^t]^T(here, "T" denotes the transpose of the matrix).

It should be noted that the instantaneous flying speed of the unmanned aerial vehicle

And acceleration

Are three-dimensionally continuously bounded.

S306, a return function is established.

The return function is defined as the instantaneous energy efficiency, i.e.

Because the algorithm needs to automatically decide the return charging/refueling time of the unmanned aerial vehicle in consideration, a punishment item with a larger value is added after the return function when the unmanned aerial vehicle returns to the midway energy source exhaustion. The unmanned aerial vehicle returns that the energy is exhausted on the way, causes the crash of the unmanned aerial vehicle, directly sets the return function value as a larger negative number, such as-100. The value of the specific penalty item can be flexibly set by a person skilled in the art according to the actual scene, and is not unique and is not listed one by one in the invention.

S307: and establishing a policy function.

The reinforcement learning method based on the strategy gradient is to parameterize the strategy and model the strategy in a random equation, namely pi_θS → P (A), represents the probability of taking an action in the action set A (i.e., the set of actions a) at any state in the state set S (i.e., the set of states S), θ ∈ RⁿAre the policy parameters that need to be optimized. RⁿRepresenting a set of n-dimensional real numbers, the size of n being equal to the dimension of theta.

S308: and establishing an objective equation.

In reinforcement learning, state s is in strategy π_θThe state cost function of a state is defined as the long-term cumulative return. When its state is s, the policy is π_θThe state cost function is of the form:

gamma is a discount factor, and the value range gamma ∈ [0,1]. Similarly, in strategy π_θNext, the state-action cost function for action a may be defined as:

the objective equation of reinforcement learning is defined as:

wherein

Is in the strategy of pi_θThe following discounted state access probability distribution.

Therefore, we obtain the final trajectory optimization problem of the unmanned aerial vehicle based on reinforcement learning as follows:

C₁and C₂Respectively, are the limit conditions of the flight speed and the acceleration of the unmanned aerial vehicle.

The strategy gradient method can be applied to an optimization strategy pi_θTo maximize the objective equation. The gradient of the target equation with respect to the argument θ can be expressed as:

b_tis a constant baseline introduced in the reward function for reducing the variance of the strategy gradient, and a constant is introduced in the reward function, the strategy gradient is unchanged and the variance is reduced. In particular, b_tThe equation of state V is typically chosen^θ(s_t) Estimated value of (1), R_t-b_tIt can be regarded as the dominance function a (a)_t，s_t)＝Q(a_t，s_t)-V(s_t) An estimate of (d).

The policy gradient algorithm generally has a large variance in policy gradient when used, and thus is subject to large changes in parameter influence. And according to the strategic gradient algorithm, the parameter update equation is

α, the step size is updated, and when the step size is not appropriate, the strategy corresponding to the updated parameter will be a worse strategy.

The trust domain method TRPO algorithm (trust region policy optimization) improves the robustness of the algorithm by limiting the change size of the strategy in each iteration. The deep reinforcement learning algorithm PPO inherits the advantages of the trust domain system method algorithm, and meanwhile, the realization method is simpler and more universal and has better sample complexity according to experience.

S309: and rewriting the target equation by adopting a PPO algorithm.

With the PPO algorithm, the objective equation can be rewritten as:

and theta is a parameter to be optimized in the strategy function and is a preset fixed value which is 0.1-0.3, and the aim is to control the updating amplitude of the strategy.

To mathematically expect a symbol, the representation is averaged over time t. r is_t(θ) is the ratio of the old policy function to the new policy function, and can be expressed as:

the old strategy function and the new strategy function are in one-time iteration updating, the updated strategy function is the new strategy function, and the strategy function before updating is the old strategy function.

Wherein the merit function equation is:

_t＝r_t+γV(s_t+1)-V(s_t)，

gamma is attenuation index and is a preset fixed valueA value; lambda is a track parameter and is also a preset fixed value; the range of γ is (0, 1), and the range of λ is (0, 1)._tThe time difference error value (Temporaldifference error) at the time t is shown in the second line of the above expression by the specific mathematical expression;_T-1and the time difference error value at the moment T-1 is obtained, and T is the total duration of the autonomous flight.

It is noted that the merit function requires all data for a period of time from the current time until time t.

Therefore, the invention introduces the deep neural network at two positions for respectively representing the state-action cost function equation Q^ω(s，a)≈Q^π(s, a) and learning the parameter ω, and expressing the policy function π_θ(s) ═ pi(s) and learn the parameter θ.

Specifically, referring to fig. 3, a specific process of deep reinforcement learning PPO algorithm in the embodiment of the present invention is as follows:

initializing each parameter of the deep reinforcement learning neural network, randomly assigning values to the parameters omega and theta, setting the autonomous flight time length as T, setting the iteration times of the two deep neural networks as M times and B times respectively, setting the iteration times as 0.2, setting the gamma as 0.99, and setting the total task time as L.

Forepsiprode is 1, L do; executing a loop from the 1 st time segment to the L < th > time segment; based on the current strategy pi_θContinuous autonomous decision-making action T times while collecting tuples with environmental interactions s_t，a_t，r_t}. By the collected tuples s_t，a_t，r_tAnd estimating a merit function by using a deep neural network

Calculating an objective function

And updating the parameter theta by using a gradient descent method, and iterating for M times.

Calculating a function

And updating the parameter omega by using a gradient descent method, and iterating for B times.

End for, ending the loop.

The embodiment of the invention provides an unmanned aerial vehicle high-energy-efficiency track optimization scheme based on a deep reinforcement learning PPO algorithm. According to the track optimization scheme, the residual energy of the unmanned aerial vehicle is taken into account in a state value and input into a reinforcement learning network, and the flying speed, the acceleration, the flying direction and the return time of the unmanned aerial vehicle are directly output. The scheme dynamically adjusts the learned strategy according to the environment change in an online learning mode, so that the environment is adapted. Meanwhile, the scheme considers the control problem under the continuous domain and conforms to the continuous domain flight control mechanism under the actual scene. On the other hand, the PPO algorithm is a continuous domain control algorithm with the best robustness and the most outstanding performance, the defect that the appropriate learning step length is not easy to determine is eliminated, and the complexity of the algorithm is reduced.

Example 4

The embodiment of the invention also provides an unmanned aerial vehicle track optimization device based on deep reinforcement learning, which comprises a construction module, a training data collection module and a training module.

The building module is used for building a deep reinforcement learning network based on a PPO algorithm; the training data collection module is used for interacting with the environment in real time in the flight process of the unmanned aerial vehicle, generating state data and action decision data and calculating instantaneous energy efficiency; and the training module is used for training the deep reinforcement learning network by using the PPO algorithm with state data as input, action decision data as output and instantaneous energy efficiency as reward return, optimizing strategy parameters and outputting an optimal strategy through repeated iterative updating.

Example 5

The embodiment of the invention also provides the unmanned aerial vehicle, which comprises a processor, wherein the processor is used for executing the unmanned aerial vehicle track optimization method based on the deep reinforcement learning.

In conclusion, the invention introduces a deep reinforcement learning PPO algorithm to carry out autonomous exploration learning on environmental information, aims at improving the efficiency of unmanned aerial vehicle energy, and intelligently decides and optimizes flight path and return flight time.

Compared with the prior art, the invention achieves the following technical effects:

firstly, the capability of the invention in adapting to scenes and environments is stronger than the scheme of adopting a convex optimization algorithm in the prior art. Because a reinforcement learning algorithm is introduced, strategy parameters are optimized in the learning process instead of being based on a fixed target equation, so that the method has stronger flexibility; in addition, the deep reinforcement learning network strengthens the interaction with the external environment by inputting the environment state and acquiring the reward, and can more quickly respond to the change of the scene and the environment.

Compared with a scheme based on Q learning in the prior art, the unmanned aerial vehicle trajectory optimization scheme of the continuous domain is adopted, the speed and the acceleration of continuous action output by reinforcement learning are closer to the actual situation, the flight area is easy to expand, and the potential problem of dimension explosion can not occur during large-area trajectory optimization.

In the prior art, a DDPG algorithm is adopted to control a machine in a continuous domain, the method has the defect that a proper learning step length is not easy to determine, and the selection of the hyper-parameters has great influence on an optimization result.

Compared with an optimization scheme for updating by adopting a depth determination strategy gradient (DDPG) algorithm, the PPO algorithm is less influenced by the training step length, the adaptability is higher when the control problem under a real scene is solved, the problem that the learning step length is difficult to determine by adopting the DDPG algorithm in the prior art is solved, and the efficiency is higher.

In addition, the invention also considers the optimal return charging/refueling time, so that the unmanned plane can flexibly adjust the flight time and the track under the condition of safe return, and the energy utilization efficiency of the unmanned plane is improved as much as possible.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principle and the implementation manner of the present invention are explained by applying specific examples, the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof, the described embodiments are only a part of the embodiments of the present invention, not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention.

Claims

1. The unmanned aerial vehicle track optimization method based on deep reinforcement learning is characterized by comprising the following steps:

interacting with the environment in real time in the flight process of the unmanned aerial vehicle to generate state data and action decision data, and calculating instantaneous energy efficiency; wherein calculating the instantaneous energy efficiency comprises calculating according to the following formula:

r(s_t,a_t) Indicating that the state of the unmanned plane at the time t is s_tThe action is a_tInstantaneous energy efficiency of the time;

representing its own residual energy, s_tIndicating the state of the drone at time t, a_tRepresenting the action of the unmanned aerial vehicle at the time t;

training the deep reinforcement learning network by using the PPO algorithm with the state data as input, the action decision data as output and the instantaneous energy efficiency as reward return, optimizing strategy parameters, and outputting an optimal strategy through repeated iterative updating;

the method comprises the following steps of training the deep reinforcement learning network by utilizing a PPO algorithm, optimizing strategy parameters, and outputting an optimal strategy through repeated iteration updating, and comprises the following steps:

wherein theta is a strategy parameter to be optimized and is a preset constant for controlling strategy updating amplitude,

for the desired value of the time t,

the dominant function equation is solved as follows:

and solving the maximum value of the target equation through multiple iterative updating so as to optimize the strategy parameters in the strategy function, and outputting the strategy parameters corresponding to the maximum value of the target equation as the optimal strategy.

2. The unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning of claim 1, wherein the step of pre-constructing a deep reinforcement learning network based on a PPO algorithm comprises:

the action network utilizes a PPO algorithm and a deep neural network to fit a strategy function and decide flight actions; the evaluation network utilizes a deep neural network to fit a state cost function and optimize strategy parameters in the strategy function.

3. The method for unmanned aerial vehicle trajectory optimization based on deep reinforcement learning of claim 1, wherein the step of generating state data and action decision data comprises:

4. The method of claim 3, wherein the step of generating state data and action decision data comprises:

quantizing the state data to represent

representing the action decision data as a_t＝[ω^t,a^t]^T(ii) a Wherein a is_tRepresents the action at time t; omega^t∈[0,2π]，ω^tRepresenting the flight steering angle of the unmanned aerial vehicle at time t; a is^tRepresenting the magnitude of the acceleration of the drone at time t, a^tIs continuously bounded data.

5. The method of any one of claims 1-4, wherein the step of calculating the instantaneous energy efficiency comprises:

6. The unmanned aerial vehicle track optimization device based on deep reinforcement learning is characterized by comprising a construction module, a training data collection module and a training module;

the training data collection module is used for interacting with the environment in real time in the flight process of the unmanned aerial vehicle, generating state data and action decision data and calculating instantaneous energy efficiency; wherein calculating the instantaneous energy efficiency comprises calculating according to the following formula:

the training module is used for training the deep reinforcement learning network by using the PPO algorithm with the state data as input, the action decision data as output and the instantaneous energy efficiency as reward return, optimizing strategy parameters, and outputting an optimal strategy through repeated iterative updating;

for the desired value of the time t,

wherein pi_θTo representPolicy function, pi_θ(a_t|s_t) Indicates that the state at time t is s_tThe action is a_tThe new policy function of (2) is,

the dominant function equation is solved as follows:

7. The unmanned aerial vehicle trajectory optimization device based on deep reinforcement learning of claim 6, wherein:

the building module is configured to: constructing a mobile network and an evaluation network; fitting a state cost function by using a deep neural network and transmitting the state cost function into the evaluation network, calculating an advantage function through the evaluation network, and transmitting the advantage function into the action network; fitting a policy function through the action network by using a deep neural network, and transmitting the policy function into the action network;

and/or the training data collection module is configured to: calculating the distance between the unmanned aerial vehicle and the Internet of things equipment, the transmission rate and the self residual energy as state data; and acquiring the acceleration and the flight direction of the unmanned aerial vehicle as action decision data.

8. A drone comprising a processor, wherein the processor is configured to perform the method of drone trajectory optimization based on deep reinforcement learning of any one of claims 1-4.