CN110488861B - Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle - Google Patents

Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle Download PDF

Info

Publication number
CN110488861B
CN110488861B CN201910697007.6A CN201910697007A CN110488861B CN 110488861 B CN110488861 B CN 110488861B CN 201910697007 A CN201910697007 A CN 201910697007A CN 110488861 B CN110488861 B CN 110488861B
Authority
CN
China
Prior art keywords
aerial vehicle
unmanned aerial
time
strategy
function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910697007.6A
Other languages
Chinese (zh)
Other versions
CN110488861A (en
Inventor
许文俊
徐越
吴思雷
张治�
张平
林家儒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201910697007.6A priority Critical patent/CN110488861B/en
Priority to PCT/CN2019/114200 priority patent/WO2021017227A1/en
Publication of CN110488861A publication Critical patent/CN110488861A/en
Application granted granted Critical
Publication of CN110488861B publication Critical patent/CN110488861B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)
  • Feedback Control In General (AREA)

Abstract

The invention discloses an unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and an unmanned aerial vehicle, wherein a reinforcement learning network is constructed in advance, and state data and action decision data are generated in real time in the flight process of the unmanned aerial vehicle; and optimizing strategy parameters by using a PPO algorithm and outputting an optimal strategy by using state data as input, the action decision data as output and instantaneous energy efficiency as reward return. The device comprises a construction module, a training data collection module and a training module. The unmanned aerial vehicle comprises a processor, and the processor is used for executing the unmanned aerial vehicle track optimization method based on deep reinforcement learning. The invention has the capability of autonomous learning from accumulated flight data, can intelligently determine the optimal flight speed, acceleration, flight direction and return flight time of the aircraft under an unknown communication scene, generalizes the flight strategy with optimal energy efficiency, and has strong environment adaptability and generalization capability.

Description

Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle
Technical Field
The invention relates to the technical field of wireless communication, in particular to an unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and an unmanned aerial vehicle.
Background
The unmanned aerial vehicle communication technology is considered as an indispensable component in fifth generation (5G) and later evolution (5G +) mobile communication networks. But unmanned aerial vehicle communication system has unique air-to-ground channel model, high dynamic three-dimensional flight ability and limited flight energy for unmanned aerial vehicle communication system is more complicated than traditional communication system.
At present, the conventional convex optimization algorithm and the reinforcement learning algorithm are mainly used for the unmanned aerial vehicle trajectory optimization. If a chinese patent application with application number "201811144956.3" exists, an energy consumption optimization method for unmanned aerial vehicle mobile edge computing system based on cellular network is disclosed. According to the method, the position, the speed and the acceleration of the unmanned aerial vehicle at all times are optimized by using a convex optimization algorithm according to constraint conditions brought by data processing, communication and flight of the unmanned aerial vehicle. For example, a chinese patent with application number "201811564184.9" discloses an unmanned aerial vehicle group path planning method based on an improved Q learning algorithm. According to the method, a reinforcement learning method Q learning algorithm is combined with unmanned aerial vehicle track optimization, firstly, a discretization environment model is established by adopting a grid method, secondly, a limited environment state value is input, a state-behavior value matrix is inquired through a reinforcement learning network part to output actions, a return updating matrix is obtained from the environment, and finally, the track planning of the unmanned aerial vehicle in an unknown environment is realized.
When the unmanned aerial vehicle trajectory optimization is carried out by using a convex optimization algorithm, as the form of an objective equation under an actual scene is very complex, the scene needs to be simplified, a scene hypothesis is established, and the flight control optimization of the unmanned aerial vehicle is carried out in a discrete domain, so that a simplified solvable objective problem can be obtained, and the obtained result usually deviates from the actual optimal condition; on the other hand, the unmanned aerial vehicle trajectory optimization method based on the convex optimization algorithm is difficult to deal with dynamically changing environment information. For example, when the communication demand changes dynamically, the original objective equation cannot be applied. In the prior art, a scheme for optimizing the trajectory of the unmanned aerial vehicle by using a reinforcement learning algorithm, such as Q learning, needs to establish a table of mapping the environmental state and the action first, and then select the action corresponding to the maximum state-action value (Q value) by looking up the table of the current state value. Because of the limitations of the state-action table, both the defined state and the actions that can be taken are limited. In practice, however, states and actions are often infinite or continuous, with information loss when translated into a finite number, and with the potential for a dimensional explosion.
It can be seen that, in the prior art, regarding to some technical schemes for optimizing the flight trajectory of the unmanned aerial vehicle, the adapted flight scene and the provided flight action scheme are relatively limited, and are difficult to deal with the environment information of dynamic change in the flight process of the unmanned aerial vehicle, and deviate from the actual flight requirements of the unmanned aerial vehicle.
Disclosure of Invention
The invention aims to provide an unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and an unmanned aerial vehicle, so as to solve the technical problems.
In order to achieve the purpose, the invention provides the following scheme:
the first aspect of the embodiment of the invention provides an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning, which comprises the following steps:
a deep reinforcement learning network based on a PPO algorithm is constructed in advance;
interacting with the environment in real time in the flight process of the unmanned aerial vehicle to generate state data and action decision data, and calculating instantaneous energy efficiency;
and training the deep reinforcement learning network by using the PPO algorithm, optimizing strategy parameters, and outputting an optimal strategy through repeated iteration updating by using the state data as input, the action decision data as output and the instantaneous energy efficiency as reward return.
Optionally, the pre-constructing a deep reinforcement learning network based on a PPO algorithm includes:
constructing a deep learning network structure comprising an action network and an evaluation network;
the action network utilizes a PPO algorithm and a deep neural network to fit a strategy function and decide flight actions; the evaluation network utilizes a deep neural network to fit a state cost function and optimize strategy parameters in a strategy function.
Optionally, generating the state data and the action decision data includes:
calculating the distance between the unmanned aerial vehicle and the Internet of things equipment, the transmission rate and the self residual energy as state data;
and acquiring the acceleration and the flight direction of the unmanned aerial vehicle as action decision data.
Optionally, generating the state data and the action decision data includes:
quantizing and representing state data as
Figure BDA0002149653010000021
Wherein phi(s)t) Representing a matrix of state data, stWhich represents the state at the time of t,
Figure BDA0002149653010000022
respectively representing Euclidean distances between 1 st to Nth Internet of things devices and the unmanned aerial vehicle at the time t;
Figure BDA0002149653010000023
respectively representing the transmission rates of the 1 st to Nth Internet of things devices transmitting information to the unmanned aerial vehicle at the time t;
Figure BDA0002149653010000024
representing the self residual energy of the unmanned aerial vehicle at the time t;
representing action decision data as at=[ωt,at]T(ii) a Wherein a istRepresents the action at time t; omegat∈[0,2π],ωtRepresenting the flight steering angle of the unmanned aerial vehicle at time t; a istRepresenting the magnitude of the acceleration of the drone at time t, atIs continuously bounded data.
Optionally, calculating the instantaneous energy efficiency comprises calculating according to the following formula:
Figure BDA0002149653010000031
wherein r(s)t,at) Indicating that the state of the unmanned plane at the time t is stThe action is atInstantaneous energy efficiency of the time;
Figure BDA0002149653010000032
the maximum transmission rate of data transmitted to the unmanned aerial vehicle by the Internet of things device u at the moment t is set;
Figure BDA0002149653010000033
indicating its own remaining energy.
Optionally, training the deep reinforcement learning network by using a PPO algorithm, optimizing strategy parameters, and outputting an optimal strategy through multiple iterative updates, including:
and (3) rewriting the target equation into the following equation by adopting a PPO algorithm:
Figure BDA0002149653010000034
where θ is the policy parameter to be optimized, is preset forThe control strategy updates the constant of the amplitude,
Figure BDA0002149653010000035
for the desired value of the time t,
Figure BDA0002149653010000036
representing a dominance function, clip representing a clipping function, rt(θ) is the ratio of the old policy function to the new policy function in one iteration of the update, and can be expressed as:
Figure BDA0002149653010000037
wherein piθRepresenting a policy function, piθ(at|st) Indicates that the state at time t is stThe action is atThe new policy function of (2) is,
Figure BDA0002149653010000039
indicates that the state at time t is stThe action is atOld policy function of;
the dominant function equation is solved as follows:
Figure BDA0002149653010000038
wherein gamma is an attenuation index and lambda is a trajectory parameter;tfor a time differential error value at time t,T-1the time difference error value at the T-1 moment; t is the total duration of the autonomous flight;
and solving the maximum value of the target equation through repeated iteration updating so as to optimize the strategy parameters in the strategy function, and outputting the strategy parameters corresponding to the maximum value of the target equation as the optimal strategy.
Optionally, calculating the instantaneous energy efficiency comprises:
when the unmanned aerial vehicle returns to the way and the situation of energy exhaustion occurs, a penalty term of a preset numerical value is added after an equation of instantaneous energy efficiency is calculated.
In a second aspect of the embodiment of the present invention, there is also provided an unmanned aerial vehicle trajectory optimization device based on deep reinforcement learning, including a construction module, a training data collection module, and a training module;
the building module is used for building a deep reinforcement learning network based on a PPO algorithm;
the training data collection module is used for interacting with the environment in real time in the flight process of the unmanned aerial vehicle, generating state data and action decision data and calculating instantaneous energy efficiency;
and the training module is used for training the deep reinforcement learning network by using the PPO algorithm with state data as input, action decision data as output and instantaneous energy efficiency as reward return, optimizing strategy parameters and outputting an optimal strategy through repeated iterative updating.
Optionally, the building block is configured to: constructing a mobile network and an evaluation network; fitting a state value function by using a deep neural network and transmitting the state value function into an evaluation network, calculating an advantage function through the evaluation network, and transmitting the advantage function into a mobile network; fitting a strategy function through a mobile network by utilizing a deep neural network, and transmitting the strategy function into the mobile network;
and/or a training data collection module to: calculating the distance between the unmanned aerial vehicle and the Internet of things equipment, the transmission rate and the self residual energy as state data; and acquiring the acceleration and the flight direction of the unmanned aerial vehicle as action decision data.
In a third aspect of the embodiment of the present invention, an unmanned aerial vehicle is further provided, which includes a processor, where the processor is configured to execute the above unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning.
According to the specific embodiment provided by the invention, the invention discloses the following technical effects:
the invention discloses an unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and an unmanned aerial vehicle, wherein a deep reinforcement learning technology PPO algorithm is introduced in the unmanned aerial vehicle track optimization, the unmanned aerial vehicle interacts with the environment in real time in the flight process, state data and action data under the current flight track are collected as training data, instantaneous energy efficiency is taken as a return function, and the continuous optimization of strategy parameters of a decision flight track is realized through the real-time autonomous learning of the PPO algorithm, namely the unmanned aerial vehicle is endowed with the online autonomous learning capability in the environment, and can adapt to the change of a dynamic environment according to the requirement; in addition, the autonomous learning based on the PPO algorithm also has the advantage of being not limited by the selection of the learning step length;
in addition, the data objects processed by the autonomous learning method based on the PPO algorithm can be three-dimensional continuous bounded data, such as input data, output data and the like which are not limited to discrete domains, so that flight control optimization of the unmanned aerial vehicle in a three-dimensional space under the continuous domains is realized, and the method is closer to a real scene; compared with a control mode based on discrete domain data or a limited number of coping schemes in a table, the method is more suitable for the requirement of the actual flight environment;
furthermore, when the return function is assigned to be the flying instantaneous energy efficiency of the unmanned aerial vehicle, punishment items are added when the aircraft cannot smoothly return to the air and charge/refuel, the unmanned aerial vehicle can immediately return to the air after continuous learning to avoid loss, and the energy efficiency of the flying work of the unmanned aerial vehicle is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
Fig. 1 is a schematic flow chart of an embodiment of an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning according to the present invention;
fig. 2 is a schematic view of the overall structure and the interaction of related data in another embodiment of the unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning according to the present invention;
fig. 3 is a schematic flow chart of an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning according to a preferred embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the drawings of the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the invention without any inventive step, are within the scope of protection of the invention.
Example 1
The deep reinforcement learning technology is a machine learning technology combining reinforcement learning and a deep neural network. Specifically, the reinforcement learning individual collects the return information of taking different actions in different environment states in an environment interaction mode, and induces and learns the optimal behavior strategy according to the collected data, so that the ability of adapting to the unknown dynamic environment is obtained. The deep neural network can remarkably improve the generalization capability of the algorithm on a high-dimensional state space and a high-dimensional action space, thereby obtaining the capability of adapting to more complex environments.
The embodiment 1 of the invention provides an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning, and as shown in fig. 1, the method comprises the following steps:
s101, a deep reinforcement learning network based on a PPO algorithm (near-end policy optimization algorithm) is constructed in advance.
This degree of depth reinforcement learning network model can also can install in thing networking equipment end on unmanned aerial vehicle before unmanned aerial vehicle takes off in advance, and unmanned aerial vehicle realizes online independently studying with thing networking equipment end real-time interaction data at the flight in-process.
And S102, interacting with the environment in real time in the flight process of the unmanned aerial vehicle to generate state data and action decision data.
S103, calculating the instantaneous energy efficiency.
And S104, training the deep reinforcement learning network by utilizing a PPO algorithm, and optimizing strategy parameters.
And (5) circularly executing the step (S102) to the step (S104), and continuously and iteratively updating the network parameters by using the collected data to finally achieve the optimal state.
And S105, obtaining the trained optimal strategy after repeated iteration updating, and outputting the optimal strategy.
And training the deep reinforcement learning network by taking the state data as input, taking the action decision data as output and taking the instantaneous energy efficiency as reward return, and realizing the optimization of the strategy parameters through repeated iterative updating.
The strategy parameters are the action parameters for determining the flight trajectory, and the optimal strategy is the flight strategy which is obtained by autonomous learning and enables the energy efficiency to be maximized.
Example 2
The embodiment 2 of the invention provides another embodiment of an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning.
In embodiment 2 of the present invention, the PPO algorithm adopts a deep reinforcement learning structure of an Actor Critic (Actor-Critic) framework, and is composed of two networks, namely, an action network and an evaluation network: the action network utilizes a PPO algorithm and a deep neural network to fit a strategy function and decide actions; the evaluation network utilizes a deep neural network to fit a state cost function and optimize strategy parameters. The overall structure and the related data interaction of the optimization method provided by embodiment 2 of the present invention are shown in fig. 2.
In this embodiment, the unmanned aerial vehicle communication scenario that uses is that a single unmanned aerial vehicle base station provides service for a plurality of fixed internet of things devices, and the internet of things devices are activated at random or periodically to collect data and transmit to the unmanned aerial vehicle base station.
As an implementable manner, the distance between the unmanned aerial vehicle and the internet of things device, the transmission rate and the self remaining energy are used as the state input action network of reinforcement learning by the unmanned aerial vehicle, the acceleration and the flight direction (i.e. the flight control angle) of the unmanned aerial vehicle are used as the output behaviors, and the instantaneous energy efficiency of the unmanned aerial vehicle obtained from the environment is used as the reward. And through continuous interaction with the environment, data of state input, action decision and reward return are generated and used as training data of the evaluation network and the action network. The evaluation network utilizes a deep neural network to fit a state cost function to provide an advantage function for the optimization of the action network; the mobile network optimizes the strategy parameters by using a PPO algorithm and fits a strategy function by using a deep neural network. Through the process of repeated iterative updating, the unmanned aerial vehicle adapts to the environment and obtains an optimal strategy.
As an implementable manner, the method for optimizing the trajectory of the unmanned aerial vehicle based on the deep reinforcement learning provided in embodiment 2 of the present invention may include the following steps:
s201, initializing a reinforcement learning decision strategy, relevant parameters and relevant parameters of a deep neural network.
S202, in a period of preset duration, the unmanned aerial vehicle autonomously flies to complete a task and records related data. The unmanned aerial vehicle calculates the distance to the Internet of things equipment, the transmission rate and the residual energy of the unmanned aerial vehicle, decides a flight track based on the current strategy, receives data sent by the Internet of things equipment, and calculates the instantaneous energy efficiency under the flight track.
S203, evaluating a network fitting state cost function according to the data collected within a period of the preset time, calculating an advantage function, and transmitting the advantage function into the mobile network. And respectively training parameters of the deep neural network of the action network and the evaluation network, and updating the flight strategy of the unmanned aerial vehicle.
And S204, repeating the step 202 and the step 203 until the unmanned aerial vehicle task is finished.
Example 3
Embodiment 3 of the present invention provides a preferred embodiment of an unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning, and further details of an unmanned aerial vehicle communication modeling method used in the present invention and an unmanned aerial vehicle high energy efficiency trajectory optimization method based on deep reinforcement learning are described in this embodiment.
The unmanned aerial vehicle communication model established in this embodiment considers a scenario in which one unmanned aerial vehicle provides delay tolerant service for N pieces of ground internet of things equipment, and the internet of things equipment is randomly distributed and fixed in position, and periodically or randomly acquires data and transmits the data to the unmanned aerial vehicle. The target is to optimize the unmanned aerial vehicle flight trajectory, maximize the cumulative energy efficiency under the limited energy condition. To accomplish this goal, the drone should be able to detect its own remaining energy and decide on the optimal return charge/refuel time.
The specific modeling method is as follows:
s301: the average path loss is calculated.
A communication channel between the unmanned aerial vehicle and the Internet of things equipment adopts sub-6GHz frequency band air-to-ground link, and line of sight (LoS) transmission is dominant in the wireless link. The average path loss of the unmanned aerial vehicle and the ground internet of things device u at the time t can be expressed as follows:
Figure BDA0002149653010000081
fcwhich represents the center frequency of the signal at the center,
Figure BDA0002149653010000082
representing the Euclidean distance between the drone and the device u at time t, c representing the speed of light, ηLoSThe additional spatial propagation loss, representing the LoS link, is a constant.
302: and calculating the signal-to-noise ratio.
The signal-to-noise ratio (SINR) of the unmanned aerial vehicle and the internet of things device u at the time t can be expressed as:
Figure BDA0002149653010000083
Puthe transmission power of the uplink on behalf of device u,
Figure BDA0002149653010000084
representing the gain of the channel between the drone and the device u at time t, N0Is the noise power. Assuming that the transmission power and noise power of all devices are the same, the channel gain is determined only by the path loss, so
Figure BDA0002149653010000085
Assuming that the doppler effect caused by the movement of the drone can be perfectly compensated by existing techniques, such as phase-locked loop techniques, the maximum rate of transmission of device u to the drone can therefore be expressed as:
Figure BDA0002149653010000086
b represents the channel bandwidth, assuming that the bandwidth of all devices is the same.
S303: and calculating the self residual energy.
The energy loss of the drone includes flight energy loss caused by propulsion and communication related energy loss. The flight energy consumption caused by the propulsion allows the drone to fly in the air, changing the flight trajectory, the power of which is related to the speed and acceleration of the drone, so the flight energy consumption can be expressed as the equation of the flight trajectory q (t), as follows:
Figure BDA0002149653010000087
wherein E (p (t)) is the self energy loss,
Figure BDA0002149653010000088
representing the self residual energy, which is the initial total energy minus the self energy loss, i.e.
Figure BDA0002149653010000089
Wherein E0The initial total energy of the unmanned aerial vehicle before the current flight is obtained. The self energy loss is the integral of the instantaneous energy loss from t-0 to t-t.
p (t) is the instantaneous energy loss,
Figure BDA0002149653010000091
representing the instantaneous speed of the drone,
Figure BDA0002149653010000092
acceleration on behalf of the drone, c1And c2Are two constants that are related to the physical properties of the drone itself, such as wing number and weight. It is to be noted that, here aTDenotes the transpose of a, where "T" is the transpose symbol.
Communication related energy losses include radiation, signal processing, and other circuit consumption, where the energy losses due to signal processing dominate. The energy loss caused by signal processing is independent of the flight of the drone and is an inverse proportional function of the square of the flight time and can be expressed as:
Figure BDA0002149653010000093
wherein E iscompNamely, the communication-related energy loss at the time t, G represents a hardware calculation constant of the node of the unmanned aerial vehicle, D represents the number of bits of data to be processed by the unmanned aerial vehicle, and t is the time t.
In the invention, the self energy loss is flight energy loss plus communication related energy loss; self remaining energy is initial total energy-self energy loss.
S304: status data is extracted from the flight environment.
The state data is obtained by extracting calculation from the environment and can be characterized as the following three parts: i) the distance from the unmanned aerial vehicle to each piece of internet-of-things equipment; ii) a transmission rate at which each internet of things device transmits information to the drone; iii) self residual energy. Thus, the status data may be represented as
Figure BDA0002149653010000094
(here, "T" denotes the transpose of the matrix).
S305: motion data is acquired.
The action is sent by the unmanned aerial vehicle for control flight path, includes following two parts: i) flight control angle omega of unmanned aerial vehicle at time tt∈[0,2π](ii) a ii) acceleration a of the drone at time tt. Thus, actions may be collectively denoted as at=[ωt,at]T(here, "T" denotes the transpose of the matrix).
It should be noted that the instantaneous flying speed of the unmanned aerial vehicle
Figure BDA0002149653010000095
And acceleration
Figure BDA0002149653010000096
Are three-dimensionally continuously bounded.
S306, a return function is established.
The return function is defined as the instantaneous energy efficiency, i.e.
Figure BDA0002149653010000097
Because the algorithm needs to automatically decide the return charging/refueling time of the unmanned aerial vehicle in consideration, a punishment item with a larger value is added after the return function when the unmanned aerial vehicle returns to the midway energy source exhaustion. The unmanned aerial vehicle returns that the energy is exhausted on the way, causes the crash of the unmanned aerial vehicle, directly sets the return function value as a larger negative number, such as-100. The value of the specific penalty item can be flexibly set by a person skilled in the art according to the actual scene, and is not unique and is not listed one by one in the invention.
S307: and establishing a policy function.
The reinforcement learning method based on the strategy gradient is to parameterize the strategy and model the strategy in a random equation, namely piθS → P (A), represents the probability of taking an action in the action set A (i.e., the set of actions a) at any state in the state set S (i.e., the set of states S), θ ∈ RnAre the policy parameters that need to be optimized. RnRepresenting a set of n-dimensional real numbers, the size of n being equal to the dimension of theta.
S308: and establishing an objective equation.
In reinforcement learning, state s is in strategy πθThe state cost function of a state is defined as the long-term cumulative return. When its state is s, the policy is πθThe state cost function is of the form:
Figure BDA0002149653010000101
gamma is a discount factor, and the value range gamma ∈ [0,1]. Similarly, in strategy πθNext, the state-action cost function for action a may be defined as:
Figure BDA0002149653010000102
the objective equation of reinforcement learning is defined as:
Figure BDA0002149653010000103
wherein
Figure BDA0002149653010000104
Is in the strategy of piθThe following discounted state access probability distribution.
Therefore, we obtain the final trajectory optimization problem of the unmanned aerial vehicle based on reinforcement learning as follows:
Figure BDA0002149653010000105
C1and C2Respectively, are the limit conditions of the flight speed and the acceleration of the unmanned aerial vehicle.
The strategy gradient method can be applied to an optimization strategy piθTo maximize the objective equation. The gradient of the target equation with respect to the argument θ can be expressed as:
Figure BDA0002149653010000111
btis a constant baseline introduced in the reward function for reducing the variance of the strategy gradient, and a constant is introduced in the reward function, the strategy gradient is unchanged and the variance is reduced. In particular, btThe equation of state V is typically chosenθ(st) Estimated value of (1), Rt-btIt can be regarded as the dominance function a (a)t,st)=Q(at,st)-V(st) An estimate of (d).
The policy gradient algorithm generally has a large variance in policy gradient when used, and thus is subject to large changes in parameter influence. And according to the strategic gradient algorithm, the parameter update equation is
Figure BDA0002149653010000112
α, the step size is updated, and when the step size is not appropriate, the strategy corresponding to the updated parameter will be a worse strategy.
The trust domain method TRPO algorithm (trust region policy optimization) improves the robustness of the algorithm by limiting the change size of the strategy in each iteration. The deep reinforcement learning algorithm PPO inherits the advantages of the trust domain system method algorithm, and meanwhile, the realization method is simpler and more universal and has better sample complexity according to experience.
S309: and rewriting the target equation by adopting a PPO algorithm.
With the PPO algorithm, the objective equation can be rewritten as:
Figure BDA0002149653010000113
and theta is a parameter to be optimized in the strategy function and is a preset fixed value which is 0.1-0.3, and the aim is to control the updating amplitude of the strategy.
Figure BDA0002149653010000114
To mathematically expect a symbol, the representation is averaged over time t. r ist(θ) is the ratio of the old policy function to the new policy function, and can be expressed as:
Figure BDA0002149653010000115
the old strategy function and the new strategy function are in one-time iteration updating, the updated strategy function is the new strategy function, and the strategy function before updating is the old strategy function.
Wherein the merit function equation is:
Figure BDA0002149653010000116
t=rt+γV(st+1)-V(st),
gamma is attenuation index and is a preset fixed valueA value; lambda is a track parameter and is also a preset fixed value; the range of γ is (0, 1), and the range of λ is (0, 1).tThe time difference error value (Temporaldifference error) at the time t is shown in the second line of the above expression by the specific mathematical expression;T-1and the time difference error value at the moment T-1 is obtained, and T is the total duration of the autonomous flight.
It is noted that the merit function requires all data for a period of time from the current time until time t.
Therefore, the invention introduces the deep neural network at two positions for respectively representing the state-action cost function equation Qω(s,a)≈Qπ(s, a) and learning the parameter ω, and expressing the policy function πθ(s) ═ pi(s) and learn the parameter θ.
Specifically, referring to fig. 3, a specific process of deep reinforcement learning PPO algorithm in the embodiment of the present invention is as follows:
initializing each parameter of the deep reinforcement learning neural network, randomly assigning values to the parameters omega and theta, setting the autonomous flight time length as T, setting the iteration times of the two deep neural networks as M times and B times respectively, setting the iteration times as 0.2, setting the gamma as 0.99, and setting the total task time as L.
Forepsiprode is 1, L do; executing a loop from the 1 st time segment to the L < th > time segment; based on the current strategy piθContinuous autonomous decision-making action T times while collecting tuples with environmental interactions st,at,rt}. By the collected tuples st,at,rtAnd estimating a merit function by using a deep neural network
Figure BDA0002149653010000121
Calculating an objective function
Figure BDA0002149653010000122
And updating the parameter theta by using a gradient descent method, and iterating for M times.
Calculating a function
Figure BDA0002149653010000123
And updating the parameter omega by using a gradient descent method, and iterating for B times.
End for, ending the loop.
The embodiment of the invention provides an unmanned aerial vehicle high-energy-efficiency track optimization scheme based on a deep reinforcement learning PPO algorithm. According to the track optimization scheme, the residual energy of the unmanned aerial vehicle is taken into account in a state value and input into a reinforcement learning network, and the flying speed, the acceleration, the flying direction and the return time of the unmanned aerial vehicle are directly output. The scheme dynamically adjusts the learned strategy according to the environment change in an online learning mode, so that the environment is adapted. Meanwhile, the scheme considers the control problem under the continuous domain and conforms to the continuous domain flight control mechanism under the actual scene. On the other hand, the PPO algorithm is a continuous domain control algorithm with the best robustness and the most outstanding performance, the defect that the appropriate learning step length is not easy to determine is eliminated, and the complexity of the algorithm is reduced.
Example 4
The embodiment of the invention also provides an unmanned aerial vehicle track optimization device based on deep reinforcement learning, which comprises a construction module, a training data collection module and a training module.
The building module is used for building a deep reinforcement learning network based on a PPO algorithm; the training data collection module is used for interacting with the environment in real time in the flight process of the unmanned aerial vehicle, generating state data and action decision data and calculating instantaneous energy efficiency; and the training module is used for training the deep reinforcement learning network by using the PPO algorithm with state data as input, action decision data as output and instantaneous energy efficiency as reward return, optimizing strategy parameters and outputting an optimal strategy through repeated iterative updating.
Example 5
The embodiment of the invention also provides the unmanned aerial vehicle, which comprises a processor, wherein the processor is used for executing the unmanned aerial vehicle track optimization method based on the deep reinforcement learning.
In conclusion, the invention introduces a deep reinforcement learning PPO algorithm to carry out autonomous exploration learning on environmental information, aims at improving the efficiency of unmanned aerial vehicle energy, and intelligently decides and optimizes flight path and return flight time.
Compared with the prior art, the invention achieves the following technical effects:
firstly, the capability of the invention in adapting to scenes and environments is stronger than the scheme of adopting a convex optimization algorithm in the prior art. Because a reinforcement learning algorithm is introduced, strategy parameters are optimized in the learning process instead of being based on a fixed target equation, so that the method has stronger flexibility; in addition, the deep reinforcement learning network strengthens the interaction with the external environment by inputting the environment state and acquiring the reward, and can more quickly respond to the change of the scene and the environment.
Compared with a scheme based on Q learning in the prior art, the unmanned aerial vehicle trajectory optimization scheme of the continuous domain is adopted, the speed and the acceleration of continuous action output by reinforcement learning are closer to the actual situation, the flight area is easy to expand, and the potential problem of dimension explosion can not occur during large-area trajectory optimization.
In the prior art, a DDPG algorithm is adopted to control a machine in a continuous domain, the method has the defect that a proper learning step length is not easy to determine, and the selection of the hyper-parameters has great influence on an optimization result.
Compared with an optimization scheme for updating by adopting a depth determination strategy gradient (DDPG) algorithm, the PPO algorithm is less influenced by the training step length, the adaptability is higher when the control problem under a real scene is solved, the problem that the learning step length is difficult to determine by adopting the DDPG algorithm in the prior art is solved, and the efficiency is higher.
In addition, the invention also considers the optimal return charging/refueling time, so that the unmanned plane can flexibly adjust the flight time and the track under the condition of safe return, and the energy utilization efficiency of the unmanned plane is improved as much as possible.
In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Also, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, Digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.
The principle and the implementation manner of the present invention are explained by applying specific examples, the above description of the embodiments is only used to help understanding the method of the present invention and the core idea thereof, the described embodiments are only a part of the embodiments of the present invention, not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art based on the embodiments of the present invention without creative efforts belong to the protection scope of the present invention.

Claims (8)

1. The unmanned aerial vehicle track optimization method based on deep reinforcement learning is characterized by comprising the following steps:
a deep reinforcement learning network based on a PPO algorithm is constructed in advance;
interacting with the environment in real time in the flight process of the unmanned aerial vehicle to generate state data and action decision data, and calculating instantaneous energy efficiency; wherein calculating the instantaneous energy efficiency comprises calculating according to the following formula:
Figure FDA0002500672040000011
r(st,at) Indicating that the state of the unmanned plane at the time t is stThe action is atInstantaneous energy efficiency of the time;
Figure FDA0002500672040000012
the maximum transmission rate of data transmitted to the unmanned aerial vehicle by the Internet of things device u at the moment t is set;
Figure FDA0002500672040000013
representing its own residual energy, stIndicating the state of the drone at time t, atRepresenting the action of the unmanned aerial vehicle at the time t;
training the deep reinforcement learning network by using the PPO algorithm with the state data as input, the action decision data as output and the instantaneous energy efficiency as reward return, optimizing strategy parameters, and outputting an optimal strategy through repeated iterative updating;
the method comprises the following steps of training the deep reinforcement learning network by utilizing a PPO algorithm, optimizing strategy parameters, and outputting an optimal strategy through repeated iteration updating, and comprises the following steps:
and (3) rewriting the target equation into the following equation by adopting a PPO algorithm:
Figure FDA0002500672040000014
wherein theta is a strategy parameter to be optimized and is a preset constant for controlling strategy updating amplitude,
Figure FDA0002500672040000015
for the desired value of the time t,
Figure FDA0002500672040000016
representing a dominance function, clip representing a clipping function, rt(θ) is the ratio of the old policy function to the new policy function in one iteration of the update, and can be expressed as:
Figure FDA0002500672040000017
wherein piθRepresenting a policy function, piθ(at|st) Indicates that the state at time t is stThe action is atThe new policy function of (2) is,
Figure FDA0002500672040000018
indicates that the state at time t is stThe action is atOld policy function of;
the dominant function equation is solved as follows:
Figure FDA0002500672040000019
wherein gamma is an attenuation index and lambda is a trajectory parameter;tfor a time differential error value at time t,T-1the time difference error value at the T-1 moment; t is the total duration of the autonomous flight;
and solving the maximum value of the target equation through multiple iterative updating so as to optimize the strategy parameters in the strategy function, and outputting the strategy parameters corresponding to the maximum value of the target equation as the optimal strategy.
2. The unmanned aerial vehicle trajectory optimization method based on deep reinforcement learning of claim 1, wherein the step of pre-constructing a deep reinforcement learning network based on a PPO algorithm comprises:
constructing a deep learning network structure comprising an action network and an evaluation network;
the action network utilizes a PPO algorithm and a deep neural network to fit a strategy function and decide flight actions; the evaluation network utilizes a deep neural network to fit a state cost function and optimize strategy parameters in the strategy function.
3. The method for unmanned aerial vehicle trajectory optimization based on deep reinforcement learning of claim 1, wherein the step of generating state data and action decision data comprises:
calculating the distance between the unmanned aerial vehicle and the Internet of things equipment, the transmission rate and the self residual energy as state data;
and acquiring the acceleration and the flight direction of the unmanned aerial vehicle as action decision data.
4. The method of claim 3, wherein the step of generating state data and action decision data comprises:
quantizing the state data to represent
Figure FDA0002500672040000021
Wherein phi(s)t) Representing a matrix of state data, stWhich represents the state at the time of t,
Figure FDA0002500672040000022
respectively representing Euclidean distances between 1 st to Nth Internet of things devices and the unmanned aerial vehicle at the time t;
Figure FDA0002500672040000023
respectively representing the transmission rates of the 1 st to Nth Internet of things devices transmitting information to the unmanned aerial vehicle at the time t;
Figure FDA0002500672040000024
representing the self residual energy of the unmanned aerial vehicle at the time t;
representing the action decision data as at=[ωt,at]T(ii) a Wherein a istRepresents the action at time t; omegat∈[0,2π],ωtRepresenting the flight steering angle of the unmanned aerial vehicle at time t; a istRepresenting the magnitude of the acceleration of the drone at time t, atIs continuously bounded data.
5. The method of any one of claims 1-4, wherein the step of calculating the instantaneous energy efficiency comprises:
when the unmanned aerial vehicle returns to the way and the situation of energy exhaustion occurs, a penalty term of a preset numerical value is added after an equation of instantaneous energy efficiency is calculated.
6. The unmanned aerial vehicle track optimization device based on deep reinforcement learning is characterized by comprising a construction module, a training data collection module and a training module;
the building module is used for building a deep reinforcement learning network based on a PPO algorithm;
the training data collection module is used for interacting with the environment in real time in the flight process of the unmanned aerial vehicle, generating state data and action decision data and calculating instantaneous energy efficiency; wherein calculating the instantaneous energy efficiency comprises calculating according to the following formula:
Figure FDA0002500672040000031
r(st,at) Indicating that the state of the unmanned plane at the time t is stThe action is atInstantaneous energy efficiency of the time;
Figure FDA0002500672040000032
the maximum transmission rate of data transmitted to the unmanned aerial vehicle by the Internet of things device u at the moment t is set;
Figure FDA0002500672040000033
representing its own residual energy, stIndicating the state of the drone at time t, atRepresenting the action of the unmanned aerial vehicle at the time t;
the training module is used for training the deep reinforcement learning network by using the PPO algorithm with the state data as input, the action decision data as output and the instantaneous energy efficiency as reward return, optimizing strategy parameters, and outputting an optimal strategy through repeated iterative updating;
the method comprises the following steps of training the deep reinforcement learning network by utilizing a PPO algorithm, optimizing strategy parameters, and outputting an optimal strategy through repeated iteration updating, and comprises the following steps:
and (3) rewriting the target equation into the following equation by adopting a PPO algorithm:
Figure FDA0002500672040000034
wherein theta is a strategy parameter to be optimized and is a preset constant for controlling strategy updating amplitude,
Figure FDA0002500672040000035
for the desired value of the time t,
Figure FDA0002500672040000036
representing a dominance function, clip representing a clipping function, rt(θ) is the ratio of the old policy function to the new policy function in one iteration of the update, and can be expressed as:
Figure FDA0002500672040000037
wherein piθTo representPolicy function, piθ(at|st) Indicates that the state at time t is stThe action is atThe new policy function of (2) is,
Figure FDA0002500672040000038
indicates that the state at time t is stThe action is atOld policy function of;
the dominant function equation is solved as follows:
Figure FDA0002500672040000041
wherein gamma is an attenuation index and lambda is a trajectory parameter;tfor a time differential error value at time t,T-1the time difference error value at the T-1 moment; t is the total duration of the autonomous flight;
and solving the maximum value of the target equation through multiple iterative updating so as to optimize the strategy parameters in the strategy function, and outputting the strategy parameters corresponding to the maximum value of the target equation as the optimal strategy.
7. The unmanned aerial vehicle trajectory optimization device based on deep reinforcement learning of claim 6, wherein:
the building module is configured to: constructing a mobile network and an evaluation network; fitting a state cost function by using a deep neural network and transmitting the state cost function into the evaluation network, calculating an advantage function through the evaluation network, and transmitting the advantage function into the action network; fitting a policy function through the action network by using a deep neural network, and transmitting the policy function into the action network;
and/or the training data collection module is configured to: calculating the distance between the unmanned aerial vehicle and the Internet of things equipment, the transmission rate and the self residual energy as state data; and acquiring the acceleration and the flight direction of the unmanned aerial vehicle as action decision data.
8. A drone comprising a processor, wherein the processor is configured to perform the method of drone trajectory optimization based on deep reinforcement learning of any one of claims 1-4.
CN201910697007.6A 2019-07-30 2019-07-30 Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle Active CN110488861B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910697007.6A CN110488861B (en) 2019-07-30 2019-07-30 Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle
PCT/CN2019/114200 WO2021017227A1 (en) 2019-07-30 2019-10-30 Path optimization method and device for unmanned aerial vehicle, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910697007.6A CN110488861B (en) 2019-07-30 2019-07-30 Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle

Publications (2)

Publication Number Publication Date
CN110488861A CN110488861A (en) 2019-11-22
CN110488861B true CN110488861B (en) 2020-08-28

Family

ID=68548830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910697007.6A Active CN110488861B (en) 2019-07-30 2019-07-30 Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle

Country Status (2)

Country Link
CN (1) CN110488861B (en)
WO (1) WO2021017227A1 (en)

Families Citing this family (72)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110879595A (en) * 2019-11-29 2020-03-13 江苏徐工工程机械研究院有限公司 Unmanned mine card tracking control system and method based on deep reinforcement learning
CN110958680B (en) * 2019-12-09 2022-09-13 长江师范学院 Energy efficiency-oriented unmanned aerial vehicle cluster multi-agent deep reinforcement learning optimization method
CN111132192B (en) * 2019-12-13 2023-01-17 广东工业大学 Unmanned aerial vehicle base station online track optimization method
CN111123963B (en) * 2019-12-19 2021-06-08 南京航空航天大学 Unknown environment autonomous navigation system and method based on reinforcement learning
CN111026147B (en) * 2019-12-25 2021-01-08 北京航空航天大学 Zero overshoot unmanned aerial vehicle position control method and device based on deep reinforcement learning
CN111191728B (en) * 2019-12-31 2023-05-09 中国电子科技集团公司信息科学研究院 Deep reinforcement learning distributed training method and system based on asynchronization or synchronization
CN111314929B (en) * 2020-01-20 2023-06-09 浙江工业大学 Contract-based unmanned aerial vehicle edge cache strategy and rewarding optimization method
CN111385806B (en) * 2020-02-18 2021-10-26 清华大学 Unmanned aerial vehicle base station path planning and bandwidth resource allocation method and device
CN111263332A (en) * 2020-03-02 2020-06-09 湖北工业大学 Unmanned aerial vehicle track and power joint optimization method based on deep reinforcement learning
CN111381499B (en) * 2020-03-10 2022-09-27 东南大学 Internet-connected aircraft self-adaptive control method based on three-dimensional space radio frequency map learning
CN111565065B (en) * 2020-03-24 2021-06-04 北京邮电大学 Unmanned aerial vehicle base station deployment method and device and electronic equipment
CN111580544B (en) * 2020-03-25 2021-05-07 北京航空航天大学 Unmanned aerial vehicle target tracking control method based on reinforcement learning PPO algorithm
CN111460650B (en) * 2020-03-31 2022-11-01 北京航空航天大学 Unmanned aerial vehicle end-to-end control method based on deep reinforcement learning
CN112180967B (en) * 2020-04-26 2022-08-19 北京理工大学 Multi-unmanned aerial vehicle cooperative countermeasure decision-making method based on evaluation-execution architecture
CN111552313B (en) * 2020-04-29 2022-06-28 南京理工大学 Multi-unmanned aerial vehicle path planning method based on edge calculation dynamic task arrival
US20220308598A1 (en) * 2020-04-30 2022-09-29 Rakuten Group, Inc. Learning device, information processing device, and learned control model
CN112198870B (en) * 2020-06-01 2022-09-02 西北工业大学 Unmanned aerial vehicle autonomous guiding maneuver decision method based on DDQN
CN111786713B (en) * 2020-06-04 2021-06-08 大连理工大学 Unmanned aerial vehicle network hovering position optimization method based on multi-agent deep reinforcement learning
CN111752304B (en) * 2020-06-23 2022-10-14 深圳清华大学研究院 Unmanned aerial vehicle data acquisition method and related equipment
CN111724001B (en) * 2020-06-29 2023-08-29 重庆大学 Aircraft detection sensor resource scheduling method based on deep reinforcement learning
CN112097783B (en) * 2020-08-14 2022-05-20 广东工业大学 Electric taxi charging navigation path planning method based on deep reinforcement learning
CN112068590A (en) * 2020-08-21 2020-12-11 广东工业大学 Unmanned aerial vehicle base station flight planning method and system, storage medium and unmanned aerial vehicle base station
CN112235810B (en) * 2020-09-17 2021-07-09 广州番禺职业技术学院 Multi-dimensional optimization method and system of unmanned aerial vehicle communication system based on reinforcement learning
CN112051863A (en) * 2020-09-25 2020-12-08 南京大学 Unmanned aerial vehicle autonomous anti-reconnaissance and enemy attack avoidance method
CN112362522B (en) * 2020-10-23 2022-08-02 浙江中烟工业有限责任公司 Tobacco leaf volume weight measuring method based on reinforcement learning
CN114527737A (en) * 2020-11-06 2022-05-24 百度在线网络技术(北京)有限公司 Speed planning method, device, equipment, medium and vehicle for automatic driving
CN112566209A (en) * 2020-11-24 2021-03-26 山西三友和智慧信息技术股份有限公司 UAV-BSs energy and service priority track design method based on double Q learning
CN112711271B (en) * 2020-12-16 2022-05-17 中山大学 Autonomous navigation unmanned aerial vehicle power optimization method based on deep reinforcement learning
CN112865855B (en) * 2021-01-04 2022-04-08 福州大学 High-efficiency wireless covert transmission method based on unmanned aerial vehicle relay
CN112819215B (en) * 2021-01-26 2024-01-12 北京百度网讯科技有限公司 Recommendation strategy training method and device, electronic equipment and readable storage medium
CN112791394B (en) * 2021-02-02 2022-09-30 腾讯科技(深圳)有限公司 Game model training method and device, electronic equipment and storage medium
CN115046433B (en) * 2021-03-09 2023-04-07 北京理工大学 Aircraft time collaborative guidance method based on deep reinforcement learning
CN113159386A (en) * 2021-03-22 2021-07-23 中国科学技术大学 Unmanned aerial vehicle return state estimation method and system
CN113050673B (en) * 2021-03-25 2021-12-28 四川大学 Three-dimensional trajectory optimization method for high-energy-efficiency unmanned aerial vehicle of auxiliary communication system
CN113115344B (en) * 2021-04-19 2021-12-14 中国人民解放军火箭军工程大学 Unmanned aerial vehicle base station communication resource allocation strategy prediction method based on noise optimization
CN113110546B (en) * 2021-04-20 2022-09-23 南京大学 Unmanned aerial vehicle autonomous flight control method based on offline reinforcement learning
CN113110550B (en) * 2021-04-23 2022-09-23 南京大学 Unmanned aerial vehicle flight control method based on reinforcement learning and network model distillation
CN113316239B (en) * 2021-05-10 2022-07-08 北京科技大学 Unmanned aerial vehicle network transmission power distribution method and device based on reinforcement learning
CN113258989B (en) * 2021-05-17 2022-06-03 东南大学 Method for obtaining relay track of unmanned aerial vehicle by using reinforcement learning
CN113110516B (en) * 2021-05-20 2023-12-22 广东工业大学 Operation planning method for limited space robot with deep reinforcement learning
CN113283169B (en) * 2021-05-24 2022-04-26 北京理工大学 Three-dimensional group exploration method based on multi-head attention asynchronous reinforcement learning
CN113255218B (en) * 2021-05-27 2022-05-31 电子科技大学 Unmanned aerial vehicle autonomous navigation and resource scheduling method of wireless self-powered communication network
CN113419548A (en) * 2021-05-28 2021-09-21 北京控制工程研究所 Spacecraft deep reinforcement learning Levier flight control system
CN113157002A (en) * 2021-05-28 2021-07-23 南开大学 Air-ground cooperative full-coverage trajectory planning method based on multiple unmanned aerial vehicles and multiple base stations
CN113382060B (en) * 2021-06-07 2022-03-22 北京理工大学 Unmanned aerial vehicle track optimization method and system in Internet of things data collection
CN113268074B (en) * 2021-06-07 2022-05-13 哈尔滨工程大学 Unmanned aerial vehicle flight path planning method based on joint optimization
CN113543068B (en) * 2021-06-07 2024-02-02 北京邮电大学 Forest area unmanned aerial vehicle network deployment method and system based on hierarchical clustering
CN113507717A (en) * 2021-06-08 2021-10-15 山东师范大学 Unmanned aerial vehicle track optimization method and system based on vehicle track prediction
CN113283013B (en) * 2021-06-10 2022-07-19 北京邮电大学 Multi-unmanned aerial vehicle charging and task scheduling method based on deep reinforcement learning
CN113423060B (en) * 2021-06-22 2022-05-10 广东工业大学 Online optimization method for flight route of unmanned aerial communication platform
CN113377131B (en) * 2021-06-23 2022-06-03 东南大学 Method for acquiring unmanned aerial vehicle collected data track by using reinforcement learning
CN113239639B (en) * 2021-06-29 2022-08-26 暨南大学 Policy information generation method, policy information generation device, electronic device, and storage medium
CN113359480B (en) * 2021-07-16 2022-02-01 中国人民解放军火箭军工程大学 Multi-unmanned aerial vehicle and user cooperative communication optimization method based on MAPPO algorithm
CN113776531A (en) * 2021-07-21 2021-12-10 电子科技大学长三角研究院(湖州) Multi-unmanned-aerial-vehicle autonomous navigation and task allocation algorithm of wireless self-powered communication network
CN113639755A (en) * 2021-08-20 2021-11-12 江苏科技大学苏州理工学院 Fire scene escape-rescue combined system based on deep reinforcement learning
CN113721655B (en) * 2021-08-26 2023-06-16 南京大学 Control period self-adaptive reinforcement learning unmanned aerial vehicle stable flight control method
CN114200950B (en) * 2021-10-26 2023-06-02 北京航天自动控制研究所 Flight attitude control method
CN114117633A (en) * 2021-11-18 2022-03-01 中国人民解放军国防科技大学 Unmanned aerial vehicle information collection control method and system
CN113885549B (en) * 2021-11-23 2023-11-21 江苏科技大学 Four-rotor gesture track control method based on dimension clipping PPO algorithm
CN114142912B (en) * 2021-11-26 2023-01-06 西安电子科技大学 Resource control method for guaranteeing time coverage continuity of high-dynamic air network
CN114268986A (en) * 2021-12-14 2022-04-01 北京航空航天大学 Unmanned aerial vehicle computing unloading and charging service efficiency optimization method
CN114372612B (en) * 2021-12-16 2023-04-28 电子科技大学 Path planning and task unloading method for unmanned aerial vehicle mobile edge computing scene
CN114384931B (en) * 2021-12-23 2023-08-29 同济大学 Multi-target optimal control method and equipment for unmanned aerial vehicle based on strategy gradient
CN114550540A (en) * 2022-02-10 2022-05-27 北方天途航空技术发展(北京)有限公司 Intelligent monitoring method, device, equipment and medium for training machine
CN114785397B (en) * 2022-03-11 2023-04-07 成都三维原光通讯技术有限公司 Unmanned aerial vehicle base station control method, flight trajectory optimization model construction and training method
CN114741886B (en) * 2022-04-18 2022-11-22 中国人民解放军军事科学院战略评估咨询中心 Unmanned aerial vehicle cluster multi-task training method and system based on contribution degree evaluation
CN115202377B (en) * 2022-06-13 2023-06-09 北京理工大学 Fuzzy self-adaptive NMPC track tracking control and energy management method
CN115061371B (en) * 2022-06-20 2023-08-04 中国航空工业集团公司沈阳飞机设计研究所 Unmanned plane control strategy reinforcement learning generation method capable of preventing strategy jitter
CN115167506B (en) * 2022-06-27 2024-06-28 华南师范大学 Method, device, equipment and storage medium for updating and planning unmanned aerial vehicle flight route
CN115001002B (en) * 2022-08-01 2022-12-30 广东电网有限责任公司肇庆供电局 Optimal scheduling method and system for solving problem of energy storage participation peak clipping and valley filling
CN116741019A (en) * 2023-08-11 2023-09-12 成都飞航智云科技有限公司 Flight model training method and training system based on AI
CN116736729B (en) * 2023-08-14 2023-10-27 成都蓉奥科技有限公司 Method for generating perception error-resistant maneuvering strategy of air combat in line of sight

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101002125B1 (en) * 2008-08-05 2010-12-16 주식회사 케이티 Apparatus and method of policy modeling based on partially observable markov decision processes
KR101813697B1 (en) * 2015-12-22 2017-12-29 한국항공대학교산학협력단 Unmanned aerial vehicle flight control system and method using deep learning
CN106019950B (en) * 2016-08-09 2018-11-16 中国科学院软件研究所 A kind of mobile phone satellite Adaptive Attitude control method
CN106168808A (en) * 2016-08-25 2016-11-30 南京邮电大学 A kind of rotor wing unmanned aerial vehicle automatic cruising method based on degree of depth study and system thereof
CN106970615B (en) * 2017-03-21 2019-10-22 西北工业大学 A kind of real-time online paths planning method of deeply study
CN108319286B (en) * 2018-03-12 2020-09-22 西北工业大学 Unmanned aerial vehicle air combat maneuver decision method based on reinforcement learning
CN108594638B (en) * 2018-03-27 2020-07-24 南京航空航天大学 Spacecraft ACS (auto-configuration transform) on-orbit reconstruction method oriented to multitask and multi-index optimization constraints
CN109445456A (en) * 2018-10-15 2019-03-08 清华大学 A kind of multiple no-manned plane cluster air navigation aid
CN109343341B (en) * 2018-11-21 2021-10-01 北京航天自动控制研究所 Carrier rocket vertical recovery intelligent control method based on deep reinforcement learning
CN109639377B (en) * 2018-12-13 2021-03-23 西安电子科技大学 Spectrum resource management method based on deep reinforcement learning
CN109443366B (en) * 2018-12-20 2020-08-21 北京航空航天大学 Unmanned aerial vehicle group path planning method based on improved Q learning algorithm
CN109933086B (en) * 2019-03-14 2022-08-30 天津大学 Unmanned aerial vehicle environment perception and autonomous obstacle avoidance method based on deep Q learning

Also Published As

Publication number Publication date
WO2021017227A1 (en) 2021-02-04
CN110488861A (en) 2019-11-22

Similar Documents

Publication Publication Date Title
CN110488861B (en) Unmanned aerial vehicle track optimization method and device based on deep reinforcement learning and unmanned aerial vehicle
Bayerlein et al. Trajectory optimization for autonomous flying base station via reinforcement learning
CN113162679B (en) DDPG algorithm-based IRS (intelligent resilient software) assisted unmanned aerial vehicle communication joint optimization method
CN110531617B (en) Multi-unmanned aerial vehicle 3D hovering position joint optimization method and device and unmanned aerial vehicle base station
WO2021208771A1 (en) Reinforced learning method and device
CN113543176B (en) Unloading decision method of mobile edge computing system based on intelligent reflecting surface assistance
CN110956148B (en) Autonomous obstacle avoidance method and device for unmanned vehicle, electronic equipment and readable storage medium
CN114422056B (en) Space-to-ground non-orthogonal multiple access uplink transmission method based on intelligent reflecting surface
CN113191484A (en) Federal learning client intelligent selection method and system based on deep reinforcement learning
CN113433967B (en) Chargeable unmanned aerial vehicle path planning method and system
CN113568727A (en) Mobile edge calculation task allocation method based on deep reinforcement learning
CN114261400B (en) Automatic driving decision method, device, equipment and storage medium
CN113469325A (en) Layered federated learning method, computer equipment and storage medium for edge aggregation interval adaptive control
CN114302407B (en) Network decision method and device, electronic equipment and storage medium
CN112804103B (en) Intelligent computing migration method for joint resource allocation and control in block chain energized Internet of things
CN113406965A (en) Unmanned aerial vehicle energy consumption optimization method based on reinforcement learning
CN116227767A (en) Multi-unmanned aerial vehicle base station collaborative coverage path planning method based on deep reinforcement learning
CN113377131A (en) Method for obtaining unmanned aerial vehicle collected data track by using reinforcement learning
CN116627162A (en) Multi-agent reinforcement learning-based multi-unmanned aerial vehicle data acquisition position optimization method
CN113382060B (en) Unmanned aerial vehicle track optimization method and system in Internet of things data collection
Han et al. Multi-uav automatic dynamic obstacle avoidance with experience-shared a2c
CN116774584A (en) Unmanned aerial vehicle differentiated service track optimization method based on multi-agent deep reinforcement learning
Tan et al. A hybrid architecture of cognitive decision engine based on particle swarm optimization algorithms and case database
Hoang et al. Deep Reinforcement Learning for Wireless Communications and Networking: Theory, Applications and Implementation
Sakthitharan et al. Establishing an emergency communication network and optimal path using multiple autonomous rover robots

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant