CN113281999A

CN113281999A - Unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning

Info

Publication number: CN113281999A
Application number: CN202110441572.3A
Authority: CN
Inventors: 俞扬; 詹德川; 周志华; 黄军富; 庞竟成; 张云天; 管聪; 陈雄辉
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-04-23
Filing date: 2021-04-23
Publication date: 2021-08-20

Abstract

The invention discloses an unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning, which comprises the steps of (1) establishing an unmanned aerial vehicle simulator environment; (2) constructing an environment transfer model based on deep learning, and randomly initializing mapping in the environment transfer model; (3) constructing a learning-enhanced A3C algorithm, and randomly initializing a flight strategy of the algorithm; (4) constructing an environment inverse transfer model based on deep learning; (5) collecting flight data obtained by operating an unmanned aerial vehicle to fly by an unmanned aerial vehicle operator and a strategy in a real environment; (6) updating the environment transfer model based on the real flight data; (7) using and performing transfer learning based on action correction, correcting a flight strategy, and executing in a simulator to obtain simulated flight data; (8) based on the simulated flight data, the flight strategy is updated using the A3C algorithm, while the environmental reverse transfer model is updated. Until the strategy converges. And finally obtaining the strategy as the initial flight strategy of the real unmanned aerial vehicle.

Description

Unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning

Technical Field

The invention relates to an unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning, and belongs to the technical field of unmanned aerial vehicle autonomous flight control.

Background

Autonomous flight control of an unmanned aerial vehicle in various, complex and rapidly changing environments has always been a difficult point in the field of unmanned aerial vehicle flight control. The conventional flight control is realized by compiling flight control rules in a manual mode, namely, all possible situations in the flight process of the unmanned aerial vehicle are considered in advance, and the listed situations are processed one by one through modes such as feedback control, rule compiling and the like by combining with the professional knowledge and experience of experts in the field of the unmanned aerial vehicle. However, first, the rule writing requires a large amount of labor cost; secondly, if the mutual influence among various situations cannot be considered, the flight control fails; finally, the autonomous flight control of the unmanned aerial vehicle needs to consider high-dimensional sensing information such as radar and a camera, and the processing of the high-dimensional information is a huge challenge for the conventional flight control of the unmanned aerial vehicle.

In recent years, intensive learning has been successful in autonomous control in both a simulation environment such as an electronic game and a real environment such as a robot arm. The reinforcement learning training needs a large number of samples, the samples are obtained in a real environment, and the disadvantages of high risk, low speed, high equipment cost and the like exist, so that a simulator is needed for simulating the real environment. In some related researches in the field of unmanned aerial vehicles, a simulation simulator for simulating a real environment is constructed through a reinforcement learning algorithm, so that the simulated unmanned aerial vehicle can perform a large number of trial and error in the simulation environment, and an autonomous flight strategy is obtained through learning.

However, the simulator environment and the simulated drone are inevitably different from the real environment and the real drone, so that the flight strategy learned in the simulator is unlikely to bring promotion to the performance of the drone in the real environment. In order to solve the above-mentioned problem of human cost in the conventional autonomous flight control and the problem of difference between the simulator and the real environment in the reinforcement learning, one method is to combine the reinforcement learning and the transfer learning. Unmanned aerial vehicle carries out a large amount of trial and error study through the mode of reinforcement study in the simulator environment to the mode through migration study reduces the adverse effect that the difference of simulator environment and reality environment brought. It is critical which transfer learning algorithm is selected for a particular problem. A commonly used transfer learning algorithm is Domain adaptation (Domain adaptation), which requires optimal flight data in a real environment, and requires that a simulated flight trajectory of an unmanned aerial vehicle has the same distribution as the optimal flight trajectory in the real environment when training a flight strategy of the unmanned aerial vehicle in a simulator. The rest large amount of non-optimal flight data cannot be fully utilized by the domain-adaptive transfer learning, and the optimal flight data of the real environment is not easy to acquire.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides an unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning, aiming at the problems that the unmanned aerial vehicle autonomous flight control cannot process complex and variable environments due to manual rule control, and flight strategies cannot be applied to real environments due to unavoidable difference between simulation environments for carrying out unmanned aerial vehicle flight strategy training by using reinforcement learning algorithms and real environments. The adverse effect of simulator and reality difference on flight strategy is reduced by combining the transfer learning based on the action correction based on the reinforcement learning. The optimal flight strategy obtained by learning of the invention has smaller difference with the actual environment, and the flight action is smooth, so that the method can be used as a better initial flight strategy or an auxiliary flight strategy in the actual unmanned aerial vehicle.

The technical scheme is as follows: an unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning is characterized in that flight data in a real environment are collected, and a state transition model of the real environment is obtained through learning; and simultaneously training the flight strategy of the unmanned aerial vehicle and the reverse transfer model of the simulator environment in the simulator, and correcting the flight action to be executed in the simulator by using the transfer model of the real environment and the reverse transfer model of the simulation environment. The method comprises the following steps:

(1) creating an unmanned aerial vehicle simulator environment; (2) construction of an environmental transfer model f based on deep learning_αI.e. the mapping of "current state-current action" to the next state and randomly initializing the mapping; (3) constructing a strengthened learning A3C algorithm and randomly initializing a flight strategy pi thereof_θ(ii) a (4) Construction of environment reverse transfer model f 'based on deep learning'_βI.e., the mapping of "current state-next state" to the current action, and randomly initializing the mapping; (5) collecting unmanned aerial vehicle operators and policies pi_θOperating the unmanned aerial vehicle to fly in a real environment to obtain flight data, namely track data formed by continuous 'state-action' pairs; (6) updating an environmental transfer model f based on real flight data_α(ii) a (7) Using f_αAnd f'_βPerforming a transfer learning based on a motion correction (grouped Action Transformation) to correct a flight strategy pi_θObtaining a flight strategy pi ', and executing pi' in a simulator to obtain simulated flight data; (8) updating flight strategy pi by using A3C algorithm based on simulated flight data_θUpdating environmental reverse transfer model f 'at the same time'_β. Repeating (5) - (8) until strategy pi_θAnd (6) converging. Finally, the strategy pi is obtained_θAs an initial flight strategy for a real drone.

And constructing a simulation simulator based on the aerodynamic model, the unmanned aerial vehicle model and the possible flight scene and flight mission encountered by the unmanned aerial vehicle, and visualizing by using a Unreal4 game engine. The simulation simulator comprises an unmanned aerial vehicle, a flight scene and a flight task, wherein in the simulation simulator, the flight state of the unmanned aerial vehicle changes along with the time in the flight process, and the simulation environment can continuously generate various obstacles. The process can be approximated as a Markov Decision Process (MDP) represented by a five-tuple < S, A, P, R, γ > where S is the state space, A is the motion space, P is the state transition probability, R is the single step reward derived from the environment, and γ is the discount factor for the accumulated reward. The observation information provided by the simulator includes the relative orientation and distance of the target, the lane offset, the wireless and radar detection information, etc.

Flight strategy pi using drone operator and simulator_θControlling the unmanned aerial vehicle, collecting flight data of the unmanned aerial vehicle in the real environment, extracting all triples (s, a, s '), wherein s is the current state, a is the current action, and s' is the next state, and obtaining a data set D for training a state transition model of the real environment_real＝{(s₁,a₁,s₂),(s₂,a₂,s₃),...,(s_n-1,a_n-1,s_n)}。

Using the current state-current action pair as a feature (feature) and the next state as a label (label), performing regression learning, and training a state transition model f in a real environment_α. By minimizing the transfer loss function:

and updating the model parameters.

Repeatedly, state s is input into flight strategy π in A3C_θThe output is obtained as action a, and action a ' is obtained as f ' by using the action correction '_β(s,f_α(s, a)), an action a' is performed in the simulator resulting in a new state s, and a prize value r. Until enough data sets were collected to train the A3C algorithm model: d {(s)₁,a₁,r₁,s₂),(s₂,a₂,r₂,s₃),...,(s_n-1,a_n-1,r_n-1,s_n) Data set of the inverse transfer model f': d_sim＝{(s₁,s₂,a₁),(s₂,s₃,a₂),...,(s_n-1,s_n,a_n-1)}。

In A3C reinforcement learning algorithm flight strategy training, a deep neural network is used as a flight strategy pi_θ(actor) and evaluation network (critic). In the actor-critic algorithm framework, commentsThe family is responsible for evaluating the actors, and the actors perform and perform skill improvement according to the evaluation. The A3C (Advantage operator-critic) algorithm is based on the operator-critic and is improved by two steps: (1) A3C uses an Advantage function (Advantage function) as critic, which reduces the variance of critic, thereby reducing the variance of strategy gradient and further stabilizing training; (2) the data samples are asynchronously collected and the actor and the critic are asynchronously updated by using a plurality of simulation environments, so that more samples can be collected in unit time, the network updating is performed, the training speed is increased, and the variance of the critic gradient of the actor is further reduced. The dominance function of A3C is: advantage (s, a) ═ Q (s, a) -V_φ(s) where Q is a function of the state action value, V is a value network, and φ is a neural network parameter of V,

is a constant for the discount factor.

Updating a strategic neural network (Actor) and a value neural network (criticic) based on an A3C algorithm based on the collected data set, defining a loss function of the value network in A3C as:

define the actor network pi in A3C network_θThe loss function of (d) is: j. the design is a square_π(θ)＝-E_(s,a)～D[logπ_θ(a|s)(Q(s,a)-V_φ(s))]Training an optimization strategy by iteratively sampling data and optimizing the loss function, minimizing the loss function J_π(theta) and J_v(phi), r (s, a) is a reward function given by the environment, phi_oldOld values stored for phi, pi being flight strategy pi_θFor short. .

Training reverse transfer model f 'of simulator environment based on regression learning algorithm'_βThe neural network parameters are updated by minimizing the following loss function:

until the model converges or a maximum number of iterations is reached. The finally obtained flight strategy model is applied to the actual unmanned aerial vehicle, and the effect of the flight strategy model is observed.

Compared with the prior art, the invention has the following advantages:

1. the invention uses the deep reinforcement learning algorithm to enable the unmanned aerial vehicle to have the capability of autonomous flight in the complex and variable flight environment, and the realization is more efficient than the traditional control mode of manually writing rules.

2. According to the invention, by means of the migration learning algorithm based on the Action correction (grouped Action Transformation), the adverse effect on the flight strategy caused by the difference between the real environment and the simulator environment is reduced, and the strategy obtained by training in the simulator can be better used as the initial flight strategy of the real unmanned aerial vehicle.

3. The migration learning algorithm used by the invention can utilize the non-optimal flight data of the unmanned aerial vehicle in the real environment, so that the algorithm has higher robustness.

Drawings

FIG. 1 is an overall frame diagram of the present invention;

FIG. 2 is a training flow diagram of the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

The unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning is used for acquiring flight data in a real environment and learning to obtain a state transition model of the real environment; and simultaneously training the flight strategy of the unmanned aerial vehicle and the reverse transfer model of the simulator environment in the simulator, and correcting the flight action to be executed in the simulator by using the transfer model of the real environment and the reverse transfer model of the simulation environment. Comprises the following steps:

the method comprises the following steps:

a simulation simulator is constructed based on the aerodynamic model, the unmanned aerial vehicle model and the flight scenario and flight mission that the unmanned aerial vehicle may encounter, and is visualized using the Unreal4 game engine. In the simulator, unmanned aerial vehicle is along with time lapse in the flight process, and the flight state of self can change, and various complicated barriers also can be constantly produced to the simulated environment. The process can be approximated as a Markov Decision Process (MDP) represented by a five-tuple < S, A, P, R, γ > where S is the state space, A is the motion space, P is the state transition probability, R is the single step reward derived from the environment, and γ is the discount factor for the accumulated reward.

Step two:

constructing and initializing a model of reinforcement learning A3C Algorithm, an inverse transfer model f 'of a simulator Environment'_βAnd a transfer model f of the real environment_α. Wherein f'_βFor the mapping of "Current State-Next State" to Current action, f_αIs the mapping of the "current state-current action" pair to the next state.

Step three:

Step four:

according to the data obtained in the third step, taking the current state-current action pair as a feature (feature) and the next state as a label (label), performing regression learning, and training the state transition model f of the real environment_α. By minimizing the loss function:

step five:

repeatedly, state s is input into flight strategy π in A3C_θThe output is obtained as action a, and action a ' is obtained as f ' by using the action correction '_β(s,f_α(s, a)), an action a' is performed in the simulator resulting in a new state s, and a prize value r. Until enough data sets were collected to train the A3C algorithm model:

D＝{(s₁,a₁,r₁,s₂),(s₂,a₂,r₂,s₃),...,(s_n-1,a_n-1,r_n-1,s_n)}

and reverse transfer model f'_αThe data set of (a):

D_sim＝{(s₁,s₂,a₁),(s₂,s₃,a₂),...,(s_n-1,s_n,a_n-1)}。

step six:

based on the data collected in step five, the strategic neural network (Actor) and the value neural network (Critic) are updated based on the A3C algorithm by minimizing the loss function J described below_π(theta) and J_v(φ)：

J_π(θ)＝-E_(s,a)～D[logπ_θ(a|s)(Q(s,a)-V_φ(s))]

Wherein Q is a function of the state action value, a function of the V state value,

V(s)＝E_{(s,a,s',r)～D}[r(s,a)+V_φ(s')],

Q(s,a)＝r(s,a)+Ε_s'～p(s,a)[V_φ(s')]

training reverse transfer model f 'of simulator environment based on regression learning algorithm'_βBy minimizing the following loss function:

and repeating the third, fourth, fifth and sixth steps until the model converges or the maximum iteration times. The finally obtained flight strategy model is applied to the actual unmanned aerial vehicle, and the effect of the flight strategy model is observed. The overall algorithm flow is shown in algorithm 1 below.

Claims

1. An unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning is characterized by comprising the following steps:

(1) creating an unmanned aerial vehicle simulator environment;

(2) construction of an environmental transfer model f based on deep learning_αI.e. the mapping of "current state-current action" to the next state and randomly initializing the mapping;

(3) constructing a strengthened learning A3C algorithm and randomly initializing a flight strategy pi thereof_θ；

(4) Construction of environment reverse transfer model f 'based on deep learning'_βI.e., the mapping of "current state-next state" to the current action, and randomly initializing the mapping;

(5) collecting unmanned aerial vehicle operator and flight strategy pi_θOperating the unmanned aerial vehicle to fly in a real environment to obtain flight data, namely track data formed by continuous 'state-action' pairs;

(6) updating an environmental transfer model f based on real flight data_α(ii) a (7) Using f_αAnd f'_βPerforming transfer learning based on action correction, correcting flight strategy pi_θObtaining a flight strategy pi ', and executing pi' in a simulator to obtain simulated flight data;

(8) updating flight strategy pi by using A3C algorithm based on simulated flight data_θUpdating environmental reverse transfer model f 'at the same time'_β；

Repeating (5) - (8) until strategy pi_θConverging; finally, the strategy pi is obtained_θAs an initial flight strategy for a real drone.

2. The unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning of claim 1, wherein a simulation simulator is constructed based on an aerodynamic model, an unmanned aerial vehicle model and an unmanned aerial vehicle encountering flight scene and flight mission, and is visualized by using a Unreal4 game engine; the simulation simulator comprises an unmanned aerial vehicle, a flight scene and a flight task, wherein in the simulation simulator, the flight state of the unmanned aerial vehicle changes along with the time lapse in the flight process, and the simulation environment can continuously generate various obstacles; the process is expressed by a Markov decision process, and is expressed by a quintuple < S, A, P, R, gamma, wherein S is a state space, A is an action space, P is a state transition probability, R is a single step reward obtained from the environment, and gamma is a discount factor of the accumulated reward.

3. The reinforcement learning and migration learning-based unmanned aerial vehicle autonomous flight training method according to claim 1, characterized in that a unmanned aerial vehicle operator and simulator flight strategy pi is used_θControlling the unmanned aerial vehicle, collecting flight data of the unmanned aerial vehicle in the real environment, extracting all triples (s, a, s '), wherein s is the current state, a is the current action, and s' is the next state, and obtaining a data set D for training a state transition model of the real environment_real＝{(s₁,a₁,s₂),(s₂,a₂,s₃),...,(s_n-1,a_n-1,s_n)}。

4. The unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning of claim 1, wherein the state of the real environment is trained by performing regression learning with a current state-current action pair as a feature and a next state as a labelState transition model f_αBy minimizing the transfer loss function:

and updating the neural network parameter alpha of the transfer model.

5. The unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning of claim 1, wherein state s is repeatedly input into flight strategy pi in A3C_θThe output is obtained as action a, and action a ' is obtained as f ' by using the action correction '_β(s,f_α(s, a)), an action a' is performed in the simulator resulting in a new state s, and a prize value r. Until enough data sets were collected to train the A3C algorithm model: d {(s)₁,a₁,r₁,s₂),(s₂,a₂,r₂,s₃),...,(s_n-1,a_n-1,r_n-1,s_n) And reverse transfer model f'_βThe data set of (a): d_sim＝{(s₁,s₂,a₁),(s₂,s₃,a₂),...,(s_n-1,s_n,a_n-1)}。

6. The unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning of claim 1, wherein in the training of the flight strategy of the A3C reinforcement learning algorithm, a deep neural network is used as a network model of the flight strategy and an evaluation network; the A3C algorithm is based on the actor-critic with two improvements: (1) A3C uses the merit function as critic; (2) asynchronously collecting data samples and asynchronously updating an actor and a critic using a plurality of simulation environments; the dominance function of A3C is: advantage (s, a) ═ Q (s, a) -V_φ(s) where Q is a function of the state action value, V is the value network, phi is the neural network parameter of V,

gamma is 0 < gamma < 1, is a discount factor, and is a constant。

7. The unmanned aerial vehicle autonomous flight training method based on reinforcement learning and migration learning of claim 5, wherein based on the collected data set, the strategy neural network (Actor) and the value neural network (Critic) are updated based on the A3C algorithm, and a loss function of the value network in A3C is defined as:

define the actor network pi in A3C network_θThe loss function of (d) is: j. the design is a square_π(θ)＝-E_(s,a)～D[logπ_θ(a|s)(Q(s,a)-V_φ(s))]Training an optimization strategy by iteratively sampling data and optimizing the loss function, minimizing the loss function J_π(theta) and J_v(phi), where r (s, a) is a reward function given by the environment, phi_oldOld values stored for phi, pi being flight strategy pi_θFor short.

8. The unmanned aerial vehicle autonomous flight training method based on reinforcement learning and transfer learning of claim 5, wherein an inverse transfer model f 'of a simulator environment is trained based on a regression learning algorithm'_βUpdating the neural network parameter β by minimizing the following inverse transfer loss function:

until the model converges or the maximum number of iterations is reached; finally obtained flight strategy pi_θAnd the method is applied to the real unmanned aerial vehicle.