CN114895697B

CN114895697B - Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm

Info

Publication number: CN114895697B
Application number: CN202210594911.6A
Authority: CN
Inventors: 李波; 白双霞; 甘志刚; 康培棋; 杨慧林; 万开方; 高晓光
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2024-04-30
Anticipated expiration: 2042-05-27
Also published as: CN114895697A

Abstract

The invention provides an unmanned aerial vehicle flight decision method based on a meta reinforcement learning parallel training algorithm, which comprises the steps of firstly constructing an unmanned aerial vehicle flight control model; then constructing a state space, an action space and a reward function of unmanned aerial vehicle flight decision according to a Markov decision process; next, constructing a multitask experience pool for storing the training sample data of the element reinforcement learning algorithm; then defining parameters of a meta reinforcement learning algorithm and training in parallel in a plurality of environments to realize an unmanned aerial vehicle meta reinforcement learning decision model; and finally, randomly initializing a new flight environment and an unmanned aerial vehicle state, testing an unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm, and evaluating flight decision performance. According to the method, the problem of insufficient generalization performance of the SAC algorithm is solved by training the strategy in a plurality of environments, the flight decision strategy of the unmanned aerial vehicle can be integrally optimized, convergence can be achieved through less training in a new environment, and generalization capability and universality of the strategy can be effectively improved.

Description

Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm

Technical Field

The invention relates to the technical field of unmanned aerial vehicles, in particular to an unmanned aerial vehicle flight decision method.

Background

With the development of unmanned aerial vehicle technology, unmanned aerial vehicles are applied to aspects of production and life, from initial experimental test flight to civil aerial photography, to autonomous navigation in recent years, even distributed positioning and three-dimensional reconstruction. Unmanned aerial vehicle is becoming an important component in the field of artificial intelligence in the future by virtue of the characteristics of high maneuverability and multiple degrees of freedom.

Along with the progress of science and technology, the deep reinforcement learning combined with the deep learning understanding capability and the reinforcement learning decision making capability provides a new solution for unmanned aerial vehicle flight decision making. The unmanned aerial vehicle decision-making based on deep reinforcement learning can achieve better flight effect in single environment training, but the effect is greatly reduced when the decision-making strategies are directly applied to new environments due to the fact that reinforcement learning algorithms have insufficient generalization performance. The prior related research only increases the randomness of the training process of reinforcement learning in a single environment so as to improve the reinforcement learning flight decision generalization capability in different environments during unmanned aerial vehicle testing, but does not consider the introduction of different environments and tasks in the training stage, and the generalization capability of an algorithm is very limited.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides an unmanned aerial vehicle flight decision method based on a meta reinforcement learning parallel training algorithm. Firstly, constructing an unmanned aerial vehicle flight control model; then constructing a state space, an action space and a reward function of unmanned aerial vehicle flight decision according to a Markov decision process; next, constructing a multitask experience pool for storing the training sample data of the element reinforcement learning algorithm; then defining parameters of a meta reinforcement learning algorithm and training in parallel in a plurality of environments to realize an unmanned aerial vehicle meta reinforcement learning decision model; and finally, randomly initializing a new flight environment and an unmanned aerial vehicle state, testing an unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm, and evaluating flight decision performance. The method solves the problem of insufficient generalization performance of the SAC algorithm by training the strategy in a plurality of environments. By utilizing the two algorithms, when the reinforcement learning algorithm strategy faces a new environment, better flight performance can be realized in the new environment through less training, and the generalization performance of the algorithm is improved.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

Step S1: constructing an unmanned aerial vehicle flight control model;

In order to solve the position and attitude information of the unmanned aerial vehicle in real time, adopting an unmanned aerial vehicle flight control rigid body model, wherein the unmanned aerial vehicle flight control rigid body model comprises an unmanned aerial vehicle kinematics model and an unmanned aerial vehicle kinematics model;

Step S2: constructing a state space, an action space and a reward function of unmanned aerial vehicle flight decisions according to a Markov decision process;

(1) State space design

The state space consists of two parts, namely environment information acquired by a sensor in real time and unmanned aerial vehicle flight state information, wherein the environment information comprises image information acquired by a front-end camera of the unmanned aerial vehicle, and the unmanned aerial vehicle flight state information is expressed as follows in a vector form:

Wherein, Representing the position coordinates of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e,/>Respectively representing the position components of x _e,y_e,z_e coordinate axes of the unmanned aerial vehicle under the earth coordinate system; /(I)Representing the linear velocity of the unmanned aerial vehicle in the earth coordinate system,/>Respectively representing linear velocity components of x _e,y_e,z_e coordinate axes of the unmanned aerial vehicle under an earth coordinate system; q is a quaternion representing the attitude of the unmanned aerial vehicle; /(I)Represents the angular velocity of the unmanned aerial vehicle in the machine body coordinate system o _bx_by_bz_b,/>Respectively representing angular velocity components of the unmanned aerial vehicle around x _b,y_b,z_b coordinate axes in a machine body coordinate system;

(2) Motion space design

The motion space is defined as the linear velocity of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e

(3) Bonus function design

The reward function consists of sparse rewards and continuous rewards, and the reward function comprises position rewards, collision rewards and speed rewards;

Step S3: constructing a multitask experience pool for storing the training sample data of the element reinforcement learning algorithm;

There are n different flight environments, the unmanned task in each environment is defined as a Markov decision process M _i＝<S_i,A_i,P_i,R_i ∈M, during the unmanned execution of flight task T _i, the unmanned interaction with the environment generates experience data < S, A, P _i,R_i >, and stores it in an experience pool Experience pool/>, of all tasksI epsilon [1, n ] are combined together to form a multitasking experience pool D;

Step S4: initializing n different flight environments and unmanned aerial vehicle states, setting element learning update frequency and update step number, and realizing unmanned aerial vehicle element reinforcement learning decision-making model through parallel training in a plurality of environments;

Step S5: and randomly initializing a new flight environment and an unmanned aerial vehicle state, testing an unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm, and evaluating flight decision performance.

S51: initializing the flight state of the unmanned aerial vehicle, and obtaining an initial decision model state s _t;

S52: initializing an Actor network and a Critic network according to the meta-strategy network weight phi ^meta and the meta-Critic network weight theta ^meta, and executing training and updating of the Actor network and the Critic network in the step S43;

S53: the state s _t is used for completing the trained Actor network, the decision action a _t of the unmanned aerial vehicle is obtained and executed, and then a new state s _t+1 is obtained;

S54: judging whether the flight decision task is finished, and if the flight decision task is finished, ending; otherwise, let S _t＝s_t+1, and execute steps S51 to S54.

The unmanned aerial vehicle flight control rigid body model comprises an unmanned aerial vehicle kinematics model and an unmanned aerial vehicle kinematics model;

(1) Unmanned aerial vehicle kinematics model

The unmanned aerial vehicle kinematic model is irrelevant to the quality and stress of the unmanned aerial vehicle, and only the relations among the speed, the angular speed, the position and the gesture of the unmanned aerial vehicle are researched; the unmanned aerial vehicle kinematic model inputs the speed and the angular speed of the unmanned aerial vehicle, outputs the corresponding position and the corresponding gesture of the unmanned aerial vehicle, and comprises a position kinematic model and a gesture kinematic model; the positional kinematic model is defined as follows:

Wherein, Representing the position coordinates of the center of gravity of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e,/>The position change quantity of the unmanned aerial vehicle is represented, and v ^e represents the speed of the unmanned aerial vehicle under the earth coordinate system;

The unmanned aerial vehicle gesture adopts quaternion to represent, and the quaternion represents as follows:

Wherein, Is/>Scalar section of/>Is the vector portion. For real numbers, e.gThe corresponding quaternion is denoted q= [ s0 _1×3]^T. For pure vector/>The corresponding quaternion representation is q= [0 v ^T]^T;

Reversely solving the attitude angle of the unmanned aerial vehicle through quaternion:

Wherein phi epsilon minus pi, pi is the rolling angle of the unmanned aerial vehicle, phi epsilon minus pi, pi is the yaw angle of the unmanned aerial vehicle, The pitch angle of the unmanned aerial vehicle is set;

The gesture kinematic model is defined as follows:

Wherein, The angular velocity of the unmanned aerial vehicle is shown in the body coordinate system o _bx_by_bz_b. /(I)Is the scalar part of the quaternion,/>Is the vector portion of the quaternion. /(I)The transpose of q _v is represented,The attitude change quantity of the unmanned aerial vehicle is represented, and I ₃ represents a third-order identity matrix;

(2) An unmanned aerial vehicle dynamic model;

The input of the unmanned aerial vehicle dynamic model is tension and moment, the moment comprises pitching moment, rolling moment and yaw moment, and the output is corresponding unmanned aerial vehicle speed and angular speed; the unmanned aerial vehicle dynamic model comprises a position dynamic model and a gesture dynamic model;

the location dynamics model is defined as follows:

Wherein, Represents the variation of the speed of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e, m represents the mass of the unmanned aerial vehicle, f represents the total pulling force of the propeller, g represents the gravitational acceleration, e ₃＝[0,0,1]^T is a unit vector, and/>The rotation matrix from the machine body coordinate system to the earth coordinate system is represented, phi represents the rolling angle of the unmanned aerial vehicle, theta represents the pitch angle of the unmanned aerial vehicle, and phi represents the yaw angle of the unmanned aerial vehicle;

the attitude dynamics equation is established in the machine body coordinate system as follows:

Wherein, Representing the moment generated by the rotation of the propeller on the axis of the unmanned aerial vehicle body,Representing the moment of inertia of the unmanned aerial vehicle itself. /(I)Representing gyro moment;

the comprehensive preparation method comprises the following steps:

the above is a rigid body model of unmanned aerial vehicle flight control.

The rewarding function consists of sparse rewards and continuous rewards, and the rewarding function comprises position rewards, collision rewards and speed rewards;

the position rewards comprise position sparse rewards and position continuous rewards, wherein the position sparse rewards are rewards for the unmanned aerial vehicle to successfully pass through a certain obstacle so as to evaluate obstacle avoidance performance of a flight decision strategy;

The location sparse reward is defined as r ₂:

Wherein, N _barrier represents the total number of obstacles in the environment, and level represents the number of the unmanned aerial vehicle passing through the obstacles;

Position consecutive rewards are defined as r ₁:

Wherein, Respectively representing the y-axis coordinate value of the unmanned aerial vehicle under the earth coordinate system o _ex_ey_ez_e at the time t-1, and y _goal represents the y _e -axis coordinate value of the unmanned aerial vehicle flight mission destination;

the collision rewards are sparse rewards for evaluating whether the unmanned aerial vehicle collides or not, and the unmanned aerial vehicle obtains collision rewards r ₃ in the flight process:

The speed prize r ₄ is defined as:

r₄＝r'+r”

Wherein v represents the current speed of the unmanned aerial vehicle, and v _limit represents the set minimum speed of the unmanned aerial vehicle; representing the component of the drone speed on the y _e axis in the earth coordinate system o _ex_ey_ez_e;

the comprehensive preparation method comprises the following steps:

R＝r₁+r₂+r₃+r₄

The bonus function package R contains position rewards R ₁ and R ₂, collision rewards R ₃, and velocity rewards R ₄.

The meta reinforcement learning implementing parallel training includes the steps of:

Step S41: setting batch training sample number batch_size and training update step number in n different flight environments, and initializing experience pool for environment i Randomly generated Actor network weights/>Critic network weights/>Initializing an Actor network/>And Critic network/>Let/>Initializing target Critic network

Step S42: inputting the state of the unmanned aerial vehicle into an Actor network to obtain Gaussian strategy distribution with a mean value of mu and a variance of sigma; acquiring unmanned aerial vehicle decision action according to strategy random samplingThe unmanned plane obtains the next time state S _t+1 after executing the action a _t, obtains the prize r (S _t,a_t) according to the prize function calculation in the step S3, and stores the decision data { S _t,a_t,r(s_t,a_t),s_t+1 } into the experience pool/>

Step S43: when the experience number in the experience pool is larger than the batch_size, randomly extracting a batch_size group experience sample M as training data of a SAC algorithm, and performing gradient descent with a learning rate lr aiming at a function J _π (phi) of the Actor network loss and a loss function J _Q(θ_i) i=1 and 2 of the Critic network during training to update the weights of the Actor network and the Critic network; the Actor network weights for task i=1, 2 were obtained by training the SAC algorithm under n different flight environmentsAnd Critic network weights/>

Step S44: judging whether the set training update step number is reached, if so, executing step S45; otherwise, executing steps S41 to S44;

Step S45: actor network weights obtained for n different flight environments And Critic network weights/>Performing meta learning updating;

Parallel multitask reinforcement learning based on meta-learning requires a decision strategy to maximize rewards under n different environments as follows:

the meta learning update process is as follows:

Wherein phi ^meta represents the meta-policy network weight updated through meta-learning, Represents the Actor network weight obtained by learning task i by adopting SAC algorithm, and theta ^meta represents the meta-Critic network weight updated by meta-learning,/>The weight of the Critic network obtained by learning by adopting the SAC algorithm is represented, and beta represents the element learning update learning rate;

Step S46: judging whether the model converges or not, wherein the convergence condition is whether the reward function is stable or whether the reward function reaches the set training element learning updating step number, and if the reward function is stable or reaches the set training element learning updating step number, finishing training to obtain the unmanned plane element reinforcement learning flight decision model for finishing training; otherwise, steps S41 to S46 are performed.

The invention has the beneficial effects that:

(1) According to the invention, a plurality of flight decision environments are introduced in the parallel training process, and the flight decision data of the environments are shared, so that the unmanned aerial vehicle flight decision strategy can be integrally optimized.

(2) According to the invention, after a plurality of environment sampling decision samples are carried out, an unmanned plane flight decision model based on a meta reinforcement learning algorithm is trained, the meta reinforcement learning parallel training algorithm adopts an internal and external alternate training updating mode, the interior is updated by adopting the reinforcement learning model, the exterior is updated by adopting the meta learning algorithm, the reinforcement learning strategy is obtained by integrally optimizing the flight decision strategy, the meta reinforcement learning strategy can be converged through less training in a new environment, and the generalization capability and the universality of the strategy can be effectively improved.

Drawings

FIG. 1 is a block diagram of a multi-tasking experience pool of the present invention.

FIG. 2 is a schematic diagram of the meta reinforcement learning parallel training process of the present invention.

FIG. 3 is a graph of a reward function of the parallel training algorithm based on meta reinforcement learning of the present invention.

Fig. 4 is a flight path of the unmanned aerial vehicle of the present invention. Fig. 4 (a) is a top view of a flight path of the unmanned aerial vehicle, and fig. 4 (b) is a coordinate change diagram of the position of the unmanned aerial vehicle on each coordinate axis during the flight of the unmanned aerial vehicle.

Detailed Description

The invention will be further described with reference to the drawings and examples.

According to the design scheme provided by the invention, the unmanned aerial vehicle flight decision based on the meta reinforcement learning algorithm comprises the following steps:

Step S1: constructing unmanned aerial vehicle flight control model

In order to describe the pose and position of the drone, it is crucial to establish an appropriate coordinate system. A suitable coordinate system facilitates clearing the relationship between variables, facilitating representation and calculation. The position of the drone is defined in the earth coordinate system, and the pose in space mainly describes the rotational relationship between the body coordinate system and the earth coordinate system.

The earth coordinate system o _ex_ey_ez_e ignores the earth curvature, i.e. the surface of the earth is assumed to be a plane, and is used for researching the motion state of the aircraft relative to the ground and determining the three-dimensional position of the machine body. The o _e,o_ex_e axis is defined as pointing in a direction in the horizontal plane, the o _ez_e axis is defined as pointing in a direction perpendicular to the ground, and finally the o _ey_e axis can be determined by right hand rules, usually with the unmanned takeoff position or earth centered on earth as the origin of coordinates.

The machine body coordinate system o _bx_by_bz_b is fixedly connected with the machine body of the aircraft, and the origin o _b of the machine body coordinate system is defined at the gravity center position of the aircraft; the o _bx_b axis is defined as pointing in the aircraft nose direction in the plane of symmetry of the aircraft; the o _bz_b axis is defined in the plane of symmetry of the aircraft, perpendicular to the o _bx_b axis, and the o _by_b axis can be determined according to the right hand rule.

Wherein, Is/>Scalar section of/>Is the vector portion. For example, for real/>The corresponding quaternion is denoted q= [ s0 _1×3]^T. For pure vector/>The corresponding quaternion representation is q= [ 0v ^T]^T.

Unmanned aerial vehicle gesture can be reversely solved through quaternion:

Wherein phi epsilon minus pi, pi is the rolling angle of the unmanned aerial vehicle, phi epsilon minus pi, pi is the yaw angle of the unmanned aerial vehicle, Is the pitch angle of the unmanned aerial vehicle.

In order to solve the position and posture information of the unmanned aerial vehicle in real time, the unmanned aerial vehicle flight control rigid body model is adopted, wherein the unmanned aerial vehicle flight control rigid body model comprises unmanned aerial vehicle kinematics and dynamics models.

(1) Unmanned aerial vehicle kinematics model

The unmanned aerial vehicle kinematics model inputs the speed and the angular velocity of the unmanned aerial vehicle, and the corresponding unmanned aerial vehicle position and gesture can be obtained. The unmanned aerial vehicle kinematic model comprises a position kinematic model and an attitude kinematic model:

the positional kinematic model is defined as follows:

Wherein, Representing the position coordinates of the center of gravity of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e,/>The position change of the unmanned aerial vehicle is represented, and v ^e represents the speed of the unmanned aerial vehicle under the earth coordinate system.

The gesture kinematic model is defined as follows:

Wherein, The angular velocity of the unmanned aerial vehicle is shown in the body coordinate system o _bx_by_bz_b.Is the scalar part of the quaternion,/>Is the vector portion of the quaternion. /(I)Represents the transpose of q _v,/>And the attitude change quantity of the unmanned aerial vehicle is represented, and I ₃ represents a third-order identity matrix.

(2) An unmanned aerial vehicle dynamic model;

the input of the unmanned aerial vehicle dynamic model is tension and moment (pitching moment, rolling moment and yawing moment), and the unmanned aerial vehicle speed and angular velocity are output; the unmanned aerial vehicle dynamic model comprises a position dynamic model and a gesture dynamic model;

the location dynamics model is defined as follows:

the attitude dynamics model is built in the organism coordinate system as follows:

Wherein, Representing the moment generated by the rotation of the propeller on the axis of the unmanned aerial vehicle body,Is defined as the moment of inertia of the drone itself. /(I)Representing gyroscopic moment.

The rigid body model of unmanned aerial vehicle flight control is comprehensively available as follows:

step S2: and constructing a state space, an action space and a reward function of the unmanned aerial vehicle flight decision according to the Markov decision process.

(1) State space design

The state space designed by the invention consists of two parts of states: unmanned aerial vehicle flight state information and environmental information acquired by a sensor in real time. The environment state comprises image information obtained by a front camera of the unmanned aerial vehicle, and the flight state information of the unmanned aerial vehicle is expressed as follows in a vector form:

Wherein, Representing the position coordinates of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e,/>Respectively representing the position components of x _e,y_e,z_e coordinate axes of the unmanned aerial vehicle under the earth coordinate system; /(I)Representing the linear velocity of the unmanned aerial vehicle in the earth coordinate system,/>Respectively representing linear velocity components of x _e,y_e,z_e coordinate axes of the unmanned aerial vehicle under an earth coordinate system; q is a quaternion representing the attitude of the unmanned aerial vehicle; /(I)Represents the angular velocity of the unmanned aerial vehicle in the machine body coordinate system o _bx_by_bz_b,/>Respectively represent angular velocity components of the unmanned aerial vehicle around x _b,y_b,z_b coordinate axes in a machine body coordinate system.

(2) Motion space design

(3) Bonus function design

The reward function designed by the invention consists of sparse rewards and continuous rewards, and comprises position rewards, collision rewards and speed rewards;

the position rewards include a position sparse reward and a position continuous reward. The position sparse reward is set as a reward for the unmanned aerial vehicle to successfully pass a certain obstacle to evaluate the obstacle avoidance performance of the flight decision strategy.

The position consecutive rewards are defined as r ₁ as follows:

the defined location sparsity rewards r ₂ are as follows:

The speed prize r ₄ is defined as:

r₄＝r'+r”

The comprehensive available prize function contains position prizes r ₁ and r ₂, collision prizes r ₃, and velocity prizes r ₄, and is defined as follows:

R＝r₁+r₂+r₃+r₄

step S3: and constructing a multitasking experience pool for storing the training sample data of the reinforcement learning algorithm.

There are n different flight environments, the unmanned task in each environment is defined as a Markov decision process M _i＝<S_i,A_i,P_i,R_i ∈M, during the unmanned execution of flight task T _i, the unmanned interaction with the environment generates experience data < S, A, P _i,R_i >, and stores it in an experience poolExperience pool/>, of all tasksI.e [1, n ] are combined together to form the multitasking experience pool D. The structure of the multitasking experience pool is shown in figure 1.

Step S4: initializing n different flight environments and unmanned aerial vehicle states, setting element learning update frequency and update step number, and realizing unmanned aerial vehicle element reinforcement learning decision model through parallel training in a plurality of environments.

The parallel training process realized by meta reinforcement learning is shown in fig. 2, and comprises the following steps:

step S41: setting batch training sample number batch_size and training update step number of algorithm in n different flight environments, and initializing experience pool for environment i Randomly generated Actor network weights/>Critic network weights/>Initializing an Actor network/>And Critic network/>Let/>Initializing target Critic network

Step S42: and inputting the state of the unmanned aerial vehicle into an Actor network to obtain Gaussian strategy distribution with the mean value of mu and the variance of sigma. Acquiring unmanned aerial vehicle decision action according to strategy random samplingAfter the unmanned aerial vehicle performs the action a _t, the next time state s _t+1 is acquired. Obtaining a prize r (S _t,a_t) according to the prize function calculation in step S3, and storing decision data { S _t,a_t,r(s_t,a_t),s_t+1 } to the experience pool/>

Step S43: when the experience number in the experience pool is larger than the batch_size, the batch_size group experience sample M is randomly extracted to serve as training data of the SAC algorithm to update the weights of the Actor network and the Critic network. Function for Actor network loss during trainingAnd the loss function/>, of Critic networksI=1, 2 to update the Actor network and Critic network weights, the specific neural network loss function and neural network update procedure is as follows:

the double Soft-Q function is defined as the target Critic network And/>The minimum value of the output, therefore, is:

Wherein, Critic network/>, respectivelyIs set to the target Q value.

Actor network loss functionThe definition is as follows:

critic network loss function J _Q(θ_i) i=1, 2 updates are defined as follows:

where α is the regularization coefficient of the policy entropy.

The target Critic network weight θ ₁',θ₂' is updated by:

Wherein τ is a target Critic network soft update parameter.

The Actor network weights for task i=1, 2 were obtained by training the SAC algorithm under n different flight environmentsAnd Critic network weights/>

Step S44: judging whether the set updating step number is reached, if so, executing step S45; otherwise, steps S41 to S44 are performed.

Step S45: actor network weights obtained for n different flight environmentsAnd Critic network weights for meta-learning updating. /(I)

the meta learning update process is as follows:

Wherein phi ^meta represents the meta-policy network weight updated through meta-learning, Represents the Actor network weight obtained by learning task i by adopting SAC algorithm, and theta ^meta represents the meta-Critic network weight updated by meta-learning,/>The weight of the Critic network obtained by learning by adopting the SAC algorithm is represented, and beta represents the element learning update learning rate.

Step S46: judging whether the model converges, namely whether the reward function tends to a stable value or reaches a set training element learning updating step number, if so, finishing training to obtain an unmanned plane element reinforcement learning flight decision model for finishing training; otherwise, steps S41 to S46 are performed.

S51: and initializing the flight state of the unmanned aerial vehicle, and obtaining an initial decision model state s _t.

S52: initializing an Actor network and a Critic network according to the meta-policy network weight phi ^meta and the meta-Critic network weight theta ^meta, and executing the training update of the Actor network and the Critic network in the step S43.

S53: and (3) completing the trained Actor network by the state s _t, obtaining the unmanned plane decision action a _t, and executing the action to obtain a new state s _t+1.

S54: judging whether a flight decision task is completed, if so, ending; otherwise S _t＝s_t+1, and steps S51 to S54 are performed.

Detailed description of the invention

In the embodiment, the unmanned aerial vehicle element reinforcement learning decision model is realized through parallel training in 3 environments. Initializing an environment:

environment 1: the endpoint y-axis was 57, the environment contained 4 obstacles, and the y-axis coordinates were 7, 17, 27.5, 45, respectively.

Environment 2: the endpoint y-axis was 55, the environment contained 4 obstacles, and the y-axis coordinates were 10, 20, 25, 35, respectively.

Environment 3: the endpoint y-axis is 60, the environment contains 5 obstructions, and the y-axis coordinates are 5,9, 20, 34, 50, respectively.

The unmanned plane state is initialized to [ P ^e v^e q ω^b ] = [0,0,0,0,0,0,0,0,0,0].

The meta learning update frequency is set to 1, and the update step number is set to 1000.

The entropy regularization coefficient alpha is set to be 0.2 and automatically attenuated, the learning rate lr is 0.0005, the empirical pool size is 100000, and the batch training sample number batch_size is 256.

Training the unmanned plane element reinforcement learning decision model, and recording the change of the rewarding value in the training process. The prize value curve during training of the SAC algorithm is shown in fig. 3. Wherein the SAC algorithm obtains a maximum prize of 52.03 during training. Throughout the training process; the SAC algorithm curve converges around 750 rounds, eventually remaining at 50.04.

After training, initializing the unmanned aerial vehicle state [ P ^e v^e q ω^b ] = [0,0,0,0,0,0,0,0,0,0], and testing under environments 1,2 and 3 by using a trained unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm. And drawing a flight trajectory diagram of the unmanned aerial vehicle according to the recorded state, wherein the flight decision effect under the environment 3 is shown in fig. 4. In the figure, the unmanned aerial vehicle flight path which is decided by using the unmanned aerial vehicle flight decision method based on the meta reinforcement learning parallel training algorithm successfully avoids the obstacle, finally reaches the end point with the y-axis coordinate of 50, and smoothly completes the flight task.

The unmanned aerial vehicle flight decision method based on the meta-reinforcement learning parallel training algorithm has better convergence performance aiming at the unmanned aerial vehicle flight decision process by comprehensively comparing the training process performance and the unmanned aerial vehicle flight decision performance, introduces a plurality of flight decision environments in the parallel training process, shares the flight decision data of the environments, can integrally optimize the unmanned aerial vehicle flight decision strategy, has better generalization performance, and has quick and safe flight characteristics when realizing the flight task.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims

1. The unmanned aerial vehicle flight decision method based on the meta reinforcement learning parallel training algorithm is characterized by comprising the following steps of:

Step S1: constructing an unmanned aerial vehicle flight control model;

(1) State space design

Wherein, Representing the position coordinates of the unmanned aerial vehicle in the earth coordinate system o _ex_ey_ez_e,/>Respectively representing the position components of x _e,y_e,z_e coordinate axes of the unmanned aerial vehicle under the earth coordinate system; /(I)Representing the linear velocity of the unmanned aerial vehicle in the earth coordinate system,/>Respectively representing linear velocity components of x _e,y_e,z_e coordinate axes of the unmanned aerial vehicle under an earth coordinate system; q is a quaternion representing the attitude of the unmanned aerial vehicle; /(I)Represents the angular velocity of the unmanned aerial vehicle in the machine body coordinate system o _bx_by_bz_b,Respectively representing angular velocity components of the unmanned aerial vehicle around x _b,y_b,z_b coordinate axes in a machine body coordinate system;

(2) Motion space design

(3) Bonus function design

Step S5: randomly initializing a new flight environment and an unmanned aerial vehicle state, testing an unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm, and evaluating flight decision performance;

2. The unmanned aerial vehicle flight decision method based on the meta reinforcement learning parallel training algorithm according to claim 1, wherein the unmanned aerial vehicle flight decision method is characterized in that:

(1) Unmanned aerial vehicle kinematics model

Wherein, Is/>Scalar section of/>Is a vector portion; for real number/>The corresponding quaternion is denoted q= [ s0 _1×3]^T; for pure vector/>The corresponding quaternion representation is q= [0 v ^T]^T;

The gesture kinematic model is defined as follows:

Wherein, Represents the angular velocity of the unmanned aerial vehicle in the body coordinate system o _bx_by_bz_b,/>Is the scalar part of the quaternion,/>Is the vector part of the quaternion,/>The transpose of q _v is represented,The attitude change quantity of the unmanned aerial vehicle is represented, and I ₃ represents a third-order identity matrix;

(2) An unmanned aerial vehicle dynamic model;

the location dynamics model is defined as follows:

Wherein, Representing the moment generated by the rotation of the propeller on the axis of the unmanned aerial vehicle body,/>Representing the rotational inertia of the unmanned aerial vehicle itself; /(I)Representing gyro moment;

the comprehensive preparation method comprises the following steps:

the above is a rigid body model of unmanned aerial vehicle flight control.

3. The unmanned aerial vehicle flight decision method based on the meta reinforcement learning parallel training algorithm according to claim 1, wherein the unmanned aerial vehicle flight decision method is characterized in that:

The location sparse reward is defined as r ₂:

Position consecutive rewards are defined as r ₁:

The speed prize r ₄ is defined as:

r₄＝r'+r”

Wherein v represents the current speed of the unmanned aerial vehicle, and v _limit represents the set minimum speed of the unmanned aerial vehicle; v _y ^e represents the component of the drone speed on the y _e axis under the earth coordinate system o _ex_ey_ez_e;

the comprehensive preparation method comprises the following steps:

R＝r₁+r₂+r₃+r₄

4. The unmanned aerial vehicle flight decision method based on the meta reinforcement learning parallel training algorithm according to claim 1, wherein the unmanned aerial vehicle flight decision method is characterized in that:

Step S41: setting batch training sample number batch_size and training update step number in n different flight environments, and initializing experience pool for environment i Randomly generated Actor network weights/>Critic network weights/>Initializing an Actor network/>And Critic network/>Let/>Initializing target Critic network/>

Step S45: actor network weights obtained for n different flight environments And Critic network weightsPerforming meta learning updating;

the meta learning update process is as follows: