CN114895697B - Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm - Google Patents

Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm Download PDF

Info

Publication number
CN114895697B
CN114895697B CN202210594911.6A CN202210594911A CN114895697B CN 114895697 B CN114895697 B CN 114895697B CN 202210594911 A CN202210594911 A CN 202210594911A CN 114895697 B CN114895697 B CN 114895697B
Authority
CN
China
Prior art keywords
aerial vehicle
unmanned aerial
flight
meta
rewards
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210594911.6A
Other languages
Chinese (zh)
Other versions
CN114895697A (en
Inventor
李波
白双霞
甘志刚
康培棋
杨慧林
万开方
高晓光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202210594911.6A priority Critical patent/CN114895697B/en
Publication of CN114895697A publication Critical patent/CN114895697A/en
Application granted granted Critical
Publication of CN114895697B publication Critical patent/CN114895697B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/08Control of attitude, i.e. control of roll, pitch, or yaw
    • G05D1/0808Control of attitude, i.e. control of roll, pitch, or yaw specially adapted for aircraft
    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Control Of Position, Course, Altitude, Or Attitude Of Moving Bodies (AREA)

Abstract

The invention provides an unmanned aerial vehicle flight decision method based on a meta reinforcement learning parallel training algorithm, which comprises the steps of firstly constructing an unmanned aerial vehicle flight control model; then constructing a state space, an action space and a reward function of unmanned aerial vehicle flight decision according to a Markov decision process; next, constructing a multitask experience pool for storing the training sample data of the element reinforcement learning algorithm; then defining parameters of a meta reinforcement learning algorithm and training in parallel in a plurality of environments to realize an unmanned aerial vehicle meta reinforcement learning decision model; and finally, randomly initializing a new flight environment and an unmanned aerial vehicle state, testing an unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm, and evaluating flight decision performance. According to the method, the problem of insufficient generalization performance of the SAC algorithm is solved by training the strategy in a plurality of environments, the flight decision strategy of the unmanned aerial vehicle can be integrally optimized, convergence can be achieved through less training in a new environment, and generalization capability and universality of the strategy can be effectively improved.

Description

Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm
Technical Field
The invention relates to the technical field of unmanned aerial vehicles, in particular to an unmanned aerial vehicle flight decision method.
Background
With the development of unmanned aerial vehicle technology, unmanned aerial vehicles are applied to aspects of production and life, from initial experimental test flight to civil aerial photography, to autonomous navigation in recent years, even distributed positioning and three-dimensional reconstruction. Unmanned aerial vehicle is becoming an important component in the field of artificial intelligence in the future by virtue of the characteristics of high maneuverability and multiple degrees of freedom.
Along with the progress of science and technology, the deep reinforcement learning combined with the deep learning understanding capability and the reinforcement learning decision making capability provides a new solution for unmanned aerial vehicle flight decision making. The unmanned aerial vehicle decision-making based on deep reinforcement learning can achieve better flight effect in single environment training, but the effect is greatly reduced when the decision-making strategies are directly applied to new environments due to the fact that reinforcement learning algorithms have insufficient generalization performance. The prior related research only increases the randomness of the training process of reinforcement learning in a single environment so as to improve the reinforcement learning flight decision generalization capability in different environments during unmanned aerial vehicle testing, but does not consider the introduction of different environments and tasks in the training stage, and the generalization capability of an algorithm is very limited.
Disclosure of Invention
In order to overcome the defects of the prior art, the invention provides an unmanned aerial vehicle flight decision method based on a meta reinforcement learning parallel training algorithm. Firstly, constructing an unmanned aerial vehicle flight control model; then constructing a state space, an action space and a reward function of unmanned aerial vehicle flight decision according to a Markov decision process; next, constructing a multitask experience pool for storing the training sample data of the element reinforcement learning algorithm; then defining parameters of a meta reinforcement learning algorithm and training in parallel in a plurality of environments to realize an unmanned aerial vehicle meta reinforcement learning decision model; and finally, randomly initializing a new flight environment and an unmanned aerial vehicle state, testing an unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm, and evaluating flight decision performance. The method solves the problem of insufficient generalization performance of the SAC algorithm by training the strategy in a plurality of environments. By utilizing the two algorithms, when the reinforcement learning algorithm strategy faces a new environment, better flight performance can be realized in the new environment through less training, and the generalization performance of the algorithm is improved.
The technical scheme adopted by the invention for solving the technical problems comprises the following steps:
Step S1: constructing an unmanned aerial vehicle flight control model;
In order to solve the position and attitude information of the unmanned aerial vehicle in real time, adopting an unmanned aerial vehicle flight control rigid body model, wherein the unmanned aerial vehicle flight control rigid body model comprises an unmanned aerial vehicle kinematics model and an unmanned aerial vehicle kinematics model;
Step S2: constructing a state space, an action space and a reward function of unmanned aerial vehicle flight decisions according to a Markov decision process;
(1) State space design
The state space consists of two parts, namely environment information acquired by a sensor in real time and unmanned aerial vehicle flight state information, wherein the environment information comprises image information acquired by a front-end camera of the unmanned aerial vehicle, and the unmanned aerial vehicle flight state information is expressed as follows in a vector form:
Wherein, Representing the position coordinates of the unmanned aerial vehicle in the earth coordinate system o exeyeze,/>Respectively representing the position components of x e,ye,ze coordinate axes of the unmanned aerial vehicle under the earth coordinate system; /(I)Representing the linear velocity of the unmanned aerial vehicle in the earth coordinate system,/>Respectively representing linear velocity components of x e,ye,ze coordinate axes of the unmanned aerial vehicle under an earth coordinate system; q is a quaternion representing the attitude of the unmanned aerial vehicle; /(I)Represents the angular velocity of the unmanned aerial vehicle in the machine body coordinate system o bxbybzb,/>Respectively representing angular velocity components of the unmanned aerial vehicle around x b,yb,zb coordinate axes in a machine body coordinate system;
(2) Motion space design
The motion space is defined as the linear velocity of the unmanned aerial vehicle in the earth coordinate system o exeyeze
(3) Bonus function design
The reward function consists of sparse rewards and continuous rewards, and the reward function comprises position rewards, collision rewards and speed rewards;
Step S3: constructing a multitask experience pool for storing the training sample data of the element reinforcement learning algorithm;
There are n different flight environments, the unmanned task in each environment is defined as a Markov decision process M i=<Si,Ai,Pi,Ri ∈M, during the unmanned execution of flight task T i, the unmanned interaction with the environment generates experience data < S, A, P i,Ri >, and stores it in an experience pool Experience pool/>, of all tasksI epsilon [1, n ] are combined together to form a multitasking experience pool D;
Step S4: initializing n different flight environments and unmanned aerial vehicle states, setting element learning update frequency and update step number, and realizing unmanned aerial vehicle element reinforcement learning decision-making model through parallel training in a plurality of environments;
Step S5: and randomly initializing a new flight environment and an unmanned aerial vehicle state, testing an unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm, and evaluating flight decision performance.
S51: initializing the flight state of the unmanned aerial vehicle, and obtaining an initial decision model state s t;
S52: initializing an Actor network and a Critic network according to the meta-strategy network weight phi meta and the meta-Critic network weight theta meta, and executing training and updating of the Actor network and the Critic network in the step S43;
S53: the state s t is used for completing the trained Actor network, the decision action a t of the unmanned aerial vehicle is obtained and executed, and then a new state s t+1 is obtained;
S54: judging whether the flight decision task is finished, and if the flight decision task is finished, ending; otherwise, let S t=st+1, and execute steps S51 to S54.
The unmanned aerial vehicle flight control rigid body model comprises an unmanned aerial vehicle kinematics model and an unmanned aerial vehicle kinematics model;
(1) Unmanned aerial vehicle kinematics model
The unmanned aerial vehicle kinematic model is irrelevant to the quality and stress of the unmanned aerial vehicle, and only the relations among the speed, the angular speed, the position and the gesture of the unmanned aerial vehicle are researched; the unmanned aerial vehicle kinematic model inputs the speed and the angular speed of the unmanned aerial vehicle, outputs the corresponding position and the corresponding gesture of the unmanned aerial vehicle, and comprises a position kinematic model and a gesture kinematic model; the positional kinematic model is defined as follows:
Wherein, Representing the position coordinates of the center of gravity of the unmanned aerial vehicle in the earth coordinate system o exeyeze,/>The position change quantity of the unmanned aerial vehicle is represented, and v e represents the speed of the unmanned aerial vehicle under the earth coordinate system;
The unmanned aerial vehicle gesture adopts quaternion to represent, and the quaternion represents as follows:
Wherein, Is/>Scalar section of/>Is the vector portion. For real numbers, e.gThe corresponding quaternion is denoted q= [ s0 1×3]T. For pure vector/>The corresponding quaternion representation is q= [0 v T]T;
Reversely solving the attitude angle of the unmanned aerial vehicle through quaternion:
Wherein phi epsilon minus pi, pi is the rolling angle of the unmanned aerial vehicle, phi epsilon minus pi, pi is the yaw angle of the unmanned aerial vehicle, The pitch angle of the unmanned aerial vehicle is set;
The gesture kinematic model is defined as follows:
Wherein, The angular velocity of the unmanned aerial vehicle is shown in the body coordinate system o bxbybzb. /(I)Is the scalar part of the quaternion,/>Is the vector portion of the quaternion. /(I)The transpose of q v is represented,The attitude change quantity of the unmanned aerial vehicle is represented, and I 3 represents a third-order identity matrix;
(2) An unmanned aerial vehicle dynamic model;
The input of the unmanned aerial vehicle dynamic model is tension and moment, the moment comprises pitching moment, rolling moment and yaw moment, and the output is corresponding unmanned aerial vehicle speed and angular speed; the unmanned aerial vehicle dynamic model comprises a position dynamic model and a gesture dynamic model;
the location dynamics model is defined as follows:
Wherein, Represents the variation of the speed of the unmanned aerial vehicle in the earth coordinate system o exeyeze, m represents the mass of the unmanned aerial vehicle, f represents the total pulling force of the propeller, g represents the gravitational acceleration, e 3=[0,0,1]T is a unit vector, and/>The rotation matrix from the machine body coordinate system to the earth coordinate system is represented, phi represents the rolling angle of the unmanned aerial vehicle, theta represents the pitch angle of the unmanned aerial vehicle, and phi represents the yaw angle of the unmanned aerial vehicle;
the attitude dynamics equation is established in the machine body coordinate system as follows:
Wherein, Representing the moment generated by the rotation of the propeller on the axis of the unmanned aerial vehicle body,Representing the moment of inertia of the unmanned aerial vehicle itself. /(I)Representing gyro moment;
the comprehensive preparation method comprises the following steps:
the above is a rigid body model of unmanned aerial vehicle flight control.
The rewarding function consists of sparse rewards and continuous rewards, and the rewarding function comprises position rewards, collision rewards and speed rewards;
the position rewards comprise position sparse rewards and position continuous rewards, wherein the position sparse rewards are rewards for the unmanned aerial vehicle to successfully pass through a certain obstacle so as to evaluate obstacle avoidance performance of a flight decision strategy;
The location sparse reward is defined as r 2:
Wherein, N barrier represents the total number of obstacles in the environment, and level represents the number of the unmanned aerial vehicle passing through the obstacles;
Position consecutive rewards are defined as r 1:
Wherein, Respectively representing the y-axis coordinate value of the unmanned aerial vehicle under the earth coordinate system o exeyeze at the time t-1, and y goal represents the y e -axis coordinate value of the unmanned aerial vehicle flight mission destination;
the collision rewards are sparse rewards for evaluating whether the unmanned aerial vehicle collides or not, and the unmanned aerial vehicle obtains collision rewards r 3 in the flight process:
The speed prize r 4 is defined as:
r4=r'+r”
Wherein v represents the current speed of the unmanned aerial vehicle, and v limit represents the set minimum speed of the unmanned aerial vehicle; representing the component of the drone speed on the y e axis in the earth coordinate system o exeyeze;
the comprehensive preparation method comprises the following steps:
R=r1+r2+r3+r4
The bonus function package R contains position rewards R 1 and R 2, collision rewards R 3, and velocity rewards R 4.
The meta reinforcement learning implementing parallel training includes the steps of:
Step S41: setting batch training sample number batch_size and training update step number in n different flight environments, and initializing experience pool for environment i Randomly generated Actor network weights/>Critic network weights/>Initializing an Actor network/>And Critic network/>Let/>Initializing target Critic network
Step S42: inputting the state of the unmanned aerial vehicle into an Actor network to obtain Gaussian strategy distribution with a mean value of mu and a variance of sigma; acquiring unmanned aerial vehicle decision action according to strategy random samplingThe unmanned plane obtains the next time state S t+1 after executing the action a t, obtains the prize r (S t,at) according to the prize function calculation in the step S3, and stores the decision data { S t,at,r(st,at),st+1 } into the experience pool/>
Step S43: when the experience number in the experience pool is larger than the batch_size, randomly extracting a batch_size group experience sample M as training data of a SAC algorithm, and performing gradient descent with a learning rate lr aiming at a function J π (phi) of the Actor network loss and a loss function J Qi) i=1 and 2 of the Critic network during training to update the weights of the Actor network and the Critic network; the Actor network weights for task i=1, 2 were obtained by training the SAC algorithm under n different flight environmentsAnd Critic network weights/>
Step S44: judging whether the set training update step number is reached, if so, executing step S45; otherwise, executing steps S41 to S44;
Step S45: actor network weights obtained for n different flight environments And Critic network weights/>Performing meta learning updating;
Parallel multitask reinforcement learning based on meta-learning requires a decision strategy to maximize rewards under n different environments as follows:
the meta learning update process is as follows:
Wherein phi meta represents the meta-policy network weight updated through meta-learning, Represents the Actor network weight obtained by learning task i by adopting SAC algorithm, and theta meta represents the meta-Critic network weight updated by meta-learning,/>The weight of the Critic network obtained by learning by adopting the SAC algorithm is represented, and beta represents the element learning update learning rate;
Step S46: judging whether the model converges or not, wherein the convergence condition is whether the reward function is stable or whether the reward function reaches the set training element learning updating step number, and if the reward function is stable or reaches the set training element learning updating step number, finishing training to obtain the unmanned plane element reinforcement learning flight decision model for finishing training; otherwise, steps S41 to S46 are performed.
The invention has the beneficial effects that:
(1) According to the invention, a plurality of flight decision environments are introduced in the parallel training process, and the flight decision data of the environments are shared, so that the unmanned aerial vehicle flight decision strategy can be integrally optimized.
(2) According to the invention, after a plurality of environment sampling decision samples are carried out, an unmanned plane flight decision model based on a meta reinforcement learning algorithm is trained, the meta reinforcement learning parallel training algorithm adopts an internal and external alternate training updating mode, the interior is updated by adopting the reinforcement learning model, the exterior is updated by adopting the meta learning algorithm, the reinforcement learning strategy is obtained by integrally optimizing the flight decision strategy, the meta reinforcement learning strategy can be converged through less training in a new environment, and the generalization capability and the universality of the strategy can be effectively improved.
Drawings
FIG. 1 is a block diagram of a multi-tasking experience pool of the present invention.
FIG. 2 is a schematic diagram of the meta reinforcement learning parallel training process of the present invention.
FIG. 3 is a graph of a reward function of the parallel training algorithm based on meta reinforcement learning of the present invention.
Fig. 4 is a flight path of the unmanned aerial vehicle of the present invention. Fig. 4 (a) is a top view of a flight path of the unmanned aerial vehicle, and fig. 4 (b) is a coordinate change diagram of the position of the unmanned aerial vehicle on each coordinate axis during the flight of the unmanned aerial vehicle.
Detailed Description
The invention will be further described with reference to the drawings and examples.
According to the design scheme provided by the invention, the unmanned aerial vehicle flight decision based on the meta reinforcement learning algorithm comprises the following steps:
Step S1: constructing unmanned aerial vehicle flight control model
In order to describe the pose and position of the drone, it is crucial to establish an appropriate coordinate system. A suitable coordinate system facilitates clearing the relationship between variables, facilitating representation and calculation. The position of the drone is defined in the earth coordinate system, and the pose in space mainly describes the rotational relationship between the body coordinate system and the earth coordinate system.
The earth coordinate system o exeyeze ignores the earth curvature, i.e. the surface of the earth is assumed to be a plane, and is used for researching the motion state of the aircraft relative to the ground and determining the three-dimensional position of the machine body. The o e,oexe axis is defined as pointing in a direction in the horizontal plane, the o eze axis is defined as pointing in a direction perpendicular to the ground, and finally the o eye axis can be determined by right hand rules, usually with the unmanned takeoff position or earth centered on earth as the origin of coordinates.
The machine body coordinate system o bxbybzb is fixedly connected with the machine body of the aircraft, and the origin o b of the machine body coordinate system is defined at the gravity center position of the aircraft; the o bxb axis is defined as pointing in the aircraft nose direction in the plane of symmetry of the aircraft; the o bzb axis is defined in the plane of symmetry of the aircraft, perpendicular to the o bxb axis, and the o byb axis can be determined according to the right hand rule.
The unmanned aerial vehicle gesture adopts quaternion to represent, and the quaternion represents as follows:
Wherein, Is/>Scalar section of/>Is the vector portion. For example, for real/>The corresponding quaternion is denoted q= [ s0 1×3]T. For pure vector/>The corresponding quaternion representation is q= [ 0v T]T.
Unmanned aerial vehicle gesture can be reversely solved through quaternion:
Wherein phi epsilon minus pi, pi is the rolling angle of the unmanned aerial vehicle, phi epsilon minus pi, pi is the yaw angle of the unmanned aerial vehicle, Is the pitch angle of the unmanned aerial vehicle.
In order to solve the position and posture information of the unmanned aerial vehicle in real time, the unmanned aerial vehicle flight control rigid body model is adopted, wherein the unmanned aerial vehicle flight control rigid body model comprises unmanned aerial vehicle kinematics and dynamics models.
(1) Unmanned aerial vehicle kinematics model
The unmanned aerial vehicle kinematics model inputs the speed and the angular velocity of the unmanned aerial vehicle, and the corresponding unmanned aerial vehicle position and gesture can be obtained. The unmanned aerial vehicle kinematic model comprises a position kinematic model and an attitude kinematic model:
the positional kinematic model is defined as follows:
Wherein, Representing the position coordinates of the center of gravity of the unmanned aerial vehicle in the earth coordinate system o exeyeze,/>The position change of the unmanned aerial vehicle is represented, and v e represents the speed of the unmanned aerial vehicle under the earth coordinate system.
The gesture kinematic model is defined as follows:
Wherein, The angular velocity of the unmanned aerial vehicle is shown in the body coordinate system o bxbybzb.Is the scalar part of the quaternion,/>Is the vector portion of the quaternion. /(I)Represents the transpose of q v,/>And the attitude change quantity of the unmanned aerial vehicle is represented, and I 3 represents a third-order identity matrix.
(2) An unmanned aerial vehicle dynamic model;
the input of the unmanned aerial vehicle dynamic model is tension and moment (pitching moment, rolling moment and yawing moment), and the unmanned aerial vehicle speed and angular velocity are output; the unmanned aerial vehicle dynamic model comprises a position dynamic model and a gesture dynamic model;
the location dynamics model is defined as follows:
Wherein, Represents the variation of the speed of the unmanned aerial vehicle in the earth coordinate system o exeyeze, m represents the mass of the unmanned aerial vehicle, f represents the total pulling force of the propeller, g represents the gravitational acceleration, e 3=[0,0,1]T is a unit vector, and/>The rotation matrix from the machine body coordinate system to the earth coordinate system is represented, phi represents the rolling angle of the unmanned aerial vehicle, theta represents the pitch angle of the unmanned aerial vehicle, and phi represents the yaw angle of the unmanned aerial vehicle;
the attitude dynamics model is built in the organism coordinate system as follows:
Wherein, Representing the moment generated by the rotation of the propeller on the axis of the unmanned aerial vehicle body,Is defined as the moment of inertia of the drone itself. /(I)Representing gyroscopic moment.
The rigid body model of unmanned aerial vehicle flight control is comprehensively available as follows:
step S2: and constructing a state space, an action space and a reward function of the unmanned aerial vehicle flight decision according to the Markov decision process.
(1) State space design
The state space designed by the invention consists of two parts of states: unmanned aerial vehicle flight state information and environmental information acquired by a sensor in real time. The environment state comprises image information obtained by a front camera of the unmanned aerial vehicle, and the flight state information of the unmanned aerial vehicle is expressed as follows in a vector form:
Wherein, Representing the position coordinates of the unmanned aerial vehicle in the earth coordinate system o exeyeze,/>Respectively representing the position components of x e,ye,ze coordinate axes of the unmanned aerial vehicle under the earth coordinate system; /(I)Representing the linear velocity of the unmanned aerial vehicle in the earth coordinate system,/>Respectively representing linear velocity components of x e,ye,ze coordinate axes of the unmanned aerial vehicle under an earth coordinate system; q is a quaternion representing the attitude of the unmanned aerial vehicle; /(I)Represents the angular velocity of the unmanned aerial vehicle in the machine body coordinate system o bxbybzb,/>Respectively represent angular velocity components of the unmanned aerial vehicle around x b,yb,zb coordinate axes in a machine body coordinate system.
(2) Motion space design
The motion space is defined as the linear velocity of the unmanned aerial vehicle in the earth coordinate system o exeyeze
(3) Bonus function design
The reward function designed by the invention consists of sparse rewards and continuous rewards, and comprises position rewards, collision rewards and speed rewards;
the position rewards include a position sparse reward and a position continuous reward. The position sparse reward is set as a reward for the unmanned aerial vehicle to successfully pass a certain obstacle to evaluate the obstacle avoidance performance of the flight decision strategy.
The position consecutive rewards are defined as r 1 as follows:
Wherein, Respectively representing the y-axis coordinate value of the unmanned aerial vehicle under the earth coordinate system o exeyeze at the time t-1, and y goal represents the y e -axis coordinate value of the unmanned aerial vehicle flight mission destination;
the defined location sparsity rewards r 2 are as follows:
Wherein, N barrier represents the total number of obstacles in the environment, and level represents the number of the unmanned aerial vehicle passing through the obstacles;
the collision rewards are sparse rewards for evaluating whether the unmanned aerial vehicle collides or not, and the unmanned aerial vehicle obtains collision rewards r 3 in the flight process:
The speed prize r 4 is defined as:
r4=r'+r”
Wherein v represents the current speed of the unmanned aerial vehicle, and v limit represents the set minimum speed of the unmanned aerial vehicle; representing the component of the drone speed on the y e axis in the earth coordinate system o exeyeze;
The comprehensive available prize function contains position prizes r 1 and r 2, collision prizes r 3, and velocity prizes r 4, and is defined as follows:
R=r1+r2+r3+r4
step S3: and constructing a multitasking experience pool for storing the training sample data of the reinforcement learning algorithm.
There are n different flight environments, the unmanned task in each environment is defined as a Markov decision process M i=<Si,Ai,Pi,Ri ∈M, during the unmanned execution of flight task T i, the unmanned interaction with the environment generates experience data < S, A, P i,Ri >, and stores it in an experience poolExperience pool/>, of all tasksI.e [1, n ] are combined together to form the multitasking experience pool D. The structure of the multitasking experience pool is shown in figure 1.
Step S4: initializing n different flight environments and unmanned aerial vehicle states, setting element learning update frequency and update step number, and realizing unmanned aerial vehicle element reinforcement learning decision model through parallel training in a plurality of environments.
The parallel training process realized by meta reinforcement learning is shown in fig. 2, and comprises the following steps:
step S41: setting batch training sample number batch_size and training update step number of algorithm in n different flight environments, and initializing experience pool for environment i Randomly generated Actor network weights/>Critic network weights/>Initializing an Actor network/>And Critic network/>Let/>Initializing target Critic network
Step S42: and inputting the state of the unmanned aerial vehicle into an Actor network to obtain Gaussian strategy distribution with the mean value of mu and the variance of sigma. Acquiring unmanned aerial vehicle decision action according to strategy random samplingAfter the unmanned aerial vehicle performs the action a t, the next time state s t+1 is acquired. Obtaining a prize r (S t,at) according to the prize function calculation in step S3, and storing decision data { S t,at,r(st,at),st+1 } to the experience pool/>
Step S43: when the experience number in the experience pool is larger than the batch_size, the batch_size group experience sample M is randomly extracted to serve as training data of the SAC algorithm to update the weights of the Actor network and the Critic network. Function for Actor network loss during trainingAnd the loss function/>, of Critic networksI=1, 2 to update the Actor network and Critic network weights, the specific neural network loss function and neural network update procedure is as follows:
the double Soft-Q function is defined as the target Critic network And/>The minimum value of the output, therefore, is:
Wherein, Critic network/>, respectivelyIs set to the target Q value.
Actor network loss functionThe definition is as follows:
critic network loss function J Qi) i=1, 2 updates are defined as follows:
where α is the regularization coefficient of the policy entropy.
The target Critic network weight θ 1',θ2' is updated by:
Wherein τ is a target Critic network soft update parameter.
The Actor network weights for task i=1, 2 were obtained by training the SAC algorithm under n different flight environmentsAnd Critic network weights/>
Step S44: judging whether the set updating step number is reached, if so, executing step S45; otherwise, steps S41 to S44 are performed.
Step S45: actor network weights obtained for n different flight environmentsAnd Critic network weights for meta-learning updating. /(I)
Parallel multitask reinforcement learning based on meta-learning requires a decision strategy to maximize rewards under n different environments as follows:
the meta learning update process is as follows:
Wherein phi meta represents the meta-policy network weight updated through meta-learning, Represents the Actor network weight obtained by learning task i by adopting SAC algorithm, and theta meta represents the meta-Critic network weight updated by meta-learning,/>The weight of the Critic network obtained by learning by adopting the SAC algorithm is represented, and beta represents the element learning update learning rate.
Step S46: judging whether the model converges, namely whether the reward function tends to a stable value or reaches a set training element learning updating step number, if so, finishing training to obtain an unmanned plane element reinforcement learning flight decision model for finishing training; otherwise, steps S41 to S46 are performed.
Step S5: and randomly initializing a new flight environment and an unmanned aerial vehicle state, testing an unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm, and evaluating flight decision performance.
S51: and initializing the flight state of the unmanned aerial vehicle, and obtaining an initial decision model state s t.
S52: initializing an Actor network and a Critic network according to the meta-policy network weight phi meta and the meta-Critic network weight theta meta, and executing the training update of the Actor network and the Critic network in the step S43.
S53: and (3) completing the trained Actor network by the state s t, obtaining the unmanned plane decision action a t, and executing the action to obtain a new state s t+1.
S54: judging whether a flight decision task is completed, if so, ending; otherwise S t=st+1, and steps S51 to S54 are performed.
Detailed description of the invention
In the embodiment, the unmanned aerial vehicle element reinforcement learning decision model is realized through parallel training in 3 environments. Initializing an environment:
environment 1: the endpoint y-axis was 57, the environment contained 4 obstacles, and the y-axis coordinates were 7, 17, 27.5, 45, respectively.
Environment 2: the endpoint y-axis was 55, the environment contained 4 obstacles, and the y-axis coordinates were 10, 20, 25, 35, respectively.
Environment 3: the endpoint y-axis is 60, the environment contains 5 obstructions, and the y-axis coordinates are 5,9, 20, 34, 50, respectively.
The unmanned plane state is initialized to [ P e ve q ωb ] = [0,0,0,0,0,0,0,0,0,0].
The meta learning update frequency is set to 1, and the update step number is set to 1000.
The entropy regularization coefficient alpha is set to be 0.2 and automatically attenuated, the learning rate lr is 0.0005, the empirical pool size is 100000, and the batch training sample number batch_size is 256.
Training the unmanned plane element reinforcement learning decision model, and recording the change of the rewarding value in the training process. The prize value curve during training of the SAC algorithm is shown in fig. 3. Wherein the SAC algorithm obtains a maximum prize of 52.03 during training. Throughout the training process; the SAC algorithm curve converges around 750 rounds, eventually remaining at 50.04.
After training, initializing the unmanned aerial vehicle state [ P e ve q ωb ] = [0,0,0,0,0,0,0,0,0,0], and testing under environments 1,2 and 3 by using a trained unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm. And drawing a flight trajectory diagram of the unmanned aerial vehicle according to the recorded state, wherein the flight decision effect under the environment 3 is shown in fig. 4. In the figure, the unmanned aerial vehicle flight path which is decided by using the unmanned aerial vehicle flight decision method based on the meta reinforcement learning parallel training algorithm successfully avoids the obstacle, finally reaches the end point with the y-axis coordinate of 50, and smoothly completes the flight task.
The unmanned aerial vehicle flight decision method based on the meta-reinforcement learning parallel training algorithm has better convergence performance aiming at the unmanned aerial vehicle flight decision process by comprehensively comparing the training process performance and the unmanned aerial vehicle flight decision performance, introduces a plurality of flight decision environments in the parallel training process, shares the flight decision data of the environments, can integrally optimize the unmanned aerial vehicle flight decision strategy, has better generalization performance, and has quick and safe flight characteristics when realizing the flight task.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims (4)

1. The unmanned aerial vehicle flight decision method based on the meta reinforcement learning parallel training algorithm is characterized by comprising the following steps of:
Step S1: constructing an unmanned aerial vehicle flight control model;
In order to solve the position and attitude information of the unmanned aerial vehicle in real time, adopting an unmanned aerial vehicle flight control rigid body model, wherein the unmanned aerial vehicle flight control rigid body model comprises an unmanned aerial vehicle kinematics model and an unmanned aerial vehicle kinematics model;
Step S2: constructing a state space, an action space and a reward function of unmanned aerial vehicle flight decisions according to a Markov decision process;
(1) State space design
The state space consists of two parts, namely environment information acquired by a sensor in real time and unmanned aerial vehicle flight state information, wherein the environment information comprises image information acquired by a front-end camera of the unmanned aerial vehicle, and the unmanned aerial vehicle flight state information is expressed as follows in a vector form:
Wherein, Representing the position coordinates of the unmanned aerial vehicle in the earth coordinate system o exeyeze,/>Respectively representing the position components of x e,ye,ze coordinate axes of the unmanned aerial vehicle under the earth coordinate system; /(I)Representing the linear velocity of the unmanned aerial vehicle in the earth coordinate system,/>Respectively representing linear velocity components of x e,ye,ze coordinate axes of the unmanned aerial vehicle under an earth coordinate system; q is a quaternion representing the attitude of the unmanned aerial vehicle; /(I)Represents the angular velocity of the unmanned aerial vehicle in the machine body coordinate system o bxbybzb,Respectively representing angular velocity components of the unmanned aerial vehicle around x b,yb,zb coordinate axes in a machine body coordinate system;
(2) Motion space design
The motion space is defined as the linear velocity of the unmanned aerial vehicle in the earth coordinate system o exeyeze
(3) Bonus function design
The reward function consists of sparse rewards and continuous rewards, and the reward function comprises position rewards, collision rewards and speed rewards;
Step S3: constructing a multitask experience pool for storing the training sample data of the element reinforcement learning algorithm;
There are n different flight environments, the unmanned task in each environment is defined as a Markov decision process M i=<Si,Ai,Pi,Ri ∈M, during the unmanned execution of flight task T i, the unmanned interaction with the environment generates experience data < S, A, P i,Ri >, and stores it in an experience pool Experience pool/>, of all tasksI epsilon [1, n ] are combined together to form a multitasking experience pool D;
Step S4: initializing n different flight environments and unmanned aerial vehicle states, setting element learning update frequency and update step number, and realizing unmanned aerial vehicle element reinforcement learning decision-making model through parallel training in a plurality of environments;
Step S5: randomly initializing a new flight environment and an unmanned aerial vehicle state, testing an unmanned aerial vehicle flight decision model based on a meta reinforcement learning algorithm, and evaluating flight decision performance;
S51: initializing the flight state of the unmanned aerial vehicle, and obtaining an initial decision model state s t;
S52: initializing an Actor network and a Critic network according to the meta-strategy network weight phi meta and the meta-Critic network weight theta meta, and executing training and updating of the Actor network and the Critic network in the step S43;
S53: the state s t is used for completing the trained Actor network, the decision action a t of the unmanned aerial vehicle is obtained and executed, and then a new state s t+1 is obtained;
S54: judging whether the flight decision task is finished, and if the flight decision task is finished, ending; otherwise, let S t=st+1, and execute steps S51 to S54.
2. The unmanned aerial vehicle flight decision method based on the meta reinforcement learning parallel training algorithm according to claim 1, wherein the unmanned aerial vehicle flight decision method is characterized in that:
the unmanned aerial vehicle flight control rigid body model comprises an unmanned aerial vehicle kinematics model and an unmanned aerial vehicle kinematics model;
(1) Unmanned aerial vehicle kinematics model
The unmanned aerial vehicle kinematic model is irrelevant to the quality and stress of the unmanned aerial vehicle, and only the relations among the speed, the angular speed, the position and the gesture of the unmanned aerial vehicle are researched; the unmanned aerial vehicle kinematic model inputs the speed and the angular speed of the unmanned aerial vehicle, outputs the corresponding position and the corresponding gesture of the unmanned aerial vehicle, and comprises a position kinematic model and a gesture kinematic model; the positional kinematic model is defined as follows:
Wherein, Representing the position coordinates of the center of gravity of the unmanned aerial vehicle in the earth coordinate system o exeyeze,/>The position change quantity of the unmanned aerial vehicle is represented, and v e represents the speed of the unmanned aerial vehicle under the earth coordinate system;
The unmanned aerial vehicle gesture adopts quaternion to represent, and the quaternion represents as follows:
Wherein, Is/>Scalar section of/>Is a vector portion; for real number/>The corresponding quaternion is denoted q= [ s0 1×3]T; for pure vector/>The corresponding quaternion representation is q= [0 v T]T;
Reversely solving the attitude angle of the unmanned aerial vehicle through quaternion:
Wherein phi epsilon minus pi, pi is the rolling angle of the unmanned aerial vehicle, phi epsilon minus pi, pi is the yaw angle of the unmanned aerial vehicle, The pitch angle of the unmanned aerial vehicle is set;
The gesture kinematic model is defined as follows:
Wherein, Represents the angular velocity of the unmanned aerial vehicle in the body coordinate system o bxbybzb,/>Is the scalar part of the quaternion,/>Is the vector part of the quaternion,/>The transpose of q v is represented,The attitude change quantity of the unmanned aerial vehicle is represented, and I 3 represents a third-order identity matrix;
(2) An unmanned aerial vehicle dynamic model;
The input of the unmanned aerial vehicle dynamic model is tension and moment, the moment comprises pitching moment, rolling moment and yaw moment, and the output is corresponding unmanned aerial vehicle speed and angular speed; the unmanned aerial vehicle dynamic model comprises a position dynamic model and a gesture dynamic model;
the location dynamics model is defined as follows:
Wherein, Represents the variation of the speed of the unmanned aerial vehicle in the earth coordinate system o exeyeze, m represents the mass of the unmanned aerial vehicle, f represents the total pulling force of the propeller, g represents the gravitational acceleration, e 3=[0,0,1]T is a unit vector, and/>The rotation matrix from the machine body coordinate system to the earth coordinate system is represented, phi represents the rolling angle of the unmanned aerial vehicle, theta represents the pitch angle of the unmanned aerial vehicle, and phi represents the yaw angle of the unmanned aerial vehicle;
the attitude dynamics equation is established in the machine body coordinate system as follows:
Wherein, Representing the moment generated by the rotation of the propeller on the axis of the unmanned aerial vehicle body,/>Representing the rotational inertia of the unmanned aerial vehicle itself; /(I)Representing gyro moment;
the comprehensive preparation method comprises the following steps:
the above is a rigid body model of unmanned aerial vehicle flight control.
3. The unmanned aerial vehicle flight decision method based on the meta reinforcement learning parallel training algorithm according to claim 1, wherein the unmanned aerial vehicle flight decision method is characterized in that:
the rewarding function consists of sparse rewards and continuous rewards, and the rewarding function comprises position rewards, collision rewards and speed rewards;
the position rewards comprise position sparse rewards and position continuous rewards, wherein the position sparse rewards are rewards for the unmanned aerial vehicle to successfully pass through a certain obstacle so as to evaluate obstacle avoidance performance of a flight decision strategy;
The location sparse reward is defined as r 2:
Wherein, N barrier represents the total number of obstacles in the environment, and level represents the number of the unmanned aerial vehicle passing through the obstacles;
Position consecutive rewards are defined as r 1:
Wherein, Respectively representing the y-axis coordinate value of the unmanned aerial vehicle under the earth coordinate system o exeyeze at the time t-1, and y goal represents the y e -axis coordinate value of the unmanned aerial vehicle flight mission destination;
the collision rewards are sparse rewards for evaluating whether the unmanned aerial vehicle collides or not, and the unmanned aerial vehicle obtains collision rewards r 3 in the flight process:
The speed prize r 4 is defined as:
r4=r'+r”
Wherein v represents the current speed of the unmanned aerial vehicle, and v limit represents the set minimum speed of the unmanned aerial vehicle; v y e represents the component of the drone speed on the y e axis under the earth coordinate system o exeyeze;
the comprehensive preparation method comprises the following steps:
R=r1+r2+r3+r4
The bonus function package R contains position rewards R 1 and R 2, collision rewards R 3, and velocity rewards R 4.
4. The unmanned aerial vehicle flight decision method based on the meta reinforcement learning parallel training algorithm according to claim 1, wherein the unmanned aerial vehicle flight decision method is characterized in that:
The meta reinforcement learning implementing parallel training includes the steps of:
Step S41: setting batch training sample number batch_size and training update step number in n different flight environments, and initializing experience pool for environment i Randomly generated Actor network weights/>Critic network weights/>Initializing an Actor network/>And Critic network/>Let/>Initializing target Critic network/>
Step S42: inputting the state of the unmanned aerial vehicle into an Actor network to obtain Gaussian strategy distribution with a mean value of mu and a variance of sigma; acquiring unmanned aerial vehicle decision action according to strategy random samplingThe unmanned plane obtains the next time state S t+1 after executing the action a t, obtains the prize r (S t,at) according to the prize function calculation in the step S3, and stores the decision data { S t,at,r(st,at),st+1 } into the experience pool/>
Step S43: when the experience number in the experience pool is larger than the batch_size, randomly extracting a batch_size group experience sample M as training data of a SAC algorithm, and performing gradient descent with a learning rate lr aiming at a function J π (phi) of the Actor network loss and a loss function J Qi) i=1 and 2 of the Critic network during training to update the weights of the Actor network and the Critic network; the Actor network weights for task i=1, 2 were obtained by training the SAC algorithm under n different flight environmentsAnd Critic network weights/>
Step S44: judging whether the set training update step number is reached, if so, executing step S45; otherwise, executing steps S41 to S44;
Step S45: actor network weights obtained for n different flight environments And Critic network weightsPerforming meta learning updating;
Parallel multitask reinforcement learning based on meta-learning requires a decision strategy to maximize rewards under n different environments as follows:
the meta learning update process is as follows:
Wherein phi meta represents the meta-policy network weight updated through meta-learning, Represents the Actor network weight obtained by learning task i by adopting SAC algorithm, and theta meta represents the meta-Critic network weight updated by meta-learning,/>The weight of the Critic network obtained by learning by adopting the SAC algorithm is represented, and beta represents the element learning update learning rate;
Step S46: judging whether the model converges or not, wherein the convergence condition is whether the reward function is stable or whether the reward function reaches the set training element learning updating step number, and if the reward function is stable or reaches the set training element learning updating step number, finishing training to obtain the unmanned plane element reinforcement learning flight decision model for finishing training; otherwise, steps S41 to S46 are performed.
CN202210594911.6A 2022-05-27 2022-05-27 Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm Active CN114895697B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210594911.6A CN114895697B (en) 2022-05-27 2022-05-27 Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210594911.6A CN114895697B (en) 2022-05-27 2022-05-27 Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm

Publications (2)

Publication Number Publication Date
CN114895697A CN114895697A (en) 2022-08-12
CN114895697B true CN114895697B (en) 2024-04-30

Family

ID=82726496

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210594911.6A Active CN114895697B (en) 2022-05-27 2022-05-27 Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm

Country Status (1)

Country Link
CN (1) CN114895697B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115494879B (en) * 2022-10-31 2023-09-15 中山大学 Rotor unmanned aerial vehicle obstacle avoidance method, device and equipment based on reinforcement learning SAC
CN116476825B (en) * 2023-05-19 2024-02-27 同济大学 Automatic driving lane keeping control method based on safe and reliable reinforcement learning
CN117168468B (en) * 2023-11-03 2024-02-06 安徽大学 Multi-unmanned-ship deep reinforcement learning collaborative navigation method based on near-end strategy optimization
CN117666332B (en) * 2024-02-02 2024-04-05 北京航空航天大学 Self-learning anti-interference control method for multi-rotor aircraft in dynamic disturbance environment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021034050A (en) * 2019-08-21 2021-03-01 哈爾浜工程大学 Auv action plan and operation control method based on reinforcement learning
CN112684794A (en) * 2020-12-07 2021-04-20 杭州未名信科科技有限公司 Foot type robot motion control method, device and medium based on meta reinforcement learning
CN113093802A (en) * 2021-04-03 2021-07-09 西北工业大学 Unmanned aerial vehicle maneuver decision method based on deep reinforcement learning
WO2022052406A1 (en) * 2020-09-08 2022-03-17 苏州浪潮智能科技有限公司 Automatic driving training method, apparatus and device, and medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2021034050A (en) * 2019-08-21 2021-03-01 哈爾浜工程大学 Auv action plan and operation control method based on reinforcement learning
WO2022052406A1 (en) * 2020-09-08 2022-03-17 苏州浪潮智能科技有限公司 Automatic driving training method, apparatus and device, and medium
CN112684794A (en) * 2020-12-07 2021-04-20 杭州未名信科科技有限公司 Foot type robot motion control method, device and medium based on meta reinforcement learning
CN113093802A (en) * 2021-04-03 2021-07-09 西北工业大学 Unmanned aerial vehicle maneuver decision method based on deep reinforcement learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
深度强化学习在变体飞行器自主外形优化中的应用;温暖;刘正华;祝令谱;孙扬;;宇航学报;20171130(第11期);19-25 *

Also Published As

Publication number Publication date
CN114895697A (en) 2022-08-12

Similar Documents

Publication Publication Date Title
CN114895697B (en) Unmanned aerial vehicle flight decision method based on meta reinforcement learning parallel training algorithm
CN110673620B (en) Four-rotor unmanned aerial vehicle air line following control method based on deep reinforcement learning
Mulgaonkar et al. Robust aerial robot swarms without collision avoidance
CN110806756B (en) Unmanned aerial vehicle autonomous guidance control method based on DDPG
CN110531786B (en) Unmanned aerial vehicle maneuvering strategy autonomous generation method based on DQN
CN109625333B (en) Spatial non-cooperative target capturing method based on deep reinforcement learning
CN112947562B (en) Multi-unmanned aerial vehicle motion planning method based on artificial potential field method and MADDPG
CN111880567B (en) Fixed-wing unmanned aerial vehicle formation coordination control method and device based on deep reinforcement learning
CN111240356B (en) Unmanned aerial vehicle cluster convergence method based on deep reinforcement learning
CN113093802A (en) Unmanned aerial vehicle maneuver decision method based on deep reinforcement learning
CN112925319B (en) Underwater autonomous vehicle dynamic obstacle avoidance method based on deep reinforcement learning
CN114253296B (en) Hypersonic aircraft airborne track planning method and device, aircraft and medium
CN114089776B (en) Unmanned aerial vehicle obstacle avoidance method based on deep reinforcement learning
CN115509251A (en) Multi-unmanned aerial vehicle multi-target cooperative tracking control method based on MAPPO algorithm
CN115755956B (en) Knowledge and data collaborative driving unmanned aerial vehicle maneuvering decision method and system
CN116242364A (en) Multi-unmanned aerial vehicle intelligent navigation method based on deep reinforcement learning
CN114355980B (en) Four-rotor unmanned aerial vehicle autonomous navigation method and system based on deep reinforcement learning
CN116700079A (en) Unmanned aerial vehicle countermeasure occupation maneuver control method based on AC-NFSP
Chen et al. Deep reinforcement learning based strategy for quadrotor UAV pursuer and evader problem
Sandström et al. Fighter pilot behavior cloning
Wu et al. Improved reinforcement learning using stability augmentation with application to quadrotor attitude control
Zhou et al. Vision-based navigation of uav with continuous action space using deep reinforcement learning
CN116697829A (en) Rocket landing guidance method and system based on deep reinforcement learning
CN116820134A (en) Unmanned aerial vehicle formation maintaining control method based on deep reinforcement learning
CN115185288B (en) Unmanned aerial vehicle layered flight decision method based on SAC algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant