CN113821057A

CN113821057A - Planetary soft landing control method and system based on reinforcement learning and storage medium

Info

Publication number: CN113821057A
Application number: CN202111196380.7A
Authority: CN
Inventors: 白成超; 郭继峰; 陈宇燊
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-10-14
Filing date: 2021-10-14
Publication date: 2021-12-21
Anticipated expiration: 2041-10-14
Also published as: CN113821057B

Abstract

A planet soft landing control method and system based on reinforcement learning and a storage medium relate to the field of soft landing trajectory optimization and control and aim to solve the problems that the existing planet soft landing control cannot guarantee optimal guidance law, is complex in model, and is difficult to converge in training. The invention comprises the following steps: firstly, the method comprises the following steps: establishing a six-degree-of-freedom dynamic model of a lander power descending section based on the characteristics of the lander such as hardware configuration, engine power configuration and the like; II, secondly: designing a reward function, an observation space, an action space and a neural network structure for training an interactive environment; thirdly, the method comprises the following steps: building a numerical simulation environment, and training by using a reinforcement learning algorithm to obtain a soft landing controller; fourthly, the method comprises the following steps: and evaluating the training control effect through a speed tracking test and a soft landing test. And (3) obtaining a soft landing reinforcement learning environment model by executing the first step and the second step, and interacting the intelligent agent with the environment model to obtain training data. And finally, selecting the training result with the best performance as the optimal soft landing controller through the step four. The method is used for soft landing trajectory optimization and control.

Description

Planetary soft landing control method and system based on reinforcement learning and storage medium

Technical Field

The invention relates to a planetary soft landing control method based on reinforcement learning, and belongs to the technical field of soft landing trajectory optimization and control and the technical field of deep space exploration.

Background

Reinforcement learning algorithm is one of machine learning algorithms, in which an agent learns in a "trial and error" manner, assessing the quality of an action by a reward obtained by interacting with the environment, with the goal of making the agent obtain the maximum reward. And can be generally classified into model-based and non-model-based.

Patent document CN110466805B discloses an optimization guidance parameter-based asteroid landing guidance method, which establishes a kinetic equation of a detector under a landing point coordinate system; analyzing the motion of the detector in three directions of a landing point coordinate system to obtain a relational expression of the position, the speed, the acceleration and the time of the detector in each direction; establishing a function relation between the guidance parameters and the initial state based on the detector motion relation, and solving coefficient values in the function relation by adopting a parameter estimation method, wherein the function relation is an optimized selection formula of the guidance parameters; respectively substituting the optimized selection formulas of the guidance parameters in the three directions into the relational expressions of the position, the speed, the acceleration and the time of the detector in each direction, and combining a kinetic equation to obtain a asteroid landing guidance law based on the optimized guidance parameters in the three directions; and performing minor planet soft landing guidance through the guidance law. The method can improve the efficiency of the asteroid landing guidance, is a model-based landing guidance design algorithm, needs to perform complex processing on the model, and does not relate to the problem of executing control by a detector.

In addition, researchers carry out the research of the planet soft landing section based on reinforcement learning, guidance and control are integrated, and an end-to-end planet soft landing algorithm can be obtained through training. However, because the guidance and control are integrated in the algorithm, the guidance law is designed indirectly through the reward function, and therefore the optimality of the guidance law cannot be guaranteed, namely the fuel consumption cannot be minimized. In addition, the integrated guidance and control result in high complexity of the model, difficulty in convergence of training and long parameter selection and debugging period.

Disclosure of Invention

The technical problem to be solved by the invention is as follows:

the invention provides a planetary soft landing control method and system based on reinforcement learning and a storage medium, aiming at solving the problems that the existing planetary soft landing control can not ensure optimal guidance law, has a complex model and is difficult to train and converge and the like and aiming at realizing more autonomous planetary soft landing.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a planetary soft landing control method based on reinforcement learning comprises the following steps:

the method comprises the following steps: establishing a six-degree-of-freedom dynamic model of a lander power descending section based on the mass m of the lander, the inertia matrix I and the power configuration of the engine; the engine power configuration comprises the number n of engines, the installation position of each engine and a single engine thrust range T_i∈[T_min,T_max]Wherein T is_iThe thrust of a single engine is shown, i is the index number of the engine, and i is 1, 2.

Step two: based on the dynamics model established in the first step, the thrust of each engine is used as control output, information reflecting the state of the lander is used as an observation vector, a reward function is designed for evaluating the control performance, and a corresponding neural network is designed according to different reinforcement learning algorithm frameworks;

step three: building a numerical simulation environment based on the dynamics model built in the step one and the interactive environment design in the step two, and respectively training by utilizing different reinforcement learning algorithm frames to obtain a soft landing controller as an alternative controller selected in subsequent tests;

step four: and based on the soft landing controller obtained in the step three, comprehensively evaluating the speed tracking capability and the soft landing precision through a speed tracking performance test and a power descent section soft landing test respectively, and selecting the controller with the optimal performance for planetary soft landing control according to the test effect.

Step one, the dynamic model of six degrees of freedom of the power descending section comprises a mass center translational dynamic model and an attitude dynamic model:

in the formula of omega^tAnd v_bThe component forms of the attitude angular velocity and the velocity in the body coordinate system, F_bAnd M_bRespectively are the resultant force and resultant moment vector of the external force borne by the lander, and I is an inertia matrix of the lander;

external force resultant force and resultant moment vector F borne by lander_bAnd M_bIs represented as follows:

F_b＝F_T+mg_b+F_N

M_b＝M_T+M_N

in the formula F_TAnd M_TResultant and resultant moment vectors, F, respectively, generated by the engine_nAnd M_nResultant force and resultant moment, g, respectively, caused by external interference_bThe component form of the gravity acceleration of the surface of the planet under the coordinate system of the lander body is shown, and m is the mass of the lander;

the aerodynamic force of the power-down section is small compared with the thrust of the engine and the planetary attraction, and is used as a disturbance at F_NAnd M_NIs embodied in that

F_N＝F_wind+δF

M_N＝M_wind+δM

In the formula, F_windAnd M_windThe total force and the resultant moment caused by air are respectively, and the delta F and the delta M are respectively the total force and the resultant moment caused by unmodeled interference, wherein the unmodeled interference comprises engine installation deviation and thrust size fluctuation.

Step two, the form of the reward function is as follows:

r＝r_fuel+r_vel+r_crash+r_constant+r_goal

in the formula, r_fuelFor fuel consumption penalty, r_velFor speed tracking rewards, r_crashPunishment of landing rollover, r_constantIs awarded for a constant value, r_goalTo succeed inReward for soft landing.

The penalty for fuel consumption is in the form of

In the formula, α is a fuel consumption penalty coefficient, and is a negative real number with a small absolute value, and the larger the absolute value is, the smaller the fuel consumption of the trained controller is, but the control accuracy is sacrificed.

The speed tracking reward is in the form of a speed tracking reward

r_vel＝β||v-v_ref||

In the equation, β is a velocity error reward coefficient, is a negative real number having a small absolute value, and the larger the absolute value is, the higher the control accuracy of the training result is, but the fuel consumption increases. v. of_refAnd generating a reference speed to be tracked in the landing process of the lander according to a guidance law.

The punishment form of landing rollover is

r_crash＝η(φ＞φ_limorθ＞θ_lim orψ＞ψ_lim)

In the formula, eta is a rollover penalty coefficient which is a negative real number with an absolute value larger than alpha and beta, and the introduction of rollover penalty is favorable for avoiding that the attitude exceeds a constraint value in the landing process; psi, theta and phi are respectively yaw angle, pitch angle and roll angle, the rotation sequence is zyx, and the body coordinate system of the lander is represented relative to the attitude of the navigation coordinate system (see figure 1 for the schematic drawing); psi_lim、θ_limAnd phi_limUpper bound values of psi, theta and phi, respectively.

Constant value reward r_constantIs a positive real number with an absolute value greater than α and β. Due to r_fuel、r_velAnd r_crashAll the numbers are negative numbers, and before a reasonable control strategy is not learned, the lander can be encouraged to continue exploring by adding the constant value reward, so that the early ending of the turn is avoided, and the convergence of training is facilitated.

The reward for successful soft landing is in the form of a reward

r_goal＝λ(h＜0and v_z＞0and||v||＜v_limand

φ＜φ_limandθ＜θ_limandψ＜ψ_limand

||ω||＜ω_lim)

Wherein λ is soft landing reward coefficient, h is lander height, v_zComponent of velocity in the height direction, ω_limIs a soft landing attitude angular velocity threshold.

Step two, the observation vector form is as follows:

s＝[δv_b，sin(φ)，cos(φ)，sin(θ)，cos(θ)，

sin(ψ)，cos(ψ)，ω^t，Q]

in the formula, δ v_bThe component of the difference between the real speed and the expected speed in a body coordinate system, and Q is an attitude quaternion;

step two the thrust output range is

a_i∈[-1，1]

By linear change

To obtain thrust output T_iCan satisfy the thrust constraint T_i∈[T_min，T_max]。

And step three, the state space and the action space of the reinforcement learning algorithm framework are continuous.

A planetary soft landing control system based on reinforcement learning executes the steps in the planetary soft landing control method based on reinforcement learning during operation.

A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the reinforcement learning-based planetary soft landing control method described above.

The invention has at least the following beneficial technical effects:

the planet soft landing control method based on reinforcement learning provided by the invention realizes more autonomous planet soft landing. Firstly, a six-degree-of-freedom dynamic model of a soft landing power descent section is established according to the configuration of a lander and the power attribute of an engine, observation information, environment reward feedback, action output of an intelligent agent and network results of the landing controller are designed according to the problem characteristics, an enhanced learning environment model is set up for training, the training results are tested, and the planetary soft landing controller based on data driving is realized.

The planet soft landing is a precondition for realizing a planet surface detection task, and the accurate soft landing can detect a high-value target and avoid damaging a detection instrument. The planet soft landing based on reinforcement learning can be trained in a soft landing interaction mode to obtain the lander controller, so that accurate planet soft landing is realized. Compared with the traditional soft landing guidance control algorithm based on a model, the control based on reinforcement learning has the following advantages: 1) training can be carried out without a model, a designer does not need to process the model in the design of the controller, the quality of the control performance is evaluated in a reward function mode, the controller is indirectly guided to learn, and the control performance is optimized; 2) the strong nonlinearity of the soft landing planning control problem can be well processed by utilizing the strong fitting capability of the deep neural network; 3) the reinforcement learning is an end-to-end algorithm, directly senses the landing environment state, outputs an engine thrust instruction, does not need to solve the landing track in advance in an offline manner, and has strong real-time performance.

The planet soft landing control algorithm based on reinforcement learning provided by the invention can better track the expected speed in the soft landing process, and realize the planet soft landing control. Meanwhile, the reward is tracked according to the reference speed, so that the problem of sparse reward in the training process can be effectively solved, and the success rate of training convergence is greatly improved. And the migration capability of the training result can be improved by learning and tracking the expected speed without the limitation of specific landing guidance signals. In order to realize more autonomous planet soft landing, the invention optimizes the soft landing controller by experience through a trial-and-error learning mode of reinforcement learning, avoids the dynamic model processing of the lander and realizes the soft landing control which is not based on the model.

Drawings

Fig. 1 is a value function network structure.

Fig. 2 is a DDPG and TD3 algorithm policy network structure.

Fig. 3 is a SAC algorithm policy network structure.

Figure 4 is a DDPG algorithm training process reward variation curve.

FIG. 5 is a reward variation curve of the TD3 algorithm training process.

Fig. 6 is a change curve of reward in the SAC algorithm training process.

Figure 7 is a DDPG algorithm speed control test curve.

FIG. 8 is a TD3 algorithm speed control test curve.

Fig. 9 is a SAC algorithm speed control test curve.

FIG. 10 is a DDPG algorithm soft landing test landing site distribution.

FIG. 11 is a TD3 algorithm soft landing test landing site distribution.

FIG. 12 is a SAC algorithm soft landing test landing site distribution.

Detailed Description

As shown in fig. 1 to 12, the planetary soft landing control method based on reinforcement learning according to the present embodiment includes the following steps:

the method comprises the following steps: lander power descending section six-freedom-degree dynamic model establishment

In the embodiment, taking Mars soft landing as an example, the centroid translational dynamics and the attitude dynamics are established as follows

In the formula of omega^tAnd v_bThe components of the attitude angular velocity and the velocity under a body coordinate system are respectively, m is the mass of the lander, and I is the rotation of the landerInertia matrix, F_bAnd M_bRespectively the resultant force and resultant moment vector of external force applied to the lander

F_b＝F_T+mg_b+F_n

M_b＝M_T+M_n

In the formula F_TAnd M_TResultant and resultant moment vectors, F, respectively, generated by the engine_nAnd M_nResultant force and resultant moment, g, respectively, caused by external interference_bThe component form of the Mars surface gravity acceleration under the coordinate system of the lander body.

According to the model, the state s of the lander at each moment can be obtained as [ x y z v ] v through integration according to the engine thrust output of the lander_x v_y v_z φ θ ψ p q r]Where x, y, z are the three-axis positions, v_x、v_y、v_zThe three-axis velocity is phi, theta, phi are attitude angles, the rotation sequence is zyx, and p, q, r are three-axis attitude angular velocities.

Step two: design of an interactive environment

1) Reward function

The reward function is in the form of

r＝r_fuel+r_vel+r_crash+r_constant+r_goal

Wherein r is_fuel，r_vel，r_crash，r_constant，r_goalThe method comprises the following calculation modes of punishment of fuel consumption, reward of speed tracking, punishment of rollover, reward of constant value and reward of soft landing respectively

r_vel＝β||v-v_ref||

r_crash＝η(φ＞φ_limorθ＞θ_limorψ＞ψ_lim)

r_constant＝κ

r_goal＝λ(h＜0and v_z＞0and||v||＜v_limand

φ＜φ_limandθ＜θ_limandψ≤ψ_limand

||ω||＜ω_lim)

Wherein α is a fuel consumption penalty coefficient, and is a negative real number with a small absolute value, and the larger the absolute value is, the smaller the fuel consumption of the trained controller is, but the control accuracy is sacrificed; beta is a speed error reward coefficient, is a negative real number with a small absolute value, and the larger the absolute value is, the higher the control precision of the training result is, but the fuel consumption is increased; eta is a rollover penalty coefficient which is a negative real number with an absolute value larger than alpha and beta, and the introduction of rollover penalty is favorable for avoiding that the attitude exceeds a constraint value in the landing process; kappa is a constant reward factor, a positive real number with an absolute value greater than alpha and beta, since r_fuel、r_velAnd r_crashAll the numbers are negative numbers, and the lander can be encouraged to continue exploring by adding the constant value reward before a reasonable control strategy is not learned, so that the early ending of the turn is avoided, and the convergence of training is facilitated;

2) observed value

Design the observed value as

r_goal＝[δv_b，sin(φ)，cos(φ)，sin(θ)，cos(θ)，

sin(ψ)，cos(ψ)，ω^t，Q]

In the formula, δ v_bThe component of the difference between the real speed and the expected speed in the body coordinate system, and Q is an attitude quaternion.

3) Motion output

Action a of setting agent output_i∈[-1，1]Converted into thrust output through linear mapping

The obtained thrust output satisfies T_i∈[T_min，T_max]Of (3) is performed.

4) Network design

According to the selected algorithm, a corresponding value function and a corresponding strategy network are designed, wherein the strategy network comprises the depth, the width and the activation function of the network, in the invention, the number of the middle layers is three, the activation function is selected as ReLU, and the width 200 can meet the training requirement.

Step three: setting up simulation environment for training

And respectively constructing an environment dynamics model, a feedback interaction model and an agent strategy and value function network based on the first step and the second step, and training the agent by using a deep learning framework.

Step four: testing landing control effect

Firstly, testing the speed tracking capability of the lander and setting a desired speed v_d＝[v_dx，v_dy，v_dz]And different initial velocities v₀＝[v_x0，v_y0，v_z0]And controlling the speed to a desired value by using the trained controller.

And then, performing soft landing test, randomly generating initial position speed and attitude according to the initial condition of the soft landing power descending section, using the trained speed controller for soft landing, and testing the training effect through a plurality of target practice experiments.

The invention can adopt DDPG, TD3 and SAC algorithm which are not based on model to train the controller. Firstly initializing a strategy network parameter theta, a value function network parameter phi and an experience playback pool D, then interacting the intelligent agent with the environment, outputting an action a to act on the environment by the intelligent agent according to a current strategy observation environment state s in each step, transferring the environment state to s 'and feeding back an award r and a turn end signal D, storing an experience group (s, a, s', r, D) into the intelligent agent D, randomly sampling a group of experiences from the intelligent agent D if an update period is reached, updating the network parameter according to the constructed loss function gradient, and circulating the steps until the turn award is converged, wherein the DDPG, TD3 and SAC pseudo codes are respectively as follows

The following examples were used to demonstrate the beneficial effects of the present invention:

example (b):

1) experimental Environment settings

The example selects DDPG, TD3 and SAC for training and testing, designs a value function network of three algorithms as fig. 1, a strategy network structure of DDPG and TD3 as fig. 2, and a SAC strategy network structure as fig. 3. And (4) building a neural network based on the pytorch for training by using python programming.

The software environment for simulation test of all algorithms herein is Ubuntu16.04 and the hardware environment is Intel (R) core (TM) i5-9300H CPU + NVIDIAGEFORCE GTX 1660Ti +16.0GB RAM.

2) Results and analysis of the experiments

The training process curves of the DDPG, TD3 and SAC algorithms are shown in fig. 4, fig. 5 and fig. 6, respectively. Wherein the DDPG round reward begins to rise at 10000 rounds, 10000-20000 promotion is obvious, the follow-up gradual and stable convergence is about 300, and the training is finished by about 40000 rounds. Training tends to be stable from the perspective of average rewards, but rewards fluctuate between 100-400 per round and are very unstable. In TD3, the agent awards a significant boost through 700 rounds of training and then substantially stabilizes at 450 through about 5000 rounds of training, which is more stable and faster in convergence rate than DDPG. SAC undergoes an obvious promotion in 3000 rounds and 5000 rounds respectively, and finally is stabilized through 9500 rounds of training, the reward converges to 500, and compared with TD3, the single round reward is higher when SAC is finally stabilized and is close to 600.

The DDPG, TD3 and SAC speed test curves are shown in fig. 7, fig. 8 and fig. 9, respectively. DDPG can control the speed error to be within 2m/s from the condition that the initial vertical error is large, but cannot be stable and has continuous oscillation, TD3 can control the speed to be close to the expected speed, the control precision is close to 1m/s, SAC is the highest and is within 0.1 m/s.

The two-dimensional distributions of DDPG, TD3 and SAC landing sites are shown in fig. 10, fig. 11 and fig. 12, respectively. For 100 targeting experiments, the landing success rates of DDPG, TD3 and SAC were 74%, 92% and 96%, respectively. The drop point accuracies TD3 and SAC are significantly better than DDPG.

According to the method, the planet soft landing control method can be realized, and a new thought is provided for the optimization and control research of the planet soft landing track.

The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims

1. A planet soft landing control method based on reinforcement learning is characterized in that: the method comprises the following steps:

2. The planetary soft landing control method based on reinforcement learning of claim 1, wherein the first step is a dynamic model of six degrees of freedom of the power descent section, which comprises a centroid translational dynamic model and an attitude dynamic model:

F_b＝F_T+mg_b+F_N

M_b＝M_T+M_N

in the formula F_TAnd M_TResultant and resultant moment vectors, F, respectively, generated by the engine_nAnd M_nAre respectively outsideResultant forces and resultant moments, g, caused by disturbances_bThe component form of the gravity acceleration of the surface of the planet under the coordinate system of the lander body is shown, and m is the mass of the lander;

F_N＝F_wind+δF

M_N＝M_wind+δM

3. The planetary soft landing control method based on reinforcement learning of claim 2, wherein the reward function form in the second step is:

r＝r_fuel+r_vel+r_crash+r_constant+r_goal

in the formula, r_fuelFor fuel consumption penalty, r_velFor speed tracking rewards, r_crashPunishment of landing rollover, r_constantIs awarded for a constant value, r_goalReward for successful soft landing.

4. The method for controlling the planetary soft landing based on the reinforcement learning of claim 3, wherein the penalty of fuel consumption is in the form of fuel consumption penalty

5. The planetary soft landing control method based on reinforcement learning as claimed in claim 3, wherein the speed tracking reward is in the form of speed tracking reward

r_vel＝β||v-v_ref||

In the formula, beta is a speed error reward coefficient, and the larger the absolute value of the speed error reward coefficient is, the higher the control precision of the training result is, but the fuel consumption is increased; v. of_refAnd generating a reference speed to be tracked in the landing process of the lander according to a guidance law.

6. The method for controlling the planet soft landing based on the reinforcement learning as claimed in claim 3, wherein the punishment form of landing rollover is

r_crash＝η(φ>φ_lim or θ>θ_lim or ψ>ψ_lim)

In the formula, eta is a rollover penalty coefficient, and the introduction of the rollover penalty is favorable for avoiding that the attitude exceeds a constraint value in the landing process; psi, theta and phi are respectively a yaw angle, a pitch angle and a roll angle, the rotation sequence is zyx, and the attitude of the body coordinate system of the lander relative to the navigation coordinate system is represented; psi_lim、θ_limAnd phi_limUpper bound values of psi, theta and phi, respectively.

7. The planetary soft landing control method based on reinforcement learning as claimed in claim 3, wherein the reward for successful soft landing is in the form of reward

r_goal＝λ(h<0 and v_z>0 and||v||<v_lim and φ<φ_lim and θ<θ_lim and ψ<ψ_lim and||ω||<ω_lim)

8. The method for controlling the planetary soft landing based on the reinforcement learning of claim 1, wherein the observation vector form in the second step is:

s＝[δv_b,sin(φ),cos(φ),sin(θ),cos(θ),sin(ψ),cos(ψ),ω^t,Q]

step two, the range of the thrust output is a_i∈[-1,1]

By linear change

To obtain thrust output T_iCan satisfy the thrust constraint T_i∈[T_min,T_max]。

9. The planetary soft landing control method based on reinforcement learning of claim 1, wherein the state space and the action space of the reinforcement learning algorithm framework in the third step are continuous.

10. A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the reinforcement learning-based planetary soft landing control method of any of claims 1-9.