CN113821057A - Planetary soft landing control method and system based on reinforcement learning and storage medium - Google Patents

Planetary soft landing control method and system based on reinforcement learning and storage medium Download PDF

Info

Publication number
CN113821057A
CN113821057A CN202111196380.7A CN202111196380A CN113821057A CN 113821057 A CN113821057 A CN 113821057A CN 202111196380 A CN202111196380 A CN 202111196380A CN 113821057 A CN113821057 A CN 113821057A
Authority
CN
China
Prior art keywords
soft landing
reinforcement learning
lander
lim
planetary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111196380.7A
Other languages
Chinese (zh)
Other versions
CN113821057B (en
Inventor
白成超
郭继峰
陈宇燊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202111196380.7A priority Critical patent/CN113821057B/en
Publication of CN113821057A publication Critical patent/CN113821057A/en
Application granted granted Critical
Publication of CN113821057B publication Critical patent/CN113821057B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G05CONTROLLING; REGULATING
    • G05DSYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
    • G05D1/00Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
    • G05D1/10Simultaneous control of position or course in three dimensions
    • G05D1/101Simultaneous control of position or course in three dimensions specially adapted for aircraft
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T90/00Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation

Landscapes

  • Engineering & Computer Science (AREA)
  • Aviation & Aerospace Engineering (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Automation & Control Theory (AREA)
  • Feedback Control In General (AREA)

Abstract

A planet soft landing control method and system based on reinforcement learning and a storage medium relate to the field of soft landing trajectory optimization and control and aim to solve the problems that the existing planet soft landing control cannot guarantee optimal guidance law, is complex in model, and is difficult to converge in training. The invention comprises the following steps: firstly, the method comprises the following steps: establishing a six-degree-of-freedom dynamic model of a lander power descending section based on the characteristics of the lander such as hardware configuration, engine power configuration and the like; II, secondly: designing a reward function, an observation space, an action space and a neural network structure for training an interactive environment; thirdly, the method comprises the following steps: building a numerical simulation environment, and training by using a reinforcement learning algorithm to obtain a soft landing controller; fourthly, the method comprises the following steps: and evaluating the training control effect through a speed tracking test and a soft landing test. And (3) obtaining a soft landing reinforcement learning environment model by executing the first step and the second step, and interacting the intelligent agent with the environment model to obtain training data. And finally, selecting the training result with the best performance as the optimal soft landing controller through the step four. The method is used for soft landing trajectory optimization and control.

Description

Planetary soft landing control method and system based on reinforcement learning and storage medium
Technical Field
The invention relates to a planetary soft landing control method based on reinforcement learning, and belongs to the technical field of soft landing trajectory optimization and control and the technical field of deep space exploration.
Background
Reinforcement learning algorithm is one of machine learning algorithms, in which an agent learns in a "trial and error" manner, assessing the quality of an action by a reward obtained by interacting with the environment, with the goal of making the agent obtain the maximum reward. And can be generally classified into model-based and non-model-based.
Patent document CN110466805B discloses an optimization guidance parameter-based asteroid landing guidance method, which establishes a kinetic equation of a detector under a landing point coordinate system; analyzing the motion of the detector in three directions of a landing point coordinate system to obtain a relational expression of the position, the speed, the acceleration and the time of the detector in each direction; establishing a function relation between the guidance parameters and the initial state based on the detector motion relation, and solving coefficient values in the function relation by adopting a parameter estimation method, wherein the function relation is an optimized selection formula of the guidance parameters; respectively substituting the optimized selection formulas of the guidance parameters in the three directions into the relational expressions of the position, the speed, the acceleration and the time of the detector in each direction, and combining a kinetic equation to obtain a asteroid landing guidance law based on the optimized guidance parameters in the three directions; and performing minor planet soft landing guidance through the guidance law. The method can improve the efficiency of the asteroid landing guidance, is a model-based landing guidance design algorithm, needs to perform complex processing on the model, and does not relate to the problem of executing control by a detector.
In addition, researchers carry out the research of the planet soft landing section based on reinforcement learning, guidance and control are integrated, and an end-to-end planet soft landing algorithm can be obtained through training. However, because the guidance and control are integrated in the algorithm, the guidance law is designed indirectly through the reward function, and therefore the optimality of the guidance law cannot be guaranteed, namely the fuel consumption cannot be minimized. In addition, the integrated guidance and control result in high complexity of the model, difficulty in convergence of training and long parameter selection and debugging period.
Disclosure of Invention
The technical problem to be solved by the invention is as follows:
the invention provides a planetary soft landing control method and system based on reinforcement learning and a storage medium, aiming at solving the problems that the existing planetary soft landing control can not ensure optimal guidance law, has a complex model and is difficult to train and converge and the like and aiming at realizing more autonomous planetary soft landing.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a planetary soft landing control method based on reinforcement learning comprises the following steps:
the method comprises the following steps: establishing a six-degree-of-freedom dynamic model of a lander power descending section based on the mass m of the lander, the inertia matrix I and the power configuration of the engine; the engine power configuration comprises the number n of engines, the installation position of each engine and a single engine thrust range Ti∈[Tmin,Tmax]Wherein T isiThe thrust of a single engine is shown, i is the index number of the engine, and i is 1, 2.
Step two: based on the dynamics model established in the first step, the thrust of each engine is used as control output, information reflecting the state of the lander is used as an observation vector, a reward function is designed for evaluating the control performance, and a corresponding neural network is designed according to different reinforcement learning algorithm frameworks;
step three: building a numerical simulation environment based on the dynamics model built in the step one and the interactive environment design in the step two, and respectively training by utilizing different reinforcement learning algorithm frames to obtain a soft landing controller as an alternative controller selected in subsequent tests;
step four: and based on the soft landing controller obtained in the step three, comprehensively evaluating the speed tracking capability and the soft landing precision through a speed tracking performance test and a power descent section soft landing test respectively, and selecting the controller with the optimal performance for planetary soft landing control according to the test effect.
Step one, the dynamic model of six degrees of freedom of the power descending section comprises a mass center translational dynamic model and an attitude dynamic model:
Figure BDA0003303137240000021
Figure BDA0003303137240000022
in the formula of omegatAnd vbThe component forms of the attitude angular velocity and the velocity in the body coordinate system, FbAnd MbRespectively are the resultant force and resultant moment vector of the external force borne by the lander, and I is an inertia matrix of the lander;
external force resultant force and resultant moment vector F borne by landerbAnd MbIs represented as follows:
Fb=FT+mgb+FN
Mb=MT+MN
in the formula FTAnd MTResultant and resultant moment vectors, F, respectively, generated by the enginenAnd MnResultant force and resultant moment, g, respectively, caused by external interferencebThe component form of the gravity acceleration of the surface of the planet under the coordinate system of the lander body is shown, and m is the mass of the lander;
the aerodynamic force of the power-down section is small compared with the thrust of the engine and the planetary attraction, and is used as a disturbance at FNAnd MNIs embodied in that
FN=Fwind+δF
MN=Mwind+δM
In the formula, FwindAnd MwindThe total force and the resultant moment caused by air are respectively, and the delta F and the delta M are respectively the total force and the resultant moment caused by unmodeled interference, wherein the unmodeled interference comprises engine installation deviation and thrust size fluctuation.
Step two, the form of the reward function is as follows:
r=rfuel+rvel+rcrash+rconstant+rgoal
in the formula, rfuelFor fuel consumption penalty, rvelFor speed tracking rewards, rcrashPunishment of landing rollover, rconstantIs awarded for a constant value, rgoalTo succeed inReward for soft landing.
The penalty for fuel consumption is in the form of
Figure BDA0003303137240000031
In the formula, α is a fuel consumption penalty coefficient, and is a negative real number with a small absolute value, and the larger the absolute value is, the smaller the fuel consumption of the trained controller is, but the control accuracy is sacrificed.
The speed tracking reward is in the form of a speed tracking reward
rvel=β||v-vref||
In the equation, β is a velocity error reward coefficient, is a negative real number having a small absolute value, and the larger the absolute value is, the higher the control accuracy of the training result is, but the fuel consumption increases. v. ofrefAnd generating a reference speed to be tracked in the landing process of the lander according to a guidance law.
The punishment form of landing rollover is
rcrash=η(φ>φlimorθ>θlim orψ>ψlim)
In the formula, eta is a rollover penalty coefficient which is a negative real number with an absolute value larger than alpha and beta, and the introduction of rollover penalty is favorable for avoiding that the attitude exceeds a constraint value in the landing process; psi, theta and phi are respectively yaw angle, pitch angle and roll angle, the rotation sequence is zyx, and the body coordinate system of the lander is represented relative to the attitude of the navigation coordinate system (see figure 1 for the schematic drawing); psilim、θlimAnd philimUpper bound values of psi, theta and phi, respectively.
Constant value reward rconstantIs a positive real number with an absolute value greater than α and β. Due to rfuel、rvelAnd rcrashAll the numbers are negative numbers, and before a reasonable control strategy is not learned, the lander can be encouraged to continue exploring by adding the constant value reward, so that the early ending of the turn is avoided, and the convergence of training is facilitated.
The reward for successful soft landing is in the form of a reward
rgoal=λ(h<0and vz>0and||v||<vlimand
φ<φlimandθ<θlimandψ<ψlimand
||ω||<ωlim)
Wherein λ is soft landing reward coefficient, h is lander height, vzComponent of velocity in the height direction, ωlimIs a soft landing attitude angular velocity threshold.
Step two, the observation vector form is as follows:
s=[δvb,sin(φ),cos(φ),sin(θ),cos(θ),
sin(ψ),cos(ψ),ωt,Q]
in the formula, δ vbThe component of the difference between the real speed and the expected speed in a body coordinate system, and Q is an attitude quaternion;
step two the thrust output range is
ai∈[-1,1]
By linear change
Figure BDA0003303137240000041
To obtain thrust output TiCan satisfy the thrust constraint Ti∈[Tmin,Tmax]。
And step three, the state space and the action space of the reinforcement learning algorithm framework are continuous.
A planetary soft landing control system based on reinforcement learning executes the steps in the planetary soft landing control method based on reinforcement learning during operation.
A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the reinforcement learning-based planetary soft landing control method described above.
The invention has at least the following beneficial technical effects:
the planet soft landing control method based on reinforcement learning provided by the invention realizes more autonomous planet soft landing. Firstly, a six-degree-of-freedom dynamic model of a soft landing power descent section is established according to the configuration of a lander and the power attribute of an engine, observation information, environment reward feedback, action output of an intelligent agent and network results of the landing controller are designed according to the problem characteristics, an enhanced learning environment model is set up for training, the training results are tested, and the planetary soft landing controller based on data driving is realized.
The planet soft landing is a precondition for realizing a planet surface detection task, and the accurate soft landing can detect a high-value target and avoid damaging a detection instrument. The planet soft landing based on reinforcement learning can be trained in a soft landing interaction mode to obtain the lander controller, so that accurate planet soft landing is realized. Compared with the traditional soft landing guidance control algorithm based on a model, the control based on reinforcement learning has the following advantages: 1) training can be carried out without a model, a designer does not need to process the model in the design of the controller, the quality of the control performance is evaluated in a reward function mode, the controller is indirectly guided to learn, and the control performance is optimized; 2) the strong nonlinearity of the soft landing planning control problem can be well processed by utilizing the strong fitting capability of the deep neural network; 3) the reinforcement learning is an end-to-end algorithm, directly senses the landing environment state, outputs an engine thrust instruction, does not need to solve the landing track in advance in an offline manner, and has strong real-time performance.
The planet soft landing control algorithm based on reinforcement learning provided by the invention can better track the expected speed in the soft landing process, and realize the planet soft landing control. Meanwhile, the reward is tracked according to the reference speed, so that the problem of sparse reward in the training process can be effectively solved, and the success rate of training convergence is greatly improved. And the migration capability of the training result can be improved by learning and tracking the expected speed without the limitation of specific landing guidance signals. In order to realize more autonomous planet soft landing, the invention optimizes the soft landing controller by experience through a trial-and-error learning mode of reinforcement learning, avoids the dynamic model processing of the lander and realizes the soft landing control which is not based on the model.
Drawings
Fig. 1 is a value function network structure.
Fig. 2 is a DDPG and TD3 algorithm policy network structure.
Fig. 3 is a SAC algorithm policy network structure.
Figure 4 is a DDPG algorithm training process reward variation curve.
FIG. 5 is a reward variation curve of the TD3 algorithm training process.
Fig. 6 is a change curve of reward in the SAC algorithm training process.
Figure 7 is a DDPG algorithm speed control test curve.
FIG. 8 is a TD3 algorithm speed control test curve.
Fig. 9 is a SAC algorithm speed control test curve.
FIG. 10 is a DDPG algorithm soft landing test landing site distribution.
FIG. 11 is a TD3 algorithm soft landing test landing site distribution.
FIG. 12 is a SAC algorithm soft landing test landing site distribution.
Detailed Description
As shown in fig. 1 to 12, the planetary soft landing control method based on reinforcement learning according to the present embodiment includes the following steps:
the method comprises the following steps: lander power descending section six-freedom-degree dynamic model establishment
In the embodiment, taking Mars soft landing as an example, the centroid translational dynamics and the attitude dynamics are established as follows
Figure BDA0003303137240000061
Figure BDA0003303137240000062
In the formula of omegatAnd vbThe components of the attitude angular velocity and the velocity under a body coordinate system are respectively, m is the mass of the lander, and I is the rotation of the landerInertia matrix, FbAnd MbRespectively the resultant force and resultant moment vector of external force applied to the lander
Fb=FT+mgb+Fn
Mb=MT+Mn
In the formula FTAnd MTResultant and resultant moment vectors, F, respectively, generated by the enginenAnd MnResultant force and resultant moment, g, respectively, caused by external interferencebThe component form of the Mars surface gravity acceleration under the coordinate system of the lander body.
According to the model, the state s of the lander at each moment can be obtained as [ x y z v ] v through integration according to the engine thrust output of the landerx vy vz φ θ ψ p q r]Where x, y, z are the three-axis positions, vx、vy、vzThe three-axis velocity is phi, theta, phi are attitude angles, the rotation sequence is zyx, and p, q, r are three-axis attitude angular velocities.
Step two: design of an interactive environment
1) Reward function
The reward function is in the form of
r=rfuel+rvel+rcrash+rconstant+rgoal
Wherein r isfuel,rvel,rcrash,rconstant,rgoalThe method comprises the following calculation modes of punishment of fuel consumption, reward of speed tracking, punishment of rollover, reward of constant value and reward of soft landing respectively
Figure BDA0003303137240000063
rvel=β||v-vref||
rcrash=η(φ>φlimorθ>θlimorψ>ψlim)
rconstant=κ
rgoal=λ(h<0and vz>0and||v||<vlimand
φ<φlimandθ<θlimandψ≤ψlimand
||ω||<ωlim)
Wherein α is a fuel consumption penalty coefficient, and is a negative real number with a small absolute value, and the larger the absolute value is, the smaller the fuel consumption of the trained controller is, but the control accuracy is sacrificed; beta is a speed error reward coefficient, is a negative real number with a small absolute value, and the larger the absolute value is, the higher the control precision of the training result is, but the fuel consumption is increased; eta is a rollover penalty coefficient which is a negative real number with an absolute value larger than alpha and beta, and the introduction of rollover penalty is favorable for avoiding that the attitude exceeds a constraint value in the landing process; kappa is a constant reward factor, a positive real number with an absolute value greater than alpha and beta, since rfuel、rvelAnd rcrashAll the numbers are negative numbers, and the lander can be encouraged to continue exploring by adding the constant value reward before a reasonable control strategy is not learned, so that the early ending of the turn is avoided, and the convergence of training is facilitated;
2) observed value
Design the observed value as
rgoal=[δvb,sin(φ),cos(φ),sin(θ),cos(θ),
sin(ψ),cos(ψ),ωt,Q]
In the formula, δ vbThe component of the difference between the real speed and the expected speed in the body coordinate system, and Q is an attitude quaternion.
3) Motion output
Action a of setting agent outputi∈[-1,1]Converted into thrust output through linear mapping
Figure BDA0003303137240000071
The obtained thrust output satisfies Ti∈[Tmin,Tmax]Of (3) is performed.
4) Network design
According to the selected algorithm, a corresponding value function and a corresponding strategy network are designed, wherein the strategy network comprises the depth, the width and the activation function of the network, in the invention, the number of the middle layers is three, the activation function is selected as ReLU, and the width 200 can meet the training requirement.
Step three: setting up simulation environment for training
And respectively constructing an environment dynamics model, a feedback interaction model and an agent strategy and value function network based on the first step and the second step, and training the agent by using a deep learning framework.
Step four: testing landing control effect
Firstly, testing the speed tracking capability of the lander and setting a desired speed vd=[vdx,vdy,vdz]And different initial velocities v0=[vx0,vy0,vz0]And controlling the speed to a desired value by using the trained controller.
And then, performing soft landing test, randomly generating initial position speed and attitude according to the initial condition of the soft landing power descending section, using the trained speed controller for soft landing, and testing the training effect through a plurality of target practice experiments.
The invention can adopt DDPG, TD3 and SAC algorithm which are not based on model to train the controller. Firstly initializing a strategy network parameter theta, a value function network parameter phi and an experience playback pool D, then interacting the intelligent agent with the environment, outputting an action a to act on the environment by the intelligent agent according to a current strategy observation environment state s in each step, transferring the environment state to s 'and feeding back an award r and a turn end signal D, storing an experience group (s, a, s', r, D) into the intelligent agent D, randomly sampling a group of experiences from the intelligent agent D if an update period is reached, updating the network parameter according to the constructed loss function gradient, and circulating the steps until the turn award is converged, wherein the DDPG, TD3 and SAC pseudo codes are respectively as follows
Figure BDA0003303137240000081
Figure BDA0003303137240000082
Figure BDA0003303137240000091
Figure BDA0003303137240000092
Figure BDA0003303137240000101
The following examples were used to demonstrate the beneficial effects of the present invention:
example (b):
1) experimental Environment settings
The example selects DDPG, TD3 and SAC for training and testing, designs a value function network of three algorithms as fig. 1, a strategy network structure of DDPG and TD3 as fig. 2, and a SAC strategy network structure as fig. 3. And (4) building a neural network based on the pytorch for training by using python programming.
The software environment for simulation test of all algorithms herein is Ubuntu16.04 and the hardware environment is Intel (R) core (TM) i5-9300H CPU + NVIDIAGEFORCE GTX 1660Ti +16.0GB RAM.
2) Results and analysis of the experiments
The training process curves of the DDPG, TD3 and SAC algorithms are shown in fig. 4, fig. 5 and fig. 6, respectively. Wherein the DDPG round reward begins to rise at 10000 rounds, 10000-20000 promotion is obvious, the follow-up gradual and stable convergence is about 300, and the training is finished by about 40000 rounds. Training tends to be stable from the perspective of average rewards, but rewards fluctuate between 100-400 per round and are very unstable. In TD3, the agent awards a significant boost through 700 rounds of training and then substantially stabilizes at 450 through about 5000 rounds of training, which is more stable and faster in convergence rate than DDPG. SAC undergoes an obvious promotion in 3000 rounds and 5000 rounds respectively, and finally is stabilized through 9500 rounds of training, the reward converges to 500, and compared with TD3, the single round reward is higher when SAC is finally stabilized and is close to 600.
The DDPG, TD3 and SAC speed test curves are shown in fig. 7, fig. 8 and fig. 9, respectively. DDPG can control the speed error to be within 2m/s from the condition that the initial vertical error is large, but cannot be stable and has continuous oscillation, TD3 can control the speed to be close to the expected speed, the control precision is close to 1m/s, SAC is the highest and is within 0.1 m/s.
The two-dimensional distributions of DDPG, TD3 and SAC landing sites are shown in fig. 10, fig. 11 and fig. 12, respectively. For 100 targeting experiments, the landing success rates of DDPG, TD3 and SAC were 74%, 92% and 96%, respectively. The drop point accuracies TD3 and SAC are significantly better than DDPG.
According to the method, the planet soft landing control method can be realized, and a new thought is provided for the optimization and control research of the planet soft landing track.
The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.

Claims (10)

1. A planet soft landing control method based on reinforcement learning is characterized in that: the method comprises the following steps:
the method comprises the following steps: establishing a six-degree-of-freedom dynamic model of a lander power descending section based on the mass m of the lander, the inertia matrix I and the power configuration of the engine; the engine power configuration comprises the number n of engines, the installation position of each engine and a single engine thrust range Ti∈[Tmin,Tmax]Wherein T isiThe thrust of a single engine is shown, i is the index number of the engine, and i is 1, 2.
Step two: based on the dynamics model established in the first step, the thrust of each engine is used as control output, information reflecting the state of the lander is used as an observation vector, a reward function is designed for evaluating the control performance, and a corresponding neural network is designed according to different reinforcement learning algorithm frameworks;
step three: building a numerical simulation environment based on the dynamics model built in the step one and the interactive environment design in the step two, and respectively training by utilizing different reinforcement learning algorithm frames to obtain a soft landing controller as an alternative controller selected in subsequent tests;
step four: and based on the soft landing controller obtained in the step three, comprehensively evaluating the speed tracking capability and the soft landing precision through a speed tracking performance test and a power descent section soft landing test respectively, and selecting the controller with the optimal performance for planetary soft landing control according to the test effect.
2. The planetary soft landing control method based on reinforcement learning of claim 1, wherein the first step is a dynamic model of six degrees of freedom of the power descent section, which comprises a centroid translational dynamic model and an attitude dynamic model:
Figure FDA0003303137230000011
Figure FDA0003303137230000012
in the formula of omegatAnd vbThe component forms of the attitude angular velocity and the velocity in the body coordinate system, FbAnd MbRespectively are the resultant force and resultant moment vector of the external force borne by the lander, and I is an inertia matrix of the lander;
external force resultant force and resultant moment vector F borne by landerbAnd MbIs represented as follows:
Fb=FT+mgb+FN
Mb=MT+MN
in the formula FTAnd MTResultant and resultant moment vectors, F, respectively, generated by the enginenAnd MnAre respectively outsideResultant forces and resultant moments, g, caused by disturbancesbThe component form of the gravity acceleration of the surface of the planet under the coordinate system of the lander body is shown, and m is the mass of the lander;
the aerodynamic force of the power-down section is small compared with the thrust of the engine and the planetary attraction, and is used as a disturbance at FNAnd MNIs embodied in that
FN=Fwind+δF
MN=Mwind+δM
In the formula, FwindAnd MwindThe total force and the resultant moment caused by air are respectively, and the delta F and the delta M are respectively the total force and the resultant moment caused by unmodeled interference, wherein the unmodeled interference comprises engine installation deviation and thrust size fluctuation.
3. The planetary soft landing control method based on reinforcement learning of claim 2, wherein the reward function form in the second step is:
r=rfuel+rvel+rcrash+rconstant+rgoal
in the formula, rfuelFor fuel consumption penalty, rvelFor speed tracking rewards, rcrashPunishment of landing rollover, rconstantIs awarded for a constant value, rgoalReward for successful soft landing.
4. The method for controlling the planetary soft landing based on the reinforcement learning of claim 3, wherein the penalty of fuel consumption is in the form of fuel consumption penalty
Figure FDA0003303137230000021
In the formula, α is a fuel consumption penalty coefficient, and is a negative real number with a small absolute value, and the larger the absolute value is, the smaller the fuel consumption of the trained controller is, but the control accuracy is sacrificed.
5. The planetary soft landing control method based on reinforcement learning as claimed in claim 3, wherein the speed tracking reward is in the form of speed tracking reward
rvel=β||v-vref||
In the formula, beta is a speed error reward coefficient, and the larger the absolute value of the speed error reward coefficient is, the higher the control precision of the training result is, but the fuel consumption is increased; v. ofrefAnd generating a reference speed to be tracked in the landing process of the lander according to a guidance law.
6. The method for controlling the planet soft landing based on the reinforcement learning as claimed in claim 3, wherein the punishment form of landing rollover is
rcrash=η(φ>φlim or θ>θlim or ψ>ψlim)
In the formula, eta is a rollover penalty coefficient, and the introduction of the rollover penalty is favorable for avoiding that the attitude exceeds a constraint value in the landing process; psi, theta and phi are respectively a yaw angle, a pitch angle and a roll angle, the rotation sequence is zyx, and the attitude of the body coordinate system of the lander relative to the navigation coordinate system is represented; psilim、θlimAnd philimUpper bound values of psi, theta and phi, respectively.
7. The planetary soft landing control method based on reinforcement learning as claimed in claim 3, wherein the reward for successful soft landing is in the form of reward
rgoal=λ(h<0 and vz>0 and||v||<vlim and φ<φlim and θ<θlim and ψ<ψlim and||ω||<ωlim)
Wherein λ is soft landing reward coefficient, h is lander height, vzComponent of velocity in the height direction, ωlimIs a soft landing attitude angular velocity threshold.
8. The method for controlling the planetary soft landing based on the reinforcement learning of claim 1, wherein the observation vector form in the second step is:
s=[δvb,sin(φ),cos(φ),sin(θ),cos(θ),sin(ψ),cos(ψ),ωt,Q]
in the formula, δ vbThe component of the difference between the real speed and the expected speed in a body coordinate system, and Q is an attitude quaternion;
step two, the range of the thrust output is ai∈[-1,1]
By linear change
Figure FDA0003303137230000031
To obtain thrust output TiCan satisfy the thrust constraint Ti∈[Tmin,Tmax]。
9. The planetary soft landing control method based on reinforcement learning of claim 1, wherein the state space and the action space of the reinforcement learning algorithm framework in the third step are continuous.
10. A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the reinforcement learning-based planetary soft landing control method of any of claims 1-9.
CN202111196380.7A 2021-10-14 2021-10-14 Planetary soft landing control method and system based on reinforcement learning and storage medium Active CN113821057B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111196380.7A CN113821057B (en) 2021-10-14 2021-10-14 Planetary soft landing control method and system based on reinforcement learning and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111196380.7A CN113821057B (en) 2021-10-14 2021-10-14 Planetary soft landing control method and system based on reinforcement learning and storage medium

Publications (2)

Publication Number Publication Date
CN113821057A true CN113821057A (en) 2021-12-21
CN113821057B CN113821057B (en) 2023-05-30

Family

ID=78916535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111196380.7A Active CN113821057B (en) 2021-10-14 2021-10-14 Planetary soft landing control method and system based on reinforcement learning and storage medium

Country Status (1)

Country Link
CN (1) CN113821057B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117631547B (en) * 2024-01-26 2024-04-26 哈尔滨工业大学 Landing control method for quadruped robot under irregular weak gravitational field of small celestial body

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107065571A (en) * 2017-06-06 2017-08-18 上海航天控制技术研究所 A kind of objects outside Earth soft landing Guidance and control method based on machine learning algorithm
CN107656439A (en) * 2017-11-13 2018-02-02 浙江大学 A kind of moon detector in flexible landing optimal control system based on Self Adaptive Control grid
CN109212976A (en) * 2018-11-20 2019-01-15 北京理工大学 The small feature loss soft landing robust trajectory tracking control method of input-bound
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
CN111460650A (en) * 2020-03-31 2020-07-28 北京航空航天大学 Unmanned aerial vehicle end-to-end control method based on deep reinforcement learning
US20200272625A1 (en) * 2019-02-22 2020-08-27 National Geographic Society Platform and method for evaluating, exploring, monitoring and predicting the status of regions of the planet through time
WO2021125395A1 (en) * 2019-12-18 2021-06-24 한국항공우주연구원 Method for determining specific area for optical navigation on basis of artificial neural network, on-board map generation device, and method for determining direction of lander
CN113408796A (en) * 2021-06-04 2021-09-17 北京理工大学 Deep space probe soft landing path planning method for multitask deep reinforcement learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107065571A (en) * 2017-06-06 2017-08-18 上海航天控制技术研究所 A kind of objects outside Earth soft landing Guidance and control method based on machine learning algorithm
CN107656439A (en) * 2017-11-13 2018-02-02 浙江大学 A kind of moon detector in flexible landing optimal control system based on Self Adaptive Control grid
CN109212976A (en) * 2018-11-20 2019-01-15 北京理工大学 The small feature loss soft landing robust trajectory tracking control method of input-bound
CN109733415A (en) * 2019-01-08 2019-05-10 同济大学 A kind of automatic Pilot following-speed model that personalizes based on deeply study
US20200272625A1 (en) * 2019-02-22 2020-08-27 National Geographic Society Platform and method for evaluating, exploring, monitoring and predicting the status of regions of the planet through time
WO2021125395A1 (en) * 2019-12-18 2021-06-24 한국항공우주연구원 Method for determining specific area for optical navigation on basis of artificial neural network, on-board map generation device, and method for determining direction of lander
CN111460650A (en) * 2020-03-31 2020-07-28 北京航空航天大学 Unmanned aerial vehicle end-to-end control method based on deep reinforcement learning
CN113408796A (en) * 2021-06-04 2021-09-17 北京理工大学 Deep space probe soft landing path planning method for multitask deep reinforcement learning

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117631547B (en) * 2024-01-26 2024-04-26 哈尔滨工业大学 Landing control method for quadruped robot under irregular weak gravitational field of small celestial body

Also Published As

Publication number Publication date
CN113821057B (en) 2023-05-30

Similar Documents

Publication Publication Date Title
CN110806759B (en) Aircraft route tracking method based on deep reinforcement learning
CN110850719B (en) Spatial non-cooperative target parameter self-tuning tracking method based on reinforcement learning
Wang et al. Nonlinear aeroelastic control of very flexible aircraft using model updating
CN112462792A (en) Underwater robot motion control method based on Actor-Critic algorithm
CN107085435A (en) Hypersonic aircraft attitude harmony control method based on coupling analysis
Kapnopoulos et al. A cooperative particle swarm optimization approach for tuning an MPC-based quadrotor trajectory tracking scheme
CN112859889A (en) Autonomous underwater robot control method and system based on self-adaptive dynamic planning
CN113821057B (en) Planetary soft landing control method and system based on reinforcement learning and storage medium
CN114200950B (en) Flight attitude control method
Dong et al. Trial input method and own-aircraft state prediction in autonomous air combat
CN114637312A (en) Unmanned aerial vehicle energy-saving flight control method and system based on intelligent deformation decision
Wang et al. A new spacecraft attitude stabilization mechanism using deep reinforcement learning method
CN116820134A (en) Unmanned aerial vehicle formation maintaining control method based on deep reinforcement learning
Chen et al. An experimental study of the wire-driven compliant robotic fish
CN116697829A (en) Rocket landing guidance method and system based on deep reinforcement learning
CN117289709A (en) High-ultrasonic-speed appearance-changing aircraft attitude control method based on deep reinforcement learning
CN116620566A (en) Non-cooperative target attached multi-node intelligent cooperative guidance method
CN116068894A (en) Rocket recovery guidance method based on double-layer reinforcement learning
Wang et al. Deep learning based missile trajectory prediction
CN116360258A (en) Hypersonic deformed aircraft anti-interference control method based on fixed time convergence
CN113050420B (en) AUV path tracking method and system based on S-plane control and TD3
CN113418674A (en) Wind tunnel track capture test method with three-degree-of-freedom motion of primary model
Hong et al. Control of a fly-mimicking flyer in complex flow using deep reinforcement learning
Breese et al. Physics-Based Neural Networks for Modeling & Control of Aerial Vehicles
Cheng et al. Cross-cycle iterative unmanned aerial vehicle reentry guidance based on reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant