CN113821057A - Planetary soft landing control method and system based on reinforcement learning and storage medium - Google Patents
Planetary soft landing control method and system based on reinforcement learning and storage medium Download PDFInfo
- Publication number
- CN113821057A CN113821057A CN202111196380.7A CN202111196380A CN113821057A CN 113821057 A CN113821057 A CN 113821057A CN 202111196380 A CN202111196380 A CN 202111196380A CN 113821057 A CN113821057 A CN 113821057A
- Authority
- CN
- China
- Prior art keywords
- soft landing
- reinforcement learning
- lander
- lim
- planetary
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000002787 reinforcement Effects 0.000 title claims abstract description 36
- 238000012549 training Methods 0.000 claims abstract description 36
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 31
- 238000012360 testing method Methods 0.000 claims abstract description 21
- 230000006870 function Effects 0.000 claims abstract description 20
- 230000009471 action Effects 0.000 claims abstract description 7
- 230000000694 effects Effects 0.000 claims abstract description 6
- 238000013528 artificial neural network Methods 0.000 claims abstract description 5
- 238000004088 simulation Methods 0.000 claims abstract description 5
- 230000002452 interceptive effect Effects 0.000 claims abstract description 4
- 239000000446 fuel Substances 0.000 claims description 24
- 230000008569 process Effects 0.000 claims description 12
- 239000013598 vector Substances 0.000 claims description 12
- 238000013461 design Methods 0.000 claims description 8
- 230000001133 acceleration Effects 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000009434 installation Methods 0.000 claims description 4
- 230000008859 change Effects 0.000 claims description 3
- 230000002349 favourable effect Effects 0.000 claims description 3
- 230000005484 gravity Effects 0.000 claims description 3
- 238000004590 computer program Methods 0.000 claims description 2
- 238000011056 performance test Methods 0.000 claims description 2
- OIGNJSKKLXVSLS-VWUMJDOOSA-N prednisolone Chemical compound O=C1C=C[C@]2(C)[C@H]3[C@@H](O)C[C@](C)([C@@](CC4)(O)C(=O)CO)[C@@H]4[C@@H]3CCC2=C1 OIGNJSKKLXVSLS-VWUMJDOOSA-N 0.000 claims description 2
- 238000005457 optimization Methods 0.000 abstract description 5
- 239000003795 chemical substances by application Substances 0.000 description 11
- 238000009826 distribution Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000011217 control strategy Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000008685 targeting Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05D—SYSTEMS FOR CONTROLLING OR REGULATING NON-ELECTRIC VARIABLES
- G05D1/00—Control of position, course, altitude or attitude of land, water, air or space vehicles, e.g. using automatic pilots
- G05D1/10—Simultaneous control of position or course in three dimensions
- G05D1/101—Simultaneous control of position or course in three dimensions specially adapted for aircraft
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T90/00—Enabling technologies or technologies with a potential or indirect contribution to GHG emissions mitigation
Landscapes
- Engineering & Computer Science (AREA)
- Aviation & Aerospace Engineering (AREA)
- Radar, Positioning & Navigation (AREA)
- Remote Sensing (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Automation & Control Theory (AREA)
- Feedback Control In General (AREA)
Abstract
A planet soft landing control method and system based on reinforcement learning and a storage medium relate to the field of soft landing trajectory optimization and control and aim to solve the problems that the existing planet soft landing control cannot guarantee optimal guidance law, is complex in model, and is difficult to converge in training. The invention comprises the following steps: firstly, the method comprises the following steps: establishing a six-degree-of-freedom dynamic model of a lander power descending section based on the characteristics of the lander such as hardware configuration, engine power configuration and the like; II, secondly: designing a reward function, an observation space, an action space and a neural network structure for training an interactive environment; thirdly, the method comprises the following steps: building a numerical simulation environment, and training by using a reinforcement learning algorithm to obtain a soft landing controller; fourthly, the method comprises the following steps: and evaluating the training control effect through a speed tracking test and a soft landing test. And (3) obtaining a soft landing reinforcement learning environment model by executing the first step and the second step, and interacting the intelligent agent with the environment model to obtain training data. And finally, selecting the training result with the best performance as the optimal soft landing controller through the step four. The method is used for soft landing trajectory optimization and control.
Description
Technical Field
The invention relates to a planetary soft landing control method based on reinforcement learning, and belongs to the technical field of soft landing trajectory optimization and control and the technical field of deep space exploration.
Background
Reinforcement learning algorithm is one of machine learning algorithms, in which an agent learns in a "trial and error" manner, assessing the quality of an action by a reward obtained by interacting with the environment, with the goal of making the agent obtain the maximum reward. And can be generally classified into model-based and non-model-based.
Patent document CN110466805B discloses an optimization guidance parameter-based asteroid landing guidance method, which establishes a kinetic equation of a detector under a landing point coordinate system; analyzing the motion of the detector in three directions of a landing point coordinate system to obtain a relational expression of the position, the speed, the acceleration and the time of the detector in each direction; establishing a function relation between the guidance parameters and the initial state based on the detector motion relation, and solving coefficient values in the function relation by adopting a parameter estimation method, wherein the function relation is an optimized selection formula of the guidance parameters; respectively substituting the optimized selection formulas of the guidance parameters in the three directions into the relational expressions of the position, the speed, the acceleration and the time of the detector in each direction, and combining a kinetic equation to obtain a asteroid landing guidance law based on the optimized guidance parameters in the three directions; and performing minor planet soft landing guidance through the guidance law. The method can improve the efficiency of the asteroid landing guidance, is a model-based landing guidance design algorithm, needs to perform complex processing on the model, and does not relate to the problem of executing control by a detector.
In addition, researchers carry out the research of the planet soft landing section based on reinforcement learning, guidance and control are integrated, and an end-to-end planet soft landing algorithm can be obtained through training. However, because the guidance and control are integrated in the algorithm, the guidance law is designed indirectly through the reward function, and therefore the optimality of the guidance law cannot be guaranteed, namely the fuel consumption cannot be minimized. In addition, the integrated guidance and control result in high complexity of the model, difficulty in convergence of training and long parameter selection and debugging period.
Disclosure of Invention
The technical problem to be solved by the invention is as follows:
the invention provides a planetary soft landing control method and system based on reinforcement learning and a storage medium, aiming at solving the problems that the existing planetary soft landing control can not ensure optimal guidance law, has a complex model and is difficult to train and converge and the like and aiming at realizing more autonomous planetary soft landing.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a planetary soft landing control method based on reinforcement learning comprises the following steps:
the method comprises the following steps: establishing a six-degree-of-freedom dynamic model of a lander power descending section based on the mass m of the lander, the inertia matrix I and the power configuration of the engine; the engine power configuration comprises the number n of engines, the installation position of each engine and a single engine thrust range Ti∈[Tmin,Tmax]Wherein T isiThe thrust of a single engine is shown, i is the index number of the engine, and i is 1, 2.
Step two: based on the dynamics model established in the first step, the thrust of each engine is used as control output, information reflecting the state of the lander is used as an observation vector, a reward function is designed for evaluating the control performance, and a corresponding neural network is designed according to different reinforcement learning algorithm frameworks;
step three: building a numerical simulation environment based on the dynamics model built in the step one and the interactive environment design in the step two, and respectively training by utilizing different reinforcement learning algorithm frames to obtain a soft landing controller as an alternative controller selected in subsequent tests;
step four: and based on the soft landing controller obtained in the step three, comprehensively evaluating the speed tracking capability and the soft landing precision through a speed tracking performance test and a power descent section soft landing test respectively, and selecting the controller with the optimal performance for planetary soft landing control according to the test effect.
Step one, the dynamic model of six degrees of freedom of the power descending section comprises a mass center translational dynamic model and an attitude dynamic model:
in the formula of omegatAnd vbThe component forms of the attitude angular velocity and the velocity in the body coordinate system, FbAnd MbRespectively are the resultant force and resultant moment vector of the external force borne by the lander, and I is an inertia matrix of the lander;
external force resultant force and resultant moment vector F borne by landerbAnd MbIs represented as follows:
Fb=FT+mgb+FN
Mb=MT+MN
in the formula FTAnd MTResultant and resultant moment vectors, F, respectively, generated by the enginenAnd MnResultant force and resultant moment, g, respectively, caused by external interferencebThe component form of the gravity acceleration of the surface of the planet under the coordinate system of the lander body is shown, and m is the mass of the lander;
the aerodynamic force of the power-down section is small compared with the thrust of the engine and the planetary attraction, and is used as a disturbance at FNAnd MNIs embodied in that
FN=Fwind+δF
MN=Mwind+δM
In the formula, FwindAnd MwindThe total force and the resultant moment caused by air are respectively, and the delta F and the delta M are respectively the total force and the resultant moment caused by unmodeled interference, wherein the unmodeled interference comprises engine installation deviation and thrust size fluctuation.
Step two, the form of the reward function is as follows:
r=rfuel+rvel+rcrash+rconstant+rgoal
in the formula, rfuelFor fuel consumption penalty, rvelFor speed tracking rewards, rcrashPunishment of landing rollover, rconstantIs awarded for a constant value, rgoalTo succeed inReward for soft landing.
The penalty for fuel consumption is in the form of
In the formula, α is a fuel consumption penalty coefficient, and is a negative real number with a small absolute value, and the larger the absolute value is, the smaller the fuel consumption of the trained controller is, but the control accuracy is sacrificed.
The speed tracking reward is in the form of a speed tracking reward
rvel=β||v-vref||
In the equation, β is a velocity error reward coefficient, is a negative real number having a small absolute value, and the larger the absolute value is, the higher the control accuracy of the training result is, but the fuel consumption increases. v. ofrefAnd generating a reference speed to be tracked in the landing process of the lander according to a guidance law.
The punishment form of landing rollover is
rcrash=η(φ>φlimorθ>θlim orψ>ψlim)
In the formula, eta is a rollover penalty coefficient which is a negative real number with an absolute value larger than alpha and beta, and the introduction of rollover penalty is favorable for avoiding that the attitude exceeds a constraint value in the landing process; psi, theta and phi are respectively yaw angle, pitch angle and roll angle, the rotation sequence is zyx, and the body coordinate system of the lander is represented relative to the attitude of the navigation coordinate system (see figure 1 for the schematic drawing); psilim、θlimAnd philimUpper bound values of psi, theta and phi, respectively.
Constant value reward rconstantIs a positive real number with an absolute value greater than α and β. Due to rfuel、rvelAnd rcrashAll the numbers are negative numbers, and before a reasonable control strategy is not learned, the lander can be encouraged to continue exploring by adding the constant value reward, so that the early ending of the turn is avoided, and the convergence of training is facilitated.
The reward for successful soft landing is in the form of a reward
rgoal=λ(h<0and vz>0and||v||<vlimand
φ<φlimandθ<θlimandψ<ψlimand
||ω||<ωlim)
Wherein λ is soft landing reward coefficient, h is lander height, vzComponent of velocity in the height direction, ωlimIs a soft landing attitude angular velocity threshold.
Step two, the observation vector form is as follows:
s=[δvb,sin(φ),cos(φ),sin(θ),cos(θ),
sin(ψ),cos(ψ),ωt,Q]
in the formula, δ vbThe component of the difference between the real speed and the expected speed in a body coordinate system, and Q is an attitude quaternion;
step two the thrust output range is
ai∈[-1,1]
By linear change
To obtain thrust output TiCan satisfy the thrust constraint Ti∈[Tmin,Tmax]。
And step three, the state space and the action space of the reinforcement learning algorithm framework are continuous.
A planetary soft landing control system based on reinforcement learning executes the steps in the planetary soft landing control method based on reinforcement learning during operation.
A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the reinforcement learning-based planetary soft landing control method described above.
The invention has at least the following beneficial technical effects:
the planet soft landing control method based on reinforcement learning provided by the invention realizes more autonomous planet soft landing. Firstly, a six-degree-of-freedom dynamic model of a soft landing power descent section is established according to the configuration of a lander and the power attribute of an engine, observation information, environment reward feedback, action output of an intelligent agent and network results of the landing controller are designed according to the problem characteristics, an enhanced learning environment model is set up for training, the training results are tested, and the planetary soft landing controller based on data driving is realized.
The planet soft landing is a precondition for realizing a planet surface detection task, and the accurate soft landing can detect a high-value target and avoid damaging a detection instrument. The planet soft landing based on reinforcement learning can be trained in a soft landing interaction mode to obtain the lander controller, so that accurate planet soft landing is realized. Compared with the traditional soft landing guidance control algorithm based on a model, the control based on reinforcement learning has the following advantages: 1) training can be carried out without a model, a designer does not need to process the model in the design of the controller, the quality of the control performance is evaluated in a reward function mode, the controller is indirectly guided to learn, and the control performance is optimized; 2) the strong nonlinearity of the soft landing planning control problem can be well processed by utilizing the strong fitting capability of the deep neural network; 3) the reinforcement learning is an end-to-end algorithm, directly senses the landing environment state, outputs an engine thrust instruction, does not need to solve the landing track in advance in an offline manner, and has strong real-time performance.
The planet soft landing control algorithm based on reinforcement learning provided by the invention can better track the expected speed in the soft landing process, and realize the planet soft landing control. Meanwhile, the reward is tracked according to the reference speed, so that the problem of sparse reward in the training process can be effectively solved, and the success rate of training convergence is greatly improved. And the migration capability of the training result can be improved by learning and tracking the expected speed without the limitation of specific landing guidance signals. In order to realize more autonomous planet soft landing, the invention optimizes the soft landing controller by experience through a trial-and-error learning mode of reinforcement learning, avoids the dynamic model processing of the lander and realizes the soft landing control which is not based on the model.
Drawings
Fig. 1 is a value function network structure.
Fig. 2 is a DDPG and TD3 algorithm policy network structure.
Fig. 3 is a SAC algorithm policy network structure.
Figure 4 is a DDPG algorithm training process reward variation curve.
FIG. 5 is a reward variation curve of the TD3 algorithm training process.
Fig. 6 is a change curve of reward in the SAC algorithm training process.
Figure 7 is a DDPG algorithm speed control test curve.
FIG. 8 is a TD3 algorithm speed control test curve.
Fig. 9 is a SAC algorithm speed control test curve.
FIG. 10 is a DDPG algorithm soft landing test landing site distribution.
FIG. 11 is a TD3 algorithm soft landing test landing site distribution.
FIG. 12 is a SAC algorithm soft landing test landing site distribution.
Detailed Description
As shown in fig. 1 to 12, the planetary soft landing control method based on reinforcement learning according to the present embodiment includes the following steps:
the method comprises the following steps: lander power descending section six-freedom-degree dynamic model establishment
In the embodiment, taking Mars soft landing as an example, the centroid translational dynamics and the attitude dynamics are established as follows
In the formula of omegatAnd vbThe components of the attitude angular velocity and the velocity under a body coordinate system are respectively, m is the mass of the lander, and I is the rotation of the landerInertia matrix, FbAnd MbRespectively the resultant force and resultant moment vector of external force applied to the lander
Fb=FT+mgb+Fn
Mb=MT+Mn
In the formula FTAnd MTResultant and resultant moment vectors, F, respectively, generated by the enginenAnd MnResultant force and resultant moment, g, respectively, caused by external interferencebThe component form of the Mars surface gravity acceleration under the coordinate system of the lander body.
According to the model, the state s of the lander at each moment can be obtained as [ x y z v ] v through integration according to the engine thrust output of the landerx vy vz φ θ ψ p q r]Where x, y, z are the three-axis positions, vx、vy、vzThe three-axis velocity is phi, theta, phi are attitude angles, the rotation sequence is zyx, and p, q, r are three-axis attitude angular velocities.
Step two: design of an interactive environment
1) Reward function
The reward function is in the form of
r=rfuel+rvel+rcrash+rconstant+rgoal
Wherein r isfuel,rvel,rcrash,rconstant,rgoalThe method comprises the following calculation modes of punishment of fuel consumption, reward of speed tracking, punishment of rollover, reward of constant value and reward of soft landing respectively
rvel=β||v-vref||
rcrash=η(φ>φlimorθ>θlimorψ>ψlim)
rconstant=κ
rgoal=λ(h<0and vz>0and||v||<vlimand
φ<φlimandθ<θlimandψ≤ψlimand
||ω||<ωlim)
Wherein α is a fuel consumption penalty coefficient, and is a negative real number with a small absolute value, and the larger the absolute value is, the smaller the fuel consumption of the trained controller is, but the control accuracy is sacrificed; beta is a speed error reward coefficient, is a negative real number with a small absolute value, and the larger the absolute value is, the higher the control precision of the training result is, but the fuel consumption is increased; eta is a rollover penalty coefficient which is a negative real number with an absolute value larger than alpha and beta, and the introduction of rollover penalty is favorable for avoiding that the attitude exceeds a constraint value in the landing process; kappa is a constant reward factor, a positive real number with an absolute value greater than alpha and beta, since rfuel、rvelAnd rcrashAll the numbers are negative numbers, and the lander can be encouraged to continue exploring by adding the constant value reward before a reasonable control strategy is not learned, so that the early ending of the turn is avoided, and the convergence of training is facilitated;
2) observed value
Design the observed value as
rgoal=[δvb,sin(φ),cos(φ),sin(θ),cos(θ),
sin(ψ),cos(ψ),ωt,Q]
In the formula, δ vbThe component of the difference between the real speed and the expected speed in the body coordinate system, and Q is an attitude quaternion.
3) Motion output
Action a of setting agent outputi∈[-1,1]Converted into thrust output through linear mapping
The obtained thrust output satisfies Ti∈[Tmin,Tmax]Of (3) is performed.
4) Network design
According to the selected algorithm, a corresponding value function and a corresponding strategy network are designed, wherein the strategy network comprises the depth, the width and the activation function of the network, in the invention, the number of the middle layers is three, the activation function is selected as ReLU, and the width 200 can meet the training requirement.
Step three: setting up simulation environment for training
And respectively constructing an environment dynamics model, a feedback interaction model and an agent strategy and value function network based on the first step and the second step, and training the agent by using a deep learning framework.
Step four: testing landing control effect
Firstly, testing the speed tracking capability of the lander and setting a desired speed vd=[vdx,vdy,vdz]And different initial velocities v0=[vx0,vy0,vz0]And controlling the speed to a desired value by using the trained controller.
And then, performing soft landing test, randomly generating initial position speed and attitude according to the initial condition of the soft landing power descending section, using the trained speed controller for soft landing, and testing the training effect through a plurality of target practice experiments.
The invention can adopt DDPG, TD3 and SAC algorithm which are not based on model to train the controller. Firstly initializing a strategy network parameter theta, a value function network parameter phi and an experience playback pool D, then interacting the intelligent agent with the environment, outputting an action a to act on the environment by the intelligent agent according to a current strategy observation environment state s in each step, transferring the environment state to s 'and feeding back an award r and a turn end signal D, storing an experience group (s, a, s', r, D) into the intelligent agent D, randomly sampling a group of experiences from the intelligent agent D if an update period is reached, updating the network parameter according to the constructed loss function gradient, and circulating the steps until the turn award is converged, wherein the DDPG, TD3 and SAC pseudo codes are respectively as follows
The following examples were used to demonstrate the beneficial effects of the present invention:
example (b):
1) experimental Environment settings
The example selects DDPG, TD3 and SAC for training and testing, designs a value function network of three algorithms as fig. 1, a strategy network structure of DDPG and TD3 as fig. 2, and a SAC strategy network structure as fig. 3. And (4) building a neural network based on the pytorch for training by using python programming.
The software environment for simulation test of all algorithms herein is Ubuntu16.04 and the hardware environment is Intel (R) core (TM) i5-9300H CPU + NVIDIAGEFORCE GTX 1660Ti +16.0GB RAM.
2) Results and analysis of the experiments
The training process curves of the DDPG, TD3 and SAC algorithms are shown in fig. 4, fig. 5 and fig. 6, respectively. Wherein the DDPG round reward begins to rise at 10000 rounds, 10000-20000 promotion is obvious, the follow-up gradual and stable convergence is about 300, and the training is finished by about 40000 rounds. Training tends to be stable from the perspective of average rewards, but rewards fluctuate between 100-400 per round and are very unstable. In TD3, the agent awards a significant boost through 700 rounds of training and then substantially stabilizes at 450 through about 5000 rounds of training, which is more stable and faster in convergence rate than DDPG. SAC undergoes an obvious promotion in 3000 rounds and 5000 rounds respectively, and finally is stabilized through 9500 rounds of training, the reward converges to 500, and compared with TD3, the single round reward is higher when SAC is finally stabilized and is close to 600.
The DDPG, TD3 and SAC speed test curves are shown in fig. 7, fig. 8 and fig. 9, respectively. DDPG can control the speed error to be within 2m/s from the condition that the initial vertical error is large, but cannot be stable and has continuous oscillation, TD3 can control the speed to be close to the expected speed, the control precision is close to 1m/s, SAC is the highest and is within 0.1 m/s.
The two-dimensional distributions of DDPG, TD3 and SAC landing sites are shown in fig. 10, fig. 11 and fig. 12, respectively. For 100 targeting experiments, the landing success rates of DDPG, TD3 and SAC were 74%, 92% and 96%, respectively. The drop point accuracies TD3 and SAC are significantly better than DDPG.
According to the method, the planet soft landing control method can be realized, and a new thought is provided for the optimization and control research of the planet soft landing track.
The present invention is capable of other embodiments and its several details are capable of modifications in various obvious respects, all without departing from the spirit and scope of the present invention.
Claims (10)
1. A planet soft landing control method based on reinforcement learning is characterized in that: the method comprises the following steps:
the method comprises the following steps: establishing a six-degree-of-freedom dynamic model of a lander power descending section based on the mass m of the lander, the inertia matrix I and the power configuration of the engine; the engine power configuration comprises the number n of engines, the installation position of each engine and a single engine thrust range Ti∈[Tmin,Tmax]Wherein T isiThe thrust of a single engine is shown, i is the index number of the engine, and i is 1, 2.
Step two: based on the dynamics model established in the first step, the thrust of each engine is used as control output, information reflecting the state of the lander is used as an observation vector, a reward function is designed for evaluating the control performance, and a corresponding neural network is designed according to different reinforcement learning algorithm frameworks;
step three: building a numerical simulation environment based on the dynamics model built in the step one and the interactive environment design in the step two, and respectively training by utilizing different reinforcement learning algorithm frames to obtain a soft landing controller as an alternative controller selected in subsequent tests;
step four: and based on the soft landing controller obtained in the step three, comprehensively evaluating the speed tracking capability and the soft landing precision through a speed tracking performance test and a power descent section soft landing test respectively, and selecting the controller with the optimal performance for planetary soft landing control according to the test effect.
2. The planetary soft landing control method based on reinforcement learning of claim 1, wherein the first step is a dynamic model of six degrees of freedom of the power descent section, which comprises a centroid translational dynamic model and an attitude dynamic model:
in the formula of omegatAnd vbThe component forms of the attitude angular velocity and the velocity in the body coordinate system, FbAnd MbRespectively are the resultant force and resultant moment vector of the external force borne by the lander, and I is an inertia matrix of the lander;
external force resultant force and resultant moment vector F borne by landerbAnd MbIs represented as follows:
Fb=FT+mgb+FN
Mb=MT+MN
in the formula FTAnd MTResultant and resultant moment vectors, F, respectively, generated by the enginenAnd MnAre respectively outsideResultant forces and resultant moments, g, caused by disturbancesbThe component form of the gravity acceleration of the surface of the planet under the coordinate system of the lander body is shown, and m is the mass of the lander;
the aerodynamic force of the power-down section is small compared with the thrust of the engine and the planetary attraction, and is used as a disturbance at FNAnd MNIs embodied in that
FN=Fwind+δF
MN=Mwind+δM
In the formula, FwindAnd MwindThe total force and the resultant moment caused by air are respectively, and the delta F and the delta M are respectively the total force and the resultant moment caused by unmodeled interference, wherein the unmodeled interference comprises engine installation deviation and thrust size fluctuation.
3. The planetary soft landing control method based on reinforcement learning of claim 2, wherein the reward function form in the second step is:
r=rfuel+rvel+rcrash+rconstant+rgoal
in the formula, rfuelFor fuel consumption penalty, rvelFor speed tracking rewards, rcrashPunishment of landing rollover, rconstantIs awarded for a constant value, rgoalReward for successful soft landing.
4. The method for controlling the planetary soft landing based on the reinforcement learning of claim 3, wherein the penalty of fuel consumption is in the form of fuel consumption penalty
In the formula, α is a fuel consumption penalty coefficient, and is a negative real number with a small absolute value, and the larger the absolute value is, the smaller the fuel consumption of the trained controller is, but the control accuracy is sacrificed.
5. The planetary soft landing control method based on reinforcement learning as claimed in claim 3, wherein the speed tracking reward is in the form of speed tracking reward
rvel=β||v-vref||
In the formula, beta is a speed error reward coefficient, and the larger the absolute value of the speed error reward coefficient is, the higher the control precision of the training result is, but the fuel consumption is increased; v. ofrefAnd generating a reference speed to be tracked in the landing process of the lander according to a guidance law.
6. The method for controlling the planet soft landing based on the reinforcement learning as claimed in claim 3, wherein the punishment form of landing rollover is
rcrash=η(φ>φlim or θ>θlim or ψ>ψlim)
In the formula, eta is a rollover penalty coefficient, and the introduction of the rollover penalty is favorable for avoiding that the attitude exceeds a constraint value in the landing process; psi, theta and phi are respectively a yaw angle, a pitch angle and a roll angle, the rotation sequence is zyx, and the attitude of the body coordinate system of the lander relative to the navigation coordinate system is represented; psilim、θlimAnd philimUpper bound values of psi, theta and phi, respectively.
7. The planetary soft landing control method based on reinforcement learning as claimed in claim 3, wherein the reward for successful soft landing is in the form of reward
rgoal=λ(h<0 and vz>0 and||v||<vlim and φ<φlim and θ<θlim and ψ<ψlim and||ω||<ωlim)
Wherein λ is soft landing reward coefficient, h is lander height, vzComponent of velocity in the height direction, ωlimIs a soft landing attitude angular velocity threshold.
8. The method for controlling the planetary soft landing based on the reinforcement learning of claim 1, wherein the observation vector form in the second step is:
s=[δvb,sin(φ),cos(φ),sin(θ),cos(θ),sin(ψ),cos(ψ),ωt,Q]
in the formula, δ vbThe component of the difference between the real speed and the expected speed in a body coordinate system, and Q is an attitude quaternion;
step two, the range of the thrust output is ai∈[-1,1]
By linear change
To obtain thrust output TiCan satisfy the thrust constraint Ti∈[Tmin,Tmax]。
9. The planetary soft landing control method based on reinforcement learning of claim 1, wherein the state space and the action space of the reinforcement learning algorithm framework in the third step are continuous.
10. A computer-readable storage medium characterized by: the computer readable storage medium stores a computer program configured to, when invoked by a processor, implement the steps of the reinforcement learning-based planetary soft landing control method of any of claims 1-9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111196380.7A CN113821057B (en) | 2021-10-14 | 2021-10-14 | Planetary soft landing control method and system based on reinforcement learning and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111196380.7A CN113821057B (en) | 2021-10-14 | 2021-10-14 | Planetary soft landing control method and system based on reinforcement learning and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113821057A true CN113821057A (en) | 2021-12-21 |
CN113821057B CN113821057B (en) | 2023-05-30 |
Family
ID=78916535
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111196380.7A Active CN113821057B (en) | 2021-10-14 | 2021-10-14 | Planetary soft landing control method and system based on reinforcement learning and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113821057B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117631547B (en) * | 2024-01-26 | 2024-04-26 | 哈尔滨工业大学 | Landing control method for quadruped robot under irregular weak gravitational field of small celestial body |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107065571A (en) * | 2017-06-06 | 2017-08-18 | 上海航天控制技术研究所 | A kind of objects outside Earth soft landing Guidance and control method based on machine learning algorithm |
CN107656439A (en) * | 2017-11-13 | 2018-02-02 | 浙江大学 | A kind of moon detector in flexible landing optimal control system based on Self Adaptive Control grid |
CN109212976A (en) * | 2018-11-20 | 2019-01-15 | 北京理工大学 | The small feature loss soft landing robust trajectory tracking control method of input-bound |
CN109733415A (en) * | 2019-01-08 | 2019-05-10 | 同济大学 | A kind of automatic Pilot following-speed model that personalizes based on deeply study |
CN111460650A (en) * | 2020-03-31 | 2020-07-28 | 北京航空航天大学 | Unmanned aerial vehicle end-to-end control method based on deep reinforcement learning |
US20200272625A1 (en) * | 2019-02-22 | 2020-08-27 | National Geographic Society | Platform and method for evaluating, exploring, monitoring and predicting the status of regions of the planet through time |
WO2021125395A1 (en) * | 2019-12-18 | 2021-06-24 | 한국항공우주연구원 | Method for determining specific area for optical navigation on basis of artificial neural network, on-board map generation device, and method for determining direction of lander |
CN113408796A (en) * | 2021-06-04 | 2021-09-17 | 北京理工大学 | Deep space probe soft landing path planning method for multitask deep reinforcement learning |
-
2021
- 2021-10-14 CN CN202111196380.7A patent/CN113821057B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107065571A (en) * | 2017-06-06 | 2017-08-18 | 上海航天控制技术研究所 | A kind of objects outside Earth soft landing Guidance and control method based on machine learning algorithm |
CN107656439A (en) * | 2017-11-13 | 2018-02-02 | 浙江大学 | A kind of moon detector in flexible landing optimal control system based on Self Adaptive Control grid |
CN109212976A (en) * | 2018-11-20 | 2019-01-15 | 北京理工大学 | The small feature loss soft landing robust trajectory tracking control method of input-bound |
CN109733415A (en) * | 2019-01-08 | 2019-05-10 | 同济大学 | A kind of automatic Pilot following-speed model that personalizes based on deeply study |
US20200272625A1 (en) * | 2019-02-22 | 2020-08-27 | National Geographic Society | Platform and method for evaluating, exploring, monitoring and predicting the status of regions of the planet through time |
WO2021125395A1 (en) * | 2019-12-18 | 2021-06-24 | 한국항공우주연구원 | Method for determining specific area for optical navigation on basis of artificial neural network, on-board map generation device, and method for determining direction of lander |
CN111460650A (en) * | 2020-03-31 | 2020-07-28 | 北京航空航天大学 | Unmanned aerial vehicle end-to-end control method based on deep reinforcement learning |
CN113408796A (en) * | 2021-06-04 | 2021-09-17 | 北京理工大学 | Deep space probe soft landing path planning method for multitask deep reinforcement learning |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117631547B (en) * | 2024-01-26 | 2024-04-26 | 哈尔滨工业大学 | Landing control method for quadruped robot under irregular weak gravitational field of small celestial body |
Also Published As
Publication number | Publication date |
---|---|
CN113821057B (en) | 2023-05-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110806759B (en) | Aircraft route tracking method based on deep reinforcement learning | |
CN110850719B (en) | Spatial non-cooperative target parameter self-tuning tracking method based on reinforcement learning | |
Wang et al. | Nonlinear aeroelastic control of very flexible aircraft using model updating | |
CN112462792A (en) | Underwater robot motion control method based on Actor-Critic algorithm | |
CN107085435A (en) | Hypersonic aircraft attitude harmony control method based on coupling analysis | |
Kapnopoulos et al. | A cooperative particle swarm optimization approach for tuning an MPC-based quadrotor trajectory tracking scheme | |
CN112859889A (en) | Autonomous underwater robot control method and system based on self-adaptive dynamic planning | |
CN113821057B (en) | Planetary soft landing control method and system based on reinforcement learning and storage medium | |
CN114200950B (en) | Flight attitude control method | |
Dong et al. | Trial input method and own-aircraft state prediction in autonomous air combat | |
CN114637312A (en) | Unmanned aerial vehicle energy-saving flight control method and system based on intelligent deformation decision | |
Wang et al. | A new spacecraft attitude stabilization mechanism using deep reinforcement learning method | |
CN116820134A (en) | Unmanned aerial vehicle formation maintaining control method based on deep reinforcement learning | |
Chen et al. | An experimental study of the wire-driven compliant robotic fish | |
CN116697829A (en) | Rocket landing guidance method and system based on deep reinforcement learning | |
CN117289709A (en) | High-ultrasonic-speed appearance-changing aircraft attitude control method based on deep reinforcement learning | |
CN116620566A (en) | Non-cooperative target attached multi-node intelligent cooperative guidance method | |
CN116068894A (en) | Rocket recovery guidance method based on double-layer reinforcement learning | |
Wang et al. | Deep learning based missile trajectory prediction | |
CN116360258A (en) | Hypersonic deformed aircraft anti-interference control method based on fixed time convergence | |
CN113050420B (en) | AUV path tracking method and system based on S-plane control and TD3 | |
CN113418674A (en) | Wind tunnel track capture test method with three-degree-of-freedom motion of primary model | |
Hong et al. | Control of a fly-mimicking flyer in complex flow using deep reinforcement learning | |
Breese et al. | Physics-Based Neural Networks for Modeling & Control of Aerial Vehicles | |
Cheng et al. | Cross-cycle iterative unmanned aerial vehicle reentry guidance based on reinforcement learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |