CN115524964B

CN115524964B - Rocket landing real-time robust guidance method and system based on reinforcement learning

Info

Publication number: CN115524964B
Application number: CN202210972207.XA
Authority: CN
Inventors: 王劲博; 李辉旭; 施健林; 苏霖锋; 陈洪波
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2023-04-11
Anticipated expiration: 2042-08-12
Also published as: CN115524964A

Abstract

The invention provides a rocket landing real-time robust guidance method and a rocket landing real-time robust guidance system based on reinforcement learning, wherein a rocket three-degree-of-freedom motion model is established according to acting force borne by the rocket landing power descent section, a rocket landing Markov decision process model is established according to the rocket three-degree-of-freedom motion model, after an intelligent Agent is established according to the rocket landing Markov decision process model, the intelligent Agent and a pre-established rocket landing flight simulation environment are subjected to interactive simulation training to obtain a landing control Agent, and the rocket landing flight is guided according to a real-time control instruction generated by the landing control Agent.

Description

Rocket landing real-time robust guidance method and system based on reinforcement learning

Technical Field

The invention relates to the technical field of vertical take-off and landing rocket earth landing guidance, in particular to a rocket landing real-time robust guidance method and system based on reinforcement learning.

Background

The vertical take-off and landing reusable carrier rocket is a novel carrier rocket and is an effective tool for reducing the cost of space carrier missions and improving the efficiency of entering space. The rocket sub-level earth landing guidance is a key technology for controlling the position and the speed of the center of mass of a carrier rocket in the three-degree-of-freedom flight process of returning the rocket to the center of mass of the earth landing, namely, generating an instruction for guiding the center of mass of the rocket to move according to a certain principle or strategy, so that the moving process meets constraint conditions, the terminal state meets a preset target, the recovery precision of the carrier rocket is ensured, the fuel consumption is reduced, and reliable reuse is realized.

The existing rocket sublevel earth landing guidance method mainly comprises an online track optimization guidance method for solving an optimal flight track on line by establishing a target rocket dynamics model and a corresponding track optimization problem model by adopting an indirect method or a direct method, and a deep learning guidance method adopting a strategy of off-line training and online application. Although the existing guidance methods have certain real-time performance and optimality and can realize recycling of the carrier rocket to a certain extent, the methods are all guidance methods based on models, the algorithm efficiency and the usability of the solving result of the methods are seriously dependent on the modeling precision and accuracy, the robustness is poor, once unknown factors which cannot be modeled exist in the environment or the models have deviation and uncertainty interference, the algorithm performance and the usability of the solving result are seriously influenced, and further guidance failure is caused.

Disclosure of Invention

The invention aims to provide a rocket landing real-time robust guidance method based on reinforcement learning, which comprises the steps of constructing a rocket three-degree-of-freedom motion model through stress analysis based on rocket earth landing power descent stage flight, constructing a rocket landing Markov decision process model by combining a staring heuristic idea, and carrying out interactive simulation training by adopting an intelligent Agent based on a value function neural network and a strategy neural network and a rocket landing flight simulation environment to obtain a landing control Agent for generating a rocket landing guidance control strategy, and generating a real-time control instruction according to the rocket landing guidance control strategy to guide the rocket landing flight.

In order to achieve the above purpose, it is necessary to provide a rocket landing real-time robust guidance method and system based on reinforcement learning to solve the above technical problems.

In a first aspect, an embodiment of the present invention provides a rocket landing real-time robust guidance method based on reinforcement learning, including the following steps:

constructing a rocket three-degree-of-freedom motion model according to acting force borne by the rocket in the earth landing power descent stage;

constructing a rocket landing Markov decision process model according to the rocket three-degree-of-freedom motion model;

constructing an intelligent Agent according to the rocket landing Markov decision process model, and performing interactive training on the intelligent Agent and a pre-constructed rocket landing flight simulation environment to obtain a landing control Agent; the intelligent Agent comprises a value function-based neural network and a strategy-based neural network;

and generating a real-time control instruction according to the landing control Agent, and guiding the rocket to land and fly according to the real-time control instruction.

Further, the step of constructing the rocket three-degree-of-freedom motion model according to the acting force borne by the rocket earth landing power descent segment flight comprises the following steps:

establishing a landing point coordinate system by taking the rocket sub-level target landing point as an original point; the landing point coordinate system is a coordinate system which takes a target landing point for rocket sublevel landing as a coordinate origin O, takes the vertical upward direction of the geocentric as a coordinate axis Oz, takes the main flight direction of the rocket during landing as a coordinate axis Ox, and takes the direction which is perpendicular to the plane xOz and forms a right-hand rectangular coordinate system with the coordinate axis Ox and the coordinate axis Oz as a coordinate axis Oy;

based on the landing point coordinate system, carrying out stress analysis on the rocket flying in the earth landing power descent stage, and determining corresponding earth attraction, aerodynamic resistance and engine thrust;

constructing the rocket three-degree-of-freedom motion model according to the earth attraction, the pneumatic resistance and the engine thrust; the rocket three-degree-of-freedom motion model is expressed as follows:

in the formula (I), the compound is shown in the specification,

wherein r represents a rocket position vector; v represents a rocket velocity vector; m represents rocket mass; g (r) represents the gravitational acceleration vector received by the rocketAn amount; t represents an engine thrust vector; d represents an aerodynamic resistance vector; I.C. A _sp Represents the fuel specific impulse; g ₀ Representing the average gravitational acceleration at sea level of the earth;

the second consumption of the propellant after the engine is started; c _D Representing a drag coefficient; s _ref A reference area representing a rocket substage; ρ is a unit of a gradient ₀ Representing a reference atmospheric density at sea level of the earth; h represents the flight height of the rocket stage; h is _ref Indicating a reference height.

Further, the step of constructing a rocket landing Markov decision process model according to the rocket three-degree-of-freedom motion model comprises the following steps:

based on the concept of staring inspiration, carrying out conversion processing on the state variable of the rocket to obtain the state quantity of the rocket landing Markov decision process model; the state quantity is expressed as:

in the formula (I), the compound is shown in the specification,

V _error ＝V-V _sight

wherein S represents the state quantity of the rocket landing Markov decision process model; r, V and V ₀ Respectively representing a rocket position vector, a rocket speed vector and a rocket initial speed; t is t _go Representing the remaining time of flight of the rocket; r is a radical of hydrogen _z A Z-axis component representing a rocket position vector; v _sight Represents a line-of-sight vector; v _error Representing the error of the rocket velocity vector and the sight line vector; lambda denotes toneParameters of the nodal view line vector magnitude changing with time;

obtaining the action quantity of the rocket landing Markov decision process model according to the control instruction of the rocket; the action amount is expressed as:

wherein A represents the action amount of the rocket landing Markov decision process model; t represents an engine thrust vector; t is _x 、T _y And T _z X-axis, Y-axis and Z-axis components representing engine thrust, respectively;

determining a return function design principle according to rocket fixed point soft landing requirements, and obtaining a return function of the rocket landing Markov decision process model according to the return function design principle;

discretizing a continuous rocket landing process according to a preset period, and determining the state transition probability of the rocket landing Markov decision process model according to rocket integral dynamics.

Further, the step of constructing an intelligent Agent according to the rocket landing Markov decision process model comprises the following steps:

selecting a near-end strategy optimization algorithm as a reinforcement learning algorithm of the intelligent Agent according to the rocket landing Markov decision process model;

and constructing the value function-based neural network and the strategy-based neural network according to a multilayer perceptron model based on the near-end strategy optimization algorithm.

Further, the rocket landing flight simulation environment construction step comprises:

and constructing a rocket landing operation environment based on the rocket three-degree-of-freedom motion model, and synchronously constructing a corresponding initial value condition generator and a corresponding flight termination determiner to obtain the rocket landing flight simulation environment.

Further, the step of interactively training the intelligent Agent and the pre-constructed rocket landing flight simulation environment to obtain the landing control Agent comprises the following steps:

and training a strategy-based neural network of the intelligent Agent until convergence through interactive simulation of the intelligent Agent and the rocket landing flight simulation environment to obtain the landing control Agent.

Further, the step of training a policy-based neural network of the intelligent Agent through interactive simulation of the intelligent Agent and the rocket landing flight simulation environment until convergence to obtain the landing control Agent comprises:

randomly selecting an initial state to be simulated from a preset initial state space according to the initial value condition generator;

according to the initial state to be simulated, executing interactive simulation of the intelligent Agent and the flight simulation environment, terminating the simulation flight of the current wheel when a simulation termination condition preset by the flight termination judging device is reached, evaluating and obtaining an accumulated return value of each state point in the current simulation flight track according to a return function, and updating parameters of the intelligent Agent based on a value function neural network according to the accumulated return value;

predicting expected accumulated return values of all state points in the current simulated flight trajectory according to the updated value-based function neural network of the intelligent Agent, calculating an advantage function according to the accumulated return values and the expected return values, and updating parameters of the intelligent Agent based on a strategy neural network according to the advantage function;

and judging whether the strategy-based neural network of the intelligent Agent reaches a preset convergence condition, if so, stopping simulation training to obtain the landing control Agent, otherwise, reselecting an initial state to be simulated according to the initial condition generator, and starting the next round of interactive simulation training.

In a second aspect, an embodiment of the present invention provides a rocket landing real-time robust guidance system based on reinforcement learning, where the system includes:

the motion model building module is used for building a rocket three-degree-of-freedom motion model according to acting force borne by the rocket in the earth landing power descent stage;

the optimization model building module is used for building a rocket landing Markov decision process model according to the rocket three-degree-of-freedom motion model;

the control strategy training module is used for constructing an intelligent Agent according to the rocket landing Markov decision process model and interactively training the intelligent Agent and a pre-constructed rocket landing flight simulation environment to obtain a landing control Agent; the intelligent Agent comprises a value function-based neural network and a strategy-based neural network;

and the rocket landing guidance module is used for generating a real-time control instruction according to the landing control Agent and guiding the rocket to land and fly according to the real-time control instruction.

In a third aspect, an embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the method when executing the computer program.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps of the above method.

The method realizes the technical scheme that a rocket three-degree-of-freedom motion model is constructed according to acting force borne by rocket earth landing power descent section flight, a rocket landing Markov decision process model is constructed according to the rocket three-degree-of-freedom motion model, after an intelligent Agent is constructed according to the rocket landing Markov decision process model, the intelligent Agent and a pre-constructed rocket landing flight simulation environment are subjected to interactive simulation training to obtain a landing control Agent, and rocket landing flight is guided according to a real-time control instruction generated by the landing control Agent. Compared with the prior art, the rocket landing real-time robust guidance method based on reinforcement learning not only has extremely high real-time performance, but also has extremely strong algorithm robustness, can adapt to relatively wide modeling deviation, can still guide the rocket to perform high-precision fixed-point soft landing under the working condition that uncertain interference exists in the environment, and has higher application value.

Drawings

FIG. 1 is a schematic diagram of an application scenario of a rocket landing real-time robust guidance method based on reinforcement learning in an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a rocket landing real-time robust guidance method based on reinforcement learning in the embodiment of the invention;

FIG. 3 is a schematic diagram of a landing site coordinate system used for building a rocket three-degree-of-freedom motion model in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a policy-based neural network of an intelligent Agent in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a value-based function neural network of an intelligent Agent in an embodiment of the present invention;

FIG. 6 is a schematic structural diagram of a rocket landing real-time robust guidance system based on reinforcement learning in the embodiment of the present invention;

fig. 7 is an internal structural diagram of a computer device in the embodiment of the present invention.

Detailed Description

In order to make the purpose, technical solution and advantages of the present invention more clearly apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments, and it is obvious that the embodiments described below are part of the embodiments of the present invention, and are used for illustrating the present invention only, but not for limiting the scope of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The rocket landing real-time robust guidance method based on reinforcement learning can be applied to returning landing guidance of a vertical take-off and landing reusable carrier rocket earth, can map an engine thrust instruction based on a rocket real-time state based on an overall framework shown in figure 1, and the given instruction has adaptability to large-range rocket model deviation and environmental interference, ensures that high-precision fixed-point soft landing is carried out by guiding a rocket sublevel under the condition of complex uncertainty, and can effectively fit a high-dimensional continuous action space instruction by adopting a staring heuristic method to set state quantity guidance and designing different reward discount rates for a rocket landing track terminal and process indexes to accelerate the convergence speed of a strategy by adopting a deep neural network as a strategy network for reinforcement learning and combining with an improved PPO algorithm to learn, thereby ensuring that the rocket landing method has extremely high real-time performance, the algorithm has stronger robustness, is suitable for relatively wide modeling deviation, can effectively cope with uncertain interference in the environment, and provides reliable guarantee for guiding the rocket to carry out high fixed-point precision soft landing, and has higher application value; it should be noted that the method of the present invention can be executed by a server that undertakes related functions, and the following embodiments all take the server as an execution subject, and the rocket landing real-time robust guidance method based on reinforcement learning of the present invention is described in detail.

In one embodiment, as shown in fig. 2, a rocket landing real-time robust guidance method based on reinforcement learning is provided, which comprises the following steps:

s11, constructing a rocket three-degree-of-freedom motion model according to acting force borne by the rocket in the earth landing power descent stage; the rocket three-degree-of-freedom motion model can be understood as a motion model obtained by performing targeted improvement on a current rocket landing operation model based on the consideration of actual flight conditions and task targets and establishing a nonlinear and continuous rocket fuel optimal landing trajectory optimization problem; because the rocket sublevel is mainly influenced by engine thrust, earth gravitation and aerodynamic force generated in a dense atmosphere environment in the final landing process, in order to simplify the problem as much as possible on the basis of ensuring the reliability of the research problem, the embodiment mainly considers the acting force borne by the rocket in the earth landing power descent stage flight to construct a corresponding rocket three-degree-of-freedom motion model; specifically, the step of constructing the rocket three-degree-of-freedom motion model according to the acting force borne by the rocket earth landing power descent segment flight comprises the following steps:

establishing a landing point coordinate system by taking the rocket sub-level target landing point as an original point; the landing point coordinate system is a coordinate system which takes a target landing point of rocket sublevel landing as a coordinate origin O, takes the vertical upward direction of the geocenter as a coordinate axis Oz, takes the main flight direction of the rocket during landing as a coordinate axis Ox, and takes the direction which is vertical to a plane xOz and forms a right-hand rectangular coordinate system with the coordinate axis Ox and the coordinate axis Oz as a coordinate axis Oy as shown in FIG. 3; it should be noted that, because the last flight phase in the rocket substage landing process considered by the invention has the characteristics of short flight time and narrow flight airspace, the influence caused by the curvature of the earth surface and the earth rotation can be ignored, and the earth surface is taken as a plane, so that in order to describe the rocket substage flight process more intuitively and simplify the solution of the problem, the embodiment preferably establishes a landing point coordinate system for the stress analysis when constructing a rocket three-degree-of-freedom motion model;

based on the landing point coordinate system, carrying out stress analysis on the rocket flying in the earth landing power descent stage, and determining corresponding earth attraction, aerodynamic resistance and engine thrust; the system comprises a rocket three-degree-of-freedom motion model, a plane landing field model, a normal gravitational field model, a dynamic descent segment model, a dynamic descent control model and a dynamic descent control model, wherein the earth gravitational force is set to be a constant value in the rocket three-degree-of-freedom motion model, the influence of earth autorotation can be ignored based on the short flight time (dozens of seconds) of the dynamic descent segment, the narrow flight airspace (dozens of kilometers) can meet the precision requirement by adopting the plane landing field and the normal gravitational field model, and the solution of the problem is effectively simplified; aerodynamic drag is understood to be the aerodynamic drag to which a rocket is subjected in a dense atmosphere and can be expressed as:

D＝-C _D S _ref ρ||V|| ₂ V/2

in the formula (I), the compound is shown in the specification,

wherein, C _D Representing a drag coefficient; s _ref A reference area representing a rocket substage; rho represents the atmospheric density in the earth landing environment and adopts an exponential atmospheric density model tableShown in the specification; v represents the velocity vector of the rocket; rho ₀ Representing a reference atmospheric density at sea level of the earth; h represents the flight height of the rocket sublevel, namely the rocket position component of the Z axis in the landing point coordinate system; h is _ref Represents a reference height;

the engine thrust is understood to be that under the condition of not considering rocket attitude transformation, a plurality of engines equipped at the rocket sublevel are combined and equivalently become one engine which provides thrust as a rocket, so that the engine thrust is obtained, and is expressed as follows:

wherein, I _sp Is the specific impulse of fuel, g ₀ Is the average gravitational acceleration at sea level of the earth,

the second consumption of the propellant after the engine is started.

In addition, in the landing problem studied by the present invention, the thrust generated by the rocket engine is used as the only Control amount without considering the influence of Control mechanisms such as a grid rudder and a Reaction Control System (RCS) on rocket adjustment; meanwhile, the attitude motion of the rocket is not considered, the landing motion of the rocket is taken as a mass center motion, so that the total thrust T of the engine can be decomposed in the established landing point coordinate system according to the three-axis direction, and the thrust component along the three axes in the landing point coordinate system is obtained, namely T = [ T ]) _x ,T _y ,T _z ] ^T The method can effectively avoid complex trigonometric function thrust resolving, directly takes three thrust components as the control quantity of the rocket sublevel in the subsequent problem modeling and solving, and imposes the following constraint on the form:

wherein, the thrust amplitude is | | | T | |; because of the limitation of the current reusable engine technology, and in order to ensure the safety in the landing process, the engine is not shut down after being ignited and started in the last section of landing flight process, namely, in the whole power descent section flight, the rocket stage can be acted by the nonzero minimum thrust, and the corresponding thrust amplitude of the engine has the following constraints:

T _min ≤||T||≤T _max

wherein, T _max And T _min The upper bound and the lower bound of the thrust amplitude of the rocket engine are respectively;

constructing the rocket three-degree-of-freedom motion model according to the earth attraction, the aerodynamic resistance and the engine thrust; the rocket three-degree-of-freedom motion model is expressed as follows:

in the formula (I), the compound is shown in the specification,

where r represents a rocket position vector,

v represents a vector of the velocity of the rocket,

m represents rocket mass; g (r) represents the gravitational acceleration vector experienced by the rocket,

is itself a function of rocket position r, which is set to a constant value in the problem solving of the present invention; t represents an engine thrust vector, ->

Control variables for the trajectory optimization problem of the present invention; d represents the pneumatic resistance vector, according to the formula>

I _sp Represents the fuel specific impulse; g ₀ Representing the average gravitational acceleration at sea level of the earth; />

The second consumption of the propellant after the engine is started; c _D Representing a drag coefficient; s _ref A reference area representing a rocket substage; rho ₀ Representing a reference atmospheric density at sea level of the earth; h represents the flight height of the rocket substage; h is _ref Indicating a reference height.

For the rocket stage landing process, the system state and system control can be expressed as:

the system state x of the rocket comprises the position of the rocket, the speed and the mass of the rocket;

wherein, the system control u of the rocket is the thrust of the equivalent rocket engine;

s12, constructing a rocket landing Markov decision process model according to the rocket three-degree-of-freedom motion model; wherein, the rocket landing Markov decision process model comprises five elements: the state quantity S, the action quantity A, the return function R, the state transition probability P and the discount factor gamma; specifically, the step of constructing a rocket landing Markov decision process model according to the rocket three-degree-of-freedom motion model comprises the following steps:

based on the concept of staring inspiration, carrying out conversion processing on the state variable of the rocket to obtain the state quantity of the rocket landing Markov decision process model; the state quantity S does not directly adopt the state variable of the rocket, but the observed state of the rocket adopts the idea of staring inspiration to carry out conversion processing so as to accelerate the convergence rate of the subsequent intelligent agent strategy in the early learning; it should be noted that the following rocket system states can be understood as state quantities obtained through conversion processing; the state quantity S can be expressed as:

in the formula (I), the compound is shown in the specification,

V _error ＝V-V _sight

wherein S represents the state quantity of the rocket landing Markov decision process model; r, V and V ₀ Respectively representing a rocket position vector, a rocket speed vector and a rocket initial speed; t is t _go Representing the remaining time of flight of the rocket; r is _z A Z-axis component representing a rocket position vector; v _sight Represents a line-of-sight vector; v _error Representing the error of the rocket velocity vector and the sight line vector; λ represents a parameter that adjusts the magnitude of the sight vector over time;

obtaining the action quantity of the rocket landing Markov decision process model according to the control instruction of the rocket; the action quantity can be understood as directly selecting a control command of a rocket, namely engine thrust, and is expressed as follows:

wherein A represents the action amount of a rocket landing Markov decision process model; t represents an engine thrust vector; t is _x 、T _y And T _z X-axis, Y-axis and Z-axis components representing engine thrust, respectively; in addition, examineConsidering that the control command given by the subsequent intelligent agent strategy cannot constrain the magnitude of the modulus of the output action, in order to ensure that the output control command meets the thrust amplitude constraint of the rocket engine, in this embodiment, preferably, the control command also needs to be subjected to amplitude interception so as to strictly meet the thrust amplitude constraint of the engine;

determining a return function design principle according to rocket fixed point soft landing requirements, and obtaining a return function of the rocket landing Markov decision process model according to the return function design principle; wherein, the return function design principle can be understood as rocket fixed point soft landing limiting conditions, such as:

(1) Rocket landing terminal position to landing site, i.e. r _f ＝0；

(2) Rocket landing terminal speed is zero, i.e. V _f ＝0；

(3) Rocket landing terminal residual mass m _f As large as possible, i.e. to reduce fuel consumption as much as possible during flight;

(4) The transverse maneuvering can not be too large in the landing and flying process of the rocket;

it should be noted that the return function design principle may include, but is not limited to, the above listed principles according to the actual analysis situation, and after the design principle is determined, the trajectory return function design may be divided into two parts by combining with the gaze heuristic idea: a process cumulative return R and a terminal reward return _prog Expressed as:

R _prog ＝α||V _error ||+β||F _use ||+η·P _glide

s.t.V _error ＝V-V _sight

wherein, V _error Is the current rocket velocity V and the 'sight line' vector V _sight The error between them is the rocket velocity vector; f _use The fuel consumption of the rocket at the current moment is related to the amplitude of the adopted command, wherein | | A | | is the amplitude of the control command output thrust, and T is the amplitude of the control command output thrust _max The maximum value of rocket thrust; gs (glideslope) represents the track slope, P _glide Limiting the lateral maneuver of the rocket for enveloping constraint, and calculating the longitudinal maneuver dr of the rocket every time the altitude between two states drops by more than 2m during the landing of the rocket _z With transverse manoeuvring

The ratio gs between; the other variables are initialization setting parameters, such as alpha = -0.01, beta = -0.05 and eta = -100 as cumulative return R _prog Scale factor of the corresponding term in (1), gs _limit =0.1 and gs _τ =0.05 representing the minimum track slope and P, respectively _glide A scale factor of an envelope constraint formula;

the terminal reward is expressed as:

R _term ＝reward _landing +P _term

wherein, reward _term Is a reward for satisfying the requirements of rocket landing terminal position and speed, P _term Punishment is carried out when the transverse maneuver is too large at the moment before the rocket is landed; i V _term | and | | r _term | | respectively represents the module values of the terminal speed and the terminal position; gs is _term Is the ratio between the longitudinal displacement and the transverse displacement of the rocket during landing, and the processThe calculation methods of gs in the constraint are consistent; the other variables V _limit 、r _limit And gs _limit Setting parameters for initialization;

through the accumulated return of the process and the terminal reward return, the intelligent Agent can be guided to control the rocket to finish the aim of vertical fixed-point soft landing.

Discretizing a continuous rocket landing process according to a preset period, and determining the state transition probability of a rocket landing Markov decision process model according to rocket integral dynamics; specifically, the state transition probability P is expressed as:

P(s _τ+1 ＝f(s _τ ,a _τ )|s _τ ,a _τ )＝1；

wherein s is _τ And a _τ Respectively representing the current state of the system at the time tau and the action currently taken by the system; s _τ+1 Represents the state of the system at time τ + 1; f (s, a) represents a system state transition kinetic equation; p represents a quantity s based on the state _τ And an amount of operation a _τ From the state s at time τ _τ Into state s at time τ +1 _τ+1 The probability of (d);

correspondingly, the discount factor γ in the rocket landing markov decision process model is used to attenuate the cumulative return of future processes in the trajectory along time, preferably by 0.95.

S13, constructing an intelligent Agent according to the rocket landing Markov decision process model, and performing interactive training on the intelligent Agent and a pre-constructed rocket landing flight simulation environment to obtain a landing control Agent; the intelligent Agent comprises a value function-based neural network and a strategy-based neural network, and specifically, the step of constructing the intelligent Agent according to the rocket landing Markov decision process model comprises the following steps:

selecting a near-end strategy optimization algorithm as a reinforcement learning algorithm of the intelligent Agent according to the rocket landing Markov decision process model; the near-end strategy optimization algorithm can be understood as an improved PPO algorithm and is used for training a rocket sublevel landing task of the intelligent Agent;

constructing the value function-based neural network and the strategy-based neural network according to a multilayer perceptron model based on the near-end strategy optimization algorithm; the input based on the strategy neural network is the system state S after the intelligent Agent observed value is processed, and the corresponding output is a thrust vector control instruction value A of rocket landing; the method comprises the steps that a value function-based neural network is used for accelerating convergence of the strategy-based neural network, and an expected accumulated return V(s) of a certain state is predicted based on a value function network trained according to a rocket state value x of a track in plot simulation and a corresponding actual accumulated return Q (s, a);

the neural network based on the strategy and the neural network based on the value function both adopt a 3-hidden-layer structure, the hidden-layer activation function adopts a tanh function, the output-layer activation function adopts a linear activation function, and the number n of neurons in the input layer based on the strategy and the value function neural networks _in Are all 5, the dimension of the state quantity S; the output layer of the strategy-based neural network comprises 3 neurons which respectively correspond to rocket three-dimensional thrust components, and the output layer of the value-function-based neural network only comprises one neuron which corresponds to expected accumulated return. Specifically, the specific structural parameters of the policy-based neural network shown in fig. 4 and the value-based function neural network shown in fig. 5 are shown in table 1:

TABLE 1 structural parameters of policy-based neural networks and value-function-based neural networks

The rocket landing flight simulation environment can be understood as a simulation model which is constructed based on a rocket landing dynamics model and is used for simulating rocket landing flight; specifically, the rocket landing flight simulation environment construction step includes:

building a rocket landing operation environment based on the rocket three-degree-of-freedom motion model, and synchronously building a corresponding initial value condition generator and a corresponding flight termination determiner to obtain a rocket landing flight simulation environment; wherein, the initial condition generator can be understood as an initial state selector for starting a round of track simulation by randomly selecting an initial state from the set initial state space; the flight termination judger can be understood as a motion state detector which simultaneously sets the rocket flight abnormity and normal termination criterion to detect whether the rocket landing is terminated in real time;

after the intelligent Agent and the rocket landing flight simulation environment are constructed and obtained through the steps of the method, each episode of the rocket landing operation environment can be simulated in a mode that the intelligent Agent and the rocket landing operation environment are continuously interacted according to the following method, an initial value condition generator firstly randomly selects an initial state of rocket landing in an initial state space, then the intelligent Agent guides the rocket landing flight according to a corresponding control instruction based on strategy neural network fitting according to an observed system state, and when the rocket landing succeeds or the rocket reaches a cut-off condition in advance and stops flying, the episode simulation is finished, so that a complete rocket landing flight trajectory is completed; in different initial states, corresponding reinforcement learning training can be completed after multiple rounds of plot simulation;

specifically, the step of interactively training the intelligent Agent and a pre-constructed rocket landing flight simulation environment to obtain the landing control Agent comprises the following steps:

training a strategy-based neural network of the intelligent Agent until convergence through interactive simulation of the intelligent Agent and the rocket landing flight simulation environment to obtain the landing control Agent; correspondingly, the training process of the landing control Agent is as follows:

according to the initial state to be simulated, executing interactive simulation of the intelligent Agent and the flight simulation environment, terminating the simulation flight of the current wheel when the simulation termination condition preset by the flight termination judger is reached, and evaluating and obtaining the accumulated return value of each state point in the current simulation flight track according to a return function,updating the parameter of the intelligent Agent based on the value function neural network according to the accumulated return value; in a PPO algorithm learning framework, an intelligent Agent in each turn of plot simulation obtains a complete track(s) formed by observation state, action and return through interaction with a flight simulation environment _l ,a _l ,r _l ) Wherein s is _l For the environmental states observed by the Intelligent Agent, a _l Actions taken by the Intelligent Agent according to the observed values, r _l Returns to the Intelligent Agent for environmental feedback, and returns r _l Generally denoted as s _l And a _l The trace from the time k to the time T (episode end time) can be represented as(s) _k ,a _k ,...,s _T ,a _T ) The cumulative discount return for the track may be expressed as:

wherein, γ ∈ [0,1] represents a discount factor for representing the discount returned by each time node in the track, and for a reinforcement learning algorithm, a set of strategies can be found for the purpose of maximizing the accumulated discount return expectation value of the track;

predicting expected accumulated return values of all state points in the current simulated flight trajectory according to the updated value-based function neural network of the intelligent Agent, calculating an advantage function according to the accumulated return values and the expected return values, and updating parameters of the intelligent Agent based on a strategy neural network according to the advantage function; wherein the merit function may be expressed as:

A(s,a)＝Q(s,a)-V(s)

wherein A (s, a), Q (s, a), and V(s) are expressed as a merit function, an accumulated reward value, and an expected accumulated reward value, respectively;

It should be noted that, in the reinforcement learning process, in order to improve the convergence rate of the network and avoid reaching the saturation region of the hidden layer activation function as much as possible, it is preferable to perform normalization processing on the input of the network. For the input data of the network, the mean value and the standard deviation of each dimension are counted and then scaled according to the following formula:

meanwhile, for the output based on the strategy neural network, in order to satisfy the thrust constraint, it is preferable to perform a clipping operation on the total amplitude of the output thrust command, and the specific process is as follows:

wherein, a and

the thrust instructions before and after the amplitude limiting operation are respectively represented, and the definitions of other variables refer to the description above, which is not described herein again;

s14, generating a real-time control instruction according to the landing control Agent, and guiding the rocket to land and fly according to the real-time control instruction; after the landing control Agent is obtained through the interactive simulation training, the obtained strategy-based neural network can be used for rocket online landing guidance, and corresponding control instructions can be given in real time according to the state of the rocket in the flight process without the amplitude of the value-function-based neural network, so that the rocket is guided to finish high-precision landing in the environment with deviation.

According to the method, not only can an engine thrust instruction corresponding to large-range rocket model deviation and environmental interference be mapped based on the rocket real-time state, high-precision fixed-point soft landing can be conducted under the condition that complex uncertainty exists, but also a deep neural network is adopted as a strategy network for strengthening learning, an improved PPO algorithm is adopted to conduct simulation training learning on the strategy network, a high-dimensional continuous action space instruction is effectively fitted, a staring heuristic method is adopted to set state quantity guidance, different discount rates are designed for a landing trajectory terminal and a process index, the convergence speed of the rocket is accelerated, and the fixed-point soft landing strategy learning efficiency is effectively improved; compared with the prior art, the method has extremely high real-time performance and extremely strong algorithm robustness, can adapt to relatively wide modeling deviation, can still guide the rocket to carry out high-precision fixed-point soft landing under the working condition that uncertain interference exists in the environment, and has higher application value.

It should be noted that, although the steps in the above-described flowcharts are shown in sequence as indicated by arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise.

In one embodiment, as shown in fig. 6, there is provided a reinforcement learning-based rocket landing real-time robust guidance system, the system comprising:

the motion model building module 1 is used for building a rocket three-degree-of-freedom motion model according to acting force borne by the rocket in the earth landing power descent stage;

the optimization model building module 2 is used for building a rocket landing Markov decision process model according to the rocket three-degree-of-freedom motion model;

the control strategy training module 3 is used for constructing an intelligent Agent according to the rocket landing Markov decision process model and carrying out interactive training on the intelligent Agent and a pre-constructed rocket landing flight simulation environment to obtain a landing control Agent; the intelligent Agent comprises a value function-based neural network and a strategy-based neural network;

and the rocket landing guidance module 4 is used for generating a real-time control instruction according to the landing control Agent and guiding the rocket to land and fly according to the real-time control instruction.

For specific limitations of a rocket landing real-time robust guidance system based on reinforcement learning, reference may be made to the above limitations of a rocket landing real-time robust guidance method based on reinforcement learning, which are not described herein again. All modules in the rocket landing real-time robust guidance system based on reinforcement learning can be completely or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Fig. 7 shows an internal structure diagram of a computer device in one embodiment, and the computer device may be specifically a terminal or a server. As shown in fig. 7, the computer apparatus includes a processor, a memory, a network interface, a display, and an input device, which are connected through a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to realize a rocket landing real-time robust guidance method based on reinforcement learning. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those of ordinary skill in the art that the architecture shown in FIG. 7 is merely a block diagram of a portion of architecture associated with aspects of the present application and is not intended to limit the computing devices to which aspects of the present application may be applied, and that a particular computing device may include more or fewer components than shown, or may combine certain components, or have a similar arrangement of components.

In one embodiment, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the steps of the above method being performed when the computer program is executed by the processor.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method.

In summary, the rocket landing real-time robust guidance method and system based on reinforcement learning provided by the embodiments of the present invention realize that a rocket three-degree-of-freedom motion model is constructed according to the acting force applied to the rocket earth landing power descent segment flight, a rocket landing markov decision process model is constructed according to the rocket three-degree-of-freedom motion model, and after an intelligent Agent is constructed according to the rocket landing markov decision process model, the intelligent Agent and a pre-constructed rocket landing flight simulation environment are subjected to interactive simulation training to obtain a landing control Agent, and a rocket landing flight is guided according to a real-time control instruction generated by the landing control Agent, so that an engine thrust instruction can be mapped based on a rocket real-time state, and the given instruction has adaptability to a large-scale rocket model bias and an environmental interference memory, so that a high-precision fixed-point soft landing is guided under a complex working condition, and a strategy network with a depth neural network as a reinforced learning improved PPO algorithm is combined to learn, a high-dimensional continuous action space instruction is effectively fitted, a fixation sub-level is guided to perform high-point soft landing guidance and a high-point guidance and a high-point convergence guidance and guidance strategy with high reliability guarantee that a high-point convergence guidance and high-point convergence-based on-simulation learning strategy network has an uncertain simulation learning guidance and a high-point-convergence guidance and high-oriented-simulation guidance and high-point guidance and high-simulation guidance application value, and high-simulation algorithm.

The embodiments in this specification are described in a progressive manner, and all the same or similar parts of the embodiments are directly referred to each other, and each embodiment is described with emphasis on differences from other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment. It should be noted that, the technical features of the embodiments may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several preferred embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for those skilled in the art, various modifications and substitutions can be made without departing from the technical principle of the present invention, and these should be construed as the protection scope of the present application. Therefore, the protection scope of the present patent shall be subject to the protection scope of the claims.

Claims

1. A rocket landing real-time robust guidance method based on reinforcement learning is characterized by comprising the following steps:

2. A rocket landing real-time robust guidance method based on reinforcement learning as claimed in claim 1, wherein said step of constructing a rocket three-degree-of-freedom motion model according to the acting force of the rocket earth landing power descent segment flight comprises:

establishing a landing point coordinate system by taking the rocket sublevel target landing point as an origin; the landing point coordinate system is a coordinate system which takes a target landing point of rocket sublevel landing as a coordinate origin O, takes the vertical upward direction of the geocenter as a coordinate axis Oz, takes the main flight direction of the rocket during landing as a coordinate axis Ox, and takes the direction which is perpendicular to the plane xOz and forms a right-hand rectangular coordinate system with the coordinate axis Ox and the coordinate axis Oz as a coordinate axis Oy;

in the formula (I), the compound is shown in the specification,

wherein r represents a rocket position vector; v represents a rocket velocity vector; m represents rocket mass; g (r) represents the gravitational acceleration vector received by the rocket; t represents an engine thrust vector; d represents an aerodynamic resistance vector; i is _sp Represents the fuel specific impulse; g ₀ Representing the average gravitational acceleration at sea level of the earth;

the second consumption of the propellant after the engine is started; c _D Representing a drag coefficient; s _ref A reference area representing a rocket substage; rho ₀ Representing a reference atmospheric density at sea level of the earth; h represents the flight height of the rocket stage; h is _ref Indicating a reference height.

3. A rocket landing real-time robust guidance method based on reinforcement learning as claimed in claim 1, wherein said step of constructing a rocket landing markov decision process model according to said rocket three degrees of freedom motion model comprises:

in the formula (I), the compound is shown in the specification,

V _error ＝V-V _sight

the method comprises the following steps that S represents the state quantity of a rocket landing Markov decision process model; r, V and V ₀ Respectively representing a rocket position vector, a rocket speed vector and a rocket initial speed; t is t _go Representing the remaining time of flight of the rocket; r is _z A Z-axis component representing a rocket position vector; v _sight Represents a line-of-sight vector; v _error Representing the error of the rocket velocity vector and the sight line vector; λ represents a parameter that adjusts the magnitude of the sight vector over time;

4. A rocket landing real-time robust guidance method based on reinforcement learning according to claim 1, wherein said step of constructing an intelligent Agent according to said rocket landing markov decision process model comprises:

5. A rocket landing real-time robust guidance method based on reinforcement learning according to claim 1, wherein the rocket landing flight simulation environment construction step comprises:

6. A rocket landing real-time robust guidance method based on reinforcement learning as claimed in claim 5, wherein said step of interactive training said intelligent Agent with a pre-constructed rocket landing flight simulation environment to obtain a landing control Agent comprises:

7. A rocket landing real-time robust guidance method based on reinforcement learning as claimed in claim 6, wherein said step of training strategy-based neural network of intelligent Agent through interactive simulation of said intelligent Agent and said rocket landing flight simulation environment until convergence, obtaining said landing control Agent comprises:

according to the initial state to be simulated, executing interactive simulation of the intelligent Agent and the flight simulation environment, terminating the simulation flight of the current wheel when a simulation termination condition preset by the flight termination judger is reached, evaluating and obtaining an accumulated return value of each state point in the current simulation flight track according to a return function, and updating a value-function-based neural network parameter of the intelligent Agent according to the accumulated return value;

predicting expected accumulated return values of all state points in the current simulated flight trajectory according to the updated value-based function neural network of the intelligent Agent, calculating an advantage function according to the accumulated return values and the expected accumulated return values, and updating parameters of the intelligent Agent based on a strategy neural network according to the advantage function;

8. A rocket landing real-time robust guidance system based on reinforcement learning, which is characterized in that the system comprises:

9. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method as claimed in any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.