CN115524964A

CN115524964A - A real-time robust guidance method and system for rocket landing based on reinforcement learning

Info

Publication number: CN115524964A
Application number: CN202210972207.XA
Authority: CN
Inventors: 王劲博; 李辉旭; 施健林; 苏霖锋; 陈洪波
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-08-12
Filing date: 2022-08-12
Publication date: 2022-12-27
Anticipated expiration: 2042-08-12
Also published as: CN115524964B

Abstract

The present invention provides a real-time robust guidance method and system for rocket landing based on reinforcement learning. The three-degree-of-freedom motion model of the rocket is constructed according to the force of the rocket's earth landing dynamic descent stage flight, and the three-degree-of-freedom motion model of the rocket is Construct the rocket landing Markov decision-making process model, and after constructing the intelligent agent Agent according to the rocket landing Markov decision-making process model, conduct interactive simulation training between the intelligent agent Agent and the pre-built rocket landing flight simulation environment to obtain the landing control Agent, and The method of guiding the rocket to land and fly according to the real-time control instructions generated by the landing control agent not only has extremely high real-time performance, but also has strong algorithm robustness, and can adapt to a wide range of modeling deviations. It can still guide the rocket to perform a high-precision fixed-point soft landing under certain conditions, and has high application value.

Description

A real-time robust guidance method and system for rocket landing based on reinforcement learning

技术领域technical field

本发明涉及垂直起降火箭地球着陆制导技术领域，特别是涉及一种基于强化学习的火箭着陆实时鲁棒制导方法及系统。The invention relates to the technical field of vertical take-off and landing rocket earth landing guidance, in particular to a method and system for real-time robust guidance of rocket landing based on reinforcement learning.

背景技术Background technique

垂直起降可重复使用运载火箭是一种新型运载火箭，是降低航天运载任务成本、提升进入空间效率的有效工具。火箭子级地球着陆制导是指在火箭返回地球着陆质心三自由度飞行过程中对运载火箭质心位置和速度的控制，即根据某种原则或策略，生成导引火箭质心运动的指令，使运动过程满足约束条件、使终端状态满足预定目标，以保证运载火箭回收精度、减少燃料消耗以及实现可靠重复使用的关键技术。The vertical take-off and landing reusable launch vehicle is a new type of launch vehicle, and it is an effective tool to reduce the cost of space delivery tasks and improve the efficiency of entering space. Rocket sub-stage earth landing guidance refers to the control of the position and velocity of the launch vehicle’s center of mass during the three-degree-of-freedom flight process of the rocket’s return to the earth’s landing center of mass, that is, according to a certain principle or strategy, generate instructions to guide the movement of the rocket’s center of mass, so that the movement process It is a key technology to meet the constraints and make the terminal state meet the predetermined goals to ensure the recovery accuracy of the launch vehicle, reduce fuel consumption and achieve reliable reuse.

现有的火箭子级地球着陆制导方法主要包括通过建立目标火箭动力学模型以及相应的轨迹优化问题模型，采用间接法或者直接法在线求解最优飞行轨迹的在线轨迹优化制导方法，以及采用“离线训练+在线应用”的策略的深度学习制导方法。虽然现有的制导方法具有一定的实时性和最优性，能够在一定程度上实现运载火箭的回收使用，但是这些方法均是基于模型的制导方法，其算法效率以及求解结果的可用性严重依赖于建模的精度和准确性，鲁棒性较差，一旦环境中存在不可建模的未知因素或者模型存在偏差和不确定性干扰，算法性能以及求解结果的可用性就会受到严重影响，进而导致制导失效。The existing rocket sub-stage earth landing guidance methods mainly include establishing the target rocket dynamics model and the corresponding trajectory optimization problem model, using the indirect method or the direct method to solve the online trajectory optimization guidance method of the optimal flight trajectory online, and using the "offline A deep learning-guided method for the strategy of "training + online application". Although the existing guidance methods have certain real-time and optimality, and can realize the recovery and use of launch vehicles to a certain extent, these methods are all model-based guidance methods, and their algorithm efficiency and availability of solution results are heavily dependent on The precision and accuracy of modeling are poor, and once there are unknown factors that cannot be modeled in the environment or the model has deviation and uncertainty interference, the performance of the algorithm and the availability of the solution results will be seriously affected, which will lead to the guidance invalidated.

发明内容Contents of the invention

本发明的目的是提供一种基于强化学习的火箭着陆实时鲁棒制导方法，通过基于火箭地球着陆动力下降段飞行的受力分析构建火箭三自由度运动模型，结合凝视启发思想构建火箭着陆马尔科夫决策过程模型，并采用基于值函数神经网络和策略神经网络的智能代理Agent与火箭着陆飞行仿真环境进行交互仿真训练，得到用于生成火箭着陆制导控制策略的着陆控制Agent，以及根据火箭着陆制导控制策略生成实时控制指令，引导火箭着陆飞行的方法，有效解决现有火箭子级地球着陆制导方法应用缺陷的基础上，不仅能有效应对火箭子级返回着陆飞行中火箭动力学模型存在较大偏差的工况，对存在的环境干扰具有较强的鲁棒性，而且具有较高的实时性，能够在存在复杂不确定性工况下引导火箭子级进行高精度定点软着陆。The purpose of the present invention is to provide a real-time robust guidance method for rocket landing based on reinforcement learning, by constructing a three-degree-of-freedom motion model of a rocket based on the force analysis of the rocket's earth landing dynamic descent stage flight, and constructing a rocket landing Marco by combining gaze heuristics. The model of the decision-making process is made, and the intelligent agent agent based on the value function neural network and the strategy neural network is used for interactive simulation training with the rocket landing flight simulation environment, and the landing control agent used to generate the rocket landing guidance control strategy is obtained, and the rocket landing guidance control agent is obtained according to the rocket landing guidance The control strategy generates real-time control instructions and guides the rocket landing flight method. On the basis of effectively solving the application defects of the existing rocket sub-stage earth landing guidance method, it can not only effectively deal with the large deviation of the rocket dynamics model during the return and landing flight of the rocket sub-stage. It has strong robustness to the existing environmental interference, and has high real-time performance, and can guide the rocket sub-stage to perform high-precision fixed-point soft landing under complex and uncertain working conditions.

为了实现上述目的，有必要针对上述技术问题，提供了一种基于强化学习的火箭着陆实时鲁棒制导方法及系统。In order to achieve the above objectives, it is necessary to provide a real-time robust guidance method and system for rocket landing based on reinforcement learning for the above technical problems.

第一方面，本发明实施例提供了一种基于强化学习的火箭着陆实时鲁棒制导方法，所述方法包括以下步骤：In a first aspect, an embodiment of the present invention provides a method for real-time robust guidance of a rocket landing based on reinforcement learning, the method comprising the following steps:

根据火箭地球着陆动力下降段飞行的所受作用力，构建火箭三自由度运动模型；According to the force of the rocket's earth landing dynamic descent stage flight, a three-degree-of-freedom motion model of the rocket is constructed;

根据所述火箭三自由度运动模型，构建火箭着陆马尔科夫决策过程模型；According to the three-degree-of-freedom motion model of the rocket, a Markov decision process model for rocket landing is constructed;

根据所述火箭着陆马尔科夫决策过程模型，构建智能代理Agent，并将所述智能代理Agent与预先构建的火箭着陆飞行仿真环境进行交互训练，得到着陆控制Agent；所述智能代理Agent包括基于值函数神经网络和基于策略神经网络；According to the Markov decision process model of the rocket landing, an intelligent agent Agent is constructed, and the intelligent agent Agent is interactively trained with the pre-built rocket landing flight simulation environment to obtain a landing control Agent; the intelligent agent Agent includes a value-based Functional neural network and policy-based neural network;

根据所述着陆控制Agent生成实时控制指令，并根据所述实时控制指令引导火箭着陆飞行。Generate real-time control instructions according to the landing control Agent, and guide the rocket to land and fly according to the real-time control instructions.

进一步地，所述根据火箭地球着陆动力下降段飞行的所受作用力，构建火箭三自由度运动模型的步骤包括：Further, the step of constructing a three-degree-of-freedom motion model of the rocket according to the acting force of the rocket's earth landing power descent stage flight includes:

以火箭子级目标着陆点为原点，建立着陆点坐标系；所述着陆点坐标系为以火箭子级降落的目标着陆点为坐标原点O，以地心竖直向上方向为坐标轴Oz，以着陆时火箭的主要飞行方向为坐标轴Ox，以与平面xOz平面垂直且与坐标轴Ox和坐标轴Oz构成右手直角坐标系的方向为坐标轴Oy的坐标系；Taking the landing point of the rocket sub-stage target as the origin, the landing point coordinate system is established; the landing point coordinate system is that the target landing point of the rocket sub-stage landing is the coordinate origin O, and the vertical upward direction of the center of the earth is the coordinate axis Oz, with The main flight direction of the rocket during landing is the coordinate axis Ox, and the coordinate system of the coordinate axis Oy is the direction perpendicular to the xOz plane and forming a right-handed rectangular coordinate system with the coordinate axis Ox and the coordinate axis Oz;

基于所述着陆点坐标系，对地球着陆动力下降段飞行的火箭进行受力分析，确定对应的地球引力、气动阻力和发动机推力；Based on the coordinate system of the landing point, the force analysis is carried out on the rocket flying in the dynamic descent stage of the earth landing, and the corresponding earth gravity, aerodynamic drag and engine thrust are determined;

根据所述地球引力、气动阻力和发动机推力，构建所述火箭三自由度运动模型；所述火箭三自由度运动模型表示为：According to the gravitational force of the earth, aerodynamic drag and engine thrust, the three-degree-of-freedom motion model of the rocket is constructed; the three-degree-of-freedom motion model of the rocket is expressed as:

式中，In the formula,

其中，r表示火箭位置矢量；v表示火箭速度矢量；m表示火箭质量；g(r)表示火箭受到的引力加速度矢量；T表示发动机推力矢量；D表示气动阻力矢量；I_sp表示燃料比冲；g₀表示地球海平面处的平均引力加速度；

为发动机开机后的推进剂秒耗量；C_D表示阻力系数；S_ref表示火箭子级的参考面积；ρ₀表示地球海平面处的参考大气密度；h表示火箭子级的飞行高度；h_ref表示参考高度。Among them, r represents the rocket position vector; v represents the rocket velocity vector; m represents the rocket mass; g(r) represents the gravitational acceleration vector received by the rocket; T represents the engine thrust vector; D represents the aerodynamic drag vector; I _sp represents the fuel specific impulse; g ₀ represents the average gravitational acceleration at the sea level of the earth;

C _D represents the drag coefficient; S _ref represents the reference area of the rocket sub-stage; ρ ₀ represents the reference atmospheric density at the earth’s sea level; h represents the flight height of the rocket sub-stage; h _ref Indicates the reference height.

进一步地，所述根据所述火箭三自由度运动模型，构建火箭着陆马尔科夫决策过程模型的步骤包括：Further, according to the rocket three-degree-of-freedom motion model, the step of constructing the rocket landing Markov decision process model includes:

基于凝视启发的思想，对火箭的状态变量进行转换处理，得到所述火箭着陆马尔科夫决策过程模型的状态量；所述状态量表示为：Based on the idea of gaze heuristic, the state variable of the rocket is converted to obtain the state quantity of the rocket landing Markov decision process model; the state quantity is expressed as:

式中，In the formula,

V_error＝V-V_sight V _error = VV _sight

其中，S表示火箭着陆马尔科夫决策过程模型的状态量；r、V和V₀分别表示火箭位置矢量、火箭速度矢量和火箭初始速度；t_go表示火箭的剩余飞行时间；r_z表示火箭位置矢量的Z轴分量；V_sight表示视线矢量；V_error表示火箭速度矢量与视线矢量的误差；λ表示调节视线矢量大小随时间变化的参数；Among them, S represents the state quantity of the rocket landing Markov decision process model; r, V and V ₀ represent the rocket position vector, rocket velocity vector and rocket initial velocity respectively; t _go represents the remaining flight time of the rocket; r _z represents the rocket position The Z-axis component of the vector; V _sight represents the sight vector; V _error represents the error between the rocket velocity vector and the sight vector; λ represents the parameter to adjust the size of the sight vector to change with time;

根据火箭的控制指令，得到所述火箭着陆马尔科夫决策过程模型的动作量；所述动作量表示为：According to the control instruction of the rocket, the action amount of the Markov decision process model of the rocket landing is obtained; the action amount is expressed as:

其中，A表示火箭着陆马尔科夫决策过程模型的动作量；T表示发动机推力矢量；T_x、T_y和T_z分别表示发动机推力的X轴、Y轴和Z轴分量；Among them, A represents the action amount of the rocket landing Markov decision process model; T represents the engine thrust vector; T _x , _Ty and T _z represent the X-axis, Y-axis and Z-axis components of the engine thrust, respectively;

根据火箭定点软着陆需求，确定回报函数设计原则，并根据所述回报函数设计原则，得到所述火箭着陆马尔科夫决策过程模型的回报函数；Determine the design principle of the reward function according to the fixed-point soft landing requirement of the rocket, and obtain the reward function of the Markov decision process model of the rocket landing according to the design principle of the reward function;

将连续的火箭着陆过程按照预设周期进行离散化处理，并根据火箭积分动力学确定所述火箭着陆马尔科夫决策过程模型的状态转移概率。The continuous rocket landing process is discretized according to a preset period, and the state transition probability of the rocket landing Markov decision process model is determined according to the integral dynamics of the rocket.

进一步地，所述根据所述火箭着陆马尔科夫决策过程模型，构建智能代理Agent的步骤包括：Further, the step of constructing an intelligent agent Agent according to the Markov decision process model of the rocket landing includes:

根据所述火箭着陆马尔科夫决策过程模型，选取近端策略优化算法作为所述智能代理Agent的强化学习算法；According to the Markov decision process model of the rocket landing, select the proximal strategy optimization algorithm as the reinforcement learning algorithm of the intelligent agent Agent;

基于所述近端策略优化算法，根据多层感知机模型，构建所述基于值函数神经网络和所述基于策略神经网络。Based on the proximal policy optimization algorithm and a multi-layer perceptron model, the value function-based neural network and the policy-based neural network are constructed.

进一步地，所述火箭着陆飞行仿真环境的构建步骤包括：Further, the construction steps of the rocket landing flight simulation environment include:

基于所述火箭三自由度运动模型，构建火箭着陆运行环境，并同步构建对应的初值条件生成器和飞行终止判定器，得到所述火箭着陆飞行仿真环境。Based on the rocket three-degree-of-freedom motion model, a rocket landing operation environment is constructed, and corresponding initial value condition generators and flight termination determiners are simultaneously constructed to obtain the rocket landing flight simulation environment.

进一步地，所述将所述智能代理Agent与预先构建的火箭着陆飞行仿真环境进行交互训练，得到着陆控制Agent的步骤包括：Further, the described intelligent agent Agent is interactively trained with the pre-built rocket landing flight simulation environment, and the step of obtaining the landing control Agent includes:

通过所述智能代理Agent与所述火箭着陆飞行仿真环境的交互仿真，训练智能代理Agent的基于策略神经网络直至收敛，得到所述着陆控制Agent。Through interactive simulation between the intelligent agent Agent and the rocket landing flight simulation environment, the strategy-based neural network of the intelligent agent Agent is trained until convergence, and the landing control Agent is obtained.

进一步地，所述通过所述智能代理Agent与所述火箭着陆飞行仿真环境的交互仿真，训练智能代理Agent的基于策略神经网络直至收敛，得到所述着陆控制Agent的步骤包括：Further, the step of training the strategy-based neural network of the intelligent agent Agent until convergence through the interactive simulation of the intelligent agent Agent and the rocket landing flight simulation environment, and obtaining the landing control Agent includes:

根据所述初值条件生成器，从预设的初始状态空间中随机选择待仿真初始状态；Randomly selecting an initial state to be simulated from a preset initial state space according to the initial value condition generator;

根据所述待仿真初始状态，执行所述智能代理Agent与所述飞行仿真环境的交互仿真，并在达到所述飞行终止判定器预设的仿真终止条件时，终止当前轮的仿真飞行，并根据回报函数评估得到当前仿真飞行轨迹中各个状态点的累积回报值，以及根据所述累积回报值更新所述智能代理Agent的基于值函数神经网络的参数；According to the initial state to be simulated, execute the interactive simulation of the intelligent agent Agent and the flight simulation environment, and when the simulation termination condition preset by the flight termination determiner is reached, terminate the simulation flight of the current wheel, and according to The reward function evaluation obtains the cumulative reward value of each state point in the current simulated flight track, and updates the parameters based on the value function neural network of the intelligent agent Agent according to the cumulative reward value;

根据更新后的所述智能代理Agent的基于值函数神经网络，预测当前仿真飞行轨迹中各个状态点的期望累积回报值，并根据所述累积回报值和所述期望回报值，计算优势函数，以及根据所述优势函数，更新所述智能代理Agent的基于策略神经网络的参数；Predict the expected cumulative return value of each state point in the current simulated flight trajectory according to the value function-based neural network of the updated intelligent agent Agent, and calculate the advantage function according to the cumulative return value and the expected return value, and According to the advantage function, update the parameters based on the policy neural network of the intelligent agent Agent;

判断所述智能代理Agent的基于策略神经网络是否达到预设收敛条件，若达到，则停止仿真训练，得到所述着陆控制Agent，反之，则，根据所述初值条件生成器重新选择待仿真初始状态，并开始下一轮交互仿真训练。Judging whether the policy-based neural network of the intelligent agent Agent reaches the preset convergence condition, if it is reached, then stop the simulation training to obtain the landing control Agent, otherwise, reselect the initial value to be simulated according to the initial value condition generator. state, and start the next round of interactive simulation training.

第二方面，本发明实施例提供了一种基于强化学习的火箭着陆实时鲁棒制导系统，所述系统包括：In the second aspect, the embodiment of the present invention provides a real-time robust guidance system for rocket landing based on reinforcement learning, the system comprising:

运动模型构建模块，用于根据火箭地球着陆动力下降段飞行的所受作用力，构建火箭三自由度运动模型；The motion model building block is used to construct a three-degree-of-freedom motion model of the rocket according to the force of the rocket's earth landing dynamic descent stage flight;

优化模型构建模块，用于根据所述火箭三自由度运动模型，构建火箭着陆马尔科夫决策过程模型；An optimization model construction module is used to construct a Markov decision process model for rocket landing according to the three-degree-of-freedom motion model of the rocket;

控制策略训练模块，用于根据所述火箭着陆马尔科夫决策过程模型，构建智能代理Agent，并将所述智能代理Agent与预先构建的火箭着陆飞行仿真环境进行交互训练，得到着陆控制Agent；所述智能代理Agent包括基于值函数神经网络和基于策略神经网络；The control strategy training module is used to construct an intelligent agent Agent according to the Markov decision process model of the rocket landing, and carry out interactive training between the intelligent agent Agent and the pre-built rocket landing flight simulation environment to obtain the landing control Agent; Said intelligent agent Agent comprises based on value function neural network and based on strategy neural network;

火箭着陆制导模块，用于根据所述着陆控制Agent生成实时控制指令，并根据所述实时控制指令引导火箭着陆飞行。The rocket landing guidance module is used to generate real-time control instructions according to the landing control agent, and guide the rocket to land and fly according to the real-time control instructions.

第三方面，本发明实施例还提供了一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述方法的步骤。In a third aspect, an embodiment of the present invention also provides a computer device, including a memory, a processor, and a computer program stored on the memory and operable on the processor, and the above method is implemented when the processor executes the computer program A step of.

第四方面，本发明实施例还提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现上述方法的步骤。In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above method are implemented.

上述本申请提供了一种基于强化学习的火箭着陆实时鲁棒制导方法及系统，通过所述方法，实现了根据火箭地球着陆动力下降段飞行的所受作用力构建火箭三自由度运动模型，并根据火箭三自由度运动模型构建火箭着陆马尔科夫决策过程模型，以及在根据火箭着陆马尔科夫决策过程模型构建智能代理Agent后，将智能代理Agent与预先构建的火箭着陆飞行仿真环境进行交互仿真训练得到着陆控制Agent，并根据由着陆控制Agent生成的实时控制指令引导火箭着陆飞行的技术方案。与现有技术相比，该基于强化学习的火箭着陆实时鲁棒制导方法，不仅具有极高的实时性，而且算法鲁棒性极强，能够适应比较广泛的建模偏差，在环境存在不确定干扰的工况下仍然能够引导火箭进行高精度定点软着陆，具有较高的应用价值。The above-mentioned application provides a real-time robust guidance method and system for rocket landing based on reinforcement learning. Through the method, the three-degree-of-freedom motion model of the rocket is constructed according to the force of the rocket's earth landing dynamic descent stage flight, and Construct the rocket landing Markov decision process model according to the rocket three-degree-of-freedom motion model, and after constructing the intelligent agent Agent according to the rocket landing Markov decision process model, conduct interactive simulation between the intelligent agent Agent and the pre-built rocket landing flight simulation environment The landing control agent is obtained through training, and the technical plan for guiding the rocket to land and fly is based on the real-time control instructions generated by the landing control agent. Compared with the existing technology, the real-time robust guidance method for rocket landing based on reinforcement learning not only has extremely high real-time performance, but also has strong algorithm robustness, and can adapt to a wide range of modeling deviations. Under interference conditions, it can still guide the rocket to a high-precision fixed-point soft landing, which has high application value.

附图说明Description of drawings

图1是本发明实施例中基于强化学习的火箭着陆实时鲁棒制导方法的应用场景示意图；1 is a schematic diagram of an application scenario of a real-time robust guidance method for rocket landing based on reinforcement learning in an embodiment of the present invention;

图2是本发明实施例中基于强化学习的火箭着陆实时鲁棒制导方法的流程示意图；Fig. 2 is a schematic flow chart of a real-time robust guidance method for rocket landing based on reinforcement learning in an embodiment of the present invention;

图3是本发明实施例中建立火箭三自由度运动模型所使用的着陆点坐标系的示意图；Fig. 3 is a schematic diagram of the landing point coordinate system used to establish the rocket three-degree-of-freedom motion model in the embodiment of the present invention;

图4是本发明实施例中智能代理Agent的基于策略神经网络的结构示意图；Fig. 4 is the structural representation based on the strategy neural network of intelligent agent Agent in the embodiment of the present invention;

图5是本发明实施例中智能代理Agent的基于值函数神经网络的结构示意图；Fig. 5 is the structural representation based on the value function neural network of intelligent agent Agent in the embodiment of the present invention;

图6是本发明实施例中基于强化学习的火箭着陆实时鲁棒制导系统的结构示意图；Fig. 6 is a structural schematic diagram of a real-time robust guidance system for rocket landing based on reinforcement learning in an embodiment of the present invention;

图7是本发明实施例中计算机设备的内部结构图。Fig. 7 is an internal structure diagram of a computer device in an embodiment of the present invention.

具体实施方式detailed description

为了使本申请的目的、技术方案和有益效果更加清楚明白，下面结合附图及实施例，对本发明作进一步详细说明，显然，以下所描述的实施例是本发明实施例的一部分，仅用于说明本发明，但不用来限制本发明的范围。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and beneficial effects of the present application clearer, the present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments. Obviously, the embodiments described below are part of the embodiments of the present invention and are only used for The present invention is illustrated, but not intended to limit the scope of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

本发明提供的基于强化学习的火箭着陆实时鲁棒制导方法，可以应用于垂直起降可重复使用运载火箭地球返回着陆制导，基于如图1所示的整体架构，能够基于火箭实时状态映射发动机推力指令，且给出的指令具有对大范围火箭模型偏差以及环境干扰存的适应能力，保证在存在复杂不确定性工况下引导火箭子级进行高精度定点软着陆，而且通过采用深度神经网络作为强化学习的策略网络，结合改进PPO算法对策略网络进行学习，有效拟合高维连续动作空间指令，还通过采用凝视启发方法设置状态量引导，并对火箭着陆轨迹终端与过程指标设计不同的奖励折扣率来加快策略的收敛速度，进而保证火箭着陆制导方法具有极高的实时性，算法具有较强的鲁棒性，既能够适应比较广泛的建模偏差，又能有效应对在环境的不确定干扰，为引导火箭进行高精度定点软着陆提供了可靠保障，具有较高的应用价值；需要说明的是，本发明方法可由承担相关功能的服务器执行，且下述实施例均以服务器作为执行主体为例，对本发明的基于强化学习的火箭着陆实时鲁棒制导方法进行详细说明。The real-time robust guidance method for rocket landing based on reinforcement learning provided by the present invention can be applied to the earth-return landing guidance of vertical take-off and landing reusable launch vehicles. Based on the overall architecture shown in Figure 1, the engine thrust can be mapped based on the real-time state of the rocket Instructions, and the instructions given have the ability to adapt to large-scale rocket model deviations and environmental disturbances, ensuring that the rocket sub-stages are guided to perform high-precision fixed-point soft landings under complex uncertain working conditions, and by using deep neural networks as The policy network of reinforcement learning, combined with the improved PPO algorithm to learn the policy network, effectively fits the high-dimensional continuous action space instructions, and sets the state quantity guidance by using the gaze heuristic method, and designs different rewards for the terminal and process indicators of the rocket landing trajectory The discount rate is used to speed up the convergence speed of the strategy, thereby ensuring that the rocket landing guidance method has extremely high real-time performance, and the algorithm has strong robustness, which can not only adapt to a wide range of modeling deviations, but also effectively deal with uncertainties in the environment Interference provides a reliable guarantee for guiding the rocket to perform a high-precision fixed-point soft landing, and has high application value; it should be noted that the method of the present invention can be executed by a server that undertakes related functions, and the following embodiments all use the server as the execution body As an example, the real-time robust guidance method for rocket landing based on reinforcement learning of the present invention will be described in detail.

在一个实施例中，如图2所示，提供了一种基于强化学习的火箭着陆实时鲁棒制导方法，包括以下步骤：In one embodiment, as shown in FIG. 2 , a real-time robust guidance method for rocket landing based on reinforcement learning is provided, including the following steps:

S11、根据火箭地球着陆动力下降段飞行的所受作用力，构建火箭三自由度运动模型；其中，火箭三自由度运动模型可理解为考虑实际飞行工况与任务目标，建立非线性的、连续的火箭燃料最优着陆轨迹优化问题，是基于当前火箭着陆运行模型进行针对性改进得到的运动模型；由于火箭子级在最后的着陆过程中主要受到发动机推力、地球引力以及在稠密大气环境内产生的气动力影响，为了在保证研究问题可靠性的基础上尽可能简化问题，本实施例主要考虑火箭地球着陆动力下降段飞行的所受作用力构建对应的火箭三自由度运动模型；具体地，所述根据火箭地球着陆动力下降段飞行的所受作用力，构建火箭三自由度运动模型的步骤包括：S11. Construct a rocket three-degree-of-freedom motion model according to the force of the rocket's earth-landing dynamic descent stage flight; wherein, the rocket three-degree-of-freedom motion model can be understood as considering the actual flight conditions and mission goals, establishing a nonlinear, continuous The optimal landing trajectory optimization problem of the rocket fuel is a motion model based on the targeted improvement of the current rocket landing operation model; because the rocket sub-stage is mainly affected by the engine thrust, the earth's gravity and the dense atmospheric environment during the final landing process. In order to simplify the problem as much as possible on the basis of ensuring the reliability of the research problem, this embodiment mainly considers the force of the rocket's earth landing dynamic descent stage flight to construct a corresponding rocket three-degree-of-freedom motion model; specifically, The steps of constructing a three-degree-of-freedom motion model of the rocket according to the force of the rocket's earth landing power descent stage flight include:

以火箭子级目标着陆点为原点，建立着陆点坐标系；所述着陆点坐标系如图3所示，为以火箭子级降落的目标着陆点为坐标原点O，以地心竖直向上方向为坐标轴Oz，以着陆时火箭的主要飞行方向为坐标轴Ox，以与平面xOz平面垂直且与坐标轴Ox和坐标轴Oz构成右手直角坐标系的方向为坐标轴Oy的坐标系；需要说明的是，由于本发明考虑的火箭子级着陆过程中最后的飞行阶段具有飞行时间短和飞行空域窄的特点，可以忽略地球表面的曲率以及地球自转带来的影响，将地表当成一个平面，故为了更加直观地描述火箭子级飞行过程以及简化问题的求解，本实施例优选建立着陆点坐标系用于构建火箭三自由度运动模型时的受力分析；The landing point coordinate system is established with the rocket sub-stage target landing point as the origin; the landing point coordinate system is shown in Figure 3, which is the coordinate origin O with the target landing point of the rocket sub-stage landing, and the vertical upward direction from the center of the earth is the coordinate axis Oz, the main flight direction of the rocket during landing is the coordinate axis Ox, and the direction perpendicular to the plane xOz plane and forming a right-handed rectangular coordinate system with the coordinate axis Ox and the coordinate axis Oz is the coordinate system of the coordinate axis Oy; it needs to be explained What is most important is that the final flight stage in the rocket sub-stage landing process considered by the present invention has the characteristics of short flight time and narrow flight space, the curvature of the earth's surface and the influence of the earth's rotation can be ignored, and the earth's surface is regarded as a plane, so In order to more intuitively describe the flight process of the rocket sub-stage and simplify the solution of the problem, this embodiment preferably establishes a landing point coordinate system for force analysis when building a rocket three-degree-of-freedom motion model;

基于所述着陆点坐标系，对地球着陆动力下降段飞行的火箭进行受力分析，确定对应的地球引力、气动阻力和发动机推力；其中，地球引力在火箭三自由度运动模型中设为常值，且基于上述动力下降段飞行时间短(几十秒)可忽略地球自转的影响，飞行空域窄(十几公里)可采用平面着陆场和常引力场模型满足精度要求的考虑，有效简化问题的求解；气动阻力可理解为火箭在稠密大气氛围内受到的气动阻力，可表示为：Based on the coordinate system of the landing point, the force analysis is carried out on the rocket flying in the dynamic descent stage of the earth landing, and the corresponding earth gravity, aerodynamic drag and engine thrust are determined; wherein, the earth gravity is set as a constant value in the three-degree-of-freedom motion model of the rocket , and based on the short flight time (tens of seconds) of the above-mentioned dynamic descent section, the influence of the earth's rotation can be ignored, and the flight airspace is narrow (more than ten kilometers), and the plane landing field and constant gravitational field model can be used to meet the accuracy requirements and effectively simplify the problem. Solving; aerodynamic drag can be understood as the aerodynamic drag experienced by the rocket in a dense atmosphere, which can be expressed as:

D＝-C_DS_refρ||V||₂V/2D＝-C _D S _ref ρ||V|| ₂ V/2

式中，In the formula,

其中，C_D表示阻力系数；S_ref表示火箭子级的参考面积；ρ表示地球着陆环境中的大气密度，采用指数大气密度模型表示；V表示火箭的速度矢量；ρ₀表示地球海平面处的参考大气密度；h表示火箭子级的飞行高度，即着陆点坐标系内Z轴的火箭位置分量；h_ref表示参考高度；Among them, C _D represents the drag coefficient; S _ref represents the reference area of the rocket sub-stage; ρ represents the atmospheric density in the earth’s landing environment, which is represented by an exponential atmospheric density model; V represents the velocity vector of the rocket _; Reference atmospheric density; h represents the flight height of the rocket sub-stage, that is, the rocket position component of the Z axis in the coordinate system of the landing point; h _ref represents the reference height;

发动机推力可理解为是不考虑火箭姿态变换的情况下，将火箭子级配备的多台发动机合并等效为一台发动机工作为火箭提供推力得到发动机推力，表示为：The engine thrust can be understood as the engine thrust obtained by combining multiple engines equipped with rocket sub-stages into one engine to provide thrust for the rocket without considering the attitude change of the rocket, which is expressed as:

其中，I_sp为燃料比冲，g₀为地球海平面处的平均引力加速度，

为发动机开机后的推进剂秒耗量。Among them, I _sp is the fuel specific impulse, g ₀ is the average gravitational acceleration at the sea level of the earth,

is the propellant consumption per second after the engine is turned on.

此外，在本发明研究的着陆问题中，不考虑栅格舵以及反应控制系统(ReactionControl System,RCS)等控制机构对火箭调整所带来的影响，将火箭发动机产生的推力作为唯一的控制量；同时，不考虑火箭的姿态运动，将火箭着陆运动作为一个质心运动，故可以在建立的着陆点坐标系中将发动机总推力T按照三轴方向进行分解，获得其在着陆点坐标系内的沿着三轴的推力分量，即T＝[T_x,T_y,T_z]^T，能够有效避免复杂的三角函数推力解算，在后续问题建模以及求解中直接以三个推力分量作为火箭子级的控制量，并对其形式加以如下约束：In addition, in the landing problem studied by the present invention, the influence of control mechanisms such as grid rudders and reaction control systems (ReactionControl System, RCS) on rocket adjustment is not considered, and the thrust produced by the rocket engine is used as the only control quantity; At the same time, regardless of the attitude motion of the rocket, the rocket landing motion is regarded as a center of mass motion, so the total thrust T of the engine can be decomposed in the three-axis direction in the established landing point coordinate system to obtain its The three-axis thrust component, that is, T=[T _x ,T _y ,T _z ] ^T , can effectively avoid complex trigonometric function thrust calculations, and directly use the three thrust components as rocket sub-components in the modeling and solution of subsequent problems level of control, and impose the following constraints on its form:

其中，||T||为推力幅值；由于受到目前可重复使用发动机技术的水平限制，也为了保障着陆过程中的安全性，最后一段着陆飞行过程中，发动机点火开机之后就不再关机，即在整个动力下降段飞行中，火箭子级会受到非零的最小推力作用，对应的发动机推力幅值存在如下约束：Among them, ||T|| is the thrust amplitude; due to the limitation of the current level of reusable engine technology, and to ensure the safety of the landing process, during the last landing flight, the engine will not shut down after it is ignited and turned on. That is to say, during the entire power descent flight, the rocket sub-stage will be subjected to a non-zero minimum thrust, and the corresponding engine thrust amplitude has the following constraints:

T_min≤||T||≤T_max T _min ≤||T||≤T _max

其中，T_max和T_min分别为火箭发动机推力幅值的上界和下界；Among them, T _max and T _min are the upper bound and lower bound of the thrust amplitude of the rocket engine, respectively;

式中，In the formula,

其中，r表示火箭位置矢量，

v表示火箭速度矢量，

m表示火箭质量；g(r)表示火箭受到的引力加速度矢量，

本身是关于火箭位置r的函数，在本发明的问题求解中将其设为常值；T表示发动机推力矢量，

为本发明轨迹优化问题的控制变量；D表示气动阻力矢量，

I_sp表示燃料比冲；g₀表示地球海平面处的平均引力加速度；

为发动机开机后的推进剂秒耗量；C_D表示阻力系数；S_ref表示火箭子级的参考面积；ρ₀表示地球海平面处的参考大气密度；h表示火箭子级的飞行高度；h_ref表示参考高度。Among them, r represents the rocket position vector,

v represents the rocket velocity vector,

m represents the mass of the rocket; g(r) represents the gravitational acceleration vector received by the rocket,

Itself is the function about rocket position r, it is set as constant value in problem solving of the present invention; T represents engine thrust vector,

Be the control variable of trajectory optimization problem of the present invention; D represents aerodynamic drag vector,

I _sp represents the fuel specific impulse; g ₀ represents the average gravitational acceleration at the sea level of the earth;

对于火箭子级着陆过程而言，其系统状态和系统控制可分别表示为：For the rocket sub-stage landing process, its system state and system control can be expressed as:

其中，火箭的系统状态x包括火箭的位置、火箭的速度及质量；Among them, the system state x of the rocket includes the position of the rocket, the speed and mass of the rocket;

其中，火箭的系统控制u为上述的等效火箭发动机推力；Wherein, the system control u of the rocket is the above-mentioned equivalent rocket engine thrust;

S12、根据所述火箭三自由度运动模型，构建火箭着陆马尔科夫决策过程模型；其中，火箭着陆马尔科夫决策过程模型包括五元素：状态量S、动作量A、回报函数R、状态转移概率P以及折扣因子γ；具体地，所述根据所述火箭三自由度运动模型，构建火箭着陆马尔科夫决策过程模型的步骤包括：S12. According to the rocket three-degree-of-freedom motion model, construct a rocket landing Markov decision-making process model; wherein, the rocket landing Markov decision-making process model includes five elements: state quantity S, action quantity A, reward function R, and state transition Probability P and discount factor γ; Specifically, according to the rocket three-degree-of-freedom motion model, the step of constructing the rocket landing Markov decision process model includes:

基于凝视启发的思想，对火箭的状态变量进行转换处理，得到所述火箭着陆马尔科夫决策过程模型的状态量；所述状态量S并不直接采用火箭的状态变量，而是所观测到的火箭状态采用凝视启发的思想进行转换处理，以加快后续智能代理策略在前期学习的收敛速度；需要说明的是，下述涉及的火箭系统状态均可理解为经过转换处理得到状态量；状态量S可表示为：Based on the idea of gaze heuristic, the state variable of the rocket is converted to obtain the state quantity of the rocket landing Markov decision process model; the state quantity S does not directly adopt the state variable of the rocket, but the observed The rocket state adopts the idea of gaze inspiration to carry out conversion processing to speed up the convergence speed of the subsequent intelligent agent strategy in the early learning; it should be noted that the state of the rocket system involved in the following can be understood as the state quantity obtained through transformation processing; the state quantity S Can be expressed as:

式中，In the formula,

V_error＝V-V_sight V _error = VV _sight

根据火箭的控制指令，得到所述火箭着陆马尔科夫决策过程模型的动作量；所述动作量可理解为直接选用火箭的控制指令，即发动机推力，表示为：According to the control command of the rocket, the action amount of the Markov decision process model of the rocket landing is obtained; the action amount can be understood as directly selecting the control command of the rocket, that is, the engine thrust, expressed as:

其中，A表示火箭着陆马尔科夫决策过程模型的动作量；T表示发动机推力矢量；T_x、T_y和T_z分别表示发动机推力的X轴、Y轴和Z轴分量；此外，考虑到后续智能代理策略给出的控制指令无法约束输出动作的模值大小，为了保证所输出的控制指令满足火箭发动机的推力幅值约束，本实施例优选地，还需要对控制指令进行幅值截取，使其严格满足发动机推力幅值约束；Among them, A represents the action amount of the rocket landing Markov decision process model; T represents the engine thrust vector; T _x , _{Ty y} and T _z represent the X-axis, Y-axis and Z-axis components of the engine thrust respectively; in addition, considering the follow-up The control instruction given by the intelligent agent strategy cannot constrain the modulus of the output action. In order to ensure that the output control instruction meets the thrust amplitude constraint of the rocket engine, this embodiment preferably also needs to intercept the amplitude of the control instruction, so that It strictly meets the constraints of the engine thrust amplitude;

根据火箭定点软着陆需求，确定回报函数设计原则，并根据所述回报函数设计原则，得到所述火箭着陆马尔科夫决策过程模型的回报函数；其中，回报函数设计原则可理解为火箭定点软着陆限制条件，如：According to the fixed-point soft landing requirements of the rocket, the design principle of the reward function is determined, and according to the design principle of the reward function, the reward function of the Markov decision process model of the rocket landing is obtained; wherein, the design principle of the reward function can be understood as a fixed-point soft landing of the rocket Restrictions such as:

(1)火箭着陆终端位置到达着陆点，即r_f＝0；(1) The landing terminal position of the rocket reaches the landing point, that is, r _f =0;

(2)火箭着陆终端速度为零，即V_f＝0；(2) The landing terminal velocity of the rocket is zero, that is, V _f =0;

(3)火箭着陆终端剩余质量m_f尽可能大，即飞行过程中尽可能减小燃料消耗；(3) The remaining mass m _f of the rocket landing terminal is as large as possible, that is, the fuel consumption is reduced as much as possible during the flight;

(4)火箭着陆飞行过程中横向机动不能过大；(4) The lateral maneuver of the rocket during landing and flight cannot be too large;

需要说明的是，回报函数设计原则可根据实际分析情况包括不限于上述罗列的原则，在确定设计原则后，即可结合凝视启发思路将轨迹回报函数设计分为两部分：过程累计回报以及终端奖励回报，其中，过程累计回报R_prog表示为：It should be noted that the design principles of the return function can include but not limited to the principles listed above according to the actual analysis situation. After the design principles are determined, the design of the trajectory return function can be divided into two parts by combining the gaze heuristic: the cumulative return of the process and the terminal reward. return, where the process cumulative return R _prog is expressed as:

R_prog＝α||V_error||+β||F_use||+η·P_glide R _prog ＝α||V _error ||+β||F _use ||+η·P _glide

s.t.V_error＝V-V_sight stV _error = VV _sight

其中，V_error是当前火箭速度V与“视线”矢量V_sight之间的误差，为火箭速度矢量；F_use为当前时刻火箭燃油消耗，与所采取的指令幅值有关，其中，|A||为控制指令输出推力的幅值，T_max为火箭推力最大值；gs(glideslope)表示轨迹坡度，P_glide为包络约束，限制火箭的横向机动，在火箭着陆过程中，每当两个状态之间高度下降超过2m时，则计算其纵向机动dr_z与横向机动

之间的比值gs；其余变量均为初始化设定参数，如α＝-0.01、β＝-0.05和η＝-100为累计回报R_prog中对应项的比例因子，gs_limit＝0.1和gs_τ＝0.05分别表示最小轨迹坡度和P_glide包络约束公式的比例因子；Among them, V _error is the error between the current rocket speed V and the "sight" vector V _sight , which is the rocket speed vector; F _use is the fuel consumption of the rocket at the current moment, which is related to the amplitude of the command adopted, where |A|| In order to control the amplitude of the command output thrust, T _max is the maximum thrust of the rocket; gs (glideslope) represents the trajectory slope, and P _glide is the envelope constraint, which limits the lateral maneuvering of the rocket. During the landing process of the rocket, whenever the two states When the altitude drops by more than 2m, calculate its longitudinal maneuver dr _z and lateral maneuver

The ratio gs between them; the rest of the variables are initialization setting parameters, such as α=-0.01, β=-0.05 and η=-100 are the proportional factors of the corresponding items in the cumulative return R _prog , gs _limit ＝0.1 and gs _τ ＝ 0.05 represents the scaling factor of the minimum trajectory slope and the P _glide envelope constraint formula, respectively;

终端奖励回报表示为：The terminal reward return is expressed as:

R_term＝reward_landing+P_term R _term = reward _landing +P _term

其中，reward_term是对火箭着陆终端位置以及速度满足要求的奖励，P_term是对火箭着陆前一时刻横向机动过大时的惩罚；||V_term||和||r_term||分别表示终端速度和终端位置的模值；gs_term是着陆时的火箭纵向位移与横向位移之间的比值，与过程约束中gs计算方法一致；其余变量V_limit、r_limit和gs_limit等为初始化的设定参数；Among them, the reward _term is the reward for the position and speed of the rocket landing terminal meeting the requirements, and the P _term is the penalty for the excessive lateral maneuvering of the rocket at the moment before landing; ||V _term || and ||r _term || represent the terminal The modulus of velocity and terminal position; gs _term is the ratio between the rocket’s longitudinal displacement and lateral displacement during landing, which is consistent with the calculation method of gs in the process constraints; other variables V _limit , r _limit and gs _limit are initialized settings parameter;

通过上述的过程累计回报以及终端奖励回报即可以引导智能代理Agent控制火箭完成垂直定点软着陆的目标。Through the above-mentioned process of accumulating rewards and terminal reward rewards, the intelligent agent Agent can be guided to control the rocket to complete the goal of vertical fixed-point soft landing.

将连续的火箭着陆过程按照预设周期进行离散化处理，并根据火箭积分动力学确定所述火箭着陆马尔科夫决策过程模型的状态转移概率；具体地，状态转移概率P表示为：The continuous rocket landing process is discretized according to the preset period, and the state transition probability of the rocket landing Markov decision process model is determined according to the rocket integral dynamics; specifically, the state transition probability P is expressed as:

P(s_τ+1＝f(s_τ,a_τ)|s_τ,a_τ)＝1；P(s _τ+1 ＝f(s _τ ,a _τ )|s _τ ,a _τ )=1;

其中，s_τ和a_τ分别表示τ时刻的系统当前状态和系统当前采取的动作；s_τ+1表示系统τ+1时刻的状态；f(s,a)表示系统状态转移动力学方程；P表示基于状态量s_τ和动作量a_τ，由τ时刻的状态s_τ转换为τ+1时刻的状态s_τ+1的概率；Among them, s _τ and a _τ respectively represent the current state of the system at time τ and the action currently taken by the system; s _τ+1 represents the state of the system at time τ+1; f(s,a) represents the dynamic equation of system state transition; P Indicates the probability of transitioning from the state s _τ at the time τ to the state s _τ +1 at the time τ+1 based on the state quantity s _τ and the action quantity a _τ ;

对应的，火箭着陆马尔科夫决策过程模型中的折扣因子γ用于对轨迹中的未来过程累计回报沿时间进行衰减，优选地取值为0.95。Correspondingly, the discount factor γ in the rocket landing Markov decision process model is used to decay the cumulative return of the future process in the trajectory along time, and the value is preferably 0.95.

S13、根据所述火箭着陆马尔科夫决策过程模型，构建智能代理Agent，并将所述智能代理Agent与预先构建的火箭着陆飞行仿真环境进行交互训练，得到着陆控制Agent；所述智能代理Agent包括基于值函数神经网络和基于策略神经网络，具体地，所述根据所述火箭着陆马尔科夫决策过程模型，构建智能代理Agent的步骤包括：S13. Construct an intelligent agent Agent according to the Markov decision process model of the rocket landing, and conduct interactive training between the intelligent agent Agent and the pre-built rocket landing flight simulation environment to obtain a landing control Agent; the intelligent agent Agent includes Based on the value function neural network and based on the policy neural network, specifically, according to the Markov decision process model of the rocket landing, the step of constructing an intelligent agent Agent includes:

根据所述火箭着陆马尔科夫决策过程模型，选取近端策略优化算法作为所述智能代理Agent的强化学习算法；其中，近端策略优化算法可理解为改进的PPO算法，用于训练智能代理Agent的火箭子级着陆任务；According to the Markov decision process model of the rocket landing, the proximal strategy optimization algorithm is selected as the reinforcement learning algorithm of the intelligent agent Agent; wherein, the proximal strategy optimization algorithm can be understood as an improved PPO algorithm for training the intelligent agent Agent rocket sub-stage landing mission;

基于所述近端策略优化算法，根据多层感知机模型，构建所述基于值函数神经网络和所述基于策略神经网络；其中，基于策略神经网络的输入为智能代理Agent观测值经过处理之后的系统状态S，对应的输出为火箭着陆的推力矢量控制指令值A；基于值函数神经网络用于加速基于策略神经网络的收敛，根据情节仿真中轨迹的火箭状态值x以及其对应的实际累积回报Q(s,a)，训练基于值函数网络用于预测某个状态的期望累积回报V(s)；Based on the proximal policy optimization algorithm, according to the multi-layer perceptron model, construct the value function-based neural network and the policy-based neural network; wherein, the input of the policy-based neural network is the intelligent agent Agent observation value after processing The system state S, the corresponding output is the thrust vector control command value A of the rocket landing; the value-based function neural network is used to accelerate the convergence of the policy-based neural network, and the rocket state value x and its corresponding actual cumulative return in the simulation are based on the trajectory of the plot Q(s,a), training the value function-based network to predict the expected cumulative return V(s) of a certain state;

上述基于策略神经网络和基于值函数神经网络均采用3隐含层结构，隐含层激活函数采用tanh函数，输出层激活函数采用线性激活函数，且基于策略神经网络和基于值函数神经网络的输入层神经元个数n_in均为5，即状态量S的维度；基于策略神经网络的输出层包含3个神经元，分别对应火箭三维推力分量，而基于值函数神经网络的输出层仅包含一个神经元，对应期望的累积回报。具体地，图4所示的基于策略神经网络和图5所示的基于值函数神经网络的具体结构参数如表1所示：The above-mentioned policy-based neural network and value-based function-based neural network both adopt a three-hidden layer structure, the hidden layer activation function adopts the tanh function, the output layer activation function adopts a linear activation function, and the input of the policy-based neural network and the value-based function neural network The number n _in of layer neurons is 5, that is, the dimension of the state quantity S; the output layer of the policy-based neural network contains 3 neurons, corresponding to the three-dimensional thrust components of the rocket, while the output layer of the value-based neural network only contains one neuron, corresponding to the expected cumulative reward. Specifically, the specific structural parameters of the policy-based neural network shown in Figure 4 and the value function-based neural network shown in Figure 5 are shown in Table 1:

表1基于策略神经网络和基于值函数神经网络的结构参数Table 1 Structural parameters of policy-based neural network and value function-based neural network

上述火箭着陆飞行仿真环境可理解为基于火箭着陆动力学模型构建的用于模拟火箭着陆飞行的仿真模型；具体地，所述火箭着陆飞行仿真环境的构建步骤包括：The above-mentioned rocket landing flight simulation environment can be understood as a simulation model for simulating rocket landing flight based on the rocket landing dynamics model; specifically, the construction steps of the rocket landing flight simulation environment include:

基于所述火箭三自由度运动模型，构建火箭着陆运行环境，并同步构建对应的初值条件生成器和飞行终止判定器，得到所述火箭着陆飞行仿真环境；其中，初值条件生成器可理解为用于从所设定的初始状态空间中随机选择初始状态开始一轮轨迹仿真的初始状态选择器；飞行终止判定器可理解为同时设定火箭飞行异常以及正常终止判据，以实时检测火箭着陆是否终止的运动状态检测器；Based on the three-degree-of-freedom motion model of the rocket, the rocket landing operation environment is constructed, and the corresponding initial value condition generator and flight termination determiner are simultaneously constructed to obtain the rocket landing flight simulation environment; wherein, the initial value condition generator can understand It is the initial state selector used to randomly select the initial state from the set initial state space to start a round of trajectory simulation; the flight termination determiner can be understood as setting the rocket flight abnormality and normal termination criteria at the same time to detect the rocket in real time A motion state detector for whether the landing is terminated;

通过上述方法步骤构建得到智能代理Agent和火箭着陆飞行仿真环境后，就可以按照下述方法通过智能代理Agent与火箭着陆运行环境不断交互的方式对火箭着陆运行环境的每一轮情节进行仿真，初值条件生成器首先在初始状态空间中随机选取火箭着陆的初始状态，随后智能代理Agent根据所观测到的系统状态根据基于策略神经网络拟合出相应的控制指令引导火箭着陆飞行，当火箭着陆成功或提前到达截断条件而终止飞行时，一轮情节仿真随之结束，完成一条完整的火箭着陆飞行轨迹；依此类推，在不同的初始状态下，经过多轮情节仿真后即可完成相应的强化学习训练；After the intelligent agent Agent and the rocket landing flight simulation environment are constructed through the steps of the above method, each round of the rocket landing operation environment can be simulated by the intelligent agent Agent interacting with the rocket landing operation environment according to the following method. The value condition generator first randomly selects the initial state of the rocket landing in the initial state space, and then the intelligent agent Agent, according to the observed system state, fits the corresponding control instructions based on the policy-based neural network to guide the rocket to land and fly. When the rocket lands successfully Or when the truncation condition is reached in advance and the flight is terminated, a round of scenario simulation ends accordingly, and a complete rocket landing flight trajectory is completed; and so on, in different initial states, after multiple rounds of scenario simulation, the corresponding enhancement can be completed study and training;

具体地，所述将所述智能代理Agent与预先构建的火箭着陆飞行仿真环境进行交互训练，得到着陆控制Agent的步骤包括：Specifically, the described intelligent agent Agent is interactively trained with the pre-built rocket landing flight simulation environment, and the step of obtaining the landing control Agent includes:

通过所述智能代理Agent与所述火箭着陆飞行仿真环境的交互仿真，训练智能代理Agent的基于策略神经网络直至收敛，得到所述着陆控制Agent；对应的，着陆控制Agent训练过程如下：Through the interactive simulation of the intelligent agent Agent and the rocket landing flight simulation environment, the strategy-based neural network of the intelligent agent Agent is trained until convergence, and the landing control Agent is obtained; correspondingly, the landing control Agent training process is as follows:

根据所述待仿真初始状态，执行所述智能代理Agent与所述飞行仿真环境的交互仿真，并在达到所述飞行终止判定器预设的仿真终止条件时，终止当前轮的仿真飞行，并根据回报函数评估得到当前仿真飞行轨迹中各个状态点的累积回报值，以及根据所述累积回报值更新所述智能代理Agent的基于值函数神经网络的参数；在PPO算法学习框架中，每一轮情节仿真中智能代理Agent通过与飞行仿真环境交互得到一条又观测状态、动作以及回报组成的完整轨迹(s_l,a_l,r_l)，其中，s_l为智能代理Agent观测到的环境状态，a_l为智能代理Agent根据观测值所采取的动作，r_l为环境反馈给智能代理Agent的回报，且回报r_l通常表示为s_l以及a_l的函数，则从k时刻到T时刻(情节终止时间)的轨迹可表示为(s_k,a_k,...,s_T,a_T)，该轨迹的累积折扣回报可表示为：According to the initial state to be simulated, execute the interactive simulation of the intelligent agent Agent and the flight simulation environment, and when the simulation termination condition preset by the flight termination determiner is reached, terminate the simulation flight of the current wheel, and according to The reward function evaluation obtains the cumulative reward value of each state point in the current simulated flight trajectory, and updates the parameters of the value-based function neural network of the intelligent agent Agent according to the cumulative reward value; in the PPO algorithm learning framework, each round of plot In the simulation, the intelligent agent Agent interacts with the flight simulation environment to obtain a complete trajectory (s _l , a _l , r _l ) consisting of observed states, actions and rewards, where s _l is the environment state observed by the intelligent agent, and a _l is the action taken by the intelligent agent according to the observed value, r _l is the return from the environment to the intelligent agent, and the return r _l is usually expressed as a function of s _l and a _l , then from time k to time T (episode termination time) can be expressed as (s _k ,a _k ,...,s _T ,a _T ), and the cumulative discounted return of this trajectory can be expressed as:

其中，γ∈[0,1]表示折扣因子，用于表示轨迹中各个时间节点回报的折扣，则对于一个强化学习算法而言，其目标就会寻找到一组策略可以使得轨迹的累积折扣回报期望值尽可能最大；Among them, γ∈[0,1] represents the discount factor, which is used to represent the discount of the return of each time node in the trajectory. For a reinforcement learning algorithm, the goal is to find a set of strategies that can make the cumulative discounted return of the trajectory The expected value is as large as possible;

根据更新后的所述智能代理Agent的基于值函数神经网络，预测当前仿真飞行轨迹中各个状态点的期望累积回报值，并根据所述累积回报值和所述期望回报值，计算优势函数，以及根据所述优势函数，更新所述智能代理Agent的基于策略神经网络的参数；其中，优势函数可表示为：Predict the expected cumulative return value of each state point in the current simulated flight trajectory according to the value function-based neural network of the updated intelligent agent Agent, and calculate the advantage function according to the cumulative return value and the expected return value, and According to the advantage function, update the parameters based on the policy neural network of the intelligent agent Agent; wherein, the advantage function can be expressed as:

A(s,a)＝Q(s,a)-V(s)A(s,a)=Q(s,a)-V(s)

其中，A(s,a)、Q(s,a)和V(s)分别表示为优势函数、累积回报值和期望累积回报值；Among them, A(s,a), Q(s,a) and V(s) are respectively expressed as advantage function, cumulative return value and expected cumulative return value;

需要说明的是，在上述强化学习过程中，为了提高网络的收敛速率以及尽量避免达到隐含层激活函数的饱和区域，优选地，对网络的输入进行归一化处理。对于网络的输入数据，统计每个维度的均值以及标准差，然后按照如下公式进行缩放：It should be noted that, in the above reinforcement learning process, in order to increase the convergence rate of the network and try to avoid reaching the saturation region of the activation function of the hidden layer, preferably, the input of the network is normalized. For the input data of the network, count the mean and standard deviation of each dimension, and then scale according to the following formula:

同时，对于基于策略神经网络的输出，为了满足推力约束，优选地，对输出推力指令的总幅值进行限幅操作，具体过程如下式所示：At the same time, for the output of the strategy-based neural network, in order to satisfy the thrust constraint, preferably, the total amplitude of the output thrust command is limited, and the specific process is shown in the following formula:

其中，a和

分别表示限幅操作前后的推力指令，其他变量释义参见前文描述，此处不再赘述；Among them, a and

Indicates the thrust command before and after the clipping operation respectively. For the interpretation of other variables, please refer to the previous description, and will not repeat them here;

S14、根据所述着陆控制Agent生成实时控制指令，并根据所述实时控制指令引导火箭着陆飞行；其中，着陆控制Agent通过上述交互仿真训练得到后，即可将得到的基于策略神经网络用于火箭在线着陆制导，且不再需要基于值函数神经网络的幅值，即可根据火箭飞行过程中的状态实时给出相应的控制指令，引导火箭在存在偏差的环境中完成高精度着陆。S14. Generate a real-time control instruction according to the landing control agent, and guide the rocket to land and fly according to the real-time control instruction; wherein, after the landing control agent is obtained through the above-mentioned interactive simulation training, the obtained policy-based neural network can be used for the rocket Online landing guidance, and no longer need the amplitude based on the value function neural network, can give corresponding control instructions in real time according to the state of the rocket during flight, and guide the rocket to complete high-precision landing in an environment with deviations.

本申请实施例通过根据火箭地球着陆动力下降段飞行的所受作用力构建火箭三自由度运动模型，并根据火箭三自由度运动模型构建火箭着陆马尔科夫决策过程模型，以及在根据火箭着陆马尔科夫决策过程模型构建智能代理Agent后，将智能代理Agent与预先构建的火箭着陆飞行仿真环境进行交互仿真训练得到着陆控制Agent，并根据由着陆控制Agent生成的实时控制指令引导火箭着陆飞行的方法，不仅能够基于火箭实时状态映射出具有应对大范围火箭模型偏差以及环境干扰的发动机推力指令，保证能够在存在复杂不确定性工况下引导火箭子级进行高精度定点软着陆，而且通过采用深度神经网络作为强化学习的策略网络且采用改进PPO算法对策略网络进行仿真训练学习，有效拟合高维连续动作空间指令，还通过采用凝视启发方法设置状态量引导，并对火箭着陆轨迹终端与过程指标设计不同的奖励折扣率，加快策略的收敛速度，有效提高策略学习定点软着陆决策的效率进；相较于现有技术，本发明方法具有极高的实时性，且算法鲁棒性极强，能够适应比较广泛的建模偏差，在环境存在不确定干扰的工况下仍然能够引导火箭进行高精度定点软着陆，具有较高的应用价值。In the embodiment of the present application, a three-degree-of-freedom motion model of the rocket is constructed according to the force of the rocket's earth landing dynamic descent stage flight, and a Markov decision process model of the rocket landing is constructed according to the three-degree-of-freedom motion model of the rocket, and a Markov decision process model is constructed according to the rocket landing Markov After the Kove decision-making process model builds the intelligent agent Agent, the intelligent agent Agent and the pre-built rocket landing flight simulation environment are interactively simulated and trained to obtain the landing control Agent, and the method of guiding the rocket to land and fly according to the real-time control instructions generated by the landing control Agent , not only can map the engine thrust command to deal with large-scale rocket model deviation and environmental interference based on the real-time state of the rocket, but also ensure that the rocket sub-stage can be guided to a high-precision fixed-point soft landing under complex uncertain working conditions, and through the use of depth As a strategy network for reinforcement learning, the neural network uses the improved PPO algorithm to carry out simulation training and learning on the strategy network, which can effectively fit the high-dimensional continuous action space instructions. It also uses the gaze heuristic method to set the state quantity guidance, and the terminal and process of the rocket landing trajectory Different reward discount rates are designed for indicators, speeding up the convergence speed of the strategy, and effectively improving the efficiency of strategy learning for fixed-point soft landing decisions; compared with the prior art, the method of the present invention has extremely high real-time performance, and the algorithm is extremely robust , can adapt to a wide range of modeling deviations, and can still guide the rocket to a high-precision fixed-point soft landing under the conditions of uncertain interference in the environment, which has high application value.

需要说明的是，虽然上述流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。It should be noted that although the various steps in the above flow chart are displayed sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders.

在一个实施例中，如图6所示，提供了一种基于强化学习的火箭着陆实时鲁棒制导系统，所述系统包括：In one embodiment, as shown in FIG. 6, a real-time robust guidance system for rocket landing based on reinforcement learning is provided, the system comprising:

运动模型构建模块1，用于根据火箭地球着陆动力下降段飞行的所受作用力，构建火箭三自由度运动模型；The motion model building block 1 is used to construct a three-degree-of-freedom motion model of the rocket according to the force of the rocket's earth landing dynamic descent stage flight;

优化模型构建模块2，用于根据所述火箭三自由度运动模型，构建火箭着陆马尔科夫决策过程模型；An optimization model construction module 2 is used to construct a Markov decision process model for rocket landing according to the three-degree-of-freedom motion model of the rocket;

控制策略训练模块3，用于根据所述火箭着陆马尔科夫决策过程模型，构建智能代理Agent，并将所述智能代理Agent与预先构建的火箭着陆飞行仿真环境进行交互训练，得到着陆控制Agent；所述智能代理Agent包括基于值函数神经网络和基于策略神经网络；The control strategy training module 3 is used to construct an intelligent agent Agent according to the Markov decision process model of the rocket landing, and carry out interactive training between the intelligent agent Agent and the pre-built rocket landing flight simulation environment to obtain the landing control Agent; The intelligent agent Agent includes a neural network based on a value function and a neural network based on a strategy;

火箭着陆制导模块4，用于根据所述着陆控制Agent生成实时控制指令，并根据所述实时控制指令引导火箭着陆飞行。The rocket landing guidance module 4 is configured to generate real-time control instructions according to the landing control Agent, and guide the rocket to land and fly according to the real-time control instructions.

关于一种基于强化学习的火箭着陆实时鲁棒制导系统的具体限定可以参见上文中对于一种基于强化学习的火箭着陆实时鲁棒制导方法的限定，在此不再赘述。上述一种基于强化学习的火箭着陆实时鲁棒制导系统中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。For the specific definition of a real-time robust guidance system for rocket landing based on reinforcement learning, please refer to the above-mentioned definition of a real-time robust guidance method for rocket landing based on reinforcement learning, which will not be repeated here. Each module in the above-mentioned real-time robust guidance system for rocket landing based on reinforcement learning can be fully or partially realized by software, hardware and combinations thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.

图7示出一个实施例中计算机设备的内部结构图，该计算机设备具体可以是终端或服务器。如图7所示，该计算机设备包括通过系统总线连接的处理器、存储器、网络接口、显示器和输入装置。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统和计算机程序。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种基于强化学习的火箭着陆实时鲁棒制导方法。该计算机设备的显示屏可以是液晶显示屏或者电子墨水显示屏，该计算机设备的输入装置可以是显示屏上覆盖的触摸层，也可以是计算机设备外壳上设置的按键、轨迹球或触控板，还可以是外接的键盘、触控板或鼠标等。Fig. 7 shows an internal structural diagram of a computer device in an embodiment, and the computer device may specifically be a terminal or a server. As shown in FIG. 7, the computer device includes a processor, a memory, a network interface, a display and an input device connected through a system bus. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used to communicate with an external terminal via a network connection. When the computer program is executed by a processor, a real-time robust guidance method for rocket landing based on reinforcement learning is realized. The display screen of the computer device may be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer device may be a touch layer covered on the display screen, or a button, a trackball or a touch pad provided on the casing of the computer device , and can also be an external keyboard, touchpad or mouse.

本领域普通技术人员可以理解，图7中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有同的部件布置。Those of ordinary skill in the art can understand that the structure shown in Figure 7 is only a block diagram of a partial structure related to the solution of this application, and does not constitute a limitation to the computer equipment on which the solution of this application is applied. The specific computing equipment More or fewer components than shown in the figures may be included, or certain components may be combined, or have the same component arrangement.

在一个实施例中，提供了一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，处理器执行计算机程序时实现上述方法的步骤。In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and operable on the processor, and the steps of the above method are implemented when the processor executes the computer program.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，计算机程序被处理器执行时实现上述方法的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the above method are implemented.

综上，本发明实施例提供的基于强化学习的火箭着陆实时鲁棒制导方法及系统，其基于强化学习的火箭着陆实时鲁棒制导方法实现了根据火箭地球着陆动力下降段飞行的所受作用力构建火箭三自由度运动模型，并根据火箭三自由度运动模型构建火箭着陆马尔科夫决策过程模型，以及在根据火箭着陆马尔科夫决策过程模型构建智能代理Agent后，将智能代理Agent与预先构建的火箭着陆飞行仿真环境进行交互仿真训练得到着陆控制Agent，并根据由着陆控制Agent生成的实时控制指令引导火箭着陆飞行的技术方案，能够基于火箭实时状态映射发动机推力指令，且给出的指令具有对大范围火箭模型偏差以及环境干扰存的适应能力，保证在存在复杂不确定性工况下引导火箭子级进行高精度定点软着陆，而且通过采用深度神经网络作为强化学习的策略网络，结合改进PPO算法对策略网络进行学习，有效拟合高维连续动作空间指令，还通过采用凝视启发方法设置状态量引导，并对火箭着陆轨迹终端与过程指标设计不同的奖励折扣率来加快策略的收敛速度，进而保证火箭着陆制导方法具有极高的实时性，算法具有较强的鲁棒性，既能够适应比较广泛的建模偏差，又能有效应对在环境的不确定干扰，为引导火箭进行高精度定点软着陆提供了可靠保障，具有较高的应用价值。To sum up, the reinforcement learning-based real-time robust guidance method and system for rocket landing provided by the embodiment of the present invention, its real-time robust guidance method for rocket landing based on reinforcement learning realizes the force according to the flight force of the rocket earth landing dynamic descent stage. Construct the rocket three-degree-of-freedom motion model, and construct the rocket landing Markov decision-making process model according to the rocket three-degree-of-freedom motion model, and after constructing the intelligent agent Agent according to the rocket landing Markov decision-making process model, combine the intelligent agent Agent with the pre-built The landing control agent is obtained by interactive simulation training in the rocket landing and flight simulation environment, and the technical scheme of guiding the rocket landing and flight according to the real-time control instructions generated by the landing control agent can map the engine thrust instructions based on the real-time state of the rocket, and the instructions given have The ability to adapt to large-scale rocket model deviations and environmental disturbances ensures that the rocket sub-stages can be guided to perform high-precision fixed-point soft landings under complex and uncertain conditions. The PPO algorithm learns the policy network to effectively fit the high-dimensional continuous action space instructions. It also uses the gaze heuristic method to set the state quantity guidance, and designs different reward discount rates for the rocket landing trajectory terminal and process indicators to speed up the convergence of the strategy. , so as to ensure that the rocket landing guidance method has extremely high real-time performance, and the algorithm has strong robustness. It can not only adapt to a wide range of modeling deviations, but also effectively deal with uncertain disturbances in the environment. The fixed-point soft landing provides a reliable guarantee and has high application value.

本说明书中的各个实施例均采用递进的方式描述，各个实施例直接相同或相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于系统实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。需要说明的是，上述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。Each embodiment in this specification is described in a progressive manner, and the same or similar parts of each embodiment can be directly referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for relevant parts, refer to part of the description of the method embodiment. It should be noted that the various technical features of the above-mentioned embodiments can be combined arbitrarily. For the sake of concise description, all possible combinations of the various technical features in the above-mentioned embodiments are not described. Where there is a contradiction, all should be deemed to be within the scope of this specification.

以上所述实施例仅表达了本申请的几种优选实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本技术领域的普通技术人员来说，在不脱离本发明技术原理的前提下，还可以做出若干改进和替换，这些改进和替换也应视为本申请的保护范围。因此，本申请专利的保护范围应以所述权利要求的保护范围为准。The above-mentioned embodiments only express several preferred implementation modes of the present application, and the description thereof is relatively specific and detailed, but it should not be construed as limiting the scope of the patent for the invention. It should be noted that those skilled in the art can make several improvements and substitutions without departing from the technical principle of the present invention, and these improvements and substitutions should also be regarded as the protection scope of the present application. Therefore, the scope of protection of the patent application should be based on the scope of protection of the claims.

Claims

1. A rocket landing real-time robust guidance method based on reinforcement learning is characterized by comprising the following steps:

constructing a rocket three-degree-of-freedom motion model according to acting force borne by the rocket in the earth landing power descent stage;

constructing a rocket landing Markov decision process model according to the rocket three-degree-of-freedom motion model;

constructing an intelligent Agent according to the rocket landing Markov decision process model, and performing interactive training on the intelligent Agent and a pre-constructed rocket landing flight simulation environment to obtain a landing control Agent; the intelligent Agent comprises a value function-based neural network and a strategy-based neural network;

and generating a real-time control instruction according to the landing control Agent, and guiding the rocket to land and fly according to the real-time control instruction.

2. A rocket landing real-time robust guidance method based on reinforcement learning as claimed in claim 1, wherein said step of constructing a rocket three-degree-of-freedom motion model according to the acting force applied to the rocket in the landing power descent stage of the earth comprises:

establishing a landing point coordinate system by taking the rocket sub-level target landing point as an original point; the landing point coordinate system is a coordinate system which takes a target landing point of rocket sublevel landing as a coordinate origin O, takes the vertical upward direction of the geocenter as a coordinate axis Oz, takes the main flight direction of the rocket during landing as a coordinate axis Ox, and takes the direction which is perpendicular to the plane xOz and forms a right-hand rectangular coordinate system with the coordinate axis Ox and the coordinate axis Oz as a coordinate axis Oy;

based on the landing point coordinate system, carrying out stress analysis on the rocket flying in the earth landing power descent stage, and determining corresponding earth attraction, aerodynamic resistance and engine thrust;

constructing the rocket three-degree-of-freedom motion model according to the earth attraction, the pneumatic resistance and the engine thrust; the rocket three-degree-of-freedom motion model is expressed as follows:

in the formula (I), the compound is shown in the specification,

wherein r represents a rocket position vector; v represents a rocket velocity vector; m represents rocket mass; g (r) represents the gravitational acceleration vector received by the rocket; t represents an engine thrust vector; d represents an aerodynamic resistance vector; I.C. A _sp Represents the fuel specific impulse; g ₀ Representing the average gravitational acceleration at sea level of the earth;

the second consumption of the propellant after the engine is started; c _D Representing a drag coefficient; s _ref A reference area representing a rocket substage; rho ₀ Representing a reference atmospheric density at sea level of the earth; h represents the flight height of the rocket stage; h is _ref Indicating a reference height.

3. A rocket landing real-time robust guidance method based on reinforcement learning as claimed in claim 1, wherein said step of constructing a rocket landing markov decision process model according to said rocket three degrees of freedom motion model comprises:

based on the concept of staring inspiration, carrying out conversion processing on the state variable of the rocket to obtain the state quantity of the rocket landing Markov decision process model; the state quantities are expressed as:

in the formula (I), the compound is shown in the specification,

V _error ＝V-V _sight

wherein S represents the state quantity of the rocket landing Markov decision process model; r, V and V ₀ Respectively representing a rocket position vector, a rocket speed vector and a rocket initial speed; t is t _go Representing the remaining time of flight of the rocket; r is _z A Z-axis component representing a rocket position vector; v _sight Represents a line-of-sight vector; v _error Representing the error of the rocket velocity vector and the sight line vector; λ represents a parameter that adjusts the magnitude of the sight vector over time;

obtaining the action quantity of the rocket landing Markov decision process model according to the control instruction of the rocket; the action amount is expressed as:

wherein A represents the action amount of a rocket landing Markov decision process model; t represents an engine thrust vector; t is a unit of _x 、T _y And T _z X-axis, Y-axis and Z-axis components representing engine thrust, respectively;

determining a return function design principle according to rocket fixed point soft landing requirements, and obtaining a return function of the rocket landing Markov decision process model according to the return function design principle;

discretizing a continuous rocket landing process according to a preset period, and determining the state transition probability of the rocket landing Markov decision process model according to rocket integral dynamics.

4. A rocket landing real-time robust guidance method based on reinforcement learning according to claim 1, wherein said step of constructing an intelligent Agent according to said rocket landing markov decision process model comprises:

selecting a near-end strategy optimization algorithm as a reinforcement learning algorithm of the intelligent Agent according to the rocket landing Markov decision process model;

and constructing the value function-based neural network and the strategy-based neural network according to a multilayer perceptron model based on the near-end strategy optimization algorithm.

5. A rocket landing real-time robust guidance method based on reinforcement learning according to claim 1, wherein the rocket landing flight simulation environment construction step comprises:

and constructing a rocket landing operating environment based on the rocket three-degree-of-freedom motion model, and synchronously constructing a corresponding initial value condition generator and a corresponding flight termination determiner to obtain the rocket landing flight simulation environment.

6. A rocket landing real-time robust guidance method based on reinforcement learning as claimed in claim 5, wherein said step of interactively training said intelligent Agent and a pre-constructed rocket landing flight simulation environment to obtain a landing control Agent comprises:

and training a strategy-based neural network of the intelligent Agent until convergence through interactive simulation of the intelligent Agent and the rocket landing flight simulation environment to obtain the landing control Agent.

7. A rocket landing real-time robust guidance method based on reinforcement learning as claimed in claim 6, wherein said step of training strategy-based neural network of intelligent Agent through interactive simulation of said intelligent Agent and said rocket landing flight simulation environment until convergence, obtaining said landing control Agent comprises:

randomly selecting an initial state to be simulated from a preset initial state space according to the initial value condition generator;

according to the initial state to be simulated, executing interactive simulation of the intelligent Agent and the flight simulation environment, terminating the simulation flight of the current wheel when a simulation termination condition preset by the flight termination judger is reached, evaluating and obtaining an accumulated return value of each state point in the current simulation flight track according to a return function, and updating a value-function-based neural network parameter of the intelligent Agent according to the accumulated return value;

predicting expected accumulated return values of all state points in the current simulated flight trajectory according to the updated value-based function neural network of the intelligent Agent, calculating an advantage function according to the accumulated return values and the expected return values, and updating parameters of the intelligent Agent based on a strategy neural network according to the advantage function;

and judging whether the strategy-based neural network of the intelligent Agent reaches a preset convergence condition, if so, stopping simulation training to obtain the landing control Agent, otherwise, reselecting an initial state to be simulated according to the initial condition generator, and starting the next round of interactive simulation training.

8. A rocket landing real-time robust guidance system based on reinforcement learning, which is characterized in that the system comprises:

the motion model building module is used for building a rocket three-degree-of-freedom motion model according to acting force borne by the rocket in the earth landing power descent stage;

the optimization model building module is used for building a rocket landing Markov decision process model according to the rocket three-degree-of-freedom motion model;

the control strategy training module is used for constructing an intelligent Agent according to the rocket landing Markov decision process model and interactively training the intelligent Agent and a pre-constructed rocket landing flight simulation environment to obtain a landing control Agent; the intelligent Agent comprises a value function-based neural network and a strategy-based neural network;

and the rocket landing guidance module is used for generating a real-time control instruction according to the landing control Agent and guiding the rocket to land and fly according to the real-time control instruction.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.