CN112231845B

CN112231845B - Stratospheric airship height control method and system

Info

Publication number: CN112231845B
Application number: CN202011210395.XA
Authority: CN
Inventors: 杨希祥; 杨晓伟; 王曰英; 邓小龙; 杨燕初; 朱炳杰
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2022-09-02
Anticipated expiration: 2040-11-03
Also published as: CN112231845A

Abstract

The invention relates to a method and a system for controlling the height of an airship on a stratosphere, which belong to the technical field of airship height control, firstly establish a Markov decision process model for airship height control, and based on the current state, a plurality of actions is selected from the action space according to a Q-learning algorithm and state transition probabilities, determining all actions selected from the initial state to the end state of the airship, taking all the selected actions as action sequences corresponding to the end state, finally selecting the action sequence with the maximum target value in all the action sequences as an optimal sequence by utilizing a target optimization function, controlling the movement of the airship according to the optimal sequence, and then the parameters of the relevant dynamic model of the stratospheric airship do not need to be obtained any more, and the real-time state of the stratospheric airship and the set reference state are used as the control input of the control system, so that the control loop is simplified.

Description

Stratospheric airship height control method and system

Technical Field

The invention relates to the technical field of airship height control, in particular to a method and a system for controlling the height of an airship on a stratosphere.

Background

The stratospheric airship is a typical low-speed near space aircraft, and in the stage of staying in the sky, a day and night closed loop system formed by combining a solar cell and a storage battery is relied on in energy, and a vector propulsion device is relied on in power, so that the horizontal track and the height of the airship are controlled, and the airship stays in a monthly-order regional long voyage, has huge application potential in the fields of high-resolution observation to the ground, communication relay, reconnaissance monitoring, environmental monitoring and the like, and is known as a stratospheric satellite.

The main reason is that the stratospheric airship has a wind field distributed along with the height in the stratospheric region, the short-period space-time change characteristic of the wind field is obvious, and meanwhile, the stratospheric airship has a huge volume, a large windward area and energy consumption in direct proportion to the cubic power of the wind speed, and when the airship is in a strong wind field environment, the airship platform can generate excessive energy consumption. Therefore, the height of the airship needs to be properly regulated and controlled, so that the airship flies in a weak wind layer as much as possible, the energy consumption is reduced, and the residence time is prolonged. However, the airship height control has the problems of under-actuation, large inertia, long time delay, uncertain dynamic model parameters, obvious nonlinear characteristics and the like.

The flight principle of the stratospheric airship is obviously different from that of conventional aircrafts such as airplanes and the like, in the aspect of height regulation, climbing and descending of the airship are mainly regulated by an inflation valve/a fan, and control surface control and power vector propulsion are only used as short-time auxiliary means. Scholars at home and abroad develop a series of researches on height control of stratospheric airship and similar aerostats, the traditional control comprises conventional linear PID control, nonlinear gain scheduling control, backstepping control and the like, the height can be effectively controlled under certain conditions, but the problems of difficulty in parameter selection, difficulty in self-adaptation and the like exist to different degrees. With the improvement of computer performance and the rapid development of machine learning algorithms and platforms, artificial intelligence decision and control gradually deepens into various application scenes, including a Monte-Carlo learning method for processing a continuous state action space by using a Gaussian process, local observable Bayesian learning based on a Gaussian process dynamics model, a Q-learning method using a CMAC neural network and the like, but the intelligent control methods generally have the characteristics of large randomness and poor convergence property.

Based on this, a need exists for a control method and system that can improve the flexibility and stability of airship height control.

Disclosure of Invention

The invention aims to provide a method and a system for controlling the height of an airship on a stratosphere, which do not depend on a dynamic model any more, control the height of the airship on the basis of an actual motion state and can improve the adaptability, effectiveness and stability of control.

In order to achieve the purpose, the invention provides the following scheme:

a stratospheric airship height control method comprises the following steps:

establishing a Markov decision process model for airship height control; the Markov decision process model comprises a state space, an action space, a return value space, a state transition probability and a target optimization function;

randomly determining an initial state of an airship, and taking the initial state of the airship as a current state;

selecting a plurality of actions from the action space according to a Q-learning algorithm and the state transition probability based on the current state;

for each action, determining a reward value corresponding to the action and a state at a next moment according to the action and the reward value space;

for each state at the next moment, judging whether the state at the next moment is a termination state or not to obtain a first judgment result;

when the first judgment result is yes, determining all selected actions for transferring from the initial state of the airship to the termination state by taking the state at the next moment as the termination state, and taking all the selected actions as an action sequence corresponding to the termination state;

when the first judgment result is negative, taking the state at the next moment as the current state, returning to the step of selecting a plurality of actions from the action space according to a Q-learning algorithm and the state transition probability based on the current state until the state at each next moment is the termination state, and determining an action sequence corresponding to each termination state;

calculating a target value corresponding to each action sequence by using the target optimization function; selecting the action sequence with the maximum target value in all the action sequences as an optimal sequence; and controlling the airship to move according to the optimal sequence.

A stratospheric airship height control system, the control system comprising:

the Markov decision process model acquisition module is used for establishing a Markov decision process model for airship height control; the Markov decision process model comprises a state space, an action space, a return value space, a state transition probability and a target optimization function;

the airship initial state determining module is used for randomly determining the initial state of the airship and taking the initial state of the airship as the current state;

an action selection module, configured to select, based on the current state, a plurality of actions from the action space according to a Q-learning algorithm and the state transition probability;

the return reward value determining module is used for determining a return reward value corresponding to each action and a state of the next moment according to the action and the return value space;

the judging module is used for judging whether the state at the next moment is a termination state or not to obtain a first judging result;

an action sequence acquisition module, configured to determine, when the first determination result is yes, all actions selected to transition from the initial state to the end state of the airship by taking the state at the next time as the end state, and take all the selected actions as an action sequence corresponding to the end state;

a returning module, configured to, when the first determination result is negative, take the state at the next time as the current state, return to the step of "selecting multiple actions from the action space according to a Q-learning algorithm and the state transition probability based on the current state" until the state at each next time is the termination state, and determine an action sequence corresponding to each termination state;

the optimal sequence determining module is used for calculating a target value corresponding to each action sequence by using the target optimization function; selecting the action sequence with the maximum target value from all the action sequences as an optimal sequence; and controlling the airship to move according to the optimal sequence.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention provides a method and a system for controlling the height of an airship on a stratosphere, which are characterized in that a Markov decision process model for controlling the height of the airship is established, a plurality of actions are selected from an action space according to a Q-learning algorithm and a state transition probability based on a current state, for each action, a reward value corresponding to the action and a state at the next moment are determined according to the action and the reward value space, until the state at each next moment is a termination state, the state at the next moment is taken as the termination state, all actions selected for transferring from an initial state to the termination state of the airship are determined, all the selected actions are taken as action sequences corresponding to the termination state, finally, a target value corresponding to each action sequence is calculated by using a target optimization function, the action sequence with the largest target value in all the action sequences is selected as an optimal sequence, and the airship is controlled to move according to the optimal sequence, so that the parameters of a relevant dynamic model of the stratospheric airship are not required to be acquired, and the real-time state of the stratospheric airship and the set reference state are used as the control input of a control system, so that a control loop is simplified. Meanwhile, the action selection is determined by adopting the uncertain probability strategy of the Q-learning algorithm and the state transition probability determined by the speed distribution, namely the deterministic probability strategy, so that the deterministic probability effectively avoids poor convergence characteristic in the reinforcement learning process, the uncertain probability effectively solves the problem of 'exploration-utilization' and enhances the adaptability of the controller. In addition, Q-learning controller design is carried out according to an MDP model under certain human intervention, real-time state and motion realizability of the stratospheric airship are considered through pre-initial probability distribution and a return value space based on an expected height change curve, and real-time control is carried out by combining a static action selection strategy and a dynamic action selection strategy, so that the Q-learning controller has strong autonomy and stability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a flowchart of a method for controlling the height of an airship according to embodiment 1 of the present invention.

Fig. 2 is a structural diagram of an airship height controller based on a Q-learning algorithm according to embodiment 1 of the present invention.

Fig. 3 is a block diagram of a flow chart for selecting an optimal sequence based on a Q-learning algorithm according to embodiment 1 of the present invention.

Fig. 4 is a block diagram of a process of generating a return value space according to embodiment 1 of the present invention.

Fig. 5 is a schematic diagram of a tracking result of the Q-learning algorithm under different distribution strategies provided in embodiment 1 of the present invention.

Fig. 6 is a schematic diagram of the tracking results of the Q-learning algorithm and the SARSA algorithm under the same distribution strategy provided in embodiment 1 of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example 1:

the embodiment provides a Markov decision process model formed by a return value space based on airship speed distribution decision state transition probability and obstacle avoidance thought aiming at the height control problem of an airship on a stratosphere, adopts a Q-learning algorithm height intelligent control method, takes speed distribution as action probability selection basis in a Q-learning process, is one of key elements of the algorithm, needs to carry out feedback correction on speed for realizing interaction between an external environment and an intelligent body, and ensures reasonable effectiveness of the state transition probability by taking a real-time height difference as basis for updating rising speed distribution. And the return value space based on the obstacle avoidance thought is a main basis for judging the optimal action sequence, the action of each state is judged according to the expected height change curve, meanwhile, each action sequence is screened, the action strategy is finally determined, and the control on the height of the stratospheric airship is completed. The method is a learning training control process independent of a dynamic model, integrates training and control, utilizes that the stratospheric airship is a controlled object with slow movement, good stability and high fault tolerance, so that an early wrong action command cannot generate obvious influence on the movement of the airship, can realize online control on the height of the airship, and controls the state of the airship in real time in the training process. In practical application, the input of the control system is the current height and the ascending speed of the airship, the output is the switching degree of the inflation and deflation valve, and the aim of tracking and controlling the reference height by the stratospheric airship is fulfilled by adjusting the valve switch.

The embodiment is used for providing a method for controlling the height of an airship on a stratosphere, and as shown in fig. 1, fig. 2 and fig. 3, the method comprises the following steps:

step 101: establishing a Markov decision process model for airship height control; the Markov decision process model comprises a state space, an action space, a return value space, a state transition probability and a target optimization function;

the markov decision process Model (MDP) in the Q-learning algorithm is mapped onto the airship object for five elements. The specific elements are as follows:

1) state space S

The state space elements of the airship height control are characterized as a set of airship heights and ascent speeds at each moment, i.e. the state space comprises the airship state at each moment. s _t ＝[h _t ，v _t ](ii) a Wherein s is _t The state of the airship at the t-th moment; h is _t The airship height at the t-th moment; v. of _t The speed of the airship at the t-th moment is obtained, and the altitude and the rising speed which represent the state of the airship are used as the input of a Q-learning algorithm.

2) Action space A

By analyzing an effective mechanism for adjusting the height of the airship on the stratosphere, the action space of the height control of the airship is characterized as a set of the opening degree of the inflation and deflation valve in each height state, namely the action of the action space is the opening degree of the inflation valve or the deflation valve of the airship. In the simulation, the valve state is set to 9 states, and the operating space a ═ a ₁ ,a ₂ ,a ₃ ,a ₄ ,a ₅ ,a ₆ ,a ₇ ,a ₈ ,a ₉ ](ii) a The first four actions are used as the opening degree of the inflation valve, the last four actions are used as the opening degree of the deflation valve, and the fifth action represents that the inflation and deflation valves are all in a closed state.

3) Space of reported values

As shown in fig. 4, the planned space around the expected height variation curve is rasterized with a certain precision, state transition is performed at each time according to the designed achievable action, and the reward value is determined according to the euclidean distance between the state after transition and the expected state. The method for endowing the return reward value is based on the obstacle avoidance thought, namely, a return value space is formed in a mode that the return reward value is gradually decreased from near to far from the expected state according to the error between the state obtained after a certain action is selected and the expected state in the current state. That is, if the error is small, the reward value is large, and if the error is large, the reward value is small. And a return value space is obtained after all actions are selected for all states, and a foundation is laid for the generation of an optimal action strategy.

Specifically, the design process of the return value space includes: acquiring a pre-planned expected height change curve; rasterizing the surrounding space of the expected height change curve to obtain a grid space; selecting an action in the action space for each state in the state space; performing state transition on the state based on the action to obtain a state of a next moment corresponding to the action; calculating the Euclidean distance between the state of the next moment and the expected state corresponding to the next moment; obtaining a return reward value corresponding to the action according to the Euclidean distance; the larger the Euclidean distance is, the smaller the reward value is; judging whether the actions in the action space are all selected; if not, selecting an unselected action in the action space, returning to the step of performing state transition on the state based on the action to obtain the state at the next moment corresponding to the action until the actions in the action space are all selected, obtaining a return reward value corresponding to each action in the action space, and determining a return reward value set of the state; and assigning the grid space according to the return reward value set of each state to obtain a return value space.

4) Probability of state transition P

Based on the concept of probability dynamics, the probability distribution of the velocity is used as the selection basis of the action in the Q-learning algorithm or the transition basis of the state. And designing ideal speed distribution to obtain initial state transition probability, and correcting the speed distribution and the state transition probability according to the actual height difference to complete Markov decision process model design under certain human intervention. Aiming at a planned expected height change curve, under the condition of known height difference and initial and final speeds, uniform acceleration and uniform deceleration movement is carried out, and based on the ideal design, initial airship speed distribution is obtained. And then, according to the height difference generated in the actual flying motion process, the designed speed distribution is subjected to feedback correction, so that the validity and the correctness of the speed are ensured, and the state transition probability is updated to ensure the reasonability of each action or state transition.

The formula for the state transition probability is:

in the formula 1, the reaction mixture is,

P(s _t ，a _t ) Representing the state at time t as s _t By selecting action a _t The probability of (d);

representing the current speed v _t The size of the number of times the speed interval is selected,

representing the size of the number of times all speed intervals have been selected.

Specifically, the design process of the state transition probability includes: determining a speed change interval and a height change value according to a pre-planned expected height change curve; according to the height change value and the initial and final speeds, designing uniform acceleration and uniform deceleration motion to obtain initial airship speed distribution; correcting the initial airship speed distribution according to the actual altitude difference in the flying motion process to obtain the airship speed at each moment; dividing the speed change interval into a plurality of speed change subintervals; and calculating the state transition probability of the airship according to the speed of the airship at each moment and the plurality of speed change subintervals.

To make the design process of the state transition probability more clear to those skilled in the art, it is specifically described here as an example. An expected altitude variation curve is designed, taking altitude variation curve 20000-. Taking an ideal situation, for example 20000 to 20010, to achieve the shortest time for completion, the homogenization is carried outThe design of uniform acceleration and deceleration movement is that the specific speed is distributed to be 1.25m/s ² Is accelerated to 2.5m/s and then again at-1.25 m/s ² The acceleration of the airship is reduced to zero, and based on the thought, the ascending speed is ideally designed to obtain the initial airship speed distribution. Considering that the height difference variation of the designed expected height variation curve is within 10m, the design speed variation interval is [ -5, 5 ] in the speed design stage]m/s, and dividing the speed change interval into 9 sub-intervals, wherein the number of the speed change sub-intervals is the same as the number of actions in the action space, and the interval size is gradually reduced along with the speed approaching zero, so that the speed correction precision is improved, and the control precision is further improved. The correspondence between the speed range, the altitude change, and the motion value is shown in table 1.

TABLE 1

In view of the fact that in practical situations, due to uncertain factors such as aerodynamic force, environmental disturbance and the like, the speed distribution of the airship deviates from the ideal initial speed distribution of the airship. Therefore, it is necessary to perform feedback correction on the designed initial airship velocity distribution, perform velocity adjustment using the actual altitude difference, and ensure the validity of the state transition probability determined by the velocity distribution. The height difference was divided into 7 sub-intervals as shown in table 2.

TABLE 2

5) Objective optimization function J

And according to the return reward value obtained by each state transition, performing probabilistic action selection under a certain action strategy, and performing weighted summation on the return reward values of the action sequences actually corresponding to the probabilistic action selection, wherein the optimization aim is to make the sum value maximum, and the maximum sum value corresponds to the optimal action sequence.

The objective optimization function is:

in formula 2, J is a target value; n is the number of actions in the action sequence; gamma ray ^t The discount factor at the t-th moment; p(s) _t ，a _t ) Is in a state of s _t By selecting action a _t The probability of (d); r (h) _c(t+1) -h _d(t+1) ) The Euclidean distance between the state of the next moment and the expected state corresponding to the next moment; h is _c(t+1) The state of the next moment; h is a total of _d(t+1) The expected state corresponding to the next moment.

And further, through the design of the 5 elements, a Markov decision process model for the height control of the airship is obtained, and an action strategy in the height control process is planned based on the model.

Step 102: randomly determining an initial state of an airship, and taking the initial state of the airship as a current state;

step 103: selecting a plurality of actions from the action space according to a Q-learning algorithm and the state transition probability based on the current state;

specifically, the selecting, based on the current state, a plurality of actions from the action space according to a Q-learning algorithm and the state transition probability specifically includes: randomly generating a random number based on the current state; judging whether the random number is smaller than a preset value; if yes, selecting a plurality of actions from the action space according to the state transition probability; and if not, selecting a plurality of actions from the action space according to a Q-learning algorithm.

Step 104: for each action, determining a reward value corresponding to the action and a state at a next moment according to the action and the reward value space;

step 105: for each state at the next moment, judging whether the state at the next moment is a termination state or not to obtain a first judgment result;

step 106: when the first judgment result is yes, determining all selected actions for transferring from the initial state of the airship to the termination state by taking the state at the next moment as the termination state, and taking all the selected actions as an action sequence corresponding to the termination state;

step 107: when the first judgment result is negative, taking the state at the next moment as the current state, returning to the step of selecting a plurality of actions from the action space according to a Q-learning algorithm and the state transition probability based on the current state until the state at each next moment is the termination state, and determining an action sequence corresponding to each termination state;

step 108: calculating a target value corresponding to each action sequence by using the target optimization function; selecting the action sequence with the maximum target value in all the action sequences as an optimal sequence; and controlling the airship to move according to the optimal sequence.

The embodiment makes full use of the real-time height and speed states of the stratospheric airship, takes the height difference as a feedback correction term of ideal speed, and replaces the state transition probability in the Q-learning algorithm with the speed probability distribution from the traditional random probability distribution, so that the state transition is more suitable for the actual ascending motion process, the decision of the action of the stratospheric airship executing mechanism is determined by means of the size distribution of the speed, and the effectiveness and the robustness of the controller are enhanced.

The method for controlling the height of the airship is utilized to simulate and track the height, based on the same height change curve, the tracking control effects of the Q-learning controllers under different probability distributions are compared, that is, the probability based on the Bolzmann distribution and the probability based on the rising velocity distribution are compared, the tracking results of the two are shown in fig. 5, and the same velocity probability distribution is used to compare different algorithms, that is, the tracking results of the Q-learning controller and the SARSA controller are compared, as shown in fig. 6. Simulation results show that the Q-learning controller designed according to the transition probability of rising speed distribution and the return value space based on the obstacle avoidance thought can realize the tracking control of the expected height change curve, effectively solve the controller failure caused by perturbation of model parameters, avoid obvious oscillation characteristics caused by Bolzmann random probability distribution, and improve the defect of low convergence speed of the SARSA algorithm. Compared with simulation results, the method has the advantages of good tracking effect, excellent effectiveness, stability, robustness, intelligent autonomy and the like.

Example 2:

the present embodiment is configured to provide a stratospheric airship height control system, which operates by using the airship height control method according to embodiment 1, and the control system includes:

the Markov decision process model obtaining module comprises a return value space design submodule; the return value space design submodule comprises:

the grid space acquisition unit is used for acquiring a pre-planned expected height change curve; rasterizing the surrounding space of the expected height change curve to obtain a grid space;

the reward value calculation unit is used for selecting one action in the action space for each state in the state space; performing state transition on the state based on the action to obtain a state of a next moment corresponding to the action; calculating the Euclidean distance between the state of the next moment and the expected state corresponding to the next moment; obtaining a return reward value corresponding to the action according to the Euclidean distance; the larger the Euclidean distance is, the smaller the reward value is;

the first judgment unit is used for judging whether the actions in the action space are all selected;

a reward value set calculation unit, configured to, if not, select an unselected action in the action space, and return to the step of "performing state transition on the state based on the action to obtain a state at a next time corresponding to the action", until all actions in the action space have been selected, obtain a reward value corresponding to each action in the action space, and determine a reward value set of the state;

and the return value space acquisition unit is used for assigning values to the grid space according to the return reward value set of each state to obtain a return value space.

the action selection module specifically comprises:

a random number generation unit for randomly generating a random number based on the current state;

the second judging unit is used for judging whether the random number is smaller than a preset value or not;

a first selection unit, configured to select, if yes, a plurality of actions from the action space according to the state transition probability;

and the second selection unit is used for selecting a plurality of actions from the action space according to a Q-learning algorithm if the motion space is not the motion space.

the judging module is used for judging whether the state at the next moment is a termination state or not for the state at each next moment to obtain a first judging result;

the action sequence acquisition module is used for determining all actions selected from the transition from the initial state of the airship to the terminal state by taking the state at the next moment as the terminal state when the first judgment result is yes, and taking all the selected actions as an action sequence corresponding to the terminal state;

a returning module, configured to, when the first determination result is negative, take the state at the next time as the current state, return to the step of "selecting a plurality of actions from the action space according to a Q-learning algorithm and the state transition probability based on the current state", until the state at each next time is the termination state, and determine an action sequence corresponding to each termination state;

the optimal sequence determining module is used for calculating a target value corresponding to each action sequence by using the target optimization function; selecting the action sequence with the maximum target value in all the action sequences as an optimal sequence; and controlling the airship to move according to the optimal sequence.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the description of the method part.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A stratospheric airship height control method is characterized by comprising the following steps:

establishing a Markov decision process model for airship height control; the Markov decision process model comprises a state space, an action space, a return value space, a state transition probability and a target optimization function; the design process of the state transition probability comprises the following steps: determining a speed change interval and a height change value according to a pre-planned expected height change curve; according to the height change value and the initial and final speeds, designing uniform acceleration and uniform deceleration motion to obtain initial airship speed distribution; correcting the initial airship speed distribution according to the actual altitude difference in the flying motion process to obtain the airship speed at each moment; dividing the speed change interval into a plurality of speed change subintervals; calculating the state transition probability of the airship according to the speed of the airship at each moment and the plurality of speed change subintervals;

calculating a target value corresponding to each action sequence by using the target optimization function; selecting the action sequence with the maximum target value in all the action sequences as an optimal sequence; controlling the airship to move according to the optimal sequence;

selecting a plurality of actions from the action space according to a Q-learning algorithm and the state transition probability based on the current state specifically includes: randomly generating a random number based on the current state; judging whether the random number is smaller than a preset value; if yes, selecting a plurality of actions from the action space according to the state transition probability; and if not, selecting a plurality of actions from the action space according to a Q-learning algorithm.

2. The stratospheric airship height control method as recited in claim 1, wherein the state space includes an airship state at each moment;

wherein s is _t ＝[h _t ，v _t ]；

s _t The state of the airship at the t-th moment; h is _t The airship height at the t-th moment; v. of _t Is the airship speed at time t.

3. The stratospheric airship height control method according to claim 1, wherein the action of the action space is an opening degree of an airship inflation valve or a deflation valve.

4. The method of claim 2, wherein the designing of the return value space comprises:

acquiring a pre-planned expected height change curve;

rasterizing the surrounding space of the expected height change curve to obtain a grid space;

selecting an action in the action space for each state in the state space;

performing state transition on the state based on the action to obtain a state of a next moment corresponding to the action; calculating the Euclidean distance between the state of the next moment and the expected state corresponding to the next moment; obtaining a return reward value corresponding to the action according to the Euclidean distance; the larger the Euclidean distance is, the smaller the return reward value is;

judging whether the actions in the action space are all selected;

if not, selecting an unselected action in the action space, returning to the step of performing state transition on the state based on the action to obtain a state at the next moment corresponding to the action until all actions in the action space are selected, obtaining a return reward value corresponding to each action in the action space, and determining a return reward value set of the state;

and assigning the grid space according to the return reward value set of each state to obtain a return value space.

5. The method of claim 1, wherein the objective optimization function is:

wherein J is a target value; n is the number of the actions in the action sequence; gamma ray ^t Is the discount factor at the t moment; p(s) _t ，a _t ) Is in a state of s _t By selecting action a _t The probability of (d); r (h) _c(t+1) -h _d(t+1) ) The Euclidean distance between the state of the next moment and the expected state corresponding to the next moment; h is _c(t+1) The state of the next moment; h is _d(t+1) The expected state corresponding to the next moment.

6. A stratospheric airship height control system, the control system comprising:

the Markov decision process model acquisition module is used for establishing a Markov decision process model for airship height control; the Markov decision process model comprises a state space, an action space, a return value space, a state transition probability and a target optimization function; the design process of the state transition probability comprises the following steps: determining a speed change interval and a height change value according to a pre-planned expected height change curve; according to the height change value and the initial and final speeds, designing uniform acceleration and uniform deceleration motion to obtain initial airship speed distribution; correcting the initial airship speed distribution according to the actual altitude difference in the flying motion process to obtain the airship speed at each moment; dividing the speed change interval into a plurality of speed change subintervals; calculating the state transition probability of the airship according to the speed of the airship at each moment and the plurality of speed change subintervals;

the optimal sequence determining module is used for calculating a target value corresponding to each action sequence by using the target optimization function; selecting the action sequence with the maximum target value from all the action sequences as an optimal sequence; controlling the airship to move according to the optimal sequence;

the action selection module specifically comprises:

7. The stratospheric airship height control system of claim 6, wherein the markov decision process model obtaining module comprises a return value space design submodule; the return value space design submodule comprises:

the reward value calculation unit is used for selecting one action in the action space for each state in the state space; performing state transition on the state based on the action to obtain the state of the next moment corresponding to the action; calculating the Euclidean distance between the state of the next moment and the expected state corresponding to the next moment; obtaining a return reward value corresponding to the action according to the Euclidean distance; the larger the Euclidean distance is, the smaller the return reward value is;