CN114676471B

CN114676471B - Method and device for establishing mission planning model of mars vehicle, electronic equipment and medium

Info

Publication number: CN114676471B
Application number: CN202210419866.0A
Authority: CN
Inventors: 卢皓; 张辉; 崔晓峰; 费立刚; 赵焕洲; 于天一; 胡晓东; 谢圆; 张宽; 孙鸿强; 润冬
Original assignee: Beijing Aerospace Control Center
Current assignee: Beijing Aerospace Control Center
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-09-13
Anticipated expiration: 2042-04-21
Also published as: CN114676471A

Abstract

The invention relates to a method, a device, electronic equipment and a medium for establishing a mission planning model of a mars vehicle, wherein the method comprises the following steps: determining a corresponding task according to the acquired first day-ground environment state, determining a second day-ground environment state corresponding to the task after the Mars vehicle executes the task in the first day-ground environment state, determining an award value corresponding to the task according to the task and the target day-ground environment state for each task, updating the value of an element corresponding to the task in a state task value function corresponding to the first day-ground environment state, taking the second day-ground environment state as a new first day-ground environment state, repeating the process until the execution accumulated time corresponding to each task reaches the preset task planning time length, completing a training period, executing n times of training periods until the nth time corresponding total award value meets the training end condition, and obtaining a task planning model. By the method, the efficiency of task planning of the mars train can be improved.

Description

Method and device for establishing mission planning model of mars vehicle, electronic equipment and medium

Technical Field

The invention relates to the technical field of spaceflight, in particular to a method and a device for establishing a mission planning model of a mars vehicle, electronic equipment and a medium.

Background

In the fields of aerospace measurement and control and Mars teleoperation, a Mars vehicle is used for performing patrol detection on the surface of a Mars to obtain scientific detection data. In order to efficiently complete the patrol detection task, the ground needs to plan and arrange the mars vehicle task under various constraint conditions.

In a lunar vehicle task similar to a mars vehicle, in order to achieve the goal of task planning, the problem of task planning is mainly solved by a resource scheduling-based method and a path-based method at present, and a lot of achievements are obtained. In terms of resource scheduling, the focus of research is rule-based modeling and problem solving methods, resulting in a series of model description languages such as PDDL, NDDL. After modeling is completed, constraints on the time dimension of the patrolling device are solved by adopting a greedy algorithm, a heuristic algorithm and the like.

In the prior art, a lot of practical achievements have been achieved in the aspect of a task planning method based on path planning. In the task planning process of the lunar vehicle, converting the path planning problem under the dynamic environment model into a series of path planning problems under the static environment model, and realizing the dynamic path planning of a task layer; and carrying out real-time constraint check in the path planning process, realizing behavior planning, iterating the influence effect into the dynamic path planning process, and finally realizing patrol instrument task planning. The precondition for forming the route of the lunar vehicle mission planning technology is as follows: on one hand, due to the limited intelligence level of the lunar vehicle, on the other hand, the lunar vehicle has good ground communication conditions. Mars car compares moon car and possesses higher intelligent level, receives harsher restraint simultaneously. Due to the huge delay of mars and the dependent transmission links, the mars mission planning also needs to consider the complex constraints of delay, store-and-forward relay, more energy supply and the like.

In summary, in the prior art, when a planner for providing a planning strategy for a mars train is constructed, constraint conditions need to be converted into rules, and then the rules need to be modeled and solved. The construction process of the planner and the planning problem solving process are complicated and complex, and the efficiency is low.

Disclosure of Invention

The invention aims to solve the technical problems of providing a method, a device, electronic equipment and a medium for establishing a mission planning model of a mars vehicle, and aims to solve the problems of complicated construction process and planning problem solving process of a planner and low efficiency.

In a first aspect, the technical solution for solving the above technical problem of the present invention is as follows: a method for establishing a mission planning model of a mars vehicle comprises the following steps:

step S1, acquiring a first day-ground environment state, wherein the first day-ground environment state is a first initial day-ground environment state meeting constraint conditions;

step S2, determining tasks corresponding to the first-day environment state from the tasks according to an epsilon-greedy strategy under the first-day environment state;

step S3, determining the corresponding environment state of the Mars vehicle on the second day after the Mars vehicle executes the task under the environment state of the first day;

step S4, for each task, determining a reward value corresponding to the task according to the task and a target heaven and earth environment state, wherein the target heaven and earth environment state comprises a first heaven and earth environment state or the first heaven and earth environment state and a second heaven and earth environment state;

step S5, for each task in the second day-ground environment state, updating the value of the element corresponding to the task in the state task cost function corresponding to the first day-ground environment state according to the reward value corresponding to the task and the maximum value of the element corresponding to the task in the state task cost function corresponding to the second day-ground environment state, so as to obtain an updated state task cost function, wherein the updated state task cost function is the updated state task cost function corresponding to the first day-ground environment state, and for each state task cost function, the expectation that the accumulated reward values of other tasks except the target task in each task can be executed after the target task in each task is executed in the second day-ground environment state is represented;

step S6, taking the environment state of the second day as a new environment state of the first day, repeatedly executing the step S2 to the step S5 until the execution accumulated time corresponding to each task corresponding to the step S2 to the step S5 reaches the preset task planning time length, completing a training period, and recording the total reward value corresponding to the training period, wherein the total reward value is determined according to the reward value corresponding to each task corresponding to the training period;

step S7, executing n times of training period until the nth time corresponding total reward value meets the training end condition, stopping executing the (n + 1) th training period, and taking the corresponding updated state task value function when meeting the training end condition as the target state task value function;

and step S8, taking the target state task value function as a task planning model, and determining the target task which enables the value of the element corresponding to the to-be-planned heaven and earth environment state to be the maximum value according to the to-be-planned heaven and earth environment state through the task planning model.

The invention has the beneficial effects that: in the scheme of the application, based on a first initial heaven-earth environment state meeting constraint conditions, a task corresponding to a first heaven-earth environment state can be determined from all tasks according to an epsilon-greedy strategy, then a corresponding second heaven-earth environment state after the execution of the tasks of the mars is determined in the first heaven-earth environment state, a state task value function is continuously updated according to a reward value corresponding to each task, the second heaven-earth environment state is taken as a new first heaven-earth environment state, the processing process is repeatedly executed until the execution accumulated time corresponding to each task in the processing process reaches a preset task planning time, a training period is completed, the training period is executed for n times, a task planning model is obtained through training until the total reward value corresponding to the nth time meets training end conditions, and the model is trained in the scheme of the invention in a mode of continuously updating the state task value function, the task planning model obtained by the training of the scheme of the invention can be planned in a time dimension and has the shortest moving period under the constraint condition, and the moving target point can be reached as fast as possible.

On the basis of the technical scheme, the invention can be further improved as follows.

Further, the task planning model is established based on a reinforcement learning method, the reinforcement learning method comprises a Q-learning reinforcement learning method, and if the reinforcement learning method is the Q-learning reinforcement learning method, the state task value function is a Q matrix.

The beneficial effect of adopting the further scheme is that the learning capability of the model can be better represented through the Q matrix obtained by adopting the Q-learning reinforcement learning method in the scheme of the invention.

Further, if the state task cost function is a Q matrix, for each task in the second heaven and earth environment state, updating the value of the element corresponding to the task in the state task cost function corresponding to the first heaven and earth environment state according to the reward value corresponding to the task and the maximum value of the element corresponding to the task in the state task cost function corresponding to the second heaven and earth environment state, and obtaining an updated state task cost function, which is determined by the following formula:

Q(S _t ，A _t )＝Q(S _t ，A _t )+α(λmax _a Q(S _t+1 ，A _t+1 )-Q(S _t ，A _t )+R _t+1 )

wherein, Q (S) _t ，A _t ) Representing a state task cost function, Q (S), corresponding to the environmental state of the first day _t+1 ，A _t+1 ) Representing the state task cost function, R, corresponding to the environment state of the second place and the second day _t+1 Indicating the prize value, max, to which the task corresponds _a Q(S _t+1 ，A _t+1 ) And the maximum value of elements corresponding to the task in the state task value function corresponding to the environment state of the second day is represented, alpha is the learning rate, lambda is the discount rate, and alpha and lambda are set values.

The method has the advantages that the state task cost function corresponding to the environment state of the first day is updated according to the state task cost function corresponding to the environment state of the second day and the reward value corresponding to the task, and the accuracy of the value of each element in the state task cost function can be improved by adopting the reverse updating mode.

Further, if the target world environment state includes a first world environment state, the step S4 includes:

for each task, determining a reward value corresponding to the task according to the task and the environment state of the first day;

if the target heaven and earth environment state includes the first heaven and earth environment state and the second heaven and earth environment state, the step S4 includes:

and for each task, determining a state variable quantity corresponding to the task according to the first day environment state and the second day environment state corresponding to the task, and determining an award value corresponding to the task according to the state variable quantity.

The method has the advantages that the reward value corresponding to each task is determined by different methods, and accuracy of determining the reward value can be improved.

Further, the first day environment state includes a first external environment state, a first mars train state, and a first ground environment state, and when the first day environment state includes the first external environment state, the step S3 includes:

determining a second external environment state corresponding to each task executed by the Mars vehicle in the first external environment state;

when the first day environment state includes the first train state, the step S3 includes:

determining a second Mars state corresponding to each task after the Mars vehicle executes each task in the first Mars state;

when the first ground environment state includes the first ground environment state, the step S3 includes:

and determining a second ground environment state corresponding to each task executed by the Mars vehicle in the first ground environment state.

The method has the advantages that in the process of determining the environment state of the second day and the ground, the environment states of the second day and the ground corresponding to the three different environment states of the first day and the ground are respectively determined, and the data processing efficiency can be improved.

Further, the constraint condition includes at least one of a communication constraint condition, a lighting constraint condition, an energy constraint condition, a temperature constraint condition, a terrain constraint condition, a data transmission capacity constraint condition and a time constraint condition; the plurality of tasks include a probe task, a mobile task, a perception task, a communication task, and a ground planning task.

The beneficial effect of adopting the further scheme is that different task requirements and constraint condition requirements of the mars can be met through different constraint conditions and tasks, so that the applicability of the trained task planning model is wider.

Further, the step S3 is implemented by a state model, and the step S4 is implemented by a bonus model.

The method has the advantages that in the model training process, the environment state of the second day and the ground can be determined through the state model obtained through pre-training, the reward value corresponding to each task can be determined through the reward model obtained through pre-training, and accuracy and training efficiency of data in the training process can be improved.

In a second aspect, the present invention provides a device for establishing a mission planning model of a mars train to solve the above technical problem, the device comprising:

the data acquisition module is used for acquiring a first day-ground environment state, wherein the first day-ground environment state is a first initial day-ground environment state meeting constraint conditions;

the task determining module is used for determining a task corresponding to the first day environment state from all tasks according to an epsilon-greedy strategy under the first day environment state;

the second heaven and earth environment state determining module is used for determining a corresponding second heaven and earth environment state after the mars vehicle executes the task under the first heaven and earth environment state;

the reward value determining module is used for determining a reward value corresponding to each task according to the task and a target heaven and earth environment state, wherein the target heaven and earth environment state comprises a first day and earth environment state or the first day and earth environment state and a second day and earth environment state;

the updating module is used for updating the value of the element corresponding to the task in the state task cost function corresponding to the first day-ground environment state according to the reward value corresponding to the task and the maximum value of the element corresponding to the task in the state task cost function corresponding to the second day-ground environment state for each task in the second day-ground environment state, so as to obtain an updated state task cost function, wherein the updated state task cost function is the updated state task cost function corresponding to the first day-ground environment state, and for each state task cost function, the expectation that the accumulated reward values of other tasks except the target task in each task can be executed after the target task in each task is executed in the second day-ground environment state is represented;

the single training period module is used for taking the environment state of the second day and the ground as a new environment state of the first day and repeatedly executing the processing process from the task determining module to the updating module until the execution accumulated time corresponding to each task corresponding to the task determining module to the updating module reaches the preset task planning time length, completing a training period, and recording a total reward value corresponding to the training period, wherein the total reward value is determined according to the reward value corresponding to each task corresponding to the training period;

the training end module is used for executing the training period for n times until the total reward value corresponding to the nth time meets the training end condition, stopping executing the training period for n +1 times, and taking the corresponding updated state task value function when the training end condition is met as a target state task value function;

and the model generation module is used for taking the target state task value function as a task planning model so as to determine the target task which enables the value of the element corresponding to the heaven and earth environment state to be planned to be the maximum value according to the heaven and earth environment state to be planned through the task planning model.

In a third aspect, the present invention provides an electronic device to solve the above technical problem, where the electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for establishing a mission planning model of a train according to the present application is implemented.

In a fourth aspect, the present invention further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the method for establishing a mission planning model of a train according to the present application.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings used in the description of the embodiments of the present invention will be briefly described below.

Fig. 1 is a schematic flow chart of a method for establishing a mission planning model of a mars vehicle according to an embodiment of the present invention;

fig. 2 is a schematic flow chart of a method for establishing a mission planning model of a train according to another embodiment of the present invention;

fig. 3 is a schematic structural diagram of a mission planning model building apparatus for a mars train according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with examples which are set forth to illustrate, but are not to be construed to limit the scope of the invention.

The technical solution of the present invention and how to solve the above technical problems will be described in detail with specific embodiments below. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present invention will be described below with reference to the accompanying drawings.

An embodiment of the present invention provides a possible implementation manner, and as shown in fig. 1, provides a flowchart of a method for establishing a mission planning model of a mars car. For convenience of description, the method provided by the embodiment of the present invention will be described below by taking a server as an execution subject, and as shown in the flowchart shown in fig. 1, the method may include the following steps:

step S2, according to the first geographical environment state, determining a task corresponding to the first geographical environment state from each task according to an epsilon-greedy strategy;

By the method, based on a first initial heaven-earth environment state meeting constraint conditions, a task corresponding to a first heaven-earth environment state can be determined from all tasks according to an epsilon-greedy strategy, then a corresponding second heaven-earth environment state after the execution of the tasks of the mars is determined in the first heaven-earth environment state, a state task value function is continuously updated according to a reward value corresponding to each task, the second heaven-earth environment state is taken as a new first heaven-earth environment state, the processing process is repeatedly executed until the execution accumulated time corresponding to each task in the processing process reaches a preset task planning time, a training period is completed, the training period is executed for n times, a task planning model is obtained through training until the nth corresponding total reward value meets training ending conditions, and the model is trained in the scheme of the invention in a mode of continuously updating the state task value function, the task planning model obtained by the training of the scheme of the invention can be planned in a time dimension and has the shortest moving period under the constraint condition, and the moving target point can be reached as fast as possible.

The following further describes the scheme of the present invention with reference to the following specific embodiments, in which the method for establishing the mission planning model of the train may include the following steps:

step S1, a first day-ground environment state is obtained, where the first day-ground environment state is a first initial day-ground environment state that satisfies the constraint condition.

The sky-ground environment state refers to a sky environment (also referred to as an external environment) state in which the train operates (performs a task), an operating state of the train itself (also referred to as a train state), and a ground environment state. Specifically, the external environment state is a parameter describing an environment where the mars vehicle is located, such as a terrain, a temperature, an illumination environment, a relay link environment, and the like where the mars vehicle is located; the state of the mars train is a parameter for describing the state of the mars train, such as whether the mars train is in a communication state, the capacity of the mars train for transmitting data, the position of the mars train, instruction uplink and the like; the ground environment state is a parameter describing a ground environment where a ground device (e.g., a station) corresponding to the mars car is located, such as a location of the ground device, whether an instruction of the ground device is uploaded to the mars car, whether sensing data of the mars car is downloaded, and the like.

As an example, the above-described external environment state, the mars train state, and the ground environment state may be shown by table 1 below.

TABLE 1

The external environment state in table 1 indicates that the external environment is a communicable state, the lighted state indicates that the external environment is a lighted environment, the sensing data generation state indicates that the mars car has a state in which sensing data that has not been downloaded is generated, the instruction uplink state indicates whether the mars car receives an instruction sent by another device, the location update state indicates whether the location of the mars car changes, the sensing data download state indicates whether the sensing data acquired after the mars car executes a sensing task is downloaded to the ground device, and the plan generation state indicates whether the ground device completes the planning of a planning task (a task that the mars car needs to execute). It should be noted that the environmental conditions of each day in table 1 are only examples, and do not limit the present invention.

The constraint conditions can include at least one of communication constraint conditions, illumination constraint conditions, energy constraint conditions, temperature constraint conditions, terrain constraint conditions, data transmission capacity constraint conditions and time constraint conditions, wherein the communication constraint conditions refer to whether communication can be carried out between the Mars train and ground equipment, the illumination constraint conditions refer to whether illumination exists in the external environment where the Mars train is located, the energy constraint conditions refer to whether the battery electric quantity of each task executed in a planning time period is above a safety limit, the temperature constraint conditions refer to whether the external environment where the Mars train is located meets a set temperature, the terrain constraint conditions refer to whether the external environment where the Mars train is located meets a movement requirement, and the data transmission capacity constraint conditions refer to whether the capacity of data which can be transmitted by a spatial link meets the set capacity.

The constraint conditions can be determined based on the actual working conditions of the mars, so that not only can the task be normally executed, but also the actual working requirements can be better met under the constraint conditions, and it can be understood that the constraint conditions can be updated in an updating mode including deletion, addition and the like.

The plurality of first-day environment states satisfying the constraint condition refers to that, among the plurality of first-day environment states described in detail above, the first-day environment state satisfying the constraint condition is selected as a training sample.

The tasks are tasks executed by the Mars vehicle on the Mars and can comprise a detection task, a movement task, a perception task, a communication task and a ground planning task, wherein the main objective of the detection task is to obtain scientific detection data by using a scientific load, the precondition for implementing the detection task is that the Mars vehicle reaches a detection target area, the constraint conditions required to be met during detection comprise an energy constraint condition, an illumination constraint condition and a temperature constraint condition, and a device communication arc section is required to be arranged to download the detection data to ground equipment on the ground after detection. The moving task mainly realizes the control of the course and the position of the mars, and the implementation of the moving task needs to meet the following preconditions: the planning task is completed through ground equipment, a control command is sent to the mars train in an ascending mode, and the terrain constraint condition, the illumination constraint condition and the energy constraint condition are met when the movement task is implemented. The perception task mainly realizes the perception of the surrounding environment of the mars, obtains an environment image, and needs to meet the following preconditions when the perception task is implemented: the planning task is completed through ground equipment, a control instruction is uploaded, the illumination constraint condition, the energy constraint condition and the temperature constraint condition are met when the sensing task is implemented, and after sensing, an arc section is required to be arranged to download the sensing image to the ground equipment for mobile task planning. The communication task mainly completes the uplink transmission tasks of instructions and injected data and the downlink transmission tasks of data transmission data. And the ground planning task generates subsequent work plan arrangement of the Mars train and generates an uplink instruction according to the sensing and telemetering data downloaded by the Mars train.

And step S2, determining tasks corresponding to the first-day environment state from the tasks according to an epsilon-greedy strategy according to the first-day environment state.

According to the first day environment state, determining the task corresponding to the first day environment state from the tasks according to an epsilon-greedy strategy in an implementation mode that: and determining the task corresponding to the environment state of the first day place from all tasks according to the preset probability 1-epsilon under the environment state of the first day place. Wherein epsilon is preset and can be set according to actual requirements.

And 3, determining the corresponding second day and ground environment state of the Mars vehicle after the task is executed in the first day and ground environment state.

The environment states of the sky and the earth corresponding to the mars before and after the mission are usually changed, the environment state of the sky and the earth corresponding to the mars before the mission is the first environment state of the day, and the environment state of the sky and the earth corresponding to the mars after the mission is the second environment state of the day. It will be appreciated that for each different task, the same first-day-to-ground state may differ with respect to the second-day-to-ground environmental state after the task is performed.

As an example, for task A, the corresponding first-day-to-ground environment before the task A is executed is a1, and the corresponding second-day-to-ground environment state after the task A is executed is a 2.

In an alternative aspect of the present invention, the first day environmental state includes a first external environmental state, a first train state and a first ground environmental state, and when the first day environmental state includes the first external environmental state, the step S3 includes: determining a second external environment state corresponding to each task executed by the Mars vehicle in the first external environment state; when the first day environment state includes the first train state, the step S3 includes: determining a second Mars train state corresponding to each task after the Mars train executes each task in the first Mars train state; when the first ground environment state includes the first ground environment state, step S3 includes: determining a second ground environment state corresponding to each task executed by the Mars vehicle in the first ground environment state;

the second heaven and earth environment state includes a second external environment state, a second mars train state, and a second ground environment state.

Optionally, the step S3 is implemented by a state model, wherein the state model may also be divided into three models according to different sky-ground environment states, including an external environment model, a train state model, and a ground state model, because the first day-ground environment state includes a first external environment state, a first train state, and a first ground environment state.

The determining of the second external environment state corresponding to the Mars vehicle after executing each task in the first external environment state includes:

according to the first external environment state and each task, determining a corresponding second external environment state after each task is executed by the Mars vehicle in the first external environment state through a pre-established external environment model; determining a second Mars train state corresponding to each task after the Mars train executes each task in the first Mars train state, wherein the second Mars train state comprises the following steps: according to the state of the first Mars train and each task, determining the corresponding state of a second Mars train after each task is executed when the Mars train is in the state of the first Mars train through a pre-established Mars train state model;

the determining a second ground environment state corresponding to the Mars vehicle after executing each task in the first ground environment state includes: and determining a second ground environment state corresponding to the Mars vehicle after executing each task under the first ground environment state through a pre-established ground state model according to the first ground environment state and each task.

The external environment model, the mars train state model and the ground state model are all pre-established models based on the existing method, the external environment model is established based on a track dynamics model, and the mars train state model and the ground state model are subjected to simulation modeling based on a design state. And will not be described in detail herein.

The external environment modeling mainly represents a lighting environment, a relay link environment, and the like, such as the sky-ground environment states shown in table 1 above, which are independent of the operating state of the mars.

As an example, referring to the schematic diagram of the training process of the mission planning model shown in fig. 2, the planner inputs a mission to be executed and a first day environment state before the mission is executed into an environment model, where the environment model includes three models, namely an external environment model (the external environment shown in fig. 2), a train state model (the train state shown in fig. 2) and a ground state model (the ground state shown in fig. 2), and a second day environment state after the mission is executed (the state shown in fig. 2) is predicted by the three models.

Step S4, for each task, determining a reward value corresponding to the task according to the task and a target heaven and earth environment state, where the target heaven and earth environment state includes a first-day environment state, or a first-day environment state and a second-day environment state.

Wherein, two schemes can be adopted to determine the reward value corresponding to each task, if the target heaven and earth environment state includes the first day and earth environment state, the step S4 includes:

for each task, the environment state of the first day corresponding to each task is not changed after the task is executed, and for each task, the reward value corresponding to the task can be determined according to the task and the environment state of the first day.

If the target heaven and earth environment state includes a first heaven and earth environment state and a second heaven and earth environment state, the step S4 includes:

and for each task, determining a state variation corresponding to the task according to the first day environment state and the second day environment state corresponding to the task, and determining an award value corresponding to the task according to the state variation.

For each task, the state change amount corresponding to the task represents the change of the environment state of the corresponding second day after the task is executed compared with the environment state of the first day before the task is executed. For example, if the first day environment state is no communication and the second day environment state is communication, the state change amount between the first day environment state and the second day environment state indicates the change of the communication.

As one example, see the corresponding amount of state change after execution of each task in table 2 (state transition in table 2).

TABLE 2

After the perception task is executed, the state variation corresponding to the heaven and earth environment state with the state variation comprises Mars vehicle generated perception data (on-board generated perception data), and the perception data are not downloaded. After the moving task is executed, the state change amount corresponding to the heaven and earth environment state with the state change comprises station position updating (station position is changed), plan is not generated (task is not generated), and the command is not received (the mars car does not receive the command sent by other equipment). After the communication task is executed, the state change amount corresponding to the heaven and earth environment state with the state change comprises that the sensing data is descended to the ground (the sensing data is transmitted to ground equipment by a mars car), and the ground planning instruction is ascended to the mars car (the instruction planned by the ground equipment is ascended to the mars car). If the state change quantity corresponding to the heaven and earth environment state with the state change comprises that the Mars vehicle does not generate sensing data (no sensing data is generated on the Mars vehicle), the sensing data is not downloaded, the instruction is not received, and the ground is not updated (the ground environment is not changed); and if the scene is a standby scene, the state of the heaven and earth environment is not changed, namely, the state transition is not carried out.

Referring to fig. 2, the planner performs step 4 described above.

the initial value of the state task cost function corresponding to the environment state of the second place may be randomly given, that is, the initial values of the elements in the state task cost function corresponding to the environment state of the second place may be randomly determined. Different heaven and earth environment states can correspond to different state task cost functions, and the state task cost functions describe the heaven and earth environment states and the tasks

For each element in the state task cost function, the value of the element is the expectation of the accumulated reward value of the task except the target task which is possibly executed after the task (the target task) is executed, the value of the element represents the profit after the task is executed by the mars vehicle in the first day environment state, and the profit can be embodied by the second day environment state. In the scheme of the invention, whether the profit is generated or not can be determined by determining whether the energy consumption, the time consumption, the information change and the like exist or not according to the environment state of the second day after the task is executed.

It should be noted that, for each task in the second heaven and earth environment state, the value of the element corresponding to the task in the state task cost function corresponding to the first heaven and earth environment state is updated through the process in the step S5, and if the second heaven and earth environment state corresponds to 2 tasks, the state task cost function corresponding to the first heaven and earth environment state may be updated 2 times, and the specific updating process may be, for example, 2 tasks are task 1 and task 2, respectively, and the first function may be obtained according to the reward value corresponding to task 1 and the maximum reward value corresponding to all tasks (task 1 and task 2) in the state task cost function corresponding to the second heaven and earth environment state, and then the value of the element corresponding to task 1 in the state task cost function corresponding to the first day environment state may be updated, and then the first function may be obtained according to the reward value corresponding to task 2 and the value corresponding to all tasks in the state task cost function corresponding to the second heaven and earth environment state And (1) updating the value of an element corresponding to the task 2 in the state task cost function corresponding to the environment state of the first day by the maximum reward value corresponding to the task 1 and the task 2 to obtain a second function, wherein the second function is the updated state task cost function.

Referring to fig. 2, the planner inputs tasks and state variables (state variables shown in fig. 2) to be performed, or the tasks and the target heaven and earth environment states into the reward model, and outputs reward values corresponding to the tasks (corresponding to the rewards shown in fig. 2).

And step S6, taking the environment state of the second day and place as a new environment state of the first day and place, repeatedly executing the step S2 to the step S5 until the execution accumulated time corresponding to each task corresponding to the step S2 to the step S5 reaches the preset task planning time length, completing a training period, and recording the total reward value corresponding to the training period, wherein the total reward value is determined according to the reward value corresponding to each task corresponding to the training period.

When each task is executed, an execution time can be correspondingly recorded, after each task is executed, the execution times corresponding to the tasks are accumulated, that is, the execution times corresponding to the tasks are added to obtain an execution accumulated time, and when the execution accumulated time is equal to the preset task planning time, the execution accumulated time can be recorded as a training period. It should be noted that, because the execution time for executing each task is different, the preset task is planned to have a duration within which one training cycle may not be able to complete all tasks, and one task may only be executed partially.

The total reward value corresponding to one training period refers to the sum of the reward values corresponding to each task or an approximate value of the sum of the reward values corresponding to each task in the training period.

It should be noted that, when the second training period is executed, the task corresponding to the new environment state of the first day may still be determined from the tasks.

Step S7, executing n times of training period until the nth time corresponding total reward value meets the training end condition, stopping executing the (n + 1) th training period, and taking the corresponding updated state task cost function when meeting the training end condition as the target state task cost function, wherein the value of the target state task cost function is the target task corresponding to the element under the heaven and earth environment state corresponding to the element;

and if the total reward value corresponding to the nth time does not meet the training ending condition, executing the (n + 1) th training period until the total reward value corresponding to the kth time meets the training ending condition, wherein k is an integer not less than n + 1. The training end condition may be configured based on an actual requirement, for example, a threshold value or a threshold value range is set, when the total reward value corresponding to the nth time meets the training end condition, it indicates that the reward value corresponding to each task tends to be stable, and if the variation is not large, it indicates that the training accuracy is better, and the training may be stopped.

In the goal state task cost function, each element corresponds to a maximum reward value, the value is the maximum value corresponding to the element, and for each element, the maximum reward value corresponding to the element indicates which task the corresponding executed goal task is specifically under the heaven and earth environment state corresponding to the element.

As an example, for example, if the element of the task a in the first day environment state a1 has a value of Q1, the corresponding target task is a when the to-be-planned day environment state is a 1.

In an alternative of the present invention, the task planning model building process can be built based on a reinforcement learning method, that is, learning based on a reinforcement learning method. The reinforcement learning method may include, but is not limited to, Q-learning (Q-learning), SARSA, DQN (Deep Q Network), DUAL-DQN (blanking DQN). If the reinforcement learning method is a Q-learning reinforcement learning method, the state task cost function is a Q matrix.

Optionally, if the state task cost function is a Q matrix, for each task in the second heaven and earth environment state, updating the value of the element corresponding to the task in the state task cost function corresponding to the first day and earth environment state according to the reward value corresponding to the task and the maximum value of the element corresponding to the task in the state task cost function corresponding to the second heaven and earth environment state, so as to obtain an updated state task cost function, which is determined by the following formula:

For one example, see the reward values output by the reward model shown in table 3, and the first day environmental status and task to which each reward value corresponds.

TABLE 3

As can be seen from table 3, for the sensing task, under the first day environmental condition (with light, location update (location change of mars) and sensing data not generated) satisfying the constraint condition (the condition in table 3), the maximum reward value corresponding to the execution of the sensing task is-10; for the mobile task, under the environment state (with light, plan generation and instruction received) of the first day meeting the constraint condition, the maximum reward value corresponding to the execution of the mobile task is 2000; for a communication task (relay communication in table 3), in the first day-to-earth environment state (communicable) satisfying the constraint condition, the maximum bonus value corresponding to the execution of the communication task is 10; for the ground planning, under the first day ground environment state (sensing data is generated, sensing data is downloaded, plan is not generated and position is not updated) meeting the constraint condition, the maximum reward value corresponding to the ground planning is executed to be-20; for all tasks, under the first day environment state (other states) meeting the constraint conditions, the maximum reward value corresponding to all tasks is executed is-inf; -inf denotes that all numbers in the prize value are larger than-inf. It should be noted that the contents in table 3 are only examples, and do not limit the scope of the present invention.

For a better illustration and understanding of the principles of the method provided by the present invention, the solution of the invention is described below with reference to an alternative embodiment. It should be noted that the specific implementation manner of each step in this specific embodiment should not be construed as a limitation to the scheme of the present invention, and other implementation manners that can be conceived by those skilled in the art based on the principle of the scheme provided by the present invention should also be considered as within the protection scope of the present invention.

In this example, the training process of the mission planning model is specifically described, which includes the following steps:

step 1 (initialization and external environment calculation): randomly initializing a Q matrix, assigning values to the super parameters alpha and lambda, and assigning values to the initial state S of the space-ground environment; obtaining external environment parameters such as illumination and relay links related to time according to a planning time period, wherein the planning time period is a time period contained in a planning task and represents time information related to task execution;

step 2 (round end judgment): judging whether the current round identifier is the maximum number of rounds, if so, ending the training; if not, repeating the step 3 to the step 7; the current round identifier refers to an identifier generated after a training period, the identifier represents a total reward value corresponding to the nth time, and the maximum number of rounds refers to the number of times of corresponding training period execution when the training end condition is met.

Step 3 (judgment is finished by single planning): judging whether the current time (corresponding execution accumulated time after executing one training period) reaches the end of the planning time (preset mission planning duration): if yes, ending the current round (ending training); if not, repeating the step 4 to the step 10;

step 4 (selection action): s _t In a state (a first day environment state), selecting task A from the action set by adopting an epsilon-greedy strategy _t ；

Step 5 (update state): according to A _t Update the state S _t To S _t+1 I.e. determined at S _t In this state, task A is executed _t The second day and earth environment state S _t+1 ，；

Step 6 (calculate reward): for each task, calling a multi-constraint algorithm function, and for each task, determining an award value R corresponding to the task according to the task and the target heaven-earth environment state _t+1 ；

Step 7 (update Q function): for each task in the second heaven and earth environment state, determining the target state task value function described above according to the reward value corresponding to the task and the maximum value of the elements corresponding to the task in the state task value function corresponding to the second heaven and earth environment state until the current round identifier is the maximum round number, and finishing training to obtain a task planning model;

step 8 (final planning): and planning the heaven and earth tasks according to the trained task planning model.

Based on the same principle as the method shown in fig. 1, an embodiment of the present invention further provides a mission planning model establishing apparatus 20 for a mars vehicle, as shown in fig. 3, the mission planning model establishing apparatus 20 for a mars vehicle may include a data obtaining module 210, a mission determining module 220, a second day-and-ground environment state determining module 230, an incentive value determining module 240, an updating module 250, a single training period module 260, a training end module 270, and a model generating module 280, where:

the data obtaining module 210 is configured to obtain a first day-ground environment state, where the first day-ground environment state is a first initial day-ground environment state that meets a constraint condition;

the task determining module 220 is configured to determine, according to the first day-ground environment state, a task corresponding to the first day-ground environment state from the tasks according to an epsilon-greedy strategy;

a second day and ground environment state determining module 230, configured to determine a second day and ground environment state corresponding to the mars vehicle after executing the task in the first day and ground environment state;

the reward value determination module 240 is configured to determine, for each task, a reward value corresponding to the task according to the task and a target heaven and earth environment state, where the target heaven and earth environment state includes a first-day environment state or a first-day environment state and a second-day environment state;

an updating module 250, configured to update, for each task in the second heaven and earth environment state, a value of an element corresponding to the task in the state task cost function corresponding to the first heaven and earth environment state according to the reward value corresponding to the task and the maximum value of the element corresponding to the task in the state task cost function corresponding to the second heaven and earth environment state, so as to obtain an updated state task cost function, where the updated state task cost function is a state task cost function corresponding to the updated first day and earth environment state, and for each state task cost function, an expectation is represented that, after a target task in each task is executed in the second heaven and earth environment state, an accumulated reward value of other tasks except the target task in each task may be executed;

the single training period module 260 is configured to take the environment state of the second day and the ground as a new environment state of the first day and repeatedly execute the processing from the task determining module to the updating module until the cumulative execution time corresponding to each task from the task determining module to the updating module reaches the preset task planning time length, complete a training period, and record a total reward value corresponding to the training period, where the total reward value is determined according to the reward value corresponding to each task corresponding to the training period;

a training end module 270, configured to execute the training period n times until the total reward value corresponding to the nth time meets the training end condition, stop executing the training period n +1 times, and take the updated state task cost function corresponding to the condition that meets the training end condition as the target state task cost function;

and the model generating module 280 is configured to use the target state task value function as a task planning model, so as to determine, according to the to-be-planned sky and ground environment state, a target task which makes a value of an element corresponding to the to-be-planned sky and ground environment state be a maximum value through the task planning model.

Optionally, the task planning model is established based on a reinforcement learning method, the reinforcement learning method includes a Q-learning reinforcement learning method, and if the reinforcement learning method is the Q-learning reinforcement learning method, the state task cost function is a Q matrix.

Optionally, if the state task cost function is a Q matrix, for each task in the second heaven and earth environment state, the updating module 250 updates the value of the element corresponding to the task in the state task cost function corresponding to the first heaven and earth environment state according to the reward value corresponding to the task and the maximum value of the element corresponding to the task in the state task cost function corresponding to the second heaven and earth environment state, and an implementation process of obtaining the updated state task cost function is determined by the following formula:

wherein, Q (S) _t ，A _t ) Representing a state task cost function, Q (S), corresponding to the environmental state of the first day _t+1 ，A _t+1 ) Representing the state task cost function, R, corresponding to the environment state of the second place and the second day _t+1 Indicates the reward value, max, corresponding to the task _a Q(S _t+1 ，A _t+1 ) And the maximum value of elements corresponding to the task in the state task value function corresponding to the environment state of the second day is represented, alpha is the learning rate, lambda is the discount rate, and alpha and lambda are set values.

Optionally, if the target day-ground environment state includes a first day-ground environment state, the reward value determining module 240 is specifically configured to:

if the target heaven-earth environment state includes a first heaven-earth environment state and a second heaven-earth environment state, the reward value determination module 240 is specifically configured to:

Optionally, the first day-to-ground environment state includes a first external environment state, a first train state, and a first ground environment state, and when the first day-to-ground environment state includes the first external environment state, the second day-to-ground environment state determining module 230 is specifically configured to:

when the first day environment state includes a first train state, the second day environment state determining module 230 is specifically configured to:

determining a second Mars train state corresponding to each task after the Mars train executes each task in the first Mars train state;

when the first day-to-earth environment state includes a first ground environment state, the second day-to-earth environment state determining module 230 is specifically configured to:

Optionally, the constraint condition includes at least one of a communication constraint condition, an illumination constraint condition, an energy constraint condition, a temperature constraint condition, a terrain constraint condition, a data transmission capacity constraint, and a time constraint condition; the plurality of tasks include a probe task, a mobile task, a perception task, a communication task, and a ground planning task.

Optionally, the implementation process of the second-day environment state determination module 230 is implemented by a state model, and the implementation process of the bonus value determination module 240 is implemented by a bonus model.

The mission planning model establishing device for a train according to the embodiment of the present invention may execute the mission planning model establishing method for a train according to the embodiment of the present invention, and the implementation principles thereof are similar, the actions performed by each module and unit in the mission planning model establishing device for a train according to the embodiments of the present invention correspond to the steps in the mission planning model establishing method for a train according to the embodiments of the present invention, and the detailed functional description of each module of the mission planning model establishing device for a train may be referred to the description in the corresponding mission planning model establishing method for a train shown in the foregoing, and will not be described herein again.

The mission planning model building device of the mars train may be a computer program (including program codes) running in a computer device, for example, the mission planning model building device of the mars train is an application software; the apparatus may be used to perform the corresponding steps in the methods provided by the embodiments of the present invention.

In some embodiments, the mission planning model creating apparatus of a train provided in the embodiments of the present invention may be implemented by combining software and hardware, and as an example, the mission planning model creating apparatus of a train provided in the embodiments of the present invention may be a processor in the form of a hardware decoding processor, which is programmed to execute the mission planning model creating method of a train provided in the embodiments of the present invention, for example, the processor in the form of the hardware decoding processor may be one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

In other embodiments, the mission planning model establishing apparatus for a mars car provided in the embodiment of the present invention may be implemented in software, and fig. 3 illustrates the mission planning model establishing apparatus for a mars car stored in a memory, which may be software in the form of programs and plug-ins, and includes a series of modules, including a data obtaining module 210, a mission determining module 220, a second day-to-ground environment state determining module 230, a reward value determining module 240, an updating module 250, a single training period module 260, a training ending module 270, and a model generating module 280, for implementing the mission planning model establishing method for a mars car provided in the embodiment of the present invention.

The modules described in the embodiments of the present invention may be implemented by software or hardware. Wherein the name of a module in some cases does not constitute a limitation on the module itself.

Based on the same principle as the method shown in the embodiment of the present invention, an embodiment of the present invention also provides an electronic device, which may include but is not limited to: a processor and a memory; a memory for storing a computer program; a processor for executing the method according to any of the embodiments of the present invention by calling a computer program.

In an alternative embodiment, an electronic device is provided, as shown in fig. 4, the electronic device 4000 shown in fig. 4 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present invention.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or other Programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computing function, e.g., comprising one or more microprocessors, a combination of DSPs and microprocessors, etc.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application program codes (computer programs) for executing the aspects of the present invention, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

The electronic device may also be a terminal device, and the electronic device shown in fig. 4 is only an example and should not bring any limitation to the functions and the application range of the embodiment of the present invention.

An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, and when the computer program runs on a computer, the computer is enabled to execute the corresponding content in the foregoing method embodiment.

According to another aspect of the invention, there is also provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the methods provided in the various embodiment implementations described above.

Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It should be understood that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Embodiments of the present invention provide a computer readable storage medium that may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer-readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method shown in the above embodiments.

The foregoing description is only exemplary of the preferred embodiments of the invention and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents is encompassed without departing from the spirit of the disclosure. For example, the above features and (but not limited to) features having similar functions disclosed in the present invention are mutually replaced to form the technical solution.

Claims

1. A method for establishing a mission planning model of a Mars vehicle is characterized by comprising the following steps:

step S2, determining tasks corresponding to the first geographical environment state from the tasks according to an epsilon-greedy strategy according to the first geographical environment state;

step S3, determining a second day and ground environment state corresponding to the Mars vehicle after executing the task under the first day and ground environment state;

step S4, for each task, determining a reward value corresponding to the task according to the task and a target heaven and earth environment state, wherein the target heaven and earth environment state comprises the first heaven and earth environment state or the first heaven and earth environment state and the second heaven and earth environment state;

step S5, for each task in the second heaven and earth environment state, updating the value of the element corresponding to the task in the state task cost function corresponding to the first earth and earth environment state according to the reward value corresponding to the task and the maximum value of the element corresponding to the task in the state task cost function corresponding to the second earth and earth environment state, so as to obtain an updated state task cost function, where the updated state task cost function is the updated state task cost function corresponding to the first earth and earth environment state, and for each state task cost function, the expectation of the accumulated reward value of other tasks except the target task in each task, which are executed after the target task in each task is executed in the second earth and earth environment state, is represented;

step S6, taking the environment status of the second day and place as a new environment status of the first day and place, repeatedly executing the steps S2 to S5 until the cumulative execution time corresponding to each task corresponding to the steps S2 to S5 reaches a preset mission planning time, completing a training period, and recording a total reward value corresponding to the training period, wherein the total reward value is determined according to the reward value corresponding to each task corresponding to the training period;

step S7, executing the training period for n times until the total reward value corresponding to the nth time meets the training ending condition, stopping executing the training period for n +1 times, and taking the corresponding updated state task value function when the training ending condition is met as the target state task value function;

step S8, taking the objective state task value function as a task planning model, and determining an objective task which enables the value of the element corresponding to the to-be-planned heaven and earth environment state to be the maximum value according to the to-be-planned heaven and earth environment state through the task planning model.

2. The method of claim 1, wherein the mission planning model is built based on a reinforcement learning method, the reinforcement learning method comprises a Q-learning reinforcement learning method, and if the reinforcement learning method is the Q-learning reinforcement learning method, the state mission cost function is a Q matrix.

3. The method according to claim 2, wherein if the state task cost function is a Q matrix, for each task in the second day-to-earth environment state, updating the value of the element corresponding to the task in the state task cost function corresponding to the first day-to-earth environment state according to the reward value corresponding to the task and the maximum value of the element corresponding to the task in the state task cost function corresponding to the second day-to-earth environment state, and obtaining an updated state task cost function is determined by the following formula:

Q(S _t ,A _t )＝Q(S _t ,A _t )+α(λmax _a Q(S _t+1 ,A _t+1 )-Q(S _t ,A _t )+R _t+1 )

wherein, Q (S) _t ,A _t ) State task value function representing the first day place environment stateNumber, Q (S) _t+1 ,A _t+1 ) Representing the state task cost function, R, corresponding to the environment state of the second place and the second day _t+1 Indicating the value of the reward, max, corresponding to said task _a Q(S _t+1 ,A _t+1 ) And expressing the maximum value of elements corresponding to the task in the state task value function corresponding to the environment state of the second day and place, wherein alpha is a learning rate, lambda is a discount rate, and alpha and lambda are set values.

4. The method according to any one of claims 1 to 3, wherein if the target heaven and earth environment state comprises the first heaven and earth environment state, the step S4 comprises:

for each task, determining a reward value corresponding to the task according to the task and the first day environment state;

for each task, determining a state variation corresponding to the task according to a first day environment state and a second day environment state corresponding to the task, and determining an award value corresponding to the task according to the state variation.

5. The method according to any one of claims 1 to 3, wherein the first day ambient state includes a first external ambient state, a first train state and a first ground ambient state, and when the first day ambient state includes the first external ambient state, the step S3 includes:

determining a second Mars train state corresponding to each task executed by the Mars train in the first Mars train state;

and determining a second ground environment state corresponding to the Mars vehicle after each task is executed under the first ground environment state.

6. The method of any one of claims 1 to 3, wherein the constraints comprise at least one of communication constraints, lighting constraints, energy constraints, temperature constraints, terrain constraints, data transfer capacity constraints, time constraints; the plurality of tasks include a probe task, a mobile task, a perception task, a communication task, and a ground planning task.

7. The method according to any one of claims 1 to 3, wherein the step S3 is implemented by a state model and the step S4 is implemented by a reward model.

8. A mission planning model building device for a Mars train is characterized by comprising the following components:

the task determining module is used for determining a task corresponding to the first geographical environment state from all tasks according to an epsilon-greedy strategy under the first geographical environment state;

the reward value determining module is used for determining a reward value corresponding to each task according to the task and a target heaven and earth environment state, wherein the target heaven and earth environment state comprises the first heaven and earth environment state or the first heaven and earth environment state and the second heaven and earth environment state;

an updating module, configured to update, for each task in the second heaven and earth environment state, a value of an element corresponding to the task in a state task cost function corresponding to the first heaven and earth environment state according to a reward value corresponding to the task and a maximum value of the element corresponding to the task in the state task cost function corresponding to the second heaven and earth environment state, so as to obtain an updated state task cost function, where the updated state task cost function is the updated state task cost function corresponding to the first heaven and earth environment state, and for each state task cost function, an expectation of an accumulated reward value of other tasks except for the target task in each task after the target task in each task is executed in the second heaven and earth environment state is represented;

a single training period module, configured to take the environment status of the second day and place as a new environment status of the first day and place, repeatedly execute the processing from the task determining module to the updating module until the cumulative execution time corresponding to each task from the task determining module to the updating module reaches a preset task planning time, complete a training period, and record a total reward value corresponding to the training period, where the total reward value is determined according to the reward value corresponding to each task corresponding to the training period;

the training ending module is used for executing the training period for n times until the total reward value corresponding to the nth time meets the training ending condition, stopping executing the training period for n +1 times, and taking the corresponding updated state task cost function when the training ending condition is met as a target state task cost function;

and the model generation module is used for taking the target state task value function as a task planning model so as to determine the target task which enables the value of the element corresponding to the to-be-planned heaven and earth environment state to be the maximum value according to the to-be-planned heaven and earth environment state through the task planning model.

9. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1-7 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method of any one of claims 1-7.