CN115481779A

CN115481779A - Satellite resource scheduling optimization method based on federal reinforcement learning

Info

Publication number: CN115481779A
Application number: CN202210931479.5A
Authority: CN
Inventors: 陈华洋; 王冠; 段然; 钱浩煜; 刘聪; 吴逸汀; 邢清雄
Original assignee: CETC 28 Research Institute
Current assignee: CETC 28 Research Institute
Priority date: 2022-08-04
Filing date: 2022-08-04
Publication date: 2022-12-16

Abstract

The invention provides a satellite resource scheduling optimization method based on federal reinforcement learning, which abstracts an earth observation satellite resource scheduling optimization problem into a discrete Markov decision problem, applies a federal reinforcement learning algorithm to solve an earth observation satellite resource scheduling optimal solution.

Description

Satellite resource scheduling optimization method based on federal reinforcement learning

Technical Field

The invention relates to the technical field of earth observation satellite resource planning, in particular to a satellite resource scheduling optimization method based on federal reinforcement learning.

Background

The earth observation satellite resource scheduling optimization problem is a complex combined optimization problem with time window constraint and resource constraint, the characteristics of various satellite resources and observation tasks and various constraint relations among the satellite resources and the observation tasks need to be comprehensively considered, the maximum utilization rate of the satellite resources or the task completion rate are taken as a scheduling optimization target, the characteristics of various satellite resources, the task characteristics and various constraints among the tasks and the resources are fully considered, a scheduling plan of the satellite observation resources is reasonably arranged by combining the task target, and an optimal scheduling scheme of satellite earth observation is generated;

the traditional method for solving the problems is based on a programming problem with constraints, a heuristic algorithm or a meta-heuristic algorithm and a machine learning algorithm are utilized, an experience rule is adopted to seek an optimal and the latest satellite resource scheduling scheme within an acceptable time range, but the intelligent algorithm is relatively dependent on the experience rule, and the design of the experience rule needs a large amount of professional knowledge and rich industrial experience as a basis, so that the difficulty is high, and the construction cost is high.

Disclosure of Invention

The purpose of the invention is as follows: the invention aims to solve the technical problem that the prior art is not enough, provides a satellite resource scheduling optimization method based on federal reinforcement learning, and can effectively solve the problems that the traditional method proposed in the background art is relatively dependent on experience rules, and the design of the experience rules needs a large amount of professional knowledge and abundant industrial experience, so that the difficulty is high and the construction cost is high.

The invention particularly provides a satellite resource scheduling optimization method based on federal reinforcement learning, wherein the scheduling optimization problem of earth observation satellite resources is abstracted into a discrete Markov decision problem, and the federal reinforcement learning algorithm is applied to solve the optimal solution of the earth observation satellite resource scheduling; the invention specifically comprises the following steps:

step 1, establishing a reinforced learning DQN (DQN) model for each agent in the reinforced learning algorithm of the United nations, and setting a state space of the agent in an environment, a behavior space which can be decided by the agent, and behavior rewards of the environment to the agent;

step 2, establishing a corresponding neural network for each agent according to a reinforced learning DQN algorithm, and obtaining an approximate value function by using a Target neural network (Target-Q);

step 3, the intelligent agent decides the action to be taken next step according to the distributed Target and the current self state by using an Ee-greedy strategy, interacts with the environment to obtain the next self state, stores the decision experience in a playback memory unit, and updates the Target neural network (Target-Q) model parameter in time according to the gradient of the error function;

step 4, after the circulation reaches the set times, transmitting the local target neural network model parameters to an intelligent DQN model for parameter aggregation, recording the intelligent DQN model as a joint virtual model, and performing subsequent federal learning;

step 5, aggregating parameters uploaded by all the agents, returning results to each agent for parameter updating, aggregating parameters of the agents by combining the virtual models, and returning corresponding aggregation results;

step 6, each agent carries out soft update on the received return result and the model parameter of the Target neural network (Target-Q) to obtain the latest local reinforcement learning model parameter;

step 7, repeating the step 3 to the step 6 until the target task is completed, and obtaining the optimal reinforcement learning model parameters;

step 8, constructing an enhanced reinforcement learning model (E-Agent DQN) by using the optimal reinforcement learning model parameters to obtain an optimal satellite resource scheduling scheme;

in step 5, recording a parameter sample of the depth enhanced learning DQN model uploaded by the ith agent as theta _i (such as memory capacity N, initial weight omega and the like), simultaneously constructing a deep reinforcement learning DQN model for fusion learning, recording the model as a joint virtual agent, wherein a parameter sample set of the joint virtual agent is theta = { theta = (theta) = theta _i I is more than or equal to 1 and less than or equal to N, and the central point theta of the sample is obtained by calculating the average value _avg ：

θ _avg Namely the aggregation result returned by the joint virtual model.

In step 6, the agent receives the aggregation result theta returned by the joint virtual model _avg Then, local deep reinforcement learning DQN model updating is carried out in a soft updating mode, namely theta is updated according to the specific gravity tau _avg Adding model parameter samples theta _i In the method, the neural network parameter theta 'of the depth reinforcement learning DQN model is updated' _i Comprises the following steps:

θ′ _i ＝(1-τ)θ _i +τ·θ _avg

to this end, a Federal learning procedure is completed, wherein tau is in the scope of 0,1]When τ is 0, it means that the parameter is not to be updated, θ _avg The DQN model is not fused into a local deep reinforcement learning DQN model, and when tau is 1, the DQN model represents that the local deep reinforcement learning DQN model directly copies an update parameter theta _avg ；

The formula for updating the parameters of the joint virtual agent is as follows:

wherein the superscript v denotes the number of the associated virtual agent, θ _t ^(v) Neural network parameters, θ, for deep reinforcement learning of DQN models for joint virtual agents _t ⁽ⁱ⁾ Training parameters v of neural network of deep reinforcement learning DQN model for ith agent at time t _t ^(v) Parameter change value l of deep reinforcement learning DQN model for the v-th joint virtual agent _t To the learning rate, N _t For the number of active agents at time t, loss (.) is a Loss function, and ρ is a system weight, which is generally 0.5.

In step 8, modeling the satellite resource scheduling optimization problem by using a Markov Decision Process (MDP), wherein three elements forming the Markov Decision Process are an environment state s, a Decision action a and a reward return r respectively;

the decision process is to select corresponding action to make decision according to the strategy based on the current state, obtain corresponding decision return, and describe the expected reward return of the whole Markov decision process by using a Q value function;

in the process of solving Markov decision by reinforcement learning optimization, the intelligent agent selects a corresponding decision action a according to a certain strategy in an environment state s, and the decision action a acts on the interactive external environment of the intelligent agent, so that the environment state s is correspondingly changed, and a corresponding reward return r is obtained, wherein the goal is to obtain the strategy of optimal reward return based on the interactive process;

modeling the satellite resource scheduling optimization problem by using a Markov decision process, wherein the fact is that a random process formally describes a geosynchronous observation satellite resource scheduling application scene, and three elements of the Markov decision process are extracted, so that the Markov decision process is converted into a resource scheduling optimization model which can be described and solved by reinforcement learning;

specifically, for an earth observation satellite resource scheduling application scenario, a model including an external environment state, a decision action and a reward return evaluation index is abstracted, which is specifically as follows:

abstracting each satellite resource in the earth observation task and a state set of the observation task into a state of a Markov decision process, and recording the state as an environment state; and abstracting the satellite resource decision action variable into an action of a Markov decision process, recording the action as a decision action, and taking the satellite resource scheduling performance evaluation index as a decision return in the Markov decision process.

In step 8, the environment state of the earth observation satellite resource scheduling is description of an earth observation satellite resource debugging application scene, and comprises description of attribute characteristics and observation task characteristics of earth observation satellite resources, and the whole environment state comprises an observation state and a task state;

when an idle time window of the satellite observation resource is visible and available for a task, the state corresponding to the observation state matrix position is set to be 1, otherwise, the state is set to be 0;

for different satellite resources of each time window, marking with a number 0 or 1 according to whether the satellite resources can meet the observation requirement in the current time window, and determining whether various satellite observation resources are idle relative to the observation task in each time window;

representing a state matrix of satellite observation resources in a given satellite resource scheduling scene by using a 0-1 matrix, thereby determining the available condition of the satellite observation resources at each moment relative to the observation tasks, determining the observation state matrix, and constructing the state of the satellite resources in the time dimension;

integrating the observation state matrix and the task state matrix in the same time window to form an environment state of the earth observation satellite resource scheduling in the current time window, and designing an environment state matrix S _{[TaskS,TaskE]} In the form:

wherein, the task S and the task E respectively represent the starting time and the ending time of the current time window, and the environment state matrix S _{[TaskS,TaskE]} The first column is the serial number of each task, and each other column is corresponding task start time, task end time, task priority, imaging time, total number of observed tasks, number of targets to be observed, task observation duration requirement, \8230, equipment conversion time, storage capacity, \8230;

environmental state matrix S _{[TaskS,TaskE]} The former task state in the middle is a task state, the later task state is a resource state, the values of the task state are related to the actual observation task scene, the data are only used for representing the form of a matrix, and the matrixes of different time windows form a state space of the MDP model of the earth observation satellite resources.

In step 8, for decision actions: the earth observation satellite resource scheduling problem is essentially an optimization problem with a plurality of constraints, is an NP-difficult problem, and needs to comprehensively consider the constraints of various satellite resources and task characteristics in a specific application scene and determine the reachable range of corresponding decision behaviors under the condition of meeting the constraints of various resources in the current state;

five-tuple is used for scheduling earth satellite observation tasks<E,S,T,C,F>Description, wherein E is the observation period, generally defined as 24 hours,

in order to observe the set of satellites,

denotes the Nth _S One of the observation satellites is provided with a satellite,

is a set of observation tasks that is to be observed,

denotes the Nth _T An observation task, C is each constraint condition set, and F is an objective functionCounting;

one observation task can be observed and imaged by more than two satellites, each observation satellite has a plurality of visible time windows for the observation task, and the jth observation satellite S is recorded _j For the ith observation task T _i Is O _ijk ＝[ws _ijk ,we _ijk ]Is a visible time window, ws _ijk To see the window start time, we _ijk For the visible window end time, the observation satellite can define a task T within a specific time window _i Set of visible time windows O _i ：

Wherein N is _ij For the jth observation satellite S _j For the ith observation task T _i Time window data of (2), N _S The total number of selectable satellites;

let the ith observation task T _i Priority of p _i The required imaging time is d _i The jth observation satellite S _j The remote sensor yaw rate is r _j The stabilization time after the side sway is h _j The storage space required per unit time of imaging is alpha _j Maximum storage capacity of M _j The maximum allowable side-view count is R _j ，x _ijk Are decision variables, wherein,

in step 8, the earth satellite observation satisfies the following constraints:

the uniqueness of the observation task is restricted, the observation task is observed by the observation satellite only once and can not be interrupted, and the observation task is represented as follows:

satellite observation activitiesConstraint of the transitions between: enough time must be provided between two continuous imaging activities of an observation satellite to ensure that the satellite-borne remote sensor carries out attitude conversion, wherein the attitude conversion comprises the lateral swing rotation time | g of the remote sensor _ikj -g _i'jk' Stabilization time h after | and side sway _j Expressed as follows:

we _ijk +|g _ikj -g _i'jk' |+h _j ≤ws _i'jk'

and we _ijk ≤ws _i'jk'

Wherein, g _ikj ,g _i'jk' Respectively representing the starting time and the ending time of attitude conversion;

satellite memory capacity constraints: satellite-borne memory with limited capacity, data acquired by satellite imaging d _i Memory capacity limit M cannot be exceeded _j Expressed as follows:

wherein alpha is _j Imaging the satellite times;

satellite side view times constraint: limited by satellite resources and mobility, the satellite can only complete a limited number of times R _j Is represented as follows:

for a satellite observation task, the decision action of satellite resource scheduling is converted into the variable a for judging whether the satellite observation resource receives the current observation task at a moment t _i To describe:

the whole satellite observation resource scheduling strategy pi is described as

i is 1 to N _S 。

In step 8, for reward return: the satellite observation scheduling performance comprehensive evaluation index is constructed by comprehensively considering three aspects of observation task completion degree, observation target priority and satellite observation resource utilization rate, wherein:

sub-goals 1: maximizing the priority y of the observed target ₁ Comprises the following steps:

wherein a is a weight parameter, p _i Is the priority of target i;

sub-goals 2: maximizing the priority y of the number of target observations ₂ Comprises the following steps:

wherein b is a weight parameter and max is a maximum function;

sub-goals 3: minimizing resource consumption y ₃ Comprises the following steps:

wherein C _i The number of resources consumed for observation of the task i, and c is a weight parameter;

the objective function of satellite resource scheduling is a multi-objective planning problem, and in order to simplify calculation, the multi-objective planning problem is converted into a single-objective planning problem by adopting an ideal point method, namely, the optimal solution of a single objective, namely the optimal solution of three sub-objectives is solved first

And

and the worst solution of three sub-goals

And

then, under any scheme, the relative closeness H of the target value and the optimal solution and the relative closeness H of the target value and the worst solution are calculated:

where eta, rho and gamma are targets y ₁ 、y ₂ And y ₃ Satisfies the following weight: η + ρ + γ =1, which is set according to the actual requirement of the observation task, for example, η is 0.5, ρ is 0.3, and γ is 0.2, and according to the above formula, the objective function (i.e., the reward) is converted into the maximum reward r, which is expressed as follows:

setting the instant return r in the Markov decision process model of the observation satellite resource scheduling as:

in step 8, the description of the satellite observation resource scheduling policy, that is, the action of the reinforcement learning network, based on the abstraction and description of each element in the satellite observation resource scheduling problem specifically includes: determining a visible time window of the satellite observation task, determining whether to receive the measurement and control task, and determining the observation time window and the observation resource for performing the satellite observation task.

In step 8, the determining the visible time window of the satellite observation task includes: based on the discretization of the observation time period and the design of the observation state, the time window set of different satellite observation resources which is possibly completed for the same observation task is determined by judging whether the start time and the end time of the observation task to be distributed are within the visible time window range of each satellite observation resource.

The determination of whether to accept the measurement and control task comprises the following steps: and judging whether to receive the current observation task according to the satellite visible time window completed by the observation task and the constraint condition, and if a certain observation task does not have the observation visible time window capable of completing the task, judging the certain observation task as an observation task which is temporarily impossible to complete.

The determining of the observation time window and the observation resources for performing the satellite observation task includes: according to the federal reinforcement learning algorithm, based on a visible time window set completed by an observation task, a visible time window uniquely corresponding to a satellite observation resource is obtained through decision, and the satellite observation multi-agent federal reinforcement learning algorithm can determine the satellite observation resource and the observation visible time window which complete the observation task.

The method comprises the steps that firstly, the federal learning and reinforcement learning technology is utilized, the implicit internal incidence relation between the earth observation task and the satellite resources is excavated, model parameters in the process of scheduling the satellite resources are independently learned and optimized by a single intelligent agent, the federal learning is utilized, model parameters of each single intelligent agent are fused based on the reinforcement learning, and an earth observation satellite resource scheduling optimization model is constructed; and secondly, by utilizing a federal learning method of self-adaptive weight, the available features in the training process of the intelligent agent are fully excavated to form a global model with higher quality.

The invention has the following beneficial effects:

1. the method provided by the invention utilizes the federal learning and reinforcement learning technology to mine the implicit incidence relation between the earth observation task and the satellite resource, utilizes the single intelligent agent to autonomously learn and optimize the model parameters in the satellite resource scheduling process, utilizes the federal learning to fuse the model parameters of each single intelligent agent based on the reinforcement learning to construct an earth observation satellite resource scheduling optimization model, fully mines various associated characteristic indexes of each intelligent agent in the training process, forms a high-efficiency and high-quality global scheduling optimization model, and finally generates an optimal conflict-free earth observation satellite resource scheduling optimization scheme.

2. The aggregation method adopted by the invention is a federal learning method of self-adaptive weight, the joint virtual agent can calculate the target function index of the joint virtual agent according to the local training model received from each agent, if the weight updating condition of the agent is reached, the contribution of the model precision index to the model training precision is calculated for each agent, and corresponding weight weighted average is generated to generate a global model parameter;

the joint virtual intelligent agent issues the updated global model parameters to the local intelligent agents, after each local intelligent agent receives the parameters, the local model parameter characteristics are reserved, model training is carried out on local training data, and the local models and the training precision indexes are uploaded to the joint virtual intelligent agent again after the training is finished, so that the available characteristics in the intelligent agent training process are fully mined, a global model with higher quality is formed, and the model precision and the convergence efficiency are improved.

Drawings

The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.

FIG. 1 is a flow chart of the steps of the method of the present invention;

FIG. 2 is a flow chart of the construction of the satellite observation resource scheduling strategy of the present invention;

FIG. 3 is a schematic diagram of a multi-agent federated reinforcement learning satellite resource scheduling algorithm in accordance with the present invention.

Detailed Description

Example (b): as shown in fig. 1, fig. 2, and fig. 3, the present invention provides a technical solution, which is a satellite resource scheduling optimization method based on federal reinforcement learning, and abstracts the problem of resource scheduling optimization of earth observation satellites into a discrete markov decision problem, and applies a federal reinforcement learning algorithm to solve the optimal solution of resource scheduling of earth observation satellites;

the imaging satellite resource scheduling method specifically comprises the following steps:

step 1, establishing a reinforcement learning DQN model for each agent, and setting a state space S of the agent in an environment, a behavior space A which can be decided by the agent and a behavior reward R of the environment to the agent;

step 2, establishing a corresponding neural network for each agent according to a reinforced learning DQN algorithm, and approximating a function by using a Target neural network (Target-Q);

step 3, the intelligent agent decides the action to be taken next step according to the distributed Target and the current self state by using a epsilon-greedy strategy, interacts with the environment to obtain the next self state, stores the decision experience in a playback memory unit, and timely updates the Target-Q model parameter according to the gradient of an error function;

step 4, after the cycle reaches the set times, such as 100000 times, transmitting the local Target neural network (Target-Q) model parameters to the joint virtual model for subsequent federal learning;

step 5, aggregating the parameters uploaded by all the agents, returning the results to each agent for parameter updating, performing parameter aggregation on the agents by combining the virtual model, and returning the corresponding aggregation results;

7, repeating the steps 3 to 6 until the target task is completed, and obtaining the optimal reinforcement learning model parameters;

step 8, finally, constructing an enhanced reinforcement learning model (E-AgentDQN) by using the obtained optimal reinforcement learning model parameters, and obtaining an optimal satellite resource scheduling scheme;

the parameter aggregation algorithm utilizes federated learning to aggregate parameters uploaded by intelligent agents on a joint virtual model, eliminates abnormal and outlier parameters, performs weighted average on the remaining parameters, and returns the result to each intelligent agent for parameter updating, and is specifically described as follows:

recording the model parameter sample uploaded by the ith agent as theta _i The sample set of model parameters for the federated virtual agent is Θ = { Θ = { (θ) _i 1 ≦ i ≦ N, and obtaining the center point θ of the sample by calculating the average value _avg ：

The intelligent agent carries out soft update on the received result and local neural network parameters, namely, the result is added into the local parameters according to a certain proportion;

the agent receives the updated parameter theta returned by the joint virtual model _avg Then, local neural network model updating is carried out in a soft updating mode, namely theta is updated with a certain proportion tau _avg Adding local current neural model parameters theta _i In (1), the updated neural network parameter theta' _i Comprises the following steps:

θ′ _i ＝(1-τ)θ _i +τ·θ _avg

thus, a Federal learning process is completed, wherein tau belongs to [0,1 ]]When τ is 0, it means that the parameter is not to be updated, θ _avg When tau is 1, the local model is represented to directly copy the update parameter theta without being fused into the local model _avg ；

According to experience, a relatively large value is set for the specific gravity tau at the beginning so as to accelerate the training pace, then the specific gravity is gradually reduced in the training iteration process to ensure the convergence stability, so that each intelligent agent can learn the experience of other intelligent agents, the local model of the intelligent agent is optimized, a good cooperation effect is formed, and the tasks are completed together;

the method for updating parameters by combining the virtual agents comprises the following steps:

wherein theta is _t ^(v) Neural network parameters, θ, for virtual agents _t ⁽ⁱ⁾ Neural network training parameters for the ith agent, l _t Is the learning rate.

Based on the technical scheme, three elements forming the Markov Decision Process (MDP) are an environment state(s), a decision action (a) and a reward return (r) respectively;

the decision making process is to select corresponding action to make decision based on the current state according to the strategy, obtain corresponding decision making return, and use Q value function to describe the expected reward return of the whole Markov decision making process;

in the process of solving the MDP by utilizing reinforcement learning optimization, an intelligent agent selects a corresponding decision action (a) according to a certain strategy under an environment state(s), and the decision action (a) acts on the external environment interacted with the intelligent agent, so that the environment state(s) is correspondingly changed, and a corresponding reward return (r) is obtained, wherein the goal is to obtain a strategy of optimal reward return based on the interaction process;

the modeling of the earth observation satellite resource scheduling optimization MDP model is characterized in that a random process is used for formally describing an earth observation satellite resource scheduling application scene, and three elements of the MDP are extracted, so that the MDP model is converted into a resource scheduling optimization model which can be described and solved by reinforcement learning;

specifically, the earth observation satellite resource scheduling application scene comprises model abstractions including external environment states, decision actions and reward return evaluation indexes;

the state set of related objects in the earth observation task is abstracted into the state of an MDP (minimization of load) and recorded as an environment state, the state comprises satellite resources and the observation task, the satellite resource decision action variable is abstracted into the action of the MDP and recorded as a decision action, and the satellite resource scheduling performance evaluation index is used as the decision return in the MDP.

Based on the technical scheme, the environmental state is as follows: the environment state of the earth observation satellite resource scheduling is description of an earth observation satellite resource debugging application scene, including description of attribute characteristics and observation task characteristics of earth observation satellite resources, and the whole environment state consists of an observation state and a task state;

on one hand, the satellite observation scheduling time is subjected to periodic discretization, time interval division scales are determined according to the specific requirements of an observation task, the scheduling time is divided into time windows with different scales, and whether the earth observation satellite resources are visible and available relative to the observation task is further determined;

in a given time window, constructing an observation state matrix of the satellite resource relative to an observation task according to whether the satellite resource is visible and available for the task, specifically, when a certain idle time window of the satellite observation resource is visible and available for a certain task, setting the state of the corresponding observation state matrix position as 1, otherwise, setting the state as 0;

marking different satellite resources of each time window, such as the load type and the memory capacity of the satellite, by using a number 0 or 1 according to whether the satellite resources can meet the observation requirement in the current time window, and determining whether various satellite observation resources are idle relative to the observation task in each time window;

namely, a 0-1 matrix is used for representing a state matrix of satellite observation resources in a given satellite resource scheduling scene, so that the available condition of the satellite observation resources at each moment relative to observation tasks can be determined, the observation state matrix can be determined, and the state of the satellite resources in the time dimension is constructed;

on the other hand, the task state based on the observation task dimension mainly defines the task states of the measurement and control task in different time windows by the static indexes such as a task sequence number, the observation task starting time, the observation task ending time, a task priority, an image resolution requirement, imaging time, the shortest observation time of the observation task, the total observation time for completing the observation task and the like, the finished number of the current observation task, the total priority of the finished observation task, the total actual observation time for completing the observation task and the like, so that a task state matrix is constructed;

integrating the observation state matrix and the task state matrix in the same time window to form an environment state of the earth observation satellite resource scheduling in the current time window, wherein the designed environment state matrix S is in the following form:

wherein [ task S, task ] represents the starting time and the ending time of the current time window, the first column is the serial number of each task, each other column is the corresponding task starting time, task ending time, task priority, imaging time, total number of observed TaskS, number of targets to be observed, task observation duration requirement, \8230, equipment switching time, storage capacity, \8230;

the former part of the environment state matrix S is a task state, the latter part is a resource state, the values of the environment state matrix S are related to the actual observation task scene, the data above are only used for representing the form of the matrix, and the matrixes of different time windows form the state space of the MDP model of the earth observation satellite resources.

Based on the technical scheme, the decision-making action is as follows: the essential of the earth observation satellite resource scheduling problem is an optimization problem with multiple constraints, and is an NP-hard problem, constraints of various satellite resources and task characteristics need to be comprehensively considered in a specific application scene, and the reachable range of corresponding decision behaviors is determined under the condition of meeting the constraints of various resources in the current state;

selecting decision-making action according to the actual observation application scene, namely selecting proper satellite observation resources to meet the observation requirements of corresponding tasks, so as to ensure that the current observation resources can meet various constraints of the current observation task, ensure that various satellite resources are indeed and practically scheduled in actual application, and have stronger practicability;

the method comprises the following steps that the earth rotates, satellites fly around the earth, the satellites can observe ground targets only in a specific time period, the observation targets are point targets, namely each observation task can be completed by a single satellite at a time, each task has the requirements of imaging time period limitation, load types and image resolution, satellite resource scheduling refers to selecting multiple satellites meeting conditions to observe the observation targets, namely decision-making action refers to selecting the optimal satellite observation resources in different visible time windows of the multiple satellite observation resources so as to achieve the optimal scheduling objective function of the optimal satellite observation resources;

five-tuple for earth satellite observation task scheduling problem<E,S,T,C,F>Description, wherein E is the observation period, generally defined as 24 hours,

in order to observe a set of satellites,

is an observation task set, C is each constraint condition set, and F is an objective function;

an observation task can be observed and imaged by a plurality of satellites, each observation satellite has a plurality of visible time windows for the observation task, and the observation task is recorded by the recording unit _ijk ＝[ws _ijk ,we _ijk ]For a visible time window, the observation satellite may define a task T within a particular time window _i Set of visible time windows of (c):

wherein N is _ij As a satellite S _j For task T _i Time window data of (C), O _ijk As a satellite S _j For task T _i Of the kth time window, N _S Is the total number of selectable satellites.

Task setting T _i Priority of p _i Required imaging time d _i Satellite S _j Remote sensor yaw rate r _j Stabilization time h after lateral swing _j Storage space α required per unit time of imaging _j Maximum storage capacity M _j Maximum allowable side view count R _j ，x _ijk Are decision variables that, among other things,

based on the technical scheme, the earth satellite observation must meet the following constraints:

the uniqueness of the task is restricted, the task is observed by the satellite only once and can not be interrupted;

constraint of transitions between satellite observation activities: enough time must be provided between two continuous imaging activities of the satellite to ensure that the satellite-borne remote sensor performs attitude conversion, including the lateral swing rotation time of the remote sensor and the stabilization time after the lateral swing;

we _ijk +|g _ikj -g _i'jk' |+h _j ≤ws _i'jk'

and we _ijk ≤ws _i'jk'

Satellite memory capacity constraints: the satellite-borne memory capacity is limited, and the data acquired by satellite imaging cannot exceed the memory capacity limit;

satellite side view times constraint: limited by satellite resources and mobility, the satellite can only complete limited number of side-looking imaging actions;

for a satellite observation task, the decision action of satellite resource scheduling is converted into whether the satellite observation resource receives the current observation task or not and which satellite observation resource receives the observation task at a certain time t;

the whole satellite observation resource scheduling strategy pi is described as

i is 1-N _S 。

Based on the technical scheme, reward is given: the satellite observation scheduling performance comprehensive evaluation index is constructed by comprehensively considering three aspects of observation task completion degree, observation target priority and satellite observation resource utilization rate, wherein:

sub-goals 1: maximizing the priority of the observed target:

sub-goals 2: priority to maximize number of target observations:

sub-goal 3: minimizing resource consumption:

wherein C is _i The number of resources consumed for observation of the task i, and c is a weight parameter;

the objective function of the satellite resource scheduling is a multi-objective planning problem, and ideal points are adopted for simplifying calculationThe method converts the single target planning problem into a single target planning problem, namely, firstly, the optimal solution of the single target is solved

And

and worst solution

And

then calculating the relative closeness degree of the target value to the optimal solution and the worst solution under any scheme, and enabling the target value to be close to the optimal solution and the worst solution

Wherein eta, rho and gamma are the weights of the targets y1, y2 and y3 respectively, can be set according to the actual requirements of the observation task, and can convert the target function into the target function according to the formula

Setting the immediate reward in the MDP model of the observation satellite resource scheduling as:

as shown in fig. 2, based on the above technical solution, the description of the satellite observation resource scheduling policy, that is, the action of the reinforcement learning network, based on the abstraction and description of each element in the satellite observation resource scheduling problem specifically includes: determining a visible time window of the satellite observation task, determining whether to receive the measurement and control task, and determining the observation time window and the observation resource for carrying out the satellite observation task.

Based on the technical scheme, determining the visible time window of the satellite observation task: based on the discretization of the observation time period and the design of the observation state, whether the start time and the end time of the observation task to be distributed are within the visible time window range of each satellite observation resource is judged, so that the time window set which is possibly completed by different satellite observation resources for the same observation task is determined.

Based on the technical scheme, whether the measurement and control task is received or not is determined: and judging whether to receive the current observation task according to the satellite visible time window in which the observation task is completed and the constraint condition, and if a certain observation task does not have the observation visible time window in which the task can be completed, judging the certain observation task as the observation task which is temporarily impossible to complete.

Based on the technical scheme, an observation time window and observation resources for satellite observation tasks are determined: according to the multi-agent deep reinforcement learning satellite resource scheduling algorithm, based on the visible time window set completed by the observation task, the visible time window uniquely corresponding to the satellite observation resource can be obtained through decision, and the satellite observation multi-agent deep reinforcement learning satellite resource scheduling algorithm can determine the satellite observation resource and the observation visible time window which complete the observation task.

In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program may run the inventive content of the satellite resource scheduling optimization method based on federal reinforcement learning and all or part of the steps in the embodiments, provided by the present invention, when the computer program is executed by the data processing unit. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.

It is obvious to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a computer program or a software product, where the computer program or the software product may be stored in a storage medium and include instructions for enabling a device (which may be a personal computer, a server, a single chip microcomputer, an MUU, or a network device) including a data processing unit to execute the method according to the embodiments or some parts of the embodiments of the present invention.

The invention provides a satellite resource scheduling optimization method based on federal reinforcement learning, and a plurality of methods and ways for implementing the technical scheme are provided, the above description is only a preferred embodiment of the invention, it should be noted that, for those skilled in the art, a plurality of improvements and decorations can be made without departing from the principle of the invention, and these improvements and decorations should also be regarded as the protection scope of the invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims

1. A satellite resource scheduling optimization method based on federal reinforcement learning is characterized by comprising the following steps:

step 1, establishing a deep reinforcement learning DQN model for each agent in the federated reinforcement learning algorithm, and setting state space of each agent in the environment, behavior space that the agent can make decisions, and behavior reward of the environment to the agent;

step 2, establishing a corresponding neural network for each agent according to the reinforcement learning DQN algorithm, and obtaining an approximate value function by using a target neural network;

step 3, the intelligent agent decides the action to be taken next step according to the distributed target and the current self state by using an epsilon-greedy strategy, interacts with the environment to obtain the next self state, stores the decision experience in a playback memory unit, and updates the target neural network model parameters according to the gradient of the error function;

step 6, each agent carries out soft update on the received aggregation result and the target neural network model parameter to obtain the latest local reinforcement learning model parameter;

7, repeating the step 3 to the step 6 until the target task is completed, and obtaining the optimal reinforcement learning model parameters;

and 8, constructing an enhanced reinforcement learning model by using the optimal reinforcement learning model parameters to obtain an optimal satellite resource scheduling scheme.

2. The method of claim 1, wherein in step 5, the depth-enhanced learning DQN model parameter sample uploaded by the ith agent is recorded as θ _i Meanwhile, a deep reinforcement learning DQN model for fusion learning is constructed and recorded as a joint virtual agent, and the parameter sample set of the joint virtual agent is theta = { theta = _i I is more than or equal to 1 and less than or equal to N, and the central point theta of the sample is obtained by calculating the average value _avg ：

θ _avg Namely the aggregation result returned by the joint virtual model.

3. The method of claim 2, wherein in step 6, the agent receives the aggregate result θ returned from the federated virtual model _avg Then, local deep reinforcement learning DQN model updating is carried out in a soft updating mode, namely theta is updated according to the specific gravity tau _avg Adding model parameter samples theta _i And in the middle, after the updating, the neural network parameter theta of the deep reinforcement learning DQN model is obtained _i ' is:

θ _i ′＝(1-τ)θ _i +τ·θ _avg

to this end, a Federal learning procedure is completed, wherein tau is in the scope of 0,1]When τ is 0, θ represents that the parameter is not to be updated _avg The DQN model is not fused into the local deep reinforcement learning, and when tau is 1, the DQN model represents that the local deep reinforcement learning is directly copied to update the parameter theta _avg ；

wherein, theta _t ^(v) Neural network parameters, θ, for deep reinforcement learning of DQN models for joint virtual agents _t ⁽ⁱ⁾ Training parameters v of neural network of deep reinforcement learning DQN model for ith agent at time t _t ^(v) For the parameter change value l of the v-th joint virtual agent deep reinforcement learning DQN model _t For learning rate, N _t For the number of active agents at time t, loss (.) is a Loss function, and ρ is the system weight.

4. The method according to claim 3, wherein in step 8, the satellite resource scheduling optimization problem is modeled by a Markov decision process, and the three elements constituting the Markov decision process are an environment state s, a decision action a and an awarding return r;

the decision process is to select corresponding action to make decision based on the current state according to the strategy, obtain corresponding decision return, and use Q value function to describe the expected reward return of the whole Markov decision process;

in the process of solving Markov decision by reinforcement learning optimization, the intelligent agent selects a corresponding decision action a according to a strategy in an environment state s, and the decision action a acts on the interactive external environment of the intelligent agent, so that the environment state s is correspondingly changed, and corresponding reward return r is obtained, and the goal is to obtain the strategy of optimal reward return based on the interactive process;

modeling a satellite resource scheduling optimization problem by using a Markov decision process, wherein a random process formally describes a ground observation satellite resource scheduling application scene, and three elements of the Markov decision process are extracted, so that the satellite resource scheduling optimization problem is converted into a resource scheduling optimization model which can be described and solved by reinforcement learning;

abstracting a model including external environment states, decision actions and reward return evaluation indexes, wherein the abstract model includes the following concrete steps:

abstracting each satellite resource in the earth observation task and the state set of the observation task into a Markov decision process state, and recording the Markov decision process state as an environment state; and abstracting the satellite resource decision action variable into an action of a Markov decision process, recording the action as a decision action, and taking the satellite resource scheduling performance evaluation index as a decision return in the Markov decision process.

5. The method according to claim 4, wherein in step 8, the environmental state of the earth observation satellite resource scheduling is description of an earth observation satellite resource debugging application scene, including description of attribute characteristics and observation task characteristics of the earth observation satellite resource, and the whole environmental state includes an observation state and a task state;

when an idle time window of the satellite observation resource is visible and available for a task, the state corresponding to the observation state matrix position is set to 1, otherwise, the state is set to 0;

marking different satellite resources of each time window by using a number 0 or 1 according to whether the satellite resources can meet the observation requirement in the current time window, and determining whether various satellite observation resources are idle relative to the observation task in each time window;

representing a state matrix of satellite observation resources in a given satellite resource scheduling scene by using a 0-1 matrix, thereby determining the available condition of the satellite observation resources at each moment relative to an observation task, determining an observation state matrix, and constructing the state of the satellite resources on a time dimension;

integrating the observation state matrix and the task state matrix in the same time window to form an environment state of earth observation satellite resource scheduling in the current time window, and designing an environment state matrix S _{[TaskS,TaskE]} In the form:

wherein, the task S and the task E respectively represent the starting time and the ending time of the current time window, and the environment state matrix S _{[TaskS,TaskE]} The first column is the sequence number of each task.

6. The method of claim 5, wherein in step 8, for the decision action: five-tuple for earth satellite observation task scheduling problem<E,S,T,C,F>Described, where, E is the observation period,

in order to observe a set of satellites,

is a set of observation tasks that is to be observed,

denotes the Nth _T C is each constraint condition set, and F is an objective function;

one observation task can be observed and imaged by more than two satellitesEach observation satellite has a visible time window for the observation task, and the jth observation satellite S is recorded _j For the ith observation task T _i Is O _ijk ＝[ws _ijk ,we _ijk ]Is a visible time window, ws _ijk To see the window start time, we _ijk For the visible window end time, the observation satellite can define a task T within a specific time window _i Set of visible time windows O _i ：

Wherein N is _ij For the jth observation satellite S _j For the ith observation task T _i Time window data of, N _S The total number of selectable satellites;

let the ith observation task T _i Priority of p _i The required imaging time is d _i The jth observation satellite S _j Remote sensor yaw rate of r _j The stabilization time after the side sway is h _j The storage space required per unit time of imaging is alpha _j Maximum storage capacity of M _j The maximum allowable number of side views is R _j ，x _ijk Are decision variables, wherein,

7. the method of claim 6, wherein in step 8, the earth satellite observations satisfy the following constraints:

and the uniqueness of the observation task is restricted, the observation task is observed by the observation satellite only once and is not interruptible, and the observation task is represented as follows:

constraint of transitions between satellite observation activities: enough time must be provided between two continuous imaging activities of an observation satellite to ensure that the satellite-borne remote sensor carries out attitude conversion, wherein the attitude conversion comprises the lateral swing rotation time | g of the remote sensor _ikj -g _i'jk' Stabilization time h after | and side sway _j Expressed as follows:

we _ijk +|g _ikj -g _i'jk' |+h _j ≤ws _i'jk'

x _ijk x _i'jk' =1 and we _ijk ≤ws _i'jk'

wherein alpha is _j Imaging times for the satellite;

satellite side view times constraint: limited by satellite resources and mobility, the satellite can only complete limited times R _j Is represented as follows:

for a satellite observation task, the decision action of satellite resource scheduling is converted into whether the satellite observation resource receives the current observation task at a time t by a variable a _i To describe:

the scheduling strategy pi of the whole satellite observation resource is described as pi = (a) ₁ ,a ₂ ,…,a _i ,…,a _Ns ) I is 1-N _S 。

8. The method of claim 7, wherein in step 8, for the reward return: the satellite observation scheduling performance comprehensive evaluation index is constructed by comprehensively considering three aspects of observation task completion degree, observation target priority and satellite observation resource utilization rate, wherein:

wherein a is a weight parameter, p _i Is the priority of target i;

wherein b is a weight parameter and max is a maximum function;

sub-goal 3: minimizing resource consumption y ₃ Comprises the following steps:

the objective function of satellite resource scheduling is a multi-objective planning problem, and the multi-objective planning problem is converted into a single-objective planning problem by adopting an ideal point method, namely, the optimal solution of three sub-objectives is firstly solved

And

and the worst solution of three sub-goals

And

where eta, rho and gamma are targets y ₁ 、y ₂ And y ₃ Satisfies the following weight: η + ρ + γ =1, which is set according to the actual requirement of the observation task, and converts the objective function into the maximum value of the return r, and represents as follows:

setting the instant return r in a Markov decision process model of observing satellite resource scheduling as:

9. the method according to claim 8, wherein in step 8, based on the abstraction and description of each element in the satellite observation resource scheduling problem, determining a visible time window of the satellite observation task, determining whether to accept the measurement and control task, and determining the observation time window and the observation resource for performing the satellite observation task.

10. The method of claim 9, wherein in step 8, the determining the satellite observation task visible time window comprises: based on the discretization of the observation time period and the design of the observation state, determining a time window set which is possibly completed by different satellite observation resources to the same observation task by judging whether the starting time and the ending time of the observation task to be distributed are within the visible time window range of each satellite observation resource;

the determining whether to accept the measurement and control task comprises: judging whether to receive the current observation task according to the satellite visible time window completed by the observation task and the constraint condition, and if one observation task does not have the observation visible time window capable of completing the task, judging that the observation task which is not completed temporarily is impossible;