CN115481779A - Satellite resource scheduling optimization method based on federal reinforcement learning - Google Patents

Satellite resource scheduling optimization method based on federal reinforcement learning Download PDF

Info

Publication number
CN115481779A
CN115481779A CN202210931479.5A CN202210931479A CN115481779A CN 115481779 A CN115481779 A CN 115481779A CN 202210931479 A CN202210931479 A CN 202210931479A CN 115481779 A CN115481779 A CN 115481779A
Authority
CN
China
Prior art keywords
observation
satellite
task
model
reinforcement learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210931479.5A
Other languages
Chinese (zh)
Inventor
陈华洋
王冠
段然
钱浩煜
刘聪
吴逸汀
邢清雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 28 Research Institute
Original Assignee
CETC 28 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 28 Research Institute filed Critical CETC 28 Research Institute
Priority to CN202210931479.5A priority Critical patent/CN115481779A/en
Publication of CN115481779A publication Critical patent/CN115481779A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06315Needs-based resource requirements planning or analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06316Sequencing of tasks or work

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Tourism & Hospitality (AREA)
  • Operations Research (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • General Engineering & Computer Science (AREA)
  • Game Theory and Decision Science (AREA)
  • Health & Medical Sciences (AREA)
  • Educational Administration (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Radio Relay Systems (AREA)

Abstract

The invention provides a satellite resource scheduling optimization method based on federal reinforcement learning, which abstracts an earth observation satellite resource scheduling optimization problem into a discrete Markov decision problem, applies a federal reinforcement learning algorithm to solve an earth observation satellite resource scheduling optimal solution.

Description

Satellite resource scheduling optimization method based on federal reinforcement learning
Technical Field
The invention relates to the technical field of earth observation satellite resource planning, in particular to a satellite resource scheduling optimization method based on federal reinforcement learning.
Background
The earth observation satellite resource scheduling optimization problem is a complex combined optimization problem with time window constraint and resource constraint, the characteristics of various satellite resources and observation tasks and various constraint relations among the satellite resources and the observation tasks need to be comprehensively considered, the maximum utilization rate of the satellite resources or the task completion rate are taken as a scheduling optimization target, the characteristics of various satellite resources, the task characteristics and various constraints among the tasks and the resources are fully considered, a scheduling plan of the satellite observation resources is reasonably arranged by combining the task target, and an optimal scheduling scheme of satellite earth observation is generated;
the traditional method for solving the problems is based on a programming problem with constraints, a heuristic algorithm or a meta-heuristic algorithm and a machine learning algorithm are utilized, an experience rule is adopted to seek an optimal and the latest satellite resource scheduling scheme within an acceptable time range, but the intelligent algorithm is relatively dependent on the experience rule, and the design of the experience rule needs a large amount of professional knowledge and rich industrial experience as a basis, so that the difficulty is high, and the construction cost is high.
Disclosure of Invention
The purpose of the invention is as follows: the invention aims to solve the technical problem that the prior art is not enough, provides a satellite resource scheduling optimization method based on federal reinforcement learning, and can effectively solve the problems that the traditional method proposed in the background art is relatively dependent on experience rules, and the design of the experience rules needs a large amount of professional knowledge and abundant industrial experience, so that the difficulty is high and the construction cost is high.
The invention particularly provides a satellite resource scheduling optimization method based on federal reinforcement learning, wherein the scheduling optimization problem of earth observation satellite resources is abstracted into a discrete Markov decision problem, and the federal reinforcement learning algorithm is applied to solve the optimal solution of the earth observation satellite resource scheduling; the invention specifically comprises the following steps:
step 1, establishing a reinforced learning DQN (DQN) model for each agent in the reinforced learning algorithm of the United nations, and setting a state space of the agent in an environment, a behavior space which can be decided by the agent, and behavior rewards of the environment to the agent;
step 2, establishing a corresponding neural network for each agent according to a reinforced learning DQN algorithm, and obtaining an approximate value function by using a Target neural network (Target-Q);
step 3, the intelligent agent decides the action to be taken next step according to the distributed Target and the current self state by using an Ee-greedy strategy, interacts with the environment to obtain the next self state, stores the decision experience in a playback memory unit, and updates the Target neural network (Target-Q) model parameter in time according to the gradient of the error function;
step 4, after the circulation reaches the set times, transmitting the local target neural network model parameters to an intelligent DQN model for parameter aggregation, recording the intelligent DQN model as a joint virtual model, and performing subsequent federal learning;
step 5, aggregating parameters uploaded by all the agents, returning results to each agent for parameter updating, aggregating parameters of the agents by combining the virtual models, and returning corresponding aggregation results;
step 6, each agent carries out soft update on the received return result and the model parameter of the Target neural network (Target-Q) to obtain the latest local reinforcement learning model parameter;
step 7, repeating the step 3 to the step 6 until the target task is completed, and obtaining the optimal reinforcement learning model parameters;
step 8, constructing an enhanced reinforcement learning model (E-Agent DQN) by using the optimal reinforcement learning model parameters to obtain an optimal satellite resource scheduling scheme;
in step 5, recording a parameter sample of the depth enhanced learning DQN model uploaded by the ith agent as theta i (such as memory capacity N, initial weight omega and the like), simultaneously constructing a deep reinforcement learning DQN model for fusion learning, recording the model as a joint virtual agent, wherein a parameter sample set of the joint virtual agent is theta = { theta = (theta) = theta i I is more than or equal to 1 and less than or equal to N, and the central point theta of the sample is obtained by calculating the average value avg
Figure BDA0003781737210000021
θ avg Namely the aggregation result returned by the joint virtual model.
In step 6, the agent receives the aggregation result theta returned by the joint virtual model avg Then, local deep reinforcement learning DQN model updating is carried out in a soft updating mode, namely theta is updated according to the specific gravity tau avg Adding model parameter samples theta i In the method, the neural network parameter theta 'of the depth reinforcement learning DQN model is updated' i Comprises the following steps:
θ′ i =(1-τ)θ i +τ·θ avg
to this end, a Federal learning procedure is completed, wherein tau is in the scope of 0,1]When τ is 0, it means that the parameter is not to be updated, θ avg The DQN model is not fused into a local deep reinforcement learning DQN model, and when tau is 1, the DQN model represents that the local deep reinforcement learning DQN model directly copies an update parameter theta avg
The formula for updating the parameters of the joint virtual agent is as follows:
Figure BDA0003781737210000031
Figure BDA0003781737210000032
wherein the superscript v denotes the number of the associated virtual agent, θ t (v) Neural network parameters, θ, for deep reinforcement learning of DQN models for joint virtual agents t (i) Training parameters v of neural network of deep reinforcement learning DQN model for ith agent at time t t (v) Parameter change value l of deep reinforcement learning DQN model for the v-th joint virtual agent t To the learning rate, N t For the number of active agents at time t, loss (.) is a Loss function, and ρ is a system weight, which is generally 0.5.
In step 8, modeling the satellite resource scheduling optimization problem by using a Markov Decision Process (MDP), wherein three elements forming the Markov Decision Process are an environment state s, a Decision action a and a reward return r respectively;
the decision process is to select corresponding action to make decision according to the strategy based on the current state, obtain corresponding decision return, and describe the expected reward return of the whole Markov decision process by using a Q value function;
in the process of solving Markov decision by reinforcement learning optimization, the intelligent agent selects a corresponding decision action a according to a certain strategy in an environment state s, and the decision action a acts on the interactive external environment of the intelligent agent, so that the environment state s is correspondingly changed, and a corresponding reward return r is obtained, wherein the goal is to obtain the strategy of optimal reward return based on the interactive process;
modeling the satellite resource scheduling optimization problem by using a Markov decision process, wherein the fact is that a random process formally describes a geosynchronous observation satellite resource scheduling application scene, and three elements of the Markov decision process are extracted, so that the Markov decision process is converted into a resource scheduling optimization model which can be described and solved by reinforcement learning;
specifically, for an earth observation satellite resource scheduling application scenario, a model including an external environment state, a decision action and a reward return evaluation index is abstracted, which is specifically as follows:
abstracting each satellite resource in the earth observation task and a state set of the observation task into a state of a Markov decision process, and recording the state as an environment state; and abstracting the satellite resource decision action variable into an action of a Markov decision process, recording the action as a decision action, and taking the satellite resource scheduling performance evaluation index as a decision return in the Markov decision process.
In step 8, the environment state of the earth observation satellite resource scheduling is description of an earth observation satellite resource debugging application scene, and comprises description of attribute characteristics and observation task characteristics of earth observation satellite resources, and the whole environment state comprises an observation state and a task state;
when an idle time window of the satellite observation resource is visible and available for a task, the state corresponding to the observation state matrix position is set to be 1, otherwise, the state is set to be 0;
for different satellite resources of each time window, marking with a number 0 or 1 according to whether the satellite resources can meet the observation requirement in the current time window, and determining whether various satellite observation resources are idle relative to the observation task in each time window;
representing a state matrix of satellite observation resources in a given satellite resource scheduling scene by using a 0-1 matrix, thereby determining the available condition of the satellite observation resources at each moment relative to the observation tasks, determining the observation state matrix, and constructing the state of the satellite resources in the time dimension;
integrating the observation state matrix and the task state matrix in the same time window to form an environment state of the earth observation satellite resource scheduling in the current time window, and designing an environment state matrix S [TaskS,TaskE] In the form:
Figure BDA0003781737210000041
wherein, the task S and the task E respectively represent the starting time and the ending time of the current time window, and the environment state matrix S [TaskS,TaskE] The first column is the serial number of each task, and each other column is corresponding task start time, task end time, task priority, imaging time, total number of observed tasks, number of targets to be observed, task observation duration requirement, \8230, equipment conversion time, storage capacity, \8230;
environmental state matrix S [TaskS,TaskE] The former task state in the middle is a task state, the later task state is a resource state, the values of the task state are related to the actual observation task scene, the data are only used for representing the form of a matrix, and the matrixes of different time windows form a state space of the MDP model of the earth observation satellite resources.
In step 8, for decision actions: the earth observation satellite resource scheduling problem is essentially an optimization problem with a plurality of constraints, is an NP-difficult problem, and needs to comprehensively consider the constraints of various satellite resources and task characteristics in a specific application scene and determine the reachable range of corresponding decision behaviors under the condition of meeting the constraints of various resources in the current state;
five-tuple is used for scheduling earth satellite observation tasks<E,S,T,C,F>Description, wherein E is the observation period, generally defined as 24 hours,
Figure BDA0003781737210000051
in order to observe the set of satellites,
Figure BDA0003781737210000052
denotes the Nth S One of the observation satellites is provided with a satellite,
Figure BDA0003781737210000053
is a set of observation tasks that is to be observed,
Figure BDA0003781737210000054
denotes the Nth T An observation task, C is each constraint condition set, and F is an objective functionCounting;
one observation task can be observed and imaged by more than two satellites, each observation satellite has a plurality of visible time windows for the observation task, and the jth observation satellite S is recorded j For the ith observation task T i Is O ijk =[ws ijk ,we ijk ]Is a visible time window, ws ijk To see the window start time, we ijk For the visible window end time, the observation satellite can define a task T within a specific time window i Set of visible time windows O i
Figure BDA0003781737210000055
Wherein N is ij For the jth observation satellite S j For the ith observation task T i Time window data of (2), N S The total number of selectable satellites;
let the ith observation task T i Priority of p i The required imaging time is d i The jth observation satellite S j The remote sensor yaw rate is r j The stabilization time after the side sway is h j The storage space required per unit time of imaging is alpha j Maximum storage capacity of M j The maximum allowable side-view count is R j ,x ijk Are decision variables, wherein,
Figure BDA0003781737210000056
in step 8, the earth satellite observation satisfies the following constraints:
the uniqueness of the observation task is restricted, the observation task is observed by the observation satellite only once and can not be interrupted, and the observation task is represented as follows:
Figure BDA0003781737210000057
satellite observation activitiesConstraint of the transitions between: enough time must be provided between two continuous imaging activities of an observation satellite to ensure that the satellite-borne remote sensor carries out attitude conversion, wherein the attitude conversion comprises the lateral swing rotation time | g of the remote sensor ikj -g i'jk' Stabilization time h after | and side sway j Expressed as follows:
we ijk +|g ikj -g i'jk' |+h j ≤ws i'jk'
Figure BDA0003781737210000061
and we ijk ≤ws i'jk'
Wherein, g ikj ,g i'jk' Respectively representing the starting time and the ending time of attitude conversion;
satellite memory capacity constraints: satellite-borne memory with limited capacity, data acquired by satellite imaging d i Memory capacity limit M cannot be exceeded j Expressed as follows:
Figure BDA0003781737210000062
wherein alpha is j Imaging the satellite times;
satellite side view times constraint: limited by satellite resources and mobility, the satellite can only complete a limited number of times R j Is represented as follows:
Figure BDA0003781737210000063
for a satellite observation task, the decision action of satellite resource scheduling is converted into the variable a for judging whether the satellite observation resource receives the current observation task at a moment t i To describe:
Figure BDA0003781737210000064
the whole satellite observation resource scheduling strategy pi is described as
Figure BDA0003781737210000065
i is 1 to N S
In step 8, for reward return: the satellite observation scheduling performance comprehensive evaluation index is constructed by comprehensively considering three aspects of observation task completion degree, observation target priority and satellite observation resource utilization rate, wherein:
sub-goals 1: maximizing the priority y of the observed target 1 Comprises the following steps:
Figure BDA0003781737210000066
wherein a is a weight parameter, p i Is the priority of target i;
sub-goals 2: maximizing the priority y of the number of target observations 2 Comprises the following steps:
Figure BDA0003781737210000067
wherein b is a weight parameter and max is a maximum function;
sub-goals 3: minimizing resource consumption y 3 Comprises the following steps:
Figure BDA0003781737210000071
wherein C i The number of resources consumed for observation of the task i, and c is a weight parameter;
the objective function of satellite resource scheduling is a multi-objective planning problem, and in order to simplify calculation, the multi-objective planning problem is converted into a single-objective planning problem by adopting an ideal point method, namely, the optimal solution of a single objective, namely the optimal solution of three sub-objectives is solved first
Figure BDA0003781737210000072
And
Figure BDA0003781737210000073
and the worst solution of three sub-goals
Figure BDA0003781737210000074
And
Figure BDA0003781737210000075
then, under any scheme, the relative closeness H of the target value and the optimal solution and the relative closeness H of the target value and the worst solution are calculated:
Figure BDA0003781737210000076
Figure BDA0003781737210000077
where eta, rho and gamma are targets y 1 、y 2 And y 3 Satisfies the following weight: η + ρ + γ =1, which is set according to the actual requirement of the observation task, for example, η is 0.5, ρ is 0.3, and γ is 0.2, and according to the above formula, the objective function (i.e., the reward) is converted into the maximum reward r, which is expressed as follows:
Figure BDA0003781737210000078
setting the instant return r in the Markov decision process model of the observation satellite resource scheduling as:
Figure BDA0003781737210000079
in step 8, the description of the satellite observation resource scheduling policy, that is, the action of the reinforcement learning network, based on the abstraction and description of each element in the satellite observation resource scheduling problem specifically includes: determining a visible time window of the satellite observation task, determining whether to receive the measurement and control task, and determining the observation time window and the observation resource for performing the satellite observation task.
In step 8, the determining the visible time window of the satellite observation task includes: based on the discretization of the observation time period and the design of the observation state, the time window set of different satellite observation resources which is possibly completed for the same observation task is determined by judging whether the start time and the end time of the observation task to be distributed are within the visible time window range of each satellite observation resource.
The determination of whether to accept the measurement and control task comprises the following steps: and judging whether to receive the current observation task according to the satellite visible time window completed by the observation task and the constraint condition, and if a certain observation task does not have the observation visible time window capable of completing the task, judging the certain observation task as an observation task which is temporarily impossible to complete.
The determining of the observation time window and the observation resources for performing the satellite observation task includes: according to the federal reinforcement learning algorithm, based on a visible time window set completed by an observation task, a visible time window uniquely corresponding to a satellite observation resource is obtained through decision, and the satellite observation multi-agent federal reinforcement learning algorithm can determine the satellite observation resource and the observation visible time window which complete the observation task.
The method comprises the steps that firstly, the federal learning and reinforcement learning technology is utilized, the implicit internal incidence relation between the earth observation task and the satellite resources is excavated, model parameters in the process of scheduling the satellite resources are independently learned and optimized by a single intelligent agent, the federal learning is utilized, model parameters of each single intelligent agent are fused based on the reinforcement learning, and an earth observation satellite resource scheduling optimization model is constructed; and secondly, by utilizing a federal learning method of self-adaptive weight, the available features in the training process of the intelligent agent are fully excavated to form a global model with higher quality.
The invention has the following beneficial effects:
1. the method provided by the invention utilizes the federal learning and reinforcement learning technology to mine the implicit incidence relation between the earth observation task and the satellite resource, utilizes the single intelligent agent to autonomously learn and optimize the model parameters in the satellite resource scheduling process, utilizes the federal learning to fuse the model parameters of each single intelligent agent based on the reinforcement learning to construct an earth observation satellite resource scheduling optimization model, fully mines various associated characteristic indexes of each intelligent agent in the training process, forms a high-efficiency and high-quality global scheduling optimization model, and finally generates an optimal conflict-free earth observation satellite resource scheduling optimization scheme.
2. The aggregation method adopted by the invention is a federal learning method of self-adaptive weight, the joint virtual agent can calculate the target function index of the joint virtual agent according to the local training model received from each agent, if the weight updating condition of the agent is reached, the contribution of the model precision index to the model training precision is calculated for each agent, and corresponding weight weighted average is generated to generate a global model parameter;
the joint virtual intelligent agent issues the updated global model parameters to the local intelligent agents, after each local intelligent agent receives the parameters, the local model parameter characteristics are reserved, model training is carried out on local training data, and the local models and the training precision indexes are uploaded to the joint virtual intelligent agent again after the training is finished, so that the available characteristics in the intelligent agent training process are fully mined, a global model with higher quality is formed, and the model precision and the convergence efficiency are improved.
Drawings
The foregoing and/or other advantages of the invention will become further apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a flow chart of the steps of the method of the present invention;
FIG. 2 is a flow chart of the construction of the satellite observation resource scheduling strategy of the present invention;
FIG. 3 is a schematic diagram of a multi-agent federated reinforcement learning satellite resource scheduling algorithm in accordance with the present invention.
Detailed Description
Example (b): as shown in fig. 1, fig. 2, and fig. 3, the present invention provides a technical solution, which is a satellite resource scheduling optimization method based on federal reinforcement learning, and abstracts the problem of resource scheduling optimization of earth observation satellites into a discrete markov decision problem, and applies a federal reinforcement learning algorithm to solve the optimal solution of resource scheduling of earth observation satellites;
the imaging satellite resource scheduling method specifically comprises the following steps:
step 1, establishing a reinforcement learning DQN model for each agent, and setting a state space S of the agent in an environment, a behavior space A which can be decided by the agent and a behavior reward R of the environment to the agent;
step 2, establishing a corresponding neural network for each agent according to a reinforced learning DQN algorithm, and approximating a function by using a Target neural network (Target-Q);
step 3, the intelligent agent decides the action to be taken next step according to the distributed Target and the current self state by using a epsilon-greedy strategy, interacts with the environment to obtain the next self state, stores the decision experience in a playback memory unit, and timely updates the Target-Q model parameter according to the gradient of an error function;
step 4, after the cycle reaches the set times, such as 100000 times, transmitting the local Target neural network (Target-Q) model parameters to the joint virtual model for subsequent federal learning;
step 5, aggregating the parameters uploaded by all the agents, returning the results to each agent for parameter updating, performing parameter aggregation on the agents by combining the virtual model, and returning the corresponding aggregation results;
step 6, each agent carries out soft update on the received return result and the model parameter of the Target neural network (Target-Q) to obtain the latest local reinforcement learning model parameter;
7, repeating the steps 3 to 6 until the target task is completed, and obtaining the optimal reinforcement learning model parameters;
step 8, finally, constructing an enhanced reinforcement learning model (E-AgentDQN) by using the obtained optimal reinforcement learning model parameters, and obtaining an optimal satellite resource scheduling scheme;
the parameter aggregation algorithm utilizes federated learning to aggregate parameters uploaded by intelligent agents on a joint virtual model, eliminates abnormal and outlier parameters, performs weighted average on the remaining parameters, and returns the result to each intelligent agent for parameter updating, and is specifically described as follows:
recording the model parameter sample uploaded by the ith agent as theta i The sample set of model parameters for the federated virtual agent is Θ = { Θ = { (θ) i 1 ≦ i ≦ N, and obtaining the center point θ of the sample by calculating the average value avg
Figure BDA0003781737210000101
The intelligent agent carries out soft update on the received result and local neural network parameters, namely, the result is added into the local parameters according to a certain proportion;
the agent receives the updated parameter theta returned by the joint virtual model avg Then, local neural network model updating is carried out in a soft updating mode, namely theta is updated with a certain proportion tau avg Adding local current neural model parameters theta i In (1), the updated neural network parameter theta' i Comprises the following steps:
θ′ i =(1-τ)θ i +τ·θ avg
thus, a Federal learning process is completed, wherein tau belongs to [0,1 ]]When τ is 0, it means that the parameter is not to be updated, θ avg When tau is 1, the local model is represented to directly copy the update parameter theta without being fused into the local model avg
According to experience, a relatively large value is set for the specific gravity tau at the beginning so as to accelerate the training pace, then the specific gravity is gradually reduced in the training iteration process to ensure the convergence stability, so that each intelligent agent can learn the experience of other intelligent agents, the local model of the intelligent agent is optimized, a good cooperation effect is formed, and the tasks are completed together;
the method for updating parameters by combining the virtual agents comprises the following steps:
Figure BDA0003781737210000102
Figure BDA0003781737210000103
wherein theta is t (v) Neural network parameters, θ, for virtual agents t (i) Neural network training parameters for the ith agent, l t Is the learning rate.
Based on the technical scheme, three elements forming the Markov Decision Process (MDP) are an environment state(s), a decision action (a) and a reward return (r) respectively;
the decision making process is to select corresponding action to make decision based on the current state according to the strategy, obtain corresponding decision making return, and use Q value function to describe the expected reward return of the whole Markov decision making process;
in the process of solving the MDP by utilizing reinforcement learning optimization, an intelligent agent selects a corresponding decision action (a) according to a certain strategy under an environment state(s), and the decision action (a) acts on the external environment interacted with the intelligent agent, so that the environment state(s) is correspondingly changed, and a corresponding reward return (r) is obtained, wherein the goal is to obtain a strategy of optimal reward return based on the interaction process;
the modeling of the earth observation satellite resource scheduling optimization MDP model is characterized in that a random process is used for formally describing an earth observation satellite resource scheduling application scene, and three elements of the MDP are extracted, so that the MDP model is converted into a resource scheduling optimization model which can be described and solved by reinforcement learning;
specifically, the earth observation satellite resource scheduling application scene comprises model abstractions including external environment states, decision actions and reward return evaluation indexes;
the state set of related objects in the earth observation task is abstracted into the state of an MDP (minimization of load) and recorded as an environment state, the state comprises satellite resources and the observation task, the satellite resource decision action variable is abstracted into the action of the MDP and recorded as a decision action, and the satellite resource scheduling performance evaluation index is used as the decision return in the MDP.
Based on the technical scheme, the environmental state is as follows: the environment state of the earth observation satellite resource scheduling is description of an earth observation satellite resource debugging application scene, including description of attribute characteristics and observation task characteristics of earth observation satellite resources, and the whole environment state consists of an observation state and a task state;
on one hand, the satellite observation scheduling time is subjected to periodic discretization, time interval division scales are determined according to the specific requirements of an observation task, the scheduling time is divided into time windows with different scales, and whether the earth observation satellite resources are visible and available relative to the observation task is further determined;
in a given time window, constructing an observation state matrix of the satellite resource relative to an observation task according to whether the satellite resource is visible and available for the task, specifically, when a certain idle time window of the satellite observation resource is visible and available for a certain task, setting the state of the corresponding observation state matrix position as 1, otherwise, setting the state as 0;
marking different satellite resources of each time window, such as the load type and the memory capacity of the satellite, by using a number 0 or 1 according to whether the satellite resources can meet the observation requirement in the current time window, and determining whether various satellite observation resources are idle relative to the observation task in each time window;
namely, a 0-1 matrix is used for representing a state matrix of satellite observation resources in a given satellite resource scheduling scene, so that the available condition of the satellite observation resources at each moment relative to observation tasks can be determined, the observation state matrix can be determined, and the state of the satellite resources in the time dimension is constructed;
on the other hand, the task state based on the observation task dimension mainly defines the task states of the measurement and control task in different time windows by the static indexes such as a task sequence number, the observation task starting time, the observation task ending time, a task priority, an image resolution requirement, imaging time, the shortest observation time of the observation task, the total observation time for completing the observation task and the like, the finished number of the current observation task, the total priority of the finished observation task, the total actual observation time for completing the observation task and the like, so that a task state matrix is constructed;
integrating the observation state matrix and the task state matrix in the same time window to form an environment state of the earth observation satellite resource scheduling in the current time window, wherein the designed environment state matrix S is in the following form:
Figure BDA0003781737210000121
wherein [ task S, task ] represents the starting time and the ending time of the current time window, the first column is the serial number of each task, each other column is the corresponding task starting time, task ending time, task priority, imaging time, total number of observed TaskS, number of targets to be observed, task observation duration requirement, \8230, equipment switching time, storage capacity, \8230;
the former part of the environment state matrix S is a task state, the latter part is a resource state, the values of the environment state matrix S are related to the actual observation task scene, the data above are only used for representing the form of the matrix, and the matrixes of different time windows form the state space of the MDP model of the earth observation satellite resources.
Based on the technical scheme, the decision-making action is as follows: the essential of the earth observation satellite resource scheduling problem is an optimization problem with multiple constraints, and is an NP-hard problem, constraints of various satellite resources and task characteristics need to be comprehensively considered in a specific application scene, and the reachable range of corresponding decision behaviors is determined under the condition of meeting the constraints of various resources in the current state;
selecting decision-making action according to the actual observation application scene, namely selecting proper satellite observation resources to meet the observation requirements of corresponding tasks, so as to ensure that the current observation resources can meet various constraints of the current observation task, ensure that various satellite resources are indeed and practically scheduled in actual application, and have stronger practicability;
the method comprises the following steps that the earth rotates, satellites fly around the earth, the satellites can observe ground targets only in a specific time period, the observation targets are point targets, namely each observation task can be completed by a single satellite at a time, each task has the requirements of imaging time period limitation, load types and image resolution, satellite resource scheduling refers to selecting multiple satellites meeting conditions to observe the observation targets, namely decision-making action refers to selecting the optimal satellite observation resources in different visible time windows of the multiple satellite observation resources so as to achieve the optimal scheduling objective function of the optimal satellite observation resources;
five-tuple for earth satellite observation task scheduling problem<E,S,T,C,F>Description, wherein E is the observation period, generally defined as 24 hours,
Figure BDA0003781737210000131
in order to observe a set of satellites,
Figure BDA0003781737210000132
is an observation task set, C is each constraint condition set, and F is an objective function;
an observation task can be observed and imaged by a plurality of satellites, each observation satellite has a plurality of visible time windows for the observation task, and the observation task is recorded by the recording unit ijk =[ws ijk ,we ijk ]For a visible time window, the observation satellite may define a task T within a particular time window i Set of visible time windows of (c):
Figure BDA0003781737210000133
wherein N is ij As a satellite S j For task T i Time window data of (C), O ijk As a satellite S j For task T i Of the kth time window, N S Is the total number of selectable satellites.
Task setting T i Priority of p i Required imaging time d i Satellite S j Remote sensor yaw rate r j Stabilization time h after lateral swing j Storage space α required per unit time of imaging j Maximum storage capacity M j Maximum allowable side view count R j ,x ijk Are decision variables that, among other things,
Figure BDA0003781737210000134
based on the technical scheme, the earth satellite observation must meet the following constraints:
the uniqueness of the task is restricted, the task is observed by the satellite only once and can not be interrupted;
Figure BDA0003781737210000135
constraint of transitions between satellite observation activities: enough time must be provided between two continuous imaging activities of the satellite to ensure that the satellite-borne remote sensor performs attitude conversion, including the lateral swing rotation time of the remote sensor and the stabilization time after the lateral swing;
we ijk +|g ikj -g i'jk' |+h j ≤ws i'jk'
Figure BDA0003781737210000136
and we ijk ≤ws i'jk'
Satellite memory capacity constraints: the satellite-borne memory capacity is limited, and the data acquired by satellite imaging cannot exceed the memory capacity limit;
Figure BDA0003781737210000141
satellite side view times constraint: limited by satellite resources and mobility, the satellite can only complete limited number of side-looking imaging actions;
Figure BDA0003781737210000142
for a satellite observation task, the decision action of satellite resource scheduling is converted into whether the satellite observation resource receives the current observation task or not and which satellite observation resource receives the observation task at a certain time t;
Figure BDA0003781737210000143
the whole satellite observation resource scheduling strategy pi is described as
Figure BDA0003781737210000144
i is 1-N S
Based on the technical scheme, reward is given: the satellite observation scheduling performance comprehensive evaluation index is constructed by comprehensively considering three aspects of observation task completion degree, observation target priority and satellite observation resource utilization rate, wherein:
sub-goals 1: maximizing the priority of the observed target:
Figure BDA0003781737210000145
sub-goals 2: priority to maximize number of target observations:
Figure BDA0003781737210000146
sub-goal 3: minimizing resource consumption:
Figure BDA0003781737210000147
wherein C is i The number of resources consumed for observation of the task i, and c is a weight parameter;
the objective function of the satellite resource scheduling is a multi-objective planning problem, and ideal points are adopted for simplifying calculationThe method converts the single target planning problem into a single target planning problem, namely, firstly, the optimal solution of the single target is solved
Figure BDA0003781737210000148
And
Figure BDA0003781737210000149
and worst solution
Figure BDA00037817372100001410
And
Figure BDA00037817372100001411
then calculating the relative closeness degree of the target value to the optimal solution and the worst solution under any scheme, and enabling the target value to be close to the optimal solution and the worst solution
Figure BDA0003781737210000151
Figure BDA0003781737210000152
Wherein eta, rho and gamma are the weights of the targets y1, y2 and y3 respectively, can be set according to the actual requirements of the observation task, and can convert the target function into the target function according to the formula
Figure BDA0003781737210000153
Setting the immediate reward in the MDP model of the observation satellite resource scheduling as:
Figure BDA0003781737210000154
as shown in fig. 2, based on the above technical solution, the description of the satellite observation resource scheduling policy, that is, the action of the reinforcement learning network, based on the abstraction and description of each element in the satellite observation resource scheduling problem specifically includes: determining a visible time window of the satellite observation task, determining whether to receive the measurement and control task, and determining the observation time window and the observation resource for carrying out the satellite observation task.
Based on the technical scheme, determining the visible time window of the satellite observation task: based on the discretization of the observation time period and the design of the observation state, whether the start time and the end time of the observation task to be distributed are within the visible time window range of each satellite observation resource is judged, so that the time window set which is possibly completed by different satellite observation resources for the same observation task is determined.
Based on the technical scheme, whether the measurement and control task is received or not is determined: and judging whether to receive the current observation task according to the satellite visible time window in which the observation task is completed and the constraint condition, and if a certain observation task does not have the observation visible time window in which the task can be completed, judging the certain observation task as the observation task which is temporarily impossible to complete.
Based on the technical scheme, an observation time window and observation resources for satellite observation tasks are determined: according to the multi-agent deep reinforcement learning satellite resource scheduling algorithm, based on the visible time window set completed by the observation task, the visible time window uniquely corresponding to the satellite observation resource can be obtained through decision, and the satellite observation multi-agent deep reinforcement learning satellite resource scheduling algorithm can determine the satellite observation resource and the observation visible time window which complete the observation task.
In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, and the computer program may run the inventive content of the satellite resource scheduling optimization method based on federal reinforcement learning and all or part of the steps in the embodiments, provided by the present invention, when the computer program is executed by the data processing unit. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a Random Access Memory (RAM), or the like.
It is obvious to those skilled in the art that the technical solutions in the embodiments of the present invention can be implemented by means of a computer program and its corresponding general-purpose hardware platform. Based on such understanding, the technical solutions in the embodiments of the present invention may be essentially or partially implemented in the form of a computer program or a software product, where the computer program or the software product may be stored in a storage medium and include instructions for enabling a device (which may be a personal computer, a server, a single chip microcomputer, an MUU, or a network device) including a data processing unit to execute the method according to the embodiments or some parts of the embodiments of the present invention.
The invention provides a satellite resource scheduling optimization method based on federal reinforcement learning, and a plurality of methods and ways for implementing the technical scheme are provided, the above description is only a preferred embodiment of the invention, it should be noted that, for those skilled in the art, a plurality of improvements and decorations can be made without departing from the principle of the invention, and these improvements and decorations should also be regarded as the protection scope of the invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (10)

1. A satellite resource scheduling optimization method based on federal reinforcement learning is characterized by comprising the following steps:
step 1, establishing a deep reinforcement learning DQN model for each agent in the federated reinforcement learning algorithm, and setting state space of each agent in the environment, behavior space that the agent can make decisions, and behavior reward of the environment to the agent;
step 2, establishing a corresponding neural network for each agent according to the reinforcement learning DQN algorithm, and obtaining an approximate value function by using a target neural network;
step 3, the intelligent agent decides the action to be taken next step according to the distributed target and the current self state by using an epsilon-greedy strategy, interacts with the environment to obtain the next self state, stores the decision experience in a playback memory unit, and updates the target neural network model parameters according to the gradient of the error function;
step 4, after the circulation reaches the set times, transmitting the local target neural network model parameters to an intelligent DQN model for parameter aggregation, recording the intelligent DQN model as a joint virtual model, and performing subsequent federal learning;
step 5, aggregating the parameters uploaded by all the agents, returning the results to each agent for parameter updating, performing parameter aggregation on the agents by combining the virtual model, and returning the corresponding aggregation results;
step 6, each agent carries out soft update on the received aggregation result and the target neural network model parameter to obtain the latest local reinforcement learning model parameter;
7, repeating the step 3 to the step 6 until the target task is completed, and obtaining the optimal reinforcement learning model parameters;
and 8, constructing an enhanced reinforcement learning model by using the optimal reinforcement learning model parameters to obtain an optimal satellite resource scheduling scheme.
2. The method of claim 1, wherein in step 5, the depth-enhanced learning DQN model parameter sample uploaded by the ith agent is recorded as θ i Meanwhile, a deep reinforcement learning DQN model for fusion learning is constructed and recorded as a joint virtual agent, and the parameter sample set of the joint virtual agent is theta = { theta = i I is more than or equal to 1 and less than or equal to N, and the central point theta of the sample is obtained by calculating the average value avg
Figure FDA0003781737200000011
θ avg Namely the aggregation result returned by the joint virtual model.
3. The method of claim 2, wherein in step 6, the agent receives the aggregate result θ returned from the federated virtual model avg Then, local deep reinforcement learning DQN model updating is carried out in a soft updating mode, namely theta is updated according to the specific gravity tau avg Adding model parameter samples theta i And in the middle, after the updating, the neural network parameter theta of the deep reinforcement learning DQN model is obtained i ' is:
θ i ′=(1-τ)θ i +τ·θ avg
to this end, a Federal learning procedure is completed, wherein tau is in the scope of 0,1]When τ is 0, θ represents that the parameter is not to be updated avg The DQN model is not fused into the local deep reinforcement learning, and when tau is 1, the DQN model represents that the local deep reinforcement learning is directly copied to update the parameter theta avg
The formula for updating the parameters of the joint virtual agent is as follows:
Figure FDA0003781737200000021
Figure FDA0003781737200000022
wherein, theta t (v) Neural network parameters, θ, for deep reinforcement learning of DQN models for joint virtual agents t (i) Training parameters v of neural network of deep reinforcement learning DQN model for ith agent at time t t (v) For the parameter change value l of the v-th joint virtual agent deep reinforcement learning DQN model t For learning rate, N t For the number of active agents at time t, loss (.) is a Loss function, and ρ is the system weight.
4. The method according to claim 3, wherein in step 8, the satellite resource scheduling optimization problem is modeled by a Markov decision process, and the three elements constituting the Markov decision process are an environment state s, a decision action a and an awarding return r;
the decision process is to select corresponding action to make decision based on the current state according to the strategy, obtain corresponding decision return, and use Q value function to describe the expected reward return of the whole Markov decision process;
in the process of solving Markov decision by reinforcement learning optimization, the intelligent agent selects a corresponding decision action a according to a strategy in an environment state s, and the decision action a acts on the interactive external environment of the intelligent agent, so that the environment state s is correspondingly changed, and corresponding reward return r is obtained, and the goal is to obtain the strategy of optimal reward return based on the interactive process;
modeling a satellite resource scheduling optimization problem by using a Markov decision process, wherein a random process formally describes a ground observation satellite resource scheduling application scene, and three elements of the Markov decision process are extracted, so that the satellite resource scheduling optimization problem is converted into a resource scheduling optimization model which can be described and solved by reinforcement learning;
abstracting a model including external environment states, decision actions and reward return evaluation indexes, wherein the abstract model includes the following concrete steps:
abstracting each satellite resource in the earth observation task and the state set of the observation task into a Markov decision process state, and recording the Markov decision process state as an environment state; and abstracting the satellite resource decision action variable into an action of a Markov decision process, recording the action as a decision action, and taking the satellite resource scheduling performance evaluation index as a decision return in the Markov decision process.
5. The method according to claim 4, wherein in step 8, the environmental state of the earth observation satellite resource scheduling is description of an earth observation satellite resource debugging application scene, including description of attribute characteristics and observation task characteristics of the earth observation satellite resource, and the whole environmental state includes an observation state and a task state;
when an idle time window of the satellite observation resource is visible and available for a task, the state corresponding to the observation state matrix position is set to 1, otherwise, the state is set to 0;
marking different satellite resources of each time window by using a number 0 or 1 according to whether the satellite resources can meet the observation requirement in the current time window, and determining whether various satellite observation resources are idle relative to the observation task in each time window;
representing a state matrix of satellite observation resources in a given satellite resource scheduling scene by using a 0-1 matrix, thereby determining the available condition of the satellite observation resources at each moment relative to an observation task, determining an observation state matrix, and constructing the state of the satellite resources on a time dimension;
integrating the observation state matrix and the task state matrix in the same time window to form an environment state of earth observation satellite resource scheduling in the current time window, and designing an environment state matrix S [TaskS,TaskE] In the form:
Figure FDA0003781737200000031
wherein, the task S and the task E respectively represent the starting time and the ending time of the current time window, and the environment state matrix S [TaskS,TaskE] The first column is the sequence number of each task.
6. The method of claim 5, wherein in step 8, for the decision action: five-tuple for earth satellite observation task scheduling problem<E,S,T,C,F>Described, where, E is the observation period,
Figure FDA0003781737200000032
in order to observe a set of satellites,
Figure FDA0003781737200000033
denotes the Nth S One of the observation satellites is provided with a satellite,
Figure FDA0003781737200000034
is a set of observation tasks that is to be observed,
Figure FDA0003781737200000035
denotes the Nth T C is each constraint condition set, and F is an objective function;
one observation task can be observed and imaged by more than two satellitesEach observation satellite has a visible time window for the observation task, and the jth observation satellite S is recorded j For the ith observation task T i Is O ijk =[ws ijk ,we ijk ]Is a visible time window, ws ijk To see the window start time, we ijk For the visible window end time, the observation satellite can define a task T within a specific time window i Set of visible time windows O i
Figure FDA0003781737200000041
Wherein N is ij For the jth observation satellite S j For the ith observation task T i Time window data of, N S The total number of selectable satellites;
let the ith observation task T i Priority of p i The required imaging time is d i The jth observation satellite S j Remote sensor yaw rate of r j The stabilization time after the side sway is h j The storage space required per unit time of imaging is alpha j Maximum storage capacity of M j The maximum allowable number of side views is R j ,x ijk Are decision variables, wherein,
Figure FDA0003781737200000042
7. the method of claim 6, wherein in step 8, the earth satellite observations satisfy the following constraints:
and the uniqueness of the observation task is restricted, the observation task is observed by the observation satellite only once and is not interruptible, and the observation task is represented as follows:
Figure FDA0003781737200000043
constraint of transitions between satellite observation activities: enough time must be provided between two continuous imaging activities of an observation satellite to ensure that the satellite-borne remote sensor carries out attitude conversion, wherein the attitude conversion comprises the lateral swing rotation time | g of the remote sensor ikj -g i'jk' Stabilization time h after | and side sway j Expressed as follows:
we ijk +|g ikj -g i'jk' |+h j ≤ws i'jk'
Figure FDA0003781737200000044
x ijk x i'jk' =1 and we ijk ≤ws i'jk'
Wherein, g ikj ,g i'jk' Respectively representing the starting time and the ending time of attitude conversion;
satellite memory capacity constraints: satellite-borne memory with limited capacity, data acquired by satellite imaging d i Memory capacity limit M cannot be exceeded j Expressed as follows:
Figure FDA0003781737200000051
wherein alpha is j Imaging times for the satellite;
satellite side view times constraint: limited by satellite resources and mobility, the satellite can only complete limited times R j Is represented as follows:
Figure FDA0003781737200000052
for a satellite observation task, the decision action of satellite resource scheduling is converted into whether the satellite observation resource receives the current observation task at a time t by a variable a i To describe:
Figure FDA0003781737200000053
the scheduling strategy pi of the whole satellite observation resource is described as pi = (a) 1 ,a 2 ,…,a i ,…,a Ns ) I is 1-N S
8. The method of claim 7, wherein in step 8, for the reward return: the satellite observation scheduling performance comprehensive evaluation index is constructed by comprehensively considering three aspects of observation task completion degree, observation target priority and satellite observation resource utilization rate, wherein:
sub-goals 1: maximizing the priority y of the observed target 1 Comprises the following steps:
Figure FDA0003781737200000054
wherein a is a weight parameter, p i Is the priority of target i;
sub-goals 2: maximizing the priority y of the number of target observations 2 Comprises the following steps:
Figure FDA0003781737200000055
wherein b is a weight parameter and max is a maximum function;
sub-goal 3: minimizing resource consumption y 3 Comprises the following steps:
Figure FDA0003781737200000061
wherein C is i The number of resources consumed for observation of the task i, and c is a weight parameter;
the objective function of satellite resource scheduling is a multi-objective planning problem, and the multi-objective planning problem is converted into a single-objective planning problem by adopting an ideal point method, namely, the optimal solution of three sub-objectives is firstly solved
Figure FDA0003781737200000062
And
Figure FDA0003781737200000063
and the worst solution of three sub-goals
Figure FDA0003781737200000064
And
Figure FDA0003781737200000065
then, under any scheme, the relative closeness H of the target value and the optimal solution and the relative closeness H of the target value and the worst solution are calculated:
Figure FDA0003781737200000066
Figure FDA0003781737200000067
where eta, rho and gamma are targets y 1 、y 2 And y 3 Satisfies the following weight: η + ρ + γ =1, which is set according to the actual requirement of the observation task, and converts the objective function into the maximum value of the return r, and represents as follows:
Figure FDA0003781737200000068
setting the instant return r in a Markov decision process model of observing satellite resource scheduling as:
Figure FDA0003781737200000069
9. the method according to claim 8, wherein in step 8, based on the abstraction and description of each element in the satellite observation resource scheduling problem, determining a visible time window of the satellite observation task, determining whether to accept the measurement and control task, and determining the observation time window and the observation resource for performing the satellite observation task.
10. The method of claim 9, wherein in step 8, the determining the satellite observation task visible time window comprises: based on the discretization of the observation time period and the design of the observation state, determining a time window set which is possibly completed by different satellite observation resources to the same observation task by judging whether the starting time and the ending time of the observation task to be distributed are within the visible time window range of each satellite observation resource;
the determining whether to accept the measurement and control task comprises: judging whether to receive the current observation task according to the satellite visible time window completed by the observation task and the constraint condition, and if one observation task does not have the observation visible time window capable of completing the task, judging that the observation task which is not completed temporarily is impossible;
the determining of the observation time window and the observation resources for performing the satellite observation task includes: according to the federal reinforcement learning algorithm, based on a visible time window set completed by an observation task, a visible time window uniquely corresponding to a satellite observation resource is obtained through decision, and the satellite observation multi-agent federal reinforcement learning algorithm can determine the satellite observation resource and the observation visible time window which complete the observation task.
CN202210931479.5A 2022-08-04 2022-08-04 Satellite resource scheduling optimization method based on federal reinforcement learning Pending CN115481779A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210931479.5A CN115481779A (en) 2022-08-04 2022-08-04 Satellite resource scheduling optimization method based on federal reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210931479.5A CN115481779A (en) 2022-08-04 2022-08-04 Satellite resource scheduling optimization method based on federal reinforcement learning

Publications (1)

Publication Number Publication Date
CN115481779A true CN115481779A (en) 2022-12-16

Family

ID=84422135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210931479.5A Pending CN115481779A (en) 2022-08-04 2022-08-04 Satellite resource scheduling optimization method based on federal reinforcement learning

Country Status (1)

Country Link
CN (1) CN115481779A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116302448A (en) * 2023-05-12 2023-06-23 中国科学技术大学先进技术研究院 Task scheduling method and system
CN116739323A (en) * 2023-08-16 2023-09-12 北京航天晨信科技有限责任公司 Intelligent evaluation method and system for emergency resource scheduling

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116302448A (en) * 2023-05-12 2023-06-23 中国科学技术大学先进技术研究院 Task scheduling method and system
CN116302448B (en) * 2023-05-12 2023-08-11 中国科学技术大学先进技术研究院 Task scheduling method and system
CN116739323A (en) * 2023-08-16 2023-09-12 北京航天晨信科技有限责任公司 Intelligent evaluation method and system for emergency resource scheduling
CN116739323B (en) * 2023-08-16 2023-11-10 北京航天晨信科技有限责任公司 Intelligent evaluation method and system for emergency resource scheduling

Similar Documents

Publication Publication Date Title
Lei et al. Deep reinforcement learning for autonomous internet of things: Model, applications and challenges
CN115481779A (en) Satellite resource scheduling optimization method based on federal reinforcement learning
CN111667513B (en) Unmanned aerial vehicle maneuvering target tracking method based on DDPG transfer learning
He et al. A generic Markov decision process model and reinforcement learning method for scheduling agile earth observation satellites
Li et al. Minimizing packet expiration loss with path planning in UAV-assisted data sensing
CN109884897B (en) Unmanned aerial vehicle task matching and calculation migration method based on deep reinforcement learning
CN112465151A (en) Multi-agent federal cooperation method based on deep reinforcement learning
Zhang et al. Ship motion attitude prediction based on an adaptive dynamic particle swarm optimization algorithm and bidirectional LSTM neural network
CN112937564A (en) Lane change decision model generation method and unmanned vehicle lane change decision method and device
CN109496305A (en) Nash equilibrium strategy on continuous action space and social network public opinion evolution model
Berkenkamp Safe exploration in reinforcement learning: Theory and applications in robotics
CN114415735B (en) Dynamic environment-oriented multi-unmanned aerial vehicle distributed intelligent task allocation method
CN114510012A (en) Unmanned cluster evolution system and method based on meta-action sequence reinforcement learning
CN114896899A (en) Multi-agent distributed decision method and system based on information interaction
CN112180730B (en) Hierarchical optimal consistency control method and device for multi-agent system
Schepers et al. Autonomous building control using offline reinforcement learning
Abed-Alguni Cooperative reinforcement learning for independent learners
CN113867934A (en) Multi-node task unloading scheduling method assisted by unmanned aerial vehicle
WO2024066675A1 (en) Multi-agent multi-task hierarchical continuous control method based on temporal equilibrium analysis
CN117828286A (en) Multi-agent countermeasure decision-making method and device based on deep reinforcement learning
CN116307331B (en) Aircraft trajectory planning method
CN115963724A (en) Unmanned aerial vehicle cluster task allocation method based on crowd-sourcing-inspired alliance game
Han et al. Ensemblefollower: A hybrid car-following framework based on reinforcement learning and hierarchical planning
CN114757101A (en) Single-satellite autonomous task scheduling method and system for non-time-sensitive moving target tracking
CN113469369A (en) Method for relieving catastrophic forgetting for multitask reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination