CN117314049A - Satellite network intelligent resource scheduling method based on reinforcement learning - Google Patents

Satellite network intelligent resource scheduling method based on reinforcement learning Download PDF

Info

Publication number
CN117314049A
CN117314049A CN202311121709.2A CN202311121709A CN117314049A CN 117314049 A CN117314049 A CN 117314049A CN 202311121709 A CN202311121709 A CN 202311121709A CN 117314049 A CN117314049 A CN 117314049A
Authority
CN
China
Prior art keywords
state
satellite
action
reinforcement learning
prize
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311121709.2A
Other languages
Chinese (zh)
Inventor
王守斌
朱皓俊
王士成
孙康
马万权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 54 Research Institute
Original Assignee
CETC 54 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 54 Research Institute filed Critical CETC 54 Research Institute
Priority to CN202311121709.2A priority Critical patent/CN117314049A/en
Publication of CN117314049A publication Critical patent/CN117314049A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06312Adjustment or analysis of established resource schedule, e.g. resource or task levelling, or dynamic rescheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0499Feedforward networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Economics (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Development Economics (AREA)
  • Health & Medical Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Operations Research (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Educational Administration (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a satellite network intelligent resource scheduling method based on reinforcement learning, and belongs to the technical field of satellite network resource scheduling. The method comprises the following steps: collecting satellite resource information and on-board task state information; determining a constraint condition, a state, an action and a reward value function of the algorithm according to the collected information; the obtained related information is sent to a reinforcement learning action module to execute satellite and link selection operation and rewarding value function calculation; and the satellite server forwards the task according to the action result. The method comprehensively considers the information such as bandwidth and duration required by the tasks on the satellites, the information such as idle computing capacity of the satellites, storage resources and available bandwidth of links between the satellites, realizes intelligent scheduling of the tasks of the ground users or the tasks of the cloud data center between the satellites, reduces time delay of forwarding between the satellites, and improves communication performance of the cloud data center on the satellites.

Description

Satellite network intelligent resource scheduling method based on reinforcement learning
Technical Field
The invention belongs to the technical field of satellite network resource scheduling, and particularly relates to a satellite network intelligent resource scheduling method based on reinforcement learning.
Background
With the sustainable development of the new generation of medium-low orbit satellite constellation technology, the satellite network formed by using the medium-low orbit satellite constellation can effectively overcome the problem of insufficient service capability of the ground network. Compared with the traditional ground network, the satellite network has the advantages of global coverage, deployment free from the constraint of regions and terrain conditions, strong system destructiveness and the like. Thus, more and more users are getting internet service through satellite networks. However, under the conditions of high complexity, high dynamic and resource shortage of the medium-low orbit satellite network, how to realize efficient and intelligent resource scheduling becomes a key challenge for limited resources in different service distribution systems.
At present, more classical on-board resource scheduling algorithms include first-in first-out, short job priority, priority queue, weighted fair queue and the like. However, these conventional algorithms have a large proportion of total resource overhead occupied by the resource allocation computation overhead of the user service scheduling in the environment facing the complexity and the scarcity of resources, so that the combination of the AI technology with higher efficiency becomes a feasible scheme. If genetic algorithm is utilized, the algorithm selects tasks of different resources to schedule through dynamically updated information, and calculates fitness value by taking total task weight value and task completion time as parameters, and then the optimized genetic algorithm is utilized to find the optimal solution in different scheduling schemes. Or the optimal resource allocation scheme meeting the QoE maximum criterion is calculated through the Hungary algorithm, so that the user experience quality can be improved on the basis of ensuring the resource utilization rate. Such algorithms, while improving the shortcomings of classical algorithms, suffer from drawbacks in terms of scheduling for specific bandwidth service resources, multi-star joint scheduling, and special traffic requirements. Aiming at the problems, the reinforcement learning technology which is raised in recent years has better intelligent solving capability in the face of some complex optimizing decision problems. The method is characterized in that on the multiple challenges of high complexity, low dynamics, scarce resources, large amount of required information and the like faced by resource scheduling caused by the characteristics of multi-star cooperative work, full-time space interconnection, mass users and the like of a medium-low orbit satellite network environment, intelligent decision is made through a reinforcement learning algorithm, so that the most reasonable resource scheduling mode can be realized.
Disclosure of Invention
Aiming at the existing technical problems, the invention provides a satellite network intelligent resource scheduling method based on reinforcement learning, which not only can adapt to a complex dynamic satellite network system, but also can reduce the time delay problem of resource scheduling.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a satellite network intelligent resource scheduling method based on reinforcement learning comprises the following steps:
step 1: collecting satellite resource information and on-board task state information;
step 2: determining a constraint, a state, an action and a reward value function according to the collected information;
step 3: transmitting the collected information to a reinforcement learning action module to execute satellite and link selection operation and bonus value function calculation;
step 4: and the satellite server forwards the task according to the action result.
Further, the specific manner of step 1 is as follows:
step 1a, a satellite resource sensing module is placed on the satellite to acquire satellite resource state information, including satellite number SN i SCPU of available computing resources i SM with available memory resources i
Step 1b, a task information collection module is deployed on the satellite to collect task requests of ground users or subtask request information which needs satellite processing after ground core cloud node decomposition, wherein the subtask request information comprises bandwidth WBW required by each subtask k And a task duration τ;
step 1c, designing a visual constraint condition FLAG of a source satellite node and a destination satellite node:
FLAG=α*β*δ
wherein alpha, beta and delta are respectively the pitch angle, azimuth angle and visible angle of the double-star link under the earth shielding of the link terminal, and alpha min 、α max Respectively representing the minimum pitch angle and the maximum pitch angle of a link terminal; beta min 、β max Respectively representing the minimum and maximum azimuth angles of the link terminal; delta max Is the maximum visible angle of the double-star building chain under the earth shielding.
Further, the specific manner of step 2 is as follows:
step 2a, establishing interaction between a plurality of agents and own environment based on an Actor-Critic (AC) algorithm, wherein each agent contains a copy of the environment; constructing an actor-critics-based Asynchronous Advantage Actor Critic (A3C) agent, wherein the actor is used to generate actions, the critics are used to evaluate and criticize the current policy, and the criticizer adopts a TD-error method by processing rewards obtained from the environment;
step 2b, constructing a network state set at the time t as follows:
S n ={SLA,SBW OD ,SCPU D ,SM D ,WBW k ,τ};
wherein SLA is link angle, SBW OD Available bandwidth resources for links between source and destination satellite nodes;
step 2c, constructing an action set A= { N i },All actions selectable by the decision algorithm are included;
step 2d, constructing a behavior action strategy of the reinforcement learning system:
π(s,a)=π(a|s)=p(A=a|S=s)
s is a state space set, A is an action space set;
step 2e, constructing a reward prize of the reinforcement learning system, wherein the reward prize is represented by gamma discount prizes, the gamma discount prizes represent accumulated prizes obtained from the current state to the future state, and future prize values are multiplied by a discount factor gamma, and the gamma prize values range from 0 to 1 in real numbers to represent the influence degree of the future prizes; gamma ray t The longer the presentation time, the smaller the prize benefit effect, expressed in detail as follows:
wherein R represents an immediate prize of the environmental feedback, s 0 S is the initial state of the intelligent agent t Is the state of the intelligent agent at the t moment;
step 2f: constructing state transition probability of the reinforcement learning system:
state transition probability indicates that action a is being performed n Later, the network environment is changed, by S n Transition to S n+1 Probability of (2);
step 2g: constructing a reward function of the reinforcement learning system: executing action a n The environment will then feed back a prize value r n To measure the execution quality of the action; for each step, the prize value r n Is determined by the time to process the entire subtask; is provided withThe fixed rewarding function is:
wherein T is d Representing processing delay, which is determined by the response speed of the wavelength routing device; t (T) w Representing the propagation time delay, which is obtained by dividing the inter-satellite distance by the speed of light; t (T) t Representing transmission delay, which is obtained by dividing task bandwidth by communication rate; t (T) p Representing queuing delay;
let f o Is the threshold of the bonus function f, if f o If the prize is more than or equal to f, the action is considered to be effective, and the prize value is 1; otherwise, the action is determined to be invalid, and the reward value is-1.
Further, the specific manner of step 3 is as follows:
step 3a, the state of reinforcement learning is transferred into a set of Markov decision process, and the state of the current environment is only related to the state at the last moment and is not related to the previous state, so the state value function is simplified into:
wherein p is the state transition probability;
step 3b, using a state-action value function to evaluate the quality of the selected action in a certain state, namely:
wherein Q is * (s, a) means that action a is selected in state s, and in the subsequent policy selection process, a larger progressive award is obtained by the optimal policy;
the relationship between the state value function and the state-action value function is satisfied:
V * (s)=Q * (s,a)
step 3c, the optimal action strategy is:
step 3d, when the action strategy is updated, the strategy adjustment optimization and the action selection are carried out according to the state value function, but the action selection is not carried out according to the instant return function;
step 3e, the neural network of the AC algorithm is a classical full-connection structure, wherein the input layer of the actor network is a matrix of 100 x 6, the hidden layer is three layers, and the output is 100-dimensional column vectors; the network output of the commentators is a rewarding value;
prize value accumulating prizes by discountsJudging; splitting the task into 5 subtasks, so the AC algorithm needs to run 5 times; the goal of reinforcement learning is to maximize the cumulative rewards, i.e. +.>
Step 3f, calculating the time corresponding to each subtask by each node according to the current environment by the star-cloud data center, selecting the node and the rewarding value { i, f } corresponding to the minimum time for calculating the kth subtask, and uploading the nodes and rewarding values to the global manager; the global manager further selects nodes corresponding to all the selected subfiles, and all the subfiles are ensured to be stored simultaneously.
The invention has the beneficial effects that:
1. the resource scheduling method provided by the invention regards all data as a whole, fuses the data through the neural network algorithm, and fuses the historical data of the state parameters into the neural network algorithm as input neurons at the same time, thereby ensuring the time of satellite network tasks and being capable of adapting to more complex systems.
2. The method comprehensively considers the information such as bandwidth and duration required by the tasks on the satellites, the information such as idle computing capacity of the satellites, storage resources and available bandwidth of links between the satellites, realizes intelligent scheduling of the tasks of the ground users or the tasks of the cloud data center between the satellites, reduces time delay of forwarding between the satellites, and improves communication performance of the cloud data center on the satellites.
Drawings
FIG. 1 is a schematic diagram of a network framework provided by the present invention;
FIG. 2 is a diagram of an algorithm framework;
FIG. 3 is a reinforcement learning schematic;
FIG. 4 is a state transition process diagram;
fig. 5 is an overall algorithm flow chart.
Detailed Description
The invention will be better explained by the following detailed description of the embodiments with reference to the drawings.
A satellite network intelligent resource scheduling method based on reinforcement learning comprises the following steps:
step 1: collecting satellite resource information and on-board task state information;
step 2: determining a constraint condition, a state, an action and a reward value function of the algorithm according to the collected information;
step 3: the obtained related information is sent to a reinforcement learning action module to execute satellite and link selection operation and rewarding value function calculation;
step 4: and the satellite server forwards the task according to the action result.
Wherein, step 1 is specifically as follows:
step 1a, a satellite resource sensing module is placed on a satellite to acquire satellite resource state information, wherein the satellite resource state information mainly comprises a satellite number SN i SCPU of available computing resources i SM with available memory resources i Etc.
Step 1b, a task information collection module is deployed on the satellite to collect task requests of ground users or subtask request information which is required to be processed by satellites after ground core cloud nodes are decomposed, wherein the task information collection module mainly comprises bandwidth WBW required by each subtask k And a task duration τ.
In the step 1c of the process,and->The serial numbers of the source satellite node and the target satellite node are respectively SBW OD Available bandwidth resources for links between source and destination satellite nodes. Design of SBW OD Constraint formulas. Because the high-speed motion and topology of the satellite change in real time, the link angle also changes at the moment, the link angle is expressed as SLA, FLAG is a visual constraint condition of a source satellite node and a destination satellite node, and mainly comprises link constraint and orbit constraint, and the specific judging function of FLAG is as follows:
FLAG=α*β*δ
wherein alpha is min 、α max Respectively representing the minimum pitch angle and the maximum pitch angle of a link terminal; beta min 、β max Respectively representing the minimum and maximum azimuth angles of the link terminal; delta max Is the maximum visible angle of the double-star building chain under the earth shielding.
Wherein, step 2 is specifically as follows:
in step 2a, the ac algorithm can reduce the effect of inter-task continuity on the final result and can asynchronously update the global model. It interacts with its own environment using multiple agents, each containing a copy of the environment. One particular A3C agent is based on an actor-reviewer, wherein the actor is used to generate actions, and the reviewer is used to evaluate and criticize current policies by processing rewards obtained from the environment. TD-error method is generally used as Critic, i.e., to evaluate the quality of an Actor.
Step 2b, the state of the network environment is the basis of the decision, and the network environment information needs to be acquired in order to obtain the decision, so that the network state set at the time t is
S n ={SLA,SBW OD ,SCPU D ,SM D ,WBW k ,τ};
Step 2c, the action set includes all the actions selectable by the decision algorithm, and corresponds to the decision variables in the dynamic planning model, and the corresponding action set a is a= { N i };
And 2d, the main elements of the reinforcement learning system comprise action strategies, environment rewards and state value functions besides the intelligent agent and the environment in which the intelligent agent is located. The strategy is used for guiding the intelligent agent to perform action selection under a specific environment state and is the core of reinforcement learning. Generally denoted by S x a→ [0,1], another policy expression is s→a, which indicates that the probability of performing action a in state S is 1, and the probability of performing other actions is 0. The policy definition is:
π(s,a)=π(a|s)=p(A=a|S=s)
wherein S is a state space set, and A is an action space set. The learning goal of reinforcement learning is to continually update the optimization strategy during the iterative process of interacting with the environment so that the agent can maximize long-term rewards according to the strategy.
Step 2e, rewarding rewards are evaluation on the environment state or the actions of the intelligent agent, and can be used for guiding the intelligent agent to perform policy optimization updating and also can be used as a learning target of the intelligent agent. The prize benefit value must be objectively notarized and cannot be altered by the agent. The purpose of reinforcement learning is to maximize the total rewards value that is ultimately obtained. The rewarding reward function is the immediate evaluation of the environment on the action of the current state, and is a short-term expression signal; and the state value function represents the expectation of the agent to obtain the accumulated rewards and benefits, and is a long-term expression form. The agent may be rewarded with a lower immediate prize, but perhaps with a higher cumulative benefit.
The representation of the state value function is typically rewarded with a gamma discount. The gamma discount prize is a cumulative prize obtained from a current state to a future state, wherein a future prize value is multiplied by a discount factor gamma, and the gamma value ranges from a real number between 0 and 1, and the gamma value represents the influence degree of the future prize. Gamma ray t The longer the presentation time, the smaller the prize benefit effect, expressed in detail as follows:
wherein R represents an immediate prize of environmental feedback, s 0 S is the initial state of the intelligent agent t Is the state of the agent at the time t.
Step 2f: the state transition probability means that the action a is performed n Later, the network environment is changed, by S n Transition to S n+1 Is a probability of (2). Since the queue task changes in a short time, the network state is considered to be unchanged, and thus the state at the next time is determined only by the state at the current time. The state transition probability is expressed as:
step 2g: bonus function: executing action a n The environment will then feed back a prize value r n To measure how well the action is performed. For each step the function r is awarded n Is determined by the time to process the entire subtask, and the bonus function is set as:
wherein,T d representing processing delay, which is determined by the response speed of the wavelength routing device; t (T) w Representing the propagation time delay, which is obtained by dividing the inter-satellite distance by the speed of light; t (T) t Representing transmission delay, which is obtained by dividing task bandwidth by communication rate; t (T) p Representing queuing delay. Let f o Is the threshold of the bonus function f. If f o If the prize is more than or equal to f, the action is considered to be effective, and the prize value is 1; otherwise, the action is determined to be invalid, and the reward value is-1.
Wherein, step 3 is specifically as follows:
in step 3a, the state transition from reinforcement learning is actually a set of markov decision processes, and the state of the current environment is only related to the state at the previous moment, and is not related to the previous state, so the state value function can be simplified as follows:
where p is the state transition probability.
Step 3b, using a state-action value function to evaluate the quality of the selected action in a certain state, namely:
wherein Q is * (s, a) means that the action a is selected in the state s, and in the subsequent policy selection process, a larger progressive award can be obtained by the optimal policy. From the above analysis, the relationship between the state value function and the state-action value function is satisfied:
V * (s)=Q * (s,a)
step 3c, further, the optimal action policy may be obtained as follows:
and 3d, obtaining from the analysis, wherein the state value function is a return function in long term and can be used as an optimization objective function of the interactive learning process of the intelligent agent and the environment. In addition, when the action strategy is updated, the strategy is adjusted, optimized and selected according to the state value function, rather than the action is selected according to the instant return function, so that the intelligent agent can pay more attention to long-term accumulated benefits.
Step 3e, for the neural network of the AC algorithm, a classical full-connection structure, wherein the input layer of the Actor network is a matrix of 100×6, the hidden layer is three layers, and the output is a 100-dimensional column vector (softmax function); the Critic network outputs the prize value.
The prize value requires accumulating prizes by discountsTo judge. The task is split into 5 subtasks, so the AC algorithm needs to run 5 times. The goal of reinforcement learning is to maximize the cumulative rewards, i.e. +.>
And 3f, calculating the time corresponding to each subtask by each node according to the current environment by the star-cloud data center, selecting the node and the rewarding value { i, f } corresponding to the minimum time for calculating the kth subtask, and uploading the nodes and rewarding values to the global manager. The global manager further selects nodes corresponding to all the selected subfiles, and all the subfiles are ensured to be stored simultaneously.
The following is a more specific example:
a satellite network intelligent resource scheduling method based on reinforcement learning is shown in FIG. 1, which is a framework diagram of a satellite network, FIG. 2, which is a framework diagram of reinforcement learning algorithm, FIG. 3, which is a reinforcement learning schematic diagram, FIG. 4, which is a state transition process diagram, wherein the reinforcement learning algorithm and the satellite network resource scheduling are combined into an overall algorithm flow diagram shown in FIG. 5. As shown in fig. 5, the method comprises the following steps:
step 1: collecting satellite resource information and on-board task state information;
step 2: determining a constraint condition, a state, an action and a reward value function of the algorithm according to the collected information;
step 3: the obtained related information is sent to a reinforcement learning action module to execute satellite and link selection operation and rewarding value function calculation;
step 4: and the satellite server forwards the task according to the action result.
Wherein, step 1 is specifically as follows:
step 1a, a satellite resource sensing module is placed on a satellite to acquire satellite resource state information, wherein the satellite resource state information mainly comprises a satellite number SN i SCPU of available computing resources i SM with available memory resources i Etc.
Step 1b, a task information collection module is deployed on the satellite to collect task requests of ground users or subtask request information which is required to be processed by satellites after ground core cloud nodes are decomposed, wherein the task information collection module mainly comprises bandwidth WBW required by each subtask k And a task duration τ.
In the step 1c of the process,and->The serial numbers of the source satellite node and the target satellite node are respectively SBW OD Available bandwidth resources for links between source and destination satellite nodes. Design of SBW OD Constraint formulas. Because the high-speed motion and topology of the satellite change in real time, the link angle also changes at the moment, the link angle is expressed as SLA, FLAG is a visual constraint condition of a source satellite node and a destination satellite node, and mainly comprises link constraint and orbit constraint, and the specific judging function of FLAG is as follows:
FLAG=α*β*δ
wherein alpha is min 、α max Respectively representing the minimum pitch angle and the maximum pitch angle of a link terminal; beta min 、β max Respectively representing the minimum and maximum azimuth angles of the link terminal; delta max Is the maximum visible angle of the double-star building chain under the earth shielding.
Wherein, step 2 is specifically as follows:
in step 2a, the ac algorithm can reduce the effect of inter-task continuity on the final result and can asynchronously update the global model. It interacts with its own environment using multiple agents, each containing a copy of the environment. One particular A3C agent is based on an actor-reviewer, wherein the actor is used to generate actions, and the reviewer is used to evaluate and criticize current policies by processing rewards obtained from the environment. TD-error method is generally used as Critic, i.e., to evaluate the quality of an Actor.
Step 2b, the state of the network environment is the basis of the decision, and the network environment information needs to be acquired in order to obtain the decision, so that the network state set at the time t is
S n ={SLA,SBW OD ,SCPU D ,SM D ,WBW k ,τ};
Step 2c, the action set includes all the actions selectable by the decision algorithm, and corresponds to the decision variables in the dynamic planning model, and the corresponding action set a is a= { N i };
And 2d, the main elements of the reinforcement learning system comprise action strategies, environment rewards and state value functions besides the intelligent agent and the environment in which the intelligent agent is located. The strategy is used for guiding the intelligent agent to perform action selection under a specific environment state and is the core of reinforcement learning. Generally denoted by S x a→ [0,1], another policy expression is s→a, which indicates that the probability of performing action a in state S is 1, and the probability of performing other actions is 0. The policy definition is:
wherein S is a state space set, and A is an action space set. The learning goal of reinforcement learning is to continually update the optimization strategy during the iterative process of interacting with the environment so that the agent can maximize long-term rewards according to the strategy.
Step 2e, rewarding rewards are evaluation on the environment state or the actions of the intelligent agent, and can be used for guiding the intelligent agent to perform policy optimization updating and also can be used as a learning target of the intelligent agent. The prize benefit value must be objectively notarized and cannot be altered by the agent. The purpose of reinforcement learning is to maximize the total rewards value that is ultimately obtained. The rewarding reward function is the immediate evaluation of the environment on the action of the current state, and is a short-term expression signal; and the state value function represents the expectation of the agent to obtain the accumulated rewards and benefits, and is a long-term expression form. The agent may be rewarded with a lower immediate prize, but perhaps with a higher cumulative benefit.
The representation of the state value function is typically rewarded with a gamma discount. The gamma discount prize is a cumulative prize obtained from a current state to a future state, wherein a future prize value is multiplied by a discount factor gamma, and the gamma value ranges from a real number between 0 and 1, and the gamma value represents the influence degree of the future prize. Gamma ray t The longer the presentation time, the smaller the prize benefit effect, expressed in detail as follows:
wherein R represents an immediate prize of the environmental feedback, s 0 S is the initial state of the intelligent agent t Is the state of the agent at the time t.
Step 2f: the state transition probability means that the action a is performed n Later, the network environment is changed, by S n Transition to S n+1 Is a probability of (2). Since the queue task changes in a short time, the network state is considered to be unchanged, and thus the state at the next time is determined only by the state at the current time. The state transition probability is expressed as:
step 2g: bonus function: executing action a n The environment will then feed back a prize value r n To measure how well the action is performed. For each step the function r is awarded n Is determined by the time to process the entire subtask, and the bonus function is set as:
wherein T is d Representing processing delay, which is determined by the response speed of the wavelength routing device; t (T) w Representing the propagation time delay, which is obtained by dividing the inter-satellite distance by the speed of light; t (T) t Representing transmission delay, which is obtained by dividing task bandwidth by communication rate; t (T) p Representing queuing delay. Let f o Is the threshold of the bonus function f. If f o If the prize is more than or equal to f, the action is considered to be effective, and the prize value is 1; otherwise, the action is determined to be invalid, and the reward value is-1.
Wherein, step 3 is specifically as follows:
in step 3a, the state transition from reinforcement learning is actually a set of markov decision processes, and the state of the current environment is only related to the state at the previous moment, and is not related to the previous state, so the state value function can be simplified as follows:
where p is the state transition probability.
Step 3b, using a state-action value function to evaluate the quality of the selected action in a certain state, namely:
wherein Q is * (s, a) means that the action a is selected in the state s, and in the subsequent policy selection process, a larger progressive award can be obtained by the optimal policy. From the above analysis, the relationship between the state value function and the state-action value function is satisfied:
V * (s)=Q * (s,a)
step 3c, further, the optimal action policy may be obtained as follows:
and 3d, obtaining from the analysis, wherein the state value function is a return function in long term and can be used as an optimization objective function of the interactive learning process of the intelligent agent and the environment. In addition, when the action strategy is updated, the strategy is adjusted, optimized and selected according to the state value function, rather than the action is selected according to the instant return function, so that the intelligent agent can pay more attention to long-term accumulated benefits.
Step 3e, for the neural network of the AC algorithm, a classical full-connection structure, wherein the input layer of the Actor network is a matrix of 100×6, the hidden layer is three layers, and the output is a 100-dimensional column vector (softmax function); the Critic network outputs the prize value.
The prize value requires accumulating prizes by discountsTo judge. The task is split into 5 subtasks, so the AC algorithm needs to run 5 times. The goal of reinforcement learning is to maximize the cumulative rewards, i.e. +.>
And 3f, calculating the time corresponding to each subtask by each node according to the current environment by the star-cloud data center, selecting the node and the rewarding value { i, f } corresponding to the minimum time for calculating the kth subtask, and uploading the nodes and rewarding values to the global manager. The global manager further selects nodes corresponding to all the selected subfiles, and all the subfiles are ensured to be stored simultaneously.
In short, the invention is mainly used for the situations of the spaceborne cloud data center. The satellite-borne cloud data center consists of a ground core cloud node and a satellite edge computing node; the ground core cloud nodes are provided with a plurality of distributed computing centers, each computing center is composed of a large-scale high-performance computer cluster, and services such as model training, large-scale complex task processing and data storage are provided; the satellite edge computing node comprises a plurality of middle orbit satellites, is responsible for controlling low orbit remote sensing satellites and receiving remote sensing data, and meanwhile, has a cloud computing platform with independent decision making capability and can receive and execute tasks. The invention collects satellite resource information and on-board task state information, distributes the information to the proper satellite edge server for execution according to the resources required by the task through the AC reinforcement learning scheduling algorithm, realizes the intelligent forwarding of the task among satellites, and reduces the communication time delay.
The technical principles of the present invention have been described above in connection with specific embodiments, which are provided for the purpose of explaining the principles of the present invention and are not to be construed as limiting the scope of the present invention in any way. Other embodiments of the invention will be apparent to those skilled in the art from consideration of this specification without undue burden.

Claims (4)

1. A satellite network intelligent resource scheduling method based on reinforcement learning is characterized by comprising the following steps:
step 1: collecting satellite resource information and on-board task state information;
step 2: determining a constraint, a state, an action and a reward value function according to the collected information;
step 3: transmitting the collected information to a reinforcement learning action module to execute satellite and link selection operation and bonus value function calculation;
step 4: and the satellite server forwards the task according to the action result.
2. The method for intelligent resource scheduling of a satellite network based on reinforcement learning according to claim 1, wherein the specific mode of step 1 is as follows:
step 1a, a satellite resource sensing module is placed on the satellite to acquire satellite resource state information, including satellite number SN i SCPU of available computing resources i SM with available memory resources i
Step 1b, a task information collection module is deployed on the satellite to collect task requests of ground users or subtask request information which needs satellite processing after ground core cloud node decomposition, wherein the subtask request information comprises bandwidth WBW required by each subtask k And a task duration τ;
step 1c, designing a visual constraint condition FLAG of a source satellite node and a destination satellite node:
FLAG=a*β*δ
wherein alpha, beta and delta are respectively the pitch angle, azimuth angle and visible angle of the double-star link under the earth shielding of the link terminal, and alpha min 、α max Respectively representing the minimum pitch angle and the maximum pitch angle of a link terminal; beta min 、β max Respectively representing the minimum and maximum azimuth angles of the link terminal; delta max Is the maximum visible angle of the double-star building chain under the earth shielding.
3. The method for intelligent resource scheduling of a satellite network based on reinforcement learning according to claim 2, wherein the specific mode of step 2 is as follows:
step 2a, establishing interaction between a plurality of agents and own environment based on an AC algorithm, wherein each agent contains a copy of the environment; constructing an A3C agent based on an actor-critter, wherein the actor is used for generating actions, the critter is used for evaluating and criticizing the current policy, and the criticizer adopts a TD-error method by processing rewards obtained from the environment;
step 2b, constructing a network state set at the time t as follows:
S n ={SLA,SBW OD ,SCPU D ,SM D ,WBW k ,τ};
wherein SLA is link angle, SBW OD Available bandwidth resources for links between source and destination satellite nodes;
step 2c, constructing an action set A= { N i -comprising all actions selectable by a decision algorithm;
step 2d, constructing a behavior action strategy of the reinforcement learning system:
p(s,a)=π(a|s)=p(A=a|S=s)
s is a state space set, A is an action space set;
step 2e, constructing a reward prize of the reinforcement learning system, wherein the reward prize is represented by gamma discount prizes, the gamma discount prizes represent accumulated prizes obtained from the current state to the future state, and future prize values are multiplied by a discount factor gamma, and the gamma prize values range from 0 to 1 in real numbers to represent the influence degree of the future prizes; gamma ray t The longer the presentation time, the smaller the prize benefit effect, expressed in detail as follows:
wherein R represents an immediate prize of the environmental feedback, s 0 S is the initial state of the intelligent agent t Is the state of the intelligent agent at the t moment;
step 2f: constructing state transition probability of the reinforcement learning system:
state transition probability indicates that action a is being performed n Later, the network environment is changed, by S n Transition to S n+1 Probability of (2);
step 2g: constructing a reward function of the reinforcement learning system: executing action a n The environment will then feed back a prize value r n To measure the execution quality of the action; for each step, the prize value r n Is determined by the time to process the entire subtask; setting a reward function as follows:
wherein T is d Representing processing delay, which is determined by the response speed of the wavelength routing device; t (T) w Representing the propagation time delay, which is obtained by dividing the inter-satellite distance by the speed of light; t (T) t Representing transmission delay, which is obtained by dividing task bandwidth by communication rate; t (T) p Representing queuing delay;
let f o Is the threshold of the bonus function f, if f o If the prize is more than or equal to f, the action is considered to be effective, and the prize value is 1; otherwise, the action is determined to be invalid, and the reward value is-1.
4. The method for intelligent resource scheduling of a satellite network based on reinforcement learning according to claim 3, wherein the specific manner of step 3 is as follows:
step 3a, the state of reinforcement learning is transferred into a set of Markov decision process, and the state of the current environment is only related to the state at the last moment and is not related to the previous state, so the state value function is simplified into:
wherein p is the state transition probability;
step 3b, using a state-action value function to evaluate the quality of the selected action in a certain state, namely:
wherein Q is * (s, a) means that action a is selected in state s, and in the subsequent policy selection process, a larger progressive award is obtained by the optimal policy;
the relationship between the state value function and the state-action value function is satisfied:
V * (s)=Q * (s,a)
step 3c, the optimal action strategy is:
step 3d, when the action strategy is updated, the strategy adjustment optimization and the action selection are carried out according to the state value function, but the action selection is not carried out according to the instant return function;
step 3e, the neural network of the AC algorithm is a classical full-connection structure, wherein the input layer of the actor network is a matrix of 100 x 6, the hidden layer is three layers, and the output is 100-dimensional column vectors; the network output of the commentators is a rewarding value;
prize value accumulating prizes by discountsJudging; splitting the task into 5 subtasks, so the AC algorithm needs to run 5 times; the goal of reinforcement learning is to maximize the cumulative rewards, i.e. +.>
Step 3f, calculating the time corresponding to each subtask by each node according to the current environment by the star-cloud data center, selecting the node and the rewarding value { i, f } corresponding to the minimum time for calculating the kth subtask, and uploading the nodes and rewarding values to the global manager; the global manager further selects nodes corresponding to all the selected subfiles, and all the subfiles are ensured to be stored simultaneously.
CN202311121709.2A 2023-09-01 2023-09-01 Satellite network intelligent resource scheduling method based on reinforcement learning Pending CN117314049A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311121709.2A CN117314049A (en) 2023-09-01 2023-09-01 Satellite network intelligent resource scheduling method based on reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311121709.2A CN117314049A (en) 2023-09-01 2023-09-01 Satellite network intelligent resource scheduling method based on reinforcement learning

Publications (1)

Publication Number Publication Date
CN117314049A true CN117314049A (en) 2023-12-29

Family

ID=89283887

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311121709.2A Pending CN117314049A (en) 2023-09-01 2023-09-01 Satellite network intelligent resource scheduling method based on reinforcement learning

Country Status (1)

Country Link
CN (1) CN117314049A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117521716A (en) * 2024-01-02 2024-02-06 山东大学 Collaborative decision-making method and medium for mass unknown options and limited memory space

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117521716A (en) * 2024-01-02 2024-02-06 山东大学 Collaborative decision-making method and medium for mass unknown options and limited memory space
CN117521716B (en) * 2024-01-02 2024-03-19 山东大学 Collaborative decision-making method and medium for mass unknown options and limited memory space

Similar Documents

Publication Publication Date Title
CN113346944B (en) Time delay minimization calculation task unloading method and system in air-space-ground integrated network
CN113128828B (en) Satellite observation distributed online planning method based on multi-agent reinforcement learning
CN113794494B (en) Edge computing system and computing unloading optimization method for low-orbit satellite network
CN117314049A (en) Satellite network intelligent resource scheduling method based on reinforcement learning
CN115408151A (en) Method for accelerating learning training of bang
WO2021036414A1 (en) Co-channel interference prediction method for satellite-to-ground downlink under low earth orbit satellite constellation
CN113822456A (en) Service combination optimization deployment method based on deep reinforcement learning in cloud and mist mixed environment
CN111884703B (en) Service request distribution method based on cooperative computing between communication satellites
CN116069512B (en) Serverless efficient resource allocation method and system based on reinforcement learning
CN109947574A (en) A kind of vehicle big data calculating discharging method based on mist network
CN114546608A (en) Task scheduling method based on edge calculation
Yang et al. Onboard coordination and scheduling of multiple autonomous satellites in an uncertain environment
Belokonov et al. Multi-agent planning of the network traffic between nanosatellites and ground stations
Wang Edge artificial intelligence-based affinity task offloading under resource adjustment in a 5G network
CN114698118A (en) Comprehensive benefit-oriented resource intelligent cooperative scheduling method in space-ground-air integrated network
CN116755867B (en) Satellite cloud-oriented computing resource scheduling system, method and storage medium
CN113946423A (en) Multi-task edge computing scheduling optimization method based on graph attention network
CN116886154A (en) Low-orbit satellite access method and system based on flow density
CN116760722A (en) Storage auxiliary MEC task unloading system and resource scheduling method
CN116822863A (en) Multi-platform collaborative awareness intelligent planning method and system
CN116862167A (en) Low-orbit remote sensing satellite constellation emergency task planning method based on multi-knapsack model
Qiao et al. A service function chain deployment scheme of the software defined satellite network
CN115484205B (en) Deterministic network routing and queue scheduling method and device
CN114710200B (en) Satellite network resource arrangement method and system based on reinforcement learning
CN116582407A (en) Containerized micro-service arrangement system and method based on deep reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination