CN117314049A

CN117314049A - Satellite network intelligent resource scheduling method based on reinforcement learning

Info

Publication number: CN117314049A
Application number: CN202311121709.2A
Authority: CN
Inventors: 王守斌; 朱皓俊; 王士成; 孙康; 马万权
Original assignee: CETC 54 Research Institute
Current assignee: CETC 54 Research Institute
Priority date: 2023-09-01
Filing date: 2023-09-01
Publication date: 2023-12-29

Abstract

The invention discloses a satellite network intelligent resource scheduling method based on reinforcement learning, and belongs to the technical field of satellite network resource scheduling. The method comprises the following steps: collecting satellite resource information and on-board task state information; determining a constraint condition, a state, an action and a reward value function of the algorithm according to the collected information; the obtained related information is sent to a reinforcement learning action module to execute satellite and link selection operation and rewarding value function calculation; and the satellite server forwards the task according to the action result. The method comprehensively considers the information such as bandwidth and duration required by the tasks on the satellites, the information such as idle computing capacity of the satellites, storage resources and available bandwidth of links between the satellites, realizes intelligent scheduling of the tasks of the ground users or the tasks of the cloud data center between the satellites, reduces time delay of forwarding between the satellites, and improves communication performance of the cloud data center on the satellites.

Description

Satellite network intelligent resource scheduling method based on reinforcement learning

Technical Field

The invention belongs to the technical field of satellite network resource scheduling, and particularly relates to a satellite network intelligent resource scheduling method based on reinforcement learning.

Background

With the sustainable development of the new generation of medium-low orbit satellite constellation technology, the satellite network formed by using the medium-low orbit satellite constellation can effectively overcome the problem of insufficient service capability of the ground network. Compared with the traditional ground network, the satellite network has the advantages of global coverage, deployment free from the constraint of regions and terrain conditions, strong system destructiveness and the like. Thus, more and more users are getting internet service through satellite networks. However, under the conditions of high complexity, high dynamic and resource shortage of the medium-low orbit satellite network, how to realize efficient and intelligent resource scheduling becomes a key challenge for limited resources in different service distribution systems.

At present, more classical on-board resource scheduling algorithms include first-in first-out, short job priority, priority queue, weighted fair queue and the like. However, these conventional algorithms have a large proportion of total resource overhead occupied by the resource allocation computation overhead of the user service scheduling in the environment facing the complexity and the scarcity of resources, so that the combination of the AI technology with higher efficiency becomes a feasible scheme. If genetic algorithm is utilized, the algorithm selects tasks of different resources to schedule through dynamically updated information, and calculates fitness value by taking total task weight value and task completion time as parameters, and then the optimized genetic algorithm is utilized to find the optimal solution in different scheduling schemes. Or the optimal resource allocation scheme meeting the QoE maximum criterion is calculated through the Hungary algorithm, so that the user experience quality can be improved on the basis of ensuring the resource utilization rate. Such algorithms, while improving the shortcomings of classical algorithms, suffer from drawbacks in terms of scheduling for specific bandwidth service resources, multi-star joint scheduling, and special traffic requirements. Aiming at the problems, the reinforcement learning technology which is raised in recent years has better intelligent solving capability in the face of some complex optimizing decision problems. The method is characterized in that on the multiple challenges of high complexity, low dynamics, scarce resources, large amount of required information and the like faced by resource scheduling caused by the characteristics of multi-star cooperative work, full-time space interconnection, mass users and the like of a medium-low orbit satellite network environment, intelligent decision is made through a reinforcement learning algorithm, so that the most reasonable resource scheduling mode can be realized.

Disclosure of Invention

Aiming at the existing technical problems, the invention provides a satellite network intelligent resource scheduling method based on reinforcement learning, which not only can adapt to a complex dynamic satellite network system, but also can reduce the time delay problem of resource scheduling.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a satellite network intelligent resource scheduling method based on reinforcement learning comprises the following steps:

step 1: collecting satellite resource information and on-board task state information;

step 2: determining a constraint, a state, an action and a reward value function according to the collected information;

step 3: transmitting the collected information to a reinforcement learning action module to execute satellite and link selection operation and bonus value function calculation;

step 4: and the satellite server forwards the task according to the action result.

Further, the specific manner of step 1 is as follows:

step 1a, a satellite resource sensing module is placed on the satellite to acquire satellite resource state information, including satellite number SN _i SCPU of available computing resources _i SM with available memory resources _i ；

Step 1b, a task information collection module is deployed on the satellite to collect task requests of ground users or subtask request information which needs satellite processing after ground core cloud node decomposition, wherein the subtask request information comprises bandwidth WBW required by each subtask _k And a task duration τ;

step 1c, designing a visual constraint condition FLAG of a source satellite node and a destination satellite node:

FLAG＝α*β*δ

wherein alpha, beta and delta are respectively the pitch angle, azimuth angle and visible angle of the double-star link under the earth shielding of the link terminal, and alpha _min 、α _max Respectively representing the minimum pitch angle and the maximum pitch angle of a link terminal; beta _min 、β _max Respectively representing the minimum and maximum azimuth angles of the link terminal; delta _max Is the maximum visible angle of the double-star building chain under the earth shielding.

Further, the specific manner of step 2 is as follows:

step 2a, establishing interaction between a plurality of agents and own environment based on an Actor-Critic (AC) algorithm, wherein each agent contains a copy of the environment; constructing an actor-critics-based Asynchronous Advantage Actor Critic (A3C) agent, wherein the actor is used to generate actions, the critics are used to evaluate and criticize the current policy, and the criticizer adopts a TD-error method by processing rewards obtained from the environment;

step 2b, constructing a network state set at the time t as follows:

S _n ＝{SLA,SBW _OD ,SCPU _D ,SM _D ,WBW _k ,τ}；

wherein SLA is link angle, SBW _OD Available bandwidth resources for links between source and destination satellite nodes;

step 2c, constructing an action set A= { N _i }，All actions selectable by the decision algorithm are included;

step 2d, constructing a behavior action strategy of the reinforcement learning system:

π(s,a)＝π(a|s)＝p(A＝a|S＝s)

s is a state space set, A is an action space set;

step 2e, constructing a reward prize of the reinforcement learning system, wherein the reward prize is represented by gamma discount prizes, the gamma discount prizes represent accumulated prizes obtained from the current state to the future state, and future prize values are multiplied by a discount factor gamma, and the gamma prize values range from 0 to 1 in real numbers to represent the influence degree of the future prizes; gamma ray ^t The longer the presentation time, the smaller the prize benefit effect, expressed in detail as follows:

wherein R represents an immediate prize of the environmental feedback, s ₀ S is the initial state of the intelligent agent _t Is the state of the intelligent agent at the t moment;

step 2f: constructing state transition probability of the reinforcement learning system:

state transition probability indicates that action a is being performed _n Later, the network environment is changed, by S _n Transition to S _n+1 Probability of (2);

step 2g: constructing a reward function of the reinforcement learning system: executing action a _n The environment will then feed back a prize value r _n To measure the execution quality of the action; for each step, the prize value r _n Is determined by the time to process the entire subtask; is provided withThe fixed rewarding function is:

wherein T is _d Representing processing delay, which is determined by the response speed of the wavelength routing device; t (T) _w Representing the propagation time delay, which is obtained by dividing the inter-satellite distance by the speed of light; t (T) _t Representing transmission delay, which is obtained by dividing task bandwidth by communication rate; t (T) _p Representing queuing delay;

let f _o Is the threshold of the bonus function f, if f _o If the prize is more than or equal to f, the action is considered to be effective, and the prize value is 1; otherwise, the action is determined to be invalid, and the reward value is-1.

Further, the specific manner of step 3 is as follows:

step 3a, the state of reinforcement learning is transferred into a set of Markov decision process, and the state of the current environment is only related to the state at the last moment and is not related to the previous state, so the state value function is simplified into:

wherein p is the state transition probability;

step 3b, using a state-action value function to evaluate the quality of the selected action in a certain state, namely:

wherein Q is ^* (s, a) means that action a is selected in state s, and in the subsequent policy selection process, a larger progressive award is obtained by the optimal policy;

the relationship between the state value function and the state-action value function is satisfied:

V ^* (s)＝Q ^* (s,a)

step 3c, the optimal action strategy is:

step 3d, when the action strategy is updated, the strategy adjustment optimization and the action selection are carried out according to the state value function, but the action selection is not carried out according to the instant return function;

step 3e, the neural network of the AC algorithm is a classical full-connection structure, wherein the input layer of the actor network is a matrix of 100 x 6, the hidden layer is three layers, and the output is 100-dimensional column vectors; the network output of the commentators is a rewarding value;

prize value accumulating prizes by discountsJudging; splitting the task into 5 subtasks, so the AC algorithm needs to run 5 times; the goal of reinforcement learning is to maximize the cumulative rewards, i.e. +.>

Step 3f, calculating the time corresponding to each subtask by each node according to the current environment by the star-cloud data center, selecting the node and the rewarding value { i, f } corresponding to the minimum time for calculating the kth subtask, and uploading the nodes and rewarding values to the global manager; the global manager further selects nodes corresponding to all the selected subfiles, and all the subfiles are ensured to be stored simultaneously.

The invention has the beneficial effects that:

1. the resource scheduling method provided by the invention regards all data as a whole, fuses the data through the neural network algorithm, and fuses the historical data of the state parameters into the neural network algorithm as input neurons at the same time, thereby ensuring the time of satellite network tasks and being capable of adapting to more complex systems.

2. The method comprehensively considers the information such as bandwidth and duration required by the tasks on the satellites, the information such as idle computing capacity of the satellites, storage resources and available bandwidth of links between the satellites, realizes intelligent scheduling of the tasks of the ground users or the tasks of the cloud data center between the satellites, reduces time delay of forwarding between the satellites, and improves communication performance of the cloud data center on the satellites.

Drawings

FIG. 1 is a schematic diagram of a network framework provided by the present invention;

FIG. 2 is a diagram of an algorithm framework;

FIG. 3 is a reinforcement learning schematic;

FIG. 4 is a state transition process diagram;

fig. 5 is an overall algorithm flow chart.

Detailed Description

The invention will be better explained by the following detailed description of the embodiments with reference to the drawings.

step 2: determining a constraint condition, a state, an action and a reward value function of the algorithm according to the collected information;

step 3: the obtained related information is sent to a reinforcement learning action module to execute satellite and link selection operation and rewarding value function calculation;

Wherein, step 1 is specifically as follows:

step 1a, a satellite resource sensing module is placed on a satellite to acquire satellite resource state information, wherein the satellite resource state information mainly comprises a satellite number SN _i SCPU of available computing resources _i SM with available memory resources _i Etc.

Step 1b, a task information collection module is deployed on the satellite to collect task requests of ground users or subtask request information which is required to be processed by satellites after ground core cloud nodes are decomposed, wherein the task information collection module mainly comprises bandwidth WBW required by each subtask _k And a task duration τ.

In the step 1c of the process,and->The serial numbers of the source satellite node and the target satellite node are respectively SBW _OD Available bandwidth resources for links between source and destination satellite nodes. Design of SBW _OD Constraint formulas. Because the high-speed motion and topology of the satellite change in real time, the link angle also changes at the moment, the link angle is expressed as SLA, FLAG is a visual constraint condition of a source satellite node and a destination satellite node, and mainly comprises link constraint and orbit constraint, and the specific judging function of FLAG is as follows:

FLAG＝α*β*δ

wherein alpha is _min 、α _max Respectively representing the minimum pitch angle and the maximum pitch angle of a link terminal; beta _min 、β _max Respectively representing the minimum and maximum azimuth angles of the link terminal; delta _max Is the maximum visible angle of the double-star building chain under the earth shielding.

Wherein, step 2 is specifically as follows:

in step 2a, the ac algorithm can reduce the effect of inter-task continuity on the final result and can asynchronously update the global model. It interacts with its own environment using multiple agents, each containing a copy of the environment. One particular A3C agent is based on an actor-reviewer, wherein the actor is used to generate actions, and the reviewer is used to evaluate and criticize current policies by processing rewards obtained from the environment. TD-error method is generally used as Critic, i.e., to evaluate the quality of an Actor.

Step 2b, the state of the network environment is the basis of the decision, and the network environment information needs to be acquired in order to obtain the decision, so that the network state set at the time t is

S _n ＝{SLA,SBW _OD ,SCPU _D ,SM _D ,WBW _k ,τ}；

Step 2c, the action set includes all the actions selectable by the decision algorithm, and corresponds to the decision variables in the dynamic planning model, and the corresponding action set a is a= { N _i }；

And 2d, the main elements of the reinforcement learning system comprise action strategies, environment rewards and state value functions besides the intelligent agent and the environment in which the intelligent agent is located. The strategy is used for guiding the intelligent agent to perform action selection under a specific environment state and is the core of reinforcement learning. Generally denoted by S x a→ [0,1], another policy expression is s→a, which indicates that the probability of performing action a in state S is 1, and the probability of performing other actions is 0. The policy definition is:

π(s,a)＝π(a|s)＝p(A＝a|S＝s)

wherein S is a state space set, and A is an action space set. The learning goal of reinforcement learning is to continually update the optimization strategy during the iterative process of interacting with the environment so that the agent can maximize long-term rewards according to the strategy.

Step 2e, rewarding rewards are evaluation on the environment state or the actions of the intelligent agent, and can be used for guiding the intelligent agent to perform policy optimization updating and also can be used as a learning target of the intelligent agent. The prize benefit value must be objectively notarized and cannot be altered by the agent. The purpose of reinforcement learning is to maximize the total rewards value that is ultimately obtained. The rewarding reward function is the immediate evaluation of the environment on the action of the current state, and is a short-term expression signal; and the state value function represents the expectation of the agent to obtain the accumulated rewards and benefits, and is a long-term expression form. The agent may be rewarded with a lower immediate prize, but perhaps with a higher cumulative benefit.

The representation of the state value function is typically rewarded with a gamma discount. The gamma discount prize is a cumulative prize obtained from a current state to a future state, wherein a future prize value is multiplied by a discount factor gamma, and the gamma value ranges from a real number between 0 and 1, and the gamma value represents the influence degree of the future prize. Gamma ray ^t The longer the presentation time, the smaller the prize benefit effect, expressed in detail as follows:

wherein R represents an immediate prize of environmental feedback, s ₀ S is the initial state of the intelligent agent _t Is the state of the agent at the time t.

Step 2f: the state transition probability means that the action a is performed _n Later, the network environment is changed, by S _n Transition to S _n+1 Is a probability of (2). Since the queue task changes in a short time, the network state is considered to be unchanged, and thus the state at the next time is determined only by the state at the current time. The state transition probability is expressed as:

step 2g: bonus function: executing action a _n The environment will then feed back a prize value r _n To measure how well the action is performed. For each step the function r is awarded _n Is determined by the time to process the entire subtask, and the bonus function is set as:

wherein,T _d representing processing delay, which is determined by the response speed of the wavelength routing device; t (T) _w Representing the propagation time delay, which is obtained by dividing the inter-satellite distance by the speed of light; t (T) _t Representing transmission delay, which is obtained by dividing task bandwidth by communication rate; t (T) _p Representing queuing delay. Let f _o Is the threshold of the bonus function f. If f _o If the prize is more than or equal to f, the action is considered to be effective, and the prize value is 1; otherwise, the action is determined to be invalid, and the reward value is-1.

Wherein, step 3 is specifically as follows:

in step 3a, the state transition from reinforcement learning is actually a set of markov decision processes, and the state of the current environment is only related to the state at the previous moment, and is not related to the previous state, so the state value function can be simplified as follows:

where p is the state transition probability.

wherein Q is ^* (s, a) means that the action a is selected in the state s, and in the subsequent policy selection process, a larger progressive award can be obtained by the optimal policy. From the above analysis, the relationship between the state value function and the state-action value function is satisfied:

V ^* (s)＝Q ^* (s,a)

step 3c, further, the optimal action policy may be obtained as follows:

and 3d, obtaining from the analysis, wherein the state value function is a return function in long term and can be used as an optimization objective function of the interactive learning process of the intelligent agent and the environment. In addition, when the action strategy is updated, the strategy is adjusted, optimized and selected according to the state value function, rather than the action is selected according to the instant return function, so that the intelligent agent can pay more attention to long-term accumulated benefits.

Step 3e, for the neural network of the AC algorithm, a classical full-connection structure, wherein the input layer of the Actor network is a matrix of 100×6, the hidden layer is three layers, and the output is a 100-dimensional column vector (softmax function); the Critic network outputs the prize value.

The prize value requires accumulating prizes by discountsTo judge. The task is split into 5 subtasks, so the AC algorithm needs to run 5 times. The goal of reinforcement learning is to maximize the cumulative rewards, i.e. +.>

And 3f, calculating the time corresponding to each subtask by each node according to the current environment by the star-cloud data center, selecting the node and the rewarding value { i, f } corresponding to the minimum time for calculating the kth subtask, and uploading the nodes and rewarding values to the global manager. The global manager further selects nodes corresponding to all the selected subfiles, and all the subfiles are ensured to be stored simultaneously.

The following is a more specific example:

a satellite network intelligent resource scheduling method based on reinforcement learning is shown in FIG. 1, which is a framework diagram of a satellite network, FIG. 2, which is a framework diagram of reinforcement learning algorithm, FIG. 3, which is a reinforcement learning schematic diagram, FIG. 4, which is a state transition process diagram, wherein the reinforcement learning algorithm and the satellite network resource scheduling are combined into an overall algorithm flow diagram shown in FIG. 5. As shown in fig. 5, the method comprises the following steps:

Wherein, step 1 is specifically as follows:

FLAG＝α*β*δ

Wherein, step 2 is specifically as follows:

S _n ＝{SLA,SBW _OD ,SCPU _D ,SM _D ,WBW _k ,τ}；

wherein R represents an immediate prize of the environmental feedback, s ₀ S is the initial state of the intelligent agent _t Is the state of the agent at the time t.

wherein T is _d Representing processing delay, which is determined by the response speed of the wavelength routing device; t (T) _w Representing the propagation time delay, which is obtained by dividing the inter-satellite distance by the speed of light; t (T) _t Representing transmission delay, which is obtained by dividing task bandwidth by communication rate; t (T) _p Representing queuing delay. Let f _o Is the threshold of the bonus function f. If f _o If the prize is more than or equal to f, the action is considered to be effective, and the prize value is 1; otherwise, the action is determined to be invalid, and the reward value is-1.

Wherein, step 3 is specifically as follows:

where p is the state transition probability.

V ^* (s)＝Q ^* (s,a)

step 3c, further, the optimal action policy may be obtained as follows:

In short, the invention is mainly used for the situations of the spaceborne cloud data center. The satellite-borne cloud data center consists of a ground core cloud node and a satellite edge computing node; the ground core cloud nodes are provided with a plurality of distributed computing centers, each computing center is composed of a large-scale high-performance computer cluster, and services such as model training, large-scale complex task processing and data storage are provided; the satellite edge computing node comprises a plurality of middle orbit satellites, is responsible for controlling low orbit remote sensing satellites and receiving remote sensing data, and meanwhile, has a cloud computing platform with independent decision making capability and can receive and execute tasks. The invention collects satellite resource information and on-board task state information, distributes the information to the proper satellite edge server for execution according to the resources required by the task through the AC reinforcement learning scheduling algorithm, realizes the intelligent forwarding of the task among satellites, and reduces the communication time delay.

The technical principles of the present invention have been described above in connection with specific embodiments, which are provided for the purpose of explaining the principles of the present invention and are not to be construed as limiting the scope of the present invention in any way. Other embodiments of the invention will be apparent to those skilled in the art from consideration of this specification without undue burden.

Claims

1. A satellite network intelligent resource scheduling method based on reinforcement learning is characterized by comprising the following steps:

2. The method for intelligent resource scheduling of a satellite network based on reinforcement learning according to claim 1, wherein the specific mode of step 1 is as follows:

FLAG＝a*β*δ

3. The method for intelligent resource scheduling of a satellite network based on reinforcement learning according to claim 2, wherein the specific mode of step 2 is as follows:

step 2a, establishing interaction between a plurality of agents and own environment based on an AC algorithm, wherein each agent contains a copy of the environment; constructing an A3C agent based on an actor-critter, wherein the actor is used for generating actions, the critter is used for evaluating and criticizing the current policy, and the criticizer adopts a TD-error method by processing rewards obtained from the environment;

step 2b, constructing a network state set at the time t as follows:

S _n ＝{SLA,SBW _OD ,SCPU _D ,SM _D ,WBW _k ,τ}；

step 2c, constructing an action set A= { N _i -comprising all actions selectable by a decision algorithm;

p(s,a)＝π(a|s)＝p(A＝a|S＝s)

s is a state space set, A is an action space set;

step 2g: constructing a reward function of the reinforcement learning system: executing action a _n The environment will then feed back a prize value r _n To measure the execution quality of the action; for each step, the prize value r _n Is determined by the time to process the entire subtask; setting a reward function as follows:

4. The method for intelligent resource scheduling of a satellite network based on reinforcement learning according to claim 3, wherein the specific manner of step 3 is as follows:

wherein p is the state transition probability;

V ^* (s)＝Q ^* (s,a)

step 3c, the optimal action strategy is: