CN111767991A

CN111767991A - Measurement and control resource scheduling method based on deep Q learning

Info

Publication number: CN111767991A
Application number: CN202010609039.9A
Authority: CN
Inventors: 郭茂耘; 武艺; 唐奇; 梁皓星
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2020-10-13
Anticipated expiration: 2040-06-29
Also published as: CN111767991B

Abstract

The invention relates to a measurement and control resource scheduling method based on deep Q learning, and belongs to the field of intelligent scheduling. The method comprises the following steps: s1: describing a complex measurement and control scene; s2: designing evaluation indexes of measurement and control scheduling performance; s3: forming a measurement and control resource scheduling scheme; s4: the DQN algorithm is applied to the generation of the measurement and control resource scheduling scheme; s5: and implementing the measurement and control resource scheduling method based on the DQN. The method and the device can generate the measurement and control resource scheduling strategy adaptive to the measurement and control scene in the complex measurement and control environment without accurately modeling the measurement and control environment, thereby achieving the maximization of the measurement and control resource scheduling efficiency.

Description

Measurement and control resource scheduling method based on deep Q learning

Technical Field

The invention belongs to the field of intelligent scheduling, and relates to a measurement and control resource scheduling method based on deep Q learning.

Background

At present, the method for solving the problem of scheduling the satellite measurement and control resources mainly comprises the following steps: intelligent algorithms such as an ant colony algorithm, a particle swarm algorithm, an SVM method and the like, a branch and bound algorithm, a deterministic algorithm such as a Lagrange relaxation algorithm and the like, and a heuristic algorithm such as a greedy algorithm, a neighborhood search algorithm, a simulated annealing algorithm and the like. The research on the aspect of space-ground integrated measurement and control resources is relatively less, and more researches are carried out from the perspective of traditional algorithms, such as Lagrange relaxation algorithm, ant colony algorithm and genetic algorithm, so that the application in the aspect of deep reinforcement learning algorithm is relatively less.

The invention mainly solves the conflict between the measurement and control resources and the measurement and control objects caused by the increasing measurement and control tasks. From the perspective of visibility between measurement and control resources and measurement and control objects, a measurement and control scene based on a measurement and control time window is constructed, the optimal execution time period of a measurement and control task is solved by utilizing deep Q learning (DeepQnetwork, DQN), and finally, an optimal measurement and control scheduling scheme is formed, so that the optimal operation of a measurement and control system under a specific index is realized.

Disclosure of Invention

In view of this, the present invention provides a measurement and control resource scheduling method based on deep Q learning. Aiming at the current situation that the conflict between the existing measurement and control tasks and the quantity of the measurement and control resources is increasingly severe, the situation that the measurement and control tasks are still limited by various conditions such as the visibility of the measurement and control resources and the measurement and control objects, the measurement and control duration, the priority of the measurement and control tasks and the like under the condition that the quantity of the measurement and control resources is limited is considered, so that the scheduling of the measurement and control resources becomes a complex combination optimization problem under various space-time constraint conditions. The measurement and control services and the measurement and control range of a single type of measurement and control resource have differences and limitations, and the measurement and control tasks tend to be more and more complex and diversified, so that the measurement and control scheduling decision difficulty is increased continuously, and therefore the combined scheduling of the space-to-ground measurement and control resources is necessary, and the comprehensive scheduling performance of the space-to-ground integrated measurement and control resources is optimal.

The invention aims to construct a measurement and control resource scheduling implementation method based on deep reinforcement learning, which utilizes the deep reinforcement learning to realize intelligent scheduling of the space-ground integrated measurement and control resources, performs more accurate abstraction and feature extraction on a measurement and control system and a measurement and control scene, and finds a measurement and control resource scheduling scheme adaptive to the measurement and control scene so as to fulfill the purposes of completing the measurement and control task and improving the comprehensive utilization efficiency of the measurement and control resources. The innovative application of the DQN algorithm is realized by abstracting the resource scheduling problem under the multi-constraint condition.

In order to achieve the purpose, the invention provides the following technical scheme:

a measurement and control resource scheduling method based on deep Q learning comprises the following steps:

s1: describing a complex measurement and control scene;

s2: designing evaluation indexes of measurement and control scheduling performance;

s3: forming a measurement and control resource scheduling scheme;

s4: the DQN algorithm is applied to the generation of the measurement and control resource scheduling scheme;

s5: and implementing the measurement and control resource scheduling method based on the DQN.

Optionally, step S1 specifically includes:

(1) description of entities in a measurement and control scenario

From the perspective of measurement and control resources of the space-ground integrated measurement and control system, elements in a measurement and control scene are described based on a visible time window;

the space-ground integrated measurement and control resources are described as follows:

RESOURCE＝{S,TYPE,TS,D_S,L,L_MAX}

wherein, S is a set of the space-ground integrated measurement and control resources, a plurality of measurement and control resources are numbered in a unified way, and S is { S ═ S }₁，s₂,...s_j,...s_M}; j is the number of the measurement and control resources, and M is the total number of all the measurement and control resources;

the TYPE represents the TYPE of the measurement and control resource, if the TYPE is 1, the measurement and control resource is a space-based measurement and control resource, and if the TYPE is 0, the resource is a foundation measurement and control resource;

TS represents an idle time window for each measurement and control resource, namely the current time window which can be used for measurement and control;

TS＝{TS₁，TS₂，...TS_j，...TS_M}

＝{[t_b1(s₁),t_e1(s₁)],[t_b2(s₁),t_e2(s₁)],...,[t_b1(s₂),t_e1(s₂)],[t_b2(s₂),t_e2(s₂)].....,....[t_b1(s_M),t_e1(s_M)]}

TS_jall available time windows, i.e. idle time windows, t, characterizing the jth measurement and control resource_b1(s₁) And t_e1(s₁) Respectively representing the starting time and the ending time of a 1 st visible time window of a jth measurement and control resource, marking the sequence of the visible windows according to the time sequence, and so on;

DS characterizes the length of each idle time window of the measurement and control resource

Characterizing the length of a kth idle time window of a jth measurement and control resource;

LS_jindicating the occupation of a single measurement and control resource by all medium and low orbit satellites

Representing the load occupation condition of a single measurement and control resource j by a measurement and control task i, wherein i represents the number of the measurement and control tasks, and n is the total number of the measurement and control tasks;

l represents the occupation of all the medium and low orbit satellites on the heaven-earth integrated measurement and control resources; the method comprises the following specific steps:

representing the load occupation condition of all measurement and control tasks on a single measurement and control resource j;

L_MAX＝{L_MAX1，L_MAX2，...L_MAXj,...L_MAXM}

L_MAXjthe measurement and control task load which can be received by the measurement and control resource j at most, namely the maximum load of the measurement and control resource, is represented;

from the perspective of a measurement and control task, elements in a measurement and control scene are described based on a visible time window; the measurement and control task is described as follows:

TASK＝{T,Sat,P,D,T_A,T_C,T_Oi}

wherein, T is the number set of all measurement and control tasks, and T is { T ═ T₁，T₂，...T_i...T_n}；

T_iA number representing a measurement and control task; in the formula and the following formula, i is the order of the measurement and control tasks, and n is the total number of the measurement and control tasks;

sat represents a measurement and control task source, namely a corresponding task satellite, and Sat is { Sat ═₁,Sat₂,…Sat_o}

Sat_iA source satellite representing the measurement and control tasks with the sequence i;

p is the priority of the measurement and control task, and P is { P ═ P₁，P₂，...P_i...P_n}，P_iThe priority of the measurement and control tasks with the sequence i is represented;

d is the shortest measurement and control time D ═ D corresponding to each measurement and control task₁,d₂,...d_i...d_n)；d_iRepresenting the shortest duration of the measurement and control tasks with the sequence i;

T_Atime interval for representing measurement and control task

T_A＝{[t_1B,t_1E],[t_2B,t_2E],....[t_iB,t_iE],...[t_nB,t_nE]}；

[t_iB,t_iE]Time window, t, indicating that the measurement and control task with the order i can perform the measurement and control task_iBFor the earliest starting time of the measurement and control task, t_iEThe latest ending time of the measurement and control task is taken as the latest ending time of the measurement and control task;

T_Cactual measurement and control interval of characterization task

T_C＝{[t_1b,t_1e],[t_2b,t_2e],....[t_ib,t_ie],...[t_nb,t_ne]}；

[t_ib,t_ie]Representing the time window, t, during which the measurement and control tasks in the order i are actually performed_ibActual start time, t, after scheduling for measurement and control tasks_ieThe actual end time after the actual scheduling of the measurement and control task is obtained;

To_idescribing sets of visible arc segments corresponding to respective tasks

The k-th visible time window of the m-th measurement and control resource for the measurement and control task with the sequence i is shown, and is specifically shown as [ t_b1(s_im),t_e1(s_im)]，t_b1(s_im) Is the start time of the visible window, t_e1(s_im) Is the end time of the visible window;

(2) measurement and control state design

The design of the measurement and control state s is that different visual states/available states in the measurement and control system are expressed by using a visible time window according to the utilization condition of measurement and control resources, namely on the basis of the visibility of time space; for a specific measurement and control scene, a 0-1 matrix capable of representing the state of each measurement and control resource is used as the state of the measurement and control scene, and the size of the matrix is determined by the number of the measurement and control resources and the division scale of a measurement and control time window; for each measurement and control resource, determining a division scale according to specific requirements to divide the daily working time of the measurement and control resource, marking the visual state of the divided time interval of the measurement and control equipment, wherein the matrix state corresponding to the visual/available unit time is set to be 0, the matrix state corresponding to the invisible/unavailable unit time is set to be 1, and determining the use condition of the measurement and control equipment at a certain determined moment, namely the measurement and control state;

the step S3 specifically includes:

(3) design of measurement and control action

The design of the measurement and control actions adopts a progressive decision idea layer by layer to determine whether to accept the measurement and control task and receive the measurement and control resources of the measurement and control task, the measurement and control resources of the accepted task are specifically used in a measurement and control time interval of the task, and the measurement and control actions are designed as follows:

wherein ,a_iWhether the measurement and control task is accepted is represented, type represents the type of the measurement and control resource accepting the measurement and control task, x_ijMeasurement and control resource number, y, representing the task of receiving measurement and control_jkIndicating the measurement and control task is executed with the kth visible time window of the resource j, t_ibAnd characterizing the actual starting time of the measurement and control task.

Optionally, step S2 specifically includes:

designing a comprehensive measurement and control performance evaluation index taking three indexes of measurement and control task completion degree, measurement and control resource utilization balance degree and measurement and control resource load balance degree into consideration, and using the comprehensive measurement and control performance evaluation index as a decision basis for applying a DQN algorithm in measurement and control scheduling; the measurement and control resource scheduling expectation obtains a scheduling strategy which enables the comprehensive evaluation index to be maximum;

setting the evaluation index of the scheduling performance of the measurement and control resource as r ═ s_R*RUR/load；

wherein ,s_RRepresenting the satisfaction degree of the measurement and control task, representing the balance degree of the utilization of the measurement and control resources by load, and representing the average utilization rate of all the measurement and control resources by RUR;

the satisfaction degree of the measurement and control task is as follows:

and (3) measuring and controlling the resource load balance degree:

average utilization rate of measurement and control resources:

optionally, step S3 specifically includes:

according to the design of the measurement and control actions in the S1, the measurement and control scheduling scheme is formed by mainly determining whether to receive the measurement and control task, determining the measurement and control resources for performing the measurement and control task, and determining the measurement and control arc section for completing the measurement and control task;

specifically, the method comprises the following steps: according to the visible time window, namely the visible arc section, as a modeling basis of the measurement and control state, and aiming at a specific measurement and control task, whether the measurement and control task is received or not is determined by judging whether the visible time window of the measurement and control task exists or not; in the process of modeling a measurement and control scene, uniformly numbering measurement and control resources and measurement and control tasks, solving visible arc sections meeting conditions for specific measurement and control tasks, and determining the types and the numbers of the resources completing the measurement and control tasks according to the corresponding relation between the visible arc sections and the measurement and control resources;

in the design of a measurement and control state, a visible arc section corresponding to a measurement and control task is discretized, the measurement and control arc section slides on the selected visible arc section according to the possible starting time of the measurement and control task, and the optimal measurement and control arc section capable of completing the task is determined.

Optionally, step S4 specifically includes:

(1) when the task state at the current moment changes and the visible time window of the measurement and control resource changes, the measurement and control state of the system changes;

(2) updating a measurement and control environment, extracting scene characteristics, and updating the measurement and control state of the system;

(3) selecting a decision strategy of the measurement and control action according to an action selection rule of a deep reinforcement learning algorithm, so that measurement and control resources are matched with the measurement and control task in time and space, and the measurement and control task is realized;

(4) evaluating and feeding back the measurement and control scheduling result aiming at the update of the measurement and control environment and the measurement and control state caused by the selected measurement and control strategy;

(5) updating the measurement and control decision strategy by using a deep reinforcement learning network according to the evaluation feedback result of the measurement and control strategy, and observing the measurement and control scene and the updating of the measurement and control state;

and through cyclic algorithm updating, selection and optimization of measurement and control resource allocation strategies are realized, and selection of an optimal measurement and control scheduling strategy is realized.

Optionally, step S5 specifically includes:

(1) describing a measurement and control scene, and defining basic physical elements in the scene; based on an actual physical scene, related elements related in the DQN method for measurement and control scheduling are sorted and summarized, and the composition of measurement and control states, measurement and control actions, measurement and control action rewards and measurement and control scheme basic elements are determined;

(2) initializing a deep Q learning measurement and control resource scheduling network, initializing a memory base according to actual capacity requirements, and initializing network parameters including learning rate, discount factors and structures and parameters of an actual value neural network and a target value neural network describing a Q value;

(3) designing a measurement and control state s according to the measurement and control scene model, initializing the input of a measurement and control scheduling network, and calculating corresponding output; selecting the measurement and control action randomly according to the probability, selecting the measurement and control action according to the probability 1, namely a greedy strategy through a Q value output by a measurement and control scheduling network, and executing the corresponding measurement and control action in the measurement and control resource scheduling network; obtaining the reward r after the action is executed, namely the evaluation index of the measurement and control action and the measurement and control state s before the next action is executed, namely the measurement and control state s at the next moment_i+1(ii) a Calculating the Q values of the actual value neural network and the current value neural network in the measurement and control scheduling network at the next moment according to the currently selected measurement and control action and the current state, namely the actual Q value and the estimated Q value;

(4) four parameters(s)_i,X_i,r_i,s_i+1) As a sample oneStoring the data into a memory library;

(5) randomly taking a certain number of sample states from a memory base, calculating a target value of each state, and updating a Q value as the target value through reward after execution; updating the actual value neural network parameters by a random gradient descent method, and assigning the current parameters in the actual value neural network to the target value neural network after the actual value neural network parameters are iteratively updated every N times, so that the target value neural network parameters in the measurement and control scheduling network are updated; continuously updating parameters to train the measurement and control scheduling network;

(6) the selection and optimization of measurement and control resource allocation strategies are realized through cyclic algorithm updating, and the selection of an optimal measurement and control scheduling strategy is realized; and finishing the measurement and control resource scheduling process.

The invention has the beneficial effects that: the method can generate the measurement and control resource scheduling strategy adaptive to the measurement and control scene in the complex measurement and control environment without accurately modeling the measurement and control environment, thereby achieving the maximization of the measurement and control resource scheduling efficiency.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a schematic view of a measurement and control state design;

FIG. 2 is a flowchart illustrating the measurement and control resource scheduling scheme;

fig. 3 is a DQN-based measurement and control resource scheduling decision flow;

fig. 4 is a schematic view of a measurement and control state in the embodiment.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.

Please refer to fig. 1 to 4, which illustrate a measurement and control resource scheduling method based on deep Q learning.

The invention relates to a measurement and control resource scheduling method based on a DQN algorithm, which mainly comprises the steps of constructing a measurement and control scene based on a visible window between a measurement and control resource and a measurement and control object, depicting a long-term reward of a measurement and control action by utilizing the strong model description capability of a neural network in the DQN algorithm, and breaking the relevance between data by utilizing a memory playback mechanism mode, so that the optimal strategy is learned by learning and evaluating the quality of the current state through interaction with the measurement and control scene, and the method is suitable for a complex measurement and control resource scheduling environment. The technical scheme of the method is as follows:

1. description of complex measurement and control scenarios

In a complex measurement and control scene related by the method, measurement and control resources mainly refer to space integrated measurement and control resources, namely foundation measurement and control resources and space-based measurement and control resources, wherein the foundation measurement and control resources mainly aim at ground stations, and the space-based measurement and control resources mainly consider tracking and data relay satellites. The type of the measurement and control resource is clear through a type variable. The description of the measurement and control scene is mainly performed based on the description of the visible states and the visible time windows of the measurement and control resources and the measurement and control objects. Specifically, the description of the complex measurement and control scene is completed by performing abstract expression (including description of measurement and control tasks and related constraint conditions) on each physical entity in the measurement and control scene and designing a measurement and control state and measurement and control actions.

(1) Description of entities in a measurement and control scenario

From the perspective of measurement and control resources of the space-ground integrated measurement and control system, elements in a measurement and control scene are described based on a visible time window.

The space-ground integrated measurement and control resource can be described as follows:

RESOURCE＝{S,TYPE,TS,D_S,L,L_MAX}

wherein, S is a set of the space-ground integrated measurement and control resources, a plurality of measurement and control resources are numbered in a unified way, and S is { S ═ S }₁，s₂,...s_j,...s_M}; in the formula and the following formula, j is the number of the measurement and control resource, and M is the total number of all the measurement and control resources.

TS characterizes an idle time window for each measurement and control resource (i.e. the time window currently available for measurement and control);

TS＝{TS₁，TS₂，...TS_j，...TS_M}

TS_jall available time windows (i.e. idle time windows), t, characterizing the jth measurement and control resource_b1(s₁) And t_e1(s₁) Respectively representing the starting time and the ending time of a 1 st visible time window of a jth measurement and control resource, wherein the sequence of the visible windows is marked according to the time sequence. And so on.

And characterizing the length of the kth idle time window of the jth measurement and control resource.

And representing the load occupation condition of the measurement and control task i to a single measurement and control resource j, wherein i represents the number of the measurement and control tasks, and n is the total number of the measurement and control tasks.

L represents the occupation of all the medium and low orbit satellites on the all-in-one measurement and control resources. The method comprises the following specific steps:

L_Sjand the load occupation situation of all the measurement and control tasks on a single measurement and control resource j is shown.

L_MAX＝{L_MAX1，L_MAX2，...L_MAXj,...L_MAXM}

L_MAXjAnd the measurement and control task load which can be received by the measurement and control resource j at most, namely the maximum load of the measurement and control resource, is shown.

From the perspective of the measurement and control task, elements in the measurement and control scene are described based on a visible time window. The measurement and control tasks can be described as:

TASK＝{T,Sat,P,D,T_A,T_C,T_Oi}

wherein, T is the number set of all measurement and control tasks, and T is { T ═ T₁，T₂，...T_i...T_n}

T_iAnd the number of the measurement and control task is shown. In this formula and the following formulas, i is the order of the measurement and control tasks, and n is the total number of the measurement and control tasks.

Sat_iThe source satellite of the measurement and control task with the sequence i is shown.

P is the priority of the measurement and control task, and P is { P ═ P₁，P₂，...P_i...P_n}，P_iAnd indicating the priority of the measurement and control tasks with the sequence i.

D is the shortest measurement and control time D ═ D corresponding to each measurement and control task₁,d₂,...d_i...d_n)；d_iThe shortest duration of the measurement and control tasks in order i is indicated.

T_ATime interval for representing measurement and control task

T_A＝{[t_1B,t_1E],[t_2B,t_2E],....[t_iB,t_iE],...[t_nB,t_nE]}；

[t_iB,t_iE]Time window, t, indicating that the measurement and control task with the order i can perform the measurement and control task_iBFor the earliest starting time of the measurement and control task, t_iEThe latest finishing time of the measurement and control task.

T_CActual measurement and control interval of characterization task

T_C＝{[t_1b,t_1e],[t_2b,t_2e],....[t_ib,t_ie],...[t_nb,t_ne]}；

[t_ib,t_ie]Representing the time window, t, during which the measurement and control tasks in the order i are actually performed_ibActual start time, t, after scheduling for measurement and control tasks_ieThe actual end time after the actual scheduling of the measurement and control task.

To_iDescribing sets of visible arc segments corresponding to respective tasks

The k-th visible time window of the m-th measurement and control resource for the measurement and control task with the sequence i is expressed as [ t ]_b1(s_im),t_e1(s_im)]，t_b1(s_im) Is the start time of the visible window, t_e1(s_im) Is the end time of the visible window.

(2) Measurement and control state design

The design of the measurement and control state s is to express different visual states/available states in the measurement and control system by using a visible time window according to the utilization condition of measurement and control resources, namely on the basis of the visibility of time space. As shown in fig. 1, for a specific measurement and control scenario, a 0-1 matrix capable of representing the state of each measurement and control resource is used as the state of the measurement and control scenario, and the size of the matrix is determined by the number of measurement and control resources and the division scale of a measurement and control time window. For each measurement and control resource, determining a division scale according to specific requirements to divide the daily working time of the measurement and control resource, and marking the visual state of the divided time interval of the measurement and control equipment, wherein the matrix state corresponding to the visual/available unit time is set to be 0, and the matrix state corresponding to the invisible/unavailable unit time is set to be 1, so that the use condition of the measurement and control equipment at a certain determined moment, namely the measurement and control state, is determined.

(3) Design of measurement and control action

The design of the measurement and control actions adopts a progressive decision idea layer by layer to determine whether to accept the measurement and control task and receive the measurement and control resources of the measurement and control task, and the measurement and control resources of the accepted task are specifically used in a measurement and control time interval of the task, so the measurement and control actions are designed as follows:

2. Evaluation index design for measurement and control scheduling performance

In the method, a comprehensive measurement and control performance evaluation index which takes three indexes of measurement and control task completion degree, measurement and control resource utilization balance degree and measurement and control resource load balance degree into consideration is designed and used as a decision basis for applying a DQN algorithm in measurement and control scheduling. The measurement and control resource scheduling expects to obtain a scheduling strategy which enables the comprehensive evaluation index to be maximum.

Specifically, the measurement and control resource scheduling performance evaluation index is set to be r ═ s_R*RUR/load。

wherein ,s_RThe method comprises the steps of representing the satisfaction degree of a measurement and control task, representing the balance degree of utilization of measurement and control resources by load, and representing the average utilization rate of all the measurement and control resources by RUR.

The satisfaction degree of the measurement and control task is as follows:

and (3) measuring and controlling the resource load balance degree:

average utilization rate of measurement and control resources:

3. measurement and control resource scheduling scheme formation

According to the design of the measurement and control actions in the step 1, the measurement and control scheduling scheme is formed by mainly determining whether to receive the measurement and control task, determining the measurement and control resources for performing the measurement and control task, and determining the measurement and control arc section for completing the measurement and control task. Specifically, the method comprises the following steps: the invention mainly takes the visible arc section as the modeling basis of the measurement and control state according to the visible time window, so that whether the measurement and control task is received or not is determined by judging whether the visible time window of the measurement and control task exists or not aiming at the specific measurement and control task. In the process of modeling the measurement and control scene, the measurement and control resources and the measurement and control tasks are uniformly numbered, so that the visible arc sections meeting the conditions are solved for the specific measurement and control tasks, and the types and the numbers of the resources completing the measurement and control tasks can be determined according to the corresponding relation between the visible arc sections and the measurement and control resources. In the design of the measurement and control state, the visible arc section corresponding to the measurement and control task is discretized, so that the measurement and control arc section slides on the selected visible arc section according to the possible starting time of the measurement and control task, and the optimal measurement and control arc section capable of completing the task is determined.

Therefore, the forming flow of the measurement and control resource scheduling scheme is shown in fig. 2:

application of DQN algorithm in generation of measurement and control resource scheduling scheme

In the method, based on a deep reinforcement learning framework and a learning principle of DQN, the following measurement and control resource scheduling decision process can be constructed, so that a measurement and control resource scheduling strategy with optimal measurement and control efficiency is selected.

The implementation steps can be summarized as follows:

(1) when the task state at the current moment changes and the visible time window of the measurement and control resource changes, the measurement and control state of the system changes.

(2) And updating the measurement and control environment, extracting scene characteristics and updating the measurement and control state of the system.

(3) And selecting a decision strategy of the measurement and control action according to an action selection rule of the deep reinforcement learning algorithm, so that the measurement and control resources are matched with the measurement and control task in time and space, and the measurement and control task is realized.

(4) And evaluating and feeding back the measurement and control scheduling result aiming at the update of the measurement and control environment and the measurement and control state caused by the selected measurement and control strategy.

(5) And updating the measurement and control decision strategy by using a deep reinforcement learning network according to the evaluation feedback result of the measurement and control strategy, and observing the measurement and control scene and the updating of the measurement and control state.

5. DQN-based measurement and control resource scheduling method implementation process

(1) Describing a measurement and control scene, and defining basic physical elements in the scene. Based on the actual physical scene, related elements related in the DQN method for measurement and control scheduling are arranged and summarized, and the composition of basic elements such as a measurement and control state, a measurement and control action, measurement and control action reward, a measurement and control scheme and the like is determined.

(2) The method comprises the steps of initializing a deep Q learning measurement and control resource scheduling network, initializing a memory base according to actual capacity requirements, and initializing network parameters including learning rate, discount factors and structures and parameters of an actual value neural network and a target value neural network for describing a Q value.

(3) And designing a measurement and control state s according to the measurement and control scene model, initializing the input of a measurement and control scheduling network, and calculating corresponding output. And randomly selecting the measurement and control action according to the probability, selecting the measurement and control action (namely-greedy strategy) according to the probability 1-through the Q value output by the measurement and control scheduling network, and executing the corresponding measurement and control action in the measurement and control resource scheduling network. Deriving reward r after action execution (i.e. measurement and control)Evaluation index of action) and the measurement and control state before the next action is executed, i.e. the measurement and control state s at the next moment_i+1. And calculating the Q value of the actual value neural network and the current value neural network in the measurement and control scheduling network at the next moment according to the currently selected measurement and control action and the current state, namely the actual Q value and the estimated Q value.

(4) Four parameters(s)_i,X_i,r_i,s_i+1) Stored together as a sample in the memory bank.

(5) A certain number of sample states are randomly taken from the memory base, and the target value of each state is calculated (the Q value is updated as the target value by means of reward after execution). Updating the actual value neural network parameters by a random gradient descent method, and assigning the current parameters in the actual value neural network to the target value neural network after the actual value neural network parameters are iteratively updated every N times, so that the target value neural network parameters in the measurement and control scheduling network are updated. And continuously updating the parameters to train the measurement and control scheduling network.

(6) And through cyclic algorithm updating, selection and optimization of measurement and control resource allocation strategies are realized, and selection of an optimal measurement and control scheduling strategy is realized. And finishing the measurement and control resource scheduling process.

Example (b):

1. and describing a complex measurement and control scene. Taking a measurement and control scene with 2 foundation measurement and control resources, 1 space-based measurement and control resource and 9 measurement and control tasks to be completed as an example, initializing and uniformly describing the measurement and control resource scene. According to the actual measurement and control scene, from the perspective of the space-ground integrated measurement and control resources, the measurement and control scene can be described as the following form:

the measurement and control resources of the space-ground integrated measurement and control system are as follows:

RESOURCE＝{S,TYPE,TS,D_S,L,L_MAX}

wherein, S is a set of space-ground integrated measurement and control resources, and S is { S ═ S₁，s₂,...s_j,...s_M}

the TS characterizes an idle time window for each measurement and control resource (i.e. the time window currently available for measurement and control),

TS＝{TS₁，TS₂，...TS_j，...TS_M}；

L＝{L_S1,L_S2，...,L_Sj,...L_SM}

＝{L₁，L₂，...L_i,...L_n}

from the perspective of the measurement and control task, the description of the elements in the measurement and control scene based on the visible time window is as follows:

TASK＝{T,Sat,P,D,T_A,T_C,T_Oi}

wherein, T is the set of measurement and control tasks of all the middle and low orbit satellites, and T is { T ═ T₁，T₂，...T_i...T_n}

P is measure and control renPriority of traffic, P ═ P₁，P₂，...P_i...P_n}

D is the shortest measurement and control time D ═ D corresponding to each measurement and control task₁,d₂,...d_i...d_n)；

T_ATime interval T for representing measurable and controllable task_A＝{[t_1B,t_1E],[t_2B,t_2E],....[t_iB,t_iE],...[t_nB,t_nE]}，

T_CActual measurement and control interval T of characterization task_C＝{[t_1b,t_1e],[t_2b,t_2e],....[t_ib,t_ie],...[t_nb,t_ne]}，

To_iDescribing sets of visible arc segments corresponding to respective tasks

And designing a measurement and control state s according to the measurement and control scene model, and regarding a specific measurement and control scene, taking a 0-1 matrix capable of representing the state of each measurement and control resource as the measurement and control state of the measurement and control scene. Taking 1h as an example of division scale, in the measurement and control scene, 3 measurement and control resources are shared, so that for each day, the measurement and control state matrix size is 3 × 24, and the matrix state corresponding to the visible/available unit time is set to be 0, and the matrix state corresponding to the invisible/unavailable unit time is set to be 1. Accordingly, in this case, the measurement and control state can be visually described by referring to fig. 4.

The measurement and control actions, i.e. the decision variables, are described as:

wherein ,a_iWhether the measurement and control task is accepted is represented, type represents the type of the measurement and control resource accepting the measurement and control task, x_ijMeasurement and control resource number, y, representing the task of receiving measurement and control_jkThe first to represent resource jk visible time windows perform measurement and control tasks, t_ibAnd characterizing the actual starting time of the measurement and control task.

The evaluation index of the measurement and control scheduling performance is expressed as r ═ s_RRUR/load, comprehensively evaluating the measurement and control resource scheduling performance, wherein s_RThe method comprises the steps of representing the satisfaction degree of a measurement and control task, representing the balance degree of utilization of measurement and control resources by load, and representing the average utilization rate of all the measurement and control resources by RUR.

2. According to the measurement and control scene requirements, a convolutional neural network is constructed to describe a Q value in a measurement and control resource scheduling network, wherein the actual value neural network and the target value neural network are respectively two convolutional neural networks with the same structure and incompletely identical parameters, the convolutional neural networks comprise 2 convolutional layers and 1 full-connection layer, and a sigmoid function is adopted as an activation function of the convolutional neural networks. In the initialization process of the deep Q learning measurement and control resource scheduling network, a memory base is initialized according to the actual capacity requirement, and network parameters including learning rate, discount factors and relevant parameters of an actual value neural network and a target value neural network describing a Q value are initialized.

3. According to the detailed description of the measurement and control scene in the step 1, the measurement and control state, the measurement and control action reward and the measurement and control scheme are further refined. On the basis, the measurement and control state s is designed according to the measurement and control scene model, the input of a measurement and control scheduling network is initialized, and corresponding output is calculated. And randomly selecting the measurement and control action according to the probability, selecting the measurement and control action (namely-greedy strategy) according to the probability 1 and the Q value output by the measurement and control scheduling network, and executing the corresponding action in the measurement and control resource scheduling network. Obtaining the reward r after the action is executed and the measurement and control state s before the next action is executed, namely the measurement and control state s at the next moment_i+1. And calculating the Q value of the actual value neural network and the current value neural network in the measurement and control scheduling network at the next moment according to the currently selected measurement and control action and the current state.

4. Four parameters(s)_i,X_i,r_i,s_i+1) Stored together as a sample in the memory bank.

5. A certain number of sample states are randomly taken from the memory base, and the target value of each state is calculated (the Q value is updated as the target value by means of reward after execution). Updating the actual value neural network parameters by a random gradient descent method, and assigning the current parameters in the actual value neural network to the target value neural network after the actual value neural network parameters are iteratively updated every N times, so that the target value neural network parameters in the measurement and control scheduling network are updated.

And continuously updating the parameters to train the measurement and control scheduling network.

6. And through cyclic algorithm updating, selection and optimization of measurement and control resource allocation strategies are realized, and selection of an optimal measurement and control scheduling strategy is realized. And finishing the measurement and control resource scheduling process.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A measurement and control resource scheduling method based on deep Q learning is characterized in that: the method comprises the following steps:

s1: describing a complex measurement and control scene;

s3: forming a measurement and control resource scheduling scheme;

2. The measurement and control resource scheduling method based on deep Q learning according to claim 1, characterized in that: the step S1 specifically includes:

(1) description of entities in a measurement and control scenario

RESOURCE＝{S,TYPE,TS,D_S,L,L_MAX}

TS＝{TS₁，TS₂，...TS_j，...TS_M}

Characterize the jthMeasuring and controlling the length of the kth idle time window of the resource;

L_Sjrepresenting the load occupation condition of all measurement and control tasks on a single measurement and control resource j;

L_MAX＝{L_MAX1，L_MAX2，...L_MAXj,...L_MAXM}

TASK＝{T,Sat,P,D,T_A,T_C,T_Oi}

T_Atime interval for representing measurement and control task

T_A＝{[t_1B,t_1E],[t_2B,t_2E],....[t_iB,t_iE],...[t_nB,t_nE]}；

T_Cactual measurement and control interval of characterization task

T_C＝{[t_1b,t_1e],[t_2b,t_2e],....[t_ib,t_ie],...[t_nb,t_ne]}；

To_idescribing sets of visible arc segments corresponding to respective tasks

Indicating for the measurement and control tasks in order iThe k-th visible time window for which m measurement and control resources are represented specifically as [ t ]_b1(s_im),t_e1(s_im)]，t_b1(s_im) Is the start time of the visible window, t_e1(s_im) Is the end time of the visible window;

(2) measurement and control state design

the step S3 specifically includes:

(3) design of measurement and control action

3. The measurement and control resource scheduling method based on deep Q learning according to claim 1, characterized in that: the step S2 specifically includes:

the satisfaction degree of the measurement and control task is as follows:

and (3) measuring and controlling the resource load balance degree:

average utilization rate of measurement and control resources:

4. the measurement and control resource scheduling method based on deep Q learning according to claim 1, characterized in that: the step S3 specifically includes:

5. The measurement and control resource scheduling method based on deep Q learning according to claim 1, characterized in that: the step S4 specifically includes:

6. The measurement and control resource scheduling method based on deep Q learning according to claim 1, characterized in that: the step S5 specifically includes:

(4) four parameters(s)_i,X_i,r_i,s_i+1) Storing the samples as a sample in a memory bank;