CN111767991B

CN111767991B - Measurement and control resource scheduling method based on deep Q learning

Info

Publication number: CN111767991B
Application number: CN202010609039.9A
Authority: CN
Inventors: 郭茂耘; 武艺; 唐奇; 梁皓星
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2020-06-29
Filing date: 2020-06-29
Publication date: 2023-08-15
Anticipated expiration: 2040-06-29
Also published as: CN111767991A

Abstract

The invention relates to a measurement and control resource scheduling method based on deep Q learning, and belongs to the field of intelligent scheduling. The method comprises the following steps: s1: describing a complex measurement and control scene; s2: designing measurement and control scheduling performance evaluation indexes; s3: forming a measurement and control resource scheduling scheme; s4: the DQN algorithm is applied to the generation of a measurement and control resource scheduling scheme; s5: and (3) implementing the DQN-based measurement and control resource scheduling method. According to the method and the system, the measurement and control resource scheduling strategy which is suitable for the measurement and control scene can be generated in the complex measurement and control environment under the condition that accurate modeling is not needed for the measurement and control environment, so that the maximization of the measurement and control resource scheduling efficiency is achieved.

Description

Measurement and control resource scheduling method based on deep Q learning

Technical Field

The invention belongs to the field of intelligent scheduling, and relates to a measurement and control resource scheduling method based on deep Q learning.

Background

At present, the method for solving the problem of satellite measurement and control resource scheduling mainly comprises the following steps: intelligent algorithms such as ant colony algorithm, particle swarm algorithm, SVM method, etc., deterministic algorithms such as Lagrange relaxation algorithm, etc., heuristic algorithms such as greedy algorithm, neighborhood search algorithm, simulated annealing algorithm, etc. The research on the aspect of the space-earth integrated measurement and control resources is relatively less, and the research is carried out from the aspect of the traditional algorithm, such as Lagrange relaxation algorithm, ant colony algorithm and genetic algorithm, and the application on the aspect of the deep reinforcement learning algorithm is relatively less.

The invention mainly solves the conflict between the measurement and control resources and the measurement and control objects caused by the increasing measurement and control tasks. From the perspective of visibility between measurement and control resources and measurement and control objects, a measurement and control scene based on a measurement and control time window is constructed, the optimal running period of a measurement and control task is solved by deep Q learning (DeepQNetwork, DQN), and finally an optimal measurement and control scheduling scheme is formed, so that the optimal running of a measurement and control system under specific indexes is realized.

Disclosure of Invention

In view of the above, the present invention aims to provide a measurement and control resource scheduling method based on deep Q learning. Aiming at the current situation that the conflict between the existing measurement and control tasks and the number of measurement and control resources is increasingly strong, the situation that the number of the measurement and control resources is limited is considered that the measurement and control tasks are still limited by various conditions such as the visibility of the measurement and control resources and the measurement and control objects, the measurement and control duration time, the priority of the measurement and control tasks and the like, so that the scheduling of the measurement and control resources becomes a complex combination optimization problem under various space-time constraint conditions. The measurement and control service and the measurement and control range of a single kind of measurement and control resources have the difference and limitation, and the measurement and control tasks tend to be complicated and diversified, so that the difficulty of measurement and control scheduling decision is increased continuously, and the joint scheduling of the space-earth measurement and control resources is necessary, so that the comprehensive scheduling performance of the space-earth integrated measurement and control resources is optimal.

The invention aims to construct a measurement and control resource scheduling realization method based on deep reinforcement learning, which utilizes the deep reinforcement learning to realize the intelligent scheduling of the measurement and control resources of the integration of the heaven and earth, performs more accurate abstraction and feature extraction on a measurement and control system and a measurement and control scene, and finds a measurement and control resource scheduling scheme adaptive to the measurement and control scene so as to fulfill the purposes of completing measurement and control tasks and improving the comprehensive efficiency of the utilization of the measurement and control resources. The novel application of the DQN algorithm is realized by abstracting the resource scheduling problem under the multi-constraint condition.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a measurement and control resource scheduling method based on deep Q learning comprises the following steps:

s1: describing a complex measurement and control scene;

s2: designing measurement and control scheduling performance evaluation indexes;

s3: forming a measurement and control resource scheduling scheme;

s4: the DQN algorithm is applied to the generation of a measurement and control resource scheduling scheme;

s5: and (3) implementing the DQN-based measurement and control resource scheduling method.

Optionally, the step S1 specifically includes:

(1) Description of entities in measurement and control scenarios

From the perspective of measurement and control resources of the heaven-earth integrated measurement and control system, describing elements in a measurement and control scene based on a visible time window;

the world integration measurement and control resource is described as follows:

RESOURCE＝{S,TYPE,TS,D _S ,L,L _MAX }

wherein S is a set of space-earth integrated measurement and control resources, wherein a plurality of measurement and control resources of multiple types are numbered uniformly, S= { S ₁ ，s ₂ ,...s _j ,...s _M -a }; j is the number of the measurement and control resources, M is the total number of all the measurement and control resources;

TYPE characterizes the TYPEs of measurement and control resources, the measurement and control resources are day-based measurement and control resources when TYPE is 1, and the resources are foundation measurement and control resources when TYPE is 0;

TS characterizes idle time window for each measurement and control resource, namely the time window currently available for measurement and control;

TS＝{TS ₁ ，TS ₂ ，...TS _j ，...TS _M }

＝{[t _b1 (s ₁ ),t _e1 (s ₁ )],[t _b2 (s ₁ ),t _e2 (s ₁ )],...,[t _b1 (s ₂ ),t _e1 (s ₂ )],[t _b2 (s ₂ ),t _e2 (s ₂ )].....,....[t _b1 (s _M ),t _e1 (s _M )]}

TS _j characterizing all available time windows, i.e. idle time windows, t, of the jth measurement and control resource _b1 (s _j ) And t _e1 (s _j ) Respectively represent the 1 st measurement and control resourceThe start time and the end time of the 1 st visible time window, the order of the visible windows is marked according to the time sequence, and so on;

D _S characterizing the length of each idle time window of a measurement and control resource

Characterizing the length of a kth idle time window of a jth measurement and control resource;

LS _j representing occupation of single measurement and control resources by all medium-low orbit satellites Representing the load occupation condition of a measurement and control task i on a single measurement and control resource j, wherein i represents the sequence of the measurement and control tasks, and n is the total number of the measurement and control tasks;

l represents the occupation of all middle-low orbit satellites on the space-earth integrated measurement and control resources; the method comprises the following steps:

L＝{L _S1 ,L _S2 ，...,L _Sj ,...L _SM }

＝{L ₁ ，L ₂ ，...L _i ,...L _n }，

L _Sj representing the load occupation condition of all measurement and control tasks on a single measurement and control resource j;

L _MAX ＝{L _MAX1 ，L _MAX2 ，...L _MAXj ,...L _MAXM }

L _MAXj the maximum receivable measurement and control task load of the measurement and control resource j is represented, namely the maximum load of the measurement and control resource;

from the perspective of measurement and control tasks, describing elements in a measurement and control scene based on a visible time window; the measurement and control tasks are described as follows:

TASK＝{T,Sat,P,D,T _A ,T _C ,T _Oi }

wherein T is the numbered set of all measurement and control tasks, t= { T ₁ ，T ₂ ，...T _i ...T _n }；

T _i The number of the measurement and control task is represented; in the formula and the following formulas, i is the order of measurement and control tasks, and n is the total number of the measurement and control tasks;

sat characterizes measurement and control task sources, namely corresponding task satellites, and sat= { Sat ₁ ,Sat ₂ ,…Sat _o }

Sat _i A source satellite representing measurement and control tasks in the order of i;

p is the priority of the measurement and control task, P= { P ₁ ，P ₂ ，...P _i ...P _n }，P _i The priority of measurement and control tasks with the sequence of i is represented;

d is the shortest measurement and control time D= { D corresponding to each measurement and control task ₁ ,d ₂ ,...d _i ...d _n )；d _i Representing the shortest duration of the measurement and control tasks in order i;

T _A time interval representing measurement and control task capable of measuring and controlling

T _A ＝{[t _1B ,t _1E ],[t _2B ,t _2E ],....[t _iB ,t _iE ],...[t _nB ,t _nE ]}；

[t _iB ,t _iE ]Time window for indicating measurement and control tasks with order of i to be available for measurement and control tasks, t _iB To measure and control the earliest start time of a task, t _iE The latest ending time of the measurement and control task;

T _C actual measurement and control interval for representing task

T _C ＝{[t _1b ,t _1e ],[t _2b ,t _2e ],....[t _ib ,t _ie ],...[t _nb ,t _ne ]}；

[t _ib ,t _ie ]Indicating that the measurement and control tasks in order i are actually performedTime window, t _ib To measure and control the actual start time after task scheduling, t _ie The actual end time after actual scheduling of the measurement and control task is obtained;

To _i describing a set of visible arc segments corresponding to each task

Representing the kth visible time window of the mth measurement and control resource for the measurement and control task with the order of i, specifically represented as [ t ] _b1 (s _im ),t _e1 (s _im )]，t _b1 (s _im ) T is the start time of the visible window _e1 (s _im ) An end time for the visible window;

(2) Measurement and control state design

The design of the measurement and control state s is expressed by utilizing a visible time window for different visible states/available states in the measurement and control system on the basis of the utilization condition of measurement and control resources, namely on the basis of the visibility of a time space; for a specific measurement and control scene, a 0-1 matrix capable of representing the state of each measurement and control resource is used as the state of the measurement and control scene, and the size of the 0-1 matrix is determined by the number of measurement and control resources and the dividing scale of a measurement and control time window; for each measurement and control resource, determining a division scale according to specific requirements, dividing the daily working time of the measurement and control resource, and marking the visible state of the divided measurement and control equipment time interval, wherein the matrix state corresponding to visible/usable unit time is set to 0, the matrix state corresponding to invisible/unusable unit time is set to 1, and determining the service condition of the measurement and control equipment at a certain determined moment, namely the measurement and control state;

(3) Design of measurement and control actions

The design of the measurement and control actions adopts a layer-by-layer progressive decision idea to sequentially determine whether to accept the measurement and control task, the measurement and control resources of the accepted measurement and control task are specifically used for the measurement and control time interval of the task, and the measurement and control actions are designed as follows:

X _i ＝(a _i ,type,x _ij ,y _jk ,t _ib )

wherein ,a_i Whether the measurement and control task is accepted or not is represented, type represents the type of measurement and control resources accepting the measurement and control task, and x _ij A measurement and control resource number for representing and receiving measurement and control tasks, y _jk Representing the execution of a measurement and control task with the kth visible time window of resource j, t _ib The actual start time of the measurement and control task is characterized.

Optionally, the step S2 specifically includes:

designing a comprehensive measurement and control performance evaluation index taking the satisfaction degree of a measurement and control task, the load balance degree of measurement and control resources and the average utilization rate of the measurement and control resources into consideration, and using the comprehensive measurement and control performance evaluation index as a decision basis for the application of the DQN algorithm in measurement and control scheduling; a scheduling strategy for maximizing the comprehensive evaluation index is expected to be obtained by measuring and controlling the resource scheduling;

setting the evaluation index of the scheduling performance of the measurement and control resource to r=s _R *RUR/load；

wherein ,s_R The satisfaction degree of the measurement and control task is represented, the load represents the load balance degree of the measurement and control resources, and the RUR represents the average utilization rate of all the measurement and control resources;

satisfaction of measurement and control task:

measuring and controlling the load balance degree of the resource:

wherein ：

average utilization rate of measurement and control resources:

optionally, the step S3 specifically includes:

according to the design of the measurement and control actions in the step S1, the formation of a measurement and control scheduling scheme also mainly comprises the steps of determining whether to accept a measurement and control task, determining measurement and control resources for carrying out the measurement and control task and determining three aspects of measurement and control arc segments for completing the measurement and control task;

specifically: determining whether to accept the measurement and control task by judging whether the visible time window of the measurement and control task exists or not according to the modeling basis that the visible time window can be used for the measurement and control state of the visible arc section; in the process of modeling a measurement and control scene, uniformly numbering measurement and control resources and measurement and control tasks, solving visible arc segments meeting the conditions aiming at specific measurement and control tasks, and determining the resource types and numbers for completing the measurement and control tasks according to the corresponding relation between the visible arc segments and the measurement and control resources;

in the design of the measurement and control state, the visible arc segments corresponding to the measurement and control tasks are discretized, and the sliding of the measurement and control arc segments is carried out on the selected visible arc segments according to the possible starting time of the measurement and control tasks, so that the optimal measurement and control arc segments capable of completing the tasks are determined.

Optionally, the step S4 specifically includes:

(1) The task state at the current moment changes, the visible time window of the measurement and control resource changes, and the measurement and control state of the system changes;

(2) Updating the measurement and control environment, extracting scene characteristics, and updating the measurement and control state of the system;

(3) According to action selection rules of the deep reinforcement learning algorithm, a decision strategy of measurement and control actions is selected, so that measurement and control resources are matched with measurement and control tasks in time and space, and the realization of the measurement and control tasks is completed;

(4) Evaluating and feeding back the result of measurement and control scheduling of the measurement and control environment and the measurement and control state caused by the selected measurement and control strategy;

(5) According to the evaluation feedback result of the measurement and control strategy, the deep reinforcement learning network is utilized to update the measurement and control decision strategy, and the measurement and control scene and the measurement and control state are observed to update;

and through cyclic and reciprocating algorithm updating, the selection and optimization of the measurement and control resource allocation strategy are realized, and the selection of the optimal measurement and control scheduling strategy is realized.

Optionally, the step S5 specifically includes:

(1) Describing a measurement and control scene, and defining basic physical elements in the scene; based on actual physical scenes, relevant elements involved in the DQN method of measurement and control scheduling are arranged and summarized, and the measurement and control state, measurement and control actions, measurement and control action rewards and the constitution of basic elements of a measurement and control scheme are defined;

(2) Initializing a deep Q learning measurement and control resource scheduling network, initializing a memory bank according to actual capacity requirements, and initializing network parameters including learning rate, discount factors, and structures and parameters of an actual value neural network and a target value neural network for describing a Q value;

(3) Designing a measurement and control state s according to the measurement and control scene model, initializing the input of a measurement and control scheduling network, and calculating the corresponding output; the measurement and control actions are randomly selected according to the probability epsilon, the measurement and control actions are selected according to the probability 1-epsilon through the Q value output by the measurement and control scheduling network, namely epsilon-greedy strategy, and corresponding measurement and control actions are executed in the measurement and control resource scheduling network; obtaining rewards r after action execution, namely evaluation indexes of measurement and control actions, and measuring and control states before next action execution, namely measuring and control states s at next moment _i+1 The method comprises the steps of carrying out a first treatment on the surface of the Calculating the Q values of the actual value neural network and the next moment of the current value neural network in the measurement and control scheduling network according to the currently selected measurement and control action and the current state, namely the actual Q value and the estimated Q value;

(4) Four parameters (s _i ,X _i ,r _i ,s _i+1 ) Storing the samples together as a sample in a memory bank;

(5) Randomly taking out a certain number of sample states from the memory library, calculating a target value of each state, and updating the Q value as the target value through the executed reward; updating the actual value neural network parameters through a random gradient descent method, and assigning the current parameters in the actual value neural network to the target value neural network after the actual value neural network parameters are iteratively updated every N times, so that the updating of the target value neural network parameters in the measurement and control scheduling network is realized; continuously updating parameters to train the measurement and control scheduling network;

(6) The selection and optimization of the measurement and control resource allocation strategy are realized through the cyclic and reciprocating algorithm updating, and the selection of the optimal measurement and control scheduling strategy is realized; and (5) completing the measurement and control resource scheduling process.

The invention has the beneficial effects that: the method can generate the measurement and control resource scheduling strategy which is suitable for the measurement and control scene under the condition that accurate modeling is not needed for the measurement and control environment in the complex measurement and control environment, thereby maximizing the measurement and control resource scheduling efficiency.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of a measurement and control state design;

FIG. 2 is a flow chart for forming a measurement and control resource scheduling scheme;

FIG. 3 is a DQN-based measurement and control resource scheduling decision flow;

FIG. 4 is a schematic diagram of measurement and control states in an embodiment.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

Please refer to fig. 1-4, which are a measurement and control resource scheduling method based on deep Q learning.

The invention relates to a measurement and control resource scheduling method based on a DQN algorithm, which mainly utilizes the strong model description capability of a neural network in the DQN algorithm to describe long-term rewards of measurement and control actions by constructing a measurement and control scene based on a visible window between measurement and control resources and measurement and control objects, breaks the relevance between data by utilizing a memory playback mechanism mode in the measurement and control scene, learns and evaluates the quality of the current state to learn an optimal strategy through interaction with the measurement and control scene, and adapts to a complex measurement and control resource scheduling environment. The technical scheme of the method is as follows:

1. description of Complex measurement and control scenarios

In the complex measurement and control scene, measurement and control resources mainly refer to space-day integrated measurement and control resources, namely foundation measurement and control resources and space-base measurement and control resources, wherein the foundation measurement and control resources mainly aim at ground stations, and the space-base measurement and control resources mainly consider tracking and data relay satellites. The type of the measurement and control resource is defined through a type variable. The description of the measurement and control scene is mainly based on the description of the visible states of the measurement and control resources and the measurement and control objects and the visible time window. Specifically, the complex measurement and control scene is described by carrying out abstract expression (including description of measurement and control tasks and related constraint conditions therein) on each physical entity in the measurement and control scene and designing the measurement and control state and the measurement and control action.

(1) Description of entities in measurement and control scenarios

From the perspective of measurement and control resources of the heaven-earth integrated measurement and control system, elements in a measurement and control scene are described based on a visible time window.

The world integrated measurement and control resources can be described as:

RESOURCE＝{S,TYPE,TS,D _S ,L,L _MAX }

wherein S is a set of space-earth integrated measurement and control resources, wherein a plurality of measurement and control resources of multiple types are numbered uniformly, S= { S ₁ ，s ₂ ,...s _j ,...s _M -a }; in the formula and the following formulas, j is the number of the measurement and control resources, and M is the total number of all the measurement and control resources.

TS characterizes an idle time window (i.e., a time window currently available for measurement and control) for each measurement and control resource;

TS＝{TS ₁ ，TS ₂ ，...TS _j ，...TS _M }

TS _j characterizing all available time windows (i.e., idle time windows) of the jth measurement and control resource, t _b1 (s _j ) And t _e1 (s _j ) The starting time and the ending time of the 1 st visible time window of the 1 st measurement and control resource are respectively represented, and the sequence of the visible windows is marked according to the time sequence. And so on.

D _S Characterizing the length of each idle time window of a measurement and control resource And the length of a kth idle time window of the jth measurement and control resource is represented.

LS _j Representing occupation of single measurement and control resources by all medium-low orbit satellites And (3) representing the load occupation condition of the measurement and control task i on a single measurement and control resource j, wherein i represents the sequence of the measurement and control tasks, and n is the total number of the measurement and control tasks.

L represents the occupation of all middle-low orbit satellites on the space-earth integrated measurement and control resources. The method comprises the following steps:

L _Sj and the load occupation condition of all measurement and control tasks for a single measurement and control resource j is represented.

L _MAX ＝{L _MAX1 ，L _MAX2 ，...L _MAXj ,...L _MAXM }

L _MAXj And the maximum load of the measurement and control resource j can be received.

From the perspective of measurement and control tasks, elements in a measurement and control scene are described based on a visible time window. The measurement and control tasks can be described as:

wherein T is the numbered set of all measurement and control tasks, t= { T ₁ ，T ₂ ，...T _i ...T _n }

T _i And the number of the measurement and control task is represented. In this formula and the following formulas, i is the order of the measurement and control tasks, and n is the total number of the measurement and control tasks.

Sat _i Representing the source satellites of measurement and control tasks in order i.

P is the priority of the measurement and control task, P= { P ₁ ，P ₂ ，...P _i ...P _n }，P _i Indicating the priority of the measurement and control tasks in order i.

D is the shortest measurement and control time D= { D corresponding to each measurement and control task ₁ ,d ₂ ,...d _i ...d _n )；d _i Representing the shortest duration of the measurement and control tasks in order i.

[t _iB ,t _iE ]Time window for indicating measurement and control tasks with order of i to be available for measurement and control tasks, t _iB To measure and control the earliest start time of a task, t _iE To measure and control the latest ending time of the task.

T _C Actual measurement and control interval for representing task

[t _ib ,t _ie ]Time window, t, representing actual progress of measurement and control task in order i _ib To measure and control the actual start time after task scheduling, t _ie To measure and control the actual end time after the actual scheduling of tasks.

To _i Describing a set of visible arc segments corresponding to each task

Represents the kth visible time window of the mth measurement and control resource for the measurement and control task with the order of i, which can be specifically represented as [ t ] _b1 (s _im ),t _e1 (s _im )]，t _b1 (s _im ) T is the start time of the visible window _e1 (s _im ) Is the end time of the visible window.

(2) Measurement and control state design

The design of the measurement and control state s is expressed by utilizing a visible time window for different visible states/available states in the measurement and control system on the basis of the utilization condition of measurement and control resources, namely on the basis of time space visibility. As shown in fig. 1, for a specific measurement and control scenario, a 0-1 matrix capable of representing the state of each measurement and control resource is used as the state of the measurement and control scenario, and the size of the matrix is determined by the number of measurement and control resources and the dividing scale of the measurement and control time window. For each measurement and control resource, determining a division scale according to specific requirements, dividing the daily working time of the measurement and control resource, and marking the visible state of the divided measurement and control equipment time interval, wherein the matrix state corresponding to the visible/usable unit time is set to 0, and the matrix state corresponding to the invisible/unusable unit time is set to 1, so that the use condition, namely the measurement and control state, of the measurement and control equipment at a certain determined moment is determined.

(3) Design of measurement and control actions

The design of the measurement and control action adopts a layer-by-layer progressive decision idea to sequentially determine whether to accept the measurement and control task, and the measurement and control resource of the accepted measurement and control task is specifically used for the measurement and control time interval of the task, so that the measurement and control action is designed as follows:

X _i ＝(a _i ,type,x _ij ,y _jk ,t _ib )

2. Measurement and control scheduling performance evaluation index design

In the method, a comprehensive measurement and control performance evaluation index taking the three indexes of the satisfaction degree of the measurement and control task, the load balance degree of the measurement and control resource and the average utilization rate of the measurement and control resource into consideration is designed and is used as a decision basis for the application of the DQN algorithm in measurement and control scheduling. The scheduling strategy for maximizing the comprehensive evaluation index is expected to be obtained by the measurement and control resource scheduling.

Specifically, the measurement and control resource scheduling performance evaluation index is set to r=s _R *RUR/load。

wherein ,s_R The satisfaction degree of the measurement and control task is represented, the load represents the load balance degree of the measurement and control resources, and the RUR represents the average utilization rate of all the measurement and control resources.

Satisfaction of measurement and control task:

measuring and controlling the load balance degree of the resource:

wherein ：

average utilization rate of measurement and control resources:

3. measurement and control resource scheduling scheme formation

According to the design of the measurement and control actions in the step 1, the formation of the measurement and control scheduling scheme mainly comprises the steps of determining whether to accept the measurement and control task, determining measurement and control resources for carrying out the measurement and control task and determining three aspects of measurement and control arc segments for completing the measurement and control task. Specifically: the invention mainly takes the visible arc section as the modeling basis of the measurement and control state according to the visible time window, so that whether the measurement and control task is accepted or not is determined by judging whether the visible time window of the measurement and control task exists or not according to the specific measurement and control task. In the process of modeling the measurement and control scene, the measurement and control resources and the measurement and control tasks are uniformly numbered, so that the visible arc segments meeting the conditions are solved for the specific measurement and control tasks, and the resource types and the numbers for completing the measurement and control tasks can be determined according to the corresponding relation between the visible arc segments and the measurement and control resources. In the design of the measurement and control state, the visible arc segments corresponding to the measurement and control task are discretized, so that the sliding of the measurement and control arc segments is performed on the selected visible arc segments according to the possible starting time of the measurement and control task, and the optimal measurement and control arc segments capable of completing the task are determined.

Therefore, the measurement and control resource scheduling scheme forming flow is as shown in fig. 2:

application of DQN algorithm in generation of measurement and control resource scheduling scheme

In the method, based on a deep reinforcement learning framework and a DQN learning principle, the following measurement and control resource scheduling decision flow can be constructed, so that a measurement and control resource scheduling strategy with optimal measurement and control efficiency is selected.

The implementation steps can be summarized as follows:

(1) The task state at the current moment changes, the visible time window of the measurement and control resource changes, and the measurement and control state of the system changes.

(2) And updating the measurement and control environment, extracting scene characteristics, and updating the measurement and control state of the system.

(3) And selecting a decision strategy of measurement and control actions according to action selection rules of the deep reinforcement learning algorithm, so that the measurement and control resources are matched with the measurement and control tasks in time and space, and the realization of the measurement and control tasks is completed.

(4) And evaluating and feeding back the measurement and control scheduling result aiming at the measurement and control environment and the update of the measurement and control state caused by the selected measurement and control strategy.

(5) And updating the measurement and control decision strategy by using the deep reinforcement learning network according to the evaluation feedback result of the measurement and control strategy, and observing the update of the measurement and control scene and the measurement and control state.

5. DQN-based measurement and control resource scheduling method implementation flow

(1) Describing a measurement and control scene, and defining basic physical elements in the scene. Based on the actual physical scene, relevant elements related to the DQN method of measurement and control scheduling are arranged and summarized, and the basic elements such as measurement and control states, measurement and control actions, measurement and control action rewards, measurement and control schemes and the like are defined.

(2) Initializing a deep Q learning measurement and control resource scheduling network, initializing a memory bank according to actual capacity requirements, and initializing network parameters including learning rate, discount factors, and structures and parameters of an actual value neural network and a target value neural network for describing a Q value.

(3) And designing a measurement and control state s according to the measurement and control scene model, initializing the input of a measurement and control scheduling network, and calculating the corresponding output. With probabilityAnd epsilon randomly selecting a measurement and control action, selecting the measurement and control action (namely epsilon-greedy strategy) by using the probability 1-epsilon through the Q value output by the measurement and control scheduling network, and executing the corresponding measurement and control action in the measurement and control resource scheduling network. Obtaining rewards r (namely evaluation index of measurement and control actions) after action execution, and measuring and control state s before next action execution, namely measurement and control state s at next moment _i+1 . And calculating the Q values of the actual value neural network and the next moment of the current value neural network in the measurement and control scheduling network according to the currently selected measurement and control action and the current state, namely the actual Q value and the estimated Q value.

(4) Four parameters (s _i ,X _i ,r _i ,s _i+1 ) Stored together as a sample in a memory bank.

(5) A certain number of sample states are randomly fetched from the memory bank, and a target value of each state is calculated (Q value is updated as a target value by the reward after execution). And updating the actual value neural network parameters through a random gradient descent method, and assigning the current parameters in the actual value neural network to the target value neural network after the actual value neural network parameters are iteratively updated every N times, so that the updating of the target value neural network parameters in the measurement and control scheduling network is realized. And continuously updating parameters to train the measurement and control scheduling network.

(6) And through cyclic and reciprocating algorithm updating, the selection and optimization of the measurement and control resource allocation strategy are realized, and the selection of the optimal measurement and control scheduling strategy is realized. And (5) completing the measurement and control resource scheduling process.

Examples:

1. and describing a complex measurement and control scene. Taking 2 foundation measurement and control resources and 1 space-based measurement and control resources as an example, and taking a measurement and control scene needing to finish 9 measurement and control tasks as a sample, initializing and uniformly describing the measurement and control resource scene. According to the actual measurement and control scene, from the aspect of the integrated measurement and control resources, the measurement and control scene can be described as the following form:

the measurement and control resources of the heaven and earth integrated measurement and control system are as follows:

RESOURCE＝{S,TYPE,TS,D _S ,L,L _MAX }

wherein S is one of heaven and earthSet of materialized measurement and control resources, s= { S ₁ ，s ₂ ,...s _j ,...s _M }

the TS characterizes the idle time window (i.e. the time window currently available for measurement and control) for each measurement and control resource,

LS _j Representing occupation of single measurement and control resources by all medium-low orbit satellites

L＝{L _S1 ,L _S2 ，...,L _Sj ,...L _SM }

＝{L ₁ ，L ₂ ，...L _i ,...L _n }

from the perspective of measurement and control tasks, the description of elements in a measurement and control scene based on a visible time window is as follows:

TASK＝{T,Sat,P,D,T _A ,T _C ,T _Oi }

wherein T is a set of measurement and control tasks of all medium-low orbit satellites, and T= { T ₁ ，T ₂ ，...T _i ...T _n }

Sat characterizes measurement and control task sources, namely corresponding task satellites, and sat= { Sat ₁ ,Sat ₂ ,...Sat _o }

P is the priority of the measurement and control task, P= { P ₁ ，P ₂ ，...P _i ...P _n }

D is the shortest measurement and control time D= { D corresponding to each measurement and control task ₁ ,d ₂ ,...d _i ...d _n )；

T _A Time interval T representing measurement and control task capable of measuring and controlling _A ＝{[t _1B ,t _1E ],[t _2B ,t _2E ],....[t _iB ,t _iE ],...[t _nB ,t _nE ]}，

T _C Actual measurement and control interval T for representing task _C ＝{[t _1b ,t _1e ],[t _2b ,t _2e ],....[t _ib ,t _ie ],...[t _nb ,t _ne ]}，

To _i Describing a set of visible arc segments corresponding to each task

And designing a measurement and control state s according to the measurement and control scene model, wherein for a specific measurement and control scene, a 0-1 matrix capable of representing the state of each measurement and control resource is used as the measurement and control state of the measurement and control scene. Taking 1h as a division scale as an example, under the measurement and control scene, 3 measurement and control resources are shared, so that the measurement and control state matrix size is 3×24 for each day, wherein the matrix state corresponding to visible/usable unit time is set to 0, and the matrix state corresponding to invisible/unusable unit time is set to 1. Accordingly, in this case, the measurement and control state can be visualized by fig. 4.

The measurement and control actions, namely decision variables, are described as follows:

wherein ,a_i Whether the measurement and control task is accepted or not is represented, type represents the type of measurement and control resources accepting the measurement and control task, and x _ij A measurement and control resource number for representing and receiving measurement and control tasks, y _jk Representing resource jThe kth visible time window executes the measurement and control task, t _ib The actual start time of the measurement and control task is characterized.

The measurement and control scheduling performance evaluation index is expressed as r=s _R * RUR/load, comprehensively evaluating scheduling performance of measurement and control resources, wherein s _R The satisfaction degree of the measurement and control task is represented, the load represents the balance degree of the utilization of measurement and control resources, and the RUR represents the average utilization rate of all the measurement and control resources.

2. Constructing a convolutional neural network to describe the Q value in the measurement and control resource scheduling network according to the measurement and control scene requirement, wherein the actual value neural network and the target value neural network are respectively two convolutional neural networks with the same structure and the incompletely same parameters, the convolutional neural network comprises 2 convolutional layers and 1 full-connection layer, and a sigmoid function is adopted as an activation function of the convolutional neural network. In the initialization process of the deep Q learning measurement and control resource scheduling network, a memory library is initialized according to the actual capacity requirement, and network parameters including learning rate, discount factors and related parameters of an actual value neural network and a target value neural network for describing the Q value are initialized.

3. And (3) further refining the measurement and control state, the measurement and control action rewards and the measurement and control scheme according to the specific description of the measurement and control scene in the step (1). On the basis, the design of the measurement and control state s is carried out according to the measurement and control scene model, the input of a measurement and control scheduling network is initialized, and the corresponding output is calculated. And randomly selecting measurement and control actions by using the probability epsilon, selecting the measurement and control actions by using the probability 1-epsilon through the Q value output by the measurement and control scheduling network (namely epsilon-greedy strategy), and executing corresponding actions in the measurement and control resource scheduling network. Obtaining the rewards r after the action is executed and the measurement and control state before the next action is executed, namely the measurement and control state s at the next moment _i+1 . And calculating the actual value neural network and the Q value of the current value neural network in the measurement and control scheduling network at the next moment according to the currently selected measurement and control action and the current state.

4. Four parameters (s _i ,X _i ,r _i ,s _i+1 ) Stored together as a sample in a memory bank.

5. A certain number of sample states are randomly fetched from the memory bank, and a target value of each state is calculated (Q value is updated as a target value by the reward after execution). And updating the actual value neural network parameters through a random gradient descent method, and assigning the current parameters in the actual value neural network to the target value neural network after the actual value neural network parameters are iteratively updated every N times, so that the updating of the target value neural network parameters in the measurement and control scheduling network is realized.

And continuously updating parameters to train the measurement and control scheduling network.

6. And through cyclic and reciprocating algorithm updating, the selection and optimization of the measurement and control resource allocation strategy are realized, and the selection of the optimal measurement and control scheduling strategy is realized. And (5) completing the measurement and control resource scheduling process.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. A measurement and control resource scheduling method based on deep Q learning is characterized in that: the method comprises the following steps:

s1: describing a complex measurement and control scene;

s3: forming a measurement and control resource scheduling scheme;

s5: the measurement and control resource scheduling method based on DQN is implemented;

the step S1 specifically comprises the following steps:

(1) Description of entities in measurement and control scenarios

the world integration measurement and control resource is described as follows:

RESOURCE＝{S,TYPE,TS,D _S ,L,L _MAX }

TS＝{TS ₁ ，TS ₂ ，...TS _j ，...TS _M }

TS _j characterizing all available time windows, i.e. idle time windows, t, of the jth measurement and control resource _b1 (s _j ) And t _e1 (s _j ) The starting time and the ending time of the 1 st visible time window of the jth measurement and control resource are respectively represented, the sequence of the visible windows is marked according to the time sequence, and so on;

L _MAX ＝{L _MAX1 ，L _MAX2 ，...L _MAXj ,...L _MAXM }

T _C actual measurement and control interval for representing task

[t _ib ,t _ie ]Time window, t, representing actual progress of measurement and control task in order i _ib To measure and control the actual start time after task scheduling, t _ie The actual end time after actual scheduling of the measurement and control task is obtained;

To _i describing a set of visible arc segments corresponding to each task

(2) Measurement and control state design

(3) Design of measurement and control actions

The design of the measurement and control actions adopts a layer-by-layer progressive decision idea to sequentially determine whether to accept the measurement and control task, the measurement and control resources of the accepted measurement and control task are specifically used for the measurement and control time interval of the task, and the measurement and control actions with the sequence of i are designed as follows:

X _i ＝(a _i ,type,x _ij ,y _jk ,t _ib )

wherein ,a_i Whether the observing and controlling task with the order of i is accepted or not is represented, type represents the type of observing and controlling resource of the observing and controlling task with the order of i, and x _ij Measurement and control resource number for representing measurement and control task with receiving order of i, y _jk Representing the execution of a measurement and control task with the kth visible time window of resource j, t _ib Characterization of the actual measurement and control tasks in order iThe time of the start.

2. The measurement and control resource scheduling method based on deep Q learning according to claim 1, wherein the method comprises the following steps: the step S2 specifically comprises the following steps:

satisfaction of measurement and control task:

measuring and controlling the load balance degree of the resource:

wherein ,

average utilization rate of measurement and control resources:

3. the measurement and control resource scheduling method based on deep Q learning according to claim 1, wherein the method comprises the following steps: the step S3 specifically comprises the following steps:

4. The measurement and control resource scheduling method based on deep Q learning according to claim 1, wherein the method comprises the following steps: the step S4 specifically includes:

5. The measurement and control resource scheduling method based on deep Q learning according to claim 1, wherein the method comprises the following steps: the step S5 specifically comprises the following steps: