CN113163447A - Communication network task resource scheduling method based on Q learning - Google Patents

Communication network task resource scheduling method based on Q learning Download PDF

Info

Publication number
CN113163447A
CN113163447A CN202110271286.7A CN202110271286A CN113163447A CN 113163447 A CN113163447 A CN 113163447A CN 202110271286 A CN202110271286 A CN 202110271286A CN 113163447 A CN113163447 A CN 113163447A
Authority
CN
China
Prior art keywords
task
scheduling
node
communication network
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110271286.7A
Other languages
Chinese (zh)
Other versions
CN113163447B (en
Inventor
桂劲松
刘尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202110271286.7A priority Critical patent/CN113163447B/en
Publication of CN113163447A publication Critical patent/CN113163447A/en
Application granted granted Critical
Publication of CN113163447B publication Critical patent/CN113163447B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/08Load balancing or load distribution
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/52Allocation or scheduling criteria for wireless resources based on load
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W72/00Local resource management
    • H04W72/50Allocation or scheduling criteria for wireless resources
    • H04W72/535Allocation or scheduling criteria for wireless resources based on resource usage policies
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Abstract

The invention discloses a communication network task resource scheduling method based on Q learning, which comprises the steps of obtaining the real-time communication state and communication parameters of a communication network and initializing an R table; each task scheduling node of the communication network carries out the training of a self Q table; each task scheduling node of the communication network makes a decision of a self Q table; the communication network carries out subsequent task resource scheduling according to the Q table obtained by each task scheduling node in the step S3; each task scheduling node of the communication network updates the R table of the task scheduling node; and repeating the steps to carry out continuous communication network task resource scheduling. The invention utilizes the characteristic of Q learning to find a breakthrough for the problem of the mutual influence relation between the survival rate of the modeling task and the resource utilization rate in the high dynamic network environment with uncertainty, realizes the task resource scheduling and balance of the communication network under the complex condition through innovative algorithm research and implementation, and has high reliability, good stability, simplicity and convenience.

Description

Communication network task resource scheduling method based on Q learning
Technical Field
The invention belongs to the field of decentralized computing, and particularly relates to a communication network task resource scheduling method based on Q learning.
Background
In a severe wireless communication environment, especially in an environment facing to the condition that the network throughput is severely limited and the user application requires near real-time response, in order to solve the contradiction between the complexity and the variability of the calculation task and the severe limitation of the node resources, the application mode based on the decentralized calculation is a solution which is worth exploring. In a distributed computing environment, in order to ensure that scheduled tasks can survive in a severe battlefield environment and successfully complete military applications and other work, an anti-destruction relay mode of cross-node computing tasks needs to be researched. In the survivability and replacement mode of the node computing task, a key problem is that when the number of the scheduled tasks is determined, the reasonable matching relation between the available amount of resources and the number of the tasks in the task completion period is required to be determined. If the deviation is too far away from the reasonable value, the resource utilization rate is too low, or the task survival rate is not high, the contradiction between the serious limitation of resources and the huge amount of tasks in the severe battlefield environment is aggravated.
In the case of physical damage to a computing node, a simple and effective means for allowing the task executed thereon to survive is to reschedule the task to be executed on another computing point. Therefore, the matching of the total number of tasks scheduled to be executed to the total amount of resources available during a particular time period directly affects the survival rate of the batch of tasks. Considering the number of scheduled tasks from the perspective of fully utilizing resources, the same resources may serve more tasks, but the probability of task execution failure due to physical damage of a computing node is higher (for example, due to lack of survivability replacing resources), and then the task survival rate is not very high. On the contrary, if the total number of tasks scheduled in the same period is excessively reduced, the probability of task execution failure caused by physical damage of the computing node is greatly reduced. This is mainly due to the fact that there are more alternative successor compute nodes when rescheduling. However, contemporaneous resource utilization will be low. In this case, although the survival rate of the mission may be high, it is not meaningful to have the survival rate of the mission high in exchange for a serious reduction in resource utilization, especially in a resource-limited battlefield environment. Therefore, the interaction relationship between the task survival rate and the resource utilization rate needs to be discussed, and a reasonable balance point between the task survival rate and the resource utilization rate needs to be found.
However, the existing research and technical scheme aiming at the reasonable balance point between the two methods are often not reliable and the method is very complicated.
Disclosure of Invention
The invention aims to provide a communication network task resource scheduling method based on Q learning, which has high reliability and good stability and is simple and convenient.
The invention provides a communication network task resource scheduling method based on Q learning, which comprises the following steps:
s1, acquiring a real-time communication state and a communication parameter of a communication network, and initializing an R table;
s2, each task scheduling node of the communication network carries out training of a self Q table;
s3, each task scheduling node of the communication network makes a decision of a self Q table;
s4, the communication network carries out subsequent task resource scheduling according to the Q table obtained by each task scheduling node in the step S3;
s5, each task scheduling node of the communication network updates the R table of the task scheduling node;
s6, repeating the steps S2-S5 to carry out continuous communication network task resource scheduling.
The initializing R table in step S1 is specifically initialized by the following steps:
the method comprises the following steps: each initial state
Figure BDA0002974529870000021
The value of the medium resource item does not exceed the sum of the initialized resource quantities of all the nodes;
for each one
Figure BDA0002974529870000022
The following steps II to VIII are repeated; wherein
Figure BDA0002974529870000023
Scheduling the state of the node i at the time 0 for the task; siScheduling a state space set of a node i for the task;
for each one
Figure BDA0002974529870000031
The following steps III to VIII are repeated;
Figure BDA0002974529870000032
scheduling the action taken by node i at time 0 for the task; a. theiScheduling an action set of node i for the task;
according to the initial action
Figure BDA0002974529870000033
Estimating the amount of tasks to be scheduled;
IV, estimating the resource quantity required by the task according to the quantity of the task to be scheduled;
v. according to the resource quantity and initial state of the task to be scheduled
Figure BDA0002974529870000034
Estimating resource utilization by values of medium resource items
Figure BDA0002974529870000035
Estimating the mean value of the damage probability of all nodes according to the damage probability initialized by each node;
VII, judging: if the initial state is
Figure BDA0002974529870000036
If the value of the middle task item is not greater than the value of the resource item, taking the mean value of the node damage probability as the success rate of the initial task
Figure BDA0002974529870000037
Otherwise, the success rate of the initial task is determined
Figure BDA0002974529870000038
Set to 0;
VIII, initializing the return value r obtained by task scheduling node i at time 0i 0
Figure BDA0002974529870000039
ε2Is a weight factor and has a value range of 0 to 1.
Step S2, each task scheduling node of the communication network trains its own Q table, specifically, the following steps are adopted for training:
repeating the following steps A to F until the repetition times reach the set times K:
A. randomly selecting an initial state
Figure BDA00029745298700000310
Figure BDA00029745298700000311
Scheduling the state of the node i at the moment t for the task; siScheduling a state space set of a node i for the task;
B. setting a first variable QmaxIs 0;
C. for each one
Figure BDA00029745298700000312
The following steps a to c are all carried out;
Figure BDA00029745298700000313
scheduling the action taken by node i at time t for the task; a. theiScheduling the action set of node i for the task:
a. and calculating the Q value of the task scheduling node i at the t +1 moment by adopting the following formula:
Figure BDA0002974529870000041
in the formula
Figure BDA0002974529870000042
Scheduling the Q value of the node i at the moment t +1 for the task; alpha is a learning factor and has a value range of [0, 1%]And the larger the value of alpha is, the more the performer of the action pays more attention to the current return;
Figure BDA0002974529870000043
scheduling the Q value of the node i at the moment t for the task;
Figure BDA0002974529870000044
a report value obtained by the task scheduling node i at the time t + 1; beta is a discount factor, the value range is [0,1 ], and the larger the value of beta is, the more important the future return is put on the executor of the action;
Figure BDA0002974529870000045
taking action at time t for task scheduling node i
Figure BDA0002974529870000046
Rear slave status
Figure BDA0002974529870000047
A new state of transition;
Figure BDA0002974529870000048
scheduling node i in a new state for a task
Figure BDA0002974529870000049
An action for obtaining the maximum Q value;
Figure BDA00029745298700000410
scheduling node i for a task at time t +1 in a new state
Figure BDA00029745298700000411
Take action
Figure BDA00029745298700000412
The Q value of (1);
b. update QiThe corresponding elements in (1); qiScheduling the Q table of the node i for the task;
c. for updated QiThe element (2) is judged:
if it is
Figure BDA00029745298700000413
Then Q will bemaxIs updated to
Figure BDA00029745298700000414
At the same time amaxIs updated to
Figure BDA00029745298700000415
amaxScheduling node i for a task to be in state at time t +1
Figure BDA00029745298700000416
An action for obtaining the maximum Q value;
otherwise, QmaxAnd amaxThe change is not changed;
D. setting detection probability
Figure BDA00029745298700000417
E. Generating a random number epsilon, wherein the value range of epsilon is 0-1;
F. for detection probability
Figure BDA00029745298700000418
And the generated random number epsilon:
if it is
Figure BDA00029745298700000419
Then the judgment is made again: if action amaxCan change the state
Figure BDA00029745298700000420
Transition to the next state
Figure BDA00029745298700000421
Then will be
Figure BDA00029745298700000422
Is assigned to
Figure BDA00029745298700000423
And jumping to the step B; otherwise, jumping back to the step A;
otherwise, from set AiIn the step (a) randomly selects a division amaxAnd performing the following actions again: if the selected action can change the state
Figure BDA00029745298700000424
Transition to the next state
Figure BDA00029745298700000425
Will be used
Figure BDA00029745298700000426
Is assigned to
Figure BDA00029745298700000427
And jumping to the step B; otherwise, jumping back to step A.
Step S3, where each task scheduling node of the communication network makes a decision on its own Q table, specifically, the following steps are adopted to make the decision:
(1) initial setting
Figure BDA0002974529870000051
And a second variable V ═ 0;
(2) for each one
Figure BDA0002974529870000052
The following operations are all carried out:
according to
Figure BDA0002974529870000053
From QiFind out
Figure BDA0002974529870000054
And (4) judging: if it is
Figure BDA0002974529870000055
Then will be
Figure BDA0002974529870000056
Assign a value to V and simultaneously will
Figure BDA0002974529870000057
Is assigned to a0,a0Scheduling node i for a task to be in state at time t
Figure BDA0002974529870000058
An action for obtaining the maximum Q value;
otherwise, V and a0The change is not changed;
(3) and (4) judging: if action a0Can change the state
Figure BDA0002974529870000059
Transition to the next state
Figure BDA00029745298700000510
Then the following formula is adopted to calculate
Figure BDA00029745298700000511
Figure BDA00029745298700000512
(4) Update QiThe corresponding elements in (1);
(5) will be provided with
Figure BDA00029745298700000513
Is assigned to
Figure BDA00029745298700000514
And returning to the step (2).
Step S5, each task scheduling node of the communication network updates its R table, specifically, the following steps are adopted for updating:
1) statistics fromtTo lttTotal amount of resources in the period resource view, and is denoted as fi t;ltScheduling and executing a virtual time t for the task; tau istScheduling and executing cycles for the tasks; the resource view is visible for the scheduling node i in the current scheduling periodExecuting the node set;
2) statistics fromtTo lttThe amount of tasks during which tasks have been scheduled for execution is noted
Figure BDA00029745298700000515
And make statistics of
Figure BDA00029745298700000516
The total amount of occupied resources;
3) estimating and recording the resource utilization rate according to the statistical results of the step 1) and the step 2)
Figure BDA0002974529870000061
The resource utilization rate is defined as the ratio of the actual occupied resource amount to the total resource amount;
4) according to the order oftTo lttThe damage rate of each node executing the task in the period, and the success rate of task execution is estimated;
5) based on the success rate of each task obtained in the step 4), counting the average success rate of all the tasks and recording the average success rate as the success rate
Figure BDA0002974529870000062
6) Calculating the return value obtained by the task scheduling node i at the moment t by adopting the following formula
Figure BDA0002974529870000063
Figure BDA0002974529870000064
In the formula of1Is a weight factor, and the value range is 0-1;
Figure BDA0002974529870000065
counting the average success rate of all tasks at the moment t for the task scheduling node i;
Figure BDA0002974529870000066
counting the resource utilization rate of the task scheduling node i at the moment t;
7) according to
Figure BDA0002974529870000067
In a report table RiFinds the most recent state;
8) according to
Figure BDA0002974529870000068
In a report table RiFind the most recent action;
9) using ri tUpdate the report back RiThe latest status found and the corresponding reward value of the latest action found.
The communication network task resource scheduling method based on Q learning provided by the invention utilizes the characteristics of Q learning to find a breakthrough for the problem of the mutual influence relation between the survival rate of the modeling task and the resource utilization rate in the uncertain high dynamic network environment, realizes the task resource scheduling and balance of the communication network under complex conditions through innovative algorithm research and implementation, and has the advantages of high reliability, good stability, simplicity and convenience.
Drawings
FIG. 1 is a schematic process flow diagram of the process of the present invention.
Fig. 2 is a schematic diagram of a network structure according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of task survival rate and resource utilization rate of different parameter combinations of the Q learning model under different training rounds according to the embodiment of the present invention.
Fig. 4 is a schematic diagram illustrating an influence of a damage probability of a task execution node and the number of training rounds on a task survival rate and a resource utilization rate according to an embodiment of the present invention.
Fig. 5 is a schematic diagram illustrating an influence of the number of currently scheduled tasks and the number of training rounds on task survival rate and resource utilization rate according to an embodiment of the present invention.
Detailed Description
The invention provides a method for modeling the interaction relation between the task survival rate and the resource utilization rate. In order to clarify the interaction relationship between the task survival rate and the resource utilization rate and obtain a balance result which takes into account the two performance requirements of the task survival rate and the resource utilization rate, an effective approach is to model the problem as a multi-objective constrained optimization problem. However, due to the large number of modeling parameters involved in an uncertain high dynamic network environment, purely mathematical modeling is complicated and the solution is also quite difficult. By utilizing the characteristics of Q learning, a breakthrough is found for the problem of the mutual influence relationship between the survival rate of the modeling task and the resource utilization rate in the high dynamic network environment with uncertainty.
Task survival refers to the ratio of the number of tasks successfully completed within a particular period of time in the system to the total number of tasks scheduled. In a resource-limited environment (such as a battlefield environment), the total number of tasks to be executed needs to be scheduled and deployed to an appropriate computing node for execution according to the distribution of node resources and network resources sensed in real time and the resource availability, wherein the earliest task to be executed and the latest task to be completed determine the execution period of the batch of tasks, i.e. a specific period in the definition of the task survival rate. Two main factors can influence the survival of one task, one is that the computing resource allocation is unreasonable, and the task fails to be executed due to no resource availability, which is mainly related to the task decomposition and the performance of a scheduling algorithm; secondly, the physical damage of the computing nodes causes the failure of task execution. The invention mainly focuses on the latter and provides a Q learning-based multi-target constraint problem solving method.
Before using the Q learning method, a state space, reward, Q table is defined for the problem of interest. Since each task scheduling node may acquire only the node resource distribution and usage status and network parameters in its neighboring area in real time, it is considered that a neighboring area centered around the task scheduling node is a unit to define a state space, an action space, a report, an R table, and a Q table. The state space is defined as a combined space with dimensions of 'available resource amount' and 'number of scheduled tasks'; the value range of each dimension can be set according to the fluctuation range of the available resource amount in the region and the historical experience value of the variation range of the task load.
An action refers to an operation in which the system can change states through parameter adjustment. Here, since the amount of available resources cannot be actively adjusted, the action space is defined as a set of the number of tasks that can be selected for scheduling. In a particular state, after an action is taken, the current state transitions to a new state and the performer of the action receives a return.
The R table is defined as a two-dimensional matrix, each row of the matrix representing a state, each column representing an action, the values in the matrix representing specific return values that can be evaluated. Similarly, a Q table is also defined as a two-dimensional matrix in which each row represents a state and each column represents an action, and the values in the matrix are referred to as Q values. The Q value represents the degree to which the agent has acquired "knowledge" in different environments. When an action is taken, the transition from the current state to the next new state occurs, and the actor of the action obtains a new return value.
Because different task scheduling nodes have different resource views, each task scheduling node independently maintains an R table and a Q table of the task scheduling node. Due to the dynamics of resources and tasks, the actual state space is large. In order to obtain more accurate decision results, the state space of Q learning should be as large as possible, and the action set should also be large enough, so the training process of Q learning can be very time-consuming and computationally intensive. Therefore, the training task of Q learning can be reasonably scheduled nearby, and the cooperation among scattered computing nodes is utilized to ensure enough computing power. The decision making process based on the trained Q table has low requirement on computing power, and the computing power of a single scattered node is sufficient usually and can be considered to be executed by a task scheduling node. The ultimate goal of Q learning is to obtain a converged Q table, i.e., no longer so does the value in the Q table; however, in the practical application process, because the state space and the action space are large, the Q table needs a long training time to reach the convergence, so the Q table is often used directly after being trained for a certain time, and then the Q table is updated in the using process, so that the Q value in the Q table can be continuously close to the convergence value in the updating process.
Therefore, the communication network task resource scheduling method based on Q learning provided by the invention comprises the following steps (as shown in fig. 1):
s1, acquiring a real-time communication state and a communication parameter of a communication network, and initializing an R table; specifically, the following steps are adopted for initialization:
the method comprises the following steps: each initial state
Figure BDA0002974529870000091
The value of the medium resource item does not exceed the sum of the initialized resource quantities of all the nodes;
for each one
Figure BDA0002974529870000092
The following steps II to VIII are repeated; wherein
Figure BDA0002974529870000093
Scheduling the state of the node i at the time 0 for the task; siScheduling a state space set of a node i for the task;
for each one
Figure BDA0002974529870000094
The following steps III to VIII are repeated;
Figure BDA0002974529870000095
scheduling the action taken by node i at time 0 for the task; a. theiScheduling an action set of node i for the task;
according to the initial action
Figure BDA0002974529870000096
Estimating the amount of tasks to be scheduled;
IV, estimating the resource quantity required by the task according to the quantity of the task to be scheduled;
v. according to the resource quantity and initial state of the task to be scheduled
Figure BDA0002974529870000097
Estimating resource utilization by values of medium resource items
Figure BDA0002974529870000098
Estimating the mean value of the damage probability of all nodes according to the damage probability initialized by each node;
VII, judging: if the initial state is
Figure BDA0002974529870000099
If the value of the middle task item is not greater than the value of the resource item, taking the mean value of the node damage probability as the success rate of the initial task
Figure BDA00029745298700000910
Otherwise, the success rate of the initial task is determined
Figure BDA00029745298700000911
Set to 0;
VIII, initializing the return value r obtained by task scheduling node i at time 0i 0
Figure BDA0002974529870000101
ε2Is a weight factor, and the value range is 0-1;
the pseudo code for this section is as follows:
Figure BDA0002974529870000102
s2, each task scheduling node of the communication network carries out training of a self Q table; specifically, the following steps are adopted for training:
repeating the following steps A to F until the repetition times reach the set times K:
A. randomly selecting an initial state
Figure BDA0002974529870000111
Figure BDA0002974529870000112
Scheduling the state of the node i at the moment t for the task; siScheduling a state space set of a node i for the task;
B. setting a first variable QmaxIs 0;
C. for each one
Figure BDA0002974529870000113
The following steps a to c are all carried out;
Figure BDA0002974529870000114
scheduling the action taken by node i at time t for the task; a. theiScheduling the action set of node i for the task:
a. and calculating the Q value of the task scheduling node i at the t +1 moment by adopting the following formula:
Figure BDA0002974529870000115
in the formula
Figure BDA0002974529870000116
Scheduling the Q value of the node i at the moment t +1 for the task; alpha is a learning factor and has a value range of [0, 1%]And the larger the value of alpha is, the more the performer of the action pays more attention to the current return;
Figure BDA0002974529870000117
scheduling the Q value of the node i at the moment t for the task;
Figure BDA0002974529870000118
a report value obtained by the task scheduling node i at the time t + 1; beta is a discount factor, the value range is [0,1 ], and the larger the value of beta is, the more important the future return is put on the executor of the action;
Figure BDA0002974529870000119
taking action at time t for task scheduling node i
Figure BDA00029745298700001110
Rear slave status
Figure BDA00029745298700001111
A new state of transition;
Figure BDA00029745298700001112
scheduling node i in a new state for a task
Figure BDA00029745298700001113
An action for obtaining the maximum Q value;
Figure BDA00029745298700001114
scheduling node i for a task at time t +1 in a new state
Figure BDA00029745298700001115
Take action
Figure BDA00029745298700001116
The Q value of (1);
b. update QiThe corresponding elements in (1); qiScheduling the Q table of the node i for the task;
c. for updated QiThe element (2) is judged:
if it is
Figure BDA00029745298700001117
Then Q will bemaxIs updated to
Figure BDA00029745298700001118
At the same time amaxIs updated to
Figure BDA00029745298700001119
amaxScheduling node i for a task to be in state at time t +1
Figure BDA00029745298700001120
An action for obtaining the maximum Q value;
otherwise, QmaxAnd amaxThe change is not changed;
D. setting detection probability
Figure BDA00029745298700001121
E. Generating a random number epsilon, wherein the value range of epsilon is 0-1;
F. for detection probability
Figure BDA0002974529870000121
And the generated random number epsilon:
if it is
Figure BDA0002974529870000122
Then the judgment is made again: if action amaxCan change the state
Figure BDA0002974529870000123
Transition to the next state
Figure BDA0002974529870000124
Then will be
Figure BDA0002974529870000125
Is assigned to
Figure BDA0002974529870000126
And jumping to the step B; otherwise, jumping back to the step A;
otherwise, from set AiIn the step (a) randomly selects a division amaxAnd performing the following actions again: if the selected action can change the state
Figure BDA0002974529870000127
Transition to the next state
Figure BDA0002974529870000128
Will be used
Figure BDA0002974529870000129
Is assigned to
Figure BDA00029745298700001210
And jumping to the step B; otherwise, jumping back to the step A;
the pseudo code for this section is as follows:
Figure BDA00029745298700001211
Figure BDA0002974529870000131
s3, each task scheduling node of the communication network makes a decision of a self Q table; specifically, the following steps are adopted for decision making:
(1) initial setting
Figure BDA0002974529870000132
And a second variable V ═ 0;
(2) for each one
Figure BDA0002974529870000133
The following operations are all carried out:
according to
Figure BDA0002974529870000134
From QiFind out
Figure BDA0002974529870000135
And (4) judging: if it is
Figure BDA0002974529870000136
Then will be
Figure BDA0002974529870000137
Assign a value to V and simultaneously will
Figure BDA0002974529870000138
Is assigned to a0,a0Scheduling for tasksNode i is in state at time t
Figure BDA0002974529870000139
An action for obtaining the maximum Q value;
otherwise, V and a0The change is not changed;
(3) and (4) judging: if action a0Can change the state
Figure BDA0002974529870000141
Transition to the next state
Figure BDA0002974529870000142
Then the following formula is adopted to calculate
Figure BDA0002974529870000143
Figure BDA0002974529870000144
(4) Update QiThe corresponding elements in (1);
(5) will be provided with
Figure BDA0002974529870000145
Is assigned to
Figure BDA0002974529870000146
And returning to the step (2);
the pseudo code for this section is as follows:
Figure BDA0002974529870000147
Figure BDA0002974529870000151
s4, the communication network carries out subsequent task resource scheduling according to the Q table obtained by each task scheduling node in the step S3;
s5, each task scheduling node of the communication network updates the R table of the task scheduling node; specifically, the method comprises the following steps:
1) statistics fromtTo lttTotal amount of resources in the period resource view, and is denoted as fi t;ltScheduling and executing a virtual time t for the task; tau istScheduling and executing cycles for the tasks; the resource view is a visible execution node set of a scheduling node i in the current scheduling period;
2) statistics fromtTo lttThe amount of tasks during which tasks have been scheduled for execution is noted
Figure BDA0002974529870000152
And make statistics of
Figure BDA0002974529870000153
The total amount of occupied resources;
3) estimating and recording the resource utilization rate according to the statistical results of the step 1) and the step 2)
Figure BDA0002974529870000154
The resource utilization rate is defined as the ratio of the actual occupied resource amount to the total resource amount;
4) according to the order oftTo lttThe damage rate of each node executing the task in the period, and the success rate of task execution is estimated;
5) based on the success rate of each task obtained in the step 4), counting the average success rate of all the tasks and recording the average success rate as the success rate
Figure BDA0002974529870000155
6) Calculating the return value r obtained by the task scheduling node i at the moment t by adopting the following formulai t
Figure BDA0002974529870000156
In the formula of1Is a weight factor and takes a valueThe range is 0-1;
Figure BDA0002974529870000157
counting the average success rate of all tasks at the moment t for the task scheduling node i;
Figure BDA0002974529870000158
counting the resource utilization rate of the task scheduling node i at the moment t;
7) according to
Figure BDA0002974529870000161
In a report table RiFinds the most recent state;
8) according to
Figure BDA0002974529870000162
In a report table RiFind the most recent action;
9) use of
Figure BDA0002974529870000163
Update back report RiThe found latest state and the found corresponding return value of the latest action;
the pseudo code for this section is as follows:
Figure BDA0002974529870000164
s6, repeating the steps S2-S5 to carry out continuous communication network task resource scheduling.
The invention uses OMNeT + + to build a simulation network. 50 nodes are arranged in a circular platform with the radius of 500 meters, parameters such as initial coordinates, computing resources (for example, the number of CPUs), communication radius and the like of each node are read from a data file (with an ini suffix) when a simulation system is initialized, data in the data file is prepared in advance, coordinate data of the nodes are randomly generated in the circular platform with the radius of 500 meters, the number of idle CPUs of each node is randomly extracted in a range from 0 to 16, and the communication radius of each node is 250 meters. And after the simulation system is started, the coordinate position of each node is refreshed once again at a time interval T. In order to simplify the refresh process, the current value can be randomly increased or decreased by 0-50%. In order to simplify other simulation details and focus on the discussion and analysis of the relationship between the task survival rate and the resource utilization rate, a node with a fixed ID is set as a task scheduling node, and the number of tasks to be scheduled and executed is counted according to standard small tasks.
A standard tasklet requires at least one CPU for a duration of T to complete successfully. In order to embody the characteristics of a communication environment (such as a battlefield environment) in which resources are always limited, the number of small tasks to be scheduled on the task scheduling node should be set to be enough so that the total amount of resources required by the small tasks is larger than the amount of resources in the resource view obtained by the task scheduling node. The task scheduling node is supposed to obtain a predicted resource view within a two-hop communication range of the task scheduling node only in real time, and according to the resource view, the task is distributed to the selected task execution node, and the task execution node is set to be damaged with a certain probability. If the damage happens, the task scheduling node schedules the task on the damaged node to other nodes with idle resources for execution, and if no resources are available, the task is declared to fail. The resource view of the task scheduling node is illustrated in fig. 2.
Firstly, the task survival rate and the resource utilization rate of different parameter combinations of the Q learning model under different training rounds are analyzed, and the simulation result is shown in FIG. 3. Here, the damage probability of the task execution node is uniformly set to 10%, and the number of the currently scheduled tasks is uniformly set to 50% N (N is the maximum number of tasks that can be scheduled in the simulation system). As can be seen from fig. 3(a), the task survival rate increases as the number of training rounds of the Q-table increases. This is because the training results of the Q-table gradually converge to the optimum. FIG. 3(b) shows that as the number of training rounds increases, the resource utilization rate also increases, and the explanation is the same as that in FIG. 3 (a). Meanwhile, the parameter combination of Q learning is α ═ 0.8 and β ═ 0.2, which can be seen as the best effect of task survival rate and resource utilization rate.
Then, the task survival rate and the resource utilization rate of different task execution node damage probabilities under different training rounds are analyzed, and the simulation result is shown in fig. 4. Here, the parameter combinations of Q learning are uniformly set to α ═ 0.8 and β ═ 0.2; and the number of currently scheduled tasks is uniformly set to 50% N (N is the maximum number of tasks schedulable in the simulation system). As can be seen from fig. 4(a), the task survival rate increases as the number of training iterations of the Q-table increases, and the explanation is the same as that of fig. 3 (a). Meanwhile, as seen from fig. 4(a), a greater probability of node destruction may result in a decreased task survival rate. This is because the survival rate of tasks assigned to nodes is reduced because the node destruction rate is increased. Fig. 4(b) shows that the explanation of the resource utilization rate that increases as the number of training iterations of the Q table increases is the same as that of fig. 3(b), and the reason that the utilization rate is different due to different node destruction probabilities is that the larger destruction probability means that the resource amount of the available resource view decreases, and the resource occupied by the task does not change.
Finally, the influence of the damage probability of different task execution nodes and the number of the current scheduling tasks on the task survival rate and the resource utilization rate is analyzed, and the simulation result is shown in fig. 5. Here, the parameter combinations of Q learning are uniformly set to α ═ 0.8 and β ═ 0.2, and the number of training rounds is uniformly employed 100000 times. Each data point in fig. 5 is a Q table after 100000 rounds of training based on a set Q learning parameter combination, and results of task survival rate and resource utilization rate under different task execution node damage probabilities are obtained when the current scheduling task number takes different values.
As shown in fig. 5(a), as the task scheduling number increases, the task survival rate is in a downward trend, mainly because if a task execution node is damaged, the chance of finding a successor node is reduced, and the condition is aggravated by a greater node damage probability. Fig. 5(b) shows that the resource utilization becomes larger as the number of task schedules increases. This is because the amount of resources occupied by the task increases, the amount of views of resources available to the system is constant, and the resource utilization rate increases. Meanwhile, as seen from fig. 5(b), a larger node destruction probability results in a smaller amount of resources of the available resource view, and the amount of resources required by the task does not change under the same current task scheduling number, so that the resource utilization rate shows an upward trend.

Claims (5)

1. A communication network task resource scheduling method based on Q learning comprises the following steps:
s1, acquiring a real-time communication state and a communication parameter of a communication network, and initializing an R table;
s2, each task scheduling node of the communication network carries out training of a self Q table;
s3, each task scheduling node of the communication network makes a decision of a self Q table;
s4, the communication network carries out subsequent task resource scheduling according to the Q table obtained by each task scheduling node in the step S3;
s5, each task scheduling node of the communication network updates the R table of the task scheduling node;
s6, repeating the steps S2-S5 to carry out continuous communication network task resource scheduling.
2. The method for scheduling task resources of communication network based on Q learning according to claim 1, wherein the initializing R table in step S1 is specifically initialized by the following steps:
the method comprises the following steps: each initial state
Figure FDA0002974529860000011
The value of the medium resource item does not exceed the sum of the initialized resource quantities of all the nodes;
for each one
Figure FDA0002974529860000012
The following steps II to VIII are repeated; wherein
Figure FDA0002974529860000013
Scheduling the state of the node i at the time 0 for the task; siScheduling a state space set of a node i for the task;
for each one
Figure FDA0002974529860000014
All repeatedly carry out the following step IIIStep VIII;
Figure FDA0002974529860000015
scheduling the action taken by node i at time 0 for the task; a. theiScheduling an action set of node i for the task;
according to the initial action
Figure FDA0002974529860000016
Estimating the amount of tasks to be scheduled;
IV, estimating the resource quantity required by the task according to the quantity of the task to be scheduled;
v. according to the resource quantity and initial state of the task to be scheduled
Figure FDA0002974529860000017
Estimating resource utilization by values of medium resource items
Figure FDA0002974529860000018
Estimating the mean value of the damage probability of all nodes according to the damage probability initialized by each node;
VII, judging: if the initial state is
Figure FDA0002974529860000019
If the value of the middle task item is not greater than the value of the resource item, taking the mean value of the node damage probability as the success rate of the initial task
Figure FDA00029745298600000110
Otherwise, the success rate of the initial task is determined
Figure FDA00029745298600000111
Set to 0;
VIII, initializing the return value obtained by task scheduling node i at time 0
Figure FDA0002974529860000021
Figure FDA0002974529860000022
ε2Is a weight factor and has a value range of 0 to 1.
3. The communication network task resource scheduling method based on Q learning according to claim 1 or 2, wherein each task scheduling node of the communication network in step S2 performs training of its own Q table, specifically, the following steps are adopted for training:
repeating the following steps A to F until the repetition times reach the set times K:
A. randomly selecting an initial state
Figure FDA0002974529860000023
Figure FDA0002974529860000024
Scheduling the state of the node i at the moment t for the task; siScheduling a state space set of a node i for the task;
B. setting a first variable QmaxIs 0;
C. for each one
Figure FDA0002974529860000025
The following steps a to c are all carried out;
Figure FDA0002974529860000026
scheduling the action taken by node i at time t for the task; a. theiScheduling the action set of node i for the task:
a. and calculating the Q value of the task scheduling node i at the t +1 moment by adopting the following formula:
Figure FDA0002974529860000027
in the formula
Figure FDA0002974529860000028
Scheduling the Q value of the node i at the moment t +1 for the task; alpha is a learning factor and has a value range of [0, 1%]And the larger the value of alpha is, the more the performer of the action pays more attention to the current return;
Figure FDA0002974529860000029
scheduling the Q value of the node i at the moment t for the task;
Figure FDA00029745298600000210
a report value obtained by the task scheduling node i at the time t + 1; beta is a discount factor, the value range is [0,1 ], and the larger the value of beta is, the more important the future return is put on the executor of the action;
Figure FDA00029745298600000211
taking action at time t for task scheduling node i
Figure FDA00029745298600000212
Rear slave status
Figure FDA00029745298600000213
A new state of transition;
Figure FDA00029745298600000214
scheduling node i in a new state for a task
Figure FDA00029745298600000215
An action for obtaining the maximum Q value;
Figure FDA00029745298600000216
scheduling node i for a task at time t +1 in a new state
Figure FDA00029745298600000217
Take action
Figure FDA00029745298600000218
The Q value of (1);
b. update QiThe corresponding elements in (1); qiScheduling the Q table of the node i for the task;
c. for updated QiThe element (2) is judged:
if it is
Figure FDA0002974529860000031
Then Q will bemaxIs updated to
Figure FDA0002974529860000032
At the same time amaxIs updated to
Figure FDA0002974529860000033
amaxScheduling node i for a task to be in state at time t +1
Figure FDA0002974529860000034
An action for obtaining the maximum Q value;
otherwise, QmaxAnd amaxThe change is not changed;
D. setting detection probability
Figure FDA0002974529860000035
E. Generating a random number epsilon, wherein the value range of epsilon is 0-1;
F. for detection probability
Figure FDA0002974529860000036
And the generated random number epsilon:
if it is
Figure FDA0002974529860000037
Then the judgment is made again: if action amaxCan change the state
Figure FDA0002974529860000038
Switch to the next oneStatus of state
Figure FDA0002974529860000039
Then will be
Figure FDA00029745298600000310
Is assigned to
Figure FDA00029745298600000311
And jumping to the step B; otherwise, jumping back to the step A;
otherwise, from set AiIn the step (a) randomly selects a division amaxAnd performing the following actions again: if the selected action can change the state
Figure FDA00029745298600000312
Transition to the next state
Figure FDA00029745298600000313
Will be used
Figure FDA00029745298600000314
Is assigned to
Figure FDA00029745298600000315
And jumping to the step B; otherwise, jumping back to step A.
4. The method for scheduling task resources in communication network based on Q learning as claimed in claim 3, wherein each task scheduling node in the communication network in step S3 makes a decision on its own Q table, specifically, the following steps are adopted for making the decision:
(1) initial setting
Figure FDA00029745298600000316
And a second variable V ═ 0;
(2) for each one
Figure FDA00029745298600000317
The following operations are all carried out:
according to
Figure FDA00029745298600000318
From QiFind out
Figure FDA00029745298600000319
And (4) judging: if it is
Figure FDA00029745298600000320
Then will be
Figure FDA00029745298600000321
Assign a value to V and simultaneously will
Figure FDA00029745298600000322
Is assigned to a0,a0Scheduling node i for a task to be in state at time t
Figure FDA00029745298600000323
An action for obtaining the maximum Q value;
otherwise, V and a0The change is not changed;
(3) and (4) judging: if action a0Can change the state
Figure FDA0002974529860000041
Transition to the next state
Figure FDA0002974529860000042
Then the following formula is adopted to calculate
Figure FDA0002974529860000043
Figure FDA0002974529860000044
(4) Update QiThe corresponding elements in (1);
(5) will be provided with
Figure FDA0002974529860000045
Is assigned to
Figure FDA0002974529860000046
And returning to the step (2).
5. The method for scheduling task resources in communication network based on Q learning as claimed in claim 4, wherein each task scheduling node in communication network in step S5 updates its R table, specifically, the following steps are adopted for updating:
1) statistics fromtTo lttTotal amount of resources in the period resource view, and is noted
Figure FDA0002974529860000047
ltScheduling and executing a virtual time t for the task; tau istScheduling and executing cycles for the tasks; the resource view is a visible execution node set of a scheduling node i in the current scheduling period;
2) statistics fromtTo lttThe amount of tasks during which tasks have been scheduled for execution is noted
Figure FDA0002974529860000048
And make statistics of
Figure FDA0002974529860000049
The total amount of occupied resources;
3) estimating and recording the resource utilization rate according to the statistical results of the step 1) and the step 2)
Figure FDA00029745298600000410
The resource utilization rate is defined as the ratio of the actual occupied resource amount to the total resource amount;
4) according to the order oftTo lttThe damage rate of each node executing the task in the period, and the success rate of task execution is estimated;
5) based on the success rate of each task obtained in the step 4), counting the average success rate of all the tasks and recording the average success rate as the success rate
Figure FDA00029745298600000411
6) Calculating the return value r obtained by the task scheduling node i at the moment t by adopting the following formulai t
Figure FDA00029745298600000412
In the formula of1Is a weight factor, and the value range is 0-1;
Figure FDA0002974529860000051
counting the average success rate of all tasks at the moment t for the task scheduling node i;
Figure FDA0002974529860000052
counting the resource utilization rate of the task scheduling node i at the moment t;
7) according to
Figure FDA0002974529860000053
In a report table RiFinds the most recent state;
8) according to
Figure FDA0002974529860000054
In a report table RiFind the most recent action;
9) use of
Figure FDA0002974529860000055
Update back report RiThe latest status found in the report and the corresponding report of the latest action foundThe value is obtained.
CN202110271286.7A 2021-03-12 2021-03-12 Communication network task resource scheduling method based on Q learning Active CN113163447B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110271286.7A CN113163447B (en) 2021-03-12 2021-03-12 Communication network task resource scheduling method based on Q learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110271286.7A CN113163447B (en) 2021-03-12 2021-03-12 Communication network task resource scheduling method based on Q learning

Publications (2)

Publication Number Publication Date
CN113163447A true CN113163447A (en) 2021-07-23
CN113163447B CN113163447B (en) 2022-05-20

Family

ID=76887502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110271286.7A Active CN113163447B (en) 2021-03-12 2021-03-12 Communication network task resource scheduling method based on Q learning

Country Status (1)

Country Link
CN (1) CN113163447B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150282166A1 (en) * 2012-11-14 2015-10-01 China Academy Of Telecommunications Technology Method and device for scheduling slot resources
CN108112082A (en) * 2017-12-18 2018-06-01 北京工业大学 A kind of wireless network distributed freedom resource allocation methods based on statelessly Q study
CN108139930A (en) * 2016-05-24 2018-06-08 华为技术有限公司 Resource regulating method and device based on Q study
US20190124667A1 (en) * 2017-10-23 2019-04-25 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method for allocating transmission resources using reinforcement learning
CN110515735A (en) * 2019-08-29 2019-11-29 哈尔滨理工大学 A kind of multiple target cloud resource dispatching method based on improvement Q learning algorithm
CN110636523A (en) * 2019-09-20 2019-12-31 中南大学 Millimeter wave mobile backhaul link energy efficiency stabilization scheme based on Q learning
CN111405568A (en) * 2020-03-19 2020-07-10 三峡大学 Computing unloading and resource allocation method and device based on Q learning
CN111556572A (en) * 2020-04-21 2020-08-18 北京邮电大学 Spectrum resource and computing resource joint allocation method based on reinforcement learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150282166A1 (en) * 2012-11-14 2015-10-01 China Academy Of Telecommunications Technology Method and device for scheduling slot resources
CN108139930A (en) * 2016-05-24 2018-06-08 华为技术有限公司 Resource regulating method and device based on Q study
US20190124667A1 (en) * 2017-10-23 2019-04-25 Commissariat A L'energie Atomique Et Aux Energies Alternatives Method for allocating transmission resources using reinforcement learning
CN108112082A (en) * 2017-12-18 2018-06-01 北京工业大学 A kind of wireless network distributed freedom resource allocation methods based on statelessly Q study
CN110515735A (en) * 2019-08-29 2019-11-29 哈尔滨理工大学 A kind of multiple target cloud resource dispatching method based on improvement Q learning algorithm
CN110636523A (en) * 2019-09-20 2019-12-31 中南大学 Millimeter wave mobile backhaul link energy efficiency stabilization scheme based on Q learning
CN111405568A (en) * 2020-03-19 2020-07-10 三峡大学 Computing unloading and resource allocation method and device based on Q learning
CN111556572A (en) * 2020-04-21 2020-08-18 北京邮电大学 Spectrum resource and computing resource joint allocation method based on reinforcement learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
JINSONG GUI: "Stabilizing Transmission Capacity in Millimeter Wave Links by Q-Learning-Based Scheme", 《MOBILE INFORMATION SYSTEMS》 *
喻鹏: "移动边缘网络中基于双深度Q学习的高能效资源分配方法", 《通信学报》 *
李孜恒: "基于深度强化学习的无线网络资源分配算法", 《通信技术》 *

Also Published As

Publication number Publication date
CN113163447B (en) 2022-05-20

Similar Documents

Publication Publication Date Title
Chen et al. iRAF: A deep reinforcement learning approach for collaborative mobile edge computing IoT networks
CN108846570B (en) Method for solving resource-limited project scheduling problem
Yang et al. A prediction-based user selection framework for heterogeneous mobile crowdsensing
Shyalika et al. Reinforcement learning in dynamic task scheduling: A review
Palacios et al. Genetic tabu search for the fuzzy flexible job shop problem
Gonzalez et al. Instance-based learning: integrating sampling and repeated decisions from experience.
CN107831685B (en) Group robot control method and system
CN108038622B (en) Method for recommending users by crowd sensing system
Jia Efficient computing budget allocation for simulation-based policy improvement
CN111367657A (en) Computing resource collaborative cooperation method based on deep reinforcement learning
CN111866187B (en) Task scheduling method for distributed deep learning reasoning cloud platform
Yan et al. Efficient selection of a set of good enough designs with complexity preference
CN109445386A (en) A kind of most short production time dispatching method of the cloud manufacturing operation based on ONBA
Nie et al. Hypergraphical real-time multirobot task allocation in a smart factory
Tomy et al. Battery charge scheduling in long-life autonomous mobile robots via multi-objective decision making under uncertainty
Peleteiro et al. Using reputation and adaptive coalitions to support collaboration in competitive environments
CN113163447B (en) Communication network task resource scheduling method based on Q learning
Karabulut et al. The value of adaptive menu sizes in peer-to-peer platforms
CN107180286B (en) Manufacturing service supply chain optimization method and system based on improved pollen algorithm
CN112613761A (en) Service scheduling method based on dynamic game and self-adaptive ant colony algorithm
Danassis et al. Improving multi-agent coordination by learning to estimate contention
Prikopa et al. Fault-tolerant least squares solvers for wireless sensor networks based on gossiping
CN116932201A (en) Multi-resource sharing scheduling method for deep learning training task
Vahidipour et al. Priority assignment in queuing systems with unknown characteristics using learning automata and adaptive stochastic Petri nets
Fukasawa et al. Bi-objective short-term scheduling in a rolling horizon framework: a priori approaches with alternative operational objectives

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant