CN113163447A - Communication network task resource scheduling method based on Q learning - Google Patents
Communication network task resource scheduling method based on Q learning Download PDFInfo
- Publication number
- CN113163447A CN113163447A CN202110271286.7A CN202110271286A CN113163447A CN 113163447 A CN113163447 A CN 113163447A CN 202110271286 A CN202110271286 A CN 202110271286A CN 113163447 A CN113163447 A CN 113163447A
- Authority
- CN
- China
- Prior art keywords
- task
- scheduling
- node
- communication network
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W28/00—Network traffic management; Network resource management
- H04W28/02—Traffic management, e.g. flow control or congestion control
- H04W28/08—Load balancing or load distribution
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/50—Allocation or scheduling criteria for wireless resources
- H04W72/52—Allocation or scheduling criteria for wireless resources based on load
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W72/00—Local resource management
- H04W72/50—Allocation or scheduling criteria for wireless resources
- H04W72/535—Allocation or scheduling criteria for wireless resources based on resource usage policies
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Abstract
The invention discloses a communication network task resource scheduling method based on Q learning, which comprises the steps of obtaining the real-time communication state and communication parameters of a communication network and initializing an R table; each task scheduling node of the communication network carries out the training of a self Q table; each task scheduling node of the communication network makes a decision of a self Q table; the communication network carries out subsequent task resource scheduling according to the Q table obtained by each task scheduling node in the step S3; each task scheduling node of the communication network updates the R table of the task scheduling node; and repeating the steps to carry out continuous communication network task resource scheduling. The invention utilizes the characteristic of Q learning to find a breakthrough for the problem of the mutual influence relation between the survival rate of the modeling task and the resource utilization rate in the high dynamic network environment with uncertainty, realizes the task resource scheduling and balance of the communication network under the complex condition through innovative algorithm research and implementation, and has high reliability, good stability, simplicity and convenience.
Description
Technical Field
The invention belongs to the field of decentralized computing, and particularly relates to a communication network task resource scheduling method based on Q learning.
Background
In a severe wireless communication environment, especially in an environment facing to the condition that the network throughput is severely limited and the user application requires near real-time response, in order to solve the contradiction between the complexity and the variability of the calculation task and the severe limitation of the node resources, the application mode based on the decentralized calculation is a solution which is worth exploring. In a distributed computing environment, in order to ensure that scheduled tasks can survive in a severe battlefield environment and successfully complete military applications and other work, an anti-destruction relay mode of cross-node computing tasks needs to be researched. In the survivability and replacement mode of the node computing task, a key problem is that when the number of the scheduled tasks is determined, the reasonable matching relation between the available amount of resources and the number of the tasks in the task completion period is required to be determined. If the deviation is too far away from the reasonable value, the resource utilization rate is too low, or the task survival rate is not high, the contradiction between the serious limitation of resources and the huge amount of tasks in the severe battlefield environment is aggravated.
In the case of physical damage to a computing node, a simple and effective means for allowing the task executed thereon to survive is to reschedule the task to be executed on another computing point. Therefore, the matching of the total number of tasks scheduled to be executed to the total amount of resources available during a particular time period directly affects the survival rate of the batch of tasks. Considering the number of scheduled tasks from the perspective of fully utilizing resources, the same resources may serve more tasks, but the probability of task execution failure due to physical damage of a computing node is higher (for example, due to lack of survivability replacing resources), and then the task survival rate is not very high. On the contrary, if the total number of tasks scheduled in the same period is excessively reduced, the probability of task execution failure caused by physical damage of the computing node is greatly reduced. This is mainly due to the fact that there are more alternative successor compute nodes when rescheduling. However, contemporaneous resource utilization will be low. In this case, although the survival rate of the mission may be high, it is not meaningful to have the survival rate of the mission high in exchange for a serious reduction in resource utilization, especially in a resource-limited battlefield environment. Therefore, the interaction relationship between the task survival rate and the resource utilization rate needs to be discussed, and a reasonable balance point between the task survival rate and the resource utilization rate needs to be found.
However, the existing research and technical scheme aiming at the reasonable balance point between the two methods are often not reliable and the method is very complicated.
Disclosure of Invention
The invention aims to provide a communication network task resource scheduling method based on Q learning, which has high reliability and good stability and is simple and convenient.
The invention provides a communication network task resource scheduling method based on Q learning, which comprises the following steps:
s1, acquiring a real-time communication state and a communication parameter of a communication network, and initializing an R table;
s2, each task scheduling node of the communication network carries out training of a self Q table;
s3, each task scheduling node of the communication network makes a decision of a self Q table;
s4, the communication network carries out subsequent task resource scheduling according to the Q table obtained by each task scheduling node in the step S3;
s5, each task scheduling node of the communication network updates the R table of the task scheduling node;
s6, repeating the steps S2-S5 to carry out continuous communication network task resource scheduling.
The initializing R table in step S1 is specifically initialized by the following steps:
the method comprises the following steps: each initial stateThe value of the medium resource item does not exceed the sum of the initialized resource quantities of all the nodes;
for each oneThe following steps II to VIII are repeated; whereinScheduling the state of the node i at the time 0 for the task; siScheduling a state space set of a node i for the task;
for each oneThe following steps III to VIII are repeated;scheduling the action taken by node i at time 0 for the task; a. theiScheduling an action set of node i for the task;
IV, estimating the resource quantity required by the task according to the quantity of the task to be scheduled;
v. according to the resource quantity and initial state of the task to be scheduledEstimating resource utilization by values of medium resource items
Estimating the mean value of the damage probability of all nodes according to the damage probability initialized by each node;
VII, judging: if the initial state isIf the value of the middle task item is not greater than the value of the resource item, taking the mean value of the node damage probability as the success rate of the initial taskOtherwise, the success rate of the initial task is determinedSet to 0;
VIII, initializing the return value r obtained by task scheduling node i at time 0i 0:ε2Is a weight factor and has a value range of 0 to 1.
Step S2, each task scheduling node of the communication network trains its own Q table, specifically, the following steps are adopted for training:
repeating the following steps A to F until the repetition times reach the set times K:
A. randomly selecting an initial state Scheduling the state of the node i at the moment t for the task; siScheduling a state space set of a node i for the task;
B. setting a first variable QmaxIs 0;
C. for each oneThe following steps a to c are all carried out;scheduling the action taken by node i at time t for the task; a. theiScheduling the action set of node i for the task:
a. and calculating the Q value of the task scheduling node i at the t +1 moment by adopting the following formula:
in the formulaScheduling the Q value of the node i at the moment t +1 for the task; alpha is a learning factor and has a value range of [0, 1%]And the larger the value of alpha is, the more the performer of the action pays more attention to the current return;scheduling the Q value of the node i at the moment t for the task;a report value obtained by the task scheduling node i at the time t + 1; beta is a discount factor, the value range is [0,1 ], and the larger the value of beta is, the more important the future return is put on the executor of the action;taking action at time t for task scheduling node iRear slave statusA new state of transition;scheduling node i in a new state for a taskAn action for obtaining the maximum Q value;scheduling node i for a task at time t +1 in a new stateTake actionThe Q value of (1);
b. update QiThe corresponding elements in (1); qiScheduling the Q table of the node i for the task;
c. for updated QiThe element (2) is judged:
if it isThen Q will bemaxIs updated toAt the same time amaxIs updated toamaxScheduling node i for a task to be in state at time t +1An action for obtaining the maximum Q value;
otherwise, QmaxAnd amaxThe change is not changed;
E. Generating a random number epsilon, wherein the value range of epsilon is 0-1;
if it isThen the judgment is made again: if action amaxCan change the stateTransition to the next stateThen will beIs assigned toAnd jumping to the step B; otherwise, jumping back to the step A;
otherwise, from set AiIn the step (a) randomly selects a division amaxAnd performing the following actions again: if the selected action can change the stateTransition to the next stateWill be usedIs assigned toAnd jumping to the step B; otherwise, jumping back to step A.
Step S3, where each task scheduling node of the communication network makes a decision on its own Q table, specifically, the following steps are adopted to make the decision:
And (4) judging: if it isThen will beAssign a value to V and simultaneously willIs assigned to a0,a0Scheduling node i for a task to be in state at time tAn action for obtaining the maximum Q value;
otherwise, V and a0The change is not changed;
(3) and (4) judging: if action a0Can change the stateTransition to the next stateThen the following formula is adopted to calculate
(4) Update QiThe corresponding elements in (1);
Step S5, each task scheduling node of the communication network updates its R table, specifically, the following steps are adopted for updating:
1) statistics fromtTo lt+τtTotal amount of resources in the period resource view, and is denoted as fi t;ltScheduling and executing a virtual time t for the task; tau istScheduling and executing cycles for the tasks; the resource view is visible for the scheduling node i in the current scheduling periodExecuting the node set;
2) statistics fromtTo lt+τtThe amount of tasks during which tasks have been scheduled for execution is notedAnd make statistics ofThe total amount of occupied resources;
3) estimating and recording the resource utilization rate according to the statistical results of the step 1) and the step 2)The resource utilization rate is defined as the ratio of the actual occupied resource amount to the total resource amount;
4) according to the order oftTo lt+τtThe damage rate of each node executing the task in the period, and the success rate of task execution is estimated;
5) based on the success rate of each task obtained in the step 4), counting the average success rate of all the tasks and recording the average success rate as the success rate
6) Calculating the return value obtained by the task scheduling node i at the moment t by adopting the following formula
In the formula of1Is a weight factor, and the value range is 0-1;counting the average success rate of all tasks at the moment t for the task scheduling node i;counting the resource utilization rate of the task scheduling node i at the moment t;
9) using ri tUpdate the report back RiThe latest status found and the corresponding reward value of the latest action found.
The communication network task resource scheduling method based on Q learning provided by the invention utilizes the characteristics of Q learning to find a breakthrough for the problem of the mutual influence relation between the survival rate of the modeling task and the resource utilization rate in the uncertain high dynamic network environment, realizes the task resource scheduling and balance of the communication network under complex conditions through innovative algorithm research and implementation, and has the advantages of high reliability, good stability, simplicity and convenience.
Drawings
FIG. 1 is a schematic process flow diagram of the process of the present invention.
Fig. 2 is a schematic diagram of a network structure according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of task survival rate and resource utilization rate of different parameter combinations of the Q learning model under different training rounds according to the embodiment of the present invention.
Fig. 4 is a schematic diagram illustrating an influence of a damage probability of a task execution node and the number of training rounds on a task survival rate and a resource utilization rate according to an embodiment of the present invention.
Fig. 5 is a schematic diagram illustrating an influence of the number of currently scheduled tasks and the number of training rounds on task survival rate and resource utilization rate according to an embodiment of the present invention.
Detailed Description
The invention provides a method for modeling the interaction relation between the task survival rate and the resource utilization rate. In order to clarify the interaction relationship between the task survival rate and the resource utilization rate and obtain a balance result which takes into account the two performance requirements of the task survival rate and the resource utilization rate, an effective approach is to model the problem as a multi-objective constrained optimization problem. However, due to the large number of modeling parameters involved in an uncertain high dynamic network environment, purely mathematical modeling is complicated and the solution is also quite difficult. By utilizing the characteristics of Q learning, a breakthrough is found for the problem of the mutual influence relationship between the survival rate of the modeling task and the resource utilization rate in the high dynamic network environment with uncertainty.
Task survival refers to the ratio of the number of tasks successfully completed within a particular period of time in the system to the total number of tasks scheduled. In a resource-limited environment (such as a battlefield environment), the total number of tasks to be executed needs to be scheduled and deployed to an appropriate computing node for execution according to the distribution of node resources and network resources sensed in real time and the resource availability, wherein the earliest task to be executed and the latest task to be completed determine the execution period of the batch of tasks, i.e. a specific period in the definition of the task survival rate. Two main factors can influence the survival of one task, one is that the computing resource allocation is unreasonable, and the task fails to be executed due to no resource availability, which is mainly related to the task decomposition and the performance of a scheduling algorithm; secondly, the physical damage of the computing nodes causes the failure of task execution. The invention mainly focuses on the latter and provides a Q learning-based multi-target constraint problem solving method.
Before using the Q learning method, a state space, reward, Q table is defined for the problem of interest. Since each task scheduling node may acquire only the node resource distribution and usage status and network parameters in its neighboring area in real time, it is considered that a neighboring area centered around the task scheduling node is a unit to define a state space, an action space, a report, an R table, and a Q table. The state space is defined as a combined space with dimensions of 'available resource amount' and 'number of scheduled tasks'; the value range of each dimension can be set according to the fluctuation range of the available resource amount in the region and the historical experience value of the variation range of the task load.
An action refers to an operation in which the system can change states through parameter adjustment. Here, since the amount of available resources cannot be actively adjusted, the action space is defined as a set of the number of tasks that can be selected for scheduling. In a particular state, after an action is taken, the current state transitions to a new state and the performer of the action receives a return.
The R table is defined as a two-dimensional matrix, each row of the matrix representing a state, each column representing an action, the values in the matrix representing specific return values that can be evaluated. Similarly, a Q table is also defined as a two-dimensional matrix in which each row represents a state and each column represents an action, and the values in the matrix are referred to as Q values. The Q value represents the degree to which the agent has acquired "knowledge" in different environments. When an action is taken, the transition from the current state to the next new state occurs, and the actor of the action obtains a new return value.
Because different task scheduling nodes have different resource views, each task scheduling node independently maintains an R table and a Q table of the task scheduling node. Due to the dynamics of resources and tasks, the actual state space is large. In order to obtain more accurate decision results, the state space of Q learning should be as large as possible, and the action set should also be large enough, so the training process of Q learning can be very time-consuming and computationally intensive. Therefore, the training task of Q learning can be reasonably scheduled nearby, and the cooperation among scattered computing nodes is utilized to ensure enough computing power. The decision making process based on the trained Q table has low requirement on computing power, and the computing power of a single scattered node is sufficient usually and can be considered to be executed by a task scheduling node. The ultimate goal of Q learning is to obtain a converged Q table, i.e., no longer so does the value in the Q table; however, in the practical application process, because the state space and the action space are large, the Q table needs a long training time to reach the convergence, so the Q table is often used directly after being trained for a certain time, and then the Q table is updated in the using process, so that the Q value in the Q table can be continuously close to the convergence value in the updating process.
Therefore, the communication network task resource scheduling method based on Q learning provided by the invention comprises the following steps (as shown in fig. 1):
s1, acquiring a real-time communication state and a communication parameter of a communication network, and initializing an R table; specifically, the following steps are adopted for initialization:
the method comprises the following steps: each initial stateThe value of the medium resource item does not exceed the sum of the initialized resource quantities of all the nodes;
for each oneThe following steps II to VIII are repeated; whereinScheduling the state of the node i at the time 0 for the task; siScheduling a state space set of a node i for the task;
for each oneThe following steps III to VIII are repeated;scheduling the action taken by node i at time 0 for the task; a. theiScheduling an action set of node i for the task;
IV, estimating the resource quantity required by the task according to the quantity of the task to be scheduled;
v. according to the resource quantity and initial state of the task to be scheduledEstimating resource utilization by values of medium resource items
Estimating the mean value of the damage probability of all nodes according to the damage probability initialized by each node;
VII, judging: if the initial state isIf the value of the middle task item is not greater than the value of the resource item, taking the mean value of the node damage probability as the success rate of the initial taskOtherwise, the success rate of the initial task is determinedSet to 0;
VIII, initializing the return value r obtained by task scheduling node i at time 0i 0:ε2Is a weight factor, and the value range is 0-1;
the pseudo code for this section is as follows:
s2, each task scheduling node of the communication network carries out training of a self Q table; specifically, the following steps are adopted for training:
repeating the following steps A to F until the repetition times reach the set times K:
A. randomly selecting an initial state Scheduling the state of the node i at the moment t for the task; siScheduling a state space set of a node i for the task;
B. setting a first variable QmaxIs 0;
C. for each oneThe following steps a to c are all carried out;scheduling the action taken by node i at time t for the task; a. theiScheduling the action set of node i for the task:
a. and calculating the Q value of the task scheduling node i at the t +1 moment by adopting the following formula:
in the formulaScheduling the Q value of the node i at the moment t +1 for the task; alpha is a learning factor and has a value range of [0, 1%]And the larger the value of alpha is, the more the performer of the action pays more attention to the current return;scheduling the Q value of the node i at the moment t for the task;a report value obtained by the task scheduling node i at the time t + 1; beta is a discount factor, the value range is [0,1 ], and the larger the value of beta is, the more important the future return is put on the executor of the action;taking action at time t for task scheduling node iRear slave statusA new state of transition;scheduling node i in a new state for a taskAn action for obtaining the maximum Q value;scheduling node i for a task at time t +1 in a new stateTake actionThe Q value of (1);
b. update QiThe corresponding elements in (1); qiScheduling the Q table of the node i for the task;
c. for updated QiThe element (2) is judged:
if it isThen Q will bemaxIs updated toAt the same time amaxIs updated toamaxScheduling node i for a task to be in state at time t +1An action for obtaining the maximum Q value;
otherwise, QmaxAnd amaxThe change is not changed;
E. Generating a random number epsilon, wherein the value range of epsilon is 0-1;
if it isThen the judgment is made again: if action amaxCan change the stateTransition to the next stateThen will beIs assigned toAnd jumping to the step B; otherwise, jumping back to the step A;
otherwise, from set AiIn the step (a) randomly selects a division amaxAnd performing the following actions again: if the selected action can change the stateTransition to the next stateWill be usedIs assigned toAnd jumping to the step B; otherwise, jumping back to the step A;
the pseudo code for this section is as follows:
s3, each task scheduling node of the communication network makes a decision of a self Q table; specifically, the following steps are adopted for decision making:
And (4) judging: if it isThen will beAssign a value to V and simultaneously willIs assigned to a0,a0Scheduling for tasksNode i is in state at time tAn action for obtaining the maximum Q value;
otherwise, V and a0The change is not changed;
(3) and (4) judging: if action a0Can change the stateTransition to the next stateThen the following formula is adopted to calculate
(4) Update QiThe corresponding elements in (1);
the pseudo code for this section is as follows:
s4, the communication network carries out subsequent task resource scheduling according to the Q table obtained by each task scheduling node in the step S3;
s5, each task scheduling node of the communication network updates the R table of the task scheduling node; specifically, the method comprises the following steps:
1) statistics fromtTo lt+τtTotal amount of resources in the period resource view, and is denoted as fi t;ltScheduling and executing a virtual time t for the task; tau istScheduling and executing cycles for the tasks; the resource view is a visible execution node set of a scheduling node i in the current scheduling period;
2) statistics fromtTo lt+τtThe amount of tasks during which tasks have been scheduled for execution is notedAnd make statistics ofThe total amount of occupied resources;
3) estimating and recording the resource utilization rate according to the statistical results of the step 1) and the step 2)The resource utilization rate is defined as the ratio of the actual occupied resource amount to the total resource amount;
4) according to the order oftTo lt+τtThe damage rate of each node executing the task in the period, and the success rate of task execution is estimated;
5) based on the success rate of each task obtained in the step 4), counting the average success rate of all the tasks and recording the average success rate as the success rate
6) Calculating the return value r obtained by the task scheduling node i at the moment t by adopting the following formulai t:
In the formula of1Is a weight factor and takes a valueThe range is 0-1;counting the average success rate of all tasks at the moment t for the task scheduling node i;counting the resource utilization rate of the task scheduling node i at the moment t;
9) use ofUpdate back report RiThe found latest state and the found corresponding return value of the latest action;
the pseudo code for this section is as follows:
s6, repeating the steps S2-S5 to carry out continuous communication network task resource scheduling.
The invention uses OMNeT + + to build a simulation network. 50 nodes are arranged in a circular platform with the radius of 500 meters, parameters such as initial coordinates, computing resources (for example, the number of CPUs), communication radius and the like of each node are read from a data file (with an ini suffix) when a simulation system is initialized, data in the data file is prepared in advance, coordinate data of the nodes are randomly generated in the circular platform with the radius of 500 meters, the number of idle CPUs of each node is randomly extracted in a range from 0 to 16, and the communication radius of each node is 250 meters. And after the simulation system is started, the coordinate position of each node is refreshed once again at a time interval T. In order to simplify the refresh process, the current value can be randomly increased or decreased by 0-50%. In order to simplify other simulation details and focus on the discussion and analysis of the relationship between the task survival rate and the resource utilization rate, a node with a fixed ID is set as a task scheduling node, and the number of tasks to be scheduled and executed is counted according to standard small tasks.
A standard tasklet requires at least one CPU for a duration of T to complete successfully. In order to embody the characteristics of a communication environment (such as a battlefield environment) in which resources are always limited, the number of small tasks to be scheduled on the task scheduling node should be set to be enough so that the total amount of resources required by the small tasks is larger than the amount of resources in the resource view obtained by the task scheduling node. The task scheduling node is supposed to obtain a predicted resource view within a two-hop communication range of the task scheduling node only in real time, and according to the resource view, the task is distributed to the selected task execution node, and the task execution node is set to be damaged with a certain probability. If the damage happens, the task scheduling node schedules the task on the damaged node to other nodes with idle resources for execution, and if no resources are available, the task is declared to fail. The resource view of the task scheduling node is illustrated in fig. 2.
Firstly, the task survival rate and the resource utilization rate of different parameter combinations of the Q learning model under different training rounds are analyzed, and the simulation result is shown in FIG. 3. Here, the damage probability of the task execution node is uniformly set to 10%, and the number of the currently scheduled tasks is uniformly set to 50% N (N is the maximum number of tasks that can be scheduled in the simulation system). As can be seen from fig. 3(a), the task survival rate increases as the number of training rounds of the Q-table increases. This is because the training results of the Q-table gradually converge to the optimum. FIG. 3(b) shows that as the number of training rounds increases, the resource utilization rate also increases, and the explanation is the same as that in FIG. 3 (a). Meanwhile, the parameter combination of Q learning is α ═ 0.8 and β ═ 0.2, which can be seen as the best effect of task survival rate and resource utilization rate.
Then, the task survival rate and the resource utilization rate of different task execution node damage probabilities under different training rounds are analyzed, and the simulation result is shown in fig. 4. Here, the parameter combinations of Q learning are uniformly set to α ═ 0.8 and β ═ 0.2; and the number of currently scheduled tasks is uniformly set to 50% N (N is the maximum number of tasks schedulable in the simulation system). As can be seen from fig. 4(a), the task survival rate increases as the number of training iterations of the Q-table increases, and the explanation is the same as that of fig. 3 (a). Meanwhile, as seen from fig. 4(a), a greater probability of node destruction may result in a decreased task survival rate. This is because the survival rate of tasks assigned to nodes is reduced because the node destruction rate is increased. Fig. 4(b) shows that the explanation of the resource utilization rate that increases as the number of training iterations of the Q table increases is the same as that of fig. 3(b), and the reason that the utilization rate is different due to different node destruction probabilities is that the larger destruction probability means that the resource amount of the available resource view decreases, and the resource occupied by the task does not change.
Finally, the influence of the damage probability of different task execution nodes and the number of the current scheduling tasks on the task survival rate and the resource utilization rate is analyzed, and the simulation result is shown in fig. 5. Here, the parameter combinations of Q learning are uniformly set to α ═ 0.8 and β ═ 0.2, and the number of training rounds is uniformly employed 100000 times. Each data point in fig. 5 is a Q table after 100000 rounds of training based on a set Q learning parameter combination, and results of task survival rate and resource utilization rate under different task execution node damage probabilities are obtained when the current scheduling task number takes different values.
As shown in fig. 5(a), as the task scheduling number increases, the task survival rate is in a downward trend, mainly because if a task execution node is damaged, the chance of finding a successor node is reduced, and the condition is aggravated by a greater node damage probability. Fig. 5(b) shows that the resource utilization becomes larger as the number of task schedules increases. This is because the amount of resources occupied by the task increases, the amount of views of resources available to the system is constant, and the resource utilization rate increases. Meanwhile, as seen from fig. 5(b), a larger node destruction probability results in a smaller amount of resources of the available resource view, and the amount of resources required by the task does not change under the same current task scheduling number, so that the resource utilization rate shows an upward trend.
Claims (5)
1. A communication network task resource scheduling method based on Q learning comprises the following steps:
s1, acquiring a real-time communication state and a communication parameter of a communication network, and initializing an R table;
s2, each task scheduling node of the communication network carries out training of a self Q table;
s3, each task scheduling node of the communication network makes a decision of a self Q table;
s4, the communication network carries out subsequent task resource scheduling according to the Q table obtained by each task scheduling node in the step S3;
s5, each task scheduling node of the communication network updates the R table of the task scheduling node;
s6, repeating the steps S2-S5 to carry out continuous communication network task resource scheduling.
2. The method for scheduling task resources of communication network based on Q learning according to claim 1, wherein the initializing R table in step S1 is specifically initialized by the following steps:
the method comprises the following steps: each initial stateThe value of the medium resource item does not exceed the sum of the initialized resource quantities of all the nodes;
for each oneThe following steps II to VIII are repeated; whereinScheduling the state of the node i at the time 0 for the task; siScheduling a state space set of a node i for the task;
for each oneAll repeatedly carry out the following step IIIStep VIII;scheduling the action taken by node i at time 0 for the task; a. theiScheduling an action set of node i for the task;
IV, estimating the resource quantity required by the task according to the quantity of the task to be scheduled;
v. according to the resource quantity and initial state of the task to be scheduledEstimating resource utilization by values of medium resource items
Estimating the mean value of the damage probability of all nodes according to the damage probability initialized by each node;
VII, judging: if the initial state isIf the value of the middle task item is not greater than the value of the resource item, taking the mean value of the node damage probability as the success rate of the initial taskOtherwise, the success rate of the initial task is determinedSet to 0;
3. The communication network task resource scheduling method based on Q learning according to claim 1 or 2, wherein each task scheduling node of the communication network in step S2 performs training of its own Q table, specifically, the following steps are adopted for training:
repeating the following steps A to F until the repetition times reach the set times K:
A. randomly selecting an initial state Scheduling the state of the node i at the moment t for the task; siScheduling a state space set of a node i for the task;
B. setting a first variable QmaxIs 0;
C. for each oneThe following steps a to c are all carried out;scheduling the action taken by node i at time t for the task; a. theiScheduling the action set of node i for the task:
a. and calculating the Q value of the task scheduling node i at the t +1 moment by adopting the following formula:
in the formulaScheduling the Q value of the node i at the moment t +1 for the task; alpha is a learning factor and has a value range of [0, 1%]And the larger the value of alpha is, the more the performer of the action pays more attention to the current return;scheduling the Q value of the node i at the moment t for the task;a report value obtained by the task scheduling node i at the time t + 1; beta is a discount factor, the value range is [0,1 ], and the larger the value of beta is, the more important the future return is put on the executor of the action;taking action at time t for task scheduling node iRear slave statusA new state of transition;scheduling node i in a new state for a taskAn action for obtaining the maximum Q value;scheduling node i for a task at time t +1 in a new stateTake actionThe Q value of (1);
b. update QiThe corresponding elements in (1); qiScheduling the Q table of the node i for the task;
c. for updated QiThe element (2) is judged:
if it isThen Q will bemaxIs updated toAt the same time amaxIs updated toamaxScheduling node i for a task to be in state at time t +1An action for obtaining the maximum Q value;
otherwise, QmaxAnd amaxThe change is not changed;
E. Generating a random number epsilon, wherein the value range of epsilon is 0-1;
if it isThen the judgment is made again: if action amaxCan change the stateSwitch to the next oneStatus of stateThen will beIs assigned toAnd jumping to the step B; otherwise, jumping back to the step A;
4. The method for scheduling task resources in communication network based on Q learning as claimed in claim 3, wherein each task scheduling node in the communication network in step S3 makes a decision on its own Q table, specifically, the following steps are adopted for making the decision:
And (4) judging: if it isThen will beAssign a value to V and simultaneously willIs assigned to a0,a0Scheduling node i for a task to be in state at time tAn action for obtaining the maximum Q value;
otherwise, V and a0The change is not changed;
(3) and (4) judging: if action a0Can change the stateTransition to the next stateThen the following formula is adopted to calculate
(4) Update QiThe corresponding elements in (1);
5. The method for scheduling task resources in communication network based on Q learning as claimed in claim 4, wherein each task scheduling node in communication network in step S5 updates its R table, specifically, the following steps are adopted for updating:
1) statistics fromtTo lt+τtTotal amount of resources in the period resource view, and is notedltScheduling and executing a virtual time t for the task; tau istScheduling and executing cycles for the tasks; the resource view is a visible execution node set of a scheduling node i in the current scheduling period;
2) statistics fromtTo lt+τtThe amount of tasks during which tasks have been scheduled for execution is notedAnd make statistics ofThe total amount of occupied resources;
3) estimating and recording the resource utilization rate according to the statistical results of the step 1) and the step 2)The resource utilization rate is defined as the ratio of the actual occupied resource amount to the total resource amount;
4) according to the order oftTo lt+τtThe damage rate of each node executing the task in the period, and the success rate of task execution is estimated;
5) based on the success rate of each task obtained in the step 4), counting the average success rate of all the tasks and recording the average success rate as the success rate
6) Calculating the return value r obtained by the task scheduling node i at the moment t by adopting the following formulai t:
In the formula of1Is a weight factor, and the value range is 0-1;counting the average success rate of all tasks at the moment t for the task scheduling node i;counting the resource utilization rate of the task scheduling node i at the moment t;
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110271286.7A CN113163447B (en) | 2021-03-12 | 2021-03-12 | Communication network task resource scheduling method based on Q learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110271286.7A CN113163447B (en) | 2021-03-12 | 2021-03-12 | Communication network task resource scheduling method based on Q learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113163447A true CN113163447A (en) | 2021-07-23 |
CN113163447B CN113163447B (en) | 2022-05-20 |
Family
ID=76887502
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110271286.7A Active CN113163447B (en) | 2021-03-12 | 2021-03-12 | Communication network task resource scheduling method based on Q learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113163447B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150282166A1 (en) * | 2012-11-14 | 2015-10-01 | China Academy Of Telecommunications Technology | Method and device for scheduling slot resources |
CN108112082A (en) * | 2017-12-18 | 2018-06-01 | 北京工业大学 | A kind of wireless network distributed freedom resource allocation methods based on statelessly Q study |
CN108139930A (en) * | 2016-05-24 | 2018-06-08 | 华为技术有限公司 | Resource regulating method and device based on Q study |
US20190124667A1 (en) * | 2017-10-23 | 2019-04-25 | Commissariat A L'energie Atomique Et Aux Energies Alternatives | Method for allocating transmission resources using reinforcement learning |
CN110515735A (en) * | 2019-08-29 | 2019-11-29 | 哈尔滨理工大学 | A kind of multiple target cloud resource dispatching method based on improvement Q learning algorithm |
CN110636523A (en) * | 2019-09-20 | 2019-12-31 | 中南大学 | Millimeter wave mobile backhaul link energy efficiency stabilization scheme based on Q learning |
CN111405568A (en) * | 2020-03-19 | 2020-07-10 | 三峡大学 | Computing unloading and resource allocation method and device based on Q learning |
CN111556572A (en) * | 2020-04-21 | 2020-08-18 | 北京邮电大学 | Spectrum resource and computing resource joint allocation method based on reinforcement learning |
-
2021
- 2021-03-12 CN CN202110271286.7A patent/CN113163447B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150282166A1 (en) * | 2012-11-14 | 2015-10-01 | China Academy Of Telecommunications Technology | Method and device for scheduling slot resources |
CN108139930A (en) * | 2016-05-24 | 2018-06-08 | 华为技术有限公司 | Resource regulating method and device based on Q study |
US20190124667A1 (en) * | 2017-10-23 | 2019-04-25 | Commissariat A L'energie Atomique Et Aux Energies Alternatives | Method for allocating transmission resources using reinforcement learning |
CN108112082A (en) * | 2017-12-18 | 2018-06-01 | 北京工业大学 | A kind of wireless network distributed freedom resource allocation methods based on statelessly Q study |
CN110515735A (en) * | 2019-08-29 | 2019-11-29 | 哈尔滨理工大学 | A kind of multiple target cloud resource dispatching method based on improvement Q learning algorithm |
CN110636523A (en) * | 2019-09-20 | 2019-12-31 | 中南大学 | Millimeter wave mobile backhaul link energy efficiency stabilization scheme based on Q learning |
CN111405568A (en) * | 2020-03-19 | 2020-07-10 | 三峡大学 | Computing unloading and resource allocation method and device based on Q learning |
CN111556572A (en) * | 2020-04-21 | 2020-08-18 | 北京邮电大学 | Spectrum resource and computing resource joint allocation method based on reinforcement learning |
Non-Patent Citations (3)
Title |
---|
JINSONG GUI: "Stabilizing Transmission Capacity in Millimeter Wave Links by Q-Learning-Based Scheme", 《MOBILE INFORMATION SYSTEMS》 * |
喻鹏: "移动边缘网络中基于双深度Q学习的高能效资源分配方法", 《通信学报》 * |
李孜恒: "基于深度强化学习的无线网络资源分配算法", 《通信技术》 * |
Also Published As
Publication number | Publication date |
---|---|
CN113163447B (en) | 2022-05-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Chen et al. | iRAF: A deep reinforcement learning approach for collaborative mobile edge computing IoT networks | |
CN108846570B (en) | Method for solving resource-limited project scheduling problem | |
Yang et al. | A prediction-based user selection framework for heterogeneous mobile crowdsensing | |
Shyalika et al. | Reinforcement learning in dynamic task scheduling: A review | |
Palacios et al. | Genetic tabu search for the fuzzy flexible job shop problem | |
Gonzalez et al. | Instance-based learning: integrating sampling and repeated decisions from experience. | |
CN107831685B (en) | Group robot control method and system | |
CN108038622B (en) | Method for recommending users by crowd sensing system | |
Jia | Efficient computing budget allocation for simulation-based policy improvement | |
CN111367657A (en) | Computing resource collaborative cooperation method based on deep reinforcement learning | |
CN111866187B (en) | Task scheduling method for distributed deep learning reasoning cloud platform | |
Yan et al. | Efficient selection of a set of good enough designs with complexity preference | |
CN109445386A (en) | A kind of most short production time dispatching method of the cloud manufacturing operation based on ONBA | |
Nie et al. | Hypergraphical real-time multirobot task allocation in a smart factory | |
Tomy et al. | Battery charge scheduling in long-life autonomous mobile robots via multi-objective decision making under uncertainty | |
Peleteiro et al. | Using reputation and adaptive coalitions to support collaboration in competitive environments | |
CN113163447B (en) | Communication network task resource scheduling method based on Q learning | |
Karabulut et al. | The value of adaptive menu sizes in peer-to-peer platforms | |
CN107180286B (en) | Manufacturing service supply chain optimization method and system based on improved pollen algorithm | |
CN112613761A (en) | Service scheduling method based on dynamic game and self-adaptive ant colony algorithm | |
Danassis et al. | Improving multi-agent coordination by learning to estimate contention | |
Prikopa et al. | Fault-tolerant least squares solvers for wireless sensor networks based on gossiping | |
CN116932201A (en) | Multi-resource sharing scheduling method for deep learning training task | |
Vahidipour et al. | Priority assignment in queuing systems with unknown characteristics using learning automata and adaptive stochastic Petri nets | |
Fukasawa et al. | Bi-objective short-term scheduling in a rolling horizon framework: a priori approaches with alternative operational objectives |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |