CN113163447A

CN113163447A - Communication network task resource scheduling method based on Q learning

Info

Publication number: CN113163447A
Application number: CN202110271286.7A
Authority: CN
Inventors: 桂劲松; 刘尧
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2021-03-12
Filing date: 2021-03-12
Publication date: 2021-07-23
Anticipated expiration: 2041-03-12
Also published as: CN113163447B

Abstract

The invention discloses a communication network task resource scheduling method based on Q learning, which comprises the steps of obtaining the real-time communication state and communication parameters of a communication network and initializing an R table; each task scheduling node of the communication network carries out the training of a self Q table; each task scheduling node of the communication network makes a decision of a self Q table; the communication network carries out subsequent task resource scheduling according to the Q table obtained by each task scheduling node in the step S3; each task scheduling node of the communication network updates the R table of the task scheduling node; and repeating the steps to carry out continuous communication network task resource scheduling. The invention utilizes the characteristic of Q learning to find a breakthrough for the problem of the mutual influence relation between the survival rate of the modeling task and the resource utilization rate in the high dynamic network environment with uncertainty, realizes the task resource scheduling and balance of the communication network under the complex condition through innovative algorithm research and implementation, and has high reliability, good stability, simplicity and convenience.

Description

Communication network task resource scheduling method based on Q learning

Technical Field

The invention belongs to the field of decentralized computing, and particularly relates to a communication network task resource scheduling method based on Q learning.

Background

In a severe wireless communication environment, especially in an environment facing to the condition that the network throughput is severely limited and the user application requires near real-time response, in order to solve the contradiction between the complexity and the variability of the calculation task and the severe limitation of the node resources, the application mode based on the decentralized calculation is a solution which is worth exploring. In a distributed computing environment, in order to ensure that scheduled tasks can survive in a severe battlefield environment and successfully complete military applications and other work, an anti-destruction relay mode of cross-node computing tasks needs to be researched. In the survivability and replacement mode of the node computing task, a key problem is that when the number of the scheduled tasks is determined, the reasonable matching relation between the available amount of resources and the number of the tasks in the task completion period is required to be determined. If the deviation is too far away from the reasonable value, the resource utilization rate is too low, or the task survival rate is not high, the contradiction between the serious limitation of resources and the huge amount of tasks in the severe battlefield environment is aggravated.

In the case of physical damage to a computing node, a simple and effective means for allowing the task executed thereon to survive is to reschedule the task to be executed on another computing point. Therefore, the matching of the total number of tasks scheduled to be executed to the total amount of resources available during a particular time period directly affects the survival rate of the batch of tasks. Considering the number of scheduled tasks from the perspective of fully utilizing resources, the same resources may serve more tasks, but the probability of task execution failure due to physical damage of a computing node is higher (for example, due to lack of survivability replacing resources), and then the task survival rate is not very high. On the contrary, if the total number of tasks scheduled in the same period is excessively reduced, the probability of task execution failure caused by physical damage of the computing node is greatly reduced. This is mainly due to the fact that there are more alternative successor compute nodes when rescheduling. However, contemporaneous resource utilization will be low. In this case, although the survival rate of the mission may be high, it is not meaningful to have the survival rate of the mission high in exchange for a serious reduction in resource utilization, especially in a resource-limited battlefield environment. Therefore, the interaction relationship between the task survival rate and the resource utilization rate needs to be discussed, and a reasonable balance point between the task survival rate and the resource utilization rate needs to be found.

However, the existing research and technical scheme aiming at the reasonable balance point between the two methods are often not reliable and the method is very complicated.

Disclosure of Invention

The invention aims to provide a communication network task resource scheduling method based on Q learning, which has high reliability and good stability and is simple and convenient.

The invention provides a communication network task resource scheduling method based on Q learning, which comprises the following steps:

s1, acquiring a real-time communication state and a communication parameter of a communication network, and initializing an R table;

s2, each task scheduling node of the communication network carries out training of a self Q table;

s3, each task scheduling node of the communication network makes a decision of a self Q table;

s4, the communication network carries out subsequent task resource scheduling according to the Q table obtained by each task scheduling node in the step S3;

s5, each task scheduling node of the communication network updates the R table of the task scheduling node;

s6, repeating the steps S2-S5 to carry out continuous communication network task resource scheduling.

The initializing R table in step S1 is specifically initialized by the following steps:

the method comprises the following steps: each initial state

The value of the medium resource item does not exceed the sum of the initialized resource quantities of all the nodes;

for each one

The following steps II to VIII are repeated; wherein

Scheduling the state of the node i at the time 0 for the task; s_iScheduling a state space set of a node i for the task;

for each one

The following steps III to VIII are repeated;

scheduling the action taken by node i at time 0 for the task; a. the_iScheduling an action set of node i for the task;

according to the initial action

Estimating the amount of tasks to be scheduled;

IV, estimating the resource quantity required by the task according to the quantity of the task to be scheduled;

v. according to the resource quantity and initial state of the task to be scheduled

Estimating resource utilization by values of medium resource items

Estimating the mean value of the damage probability of all nodes according to the damage probability initialized by each node;

VII, judging: if the initial state is

If the value of the middle task item is not greater than the value of the resource item, taking the mean value of the node damage probability as the success rate of the initial task

Otherwise, the success rate of the initial task is determined

Set to 0;

VIII, initializing the return value r obtained by task scheduling node i at time 0_i ⁰：

ε₂Is a weight factor and has a value range of 0 to 1.

Step S2, each task scheduling node of the communication network trains its own Q table, specifically, the following steps are adopted for training:

repeating the following steps A to F until the repetition times reach the set times K:

A. randomly selecting an initial state

Scheduling the state of the node i at the moment t for the task; s_iScheduling a state space set of a node i for the task;

B. setting a first variable Q_maxIs 0;

C. for each one

The following steps a to c are all carried out;

scheduling the action taken by node i at time t for the task; a. the_iScheduling the action set of node i for the task:

a. and calculating the Q value of the task scheduling node i at the t +1 moment by adopting the following formula:

in the formula

Scheduling the Q value of the node i at the moment t +1 for the task; alpha is a learning factor and has a value range of [0, 1%]And the larger the value of alpha is, the more the performer of the action pays more attention to the current return;

scheduling the Q value of the node i at the moment t for the task;

a report value obtained by the task scheduling node i at the time t + 1; beta is a discount factor, the value range is [0,1 ], and the larger the value of beta is, the more important the future return is put on the executor of the action;

taking action at time t for task scheduling node i

Rear slave status

A new state of transition;

scheduling node i in a new state for a task

An action for obtaining the maximum Q value;

scheduling node i for a task at time t +1 in a new state

Take action

The Q value of (1);

b. update Q_iThe corresponding elements in (1); q_iScheduling the Q table of the node i for the task;

c. for updated Q_iThe element (2) is judged:

if it is

Then Q will be_maxIs updated to

At the same time a_maxIs updated to

a_maxScheduling node i for a task to be in state at time t +1

An action for obtaining the maximum Q value;

otherwise, Q_maxAnd a_maxThe change is not changed;

D. setting detection probability

E. Generating a random number epsilon, wherein the value range of epsilon is 0-1;

F. for detection probability

And the generated random number epsilon:

if it is

Then the judgment is made again: if action a_maxCan change the state

Transition to the next state

Then will be

Is assigned to

And jumping to the step B; otherwise, jumping back to the step A;

otherwise, from set A_iIn the step (a) randomly selects a division a_maxAnd performing the following actions again: if the selected action can change the state

Transition to the next state

Will be used

Is assigned to

And jumping to the step B; otherwise, jumping back to step A.

Step S3, where each task scheduling node of the communication network makes a decision on its own Q table, specifically, the following steps are adopted to make the decision:

(1) initial setting

And a second variable V ═ 0;

(2) for each one

The following operations are all carried out:

according to

From Q_iFind out

And (4) judging: if it is

Then will be

Assign a value to V and simultaneously will

Is assigned to a₀，a₀Scheduling node i for a task to be in state at time t

An action for obtaining the maximum Q value;

otherwise, V and a₀The change is not changed;

(3) and (4) judging: if action a₀Can change the state

Transition to the next state

Then the following formula is adopted to calculate

(4) Update Q_iThe corresponding elements in (1);

(5) will be provided with

Is assigned to

And returning to the step (2).

Step S5, each task scheduling node of the communication network updates its R table, specifically, the following steps are adopted for updating:

1) statistics from_tTo l_t+τ_tTotal amount of resources in the period resource view, and is denoted as f_i ^t；l_tScheduling and executing a virtual time t for the task; tau is_tScheduling and executing cycles for the tasks; the resource view is visible for the scheduling node i in the current scheduling periodExecuting the node set;

2) statistics from_tTo l_t+τ_tThe amount of tasks during which tasks have been scheduled for execution is noted

And make statistics of

The total amount of occupied resources;

3) estimating and recording the resource utilization rate according to the statistical results of the step 1) and the step 2)

The resource utilization rate is defined as the ratio of the actual occupied resource amount to the total resource amount;

4) according to the order of_tTo l_t+τ_tThe damage rate of each node executing the task in the period, and the success rate of task execution is estimated;

5) based on the success rate of each task obtained in the step 4), counting the average success rate of all the tasks and recording the average success rate as the success rate

6) Calculating the return value obtained by the task scheduling node i at the moment t by adopting the following formula

In the formula of₁Is a weight factor, and the value range is 0-1;

counting the average success rate of all tasks at the moment t for the task scheduling node i;

counting the resource utilization rate of the task scheduling node i at the moment t;

7) according to

In a report table R_iFinds the most recent state;

8) according to

In a report table R_iFind the most recent action;

9) using r_i ^tUpdate the report back R_iThe latest status found and the corresponding reward value of the latest action found.

The communication network task resource scheduling method based on Q learning provided by the invention utilizes the characteristics of Q learning to find a breakthrough for the problem of the mutual influence relation between the survival rate of the modeling task and the resource utilization rate in the uncertain high dynamic network environment, realizes the task resource scheduling and balance of the communication network under complex conditions through innovative algorithm research and implementation, and has the advantages of high reliability, good stability, simplicity and convenience.

Drawings

FIG. 1 is a schematic process flow diagram of the process of the present invention.

Fig. 2 is a schematic diagram of a network structure according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of task survival rate and resource utilization rate of different parameter combinations of the Q learning model under different training rounds according to the embodiment of the present invention.

Fig. 4 is a schematic diagram illustrating an influence of a damage probability of a task execution node and the number of training rounds on a task survival rate and a resource utilization rate according to an embodiment of the present invention.

Fig. 5 is a schematic diagram illustrating an influence of the number of currently scheduled tasks and the number of training rounds on task survival rate and resource utilization rate according to an embodiment of the present invention.

Detailed Description

The invention provides a method for modeling the interaction relation between the task survival rate and the resource utilization rate. In order to clarify the interaction relationship between the task survival rate and the resource utilization rate and obtain a balance result which takes into account the two performance requirements of the task survival rate and the resource utilization rate, an effective approach is to model the problem as a multi-objective constrained optimization problem. However, due to the large number of modeling parameters involved in an uncertain high dynamic network environment, purely mathematical modeling is complicated and the solution is also quite difficult. By utilizing the characteristics of Q learning, a breakthrough is found for the problem of the mutual influence relationship between the survival rate of the modeling task and the resource utilization rate in the high dynamic network environment with uncertainty.

Task survival refers to the ratio of the number of tasks successfully completed within a particular period of time in the system to the total number of tasks scheduled. In a resource-limited environment (such as a battlefield environment), the total number of tasks to be executed needs to be scheduled and deployed to an appropriate computing node for execution according to the distribution of node resources and network resources sensed in real time and the resource availability, wherein the earliest task to be executed and the latest task to be completed determine the execution period of the batch of tasks, i.e. a specific period in the definition of the task survival rate. Two main factors can influence the survival of one task, one is that the computing resource allocation is unreasonable, and the task fails to be executed due to no resource availability, which is mainly related to the task decomposition and the performance of a scheduling algorithm; secondly, the physical damage of the computing nodes causes the failure of task execution. The invention mainly focuses on the latter and provides a Q learning-based multi-target constraint problem solving method.

Before using the Q learning method, a state space, reward, Q table is defined for the problem of interest. Since each task scheduling node may acquire only the node resource distribution and usage status and network parameters in its neighboring area in real time, it is considered that a neighboring area centered around the task scheduling node is a unit to define a state space, an action space, a report, an R table, and a Q table. The state space is defined as a combined space with dimensions of 'available resource amount' and 'number of scheduled tasks'; the value range of each dimension can be set according to the fluctuation range of the available resource amount in the region and the historical experience value of the variation range of the task load.

An action refers to an operation in which the system can change states through parameter adjustment. Here, since the amount of available resources cannot be actively adjusted, the action space is defined as a set of the number of tasks that can be selected for scheduling. In a particular state, after an action is taken, the current state transitions to a new state and the performer of the action receives a return.

The R table is defined as a two-dimensional matrix, each row of the matrix representing a state, each column representing an action, the values in the matrix representing specific return values that can be evaluated. Similarly, a Q table is also defined as a two-dimensional matrix in which each row represents a state and each column represents an action, and the values in the matrix are referred to as Q values. The Q value represents the degree to which the agent has acquired "knowledge" in different environments. When an action is taken, the transition from the current state to the next new state occurs, and the actor of the action obtains a new return value.

Because different task scheduling nodes have different resource views, each task scheduling node independently maintains an R table and a Q table of the task scheduling node. Due to the dynamics of resources and tasks, the actual state space is large. In order to obtain more accurate decision results, the state space of Q learning should be as large as possible, and the action set should also be large enough, so the training process of Q learning can be very time-consuming and computationally intensive. Therefore, the training task of Q learning can be reasonably scheduled nearby, and the cooperation among scattered computing nodes is utilized to ensure enough computing power. The decision making process based on the trained Q table has low requirement on computing power, and the computing power of a single scattered node is sufficient usually and can be considered to be executed by a task scheduling node. The ultimate goal of Q learning is to obtain a converged Q table, i.e., no longer so does the value in the Q table; however, in the practical application process, because the state space and the action space are large, the Q table needs a long training time to reach the convergence, so the Q table is often used directly after being trained for a certain time, and then the Q table is updated in the using process, so that the Q value in the Q table can be continuously close to the convergence value in the updating process.

Therefore, the communication network task resource scheduling method based on Q learning provided by the invention comprises the following steps (as shown in fig. 1):

s1, acquiring a real-time communication state and a communication parameter of a communication network, and initializing an R table; specifically, the following steps are adopted for initialization:

the method comprises the following steps: each initial state

for each one

The following steps II to VIII are repeated; wherein

for each one

The following steps III to VIII are repeated;

according to the initial action

Estimating the amount of tasks to be scheduled;

Estimating resource utilization by values of medium resource items

VII, judging: if the initial state is

Otherwise, the success rate of the initial task is determined

Set to 0;

ε₂Is a weight factor, and the value range is 0-1;

the pseudo code for this section is as follows:

s2, each task scheduling node of the communication network carries out training of a self Q table; specifically, the following steps are adopted for training:

A. randomly selecting an initial state

B. setting a first variable Q_maxIs 0;

C. for each one

The following steps a to c are all carried out;

in the formula

scheduling the Q value of the node i at the moment t for the task;

taking action at time t for task scheduling node i

Rear slave status

A new state of transition;

scheduling node i in a new state for a task

An action for obtaining the maximum Q value;

scheduling node i for a task at time t +1 in a new state

Take action

The Q value of (1);

c. for updated Q_iThe element (2) is judged:

if it is

Then Q will be_maxIs updated to

At the same time a_maxIs updated to

a_maxScheduling node i for a task to be in state at time t +1

An action for obtaining the maximum Q value;

otherwise, Q_maxAnd a_maxThe change is not changed;

D. setting detection probability

F. for detection probability

And the generated random number epsilon:

if it is

Then the judgment is made again: if action a_maxCan change the state

Transition to the next state

Then will be

Is assigned to

And jumping to the step B; otherwise, jumping back to the step A;

Transition to the next state

Will be used

Is assigned to

And jumping to the step B; otherwise, jumping back to the step A;

the pseudo code for this section is as follows:

s3, each task scheduling node of the communication network makes a decision of a self Q table; specifically, the following steps are adopted for decision making:

(1) initial setting

And a second variable V ═ 0;

(2) for each one

The following operations are all carried out:

according to

From Q_iFind out

And (4) judging: if it is

Then will be

Assign a value to V and simultaneously will

Is assigned to a₀，a₀Scheduling for tasksNode i is in state at time t

An action for obtaining the maximum Q value;

otherwise, V and a₀The change is not changed;

(3) and (4) judging: if action a₀Can change the state

Transition to the next state

Then the following formula is adopted to calculate

(4) Update Q_iThe corresponding elements in (1);

(5) will be provided with

Is assigned to

And returning to the step (2);

the pseudo code for this section is as follows:

s5, each task scheduling node of the communication network updates the R table of the task scheduling node; specifically, the method comprises the following steps:

1) statistics from_tTo l_t+τ_tTotal amount of resources in the period resource view, and is denoted as f_i ^t；l_tScheduling and executing a virtual time t for the task; tau is_tScheduling and executing cycles for the tasks; the resource view is a visible execution node set of a scheduling node i in the current scheduling period;

And make statistics of

The total amount of occupied resources;

6) Calculating the return value r obtained by the task scheduling node i at the moment t by adopting the following formula_i ^t：

In the formula of₁Is a weight factor and takes a valueThe range is 0-1;

7) according to

In a report table R_iFinds the most recent state;

8) according to

In a report table R_iFind the most recent action;

9) use of

Update back report R_iThe found latest state and the found corresponding return value of the latest action;

the pseudo code for this section is as follows:

The invention uses OMNeT + + to build a simulation network. 50 nodes are arranged in a circular platform with the radius of 500 meters, parameters such as initial coordinates, computing resources (for example, the number of CPUs), communication radius and the like of each node are read from a data file (with an ini suffix) when a simulation system is initialized, data in the data file is prepared in advance, coordinate data of the nodes are randomly generated in the circular platform with the radius of 500 meters, the number of idle CPUs of each node is randomly extracted in a range from 0 to 16, and the communication radius of each node is 250 meters. And after the simulation system is started, the coordinate position of each node is refreshed once again at a time interval T. In order to simplify the refresh process, the current value can be randomly increased or decreased by 0-50%. In order to simplify other simulation details and focus on the discussion and analysis of the relationship between the task survival rate and the resource utilization rate, a node with a fixed ID is set as a task scheduling node, and the number of tasks to be scheduled and executed is counted according to standard small tasks.

A standard tasklet requires at least one CPU for a duration of T to complete successfully. In order to embody the characteristics of a communication environment (such as a battlefield environment) in which resources are always limited, the number of small tasks to be scheduled on the task scheduling node should be set to be enough so that the total amount of resources required by the small tasks is larger than the amount of resources in the resource view obtained by the task scheduling node. The task scheduling node is supposed to obtain a predicted resource view within a two-hop communication range of the task scheduling node only in real time, and according to the resource view, the task is distributed to the selected task execution node, and the task execution node is set to be damaged with a certain probability. If the damage happens, the task scheduling node schedules the task on the damaged node to other nodes with idle resources for execution, and if no resources are available, the task is declared to fail. The resource view of the task scheduling node is illustrated in fig. 2.

Firstly, the task survival rate and the resource utilization rate of different parameter combinations of the Q learning model under different training rounds are analyzed, and the simulation result is shown in FIG. 3. Here, the damage probability of the task execution node is uniformly set to 10%, and the number of the currently scheduled tasks is uniformly set to 50% N (N is the maximum number of tasks that can be scheduled in the simulation system). As can be seen from fig. 3(a), the task survival rate increases as the number of training rounds of the Q-table increases. This is because the training results of the Q-table gradually converge to the optimum. FIG. 3(b) shows that as the number of training rounds increases, the resource utilization rate also increases, and the explanation is the same as that in FIG. 3 (a). Meanwhile, the parameter combination of Q learning is α ═ 0.8 and β ═ 0.2, which can be seen as the best effect of task survival rate and resource utilization rate.

Then, the task survival rate and the resource utilization rate of different task execution node damage probabilities under different training rounds are analyzed, and the simulation result is shown in fig. 4. Here, the parameter combinations of Q learning are uniformly set to α ═ 0.8 and β ═ 0.2; and the number of currently scheduled tasks is uniformly set to 50% N (N is the maximum number of tasks schedulable in the simulation system). As can be seen from fig. 4(a), the task survival rate increases as the number of training iterations of the Q-table increases, and the explanation is the same as that of fig. 3 (a). Meanwhile, as seen from fig. 4(a), a greater probability of node destruction may result in a decreased task survival rate. This is because the survival rate of tasks assigned to nodes is reduced because the node destruction rate is increased. Fig. 4(b) shows that the explanation of the resource utilization rate that increases as the number of training iterations of the Q table increases is the same as that of fig. 3(b), and the reason that the utilization rate is different due to different node destruction probabilities is that the larger destruction probability means that the resource amount of the available resource view decreases, and the resource occupied by the task does not change.

Finally, the influence of the damage probability of different task execution nodes and the number of the current scheduling tasks on the task survival rate and the resource utilization rate is analyzed, and the simulation result is shown in fig. 5. Here, the parameter combinations of Q learning are uniformly set to α ═ 0.8 and β ═ 0.2, and the number of training rounds is uniformly employed 100000 times. Each data point in fig. 5 is a Q table after 100000 rounds of training based on a set Q learning parameter combination, and results of task survival rate and resource utilization rate under different task execution node damage probabilities are obtained when the current scheduling task number takes different values.

As shown in fig. 5(a), as the task scheduling number increases, the task survival rate is in a downward trend, mainly because if a task execution node is damaged, the chance of finding a successor node is reduced, and the condition is aggravated by a greater node damage probability. Fig. 5(b) shows that the resource utilization becomes larger as the number of task schedules increases. This is because the amount of resources occupied by the task increases, the amount of views of resources available to the system is constant, and the resource utilization rate increases. Meanwhile, as seen from fig. 5(b), a larger node destruction probability results in a smaller amount of resources of the available resource view, and the amount of resources required by the task does not change under the same current task scheduling number, so that the resource utilization rate shows an upward trend.

Claims

1. A communication network task resource scheduling method based on Q learning comprises the following steps:

2. The method for scheduling task resources of communication network based on Q learning according to claim 1, wherein the initializing R table in step S1 is specifically initialized by the following steps:

the method comprises the following steps: each initial state

for each one

The following steps II to VIII are repeated; wherein

for each one

All repeatedly carry out the following step IIIStep VIII;

according to the initial action

Estimating the amount of tasks to be scheduled;

Estimating resource utilization by values of medium resource items

VII, judging: if the initial state is

Otherwise, the success rate of the initial task is determined

Set to 0;

VIII, initializing the return value obtained by task scheduling node i at time 0

ε₂Is a weight factor and has a value range of 0 to 1.

3. The communication network task resource scheduling method based on Q learning according to claim 1 or 2, wherein each task scheduling node of the communication network in step S2 performs training of its own Q table, specifically, the following steps are adopted for training:

A. randomly selecting an initial state

B. setting a first variable Q_maxIs 0;

C. for each one

The following steps a to c are all carried out;

in the formula

scheduling the Q value of the node i at the moment t for the task;

taking action at time t for task scheduling node i

Rear slave status

A new state of transition;

scheduling node i in a new state for a task

An action for obtaining the maximum Q value;

scheduling node i for a task at time t +1 in a new state

Take action

The Q value of (1);

c. for updated Q_iThe element (2) is judged:

if it is

Then Q will be_maxIs updated to

At the same time a_maxIs updated to

a_maxScheduling node i for a task to be in state at time t +1

An action for obtaining the maximum Q value;

otherwise, Q_maxAnd a_maxThe change is not changed;

D. setting detection probability

F. for detection probability

And the generated random number epsilon:

if it is

Then the judgment is made again: if action a_maxCan change the state

Switch to the next oneStatus of state

Then will be

Is assigned to

And jumping to the step B; otherwise, jumping back to the step A;

Transition to the next state

Will be used

Is assigned to

And jumping to the step B; otherwise, jumping back to step A.

4. The method for scheduling task resources in communication network based on Q learning as claimed in claim 3, wherein each task scheduling node in the communication network in step S3 makes a decision on its own Q table, specifically, the following steps are adopted for making the decision:

(1) initial setting

And a second variable V ═ 0;

(2) for each one

The following operations are all carried out:

according to

From Q_iFind out

And (4) judging: if it is

Then will be

Assign a value to V and simultaneously will

Is assigned to a₀，a₀Scheduling node i for a task to be in state at time t

An action for obtaining the maximum Q value;

otherwise, V and a₀The change is not changed;

(3) and (4) judging: if action a₀Can change the state

Transition to the next state

Then the following formula is adopted to calculate

(4) Update Q_iThe corresponding elements in (1);

(5) will be provided with

Is assigned to

And returning to the step (2).

5. The method for scheduling task resources in communication network based on Q learning as claimed in claim 4, wherein each task scheduling node in communication network in step S5 updates its R table, specifically, the following steps are adopted for updating:

1) statistics from_tTo l_t+τ_tTotal amount of resources in the period resource view, and is noted

l_tScheduling and executing a virtual time t for the task; tau is_tScheduling and executing cycles for the tasks; the resource view is a visible execution node set of a scheduling node i in the current scheduling period;