CN113672372B

CN113672372B - Multi-edge collaborative load balancing task scheduling method based on reinforcement learning

Info

Publication number: CN113672372B
Application number: CN202111000830.0A
Authority: CN
Inventors: 陈哲毅; 胡俊钦; 陈星�
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2021-08-30
Filing date: 2021-08-30
Publication date: 2023-08-08
Anticipated expiration: 2041-08-30
Also published as: CN113672372A

Abstract

The invention relates to a multi-edge collaborative load balancing task scheduling method based on reinforcement learning, which comprises the following steps: step S1: according to the historical data set, a reinforcement learning algorithm is used for evaluating the Q value of each adjustment operation under different system states; step S2: preprocessing the Q value of the adjustment operation in the Q value table constructed in the step S1, and then training a Q value prediction model by using a machine learning algorithm; step S3: each edge independently makes decisions in parallel according to the Q-value prediction model. The invention combines reinforcement learning and machine learning, and designs a multi-edge cooperative load balancing algorithm in the wireless metropolitan area network. Each edge node can independently perform load balancing scheduling between the edge node and the adjacent nodes by only using local information, and gradually search for a proper load balancing scheme through feedback control and multi-edge cooperation. The sought solution can effectively reduce the response time of the task.

Description

Multi-edge collaborative load balancing task scheduling method based on reinforcement learning

Technical Field

The invention relates to the field of load balancing scheduling strategies in edge calculation, in particular to a polygonal edge collaborative load balancing task scheduling method based on reinforcement learning.

Background

In recent years, advances in mobile computing technology have enabled users to experience a wide variety of applications. However, as the resource requirements of newly developed applications continue to grow, the computing power of mobile devices remains limited. A traditional approach to overcoming mobile device resource starvation is to utilize the abundant computing resources in the remote cloud. Mobile devices can reduce their workload and extend battery life by offloading computation-intensive tasks to remote cloud execution. However, long distances between the cloud and the user can cause delays in applications where user interaction is frequent, affecting the user experience. To minimize the delay in offloading tasks to the remote cloud, researchers propose to use edges closer to the user for offloading tasks.

An edge is a cluster of computers rich in resources that connect to nearby mobile users through a wireless network. By providing low latency access to its rich computing resources, edges can significantly improve the performance of mobile applications. Although edges are often defined as isolated "data centers in boxes," it is a clear benefit to connect multiple edges together to form a network. Cities are typically highly populated, meaning that edges will be available to a large number of users. This increases the cost effectiveness of the edges because they are less likely to be idle. In addition, due to the size of the network, wireless metropolitan area network service providers can take advantage of economies of scale in providing edge services over wireless metropolitan area networks, making the edge services more acceptable to the public.

One major problem faced by wireless metropolitan area network service providers is how to distribute user's task requests to different edges, so that the workload between edges in the wireless metropolitan area network is well balanced, thereby shortening the response time delay of tasks and enhancing the user's experience of using services. In particular, a large number of users in the network means that the workload of each edge will be highly unstable. If an edge is suddenly overwhelmed by a user request, the task response time of the edge will increase dramatically, resulting in delays in the user application and a reduction in the user experience. In order to prevent some edges from being overloaded, it is important to assign user requests to different edges so that the workload between the edges can be well balanced, thus reducing the maximum response time.

Disclosure of Invention

In view of the above, the present invention aims to provide a multi-edge collaborative load balancing task scheduling method based on reinforcement learning, which is used for solving the problem of multi-edge collaborative load balancing in a wireless metropolitan area network, each edge node independently performs load balancing scheduling between the node and an adjacent node based on local information, performs a single scheduling decision by a method combining reinforcement learning and machine learning, gradually searches for a proper load balancing scheme through feedback control and multi-edge collaboration, and can effectively reduce the response time of a task.

The invention is realized by adopting the following scheme: a polygonal edge collaborative load balancing task scheduling method based on reinforcement learning comprises the following steps:

step S1: according to the historical data set, using a reinforcement learning algorithm to evaluate the Q value of each adjustment operation under different multi-edge collaborative system states;

step S2: preprocessing the Q value of the adjustment operation in the Q value table constructed in the step S1, and then training a Q value prediction model by using a machine learning algorithm;

step S3: each edge independently makes decisions in parallel according to the Q-value prediction model.

Further, the reinforcement learning algorithm in step S1 is:

state space: edge e _i Is not equal to the state of each of the otherWith a triplet < lambda _i ,/>L _i Represented by edge e _i Task arrival rate lambda of (2) _i Edge e _i Is>And edge e _i And the load ratio L of the adjacent edges _i ；

Action space: edge e _i For the movement space of (a)Representation, wherein->Representing the added edge e _i Is scheduled to adjacent edges in arriving tasks>The amount of task performed thereon; />Representing the reduced edge e _i Is scheduled to adjacent edges in arriving tasks>The amount of task performed thereon;

bonus function: the bonus function is defined as follows:

the reinforcement learning algorithm adopts a Q-learning algorithm, and the Q value updating formula is as follows:

Q(s,a)＝Q(s,a)+α[r+γ·max(Q(s',a'))-Q(s,a)]

where max (Q (s ', a')) represents the maximum Q value obtained by selecting action a 'in state s', parameter α represents learning efficiency, and parameter γ represents a prize discount.

Further, the step S1 specifically includes the following steps:

step S11: initializing a Q value table;

step S12: using a Q-learning algorithm to evaluate the Q value of the adjustment operation in each piece of historical data, and continuing the training process until the Q value converges;

step S13: obtaining a Q value table and recording the edges e at different moments _i Task arrival rate of edge e _i Current local load balancing scheme of (a)And edge e _i And the load ratio L of the adjacent edges _i And a Q value corresponding to each adjustment operation.

Further, the step S12 specifically includes the following steps:

step S121: in each round, first randomInitializing a current local load balancing schemeAnd generating a current system state;

initializing a current local load balancing scheme

Edge e _i Is not equal to the state of each of the otherWith a triplet < lambda _i ,/>L _i Represented by edge e _i Task arrival rate lambda of (2) _i Edge e _i Is>And edge e _i And the load ratio L of the adjacent edges _i ；

Step S122: if the current local load balancing schemeNot target local load balancing scheme->Circularly executing the steps S123 to S125;

step S123: selecting an action a from the action space according to an epsilon-greedy strategy; selecting an action a with an epsilon-greedy policy:

a＝select_action(s,Q_table)；

step S124: reaching state s' and obtaining a reward r under the action of the state transfer function;

the current state is transformed into s' under the action of a state transfer function:

s'＝T(S,a)

the agent obtains a prize value:

step S125: updating the Q value according to the following formula, and finally replacing the current state s with the state s';

updating the Q value:

Q(s,a)＝Q(s,a)+α[r+γ·max(Q(s',a'))-Q(s,a)]

updating the current state:

s＝s'。

further, in step S2

The Q value is preprocessed, and the processing rule is as follows:

if q_value= 0andthe corresponding adjustment operations are considered illegal, these Q values will be labeled I; if->I.e. a target local load balancing scheme is found, the Q value is still set to 0; for the rest of the cases, the Q value is set to its inverse; after preprocessing, the current local load balancing scheme +.>Closer to the target local load balancing scheme +.>Q value of adjustment operationThe smaller the Q value is, the smaller the Q value is 0 when a target local load balancing scheme is found; based on the preprocessed Q value table, a SVR algorithm is used for training a Q value prediction model, and a regression equation is expressed as follows:

where m is the number of training samples, κ (x, x _i ) Is a kernel function, the remaining parameters are model parameters; with Gaussian kernel as kernel function, i.e.

Where χ > 0 is the bandwidth of the Gaussian kernel.

Further, the specific content of the step S3 is as follows:

adopting an adjustment operation decision algorithm, predicting the Q value of each operation under the state of different multi-edge cooperative systems by utilizing a Q value prediction model in decision, and selecting the action with the minimum Q value; setting a threshold T, and when the Q value corresponding to each adjustment operation is smaller than the threshold T, considering that the current local load balancing scheme is close to the target local load balancing scheme, and taking the current local load balancing scheme as the target local load balancing scheme approximately; the inputs to adjust the operational decision algorithm include edge e _i Task arrival rate lambda of (2) _i Edge e _i Current local load balancing scheme of (a)Edge e _i And the load rate L of adjacent edges _i The output is edge e _i A) the next load balancing adjustment operation a; the specific process is as follows:

firstly, evaluating the Q value of each action, namely, adjusting operation; if an action is deemed illegal, marking the corresponding Q value as I; under other conditions, predicting the Q value corresponding to each action according to the Q value prediction model;

wherein the prediction_model () -invokes the Q value prediction model

Q_value (a) -Q value of action a

Then, judging whether the Q values of all legal adjustment operations are smaller than a threshold T, wherein the Q values marked as I are excluded; if the Q values of all actions are smaller than the threshold T, the target local load balancing scheme is considered to be found, and adjustment is not needed, so that the adjustment operation is Null; otherwise, selecting the adjustment operation with the minimum Q value, and if the Q values of a plurality of adjustment operations are the same and are the minimum Q value, randomly selecting one adjustment operation from the adjustment operations;

if(for each Q_value(a)≤T||Q_value(a)＝＝I):

a＝Null

else:

record actions with minimum Q value and not I:

a_List＝A _i .getAction_MinQvalue()

randomly selecting an action from the inside of the a_list:

a＝a_List.get_Action_Random()

finally, return the selected edge e _i Is a next load balancing adjustment operation a.

Compared with the prior art, the invention has the following beneficial effects:

the invention combines reinforcement learning and machine learning, and designs a multi-edge cooperative load balancing method in a wireless metropolitan area network. Each edge node can independently perform load balancing scheduling between the edge node and the adjacent nodes by only using local information, and gradually search for a proper load balancing scheme through feedback control and multi-edge cooperation. The sought solution can effectively reduce the response time of the task.

Drawings

Fig. 1 is a general frame diagram of an embodiment of the present invention.

FIG. 2 is a graph comparing performance of an embodiment of the present invention with that of a conventional method.

Detailed Description

The invention will be further described with reference to the accompanying drawings and examples.

It should be noted that the following detailed description is illustrative and is intended to provide further explanation of the present application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the present application. As used herein, the singular is also intended to include the plural unless the context clearly indicates otherwise, and furthermore, it is to be understood that the terms "comprises" and/or "comprising" when used in this specification are taken to specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof.

As shown in fig. 1 and 2, the present embodiment provides a multi-edge collaborative load balancing task scheduling method based on reinforcement learning, which includes the following steps:

step S1: according to the historical data set, using a reinforcement learning algorithm to evaluate the Q value of each adjustment operation under different multi-edge collaborative system states; the multi-edge cooperative system refers to a multi-edge cooperative system in a wireless metropolitan area network, and consists of a plurality of edges connected with each other through the wireless metropolitan area network.

In this embodiment, the reinforcement learning algorithm in step S1 is:

bonus function: the bonus function is defined as follows:

Q(s,a)＝Q(s,a)+α[r+γ·max(Q(s',a'))-Q(s,a)]

In this embodiment, the step S1 specifically includes the following steps:

step S11: initializing a Q value table;

In this embodiment, the step S12 specifically includes the following steps:

step S121: in each round, the current local load balancing scheme is first randomly initializedAnd generating a current system state;

initializing a current local load balancing scheme

Edge e _i Is not equal to the state of each of the otherWith a triplet < lambda _i ,/>L _i Represented by edge e _i Task arrival rate lambda of (2) _i Edge e _i Is equal to the current local load of (a)Balance plan->And edge e _i And the load ratio L of the adjacent edges _i ；

a＝select_action(s,Q_table)；

s'＝T(S,a)

the agent obtains a prize value:

updating the Q value:

Q(s,a)＝Q(s,a)+α[r+γ·max(Q(s',a'))-Q(s,a)]

updating the current state:

s＝s'。

in the present embodiment, the steps are described in step S2

The Q value is preprocessed, and the processing rule is as follows:

if q_value= 0andthe corresponding adjustment operations are considered illegal, these Q values will be labeled I; if->I.e. a target local load balancing scheme is found, the Q value is still set to 0; for the rest of the cases, the Q value is set to its inverse; after preprocessing, the current local load balancing scheme +.>Closer to the target local load balancing scheme +.>The smaller the Q value of the adjustment operation is, the smaller the Q value is, and when a target local load balancing scheme is found, the Q value is the minimum value of 0; based on the preprocessed Q value table, training a Q value prediction model by using an SVR algorithm, wherein a regression equation of the Q value prediction model (training the Q value prediction model is a parameter in a training regression equation) is expressed as follows:

Where χ > 0 is the bandwidth of the Gaussian kernel.

In this embodiment, the specific content of step S3 is as follows:

adopting an adjustment operation decision algorithm, and predicting each of the different multi-edge collaborative system states in decision by using a Q value prediction modelThe Q value of the operation is selected, and the action with the minimum Q value is selected; setting a threshold T, and when the Q value corresponding to each adjustment operation is smaller than the threshold T, considering that the current local load balancing scheme is close to the target local load balancing scheme, and taking the current local load balancing scheme as the target local load balancing scheme approximately; the inputs to adjust the operational decision algorithm include edge e _i Task arrival rate lambda of (2) _i Edge e _i Current local load balancing scheme of (a)Edge e _i And the load rate L of adjacent edges _i The output is edge e _i A) the next load balancing adjustment operation a; the specific process is as follows:

wherein the prediction_model () -invokes the Q value prediction model

Q_value (a) -Q value of action a

if(for each Q_value(a)≤T||Q_value(a)＝＝I):

a＝Null

else:

record actions with minimum Q value and not I:

a_List＝A _i .getAction_MinQvalue()

randomly selecting an action from the inside of the a_list:

a＝a_List.get_Action_Random()

Preferably, in the present embodiment, the symbols are defined as follows:

definition 1: the deployed edge set in a wireless metropolitan area network is denoted as e= { E ₁ ,e ₂ ,...,e _N E, where e _i Representing the ith edge, N is the total number of edges.

Definition 2: the service rate of N edges is v= { V ₁ ,v ₂ ,...,v _N Represented by, v where _i Representing edge e _i Is provided.

Definition 3: the unit task transfer time between N edges is expressed as:

wherein d _i,j Representing edge e _i And edge e _j The task transmission time between them.

Definition 4: task arrival rate for N edges is shown as λ= { λ ₁ ,λ ₂ ,...,λ _N Represented by }, where lambda _i > 0 represents edge e _i Task arrival rate of (2). For convenience of description, an initial task of offloading a user to an edge is hereinafter referred to as an arrival task of the edge. Definition 5: the global load balancing scheme is expressed as:

wherein F is _i Representing edge e _i Is provided.

Definition 6: the actual task load for the N edges is denoted as w= { W ₁ ,w ₂ ,...,w _N -w is _i > 0 representsEdge e per unit time _i Is a real task amount of the system.

Definition 7: edge e per unit time _j The actual task load rate of (2) is expressed as:

preferably, the problem in this embodiment is defined as follows:

definition 8: based on queuing theory, the average execution time of tasks on different edges is:

definition 9: the task response time is composed of execution time and transmission time, so that the edge e in unit time _i Scheduled to adjacent edge e in arriving task of (a) _j The average response time of the task performed on is:

t _i,j ＝T _a (l _j )+d _i,j

definition 10: edge e _i Average response time T of arriving task _r ⁱ The method comprises the following steps:

defining the maximum average response time to reach a task on 11:N edges as:

definition 12: the objective function is:

min(T _max )

i.e. the maximum average response time to reach a task on N edges.

Reinforcement learning:

state space: edge(s)Edge e _i Is not equal to the state of each of the otherWith a triplet < lambda _i ,/>L _i Represented by edge e _i Task arrival rate lambda of (2) _i Edge e _i Is>And edge e _i And the load ratio L of the adjacent edges _i 。

Action space: edge e _i For the movement space of (a)Representation, wherein->Representing the added edge e _i Is scheduled to adjacent edges in arriving tasks>The amount of task performed thereon; />Representing the reduced edge e _i Is scheduled to adjacent edges in arriving tasks>The amount of tasks performed thereon.

Bonus function: the reward function herein is defined as follows:

the reinforcement learning algorithm of this embodiment adopts a Q-learning algorithm, and the Q value update formula is as follows:

Q(s,a)＝Q(s,a)+α[r+γ·max(Q(s',a'))-Q(s,a)]

Based on the above definition, the Q-learning algorithm is used to evaluate the Q values of the different adjustment operations according to the data set of table 3, as shown in algorithm 1. First, a Q value table (line 1) is initialized. The Q value of the tuning operation in each piece of historical data is then evaluated using the Q-learning algorithm, and the training process continues until the Q value converges (rows 2-12). In each round, the current local load balancing scheme is first randomly initializedAnd generates the current system state (line 3). Then if the current local load balancing scheme +.>Not target local load balancing scheme->The following procedure (lines 5-11) is cyclically performed: firstly, an action a is selected from the action space according to an epsilon-greedy strategy (line 6), then the state s' is reached under the action of a state transfer function and a prize r is obtained (lines 7-8). Next, the Q value is updated again according to equation (9) (line 9). Finally, the current state s is replaced by the state s' (line 10).

Thus, a Q value table can be obtained to record the edges e at different moments _i Task arrival rate of edge e _i Current local load balancing scheme of (a)And edge e _i And the load ratio L of the adjacent edges _i And a Q value corresponding to each adjustment operation.

Q value prediction model:

the Q value is preprocessed, and the processing rule is as follows:

if q_value= 0andthe corresponding adjustment operations are considered illegal and these Q values will be labeled I. If->(i.e., the target local load balancing scheme is found), the Q value is still set to 0. For the rest of the cases, the Q value is set to its inverse. After preprocessing, the current local load balancing scheme +.>Closer to the target local load balancing scheme +.>The smaller the Q value of the adjustment operation is, the smaller the Q value is, and when the target local load balancing scheme is found, the Q value is the minimum value of 0. Based on the preprocessed Q-value table, a SVR algorithm is used to train a Q-value prediction model, and its regression equation can be expressed as:

where m is the number of training samples, κ (x, x _i ) Is a kernel function and the remaining parameters are model parameters. We choose gaussian kernels as weKernel functions, i.e.

Where χ > 0 is the bandwidth of the Gaussian kernel.

Local load balancing scheduling algorithm:

first, edge e is initialized _i Is a current load balancing scheme of (1)(line 1), i.e. edge e _i Is performed at the node.

Then edge e _i The following procedure (lines 2 to 11) was repeatedly performed:

the first step: edge e _i Acquiring load rates L of self and adjacent edges _i (line 3).

And a second step of: use edge e _i Task arrival rate lambda of (2) _i Edge e _i Current local load balancing scheme of (a)And edge e _i And the load ratio L of the adjacent edges _i As input, an edge e is obtained according to algorithm 3 _i Next load balancing adjustment operation (line 4).

And a third step of: if the adjustment operation returned by algorithm 3 is null, then edge e is declared _i A target local load balancing scheme has been found, without further adjustment (lines 5-6); otherwise, edge e _i Performing the resulting adjustment operation and updating the current local load balancing scheme(lines 7 to 9). />

The algorithm 3 predicts the Q value of each operation in different system states by using a Q value prediction model, and selects the action with the smallest Q value. When the Q value corresponding to each adjustment operation is smaller than the threshold T, we consider that the current local load balancing scheme is close enough to the target local load balancing scheme, and at this time, the current local load balancing scheme can be approximately used as the target local load balancing scheme.

The inputs to algorithm 3 include edge e _i Task arrival rate lambda of (2) _i Edge e _i Current local load balancing scheme of (a)Edge e _i And the load rate L of adjacent edges _i The output is edge e _i Is a next load balancing adjustment operation a. The specific process is as follows:

first, the Q value of each action (adjustment operation) is evaluated. If an action is deemed illegal, marking the corresponding Q value as I; in other cases, the Q value (1 to 7 lines) corresponding to each operation is predicted from the Q value prediction model.

Then, it is determined whether the Q values of all legal adjustment operations are less than the threshold T (except the Q value marked as I). If the Q values of all actions are smaller than the threshold T, the target local load balancing scheme is considered to be found, and adjustment is not needed, so that the adjustment operation is Null. Otherwise, the adjustment operation with the smallest Q value is selected, and if the Q values of a plurality of adjustment operations are the same and are the smallest Q value, one adjustment operation is randomly selected (rows 8-13).

Finally, return the selected edge e _i Next load balancing adjustment operation a (line 14).

/>

Preferably, five areas are randomly selected on the distribution diagram of the wireless base station in Shanghai city, and five different simulation scenes are designed. In each scene, the total number of edges n=15, and the longitude and latitude coordinates of 15 wireless base stations are randomly selected in each area as the coordinates of the edges, and the task arrival rate lambda of each edge _i Satisfies normal distribution N (10, 4), service rate v _i Satisfies a normal distribution N (15, 6); the number of the edges connected with other edges is more than 0and less than or equal to 3, the unit task transmission time D between the edges is mapped to the intervals [0.1,0.2 according to the distance between the edges]The closer the distance between the two edges, the less the transmission time per unit task. The iteration round number (epi) of reinforcement learning, learning efficiency α, and bonus discount γ are set to 100, 0.1, and 0.9, respectively. Epsilon=0.1 is set in epsilon-greedy policy. The threshold T is set to 0.15.

The experimental results are shown in the figure comparing the method proposed in this example (RF-CLB) with the classical ML-based method and the rule-based method. Experimental results show that the response time of the load balancing scheme obtained by the RF-CLB is 6-9% and 10-12% smaller than that of the classical ML-based method and the rule-based method respectively.

The foregoing description is only of the preferred embodiments of the invention, and all changes and modifications that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A polygonal edge cooperative load balancing task scheduling method based on reinforcement learning is characterized by comprising the following steps of: the method comprises the following steps:

step S3: each edge independently and parallelly makes decisions according to the Q value prediction model;

the reinforcement learning algorithm in step S1 is:

state space: edge e _i Is not equal to the state of each of the otherWith a triplet->Representing, respectively, edge e _i Task arrival rate lambda of (2) _i Edge e _i Is>And edge e _i And the load ratio L of the adjacent edges _i ；

bonus function: the bonus function is defined as follows:

Q(s,a)＝Q(s,a)+α[r+γmax(Q(s',a'))-Q(s,a)]

wherein max (Q (s ', a')) represents the maximum Q value obtained by selecting action a 'in state s', parameter α represents learning efficiency, and parameter γ represents rewarding discount;

the step S1 specifically comprises the following steps:

step S11: initializing a Q value table;

step S13: obtaining a Q value table and recording the edges e at different moments _i Task arrival rate of edge e _i Current local load balancing scheme of (a)And edge e _i And the load ratio L of the adjacent edges _i And a Q value corresponding to each adjustment operation;

the step S12 specifically includes the following steps:

initializing a current local load balancing scheme

Edge e _i Is not equal to the state of each of the otherWith a triplet->Representing, respectively, edge e _i Task arrival rate lambda of (2) _i Edge e _i Is>And edge e _i And the load ratio L of the adjacent edges _i ；

step S123: selecting an action a from the action space according to an epsilon-greedy strategy;

selecting an action a with an epsilon-greedy policy:

a＝select_action(s,Q_table)；

s'＝T(S,a)

the agent obtains a prize value:

updating the Q value:

Q(s,a)＝Q(s,a)+α[r+γmax(Q(s',a'))-Q(s,a)]

updating the current state:

s＝s'；

described in step S2

The Q value is preprocessed, and the processing rule is as follows:

if it isThe corresponding adjustment operations are considered illegal, these Q values will be labeled I; if->I.e. a target local load balancing scheme is found, the Q value is still set to 0; for the rest of the cases, the Q value is set to its inverse; after preprocessing, the current local load balancing scheme +.>Closer to the target local load balancing scheme +.>The smaller the Q value of the adjustment operation is, the smaller the Q value is, and when a target local load balancing scheme is found, the Q value is the minimum value of 0; based on the preprocessed Q value table, a SVR algorithm is used for training a Q value prediction model, and a regression equation is expressed as follows:

Where χ > 0 is the bandwidth of the Gaussian kernel;

the specific content of the step S3 is as follows:

wherein the prediction_model () -invokes the Q value prediction model

Q_value (a) -Q value of action a