CN114090239A

CN114090239A - Model-based reinforcement learning edge resource scheduling method and device

Info

Publication number: CN114090239A
Application number: CN202111285553.2A
Authority: CN
Inventors: 缪巍巍; 曾锃; 张明轩; 张震; 张瑞; 滕昌志; 李世豪; 毕思博
Original assignee: Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Current assignee: Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2021-11-01
Filing date: 2021-11-01
Publication date: 2022-02-25
Anticipated expiration: 2041-11-01
Also published as: CN114090239B

Abstract

The invention discloses a model-based reinforcement learning edge resource scheduling method and device, which comprises the steps of collecting historical data of load information, resource information and user request information of edge nodes through an edge server, and constructing an edge environment model through supervised learning according to the historical data; and realizing reinforcement learning edge node resource scheduling based on the edge environment model, and distributing the user request to a proper edge node. The model-based reinforcement learning edge resource scheduling method and device provided by the invention are used for processing dynamic resource load requests aiming at the scene of edge computing resource scheduling, and have higher sample utilization rate and higher practicability.

Description

Model-based reinforcement learning edge resource scheduling method and device

Technical Field

The invention relates to a model-based edge resource scheduling method and device for reinforcement learning, and belongs to the technical field of Internet of things.

Background

Because the load of the edge node can dynamically change, the load of the edge node needs to be reasonably scheduled through an algorithm, and the task requests of the users are distributed to different edge nodes, so that the optimal service guarantee is realized, and the load balance is realized.

The prior art generally performs resource scheduling by the following methods:

1. setting by manual rules, for example, allocating requests with low load demand to edge nodes which are busy, and allocating requests with high load demand to nodes which are idle;

2. and solving an approximately optimal distribution scheme by a combined optimization method and an approximately boxing problem method, and distributing the request to the corresponding edge node.

3. And finding out a heuristic load request distribution algorithm according to a heuristic algorithm, for example, a simulated annealing method.

4. A reinforcement learning based load request distribution algorithm.

The manual rule method requires the dependence on experienced personnel, maintains a very complex rule system, and is often not effective; the combined optimization method can only process static resource requests, and is not applied to the scene of dynamic resource requests; the heuristic method often cannot obtain the globally optimal result; although the resource scheduling algorithm based on general reinforcement learning can process dynamic requests, it needs to perform exploration and trial and error in a real edge computing environment, which may cause performance loss and decrease of user satisfaction.

Disclosure of Invention

The purpose is as follows: in order to overcome the defects in the prior art, the invention provides the edge resource scheduling method and device based on the model reinforcement learning, which have very high sample efficiency, can realize resource allocation aiming at the edge calculation scene, and is more suitable for being deployed in the real edge calculation scene.

The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:

in a first aspect, a model-based edge resource scheduling method for reinforcement learning includes the following steps:

and collecting historical data of load information, resource information and user request information of the edge nodes through the edge server, and constructing an edge environment model through supervised learning according to the historical data.

And realizing reinforcement learning edge node resource scheduling based on the edge environment model, and distributing the user request to a proper edge node.

In a second aspect, an apparatus for scheduling edge resources based on model-based reinforcement learning includes the following modules:

the edge environment model building module: the edge server is used for collecting historical data of load information, resource information and user request information of the edge nodes and building an edge environment model through supervised learning according to the historical data.

A reinforcement learning module: the method is used for realizing reinforcement learning edge node resource scheduling based on the edge environment model and distributing the user request to the appropriate edge node.

Preferably, the method for constructing the edge environment model through supervised learning according to the historical data comprises the following steps:

based on the collected historical data, through the supervised learning of the deep neural network, the input of the edge environment model is the current state and the current action as an input vector X, and the current state comprises the following steps: the method comprises the steps of obtaining resource information of edge nodes, load information of the edge nodes and user request data; the current actions include: allocation is requested for each user. The output of the edge environment model is the state at the next moment as an output vector y, and the state at the next moment comprises: resource information of the edge node, load information of the edge node, and user request data.

The dimension of the deep neural network input is the second dimension of the input vector X, and in the deep neural network, network output is performed through a full connection layer by taking a plurality of full connection layers, a ReLU activation layer and a batch normalization layer as intermediate network layers.

The deep neural network updates parameters of the deep neural network through a gradient descent and back propagation method according to a loss function.

Preferably, the resource information of the edge node includes: the number of CPU cores of the edge nodes, the total amount of memory, the total amount of bandwidth and the number of servers of the edge nodes. The load information of the edge node includes: yesterday historical load, last week historical average load, last month historical average load, last year calendar historical average load. The user request information includes: the amount of resources requested by each user, the response time of the user request.

As a preferred scheme, the method for implementing reinforcement learning edge node resource scheduling based on the edge environment model and allocating the user's request to a suitable edge node includes the following steps:

for reinforcement learning, elements in the markov decision process are defined:

and a state s: resource information of the edge node, load information of the edge node, and user request data.

Action a: the user's request is distributed to the edge nodes.

Reward r: a weighted sum of user satisfaction and load balancing.

By constructing a state-action value function Q (s, a) ═ E [ r | s₀＝s，a₀＝a]Acquiring a cumulative award, acquiring different actions output at different probabilities by a policy function mu (o) of allocating resources by the edge node, and outputting a resource allocation scheme requested by each user in which the cumulative award is maximized according to the cumulative award and the actions. Where s is the initial state, a is the initial action, and o is the state observed by the edge node.

As a preferred scheme, a state-action value function and a policy function of resource allocation of edge nodes are modeled by a multilayer neural network, wherein the neural network constructed by the state-action value function updates parameters of the neural network by using a minimized time division error, and the neural network constructed by the policy function of resource allocation of the edge nodes updates the parameters of the neural network by using a policy gradient theorem to obtain the updated neural network.

Preferably, the state-action value function updates the formula as follows:

Q＝(1-w)Q_g+wQ

wherein Q is_gIs a global state action value function, and w is a weight.

Preferably, the satisfaction degree comprises: a linear function of the response time, the longer the response time, the lower the satisfaction; the load balancing comprises: a minimum load among the plurality of edge nodes; the weight of the weighted sum is set according to the preference of the edge node administrator.

Has the advantages that: the model-based reinforcement learning edge resource scheduling method and device provided by the invention can process dynamic resource load requests aiming at the scene of edge computing resource scheduling, and have higher sample utilization rate and higher practicability.

Drawings

FIG. 1 is a schematic diagram of a system architecture for edge computing resource allocation.

FIG. 2 is a schematic flow diagram of the method of the present invention.

Fig. 3 is a schematic diagram of a model for resource allocation by multi-edge node cooperation.

Detailed Description

The present invention will be further described with reference to the following examples.

The invention discloses a model-based reinforcement learning edge resource scheduling system, which is used for performing resource scheduling on dynamic user load requests and distributing the user requests to different edge nodes, thereby maximizing the satisfaction degree of users and simultaneously balancing the load among the edge nodes.

As shown in fig. 1, the system is composed of a plurality of unmanned aerial vehicle terminal devices, a base station, and an edge device cluster, and when the system performs edge computing resource allocation, a plurality of terminal devices send a load task to the edge device cluster through the base station. The edge device determines how many resources (CPU, memory) to allocate for each task according to the load and resource requirements of different tasks.

As shown in fig. 2, a model-based edge resource scheduling method for reinforcement learning includes the following steps:

and collecting load information, resource information and historical data of user request information of the edge nodes through the edge server, and constructing an edge environment model through supervised learning.

The specific method comprises the following steps:

the construction method of the edge environment model comprises the following steps:

step 1: collecting historical data of edge nodes, specifically including the following categories:

the resource information of the edge node includes: the number of CPU cores of the edge nodes, the total amount of memory, the total amount of bandwidth and the number of servers of the edge nodes.

The load information of the edge node includes: yesterday historical load, last week historical average load, last month historical average load, last year calendar historical average load.

The user request information includes: the amount of resources requested by each user, the response time of the user request.

Step 2: constructing a marginal environment model through a supervised learning algorithm

And based on the collected historical data, constructing an edge environment model through the supervised learning of a deep neural network. The input of the edge environment model is a current state and a current action as an input vector X, and the current state comprises: the method comprises the steps of obtaining resource information of edge nodes, load information of the edge nodes and user request data; the current actions include: allocation is requested for each user. The output of the edge environment model is the state at the next moment as an output vector y, and the state at the next moment comprises: resource information of the edge node, load information of the edge node, and user request data.

The dimension of the deep neural network input is the second dimension of the input vector X, and in the deep neural network, network output is performed through a full connection layer by taking a plurality of full connection layers, a ReLU activation layer and a batch normalization layer as intermediate network layers. At the output, the comparison is made with the true state y and the following loss function is calculated:

wherein: f () represents the output of the deep neural network, y_iFor the real state, the network parameters can then be updated by gradient descent and back propagation methods, according to the loss function.

In addition, since the resource information of the edge node is static data, prediction is not required. For the load information of the edge node, when the user request information is known, the part can also be directly determined, so that the output part only needs to include the user request data at the next moment.

According to the method for realizing the reinforcement learning edge node resource scheduling based on the edge environment model, exploration trial and error are carried out in the edge environment model through reinforcement learning according to the edge environment model, and an optimal resource scheduling strategy is found out. The method comprises the following steps:

step 1: for reinforcement learning, the elements in the markov decision process are respectively defined as follows:

Action a: the user's request is distributed to the edge nodes.

Reward r: a weighted sum of user satisfaction and load balancing. The satisfaction degree comprises: a linear function of the response time, the longer the response time, the lower the satisfaction; the load balancing comprises: a minimum load among the plurality of edge nodes; the weight of the weighted sum is set according to the preference of the edge node administrator.

Step 2: and outputting a resource allocation scheme requested by each user through a deep reinforcement learning algorithm, thereby achieving the maximization of long-term accumulated benefits.

Define the state-action value function Q (s, a) ═ E [ r | s₀＝s，a₀＝a]I.e. the accumulated reward that this strategy can achieve when the initial state action is s, a, respectively.

And defining a policy function of resource allocation of the edge node as mu (o), namely, the probability of adopting different allocation schemes after the edge node observes the state o. Modeling is performed through a multilayer neural network for the state-action value function and the policy function of the edge node allocation resource, and the parameters of the neural network are updated through the following method.

For the state-action value function, the update of the neural network parameters is performed by minimizing the time division error:

L(θ)＝E[Q(s，a)-y]²

wherein: y ═ r + γ maxQ (s ', a'). Wherein s 'is the state after the action a is executed, a' is the next moment action, the gamma yield weight coefficient, the reward function r and the prediction of the next moment state s 'and the action a' through the marginal environment model, so that the interaction with the real environment is not needed. The training speed can be effectively improved, and the stability of the algorithm can also be improved.

For the strategy function of the edge node distribution resources, updating the neural network parameters according to the strategy gradient theorem:

it should also be noted that in using reinforcement learning, exploration in the environment is required. In the invention, the strategy function is assumed to be a probability function, so that different actions can be output with different probabilities, and in the process of reinforcement learning execution, the variance of the probability function is gradually reduced, so that the finally executed action is output with a more stable value.

As shown in fig. 3, the above method is to perform uniform and global scheduling on a plurality of edge nodes. If a plurality of edge nodes can only perform distributed scheduling, if each edge node only performs local resource allocation, it is difficult to achieve global benefit maximization. On the other hand, if the states of other edge nodes are taken into consideration, the overall benefit can be improved through cooperation. A state action value function Q and a policy function μmay be maintained for each edge node, and in addition, to improve overall cooperative efficiency, a global Q may be used_gAs a function of the global state action values. Q_gThe value of (d) is not used directly for the output action, but may be updated as part of the Q value:

Q＝(1-w)Q_g+wQ

wherein w is a weight, when w is larger, each edge node pays more attention to cooperative global rewards, otherwise, the node pays more attention to the rewards.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only of the preferred embodiments of the present invention, and it should be noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A method for scheduling edge resources based on model reinforcement learning is characterized in that: the method comprises the following steps:

collecting historical data of load information, resource information and user request information of edge nodes through an edge server, and constructing an edge environment model through supervised learning according to the historical data;

2. The method of claim 1, wherein the model-based edge resource scheduling method for reinforcement learning comprises: the method for constructing the edge environment model through supervised learning according to the historical data comprises the following steps:

based on the collected historical data, through the supervised learning of the deep neural network, the input of the edge environment model is the current state and the current action as an input vector X, and the current state comprises the following steps: the method comprises the steps of obtaining resource information of edge nodes, load information of the edge nodes and user request data; the current actions include: requesting allocation for each user; the output of the edge environment model is the state at the next moment as an output vector y, and the state at the next moment comprises: the method comprises the steps of obtaining resource information of edge nodes, load information of the edge nodes and user request data;

the dimension of the deep neural network input is the second dimension of the input vector X, and in the deep neural network, network output is carried out through a full connection layer by taking a plurality of full connection layers, a ReLU activation layer and a batch normalization layer as intermediate network layers;

3. The method of claim 1, wherein the model-based edge resource scheduling method for reinforcement learning comprises: the resource information of the edge node includes: the number of CPU cores, the total amount of memory, the total amount of bandwidth and the number of servers of the edge nodes; the load information of the edge node includes: yesterday historical load, last week historical average load, last month historical average load, last year calendar historical average load; the user request information includes: the amount of resources requested by each user, the response time of the user request.

4. The method of claim 1, wherein the model-based edge resource scheduling method for reinforcement learning comprises: the method for realizing the reinforcement learning edge node resource scheduling based on the edge environment model and distributing the user request to the proper edge node comprises the following steps:

and a state s: the method comprises the steps of obtaining resource information of edge nodes, load information of the edge nodes and user request data;

action a: distributing the request of the user to the edge node;

reward r: a weighted sum of user satisfaction and load balancing;

by constructing a state-action value function Q (s, a) ═ E [ r | s₀＝s，a₀＝a]Acquiring a cumulative award, acquiring different actions output under different probabilities through a policy function mu (o) of allocating resources by a fringe node, and outputting a resource allocation scheme requested by each user with the maximized cumulative award according to the cumulative award and the actions; where s is the initial state, a is the initial action, and o is the state observed by the edge node.

5. The method of claim 4, wherein the model-based reinforcement learning edge resource scheduling method comprises: and modeling the state-action value function and the strategy function of distributing resources to the edge nodes through a multilayer neural network, wherein the neural network constructed by the state-action value function updates parameters of the neural network by using a minimized time division error, and the neural network constructed by the strategy function of distributing resources to the edge nodes updates the parameters of the neural network by using a strategy gradient theorem to obtain the updated neural network.

6. The method of claim 4, wherein the model-based reinforcement learning edge resource scheduling method comprises: the state-action value function updates the formula as follows:

Q＝(1-w)Q_g+wQ

wherein Q is_gIs a global state action value function, and w is a weight.

7. The method of claim 4, wherein the model-based reinforcement learning edge resource scheduling method comprises: the satisfaction degree comprises: a linear function of the response time, the longer the response time, the lower the satisfaction; the load balancing comprises: a minimum load among the plurality of edge nodes; the weight of the weighted sum is set according to the preference of the edge node administrator.

8. The utility model provides a marginal resource scheduling device of reinforcement learning based on model which characterized in that: the system comprises the following modules:

the edge environment model building module: the edge server is used for collecting historical data of load information, resource information and user request information of edge nodes and building an edge environment model through supervised learning according to the historical data;