CN111813538B

CN111813538B - Edge computing resource allocation method

Info

Publication number: CN111813538B
Application number: CN202010460707.6A
Authority: CN
Inventors: 袁新杰; 杜清河
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2020-05-27
Filing date: 2020-05-27
Publication date: 2024-03-29
Anticipated expiration: 2040-05-27
Also published as: CN111813538A

Abstract

The application belongs to the technical field of resource allocation strategies, and particularly relates to an edge computing resource allocation method. With respect to how efficiently edge computing resources are allocated, a mobile edge computing environment can be modeled as a Markov decision process, and the model is highly random and very complex. A method of edge computing resource allocation, the method comprising: 1): defining edge computing model states, actions, and rewards; 2): analyzing and utilizing the state, the action and the rewards defined in the 1) to define the structure of the neural network and the structure of input and output; 3): updating, training and applying the neural network defined in the step 2) according to a given training method. The deep reinforcement learning technology in machine learning is very suitable for being applied to the model due to the strong perceptibility and decision capability of the deep reinforcement learning technology to the environment so as to solve the problem of maximum return.

Description

Edge computing resource allocation method

Technical Field

The application belongs to the technical field of resource allocation strategies, and particularly relates to an edge computing resource allocation method.

Background

With the rise of the fifth generation mobile communication technology (5G), mobile applications and services also put higher demands on time delay, and the conventional cloud computing mode can no longer completely adapt to the low-time-delay requirement, so that the mobile edge computing (mobileedge computing) technology is generated.

In the traditional cloud computing mode, a user uploads a computation-intensive task to a cloud server for processing through a core network, and although the computing resources of the cloud server are sufficient, the computation can be completed in a short time, but the transmission delay is larger due to factors such as limited bandwidth of the core network, network jitter and the like. In order to reduce transmission delay, the mobile edge computing mode configures computing resources beside a network edge, such as a wireless access point of a base station and the like, and a user only needs to offload tasks to an edge computing server for processing, so that larger transmission delay caused when data is transmitted through a core network can be avoided, and meanwhile, the advantages of saving the bandwidth of the core network, facilitating an operator to provide personalized service strategies according to different places, further guaranteeing privacy safety and the like are brought. However, since computing resources at the network edge are very limited relative to the cloud server, how to effectively allocate and utilize the edge computing resources becomes one of the keys of the mobile edge computing technology.

With respect to how efficiently edge computing resources are allocated, a mobile edge computing environment can be modeled as a Markov decision process, and the model is highly random and very complex.

Disclosure of Invention

1. Technical problem to be solved

Based on how to allocate edge computing resources effectively, the mobile edge computing environment can be modeled as a markov decision process, and the model is highly random and very complex, and the application provides an edge computing resource allocation method facing 5G communication requirements.

2. Technical proposal

To achieve the above object, the present application provides a method for allocating edge computing resources, the method including:

1): defining edge computing model states, actions, and rewards;

2): defining the structure of a neural network and the structure of input and output;

3): the neural network is updated, trained and applied according to a given training method.

Another embodiment provided herein is: the edge computation model state, action, and prize definition process of 1) is as follows:

in the kth frame, the environment observed by the agent is x ^(k) ＝[d ^(k) ,w ^(k) ,q ^(k) ,η ^(k) ]And define x ^(k) Is an observation of the environment at the kth frame, and is not a state, where each element has the meaning:

a vector formed by the size of the task data at the head of the cache queue;

a vector consisting of the waiting time of the task at the head of the cache queue;

a vector consisting of the length of each cache queue;

a vector consisting of the signal-to-noise ratios of the channels.

Another embodiment provided herein is: the environment observed by the agent comprises the current state of the previous W frame and the current frame.

Another embodiment provided herein is: the policy adopted by the agent is recorded as pi, then a _k ＝π(s _k ) The method comprises the steps of carrying out a first treatment on the surface of the In the edge computing model, the edge computing server has C CPU cores, N users and terminals, and the total of the N users and terminals is obtained by using a partition method in permutation and combinationDifferent allocation schemes, thus action a _k There may beAnd (5) seed selection.

Another embodiment provided herein is: the rewards and penalties are set to the maximum tolerable delay d _r ＝αT _f And maximum tolerable error probability ε _max 。

Another embodiment provided herein is: the rewards and penalties include three things:

1) Task latency<d _r Probability of error<ε _max The method comprises the steps of carrying out a first treatment on the surface of the At this time, the task is successfully completed, and a prize R= +1 is obtained;

2) Task latency<d _r Probability of error>ε _max The method comprises the steps of carrying out a first treatment on the surface of the At this time, the task is obtainedProcessing, wherein the error probability is too high, and the prize R= -1 is obtained, namely punishment is carried out;

3) Task latency>d _r At this time, the task waits too long in the buffer queue and is not processed, and obtains the prize r= -1.5, namely, is punished.

Another embodiment provided herein is: the neural network in the step 2) is a multi-layer fully-connected neural network, each output node corresponds to an output scalar, the ReLU function is selected except for the activation function outside the output layer, and the activation function is not used by the output layer. The neural network parameter of the intelligent agent is theta _k In this case, the action state value function is recorded asAt this time, the strategy of the agent is pi (s _k ；θ _k ) The method comprises the steps of carrying out a first treatment on the surface of the For simplicity of description, Q is used when ambiguity is not caused ^k (s _k ,a _k ) Representation->The Q function is estimated by a neural network, and the network is called Q-network.

Another embodiment provided herein is: the training method in 3) comprises an experience replay method, wherein samples of the interaction of the intelligent agent and the environment are stored in a memory bank, and a batch of samples are randomly selected from the samples in the training process to update and iterate the network. The empirical replay method and the fixed Q target network method are skills in algorithm 1, and are included in algorithm 1.

Another embodiment provided herein is: the training method adopts a fixed Q-network method when updating, and the training process needs two Q-networks: q-policy network Q _policy And Q-target network Q _target The parameters are respectivelyAnd theta ^target The method comprises the steps of carrying out a first treatment on the surface of the At the time of iteration, Q _target The frequency of variation of (2) is far lower than Q _policy 。

3. Advantageous effects

Compared with the prior art, the edge computing resource allocation method provided by the application has the beneficial effects that:

the edge computing resource allocation method provided by the application is very suitable for being applied to the model due to the strong perceptibility and decision capability of the edge computing resource allocation method to the environment in the machine learning so as to solve the problem of maximum return.

According to the edge computing resource allocation method, in the scene that a single edge computing node, multiple users and multiple terminals are used for network coding with limited block length, the edge computing resource allocation method based on deep reinforcement learning is provided.

According to the edge computing resource allocation method, the computing resources, particularly the CPU cores, in the edge computing server are reasonably allocated, so that the processing success rate of tasks unloaded to the edge computing server by users is improved.

According to the edge computing resource allocation method, states, actions, rewards, neural network structures, neural network input and output structures, training and application methods and the like can be effectively perceived and decided on an edge computing environment.

According to the edge computing resource allocation method, after the states are decoupled into the states of the queue buffers, the specially designed neural network is input, the action state cost functions of all possible allocation schemes are output, and then the allocation scheme with the highest action state cost function is selected to allocate the CPU core, so that the probability of task success is improved. The neural network architecture shown in fig. 1 is specifically designed and includes the manner of input. The input layer of a neural network, for example, is divided into several independent blocks of the neural network, which are specifically designed for each buffer state.

Drawings

FIG. 1 is a schematic diagram of the DQN network architecture of the present application;

fig. 2 is a schematic diagram of simulation results of the present application.

Detailed Description

Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings, and according to these detailed descriptions, those skilled in the art can clearly understand the present application and can practice the present application. Features from various embodiments may be combined to obtain new implementations or to replace certain features from certain embodiments to obtain other preferred implementations without departing from the principles of the present application.

1-2, the present application provides a method for allocating edge computing resources, the method comprising:

1): defining edge computing model states, actions, and rewards;

2): analyzing and utilizing the state, the action and the rewards defined in the 1) to define the structure of the neural network and the structure of input and output;

3): updating, training and applying the neural network defined in the step 2) according to a given training method.

Further, the edge computation model state, action, and prize definition process in 1) is as follows:

a vector formed by the size of the task data at the head of the cache queue;

a vector consisting of the length of each cache queue;

a vector consisting of the signal-to-noise ratios of the channels.

Further, the environment observed by the agent comprises the current state of the previous W frame and the current frame, namely s _k ＝[x ^(k-W) ,x ^(k-W+1) ,...,x ^(k) ]. In view of the correlation of the channel, each decision is not only related to the currently observed environment, but also to the observations of several frames before,

further, if the policy adopted by the agent is recorded as pi, then a _k ＝π(s _k ) The method comprises the steps of carrying out a first treatment on the surface of the In the edge computing model, the edge computing server has C CPU cores, N users and terminals, and the total of the N users and terminals is obtained by using a partition method in permutation and combinationDifferent allocation schemes, thus action a _k There may be->And (5) seed selection. Each action a _k Is an N-dimensional vector, each action in the vector being the number of CPU cores allocated to the corresponding task.

Further, the rewards and penalties are set to a maximum tolerable delay d _r ＝αT _f And maximum tolerable error probability ε _max 。

Further, the rewards and penalties include three scenarios:

2) Task latency<d _r Probability of error>ε _max The method comprises the steps of carrying out a first treatment on the surface of the At this time, although the task is processed, the error probability is too high, and the obtained prize R= -1 is punished;

3) Task latency>d _r At this time, the task waits too long in the buffer queue, is not processed, and obtains a prize r=-1.5, i.e. penalized.

Further, the neural network in the 2) is a multi-layer fully connected neural network, each output node corresponds to an output scalar, the ReLU function is selected except for the activation function of the output layer, and the activation function is not used by the output layer.

Further, the neural network parameter of the agent is θ _k In this case, the action state value function is recorded asAt this time, the strategy of the agent is pi (s _k ；θ _k ) The method comprises the steps of carrying out a first treatment on the surface of the For simplicity of description, Q is used when ambiguity is not caused ^k (s _k ,a _k ) Representation->The Q function is estimated by a neural network, and the network is called Q-network.

Further, the updating method in 3) includes an empirical replay method, wherein samples of the interaction between the agent and the environment are stored in a memory bank, and a batch of samples are randomly selected from the samples in the training process to update and iterate the network.

Further, the update method adopts a fixed Q-target method when updating, and the training process requires two Q-networks: q-policy network Q _policy And Q-target network Q _target The parameters are respectivelyAnd theta ^target The method comprises the steps of carrying out a first treatment on the surface of the At the time of iteration, Q _target The frequency of variation of (2) is far lower than Q _policy 。

2) In the above, the structure of the neural network and the input/output structure are as follows

In this algorithm, we use the deep neural network structure shown in figure 1. In the figure, the left side is input, the right side is output, the left side two-layer cube represents a multi-layer fully connected neural network (DNN), the rightmost small cube represents output nodes, each node corresponds to an output scalar, and functions are activated except the output layersThe ReLU function is selected and the output layer does not use an activation function. After normalization of the state vectors, the state vectors are recombined in the input structure shown in the figure, i.e. each group of inputs isA state corresponding to buffer n; the network will firstly read the state of each buffer, then the read characteristics are collected and then further analyzed by the next part of neural network, and finally the output is +.>The action state value of the |a| different actions are expressed respectively.

3) The training method and the application process are as follows:

in the course of the neural network update we introduced a method that we introduced a fixed Q-target and empirical replay.

With the fixed Q-target approach, the training process requires two Q-networks: q-policy network Q _policy And Q-target network Q _target The parameters are respectivelyAnd theta ^target . So that the network to be updated and the calculated target value y can be updated _i Network separation of (c), calculate y _i When using the Q-target network, the updated network is the Q-policy network. At the time of iteration, Q _target The frequency of variation of (2) is far lower than Q _policy Specifically, every iteration M turns, let +.>To update one time Q _target . The loss function may be defined as:

wherein the method comprises the steps of

And (3) storing samples interacted by the intelligent agent and the environment into a memory bank by adopting an experience replay method, and randomly selecting a batch of samples from the samples in the training process to update and iterate the network. Let us note e _k ＝(s _k ,a _k ,R _k ,s _k+1 ) For one Transition, all that is stored in memory bank D is the transitions, i.e., d= { e ₁ ,e ₂ ,...,e _k ,...}。

The specific update algorithm is shown as algorithm 1 (edge computing resource allocation algorithm based on deep reinforcement learning), where a widely used e-greedy exploration strategy is also used.

After the neural network training is finished, when in application, at each frame of the edge computing environment, making a decision by using the neural network, wherein the input and output methods are as shown in the step 2), and the actions selected during application are as follows

Examples

In this section, we consider a specific edge calculation model as follows:

the edge computing model is a scenario of a mobile edge computing node (MEC node) with C CPU cores, N buffers, and N users, N terminals. In this scenario, the user sends tasks to the edge computing node, which allocates CPU cores for it and processes the tasks, and then sends the processing results to the terminal. Each frame of the model is divided into T _f Each symbol time t _s One frame can be divided into three phases again. Taking the kth frame as an example, in the first stage, user U _n (n.epsilon {1,2,.,. N }) send MEC nodes a size ofIs occupied in time>The length of each symbol, after the task is sent to the MEC node, the node stores the task in a Buffer queue Buffer n; in the second stage, the MEC node distributes the computing resources of the C CPU cores to users, so that the users of the computing resources can process the first task positioned at the head of the queue in the corresponding buffer, and the processing time length occupies +.>A number of symbols; in the third stage, after the task has been processed, the MEC node sends the calculation result to the terminal T corresponding to the user _n Occupy->And a symbol.

In the downlink, the channel gain is seen at each frame timeModeling the channel gain as +.>Wherein ρ is _DL,n As the channel correlation coefficient(s),is a complex gaussian random variable subject to zero mean unit variance. Thus, the downlink Signal-to-Noise Ratio (SNR) can be expressed as +.>Wherein->Is the average signal to noise ratio.

Due to T _f 、m _DL,n With limited length, we consider a limited blockDecoding error probability problem in length coding, assuming that the error probability isThe following formula can be obtained:

wherein the method comprises the steps ofTherefore, the decoding error probability is deduced to be

In moving edge computation, the computation result is typically smaller than the uploaded data, so we assume in this model that for a certain task D _DL，n ＝βD _UL，n Wherein β is a positive number less than 1.

In the second stage, the edge computing node is moved to the user U at the kth frame _n DispensingThe number of CPU cores is an integer->Needs to meet->At the same time can obtain the treatment time as

Where L represents how many CPU cycles are needed per bit task, f ₀ Representing the frequency of each CPU core, ceil (x) represents the integer rounded up closest to x. Therefore, in order to ensure the synchronization of frames, inT _f Given the time, the third stage availability time can be calculated as

In deep reinforcement learning, the edge computation model is an environment with which an agent interacts, which can be modeled as a Markov decision process, and reinforcement learning algorithms solve the maximized success rate problem. In this algorithm, the agent changes the environment by deciding on the allocation policy of the computing resources, i.e., the CPU cores, and the environment determines rewards or penalties by task success or failure.

In the kth frame, the environment observed by the agent is x ^(k) ＝[d ^(k) ,w ^(k) ,q ^(k) ,η ^(k) ]And define x ^(k) Observations (not states) of the environment at the kth frame, where each element has the meaning:

a vector formed by the size of the task data at the head of the cache queue;

a vector consisting of the length of each cache queue;

a vector consisting of the signal-to-noise ratios of the channels.

However, considering the correlation of the channel, each decision is not only related to the currently observed environment, but also to the observations of several frames before, so we define to include the previous W frame and the current frameIs the current state, i.e. s _k ＝[x ^(k-W) ,x ^(k ^-W+1) ,...,x ^(k) ]. We mark the policy adopted by the agent as pi, namely a _k ＝π(s _k ). In the edge computing model, the edge computing server has C CPU cores, and the number of users and terminals is N, and the total number can be obtained by using a partition method in permutation and combinationDifferent allocation schemes, thus action a _k There may be->And (5) seed selection. Each action a _k Is an N-dimensional vector, each action in the vector being the number of CPU cores allocated to the corresponding task.

For rewards and penalties we set the maximum tolerable delay d _r ＝αT _f And maximum tolerable error probability ε _max . We consider three cases:

1) Task latency<d _r Probability of error<ε _max . At this time, the task is successfully completed, and a prize r= +1 is obtained.

2) Task latency<d _r Probability of error>ε _max . At this time, although the task is processed, the error probability is too high, and the prize r= -1 is obtained, namely, punishment is performed.

3) Task latency>d _r . At this time, the task waits too long in the buffer queue and is not processed, and obtains the prize r= -1.5, namely, is punished.

In calculating the rewards, we use the gamma discount rewards, i.e. the rewards are scored as being at the kth frame

G _k ＝R _k +γR _k+1 +γ ² R _k+2 +...＝R _k +γG _k+1 .\*MERGEFORMAT(6)

In this algorithm, we use the deep neural network structure shown in figure 1. In the figure, the left side is input, the right side is output, and the left two-layer cube represents multiple layersThe full-connected neural network (DNN) of the layer, the right-most small cube represents output nodes, each node corresponds to an output scalar, the ReLU function is selected except for the activation function of the output layer, and the activation function is not used by the output layer. After normalization of the state vectors, the state vectors are recombined in the input structure shown in the figure, i.e. each group of inputs isA state corresponding to buffer n; the network will firstly read the state of each buffer, then the read characteristics are collected and then further analyzed by the next part of neural network, and finally the output is +.>The action state value of the |a| different actions are expressed respectively.

In the course of the neural network update we introduced a method that we introduced a fixed Q-target and empirical replay. With the fixed Q-network approach, the training process requires two Q-networks: q-policy network Q _policy And Q-target network Q _target The parameters are respectivelyAnd theta ^target . So that the network to be updated and the calculated target value y can be updated _i Network separation of (c), calculate y _i When using the Q-target network, the updated network is the Q-policy network. At the time of iteration, Q _target The frequency of variation of (2) is far lower than Q _policy Specifically, every iteration M turns, let +.>To update one time Q _target . The loss function may be defined as:

wherein the method comprises the steps of

The specific updating algorithm is shown as algorithm 1, wherein a widely used epsilon-greedy exploration strategy is also used.

In fig. 2, the abscissa represents the training process set number, and the ordinate represents the task success rate. The two horizontal lines represent the task success rates at the time of average allocation and random allocation, respectively. The trade-off represents the success rate variation of the present invention during training. It can be seen that after training 40-60 rounds, the task success rate of the method of the present invention is higher than the baseline scheme, at which time the neural network can stop training and apply.

Although the present application has been described with reference to particular embodiments, those skilled in the art will appreciate that many modifications are possible in the principles and scope of the disclosure. The scope of the application is to be determined by the appended claims, and it is intended that the claims cover all modifications that are within the literal meaning or range of equivalents of the technical features of the claims.

Claims

1. An edge computing resource allocation method, characterized in that: the method comprises the following steps:

1): defining edge computing model states, actions, and rewards;

3): updating and training the neural network defined in the step 2) according to a given training method and applying the neural network; the policy adopted by the agent is recorded as pi, then a _k ＝π(s _k ) The method comprises the steps of carrying out a first treatment on the surface of the In the edge computing model, the edge computing server has C CPU cores, N users and terminals, and the total of the N users and terminals is obtained by using a partition method in permutation and combinationDifferent allocation schemes, thus action a _k There may be->Seed selection; the edge computation model state, action, and prize definition process of 1) is as follows:

a vector formed by the size of the task data at the head of the cache queue;

a vector consisting of the length of each cache queue;

a vector formed by the signal-to-noise ratio of each channel;

the rewards and penalties include three things:

3) Task latency>d _r At this time, the task waits too long in the buffer queue and is not processed, and obtains the reward R= -1.5, namely, is punished; the neural network parameter of the intelligent agent is theta _k In this case, the action state value function is recorded asAt this time, the strategy of the agent is pi (s _k ；θ _k ) The method comprises the steps of carrying out a first treatment on the surface of the For simplicity of description, Q is used when ambiguity is not caused ^k (s _k ,a _k ) Representation->Estimating the Q function by a neural network, and the network is called Q-network; the updating method in the 3) comprises an experience replay method, wherein samples of the interaction of the intelligent agent and the environment are stored in a memory bank, and a batch of samples are randomly selected from the samples in the training process to update and iterate the network; the updating method adopts a fixed Q-network method when updating, and the training process needs two Q-networks: q-policy network Q _policy And Q-target network Q _target The parameters are->And theta ^target The method comprises the steps of carrying out a first treatment on the surface of the At the time of iteration, Q _target Frequency of variation of (2)Far below Q _policy 。

2. The edge computing resource allocation method of claim 1, wherein: the environment observed by the agent comprises the current state of the previous W frame and the current frame.

3. The edge computing resource allocation method of claim 1, wherein: the rewards and penalties are set to the maximum tolerable delay d _r ＝αT _f And maximum tolerable error probability ε _max 。

4. The edge computing resource allocation method of claim 1, wherein: the neural network in the step 2) is a multi-layer fully-connected neural network, each output node corresponds to an output scalar, the ReLU function is selected except for the activation function outside the output layer, and the activation function is not used by the output layer.