CN114116156A

CN114116156A - Cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method

Info

Publication number: CN114116156A
Application number: CN202111209997.8A
Authority: CN
Inventors: 袁景凌; 向尧; 罗忆; 毛慧华; 李新平
Original assignee: Wuhan University of Technology WUT
Current assignee: Wuhan University of Technology WUT
Priority date: 2021-10-18
Filing date: 2021-10-18
Publication date: 2022-03-01
Anticipated expiration: 2041-10-18
Also published as: CN114116156B

Abstract

The invention discloses a cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method, which comprises the following steps: 1) establishing a resource distribution framework in a cloud edge environment; 2) determining a user benefit optimization objective function, a service provider benefit optimization objective function and a bilateral benefit balancing objective function; 3) constructing three factors of reinforcement learning in a resource distributor; 4) selecting a computing node a_i(ii) a 5) According to the selected action a_iUpdating to obtain new state s_t+1(ii) a 6) According to the new state s_t+1Simulation of motion a'_i(ii) a 7) Calculating a target value

8) Calculating an Actor-critical network parameter eta^Q(ii) a 9) Updating an Actor-critical network parameter; step 10) repeating the steps 3) to 9) until the Actor-Critic network converges, so as to obtain bilateral benefit balanceAn optimal solution to the objective function. The invention takes the average resource utilization rate of the service provider as the benefit index of the service provider, and adaptively makes the optimal resource allocation decision by facing the real-time and dynamic user task through a tabu reinforcement learning method.

Description

Cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method

Technical Field

The invention relates to a system resource allocation method in the field of cloud computing and edge computing, in particular to a cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method.

Technical Field

The cloud edge cooperation is used as a brand-new internet-of-things computing mode, and large-scale complex computing tasks are executed through computing/data migration and resource cooperation between remote clouds and edge clouds and mutual cooperation between computing nodes, so that the cloud edge cooperation gradually becomes a focus and a leading-edge field of attention in the academic and industrial fields at home and abroad. In the traditional cloud computing, edge computing mode, the user is only the final "consumer" of the data, such as watching an online video with a cell phone. In contrast, the cloud edge collaborative mode is an interconnection system composed of multiple types and resource heterogeneous computing nodes, an integrated collaborative computing system is formed, and intelligent services are provided for users nearby. The user has the dual roles of data consumer and data producer, such as sharing video through WeChat, tremble and the like. The user is concerned with how much revenue they can get to complete their own task requests, how much cost the provider needs to be paid to complete these task requests, and the user experience, among other things. If the user is not good in benefit when using the cloud-edge collaborative computing mode, the user refuses to use the cloud-edge collaborative service and only selects to complete their job task locally. Conversely, if the interests of a large number of users can be optimized, the users will be more willing to use the cloud-edge collaborative computing mode, which will also attract more potential users in the market to use cloud-edge collaboration.

In fact, the interests of the user are related to the interests inhabitation of the facilitator. As mentioned above, cloud-edge collaboration is a new application paradigm that includes software, platform, and infrastructure services that users share in using. For the facilitator, their revenue is derived from the fees charged to the user for providing the service (the user is the consumer) and the fees charged to the user for the use of the data shared by him (the user is the producer). The improvement of income can better promote the quality of service, attract more users to use the service, and finally realize virtuous circle. Therefore, how to increase the interests of the service provider while optimizing the interests of the user is also part of the consideration. Therefore, how to reasonably allocate resources to satisfy the balance of interests of users and service providers in the cloud-edge collaborative environment has very important significance.

In many of the existing research efforts, the resource allocation problem has been identified as a multi-constraint, multi-objective optimization NP-hard problem. The existing resource allocation solution is only oriented to a single cloud computing or edge computing environment, and lacks generality, so that the existing resource allocation solution is difficult to be directly applied to a complex cloud edge collaborative environment. In addition, most of these schemes are based on the perspective of maximizing the benefit of a single perspective, and there is no consideration for both the benefit of the user and the benefit of the service provider. Therefore, it is necessary to provide a resource allocation method for balancing the interests of users and service providers to solve the above problems.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a cloud-edge collaborative dual-profit equilibrium taboo reinforcement learning resource allocation method, which comprehensively takes the average completion time of a user task as a user benefit index, takes the average resource utilization rate of a service provider as a service provider benefit index and adaptively makes an optimal resource allocation decision when facing a real-time dynamic user task through the taboo reinforcement learning method.

In order to achieve the above object, the invention provides a resource allocation method for cloud-edge collaborative dual-interest balance taboo reinforcement learning, which is characterized by comprising the following steps:

1) establishing a resource allocation framework in a cloud edge environment, comprising: a user resource demand model, a computing node resource state model and a resource distributor;

2) determining a user benefit optimization objective function, a service provider benefit optimization objective function and a bilateral benefit balance objective function;

3) three elements in reinforcement learning are constructed in a resource allocator: state space, action space and reward function;

4) the resource allocator sends the status space to the Actor network, which selects a set of compute nodes a from the action space according to the policy_iAssigning user tasks as motion vectors;

a_i＝μ(s_t，η^μ)+Ψ

wherein s is_tRepresenting the state of the cloud side system at the time t; mu represents a strategy simulated by a convolutional neural network, psi is random noise, eta^μIs an Actor-critical network parameter;

5) the state space is selected according to the action a of step 4)_iUpdating to obtain new state s_t+1(ii) a The resource distributor distributes the tasks of the users to the node a in sequence_iCalculating the reward value r in the time period t_t(ii) a If the obtained reward value is a negative number, storing the selected action vector into a contraindication list, and if the obtained reward value is a positive number, storing the selected action vector into an experience replay pool;

6) the state space being based on the new state s_t+1Simulation of motion a'_i；

a′_i＝μ′(s_t+1，η^μ′)+Ψ

Wherein, mu' represents a strategy simulated by a convolution neural network, psi is random noise, eta^μ′Is an Actor-critical network parameter;

7) resource allocator calculates target values

Wherein Reward represents a Reward function, γ is an attenuation factor, Q^μ′Is shown in state s_t+1Q evaluation value, η, of the lower-adopted strategy μ^Q′Policy network parameters, η, for target in Critic networks^μ′For Actor networkThe target policy network parameter in (1).

8) Calculating an Actor-critical network parameter eta by adopting a minimum mean square error loss function^Q：

Wherein X represents the experience number in the experience replay pool, Q^μIs shown in state s_tLower adoption action a_iAnd always adopting the Q value of the strategy mu;

9) updating the Actor-critical network parameters by adopting a Monte Carlo method to measure the strategy mu;

10) and repeating the steps 3) to 9) until the Actor-Critic network converges, and obtaining the optimal solution of the bilateral benefit balance target function.

Preferably, in step 1), at each scheduling time t:

each computing node transmits its own state to the resource allocator, and the specific state includes: CPU resource allowance, memory resource allowance and storage resource allowance;

each user transmits own computing task requirements to the resource allocator by means of the terminal device, and the specific requirements include: the position of the user, the size of the task, the requirement on CPU resources, the requirement on memory resources and the requirement on storage resources.

Preferably, the resource allocator in step 1) stores the user requirement and the state of the computing node in the form of a matrix:

wherein, U^tRepresenting a user demand matrix at the time t; k represents the total number of users at time t; s_k，cpu ^tRepresenting the demands of the kth user on CPU resources;s_k，men ^trepresenting the demands of the kth user on the memory resources; s_k，storage ^tRepresenting the demands of the kth user on the storage resources; c^tRepresenting a state matrix of the computing node at the time t; m represents the total number of compute nodes; c_m，cpu ^tRepresenting the CPU resource allowance of the mth computing node; c. C_m，men ^tRepresenting the memory resource allowance of the mth computing node; c. C_m，storage ^tRepresenting the storage resource margin of the mth compute node.

Preferably, the user benefit optimization objective function in step 2) is composed of task average execution time of all users:

wherein, art_iRepresenting the task execution time of the user i; ART represents the average execution time of tasks for all users; k denotes the number of users, art_iThe method comprises the following steps of transmitting a task to a computing node, waiting to execute in the computing node, and computing time of the task:

art_i＝art_delay+art_wait+art_computg

wherein, art_delayIndicating the delay of the task transmission to the compute node, art_waitIndicating the delay of a task waiting to execute in a compute node, art_computingRepresenting the time that the task was computed in the compute node.

Preferably, the facilitator interest optimization objective function in step 2) is composed of resource utilization rates of all the computing nodes:

wherein, asr_jRepresenting the resource utilization rate of the computing node j; ASR represents the resource utilization of all computing nodes; n represents the total number of nodes having the nth resource margin; c. C_m，n ^tRepresenting the surplus of the nth resource of the mth computing node at the time t; s_k，n ^tRepresenting the requirement of the kth task on the nth resource at the time t; a. the_tIndicating the scheduling action selected by the scheduler at time t.

Preferably, the bilateral benefit balancing objective function in step 2) is composed of a user benefit optimizing objective function and a service provider benefit optimizing objective function:

wherein Z represents a bilateral benefit balancing objective function; theta represents a weight coefficient of the user benefit optimization target function;

weight coefficients representing a facilitator interest optimization objective function,

preferably, the resource allocator in step 6) evaluates the policy μ using bellman's formula:

Q^μ(s_t，a_i，η^μ)＝E[Reward+γQ^μ(s_t+1，μ(s_t+1，η^Q)，η^μ)]

e represents expectation.

Preferably, the method for updating the Actor-critical network parameter by using the monte carlo measurement policy μ in step 9) includes:

η^Q′←vη^Q+(1-v)η^Q′

η^μ′←vη^μ+(1-v)η^μ′

wherein the content of the first and second substances,

representing the gradient, v is the update factor and has a value of 0.001.

Preferably, the task is transmitted to the delay art of the computing node_delayThe calculation method of (c) is as follows:

Distance_ij＝R*cos^-1[sin(Mlat_i)*sin(Mlat_j)*cos(Mlon_i-Mlon_j) +cos(Mlat_i)*cos(Mlat_j)|*π÷180；

where α is the delay factor, Distance_iiRepresenting the distance between the user i and the computing node j, wherein R represents the average radius of the earth, the value is 6371.004km, and pi represents the circumferential rate; mlat_iRepresenting the calculated latitude value, Mlon, of user i_iRepresenting the calculated longitude value for user i.

Preferably, the task waits in the compute node for a delayed art to execute_waitThe calculation of (c) is as follows:

art_wait＝task_begin-task_arrive

wherein, task_beginThe time for starting the task calculation is represented and obtained by the system record; task_arriveThe arrival time of the task is represented and obtained by system record;

time art calculated by the task in the computing node_computingThe calculation method of (c) is as follows:

art_computing＝task_size/f_j

wherein, task_sizeIs the size of the task; f. of_jRepresenting the computation frequency of compute node j.

The invention acquires the environment information by interacting with the cloud side environment, and performs corresponding allocation actions according to the change of the environment information to realize optimal resource allocation, and has the advantages that:

1. compared with the existing method, the average resource utilization rate is improved by 35.08%.

2. Compared with the existing method, the average task completion time is reduced by 24.2%.

3. Compared with the existing method, the method ensures the benefits of the service provider, improves the user benefits by 32.96%, and has better benefit balancing performance.

Drawings

FIG. 1 is a system architecture diagram of a resource allocation method for balancing bilateral interests of users and resource providers based on tabu reinforcement learning.

Fig. 2 is an overall architecture diagram of a tabu reinforcement learning algorithm.

FIG. 3 shows the results of the user profits of the inventive method (SHAER) compared to the existing methods (NSGA-II, MSQL, ICPSO) in accordance with an embodiment of the present invention.

FIG. 4 shows the results of the revenue of the service provider comparing the method of the invention (SHARER) with the existing methods (NSGA-II, MSQL, ICPSO) in accordance with the examples of the present invention.

FIG. 5 shows the results of the average completion times of the tasks comparing the method of the invention (SHArer) with the prior art methods (NSGA-II, MSQL, ICPSO) in the examples of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific embodiments.

As shown in fig. 1, the cloud-edge collaborative dual-benefit equilibrium taboo strong learning resource allocation method provided by the present invention interacts with the cloud-edge environment to obtain environment information, and performs a corresponding allocation action according to a change of the environment information, so as to implement optimal resource allocation. The specific steps are as follows:

1) establishing a resource allocation framework in a cloud edge environment, comprising: a user resource demand model, a computing node resource state model and a resource distributor. At each scheduling instant t:

The resource allocator stores user requirements and compute node states in the form of a matrix:

wherein, U^tRepresenting a user demand matrix at the time t; k represents the total number of users at time t; s_k，cpu ^tRepresenting the demands of the kth user on CPU resources; s_k，men ^tRepresenting the demands of the kth user on the memory resources; s_k，storage ^tRepresenting the demands of the kth user on the storage resources; c^tRepresenting a state matrix of the computing node at the time t; m represents the total number of compute nodes; c. C_m，cpu ^tRepresenting the CPU resource allowance of the mth computing node; c. C_m，men ^tRepresenting the memory resource allowance of the mth computing node; c. C_m，storage ^tRepresenting the storage resource margin of the mth compute node.

2) And determining a user benefit optimization objective function, a service provider benefit optimization objective function and a double-edge benefit balancing objective function. Wherein:

the user benefit optimization objective function consists of the average execution time of tasks of all users:

wherein, art_iRepresenting the task execution time of the user i; ART represents the average execution time of tasks for all users; k is the number of users, art_iThe method comprises the following steps of delay of task transmission to a computing node, delay waiting for execution in the computing node and task computing time:

art_i＝art_delay+art_wait+art_computing

Delayed art for task transmission to compute node_delayThe calculation method of (c) is as follows:

Distance_ij＝R*cos^-1[sin(Mlat_i)*sin(Mlat_j)*cos(Mlon_i-Mlon_j) +cos(Mlat_i)*cos(Mlat_j)]*π÷180；

where α is the delay factor, Distance_ijRepresenting the distance between the user i and the computing node j, wherein R represents the average radius of the earth, the value is 6371.004km, and pi represents the circumferential rate; mlat_iRepresenting the calculated latitude value of user i, Mlat if the geographic location is the northern hemisphere_i＝90-lat_i(ii) a If the geographic location is the southern hemisphere, then Mlat_i＝90+lat_i；lat_iFor the true latitude value of user i, obtained from GPS data, Mlat_jMethod for computing (D) and Mlat_iAnd (5) the consistency is achieved. Mlon_iRepresenting the calculated longitude value of user i, if the geographic location is eastern hemisphere, Mlon_i＝lon_i(ii) a Mlon if the geographic location is the western hemisphere_i＝-lon_i(ii) a Wherein, lon_iFor the true longitude value of user i, obtained from GPS data, Mlon_jIs calculated by the method (2) and Mlon_iAnd (5) the consistency is achieved.

Delayed art where a task is waiting to execute in a compute node_waitThe calculation method of (c) is as follows:

art_wait＝task_begin-task_arrive

time art calculated by task in computing node_computingThe calculation method of (c) is as follows:

art_computing＝task_size/f_j

The service provider interest optimization objective function consists of the resource utilization rates of all the computing nodes:

The bilateral benefit balancing objective function consists of a user benefit optimizing objective function and a service provider benefit optimizing objective function:

3) three elements in reinforcement learning are constructed in a resource allocator: state space, action space, and reward functions. As shown in fig. 2, the present embodiment employs a DDPG algorithm, and the DDPG algorithm is composed of an Actor network and a criticc network. The algorithm determines the computing nodes distributed by the user task at each time t. The state space is represented by a computational node state matrix:

S＝{C^t}

where S represents a state space. The action space is represented by a set of compute nodes that can satisfy the execution of user tasks:

A＝{a₁，a₂，…，a_i}

wherein A represents an action space, a_iRepresenting a set of compute nodes that satisfy the execution of a user task.

The reward function is composed of a bilateral benefit balancing objective function and is calculated in the following mode:

where Reward represents a Reward function.

a_i＝μ(s_t，η^μ)+Ψ

wherein s is_tRepresenting the state of the cloud side system at the time t; mu represents a strategy simulated by a convolutional neural network, psi is random noise, eta^μIs an Actor-critical network parameter.

5) The state space is selected according to the action a of step 4)_iUpdating to obtain new state s_t+1(ii) a The resource distributor distributes the tasks of the users to the node a in sequence_iCalculating the reward value r in the time period t_t(ii) a And if the obtained reward value is a negative number, storing the selected action vector into a contraindication list, and if the obtained reward value is a positive number, storing the selected action vector into an experience replay pool.

a′_i＝μ′(s_t+1，η^μ′)+Ψ

Wherein, mu' represents a strategy simulated by a convolution neural network, psi is random noise, eta^μ′Is an Actor-critical network parameter. The resource allocator evaluates the policy μ using bellman's formula:

Q^μ(s_t，a_i，η^μ)＝E[Reward+γQ^μ(s_t+1，μ(s_t+1，η^Q)，η^μ)]

wherein gamma is an attenuation factor, eta^QIs an Actor-critical network parameter.

7) Resource allocator calculates target values

Wherein Reward represents a Reward function, γ is an attenuation factor, Q^μ′Is shown in state s_t+1Q evaluation value, η, of the lower-adopted strategy μ^Q′Policy network parameters, η, for target in Critic networks^μ′And (5) strategy of network parameters for the target in the Actor network.

Wherein X represents the experience number in the experience replay pool, Q^μIs shown in state s_tLower adoption action a_iAnd the Q value of the strategy mu is always adopted.

9) Updating the Actor-critical network parameters by adopting a Monte Carlo measurement strategy mu:

η^Q′←vη^Q+(1-v)η^Q′

η^μ′←vη^μ+(1-v)η^μ′

wherein the content of the first and second substances,

representing the gradient, v is the update factor and has a value of 0.001.

10) And repeating the steps 3) to 9) until the Actor-Critic network converges, and obtaining the optimal solution of the bilateral profit balance objective function.

The method and the device interact with the cloud side environment to obtain the environment information, and perform corresponding allocation actions according to the change of the environment information to realize optimal resource allocation. The invention adopts MO-FJSPW data set and the existing methods (NSGA-II, MSQL, ICPSO) to carry out multi-angle performance comparison. The invention adopts the ratio of the user income to the service provider income as the performance index for measuring the double benefit balance, as can be seen from figure 3, the invention can obtain the highest ratio under different users, which proves the superiority of the method of the invention in the benefit balance. Fig. 4 shows that the method of the present invention has a higher average resource utilization rate under different user numbers, and the effect is significantly better than the other three methods. As can be seen from fig. 5, the average task completion time is advantageous over other methods for different numbers of users.

Finally, it should be noted that the above detailed description is only for illustrating the patent technical solution and not for limiting, although the patent is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the patent can be modified or replaced equivalently without departing from the spirit and scope of the patent, and all that should be covered by the claims of the patent.

Claims

1. A cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method is characterized by comprising the following steps: the method comprises the following steps:

2) determining a user benefit optimization objective function, a service provider benefit optimization objective function and a bilateral benefit balancing objective function;

4) the resource distributor sends the status space to the Actor network, and the Actor network selects a group of computing nodes a from the action space according to the strategy_iAssigning user tasks as motion vectors;

a_i＝μ(s_t，η^μ)+Ψ

6) the state space according toNew state s_t+1Simulation of motion a'_i；

a′_i＝μ′(s_t+1，η^μ′)+Ψ

7) resource allocator calculates target values

2. The method according to claim 1, wherein the resource allocation method comprises: in step 1), at each scheduling time t:

3. The method according to claim 1, wherein the resource allocation method comprises: the resource allocator in the step 1) stores the user requirements and the states of the computing nodes in a matrix form:

wherein, U^tRepresenting a user demand matrix at the time t; k represents the total number of users at time t; s_k，cpu ^tRepresenting the demands of the kth user on CPU resources; s_k，men ^tRepresenting the demands of the kth user on the memory resources; s_k，storage ^tRepresenting the demands of the kth user on the storage resources; c^tRepresenting a state matrix of the computing node at the time t; m represents the total number of compute nodes; c. C_m，cpu ^tRepresenting the CPU resource allowance of the mth computing node; c. C_m，men ^tRepresenting the memory resource allowance of the mth computing node; c_m，storage ^tRepresenting the storage resource margin of the mth compute node.

4. The method according to claim 1, wherein the resource allocation method comprises: the user benefit optimization objective function in the step 2) is composed of the average execution time of tasks of all users:

art_i＝art_delay+art_wait+art_computing

5. The method according to claim 1, wherein the resource allocation method comprises: the facilitator interest optimization objective function in step 2) is composed of resource utilization rates of all the computing nodes:

wherein, asr_jRepresenting the resource utilization rate of the computing node j; ASR represents the resource utilization of all computing nodes; n represents the total number of nodes having the nth resource margin; c. C_m，n ^tRepresenting the surplus of the nth resource of the mth computing node at the time t; s_k，n ^tIndicating the kth task at time tThe requirements of n kinds of resources; a. the_tIndicating the scheduling action selected by the scheduler at time t.

6. The method according to claim 1, wherein the cloud-edge collaborative dual-interest balance taboo reinforcement learning resource allocation method is characterized in that: the bilateral interest balance objective function in the step 2) is composed of a user interest optimization objective function and a service provider interest optimization objective function:

wherein Z represents a bilateral benefit balancing objective function; theta represents a weight coefficient of the user benefit optimization objective function;

7. the method according to claim 1, wherein the resource allocation method comprises: the resource allocator in the step 6) evaluates the policy mu by using a Bellman formula:

Q^μ(s_t，a_i，η^μ)＝E[Reward+γQ^μ(s_t+1，μ(s_t+1，η^Q)，η^μ)]

e represents expectation.

8. The method according to claim 1, wherein the resource allocation method comprises: the method for updating the Actor-critical network parameters by adopting the Monte Carlo method to measure the strategy mu in the step 9) comprises the following steps:

η^Q′←vη^Q+(1-v)η^Q′

η^μ′←vη^μ+(1-v)η^μ′

wherein the content of the first and second substances,

representing the gradient, v is the update factor and has a value of 0.001.

9. The method according to claim 4, wherein the cloud-edge collaborative dual-interest balance taboo reinforcement learning resource allocation method comprises: delay art of the task transmission to the compute node_delayThe calculation method of (c) is as follows:

Distance_ij＝R*cos^-1[sin(Mlat_i)*sin(Mlat_j)*coS(Mlon_i-Mlon_j)+cos(Mlat_i)*cos(Mlat_j)]*π÷180；

where α is the delay factor, Distance_ijRepresenting the distance between the user i and the computing node j, wherein R represents the average radius of the earth, the value is 6371.004km, and pi represents the circumferential rate; mlat_iRepresenting the calculated latitude value, Mlon, of user i_iRepresenting the calculated longitude value for user i.

10. The method according to claim 4, wherein the cloud-edge collaborative dual-interest balance taboo reinforcement learning resource allocation method comprises: the task waiting for a delayed art to execute in the compute node_waitThe calculation method of (c) is as follows:

art_wait＝task_begin-task_arrive

wherein, task_beginIndicating the time at which the task started to be calculated, by the systemRecording and obtaining; task_arriveThe arrival time of the task is represented and obtained by system record;

art_computing＝task_size/f_j