CN114116156A - Cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method - Google Patents

Cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method Download PDF

Info

Publication number
CN114116156A
CN114116156A CN202111209997.8A CN202111209997A CN114116156A CN 114116156 A CN114116156 A CN 114116156A CN 202111209997 A CN202111209997 A CN 202111209997A CN 114116156 A CN114116156 A CN 114116156A
Authority
CN
China
Prior art keywords
resource
task
representing
user
computing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111209997.8A
Other languages
Chinese (zh)
Other versions
CN114116156B (en
Inventor
袁景凌
向尧
罗忆
毛慧华
李新平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University of Technology WUT
Original Assignee
Wuhan University of Technology WUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University of Technology WUT filed Critical Wuhan University of Technology WUT
Priority to CN202111209997.8A priority Critical patent/CN114116156B/en
Publication of CN114116156A publication Critical patent/CN114116156A/en
Application granted granted Critical
Publication of CN114116156B publication Critical patent/CN114116156B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/502Proximity
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method, which comprises the following steps: 1) establishing a resource distribution framework in a cloud edge environment; 2) determining a user benefit optimization objective function, a service provider benefit optimization objective function and a bilateral benefit balancing objective function; 3) constructing three factors of reinforcement learning in a resource distributor; 4) selecting a computing node ai(ii) a 5) According to the selected action aiUpdating to obtain new state st+1(ii) a 6) According to the new state st+1Simulation of motion a'i(ii) a 7) Calculating a target value
Figure DDA0003308561060000011
8) Calculating an Actor-critical network parameter etaQ(ii) a 9) Updating an Actor-critical network parameter; step 10) repeating the steps 3) to 9) until the Actor-Critic network converges, so as to obtain bilateral benefit balanceAn optimal solution to the objective function. The invention takes the average resource utilization rate of the service provider as the benefit index of the service provider, and adaptively makes the optimal resource allocation decision by facing the real-time and dynamic user task through a tabu reinforcement learning method.

Description

Cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method
Technical Field
The invention relates to a system resource allocation method in the field of cloud computing and edge computing, in particular to a cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method.
Technical Field
The cloud edge cooperation is used as a brand-new internet-of-things computing mode, and large-scale complex computing tasks are executed through computing/data migration and resource cooperation between remote clouds and edge clouds and mutual cooperation between computing nodes, so that the cloud edge cooperation gradually becomes a focus and a leading-edge field of attention in the academic and industrial fields at home and abroad. In the traditional cloud computing, edge computing mode, the user is only the final "consumer" of the data, such as watching an online video with a cell phone. In contrast, the cloud edge collaborative mode is an interconnection system composed of multiple types and resource heterogeneous computing nodes, an integrated collaborative computing system is formed, and intelligent services are provided for users nearby. The user has the dual roles of data consumer and data producer, such as sharing video through WeChat, tremble and the like. The user is concerned with how much revenue they can get to complete their own task requests, how much cost the provider needs to be paid to complete these task requests, and the user experience, among other things. If the user is not good in benefit when using the cloud-edge collaborative computing mode, the user refuses to use the cloud-edge collaborative service and only selects to complete their job task locally. Conversely, if the interests of a large number of users can be optimized, the users will be more willing to use the cloud-edge collaborative computing mode, which will also attract more potential users in the market to use cloud-edge collaboration.
In fact, the interests of the user are related to the interests inhabitation of the facilitator. As mentioned above, cloud-edge collaboration is a new application paradigm that includes software, platform, and infrastructure services that users share in using. For the facilitator, their revenue is derived from the fees charged to the user for providing the service (the user is the consumer) and the fees charged to the user for the use of the data shared by him (the user is the producer). The improvement of income can better promote the quality of service, attract more users to use the service, and finally realize virtuous circle. Therefore, how to increase the interests of the service provider while optimizing the interests of the user is also part of the consideration. Therefore, how to reasonably allocate resources to satisfy the balance of interests of users and service providers in the cloud-edge collaborative environment has very important significance.
In many of the existing research efforts, the resource allocation problem has been identified as a multi-constraint, multi-objective optimization NP-hard problem. The existing resource allocation solution is only oriented to a single cloud computing or edge computing environment, and lacks generality, so that the existing resource allocation solution is difficult to be directly applied to a complex cloud edge collaborative environment. In addition, most of these schemes are based on the perspective of maximizing the benefit of a single perspective, and there is no consideration for both the benefit of the user and the benefit of the service provider. Therefore, it is necessary to provide a resource allocation method for balancing the interests of users and service providers to solve the above problems.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a cloud-edge collaborative dual-profit equilibrium taboo reinforcement learning resource allocation method, which comprehensively takes the average completion time of a user task as a user benefit index, takes the average resource utilization rate of a service provider as a service provider benefit index and adaptively makes an optimal resource allocation decision when facing a real-time dynamic user task through the taboo reinforcement learning method.
In order to achieve the above object, the invention provides a resource allocation method for cloud-edge collaborative dual-interest balance taboo reinforcement learning, which is characterized by comprising the following steps:
1) establishing a resource allocation framework in a cloud edge environment, comprising: a user resource demand model, a computing node resource state model and a resource distributor;
2) determining a user benefit optimization objective function, a service provider benefit optimization objective function and a bilateral benefit balance objective function;
3) three elements in reinforcement learning are constructed in a resource allocator: state space, action space and reward function;
4) the resource allocator sends the status space to the Actor network, which selects a set of compute nodes a from the action space according to the policyiAssigning user tasks as motion vectors;
ai=μ(st,ημ)+Ψ
wherein s istRepresenting the state of the cloud side system at the time t; mu represents a strategy simulated by a convolutional neural network, psi is random noise, etaμIs an Actor-critical network parameter;
5) the state space is selected according to the action a of step 4)iUpdating to obtain new state st+1(ii) a The resource distributor distributes the tasks of the users to the node a in sequenceiCalculating the reward value r in the time period tt(ii) a If the obtained reward value is a negative number, storing the selected action vector into a contraindication list, and if the obtained reward value is a positive number, storing the selected action vector into an experience replay pool;
6) the state space being based on the new state st+1Simulation of motion a'i
a′i=μ′(st+1,ημ′)+Ψ
Wherein, mu' represents a strategy simulated by a convolution neural network, psi is random noise, etaμ′Is an Actor-critical network parameter;
7) resource allocator calculates target values
Figure BDA0003308561040000032
Figure BDA0003308561040000033
Wherein Reward represents a Reward function, γ is an attenuation factor, Qμ′Is shown in state st+1Q evaluation value, η, of the lower-adopted strategy μQ′Policy network parameters, η, for target in Critic networksμ′For Actor networkThe target policy network parameter in (1).
8) Calculating an Actor-critical network parameter eta by adopting a minimum mean square error loss functionQ
Figure BDA0003308561040000031
Wherein X represents the experience number in the experience replay pool, QμIs shown in state stLower adoption action aiAnd always adopting the Q value of the strategy mu;
9) updating the Actor-critical network parameters by adopting a Monte Carlo method to measure the strategy mu;
10) and repeating the steps 3) to 9) until the Actor-Critic network converges, and obtaining the optimal solution of the bilateral benefit balance target function.
Preferably, in step 1), at each scheduling time t:
each computing node transmits its own state to the resource allocator, and the specific state includes: CPU resource allowance, memory resource allowance and storage resource allowance;
each user transmits own computing task requirements to the resource allocator by means of the terminal device, and the specific requirements include: the position of the user, the size of the task, the requirement on CPU resources, the requirement on memory resources and the requirement on storage resources.
Preferably, the resource allocator in step 1) stores the user requirement and the state of the computing node in the form of a matrix:
Figure BDA0003308561040000041
Figure BDA0003308561040000042
wherein, UtRepresenting a user demand matrix at the time t; k represents the total number of users at time t; sk,cpu tRepresenting the demands of the kth user on CPU resources;sk,men trepresenting the demands of the kth user on the memory resources; sk,storage tRepresenting the demands of the kth user on the storage resources; ctRepresenting a state matrix of the computing node at the time t; m represents the total number of compute nodes; cm,cpu tRepresenting the CPU resource allowance of the mth computing node; c. Cm,men tRepresenting the memory resource allowance of the mth computing node; c. Cm,storage tRepresenting the storage resource margin of the mth compute node.
Preferably, the user benefit optimization objective function in step 2) is composed of task average execution time of all users:
Figure BDA0003308561040000043
wherein, artiRepresenting the task execution time of the user i; ART represents the average execution time of tasks for all users; k denotes the number of users, artiThe method comprises the following steps of transmitting a task to a computing node, waiting to execute in the computing node, and computing time of the task:
arti=artdelay+artwait+artcomputg
wherein, artdelayIndicating the delay of the task transmission to the compute node, artwaitIndicating the delay of a task waiting to execute in a compute node, artcomputingRepresenting the time that the task was computed in the compute node.
Preferably, the facilitator interest optimization objective function in step 2) is composed of resource utilization rates of all the computing nodes:
Figure BDA0003308561040000051
Figure BDA0003308561040000052
wherein, asrjRepresenting the resource utilization rate of the computing node j; ASR represents the resource utilization of all computing nodes; n represents the total number of nodes having the nth resource margin; c. Cm,n tRepresenting the surplus of the nth resource of the mth computing node at the time t; sk,n tRepresenting the requirement of the kth task on the nth resource at the time t; a. thetIndicating the scheduling action selected by the scheduler at time t.
Preferably, the bilateral benefit balancing objective function in step 2) is composed of a user benefit optimizing objective function and a service provider benefit optimizing objective function:
Figure BDA0003308561040000056
wherein Z represents a bilateral benefit balancing objective function; theta represents a weight coefficient of the user benefit optimization target function;
Figure BDA0003308561040000057
weight coefficients representing a facilitator interest optimization objective function,
Figure BDA0003308561040000058
preferably, the resource allocator in step 6) evaluates the policy μ using bellman's formula:
Qμ(st,ai,ημ)=E[Reward+γQμ(st+1,μ(st+1,ηQ),ημ)]
e represents expectation.
Preferably, the method for updating the Actor-critical network parameter by using the monte carlo measurement policy μ in step 9) includes:
Figure BDA0003308561040000053
ηQ′←vηQ+(1-v)ηQ′
ημ′←vημ+(1-v)ημ′
wherein the content of the first and second substances,
Figure BDA0003308561040000054
representing the gradient, v is the update factor and has a value of 0.001.
Preferably, the task is transmitted to the delay art of the computing nodedelayThe calculation method of (c) is as follows:
Figure BDA0003308561040000055
Distanceij=R*cos-1[sin(Mlati)*sin(Mlatj)*cos(Mloni-Mlonj) +cos(Mlati)*cos(Mlatj)|*π÷180;
where α is the delay factor, DistanceiiRepresenting the distance between the user i and the computing node j, wherein R represents the average radius of the earth, the value is 6371.004km, and pi represents the circumferential rate; mlatiRepresenting the calculated latitude value, Mlon, of user iiRepresenting the calculated longitude value for user i.
Preferably, the task waits in the compute node for a delayed art to executewaitThe calculation of (c) is as follows:
artwait=taskbegin-taskarrive
wherein, taskbeginThe time for starting the task calculation is represented and obtained by the system record; taskarriveThe arrival time of the task is represented and obtained by system record;
time art calculated by the task in the computing nodecomputingThe calculation method of (c) is as follows:
artcomputing=tasksize/fj
wherein, tasksizeIs the size of the task; f. ofjRepresenting the computation frequency of compute node j.
The invention acquires the environment information by interacting with the cloud side environment, and performs corresponding allocation actions according to the change of the environment information to realize optimal resource allocation, and has the advantages that:
1. compared with the existing method, the average resource utilization rate is improved by 35.08%.
2. Compared with the existing method, the average task completion time is reduced by 24.2%.
3. Compared with the existing method, the method ensures the benefits of the service provider, improves the user benefits by 32.96%, and has better benefit balancing performance.
Drawings
FIG. 1 is a system architecture diagram of a resource allocation method for balancing bilateral interests of users and resource providers based on tabu reinforcement learning.
Fig. 2 is an overall architecture diagram of a tabu reinforcement learning algorithm.
FIG. 3 shows the results of the user profits of the inventive method (SHAER) compared to the existing methods (NSGA-II, MSQL, ICPSO) in accordance with an embodiment of the present invention.
FIG. 4 shows the results of the revenue of the service provider comparing the method of the invention (SHARER) with the existing methods (NSGA-II, MSQL, ICPSO) in accordance with the examples of the present invention.
FIG. 5 shows the results of the average completion times of the tasks comparing the method of the invention (SHArer) with the prior art methods (NSGA-II, MSQL, ICPSO) in the examples of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific embodiments.
As shown in fig. 1, the cloud-edge collaborative dual-benefit equilibrium taboo strong learning resource allocation method provided by the present invention interacts with the cloud-edge environment to obtain environment information, and performs a corresponding allocation action according to a change of the environment information, so as to implement optimal resource allocation. The specific steps are as follows:
1) establishing a resource allocation framework in a cloud edge environment, comprising: a user resource demand model, a computing node resource state model and a resource distributor. At each scheduling instant t:
each computing node transmits its own state to the resource allocator, and the specific state includes: CPU resource allowance, memory resource allowance and storage resource allowance;
each user transmits own computing task requirements to the resource allocator by means of the terminal device, and the specific requirements include: the position of the user, the size of the task, the requirement on CPU resources, the requirement on memory resources and the requirement on storage resources.
The resource allocator stores user requirements and compute node states in the form of a matrix:
Figure BDA0003308561040000071
Figure BDA0003308561040000072
wherein, UtRepresenting a user demand matrix at the time t; k represents the total number of users at time t; sk,cpu tRepresenting the demands of the kth user on CPU resources; sk,men tRepresenting the demands of the kth user on the memory resources; sk,storage tRepresenting the demands of the kth user on the storage resources; ctRepresenting a state matrix of the computing node at the time t; m represents the total number of compute nodes; c. Cm,cpu tRepresenting the CPU resource allowance of the mth computing node; c. Cm,men tRepresenting the memory resource allowance of the mth computing node; c. Cm,storage tRepresenting the storage resource margin of the mth compute node.
2) And determining a user benefit optimization objective function, a service provider benefit optimization objective function and a double-edge benefit balancing objective function. Wherein:
the user benefit optimization objective function consists of the average execution time of tasks of all users:
Figure BDA0003308561040000081
wherein, artiRepresenting the task execution time of the user i; ART represents the average execution time of tasks for all users; k is the number of users, artiThe method comprises the following steps of delay of task transmission to a computing node, delay waiting for execution in the computing node and task computing time:
arti=artdelay+artwait+artcomputing
wherein, artdelayIndicating the delay of the task transmission to the compute node, artwaitIndicating the delay of a task waiting to execute in a compute node, artcomputingRepresenting the time that the task was computed in the compute node.
Delayed art for task transmission to compute nodedelayThe calculation method of (c) is as follows:
Figure BDA0003308561040000082
Distanceij=R*cos-1[sin(Mlati)*sin(Mlatj)*cos(Mloni-Mlonj) +cos(Mlati)*cos(Mlatj)]*π÷180;
where α is the delay factor, DistanceijRepresenting the distance between the user i and the computing node j, wherein R represents the average radius of the earth, the value is 6371.004km, and pi represents the circumferential rate; mlatiRepresenting the calculated latitude value of user i, Mlat if the geographic location is the northern hemispherei=90-lati(ii) a If the geographic location is the southern hemisphere, then Mlati=90+lati;latiFor the true latitude value of user i, obtained from GPS data, MlatjMethod for computing (D) and MlatiAnd (5) the consistency is achieved. MloniRepresenting the calculated longitude value of user i, if the geographic location is eastern hemisphere, Mloni=loni(ii) a Mlon if the geographic location is the western hemispherei=-loni(ii) a Wherein, loniFor the true longitude value of user i, obtained from GPS data, MlonjIs calculated by the method (2) and MloniAnd (5) the consistency is achieved.
Delayed art where a task is waiting to execute in a compute nodewaitThe calculation method of (c) is as follows:
artwait=taskbegin-taskarrive
wherein, taskbeginThe time for starting the task calculation is represented and obtained by the system record; taskarriveThe arrival time of the task is represented and obtained by system record;
time art calculated by task in computing nodecomputingThe calculation method of (c) is as follows:
artcomputing=tasksize/fj
wherein, tasksizeIs the size of the task; f. ofjRepresenting the computation frequency of compute node j.
The service provider interest optimization objective function consists of the resource utilization rates of all the computing nodes:
Figure BDA0003308561040000091
Figure BDA0003308561040000092
wherein, asrjRepresenting the resource utilization rate of the computing node j; ASR represents the resource utilization of all computing nodes; n represents the total number of nodes having the nth resource margin; c. Cm,n tRepresenting the surplus of the nth resource of the mth computing node at the time t; sk,n tRepresenting the requirement of the kth task on the nth resource at the time t; a. thetIndicating the scheduling action selected by the scheduler at time t.
The bilateral benefit balancing objective function consists of a user benefit optimizing objective function and a service provider benefit optimizing objective function:
Figure BDA0003308561040000093
wherein Z represents a bilateral benefit balancing objective function; theta represents a weight coefficient of the user benefit optimization target function;
Figure BDA0003308561040000094
weight coefficients representing a facilitator interest optimization objective function,
Figure BDA0003308561040000095
3) three elements in reinforcement learning are constructed in a resource allocator: state space, action space, and reward functions. As shown in fig. 2, the present embodiment employs a DDPG algorithm, and the DDPG algorithm is composed of an Actor network and a criticc network. The algorithm determines the computing nodes distributed by the user task at each time t. The state space is represented by a computational node state matrix:
S={Ct}
where S represents a state space. The action space is represented by a set of compute nodes that can satisfy the execution of user tasks:
A={a1,a2,…,ai}
wherein A represents an action space, aiRepresenting a set of compute nodes that satisfy the execution of a user task.
The reward function is composed of a bilateral benefit balancing objective function and is calculated in the following mode:
Figure BDA0003308561040000101
where Reward represents a Reward function.
4) The resource allocator sends the status space to the Actor network, which selects a set of compute nodes a from the action space according to the policyiAssigning user tasks as motion vectors;
ai=μ(st,ημ)+Ψ
wherein s istRepresenting the state of the cloud side system at the time t; mu represents a strategy simulated by a convolutional neural network, psi is random noise, etaμIs an Actor-critical network parameter.
5) The state space is selected according to the action a of step 4)iUpdating to obtain new state st+1(ii) a The resource distributor distributes the tasks of the users to the node a in sequenceiCalculating the reward value r in the time period tt(ii) a And if the obtained reward value is a negative number, storing the selected action vector into a contraindication list, and if the obtained reward value is a positive number, storing the selected action vector into an experience replay pool.
6) The state space being based on the new state st+1Simulation of motion a'i
a′i=μ′(st+1,ημ′)+Ψ
Wherein, mu' represents a strategy simulated by a convolution neural network, psi is random noise, etaμ′Is an Actor-critical network parameter. The resource allocator evaluates the policy μ using bellman's formula:
Qμ(st,ai,ημ)=E[Reward+γQμ(st+1,μ(st+1,ηQ),ημ)]
wherein gamma is an attenuation factor, etaQIs an Actor-critical network parameter.
7) Resource allocator calculates target values
Figure BDA0003308561040000103
Figure BDA0003308561040000104
Wherein Reward represents a Reward function, γ is an attenuation factor, Qμ′Is shown in state st+1Q evaluation value, η, of the lower-adopted strategy μQ′Policy network parameters, η, for target in Critic networksμ′And (5) strategy of network parameters for the target in the Actor network.
8) Calculating an Actor-critical network parameter eta by adopting a minimum mean square error loss functionQ
Figure BDA0003308561040000102
Wherein X represents the experience number in the experience replay pool, QμIs shown in state stLower adoption action aiAnd the Q value of the strategy mu is always adopted.
9) Updating the Actor-critical network parameters by adopting a Monte Carlo measurement strategy mu:
Figure BDA0003308561040000111
ηQ′←vηQ+(1-v)ηQ′
ημ′←vημ+(1-v)ημ′
wherein the content of the first and second substances,
Figure BDA0003308561040000112
representing the gradient, v is the update factor and has a value of 0.001.
10) And repeating the steps 3) to 9) until the Actor-Critic network converges, and obtaining the optimal solution of the bilateral profit balance objective function.
The method and the device interact with the cloud side environment to obtain the environment information, and perform corresponding allocation actions according to the change of the environment information to realize optimal resource allocation. The invention adopts MO-FJSPW data set and the existing methods (NSGA-II, MSQL, ICPSO) to carry out multi-angle performance comparison. The invention adopts the ratio of the user income to the service provider income as the performance index for measuring the double benefit balance, as can be seen from figure 3, the invention can obtain the highest ratio under different users, which proves the superiority of the method of the invention in the benefit balance. Fig. 4 shows that the method of the present invention has a higher average resource utilization rate under different user numbers, and the effect is significantly better than the other three methods. As can be seen from fig. 5, the average task completion time is advantageous over other methods for different numbers of users.
Finally, it should be noted that the above detailed description is only for illustrating the patent technical solution and not for limiting, although the patent is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the patent can be modified or replaced equivalently without departing from the spirit and scope of the patent, and all that should be covered by the claims of the patent.

Claims (10)

1. A cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method is characterized by comprising the following steps: the method comprises the following steps:
1) establishing a resource allocation framework in a cloud edge environment, comprising: a user resource demand model, a computing node resource state model and a resource distributor;
2) determining a user benefit optimization objective function, a service provider benefit optimization objective function and a bilateral benefit balancing objective function;
3) three elements in reinforcement learning are constructed in a resource allocator: state space, action space and reward function;
4) the resource distributor sends the status space to the Actor network, and the Actor network selects a group of computing nodes a from the action space according to the strategyiAssigning user tasks as motion vectors;
ai=μ(st,ημ)+Ψ
wherein s istRepresenting the state of the cloud side system at the time t; mu represents a strategy simulated by a convolutional neural network, psi is random noise, etaμIs an Actor-critical network parameter;
5) the state space is selected according to the action a of step 4)iUpdating to obtain new state st+1(ii) a The resource distributor distributes the tasks of the users to the node a in sequenceiCalculating the reward value r in the time period tt(ii) a If the obtained reward value is a negative number, storing the selected action vector into a contraindication list, and if the obtained reward value is a positive number, storing the selected action vector into an experience replay pool;
6) the state space according toNew state st+1Simulation of motion a'i
a′i=μ′(st+1,ημ′)+Ψ
Wherein, mu' represents a strategy simulated by a convolution neural network, psi is random noise, etaμ′Is an Actor-critical network parameter;
7) resource allocator calculates target values
Figure FDA0003308561030000011
Figure FDA0003308561030000012
Wherein Reward represents a Reward function, γ is an attenuation factor, Qμ′Is shown in state st+1Q evaluation value, η, of the lower-adopted strategy μQ′Policy network parameters, η, for target in Critic networksμ′And (5) strategy of network parameters for the target in the Actor network.
8) Calculating an Actor-critical network parameter eta by adopting a minimum mean square error loss functionQ
Figure FDA0003308561030000021
Wherein X represents the experience number in the experience replay pool, QμIs shown in state stLower adoption action aiAnd always adopting the Q value of the strategy mu;
9) updating the Actor-critical network parameters by adopting a Monte Carlo method to measure the strategy mu;
10) and repeating the steps 3) to 9) until the Actor-Critic network converges, and obtaining the optimal solution of the bilateral benefit balance target function.
2. The method according to claim 1, wherein the resource allocation method comprises: in step 1), at each scheduling time t:
each computing node transmits its own state to the resource allocator, and the specific state includes: CPU resource allowance, memory resource allowance and storage resource allowance;
each user transmits own computing task requirements to the resource allocator by means of the terminal device, and the specific requirements include: the position of the user, the size of the task, the requirement on CPU resources, the requirement on memory resources and the requirement on storage resources.
3. The method according to claim 1, wherein the resource allocation method comprises: the resource allocator in the step 1) stores the user requirements and the states of the computing nodes in a matrix form:
Figure FDA0003308561030000023
Figure FDA0003308561030000022
wherein, UtRepresenting a user demand matrix at the time t; k represents the total number of users at time t; sk,cpu tRepresenting the demands of the kth user on CPU resources; sk,men tRepresenting the demands of the kth user on the memory resources; sk,storage tRepresenting the demands of the kth user on the storage resources; ctRepresenting a state matrix of the computing node at the time t; m represents the total number of compute nodes; c. Cm,cpu tRepresenting the CPU resource allowance of the mth computing node; c. Cm,men tRepresenting the memory resource allowance of the mth computing node; cm,storage tRepresenting the storage resource margin of the mth compute node.
4. The method according to claim 1, wherein the resource allocation method comprises: the user benefit optimization objective function in the step 2) is composed of the average execution time of tasks of all users:
Figure FDA0003308561030000031
wherein, artiRepresenting the task execution time of the user i; ART represents the average execution time of tasks for all users; k denotes the number of users, artiThe method comprises the following steps of transmitting a task to a computing node, waiting to execute in the computing node, and computing time of the task:
arti=artdelay+artwait+artcomputing
wherein, artdelayIndicating the delay of the task transmission to the compute node, artwaitIndicating the delay of a task waiting to execute in a compute node, artcomputingRepresenting the time that the task was computed in the compute node.
5. The method according to claim 1, wherein the resource allocation method comprises: the facilitator interest optimization objective function in step 2) is composed of resource utilization rates of all the computing nodes:
Figure FDA0003308561030000032
Figure FDA0003308561030000033
wherein, asrjRepresenting the resource utilization rate of the computing node j; ASR represents the resource utilization of all computing nodes; n represents the total number of nodes having the nth resource margin; c. Cm,n tRepresenting the surplus of the nth resource of the mth computing node at the time t; sk,n tIndicating the kth task at time tThe requirements of n kinds of resources; a. thetIndicating the scheduling action selected by the scheduler at time t.
6. The method according to claim 1, wherein the cloud-edge collaborative dual-interest balance taboo reinforcement learning resource allocation method is characterized in that: the bilateral interest balance objective function in the step 2) is composed of a user interest optimization objective function and a service provider interest optimization objective function:
Figure FDA0003308561030000044
wherein Z represents a bilateral benefit balancing objective function; theta represents a weight coefficient of the user benefit optimization objective function;
Figure FDA0003308561030000045
weight coefficients representing a facilitator interest optimization objective function,
Figure FDA0003308561030000046
7. the method according to claim 1, wherein the resource allocation method comprises: the resource allocator in the step 6) evaluates the policy mu by using a Bellman formula:
Qμ(st,ai,ημ)=E[Reward+γQμ(st+1,μ(st+1,ηQ),ημ)]
e represents expectation.
8. The method according to claim 1, wherein the resource allocation method comprises: the method for updating the Actor-critical network parameters by adopting the Monte Carlo method to measure the strategy mu in the step 9) comprises the following steps:
Figure FDA0003308561030000041
ηQ′←vηQ+(1-v)ηQ′
ημ′←vημ+(1-v)ημ′
wherein the content of the first and second substances,
Figure FDA0003308561030000042
representing the gradient, v is the update factor and has a value of 0.001.
9. The method according to claim 4, wherein the cloud-edge collaborative dual-interest balance taboo reinforcement learning resource allocation method comprises: delay art of the task transmission to the compute nodedelayThe calculation method of (c) is as follows:
Figure FDA0003308561030000043
Distanceij=R*cos-1[sin(Mlati)*sin(Mlatj)*coS(Mloni-Mlonj)+cos(Mlati)*cos(Mlatj)]*π÷180;
where α is the delay factor, DistanceijRepresenting the distance between the user i and the computing node j, wherein R represents the average radius of the earth, the value is 6371.004km, and pi represents the circumferential rate; mlatiRepresenting the calculated latitude value, Mlon, of user iiRepresenting the calculated longitude value for user i.
10. The method according to claim 4, wherein the cloud-edge collaborative dual-interest balance taboo reinforcement learning resource allocation method comprises: the task waiting for a delayed art to execute in the compute nodewaitThe calculation method of (c) is as follows:
artwait=taskbegin-taskarrive
wherein, taskbeginIndicating the time at which the task started to be calculated, by the systemRecording and obtaining; taskarriveThe arrival time of the task is represented and obtained by system record;
time art calculated by the task in the computing nodecomputingThe calculation method of (c) is as follows:
artcomputing=tasksize/fj
wherein, tasksizeIs the size of the task; f. ofjRepresenting the computation frequency of compute node j.
CN202111209997.8A 2021-10-18 2021-10-18 Cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method Active CN114116156B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111209997.8A CN114116156B (en) 2021-10-18 2021-10-18 Cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111209997.8A CN114116156B (en) 2021-10-18 2021-10-18 Cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method

Publications (2)

Publication Number Publication Date
CN114116156A true CN114116156A (en) 2022-03-01
CN114116156B CN114116156B (en) 2022-09-09

Family

ID=80376227

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111209997.8A Active CN114116156B (en) 2021-10-18 2021-10-18 Cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method

Country Status (1)

Country Link
CN (1) CN114116156B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444009A (en) * 2019-11-15 2020-07-24 北京邮电大学 Resource allocation method and device based on deep reinforcement learning
CN111813539A (en) * 2020-05-29 2020-10-23 西安交通大学 Edge computing resource allocation method based on priority and cooperation
CN111918339A (en) * 2020-07-17 2020-11-10 西安交通大学 AR task unloading and resource allocation method based on reinforcement learning in mobile edge network
CN112351433A (en) * 2021-01-05 2021-02-09 南京邮电大学 Heterogeneous network resource allocation method based on reinforcement learning
CN112367353A (en) * 2020-10-08 2021-02-12 大连理工大学 Mobile edge computing unloading method based on multi-agent reinforcement learning
CN112363829A (en) * 2020-11-03 2021-02-12 武汉理工大学 Dynamic user resource allocation method based on elastic scale aggregation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444009A (en) * 2019-11-15 2020-07-24 北京邮电大学 Resource allocation method and device based on deep reinforcement learning
CN111813539A (en) * 2020-05-29 2020-10-23 西安交通大学 Edge computing resource allocation method based on priority and cooperation
CN111918339A (en) * 2020-07-17 2020-11-10 西安交通大学 AR task unloading and resource allocation method based on reinforcement learning in mobile edge network
CN112367353A (en) * 2020-10-08 2021-02-12 大连理工大学 Mobile edge computing unloading method based on multi-agent reinforcement learning
CN112363829A (en) * 2020-11-03 2021-02-12 武汉理工大学 Dynamic user resource allocation method based on elastic scale aggregation
CN112351433A (en) * 2021-01-05 2021-02-09 南京邮电大学 Heterogeneous network resource allocation method based on reinforcement learning

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BINBIN HUANG 等: "Deep Reinforcement Learning for Performance-Aware Adaptive Resource Allocation in Mobile Edge Computing", 《WIRELESS COMMUNICATIONS AND MOBILE COMPUTING》 *
CHEN MINCHENG 等: "Two-Sided Matching Scheduling Using Multi-Level Look-Ahead Queue of Supply and Demand", 《18TH INTERNATIONAL CONFERENCE ON SERVICE-ORIENTED COMPUTING (ICSOC)》 *
叶青等: "虚拟资源分配优化量子学习算法仿真研究", 《计算机仿真》 *

Also Published As

Publication number Publication date
CN114116156B (en) 2022-09-09

Similar Documents

Publication Publication Date Title
WO2021248607A1 (en) Deep reinforcement learning-based taxi dispatching method and system
CN108009023B (en) Task scheduling method based on BP neural network time prediction in hybrid cloud
Pillai et al. Resource allocation in cloud computing using the uncertainty principle of game theory
Shi et al. Location-aware and budget-constrained service deployment for composite applications in multi-cloud environment
Sotiriadis et al. Towards inter-cloud schedulers: A survey of meta-scheduling approaches
Murad et al. A review on job scheduling technique in cloud computing and priority rule based intelligent framework
Keshk et al. Cloud task scheduling for load balancing based on intelligent strategy
CN113037877B (en) Optimization method for time-space data and resource scheduling under cloud edge architecture
CN113238847B (en) Distribution and scheduling method based on distributed network environment and capable of distributing tasks
CN109343945A (en) A kind of multitask dynamic allocation method based on contract net algorithm
CN116489708B (en) Meta universe oriented cloud edge end collaborative mobile edge computing task unloading method
Aloqaily et al. Fairness-aware game theoretic approach for service management in vehicular clouds
Salimi et al. Task scheduling with Load balancing for computational grid using NSGA II with fuzzy mutation
Liwang et al. Resource trading in edge computing-enabled IoV: An efficient futures-based approach
CN115225643A (en) Point cloud platform big data distributed management method, device and system
CN113139639B (en) MOMBI-oriented smart city application multi-target computing migration method and device
Zhou et al. DPS: Dynamic pricing and scheduling for distributed machine learning jobs in edge-cloud networks
CN114116156B (en) Cloud-edge cooperative double-profit equilibrium taboo reinforcement learning resource allocation method
Zhu et al. SAAS parallel task scheduling based on cloud service flow load algorithm
CN110012507B (en) Internet of vehicles resource allocation method and system with priority of user experience
Milocco et al. Evaluating the upper bound of energy cost saving by proactive data center management
Sotiriadis The inter-cloud meta-scheduling framework
Zhao et al. Hypergraph-based task-bundle scheduling towards efficiency and fairness in heterogeneous distributed systems
Sun et al. An intelligent resource allocation mechanism in the cloud computing environment
CN112700269A (en) Distributed data center selection method based on anisotropic reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant