CN111367657A - Computing resource collaborative cooperation method based on deep reinforcement learning - Google Patents

Computing resource collaborative cooperation method based on deep reinforcement learning Download PDF

Info

Publication number
CN111367657A
CN111367657A CN202010107300.5A CN202010107300A CN111367657A CN 111367657 A CN111367657 A CN 111367657A CN 202010107300 A CN202010107300 A CN 202010107300A CN 111367657 A CN111367657 A CN 111367657A
Authority
CN
China
Prior art keywords
state
edge server
cpu
experience
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010107300.5A
Other languages
Chinese (zh)
Other versions
CN111367657B (en
Inventor
陈沛锐
于秀兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202010107300.5A priority Critical patent/CN111367657B/en
Publication of CN111367657A publication Critical patent/CN111367657A/en
Application granted granted Critical
Publication of CN111367657B publication Critical patent/CN111367657B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention relates to a computing resource collaborative cooperation method based on deep reinforcement learning, and belongs to the field of edge computing resource allocation. The method comprises the following steps: deploying edge servers into a honeycomb in a 5G dense area; recording a sample of the computing resource state and the corresponding action at a period of time by considering each edge server as an agent; randomly selecting a state sample from the experience replay at each time t to obtain an experience tuple, and storing each experience tuple into the experience replay to accumulate experiences; obtaining a new experience tuple from the Q value at the same time, and filling the experience tuple; and (4) iterating the Q value, and carrying out training by bringing into a target-net network and an eval-net network to obtain a cooperation decision of the optimal approximate solution. The invention breaks the association between the state samples, makes the samples mutually independent and improves the utilization rate of the computing resources in cooperative cooperation.

Description

Computing resource collaborative cooperation method based on deep reinforcement learning
Technical Field
The invention belongs to the field of edge computing resource allocation, and relates to a computing resource collaborative cooperation method based on deep reinforcement learning.
Background
Currently, the internet of things (IoT) is extended by internet technology to connect ubiquitous Mobile Devices (MDs) with sensors over wireless networks. The internet of things has been widely used in many fields, and the amount of data in mobile internet applications is exponentially increasing. In order to improve efficiency, the pursuit of low delay has become a trend. However, the data is uploaded from the terminal device to the cloud, and then is transmitted back to the terminal device after being calculated. The traditional cloud computing technology cannot meet the high requirement of people on computing efficiency.
The 5G can be connected with countless intelligent devices to realize data sharing and interaction. In addition, 5G can provide the basic idea of the Internet of things, the coverage range of the Internet of things is expanded, and billions of MDs access the Internet. The demand for data services has proliferated, creating new challenges for service providers and mobile network operators. Many applications in 5G, such as face recognition, natural language processing, etc., can operate in the terminal. Therefore, in order to exploit the computational load offload, we need to jointly manage the computational load offload and the related radio resource allocation, which has actually attracted much attention from researchers. It is precisely when it comes. Edge Computing (EC) enables a mobile terminal to actively transfer a Computing task to a nearby Edge server, so that Edge artificial intelligence cooperative cooperation is also allowed, and an effective method is provided for solving the increasing large-scale cluster Computing requirement and realizing efficient Computing cooperative transfer and cooperation.
Disclosure of Invention
In view of the above, the present invention provides a computing resource collaborative method based on deep reinforcement learning.
In order to achieve the purpose, the invention provides the following technical scheme:
a computing resource collaborative method based on deep reinforcement learning comprises the following steps:
the method comprises the following steps: for seamless connection, the edge servers are formed into a honeycomb shape and are deployed in a 5G network dense area;
step two: regarding each edge server as an agent, taking the state of the computing resource and the corresponding action recorded at a certain moment as samples and putting the samples into an experience replay;
step three: in order to increase the independence of the samples, randomly selecting a state sample from the experience replay at each time t to obtain an experience tuple, and then storing each experience tuple into the experience replay to accumulate and store experiences;
step four: and iterating the Q value through the target network target net and the evaluation network eval net to obtain a new state, putting the new state into an experience replay, updating the weight parameter by using the loss function, finally obtaining an optimal approximate solution, and obtaining an optimal decision of edge server cooperation.
Optionally, in the first step, the effort and time spent by the edge server in receiving the collaborative calculation result are ignored;
considering the system model, N mobile users offload computation tasks to the edge server over the wireless link; each user has M independent tasks to be completed;
to model tasks, a cellular network shape is used to maximize coverage utilization of edge servers; by cooperatively optimizing the unloading decision of each edge task and the distribution of the computing resources of the server and the transmission and the reception of the tasks, an optimization case aiming at minimizing the energy consumption for completing the computing tasks and fully utilizing the computing resources is made.
Optionally, in the second step, each edge server is regarded as an agent, and the computing resource states of the CPU, the task amount and the energy consumption at each moment are taken as a state sample, where a CPU idle configuration file of a partner is defined as data of the terminal device, and the partner is in the duration T ∈ [0, T]These data are internally calculated, denoted as Ubit(t);
The cooperative edge server information with free CPU is as follows, the cooperative edge server CPU state information refers to the state of CPU along with time and is recorded by defining the following cooperative CPU event space, process and epoch, wherein α ═ α12Denotes the CPU state space samples of the cooperating edge servers, α1And α2Respectively cooperating with the edge server from busy to idle and then from idle to busy; the cooperative edge processor process is then defined as the time instant of the coprocessor event sequence:
Figure BDA0002388798960000021
with a longer time interval T between two successive eventsk=sk-sk-1Wherein
Figure BDA0002388798960000022
Is called an epoch;
the process of the CPU is to allow an offline design of the cooperative computing strategy, with a sample path of the partner CPU process, let I for each epoch kkA CPU status indicator is represented, wherein values 1 and 0 represent an idle state and a busy state, respectively; the CPU idle configuration of the server is as follows:
Figure BDA0002388798960000023
edge collaborators have non-causal knowledge of the nature of edge servers with no CPU idle; assuming that a q-bit buffer is reserved for storing unloaded data before the partner processes in the CPU;
consider two forms of data arrival on an edge server; one task arrival assumes that the input L-bit arrives at time t ═ 0, so the event space and process of the edge server CPU follow the complaints; on the other hand, the data arrival of the burst forms a random process; for burst data arrival, use
Figure BDA0002388798960000024
Representing a combined event space, α3The new task state is shown to reach the edge server, and the corresponding process is a variable sequence:
Figure BDA0002388798960000031
123.. } representing time instants of the sequence of events; moreover, every moment
Figure BDA0002388798960000032
Let LkIndicating the size of the data arrival, L k0 means α1And α2State, LkNot equal to 0 indicates α3The state of (1);
Figure BDA0002388798960000033
otherwise, the task reaching the deadline can not be calculated, and then the total data is input
Figure BDA0002388798960000034
Then taking the state of the computing resource at each moment as a state sample;
selecting an action to represent how to collaborate between two different adjacent edge servers by selecting a state sample; corresponding to a particular change/shift between two different adjacent states; the variable v indicates the number v of the different time states 1,2, … NM +3N, and then the action a (t) av(t) }, action 1 × (NM +3N) depends on the selection of v, for which there are the following actions;
when v is more than or equal to 1 and less than or equal to NM, the corresponding action av(t) means how to change task xnmAn offloading decision; specifically, use is made of:
such as av(t) is 1, then
Figure BDA0002388798960000035
Such as av(t) is 0, then
Figure BDA0002388798960000036
Figure BDA0002388798960000037
Integer operation is adopted, mod (v, M) is remainder operation, and a corresponding server task v is found;
when v is more than or equal to NM +1 and less than or equal to NM + N and V is more than or equal to NM +1 and less than or equal to NM +2N, corresponding action av(t) arranging the collaborative computing resources for the edge server, the actions being:
such as av(t) is 1, then
Figure BDA0002388798960000038
Such as av(t) is 0, then
Figure BDA0002388798960000039
Wherein the computing resources are updated by
Figure BDA00023887989600000310
Wherein C iscoNumber of cycles for processing a computing task for the CPU of an edge server, Cco,maxFor maximum period calculated by CPU, Ndo,totThe number of CPU cores.
Optionally, in the third step, based on the CPU and the calculation task as the state sample and defining the selection of the action, the system state of any edge server is defined, in the initial stage, the corresponding action is taken for the corresponding state, the state sample therein is selected as the state of the given time, and a specific action is taken to obtain the maximum accumulated reward, that is, the Q value;
Qπ(s,a)=Eπ[rt+1+γrt+1+...|At=a,St=s]
Q(s,a)←Q(s,a)+δ[r+γmaxaQ(s`,a`)-Q(s,a)]
wherein Q (s, a) is an action state function value, the moment t is discounted, the influence of gamma decay on the Q function, the closer gamma to 1 represents that the later decision is influenced, and the closer gamma to 0 represents that the current interest value is emphasized;
at this point, there is an experience tuple table Dt=(e1,…,et) To record the value of each set of state and action; now experience tuple e in experience replay1=(st,a,rt,st+1) Not full, and randomly selecting a state sample at each time step t in the experience replay to obtain experience tuples, and then storing each experience tuple in the experience replay to accumulate the experience.
Optionally, in the fourth step, a method different from the previous method is adopted, and the Experience is performed on the obtained productExperience tuple e in replay1=(st,a,rt,st+1) Carrying into two same neural networks for training, namely a target net and an evaluation network eval net; output value Q of targetnettarRepresents the decay score when behavior a is selected under the state sample s of the front edge server, i.e.:
Figure BDA0002388798960000041
wherein r and s' respectively represent a corresponding score and a corresponding next observation state when the action a is taken in the current state s of the edge server; gamma is the attenuation factor; a ' is the action taken by the edge server in state s ', and w ' is the weight parameter of target net;
output value Q of evalnetevalRepresents the score when action a is taken under the state sample s of the front edge server:
Figure BDA0002388798960000042
wherein w is a weight parameter of eval net;
an epsilon-greedy strategy is adopted to obtain the behavior a, and a plurality of cooperative edge servers can be explored while a decision is generated based on a network and a single adjacent edge server is cooperative with a certain probability; continuously updating experience tuples in the experience pool and taking the experience tuples as input of target net and eval net to obtain QevalAnd Qtar(ii) a Will QevalAnd QtarThe difference value of the evaluation network is used as a Loss function, and a weight parameter of the evaluation network is updated by a gradient descent method; for training convergence, the weight parameters of the target network are updated by copying the weight parameters of the evaluation network at regular intervals, and the model is as follows:
Figure BDA0002388798960000043
wherein s istAnd a represents the current state and current work of the edge server respectivelyThe action taken, r, represents the reward rewarded by taking this action, γ is the desired reduced count factor, st+1Representing the state of the next step in the future, and w is used as a vector for fitting the deep neural network;
then, the gradient descent algorithm is used to minimize the difference between the target network output and the prediction, i.e. the main network output:
Loss=(Qtraget(st,a)-Qpre(st,a,w))2
and finally, training the two neural networks by using the experience tuples, and continuously iterating the Q value, so that the edge server can obtain a solution of an optimal approximate solution under the condition that the edge server is in a limited computing resource state, and the solution is used as an optimal strategy for the edge server to cooperate with each other.
The invention has the beneficial effects that: and taking the acquired computing resource state and the corresponding selection action as samples to be put into an experience pool, and then randomly extracting one sample from the experience pool to carry out two-row neural network training. The first part of this treatment breaks the link between the samples, making them independent of each other. Fixed Q target network to calculate the target value of the network, the existing Q value is needed. A slower update network is used to provide this Q value. This improves the stability and convergence of training to further improve the efficiency and cost of collaboration between edge servers.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.
Drawings
For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 is a flow chart of a computing resource collaborative method based on deep reinforcement learning;
FIG. 2 is a schematic diagram of an edge server deployment;
FIG. 3 is a diagram illustrating the cooperation of an optimal solution of the demand.
Detailed Description
The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.
Wherein the showings are for the purpose of illustrating the invention only and not for the purpose of limiting the same, and in which there is shown by way of illustration only and not in the drawings in which there is no intention to limit the invention thereto; to better illustrate the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if there is an orientation or positional relationship indicated by terms such as "upper", "lower", "left", "right", "front", "rear", etc., based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not an indication or suggestion that the referred device or element must have a specific orientation, be constructed in a specific orientation, and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes, and are not to be construed as limiting the present invention, and the specific meaning of the terms may be understood by those skilled in the art according to specific situations.
The technical scheme for solving the technical problems is as follows:
referring to fig. 1 to 3, the following description is made with reference to the accompanying drawings, and the present invention includes the following steps:
the method comprises the following steps: the effort and time spent by the edge servers in receiving the collaborative computing results is negligible because they are typically much smaller than the offload counterparts. Extending the current analysis to include overhead is simple and tedious. Considering the system model, N mobile users offload their computing tasks to the edge server over the wireless link. Each user has M independent tasks to be completed. To model the task, a cellular network shape is used to maximize coverage utilization of the edge servers. Then, by collaboratively optimizing the offloading decision of each edge task and the allocation of server computing resources, as well as the transmission and reception of tasks, an optimization plan aiming at minimizing the energy consumption for completing the computing tasks and the full utilization of computing resources is made, as shown in fig. 2.
And step two, regarding each edge server as an agent, taking the state of the computing resources such as CPU, task amount, energy consumption and the like of each edge server as a state sample, wherein the CPU idle configuration file of the partner is defined as data (in bits) of the terminal equipment, and the partner can continue for the duration T ∈ [0, T]These data are internally calculated and are denoted as Ubit(t) the collaboration edge server information with free CPUs in it is such that collaboration edge server CPU state information, which refers to the state of the CPUs over time, can be recorded by defining a collaboration CPU event space, process and epoch where α ═ α12Denotes the CPU state space samples of the cooperating edge servers, α1And α2Respectively cooperating with the edge server from busy to idle and then from idle to busy. The coordinated edge processor process can then be defined as the time instant of the coprocessor event sequence:
Figure BDA0002388798960000061
with a longer time interval T between two successive eventsk=sk-sk-1Wherein
Figure BDA0002388798960000062
Referred to as an epoch.
The CPU process is one that allows for the offline design of collaborative computing strategies, e.g., one sample path of partner CPU process, let I for each epoch kkIndicating a CPU status indicator with values 1 and 0 indicating an idle state and a busy state, respectively. However, with fhThe constant CPU frequency of the collaborators is indicated and C is used to indicate the number of cycles of the CPU in 1 bit of input data that need to be counted by the user. Based on the above definition, the CPU idle configuration of the server is as follows:
Figure BDA0002388798960000071
edge collaborators have non-causal knowledge of the nature of edge servers that have no CPU idle. Finally, assume that a q-bit buffer is reserved for storing offloaded data before processing in the CPU by the partner.
Consider two forms of data arrival on an edge server. One task arrival assumes that the input L-bit arrives at time t ═ 0, so the event space and process of the edge server CPU follow the complaints. On the other hand, the data arrival of the burst forms a random process. For a short specification, it is useful to define a random process that combines the data arrival of both processes with the edge server CPU. Thus for a combined random process of bursty task arrivals. For burst data arrival, use
Figure BDA0002388798960000072
Representing a combined event space, α12Already introduced above, α3The new task state is shown to reach the edge server, and the corresponding process is a variable sequence:
Figure BDA0002388798960000073
123.. } representing time instants of the sequence of events. And, every time every hourCarving tool
Figure BDA0002388798960000074
Let LkIndicating the size of the data arrival, L k0 means α1And α2State, LkNot equal to 0 indicates α3The state of (1). In addition, the first and second substrates are,
Figure BDA0002388798960000075
otherwise, the task reaching the deadline can not be calculated, and then the total data is input
Figure BDA0002388798960000076
Then, the state of the computing resources such as the CPU, the task amount and the like at each moment is taken as a state sample.
Selecting a state sample will select an action to represent how to collaborate between two different adjacent edge servers. In particular, to a particular change/movement between two different adjacent states. The variable v indicates the number v of the different time states 1,2, … NM +3N, and then the action a (t) av(t) }, action 1 × (NM +3N) depends on the selection of v, for which there are the following actions.
When v is more than or equal to 1 and less than or equal to NM, the corresponding action av(t) means how to change task xnmAnd (6) unloading the decision. Specifically, use is made of:
such as av(t) is 1, then
Figure BDA0002388798960000077
Such as av(t) is 0, then
Figure BDA0002388798960000078
Figure BDA0002388798960000081
Is an integer operation, mod (v, M) is a remainder operation, and we can then find the corresponding server task v.
When v is more than or equal to NM +1 and less than or equal to NM + N and v is more than or equal to NM +2N, the correspondingAction av(t) arranging the collaborative computing resources for the edge server, the actions being:
such as av(t) is 1, then
Figure BDA0002388798960000082
Such as av(t) is 0, then
Figure BDA0002388798960000083
Wherein the computing resources are updated by
Figure BDA0002388798960000084
Wherein C iscoNumber of cycles for processing a computing task for the CPU of an edge server, Cco,maxFor maximum period calculated by CPU, Ndo,totThe number of CPU cores.
Step three: based on CPU and computing task as state sample and defining action selection, thereby defining system state of any edge server, in initial stage, taking corresponding action for corresponding state, selecting state sample as state of given time, and taking a specific action after action, in order to obtain maximum accumulated reward, i.e. Q value.
Qπ(s,a)=Eπ[rt+1+γrt+1+...|At=a,St=s]
Q(s,a)←Q(s,a)+δ[r+γmaxaQ(s`,a`)-Q(s,a)]
Wherein Q (s, a) is the function value of the action state, the moment t is discounted, the influence of gamma decay on the Q function, the closer gamma to 1 represents that the later decision is influenced, and the closer gamma to 0 represents that the current interest value is emphasized.
At this point, there is an experience tuple table Dt=(e1,…,et) To record the value of each set of state and action. Now experience tuple e in experience replay1=(st,a,rt,st+1) Not full, and randomly selecting a state sample in each time step t in the experience replay to obtain an experience tupleThen, each time the experience tuple is stored in the experience replay, the experience is accumulated.
Step four: adopting a different method from the previous method to extract the Experience tuple e in the Experience replay thereof1=(st,a,rt,st+1) Training is carried into two identical neural networks (three convolutional layers and two fully-connected layers), targetnet and eval net respectively. Output value Q of target nettarRepresents the decay score when behavior a is selected under the state sample s of the front edge server, i.e.:
Figure BDA0002388798960000085
wherein r and s' respectively represent a corresponding score and a corresponding next observation state when the action a is taken in the current state s of the edge server; gamma is the attenuation factor; a ' is the action taken by the edge server at state s ', and w ' is the weight parameter of target net.
Output value Q of evalnetevalRepresents the score when action a is taken under the state sample s of the front edge server:
Figure BDA0002388798960000091
where w is the weight parameter of eval net.
And an epsilon-greedy strategy (epsilon is gradually reduced from 1 to 0) is adopted to obtain the behavior a, so that a plurality of edge servers which are cooperated can be searched while a decision is generated based on a network and an adjacent single edge server is cooperated with a certain probability, and the problem of optimal solution is avoided. Here, the experience tuples in the experience pool are continuously updated and used as the input of target net and evalnet to obtain QevalAnd Qtar. Will QevalAnd QtarThe difference value of (2) is used as a Loss function, and the weight parameter of the evaluation network is updated by a gradient descent method. For training convergence, the weight parameters of the target network are updated by copying the weight parameters of the evaluation network at regular intervalsThe model is as follows:
Figure BDA0002388798960000092
wherein s istAnd a represents the current state of the edge server and the action currently being taken, respectively, r represents the reward rewarded for taking this action, γ is the desired reduced count factor, st+1Representing the state of the next step in the future, w is used as a vector for the deep neural network fit.
Then, the difference between the target network output and the prediction is minimized using a gradient descent algorithm:
Loss=(Qtraget(st,a)-Qeval(st,a,w))2
and finally, training the two neural networks by using the experience tuples, and continuously iterating the Q value, so that the edge server can obtain a solution of an optimal approximate solution under the condition that the edge server is in a limited computing resource state, and the solution is used as an optimal strategy for the edge server to cooperate with each other.
Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims (5)

1. A computing resource collaborative cooperation method based on deep reinforcement learning is characterized in that: the method comprises the following steps:
the method comprises the following steps: for seamless connection, the edge servers are formed into a honeycomb shape and are deployed in a 5G network dense area;
step two: regarding each edge server as an agent, recording the state of the computing resource at a certain moment and the corresponding action as a sample and putting the sample into an experienceplay;
step three: in order to increase the independence of the samples, randomly selecting a state sample from the experiencereplay at each time t to obtain an experience tuple, and then storing each experience tuple into the experiencereplay to accumulate experiences and store;
step four: and iterating the Q value through the target network targetnet and the evaluation network evalnet to obtain a new state, putting the new state into experienceplay again, updating the weight parameter by using the loss function, finally obtaining an optimal approximate solution, and obtaining an optimal decision of edge server cooperation.
2. The computing resource collaborative method based on deep reinforcement learning according to claim 1, wherein: in the first step, the effort and time spent by the edge server on receiving the collaborative calculation result are ignored;
considering the system model, N mobile users offload computation tasks to the edge server over the wireless link; each user has M independent tasks to be completed;
to model tasks, a cellular network shape is used to maximize coverage utilization of edge servers; by cooperatively optimizing the unloading decision of each edge task and the distribution of the computing resources of the server and the transmission and the reception of the tasks, an optimization case aiming at minimizing the energy consumption for completing the computing tasks and fully utilizing the computing resources is made.
3. The method according to claim 1, wherein in the second step, regarding each edge server as an agent, the computing resource status of CPU, task amount and energy consumption at each moment is taken as a status sample, wherein the CPU idle configuration file of the partner is defined as the data of the terminal device, and the partner is in the duration of T ∈ [0, T]These data are internally calculated, denoted as Ubit(t);
The cooperative edge server information with free CPU is as follows, the cooperative edge server CPU state information refers to the state of CPU along with time and is recorded by defining the following cooperative CPU event space, process and epoch, wherein α ═{α12Denotes the CPU state space samples of the cooperating edge servers, α1And α2Respectively cooperating with the edge server from busy to idle and then from idle to busy; the cooperative edge processor process is then defined as the time instant of the coprocessor event sequence:
Figure FDA0002388798950000011
with a longer time interval T between two successive eventsk=sk-sk-1Wherein
Figure FDA0002388798950000012
Is called an epoch;
the process of the CPU is to allow an offline design of the cooperative computing strategy, with a sample path of the partner CPU process, let I for each epoch kkA CPU status indicator is represented, wherein values 1 and 0 represent an idle state and a busy state, respectively; the CPU idle configuration of the server is as follows:
Figure FDA0002388798950000021
edge collaborators have non-causal knowledge of the nature of edge servers with no CPU idle; assuming that a q-bit buffer is reserved for storing unloaded data before the partner processes in the CPU;
consider two forms of data arrival on an edge server; one task arrival assumes that the input L-bit arrives at time t ═ 0, so the event space and process of the edge server CPU follow the complaints; on the other hand, the data arrival of the burst forms a random process; for burst data arrival, use
Figure FDA0002388798950000022
Representing a combined event space, α3The new task state is shown to reach the edge server, and the corresponding process is a variable sequence:
Figure FDA0002388798950000023
123.. } representing time instants of the sequence of events; moreover, every moment
Figure FDA0002388798950000024
Let LkIndicating the size of the data arrival, Lk0 means α1And α2State, LkNot equal to 0 indicates α3The state of (1);
Figure FDA0002388798950000025
otherwise, the task reaching the deadline can not be calculated, and then the total data is input
Figure FDA0002388798950000026
Then taking the state of the computing resource at each moment as a state sample;
selecting an action to represent how to collaborate between two different adjacent edge servers by selecting a state sample; corresponding to a particular change/shift between two different adjacent states; the variable v indicates the number v of the different time states 1,2, … NM +3N, and then the action a (t) av(t) }, action 1 × (NM +3N) depends on the selection of v, for which there are the following actions;
when v is more than or equal to 1 and less than or equal to NM, the corresponding action av(t) means how to change task xnmAn offloading decision; specifically, use is made of:
such as av(t) is 1, then
Figure FDA0002388798950000027
Such as av(t) is 0, then
Figure FDA0002388798950000028
Figure FDA0002388798950000029
Integer operation is adopted, mod (v, M) is remainder operation, and a corresponding server task v is found;
when v is more than or equal to NM +1 and less than or equal to NM + N and V is more than or equal to NM +1 and less than or equal to NM +2N, corresponding action av(t) arranging the collaborative computing resources for the edge server, the actions being:
such as av(t) is 1, then
Figure FDA0002388798950000031
Such as av(t) is 0, then
Figure FDA0002388798950000032
Wherein the computing resources are updated by
Figure FDA0002388798950000033
Wherein C iscoNumber of cycles for processing a computing task for the CPU of an edge server, Cco,maxFor maximum period calculated by CPU, Ndo,totThe number of CPU cores.
4. The computing resource collaborative method based on deep reinforcement learning according to claim 1, wherein: in the third step, based on the CPU and the calculation task as the state sample and the selection of the defined action, the system state of any edge server is defined, in the initial stage, corresponding action is taken for the corresponding state, the state sample is selected as the state of the given time, and a specific action is taken to obtain the maximum accumulated reward, namely the Q value;
Qπ(s,a)=Eπ[rt+1+γrt+1+...|At=a,St=s]
Q(s,a)←Q(s,a)+δ[r+γmaxaQ(s`,a`)-Q(s,a)]
wherein Q (s, a) is an action state function value, the moment t is discounted, the influence of gamma decay on the Q function, the closer gamma to 1 represents that the later decision is influenced, and the closer gamma to 0 represents that the current interest value is emphasized;
at this point, there is an experience tuple table Dt=(e1,…,et) To record the value of each set of state and action; now experience tuple e in experience replay1=(st,a,rt,st+1) Not full, and randomly selecting a state sample at each time step t in the experience replay to obtain experience tuples, and then storing each experience tuple in the experience replay to accumulate the experience.
5. The computing resource collaborative method based on deep reinforcement learning according to claim 1, wherein: in the fourth step, a method different from the previous method is adopted, and the Experience tuple e in Experience replay is used1=(st,a,rt,st+1) Carrying into two same neural networks for training, namely a target net and an evaluation network eval net; output value Q of target nettarRepresents the decay score when behavior a is selected under the state sample s of the front edge server, i.e.:
Figure FDA0002388798950000034
wherein r and s' respectively represent a corresponding score and a corresponding next observation state when the action a is taken in the current state s of the edge server; gamma is the attenuation factor; a ' is the action taken by the edge server in state s ', and w ' is the weight parameter of target net;
output value Q of eval netevalRepresents the score when action a is taken under the state sample s of the front edge server:
Figure FDA0002388798950000041
wherein w is a weight parameter of eval net;
and adopts epsilon-greedy strategy to obtain behaviora, when a decision is generated based on a network, a plurality of cooperative edge servers can be explored while the adjacent edge servers are cooperated with a certain probability; continuously updating experience tuples in the experience pool and taking the experience tuples as input of target net and eval net to obtain QevalAnd Qtar(ii) a Will QevalAnd QtarThe difference value of the evaluation network is used as a Loss function, and a weight parameter of the evaluation network is updated by a gradient descent method; for training convergence, the weight parameters of the target network are updated by copying the weight parameters of the evaluation network at regular intervals, and the model is as follows:
Figure FDA0002388798950000042
wherein s istAnd a represents the current state of the edge server and the action currently being taken, respectively, r represents the reward rewarded for taking this action, γ is the desired reduced count factor, st+1Representing the state of the next step in the future, and w is used as a vector for fitting the deep neural network;
then, the gradient descent algorithm is used to minimize the difference between the target network output and the prediction, i.e. the main network output:
Loss=(Qtraget(st,a)-Qpre(st,a,w))2
and finally, training the two neural networks by using the experience tuples, and continuously iterating the Q value, so that the edge server can obtain a solution of an optimal approximate solution under the condition that the edge server is in a limited computing resource state, and the solution is used as an optimal strategy for the edge server to cooperate with each other.
CN202010107300.5A 2020-02-21 2020-02-21 Computing resource collaborative cooperation method based on deep reinforcement learning Active CN111367657B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010107300.5A CN111367657B (en) 2020-02-21 2020-02-21 Computing resource collaborative cooperation method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010107300.5A CN111367657B (en) 2020-02-21 2020-02-21 Computing resource collaborative cooperation method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN111367657A true CN111367657A (en) 2020-07-03
CN111367657B CN111367657B (en) 2022-04-19

Family

ID=71206211

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010107300.5A Active CN111367657B (en) 2020-02-21 2020-02-21 Computing resource collaborative cooperation method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN111367657B (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112099510A (en) * 2020-09-25 2020-12-18 东南大学 Intelligent agent control method based on end edge cloud cooperation
CN112134916A (en) * 2020-07-21 2020-12-25 南京邮电大学 Cloud edge collaborative computing migration method based on deep reinforcement learning
CN112506673A (en) * 2021-02-04 2021-03-16 国网江苏省电力有限公司信息通信分公司 Intelligent edge calculation-oriented collaborative model training task configuration method
CN112511336A (en) * 2020-11-05 2021-03-16 上海大学 Online service placement method in edge computing system
CN112948112A (en) * 2021-02-26 2021-06-11 杭州电子科技大学 Edge computing workload scheduling method based on reinforcement learning
CN113836788A (en) * 2021-08-24 2021-12-24 浙江大学 Acceleration method for flow industry reinforcement learning control based on local data enhancement
CN113835878A (en) * 2021-08-24 2021-12-24 润联软件系统(深圳)有限公司 Resource allocation method and device, computer equipment and storage medium
CN114140033A (en) * 2022-01-29 2022-03-04 北京新唐思创教育科技有限公司 Service personnel allocation method and device, electronic equipment and storage medium
CN115496208A (en) * 2022-11-15 2022-12-20 清华大学 Unsupervised multi-agent reinforcement learning method with collaborative mode diversity guidance
CN116821693A (en) * 2023-08-29 2023-09-29 腾讯科技(深圳)有限公司 Model training method and device for virtual scene, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170085488A1 (en) * 2015-09-22 2017-03-23 Brocade Communications Systems, Inc. Intelligent, load adaptive, and self optimizing master node selection in an extended bridge
CN108920280A (en) * 2018-07-13 2018-11-30 哈尔滨工业大学 A kind of mobile edge calculations task discharging method under single user scene
CN109583582A (en) * 2017-09-28 2019-04-05 中国石油化工股份有限公司 Neural network intensified learning method and system based on Eligibility traces
CN109710404A (en) * 2018-12-20 2019-05-03 上海交通大学 Method for scheduling task in distributed system
CN109976909A (en) * 2019-03-18 2019-07-05 中南大学 Low delay method for scheduling task in edge calculations network based on study
CN110427261A (en) * 2019-08-12 2019-11-08 电子科技大学 A kind of edge calculations method for allocating tasks based on the search of depth Monte Carlo tree
CN110619385A (en) * 2019-08-31 2019-12-27 电子科技大学 Structured network model compression acceleration method based on multi-stage pruning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170085488A1 (en) * 2015-09-22 2017-03-23 Brocade Communications Systems, Inc. Intelligent, load adaptive, and self optimizing master node selection in an extended bridge
CN109583582A (en) * 2017-09-28 2019-04-05 中国石油化工股份有限公司 Neural network intensified learning method and system based on Eligibility traces
CN108920280A (en) * 2018-07-13 2018-11-30 哈尔滨工业大学 A kind of mobile edge calculations task discharging method under single user scene
CN109710404A (en) * 2018-12-20 2019-05-03 上海交通大学 Method for scheduling task in distributed system
CN109976909A (en) * 2019-03-18 2019-07-05 中南大学 Low delay method for scheduling task in edge calculations network based on study
CN110427261A (en) * 2019-08-12 2019-11-08 电子科技大学 A kind of edge calculations method for allocating tasks based on the search of depth Monte Carlo tree
CN110619385A (en) * 2019-08-31 2019-12-27 电子科技大学 Structured network model compression acceleration method based on multi-stage pruning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
EJAZ AHMED: ""Bringing Computation Closer toward the User Network:Is Edge Computing the Solution?"", 《IEEE COMMUNICATIONS MAGAZINE》 *
戴亚盛: ""边缘计算可信协同服务机制研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112134916A (en) * 2020-07-21 2020-12-25 南京邮电大学 Cloud edge collaborative computing migration method based on deep reinforcement learning
CN112099510A (en) * 2020-09-25 2020-12-18 东南大学 Intelligent agent control method based on end edge cloud cooperation
CN112511336A (en) * 2020-11-05 2021-03-16 上海大学 Online service placement method in edge computing system
CN112506673A (en) * 2021-02-04 2021-03-16 国网江苏省电力有限公司信息通信分公司 Intelligent edge calculation-oriented collaborative model training task configuration method
CN112948112A (en) * 2021-02-26 2021-06-11 杭州电子科技大学 Edge computing workload scheduling method based on reinforcement learning
CN113835878A (en) * 2021-08-24 2021-12-24 润联软件系统(深圳)有限公司 Resource allocation method and device, computer equipment and storage medium
CN113836788A (en) * 2021-08-24 2021-12-24 浙江大学 Acceleration method for flow industry reinforcement learning control based on local data enhancement
CN113836788B (en) * 2021-08-24 2023-10-27 浙江大学 Acceleration method for flow industrial reinforcement learning control based on local data enhancement
CN114140033A (en) * 2022-01-29 2022-03-04 北京新唐思创教育科技有限公司 Service personnel allocation method and device, electronic equipment and storage medium
CN114140033B (en) * 2022-01-29 2022-04-12 北京新唐思创教育科技有限公司 Service personnel allocation method and device, electronic equipment and storage medium
CN115496208A (en) * 2022-11-15 2022-12-20 清华大学 Unsupervised multi-agent reinforcement learning method with collaborative mode diversity guidance
CN116821693A (en) * 2023-08-29 2023-09-29 腾讯科技(深圳)有限公司 Model training method and device for virtual scene, electronic equipment and storage medium
CN116821693B (en) * 2023-08-29 2023-11-03 腾讯科技(深圳)有限公司 Model training method and device for virtual scene, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN111367657B (en) 2022-04-19

Similar Documents

Publication Publication Date Title
CN111367657B (en) Computing resource collaborative cooperation method based on deep reinforcement learning
CN113254197B (en) Network resource scheduling method and system based on deep reinforcement learning
CN113950066B (en) Single server part calculation unloading method, system and equipment under mobile edge environment
CN109753751B (en) MEC random task migration method based on machine learning
CN110809306B (en) Terminal access selection method based on deep reinforcement learning
CN113543176B (en) Unloading decision method of mobile edge computing system based on intelligent reflecting surface assistance
CN110113190A (en) Time delay optimization method is unloaded in a kind of mobile edge calculations scene
CN111858009A (en) Task scheduling method of mobile edge computing system based on migration and reinforcement learning
CN109947545A (en) A kind of decision-making technique of task unloading and migration based on user mobility
CN113098714A (en) Low-delay network slicing method based on deep reinforcement learning
CN113469325A (en) Layered federated learning method, computer equipment and storage medium for edge aggregation interval adaptive control
CN113286317B (en) Task scheduling method based on wireless energy supply edge network
CN114390057B (en) Multi-interface self-adaptive data unloading method based on reinforcement learning under MEC environment
CN114546608B (en) Task scheduling method based on edge calculation
CN115277689A (en) Yun Bianwang network communication optimization method and system based on distributed federal learning
CN114205353B (en) Calculation unloading method based on hybrid action space reinforcement learning algorithm
Yang et al. Deep reinforcement learning based wireless network optimization: A comparative study
CN116489712B (en) Mobile edge computing task unloading method based on deep reinforcement learning
CN114938372B (en) Federal learning-based micro-grid group request dynamic migration scheduling method and device
Tao et al. Drl-driven digital twin function virtualization for adaptive service response in 6g networks
CN116321307A (en) Bidirectional cache placement method based on deep reinforcement learning in non-cellular network
Henna et al. Distributed and collaborative high-speed inference deep learning for mobile edge with topological dependencies
CN113821346B (en) Edge computing unloading and resource management method based on deep reinforcement learning
CN114154685A (en) Electric energy data scheduling method in smart power grid
CN117749796A (en) Cloud edge computing power network system calculation unloading method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant