CN111444009A

CN111444009A - Resource allocation method and device based on deep reinforcement learning

Info

Publication number: CN111444009A
Application number: CN201911117328.0A
Authority: CN
Inventors: 张海涛; 郭彤宇; 郭建立; 黄瀚; 何晨泽
Original assignee: Beijing University of Posts and Telecommunications; CETC 54 Research Institute
Current assignee: Beijing University of Posts and Telecommunications; CETC 54 Research Institute
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-07-24
Anticipated expiration: 2039-11-15
Also published as: CN111444009B

Abstract

The embodiment of the invention provides a resource allocation method and a device based on deep reinforcement learning, wherein the method comprises the following steps: determining services of various resources to be allocated contained in an application program request of a user and allocation priority of each service; determining state parameters of a current edge micro cloud system, wherein the state parameters comprise a resource balance degree evaluation parameter, a response delay evaluation parameter and the resource surplus of each computing node in each micro cloud; inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service; the resource balance optimization model is completed based on deep reinforcement learning training, and a first service is deployed in a first target computing node; and updating the state parameters, and returning to the parameter input step until each service of the resources to be allocated, which is contained in the application program request, completes the resource allocation. Compared with the traditional resource allocation method, the method can meet the communication delay requirement and achieve higher resource utilization balance degree.

Description

Resource allocation method and device based on deep reinforcement learning

Technical Field

The present invention relates to the field of wireless communication technologies, and in particular, to a resource allocation method and apparatus based on deep reinforcement learning.

Background

In recent years, with the progress of informatization and networking, information systems have played an increasingly important role in the fields of military affairs, disaster relief, and the like. In such a highly dynamic environment, mission plans and equipment configurations may change frequently, and network connectivity may fluctuate. The service resources based on the stand-alone equipment are very limited, and cannot deal with complex computing tasks. Cloud computing technology is an effective means to address such scenarios. In the cloud computing technology, resource configuration can be performed in a user-defined manner according to task requirements, so that convenient and flexible management service is provided for large-scale application programs, however, a traditional cloud platform is usually deployed in a region far away from a user, communication delay is high, and continuous and reliable service is difficult to provide in an environment with unstable network.

To solve the above problem, an edge micro cloud platform is created. The edge micro cloud platform is an emerging cloud computing model and is composed of a plurality of edge micro clouds which are distributed and deployed, each edge micro cloud comprises a plurality of small servers, and the scale of the edge micro cloud platform can be adjusted along with task requirements. Most of the edge micro clouds are deployed on a mobile vehicle and move according to task requirements, so that higher-quality cloud services are provided. With the development of microservice technology, an application is generally composed of a plurality of combined services which communicate with each other, and each combined service has different requirements on resources with different dimensions. Because the computing power of a single edge cloudlet is limited and cannot meet all service requirements, the combined service can be distributed and deployed in different edge cloudlets, and the different edge cloudlets cooperate with each other to provide the computing power together.

However, in the existing edge clouding technology, when allocating resources for a service, a situation of resource fragmentation often occurs, that is, resource allocation is unbalanced, thereby causing waste of resources in a certain dimension. In addition, when allocating resources, not only the resource requirements of each service need to be considered, but also the communication requirements between services need to be considered, which further increases the complexity of resource allocation, and none of the related art considers the communication requirements between services.

Therefore, a resource allocation method that can satisfy the communication requirements between services and has a high balance of resource utilization is needed.

Disclosure of Invention

An object of the embodiments of the present invention is to provide a method and an apparatus for resource allocation based on deep reinforcement learning, so as to improve the balance of resource utilization. The specific technical scheme is as follows:

in order to achieve the above object, an embodiment of the present invention provides a resource allocation method based on deep reinforcement learning, which is applied to a control platform of an edge cloudlet system, where the edge cloudlet system further includes a plurality of cloudlets, each cloudlet includes a plurality of computing nodes, and the method includes:

determining services of various resources to be allocated contained in an application program request of a user and allocation priority of each service;

determining state parameters of a current edge micro cloud system, wherein the state parameters comprise a resource balance degree evaluation parameter, a response delay evaluation parameter and the resource surplus of each computing node in each micro cloud;

inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service; the first service is the service with the highest priority currently allocated; the resource balance optimization model is completed based on deep reinforcement learning training, wherein a training set of the deep reinforcement learning comprises: sample state parameters of the edge micro-cloud system;

deploying the first service to the first target computing node;

and updating the state parameters, and returning to the step of inputting the state parameters into the resource balance optimization model which is trained in advance until the resource allocation of each service of the resources to be allocated contained in the application program request is completed.

Optionally, the resource balance evaluation parameter is calculated based on the following formula:

wherein the content of the first and second substances,

representing the resource utilization variance of the jth computing node in the ith cloudlet, D representing the number of classes of resources,

indicating the resource utilization rate of the d-th type resource in the j-th computing node in the ith micro cloud,

representing the resource utilization rate of all kinds of resources in the jth computing node in the ith micro cloudX denotes the resource allocation policy,

representing the resource balance rate, RUBD, of the jth computing node in the ith clout_iIndicating resource utilization balance of the ith cloudlet, L_iRepresenting the total number of compute nodes, RUBD, in the ith clout_TotalRepresenting a resource balance evaluation parameter of the edge micro cloud system, wherein K represents the total number of micro clouds in the edge micro cloud system;

calculating the response delay evaluation parameter based on the following formula:

t_Total＝T_Comp(X)+T_TR(X)

t_Totalindicating a response delay evaluation parameter, T_Comp(X) represents a calculation delay, T_TR(X) represents a transmission delay.

Optionally, the resource balancing optimization model is trained according to the following steps:

acquiring a preset neural network model and the training set;

inputting the sample state parameters into the neural network model to obtain service placement actions; the service placement action represents determining a placed target compute node for a sample service;

updating the sample state parameters based on the service placement action to obtain updated sample state parameters;

calculating an incentive value of the service placement action based on the resource balance degree evaluation parameter and the response delay evaluation parameter contained in the sample state parameter and the resource balance degree evaluation parameter contained in the updated sample state parameter and the response delay evaluation parameter;

substituting the sample state parameter, the updated sample state parameter, the service placing action and the reward value of the service placing action into a preset loss function, and calculating the loss value of the service placing action;

determining whether the neural network model converges according to the loss value;

if not, adjusting parameter values in the neural network model, and returning to the step of inputting the updated sample state parameters into the neural network model to obtain a service placing action;

and if so, determining the current neural network model as a resource balance optimization model.

Optionally, the loss function is:

wherein L represents a loss function, E [ [ alpha ] ]]Representing a mathematical expectation, n representing the number of sets of historical iteration data referenced per iteration, t representing the time of day,

representing the priority weights of n sets of historical iteration data after time t,

representing the sum of the prize values for n iterations after time t,

representing a decay factor, Q, for a reward value of n iterations after time t_targetRepresenting the target network, Q_evaRepresenting an estimated network, s_tA sample state parameter representing the time t, a_tService Placement action, s, representing time t_t+nRepresents the state parameters of the samples after iteration n times, a' represents the service placement action for making the estimated network output the maximum value, k represents the iteration number,

representing a decay factor, r, for the reward value of the kth iteration after time t_t+k+1Representing the prize value of the kth iteration after time t.

In order to achieve the above object, an embodiment of the present invention further provides a resource allocation apparatus based on deep reinforcement learning, which is applied to a control platform of an edge cloudlet system, where the edge cloudlet system further includes a plurality of cloudlets, each cloudlet includes a plurality of computing nodes, and the apparatus includes:

the first determining module is used for determining services of various resources to be allocated contained in an application program request of a user and the allocation priority of each service;

the second determining module is used for determining state parameters of the current edge micro cloud system, wherein the state parameters comprise a resource balance degree evaluation parameter, a response delay evaluation parameter and the resource surplus of each computing node in each micro cloud;

the input module is used for inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service; the first service is the service with the highest priority currently allocated; the resource balance optimization model is completed based on deep reinforcement learning training, wherein a training set of the deep reinforcement learning comprises: sample state parameters of the edge micro-cloud system;

a deployment module to deploy the first service to the first target computing node;

and the updating module is used for updating the state parameters and triggering the input module until each service of the resources to be allocated contained in the application program request completes resource allocation.

Optionally, the apparatus further comprises: a calculating module, configured to calculate the resource balance evaluation parameter based on the following formula:

wherein the content of the first and second substances,

represents the average value of the resource utilization rate of all kinds of resources in the jth computing node in the ith micro cloud, X represents the resource allocation strategy,

t_Total＝T_Comp(X)+T_TR(X)

Optionally, the apparatus further comprises: a training module, configured to train the resource balancing optimization model according to the following steps:

acquiring a preset neural network model and the training set;

Optionally, the loss function is:

wherein L represents a loss function, E [ [ alpha ] ]]Representing a mathematical expectation, n representing the number of sets of historical iteration data referenced per iteration, and t representingAt the moment of time, the time of day,

representing the sum of the prize values for n iterations after time t,

In order to achieve the above object, an embodiment of the present invention further provides an electronic device, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing any method step when executing the program stored in the memory.

To achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, in which a computer program is stored, and the computer program, when executed by a processor, implements any of the above method steps.

Therefore, the resource allocation method and device based on deep reinforcement learning provided by the embodiment of the invention can determine the services of various resources to be allocated contained in the application program request of the user and the allocation priority of each service; determining state parameters of a current edge micro cloud system, wherein the state parameters comprise a resource balance degree evaluation parameter, a response delay evaluation parameter and the resource surplus of each computing node in each micro cloud; inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service, and deploying the first service in the first target computing node; and updating the state parameters, and returning to the step of inputting the state parameters into the resource balance optimization model which is trained in advance until each service of the resources to be distributed contained in the application program request completes the resource distribution. Therefore, response delay and resource allocation balance are comprehensively considered, and a network model is trained in a deep reinforcement learning mode.

Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a resource allocation method based on deep reinforcement learning according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a process for training a resource balancing optimization model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a deep reinforcement learning-based resource allocation apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the technical problem of low resource utilization balance of the existing service resource allocation method in the field of edge cloudiness, the embodiment of the invention provides a resource allocation method and device based on deep reinforcement learning, electronic equipment and a computer readable storage medium.

For ease of understanding, the following description will first describe an application scenario of the embodiment of the present invention.

The resource allocation method based on deep reinforcement learning provided by the embodiment of the invention can be applied to highly dynamic scenes such as military field, disaster relief field and the like, and the scenes usually adopt an edge micro cloud system to provide services for application programs. An edge cloudlet system may include a control platform, a plurality of cloudlets, each cloudlet containing a plurality of computing nodes, wherein a computing node may represent an electronic device containing a processor, a communication interface, a memory, and a communication bus, such as a personal computer, etc., which may be typically disposed on a mobile vehicle. A cloudlet represents a collection of computing nodes, and a cloudlet may typically include computing nodes on a mobile vehicle. The resource allocation method based on deep reinforcement learning provided by the embodiment of the invention can be applied to a control platform, namely the control platform determines to deploy each service contained in an application program in which computing node.

Specifically, referring to fig. 1, the resource allocation method based on deep reinforcement learning according to an embodiment of the present invention may include the following steps:

s101: determining a plurality of services of resources to be allocated contained in the application program request of the user and the allocation priority of each service.

In the embodiment of the present invention, the application request of the user may include a plurality of services, such as a location service, an image processing service, and the like, which need to be deployed on the computing node in the micro cloud, so that the computing node can provide corresponding resources for the services. In the embodiment of the present invention, the process of allocating resources to a service may also be understood as a process of allocating a computing node to a service.

In addition, since there is a certain association relationship between services, the services to be allocated with resources have allocation priorities. For example, if service a needs to depend on service b, then service b has a higher allocation priority than service a, i.e. service a can only be allocated if service b is allocated a compute node first.

S102: and determining state parameters of the current edge micro cloud system, wherein the state parameters comprise a resource balance degree evaluation parameter, a response delay evaluation parameter and the resource residual quantity of each computing node in each micro cloud.

In the embodiment of the invention, the resource balance evaluation parameter represents the balance of resource allocation of the computing nodes in the micro cloud.

In one embodiment of the present invention, the resource balance evaluation parameter may be calculated based on the following formula:

first, the resource utilization variance is defined as follows:

wherein the content of the first and second substances,

representing the resource utilization rate of all kinds of resources in the jth computing node in the ith micro cloudX denotes the resource allocation policy.

Since the resource utilization variance cannot reflect the condition of resource utilization imbalance, in which the utilization rate of a specific resource is too high, and the utilization rates of other types of resources are low, the resource balance rate can be defined according to the following formula:

wherein the content of the first and second substances,

and expressing the resource balance rate of the j computing node in the ith micro cloud.

Further, normalizing the resource balance rate may define the resource utilization balance degree according to the following formula:

wherein the content of the first and second substances,

the resource utilization balance degree of the jth computing node in the ith micro cloud is represented, the value of the resource utilization balance degree is between 0 and 1, and the larger the resource utilization balance degree value is, the more balanced the resource utilization is.

Further, the resource utilization balance degree of the whole edge micro cloud system can be determined:

wherein the RUBD_iIndicating resource utilization balance of the ith cloudlet, L_iRepresenting the total number of compute nodes, RUBD, in the ith clout_TotalExpressing the resource utilization balance of the edge micro-cloud system, namely a resource balance evaluation parameter, and K represents an edgeThe total number of cloudiness in the limbal cloudiness system.

In the embodiment of the present invention, the response delay evaluation parameter represents the sum of the computation delay and the communication delay of the edge micro-cloud system, that is:

t_Total＝T_Comp(X)+T_TR(X)

In the embodiment of the present invention, can be used

And the resource residual quantity of the j computing node in the ith micro cloud is represented.

Furthermore, in the embodiment of the present invention, a set s may be used to represent a state parameter of the current edge micro cloud system, and then:

that is, the state parameters of the edge micro-cloud system may be composed of a resource balance evaluation parameter, a response delay evaluation parameter, and the remaining amount of resources of each computing node in each micro-cloud.

S103: inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service; the first service is the service with the highest priority currently allocated, and the resource balance optimization model is completed based on deep reinforcement learning training, wherein the training set of the deep reinforcement learning comprises: sample state parameters of the edge clouding system.

In the embodiment of the invention, after the control platform obtains the state parameters of the current edge micro cloud system, the state parameters can be input into the resource balance optimization model, and the resource balance optimization model is completed based on deep reinforcement learning training according to the training set, so that a resource allocation strategy which is most suitable for the current state parameters can be output.

Specifically, the resource balance optimization model outputs a first target computing node of the first serviceAnd the first service is the service with the highest current allocation priority. For example, if the service with the highest priority is currently allocated as the positioning service, the resource balancing optimization model outputs the target computing node of the positioning service, which can be recorded as

Then

Indicating that the location service is assigned to the jth node in the ith cloudlet.

The training process of the resource balancing optimization model can be referred to below, and is not described herein again.

S104: the first service is deployed to a first target compute node.

In this embodiment of the present invention, after determining a first target computing node of a first service, the first service may be deployed in the first target computing node, and then the first service may be run based on resources in the first target computing node.

S105: and updating the state parameters, and returning to the step of inputting the state parameters into the resource balance optimization model which is trained in advance until each service of the resources to be distributed contained in the application program request completes the resource distribution.

In the embodiment of the invention, after the resources are allocated to the first service, the state parameters of the edge micro cloud system are changed, so that the control platform counts the current state parameters of the edge micro cloud system again and continues to allocate the resources to the next service.

And the control platform inputs the current state parameters into the resource balance optimization model so as to obtain the target computing node of the service with the highest current distribution priority.

In the embodiment of the present invention, the above steps may be executed in a loop until each service of the resource to be allocated included in the application request completes resource allocation.

Therefore, the resource allocation method based on deep reinforcement learning provided by the embodiment of the invention can determine the services of various resources to be allocated contained in the application program request of the user and the allocation priority of each service; determining state parameters of a current edge micro cloud system, wherein the state parameters comprise a resource balance degree evaluation parameter, a response delay evaluation parameter and the resource surplus of each computing node in each micro cloud; inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service, and deploying the first service in the first target computing node; and updating the state parameters, and returning to the step of inputting the state parameters into the resource balance optimization model which is trained in advance until each service of the resources to be distributed contained in the application program request completes the resource distribution. Therefore, response delay and resource allocation balance are comprehensively considered, and a network model is trained in a deep reinforcement learning mode.

In the embodiment of the invention, the resource balance optimization model can be trained based on deep reinforcement learning. Deep reinforcement learning is a combination of reinforcement learning and deep learning.

For ease of understanding, reinforcement learning is briefly described below.

The reinforcement learning is a type of machine learning, and the basic idea is to generate actions according to scene states, obtain learning information through receiving environment rewards for the actions and update model parameters, and finally, can realize the optimal action in a specific scene state.

In the embodiment of the invention, the resource allocation process can be modeled as a reinforcement learning model, wherein the scene state is a state parameter of the edge micro-cloud system, and the action is a resource allocation strategy for a certain service. Therefore, the trained resource balance optimization model can output an optimal resource allocation strategy for a certain service according to the state parameters of the edge micro-cloud system.

In one embodiment of the present invention, referring to fig. 2, the resource balancing optimization model may be trained by the following steps:

s201: and acquiring a preset neural network model and a training set.

As can be understood by those skilled in the art, in contrast to conventional supervised learning, the reinforcement learning has no label as a sample during training, and only initial input states are required as a training set during training of the reinforcement learning.

In the embodiment of the invention, the training set can be a sample state parameter of the edge micro-cloud system.

S202: inputting the sample state parameters into a neural network model to obtain service placement actions; the service placement action represents determining a placed target compute node for the sample service.

In the embodiment of the invention, the input of the neural network model is a sample state parameter, the output is a service placing action, and the service placing action represents that a placed target computing node is determined for a sample service. The sample state parameter may be denoted by s, as above,

and the set of the resource balance evaluation parameter, the response delay evaluation parameter and the resource residual quantity of each computing node in each micro cloud is represented. The service placement action may be represented by a,

the representation service is assigned to the jth node in the ith cloudlet.

S203: and updating the sample state parameters based on the service placing action to obtain the updated sample state parameters.

After the service placement action is generated, the sample state parameters are updated to obtain updated sample state parameters which are used as input of the next round of training.

S204: and calculating the reward value of the service placement action according to the resource balance evaluation parameter and the response delay evaluation parameter contained in the sample state parameter and the resource balance evaluation parameter contained in the updated sample state parameter.

In the embodiment of the invention, after each iteration, the reward value of the service placing action in the current iteration can be calculated, and the reward value of the current placing action is easy to understand if the more balanced the resource balance degree evaluation parameter shows resource distribution after the current placing action is, the lower the response delay evaluation parameter shows the response delay, and the higher the reward value of the current placing action is.

Specifically, in one embodiment of the present invention, the reward value for the service placement action may be calculated based on the following formula:

wherein r is_nRepresenting the prize value for the nth iteration,

representing the resource balance evaluation parameters after the (n-1) th iteration,

representing the resource balance evaluation parameter after the nth iteration,

denotes the service delay after the nth iteration, L_aveRepresenting the average service delay constraint.

The above formula is only one way to calculate the prize value, and the embodiments of the present invention are not limited to calculating the prize value using the above formula.

S205: and substituting the sample state parameter, the updated sample state parameter, the service placing action and the reward value of the service placing action into a preset loss function, and calculating the loss value of the service placing action.

For the sake of easy understanding, the description will be made by taking the nth iteration as an example, and the state parameter of the sample after the (n-1) th iteration is s_n-1The state parameter of the sample after the nth iteration is s_nAnd the service placement action output by the nth iteration is used as a_nThe reward value of the nth iteration is r_nThen can be based on s_n-1，s_n，a_n，r_nAnd calculating the loss value of the service placement action in the nth iteration according to a preset loss function.

Those skilled in the art will appreciate that in the field of deep reinforcement learning, each iteration can obtain a new Q value, which is a function of state s and action a, and represents the expected gain obtained by taking action a in a certain state s. The Q value is typically determined by the target network and the estimated network, where the target network uses Q_targetRepresenting, estimating Q for the network_evaRepresenting, a target network Q_targetThe output Q value is updated iteratively, and the network Q is estimated_evaThe output Q value is before the iterative update.

In the embodiment of the invention, the output Q values of the target network and the estimated network can be determined based on the sample state parameters before and after iteration and corresponding service placement actions, and then a loss function is constructed by combining the reward values.

Specifically, in an embodiment of the present invention, the preset loss function may be:

Loss＝E[(r_n+1+γQ_target(s_n+1,argmax_a′(Q_eva(s_n+1,a′)))-Q_eva(s_n,a_n))²]

wherein, E2]Representing a mathematical function, r_n+1Denotes the reward value of the (n + 1) th iteration, gamma denotes the decay factor, Q_targetRepresenting the target network, Q_evaRepresenting an estimated network, s_nRepresenting the state parameter of the sample after the n-th iteration, a_nAnd a' represents the service placing action which enables the estimated network to output the maximum value.

In the embodiment of the invention, the state parameters of the samples before and after each iteration, the service placement action and the reward value are used as variables in the loss function, so as to train the neural network model, and in order to accelerate the convergence speed and improve the accuracy of the network model, the training can be carried out by combining the data of previous multiple steps in the subsequent iteration process. For example, for the third iteration, the sample state parameter, the service placement action, the reward value of the first iteration, and the sample state parameter, the service placement action, and the reward value of the second iteration may be used as training data. I.e. the data of the first few iterations can be considered together in the loss function.

In addition, since the difference value of the Q values output by the target network and the estimated network in each iteration can reflect the reference degree of the data in the iteration as training data, that is, the larger the difference value, the more training worth the data in the iteration is indicated, so that a larger sampling weight can be set for the data. For example, if the current iteration is the 5 th iteration, if the data of the 2 nd to 4 th iterations are selected for training, and for the 2 nd to 4 th iterations, if the difference between the Q values output by the target network and the estimated network is the largest in the 3 rd iteration, a larger sampling weight may be set for the data of the 3 rd iteration.

In the embodiment of the present invention, the loss function may be improved based on two aspects, namely, multi-step joint training and setting of sampling weights, and specifically, the loss function improved based on the two aspects may be:

wherein L represents a loss function after improvement, E2]Representing a mathematical expectation, n representing the number of sets of historical iteration data referenced per iteration, t representing the time of day,

it can be determined according to the difference between the Q values of the target network and the estimated network output in the historical iteration data, i.e. the larger the difference between the Q values, the greater the priority weight of the historical iteration data in the round,

representing the sum of the prize values for n iterations after time t,

representing the decay factor for the reward value of n iterations after time t,

the value of (A) can be set according to the actual situation, Q_targetRepresenting the target network, Q_evaRepresenting an estimated network, s_tA sample state parameter representing the time t, a_tService Placement action, s, representing time t_t+nRepresents the state parameters of the samples after iteration n times, a' represents the service placement action for making the estimated network output the maximum value, k represents the iteration number,

Therefore, the improved loss function considers multiple groups of historical iteration data generated by previous iteration, sets priority weights based on the difference value of Q values in each previous iteration, and can train the neural network more pertinently, so that the convergence speed is increased, and the accuracy of the network model is improved.

S206: and determining whether the neural network model converges according to the loss value, otherwise executing S207, and executing S208.

When the loss value does not exceed the preset loss threshold, the neural network model may be considered to have converged. In addition, the maximum number of iterations may also be preset, and when the maximum number of iterations is reached, the neural network model may also be considered to have converged, which is not limited.

S207: and adjusting the parameter values in the neural network model, and returning to execute the step S202.

And when the loss value shows that the neural network model does not converge, adjusting the parameter value, returning to the step S202, and starting a new round of iterative training.

S208: and determining the current neural network model as a resource balance optimization model.

Based on the same inventive concept, according to the above embodiment of the resource allocation method based on deep reinforcement learning, an embodiment of the present invention further provides a resource allocation method based on deep reinforcement learning, and referring to fig. 3, the method may include the following modules:

a first determining module 301, configured to determine services of multiple resources to be allocated included in an application request of a user, and an allocation priority of each service;

a second determining module 302, configured to determine state parameters of the current edge micro-cloud system, where the state parameters include a resource balance evaluation parameter, a response delay evaluation parameter, and a resource remaining amount of each computing node in each micro-cloud;

an input module 303, configured to input the state parameter into a resource balancing optimization model that is trained in advance, to obtain a first target computing node of the first service; the first service is the service with the highest priority currently allocated; the resource balance optimization model is completed based on deep reinforcement learning training, wherein the training set of the deep reinforcement learning comprises the following steps: sample state parameters of the edge micro-cloud system;

a deployment module 304 for deploying the first service to the first target computing node;

the updating module 305 is configured to update the status parameters and trigger the input module until the resource allocation is completed for each service of the resource to be allocated included in the application request.

In an embodiment of the present invention, on the basis of the apparatus shown in fig. 3, a calculation module may further be included, where the calculation module is configured to calculate a resource balance evaluation parameter based on the following formula:

wherein the content of the first and second substances,

representing the resource balance rate, RUBD, of the jth computing node in the ith clout_iIndicating resource utilization balance of the ith cloudlet, L_iRepresenting the total number of compute nodes, RUBD, in the ith clout_TotalExpressing a resource balance evaluation parameter of the edge micro cloud system, and K expresses the total number of micro clouds in the edge micro cloud system;

calculating a response delay evaluation parameter based on the following formula:

t_Total＝T_Comp(X)+T_TR(X)

In an embodiment of the present invention, on the basis of the apparatus shown in fig. 3, a training module may further be included, where the training module is configured to train a resource balancing optimization model according to the following steps:

acquiring a preset neural network model and a training set;

inputting the sample state parameters into a neural network model to obtain service placement actions; the service placement action represents determining a placed target compute node for the sample service;

responding to the delay evaluation parameter based on the resource balance evaluation parameter contained in the sample state parameter, the response delay evaluation parameter and the resource balance evaluation parameter contained in the updated sample state parameter, and calculating the reward value of the service placing action;

In one embodiment of the invention, the loss function may be:

wherein L represents a loss function, E [ [ alpha ] ]]Representing a mathematical expectation, n represents eachThe number of sets of historical iteration data referenced by the sub-iteration, t representing the time of day,

representing the sum of the prize values for n iterations after time t,

By applying the resource allocation device based on deep reinforcement learning provided by the embodiment of the invention, the services of various resources to be allocated contained in the application program request of a user and the allocation priority of each service can be determined; determining state parameters of a current edge micro cloud system, wherein the state parameters comprise a resource balance degree evaluation parameter, a response delay evaluation parameter and the resource surplus of each computing node in each micro cloud; inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service, and deploying the first service in the first target computing node; and updating the state parameters, and returning to the step of inputting the state parameters into the resource balance optimization model which is trained in advance until each service of the resources to be distributed contained in the application program request completes the resource distribution. Therefore, response delay and resource allocation balance are comprehensively considered, and a network model is trained in a deep reinforcement learning mode.

Based on the same inventive concept, according to the implementation of the above resource allocation method based on deep reinforcement learning, the embodiment of the present invention provides an electronic device, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication via the communication bus 404,

a memory 403 for storing a computer program;

the processor 401, when executing the program stored in the memory 403, implements the following steps:

inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service; the first service is the service with the highest priority currently allocated; the resource balance optimization model is completed based on deep reinforcement learning training, wherein the training set of the deep reinforcement learning comprises the following steps: sample state parameters of the edge micro-cloud system;

deploying a first service at a first target computing node;

and updating the state parameters, and returning to the step of inputting the state parameters into the resource balance optimization model which is trained in advance until each service of the resources to be distributed contained in the application program request completes the resource distribution.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 4, but this does not indicate only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

By applying the electronic equipment provided by the embodiment of the invention, the services of various resources to be allocated contained in the application program request of the user and the allocation priority of each service can be determined; determining state parameters of a current edge micro cloud system, wherein the state parameters comprise a resource balance degree evaluation parameter, a response delay evaluation parameter and the resource surplus of each computing node in each micro cloud; inputting the state parameters into a resource balance optimization model which is trained in advance to obtain a first target computing node of a first service, and deploying the first service in the first target computing node; and updating the state parameters, and returning to the step of inputting the state parameters into the resource balance optimization model which is trained in advance until each service of the resources to be distributed contained in the application program request completes the resource distribution. Therefore, response delay and resource allocation balance are comprehensively considered, and a network model is trained in a deep reinforcement learning mode.

Based on the same inventive concept, implemented according to the above deep reinforcement learning-based resource allocation method, in yet another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements any of the above deep reinforcement learning-based resource allocation method steps.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the above-mentioned resource allocation apparatus embodiment, electronic device embodiment and computer-readable storage medium embodiment based on deep reinforcement learning, since they are substantially similar to the above-mentioned resource allocation method embodiment based on deep reinforcement learning, the description is relatively simple, and for the relevant points, refer to the partial description of the above-mentioned resource allocation method embodiment based on deep reinforcement learning.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A resource allocation method based on deep reinforcement learning is applied to a control platform of an edge micro-cloud system, wherein the edge micro-cloud system further comprises a plurality of micro-clouds, each micro-cloud comprises a plurality of computing nodes, and the method comprises the following steps:

deploying the first service to the first target computing node;

2. The method of claim 1, wherein the resource balance evaluation parameter is calculated based on the following formula:

wherein, the RUV_i ^jRepresenting the resource utilization variance of the jth computing node in the ith cloudlet, D representing the number of classes of resources,

t_Total＝T_Comp(X)+T_TR(X)

3. The method of claim 1, wherein the resource balancing optimization model is trained by:

acquiring a preset neural network model and the training set;

4. The method of claim 3, wherein the loss function is:

representing the priority weight, r, of n sets of historical iterative data after time t_t ⁽ⁿ⁾Representing the sum of the prize values for n iterations after time t,

5. A resource allocation device based on deep reinforcement learning is applied to a control platform of an edge micro-cloud system, wherein the edge micro-cloud system further comprises a plurality of micro-clouds, each micro-cloud comprises a plurality of computing nodes, and the device comprises:

6. The apparatus of claim 5, further comprising: a calculating module, configured to calculate the resource balance evaluation parameter based on the following formula:

t_Total＝T_Comp(X)+T_TR(X)

7. The apparatus of claim 5, further comprising: a training module, configured to train the resource balancing optimization model according to the following steps:

acquiring a preset neural network model and the training set;

8. The apparatus of claim 7, wherein the loss function is:

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1 to 4 when executing a program stored in the memory.

10. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1 to 4.