CN109947567B

CN109947567B - Multi-agent reinforcement learning scheduling method and system and electronic equipment

Info

Publication number: CN109947567B
Application number: CN201910193429.XA
Authority: CN
Inventors: 任宏帅; 王洋; 须成忠
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2021-07-20
Anticipated expiration: 2039-03-14
Also published as: WO2020181896A1; CN109947567A

Abstract

The application relates to a multi-agent reinforcement learning scheduling method, a multi-agent reinforcement learning scheduling system and electronic equipment. The method comprises the following steps: step a: collecting server parameters of a network data center and virtual machine load information running on each server; step b: establishing a virtual simulation environment by using the server parameters and the virtual machine load information, and establishing a deep reinforcement learning model of the multi-agent; step c: off-line training and learning are carried out by utilizing the deep reinforcement learning model and the simulation environment of the multi-agent, and an agent model is trained for each server respectively; step d: and deploying the intelligent agent model to real service nodes, and scheduling according to the load condition of each service node. The method and the system have the advantages that the service running on the server is virtualized through the virtualization technology, the load balance is carried out in a virtual machine scheduling mode, the resource allocation is macroscopic, and the strategy that the multi-agent generates cooperation under the complex dynamic environment can be realized.

Description

Multi-agent reinforcement learning scheduling method and system and electronic equipment

Technical Field

The present application relates to the field of multi-agent systems, and in particular, to a method, a system, and an electronic device for multi-agent reinforcement learning scheduling.

Background

In a cloud computing environment, a traditional service deployment mode is difficult to deal with variable access modes, although fixed allocation of resources can stably provide services, a large amount of resource waste exists in the traditional service deployment mode, for example, in the same network topology structure, some servers may often run at full load, and some servers only deploy a few services and still have a lot of unused storage space and computing capacity, so that the traditional deployment service is difficult to deal with the waste of the resources, and efficient scheduling is difficult to realize, so that the resources cannot be efficiently utilized. There is therefore a need for a scheduling algorithm that can adapt to dynamic environments to balance the load of the individual servers in the network.

With the development of virtualization technology, the resource scheduling problem is also promoted from static allocation to dynamic allocation by the appearance of technologies such as virtual machine containers, and in recent years, schemes for resource adaptive scheduling are endless, most of the schemes adopt a heuristic algorithm, perform dynamic scheduling by adjusting parameters, adjust the abundant or insufficient conditions of available resources in the operating environment according to a threshold value, and iteratively calculate a suitable threshold value by using the heuristic algorithm. However, the scheduling method only seeks an optimal solution on a massive data combination, and the solved optimal decision is only for the current specific time node, and the time sequence information is not fully utilized, so that the problem of resource allocation in a large-scale complex dynamic environment is difficult to solve.

With the rise of artificial intelligence, the development of deep reinforcement learning technology makes the decision of an intelligent agent on a large state space possible. In the field of multi-agent reinforcement learning, if distributed learning is performed by using a traditional reinforcement learning algorithm such as Q-learning, PG (Policy Gradient Method), the expected effect still cannot be obtained because each agent tries to learn and predict the actions of other agents in each step, and other agents are always changing in a dynamic environment, so the environment becomes unstable, the knowledge is difficult to learn, and optimal resource allocation cannot be realized. In addition, from the aspect of a reinforcement learning method, most of the current scheduling means are single agent reinforcement learning and distributed reinforcement learning, and if only one agent is used for centralized training, the algorithm is difficult to train and is difficult to converge due to a large amount of action spaces of complex state changes and permutation combinations under a network topology structure. The method using distributed reinforcement learning also faces another problem, and the common distributed reinforcement learning is to train a plurality of agents together to accelerate the convergence rate, but in fact, the scheduling strategies of the agents are the same, and only a plurality of entities are used to accelerate the training rate in the training process, so that the finally obtained homogeneous agents have no cooperative ability. In the traditional multi-agent method, each agent can predict the decision of other agents at each decision step, but because the decision of other agents is unstable under the dynamic environment, the training is very difficult and each agent can do things almost the same without cooperative strategy.

Disclosure of Invention

The application provides a multi-agent reinforcement learning scheduling method, a multi-agent reinforcement learning scheduling system and electronic equipment, and aims to solve at least one of the technical problems in the prior art to a certain extent.

In order to solve the above problems, the present application provides the following technical solutions:

a multi-agent reinforcement learning scheduling method comprises the following steps:

step a: collecting server parameters of a network data center and virtual machine load information running on each server;

step b: establishing a virtual simulation environment by using the server parameters and the virtual machine load information, and establishing a deep reinforcement learning model of the multi-agent;

step c: off-line training and learning are carried out by utilizing the deep reinforcement learning model and the simulation environment of the multi-agent, and an agent model is trained for each server respectively;

step d: and deploying the intelligent agent model to real service nodes, and scheduling according to the load condition of each service node.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the step a further comprises: carrying out standardized preprocessing operation on the collected server parameters and the virtual machine load information; the normalized preprocessing operation comprises: defining the virtual machine information of each service node as a tuple, wherein the tuple comprises the number of virtual machines and respective configuration of the virtual machines, each virtual machine comprises two scheduling states, namely a to-be-scheduled state and an operating state, each service node comprises two states, namely a saturated state and a hungry state, and the sum of the resource ratio occupied by each virtual machine is less than the upper limit of the configuration of the server where the virtual machine is located.

The technical scheme adopted by the embodiment of the application further comprises the following steps: in the step b, the deep reinforcement learning model of the multi-agent specifically comprises a prediction module and a scheduling module, wherein the prediction module predicts resources needing to be scheduled out in the current state through information input by each service node, and maps an action space into the total capacity of the current service node according to configuration information of the current service node; the scheduling module carries out rescheduling and distribution to generate a scheduling strategy according to the marked virtual machine in the state to be scheduled, and an agent on each service node calculates a return function according to the generated scheduling action; the prediction module measures the quality of the scheduling strategy, so that the load of each service node in the whole network is balanced.

The technical scheme adopted by the embodiment of the application further comprises the following steps: in step c, the off-line training and learning by using the deep reinforcement learning model of the multi-agent and the simulation environment, and the training of an agent model for each server specifically includes: the intelligent agent on each service node adjusts the size of the resource to be scheduled through the prediction module, marks the virtual machine to be scheduled out, generates a scheduling strategy according to the virtual machine in the state to be scheduled, calculates the return value of each service node, summarizes and sums the return values to obtain a total return value, and adjusts the parameters of each prediction module according to the total return value.

The technical scheme adopted by the embodiment of the application further comprises the following steps: in the step d, the deploying the agent model to the real service nodes and scheduling according to the load condition of each service node specifically comprises: deploying a trained intelligent agent model to a corresponding service node in a real environment, sensing state information of a server where the intelligent agent model is located within a period of time as input, predicting to obtain resources needing to be released by the current server, and selecting a virtual machine closest to a standard by using a knapsack algorithm to mark the virtual machine as a state to be scheduled; then, collecting prediction results on all servers and the virtual machines marked as the to-be-scheduled states through a scheduling module, assigning the virtual machines in the to-be-scheduled states to suitable servers as required to generate a scheduling strategy, and distributing a scheduling command to corresponding service nodes to execute scheduling operation; before executing the scheduling strategy, checking whether each scheduling command is legal or not, if not, feeding back a punishment reward updating parameter, and regenerating the scheduling strategy; and if the intelligent agent parameter is legal, executing the scheduling operation, and obtaining the feedback reward value to update the intelligent agent parameter.

Another technical scheme adopted by the embodiment of the application is as follows: a multi-agent reinforcement learning scheduling system comprising:

an information collection module: the system comprises a data center, a data center and a server, wherein the data center is used for collecting server parameters of the data center and virtual machine load information running on each server;

a reinforcement learning model construction module: the system comprises a virtual simulation environment and a deep reinforcement learning model of a multi-agent, wherein the virtual simulation environment is established by using the server parameters and the virtual machine load information;

the intelligent agent model training module: the system comprises a multi-agent deep reinforcement learning model, a simulation environment and a plurality of servers, wherein the multi-agent deep reinforcement learning model is used for performing offline training and learning by utilizing the deep reinforcement learning model and the simulation environment of the multi-agent, and an agent model is trained for each server;

an agent deployment module: and the intelligent agent model is used for deploying the intelligent agent model to real service nodes and scheduling according to the load condition of each service node.

The technical scheme adopted by the embodiment of the application further comprises a preprocessing module, wherein the preprocessing module is used for carrying out standardized preprocessing operation on the collected server parameters and the collected virtual machine load information; the normalized preprocessing operation comprises: defining the virtual machine information of each service node as a tuple, wherein the tuple comprises the number of virtual machines and respective configuration of the virtual machines, each virtual machine comprises two scheduling states, namely a to-be-scheduled state and an operating state, each service node comprises two states, namely a saturated state and a hungry state, and the sum of the resource ratio occupied by each virtual machine is less than the upper limit of the configuration of the server where the virtual machine is located.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the reinforcement learning model building module comprises a prediction module and a scheduling module, wherein the prediction module comprises:

a state sensing unit: the system is used for predicting the resources needing to be scheduled out in the current state through the information input by each service node;

an action space unit: the action space is mapped into the total capacity of the current service node according to the configuration information of the current service node;

the scheduling module carries out rescheduling and distribution to generate a scheduling strategy according to the marked virtual machine in the state to be scheduled, and an agent on each service node calculates a return function according to the generated scheduling action;

the prediction module further comprises:

a reward function unit: the method is used for measuring the quality of the scheduling strategy, so that the load of each service node in the whole network is balanced.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the intelligent agent model training module utilizes a deep reinforcement learning model and a simulation environment of a plurality of intelligent agents to carry out off-line training and learning, and the training of an intelligent agent model for each server specifically comprises the following steps: the intelligent agent on each service node adjusts the size of the resource to be scheduled through the prediction module, marks the virtual machine to be scheduled out, generates a scheduling strategy according to the virtual machine in the state to be scheduled, calculates the return value of each service node, summarizes and sums the return values to obtain a total return value, and adjusts the parameters of each prediction module according to the total return value.

The technical scheme adopted by the embodiment of the application further comprises the following steps: the intelligent agent deployment module deploys the intelligent agent model to the real service nodes, and the scheduling according to the load condition of each service node specifically comprises the following steps: deploying a trained intelligent agent model to a corresponding service node in a real environment, sensing state information of a server where the intelligent agent model is located within a period of time as input, predicting to obtain resources needing to be released by the current server, and selecting a virtual machine closest to a standard by using a knapsack algorithm to mark the virtual machine as a state to be scheduled; then, collecting prediction results on all servers and the virtual machines marked as the to-be-scheduled states through a scheduling module, assigning the virtual machines in the to-be-scheduled states to suitable servers as required to generate a scheduling strategy, and distributing a scheduling command to corresponding service nodes to execute scheduling operation; before executing the scheduling strategy, checking whether each scheduling command is legal or not, if not, feeding back a punishment reward updating parameter, and regenerating the scheduling strategy; and if the intelligent agent parameter is legal, executing the scheduling operation, and obtaining the feedback reward value to update the intelligent agent parameter.

The embodiment of the application adopts another technical scheme that: an electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the one processor to cause the at least one processor to perform the following operations of the multi-agent reinforcement learning scheduling method described above:

Compared with the prior art, the embodiment of the application has the advantages that: the multi-agent reinforcement learning scheduling method, the multi-agent reinforcement learning scheduling system and the electronic equipment virtualize the service running on the server through virtualization technology, and perform load balancing through a virtual machine scheduling mode, because the scheduling range is not limited in a single server, when one server is in a high-load state, the virtual machine can be scheduled to other low-load servers to run, and compared with a scheme of resource allocation, the method and the system are more macroscopic. Meanwhile, the MADDPG framework is used for expanding on the AC framework, critic adds extra information for decision making of other agents, but each agent can only use local information for training, and a strategy that a plurality of agents generate cooperation in a complex dynamic environment can be achieved through the framework.

Drawings

FIG. 1 is a flow chart of a multi-agent reinforcement learning scheduling method according to an embodiment of the present application;

FIG. 2 is a diagram of a MADDPG scheduling framework according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a scheduling overall framework according to an embodiment of the present application;

FIG. 4 is a schematic structural diagram of a multi-agent reinforcement learning scheduling system according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a hardware device of a multi-agent reinforcement learning scheduling method according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

In order to solve the defects in the prior art, the multi-agent reinforcement learning scheduling method in the embodiment of the application uses a multi-agent reinforcement learning technology in the reinforcement learning field, models are built according to load information on each service node in the cloud service environment, decision is made by using recurrent neural network learning time sequence information, an agent is trained for each server, and competition or cooperative work is performed on a plurality of agents with different tasks to maintain load balance under the whole network topology structure. After the initial training is completed, each intelligent agent is placed to a real service node, then scheduling is carried out according to the load condition of each node, each intelligent agent continues to learn and perfect according to the decision memory of the current independent environment and other nodes while decision and scheduling are carried out, so that each intelligent agent can cooperate with the intelligent agents of other nodes to generate a scheduling strategy, and the load balance of each service node is realized.

Specifically, please refer to fig. 1, which is a flowchart illustrating a multi-agent reinforcement learning scheduling method according to an embodiment of the present application. The multi-agent reinforcement learning scheduling method comprises the following steps:

step 100: collecting server parameters of a network data center and virtual machine load information running on each server;

in step 100, the collected server parameters specifically include: collecting configuration information, memory, hard disk storage space and the like of each server in a real scene for a period of time; the collected virtual machine load information specifically includes: and collecting parameters of resources occupied by the virtual machine running on each server, such as CPU occupancy rate, memory and hard disk occupancy rate and the like.

Step 200: preprocessing operations such as normalization and the like are carried out on the collected server parameters and the virtual machine load information;

in step 200, the preprocessing operation specifically includes: defining the virtual machine information of each service node as a tuple, wherein the tuple comprises the number of virtual machines and the respective configuration of the virtual machines, including a CPU, a memory, a hard disk and the current state, each virtual machine comprises two scheduling states, namely a to-be-scheduled state and a running state, each service node comprises two states, namely a saturated state and a hungry state, and the sum of the resource ratio occupied by each virtual machine cannot be more than the upper limit of the configuration of the server.

Step 300: establishing a virtual simulation environment by using the preprocessed data, and establishing a deep reinforcement learning model of the multi-agent;

in step 300, establishing a deep reinforcement learning model of a multi-agent specifically includes: modeling the collected time sequence dynamic information (server parameters and virtual machine load information) to create a simulation environment for off-line training, wherein the model adopts a multi-agent deep reinforcement learning model, and in order to fully utilize the influence of time sequence data, an LSTM model is adopted in a deep network part in the model to extract the time sequence information, so that the influence of abnormal data fluctuation in an instantaneous state on decision making is avoided. The model adopts a MADDPG (Multi-Agent Deep Deterministic Policy Gradient, namely, a Multi-Agent activator-critical for Mixed Cooperative-comprehensive environment from OpenAI) framework, the MADDPG framework is the expansion of a DDPG (continuous control with Deep learning article published by Google Deep Mind) algorithm in the Multi-Agent field, and the DDPG algorithm applies Deep reinforcement learning to a continuous action space. And the action space obtained by the deep learning part is set as the resource occupation ratio of the virtual machine in the state to be scheduled, namely the load balance of the current service node can be maintained only by scheduling the occupied space. Marking the virtual machine with proper size as a to-be-scheduled state according to the obtained to-be-scheduled space, then calculating the return rewards of the virtual machine in the to-be-scheduled state and each service node on each service node in the whole network, generating a scheduling strategy by using reward values obtained by the virtual machine by being distributed to the service nodes as distance measurement, finally checking whether the scheduling strategy is executable, if the scheduling strategy is executable, scheduling the virtual machine in the to-be-scheduled state to other proper service nodes, if the scheduling strategy is not executable, returning a negative feedback punishment, and generating the scheduling strategy again by the intelligent agent. The detailed scheduling framework is shown in fig. 2.

In the embodiment of the application, in order to solve the influence caused by some instantaneous abnormal load fluctuations in a dynamic environment, a circulating neural network LSTM (long-short time memory network) is used for replacing a fully-connected neural network in deep reinforcement learning, so that an intelligent agent can learn hidden information among time sequence data, and therefore self-adaptive scheduling based on space-time perception is achieved.

In the above, the virtual machines are marked as the states to be scheduled by using the intelligent agents on the service nodes, a knapsack problem solution is adopted, the predicted spaces to be scheduled are used as knapsack spaces, the occupied resources of each virtual machine are used as the weight and the value of the articles, the maximum value which can be loaded into the knapsack is calculated, and the loaded virtual machines are marked as the states to be scheduled. Then, the space to be scheduled predicted on the service node is counted (wherein, negative numbers exist to indicate how many resources to be scheduled can fully utilize the resources), the target is that the sum of the space to be scheduled occupied and the space to be scheduled of each service node is minimum, and a scheduling strategy can be obtained through calculation.

In the embodiment of the application, the MADDPG framework expands the deep reinforcement learning technology to the field of multi-agent, the algorithm is suitable for Centralized learning (Centralized learning) and distributed execution (Centralized execution) in the multi-agent environment, and the multi-agent can be learnt to cooperate and compete by using the framework.

Specifically, the maddppg algorithm takes into account a plurality of parameterizations θ ═ θ₁,θ₂,θ₃,…θ_n-calculating Policy by gaming of a plurality of agents, the Policy for all agents being defined as pi ═ pi { (pi } pi)₁,π₂,π₃,…π_nThe expected profit for the ith agent is J (θ)_i)＝E[R_i]Then consider the deterministic policy μ_θiθ_iWhen parametric, the gradient can be expressed as:

wherein x is (o)₁…o_n)。

Specifically, the deep reinforcement learning model comprises a prediction module and a scheduling module, the prediction module comprises a state sensing unit, an action space unit and a reward function unit, and the specific functions are as follows:

a state sensing unit: predicting resources needing to be scheduled out in the current state through information input by each node, wherein the input state is defined through load information of each node and resources occupied by running virtual machines;

an action space unit: mapping the action space to the total capacity of the current service node according to the configuration information of the current node;

a scheduling module: according to the marked virtual machine in the state to be scheduled, rescheduling and distributing are carried out to generate a scheduling strategy, and an agent on each service node calculates a return function according to the generated scheduling action;

a reward function unit: measuring the quality of a scheduling strategy, wherein the target is load balance of each service node in the whole network, and a return function on each service node is calculated independently; the return function is formulated as follows:

in the above formula, the first and second carbon atoms are,r_iis the reward return on each service node, wherein c represents the CPU occupancy rate on the ith machine, and alpha and beta are penalty coefficients. α can be set as the case may be, indicating a threshold value at which it is desired that the server CPU occupancy load remain steady.

In the above formula, R is an overall reward function, and the final optimization target is the maximum R obtained for the scheduling policy cooperatively generated by each agent.

Step 400: off-line training and learning are carried out by utilizing a deep reinforcement learning model of a plurality of intelligent agents and a simulation environment, and an intelligent agent model is trained for each server respectively;

in step 400, performing offline training in a simulation environment established according to real data, creating an agent for each service node, adjusting the size of resources to be scheduled by the agent on each service node through a prediction module, marking virtual machines to be scheduled, generating a scheduling policy according to the virtual machines in a state to be scheduled, calculating the return values of each service node, summarizing and summing the return values to obtain a total return value, and adjusting the parameters of each prediction module according to the total return value.

Step 500: and deploying the trained intelligent agent model to the real service nodes, and scheduling according to the load condition of each service node.

Step 500, putting each trained agent model down to a corresponding service node in a real environment, sensing state information of a server where the agent is located within a period of time as input by the agent, predicting to obtain resources which the current server wants to release through a prediction module of the agent, and selecting a virtual machine which is closest to a standard by using a knapsack algorithm to mark the virtual machine as a state to be scheduled; and then, collecting the prediction results on all the servers and the virtual machines marked as the to-be-scheduled states through a scheduling module, assigning the virtual machines in the to-be-scheduled states to appropriate servers as required to generate a scheduling strategy, and distributing a scheduling command to corresponding nodes to execute scheduling operation. Before executing the scheduling strategy, whether each scheduling command is legal or not needs to be checked, if not, a punitive reward updating parameter is fed back, the scheduling strategy is regenerated, and iteration is repeated until all the scheduling strategies can be executed. And if the intelligent agent parameters are legal, the intelligent agent parameters are updated by executing and obtaining the feedback reward values. The specific scheduling overall framework is shown in fig. 3.

In general, a scheduling action is directly obtained according to environment input in multi-agent reinforcement learning, but in a complex network topology, an action space for a virtual machine scheduling strategy is too large, and the action space is too large or an algorithm is difficult to converge, and in this way, each virtual machine running in the complex network topology needs to be configured with a global id for specifying a scheduling target, but it should be noted that although the id can be indexed to the virtual machine, resources occupied by the virtual machine are likely to change in the running process, and therefore the strategy learned in the learning process is not reliable. Even if the occupied resources of the virtual machines do not change, if a virtual machine is newly added, the intelligent agent trained based on the algorithm does not consider the newly added virtual machine in decision making. Therefore, the method is improved on the basis of the algorithm, so that the action space of the model is replaced by the resources which the current server wants to release, namely, how many resources are expected to be scheduled from the action space to keep load balance under the overall network topology. By the arrangement, the fact that the global id is used for marking each virtual machine can be avoided, and the operation can still be carried out even if a new virtual machine is added midway, so that the scheduling algorithm is more flexible and can be adaptive to a wider scene.

Please refer to fig. 4, which is a schematic structural diagram of a multi-agent reinforcement learning scheduling system according to an embodiment of the present application. The multi-agent reinforcement learning scheduling system comprises an information collection module, a preprocessing module, a reinforcement learning model construction module, an agent model training module and an agent deployment module.

An information collection module: the system comprises a data center, a data center and a server, wherein the data center is used for collecting server parameters of the data center and virtual machine load information running on each server; the collected server parameters specifically include: collecting configuration information, memory, hard disk storage space and the like of each server in a real scene for a period of time; the collected virtual machine load information specifically includes: and collecting parameters of resources occupied by the virtual machine running on each server, such as CPU occupancy rate, memory and hard disk occupancy rate and the like.

A preprocessing module: the system is used for carrying out preprocessing operations such as normalization on the collected server parameters and the collected virtual machine load information; wherein the preprocessing operation specifically comprises: defining the virtual machine information of each service node as a tuple, wherein the tuple comprises the number of virtual machines and the respective configuration of the virtual machines, including a CPU, a memory, a hard disk and the current state, each virtual machine comprises two scheduling states, namely a to-be-scheduled state and a running state, each service node comprises two states, namely a saturated state and a hungry state, and the sum of the resource ratio occupied by each virtual machine cannot be more than the upper limit of the configuration of the server.

A reinforcement learning model construction module: the system is used for establishing a virtual simulation environment by using the preprocessed data and establishing a deep reinforcement learning model of the multi-agent; the establishing of the deep reinforcement learning model of the multi-agent specifically comprises the following steps: modeling the collected time sequence dynamic information (server parameters and virtual machine load information) to create a simulation environment for off-line training, wherein the model adopts a multi-agent deep reinforcement learning model, and in order to fully utilize the influence of time sequence data, an LSTM model is adopted in a deep network part in the model to extract the time sequence information, so that the influence of abnormal data fluctuation in an instantaneous state on decision making is avoided. The model adopts an MADDPG frame, the MADDPG frame is the expansion of a DDPG algorithm in the field of multi-agents, and the DDPG algorithm applies deep reinforcement learning to a continuous action space. And the action space obtained by the deep learning part is set as the resource occupation ratio of the virtual machine in the state to be scheduled, namely the load balance of the current service node can be maintained only by scheduling the occupied space. Marking the virtual machine with proper size as a to-be-scheduled state according to the obtained to-be-scheduled space, then calculating the return rewards of the virtual machine in the to-be-scheduled state and each service node on each service node in the whole network, generating a scheduling strategy by using reward values obtained by the virtual machine by being distributed to the service nodes as distance measurement, finally checking whether the scheduling strategy is executable, if the scheduling strategy is executable, scheduling the virtual machine in the to-be-scheduled state to other proper service nodes, if the scheduling strategy is not executable, returning a negative feedback punishment, and generating the scheduling strategy again by the intelligent agent.

Specifically, the maddppg algorithm takes into account a plurality of parameterizations θ ═ θ₁,θ₂,θ₃,…θ_n-calculating Policy by gaming of a plurality of agents, the Policy for all agents being defined as pi ═ pi { (pi } pi)₁,π₂,π₃,…π_nThe expected profit for the ith agent is J (θ)_i)＝E[R_i]Then considerDeterministic strategy mu_θiθ_iWhen parametric, the gradient can be expressed as:

wherein x is (o)₁…o_n)。

Further, the reinforcement learning model building module comprises a prediction module and a scheduling module, the prediction module comprises a state sensing unit, an action space unit and a reward function unit, and the specific functions are as follows:

in the above formula, r_iIs the reward return on each service node, wherein c represents the CPU occupancy rate on the ith machine, and alpha and beta are penalty coefficients. α can be set as the case may be, indicating a threshold value at which it is desired that the server CPU occupancy load remain steady.

The intelligent agent model training module: the system is used for performing off-line training and learning by utilizing a deep reinforcement learning model and a simulation environment of a plurality of intelligent agents, and respectively training an intelligent agent model for each server; the method comprises the steps of performing off-line training under a simulation environment established according to real data, respectively establishing an agent for each service node, adjusting the size of resources to be scheduled by the agent on each service node through a prediction module, marking virtual machines to be scheduled, generating a scheduling strategy according to the virtual machines in a state to be scheduled, respectively calculating the return values of the service nodes, summarizing and summing the return values to obtain a total return value, and finally adjusting the parameters of each prediction module according to the total return value.

An agent deployment module: and the intelligent agent model is used for deploying the trained intelligent agent model to the real service nodes and scheduling according to the load condition of each service node. The method comprises the steps of putting each trained intelligent agent model down to a corresponding service node in a real environment, then predicting and modifying a state to be scheduled through a prediction module of an intelligent agent, uniformly distributing and generating a scheduling strategy by the scheduling module, distributing a scheduling command to the corresponding node to execute scheduling operation, judging whether an action can be executed or not before executing the scheduling action, feeding back a punishment reward updating parameter if the action cannot be executed or fails to be executed, re-generating the scheduling strategy, and repeating iteration until all scheduling strategies can be executed.

Fig. 5 is a schematic structural diagram of a hardware device of a multi-agent reinforcement learning scheduling method according to an embodiment of the present application. As shown in fig. 5, the device includes one or more processors and memory. Taking a processor as an example, the apparatus may further include: an input system and an output system.

The processor, memory, input system, and output system may be connected by a bus or other means, as exemplified by the bus connection in fig. 5.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules. The processor executes various functional applications and data processing of the electronic device, i.e., implements the processing method of the above-described method embodiment, by executing the non-transitory software program, instructions and modules stored in the memory.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data and the like. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processing system over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input system may receive input numeric or character information and generate a signal input. The output system may include a display device such as a display screen.

The one or more modules are stored in the memory and, when executed by the one or more processors, perform the following for any of the above method embodiments:

The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the methods provided in the embodiments of the present application.

Embodiments of the present application provide a non-transitory (non-volatile) computer storage medium having stored thereon computer-executable instructions that may perform the following operations:

Embodiments of the present application provide a computer program product comprising a computer program stored on a non-transitory computer readable storage medium, the computer program comprising program instructions that, when executed by a computer, cause the computer to perform the following:

The multi-agent reinforcement learning scheduling method, the multi-agent reinforcement learning scheduling system and the electronic equipment virtualize the service running on the server through virtualization technology, and perform load balancing through a virtual machine scheduling mode, because the scheduling range is not limited in a single server, when one server is in a high-load state, the virtual machine can be scheduled to other low-load servers to run, and compared with a scheme of resource allocation, the method and the system are more macroscopic. Meanwhile, the MADDPG framework is used for expanding on the AC framework, critic adds extra information for decision making of other agents, but each agent can only use local information for training, and a strategy that a plurality of agents generate cooperation in a complex dynamic environment can be achieved through the framework.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A multi-agent reinforcement learning scheduling method is characterized by comprising the following steps:

step c: off-line training and learning are carried out by utilizing the deep reinforcement learning model of the multi-agent and the virtual simulation environment, and an agent model is trained for each server;

step d: deploying the intelligent agent model to a real service node, and scheduling according to the load condition of each service node;

in the step d, the deploying the agent model to the real service nodes and scheduling according to the load condition of each service node specifically comprises: deploying a trained intelligent agent model to a corresponding service node in a real environment, sensing state information of a server where the intelligent agent model is located within a period of time as input, predicting to obtain resources needing to be released by the current server, and selecting a virtual machine closest to a standard by using a knapsack algorithm to mark the virtual machine as a state to be scheduled; then, collecting prediction results on all servers and the virtual machines marked as the to-be-scheduled states through a scheduling module, assigning the virtual machines in the to-be-scheduled states to suitable servers as required to generate a scheduling strategy, and distributing a scheduling command to corresponding service nodes to execute scheduling operation; before executing the scheduling strategy, checking whether each scheduling command is legal or not, if not, feeding back a punishment reward updating parameter, and regenerating the scheduling strategy; and if the intelligent agent parameter is legal, executing the scheduling operation, and obtaining the feedback reward value to update the intelligent agent parameter.

2. The multi-agent reinforcement learning scheduling method of claim 1, wherein the step a further comprises: carrying out standardized preprocessing operation on the collected server parameters and the virtual machine load information; the normalized preprocessing operation comprises: defining the virtual machine information of each service node as a tuple, wherein the tuple comprises the number of virtual machines and respective configuration of the virtual machines, each virtual machine comprises two scheduling states, namely a to-be-scheduled state and an operating state, each service node comprises two states, namely a saturated state and a hungry state, and the sum of the resource ratio occupied by each virtual machine is less than the upper limit of the configuration of the server where the virtual machine is located.

3. The multi-agent reinforcement learning scheduling method according to claim 1 or 2, wherein in the step b, the deep reinforcement learning model of the multi-agent specifically includes a prediction module and a scheduling module, the prediction module predicts the resources to be scheduled out in the current state according to the information input by each service node, and maps the action space to the total capacity of the current service node according to the configuration information of the current service node; the scheduling module carries out rescheduling and distribution to generate a scheduling strategy according to the marked virtual machine in the state to be scheduled, and an agent on each service node calculates a return function according to the generated scheduling action; the prediction module measures the quality of the scheduling strategy, so that the load of each service node in the whole network is balanced.

4. The multi-agent reinforcement learning scheduling method of claim 3, wherein in the step c, the off-line training and learning are performed by using the deep reinforcement learning model and the virtual simulation environment of the multi-agent, and the training of one agent model for each server specifically comprises: the intelligent agent on each service node adjusts the size of the resource to be scheduled through the prediction module, marks the virtual machine to be scheduled out, generates a scheduling strategy according to the virtual machine in the state to be scheduled, calculates the return value of each service node, summarizes and sums the return values to obtain a total return value, and adjusts the parameters of each prediction module according to the total return value.

5. A multi-agent reinforcement learning scheduling system, comprising:

the intelligent agent model training module: the system comprises a plurality of servers, a deep reinforcement learning model of a multi-agent and a virtual simulation environment, wherein the deep reinforcement learning model of the multi-agent and the virtual simulation environment are used for off-line training and learning, and an agent model is trained for each server;

an agent deployment module: the intelligent agent model is deployed to real service nodes and is scheduled according to the load condition of each service node;

the intelligent agent deployment module deploys the intelligent agent model to the real service nodes, and the scheduling according to the load condition of each service node specifically comprises the following steps: deploying a trained intelligent agent model to a corresponding service node in a real environment, sensing state information of a server where the intelligent agent model is located within a period of time as input, predicting to obtain resources needing to be released by the current server, and selecting a virtual machine closest to a standard by using a knapsack algorithm to mark the virtual machine as a state to be scheduled; then, collecting prediction results on all servers and the virtual machines marked as the to-be-scheduled states through a scheduling module, assigning the virtual machines in the to-be-scheduled states to suitable servers as required to generate a scheduling strategy, and distributing a scheduling command to corresponding service nodes to execute scheduling operation; before executing the scheduling strategy, checking whether each scheduling command is legal or not, if not, feeding back a punishment reward updating parameter, and regenerating the scheduling strategy; and if the intelligent agent parameter is legal, executing the scheduling operation, and obtaining the feedback reward value to update the intelligent agent parameter.

6. The multi-agent reinforcement learning scheduling system of claim 5, further comprising a preprocessing module for performing a normalized preprocessing operation on the collected server parameters and virtual machine load information; the normalized preprocessing operation comprises: defining the virtual machine information of each service node as a tuple, wherein the tuple comprises the number of virtual machines and respective configuration of the virtual machines, each virtual machine comprises two scheduling states, namely a to-be-scheduled state and an operating state, each service node comprises two states, namely a saturated state and a hungry state, and the sum of the resource ratio occupied by each virtual machine is less than the upper limit of the configuration of the server where the virtual machine is located.

7. The multi-agent reinforcement learning scheduling system of claim 5 or 6, wherein the reinforcement learning model building module comprises a prediction module and a scheduling module, the prediction module comprising:

the prediction module further comprises:

8. The multi-agent reinforcement learning scheduling system of claim 7, wherein the agent model training module performs off-line training and learning using the deep reinforcement learning model and the virtual simulation environment of the multi-agent, and training one agent model for each server specifically comprises: the intelligent agent on each service node adjusts the size of the resource to be scheduled through the prediction module, marks the virtual machine to be scheduled out, generates a scheduling strategy according to the virtual machine in the state to be scheduled, calculates the return value of each service node, summarizes and sums the return values to obtain a total return value, and adjusts the parameters of each prediction module according to the total return value.

9. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the multi-agent reinforcement learning scheduling method of any of the above 1 to 4.