WO2020181896A1 - 一种多智能体强化学习调度方法、系统及电子设备 - Google Patents

一种多智能体强化学习调度方法、系统及电子设备 Download PDF

Info

Publication number
WO2020181896A1
WO2020181896A1 PCT/CN2019/130582 CN2019130582W WO2020181896A1 WO 2020181896 A1 WO2020181896 A1 WO 2020181896A1 CN 2019130582 W CN2019130582 W CN 2019130582W WO 2020181896 A1 WO2020181896 A1 WO 2020181896A1
Authority
WO
WIPO (PCT)
Prior art keywords
agent
scheduling
service node
server
reinforcement learning
Prior art date
Application number
PCT/CN2019/130582
Other languages
English (en)
French (fr)
Inventor
任宏帅
王洋
须成忠
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2020181896A1 publication Critical patent/WO2020181896A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • This application belongs to the technical field of multi-agent systems, and in particular relates to a multi-agent reinforcement learning scheduling method, system and electronic equipment.
  • the traditional service deployment method is difficult to cope with the changing access methods.
  • the fixed allocation of resources can provide services stably, there is also a large amount of waste of resources, for example, under the same network topology.
  • Some servers may often run at full load, while some servers only deploy a few services and still have a lot of unused storage space and computing power. It can be seen that traditional deployment services are difficult to cope with this waste of resources and are difficult to achieve Efficient scheduling makes it impossible to use resources efficiently. Therefore, a scheduling algorithm that can adapt to the dynamic environment is needed to balance the load of the servers in the network.
  • the large amount of action space makes the algorithm difficult to train and difficult to converge.
  • the method of using distributed reinforcement learning also faces another problem.
  • distributed reinforcement learning uses multiple agents to train together to speed up the convergence speed, but in fact the scheduling strategies of these agents are the same, but In the training process, multiple clones are used to speed up the training, so the final result is a homogeneous agent that does not have the ability to collaborate.
  • each agent predicts the decisions of other agents at each step of the decision.
  • training is very difficult and each agent can do Things are almost the same as no collaborative strategy.
  • the present application provides a multi-agent reinforcement learning scheduling method, system, and electronic device, which aim to solve at least one of the above technical problems in the prior art to a certain extent.
  • a multi-agent reinforcement learning scheduling method includes the following steps:
  • Step a Collect server parameters of the network data center and load information of virtual machines running on each server;
  • Step b use the server parameters and virtual machine load information to establish a virtual simulation environment, and establish a multi-agent deep reinforcement learning model
  • Step c Use the deep reinforcement learning model and simulation environment of the multi-agent for offline training and learning, and train an agent model for each server;
  • Step d Deploy the agent model to real service nodes, and perform scheduling according to the load conditions of each service node.
  • the technical solution adopted in the embodiment of the application further includes: the step a further includes: performing a standardized preprocessing operation on the collected server parameters and virtual machine load information; the standardized preprocessing operation includes: defining each service node virtual machine
  • the information is a tuple.
  • the tuple includes the number of virtual machines and their respective configurations.
  • Each virtual machine includes two scheduling states, namely the pending state and the running state, and each service node includes two states, respectively In the saturation state and starvation state, the sum of the resource ratios occupied by each virtual machine is less than the upper limit of the server configuration.
  • the deep reinforcement learning model of the multi-agent specifically includes a prediction module and a scheduling module, and the prediction module compares the current state with the information input by each service node
  • the resources that need to be scheduled are predicted, and the action space is mapped to the total capacity of the current service node according to the configuration information of the current service node;
  • the scheduling module performs rescheduling and allocation according to the marked virtual machine to be scheduled
  • the agent on each service node calculates the reward function according to the generated scheduling action;
  • the prediction module measures the quality of the scheduling strategy to balance the load of each service node in the entire network.
  • the technical solution adopted in the embodiment of the present application further includes: in the step c, the use of the multi-agent deep reinforcement learning model and simulation environment for offline training and learning, and training an agent model for each server specifically includes :
  • the agent on each service node adjusts the size of the resource to be scheduled through the prediction module, marks the virtual machine that needs to be scheduled, and generates a scheduling strategy based on the virtual machine to be scheduled.
  • Each service node calculates its own return value and summarizes it. Sum up the total return value, and adjust the parameters of each prediction module according to the total return value.
  • the technical solution adopted in the embodiment of the present application further includes: in the step d, the deployment of the agent model to the real service node, and scheduling according to the load of each service node is specifically: the trained agent The model is deployed on the corresponding service node in the real environment.
  • the agent model perceives the state information on the server for a period of time as input, predicts the resources that the current server needs to release, and uses the backpack algorithm to select the closest Standard virtual machines are marked as pending; then, the prediction results on all servers and the virtual machines marked as pending are collected through the scheduling module, and then the virtual machines in pending state are assigned to suitable servers as needed Generate a scheduling strategy, and distribute the scheduling commands to the corresponding service nodes to perform scheduling operations; check whether each scheduling command is legal before executing the scheduling strategy, if it is not legal, feedback a penalty reward update parameter, and regenerate the scheduling strategy; if it is legal, then Perform scheduling operations and obtain feedback reward values to update agent parameters.
  • a multi-agent reinforcement learning scheduling system including:
  • Information collection module used to collect server parameters of the network data center and virtual machine load information running on each server;
  • Reinforcement learning model building module used to use the server parameters and virtual machine load information to establish a virtual simulation environment and establish a multi-agent deep reinforcement learning model;
  • Agent model training module used to use the deep reinforcement learning model and simulation environment of the multi-agent for offline training and learning, and train an agent model for each server;
  • Agent deployment module used to deploy the agent model to real service nodes and perform scheduling according to the load conditions of each service node.
  • the technical solution adopted in the embodiment of the application further includes a preprocessing module, which is used to perform standardized preprocessing operations on the collected server parameters and virtual machine load information;
  • the standardized preprocessing operations include: defining each service The node virtual machine information is a tuple, the tuple includes the number of virtual machines and their respective configurations, each virtual machine includes two scheduling states, which are the pending state and the running state, and each service node includes two states , Respectively are saturated state and starvation state, the resource ratio of each virtual machine is less than the upper limit of the server configuration.
  • the reinforcement learning model building module includes a prediction module and a scheduling module
  • the prediction module includes:
  • State perception unit used to predict the resources that need to be dispatched in the current state through the information input by each service node;
  • Action space unit used to map the action space to the total capacity of the current service node according to the configuration information of the current service node;
  • the scheduling module performs rescheduling and allocation to generate a scheduling strategy according to the marked virtual machine to be scheduled, and the agent on each service node calculates a reward function according to the generated scheduling action;
  • the prediction module further includes:
  • Reward function unit used to measure the quality of the scheduling strategy and balance the load of each service node in the entire network.
  • the technical solution adopted in the embodiment of the application further includes: the agent model training module uses the multi-agent deep reinforcement learning model and simulation environment for offline training and learning, and training an agent model for each server is specifically:
  • the agent on each service node adjusts the size of the resource to be scheduled through the prediction module, marks the virtual machine that needs to be scheduled, and generates a scheduling strategy based on the virtual machine to be scheduled.
  • Each service node calculates its own return value and summarizes the sum. Obtain the total return value, and adjust the parameters of each prediction module according to the total return value.
  • the technical solution adopted in the embodiment of the application further includes: the agent deployment module deploys the agent model to the real service node, and schedules according to the load of each service node. Specifically: deploy the trained agent model to On the corresponding service node in the real environment, the agent model perceives the state information on the server for a period of time as input, predicts the resources that the current server needs to release, and uses the backpack algorithm to select the virtual The machine marks it as a state to be scheduled; then the scheduling module collects the prediction results on all servers and the virtual machines marked as the state to be scheduled, and then assigns the virtual machines in the state to be scheduled to the appropriate server to generate a scheduling strategy , Distribute the scheduling command to the corresponding service node to perform the scheduling operation; check whether each scheduling command is legal before executing the scheduling strategy, if not legal, feedback a penalty reward update parameter, and regenerate the scheduling strategy; if it is legal, perform the scheduling operation , And get the feedback reward value to update the agent parameters.
  • an electronic device including:
  • At least one processor At least one processor
  • a memory communicatively connected with the at least one processor; wherein,
  • the memory stores instructions executable by the one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform the following operations of the foregoing multi-agent reinforcement learning scheduling method :
  • Step a Collect server parameters of the network data center and load information of virtual machines running on each server;
  • Step b use the server parameters and virtual machine load information to establish a virtual simulation environment, and establish a multi-agent deep reinforcement learning model
  • Step c Use the deep reinforcement learning model and simulation environment of the multi-agent for offline training and learning, and train an agent model for each server;
  • Step d Deploy the agent model to real service nodes, and perform scheduling according to the load conditions of each service node.
  • the beneficial effects produced by the embodiments of the present application are: the multi-agent reinforcement learning scheduling method, system, and electronic equipment of the embodiments of the present application virtualize the services running on the server through virtualization technology, and through scheduling virtual machines Load balancing is performed in a way, because the scheduling scope is not limited to a single server.
  • the virtual machine can be scheduled to run on other low-load servers, compared to the resource allocation scheme More macro.
  • this application uses the MADDPG framework to expand on the AC framework.
  • the critic adds additional information for decision-making by other agents, but each actor can only use local information for training. Through this framework, multi-agents can be realized. Produce collaborative strategies in a complex dynamic environment.
  • Fig. 1 is a flowchart of a multi-agent reinforcement learning scheduling method according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of the MADDPG scheduling framework of an embodiment of the present application.
  • Figure 3 is a schematic diagram of the overall scheduling framework of an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a multi-agent reinforcement learning scheduling system according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of the hardware device structure of the multi-agent reinforcement learning scheduling method provided by an embodiment of the present application.
  • the multi-agent reinforcement learning scheduling method of the embodiment of the present application uses multi-agent reinforcement learning technology in the field of reinforcement learning, based on the load information on each service node in the cloud service environment Modeling, using cyclic neural networks to learn time sequence information for decision-making, train an agent for each server, and compete or work together in multiple agents with different tasks to maintain load balance under the entire network topology.
  • each agent is sent to the real service node, and then scheduled according to the load situation of each node. While making decisions and scheduling, each agent continues to learn and perfect according to the current independent environment and the decision memory of other nodes, so that Each agent can cooperate with the agents of other nodes to generate scheduling strategies to achieve load balancing of each service node.
  • FIG. 1 is a flowchart of a multi-agent reinforcement learning scheduling method according to an embodiment of the present application.
  • the multi-agent reinforcement learning scheduling method of the embodiment of the present application includes the following steps:
  • Step 100 Collect server parameters of the network data center and load information of virtual machines running on each server;
  • the collected server parameters specifically include: collecting configuration information, memory and hard disk storage space of each server for a period of time in a real scenario; the collected virtual machine load information specifically includes: collecting virtual machine usage on each server Resource parameters, such as CPU usage, memory and hard disk usage, etc.
  • Step 200 Perform preprocessing operations such as normalization on the collected server parameters and virtual machine load information
  • the preprocessing operation specifically includes: defining the virtual machine information of each service node as a tuple.
  • the tuple includes the number of virtual machines and their respective configurations, including CPU, memory, hard disk, and current state.
  • the machine includes two scheduling states, namely the pending state and the running state.
  • Each service node includes two states, namely the saturated state and the starved state. The sum of the resources occupied by each virtual machine cannot be more than the server configuration. Upper limit.
  • Step 300 Use the preprocessed data to establish a virtual simulation environment, and establish a multi-agent deep reinforcement learning model
  • the establishment of a multi-agent deep reinforcement learning model specifically includes: modeling the collected time series dynamic information (server parameters and virtual machine load information) to create a simulation environment for offline training, and the model adopts multi-agent deep reinforcement Learning model, in order to make full use of the impact of time series data, the deep network part of the model uses the LSTM model to extract time series information to avoid the influence of abnormal data fluctuations on decision-making in the transient state.
  • the model adopts the MADDPG (i.e. Multi-Agent Deep Deterministic Policy Gradient, from OpenAI's Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environment) framework.
  • the MADDPG framework is DDPG (from the continuous control with deep learning reinforcement published by Google DeepMind).
  • the DDPG algorithm applies deep reinforcement learning to the continuous action space.
  • the action space obtained by the deep learning part is set to the resource proportion of the virtual machine in the to-be-scheduled state, that is, how much space is scheduled to maintain the load balance of the current service node.
  • the virtual machine of the appropriate size is marked as the pending state, and then the virtual machine in the pending state on each service node in the entire network and the return reward of each service node are calculated, and the virtual machine is used to allocate to the service
  • the reward value obtained by the node is used as a distance metric to generate a scheduling strategy.
  • the detailed scheduling framework is shown in Figure 2.
  • a cyclic neural network LSTM Long Short Term Memory Network
  • LSTM Long Short Term Memory Network
  • the agent on each service node is used to mark the virtual machine as the state to be scheduled.
  • the knapsack problem is solved.
  • the predicted space to be scheduled is used as the backpack space, and the resource occupied by each virtual machine is used as the weight and value of the item. You need to calculate the maximum value that the backpack can load, and mark the loaded virtual machine as a state to be scheduled. Then count the predicted space to be scheduled on the service node (the presence of a negative number indicates how many resources need to be scheduled to make full use of the resources), and the goal is to minimize the sum of the space occupied by the space to be scheduled and the scheduled space of each service node, which can be obtained by calculation Scheduling strategy.
  • the MADDPG framework extends the technology of deep reinforcement learning to the field of multi-agents.
  • the algorithm is suitable for centralized learning and decentralized execution in a multi-agent environment.
  • the framework can be used To enable multi-agents to learn cooperation and competition.
  • the deep reinforcement learning model includes a prediction module and a scheduling module.
  • the prediction module includes a state perception unit, an action space unit, and a reward function unit.
  • the specific functions are as follows:
  • State-aware unit predict the resources that need to be scheduled in the current state through the information input by each node, and the input state is defined by the load information of each node and the resources occupied by the running virtual machine;
  • Action space unit Map the action space to the total capacity of the current service node according to the configuration information of the current node;
  • Scheduling module According to the marked virtual machine to be scheduled, the scheduling strategy is generated by rescheduling and allocation, and the agent on each service node calculates the reward function according to the generated scheduling action;
  • Reward function unit Measure the quality of the scheduling strategy. Its goal is to balance the load of each service node in the entire network. The reward function on each service node is calculated separately; the reward function formula is as follows:
  • r i is the reward return on each service node, where c represents the CPU occupancy rate on the i-th machine, and ⁇ and ⁇ are penalty coefficients.
  • can be set according to the situation, which means that the server CPU usage load is expected to maintain a steady state threshold.
  • R is the overall return function
  • the final optimization goal is to obtain the maximum R for the scheduling strategy generated by the cooperation of each agent.
  • Step 400 Use the multi-agent deep reinforcement learning model and simulation environment for offline training and learning, and train an agent model for each server;
  • step 400 offline training is performed in a simulation environment established based on real data, and an agent is created for each service node.
  • the agent on each service node adjusts the size of the resource to be scheduled through the prediction module, and marks the need
  • the scheduled virtual machine generates a scheduling strategy based on the virtual machine to be scheduled, and then each service node calculates its own return value and sums it up to obtain the total return value, and finally adjusts the parameters of each prediction module according to the total return value.
  • Step 500 Deploy the trained agent model to real service nodes, and perform scheduling according to the load conditions of each service node.
  • each trained agent model is transferred to the corresponding service node in the real environment.
  • the agent first perceives the state information on the server for a period of time as input, and obtains the current state information through the prediction module of the agent.
  • the server hopes to release the resources, and then uses the knapsack algorithm to select the virtual machine closest to the standard and mark it as the pending state; then the scheduling module collects the prediction results on all servers and the virtual machine marked as the pending state.
  • the virtual machines in the to-be-scheduled state are assigned to a suitable server to generate a scheduling strategy, and the scheduling command is distributed to the corresponding node to perform the scheduling operation.
  • the scheduling strategy Before executing the scheduling strategy, it is necessary to check whether each scheduling command is legal. If it is illegal, feedback a penalty reward update parameter, regenerate the scheduling strategy, and iterate repeatedly until all the scheduling strategies can be executed. If it is legal, execute it and get the feedback reward value to update the agent parameters.
  • the specific overall scheduling framework is shown in Figure 3.
  • this application improves on the above algorithm to replace the action space of the model with the resources that the current server hopes to release, that is, how many resources are hoped to be dispatched from it to maintain load balance under the overall network topology.
  • This setting can avoid using the global id to mark each virtual machine, even if a new virtual machine is added midway, it can still work, so it makes the scheduling algorithm more flexible and can adapt to a wider range of scenarios.
  • FIG. 4 is a schematic structural diagram of a multi-agent reinforcement learning scheduling system according to an embodiment of the present application.
  • the multi-agent reinforcement learning scheduling system of the embodiment of the present application includes an information collection module, a preprocessing module, a reinforcement learning model construction module, an agent model training module, and an agent deployment module.
  • Information collection module used to collect the server parameters of the network data center and the load information of the virtual machines running on each server; among them, the collected server parameters specifically include: collecting the configuration information of each server for a period of time in a real scenario, memory and hard disk Storage space, etc.; the collected virtual machine load information specifically includes: collecting the parameters of the resources occupied by the virtual machines running on each server, such as CPU occupancy rate, memory and hard disk occupancy rate, etc.
  • Preprocessing module used to standardize the collected server parameters and virtual machine load information and other preprocessing operations; among them, the preprocessing operations specifically include: defining each service node virtual machine information as a tuple, which includes the virtual machine The number and their respective configuration, including CPU, memory, hard disk and current state, each virtual machine includes two scheduling states, namely the pending state and the running state, each service node includes two states, respectively saturated Status and starvation status, the sum of resources occupied by each virtual machine cannot exceed the upper limit of the server configuration.
  • Reinforcement learning model building module used to use the preprocessed data to establish a virtual simulation environment and establish a multi-agent deep reinforcement learning model; among them, the establishment of a multi-agent deep reinforcement learning model specifically includes: the collected time series dynamics Information (server parameters and virtual machine load information) is modeled to create a simulation environment for offline training.
  • the model uses a multi-agent deep reinforcement learning model.
  • the deep network part of the model uses the LSTM model to extract the time series Information, to avoid the impact of abnormal data fluctuations on decision-making in a transient state.
  • the model adopts the MADDPG framework, which is an extension of the DDPG algorithm in the field of multi-agents.
  • the DDPG algorithm applies deep reinforcement learning to the continuous action space.
  • the action space obtained by the deep learning part is set to the resource proportion of the virtual machine in the to-be-scheduled state, that is, how much space is scheduled to maintain the load balance of the current service node.
  • the virtual machine of the appropriate size is marked as the pending state, and then the virtual machine in the pending state on each service node in the entire network and the return reward of each service node are calculated, and the virtual machine is used to allocate to the service
  • the reward value obtained by the node is used as a distance metric to generate a scheduling strategy.
  • a cyclic neural network LSTM Long Short Term Memory Network
  • LSTM Long Short Term Memory Network
  • the agent on each service node is used to mark the virtual machine as the state to be scheduled.
  • the knapsack problem is solved.
  • the predicted space to be scheduled is used as the backpack space, and the resource occupied by each virtual machine is used as the weight and value of the item. You need to calculate the maximum value that the backpack can load, and mark the loaded virtual machine as a state to be scheduled. Then count the predicted space to be scheduled on the service node (the presence of a negative number indicates how many resources need to be scheduled to make full use of the resources), and the goal is to minimize the sum of the space occupied by the space to be scheduled and the scheduled space of each service node, which can be obtained by calculation Scheduling strategy.
  • the MADDPG framework extends the technology of deep reinforcement learning to the field of multi-agents.
  • the algorithm is suitable for centralized learning and decentralized execution in a multi-agent environment.
  • the framework can be used To enable multi-agents to learn cooperation and competition.
  • the reinforcement learning model building module includes a prediction module and a scheduling module.
  • the prediction module includes a state perception unit, an action space unit, and a reward function unit. The specific functions are as follows:
  • State-aware unit predict the resources that need to be scheduled in the current state through the information input by each node, and the input state is defined by the load information of each node and the resources occupied by the running virtual machine;
  • Action space unit Map the action space to the total capacity of the current service node according to the configuration information of the current node;
  • Scheduling module According to the marked virtual machine to be scheduled, the scheduling strategy is generated by rescheduling and allocation, and the agent on each service node calculates the reward function according to the generated scheduling action;
  • Reward function unit Measure the quality of the scheduling strategy. Its goal is to balance the load of each service node in the entire network. The reward function on each service node is calculated separately; the reward function formula is as follows:
  • r i is the reward return on each service node, where c represents the CPU occupancy rate on the i-th machine, and ⁇ and ⁇ are penalty coefficients.
  • can be set according to the situation, which means that the server CPU usage load is expected to maintain a steady state threshold.
  • R is the overall return function
  • the final optimization goal is to obtain the maximum R for the scheduling strategy generated by the cooperation of each agent.
  • Agent model training module It is used for offline training and learning using the deep reinforcement learning model and simulation environment of multi-agents, and an agent model is trained for each server; among them, it is carried out in a simulation environment established based on real data Offline training, create an agent for each service node, the agent on each service node adjusts the size of the resource to be scheduled through the prediction module, marks the virtual machine that needs to be scheduled, and generates scheduling based on the virtual machine to be scheduled Then, each service node calculates its own return value and sums it up to get the total return value, and finally adjusts the parameters of each prediction module according to the total return value.
  • Agent deployment module It is used to deploy the trained agent model to the real service node, and schedule it according to the load of each service node. Among them, each trained agent model is distributed to the corresponding service node in the real environment, and then the prediction module of the agent is used to predict and modify the pending state.
  • the scheduling module uniformly allocates the scheduling strategy and distributes the scheduling commands to the corresponding nodes. Perform the scheduling operation. Before the scheduling action is executed, it is necessary to determine whether the action can be executed. If it cannot be executed or the execution fails, a penalty reward update parameter is fed back to regenerate the scheduling strategy, and iterate repeatedly until all scheduling strategies can be executed.
  • this application improves on the above algorithm to replace the action space of the model with the resources that the current server hopes to release, that is, how many resources are hoped to be dispatched from it to maintain load balance under the overall network topology.
  • This setting can avoid using the global id to mark each virtual machine, even if a new virtual machine is added midway, it can still work, so it makes the scheduling algorithm more flexible and can adapt to a wider range of scenarios.
  • FIG. 5 is a schematic diagram of the hardware device structure of the multi-agent reinforcement learning scheduling method provided by an embodiment of the present application.
  • the device includes one or more processors and memory. Taking a processor as an example, the device may also include: an input system and an output system.
  • the processor, the memory, the input system, and the output system may be connected by a bus or other methods.
  • the connection by a bus is taken as an example.
  • the memory can be used to store non-transitory software programs, non-transitory computer executable programs, and modules.
  • the processor executes various functional applications and data processing of the electronic device by running non-transitory software programs, instructions, and modules stored in the memory, that is, realizing the processing methods of the foregoing method embodiments.
  • the memory may include a program storage area and a data storage area, where the program storage area can store an operating system and an application program required by at least one function; the data storage area can store data and the like.
  • the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device.
  • the storage may optionally include storage remotely arranged with respect to the processor, and these remote storages may be connected to the processing system through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input system can receive input digital or character information, and generate signal input.
  • the output system may include display devices such as a display screen.
  • the one or more modules are stored in the memory, and when executed by the one or more processors, the following operations of any of the foregoing method embodiments are performed:
  • Step a Collect server parameters of the network data center and load information of virtual machines running on each server;
  • Step b use the server parameters and virtual machine load information to establish a virtual simulation environment, and establish a multi-agent deep reinforcement learning model
  • Step c Use the deep reinforcement learning model and simulation environment of the multi-agent for offline training and learning, and train an agent model for each server;
  • Step d Deploy the agent model to real service nodes, and perform scheduling according to the load conditions of each service node.
  • the embodiments of the present application provide a non-transitory (non-volatile) computer storage medium, the computer storage medium stores computer executable instructions, and the computer executable instructions can perform the following operations:
  • Step a Collect server parameters of the network data center and load information of virtual machines running on each server;
  • Step b use the server parameters and virtual machine load information to establish a virtual simulation environment, and establish a multi-agent deep reinforcement learning model
  • Step c Use the deep reinforcement learning model and simulation environment of the multi-agent to perform offline training and learning, and train an agent model for each server;
  • Step d Deploy the agent model to real service nodes, and perform scheduling according to the load conditions of each service node.
  • the embodiment of the present application provides a computer program product, the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer To make the computer do the following:
  • Step a Collect server parameters of the network data center and load information of virtual machines running on each server;
  • Step b use the server parameters and virtual machine load information to establish a virtual simulation environment, and establish a multi-agent deep reinforcement learning model
  • Step c Use the deep reinforcement learning model and simulation environment of the multi-agent for offline training and learning, and train an agent model for each server;
  • Step d Deploy the agent model to real service nodes, and perform scheduling according to the load conditions of each service node.
  • the multi-agent reinforcement learning scheduling method, system and electronic device of the embodiments of the present application virtualize services running on the server through virtualization technology, and perform load balancing by scheduling virtual machines, because the scheduling scope is not limited to a single server .
  • the virtual machine can be scheduled to run on other low-load servers, which is more macroscopic than the resource allocation scheme.
  • this application uses the MADDPG framework to expand on the AC framework.
  • the critic adds additional information for decision-making by other agents, but each actor can only use local information for training. Through this framework, multi-agents can be realized. Produce collaborative strategies in a complex dynamic environment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer And Data Communications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种多智能体强化学习调度方法、系统及电子设备,所述方法包括:步骤a:收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息(100);步骤b:使用所述服务器参数和虚拟机负载信息建立虚拟仿真环境,并建立多智能体的深度强化学习模型;步骤c:利用所述多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;步骤d:将所述智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。通过虚拟化技术将服务器上运行的服务虚拟化,通过调度虚拟机的方式来进行负载均衡,资源分配更加宏观,可以实现多智能体在复杂的动态环境下产生协作的策略。

Description

一种多智能体强化学习调度方法、系统及电子设备 技术领域
本申请属于多智能体系统技术领域,特别涉及一种多智能体强化学习调度方法、系统及电子设备。
背景技术
在云计算环境下传统的服务部署方式很难应对多变的访问方式,资源的固定分配虽然能够稳定的提供服务,但同时这其中也存在的大量的资源浪费,例如在同一个网络拓扑结构下,有的服务器可能经常处于满负载运行,而有些服务器却只部署了几个服务仍然存在许多没有被使用的存储空间和运算能力,可见传统的部署服务难以应对这种资源的浪费,而且难以实现高效的调度,使得无法高效的利用资源。因此需要一种能够自适应动态环境的调度算法来平衡网络中个服务器的负载。
随着虚拟化技术的发展,虚拟机容器等技术的出现也将资源调度问题由静态分配推进到了动态分配的局面,近年来针对资源自适应调度的方案层出不穷,大多数都采用了启发式算法,通过调节参数的方式进行动态调度,并根据阈值调整运行环境的可用资源的充裕或紧张的情况,使用启发式算法迭代计算合适的阈值。但是这种调度方式只是在海量的数据组合上去寻求最优解,并且求解的最优决策只是针对当前特定时间节点,没有充分的利用时序信息,难以解决大型复杂的动态环境下的资源分配问题。
随着人工智能的兴起,深度强化学习技术的发展使得智能体在大状态空间上的决策成为了可能。在多智能体强化学习领域中,如果使用传统的 Q-learning、PG(Policy Gradient Method,策略梯度算法)等强化学习算法进行分布式学习仍然无法取得预期的效果,因为在每个步骤中每个智能体都尝试学习预测其他智能体的行动,而在动态环境下其他智能体总是在变化的,因此环境会变得不稳定难以学习到知识,无法实现最优化的资源分配。另外从强化学习的方法上来看,目前的调度手段大多都是单智能体强化学习与分布式强化学习,如果只用一个智能体集中式训练,会因为网络拓扑结构下复杂的状态变化与排列组合的大量动作空间使得算法难以训练,不易收敛。而使用分布式强化学习的办法也面临着另外一种问题,通常的分布式强化学习是通过多个智能体共同训练来加快收敛速度,但是事实上这些智能体的调度策略都是相同的,只是在训练的过程中用多个分身加快训练的速度,所以最后得到的都是同质化的智能体不具备协作能力。传统的多智能体方法中每个智能体会在每一步决策的时候去预测其他智能体的决策,但是因为在动态环境下其他智能体的决策也是不稳定的,训练十分困难而且每个智能体能做的事情几乎一样没有协作的策略。
发明内容
本申请提供了一种多智能体强化学习调度方法、系统及电子设备,旨在至少在一定程度上解决现有技术中的上述技术问题之一。
为了解决上述问题,本申请提供了如下技术方案:
一种多智能体强化学习调度方法,包括以下步骤:
步骤a:收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息;
步骤b:使用所述服务器参数和虚拟机负载信息建立虚拟仿真环境,并建 立多智能体的深度强化学习模型;
步骤c:利用所述多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;
步骤d:将所述智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。
本申请实施例采取的技术方案还包括:所述步骤a还包括:将收集到的服务器参数和虚拟机负载信息进行规范化预处理操作;所述规范化预处理操作包括:定义每个服务节点虚拟机信息为一个多元组,所述多元组包括虚拟机的数量与其各自的配置,每个虚拟机包括两个调度状态,分别为待调度状态和运行状态,每个服务节点包括两个状态,分别为饱和状态和饥饿状态,各个虚拟机占用的资源比之和少于所在服务器配置的上限。
本申请实施例采取的技术方案还包括:在所述步骤b中,所述多智能体的深度强化学习模型具体包括预测模块和调度模块,所述预测模块通过各个服务节点输入的信息对当前状态下需要调度出去的资源进行预测,根据当前服务节点的配置信息将动作空间映射到当前服务节点的总容量之内;所述调度模块根据标记出来的待调度状态的虚拟机,进行重新调度分配产生调度策略,各个服务节点上的智能体根据产生的调度动作计算回报函数;所述预测模块度量调度策略的好坏,使整个网络中各个服务节点负载均衡。
本申请实施例采取的技术方案还包括:在所述步骤c中,所述利用多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型具体包括:每个服务节点上的智能体通过预测模块调整需要调度的资源大小,标记出需要调度出去的虚拟机,根据待调度状态的虚拟机产生调度策略,各个服务节点分别计算自身的回报值并汇总求和得到总回报值,并根据总回报值调整各个预测模块的参数。
本申请实施例采取的技术方案还包括:在所述步骤d中,所述将智能体模 型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度具体为:将训练好的智能体模型部署到真实环境中对应的服务节点上,所述智能体模型感知到所在服务器上的一段时间内的状态信息作为输入,预测得到当前服务器需要释放掉的资源,并使用背包算法选出最接近标准的虚拟机将其标记为待调度状态;之后通过调度模块收集到所有服务器上的预测结果与被标记为待调度状态的虚拟机,再按需将待调度状态的虚拟机指派给适合的服务器产生调度策略,将调度命令分发至对应服务节点执行调度操作;在执行调度策略之前对每个调度命令进行校验是否合法,若不合法则反馈一个惩罚奖励更新参数,重新产生调度策略;若合法则执行调度操作,并获得反馈的奖励值更新智能体参数。
本申请实施例采取的另一技术方案为:一种多智能体强化学习调度系统,包括:
信息收集模块:用于收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息;
强化学习模型构建模块:用于使用所述服务器参数和虚拟机负载信息建立虚拟仿真环境,并建立多智能体的深度强化学习模型;
智能体模型训练模块:用于利用所述多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;
智能体部署模块:用于将所述智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。
本申请实施例采取的技术方案还包括预处理模块,所述预处理模块用于将收集到的服务器参数和虚拟机负载信息进行规范化预处理操作;所述规范化预处理操作包括:定义每个服务节点虚拟机信息为一个多元组,所述多元组包括虚拟机的数量与其各自的配置,每个虚拟机包括两个调度状态,分别为待调度状态和运行状态,每个服务节点包括两个状态,分别为饱和状态和饥饿状态,各个虚拟机占用的资源比之和少于所在服务器配置的上限。
本申请实施例采取的技术方案还包括:所述强化学习模型构建模块包括预测模块和调度模块,所述预测模块包括:
状态感知单元:用于通过各个服务节点输入的信息对当前状态下需要调度出去的资源进行预测;
动作空间单元:用于根据当前服务节点的配置信息将动作空间映射到当前服务节点的总容量之内;
所述调度模块根据标记出来的待调度状态的虚拟机,进行重新调度分配产生调度策略,各个服务节点上的智能体根据产生的调度动作计算回报函数;
所述预测模块还包括:
奖励函数单元:用于度量调度策略的好坏,使整个网络中各个服务节点负载均衡。
本申请实施例采取的技术方案还包括:所述智能体模型训练模块利用多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型具体为:每个服务节点上的智能体通过预测模块调整需要调度的资源大小,标记出需要调度出去的虚拟机,根据待调度状态的虚拟机产生调度策略,各个服务节点分别计算自身的回报值并汇总求和得到总回报值,并根据总回报值调整各个预测模块的参数。
本申请实施例采取的技术方案还包括:所述智能体部署模块将智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度具体为:将训练好的智能体模型部署到真实环境中对应的服务节点上,所述智能体模型感知到所在服务器上的一段时间内的状态信息作为输入,预测得到当前服务器需要释放掉的资源,并使用背包算法选出最接近标准的虚拟机将其标记为待调度状态;之后通过调度模块收集到所有服务器上的预测结果与被标记为待调度状态的虚拟机,再按需将待调度状态的虚拟机指派给适合的服务器产生调度策略,将调度命令分发至对应服务节点执行调度操作;在执行调度策略之前对每 个调度命令进行校验是否合法,若不合法则反馈一个惩罚奖励更新参数,重新产生调度策略;若合法则执行调度操作,并获得反馈的奖励值更新智能体参数。
本申请实施例采取的又一技术方案为:一种电子设备,包括:
至少一个处理器;以及
与所述至少一个处理器通信连接的存储器;其中,
所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述的多智能体强化学习调度方法的以下操作:
步骤a:收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息;
步骤b:使用所述服务器参数和虚拟机负载信息建立虚拟仿真环境,并建立多智能体的深度强化学习模型;
步骤c:利用所述多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;
步骤d:将所述智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。
相对于现有技术,本申请实施例产生的有益效果在于:本申请实施例的多智能体强化学习调度方法、系统及电子设备通过虚拟化技术将服务器上运行的服务虚拟化,通过调度虚拟机的方式来进行负载均衡,因为调度范围不在局限于单个服务器内部,当一台服务器处于高负载状态下的时候可以将其中的虚拟机调度到其他低负载的服务器上运行,相比分配资源的方案更加宏观。同时,本申请使用了MADDPG框架在AC框架上进行扩展,critic增加了其他智能体的进行决策的额外信息,但是每个actor只能使用本地的信息训练,通过这种框架就可以实现多智能体在复杂的动态环境下产生协作的策略。
附图说明
图1是本申请实施例的多智能体强化学习调度方法的流程图;
图2是本申请实施例的MADDPG调度框架示意图;
图3是本申请实施例的调度总体框架示意图;
图4是本申请实施例的多智能体强化学习调度系统的结构示意图;
图5是本申请实施例提供的多智能体强化学习调度方法的硬件设备结构示意图。
具体实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本申请,并不用于限定本申请。
为了解决现有技术中存在的不足,本申请实施例的多智能体强化学习调度方法通过使用强化学习领域中的多智能体强化学习技术,根据在云服务的环境下各个服务节点上的负载信息建模,使用循环神经网络学习时序信息进行决策,为每个服务器训练一个智能体,在多个不同任务的智能体进行竞争或协同工作来维护整个网络拓扑结构下的负载均衡。完成初步训练后将各个智能体下放到真实的服务节点,之后根据各个节点的负载情况进行调度,在决策和调度的同时每个智能体根据当前独立环境与其他节点的决策记忆继续学习完善,使得每个智能体能够与其他节点的智能体互相协作产生调度策略,实现各个服务节点的负载均衡。
具体的,请参阅图1,是本申请实施例的多智能体强化学习调度方法的流程图。本申请实施例的多智能体强化学习调度方法包括以下步骤:
步骤100:收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息;
步骤100中,收集的服务器参数具体包括:收集真实场景下一段时间的各 个服务器的配置信息,内存与硬盘存储空间等;收集的虚拟机负载信息具体包括:收集每台服务器上运行的虚拟机占用资源的参数,例如CPU占用率、内存与硬盘占用率等。
步骤200:将收集到的服务器参数和虚拟机负载信息进行规范化等预处理操作;
步骤200中,预处理操作具体包括:定义每个服务节点虚拟机信息为一个多元组,多元组包括虚拟机的数量与其各自的配置,包括CPU、内存、硬盘及当前所处状态,每个虚拟机包括两个调度状态,分别为待调度状态和运行状态,每个服务节点包括两个状态,分别为饱和状态和饥饿状态,各个虚拟机占用的资源比之和不能够多于所在服务器配置的上限。
步骤300:使用预处理后的数据建立虚拟仿真环境,并建立多智能体的深度强化学习模型;
步骤300中,建立多智能体的深度强化学习模型具体包括:将收集到的时序动态信息(服务器参数和虚拟机负载信息)进行建模创建模拟环境进行离线训练,模型采用多智能体的深度强化学习模型,为了充分利用时序数据的影响,模型中深度网络部分采用LSTM模型来提取时序信息,避免瞬时状态下异常数据波动对决策产生的影响。模型采用MADDPG(即为Multi-Agent Deep Deterministic Policy Gradient,来自于OpenAI的Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments)框架,MADDPG框架是DDPG(来自于Google DeepMind发表的continuous control with deep reinforcement learning文章中)算法在多智能体领域的拓展,DDPG算法将深度强化学习应用到连续动作空间上。深度学习部分得出的动作空间设定为待调度状态的虚拟机的资源占比,即调度走多少空间才可以保持当前服务节点的负载平衡。根据得出的待调度空间来将大小合适的虚拟机标记为待调度状态,然后计算整个网络中各个服务节点上处于待调度状态的虚拟机与各服务节点的回报奖励,使用虚拟机分配到服务节点所获得的奖励值作为距离度量,产生调度策略,最后校验调度策略是否可执行,可执行则将处于待调度状态的虚拟机调度到其他合适 的服务节点上,不可执行的策略将返回一个负反馈惩罚,智能体重新产生调度策略。详细调度框架如图2所示。
本申请实施例中,为了解决动态环境下某些瞬时异常负载波动带来的影响,使用循环神经网络LSTM(长短时记忆网络)取代深度强化学习中的全连接神经网络,让智能体可以学习到时序数据之间隐藏的信息,从而实现基于时空感知的自适应调度。
上述中,利用各个服务节点上的智能体将虚拟机标记为待调度状态采用了背包问题解法,将预测得到的待调度空间作为背包空间,每个虚拟机的占用资源作为物品重量与价值,只需计算背包能够装入的最大价值,将装入的虚拟机标记为待调度状态即可。然后统计服务节点上预测得出的待调度空间(其中存在负数表示需要调度进来多少资源能够充分利用资源),目标是待调度空间占用与各个服务节点的待调度之和最小,通过计算可得出调度策略。
本申请实施例中,MADDPG框架将深度强化学习的技术拓展到了多智能体领域,算法适用于多智能体环境下的集中式学习(Centralized learning)和分散式执行(Decentralized execution),使用该框架可以使多智能体之间学会协作与竞争。
具体的,MADDPG算法通过考虑多个参数化θ={θ 1,θ 2,θ 3,...θ n}的多个智能体的博弈来计算策略Policy,可将所有智能体的策略定义为π={π 1,π 2,π 3,...π n},第i个智能体的期望收益为J(θ i)=E[R i],则在考虑确定性策略μ θiθ i为参数时,梯度可表示为下式:
Figure PCTCN2019130582-appb-000001
其中x=(o 1...o n)。
具体的,深度强化学习模型包括预测模块和调度模块,预测模块包括状态 感知单元、动作空间单元和奖励函数单元,具体功能如下:
状态感知单元:通过各个节点输入的信息对当前状态下需要调度出去的资源进行预测,输入状态通过各个节点的负载信息以及运行的虚拟机所占资源进行定义;
动作空间单元:根据当前节点的配置信息将动作空间映射到当前服务节点的总容量之内;
调度模块:根据标记出来的待调度状态的虚拟机,进行重新调度分配产生调度策略,各个服务节点上的智能体根据产生的调度动作计算回报函数;
奖励函数单元:度量调度策略的好坏,其目标是整个网络中各个服务节点负载均衡,其中每个服务节点上的回报函数是单独来计算的;回报函数公式如下:
Figure PCTCN2019130582-appb-000002
上式中,r i是每个服务节点上的奖励回报,其中c代表第i台机器上的CPU占用率,α,β是惩罚系数。α可以根据情况设定,表示希望服务器CPU占用率负载保持稳态的阈值。
Figure PCTCN2019130582-appb-000003
上式中,R为整体回报函数,最终优化目标为各个智能体协作产生的调度策略得到最大的R。
步骤400:利用多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;
步骤400中,在根据真实数据所建立的模拟环境下进行离线训练,对每个服务节点分别创建一个智能体,每个服务节点上的智能体通过预测模块调整需要调度的资源大小,标记出需要调度出去的虚拟机,根据待调度状态的虚拟机产生调度策略,然后各个服务节点分别计算出自身的回报值并汇总求和得到总回报值,最后根据总回报值调整各个预测模块的参数。
步骤500:将训练好的智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。
步骤500中,将训练好的各个智能体模型下放到真实环境中对应的服务节点上,智能体首先感知到所在服务器上的一段时间内的状态信息作为输入,通过智能体的预测模块预测得到当前服务器希望释放掉的资源,然后使用背包算法选出最接近标准的虚拟机将其标记为待调度状态;之后通过调度模块收集到所有服务器上的预测结果与被标记为待调度状态的虚拟机,再按需将待调度状态的虚拟机指派给合适的服务器产生调度策略,将调度命令分发至对应节点执行调度操作。在执行调度策略之前需要对每个调度命令进行校验是否合法,若不合法则反馈一个惩罚奖励更新参数,重新产生调度策略,反复迭代直到全部调度策略均可执行。若合法则执行并获得反馈的奖励值更新智能体参数。具体的调度总体框架如图3所示。
对于普通的多智能体强化学习通常情况下会根据环境输入直接得到调度动作,但是在复杂的网络拓扑结构中对于虚拟机调度策略来说的动作空间过于庞大,在如此庞大动作空间上或导致算法难以收敛,而且使用此种方式便需要将每一个运行在其中的虚拟机都配置一个全局id,用来指定调度的目标,但是需要注意的是虽然id可以索引到虚拟机,但是虚拟机占用的资源是有可能在运行过程中发生变化的,所以在学习过程中学到的策略是不可靠的。即便假设虚拟机的占用资源不会变化,此时如果新增加一个虚拟机,那么基于上述算法所训练的智能体在决策时是不会考虑新增加的虚拟机的。因此本申请在上述算法的基础上加以改进,使模型的动作空间替换为当前服务器希望释放的资源,即表示希望从中调度出多少资源来保持整体网络拓扑结构下的负载均衡。这样的设置可以避免使用全局id来标记各个虚拟机,即便中途增加新的虚拟机仍然可以工作,所以使得调度算法更加灵活可以自适应更广泛的场景。
请参阅图4,是本申请实施例的多智能体强化学习调度系统的结构示意图。本申请实施例的多智能体强化学习调度系统包括信息收集模块、预处理模块、强化学习模型构建模块、智能体模型训练模块和智能体部署模块。
信息收集模块:用于收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息;其中,收集的服务器参数具体包括:收集真实场景下一段时间的各个服务器的配置信息,内存与硬盘存储空间等;收集的虚拟机负载信息具体包括:收集每台服务器上运行的虚拟机占用资源的参数,例如CPU占用率、内存与硬盘占用率等。
预处理模块:用于将收集到的服务器参数和虚拟机负载信息进行规范化等预处理操作;其中,预处理操作具体包括:定义每个服务节点虚拟机信息为一个多元组,多元组包括虚拟机的数量与其各自的配置,包括CPU、内存、硬盘及当前所处状态,每个虚拟机包括两个调度状态,分别为待调度状态和运行状态,每个服务节点包括两个状态,分别为饱和状态和饥饿状态,各个虚拟机占用的资源比之和不能够多于所在服务器配置的上限。
强化学习模型构建模块:用于使用预处理后的数据建立虚拟仿真环境,并建立多智能体的深度强化学习模型;其中,建立多智能体的深度强化学习模型具体包括:将收集到的时序动态信息(服务器参数和虚拟机负载信息)进行建模创建模拟环境进行离线训练,模型采用多智能体的深度强化学习模型,为了充分利用时序数据的影响,模型中深度网络部分采用LSTM模型来提取时序信息,避免瞬时状态下异常数据波动对决策产生的影响。模型采用MADDPG框架,MADDPG框架是DDPG算法在多智能体领域的拓展,DDPG算法将深度强化学习应用到连续动作空间上。深度学习部分得出的动作空间设定为待调度状态的虚拟机的资源占比,即调度走多少空间才可以保持当前服务节点的负载平衡。根据得出的待调度空间来将大小合适的虚拟机标记为待调度状态,然后计算整个网络中各个服务节点上处于待调度状态的虚拟机与各服务节点的回报奖励,使用虚拟机分配到服务节点所获得的奖励值作为距离度量,产生调度策略,最后校验调度策略是否可执行,可执行则将处于待调度状态的虚拟机调度到其他合适的服务节点上,不可执行的策略将返回一个负反馈惩罚,智能体重新产生调度策略。
本申请实施例中,为了解决动态环境下某些瞬时异常负载波动带来的影 响,使用循环神经网络LSTM(长短时记忆网络)取代深度强化学习中的全连接神经网络,让智能体可以学习到时序数据之间隐藏的信息,从而实现基于时空感知的自适应调度。
上述中,利用各个服务节点上的智能体将虚拟机标记为待调度状态采用了背包问题解法,将预测得到的待调度空间作为背包空间,每个虚拟机的占用资源作为物品重量与价值,只需计算背包能够装入的最大价值,将装入的虚拟机标记为待调度状态即可。然后统计服务节点上预测得出的待调度空间(其中存在负数表示需要调度进来多少资源能够充分利用资源),目标是待调度空间占用与各个服务节点的待调度之和最小,通过计算可得出调度策略。
本申请实施例中,MADDPG框架将深度强化学习的技术拓展到了多智能体领域,算法适用于多智能体环境下的集中式学习(Centralized learning)和分散式执行(Decentralized execution),使用该框架可以使多智能体之间学会协作与竞争。
具体的,MADDPG算法通过考虑多个参数化θ={θ 1,θ 2,θ 3,...θ n}的多个智能体的博弈来计算策略Policy,可将所有智能体的策略定义为π={π 1,π 2,π 3,...π n},第i个智能体的期望收益为J(θ i)=E[R i],则在考虑确定性策略μ θiθ i为参数时,梯度可表示为下式:
Figure PCTCN2019130582-appb-000004
其中x=(o 1...o n)。
进一步地,强化学习模型构建模块包括预测模块和调度模块,预测模块包括状态感知单元、动作空间单元和奖励函数单元,具体功能如下:
状态感知单元:通过各个节点输入的信息对当前状态下需要调度出去的资源进行预测,输入状态通过各个节点的负载信息以及运行的虚拟机所占资源进 行定义;
动作空间单元:根据当前节点的配置信息将动作空间映射到当前服务节点的总容量之内;
调度模块:根据标记出来的待调度状态的虚拟机,进行重新调度分配产生调度策略,各个服务节点上的智能体根据产生的调度动作计算回报函数;
奖励函数单元:度量调度策略的好坏,其目标是整个网络中各个服务节点负载均衡,其中每个服务节点上的回报函数是单独来计算的;回报函数公式如下:
Figure PCTCN2019130582-appb-000005
上式中,r i是每个服务节点上的奖励回报,其中c代表第i台机器上的CPU占用率,α,β是惩罚系数。α可以根据情况设定,表示希望服务器CPU占用率负载保持稳态的阈值。
Figure PCTCN2019130582-appb-000006
上式中,R为整体回报函数,最终优化目标为各个智能体协作产生的调度策略得到最大的R。
智能体模型训练模块:用于利用多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;其中,在根据真实数据所建立的模拟环境下进行离线训练,对每个服务节点分别创建一个智能体,每个服务节点上的智能体通过预测模块调整需要调度的资源大小,标记出需要调度出去的虚拟机,根据待调度状态的虚拟机产生调度策略,然后各个服务节点分别计算出自身的回报值并汇总求和得到总回报值,最后根据总回报值调整各个预测模块的参数。
智能体部署模块:用于将训练好的智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。其中,将训练好的各个智能体模型下放到真实环境中对应的服务节点上,然后通过智能体的预测模块进行预测修改 待调度状态,调度模块统一分配产生调度策略,将调度命令分发至对应节点执行调度操作,调度动作执行之前需要判断动作能否执行,若无法执行或执行失败则反馈一个惩罚奖励更新参数,重新产生调度策略,反复迭代直到全部调度策略均可执行。
对于普通的多智能体强化学习通常情况下会根据环境输入直接得到调度动作,但是在复杂的网络拓扑结构中对于虚拟机调度策略来说的动作空间过于庞大,在如此庞大动作空间上或导致算法难以收敛,而且使用此种方式便需要将每一个运行在其中的虚拟机都配置一个全局id,用来指定调度的目标,但是需要注意的是虽然id可以索引到虚拟机,但是虚拟机占用的资源是有可能在运行过程中发生变化的,所以在学习过程中学到的策略是不可靠的。即便假设虚拟机的占用资源不会变化,此时如果新增加一个虚拟机,那么基于上述算法所训练的智能体在决策时是不会考虑新增加的虚拟机的。因此本申请在上述算法的基础上加以改进,使模型的动作空间替换为当前服务器希望释放的资源,即表示希望从中调度出多少资源来保持整体网络拓扑结构下的负载均衡。这样的设置可以避免使用全局id来标记各个虚拟机,即便中途增加新的虚拟机仍然可以工作,所以使得调度算法更加灵活可以自适应更广泛的场景。
图5是本申请实施例提供的多智能体强化学习调度方法的硬件设备结构示意图。如图5所示,该设备包括一个或多个处理器以及存储器。以一个处理器为例,该设备还可以包括:输入系统和输出系统。
处理器、存储器、输入系统和输出系统可以通过总线或者其他方式连接,图5中以通过总线连接为例。
存储器作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序、非暂态计算机可执行程序以及模块。处理器通过运行存储在存储器中的非暂态软件程序、指令以及模块,从而执行电子设备的各种功能应用以及数据处理,即实现上述方法实施例的处理方法。
存储器可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储数据等。此外,存 储器可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中,存储器可选包括相对于处理器远程设置的存储器,这些远程存储器可以通过网络连接至处理系统。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
输入系统可接收输入的数字或字符信息,以及产生信号输入。输出系统可包括显示屏等显示设备。
所述一个或者多个模块存储在所述存储器中,当被所述一个或者多个处理器执行时,执行上述任一方法实施例的以下操作:
步骤a:收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息;
步骤b:使用所述服务器参数和虚拟机负载信息建立虚拟仿真环境,并建立多智能体的深度强化学习模型;
步骤c:利用所述多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;
步骤d:将所述智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。
上述产品可执行本申请实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请实施例提供的方法。
本申请实施例提供了一种非暂态(非易失性)计算机存储介质,所述计算机存储介质存储有计算机可执行指令,该计算机可执行指令可执行以下操作:
步骤a:收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息;
步骤b:使用所述服务器参数和虚拟机负载信息建立虚拟仿真环境,并建立多智能体的深度强化学习模型;
步骤c:利用所述多智能体的深度强化学习模型和模拟环境进行离线训练 和学习,为每个服务器分别训练一个智能体模型;
步骤d:将所述智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。
本申请实施例提供了一种计算机程序产品,所述计算机程序产品包括存储在非暂态计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行以下操作:
步骤a:收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息;
步骤b:使用所述服务器参数和虚拟机负载信息建立虚拟仿真环境,并建立多智能体的深度强化学习模型;
步骤c:利用所述多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;
步骤d:将所述智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。
本申请实施例的多智能体强化学习调度方法、系统及电子设备通过虚拟化技术将服务器上运行的服务虚拟化,通过调度虚拟机的方式来进行负载均衡,因为调度范围不在局限于单个服务器内部,当一台服务器处于高负载状态下的时候可以将其中的虚拟机调度到其他低负载的服务器上运行,相比分配资源的方案更加宏观。同时,本申请使用了MADDPG框架在AC框架上进行扩展,critic增加了其他智能体的进行决策的额外信息,但是每个actor只能使用本地的信息训练,通过这种框架就可以实现多智能体在复杂的动态环境下产生协作的策略。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本申请中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本申请所示的这些实施例,而是要符合与本申请所公开的原理和新颖特点相一致的最宽的范围。

Claims (11)

  1. 一种多智能体强化学习调度方法,其特征在于,包括以下步骤:
    步骤a:收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息;
    步骤b:使用所述服务器参数和虚拟机负载信息建立虚拟仿真环境,并建立多智能体的深度强化学习模型;
    步骤c:利用所述多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;
    步骤d:将所述智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。
  2. 根据权利要求1所述的多智能体强化学习调度方法,其特征在于,所述步骤a还包括:将收集到的服务器参数和虚拟机负载信息进行规范化预处理操作;所述规范化预处理操作包括:定义每个服务节点虚拟机信息为一个多元组,所述多元组包括虚拟机的数量与其各自的配置,每个虚拟机包括两个调度状态,分别为待调度状态和运行状态,每个服务节点包括两个状态,分别为饱和状态和饥饿状态,各个虚拟机占用的资源比之和少于所在服务器配置的上限。
  3. 根据权利要求1或2所述的多智能体强化学习调度方法,其特征在于,在所述步骤b中,所述多智能体的深度强化学习模型具体包括预测模块和调度模块,所述预测模块通过各个服务节点输入的信息对当前状态下需要调度出去的资源进行预测,根据当前服务节点的配置信息将动作空间映射到当前服务节点的总容量之内;所述调度模块根据标记出来的待调度状态的虚拟机,进行重新调度分配产生调度策略,各个服务节点上的智能体根据产生的调度动作计算回报函数; 所述预测模块度量调度策略的好坏,使整个网络中各个服务节点负载均衡。
  4. 根据权利要求3所述的多智能体强化学习调度方法,其特征在于,在所述步骤c中,所述利用多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型具体包括:每个服务节点上的智能体通过预测模块调整需要调度的资源大小,标记出需要调度出去的虚拟机,根据待调度状态的虚拟机产生调度策略,各个服务节点分别计算自身的回报值并汇总求和得到总回报值,并根据总回报值调整各个预测模块的参数。
  5. 根据权利要求4所述的多智能体强化学习调度方法,其特征在于,在所述步骤d中,所述将智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度具体为:将训练好的智能体模型部署到真实环境中对应的服务节点上,所述智能体模型感知到所在服务器上的一段时间内的状态信息作为输入,预测得到当前服务器需要释放掉的资源,并使用背包算法选出最接近标准的虚拟机将其标记为待调度状态;之后通过调度模块收集到所有服务器上的预测结果与被标记为待调度状态的虚拟机,再按需将待调度状态的虚拟机指派给适合的服务器产生调度策略,将调度命令分发至对应服务节点执行调度操作;在执行调度策略之前对每个调度命令进行校验是否合法,若不合法则反馈一个惩罚奖励更新参数,重新产生调度策略;若合法则执行调度操作,并获得反馈的奖励值更新智能体参数。
  6. 一种多智能体强化学习调度系统,其特征在于,包括:
    信息收集模块:用于收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负载信息;
    强化学习模型构建模块:用于使用所述服务器参数和虚拟机负载信息建立虚拟仿真环境,并建立多智能体的深度强化学习模型;
    智能体模型训练模块:用于利用所述多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;
    智能体部署模块:用于将所述智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。
  7. 根据权利要求6所述的多智能体强化学习调度系统,其特征在于,还包括预处理模块,所述预处理模块用于将收集到的服务器参数和虚拟机负载信息进行规范化预处理操作;所述规范化预处理操作包括:定义每个服务节点虚拟机信息为一个多元组,所述多元组包括虚拟机的数量与其各自的配置,每个虚拟机包括两个调度状态,分别为待调度状态和运行状态,每个服务节点包括两个状态,分别为饱和状态和饥饿状态,各个虚拟机占用的资源比之和少于所在服务器配置的上限。
  8. 根据权利要求6或7所述的多智能体强化学习调度系统,其特征在于,所述强化学习模型构建模块包括预测模块和调度模块,所述预测模块包括:
    状态感知单元:用于通过各个服务节点输入的信息对当前状态下需要调度出去的资源进行预测;
    动作空间单元:用于根据当前服务节点的配置信息将动作空间映射到当前服务节点的总容量之内;
    所述调度模块根据标记出来的待调度状态的虚拟机,进行重新调度分配产生调度策略,各个服务节点上的智能体根据产生的调度动作计算回报函数;
    所述预测模块还包括:
    奖励函数单元:用于度量调度策略的好坏,使整个网络中各个服务节点负载均衡。
  9. 根据权利要求8所述的多智能体强化学习调度系统,其特征在于,所述 智能体模型训练模块利用多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型具体为:每个服务节点上的智能体通过预测模块调整需要调度的资源大小,标记出需要调度出去的虚拟机,根据待调度状态的虚拟机产生调度策略,各个服务节点分别计算自身的回报值并汇总求和得到总回报值,并根据总回报值调整各个预测模块的参数。
  10. 根据权利要求9所述的多智能体强化学习调度系统,其特征在于,所述智能体部署模块将智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度具体为:将训练好的智能体模型部署到真实环境中对应的服务节点上,所述智能体模型感知到所在服务器上的一段时间内的状态信息作为输入,预测得到当前服务器需要释放掉的资源,并使用背包算法选出最接近标准的虚拟机将其标记为待调度状态;之后通过调度模块收集到所有服务器上的预测结果与被标记为待调度状态的虚拟机,再按需将待调度状态的虚拟机指派给适合的服务器产生调度策略,将调度命令分发至对应服务节点执行调度操作;在执行调度策略之前对每个调度命令进行校验是否合法,若不合法则反馈一个惩罚奖励更新参数,重新产生调度策略;若合法则执行调度操作,并获得反馈的奖励值更新智能体参数。
  11. 一种电子设备,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述1至5任一项所述的多智能体强化学习调度方法的以下操作:
    步骤a:收集网络数据中心的服务器参数以及每台服务器上运行的虚拟机负 载信息;
    步骤b:使用所述服务器参数和虚拟机负载信息建立虚拟仿真环境,并建立多智能体的深度强化学习模型;
    步骤c:利用所述多智能体的深度强化学习模型和模拟环境进行离线训练和学习,为每个服务器分别训练一个智能体模型;
    步骤d:将所述智能体模型部署到真实的服务节点,并根据各个服务节点的负载情况进行调度。
PCT/CN2019/130582 2019-03-14 2019-12-31 一种多智能体强化学习调度方法、系统及电子设备 WO2020181896A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910193429.X 2019-03-14
CN201910193429.XA CN109947567B (zh) 2019-03-14 2019-03-14 一种多智能体强化学习调度方法、系统及电子设备

Publications (1)

Publication Number Publication Date
WO2020181896A1 true WO2020181896A1 (zh) 2020-09-17

Family

ID=67009966

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130582 WO2020181896A1 (zh) 2019-03-14 2019-12-31 一种多智能体强化学习调度方法、系统及电子设备

Country Status (2)

Country Link
CN (1) CN109947567B (zh)
WO (1) WO2020181896A1 (zh)

Families Citing this family (45)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109947567B (zh) * 2019-03-14 2021-07-20 深圳先进技术研究院 一种多智能体强化学习调度方法、系统及电子设备
CN110362411B (zh) * 2019-07-25 2022-08-02 哈尔滨工业大学 一种基于Xen系统的CPU资源调度方法
CN110442129B (zh) * 2019-07-26 2021-10-22 中南大学 一种多智能体编队的控制方法和系统
CN110471297B (zh) * 2019-07-30 2020-08-11 清华大学 多智能体协同控制方法、系统及设备
CN110427006A (zh) * 2019-08-22 2019-11-08 齐鲁工业大学 一种用于流程工业的多智能体协同控制系统及方法
CN110516795B (zh) * 2019-08-28 2022-05-10 北京达佳互联信息技术有限公司 一种为模型变量分配处理器的方法、装置及电子设备
CN110728368B (zh) * 2019-10-25 2022-03-15 中国人民解放军国防科技大学 一种仿真机器人深度强化学习的加速方法
CN111031387B (zh) * 2019-11-21 2020-12-04 南京大学 一种监控视频发送端视频编码流速控制的方法
CN110882544B (zh) * 2019-11-28 2023-09-15 网易(杭州)网络有限公司 多智能体训练方法、装置和电子设备
CN111026549B (zh) * 2019-11-28 2022-06-10 国网甘肃省电力公司电力科学研究院 一种电力信息通信设备自动化测试资源调度方法
CN111047014B (zh) * 2019-12-11 2023-06-23 中国航空工业集团公司沈阳飞机设计研究所 一种多智能体空中对抗分布式采样训练方法及设备
CN111178545B (zh) * 2019-12-31 2023-02-24 中国电子科技集团公司信息科学研究院 一种动态强化学习决策训练系统
CN113067714B (zh) * 2020-01-02 2022-12-13 中国移动通信有限公司研究院 一种内容分发网络调度处理方法、装置及设备
CN111310915B (zh) * 2020-01-21 2023-09-01 浙江工业大学 一种面向强化学习的数据异常检测防御方法
CN111324358B (zh) * 2020-02-14 2020-10-16 南栖仙策(南京)科技有限公司 一种用于信息系统自动运维策略的训练方法
CN111343095B (zh) * 2020-02-15 2021-11-05 北京理工大学 一种在软件定义网络中实现控制器负载均衡的方法
CN111461338A (zh) * 2020-03-06 2020-07-28 北京仿真中心 基于数字孪生的智能系统更新方法、装置
CN111339675B (zh) * 2020-03-10 2020-12-01 南栖仙策(南京)科技有限公司 基于机器学习构建模拟环境的智能营销策略的训练方法
CN111538668B (zh) * 2020-04-28 2023-08-15 山东浪潮科学研究院有限公司 基于强化学习的移动端应用测试方法、装置、设备及介质
CN111585811B (zh) * 2020-05-06 2022-09-02 郑州大学 一种基于多智能体深度强化学习的虚拟光网络映射方法
CN111722910B (zh) * 2020-06-19 2023-07-21 广东石油化工学院 一种云作业调度及资源配置的方法
CN111724001B (zh) * 2020-06-29 2023-08-29 重庆大学 一种基于深度强化学习的飞行器探测传感器资源调度方法
CN111860777B (zh) * 2020-07-06 2021-07-02 中国人民解放军军事科学院战争研究院 面向超实时仿真环境的分布式强化学习训练方法及装置
CN112001585B (zh) * 2020-07-14 2023-09-22 北京百度网讯科技有限公司 多智能体决策方法、装置、电子设备及存储介质
CN111967645B (zh) * 2020-07-15 2022-04-29 清华大学 一种社交网络信息传播范围预测方法及系统
CN112422651A (zh) * 2020-11-06 2021-02-26 电子科技大学 一种基于强化学习的云资源调度性能瓶颈预测方法
CN112838946B (zh) * 2020-12-17 2023-04-28 国网江苏省电力有限公司信息通信分公司 基于通信网故障智能感知与预警模型的构建方法
CN112766705A (zh) * 2021-01-13 2021-05-07 北京洛塔信息技术有限公司 分布式工单处理方法、系统、设备和存储介质
CN112966431B (zh) * 2021-02-04 2023-04-28 西安交通大学 一种数据中心能耗联合优化方法、系统、介质及设备
CN112801303A (zh) * 2021-02-07 2021-05-14 中兴通讯股份有限公司 一种智能流水线处理方法、装置、存储介质及电子装置
CN113115451A (zh) * 2021-02-23 2021-07-13 北京邮电大学 基于多智能体深度强化学习的干扰管理和资源分配方案
CN113094171A (zh) * 2021-03-31 2021-07-09 北京达佳互联信息技术有限公司 数据处理方法、装置、电子设备和存储介质
US20220321605A1 (en) * 2021-04-01 2022-10-06 Cisco Technology, Inc. Verifying trust postures of heterogeneous confidential computing clusters
CN113325721B (zh) * 2021-08-02 2021-11-05 北京中超伟业信息安全技术股份有限公司 一种工业系统无模型自适应控制方法及系统
CN113672372B (zh) * 2021-08-30 2023-08-08 福州大学 一种基于强化学习的多边缘协同负载均衡任务调度方法
CN114003121B (zh) * 2021-09-30 2023-10-31 中国科学院计算技术研究所 数据中心服务器能效优化方法与装置、电子设备及存储介质
CN113641462B (zh) * 2021-10-14 2021-12-21 西南民族大学 基于强化学习的虚拟网络层次化分布式部署方法及系统
WO2023121514A1 (ru) * 2021-12-21 2023-06-29 Владимир Германович КРЮКОВ Система принятия решений в мультиагентной среде
CN114116183B (zh) * 2022-01-28 2022-04-29 华北电力大学 基于深度强化学习的数据中心业务负载调度方法及系统
CN114518948A (zh) * 2022-02-21 2022-05-20 南京航空航天大学 面向大规模微服务应用的动态感知重调度的方法及应用
CN114924684A (zh) * 2022-04-24 2022-08-19 南栖仙策(南京)科技有限公司 基于决策流图的环境建模方法、装置和电子设备
CN114860416B (zh) * 2022-06-06 2024-04-09 清华大学 对抗场景中的分布式多智能体探测任务分配方法及装置
CN114781072A (zh) * 2022-06-17 2022-07-22 北京理工大学前沿技术研究院 一种无人驾驶车辆的决策方法和系统
CN115293451B (zh) * 2022-08-24 2023-06-16 中国西安卫星测控中心 基于深度强化学习的资源动态调度方法
CN116151137B (zh) * 2023-04-24 2023-07-28 之江实验室 一种仿真系统、方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103873569A (zh) * 2014-03-05 2014-06-18 兰雨晴 一种基于IaaS云平台的资源优化部署方法
CN105607952A (zh) * 2015-12-18 2016-05-25 航天恒星科技有限公司 一种虚拟化资源的调度方法及装置
WO2018076791A1 (zh) * 2016-10-31 2018-05-03 华为技术有限公司 一种资源负载均衡控制方法及集群调度器
CN108829494A (zh) * 2018-06-25 2018-11-16 杭州谐云科技有限公司 基于负载预测的容器云平台智能资源优化方法
CN109947567A (zh) * 2019-03-14 2019-06-28 深圳先进技术研究院 一种多智能体强化学习调度方法、系统及电子设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10649966B2 (en) * 2017-06-09 2020-05-12 Microsoft Technology Licensing, Llc Filter suggestion for selective data import
CN108021451B (zh) * 2017-12-07 2021-08-13 上海交通大学 一种雾计算环境下的自适应容器迁移方法
CN109165081B (zh) * 2018-08-15 2021-09-28 福州大学 基于机器学习的Web应用自适应资源配置方法
CN109068350B (zh) * 2018-08-15 2021-09-28 西安电子科技大学 一种无线异构网络的终端自主选网系统及方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103873569A (zh) * 2014-03-05 2014-06-18 兰雨晴 一种基于IaaS云平台的资源优化部署方法
CN105607952A (zh) * 2015-12-18 2016-05-25 航天恒星科技有限公司 一种虚拟化资源的调度方法及装置
WO2018076791A1 (zh) * 2016-10-31 2018-05-03 华为技术有限公司 一种资源负载均衡控制方法及集群调度器
CN108829494A (zh) * 2018-06-25 2018-11-16 杭州谐云科技有限公司 基于负载预测的容器云平台智能资源优化方法
CN109947567A (zh) * 2019-03-14 2019-06-28 深圳先进技术研究院 一种多智能体强化学习调度方法、系统及电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WEI, LIANG: "Research on Resource Scheduling Algorithm and Experimental Platform for Cloud-network Integration", CNKI, CHINA DOCTORAL DISSERTATIONS FULL-TEXT DATABASE, 30 September 2018 (2018-09-30), DOI: 20200315230946X *

Also Published As

Publication number Publication date
CN109947567A (zh) 2019-06-28
CN109947567B (zh) 2021-07-20

Similar Documents

Publication Publication Date Title
WO2020181896A1 (zh) 一种多智能体强化学习调度方法、系统及电子设备
Rossi et al. Horizontal and vertical scaling of container-based applications using reinforcement learning
Liu et al. Adaptive asynchronous federated learning in resource-constrained edge computing
Ghobaei-Arani et al. A cost-efficient IoT service placement approach using whale optimization algorithm in fog computing environment
CN107888669B (zh) 一种基于深度学习神经网络的大规模资源调度系统及方法
Han et al. Tailored learning-based scheduling for kubernetes-oriented edge-cloud system
CN109491790A (zh) 基于容器的工业物联网边缘计算资源分配方法及系统
CN107404523A (zh) 云平台自适应资源调度系统和方法
CN110231976B (zh) 一种基于负载预测的边缘计算平台容器部署方法及系统
CN108965014A (zh) QoS感知的服务链备份方法及系统
CN104102533B (zh) 一种基于带宽感知的Hadoop调度方法和系统
CN114787830A (zh) 异构集群中的机器学习工作负载编排
CN109783225B (zh) 一种多租户大数据平台的租户优先级管理方法及系统
CN114841345B (zh) 一种基于深度学习算法的分布式计算平台及其应用
CN112732444A (zh) 一种面向分布式机器学习的数据划分方法
CN113742089A (zh) 异构资源中神经网络计算任务的分配方法、装置和设备
Cardellini et al. Self-adaptive container deployment in the fog: A survey
CN115543626A (zh) 采用异构计算资源负载均衡调度的电力缺陷图像仿真方法
Ye et al. SHWS: Stochastic hybrid workflows dynamic scheduling in cloud container services
CN109976873B (zh) 容器化分布式计算框架的调度方案获取方法及调度方法
CN112446484A (zh) 一种多任务训练集群智能网络系统及集群网络优化方法
Tuli et al. Optimizing the Performance of Fog Computing Environments Using AI and Co-Simulation
CN115562812A (zh) 面向机器学习训练的分布式虚拟机调度方法、装置和系统
Guérout et al. Autonomic energy-aware tasks scheduling
CN111782354A (zh) 一种基于强化学习的集中式数据处理时间优化方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19919235

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19919235

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 04/02/2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19919235

Country of ref document: EP

Kind code of ref document: A1