WO2023161947A1 - Handling heterogeneous computation in multi-agent reinforcement learning - Google Patents

Handling heterogeneous computation in multi-agent reinforcement learning Download PDF

Info

Publication number
WO2023161947A1
WO2023161947A1 PCT/IN2022/050162 IN2022050162W WO2023161947A1 WO 2023161947 A1 WO2023161947 A1 WO 2023161947A1 IN 2022050162 W IN2022050162 W IN 2022050162W WO 2023161947 A1 WO2023161947 A1 WO 2023161947A1
Authority
WO
WIPO (PCT)
Prior art keywords
value
action
environment
agent
computing agent
Prior art date
Application number
PCT/IN2022/050162
Other languages
French (fr)
Inventor
Saravanan M
Perepu SATHEESH KUMAR
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Priority to PCT/IN2022/050162 priority Critical patent/WO2023161947A1/en
Publication of WO2023161947A1 publication Critical patent/WO2023161947A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/60Software deployment
    • G06F8/65Updates

Definitions

  • Cognitive networks may enable zero-touch deployment and operation, as well as continuous real-time performance improvements.
  • the network does this with minimum human intervention by self-learning at scale from its experience and its interactions with the environment.
  • self-learning systems multiple agents may be observed individually and a reward function for each agent is learned independently from the other agents.
  • Decentralized training or centralized training may be used to train these agents.
  • decentralized training like Insight Query Language (IQL) methods or centralized training decentralized execution methods like QMIX etc.
  • IQL Insight Query Language
  • QMIX centralized training decentralized execution methods
  • Sunehag et. al. [1] proposes a value decomposition method (VDN) which updates the Q- networks of individual agents based on sum of all the agents’ value functions using centralized training and decentralized execution.
  • VDN value decomposition method
  • Such methods may not consider the extra state information of the environment and cannot be applied to all the general MARL problems, particularly where the joint Q function is not a linear function of individual Q functions.
  • Rashid et. al. [2] proposed a QMIX method, which lies between extremes of Value Decomposition Networks (VDN) and Counterfactual Multi-Agent (COMA) policies.
  • VDN Value Decomposition Networks
  • COMPA Counterfactual Multi-Agent
  • the agents’ value functions are trained based on the Global Q value and the mixing network is trained by conditioning the same on the state of the environment. In this way, the state of the environment can be added with training of the individual agents, and the agents are trained based on other agents’ performances.
  • the issue with this approach is that the value function of individual agents should monotonically increase with respect to the Global Q value.
  • Mahajan et. al. [3] proposed a method known as MAVEN to train decentralized policies for agents to condition their behavior on the shared latent variable controlled by a hierarchical policy.
  • Embodiments of the present disclosure address situations where different agents have different computational powers.
  • the agent Q-values will be weighed and the Q tot value will be computed using, for example, VDN and/or QMIX.
  • a deep neural network is used to obtain the weighing factors, and the weights are used to scale the respective individual Q-values.
  • a hypernetwork-based approach is used to train the network using the global state of the environment.
  • the techniques disclosed herein arrive at a solution in multi-agent reinforcement learning (MARL) where different agents have different computational powers.
  • a hypernetwork-based approach may be used to predict the weights that can handle the different computational powers.
  • the computational power may be measured by different metrics. Some of the example metrics are number of iterations/epochs trained in each of the agent, number of cores available to train the model, and so on. Other computational power metrics may be considered and measured in terms of available memory.
  • a computer-implemented method for training a plurality of computing agents includes obtaining, from a first computing agent operating in an environment, a first value of a first action and a first computational metric of the first computing agent.
  • the method includes obtaining, from a second computing agent different than the first computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent.
  • the method includes obtaining a global state of the environment.
  • the method includes updating, using a first machine learning model and a second machine learning model, a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment.
  • the method includes transmitting the updated global value of the joint action towards the first computing agent.
  • the method includes transmitting the updated global value of the joint action towards the second computing agent.
  • a device with processing circuitry adapted to perform the method.
  • a computer program comprising instructions which when executed by processing circuity of a device causes the device to perform the method.
  • a carrier containing the computer program, where the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
  • FIG. 1 illustrates a QMIX algorithm
  • FIG. 2 is a block diagram of a system architecture, according to some embodiments.
  • FIG. 3 illustrates results of a testing environment, according to some embodiments.
  • FIG. 4 is a signalling diagram, according to some embodiments.
  • FIG. 5 illustrates a method, according to some embodiments.
  • FIG. 6 is a block diagram of a device, according to some embodiments.
  • Embodiments of the present disclosure are directed to decentralized learning in MARL environments where the agents have different computational powers. Depending on the agent computational power, the agent Q-values will be weighed and the Q tot value will be computed using one or more machine learning (ML) models described herein.
  • ML machine learning
  • Embodiments of the present disclosure may be used to address a variety of MARL problems arising in telecommunications applications where the agents have different computing powers, such as, for example, an antenna tilting problem to provide good quality of experience for different users while minimizing experience, a channel prediction problem, an optimal handover problem, a load balancing problem, or a resource allocation problem among Internet of Things (loT) devices.
  • MARL problems arising in telecommunications applications where the agents have different computing powers, such as, for example, an antenna tilting problem to provide good quality of experience for different users while minimizing experience, a channel prediction problem, an optimal handover problem, a load balancing problem, or a resource allocation problem among Internet of Things (
  • embodiments of the present disclosure may be used to address MARL problems in multi-player games, such as battle strategies among multiple characters in a game.
  • the below steps are performed to predict the weights of the agents (e.g., controllers in loT devices, network nodes, gaming agents) and update the utility networks (e.g., allocation of resources, gaming strategy) of the agents.
  • a weight prediction network is used to compute the weights of the individual Q- values to arrive at Q tot, corresponding to a collaborative or joint action of the agents (e.g., agent one tilts its antenna and agent two does not tilt its antenna).
  • the back gradients (in case of QMIX) are estimated and sent them to the local edge nodes.
  • the back gradients may include updated global values of a joint action for each agent.
  • the proposed technique in MARL scenarios. For example, it can handle different computational powers and is scalable with the number of agents. These techniques are therefore more applicable to more realistic scenarios, such as where different cells have different computational powers and where there are collaboration needs between cells. Additionally, the proposed techniques can be easily adapted to many types of automation systems, and multi-agent scenarios exist in many automation systems, including vehicular computing, unmanned aerial vehicles (UAVs) for disaster environments, autonomous cars, and 5G MIMO environment.
  • UAVs unmanned aerial vehicles
  • FIG. 1 illustrates a QMIX algorithm.
  • FIG. 2 illustrates the working of the QMIX algorithm described in reference [2].
  • QMIX may be used as an underlying collaborative mechanism.
  • the Q-values are calculated by the individual agents 102a-n and are sent to the mixing agent 104. Further, the mixing agent will send the total or combined Q-values (e.g., a combined Q-value of all the agents) to respective agents. Further, the agents are updated with the combined Q-value generated by the mixing agent 104 instead of local Q-values. In this the information of agents is passed to other agents to encourage collaboration.
  • All the agents train on the total Q-value Q tot or Q tot , which is obtained by passing individual agent Q-values.
  • the computational power of the individual agents vary, then it can result in inaccurate local Q-values for the agents which in turn effect the Q tot .
  • the Q-values of respective agent may be weighed based on the computational power available, and the weighted Q-values are used to arrive at the Q tot .
  • FIG. 2 is a block diagram of the system, according to some embodiments.
  • a first machine learning model 212 e.g., neural network
  • the weights may be relative to each other.
  • a neural network may be used to predict the weights of the individual agents so that the weights can be used while computing the Q tot .
  • local computational power C t i and global state S t at time instant t of the environment are used to compute the weights for individual agents. Since the agent computational powers can be different, the computational values may be used to arrive at weights. Also, since the agent performance can be measured with the state of the environment, the state of the environment may be utilized to arrive at the weights. Hence, both the local computational power and the state of the environment may need to be used in order to compute the weights of the individual agents.
  • the choice of weights depends on the computational power available at the devices. However, poor choice of weights can lead to poor collaboration. The measure of collaboration can be seen through the state of the environment, i.e. a global state. In some embodiments, if a choice of weights does not affect the collaboration between agents, then the weights should not be changed, i.e., the previous weights may be used. According, both the parameters, i.e., computational power available (percentage) and state of the environment, may be used to predict the weights.
  • Block 202 corresponds to the Utility Networks, which correspond to individual Q-value networks. These networks take individual observations as input and output the Q-value for a particular agent.
  • the network chosen here is a GRU network, which will incorporate the past history of observations.
  • Block 204 corresponds to the Mixing Network, which comprises a first machine learning model.
  • the mixing network takes all individual Q-values and combines them using a mixing network.
  • the mixing network is a fully connected network. The weights of this network may be calculated by using the mixing hypernetwork 206.
  • Block 212 corresponds to the Weight Prediction Network, which comprises a second machine learning model.
  • the second machine learning model which may be a fully connected network, uses the computational power of agents to estimate weights.
  • additional information about the global state may be required.
  • the global state contains information on collective performance of all the agents.
  • the global state is a different size vector than that of computational power vector. Where the computational power and global state are in a different scale, they may not be combined in same network.
  • the weights need to be positive, and the network cannot be trained directly. Accordingly, a hypernetwork 210 may be used.
  • t is the time instant of training
  • S t is the state of the environment at time instant t
  • ⁇ ⁇ is the history of past actions
  • O t i is the observation of the agent i at time instant t
  • C t i is the computational power available for agent i at time instant t
  • w t i is the weight chosen for the agent i and time instant t.
  • the computational power of all N agents 202 are sent to the Weight Prediction Network 212 (network which will compute the weights V.
  • the Weight Hypernetwork 210 hypernetwork s used to arrive at the weights (w).
  • the Combined Loss function 208 used to train is shown in equation (1) below.
  • g(.) is the function of the mixing network ( ⁇ tot ) parametrized by the weight prediction network ⁇ w .
  • Equation (1) There are two unknown parameters in equation (1) above: (i) ⁇ W , (ii) ⁇ h . Since the parameters are interdependent on each other i.e., ⁇ h on ⁇ w and vice-versa, they need to be solved in an iterative fashion. For every B samples, first the hyper network ⁇ h will be updated for fixed ⁇ w and then ⁇ w will be updated for the computed ⁇ h . So, at each step both ⁇ h and 9 W are updated iteratively.
  • FIG. 3 illustrates results of a testing environment, according to some embodiments.
  • the approach was tested on a StarCraft Multi-Agent Challenge (SMAC) environment with three agents present in (i) 16-core GPU with 64 GB RAM per core, (ii) 8-core CPU with 16 GB, and (iii) RPi with 1 GB RAM.
  • SMAC StarCraft Multi-Agent Challenge
  • SMAC simulates the battle scenarios of a popular real-time strategy game StarCraft II.
  • SMAC focuses on the decentralized micromanagement challenge. In this challenge, a team of units, each controlled by an agent observes other units within a fixed radius and takes actions based on their local observations. These agents are trained to solve challenging combat scenarios known as maps. In the Star Craft game, there are two groups of units (i) allied group and (ii) enemy group.
  • the units of the allied group were controlled with the decentralized agents trained using the techniques described herein and existing techniques, and the enemies are controlled by built-in StarCraft artificial intelligence (AI).
  • Line 302 shows the results (% of wins) of the techniques disclosed herein.
  • Line 304 shows the results of iterative penalization technique.
  • Line 306 shows the results of QMIX.
  • Line 308 shows the results of IQL.
  • the results of the proposed approach herein 302 and simple penalization 304 show an improvement in collaboration by the predicting the weighing factors for individual agents.
  • the simple penalization method 304 improves the collaboration to a value of 60%, whereas the proposed techniques disclosed herein 302 improved collaboration by achieving the top value at 95%.
  • FIG. 4 is a signalling diagram, according to some embodiments.
  • FIG. 4 shows signalling in a telecommunications network among a first cell 402a, a second cell 402b, a third cell 402c, a central server 404, and an environment 410.
  • the first cell sends its Q- values and computational power available at time instant t towards the central server.
  • the second cell and the third cell send their respective Q-values and computational powers available at time instant t towards the central server.
  • the environment transmits a global state from the environment towards the central server.
  • the global state may include a set of joint actions taken and a corresponding global value or reward.
  • the central server estimates the weights using the first machine learning model, e.g. weight prediction network 212.
  • the central server computes the Q tot value using a second machine learning model, e.g., the mixing network (204), and computes back propagation gradients, e.g., updated global values.
  • a second machine learning model e.g., the mixing network (204)
  • back propagation gradients e.g., updated global values.
  • the central server sends the global values, back to cell 1, cell 2, and cell 3, respectively.
  • the technique proposed herein may be used to address a telecommunications antenna tilting problem in 5G.
  • the problem relates to how to tilt antennas at specific angles in order to maximize the QoE to the customers and minimize the interference. Since there could be many antennas covering a particular area, these antennas should collaborate with each other to provide good QoE for all the users in the network.
  • This problem can be formulated as MARL problem, with the actions being tilting the antenna and the environmental state could be the QoE for each user present in the area.
  • each agent may be present in the different gNodeBs or base stations in the network.
  • agents may be trained in an online fashion because offline training can be very difficult and can lead to high variance Q-values.
  • the error in Q-values may be three times greater in offline training when compared with online training.
  • an online training approach is used to train these agents. Observations are collected across agents, and training is performed at the central agent by sending respective Q-values.
  • the weights in which Q-values are to be mixed must be estimated.
  • the computational power is sent to the central server, which in turn estimates the weights in which individual Q-values are mixed.
  • the central server will compute the Q tot which will be sent to the local agents.
  • the local agents are updated with the Q tot which results in improved collaboration.
  • the techniques disclosed herein can be used for load balancing in gNodeBs.
  • gNodeBs may have to collaborate within themselves to create a smooth and optimal experience for the users.
  • the other nearby gNodeBs have to take this load.
  • the techniques disclosed herein can be used for resource allocation for Internet of Things (loT) applications.
  • edge computing provides a front-end distributed computing archetype of centralized cloud computing with low latency.
  • the network has to allocate the optimal resources to these devices based on their usage and criticality.
  • the proposed techniques herein for MARL may be used to allocate resources to these devices.
  • FIG. 5 illustrates a method, according to some embodiments.
  • method 500 is a computer-implemented method for training a plurality of computing agents.
  • Step 501 includes obtaining, from a first computing agent operating in an environment, a first value of a first action and a first computational metric of the first computing agent.
  • Step 503 includes obtaining, from a second computing agent different than the first computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent.
  • Step 505 includes obtaining a global state of the environment.
  • Step 507 includes updating, using a first machine learning model and a second machine learning model, a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment.
  • Step 509 includes transmitting the updated global value of the joint action towards the first computing agent.
  • Step 511 includes transmitting the updated global value of the joint action towards the second computing agent.
  • the method may further include determining, using the first machine learning model, a first weight for the first value of the first action based on the first computational metric and the global state of the environment.
  • the method may further include determining, using the first machine learning model, a second weight for to the second value of the second action based on the second computational network and the global state of the environment.
  • the method further includes updating, using the second machine learning model, the global value of the joint action based on the determined first weight and the determined second weight.
  • the first machine learning model is a first neural network and the second machine learning model is a second neural network different than the first neural network.
  • the first neural network may correspond to weight prediction network 212 and the second neural network may correspond to mixing network 204.
  • the first computational metric has a value that is different than a value of the second computational metric.
  • the first computational metric and the second computational metric comprise one or more of: a number of iterations or epochs trained, a measure of processing resources, a measure of available memory, or other relevant parameters.
  • the updating in step 511 comprises using at least one of: a value decomposition method, a counterfactual multi-agent method, or a combination thereof.
  • the joint action in method 500 comprises at least one of the first action and the second action.
  • FIG. 6 is a block diagram of a computing device 600 according to some embodiments.
  • computing device 600 may comprise one or more of the components of the central server as described above.
  • the device may comprise: processing circuitry (PC) 602, which may include one or more processors (P) 655 (e.g., one or more general purpose microprocessors and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); communication circuitry 648, comprising a transmitter (Tx) 645 and a receiver (Rx) 647 for enabling the device to transmit data and receive data (e.g., wirelessly transmit/receive data); and a local storage unit (a.k.a., “data storage system”) 608, which may include one or more non-volatile storage devices and/or one or more volatile storage devices.
  • PC processing circuitry
  • P processors
  • ASIC application specific integrated circuit
  • Rx field-programmable gate arrays
  • CPP 641 includes a computer readable medium (CRM) 642 storing a computer program (CP) 643 comprising computer readable instructions (CRI) 644.
  • CRM 642 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like.
  • the CRI 644 of computer program 643 is configured such that when executed by PC 602, the CRI causes the apparatus to perform steps described herein (e.g., steps described herein with reference to the flow charts).
  • the apparatus may be configured to perform steps described herein without the need for code. That is, for example, PC 602 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method is provided that includes obtaining, from a first computing agent operating in an environment, a first value of a first action and a first computational metric of the first computing agent and obtaining, from a second computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent. The method includes obtaining a global state of the environment. The method includes updating, using a first machine learning model and a second machine learning model, a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment. The method includes transmitting the updated global value of the joint action towards the first computing agent and the second computing agent.

Description

HANDLING HETEROGENEOUS COMPUTATION IN MULTI- AGENT REINFORCEMENT LEARNING
TECHNICAL FIELD
[001] Disclosed are embodiments related to handling heterogeneous computation in multi-agent reinforcement learning environments.
INTRODUCTION
[002] Advances in reinforcement learning (RL) have recorded sublime success in various domains. Although its single-agent counterpart has overshadowed the multi-agent domain during this progress, multi-agent reinforcement learning gains rapid traction and the latest accomplishments address problems with real-world complexity. Traditional Reinforcement Learning (RL) techniques aim to maximize some notion of long-term reward. The goal of RL is to find a policy that maps states of the world to the actions executed by the agent. The key assumption for RL is that the reward function being maximized is accessible to the agent.
[003] Cognitive networks may enable zero-touch deployment and operation, as well as continuous real-time performance improvements. The network does this with minimum human intervention by self-learning at scale from its experience and its interactions with the environment. In such self-learning systems, multiple agents may be observed individually and a reward function for each agent is learned independently from the other agents.
[004] This is not always a good solution, however, as the optimal behavior of one agent may be suboptimal for another. Another challenge is that one may never be able to observe the actions of the centralized controller directly but only the actions of the individual agents. Finally, considering the cross product of the state and action spaces of the individual agents can lead to a prohibitively large space. Moreover, it is challenging for multiple agents to learn policies effectively, because the policies of other agents are also part of the environment from the perspective of an individual agent. The multi-agent environment is non-stationary; thus, it is not applicable to directly employ the single-agent algorithm. In a multi-agent scenario, different agents participate together and work towards either collaboratively or competitively.
[005] Decentralized training or centralized training may be used to train these agents. In many cases, where there is quite a good number of agents, one may use decentralized training like Insight Query Language (IQL) methods or centralized training decentralized execution methods like QMIX etc.
[006] Sunehag et. al. [1] proposes a value decomposition method (VDN) which updates the Q- networks of individual agents based on sum of all the agents’ value functions using centralized training and decentralized execution. However, such methods may not consider the extra state information of the environment and cannot be applied to all the general MARL problems, particularly where the joint Q function is not a linear function of individual Q functions. To address this problem, Rashid et. al. [2] proposed a QMIX method, which lies between extremes of Value Decomposition Networks (VDN) and Counterfactual Multi-Agent (COMA) policies. The proposed approach uses a mixing network that mixes the individual agents’ value functions through a mixing network, which is then used to obtain Global Q value. Further, the agents’ value functions are trained based on the Global Q value and the mixing network is trained by conditioning the same on the state of the environment. In this way, the state of the environment can be added with training of the individual agents, and the agents are trained based on other agents’ performances. However, the issue with this approach is that the value function of individual agents should monotonically increase with respect to the Global Q value. To handle the stricter non-monotonic assumption, Mahajan et. al. [3] proposed a method known as MAVEN to train decentralized policies for agents to condition their behavior on the shared latent variable controlled by a hierarchical policy.
SUMMARY
[007] One problem with the above methods is that they generally assume the agents have similar computational power. For example, in the QMIX, the individual Q-values of the agents are mixed through a mixing network and arrive at Q tot. Now, the individual agents are trained with Q tot instead of local Q values.
[008] Accordingly, the performance of other agents depends on the Q tot. In the case where one agent Q-value is not accurate, then the inaccurate Q-value will spoil the training of all the agents. All these Q-values of the agents are typically assumed to have the same accuracy in order to compute Q tot using the mixing function.
[009] In [4] the authors proposed a bottom up MARL approach where other agents’ rewards are predicted by another agent and used to calculate the global reward. However, here also the agents are assumed to have the same computational power. [0010] However, in many cases, different agents have different computational power, and because the Q-values are computed at the local agent level, an agent’s computational power can affect the accuracy of local Q-values. Accordingly, if different agents’ Q-values are blindly used, then it can result in inaccurate training and lead to poor collaboration. Hence, aspects of the present disclosure describe techniques for handling heterogeneous computation in different agents and training the agents by taking care of computational power of the agents.
[0011] Embodiments of the present disclosure address situations where different agents have different computational powers. Depending on the agent computational power, the agent Q-values will be weighed and the Q tot value will be computed using, for example, VDN and/or QMIX. According to some embodiments, a deep neural network is used to obtain the weighing factors, and the weights are used to scale the respective individual Q-values. According to some embodiments, a hypernetwork-based approach is used to train the network using the global state of the environment.
[0012] In some embodiments, the techniques disclosed herein arrive at a solution in multi-agent reinforcement learning (MARL) where different agents have different computational powers. A hypernetwork-based approach may be used to predict the weights that can handle the different computational powers. The computational power may be measured by different metrics. Some of the example metrics are number of iterations/epochs trained in each of the agent, number of cores available to train the model, and so on. Other computational power metrics may be considered and measured in terms of available memory.
[0013] According to one aspect, a computer-implemented method for training a plurality of computing agents is provided. The method includes obtaining, from a first computing agent operating in an environment, a first value of a first action and a first computational metric of the first computing agent. The method includes obtaining, from a second computing agent different than the first computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent. The method includes obtaining a global state of the environment. The method includes updating, using a first machine learning model and a second machine learning model, a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment. The method includes transmitting the updated global value of the joint action towards the first computing agent. The method includes transmitting the updated global value of the joint action towards the second computing agent.
[0014] In another aspect there is provided a device with processing circuitry adapted to perform the method. In another aspect there is provided a computer program comprising instructions which when executed by processing circuity of a device causes the device to perform the method. In another aspect there is provided a carrier containing the computer program, where the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
BRIEF DESCRIPTION OF THE DRAWINGS
[0015] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.
[0016] FIG. 1 illustrates a QMIX algorithm
[0017] FIG. 2 is a block diagram of a system architecture, according to some embodiments.
[0018] FIG. 3 illustrates results of a testing environment, according to some embodiments.
[0019] FIG. 4 is a signalling diagram, according to some embodiments.
[0020] FIG. 5 illustrates a method, according to some embodiments.
[0021] FIG. 6 is a block diagram of a device, according to some embodiments.
DETAILED DESCRIPTION
[0022] Embodiments of the present disclosure are directed to decentralized learning in MARL environments where the agents have different computational powers. Depending on the agent computational power, the agent Q-values will be weighed and the Q tot value will be computed using one or more machine learning (ML) models described herein. Embodiments of the present disclosure may be used to address a variety of MARL problems arising in telecommunications applications where the agents have different computing powers, such as, for example, an antenna tilting problem to provide good quality of experience for different users while minimizing experience, a channel prediction problem, an optimal handover problem, a load balancing problem, or a resource allocation problem among Internet of Things (loT) devices. In addition, embodiments of the present disclosure may be used to address MARL problems in multi-player games, such as battle strategies among multiple characters in a game. [0023] According to some embodiments, the below steps are performed to predict the weights of the agents (e.g., controllers in loT devices, network nodes, gaming agents) and update the utility networks (e.g., allocation of resources, gaming strategy) of the agents.
[0024] First, compute the utility values of the agent at the edge level (e.g., where a computing resource interfaces with a communications network) and send the computed Q-values to the central server (e.g., a node in the communications network). In addition, the available computational power of the edge node is sent to the central server.
[0025] Second, a weight prediction network is used to compute the weights of the individual Q- values to arrive at Q tot, corresponding to a collaborative or joint action of the agents (e.g., agent one tilts its antenna and agent two does not tilt its antenna).
[0026] Third, based on the computed Q tot value, the back gradients (in case of QMIX) are estimated and sent them to the local edge nodes. The back gradients may include updated global values of a joint action for each agent.
[0027] Fourth, based on these gradients, the individual utility networks are updated.
[0028] Fifth, the joint action is applied based on updated utility networks on the environment. [0029] Based on the reward obtained, repeat steps 1 to 5 until convergence.
[0030] There are several advantages of the proposed technique in MARL scenarios. For example, it can handle different computational powers and is scalable with the number of agents. These techniques are therefore more applicable to more realistic scenarios, such as where different cells have different computational powers and where there are collaboration needs between cells. Additionally, the proposed techniques can be easily adapted to many types of automation systems, and multi-agent scenarios exist in many automation systems, including vehicular computing, unmanned aerial vehicles (UAVs) for disaster environments, autonomous cars, and 5G MIMO environment.
[0031] FIG. 1 illustrates a QMIX algorithm. FIG. 2 illustrates the working of the QMIX algorithm described in reference [2], In a multi-agent scenario, all the agents 102a-n collaborate towards achieving a common goal. For example, there are N agents 102a-n in the system, and they want to achieve a common target. QMIX may be used as an underlying collaborative mechanism.
[0032] In this case, assume the Q-values are calculated by the individual agents 102a-n and are sent to the mixing agent 104. Further, the mixing agent will send the total or combined Q-values (e.g., a combined Q-value of all the agents) to respective agents. Further, the agents are updated with the combined Q-value generated by the mixing agent 104 instead of local Q-values. In this the information of agents is passed to other agents to encourage collaboration.
[0033] All the agents train on the total Q-value Q tot or Qtot, which is obtained by passing individual agent Q-values.
[0034] However, if the computational power of the individual agents vary, then it can result in inaccurate local Q-values for the agents which in turn effect the Qtot. To prevent inaccurate Q- values for the agents, the Q-values of respective agent may be weighed based on the computational power available, and the weighted Q-values are used to arrive at the Qtot.
[0035] FIG. 2 is a block diagram of the system, according to some embodiments. In some embodiments, in order to arrive at weights to scale the Q-values, a first machine learning model 212 (e.g., neural network) is used to predict the weights. Since the computational power of the devices can be different, the weights may be relative to each other. Hence, a neural network may be used to predict the weights of the individual agents so that the weights can be used while computing the Qtot.
[0036] According to some embodiments, local computational power Ct i and global state St at time instant t of the environment are used to compute the weights for individual agents. Since the agent computational powers can be different, the computational values may be used to arrive at weights. Also, since the agent performance can be measured with the state of the environment, the state of the environment may be utilized to arrive at the weights. Hence, both the local computational power and the state of the environment may need to be used in order to compute the weights of the individual agents.
[0037] In some embodiments, the choice of weights depends on the computational power available at the devices. However, poor choice of weights can lead to poor collaboration. The measure of collaboration can be seen through the state of the environment, i.e. a global state. In some embodiments, if a choice of weights does not affect the collaboration between agents, then the weights should not be changed, i.e., the previous weights may be used. According, both the parameters, i.e., computational power available (percentage) and state of the environment, may be used to predict the weights.
[0038] Block 202 corresponds to the Utility Networks, which correspond to individual Q-value networks. These networks take individual observations as input and output the Q-value for a particular agent. The network chosen here is a GRU network, which will incorporate the past history of observations.
[0039] Block 204 corresponds to the Mixing Network, which comprises a first machine learning model. In the case of a QMIX method, the mixing network takes all individual Q-values and combines them using a mixing network. In some embodiments, the mixing network is a fully connected network. The weights of this network may be calculated by using the mixing hypernetwork 206.
[0040] Block 212 corresponds to the Weight Prediction Network, which comprises a second machine learning model. The second machine learning model, which may be a fully connected network, uses the computational power of agents to estimate weights. However, in some embodiments, additional information about the global state may be required. The global state contains information on collective performance of all the agents. The global state is a different size vector than that of computational power vector. Where the computational power and global state are in a different scale, they may not be combined in same network. In some embodiments, the weights need to be positive, and the network cannot be trained directly. Accordingly, a hypernetwork 210 may be used.
[0041] As shown in FIG. 2, t is the time instant of training, St is the state of the environment at time instant t, τι is the history of past actions, observations until time instant t, Ot i is the observation of the agent i at time instant t,
Figure imgf000009_0002
is the Q-value obtained for specific agent i and time instant t, Ct i is the computational power available for agent i at time instant t and wt i is the weight chosen for the agent i and time instant t.
[0042] The computational power of all N agents 202 are sent to the Weight Prediction Network 212 (network which will compute the weights
Figure imgf000009_0004
V. The Weight Hypernetwork 210 (hypernetwork
Figure imgf000009_0003
s used to arrive at the weights (w).
[0043] According to some embodiments, the Combined Loss function 208 used to train
Figure imgf000009_0005
is shown in equation (1) below.
Figure imgf000009_0001
[0044] In equation (1) above, g(.) is the function of the mixing network ( θtot) parametrized by the weight prediction network θw.
[0045] There are two unknown parameters in equation (1) above: (i) θW, (ii) θh. Since the parameters are interdependent on each other i.e., θh on θw and vice-versa, they need to be solved in an iterative fashion. For every B samples, first the hyper network θh will be updated for fixed θw and then θw will be updated for the computed θh. So, at each step both θh and 9W are updated iteratively.
[0046] Gaming Example
[0047] FIG. 3 illustrates results of a testing environment, according to some embodiments. The approach was tested on a StarCraft Multi-Agent Challenge (SMAC) environment with three agents present in (i) 16-core GPU with 64 GB RAM per core, (ii) 8-core CPU with 16 GB, and (iii) RPi with 1 GB RAM.
[0048] SMAC simulates the battle scenarios of a popular real-time strategy game StarCraft II. SMAC focuses on the decentralized micromanagement challenge. In this challenge, a team of units, each controlled by an agent observes other units within a fixed radius and takes actions based on their local observations. These agents are trained to solve challenging combat scenarios known as maps. In the Star Craft game, there are two groups of units (i) allied group and (ii) enemy group. [0049] In this experiment, the units of the allied group were controlled with the decentralized agents trained using the techniques described herein and existing techniques, and the enemies are controlled by built-in StarCraft artificial intelligence (AI). Line 302 shows the results (% of wins) of the techniques disclosed herein. Line 304 shows the results of iterative penalization technique. Line 306 shows the results of QMIX. Line 308 shows the results of IQL.
[0050] The placement of these units changes across episodes. During the start of each episode, the enemy units attack the allies, and the goal is to kill all the enemies as quickly as possible. We compare our results on one of set of maps which comprises of 2 Stalkers and 3 Zealots (2s_3z).
[0051] To evaluate the performance of the agents here, training was paused for every $100$ episodes and testing was performed for 20 episodes. The plot of the percentage (%) of test winning episodes is shown in FIG. 3.
[0052] From the plot it can be seen that the IQL approach 308 failed miserably since the two of the agents are stochastic and it resulted in poor collaboration between the agents. Although QMIX 306 gave good performance when compared with the IQL, it also settled around a 40% winning rate which is also not enough. Both cases fail as they look too much into future with the choice of highest discounted factor, which leads to poor planning.
[0053] The results of the proposed approach herein 302 and simple penalization 304 show an improvement in collaboration by the predicting the weighing factors for individual agents. The simple penalization method 304 improves the collaboration to a value of 60%, whereas the proposed techniques disclosed herein 302 improved collaboration by achieving the top value at 95%.
[0054] From the results shown in FIG. 3, it is evident that the proposed approach 302 performs better than the existing methods, which shows the efficacy of the proposed approach. One reason is the proposed approach dynamically chooses the best weights based on the conditions, whereas the existing methods uses no weights.
[0055] Telecommunications Example
[0056] FIG. 4 is a signalling diagram, according to some embodiments. FIG. 4 shows signalling in a telecommunications network among a first cell 402a, a second cell 402b, a third cell 402c, a central server 404, and an environment 410.
[0057] At 401, the first cell sends its Q- values and computational power available at time instant t towards the central server. Similarly, at 403 and 405 the second cell and the third cell send their respective Q-values and computational powers available at time instant t towards the central server.
[0058] At 407, the environment transmits a global state from the environment towards the central server. The global state may include a set of joint actions taken and a corresponding global value or reward.
[0059] At 409, the central server estimates the weights using the first machine learning model, e.g. weight prediction network 212.
[0060] At 411, the central server computes the Q tot value using a second machine learning model, e.g., the mixing network (204), and computes back propagation gradients, e.g., updated global values.
[0061] At 413, 415, and 417, the central server sends the global values, back to cell 1, cell 2, and cell 3, respectively.
[0062] At 419, the process is repeated until all the networks converge. [0063] In one embodiment, the technique proposed herein may be used to address a telecommunications antenna tilting problem in 5G. The problem relates to how to tilt antennas at specific angles in order to maximize the QoE to the customers and minimize the interference. Since there could be many antennas covering a particular area, these antennas should collaborate with each other to provide good QoE for all the users in the network. This problem can be formulated as MARL problem, with the actions being tilting the antenna and the environmental state could be the QoE for each user present in the area. In this example, each agent may be present in the different gNodeBs or base stations in the network. In addition to this MARL model, there are many machine learning (ML) models running in parallel in the difference cells. Hence the computational power available for the MARL model at a times can be different.
[0064] In general, agents may be trained in an online fashion because offline training can be very difficult and can lead to high variance Q-values. Lor example, the error in Q-values may be three times greater in offline training when compared with online training. Accordingly, to train these agents an online training approach is used. Observations are collected across agents, and training is performed at the central agent by sending respective Q-values. However, since the computational resources can change at any time due to other ML models running, the weights in which Q-values are to be mixed must be estimated. In order to estimate the weights, the computational power is sent to the central server, which in turn estimates the weights in which individual Q-values are mixed. Finally, the central server will compute the Qtot which will be sent to the local agents. The local agents are updated with the Qtot which results in improved collaboration.
[0065] Accordingly, using the techniques disclosed herein, different computational resources can be handled in different cells while training a collaborative MARL model.
[0066] In addition, different applications of the proposed approach may be used in the telecom environment, such for the channel prediction problem, optimal handover etc.
[0067] Lor example, the techniques disclosed herein can be used for load balancing in gNodeBs. In general, gNodeBs may have to collaborate within themselves to create a smooth and optimal experience for the users. In a case when one of the gNodeBs does not take too much load, the other nearby gNodeBs have to take this load. In order to create optimal collaborations between the gNodeBs, one may consider various factors such as mobility pattern of the users, available capacities of the gNodeB etc. [0068] In another example, the techniques disclosed herein can be used for resource allocation for Internet of Things (loT) applications. To support popular Internet of Things (loT) applications, edge computing provides a front-end distributed computing archetype of centralized cloud computing with low latency. The network has to allocate the optimal resources to these devices based on their usage and criticality. Hence, the proposed techniques herein for MARL may be used to allocate resources to these devices.
[0069] FIG. 5 illustrates a method, according to some embodiments. In some embodiments, method 500 is a computer-implemented method for training a plurality of computing agents.
[0070] Step 501 includes obtaining, from a first computing agent operating in an environment, a first value of a first action and a first computational metric of the first computing agent.
[0071] Step 503 includes obtaining, from a second computing agent different than the first computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent.
[0072] Step 505 includes obtaining a global state of the environment.
[0073] Step 507 includes updating, using a first machine learning model and a second machine learning model, a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment.
[0074] Step 509 includes transmitting the updated global value of the joint action towards the first computing agent.
[0075] Step 511 includes transmitting the updated global value of the joint action towards the second computing agent.
[0076] In some embodiments, the method may further include determining, using the first machine learning model, a first weight for the first value of the first action based on the first computational metric and the global state of the environment. The method may further include determining, using the first machine learning model, a second weight for to the second value of the second action based on the second computational network and the global state of the environment. In some embodiments, the method further includes updating, using the second machine learning model, the global value of the joint action based on the determined first weight and the determined second weight. [0077] In some embodiments, the first machine learning model is a first neural network and the second machine learning model is a second neural network different than the first neural network. The first neural network may correspond to weight prediction network 212 and the second neural network may correspond to mixing network 204.
[0078] In some embodiments, the first computational metric has a value that is different than a value of the second computational metric. In some embodiments, the first computational metric and the second computational metric comprise one or more of: a number of iterations or epochs trained, a measure of processing resources, a measure of available memory, or other relevant parameters.
[0079] In some embodiments, the updating in step 511 comprises using at least one of: a value decomposition method, a counterfactual multi-agent method, or a combination thereof.
[0080] In some embodiments, the first value, the second value, and the global value are determined using reinforcement learning (RL) models. In some embodiments, the first value is a first Q-value corresponding to the first action and the second value is a second Q-value corresponding to the second action.
[0081] In some embodiments, the joint action in method 500 comprises at least one of the first action and the second action.
[0082] FIG. 6 is a block diagram of a computing device 600 according to some embodiments. In some embodiments, computing device 600 may comprise one or more of the components of the central server as described above. As shown in FIG. 6, the device may comprise: processing circuitry (PC) 602, which may include one or more processors (P) 655 (e.g., one or more general purpose microprocessors and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); communication circuitry 648, comprising a transmitter (Tx) 645 and a receiver (Rx) 647 for enabling the device to transmit data and receive data (e.g., wirelessly transmit/receive data); and a local storage unit (a.k.a., “data storage system”) 608, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In embodiments where PC 602 includes a programmable processor, a computer program product (CPP) 641 may be provided. CPP 641 includes a computer readable medium (CRM) 642 storing a computer program (CP) 643 comprising computer readable instructions (CRI) 644. CRM 642 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 644 of computer program 643 is configured such that when executed by PC 602, the CRI causes the apparatus to perform steps described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, the apparatus may be configured to perform steps described herein without the need for code. That is, for example, PC 602 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.
[0083] While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above described embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.
[0084] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.
[0085] ABBREVIATIONS
[0086] SLA Service Level Agreement
[0087] QoE Quality of Experience
[0088] VDN Value Decomposition Network
[0089] COMA Counterfactual multi-agent policy gradient
[0090] MAVEN Multi-agent Variational Exploration
[0091] MARL Multi-agent Reinforcement Learning
[0092] REFERENCES
[0093] [1] Sunehag, Peter, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv: 1706.05296) (2017)
[0094] [2] Rashid, Tabish, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. "Qnnx: Monotonic value function factorisation for deep multiagent reinforcement learning." In International Conference on Machine Learning, pp. 4295-4304.
PMLR, 2018. [0095] [3] Mahajan, Anuj, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. "Maven: Multi-agent variational exploration." arXiv preprint arXiv: 1910.07483 (2019).
[0096] [4] Aotani, Takumi, Taisuke Kobayashi, and Kenji Sugimoto. "Bottom-up multi-agent reinforcement learning for selective cooperation." In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 3590-3595. IEEE, 2018.

Claims

1. A computer-implemented method (500) for training a plurality of computing agents, the method comprising: obtaining (501), from a first computing agent (102, 202, 402) operating in an environment, a first value of a first action and a first computational metric of the first computing agent; obtaining (503), from a second computing agent (102, 202, 402) different than the first computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent; obtaining (505) a global state of the environment; updating (507), using a first machine learning model (212) and a second machine learning model (204), a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment; transmitting (509) the updated global value of the joint action towards the first computing agent; and transmitting (511) the updated global value of the joint action towards the second computing agent.
2. The method of claim 1, further comprising: determining, using the first machine learning model (212), a first weight for the first value of the first action based on the first computational metric and the global state of the environment; and determining, using the first machine learning model, a second weight for to the second value of the second action based on the second computational network and the global state of the environment.
3. The method of claim 2, further comprising: updating, using the second machine learning model (204), the global value of the joint action based on the determined first weight and the determined second weight.
4. The method of any one of claims 2-3, wherein the first machine learning model is a first neural network and the second machine learning model is a second neural network different than the first neural network.
5. The method of any one of claims 1-4, wherein the first computational metric has a value that is different than a value of the second computational metric.
6. The method of any one of claims 1-5, wherein the first computational metric and the second computational metric comprise one or more of: a number of iterations or epochs trained, a measure of processing resources, a measure of available memory, or other relevant parameters.
7. The method of any one of claims 1-6, wherein the updating comprises using at least one of: a value decomposition method, a counterfactual multi-agent method, or a combination thereof.
8. The method of any one of claims 1-7, wherein the environment comprises a communications network, the first computing agent is a network node (402), and the second computing agent is a second network node (402).
9. The method of claim 8, wherein the state of the environment comprises a quality of experience of one or more user devices in the communications network and a configuration of one or more network resources in the communications network.
10. The method of claim 9, wherein the configuration of one or more network resources comprises a measure of a tilt of one or more antennas.
11. The method of claim 8, wherein the state of the environment comprises one or more of: a mobility pattern of one or more user devices, a capability of the first network node or the second network node, or a balance of load across the first network node and the second network node.
12. The method of any one of claims 1-7, wherein the environment comprises a distributed computing architecture, the first computing agent is a first device, and the second computing agent is a second device.
13. The method of claim 12, wherein the state of the environment comprises one or more of: a usage of the first device or the second device, or one or more resources allocated to the first device or the second device.
14. The method of any one of claims 1-7, wherein the environment comprises a computer game, the first computing agent is a first actor in the computer game, and the second computing agent is a second actor in the computer game.
15. The method of any one of claims 1-14, wherein the first computing agent is associated with a first device, the second computing agent is associated with a second device, and the first device is different than the second device.
16. The method of any one of claims 1-15, wherein the first value, the second value, and the global value are determined using reinforcement learning (RL) models.
17. The method of any one of claims 1-16, wherein the first value is a first Q-value corresponding to the first action and the second value is a second Q-value corresponding to the second action.
18. The method of any one of claims 1-17, wherein the first action and the second action comprise one or more of: tilting an antenna, offloading network traffic, allocating one or more resources in a network node, or configuring one or more resources in a network node.
19. The method of any one of claims 1-17, wherein the first action and the second action comprise a move by one or more players in a computer game.
20. The method of any one of claims 1-19, wherein the joint action comprises at least one of the first action and the second action.
21. A device (600) with processing circuitry (602), wherein the processing circuitry is adapted to: obtain (501), from a first computing agent operating in an environment, a first value of a first action and a first computational metric of the first computing agent; obtain (503), from a second computing agent different than the first computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent; obtain (505) a global state of the environment; update (507), using a first machine learning model and a second machine learning model, a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment; transmit (509) the updated global value of the joint action towards the first computing agent; and transmit (511) the updated global value of the joint action towards the second computing agent.
22. The device of claim 21, wherein the processing circuity is further configured to perform any one of the methods of claims 1 -20.
23. A computer program (643) comprising instructions (644) which when executed by processing circuity (602) of a device (600) causes the device to perform the method of any one of claims 1-20.
24. A carrier containing the computer program of claim 23, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.
PCT/IN2022/050162 2022-02-25 2022-02-25 Handling heterogeneous computation in multi-agent reinforcement learning WO2023161947A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IN2022/050162 WO2023161947A1 (en) 2022-02-25 2022-02-25 Handling heterogeneous computation in multi-agent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IN2022/050162 WO2023161947A1 (en) 2022-02-25 2022-02-25 Handling heterogeneous computation in multi-agent reinforcement learning

Publications (1)

Publication Number Publication Date
WO2023161947A1 true WO2023161947A1 (en) 2023-08-31

Family

ID=87765073

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2022/050162 WO2023161947A1 (en) 2022-02-25 2022-02-25 Handling heterogeneous computation in multi-agent reinforcement learning

Country Status (1)

Country Link
WO (1) WO2023161947A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150102945A1 (en) * 2011-12-16 2015-04-16 Pragmatek Transport Innovations, Inc. Multi-agent reinforcement learning for integrated and networked adaptive traffic signal control
CN112926746A (en) * 2021-03-01 2021-06-08 昆山小眼探索信息科技有限公司 Decision-making method and device for multi-agent reinforcement learning
CN113837348A (en) * 2021-07-28 2021-12-24 中国科学院自动化研究所 Multi-agent control method and device for changing environment based on reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150102945A1 (en) * 2011-12-16 2015-04-16 Pragmatek Transport Innovations, Inc. Multi-agent reinforcement learning for integrated and networked adaptive traffic signal control
CN112926746A (en) * 2021-03-01 2021-06-08 昆山小眼探索信息科技有限公司 Decision-making method and device for multi-agent reinforcement learning
CN113837348A (en) * 2021-07-28 2021-12-24 中国科学院自动化研究所 Multi-agent control method and device for changing environment based on reinforcement learning

Similar Documents

Publication Publication Date Title
Feriani et al. Single and multi-agent deep reinforcement learning for AI-enabled wireless networks: A tutorial
Zhang et al. Multi-agent reinforcement learning: A selective overview of theories and algorithms
Bai et al. Towards autonomous multi-UAV wireless network: A survey of reinforcement learning-based approaches
US11323886B2 (en) Cooperative target execution system for unmanned aerial vehicle networks
Wu et al. Mobility-aware deep reinforcement learning with glimpse mobility prediction in edge computing
Hazarika et al. RADiT: Resource allocation in digital twin-driven UAV-aided internet of vehicle networks
Dieye et al. Market driven multidomain network service orchestration in 5G networks
Möhlenhof et al. Reinforcement learning environment for tactical networks
WO2023161947A1 (en) Handling heterogeneous computation in multi-agent reinforcement learning
He et al. Enhancing the efficiency of UAV swarms communication in 5G networks through a hybrid split and federated learning approach
Shen et al. Game theoretic sensor management for target tracking
Ge et al. Multi-agent cooperation Q-learning algorithm based on constrained Markov Game
Wang et al. A Survey On Mean-Field Game for Dynamic Management and Control in Space-Air-Ground Network
Qu et al. A game theory based approach for distributed dynamic spectrum access
Wu et al. Mobility-aware deep reinforcement learning with seq2seq mobility prediction for offloading and allocation in edge computing
Han Game-theoretic payoff allocation in multiagent machine learning systems
Kalathil et al. Multi-player multi-armed bandits: Decentralized learning with IID rewards
Zhou et al. SORA: Improving Multi-agent Cooperation with a Soft Role Assignment Mechanism
Hazra et al. Analysis and applications of a bridge game
Lu et al. Resource Allocation Method of Industrial Terminal Edge Computing Based on Reinforcement Learning Algorithm
Almeida et al. A Survey on Coordination Methodologies for Simulated Robotic Soccer Teams.
Bazzan Accelerating the computation of solutions in resource allocation problems using an evolutionary approach and multiagent reinforcement learning
Cabral et al. Improved Learning in Multi-Agent Pursuer-Evader UAV Scenarios via Mechanism Design and Deep Reinforcement Learning
Esmat et al. Cross-Technology Federated Matching for Age of Information Minimization in Heterogeneous IoT
Xu Intelligent task allocation and data uploading for Mobile Crowd Sensing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22928498

Country of ref document: EP

Kind code of ref document: A1