WO2023161947A1

WO2023161947A1 - Handling heterogeneous computation in multi-agent reinforcement learning

Info

Publication number: WO2023161947A1
Application number: PCT/IN2022/050162
Authority: WO
Inventors: Saravanan M; Perepu SATHEESH KUMAR
Original assignee: Telefonaktiebolaget Lm Ericsson (Publ)
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2023-08-31

Abstract

A method is provided that includes obtaining, from a first computing agent operating in an environment, a first value of a first action and a first computational metric of the first computing agent and obtaining, from a second computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent. The method includes obtaining a global state of the environment. The method includes updating, using a first machine learning model and a second machine learning model, a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment. The method includes transmitting the updated global value of the joint action towards the first computing agent and the second computing agent.

Description

HANDLING HETEROGENEOUS COMPUTATION IN MULTI- AGENT REINFORCEMENT LEARNING

TECHNICAL FIELD

[001] Disclosed are embodiments related to handling heterogeneous computation in multi-agent reinforcement learning environments.

INTRODUCTION

[002] Advances in reinforcement learning (RL) have recorded sublime success in various domains. Although its single-agent counterpart has overshadowed the multi-agent domain during this progress, multi-agent reinforcement learning gains rapid traction and the latest accomplishments address problems with real-world complexity. Traditional Reinforcement Learning (RL) techniques aim to maximize some notion of long-term reward. The goal of RL is to find a policy that maps states of the world to the actions executed by the agent. The key assumption for RL is that the reward function being maximized is accessible to the agent.

[003] Cognitive networks may enable zero-touch deployment and operation, as well as continuous real-time performance improvements. The network does this with minimum human intervention by self-learning at scale from its experience and its interactions with the environment. In such self-learning systems, multiple agents may be observed individually and a reward function for each agent is learned independently from the other agents.

[004] This is not always a good solution, however, as the optimal behavior of one agent may be suboptimal for another. Another challenge is that one may never be able to observe the actions of the centralized controller directly but only the actions of the individual agents. Finally, considering the cross product of the state and action spaces of the individual agents can lead to a prohibitively large space. Moreover, it is challenging for multiple agents to learn policies effectively, because the policies of other agents are also part of the environment from the perspective of an individual agent. The multi-agent environment is non-stationary; thus, it is not applicable to directly employ the single-agent algorithm. In a multi-agent scenario, different agents participate together and work towards either collaboratively or competitively.

[005] Decentralized training or centralized training may be used to train these agents. In many cases, where there is quite a good number of agents, one may use decentralized training like Insight Query Language (IQL) methods or centralized training decentralized execution methods like QMIX etc.

[006] Sunehag et. al. [1] proposes a value decomposition method (VDN) which updates the Q- networks of individual agents based on sum of all the agents’ value functions using centralized training and decentralized execution. However, such methods may not consider the extra state information of the environment and cannot be applied to all the general MARL problems, particularly where the joint Q function is not a linear function of individual Q functions. To address this problem, Rashid et. al. [2] proposed a QMIX method, which lies between extremes of Value Decomposition Networks (VDN) and Counterfactual Multi-Agent (COMA) policies. The proposed approach uses a mixing network that mixes the individual agents’ value functions through a mixing network, which is then used to obtain Global Q value. Further, the agents’ value functions are trained based on the Global Q value and the mixing network is trained by conditioning the same on the state of the environment. In this way, the state of the environment can be added with training of the individual agents, and the agents are trained based on other agents’ performances. However, the issue with this approach is that the value function of individual agents should monotonically increase with respect to the Global Q value. To handle the stricter non-monotonic assumption, Mahajan et. al. [3] proposed a method known as MAVEN to train decentralized policies for agents to condition their behavior on the shared latent variable controlled by a hierarchical policy.

SUMMARY

[007] One problem with the above methods is that they generally assume the agents have similar computational power. For example, in the QMIX, the individual Q-values of the agents are mixed through a mixing network and arrive at Q tot. Now, the individual agents are trained with Q tot instead of local Q values.

[008] Accordingly, the performance of other agents depends on the Q tot. In the case where one agent Q-value is not accurate, then the inaccurate Q-value will spoil the training of all the agents. All these Q-values of the agents are typically assumed to have the same accuracy in order to compute Q tot using the mixing function.

[009] In [4] the authors proposed a bottom up MARL approach where other agents’ rewards are predicted by another agent and used to calculate the global reward. However, here also the agents are assumed to have the same computational power. [0010] However, in many cases, different agents have different computational power, and because the Q-values are computed at the local agent level, an agent’s computational power can affect the accuracy of local Q-values. Accordingly, if different agents’ Q-values are blindly used, then it can result in inaccurate training and lead to poor collaboration. Hence, aspects of the present disclosure describe techniques for handling heterogeneous computation in different agents and training the agents by taking care of computational power of the agents.

[0011] Embodiments of the present disclosure address situations where different agents have different computational powers. Depending on the agent computational power, the agent Q-values will be weighed and the Q tot value will be computed using, for example, VDN and/or QMIX. According to some embodiments, a deep neural network is used to obtain the weighing factors, and the weights are used to scale the respective individual Q-values. According to some embodiments, a hypernetwork-based approach is used to train the network using the global state of the environment.

[0012] In some embodiments, the techniques disclosed herein arrive at a solution in multi-agent reinforcement learning (MARL) where different agents have different computational powers. A hypernetwork-based approach may be used to predict the weights that can handle the different computational powers. The computational power may be measured by different metrics. Some of the example metrics are number of iterations/epochs trained in each of the agent, number of cores available to train the model, and so on. Other computational power metrics may be considered and measured in terms of available memory.

[0013] According to one aspect, a computer-implemented method for training a plurality of computing agents is provided. The method includes obtaining, from a first computing agent operating in an environment, a first value of a first action and a first computational metric of the first computing agent. The method includes obtaining, from a second computing agent different than the first computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent. The method includes obtaining a global state of the environment. The method includes updating, using a first machine learning model and a second machine learning model, a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment. The method includes transmitting the updated global value of the joint action towards the first computing agent. The method includes transmitting the updated global value of the joint action towards the second computing agent.

[0014] In another aspect there is provided a device with processing circuitry adapted to perform the method. In another aspect there is provided a computer program comprising instructions which when executed by processing circuity of a device causes the device to perform the method. In another aspect there is provided a carrier containing the computer program, where the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.

BRIEF DESCRIPTION OF THE DRAWINGS

[0015] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate various embodiments.

[0016] FIG. 1 illustrates a QMIX algorithm

[0017] FIG. 2 is a block diagram of a system architecture, according to some embodiments.

[0018] FIG. 3 illustrates results of a testing environment, according to some embodiments.

[0019] FIG. 4 is a signalling diagram, according to some embodiments.

[0020] FIG. 5 illustrates a method, according to some embodiments.

[0021] FIG. 6 is a block diagram of a device, according to some embodiments.

DETAILED DESCRIPTION

[0022] Embodiments of the present disclosure are directed to decentralized learning in MARL environments where the agents have different computational powers. Depending on the agent computational power, the agent Q-values will be weighed and the Q tot value will be computed using one or more machine learning (ML) models described herein. Embodiments of the present disclosure may be used to address a variety of MARL problems arising in telecommunications applications where the agents have different computing powers, such as, for example, an antenna tilting problem to provide good quality of experience for different users while minimizing experience, a channel prediction problem, an optimal handover problem, a load balancing problem, or a resource allocation problem among Internet of Things (loT) devices. In addition, embodiments of the present disclosure may be used to address MARL problems in multi-player games, such as battle strategies among multiple characters in a game. [0023] According to some embodiments, the below steps are performed to predict the weights of the agents (e.g., controllers in loT devices, network nodes, gaming agents) and update the utility networks (e.g., allocation of resources, gaming strategy) of the agents.

[0024] First, compute the utility values of the agent at the edge level (e.g., where a computing resource interfaces with a communications network) and send the computed Q-values to the central server (e.g., a node in the communications network). In addition, the available computational power of the edge node is sent to the central server.

[0025] Second, a weight prediction network is used to compute the weights of the individual Q- values to arrive at Q tot, corresponding to a collaborative or joint action of the agents (e.g., agent one tilts its antenna and agent two does not tilt its antenna).

[0026] Third, based on the computed Q tot value, the back gradients (in case of QMIX) are estimated and sent them to the local edge nodes. The back gradients may include updated global values of a joint action for each agent.

[0027] Fourth, based on these gradients, the individual utility networks are updated.

[0028] Fifth, the joint action is applied based on updated utility networks on the environment. [0029] Based on the reward obtained, repeat steps 1 to 5 until convergence.

[0030] There are several advantages of the proposed technique in MARL scenarios. For example, it can handle different computational powers and is scalable with the number of agents. These techniques are therefore more applicable to more realistic scenarios, such as where different cells have different computational powers and where there are collaboration needs between cells. Additionally, the proposed techniques can be easily adapted to many types of automation systems, and multi-agent scenarios exist in many automation systems, including vehicular computing, unmanned aerial vehicles (UAVs) for disaster environments, autonomous cars, and 5G MIMO environment.

[0031] FIG. 1 illustrates a QMIX algorithm. FIG. 2 illustrates the working of the QMIX algorithm described in reference [2], In a multi-agent scenario, all the agents 102a-n collaborate towards achieving a common goal. For example, there are N agents 102a-n in the system, and they want to achieve a common target. QMIX may be used as an underlying collaborative mechanism.

[0032] In this case, assume the Q-values are calculated by the individual agents 102a-n and are sent to the mixing agent 104. Further, the mixing agent will send the total or combined Q-values (e.g., a combined Q-value of all the agents) to respective agents. Further, the agents are updated with the combined Q-value generated by the mixing agent 104 instead of local Q-values. In this the information of agents is passed to other agents to encourage collaboration.

[0033] All the agents train on the total Q-value Q tot or Q_tot, which is obtained by passing individual agent Q-values.

[0034] However, if the computational power of the individual agents vary, then it can result in inaccurate local Q-values for the agents which in turn effect the Q_tot. To prevent inaccurate Q- values for the agents, the Q-values of respective agent may be weighed based on the computational power available, and the weighted Q-values are used to arrive at the Q_tot.

[0035] FIG. 2 is a block diagram of the system, according to some embodiments. In some embodiments, in order to arrive at weights to scale the Q-values, a first machine learning model 212 (e.g., neural network) is used to predict the weights. Since the computational power of the devices can be different, the weights may be relative to each other. Hence, a neural network may be used to predict the weights of the individual agents so that the weights can be used while computing the Q_tot.

[0036] According to some embodiments, local computational power C_{t i} and global state S_t at time instant t of the environment are used to compute the weights for individual agents. Since the agent computational powers can be different, the computational values may be used to arrive at weights. Also, since the agent performance can be measured with the state of the environment, the state of the environment may be utilized to arrive at the weights. Hence, both the local computational power and the state of the environment may need to be used in order to compute the weights of the individual agents.

[0037] In some embodiments, the choice of weights depends on the computational power available at the devices. However, poor choice of weights can lead to poor collaboration. The measure of collaboration can be seen through the state of the environment, i.e. a global state. In some embodiments, if a choice of weights does not affect the collaboration between agents, then the weights should not be changed, i.e., the previous weights may be used. According, both the parameters, i.e., computational power available (percentage) and state of the environment, may be used to predict the weights.

[0038] Block 202 corresponds to the Utility Networks, which correspond to individual Q-value networks. These networks take individual observations as input and output the Q-value for a particular agent. The network chosen here is a GRU network, which will incorporate the past history of observations.

[0039] Block 204 corresponds to the Mixing Network, which comprises a first machine learning model. In the case of a QMIX method, the mixing network takes all individual Q-values and combines them using a mixing network. In some embodiments, the mixing network is a fully connected network. The weights of this network may be calculated by using the mixing hypernetwork 206.

[0040] Block 212 corresponds to the Weight Prediction Network, which comprises a second machine learning model. The second machine learning model, which may be a fully connected network, uses the computational power of agents to estimate weights. However, in some embodiments, additional information about the global state may be required. The global state contains information on collective performance of all the agents. The global state is a different size vector than that of computational power vector. Where the computational power and global state are in a different scale, they may not be combined in same network. In some embodiments, the weights need to be positive, and the network cannot be trained directly. Accordingly, a hypernetwork 210 may be used.

[0041] As shown in FIG. 2, t is the time instant of training, S_t is the state of the environment at time instant t, τ_ι is the history of past actions, observations until time instant t, O_{t i} is the observation of the agent i at time instant t,

is the Q-value obtained for specific agent i and time instant t, C_{t i} is the computational power available for agent i at time instant t and w_{t i} is the weight chosen for the agent i and time instant t.

[0042] The computational power of all N agents 202 are sent to the Weight Prediction Network 212 (network which will compute the weights

V. The Weight Hypernetwork 210 (hypernetwork

s used to arrive at the weights (w).

[0043] According to some embodiments, the Combined Loss function 208 used to train

is shown in equation (1) below.

[0044] In equation (1) above, g(.) is the function of the mixing network ( θ_tot) parametrized by the weight prediction network θ_w.

[0045] There are two unknown parameters in equation (1) above: (i) θ_W, (ii) θ_h. Since the parameters are interdependent on each other i.e., θ_h on θ_w and vice-versa, they need to be solved in an iterative fashion. For every B samples, first the hyper network θ_h will be updated for fixed θ_w and then θ_w will be updated for the computed θ_h. So, at each step both θ_h and 9_W are updated iteratively.

[0046] Gaming Example

[0047] FIG. 3 illustrates results of a testing environment, according to some embodiments. The approach was tested on a StarCraft Multi-Agent Challenge (SMAC) environment with three agents present in (i) 16-core GPU with 64 GB RAM per core, (ii) 8-core CPU with 16 GB, and (iii) RPi with 1 GB RAM.

[0048] SMAC simulates the battle scenarios of a popular real-time strategy game StarCraft II. SMAC focuses on the decentralized micromanagement challenge. In this challenge, a team of units, each controlled by an agent observes other units within a fixed radius and takes actions based on their local observations. These agents are trained to solve challenging combat scenarios known as maps. In the Star Craft game, there are two groups of units (i) allied group and (ii) enemy group. [0049] In this experiment, the units of the allied group were controlled with the decentralized agents trained using the techniques described herein and existing techniques, and the enemies are controlled by built-in StarCraft artificial intelligence (AI). Line 302 shows the results (% of wins) of the techniques disclosed herein. Line 304 shows the results of iterative penalization technique. Line 306 shows the results of QMIX. Line 308 shows the results of IQL.

[0050] The placement of these units changes across episodes. During the start of each episode, the enemy units attack the allies, and the goal is to kill all the enemies as quickly as possible. We compare our results on one of set of maps which comprises of 2 Stalkers and 3 Zealots (2s_3z).

[0051] To evaluate the performance of the agents here, training was paused for every $100$ episodes and testing was performed for 20 episodes. The plot of the percentage (%) of test winning episodes is shown in FIG. 3.

[0052] From the plot it can be seen that the IQL approach 308 failed miserably since the two of the agents are stochastic and it resulted in poor collaboration between the agents. Although QMIX 306 gave good performance when compared with the IQL, it also settled around a 40% winning rate which is also not enough. Both cases fail as they look too much into future with the choice of highest discounted factor, which leads to poor planning.

[0053] The results of the proposed approach herein 302 and simple penalization 304 show an improvement in collaboration by the predicting the weighing factors for individual agents. The simple penalization method 304 improves the collaboration to a value of 60%, whereas the proposed techniques disclosed herein 302 improved collaboration by achieving the top value at 95%.

[0054] From the results shown in FIG. 3, it is evident that the proposed approach 302 performs better than the existing methods, which shows the efficacy of the proposed approach. One reason is the proposed approach dynamically chooses the best weights based on the conditions, whereas the existing methods uses no weights.

[0055] Telecommunications Example

[0056] FIG. 4 is a signalling diagram, according to some embodiments. FIG. 4 shows signalling in a telecommunications network among a first cell 402a, a second cell 402b, a third cell 402c, a central server 404, and an environment 410.

[0057] At 401, the first cell sends its Q- values and computational power available at time instant t towards the central server. Similarly, at 403 and 405 the second cell and the third cell send their respective Q-values and computational powers available at time instant t towards the central server.

[0058] At 407, the environment transmits a global state from the environment towards the central server. The global state may include a set of joint actions taken and a corresponding global value or reward.

[0059] At 409, the central server estimates the weights using the first machine learning model, e.g. weight prediction network 212.

[0060] At 411, the central server computes the Q tot value using a second machine learning model, e.g., the mixing network (204), and computes back propagation gradients, e.g., updated global values.

[0061] At 413, 415, and 417, the central server sends the global values, back to cell 1, cell 2, and cell 3, respectively.

[0062] At 419, the process is repeated until all the networks converge. [0063] In one embodiment, the technique proposed herein may be used to address a telecommunications antenna tilting problem in 5G. The problem relates to how to tilt antennas at specific angles in order to maximize the QoE to the customers and minimize the interference. Since there could be many antennas covering a particular area, these antennas should collaborate with each other to provide good QoE for all the users in the network. This problem can be formulated as MARL problem, with the actions being tilting the antenna and the environmental state could be the QoE for each user present in the area. In this example, each agent may be present in the different gNodeBs or base stations in the network. In addition to this MARL model, there are many machine learning (ML) models running in parallel in the difference cells. Hence the computational power available for the MARL model at a times can be different.

[0064] In general, agents may be trained in an online fashion because offline training can be very difficult and can lead to high variance Q-values. Lor example, the error in Q-values may be three times greater in offline training when compared with online training. Accordingly, to train these agents an online training approach is used. Observations are collected across agents, and training is performed at the central agent by sending respective Q-values. However, since the computational resources can change at any time due to other ML models running, the weights in which Q-values are to be mixed must be estimated. In order to estimate the weights, the computational power is sent to the central server, which in turn estimates the weights in which individual Q-values are mixed. Finally, the central server will compute the Q_tot which will be sent to the local agents. The local agents are updated with the Q_tot which results in improved collaboration.

[0065] Accordingly, using the techniques disclosed herein, different computational resources can be handled in different cells while training a collaborative MARL model.

[0066] In addition, different applications of the proposed approach may be used in the telecom environment, such for the channel prediction problem, optimal handover etc.

[0067] Lor example, the techniques disclosed herein can be used for load balancing in gNodeBs. In general, gNodeBs may have to collaborate within themselves to create a smooth and optimal experience for the users. In a case when one of the gNodeBs does not take too much load, the other nearby gNodeBs have to take this load. In order to create optimal collaborations between the gNodeBs, one may consider various factors such as mobility pattern of the users, available capacities of the gNodeB etc. [0068] In another example, the techniques disclosed herein can be used for resource allocation for Internet of Things (loT) applications. To support popular Internet of Things (loT) applications, edge computing provides a front-end distributed computing archetype of centralized cloud computing with low latency. The network has to allocate the optimal resources to these devices based on their usage and criticality. Hence, the proposed techniques herein for MARL may be used to allocate resources to these devices.

[0069] FIG. 5 illustrates a method, according to some embodiments. In some embodiments, method 500 is a computer-implemented method for training a plurality of computing agents.

[0070] Step 501 includes obtaining, from a first computing agent operating in an environment, a first value of a first action and a first computational metric of the first computing agent.

[0071] Step 503 includes obtaining, from a second computing agent different than the first computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent.

[0072] Step 505 includes obtaining a global state of the environment.

[0073] Step 507 includes updating, using a first machine learning model and a second machine learning model, a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment.

[0074] Step 509 includes transmitting the updated global value of the joint action towards the first computing agent.

[0075] Step 511 includes transmitting the updated global value of the joint action towards the second computing agent.

[0076] In some embodiments, the method may further include determining, using the first machine learning model, a first weight for the first value of the first action based on the first computational metric and the global state of the environment. The method may further include determining, using the first machine learning model, a second weight for to the second value of the second action based on the second computational network and the global state of the environment. In some embodiments, the method further includes updating, using the second machine learning model, the global value of the joint action based on the determined first weight and the determined second weight. [0077] In some embodiments, the first machine learning model is a first neural network and the second machine learning model is a second neural network different than the first neural network. The first neural network may correspond to weight prediction network 212 and the second neural network may correspond to mixing network 204.

[0078] In some embodiments, the first computational metric has a value that is different than a value of the second computational metric. In some embodiments, the first computational metric and the second computational metric comprise one or more of: a number of iterations or epochs trained, a measure of processing resources, a measure of available memory, or other relevant parameters.

[0079] In some embodiments, the updating in step 511 comprises using at least one of: a value decomposition method, a counterfactual multi-agent method, or a combination thereof.

[0080] In some embodiments, the first value, the second value, and the global value are determined using reinforcement learning (RL) models. In some embodiments, the first value is a first Q-value corresponding to the first action and the second value is a second Q-value corresponding to the second action.

[0081] In some embodiments, the joint action in method 500 comprises at least one of the first action and the second action.

[0082] FIG. 6 is a block diagram of a computing device 600 according to some embodiments. In some embodiments, computing device 600 may comprise one or more of the components of the central server as described above. As shown in FIG. 6, the device may comprise: processing circuitry (PC) 602, which may include one or more processors (P) 655 (e.g., one or more general purpose microprocessors and/or one or more other processors, such as an application specific integrated circuit (ASIC), field-programmable gate arrays (FPGAs), and the like); communication circuitry 648, comprising a transmitter (Tx) 645 and a receiver (Rx) 647 for enabling the device to transmit data and receive data (e.g., wirelessly transmit/receive data); and a local storage unit (a.k.a., “data storage system”) 608, which may include one or more non-volatile storage devices and/or one or more volatile storage devices. In embodiments where PC 602 includes a programmable processor, a computer program product (CPP) 641 may be provided. CPP 641 includes a computer readable medium (CRM) 642 storing a computer program (CP) 643 comprising computer readable instructions (CRI) 644. CRM 642 may be a non-transitory computer readable medium, such as, magnetic media (e.g., a hard disk), optical media, memory devices (e.g., random access memory, flash memory), and the like. In some embodiments, the CRI 644 of computer program 643 is configured such that when executed by PC 602, the CRI causes the apparatus to perform steps described herein (e.g., steps described herein with reference to the flow charts). In other embodiments, the apparatus may be configured to perform steps described herein without the need for code. That is, for example, PC 602 may consist merely of one or more ASICs. Hence, the features of the embodiments described herein may be implemented in hardware and/or software.

[0083] While various embodiments are described herein, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of this disclosure should not be limited by any of the above described embodiments. Moreover, any combination of the above-described elements in all possible variations thereof is encompassed by the disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.

[0084] Additionally, while the processes described above and illustrated in the drawings are shown as a sequence of steps, this was done solely for the sake of illustration. Accordingly, it is contemplated that some steps may be added, some steps may be omitted, the order of the steps may be re-arranged, and some steps may be performed in parallel.

[0085] ABBREVIATIONS

[0086] SLA Service Level Agreement

[0087] QoE Quality of Experience

[0088] VDN Value Decomposition Network

[0089] COMA Counterfactual multi-agent policy gradient

[0090] MAVEN Multi-agent Variational Exploration

[0091] MARL Multi-agent Reinforcement Learning

[0092] REFERENCES

[0093] [1] Sunehag, Peter, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot et al. "Value-decomposition networks for cooperative multi-agent learning." arXiv preprint arXiv: 1706.05296) (2017)

[0094] [2] Rashid, Tabish, Mikayel Samvelyan, Christian Schroeder, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. "Qnnx: Monotonic value function factorisation for deep multiagent reinforcement learning." In International Conference on Machine Learning, pp. 4295-4304.

PMLR, 2018. [0095] [3] Mahajan, Anuj, Tabish Rashid, Mikayel Samvelyan, and Shimon Whiteson. "Maven: Multi-agent variational exploration." arXiv preprint arXiv: 1910.07483 (2019).

[0096] [4] Aotani, Takumi, Taisuke Kobayashi, and Kenji Sugimoto. "Bottom-up multi-agent reinforcement learning for selective cooperation." In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 3590-3595. IEEE, 2018.

Claims

1. A computer-implemented method (500) for training a plurality of computing agents, the method comprising: obtaining (501), from a first computing agent (102, 202, 402) operating in an environment, a first value of a first action and a first computational metric of the first computing agent; obtaining (503), from a second computing agent (102, 202, 402) different than the first computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent; obtaining (505) a global state of the environment; updating (507), using a first machine learning model (212) and a second machine learning model (204), a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment; transmitting (509) the updated global value of the joint action towards the first computing agent; and transmitting (511) the updated global value of the joint action towards the second computing agent.

2. The method of claim 1, further comprising: determining, using the first machine learning model (212), a first weight for the first value of the first action based on the first computational metric and the global state of the environment; and determining, using the first machine learning model, a second weight for to the second value of the second action based on the second computational network and the global state of the environment.

3. The method of claim 2, further comprising: updating, using the second machine learning model (204), the global value of the joint action based on the determined first weight and the determined second weight.

4. The method of any one of claims 2-3, wherein the first machine learning model is a first neural network and the second machine learning model is a second neural network different than the first neural network.

5. The method of any one of claims 1-4, wherein the first computational metric has a value that is different than a value of the second computational metric.

6. The method of any one of claims 1-5, wherein the first computational metric and the second computational metric comprise one or more of: a number of iterations or epochs trained, a measure of processing resources, a measure of available memory, or other relevant parameters.

7. The method of any one of claims 1-6, wherein the updating comprises using at least one of: a value decomposition method, a counterfactual multi-agent method, or a combination thereof.

8. The method of any one of claims 1-7, wherein the environment comprises a communications network, the first computing agent is a network node (402), and the second computing agent is a second network node (402).

9. The method of claim 8, wherein the state of the environment comprises a quality of experience of one or more user devices in the communications network and a configuration of one or more network resources in the communications network.

10. The method of claim 9, wherein the configuration of one or more network resources comprises a measure of a tilt of one or more antennas.

11. The method of claim 8, wherein the state of the environment comprises one or more of: a mobility pattern of one or more user devices, a capability of the first network node or the second network node, or a balance of load across the first network node and the second network node.

12. The method of any one of claims 1-7, wherein the environment comprises a distributed computing architecture, the first computing agent is a first device, and the second computing agent is a second device.

13. The method of claim 12, wherein the state of the environment comprises one or more of: a usage of the first device or the second device, or one or more resources allocated to the first device or the second device.

14. The method of any one of claims 1-7, wherein the environment comprises a computer game, the first computing agent is a first actor in the computer game, and the second computing agent is a second actor in the computer game.

15. The method of any one of claims 1-14, wherein the first computing agent is associated with a first device, the second computing agent is associated with a second device, and the first device is different than the second device.

16. The method of any one of claims 1-15, wherein the first value, the second value, and the global value are determined using reinforcement learning (RL) models.

17. The method of any one of claims 1-16, wherein the first value is a first Q-value corresponding to the first action and the second value is a second Q-value corresponding to the second action.

18. The method of any one of claims 1-17, wherein the first action and the second action comprise one or more of: tilting an antenna, offloading network traffic, allocating one or more resources in a network node, or configuring one or more resources in a network node.

19. The method of any one of claims 1-17, wherein the first action and the second action comprise a move by one or more players in a computer game.

20. The method of any one of claims 1-19, wherein the joint action comprises at least one of the first action and the second action.

21. A device (600) with processing circuitry (602), wherein the processing circuitry is adapted to: obtain (501), from a first computing agent operating in an environment, a first value of a first action and a first computational metric of the first computing agent; obtain (503), from a second computing agent different than the first computing agent operating in the environment, a second value of a second action and a second computational metric of the second computing agent; obtain (505) a global state of the environment; update (507), using a first machine learning model and a second machine learning model, a global value of a joint action based on the first value of the first action, the first computational metric, the second value of the second action, the second computational metric, and the state of the environment; transmit (509) the updated global value of the joint action towards the first computing agent; and transmit (511) the updated global value of the joint action towards the second computing agent.

22. The device of claim 21, wherein the processing circuity is further configured to perform any one of the methods of claims 1 -20.

23. A computer program (643) comprising instructions (644) which when executed by processing circuity (602) of a device (600) causes the device to perform the method of any one of claims 1-20.

24. A carrier containing the computer program of claim 23, wherein the carrier is one of an electronic signal, an optical signal, a radio signal, and a computer readable storage medium.