WO2023213402A1 - Systèmes d'apprentissage par renforcement à multiples agents - Google Patents

Systèmes d'apprentissage par renforcement à multiples agents Download PDF

Info

Publication number
WO2023213402A1
WO2023213402A1 PCT/EP2022/062169 EP2022062169W WO2023213402A1 WO 2023213402 A1 WO2023213402 A1 WO 2023213402A1 EP 2022062169 W EP2022062169 W EP 2022062169W WO 2023213402 A1 WO2023213402 A1 WO 2023213402A1
Authority
WO
WIPO (PCT)
Prior art keywords
machine learning
action
value function
agent
actions
Prior art date
Application number
PCT/EP2022/062169
Other languages
English (en)
Inventor
David MGUNI
Taher JAFFERJEE
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2022/062169 priority Critical patent/WO2023213402A1/fr
Publication of WO2023213402A1 publication Critical patent/WO2023213402A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Definitions

  • This invention relates to multi-agent reinforcement learning (MARL).
  • MARL multi-agent reinforcement learning
  • CT-DE centralised training with decentralised execution
  • MASs Multi-agent systems
  • Centralised training has typically demonstrated particularly good performance in multi-agent reinforcement learning.
  • CT relies on agents learning from one-off observations of other agents' actions at a given state. Since MARL agents explore and progressively update their policies, the one-off observations may provide little information about the typical behaviour of other agents and consequently, little information about the expected return for a given action. As such, CT methods may have high variance and can provide error-prone estimates which harm learning.
  • centralised learning is typically computationally expensive and can often require large numbers of samples to complete training. Scaling such CT methods therefore typically requires value function (VF) factorisations which in turn require strong assumptions (e.g. monotonicity in QMIX). This can lead to poor performance when violated. Recent results have indicated that in some settings centralised learning can hinder learning. There are cases in which centralised learning is greatly outperformed by independent learning (IL) algorithms. IL algorithms typically learn quickly and are easy to tune.
  • IL independent learning
  • Independent learning algorithms have other disadvantages, however. First, they do not explicitly account for the actions of other agents. Unlike single agent reinforcement learning (RL), in MASs the collective of agents is typically required to coordinate to successfully solve the task. For independent learning methods, the presence of other RL agents that are not accounted for produces the appearance of a non-stationary environment from the perspective of a given agent. This degrades the strength of the algorithm because it causes convergence guarantees to be lost.
  • RL single agent reinforcement learning
  • a second issue for IL algorithms is that they are unable to distinguish between (i) randomness due to the stochasticity of the modelled environment and (ii) randomness produced by the explorative actions of RL agents. This hinders the ability of independent learning methods to learn optimal policies in some environments.
  • the value function may be a real number which represents the expected sum of rewards as a result of the action of the first machine learning agent in the current state while accounting for one of the plurality of joint actions of the plurality of other machine learning agents. This can assist in identifying an optimal solution.
  • the mean value function may be a mean of the total value function of a plurality of joint actions being considered over the number of the plurality of joint actions being considered. This can assist in identifying an optimal solution.
  • the method steps of determining and calculating may be repeated for a different action of the first machine learning agent. This can assist in providing an optimised solution for training the MARL algorithm.
  • the value function may be calculated using a value function equation arranged to take one of the determined plurality of joint actions as an input.
  • the value function equation may be written as: Vi’.
  • V t is the value function
  • S is the current state
  • a ⁇ - A N represent each action of a joint action for N other machine learning agents
  • IR is the real number output which represents the expected sum of rewards as a result of the action i of the first machine learning agent. This can assist in providing a solution which takes into account an exact action of multiple other agents.
  • the value function equation may be implemented as a multi-layer perceptron with the joint action appended to the current state term. This can assist in implementing the proposed approach to identify an optimal solution.
  • the method may comprise updating the first machine learning agent and optimising the policy of the first machine learning agent based on the calculated mean value function. This can assist in training the first machine learning agent to identify an optimal solution.
  • a device for implementing a policy of a first machine learning agent the policy optimised by calculating a mean value function of one or more actions of the first machine learning agent within a multi-agent reinforcement learning framework comprising a plurality of other machine learning agents, where the mean value function is calculated to account for a plurality of joint actions of the plurality of other machine learning agents.
  • the device may implement the policy optimised according to the method of any of the preceding claims.
  • Figure 1 shows a simple example of a situation where the present invention may be advantageous.
  • Figure 2 shows a schematic diagram of the combined joint action and state in a network.
  • Figure 3 shows an example algorithm.
  • Figure 4 shows a computer system suitable for implementing the present invention.
  • the proposed approach succeeds in computing accurate value function, VF, estimates that incorporate the exact behaviour of other agents while maintaining a decentralised critic.
  • the proposed approach comprises a method of embedding shared knowledge in the agents' individual VFs.
  • the proposed method dramatically reduces the variance of the VFs during training and enables agents to quickly determine their optimal joint policy given the influence of other agents.
  • MARL Multi-agent Reinforcement Learning
  • MARL is a setting in which a collection of decision-making agents observe a state of the environment and attempt to jointly take actions to maximise their utility.
  • N agents each observe a state of the environment, s t .
  • each agent i samples its respective policy to select an action a t l ⁇ n l (-
  • the collection of each agent’s action forms a joint-action a t .
  • Actuating this joint-action in the environment results in the environment emitting a joint-reward r t ⁇ R(s t , a t ) and transitioning to a new state s t+1 ⁇ P(s t ,a t ,s t+1 ).
  • achieving optimal outcomes with respect to this goal may require agents to coordinate; a central component of MARL.
  • the settings consists of an environment E (which has a set of states S, allowed actions A, a reward function r, and transition dynamics P), and a collection of decision making agents with jointpolicy 7T.
  • FIG. 1 a simple example of a MARL environment 101 is shown in Figure 1.
  • This is a traffic junction in which agents 102 are represented by squares.
  • the squares can move on orthogonal roads 103.
  • the state of the environment consists of the locations of the agent squares 102 on the roads 103 at any given time.
  • Each agent’s 102 action at an iteration is to either move forward or stand still.
  • the so-called joint-action consists of the collection of these actions. For example, a set of actions where each agent 102 takes a respective action.
  • a typical reward function in this MARL example might reward forward movement of an agent 102 and severely penalise accidents where the roads 103 intersect 104.
  • CT-DE Centralised Training -- Decentralised Execution
  • VDN Centralised Training -- Decentralised Execution
  • QMIX MAPPO
  • HATRPO Centralised Training -- Decentralised Execution
  • a central feature of this CT-DE framework is a centralised value function, also called a centralised critic, that is used to aid coordination between agents.
  • the critic is privy to extra information which aids in learning, for example the global state of the environment, while agents still execute their policies independently. That is, agents still execute their policies based on their own local observations without consultation regarding other agents’ actions.
  • the proposed approach consists of two core components: (1) a value function which takes the joint action as input and (2) marginalisation of the actions of all agents except that of the tth agent. That is, when updating the parameters of the tth agent.
  • the proposed approach focuses on training a first machine learning agent in a multi-agent reinforcement learning framework.
  • the framework comprises a plurality of other machine learning agents, each implementing their own respective policy.
  • the method comprises determining a plurality of joint actions, each comprising a different combination of respective actions to be taken by each of the plurality of machine learning agents based on their respective policies.
  • the method also comprises calculating, based on the determined plurality of joint actions and an action of the first machine learning agent, the mean value function of the action of the first machine learning agent.
  • the proposed method is applicable to multi-agent reinforcement learning algorithms.
  • the steps of determining and calculating may also be repeated for a different action of the first machine learning agent.
  • any agent i e JV is able to compute Q (T, a) which estimates the agent’s expected return using its policy given the behaviour of all other agents N I ⁇ / ⁇ given that the agent takes an action a e c/£j. Therefore, Q (T, a) seeks to provide an estimate of the agent’s own action while accounting for the actions of other agents.
  • the action a_j provides little information about the usefulness of taking action a in state s and may also be an action tuple that is unlikely to be played again. This can occur if at least one of the agents -i samples an action to be executed which is a low probability action under either its current policy or under its policies after subsequent updates.
  • This procedure involves making a single observation (a lt . . . , a N ) ⁇ , n N ) of other agents’ actions sampled from their individual policies. This is not very useful in providing information about other agents’ behaviour. Learning requires many samples and many revisitations of the same state. Including global observations is computationally expensive as the set of possible actions explodes with increasing numbers of agents.
  • the proposed method comprises receiving current state information about each of the plurality of other machine learning agents in the environment and evaluating each respective policy given the current state information to provide a respective action of the other machine learning agent, where each different possible combination of actions provides a different joint action.
  • the value function (also called the critic) of the tth agent is a function or equation defined as follows:
  • the value function outputs a real number representing the expected sum of discounted rewards of policy n l from state s t .
  • the proposed implementation augments the value function equation to also take the joint action a t ⁇ n:(- 1 ⁇ ) as input:
  • the value function now outputs a real-number representing the expected sum of discounted rewards of policy n l from state s t given joint-action a t . That is, the expected sum of rewards is as a result of the action of the first machine learning agent in the current state while accounting for one of the plurality of joint actions of the plurality of other machine learning agents.
  • the value function that is the real number output of the function called the value function shown above, is calculated using a value function equation arranged to take one of the determined plurality of joint actions as an input.
  • V t is the value function
  • S is the current state
  • a ⁇ - A N represent each action of a joint action for N other machine learning agents
  • IR is the real number output which represents the expected sum of rewards as a result of the action i of the first machine learning agent.
  • the value function is typically implemented as a multi-layer perceptron (MLP).
  • MLP multi-layer perceptron
  • Figure 2 shows a schematic diagram of the combined joint action and state 201 passing through the network 200.
  • the output 204 is generated.
  • the output 204 is a real number representing the expected sum of discounted rewards of policy n l from state s t which also takes into account the joint action a t .
  • a key object in the proposed framework is the jointly marginalised action-value function Q which is defined via the following expression:
  • the function Q t seeks to estimate the expected return following agent i taking action a;.
  • Q i t builds in the distribution of the actions played by other agents under their current policy. Therefore, giving less weight to low probability actions and conversely giving more weight to high probability actions.
  • Q depends only on the agent’s own action and state, which acquires the benefit of IL (decentralised learning), while factoring in the behaviour of other agents. This avoids the need for rigid factorisations of the value functions that generally requires imposing restrictive assumptions on the reward structure.
  • the critic now takes as input a t as well.
  • the critic’s parameters are updated with respect to the gradient of the following loss:
  • Stages 302, 304, and 305 explicitly indicate components of the proposed method and stages 301 , 303, and 306 indicate algorithmic components of generic MARL algorithms which benefit from the other proposed components.
  • the first stage 301 starts by initialising the environment alongside a MARL learning algorithm with value function or critic and a policy where said learning algorithm uses a policy.
  • the value function or critic is augmented to take the joint action into account by appending the joint action vector to the current state.
  • the output value function takes into account the actions of the other agents without expensive computational processes and in doing so reduces the variance of the value function compared to other computationally inexpensive solutions.
  • the next stage 303 comprises a typical MARL step where policies of the other agents in the environment whose actions are being considered are rolled out.
  • action data is obtained for each of those agents.
  • D is a data set defined by a set of observations o - o T , joint actions a r - a T , and reward tuples r r - r T .
  • the next stage 304 involves repeating the evaluation of the augmented value function for each of the possible joint actions. That is, one joint action per agent to be considered within the environment.
  • the value function or critic can then provide a real number output which accounts for that specific joint action.
  • o is the observation or local observation for agent i.
  • the mean of all of the value function outputs or critics’ values can be calculated by averaging them over the number of joint actions considered which is the same as the number of agents considered within the environment.
  • the mean value function is a mean of the total value function of a plurality of joint actions being considered over the number of the plurality of joint actions being considered.
  • the value function output or critic value is used to update the value function or critic and the corresponding policy. That is, the MARL algorithm is trained by subsequently updating the policy based on the calculated mean value function.
  • the method may comprise updating the first machine learning agent and optimising the policy of the first machine learning agent based on the calculated mean value function.
  • Figure 4 shows a computing system 401 comprising a processing entity 402 and a program store 403.
  • the processing entity 402 comprises one or more processors.
  • the program store 403 stores in non-transient form program code executable by the processing entity 402 to perform the algorithmic functions described herein.
  • the computing system 401 may be a single computer or multiple computers. It may be in a single location or distributed between multiple locations.
  • the computing system 401 may be contained within one or more devices.
  • a device for implementing a policy of a first machine learning agent comprising a multi-agent reinforcement learning framework comprising a plurality of other machine learning agents.
  • the mean value function is calculated to account for a plurality of joint actions of the plurality of other machine learning agents. Therefore, there is provided a device configured to operate according to the optimised training of the policy it implements.

Abstract

La présente approche implique un procédé d'entraînement d'un premier agent d'apprentissage automatique dans une plateforme d'apprentissage par renforcement à multiples agents. La plateforme comprend une pluralité d'autres agents d'apprentissage automatique mettant chacun en œuvre leur propre politique respective. Le procédé comprend la détermination d'une pluralité d'actions conjointes, chacune comprenant une combinaison différente d'actions respectives à prendre par chaque agent de la pluralité d'agents d'apprentissage automatique sur la base de leurs politiques respectives, et le calcul, sur la base de la pluralité d'actions conjointes déterminées et d'une action du premier agent d'apprentissage automatique, de la fonction de valeur moyenne de l'action du premier agent d'apprentissage automatique.
PCT/EP2022/062169 2022-05-05 2022-05-05 Systèmes d'apprentissage par renforcement à multiples agents WO2023213402A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/062169 WO2023213402A1 (fr) 2022-05-05 2022-05-05 Systèmes d'apprentissage par renforcement à multiples agents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2022/062169 WO2023213402A1 (fr) 2022-05-05 2022-05-05 Systèmes d'apprentissage par renforcement à multiples agents

Publications (1)

Publication Number Publication Date
WO2023213402A1 true WO2023213402A1 (fr) 2023-11-09

Family

ID=81927796

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/062169 WO2023213402A1 (fr) 2022-05-05 2022-05-05 Systèmes d'apprentissage par renforcement à multiples agents

Country Status (1)

Country Link
WO (1) WO2023213402A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200279136A1 (en) * 2019-03-01 2020-09-03 Royal Bank Of Canada System and method for multi-type mean field reinforcement machine learning
US20210252711A1 (en) * 2020-02-14 2021-08-19 Robert Bosch Gmbh Method and device for controlling a robot

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200279136A1 (en) * 2019-03-01 2020-09-03 Royal Bank Of Canada System and method for multi-type mean field reinforcement machine learning
US20210252711A1 (en) * 2020-02-14 2021-08-19 Robert Bosch Gmbh Method and device for controlling a robot

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
DE WITT ET AL., IS INDEPENDENT LEARNING ALL YOU NEED IN THE STARCRAFT MULTI-AGENT CHALLENGE?, 2021
RASHID ET AL., QMIX: MONOTONIC VALUE FUNCTION FACTORISATION FOR DEEP MULTI-AGENT REINFORCEMENT LEARNING, 2018
TAN ET AL., MULTI-AGENT REINFORCEMENT LEARNING: INDEPENDENT VS. COOPERATIVE AGENTS, 1993
ZHI ZHANG ET AL: "Integrating independent and centralized multi-agent reinforcement learning for traffic signal network optimization", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 24 September 2019 (2019-09-24), XP081480849 *

Similar Documents

Publication Publication Date Title
EP3446260B1 (fr) Rétropropagation dans le temps, économe en mémoire
CN110503192A (zh) 资源有效的神经架构
Al-Betar et al. University course timetabling using a hybrid harmony search metaheuristic algorithm
US10068170B2 (en) Minimizing global error in an artificial neural network
KR20220133323A (ko) 어텐션-기반의 시퀀스 변환 신경망
Azzouz et al. Steady state IBEA assisted by MLP neural networks for expensive multi-objective optimization problems
WO2021158267A1 (fr) Optimisation de graphique de calcul
CN114626585A (zh) 一种基于生成对抗网络的城市轨道交通短时客流预测方法
Zhao et al. Improving Goal-Oriented Visual Dialog Agents via Advanced Recurrent Nets with Tempered Policy Gradient.
CN111586146A (zh) 基于概率转移深度强化学习的无线物联网资源分配方法
CN110263136B (zh) 基于强化学习模型向用户推送对象的方法和装置
WO2023213402A1 (fr) Systèmes d'apprentissage par renforcement à multiples agents
US20210103807A1 (en) Computer implemented method and system for running inference queries with a generative model
JP4643586B2 (ja) 最適設計管理装置、最適設計計算システム、最適設計管理方法、最適設計管理プログラム
Peng et al. Simulation Optimization in the New Era of AI
CN115909027A (zh) 一种态势估计方法及装置
KR20220042315A (ko) 교통 데이터 예측 방법, 장치 및 전자 기기
CN114548288A (zh) 模型训练、图像识别方法和装置
Kozat et al. Switching strategies for sequential decision problems with multiplicative loss with application to portfolios
CN113220437A (zh) 一种工作流多目标调度方法及装置
CN113128677A (zh) 模型生成方法和装置
Santos et al. MaxDropoutV2: An Improved Method to Drop Out Neurons in Convolutional Neural Networks
Brasch A note on efficient pricing and risk calculation of credit basket products
Marquardt et al. An empirical study of prior-data conflicts in Bayesian neural networks
Wilkman Feasibility of a reinforcement learning based stock trader

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22727344

Country of ref document: EP

Kind code of ref document: A1