WO2022207087A1 - Device and method for approximating nash equilibrium in two-player zero-sum games - Google Patents

Device and method for approximating nash equilibrium in two-player zero-sum games Download PDF

Info

Publication number
WO2022207087A1
WO2022207087A1 PCT/EP2021/058392 EP2021058392W WO2022207087A1 WO 2022207087 A1 WO2022207087 A1 WO 2022207087A1 EP 2021058392 W EP2021058392 W EP 2021058392W WO 2022207087 A1 WO2022207087 A1 WO 2022207087A1
Authority
WO
WIPO (PCT)
Prior art keywords
agents
input
machine learning
learning process
aggregate functions
Prior art date
Application number
PCT/EP2021/058392
Other languages
French (fr)
Inventor
Yaodong YANG
Nicolas PEREZ NIEVES
Oliver SLUMBERS
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to PCT/EP2021/058392 priority Critical patent/WO2022207087A1/en
Priority to EP21717001.8A priority patent/EP4298552A1/en
Priority to CN202180096388.8A priority patent/CN117083617A/en
Publication of WO2022207087A1 publication Critical patent/WO2022207087A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This invention relates to a computer implemented device and method for application in two- player zero-sum game frameworks, particularly to approximating Nash equilibrium and promoting the diversity of policies in such frameworks.
  • a desirable configuration is known as a fixed point. This is a configuration in which no agent can improve their payoff by unilaterally changing their current policy behaviour. This concept is known as a Nash equilibrium (NE).
  • a simple example of two-player zero-sum game is the Rock-Paper-Scissors game where Rock beats Scissor, Scissor beats Paper, and Paper beats Rock, and one can easily know that the Nash equilibrium is to play three strategies uniformly (1/3, 1/3, 1/3). When a player plays the Nash strategy, he can no longer be exploited.
  • two-player zero- sum games such as Texas Holdem Poker or Starcraft, where the strategy space is much larger (for example, Starcraft has 10 26 atomic actions at every time step), it is required to design approximate solvers to compute the Nash equilibrium.
  • the transitive part of a game represents the structure in which the rule of winning is transitive (i.e., if strategy A beats B, B beats C, then A beats C), and the non-transitive part refers to the structure in which the set of strategies follows a cyclic rule (for example, the endless cycles among Rock, Paper and Scissors). Diversity matters, especially for the non-transitive part simply because there is no consistent winner in such part of a game: if a player only plays Rock, he can be exploited by Paper, but not so if he has a diverse strategy set of Rock and Scissor.
  • a computer-implemented device for processing a two-agent system input to form multiple at least partially optimised outputs, each output indicative of an action policy for each of the two agents
  • the device comprising one or more processors configured to perform the steps of: receiving the two-agent system input, the two- agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving an indication of an input system state; and performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.
  • This may allow for an iterative method that is scalable for approximating Nash equilibria in two- player zero-sum game frameworks.
  • the multiple aggregate functions may correspond to multiple best-response policies for the agents.
  • the processor may be configured to iteratively process the multiple aggregate functions for the input system state to estimate multiple at least partially optimised set of actions for each of the two agents in the input system state.
  • the multiple aggregate functions may be iteratively processed until a predefined level of convergence is reached.
  • the multiple aggregate functions may be determined in a single iteration of the iterative machine learning process in parallel.
  • the device may implement a parallel double-oracle scheme that is designed to find multiple best-response policies in a distributed way at the same time.
  • Multiple aggregate functions may be determined in each iteration of the machine learning process.
  • the multiple aggregate functions may be refined in subsequent iterations of the iterative machine leaning process. This may allow the device to keep finding best-response strategies in an iterative manner.
  • the iterative machine learning process may be so as to promote behavioural diversity among the multiple aggregate functions determined in each iteration. Promoting diversity of best- response policies may strengthen the performance of a model trained by the iterative machine learning process.
  • the iterative machine learning process may be performed in dependence on a diversity measure.
  • the diversity measure may be modelled by a determinantal point process.
  • the diversity measure may be based on the expected cardinality of a determinantal point process. This may allow diverse best-response policies to be determined.
  • the multiple at least partially optimised outputs may each comprise a collectively optimal action policy for each of the two agents in the input system state. This may allow for optimal behaviour of the agents.
  • the multiple at least partially optimised outputs may each represent a Nash equilibrium behaviour pattern of the two agents in the input system state. This may allow policies corresponding to the Nash equilibrium to be learned.
  • the step of performing an iterative machine learning process may comprise repeatedly performing the following steps until a predetermined level of convergence is reached: generating a set of random system states; estimating based on the two-agent system input the behaviour patterns of the two agents in the system states; estimating an error between the estimated behaviour patterns and the behaviour patterns predicted by each of multiple predetermined candidate aggregate functions, the error representing the level of convergence; and adapting the multiple predetermined candidate aggregate functions based on the estimated behaviour patterns. This can enable the device to find suitable aggregate functions in a manageable time period.
  • the set of random system states may be generated based on a predetermined probability distribution. This may be convenient for generating the system states.
  • the agents may be autonomous vehicles and the system states may be vehicular system states. This may allow the device to be implemented in a driverless car.
  • the agents may be data processing devices and the system states may be computation tasks. This may allow the device to be implemented in a communication system.
  • a method for processing a two-agent system input to form a multiple at least partially optimised outputs each indicative of an action policy comprising the steps of: receiving the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving an indication of an input system state; and performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.
  • the method may allow for an iterative method that is scalable for approximating Nash equilibria in two- player zero-sum game frameworks.
  • the method may further comprise the step of causing each of the agents to implement a respective action of the at least partially optimised set of actions. This can result in efficient operation of the agents. In this way the method can be used to control the actions of a physical entity.
  • a third aspect there is computer readable medium storing in non-transient form a set of instructions for causing one or more processors to perform the method described above.
  • the method may be performed by a computer system comprising one or more processors programmed with executable code stored non-transiently in one or more memories.
  • Figure 1 shows an algorithm for general meta-game solvers.
  • Figure 2 shows a summary of prior methods.
  • Figure 3 shows an example of a determinantal point process.
  • Figure 4 shows an example of the pseudo-code for one implementation of the method described herein.
  • FIG. 5 schematically illustrates the main goal of the approach described herein.
  • Figure 6 summarises an example of a method of processing a two-agent system input to form a multiple at least partially optimised outputs each indicative of an action policy.
  • Figure 7 schematically illustrates an example of a procedural diagram of the training of the solver.
  • Figure 8 summarises an example of the steps of the iterantive machine learning process described herein.
  • Figure 9 shows an example of a computing device configured to perform the methods described herein.
  • Described herein is a computer implemented device and method for application in two-player zero-sum game frameworks, implementing a general Nash solver suitable for large-scale two- player zero-sum games.
  • the approach can provide a parallel implementation to keep finding best-response strategies for the two agents in an iterative manner. Furthermore, the approach can find policies that are diverse in behaviours. In other words, the solver promotes behavioural diversity during the learning process.
  • DPP determinantal point processes
  • a DPP is a type of point process, which measures the probability of selecting a random subset from a ground set where only diverse subsets are desired.
  • DPPs have origins in modelling repulsive quantum particles in physics (see Macchi, O., “The fermion process - a model of stochastic point process with repulsive points”, Transactions of the Seventh Moscow Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians, pp. 391-398. Springer, 1977).
  • the expected cardinality of a DPP is formulated as the diversity metric.
  • the diversity metric is a general tool for game solvers.
  • the diversity metric is incorporated into the best-response dynamics, and diversity- aware extensions of fictitious play (FP) (see Brown, G.
  • maximising the DPP-based diversity metric guarantees an expansion of the gamescape (convex polytopes spanned by agents’ mixtures of policies).
  • the diversity-aware learning methods may converge to the respective solution concept of Nash equilibrium and ⁇ -Rank (see Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K., Rowland, M., Lespiau, J.-B., Czarnecki, W. M., Lanctot, M., Perolat, J., and Munos, R., “ ⁇ - rank: Multi-agent evaluation by evolution”, Scientific reports, 9(1): 1-29, 2019) in two-player games.
  • a further preferred implementation of the method involves a parallel double-oracle scheme that is designed to find multiple best-responses in a distributed way at the same time.
  • the method defines and promotes behavioural diversity among the multiple best-response policies using a distributed version of solvers where at each iteration, multiple best- responses can be found in one iteration.
  • Nash equilibrium exists in all finite games (see Nash, J. F. et al. , “Equilibrium points in n-person games”, Proceedings of the national academy of sciences, 36(1): 48-49, 1950).
  • the NE is a joint mixed-strategy profile p in which each player i e N plays the best-response to other players
  • NFGs The framework of NFGs is often limited in describing real-world games. In solving games such as StarCraft or GO, it is inefficient to list all atomic actions. Instead, of more interest are games at the policy level where a policy can be a “higher-level” strategy (e.g., a RL model powered by DNN), and the resulting game is a meta-game, denoted by A meta-game payoff table, M, is constructed by simulating games that cover different policy combinations.
  • a policy can be a “higher-level” strategy (e.g., a RL model powered by DNN)
  • a meta-game payoff table, M is constructed by simulating games that cover different policy combinations.
  • the policy set e.g., a population of deep RL models
  • the meta-policy e.g., player i plays [RL- Model 1, RL-Model 2] with probability [0.3, 0.7]
  • p is a joint meta-policy profile.
  • Meta-games are often open-ended because there could exist an infinite number of policies to play a game. The openness also refers to the fact that new strategies will be continuously discovered and added to agents’ policy sets during training; the dimension of M will grow.
  • Fictitious play In solving NFGs, Fictitious play (FP) describes the learning process where each player chooses a best-response to their opponents’ time-average strategies, and the resulting strategies guarantee to converge to the NE in two-player zero-sum, or potential games.
  • Generalised weakened fictitious play (GWFP) (see Leslie, D. S. and Collins, E. J., “Generalised weakened fictitious play”, Games and Economic Behavior , 56(2):285-298, 2006) generalises FP by allowing for approximate best-responses and perturbed average strategy updates.
  • GWFP is a process of following the below updating rule:
  • a general solver for open-ended (meta-)games involves an iterative process of solving the equilibrium (meta-)policy first, and then based on the (meta-)policy, finding a new better- performing policy to augment the existing population.
  • the (meta-)policy solver denoted as computes a joint (meta-)policy profile p based on the current payoff M (or, G) where different solution concepts can be adopted (for example, NE or ⁇ -Rank).
  • M or, G
  • different solution concepts for example, NE or ⁇ -Rank
  • an Oracle In two-player zero-sum cases, an Oracle represents Generally, Oracles can be implemented through optimisation subroutines such as gradient-descent methods or RL algorithms. After a new policy is learned, the payoff table is expanded, and the missing entries will be filled by running new game simulations. The above process loops over each player at every iteration, and it terminates if no players can find new best-response policies (i.e. , Eq. (1) reaches zero).
  • Algorithm 1 in Figure 1 shows an exemplary algorithm for general meta-game solvers. The step of finding a new policy is shown in step 5.
  • smooth FP (Fudenberg, D. and Levine, D., “Consistency and cautious fictitious play”, Journal of Economic Dynamics and Control, 1995) is a solver that accounts for diversity through adopting a policy entropy term in the original FP (Brown et al., 1951, see above).
  • Double Oracle (DO) (McMahan, H. B., Gordon, G. J., and Blum, A., “Planning in the presence of cost functions controlled by an adversary”, Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 536-543, 2003) provides an iterative method where agents progressively expand their policy pool by, at each iteration, adding one best-response versus the opponent’s Nash strategy.
  • PSRO generalises FP and DO via adopting a RL subroutine to approximate the best-response (Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T., “A unified game-theoretic approach to multiagent reinforcement learning”, Advances in neural information processing systems, pp. 4190-4203, 2017).
  • Pipeline-PSRO (McAleer, S., Lanier, J., Fox, R., and Baldi, P., “Pipeline psro: A scalable approach for finding approximate nash equilibria in large games”, arXiv preprint arXiv:2006.08555, 2020) trains multiple best-responses in parallel and efficiently solves games of size 10 50 .
  • PSRO rN (Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T., “Open-ended learning in symmetric zero-sum games”, ICML, volume 97, pp.
  • PMLR 2019
  • ⁇ -PSRO Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L, Lanctot, M., Hughes, E., et al. , “A generalized training approach for multiagent learning”, International Conference on Learning Representations, 2019), which replaces NE with ⁇ -Rank. Yet, how to promote diversity in the context of ⁇ -PSRO is still unknown.
  • a DPP is a probabilistic framework that characterises how likely a subset of items is to be sampled from a ground set where diverse subsets are preferred.
  • a DPP defines a probability measure P on the power set of Y (i.e., 2 Y ), such that, given an M x M positive semi-definite (PSD) kernel £ that measures the pairwise similarity for items in Y, and let Y be a random subset drawn from the DPP, the probability of sampling ⁇ Y ⁇ Y is written as: where denotes a submatrix of £ whose entries are indexed by the items included in Y .
  • PSD semi-definite
  • each row W i represents a P-dimensional feature vector of item i e Y
  • the geometric meaning of is the squared volume of the parallelepiped spanned by the rows of W that correspond to the sampled items in Y .
  • a PSD matrix ensures all principal minors of £ are non-negative , which suffices to be a proper probability distribution.
  • the normaliser of can be computed by where I is the M x M identity matrix.
  • the entries of are pairwise inner products between item vectors.
  • the kernel can intuitively be thought of as representing dual effects - the diagonal elements aim to capture the quality of item i, whereas the off-diagonal elements capture the similarity between the items i and j.
  • a DPP models the repulsive connections among the items in a sampled subset. For example, in a two-item subset,
  • the target is to find a population of diverse policies, with each of them performing differently from other policies due to their unique characteristics. Therefore, when modelling the behavioural diversity in games, the payoff matrix can be used to construct a DPP kernel so that the similarity between two policies depends on their performance in terms of payoffs against different types of opponents.
  • a game DPP (G-DPP) for each player is a DPP in which the ground set is the strategy population and the DPP kernel £ is written by Eq. (10), which is a Gram matrix based on the payoff table M (see Figure 3):
  • a diversity measure can be designed based on the expected cardinality of random samples from a G-DPP, i.e.
  • the diversity metric defined as the expected cardinality of a G-DPP, can be computed in time by the following equation:
  • Figure 3 shows an example of a G-DDP.
  • the squared volume of the grey cube 300 is equal to d
  • the payoff vectors of and are shown at 301, 302 and 303 respectively. Since and have similar payoff vectors (302 and 303), this leads to a smaller shaded area 304, and thus the probability of these two strategies co-occurring is low. i.e. the probability of selecting ⁇ (the shaded area 304) from G-DPP is smaller than that of selecting which has orthogonal payoff vectors.
  • the diversity in Eq. (11) of the population are 0, 1, 1.2 respectively.
  • the diversity measure is therefore based on the expected cardinality of a determinantal point process.
  • the classical FP approach can be expanded to a diverse version such that at each iteration, the player not only considers a best-response, but also considers how this new strategy can help enrich the existing strategy pool after the update.
  • the diverse FP method maintains the same update rule as Eq. (4), but with the best-response changing into: where t is a tuneable constant.
  • Eq. (12) has a unique solution at each iteration.
  • the algorithm maintains a population of policies S4 learned so far by player i.
  • the goal here is to design an Oracle to train a new strategy Se, parameterised by ⁇ ⁇ R d (for example, a deep neural net), which both maximises player i’s payoff and is diverse from all strategies in Therefore, the ground set of the G- DPP at iteration t can be defined to be the union of the existing the new model to add: ,
  • the diversity measure can be computed by Eq. (11).
  • the objective of an Oracle can be written as: where is the policy of the player two. Depending on the game solvers, it can be NE, UNIFORM, etc.
  • the general solver may therefore approximate Nash strategy in large-scale two-player zero- sum games.
  • Figure 4 shows an example of the pseudo-code for one implementation of the method.
  • the steps concerning the learning of multiple best-response policies are indicated at 401 , and for promoting diversity among all best-responses at 402.
  • FIG. 5 illustrates a summary of the main goal of the approach.
  • a black-box multi-agent game engine 501 takes as input a joint strategy, shown at 502, and outputs the reward 503.
  • the output is multiple “good” strategies, as shown at 505.
  • Figure 6 summarises an example of a computer-implemented method 600 for processing a two-agent system input to form an at least partially optimised output indicative of an action policy.
  • the method comprises receiving the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states.
  • the method comprises receiving an indication of an input system state.
  • the method comprises performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.
  • the multiple at least partially optimised outputs each comprise a collectively optimal action policy for each of the two agents in the input system state and a Nash equilibrium behaviour pattern of the two agents in the input system state.
  • the procedural diagram for the training of the solver can be visualized in the plot shown in Figure 7.
  • time step i.e. each iteration of the iterative machine learning process
  • two policies are trained in parallel at one time.
  • Each new policy is trained against all existing policies.
  • the policy shown at 701, is fixed at all time steps.
  • policy shown at 702 is trained against shown at 701, at time step 0, leading to the new policy of shown at 703, at time step 1 , and shown at 704, is training against both which leads to the policy , shown at 705.
  • the new generated policy will be diverse in the sense that it will be different from all existing policies. For example, (705) is diverse from and (703 and 701) at time step 1. Once a new generated policy converges in the training, it is kept fixed and unchanged in the pool. In time step 2, converges, shown at 706, and its parameters will be fixed and stay unchanged in later time steps, as indicated at 707.
  • DBR diverse Oracle function
  • the approach described herein therefore offers a geometric interpretation of behavioural diversity for learning in game frameworks by introducing a new diversity measure built upon the expected cardinality of a DPP.
  • the diversity metric can be used as part of a general solver for normalform games and open-ended (meta)games.
  • the method can converge to NE and ⁇ -Rank in two-player games and show theoretical guarantees of expanding the gamescapes.
  • Figure 8 summarises an example of the process performed as part of the step of performing an iterative machine learning process.
  • the process comprises repeatedly performing the following steps until a predetermined level of convergence is reached.
  • the method comprises generating a set of random system states.
  • the set of random system states may be initially generated based on a predetermined probability distribution.
  • the method comprises estimating based on the two-agent system input the behaviour patterns of the two agents in the system states.
  • the method comprises estimating an error between the estimated behaviour patterns and the behaviour patterns predicted by each of multiple predetermined candidate aggregate functions, the error representing the level of convergence.
  • the method comprises adapting the multiple predetermined candidate aggregate functions based on the estimated behaviour patterns. This iterative machine learning process can be used to enable the device to find suitable aggregate functions in a manageable time period.
  • each of the agents can implement a respective action of the at least partially optimised set of actions.
  • the iterative method is scalable for approximating Nash equilibria in two-player zero-sum games.
  • the method preferably involves a parallel double-oracle scheme that is designed to find multiple best-responses in a distributed way at the same time.
  • the preferred implementation of the method defines and promotes so-called behavioural diversity among the multiple best-responses based on a determinantal point process.
  • the method has been shown in some embodiments to demonstrate state-of-the-art performance, outperforming existing baselines, in approximating Nash equilibrium in large-scale two-player zero-sum games.
  • Figure 9 shows a schematic diagram of a computing device 900 configured to implement the computer implemented method described above and its associated components.
  • the device may comprise a processor 901 and a non-volatile memory 902.
  • the system may comprise more than one processor and more than one memory.
  • the memory may store data that is executable by the processor.
  • the processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium.
  • the computer program may store instructions for causing the processor to perform its methods in the manner described herein.
  • the agents may be autonomous vehicles and the system states may be vehicular system states.
  • the agents may be communications routing devices and the system states may be data flows.
  • the agents may be data processing devices and the system states may be computation tasks.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Described is a computer-implemented device (900) for processing a two-agent system input to form multiple at least partially optimised outputs, each output indicative of an action policy for each of the two agents, the device comprising one or more processors (901) configured to perform the steps of: receiving (601) the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving (602) an indication of an input system state; and performing (603) an iterative machine learning process (800) to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process. This may allow for an iterative method that is scalable for approximating Nash equilibria in two-player zero-sum games.

Description

DEVICE AND METHOD FOR APPROXIMATING NASH EQUILIBRIUM IN TWO-PLAYER
ZERO-SUM GAMES
FIELD OF THE INVENTION
This invention relates to a computer implemented device and method for application in two- player zero-sum game frameworks, particularly to approximating Nash equilibrium and promoting the diversity of policies in such frameworks.
BACKGROUND
Computing the strategic configuration in which the agents in a system are executing their best- response actions is difficult because of the interdependencies between each of the agents’ actions. In particular, a desirable configuration is known as a fixed point. This is a configuration in which no agent can improve their payoff by unilaterally changing their current policy behaviour. This concept is known as a Nash equilibrium (NE).
A simple example of two-player zero-sum game is the Rock-Paper-Scissors game where Rock beats Scissor, Scissor beats Paper, and Paper beats Rock, and one can easily know that the Nash equilibrium is to play three strategies uniformly (1/3, 1/3, 1/3). When a player plays the Nash strategy, he can no longer be exploited. However, in more sophisticated two-player zero- sum games, such as Texas Holdem Poker or Starcraft, where the strategy space is much larger (for example, Starcraft has 1026 atomic actions at every time step), it is required to design approximate solvers to compute the Nash equilibrium.
Many real-world applications, such as designing gaming Al, involve solving an approximate Nash equilibrium in two-player zero-sum games. Often, these games comprise a large number of dimensions which make traditional Linear Programming solvers infeasible and require scalable methods to solve them. In designing scalable approximating solutions, promoting behavioural diversity during training is very important. Promoting behavioural diversity is particularly important for solving games with non-transitive dynamics where strategic cycles exist, and there is no consistent winner. For example, a player who can only play Rock can never win a Rock-Paper-Scissor game. Yet, there is a lack of rigorous treatment for defining diversity and constructing diversity-aware learning dynamics. In essence, existing solvers either cannot solve large-scale zero-sum games, or they do not promote behavioural diversity when approximating Nash equilibrium. In general, an arbitrary game of either the normal-form type (for example, see Candogan, O., Menache, I., Ozdaglar, A., and Parrilo, P. A., “Flows and decompositions of games: Harmonic and potential games”, Mathematics of Operations Research, 36(3):474-503, 2011) or the differential type (for example, see Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T., “The mechanics of n-player differentiable games”, ICML, volume 80, pp. 363-372. JMLR. org, 2018a.) can always be decomposed into a sum of two components: a transitive part and a non-transitive part. The transitive part of a game represents the structure in which the rule of winning is transitive (i.e., if strategy A beats B, B beats C, then A beats C), and the non-transitive part refers to the structure in which the set of strategies follows a cyclic rule (for example, the endless cycles among Rock, Paper and Scissors). Diversity matters, especially for the non-transitive part simply because there is no consistent winner in such part of a game: if a player only plays Rock, he can be exploited by Paper, but not so if he has a diverse strategy set of Rock and Scissor.
Many real-world games demonstrate strong nontransitivity (for example, see Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., and Jaderberg, M., “Real world games look like spinning tops”, arXiv, pp. arXiv-2004, 2020). Therefore, it is highly desirable to design objectives in the learning framework that can lead to behavioural diversity. In multi-agent reinforcement learning (MARL), promoting diversity not only prevents Al agents from checking the same policies repeatedly, but more importantly, helps them discover niche skills, avoid being exploited and maintain robust performance when encountering unfamiliar types of opponents. In the examples of building Als to play StarCraft (see Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning”, Nature, 575 (7782):350-354, 2019b), Honour of King (Ye, D., Chen, G., Zhang, W., Chen, S., Yuan, B., Liu, B., Chen, J., Liu, Z., Qiu, F., Yu, H., et al. , “Towards playing full moba games with deep reinforcement learning”, arXiv e-prints, pp. arXiv-2011, 2020) and Soccer (Kurach, K., Raichuk, A., Stanczyk, P., Zajac, M., Bachem.O., Espeholt, L., Riquelme, C., Vincent, D., Michalski, M., Bousquet, O., et al., “Google research football: A novel reinforcement learning environment”, Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 4501-4510, 2020), learning a diverse set of strategies has been reported as an imperative step in strengthening Al’s performance.
Despite the importance of diversity, there is very little prior work that offers a rigorous treatment in even defining diversity. The majority of work so far has followed a heuristic approach. For example, the idea of co-evolution (see Durham, W. H., “Coevolution: Genes, culture, and human diversity”, Stanford University Press, 1991, and Paredis, J. Coevolutionary computation. Artificial life, 2(4): 355-375, 1995) has drawn forth a series of effective methods, such as open-ended evolution (see Standish, R. K, Open-ended artificial evolution”, International Journal of Computational Intelligence and Applications, 3(02): 167— 175, 2003, Banzhaf, W., Baumgaertner, B., Beslon, G., Doursat, R., Foster, J. A., McMullin, B., De Melo, V. V., Miconi, T., Spector, L, Stepney, S., et al., “Defining and simulating open-ended novelty: requirements, guidelines, and challenges”, Theory in Biosciences, 135(3):131— 161 , 2016, and Lehman, J. and Stanley, K. O., “Exploiting open-endedness to solve problems through the search for novelty”, ALIFE, pp. 329-336, 2008), population based training methods (see Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A., et al., “Human-level performance in 3d multiplayer games with population based reinforcement learning”, Science, 364(6443): 859- 865, 2019, and Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., and Graepel, T., “Emergent coordination through competition”, International Conference on Learning Representations, 2018), and auto-curricula (see Leibo, J. Z., Hughes, E., Lanctot, M., and Graepel, T., “Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research”, arXiv, pp. arXiv-1903, 2019 and Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell, G., McGrew, B., and Mordatch, I., “Emergent tool use from multi-agent autocurricula”, International Conference on Learning Representations, 2019).
Despite many empirical successes, the lack of rigorous treatment for behavioural diversity still hinders one from developing a principled approach.
It is desirable to develop a method that overcomes such problems.
SUMMARY OF THE INVENTION
According to one aspect there is provided a computer-implemented device for processing a two-agent system input to form multiple at least partially optimised outputs, each output indicative of an action policy for each of the two agents, the device comprising one or more processors configured to perform the steps of: receiving the two-agent system input, the two- agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving an indication of an input system state; and performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process. This may allow for an iterative method that is scalable for approximating Nash equilibria in two- player zero-sum game frameworks.
The multiple aggregate functions may correspond to multiple best-response policies for the agents.
The processor may be configured to iteratively process the multiple aggregate functions for the input system state to estimate multiple at least partially optimised set of actions for each of the two agents in the input system state. The multiple aggregate functions may be iteratively processed until a predefined level of convergence is reached.
The multiple aggregate functions may be determined in a single iteration of the iterative machine learning process in parallel. For example, the device may implement a parallel double-oracle scheme that is designed to find multiple best-response policies in a distributed way at the same time.
Multiple aggregate functions may be determined in each iteration of the machine learning process. The multiple aggregate functions may be refined in subsequent iterations of the iterative machine leaning process. This may allow the device to keep finding best-response strategies in an iterative manner.
The iterative machine learning process may be so as to promote behavioural diversity among the multiple aggregate functions determined in each iteration. Promoting diversity of best- response policies may strengthen the performance of a model trained by the iterative machine learning process.
The iterative machine learning process may be performed in dependence on a diversity measure. The diversity measure may be modelled by a determinantal point process. The diversity measure may be based on the expected cardinality of a determinantal point process. This may allow diverse best-response policies to be determined.
The multiple at least partially optimised outputs may each comprise a collectively optimal action policy for each of the two agents in the input system state. This may allow for optimal behaviour of the agents. The multiple at least partially optimised outputs may each represent a Nash equilibrium behaviour pattern of the two agents in the input system state. This may allow policies corresponding to the Nash equilibrium to be learned.
The step of performing an iterative machine learning process may comprise repeatedly performing the following steps until a predetermined level of convergence is reached: generating a set of random system states; estimating based on the two-agent system input the behaviour patterns of the two agents in the system states; estimating an error between the estimated behaviour patterns and the behaviour patterns predicted by each of multiple predetermined candidate aggregate functions, the error representing the level of convergence; and adapting the multiple predetermined candidate aggregate functions based on the estimated behaviour patterns. This can enable the device to find suitable aggregate functions in a manageable time period.
The set of random system states may be generated based on a predetermined probability distribution. This may be convenient for generating the system states.
The agents may be autonomous vehicles and the system states may be vehicular system states. This may allow the device to be implemented in a driverless car.
The agents may be data processing devices and the system states may be computation tasks. This may allow the device to be implemented in a communication system.
According to a second aspect there is provided a method for processing a two-agent system input to form a multiple at least partially optimised outputs each indicative of an action policy, the method comprising the steps of: receiving the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving an indication of an input system state; and performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.
This may allow for an iterative method that is scalable for approximating Nash equilibria in two- player zero-sum game frameworks. The method may further comprise the step of causing each of the agents to implement a respective action of the at least partially optimised set of actions. This can result in efficient operation of the agents. In this way the method can be used to control the actions of a physical entity.
According to a third aspect there is computer readable medium storing in non-transient form a set of instructions for causing one or more processors to perform the method described above. The method may be performed by a computer system comprising one or more processors programmed with executable code stored non-transiently in one or more memories.
BRIEF DESCRIPTION OF THE FIGURES
The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:
Figure 1 shows an algorithm for general meta-game solvers.
Figure 2 shows a summary of prior methods.
Figure 3 shows an example of a determinantal point process.
Figure 4 shows an example of the pseudo-code for one implementation of the method described herein.
Figure 5 schematically illustrates the main goal of the approach described herein.
Figure 6 summarises an example of a method of processing a two-agent system input to form a multiple at least partially optimised outputs each indicative of an action policy. Figure 7 schematically illustrates an example of a procedural diagram of the training of the solver.
Figure 8 summarises an example of the steps of the iterantive machine learning process described herein.
Figure 9 shows an example of a computing device configured to perform the methods described herein.
DETAILED DESCRIPTION
Described herein is a computer implemented device and method for application in two-player zero-sum game frameworks, implementing a general Nash solver suitable for large-scale two- player zero-sum games.
As will be described in more detail below, the approach can provide a parallel implementation to keep finding best-response strategies for the two agents in an iterative manner. Furthermore, the approach can find policies that are diverse in behaviours. In other words, the solver promotes behavioural diversity during the learning process.
The preferred implementation of the approach described herein offers a geometric interpretation of behavioural diversity in game frameworks and introduces a diversity metric based on determinantal point processes (DPP).
A DPP is a type of point process, which measures the probability of selecting a random subset from a ground set where only diverse subsets are desired. DPPs have origins in modelling repulsive quantum particles in physics (see Macchi, O., “The fermion process - a model of stochastic point process with repulsive points”, Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians, pp. 391-398. Springer, 1977).
In the preferred implementation of the method described herein, the expected cardinality of a DPP is formulated as the diversity metric. The diversity metric is a general tool for game solvers. The diversity metric is incorporated into the best-response dynamics, and diversity- aware extensions of fictitious play (FP) (see Brown, G. W., “Iterative solution of games by fictitious play”, Activity analysis of production and allocation, 13(1 ):374— 376, 1951) and policy- space response oracles (PSRO) (see Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T., “A unified game-theoretic approach to multiagent reinforcement learning”, Advances in neural information processing systems, pp. 4190-4203, 2017) are developed. By incorporating the diversity metric into the best-response dynamics, diverse FP and diverse PSRO may be developed for solving normal-form games and open-ended games.
Theoretically, maximising the DPP-based diversity metric guarantees an expansion of the gamescape (convex polytopes spanned by agents’ mixtures of policies). Meanwhile, the diversity-aware learning methods may converge to the respective solution concept of Nash equilibrium and α-Rank (see Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K., Rowland, M., Lespiau, J.-B., Czarnecki, W. M., Lanctot, M., Perolat, J., and Munos, R., “α- rank: Multi-agent evaluation by evolution”, Scientific reports, 9(1): 1-29, 2019) in two-player games.
A further preferred implementation of the method involves a parallel double-oracle scheme that is designed to find multiple best-responses in a distributed way at the same time. The method defines and promotes behavioural diversity among the multiple best-response policies using a distributed version of solvers where at each iteration, multiple best- responses can be found in one iteration.
The following basic notations are first introduced to aid understanding and to highlight differences of the present approach over the prior art.
Consider a normal-form game (NFG) denoted by where each player i ∈ N has a
Figure imgf000010_0009
finite set of pure strategies
Figure imgf000010_0010
denote the space of joint pure-strategy profiles, and denote the set of joint strategy profiles except the i-th player. A mixed strategy of player i is written by where D is a probability simplex. A joint mixed-strategy profile
Figure imgf000010_0011
is
Figure imgf000010_0012
represents the probability of joint strategy profile S. For each let G(S) =
Figure imgf000010_0001
Figure imgf000010_0015
denote the vector of payoff values for each player. The expected payoff of player i under a joint mixed-strategy profile p is thus written as
Figure imgf000010_0016
also
Figure imgf000010_0013
Figure imgf000010_0002
Nash equilibrium (NE) exists in all finite games (see Nash, J. F. et al. , “Equilibrium points in n-person games”, Proceedings of the national academy of sciences, 36(1): 48-49, 1950). The NE is a joint mixed-strategy profile p in which each player i e N plays the best-response to other players
Figure imgf000010_0003
For
Figure imgf000010_0014
an e-best-response to the
Figure imgf000010_0007
is a joint profile
Figure imgf000010_0006
The exploitability (see
Figure imgf000010_0005
Davis, T., Burch, N., and Bowling, M., “Using response functions to measure strategy strength”, Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014) measures the distance of a joint strategy profile p to a NE, written as:
Figure imgf000010_0004
When the exploitability reaches zero, all players reach their best-responses, and thus p is a NE.
The framework of NFGs is often limited in describing real-world games. In solving games such as StarCraft or GO, it is inefficient to list all atomic actions. Instead, of more interest are games at the policy level where a policy can be a “higher-level” strategy (e.g., a RL model powered by DNN), and the resulting game is a meta-game, denoted by
Figure imgf000010_0008
A meta-game payoff table, M, is constructed by simulating games that cover different policy combinations. In meta-games,
Figure imgf000011_0003
can be used to denote the policy set (e.g., a population of deep RL models), and may be sued to denote the meta-policy (e.g., player i plays [RL-
Figure imgf000011_0004
Model 1, RL-Model 2] with probability [0.3, 0.7]), and thus p = is a joint meta-policy
Figure imgf000011_0008
profile. Meta-games are often open-ended because there could exist an infinite number of policies to play a game. The openness also refers to the fact that new strategies will be continuously discovered and added to agents’ policy sets during training; the dimension of M will grow.
In the meta-game analysis (a.k.a. empirical game-theoretic analysis), traditional solution concepts (for example, NE or α-Rank) can still be computed based on M, even in a more scalable manner. This is because the number of “higher-level” strategies in the meta-game is usually far smaller than the number of atomic actions of the underlying game. For example, in tackling StarCraft (see Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W. M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., et al. Alphastar: Mastering the real-time strategy game starcraft ii. DeepMind blog, 2, 2019a), hundreds of deep RL models were trained, which is a trivial amount compared to the number of atomic actions: 1026 at every timestep.
Many real-world games such as Poker, GO and StarCraft can be described through an open- ended zero-sum meta-game. Given a game engine
Figure imgf000011_0005
beats and φ < 0, φ = 0 refers to losses and ties, the meta-game payoff is:
Figure imgf000011_0006
Figure imgf000011_0001
A game is symmetric if
Figure imgf000011_0007
It is transitive if there is a monotonic rating function f such that φ (S1,S2)=f(S1)- f(S2),∀S1,S2
Figure imgf000011_0009
meaning that performance on the game is the difference in ratings. It is non-transitive if f satisfies
Figure imgf000011_0002
= 0,vS1 e S1, meaning that winning against some strategies will be counterbalanced by losses against others; the game has no consistent winner. Lastly, the gamescape of a population of strategies (see Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T., “Open-ended' learning in symmetric zero-sum games”, ICML, volume 97, pp. 434-443, PMLR, 2019) in a meta-game is defined as the convex hull of the payoff vectors of all policies in S, written as:
Figure imgf000012_0001
In solving NFGs, Fictitious play (FP) describes the learning process where each player chooses a best-response to their opponents’ time-average strategies, and the resulting strategies guarantee to converge to the NE in two-player zero-sum, or potential games. Generalised weakened fictitious play (GWFP) (see Leslie, D. S. and Collins, E. J., “Generalised weakened fictitious play”, Games and Economic Behavior , 56(2):285-298, 2006) generalises FP by allowing for approximate best-responses and perturbed average strategy updates.
GWFP is a process of
Figure imgf000012_0004
following the below updating rule:
Figure imgf000012_0002
As is a sequence of perturbations that satisfies: ∀T
Figure imgf000012_0005
> o,
Figure imgf000012_0003
GWFP recovers FP if at - 1/t> ¾ - 0 and Mt= 0,vt.
A general solver for open-ended (meta-)games involves an iterative process of solving the equilibrium (meta-)policy first, and then based on the (meta-)policy, finding a new better- performing policy to augment the existing population. The (meta-)policy solver, denoted as computes a joint (meta-)policy profile p based on the current payoff M (or, G) where different solution concepts can be adopted (for example, NE or α-Rank). With TT, each agent then finds a new best-response policy, which is equivalent to solving a single-player optimisation problem against opponents’ (meta-)policies
Figure imgf000012_0007
One can regard a best- response policy as given by an Oracle, denoted by 0. In two-player zero-sum cases, an Oracle represents
Figure imgf000012_0006
Generally, Oracles can be implemented through optimisation subroutines such as gradient-descent methods or RL algorithms. After a new policy is learned, the payoff table is expanded, and the missing entries will be filled by running new game simulations. The above process loops over each player at every iteration, and it terminates if no players can find new best-response policies (i.e. , Eq. (1) reaches zero). Algorithm 1 in Figure 1 shows an exemplary algorithm for general meta-game solvers. The step of finding a new policy is shown in step 5.
With the above notations, the prior art can be summarised in the table shown in Figure 2. These prior methods cannot promote behavioural diversity in large-scale games.
For two-player zero-sum games, smooth FP (Fudenberg, D. and Levine, D., “Consistency and cautious fictitious play”, Journal of Economic Dynamics and Control, 1995) is a solver that accounts for diversity through adopting a policy entropy term in the original FP (Brown et al., 1951, see above).
When the game size is large, Double Oracle (DO) (McMahan, H. B., Gordon, G. J., and Blum, A., “Planning in the presence of cost functions controlled by an adversary”, Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 536-543, 2003) provides an iterative method where agents progressively expand their policy pool by, at each iteration, adding one best-response versus the opponent’s Nash strategy.
PSRO generalises FP and DO via adopting a RL subroutine to approximate the best-response (Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T., “A unified game-theoretic approach to multiagent reinforcement learning”, Advances in neural information processing systems, pp. 4190-4203, 2017). Pipeline-PSRO (McAleer, S., Lanier, J., Fox, R., and Baldi, P., “Pipeline psro: A scalable approach for finding approximate nash equilibria in large games”, arXiv preprint arXiv:2006.08555, 2020) trains multiple best-responses in parallel and efficiently solves games of size 1050. PSROrN(Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T., “Open-ended learning in symmetric zero-sum games”, ICML, volume 97, pp. 434-443. PMLR, 2019) is a specific variation of PSRO that accounts for diversity. However, it suffers from poor performance in a selection of tasks. Since computing NE is PPADHard, another important extension of PSRO is α-PSRO (Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L, Lanctot, M., Hughes, E., et al. , “A generalized training approach for multiagent learning”, International Conference on Learning Representations, 2019), which replaces NE with α-Rank. Yet, how to promote diversity in the context of α-PSRO is still unknown.
Using the method described herein, diversity-aware extensions of FP, PSRO and α-PSRO can be developed. The method introduces two primary differences over the prior art. Firstly, a distributed implementation of zero-sum game solvers is designed. At each iteration of the training process, the algorithm can search for multiple best-response policies. In contrast, prior methods only consider finding one new best-response policy at each iteration. Secondly, behavioral diversity of the many policies can be defined, based on the determinantal point process. Therefore, algorithms can be developed that can promote behavioral diversity when the many best-response policies are searched for.
Therefore, instead of choosing between amplifying strengths or overcoming weaknesses, a different approach is adopted of modelling the behavioural diversity in games.
As noted above, a DPP is a probabilistic framework that characterises how likely a subset of items is to be sampled from a ground set where diverse subsets are preferred.
Formally, fora ground set Y = {1,2,...,M}, a DPP defines a probability measure P on the power set of Y (i.e., 2Y), such that, given an M x M positive semi-definite (PSD) kernel £ that measures the pairwise similarity for items in Y, and let Y be a random subset drawn from the DPP, the probability of sampling ∀Y ⊂ Y is written as:
Figure imgf000014_0001
where denotes a submatrix of £ whose entries are indexed by the items included
Figure imgf000014_0002
in Y . Given a PSD kernel , each row Wi represents a P-dimensional
Figure imgf000014_0003
feature vector of item i e Y, then the geometric meaning of
Figure imgf000014_0004
is the squared volume of the parallelepiped spanned by the rows of W that correspond to the sampled items in Y .
A PSD matrix ensures all principal minors of £ are non-negative
Figure imgf000014_0006
, which suffices to be a proper probability distribution. The normaliser of can be
Figure imgf000014_0007
computed by where I is the M x M identity matrix.
Figure imgf000014_0005
The entries of are pairwise inner products between item vectors. The kernel
Figure imgf000014_0010
can intuitively be thought of as representing dual effects - the diagonal elements aim to
Figure imgf000014_0009
capture the quality of item i, whereas the off-diagonal elements capture the similarity
Figure imgf000014_0008
between the items i and j. A DPP models the repulsive connections among the items in a sampled subset. For example, in a two-item subset,
Since PL
({hi}) oc
Figure imgf000015_0004
if item i and item j are perfectly similar such that Wi= Wj, and thus
Figure imgf000015_0001
then these
Figure imgf000015_0005
two items will not co-occur, hence such a subset of Y = {i,j} will be sampled with probability zero.
In embodiments of the present invention, the target is to find a population of diverse policies, with each of them performing differently from other policies due to their unique characteristics. Therefore, when modelling the behavioural diversity in games, the payoff matrix can be used to construct a DPP kernel so that the similarity between two policies depends on their performance in terms of payoffs against different types of opponents.
A game DPP (G-DPP) for each player is a DPP in which the ground set is the strategy population and the DPP kernel £ is written by Eq. (10), which is a Gram matrix based on the payoff table M (see Figure 3):
Figure imgf000015_0002
For learning in open-ended games, it is desirable to keep adding diverse policies to the population. In other words, at each iteration, if a random sample is taken from the G-DPP that consists of all existing policies, it is desirable that the cardinality of such a random sample is large (since policies with similar payoff vectors will be unlikely to co-occur). In this sense, a diversity measure can be designed based on the expected cardinality of random samples from a G-DPP, i.e.
Figure imgf000015_0006
The diversity metric, defined as the expected cardinality of a G-DPP, can be computed in time by the following equation:
Figure imgf000015_0007
Figure imgf000015_0003
Figure 3 shows an example of a G-DDP. The squared volume of the grey cube 300 is equal to d The payoff vectors of and are shown at 301, 302 and 303
Figure imgf000015_0008
respectively. Since
Figure imgf000015_0009
and
Figure imgf000015_0010
have similar payoff vectors (302 and 303), this leads to a smaller shaded area 304, and thus the probability of these two strategies co-occurring is low. i.e. the probability of selecting } (the shaded area 304) from G-DPP is smaller than
Figure imgf000016_0003
that of selecting
Figure imgf000016_0004
which has orthogonal payoff vectors. In this example, the diversity in Eq. (11) of the population are 0, 1, 1.2 respectively.
Figure imgf000016_0005
The diversity measure is therefore based on the expected cardinality of a determinantal point process.
An advantageous property of this diversity measure is that it is well defined even in the case when Y has duplicated policies. Dealing with redundant policies can be a challenge for game evaluation. Here, redundancy also prevents one from directly using
Figure imgf000016_0006
as the diversity measure because the determinant value becomes zero with duplicated entries.
With the diversity measure of Eq. (11), diversity-aware learning algorithms can now be designed.
The classical FP approach can be expanded to a diverse version such that at each iteration, the player not only considers a best-response, but also considers how this new strategy can help enrich the existing strategy pool after the update. Formally, the diverse FP method maintains the same update rule as Eq. (4), but with the best-response changing into:
Figure imgf000016_0001
where t is a tuneable constant. For diverse FP, the expected cardinality is guaranteed to be a strictly concave function. Therefore, Eq. (12) has a unique solution at each iteration.
In solving open-ended games, at the t-th iteration, the algorithm maintains a population of policies S4 learned so far by player i. The goal here is to design an Oracle to train a new strategy Se, parameterised by θ ∈ Rd(for example, a deep neural net), which both maximises player i’s payoff and is diverse from all strategies in
Figure imgf000016_0007
Therefore, the ground set of the G- DPP at iteration t can be defined to be the union of the existing the new model to add:
Figure imgf000016_0008
,
Figure imgf000016_0002
With the ground set at each iteration, the diversity measure can be computed by Eq. (11). Subsequently, the objective of an Oracle can be written as:
Figure imgf000017_0001
where is the policy of the player two. Depending on the game solvers, it can be NE, UNIFORM, etc.
The general solver may therefore approximate Nash strategy in large-scale two-player zero- sum games.
Figure 4 shows an example of the pseudo-code for one implementation of the method. The steps concerning the learning of multiple best-response policies are indicated at 401 , and for promoting diversity among all best-responses at 402.
Figure 5 illustrates a summary of the main goal of the approach. A black-box multi-agent game engine 501 takes as input a joint strategy, shown at 502, and outputs the reward 503. Using the described algorithm, shown at 504, the output is multiple “good” strategies, as shown at 505.
Figure 6 summarises an example of a computer-implemented method 600 for processing a two-agent system input to form an at least partially optimised output indicative of an action policy. At step 601, the method comprises receiving the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states. At step 602, the method comprises receiving an indication of an input system state. At step 603, the method comprises performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.
The multiple at least partially optimised outputs each comprise a collectively optimal action policy for each of the two agents in the input system state and a Nash equilibrium behaviour pattern of the two agents in the input system state.
The procedural diagram for the training of the solver can be visualized in the plot shown in Figure 7. At each time step (i.e. each iteration of the iterative machine learning process), multiple new policies are trained in parallel. In the example shown in Figure 7, two policies are trained in parallel at one time. Each new policy is trained against all existing policies. The policy
Figure imgf000018_0001
shown at 701, is fixed at all time steps.
For example, policy
Figure imgf000018_0002
, shown at 702, is trained against
Figure imgf000018_0003
shown at 701, at time step 0, leading to the new policy of
Figure imgf000018_0006
shown at 703, at time step 1 , and
Figure imgf000018_0004
shown at 704, is training against both which leads to the policy , shown at 705.
Figure imgf000018_0007
Figure imgf000018_0008
During the training process, through a diverse Oracle function (denoted as DBR), the new generated policy will be diverse in the sense that it will be different from all existing policies. For example, (705) is diverse from and
Figure imgf000018_0005
(703 and 701) at time step 1. Once a new
Figure imgf000018_0009
generated policy converges in the training, it is kept fixed and unchanged in the pool. In time step 2, converges, shown at 706, and its parameters will be fixed and stay unchanged in later time steps, as indicated at 707.
The approach described herein therefore offers a geometric interpretation of behavioural diversity for learning in game frameworks by introducing a new diversity measure built upon the expected cardinality of a DPP. The diversity metric can be used as part of a general solver for normalform games and open-ended (meta)games. The method can converge to NE and α-Rank in two-player games and show theoretical guarantees of expanding the gamescapes.
Figure 8 summarises an example of the process performed as part of the step of performing an iterative machine learning process. The process comprises repeatedly performing the following steps until a predetermined level of convergence is reached. At step 801, the method comprises generating a set of random system states. The set of random system states may be initially generated based on a predetermined probability distribution. At step 802, the method comprises estimating based on the two-agent system input the behaviour patterns of the two agents in the system states. At step 803, the method comprises estimating an error between the estimated behaviour patterns and the behaviour patterns predicted by each of multiple predetermined candidate aggregate functions, the error representing the level of convergence. At step 804, the method comprises adapting the multiple predetermined candidate aggregate functions based on the estimated behaviour patterns. This iterative machine learning process can be used to enable the device to find suitable aggregate functions in a manageable time period.
When the predetermined level of convergence is reached, each of the agents can implement a respective action of the at least partially optimised set of actions.
The iterative method is scalable for approximating Nash equilibria in two-player zero-sum games. As described above, the method preferably involves a parallel double-oracle scheme that is designed to find multiple best-responses in a distributed way at the same time. The preferred implementation of the method defines and promotes so-called behavioural diversity among the multiple best-responses based on a determinantal point process. The method has been shown in some embodiments to demonstrate state-of-the-art performance, outperforming existing baselines, in approximating Nash equilibrium in large-scale two-player zero-sum games.
Figure 9 shows a schematic diagram of a computing device 900 configured to implement the computer implemented method described above and its associated components. The device may comprise a processor 901 and a non-volatile memory 902. The system may comprise more than one processor and more than one memory. The memory may store data that is executable by the processor. The processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium. The computer program may store instructions for causing the processor to perform its methods in the manner described herein.
Other examples of applications of this approach in practical applications include but are not limited to: driverless cars/autonomous vehicles, unmanned locomotive devices, packet delivery and routing devices, computer servers and ledgers in blockchains. For example, the agents may be autonomous vehicles and the system states may be vehicular system states. The agents may be communications routing devices and the system states may be data flows. The agents may be data processing devices and the system states may be computation tasks.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A computer-implemented device (900) for processing a two-agent system input to form multiple at least partially optimised outputs, each output indicative of an action policy for each of the two agents, the device comprising one or more processors (901) configured to perform the steps of: receiving (601) the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving (602) an indication of an input system state; and performing (603) an iterative machine learning process (800) to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.
2. A device (900) as claimed in claim 1 , wherein the processor (901) is configured to iteratively process the multiple aggregate functions for the input system state to estimate multiple at least partially optimised set of actions for each of the two agents in the input system state.
3. A device (900) as claimed in claim 1 or claim 2, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process in parallel.
4. A device (900) as claimed in any preceding claim, wherein multiple aggregate functions are determined in each iteration of the machine learning process.
5. A device (900) as claimed in any preceding claim, wherein the iterative machine learning process (800) is so as to promote behavioural diversity among the multiple aggregate functions determined in each iteration.
6. A device (900) as claimed in any preceding claim, wherein the iterative machine learning process (800) is performed in dependence on a diversity measure, wherein the diversity measure is modelled by a determinantal point process.
7. A device (900) as claimed in any preceding claim, wherein the multiple at least partially optimised outputs each comprise a collectively optimal action policy for each of the two agents in the input system state.
8. A device (900) as claimed in any preceding claim, wherein the multiple at least partially optimised outputs each represent a Nash equilibrium behaviour pattern of the two agents in the input system state.
9. A device (900) as claimed in any preceding claim, wherein the step of performing an iterative machine learning process comprises repeatedly performing the following steps until a predetermined level of convergence is reached: generating (801) a set of random system states; estimating (802) based on the two-agent system input the behaviour patterns of the two agents in the system states; estimating (803) an error between the estimated behaviour patterns and the behaviour patterns predicted by each of multiple predetermined candidate aggregate functions, the error representing the level of convergence; and adapting (804) the multiple predetermined candidate aggregate functions based on the estimated behaviour patterns.
10. A device (900) as claimed in claim 9, wherein the set of random system states are generated based on a predetermined probability distribution.
11. A device (900) as claimed in any preceding claim, wherein the agents are autonomous vehicles and the system states are vehicular system states.
12. A device (900) as claimed in any of claims 1 to 10, wherein the agents are data processing devices and the system states are computation tasks.
13. A method (600) for processing a two-agent system input to form a multiple at least partially optimised outputs each indicative of an action policy, the method comprising the steps of: receiving (601) the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving (602) an indication of an input system state; and performing (603) an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.
14. The method (600) of claim 13, further comprising the step of causing each of the agents to implement a respective action of the at least partially optimised set of actions.
15. A computer readable medium (902) storing in non-transient form a set of instructions for causing one or more processors to perform the method (600) of claim 13 or 14.
PCT/EP2021/058392 2021-03-31 2021-03-31 Device and method for approximating nash equilibrium in two-player zero-sum games WO2022207087A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/EP2021/058392 WO2022207087A1 (en) 2021-03-31 2021-03-31 Device and method for approximating nash equilibrium in two-player zero-sum games
EP21717001.8A EP4298552A1 (en) 2021-03-31 2021-03-31 Device and method for approximating nash equilibrium in two-player zero-sum games
CN202180096388.8A CN117083617A (en) 2021-03-31 2021-03-31 Apparatus and method for approximating Nash equalization in two-person zero and gaming

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2021/058392 WO2022207087A1 (en) 2021-03-31 2021-03-31 Device and method for approximating nash equilibrium in two-player zero-sum games

Publications (1)

Publication Number Publication Date
WO2022207087A1 true WO2022207087A1 (en) 2022-10-06

Family

ID=75426585

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2021/058392 WO2022207087A1 (en) 2021-03-31 2021-03-31 Device and method for approximating nash equilibrium in two-player zero-sum games

Country Status (3)

Country Link
EP (1) EP4298552A1 (en)
CN (1) CN117083617A (en)
WO (1) WO2022207087A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220410878A1 (en) * 2021-06-23 2022-12-29 International Business Machines Corporation Risk sensitive approach to strategic decision making with many agents

Non-Patent Citations (32)

* Cited by examiner, † Cited by third party
Title
BAKER, B.KANITSCHEIDER, I.MARKOV, T.WU, Y.POWELL, G.MCGREW, B.MORDATCH, I.: "Emergent tool use from multi-agent autocurricula", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2019
BALDUZZI, D.GARNELO, M.BACHRACH, Y.CZARNECKI, W.PEROLAT, J.JADERBERG, M.GRAEPEL, T.: "Open-ended' learning in symmetric zero-sum games", ICML, vol. 97, 2019, pages 434 - 443
BALDUZZI, D.GARNELO, M.BACHRACH, Y.CZARNECKI, W.PEROLAT, J.JADERBERG, M.GRAEPEL, T.: "Open-ended learning in symmetric zero-sum games", ICML, vol. 97, 2019, pages 434 - 443
BALDUZZI, D.RACANIERE, S.MARTENS, J.FOERSTER, J.TUYLS, K.GRAEPEL, T.: "The mechanics of n-player differentiable games", ICML, vol. 80, 2018, pages 363 - 372
BANZHAF, W.BAUMGAERTNER, B.BESLON, G.DOURSAT, R.FOSTER, J. A.MCMULLIN, B.DE MELO, V. V.MICONI, T.SPECTOR, L.STEPNEY, S. ET AL.: "Defining and simulating open-ended novelty: requirements, guidelines, and challenges", THEORY IN BIOSCIENCES, vol. 135, no. 3, 2016, pages 131 - 161, XP036050102, DOI: 10.1007/s12064-016-0229-7
BROWN, G. W.: "Iterative solution of games by fictitious play", ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION, vol. 13, no. 1, 1951, pages 374 - 376
CANDOGAN, O.MENACHE, I.OZDAGLAR, A.PARRILO, P. A.: "Flows and decompositions of games: Harmonic and potential games", MATHEMATICS OF OPERATIONS RESEARCH, vol. 36, no. 3, 2011, pages 474 - 503
CZARNECKI, W. M.GIDEL, G.TRACEY, B.TUYLS, K.OMIDSHAFIEI, S.BALDUZZI, D.JADERBERG, M.: "Real world games look like spinning tops", ARXIV, PP. ARXIV-2004, 2020
DAVID BALDUZZI ET AL: "Open-ended Learning in Symmetric Zero-sum Games", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 January 2019 (2019-01-23), XP081007431 *
DAVIS, T.BURCH, N.BOWLING, M.: "Using response functions to measure strategy strength", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 28, 2014
DURHAM, W. H.: "Coevolution: Genes, culture, and human diversity", 1991, STANFORD UNIVERSITY PRESS
FUDENBERG, D.LEVINE, D.: "Consistency and cautious fictitious play", JOURNAL OF ECONOMIC DYNAMICS AND CONTROL, 1995
JADERBERG, M.CZARNECKI, W. M.DUNNING, I.MARRIS, L.LEVER, G.CASTANEDA, A. G.BEATTIE, C.RABINOWITZ, N. C.MORCOS, A. S.RUDERMAN, A. E: "Human-level performance in 3d multiplayer games with population based reinforcement learning", SCIENCE, vol. 364, no. 6443, 2019, pages 859 - 865
KURACH, K.RAICHUK, A.STANCZYK, P.ZAJAC, M.BACHEM,'O.ESPEHOLT, L.RIQUELME, C.VINCENT, D.MICHALSKI, M.BOUSQUET, O. ET AL.: "Google research football: A novel reinforcement learning environment", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 34, 2020, pages 4501 - 4510
LANCTOT, M.ZAMBALDI, V.GRUSLYS, A.LAZARIDOU, A.TUYLS, K.PEROLAT, J.SILVER, D.GRAEPEL, T.: "A unified game-theoretic approach to multiagent reinforcement learning", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2017, pages 4190 - 4203
LEHMAN, J.STANLEY, K. O.: "Exploiting open-endedness to solve problems through the search for novelty", ALIFE, 2008, pages 329 - 336
LEIBO, J. Z.HUGHES, E.LANCTOT, M.GRAEPEL, T.: "Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research", ARXIV, 2019, pages arXiv-1903
LESLIE, D. S.COLLINS, E. J.: "Generalised weakened fictitious play", GAMES AND ECONOMIC BEHAVIOR, vol. 56, no. 2, 2006, pages 285 - 298, XP024911102, DOI: 10.1016/j.geb.2005.08.005
LIU, S.LEVER, G.MEREL, J.TUNYASUVUNAKOOL, S.HEESS, N.GRAEPEL, T.: "Emergent coordination through competition", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2018
MACCHI, O.: "Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians", 1977, SPRINGER, article "The fermion process - a model of stochastic point process with repulsive points", pages: 391 - 398
MCALEER, S.LANIER, J.FOX, R.BALDI, P.: "Pipeline psro: A scalable approach for finding approximate nash equilibria in large games", ARXIV PREPRINT ARXIV:2006.08555, 2020
MCMAHAN, H. B.GORDON, G. J.BLUM, A.: "Planning in the presence of cost functions controlled by an adversary", PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML-03, 2003, pages 536 - 543, XP008098344
MULLER, P.OMIDSHAFIEI, S.ROWLAND, M.TUYLS, K.PEROLAT, J.LIU, S.HENNES, D.MARRIS, L.LANCTOT, M.HUGHES, E. ET AL.: "A generalized training approach for multiagent learning", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2019
NASH, J. F. ET AL.: "Equilibrium points in n-person games", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES, vol. 36, no. 1, 1950, pages 48 - 49
OMIDSHAFIEI, S.PAPADIMITRIOU, C.PILIOURAS, G.TUYLS, K.ROWLAND, M.LESPIAU, J.-B.CZARNECKI, W. M.LANCTOT, M.PEROLAT, J.MUNOS, R.: "a-rank: Multi-agent evaluation by evolution", SCIENTIFIC REPORTS, vol. 9, no. 1, 2019, pages 1 - 29
PAREDIS, J.: "Revolutionary computation", ARTIFICIAL LIFE, vol. 2, no. 4, 1995, pages 355 - 375
PEREZ NIEVES NICOLAS ET AL: "Modelling Behavioural Diversity for Learning in Open-Ended Games", 14 March 2021 (2021-03-14), pages 1 - 28, XP055871842, Retrieved from the Internet <URL:https://arxiv.org/pdf/2103.07927v1.pdf> [retrieved on 20211210] *
STANDISH, R. K: "Open-ended artificial evolution", INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, vol. 3, no. 02, 2003, pages 167 - 175
STEPHEN MCALEER ET AL: "Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 18 February 2021 (2021-02-18), XP081882238 *
VINYALS, O.BABUSCHKIN, I.CHUNG, J.MATHIEU, M.JADERBERG, M.CZARNECKI, W. M.DUDZIK, A.HUANG, A.GEORGIEV, P.POWELL, R. ET AL.: "Alphastar: Mastering the real-time strategy game starcraft", DEEPMIND BLOG, vol. 2, 2019
VINYALS, O.BABUSCHKIN, I.CZARNECKI, W. M.MATHIEU, M.DUDZIK, A.CHUNG, J.CHOI, D. H.POWELL, R.EWALDS, T.GEORGIEV, P. ET AL.: "Grandmaster level in starcraft ii using multi-agent reinforcement learning", NATURE, vol. 575, no. 7782, 2019, pages 350 - 354, XP036927623, DOI: 10.1038/s41586-019-1724-z
YE, D.CHEN, G.ZHANG, W.CHEN, S.YUAN, B.LIU, B.CHEN, J.LIU, Z.QIU, F.YU, H. ET AL.: "Towards playing full moba games with deep reinforcement learning", ARXIV E-PRINTS, 2020, pages arXiv-2011

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220410878A1 (en) * 2021-06-23 2022-12-29 International Business Machines Corporation Risk sensitive approach to strategic decision making with many agents

Also Published As

Publication number Publication date
CN117083617A (en) 2023-11-17
EP4298552A1 (en) 2024-01-03

Similar Documents

Publication Publication Date Title
Perez-Nieves et al. Modelling behavioural diversity for learning in open-ended games
Świechowski et al. Monte Carlo tree search: A review of recent modifications and applications
Yang et al. Hierarchical cooperative multi-agent reinforcement learning with skill discovery
Ozair et al. Vector quantized models for planning
Khadka et al. Evolutionary reinforcement learning
Ilhan et al. Monte Carlo tree search with temporal-difference learning for general video game playing
Khan et al. Transformer-based value function decomposition for cooperative multi-agent reinforcement learning in starcraft
WO2022207087A1 (en) Device and method for approximating nash equilibrium in two-player zero-sum games
Liu et al. A unified diversity measure for multiagent reinforcement learning
Suri Off-policy evolutionary reinforcement learning with maximum mutations
Mathieu et al. AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning
Ha Neuroevolution for deep reinforcement learning problems
Reis et al. An Adversarial Approach for Automated Pokémon Team Building and Meta-Game Balance
McAleer et al. Team-PSRO for Learning Approximate TMECor in Large Team Games via Cooperative Reinforcement Learning
Callaghan et al. Evolutionary strategy guided reinforcement learning via multibuffer communication
Liu et al. Soft-Actor-Attention-Critic Based on Unknown Agent Action Prediction for Multi-Agent Collaborative Confrontation
Berthet Review of Deep Reinforcement Learning Algorithms
Parker-Holder Towards truly open-ended reinforcement learning
Yang Integrating Domain Knowledge into Monte Carlo Tree Search for Real-Time Strategy Games
Gillberg et al. Technical challenges of deploying reinforcement learning agents for game testing in aaa games
Mahmud et al. Implementation of reinforcement learning architecture to augment an AI that can self-learn to play video games
Lundberg Evaluating behaviour tree integration in the option critic framework in Starcraft 2 mini-games with training restricted by consumer level hardware
Wang et al. Quality-Diversity with Limited Resources
Huang Warm-Starting Networks for Sample-Efficient Continuous Adaptation to Parameter Perturbations in Multi-Agent Reinforcement Learning
Feng et al. Neural auto-curricula

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21717001

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202180096388.8

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 2021717001

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2021717001

Country of ref document: EP

Effective date: 20230928

NENP Non-entry into the national phase

Ref country code: DE