WO2022207087A1

WO2022207087A1 - Device and method for approximating nash equilibrium in two-player zero-sum games

Info

Publication number: WO2022207087A1
Application number: PCT/EP2021/058392
Authority: WO
Inventors: Yaodong YANG; Nicolas PEREZ NIEVES; Oliver SLUMBERS
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2022-10-06
Also published as: CN117083617A; EP4298552A1

Abstract

Described is a computer-implemented device (900) for processing a two-agent system input to form multiple at least partially optimised outputs, each output indicative of an action policy for each of the two agents, the device comprising one or more processors (901) configured to perform the steps of: receiving (601) the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving (602) an indication of an input system state; and performing (603) an iterative machine learning process (800) to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process. This may allow for an iterative method that is scalable for approximating Nash equilibria in two-player zero-sum games.

Description

DEVICE AND METHOD FOR APPROXIMATING NASH EQUILIBRIUM IN TWO-PLAYER

ZERO-SUM GAMES

FIELD OF THE INVENTION

This invention relates to a computer implemented device and method for application in two- player zero-sum game frameworks, particularly to approximating Nash equilibrium and promoting the diversity of policies in such frameworks.

BACKGROUND

Computing the strategic configuration in which the agents in a system are executing their best- response actions is difficult because of the interdependencies between each of the agents’ actions. In particular, a desirable configuration is known as a fixed point. This is a configuration in which no agent can improve their payoff by unilaterally changing their current policy behaviour. This concept is known as a Nash equilibrium (NE).

A simple example of two-player zero-sum game is the Rock-Paper-Scissors game where Rock beats Scissor, Scissor beats Paper, and Paper beats Rock, and one can easily know that the Nash equilibrium is to play three strategies uniformly (1/3, 1/3, 1/3). When a player plays the Nash strategy, he can no longer be exploited. However, in more sophisticated two-player zero- sum games, such as Texas Holdem Poker or Starcraft, where the strategy space is much larger (for example, Starcraft has 10²⁶ atomic actions at every time step), it is required to design approximate solvers to compute the Nash equilibrium.

Many real-world applications, such as designing gaming Al, involve solving an approximate Nash equilibrium in two-player zero-sum games. Often, these games comprise a large number of dimensions which make traditional Linear Programming solvers infeasible and require scalable methods to solve them. In designing scalable approximating solutions, promoting behavioural diversity during training is very important. Promoting behavioural diversity is particularly important for solving games with non-transitive dynamics where strategic cycles exist, and there is no consistent winner. For example, a player who can only play Rock can never win a Rock-Paper-Scissor game. Yet, there is a lack of rigorous treatment for defining diversity and constructing diversity-aware learning dynamics. In essence, existing solvers either cannot solve large-scale zero-sum games, or they do not promote behavioural diversity when approximating Nash equilibrium. In general, an arbitrary game of either the normal-form type (for example, see Candogan, O., Menache, I., Ozdaglar, A., and Parrilo, P. A., “Flows and decompositions of games: Harmonic and potential games”, Mathematics of Operations Research, 36(3):474-503, 2011) or the differential type (for example, see Balduzzi, D., Racaniere, S., Martens, J., Foerster, J., Tuyls, K., and Graepel, T., “The mechanics of n-player differentiable games”, ICML, volume 80, pp. 363-372. JMLR. org, 2018a.) can always be decomposed into a sum of two components: a transitive part and a non-transitive part. The transitive part of a game represents the structure in which the rule of winning is transitive (i.e., if strategy A beats B, B beats C, then A beats C), and the non-transitive part refers to the structure in which the set of strategies follows a cyclic rule (for example, the endless cycles among Rock, Paper and Scissors). Diversity matters, especially for the non-transitive part simply because there is no consistent winner in such part of a game: if a player only plays Rock, he can be exploited by Paper, but not so if he has a diverse strategy set of Rock and Scissor.

Many real-world games demonstrate strong nontransitivity (for example, see Czarnecki, W. M., Gidel, G., Tracey, B., Tuyls, K., Omidshafiei, S., Balduzzi, D., and Jaderberg, M., “Real world games look like spinning tops”, arXiv, pp. arXiv-2004, 2020). Therefore, it is highly desirable to design objectives in the learning framework that can lead to behavioural diversity. In multi-agent reinforcement learning (MARL), promoting diversity not only prevents Al agents from checking the same policies repeatedly, but more importantly, helps them discover niche skills, avoid being exploited and maintain robust performance when encountering unfamiliar types of opponents. In the examples of building Als to play StarCraft (see Vinyals, O., Babuschkin, I., Czarnecki, W. M., Mathieu, M., Dudzik, A., Chung, J., Choi, D. H., Powell, R., Ewalds, T., Georgiev, P., et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning”, Nature, 575 (7782):350-354, 2019b), Honour of King (Ye, D., Chen, G., Zhang, W., Chen, S., Yuan, B., Liu, B., Chen, J., Liu, Z., Qiu, F., Yu, H., et al. , “Towards playing full moba games with deep reinforcement learning”, arXiv e-prints, pp. arXiv-2011, 2020) and Soccer (Kurach, K., Raichuk, A., Stanczyk, P., Zajac, M., Bachem.O., Espeholt, L., Riquelme, C., Vincent, D., Michalski, M., Bousquet, O., et al., “Google research football: A novel reinforcement learning environment”, Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 4501-4510, 2020), learning a diverse set of strategies has been reported as an imperative step in strengthening Al’s performance.

Despite the importance of diversity, there is very little prior work that offers a rigorous treatment in even defining diversity. The majority of work so far has followed a heuristic approach. For example, the idea of co-evolution (see Durham, W. H., “Coevolution: Genes, culture, and human diversity”, Stanford University Press, 1991, and Paredis, J. Coevolutionary computation. Artificial life, 2(4): 355-375, 1995) has drawn forth a series of effective methods, such as open-ended evolution (see Standish, R. K, Open-ended artificial evolution”, International Journal of Computational Intelligence and Applications, 3(02): 167— 175, 2003, Banzhaf, W., Baumgaertner, B., Beslon, G., Doursat, R., Foster, J. A., McMullin, B., De Melo, V. V., Miconi, T., Spector, L, Stepney, S., et al., “Defining and simulating open-ended novelty: requirements, guidelines, and challenges”, Theory in Biosciences, 135(3):131— 161 , 2016, and Lehman, J. and Stanley, K. O., “Exploiting open-endedness to solve problems through the search for novelty”, ALIFE, pp. 329-336, 2008), population based training methods (see Jaderberg, M., Czarnecki, W. M., Dunning, I., Marris, L., Lever, G., Castaneda, A. G., Beattie, C., Rabinowitz, N. C., Morcos, A. S., Ruderman, A., et al., “Human-level performance in 3d multiplayer games with population based reinforcement learning”, Science, 364(6443): 859- 865, 2019, and Liu, S., Lever, G., Merel, J., Tunyasuvunakool, S., Heess, N., and Graepel, T., “Emergent coordination through competition”, International Conference on Learning Representations, 2018), and auto-curricula (see Leibo, J. Z., Hughes, E., Lanctot, M., and Graepel, T., “Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research”, arXiv, pp. arXiv-1903, 2019 and Baker, B., Kanitscheider, I., Markov, T., Wu, Y., Powell, G., McGrew, B., and Mordatch, I., “Emergent tool use from multi-agent autocurricula”, International Conference on Learning Representations, 2019).

Despite many empirical successes, the lack of rigorous treatment for behavioural diversity still hinders one from developing a principled approach.

It is desirable to develop a method that overcomes such problems.

SUMMARY OF THE INVENTION

According to one aspect there is provided a computer-implemented device for processing a two-agent system input to form multiple at least partially optimised outputs, each output indicative of an action policy for each of the two agents, the device comprising one or more processors configured to perform the steps of: receiving the two-agent system input, the two- agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving an indication of an input system state; and performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process. This may allow for an iterative method that is scalable for approximating Nash equilibria in two- player zero-sum game frameworks.

The multiple aggregate functions may correspond to multiple best-response policies for the agents.

The processor may be configured to iteratively process the multiple aggregate functions for the input system state to estimate multiple at least partially optimised set of actions for each of the two agents in the input system state. The multiple aggregate functions may be iteratively processed until a predefined level of convergence is reached.

The multiple aggregate functions may be determined in a single iteration of the iterative machine learning process in parallel. For example, the device may implement a parallel double-oracle scheme that is designed to find multiple best-response policies in a distributed way at the same time.

Multiple aggregate functions may be determined in each iteration of the machine learning process. The multiple aggregate functions may be refined in subsequent iterations of the iterative machine leaning process. This may allow the device to keep finding best-response strategies in an iterative manner.

The iterative machine learning process may be so as to promote behavioural diversity among the multiple aggregate functions determined in each iteration. Promoting diversity of best- response policies may strengthen the performance of a model trained by the iterative machine learning process.

The iterative machine learning process may be performed in dependence on a diversity measure. The diversity measure may be modelled by a determinantal point process. The diversity measure may be based on the expected cardinality of a determinantal point process. This may allow diverse best-response policies to be determined.

The multiple at least partially optimised outputs may each comprise a collectively optimal action policy for each of the two agents in the input system state. This may allow for optimal behaviour of the agents. The multiple at least partially optimised outputs may each represent a Nash equilibrium behaviour pattern of the two agents in the input system state. This may allow policies corresponding to the Nash equilibrium to be learned.

The step of performing an iterative machine learning process may comprise repeatedly performing the following steps until a predetermined level of convergence is reached: generating a set of random system states; estimating based on the two-agent system input the behaviour patterns of the two agents in the system states; estimating an error between the estimated behaviour patterns and the behaviour patterns predicted by each of multiple predetermined candidate aggregate functions, the error representing the level of convergence; and adapting the multiple predetermined candidate aggregate functions based on the estimated behaviour patterns. This can enable the device to find suitable aggregate functions in a manageable time period.

The set of random system states may be generated based on a predetermined probability distribution. This may be convenient for generating the system states.

The agents may be autonomous vehicles and the system states may be vehicular system states. This may allow the device to be implemented in a driverless car.

The agents may be data processing devices and the system states may be computation tasks. This may allow the device to be implemented in a communication system.

According to a second aspect there is provided a method for processing a two-agent system input to form a multiple at least partially optimised outputs each indicative of an action policy, the method comprising the steps of: receiving the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving an indication of an input system state; and performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.

This may allow for an iterative method that is scalable for approximating Nash equilibria in two- player zero-sum game frameworks. The method may further comprise the step of causing each of the agents to implement a respective action of the at least partially optimised set of actions. This can result in efficient operation of the agents. In this way the method can be used to control the actions of a physical entity.

According to a third aspect there is computer readable medium storing in non-transient form a set of instructions for causing one or more processors to perform the method described above. The method may be performed by a computer system comprising one or more processors programmed with executable code stored non-transiently in one or more memories.

BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:

Figure 1 shows an algorithm for general meta-game solvers.

Figure 2 shows a summary of prior methods.

Figure 3 shows an example of a determinantal point process.

Figure 4 shows an example of the pseudo-code for one implementation of the method described herein.

Figure 5 schematically illustrates the main goal of the approach described herein.

Figure 6 summarises an example of a method of processing a two-agent system input to form a multiple at least partially optimised outputs each indicative of an action policy. Figure 7 schematically illustrates an example of a procedural diagram of the training of the solver.

Figure 8 summarises an example of the steps of the iterantive machine learning process described herein.

Figure 9 shows an example of a computing device configured to perform the methods described herein.

DETAILED DESCRIPTION

Described herein is a computer implemented device and method for application in two-player zero-sum game frameworks, implementing a general Nash solver suitable for large-scale two- player zero-sum games.

As will be described in more detail below, the approach can provide a parallel implementation to keep finding best-response strategies for the two agents in an iterative manner. Furthermore, the approach can find policies that are diverse in behaviours. In other words, the solver promotes behavioural diversity during the learning process.

The preferred implementation of the approach described herein offers a geometric interpretation of behavioural diversity in game frameworks and introduces a diversity metric based on determinantal point processes (DPP).

A DPP is a type of point process, which measures the probability of selecting a random subset from a ground set where only diverse subsets are desired. DPPs have origins in modelling repulsive quantum particles in physics (see Macchi, O., “The fermion process - a model of stochastic point process with repulsive points”, Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians, pp. 391-398. Springer, 1977).

In the preferred implementation of the method described herein, the expected cardinality of a DPP is formulated as the diversity metric. The diversity metric is a general tool for game solvers. The diversity metric is incorporated into the best-response dynamics, and diversity- aware extensions of fictitious play (FP) (see Brown, G. W., “Iterative solution of games by fictitious play”, Activity analysis of production and allocation, 13(1 ):374— 376, 1951) and policy- space response oracles (PSRO) (see Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T., “A unified game-theoretic approach to multiagent reinforcement learning”, Advances in neural information processing systems, pp. 4190-4203, 2017) are developed. By incorporating the diversity metric into the best-response dynamics, diverse FP and diverse PSRO may be developed for solving normal-form games and open-ended games.

Theoretically, maximising the DPP-based diversity metric guarantees an expansion of the gamescape (convex polytopes spanned by agents’ mixtures of policies). Meanwhile, the diversity-aware learning methods may converge to the respective solution concept of Nash equilibrium and α-Rank (see Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K., Rowland, M., Lespiau, J.-B., Czarnecki, W. M., Lanctot, M., Perolat, J., and Munos, R., “α- rank: Multi-agent evaluation by evolution”, Scientific reports, 9(1): 1-29, 2019) in two-player games.

A further preferred implementation of the method involves a parallel double-oracle scheme that is designed to find multiple best-responses in a distributed way at the same time. The method defines and promotes behavioural diversity among the multiple best-response policies using a distributed version of solvers where at each iteration, multiple best- responses can be found in one iteration.

The following basic notations are first introduced to aid understanding and to highlight differences of the present approach over the prior art.

Consider a normal-form game (NFG) denoted by where each player i ∈ N has a

finite set of pure strategies

denote the space of joint pure-strategy profiles, and denote the set of joint strategy profiles except the i-th player. A mixed strategy of player i is written by where D is a probability simplex. A joint mixed-strategy profile

is

represents the probability of joint strategy profile S. For each let G(S) =

denote the vector of payoff values for each player. The expected payoff of player i under a joint mixed-strategy profile p is thus written as

also

Nash equilibrium (NE) exists in all finite games (see Nash, J. F. et al. , “Equilibrium points in n-person games”, Proceedings of the national academy of sciences, 36(1): 48-49, 1950). The NE is a joint mixed-strategy profile p in which each player i e N plays the best-response to other players

For

an e-best-response to the

is a joint profile

The exploitability (see

Davis, T., Burch, N., and Bowling, M., “Using response functions to measure strategy strength”, Proceedings of the AAAI Conference on Artificial Intelligence, volume 28, 2014) measures the distance of a joint strategy profile p to a NE, written as:

When the exploitability reaches zero, all players reach their best-responses, and thus p is a NE.

The framework of NFGs is often limited in describing real-world games. In solving games such as StarCraft or GO, it is inefficient to list all atomic actions. Instead, of more interest are games at the policy level where a policy can be a “higher-level” strategy (e.g., a RL model powered by DNN), and the resulting game is a meta-game, denoted by

A meta-game payoff table, M, is constructed by simulating games that cover different policy combinations. In meta-games,

can be used to denote the policy set (e.g., a population of deep RL models), and may be sued to denote the meta-policy (e.g., player i plays [RL-

Model 1, RL-Model 2] with probability [0.3, 0.7]), and thus p = is a joint meta-policy

profile. Meta-games are often open-ended because there could exist an infinite number of policies to play a game. The openness also refers to the fact that new strategies will be continuously discovered and added to agents’ policy sets during training; the dimension of M will grow.

In the meta-game analysis (a.k.a. empirical game-theoretic analysis), traditional solution concepts (for example, NE or α-Rank) can still be computed based on M, even in a more scalable manner. This is because the number of “higher-level” strategies in the meta-game is usually far smaller than the number of atomic actions of the underlying game. For example, in tackling StarCraft (see Vinyals, O., Babuschkin, I., Chung, J., Mathieu, M., Jaderberg, M., Czarnecki, W. M., Dudzik, A., Huang, A., Georgiev, P., Powell, R., et al. Alphastar: Mastering the real-time strategy game starcraft ii. DeepMind blog, 2, 2019a), hundreds of deep RL models were trained, which is a trivial amount compared to the number of atomic actions: 10²⁶ at every timestep.

Many real-world games such as Poker, GO and StarCraft can be described through an open- ended zero-sum meta-game. Given a game engine

beats and φ < 0, φ = 0 refers to losses and ties, the meta-game payoff is:

A game is symmetric if

It is transitive if there is a monotonic rating function f such that φ (S¹,S²)=f(S¹)- f(S²),∀S¹,S² ∈

meaning that performance on the game is the difference in ratings. It is non-transitive if f satisfies

= 0,vS^{1 e} S¹, meaning that winning against some strategies will be counterbalanced by losses against others; the game has no consistent winner. Lastly, the gamescape of a population of strategies (see Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T., “Open-ended^' learning in symmetric zero-sum games”, ICML, volume 97, pp. 434-443, PMLR, 2019) in a meta-game is defined as the convex hull of the payoff vectors of all policies in S, written as:

In solving NFGs, Fictitious play (FP) describes the learning process where each player chooses a best-response to their opponents’ time-average strategies, and the resulting strategies guarantee to converge to the NE in two-player zero-sum, or potential games. Generalised weakened fictitious play (GWFP) (see Leslie, D. S. and Collins, E. J., “Generalised weakened fictitious play”, Games and Economic Behavior , 56(2):285-298, 2006) generalises FP by allowing for approximate best-responses and perturbed average strategy updates.

GWFP is a process of

following the below updating rule:

As is a sequence of perturbations that satisfies: ∀T

> o,

GWFP recovers FP if ^at - ¹/^t> ¾ - 0 and M_t= 0,vt.

A general solver for open-ended (meta-)games involves an iterative process of solving the equilibrium (meta-)policy first, and then based on the (meta-)policy, finding a new better- performing policy to augment the existing population. The (meta-)policy solver, denoted as computes a joint (meta-)policy profile p based on the current payoff M (or, G) where different solution concepts can be adopted (for example, NE or α-Rank). With TT, each agent then finds a new best-response policy, which is equivalent to solving a single-player optimisation problem against opponents’ (meta-)policies

One can regard a best- response policy as given by an Oracle, denoted by 0. In two-player zero-sum cases, an Oracle represents

Generally, Oracles can be implemented through optimisation subroutines such as gradient-descent methods or RL algorithms. After a new policy is learned, the payoff table is expanded, and the missing entries will be filled by running new game simulations. The above process loops over each player at every iteration, and it terminates if no players can find new best-response policies (i.e. , Eq. (1) reaches zero). Algorithm 1 in Figure 1 shows an exemplary algorithm for general meta-game solvers. The step of finding a new policy is shown in step 5.

With the above notations, the prior art can be summarised in the table shown in Figure 2. These prior methods cannot promote behavioural diversity in large-scale games.

For two-player zero-sum games, smooth FP (Fudenberg, D. and Levine, D., “Consistency and cautious fictitious play”, Journal of Economic Dynamics and Control, 1995) is a solver that accounts for diversity through adopting a policy entropy term in the original FP (Brown et al., 1951, see above).

When the game size is large, Double Oracle (DO) (McMahan, H. B., Gordon, G. J., and Blum, A., “Planning in the presence of cost functions controlled by an adversary”, Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 536-543, 2003) provides an iterative method where agents progressively expand their policy pool by, at each iteration, adding one best-response versus the opponent’s Nash strategy.

PSRO generalises FP and DO via adopting a RL subroutine to approximate the best-response (Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T., “A unified game-theoretic approach to multiagent reinforcement learning”, Advances in neural information processing systems, pp. 4190-4203, 2017). Pipeline-PSRO (McAleer, S., Lanier, J., Fox, R., and Baldi, P., “Pipeline psro: A scalable approach for finding approximate nash equilibria in large games”, arXiv preprint arXiv:2006.08555, 2020) trains multiple best-responses in parallel and efficiently solves games of size 10⁵⁰. PSRO_rN(Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T., “Open-ended learning in symmetric zero-sum games”, ICML, volume 97, pp. 434-443. PMLR, 2019) is a specific variation of PSRO that accounts for diversity. However, it suffers from poor performance in a selection of tasks. Since computing NE is PPADHard, another important extension of PSRO is α-PSRO (Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L, Lanctot, M., Hughes, E., et al. , “A generalized training approach for multiagent learning”, International Conference on Learning Representations, 2019), which replaces NE with α-Rank. Yet, how to promote diversity in the context of α-PSRO is still unknown.

Using the method described herein, diversity-aware extensions of FP, PSRO and α-PSRO can be developed. The method introduces two primary differences over the prior art. Firstly, a distributed implementation of zero-sum game solvers is designed. At each iteration of the training process, the algorithm can search for multiple best-response policies. In contrast, prior methods only consider finding one new best-response policy at each iteration. Secondly, behavioral diversity of the many policies can be defined, based on the determinantal point process. Therefore, algorithms can be developed that can promote behavioral diversity when the many best-response policies are searched for.

Therefore, instead of choosing between amplifying strengths or overcoming weaknesses, a different approach is adopted of modelling the behavioural diversity in games.

As noted above, a DPP is a probabilistic framework that characterises how likely a subset of items is to be sampled from a ground set where diverse subsets are preferred.

Formally, fora ground set Y = {1,2,...,M}, a DPP defines a probability measure P on the power set of Y (i.e., 2^Y), such that, given an M x M positive semi-definite (PSD) kernel £ that measures the pairwise similarity for items in Y, and let Y be a random subset drawn from the DPP, the probability of sampling ∀Y ⊂ Y is written as:

where denotes a submatrix of £ whose entries are indexed by the items included

in Y . Given a PSD kernel , each row W_i represents a P-dimensional

feature vector of item i e Y, then the geometric meaning of

is the squared volume of the parallelepiped spanned by the rows of W that correspond to the sampled items in Y .

A PSD matrix ensures all principal minors of £ are non-negative

, which suffices to be a proper probability distribution. The normaliser of can be

computed by where I is the M x M identity matrix.

The entries of are pairwise inner products between item vectors. The kernel

can intuitively be thought of as representing dual effects - the diagonal elements aim to

capture the quality of item i, whereas the off-diagonal elements capture the similarity

between the items i and j. A DPP models the repulsive connections among the items in a sampled subset. For example, in a two-item subset,

Since P_L

({hi}) oc

if item i and item j are perfectly similar such that W_i= W_j, and thus

then these

two items will not co-occur, hence such a subset of Y = {i,j} will be sampled with probability zero.

In embodiments of the present invention, the target is to find a population of diverse policies, with each of them performing differently from other policies due to their unique characteristics. Therefore, when modelling the behavioural diversity in games, the payoff matrix can be used to construct a DPP kernel so that the similarity between two policies depends on their performance in terms of payoffs against different types of opponents.

A game DPP (G-DPP) for each player is a DPP in which the ground set is the strategy population and the DPP kernel £ is written by Eq. (10), which is a Gram matrix based on the payoff table M (see Figure 3):

For learning in open-ended games, it is desirable to keep adding diverse policies to the population. In other words, at each iteration, if a random sample is taken from the G-DPP that consists of all existing policies, it is desirable that the cardinality of such a random sample is large (since policies with similar payoff vectors will be unlikely to co-occur). In this sense, a diversity measure can be designed based on the expected cardinality of random samples from a G-DPP, i.e.

The diversity metric, defined as the expected cardinality of a G-DPP, can be computed in time by the following equation:

Figure 3 shows an example of a G-DDP. The squared volume of the grey cube 300 is equal to d The payoff vectors of and are shown at 301, 302 and 303

respectively. Since

and

have similar payoff vectors (302 and 303), this leads to a smaller shaded area 304, and thus the probability of these two strategies co-occurring is low. i.e. the probability of selecting } (the shaded area 304) from G-DPP is smaller than

that of selecting

which has orthogonal payoff vectors. In this example, the diversity in Eq. (11) of the population are 0, 1, 1.2 respectively.

The diversity measure is therefore based on the expected cardinality of a determinantal point process.

An advantageous property of this diversity measure is that it is well defined even in the case when Y has duplicated policies. Dealing with redundant policies can be a challenge for game evaluation. Here, redundancy also prevents one from directly using

as the diversity measure because the determinant value becomes zero with duplicated entries.

With the diversity measure of Eq. (11), diversity-aware learning algorithms can now be designed.

The classical FP approach can be expanded to a diverse version such that at each iteration, the player not only considers a best-response, but also considers how this new strategy can help enrich the existing strategy pool after the update. Formally, the diverse FP method maintains the same update rule as Eq. (4), but with the best-response changing into:

where t is a tuneable constant. For diverse FP, the expected cardinality is guaranteed to be a strictly concave function. Therefore, Eq. (12) has a unique solution at each iteration.

In solving open-ended games, at the t-th iteration, the algorithm maintains a population of policies S4 learned so far by player i. The goal here is to design an Oracle to train a new strategy Se, parameterised by θ ∈ R^d(for example, a deep neural net), which both maximises player i’s payoff and is diverse from all strategies in

Therefore, the ground set of the G- DPP at iteration t can be defined to be the union of the existing the new model to add:

,

With the ground set at each iteration, the diversity measure can be computed by Eq. (11). Subsequently, the objective of an Oracle can be written as:

where is the policy of the player two. Depending on the game solvers, it can be NE, UNIFORM, etc.

The general solver may therefore approximate Nash strategy in large-scale two-player zero- sum games.

Figure 4 shows an example of the pseudo-code for one implementation of the method. The steps concerning the learning of multiple best-response policies are indicated at 401 , and for promoting diversity among all best-responses at 402.

Figure 5 illustrates a summary of the main goal of the approach. A black-box multi-agent game engine 501 takes as input a joint strategy, shown at 502, and outputs the reward 503. Using the described algorithm, shown at 504, the output is multiple “good” strategies, as shown at 505.

Figure 6 summarises an example of a computer-implemented method 600 for processing a two-agent system input to form an at least partially optimised output indicative of an action policy. At step 601, the method comprises receiving the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states. At step 602, the method comprises receiving an indication of an input system state. At step 603, the method comprises performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.

The multiple at least partially optimised outputs each comprise a collectively optimal action policy for each of the two agents in the input system state and a Nash equilibrium behaviour pattern of the two agents in the input system state.

The procedural diagram for the training of the solver can be visualized in the plot shown in Figure 7. At each time step (i.e. each iteration of the iterative machine learning process), multiple new policies are trained in parallel. In the example shown in Figure 7, two policies are trained in parallel at one time. Each new policy is trained against all existing policies. The policy

shown at 701, is fixed at all time steps.

For example, policy

, shown at 702, is trained against

shown at 701, at time step 0, leading to the new policy of

shown at 703, at time step 1 , and

shown at 704, is training against both which leads to the policy , shown at 705.

During the training process, through a diverse Oracle function (denoted as DBR), the new generated policy will be diverse in the sense that it will be different from all existing policies. For example, (705) is diverse from and

(703 and 701) at time step 1. Once a new

generated policy converges in the training, it is kept fixed and unchanged in the pool. In time step 2, converges, shown at 706, and its parameters will be fixed and stay unchanged in later time steps, as indicated at 707.

The approach described herein therefore offers a geometric interpretation of behavioural diversity for learning in game frameworks by introducing a new diversity measure built upon the expected cardinality of a DPP. The diversity metric can be used as part of a general solver for normalform games and open-ended (meta)games. The method can converge to NE and α-Rank in two-player games and show theoretical guarantees of expanding the gamescapes.

Figure 8 summarises an example of the process performed as part of the step of performing an iterative machine learning process. The process comprises repeatedly performing the following steps until a predetermined level of convergence is reached. At step 801, the method comprises generating a set of random system states. The set of random system states may be initially generated based on a predetermined probability distribution. At step 802, the method comprises estimating based on the two-agent system input the behaviour patterns of the two agents in the system states. At step 803, the method comprises estimating an error between the estimated behaviour patterns and the behaviour patterns predicted by each of multiple predetermined candidate aggregate functions, the error representing the level of convergence. At step 804, the method comprises adapting the multiple predetermined candidate aggregate functions based on the estimated behaviour patterns. This iterative machine learning process can be used to enable the device to find suitable aggregate functions in a manageable time period.

When the predetermined level of convergence is reached, each of the agents can implement a respective action of the at least partially optimised set of actions.

The iterative method is scalable for approximating Nash equilibria in two-player zero-sum games. As described above, the method preferably involves a parallel double-oracle scheme that is designed to find multiple best-responses in a distributed way at the same time. The preferred implementation of the method defines and promotes so-called behavioural diversity among the multiple best-responses based on a determinantal point process. The method has been shown in some embodiments to demonstrate state-of-the-art performance, outperforming existing baselines, in approximating Nash equilibrium in large-scale two-player zero-sum games.

Figure 9 shows a schematic diagram of a computing device 900 configured to implement the computer implemented method described above and its associated components. The device may comprise a processor 901 and a non-volatile memory 902. The system may comprise more than one processor and more than one memory. The memory may store data that is executable by the processor. The processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium. The computer program may store instructions for causing the processor to perform its methods in the manner described herein.

Other examples of applications of this approach in practical applications include but are not limited to: driverless cars/autonomous vehicles, unmanned locomotive devices, packet delivery and routing devices, computer servers and ledgers in blockchains. For example, the agents may be autonomous vehicles and the system states may be vehicular system states. The agents may be communications routing devices and the system states may be data flows. The agents may be data processing devices and the system states may be computation tasks.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A computer-implemented device (900) for processing a two-agent system input to form multiple at least partially optimised outputs, each output indicative of an action policy for each of the two agents, the device comprising one or more processors (901) configured to perform the steps of: receiving (601) the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving (602) an indication of an input system state; and performing (603) an iterative machine learning process (800) to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.

2. A device (900) as claimed in claim 1 , wherein the processor (901) is configured to iteratively process the multiple aggregate functions for the input system state to estimate multiple at least partially optimised set of actions for each of the two agents in the input system state.

3. A device (900) as claimed in claim 1 or claim 2, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process in parallel.

4. A device (900) as claimed in any preceding claim, wherein multiple aggregate functions are determined in each iteration of the machine learning process.

5. A device (900) as claimed in any preceding claim, wherein the iterative machine learning process (800) is so as to promote behavioural diversity among the multiple aggregate functions determined in each iteration.

6. A device (900) as claimed in any preceding claim, wherein the iterative machine learning process (800) is performed in dependence on a diversity measure, wherein the diversity measure is modelled by a determinantal point process.

7. A device (900) as claimed in any preceding claim, wherein the multiple at least partially optimised outputs each comprise a collectively optimal action policy for each of the two agents in the input system state.

8. A device (900) as claimed in any preceding claim, wherein the multiple at least partially optimised outputs each represent a Nash equilibrium behaviour pattern of the two agents in the input system state.

9. A device (900) as claimed in any preceding claim, wherein the step of performing an iterative machine learning process comprises repeatedly performing the following steps until a predetermined level of convergence is reached: generating (801) a set of random system states; estimating (802) based on the two-agent system input the behaviour patterns of the two agents in the system states; estimating (803) an error between the estimated behaviour patterns and the behaviour patterns predicted by each of multiple predetermined candidate aggregate functions, the error representing the level of convergence; and adapting (804) the multiple predetermined candidate aggregate functions based on the estimated behaviour patterns.

10. A device (900) as claimed in claim 9, wherein the set of random system states are generated based on a predetermined probability distribution.

11. A device (900) as claimed in any preceding claim, wherein the agents are autonomous vehicles and the system states are vehicular system states.

12. A device (900) as claimed in any of claims 1 to 10, wherein the agents are data processing devices and the system states are computation tasks.

13. A method (600) for processing a two-agent system input to form a multiple at least partially optimised outputs each indicative of an action policy, the method comprising the steps of: receiving (601) the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving (602) an indication of an input system state; and performing (603) an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.

14. The method (600) of claim 13, further comprising the step of causing each of the agents to implement a respective action of the at least partially optimised set of actions.

15. A computer readable medium (902) storing in non-transient form a set of instructions for causing one or more processors to perform the method (600) of claim 13 or 14.