WO2022207087A1 - Device and method for approximating nash equilibrium in two-player zero-sum games - Google Patents
Device and method for approximating nash equilibrium in two-player zero-sum games Download PDFInfo
- Publication number
- WO2022207087A1 WO2022207087A1 PCT/EP2021/058392 EP2021058392W WO2022207087A1 WO 2022207087 A1 WO2022207087 A1 WO 2022207087A1 EP 2021058392 W EP2021058392 W EP 2021058392W WO 2022207087 A1 WO2022207087 A1 WO 2022207087A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- agents
- input
- machine learning
- learning process
- aggregate functions
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 101
- 230000008569 process Effects 0.000 claims abstract description 50
- 230000006870 function Effects 0.000 claims abstract description 36
- 230000006399 behavior Effects 0.000 claims abstract description 31
- 238000010801 machine learning Methods 0.000 claims abstract description 28
- 230000009471 action Effects 0.000 claims abstract description 25
- 238000012545 processing Methods 0.000 claims abstract description 10
- 230000003542 behavioural effect Effects 0.000 claims description 17
- 238000009826 distribution Methods 0.000 claims description 4
- 230000001052 transient effect Effects 0.000 claims description 2
- 239000003795 chemical substances by application Substances 0.000 description 54
- 230000004044 response Effects 0.000 description 26
- 238000013459 approach Methods 0.000 description 16
- 238000012549 training Methods 0.000 description 10
- 239000013598 vector Substances 0.000 description 8
- 230000002787 reinforcement Effects 0.000 description 7
- 230000001737 promoting effect Effects 0.000 description 6
- 239000011435 rock Substances 0.000 description 6
- 239000011159 matrix material Substances 0.000 description 4
- 230000015654 memory Effects 0.000 description 4
- 238000013461 design Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 229910052709 silver Inorganic materials 0.000 description 2
- 239000004332 silver Substances 0.000 description 2
- 235000000332 black box Nutrition 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 238000002955 isolation Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 230000003137 locomotive effect Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 206010053219 non-alcoholic steatohepatitis Diseases 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000005316 response function Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000003997 social interaction Effects 0.000 description 1
- 238000009987 spinning Methods 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Definitions
- This invention relates to a computer implemented device and method for application in two- player zero-sum game frameworks, particularly to approximating Nash equilibrium and promoting the diversity of policies in such frameworks.
- a desirable configuration is known as a fixed point. This is a configuration in which no agent can improve their payoff by unilaterally changing their current policy behaviour. This concept is known as a Nash equilibrium (NE).
- a simple example of two-player zero-sum game is the Rock-Paper-Scissors game where Rock beats Scissor, Scissor beats Paper, and Paper beats Rock, and one can easily know that the Nash equilibrium is to play three strategies uniformly (1/3, 1/3, 1/3). When a player plays the Nash strategy, he can no longer be exploited.
- two-player zero- sum games such as Texas Holdem Poker or Starcraft, where the strategy space is much larger (for example, Starcraft has 10 26 atomic actions at every time step), it is required to design approximate solvers to compute the Nash equilibrium.
- the transitive part of a game represents the structure in which the rule of winning is transitive (i.e., if strategy A beats B, B beats C, then A beats C), and the non-transitive part refers to the structure in which the set of strategies follows a cyclic rule (for example, the endless cycles among Rock, Paper and Scissors). Diversity matters, especially for the non-transitive part simply because there is no consistent winner in such part of a game: if a player only plays Rock, he can be exploited by Paper, but not so if he has a diverse strategy set of Rock and Scissor.
- a computer-implemented device for processing a two-agent system input to form multiple at least partially optimised outputs, each output indicative of an action policy for each of the two agents
- the device comprising one or more processors configured to perform the steps of: receiving the two-agent system input, the two- agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving an indication of an input system state; and performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.
- This may allow for an iterative method that is scalable for approximating Nash equilibria in two- player zero-sum game frameworks.
- the multiple aggregate functions may correspond to multiple best-response policies for the agents.
- the processor may be configured to iteratively process the multiple aggregate functions for the input system state to estimate multiple at least partially optimised set of actions for each of the two agents in the input system state.
- the multiple aggregate functions may be iteratively processed until a predefined level of convergence is reached.
- the multiple aggregate functions may be determined in a single iteration of the iterative machine learning process in parallel.
- the device may implement a parallel double-oracle scheme that is designed to find multiple best-response policies in a distributed way at the same time.
- Multiple aggregate functions may be determined in each iteration of the machine learning process.
- the multiple aggregate functions may be refined in subsequent iterations of the iterative machine leaning process. This may allow the device to keep finding best-response strategies in an iterative manner.
- the iterative machine learning process may be so as to promote behavioural diversity among the multiple aggregate functions determined in each iteration. Promoting diversity of best- response policies may strengthen the performance of a model trained by the iterative machine learning process.
- the iterative machine learning process may be performed in dependence on a diversity measure.
- the diversity measure may be modelled by a determinantal point process.
- the diversity measure may be based on the expected cardinality of a determinantal point process. This may allow diverse best-response policies to be determined.
- the multiple at least partially optimised outputs may each comprise a collectively optimal action policy for each of the two agents in the input system state. This may allow for optimal behaviour of the agents.
- the multiple at least partially optimised outputs may each represent a Nash equilibrium behaviour pattern of the two agents in the input system state. This may allow policies corresponding to the Nash equilibrium to be learned.
- the step of performing an iterative machine learning process may comprise repeatedly performing the following steps until a predetermined level of convergence is reached: generating a set of random system states; estimating based on the two-agent system input the behaviour patterns of the two agents in the system states; estimating an error between the estimated behaviour patterns and the behaviour patterns predicted by each of multiple predetermined candidate aggregate functions, the error representing the level of convergence; and adapting the multiple predetermined candidate aggregate functions based on the estimated behaviour patterns. This can enable the device to find suitable aggregate functions in a manageable time period.
- the set of random system states may be generated based on a predetermined probability distribution. This may be convenient for generating the system states.
- the agents may be autonomous vehicles and the system states may be vehicular system states. This may allow the device to be implemented in a driverless car.
- the agents may be data processing devices and the system states may be computation tasks. This may allow the device to be implemented in a communication system.
- a method for processing a two-agent system input to form a multiple at least partially optimised outputs each indicative of an action policy comprising the steps of: receiving the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states; receiving an indication of an input system state; and performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.
- the method may allow for an iterative method that is scalable for approximating Nash equilibria in two- player zero-sum game frameworks.
- the method may further comprise the step of causing each of the agents to implement a respective action of the at least partially optimised set of actions. This can result in efficient operation of the agents. In this way the method can be used to control the actions of a physical entity.
- a third aspect there is computer readable medium storing in non-transient form a set of instructions for causing one or more processors to perform the method described above.
- the method may be performed by a computer system comprising one or more processors programmed with executable code stored non-transiently in one or more memories.
- Figure 1 shows an algorithm for general meta-game solvers.
- Figure 2 shows a summary of prior methods.
- Figure 3 shows an example of a determinantal point process.
- Figure 4 shows an example of the pseudo-code for one implementation of the method described herein.
- FIG. 5 schematically illustrates the main goal of the approach described herein.
- Figure 6 summarises an example of a method of processing a two-agent system input to form a multiple at least partially optimised outputs each indicative of an action policy.
- Figure 7 schematically illustrates an example of a procedural diagram of the training of the solver.
- Figure 8 summarises an example of the steps of the iterantive machine learning process described herein.
- Figure 9 shows an example of a computing device configured to perform the methods described herein.
- Described herein is a computer implemented device and method for application in two-player zero-sum game frameworks, implementing a general Nash solver suitable for large-scale two- player zero-sum games.
- the approach can provide a parallel implementation to keep finding best-response strategies for the two agents in an iterative manner. Furthermore, the approach can find policies that are diverse in behaviours. In other words, the solver promotes behavioural diversity during the learning process.
- DPP determinantal point processes
- a DPP is a type of point process, which measures the probability of selecting a random subset from a ground set where only diverse subsets are desired.
- DPPs have origins in modelling repulsive quantum particles in physics (see Macchi, O., “The fermion process - a model of stochastic point process with repulsive points”, Transactions of the Seventh Moscow Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians, pp. 391-398. Springer, 1977).
- the expected cardinality of a DPP is formulated as the diversity metric.
- the diversity metric is a general tool for game solvers.
- the diversity metric is incorporated into the best-response dynamics, and diversity- aware extensions of fictitious play (FP) (see Brown, G.
- maximising the DPP-based diversity metric guarantees an expansion of the gamescape (convex polytopes spanned by agents’ mixtures of policies).
- the diversity-aware learning methods may converge to the respective solution concept of Nash equilibrium and ⁇ -Rank (see Omidshafiei, S., Papadimitriou, C., Piliouras, G., Tuyls, K., Rowland, M., Lespiau, J.-B., Czarnecki, W. M., Lanctot, M., Perolat, J., and Munos, R., “ ⁇ - rank: Multi-agent evaluation by evolution”, Scientific reports, 9(1): 1-29, 2019) in two-player games.
- a further preferred implementation of the method involves a parallel double-oracle scheme that is designed to find multiple best-responses in a distributed way at the same time.
- the method defines and promotes behavioural diversity among the multiple best-response policies using a distributed version of solvers where at each iteration, multiple best- responses can be found in one iteration.
- Nash equilibrium exists in all finite games (see Nash, J. F. et al. , “Equilibrium points in n-person games”, Proceedings of the national academy of sciences, 36(1): 48-49, 1950).
- the NE is a joint mixed-strategy profile p in which each player i e N plays the best-response to other players
- NFGs The framework of NFGs is often limited in describing real-world games. In solving games such as StarCraft or GO, it is inefficient to list all atomic actions. Instead, of more interest are games at the policy level where a policy can be a “higher-level” strategy (e.g., a RL model powered by DNN), and the resulting game is a meta-game, denoted by A meta-game payoff table, M, is constructed by simulating games that cover different policy combinations.
- a policy can be a “higher-level” strategy (e.g., a RL model powered by DNN)
- a meta-game payoff table, M is constructed by simulating games that cover different policy combinations.
- the policy set e.g., a population of deep RL models
- the meta-policy e.g., player i plays [RL- Model 1, RL-Model 2] with probability [0.3, 0.7]
- p is a joint meta-policy profile.
- Meta-games are often open-ended because there could exist an infinite number of policies to play a game. The openness also refers to the fact that new strategies will be continuously discovered and added to agents’ policy sets during training; the dimension of M will grow.
- Fictitious play In solving NFGs, Fictitious play (FP) describes the learning process where each player chooses a best-response to their opponents’ time-average strategies, and the resulting strategies guarantee to converge to the NE in two-player zero-sum, or potential games.
- Generalised weakened fictitious play (GWFP) (see Leslie, D. S. and Collins, E. J., “Generalised weakened fictitious play”, Games and Economic Behavior , 56(2):285-298, 2006) generalises FP by allowing for approximate best-responses and perturbed average strategy updates.
- GWFP is a process of following the below updating rule:
- a general solver for open-ended (meta-)games involves an iterative process of solving the equilibrium (meta-)policy first, and then based on the (meta-)policy, finding a new better- performing policy to augment the existing population.
- the (meta-)policy solver denoted as computes a joint (meta-)policy profile p based on the current payoff M (or, G) where different solution concepts can be adopted (for example, NE or ⁇ -Rank).
- M or, G
- different solution concepts for example, NE or ⁇ -Rank
- an Oracle In two-player zero-sum cases, an Oracle represents Generally, Oracles can be implemented through optimisation subroutines such as gradient-descent methods or RL algorithms. After a new policy is learned, the payoff table is expanded, and the missing entries will be filled by running new game simulations. The above process loops over each player at every iteration, and it terminates if no players can find new best-response policies (i.e. , Eq. (1) reaches zero).
- Algorithm 1 in Figure 1 shows an exemplary algorithm for general meta-game solvers. The step of finding a new policy is shown in step 5.
- smooth FP (Fudenberg, D. and Levine, D., “Consistency and cautious fictitious play”, Journal of Economic Dynamics and Control, 1995) is a solver that accounts for diversity through adopting a policy entropy term in the original FP (Brown et al., 1951, see above).
- Double Oracle (DO) (McMahan, H. B., Gordon, G. J., and Blum, A., “Planning in the presence of cost functions controlled by an adversary”, Proceedings of the 20th International Conference on Machine Learning (ICML-03), pp. 536-543, 2003) provides an iterative method where agents progressively expand their policy pool by, at each iteration, adding one best-response versus the opponent’s Nash strategy.
- PSRO generalises FP and DO via adopting a RL subroutine to approximate the best-response (Lanctot, M., Zambaldi, V., Gruslys, A., Lazaridou, A., Tuyls, K., Perolat, J., Silver, D., and Graepel, T., “A unified game-theoretic approach to multiagent reinforcement learning”, Advances in neural information processing systems, pp. 4190-4203, 2017).
- Pipeline-PSRO (McAleer, S., Lanier, J., Fox, R., and Baldi, P., “Pipeline psro: A scalable approach for finding approximate nash equilibria in large games”, arXiv preprint arXiv:2006.08555, 2020) trains multiple best-responses in parallel and efficiently solves games of size 10 50 .
- PSRO rN (Balduzzi, D., Garnelo, M., Bachrach, Y., Czarnecki, W., Perolat, J., Jaderberg, M., and Graepel, T., “Open-ended learning in symmetric zero-sum games”, ICML, volume 97, pp.
- PMLR 2019
- ⁇ -PSRO Muller, P., Omidshafiei, S., Rowland, M., Tuyls, K., Perolat, J., Liu, S., Hennes, D., Marris, L, Lanctot, M., Hughes, E., et al. , “A generalized training approach for multiagent learning”, International Conference on Learning Representations, 2019), which replaces NE with ⁇ -Rank. Yet, how to promote diversity in the context of ⁇ -PSRO is still unknown.
- a DPP is a probabilistic framework that characterises how likely a subset of items is to be sampled from a ground set where diverse subsets are preferred.
- a DPP defines a probability measure P on the power set of Y (i.e., 2 Y ), such that, given an M x M positive semi-definite (PSD) kernel £ that measures the pairwise similarity for items in Y, and let Y be a random subset drawn from the DPP, the probability of sampling ⁇ Y ⁇ Y is written as: where denotes a submatrix of £ whose entries are indexed by the items included in Y .
- PSD semi-definite
- each row W i represents a P-dimensional feature vector of item i e Y
- the geometric meaning of is the squared volume of the parallelepiped spanned by the rows of W that correspond to the sampled items in Y .
- a PSD matrix ensures all principal minors of £ are non-negative , which suffices to be a proper probability distribution.
- the normaliser of can be computed by where I is the M x M identity matrix.
- the entries of are pairwise inner products between item vectors.
- the kernel can intuitively be thought of as representing dual effects - the diagonal elements aim to capture the quality of item i, whereas the off-diagonal elements capture the similarity between the items i and j.
- a DPP models the repulsive connections among the items in a sampled subset. For example, in a two-item subset,
- the target is to find a population of diverse policies, with each of them performing differently from other policies due to their unique characteristics. Therefore, when modelling the behavioural diversity in games, the payoff matrix can be used to construct a DPP kernel so that the similarity between two policies depends on their performance in terms of payoffs against different types of opponents.
- a game DPP (G-DPP) for each player is a DPP in which the ground set is the strategy population and the DPP kernel £ is written by Eq. (10), which is a Gram matrix based on the payoff table M (see Figure 3):
- a diversity measure can be designed based on the expected cardinality of random samples from a G-DPP, i.e.
- the diversity metric defined as the expected cardinality of a G-DPP, can be computed in time by the following equation:
- Figure 3 shows an example of a G-DDP.
- the squared volume of the grey cube 300 is equal to d
- the payoff vectors of and are shown at 301, 302 and 303 respectively. Since and have similar payoff vectors (302 and 303), this leads to a smaller shaded area 304, and thus the probability of these two strategies co-occurring is low. i.e. the probability of selecting ⁇ (the shaded area 304) from G-DPP is smaller than that of selecting which has orthogonal payoff vectors.
- the diversity in Eq. (11) of the population are 0, 1, 1.2 respectively.
- the diversity measure is therefore based on the expected cardinality of a determinantal point process.
- the classical FP approach can be expanded to a diverse version such that at each iteration, the player not only considers a best-response, but also considers how this new strategy can help enrich the existing strategy pool after the update.
- the diverse FP method maintains the same update rule as Eq. (4), but with the best-response changing into: where t is a tuneable constant.
- Eq. (12) has a unique solution at each iteration.
- the algorithm maintains a population of policies S4 learned so far by player i.
- the goal here is to design an Oracle to train a new strategy Se, parameterised by ⁇ ⁇ R d (for example, a deep neural net), which both maximises player i’s payoff and is diverse from all strategies in Therefore, the ground set of the G- DPP at iteration t can be defined to be the union of the existing the new model to add: ,
- the diversity measure can be computed by Eq. (11).
- the objective of an Oracle can be written as: where is the policy of the player two. Depending on the game solvers, it can be NE, UNIFORM, etc.
- the general solver may therefore approximate Nash strategy in large-scale two-player zero- sum games.
- Figure 4 shows an example of the pseudo-code for one implementation of the method.
- the steps concerning the learning of multiple best-response policies are indicated at 401 , and for promoting diversity among all best-responses at 402.
- FIG. 5 illustrates a summary of the main goal of the approach.
- a black-box multi-agent game engine 501 takes as input a joint strategy, shown at 502, and outputs the reward 503.
- the output is multiple “good” strategies, as shown at 505.
- Figure 6 summarises an example of a computer-implemented method 600 for processing a two-agent system input to form an at least partially optimised output indicative of an action policy.
- the method comprises receiving the two-agent system input, the two-agent system input comprising a definition of a two-agent system and defining behaviour patterns of two agents based on system states.
- the method comprises receiving an indication of an input system state.
- the method comprises performing an iterative machine learning process to estimate multiple aggregate functions, each representing the behaviour patterns of the two agents over a set of system states, wherein the multiple aggregate functions are determined in a single iteration of the iterative machine learning process.
- the multiple at least partially optimised outputs each comprise a collectively optimal action policy for each of the two agents in the input system state and a Nash equilibrium behaviour pattern of the two agents in the input system state.
- the procedural diagram for the training of the solver can be visualized in the plot shown in Figure 7.
- time step i.e. each iteration of the iterative machine learning process
- two policies are trained in parallel at one time.
- Each new policy is trained against all existing policies.
- the policy shown at 701, is fixed at all time steps.
- policy shown at 702 is trained against shown at 701, at time step 0, leading to the new policy of shown at 703, at time step 1 , and shown at 704, is training against both which leads to the policy , shown at 705.
- the new generated policy will be diverse in the sense that it will be different from all existing policies. For example, (705) is diverse from and (703 and 701) at time step 1. Once a new generated policy converges in the training, it is kept fixed and unchanged in the pool. In time step 2, converges, shown at 706, and its parameters will be fixed and stay unchanged in later time steps, as indicated at 707.
- DBR diverse Oracle function
- the approach described herein therefore offers a geometric interpretation of behavioural diversity for learning in game frameworks by introducing a new diversity measure built upon the expected cardinality of a DPP.
- the diversity metric can be used as part of a general solver for normalform games and open-ended (meta)games.
- the method can converge to NE and ⁇ -Rank in two-player games and show theoretical guarantees of expanding the gamescapes.
- Figure 8 summarises an example of the process performed as part of the step of performing an iterative machine learning process.
- the process comprises repeatedly performing the following steps until a predetermined level of convergence is reached.
- the method comprises generating a set of random system states.
- the set of random system states may be initially generated based on a predetermined probability distribution.
- the method comprises estimating based on the two-agent system input the behaviour patterns of the two agents in the system states.
- the method comprises estimating an error between the estimated behaviour patterns and the behaviour patterns predicted by each of multiple predetermined candidate aggregate functions, the error representing the level of convergence.
- the method comprises adapting the multiple predetermined candidate aggregate functions based on the estimated behaviour patterns. This iterative machine learning process can be used to enable the device to find suitable aggregate functions in a manageable time period.
- each of the agents can implement a respective action of the at least partially optimised set of actions.
- the iterative method is scalable for approximating Nash equilibria in two-player zero-sum games.
- the method preferably involves a parallel double-oracle scheme that is designed to find multiple best-responses in a distributed way at the same time.
- the preferred implementation of the method defines and promotes so-called behavioural diversity among the multiple best-responses based on a determinantal point process.
- the method has been shown in some embodiments to demonstrate state-of-the-art performance, outperforming existing baselines, in approximating Nash equilibrium in large-scale two-player zero-sum games.
- Figure 9 shows a schematic diagram of a computing device 900 configured to implement the computer implemented method described above and its associated components.
- the device may comprise a processor 901 and a non-volatile memory 902.
- the system may comprise more than one processor and more than one memory.
- the memory may store data that is executable by the processor.
- the processor may be configured to operate in accordance with a computer program stored in non-transitory form on a machine readable storage medium.
- the computer program may store instructions for causing the processor to perform its methods in the manner described herein.
- the agents may be autonomous vehicles and the system states may be vehicular system states.
- the agents may be communications routing devices and the system states may be data flows.
- the agents may be data processing devices and the system states may be computation tasks.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2021/058392 WO2022207087A1 (en) | 2021-03-31 | 2021-03-31 | Device and method for approximating nash equilibrium in two-player zero-sum games |
EP21717001.8A EP4298552A1 (en) | 2021-03-31 | 2021-03-31 | Device and method for approximating nash equilibrium in two-player zero-sum games |
CN202180096388.8A CN117083617A (en) | 2021-03-31 | 2021-03-31 | Apparatus and method for approximating Nash equalization in two-person zero and gaming |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2021/058392 WO2022207087A1 (en) | 2021-03-31 | 2021-03-31 | Device and method for approximating nash equilibrium in two-player zero-sum games |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022207087A1 true WO2022207087A1 (en) | 2022-10-06 |
Family
ID=75426585
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/EP2021/058392 WO2022207087A1 (en) | 2021-03-31 | 2021-03-31 | Device and method for approximating nash equilibrium in two-player zero-sum games |
Country Status (3)
Country | Link |
---|---|
EP (1) | EP4298552A1 (en) |
CN (1) | CN117083617A (en) |
WO (1) | WO2022207087A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220410878A1 (en) * | 2021-06-23 | 2022-12-29 | International Business Machines Corporation | Risk sensitive approach to strategic decision making with many agents |
-
2021
- 2021-03-31 WO PCT/EP2021/058392 patent/WO2022207087A1/en active Application Filing
- 2021-03-31 CN CN202180096388.8A patent/CN117083617A/en active Pending
- 2021-03-31 EP EP21717001.8A patent/EP4298552A1/en active Pending
Non-Patent Citations (32)
Title |
---|
BAKER, B.KANITSCHEIDER, I.MARKOV, T.WU, Y.POWELL, G.MCGREW, B.MORDATCH, I.: "Emergent tool use from multi-agent autocurricula", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2019 |
BALDUZZI, D.GARNELO, M.BACHRACH, Y.CZARNECKI, W.PEROLAT, J.JADERBERG, M.GRAEPEL, T.: "Open-ended' learning in symmetric zero-sum games", ICML, vol. 97, 2019, pages 434 - 443 |
BALDUZZI, D.GARNELO, M.BACHRACH, Y.CZARNECKI, W.PEROLAT, J.JADERBERG, M.GRAEPEL, T.: "Open-ended learning in symmetric zero-sum games", ICML, vol. 97, 2019, pages 434 - 443 |
BALDUZZI, D.RACANIERE, S.MARTENS, J.FOERSTER, J.TUYLS, K.GRAEPEL, T.: "The mechanics of n-player differentiable games", ICML, vol. 80, 2018, pages 363 - 372 |
BANZHAF, W.BAUMGAERTNER, B.BESLON, G.DOURSAT, R.FOSTER, J. A.MCMULLIN, B.DE MELO, V. V.MICONI, T.SPECTOR, L.STEPNEY, S. ET AL.: "Defining and simulating open-ended novelty: requirements, guidelines, and challenges", THEORY IN BIOSCIENCES, vol. 135, no. 3, 2016, pages 131 - 161, XP036050102, DOI: 10.1007/s12064-016-0229-7 |
BROWN, G. W.: "Iterative solution of games by fictitious play", ACTIVITY ANALYSIS OF PRODUCTION AND ALLOCATION, vol. 13, no. 1, 1951, pages 374 - 376 |
CANDOGAN, O.MENACHE, I.OZDAGLAR, A.PARRILO, P. A.: "Flows and decompositions of games: Harmonic and potential games", MATHEMATICS OF OPERATIONS RESEARCH, vol. 36, no. 3, 2011, pages 474 - 503 |
CZARNECKI, W. M.GIDEL, G.TRACEY, B.TUYLS, K.OMIDSHAFIEI, S.BALDUZZI, D.JADERBERG, M.: "Real world games look like spinning tops", ARXIV, PP. ARXIV-2004, 2020 |
DAVID BALDUZZI ET AL: "Open-ended Learning in Symmetric Zero-sum Games", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 January 2019 (2019-01-23), XP081007431 * |
DAVIS, T.BURCH, N.BOWLING, M.: "Using response functions to measure strategy strength", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 28, 2014 |
DURHAM, W. H.: "Coevolution: Genes, culture, and human diversity", 1991, STANFORD UNIVERSITY PRESS |
FUDENBERG, D.LEVINE, D.: "Consistency and cautious fictitious play", JOURNAL OF ECONOMIC DYNAMICS AND CONTROL, 1995 |
JADERBERG, M.CZARNECKI, W. M.DUNNING, I.MARRIS, L.LEVER, G.CASTANEDA, A. G.BEATTIE, C.RABINOWITZ, N. C.MORCOS, A. S.RUDERMAN, A. E: "Human-level performance in 3d multiplayer games with population based reinforcement learning", SCIENCE, vol. 364, no. 6443, 2019, pages 859 - 865 |
KURACH, K.RAICHUK, A.STANCZYK, P.ZAJAC, M.BACHEM,'O.ESPEHOLT, L.RIQUELME, C.VINCENT, D.MICHALSKI, M.BOUSQUET, O. ET AL.: "Google research football: A novel reinforcement learning environment", PROCEEDINGS OF THE AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, vol. 34, 2020, pages 4501 - 4510 |
LANCTOT, M.ZAMBALDI, V.GRUSLYS, A.LAZARIDOU, A.TUYLS, K.PEROLAT, J.SILVER, D.GRAEPEL, T.: "A unified game-theoretic approach to multiagent reinforcement learning", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS, 2017, pages 4190 - 4203 |
LEHMAN, J.STANLEY, K. O.: "Exploiting open-endedness to solve problems through the search for novelty", ALIFE, 2008, pages 329 - 336 |
LEIBO, J. Z.HUGHES, E.LANCTOT, M.GRAEPEL, T.: "Autocurricula and the emergence of innovation from social interaction: A manifesto for multi-agent intelligence research", ARXIV, 2019, pages arXiv-1903 |
LESLIE, D. S.COLLINS, E. J.: "Generalised weakened fictitious play", GAMES AND ECONOMIC BEHAVIOR, vol. 56, no. 2, 2006, pages 285 - 298, XP024911102, DOI: 10.1016/j.geb.2005.08.005 |
LIU, S.LEVER, G.MEREL, J.TUNYASUVUNAKOOL, S.HEESS, N.GRAEPEL, T.: "Emergent coordination through competition", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2018 |
MACCHI, O.: "Transactions of the Seventh Prague Conference on Information Theory, Statistical Decision Functions, Random Processes and of the 1974 European Meeting of Statisticians", 1977, SPRINGER, article "The fermion process - a model of stochastic point process with repulsive points", pages: 391 - 398 |
MCALEER, S.LANIER, J.FOX, R.BALDI, P.: "Pipeline psro: A scalable approach for finding approximate nash equilibria in large games", ARXIV PREPRINT ARXIV:2006.08555, 2020 |
MCMAHAN, H. B.GORDON, G. J.BLUM, A.: "Planning in the presence of cost functions controlled by an adversary", PROCEEDINGS OF THE 20TH INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML-03, 2003, pages 536 - 543, XP008098344 |
MULLER, P.OMIDSHAFIEI, S.ROWLAND, M.TUYLS, K.PEROLAT, J.LIU, S.HENNES, D.MARRIS, L.LANCTOT, M.HUGHES, E. ET AL.: "A generalized training approach for multiagent learning", INTERNATIONAL CONFERENCE ON LEARNING REPRESENTATIONS, 2019 |
NASH, J. F. ET AL.: "Equilibrium points in n-person games", PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES, vol. 36, no. 1, 1950, pages 48 - 49 |
OMIDSHAFIEI, S.PAPADIMITRIOU, C.PILIOURAS, G.TUYLS, K.ROWLAND, M.LESPIAU, J.-B.CZARNECKI, W. M.LANCTOT, M.PEROLAT, J.MUNOS, R.: "a-rank: Multi-agent evaluation by evolution", SCIENTIFIC REPORTS, vol. 9, no. 1, 2019, pages 1 - 29 |
PAREDIS, J.: "Revolutionary computation", ARTIFICIAL LIFE, vol. 2, no. 4, 1995, pages 355 - 375 |
PEREZ NIEVES NICOLAS ET AL: "Modelling Behavioural Diversity for Learning in Open-Ended Games", 14 March 2021 (2021-03-14), pages 1 - 28, XP055871842, Retrieved from the Internet <URL:https://arxiv.org/pdf/2103.07927v1.pdf> [retrieved on 20211210] * |
STANDISH, R. K: "Open-ended artificial evolution", INTERNATIONAL JOURNAL OF COMPUTATIONAL INTELLIGENCE AND APPLICATIONS, vol. 3, no. 02, 2003, pages 167 - 175 |
STEPHEN MCALEER ET AL: "Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 18 February 2021 (2021-02-18), XP081882238 * |
VINYALS, O.BABUSCHKIN, I.CHUNG, J.MATHIEU, M.JADERBERG, M.CZARNECKI, W. M.DUDZIK, A.HUANG, A.GEORGIEV, P.POWELL, R. ET AL.: "Alphastar: Mastering the real-time strategy game starcraft", DEEPMIND BLOG, vol. 2, 2019 |
VINYALS, O.BABUSCHKIN, I.CZARNECKI, W. M.MATHIEU, M.DUDZIK, A.CHUNG, J.CHOI, D. H.POWELL, R.EWALDS, T.GEORGIEV, P. ET AL.: "Grandmaster level in starcraft ii using multi-agent reinforcement learning", NATURE, vol. 575, no. 7782, 2019, pages 350 - 354, XP036927623, DOI: 10.1038/s41586-019-1724-z |
YE, D.CHEN, G.ZHANG, W.CHEN, S.YUAN, B.LIU, B.CHEN, J.LIU, Z.QIU, F.YU, H. ET AL.: "Towards playing full moba games with deep reinforcement learning", ARXIV E-PRINTS, 2020, pages arXiv-2011 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220410878A1 (en) * | 2021-06-23 | 2022-12-29 | International Business Machines Corporation | Risk sensitive approach to strategic decision making with many agents |
Also Published As
Publication number | Publication date |
---|---|
CN117083617A (en) | 2023-11-17 |
EP4298552A1 (en) | 2024-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Perez-Nieves et al. | Modelling behavioural diversity for learning in open-ended games | |
Świechowski et al. | Monte Carlo tree search: A review of recent modifications and applications | |
Yang et al. | Hierarchical cooperative multi-agent reinforcement learning with skill discovery | |
Ozair et al. | Vector quantized models for planning | |
Khadka et al. | Evolutionary reinforcement learning | |
Ilhan et al. | Monte Carlo tree search with temporal-difference learning for general video game playing | |
Khan et al. | Transformer-based value function decomposition for cooperative multi-agent reinforcement learning in starcraft | |
WO2022207087A1 (en) | Device and method for approximating nash equilibrium in two-player zero-sum games | |
Liu et al. | A unified diversity measure for multiagent reinforcement learning | |
Suri | Off-policy evolutionary reinforcement learning with maximum mutations | |
Mathieu et al. | AlphaStar Unplugged: Large-Scale Offline Reinforcement Learning | |
Ha | Neuroevolution for deep reinforcement learning problems | |
Reis et al. | An Adversarial Approach for Automated Pokémon Team Building and Meta-Game Balance | |
McAleer et al. | Team-PSRO for Learning Approximate TMECor in Large Team Games via Cooperative Reinforcement Learning | |
Callaghan et al. | Evolutionary strategy guided reinforcement learning via multibuffer communication | |
Liu et al. | Soft-Actor-Attention-Critic Based on Unknown Agent Action Prediction for Multi-Agent Collaborative Confrontation | |
Berthet | Review of Deep Reinforcement Learning Algorithms | |
Parker-Holder | Towards truly open-ended reinforcement learning | |
Yang | Integrating Domain Knowledge into Monte Carlo Tree Search for Real-Time Strategy Games | |
Gillberg et al. | Technical challenges of deploying reinforcement learning agents for game testing in aaa games | |
Mahmud et al. | Implementation of reinforcement learning architecture to augment an AI that can self-learn to play video games | |
Lundberg | Evaluating behaviour tree integration in the option critic framework in Starcraft 2 mini-games with training restricted by consumer level hardware | |
Wang et al. | Quality-Diversity with Limited Resources | |
Huang | Warm-Starting Networks for Sample-Efficient Continuous Adaptation to Parameter Perturbations in Multi-Agent Reinforcement Learning | |
Feng et al. | Neural auto-curricula |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21717001 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 202180096388.8 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2021717001 Country of ref document: EP |
|
ENP | Entry into the national phase |
Ref document number: 2021717001 Country of ref document: EP Effective date: 20230928 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |