CN110404264B - Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium - Google Patents

Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium Download PDF

Info

Publication number
CN110404264B
CN110404264B CN201910676407.9A CN201910676407A CN110404264B CN 110404264 B CN110404264 B CN 110404264B CN 201910676407 A CN201910676407 A CN 201910676407A CN 110404264 B CN110404264 B CN 110404264B
Authority
CN
China
Prior art keywords
strategy
game
agent
nfsp
player
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910676407.9A
Other languages
Chinese (zh)
Other versions
CN110404264A (en
Inventor
王轩
漆舒汉
蒋琳
胡书豪
毛建博
廖清
李化乐
张加佳
刘洋
夏文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Graduate School Harbin Institute of Technology
Original Assignee
Shenzhen Graduate School Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Graduate School Harbin Institute of Technology filed Critical Shenzhen Graduate School Harbin Institute of Technology
Priority to CN201910676407.9A priority Critical patent/CN110404264B/en
Publication of CN110404264A publication Critical patent/CN110404264A/en
Application granted granted Critical
Publication of CN110404264B publication Critical patent/CN110404264B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/60Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
    • A63F13/67Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • A63F2300/6027Methods for processing data by generating or executing the game program using adaptive systems learning from user actions, e.g. for skill level adjustment

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides a method, a device and a system for solving a multi-person non-complete information game strategy based on virtual self-game, and a storage medium, wherein the method comprises the following steps: aiming at the two-person game situation, the average strategy is generated by using multi-class logistic regression and reservoir sampling, and the optimal reaction strategy is generated by using DQN and annular buffer memory; aiming at the situation of multi-person game, a multi-agent near-end strategy optimization algorithm MAPPO is used for realizing an optimal reaction strategy, and simultaneously, multi-agent NFSP is used for adjusting the training of agents. The beneficial effects of the invention are: the invention introduces an algorithm framework of virtual self-alignment, divides the Texas poker strategy optimization process into an optimal reaction strategy learning part and an average strategy learning part, and respectively realizes the optimal strategy learning method by using simulation learning and deep reinforcement learning, thereby designing a more universal multi-agent optimal strategy learning method.

Description

Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method, a device and a system for solving a multi-person non-complete information game strategy based on virtual self-game and a storage medium.
Background
Machine game and artificial intelligence are all connected with each other in a secret and inseparable way, and the method is an important aspect for embodying the development of artificial intelligence. Many well-known scholars in the computer field have conducted related research on machine gaming: the parent von neumann and the mathematician olympic of computers proposed a very small and extremely large approach in gaming. The father of artificial intelligence, alan, tuling, provides a theoretical basis for developing computer chess programs. This theory was later used to design the world's first computer program checkers on ENIAC. For more than half a century, many of the major research efforts generated in the field of machine gaming have been recognized as important milestones in the development of artificial intelligence. At present, the research results of machine game have been widely used in the fields of robot control, dialogue systems, resource scheduling, traffic signal lamp control, automatic driving, external negotiation, financial decision-making, and the like.
According to the completeness of game information, the game can be divided into complete information game and non-complete information game. In the perfect information game, a game player can completely and immediately acquire all information related to game decision in the decision making process, and many chess game games belong to non-perfect information games, such as: go, chinese chess, general chess. In the non-perfect information game. Game information acquired by the game players is incomplete or lagged, so that each game player has private information that cannot be acquired by an opponent: such as hands in poker games, self-vision in autopilot, knowledge of chips exchanged in outcrossings, etc.
The pair of incompleteness of the information makes the solution of the optimal strategy more complicated. To date, many more complex non-perfect information gaming problems such as multi-person non-zero and gaming do not have theoretical methods to solve the optimal solution. Taking a typical game of weiqi, which is a complete information game, as an example, all players can obtain all information of the current game, so that a minimum maximum algorithm can be used for traversing the game tree, thereby finding the current optimal strategy. But for non-perfect information gambling games the gambling information is not fully visible. Taking the texas playing card as an example, each hand is invisible to other people in other places, so that an agent is required to reason and guess unknown information of the opponent in the game playing process, and the opponent can be used for cheating by using the private information that the opponent cannot acquire the opponent. These characteristics make solving the game of incomplete information much more difficult.
Many decision problems in reality can be abstracted to strategy optimization problems of non-complete information games, but the current strategy optimization algorithm of non-complete information, such as a cold pounder, can only solve the game problems of two persons, discrete actions and simple states, and cannot be well applied to solving the decision problems in reality. Therefore, the research on the incomplete information strategy optimization algorithm supporting continuous actions and complex states of multiple people has important theoretical and practical significance.
Disclosure of Invention
The invention provides a multi-person non-perfect information game strategy solving method based on virtual self-game, aiming at the two-person game situation, the generation of an average strategy is realized by using multi-class logistic regression and reservoir sampling, and the generation of an optimal reaction strategy is realized by using DQN and annular buffer memory; aiming at the situation of multi-person game, a multi-agent near-end strategy optimization algorithm MAPPO is used for realizing an optimal reaction strategy, and meanwhile, multi-agent NFSP is used for adjusting the training of agents; NFSP: neural network virtual self-alignment, DQN: a deep Q-value network.
As a further improvement of the invention, aiming at the two-player game situation, the intelligent agent adopts a memory segment of an optimal response strategy as data, and trains a fully-connected shallow neural network by adopting a reservoir sampling method, wherein the input of the shallow neural network is the current poker game situation, and the output is the probability of taking each action in the state; and the online NFSP algorithm is adopted, and the two agents play chess and update strategies at the same time.
As a further improvement of the invention, in the case of multiplayer gaming, the multi-agent near-end strategy optimization algorithm comprises: by using centralized advantage estimation, when updating the strategy network, the MAPO uses the tailored proxy objective function, and in the training process, the intelligent agent continuously updates the strategy network by using a decision sequence obtained by exploration in the environment and updates the target strategy network at regular time.
As a further improvement of the present invention, in the case of multiplayer gaming, the multi-agent NFSP comprises: after each sampling action, the current state is executed and updated in the environment, and after the game is finished, the reward values in the last memory segments of all the intelligent agents in the memory base are updated.
The invention also provides a multi-person non-complete information game strategy solving device based on virtual self-game, which comprises the following steps:
two-player game module: the average strategy is generated by using multi-class logistic regression and reservoir sampling, and the optimal reaction strategy is generated by using DQN and annular buffer memory;
the multi-player game module: using a multi-agent near-end strategy optimization algorithm MAPPO to realize an optimal reaction strategy, and simultaneously using a multi-agent NFSP to adjust the training of the agents;
NFSP: neural network virtual self-alignment, DQN: a deep Q-value network.
In the two-player game module, the intelligent agent adopts a memory segment of an optimal response strategy as data, and trains a fully-connected shallow neural network by adopting a reservoir sampling method, wherein the input of the shallow neural network is the current poker game situation, and the output is the probability of taking each action in the state; and the online NFSP algorithm is adopted, and the two agents play chess and update strategies simultaneously.
As a further improvement of the invention, in the multi-player gaming module, the multi-agent near-end strategy optimization algorithm comprises: by using centralized advantage estimation, when updating the strategy network, the MAPO uses the tailored proxy objective function, and in the training process, the intelligent agent continuously updates the strategy network by using a decision sequence obtained by exploration in the environment and updates the target strategy network at regular time.
As a further development of the invention, in the multi-player gaming module, the multi-agent NFSP comprises: after each sampling action, the current state is executed and updated in the environment, and after the game is finished, the reward values in the last memory segments of all the intelligent agents in the memory base are updated.
The invention also provides a system for solving the multi-person non-complete information game strategy based on virtual self-game, which comprises the following steps: memory, a processor and a computer program stored on the memory, the computer program being configured to carry out the steps of the method of the invention when called by the processor.
The invention also provides a computer-readable storage medium having stored thereon a computer program configured to, when invoked by a processor, perform the steps of the method of the invention.
The invention has the beneficial effects that: the invention introduces an algorithm framework of virtual self-game, divides the Texas poker strategy optimization process into an optimal reaction strategy learning part and an average strategy learning part, and respectively realizes the optimal reaction strategy learning part and the average strategy learning part by using imitation learning and deep reinforcement learning, thereby designing a more universal multi-agent optimal strategy learning method.
Drawings
FIG. 1 is a flow chart of a virtual self-alignment algorithm;
FIG. 2 is a NFSP algorithm framework diagram;
FIG. 3 is a diagram of a communication framework for an ACPC race;
figure 4 is a data communication flow diagram for the texas poker machine gaming system.
Detailed Description
The invention 1.1 discloses a multi-person non-complete information game strategy solving method based on virtual self-game, taking multi-person non-limiting Texas poker as an example, the invention is a multi-person non-limiting Texas poker strategy solving algorithm. The invention is based on virtual self-alignment, combines the technologies of deep learning, multi-agent reinforcement learning and the like, and takes Texas poker and multi-agent particle environment as an experimental platform. When the traditional method is used for solving the problem of incomplete information game of Texas poker, the scale of the game tree needs to be reduced by using methods in the fields of card abstraction and the like, and the transferability is poor. The invention introduces an algorithm framework of virtual self-alignment, divides the Texas poker strategy optimization process into an optimal reaction strategy learning part and an average strategy learning part, and respectively realizes the optimal strategy learning algorithm by using simulation learning and deep reinforcement learning, thereby designing a more universal multi-agent optimal strategy learning algorithm.
1.2 detailed description of the method
1.2.1 neural network virtual self-alignment
1.2.1.1 Algorithm framework
Virtual Self-Play (FSP) can be used to solve the problem of optimizing large non-perfect information game strategies like texas poker. FSP is a machine learning algorithm that has been shown to converge to nash equilibrium in two-player zero-sum gaming, which achieves generalized weakening of virtual pairings based on behavioral strategies and sampling. Conventional virtual hands are faced with dimensional disasters as the size of the game increases, since each iteration requires computing all the states of the game. However, FSP requires only an approximately optimal reaction and also allows for some perturbation in the update. FSP replaces the operation of computing the average strategy and optimal response in virtual hands-on with machine-learned algorithms. The updating of the average strategy is a process of imitation learning, and a common imitation learning algorithm can be used; the calculation of the optimal response is replaced by reinforcement learning.
In the learning process of FSP, two virtual players continuously optimize respective game strategies by mutually playing. Each player has two sets of policies: optimal reaction strategy for adversaries betajAnd its own averaging strategy pj(j). The training process of the FSP is specifically divided into the following steps: each player initializes the average strategy of the player, and the initialization mode can be any random algorithm; each player plays games by utilizing the optimal reaction strategy and the average strategy of the opponents respectively to obtain data D; each player updates its own reinforcement learning memory M according to the data DrlAnd imitate learning memory Msl(ii) a Each player updates the optimal reaction strategy beta of the player by using a reinforced learning algorithmj+1Updating its own average strategy p using a mock learning algorithmj(j)(ii) a When a certain iteration number m is reached, the average strategy of each player is the learned nash equilibrium strategy about m.
Neural networkVirtual Self-alignment (NFSP) uses Neural networks to fit optimal strategies as well as average strategies on the basis of FSP. Whenever the agent takes action according to its action strategy (mixture of optimal response and average strategy) and gets environmental feedback, the memory segment M = (i, s, a, r, s') is stored in the reinforcement learning memory MrlAnd imitate learning memory MslIn (1). Subsequently, using MrlThe memory M in (1) updates the optimal response network Q (s, a | theta)Q) And MslMemory update network pi (s, a | theta) in (1)Π) The parameter (c) of (c). In NFSP, Q (s, a | θ)Q) Is used to fit the optimal response of the agent, Π (s, a | θ)Π) It is the average strategy used to fit the agent's optimal response strategy, and the agent uses continuous-time dynamic virtual hits to predict the adversary's tie strategy. By piN,tMean strategy, beta, representing a regular formN,tAn optimal reaction strategy, representing a canonical form, then at continuous time:
Figure BDA0002143422280000051
ΔπN,t=πN,t+1N,t+1∝βN,tN,t (3-2)
thus is provided with
πN,t+1≈ηβN,t+(1-η)πN,t (3-3)
Where η ∈ R is called the prediction Parameter (anticancer Parameter). Fig. 1 is a general virtual self-alignment algorithm flow.
1.2.2 NFSP solution two-person non-limiting Texas poker strategy solution
1.2.2.1 learning optimal reaction strategy by using deep Q value network
First, an attempt is made to learn the optimal reaction strategy of an agent using a Deep Q-Network (DQN). DQN is a different strategy reinforcement learning algorithm used for finding the optimal strategy of MDP, so we can apply it under the FSP framework.
DQN can be used in playback experienceThe greedy strategy of epsilon is derived using Q learning, and similarly, DQN can be used to learn the optimal reaction strategy to the average strategy of the opponents under the FSP framework when solving the problem of texas poker. At this point, the FSP agent can learn a neural network Q (s, a; theta) according to the experience segments in the reinforcement learning memory by using DQN algorithmQ) To predict the state-action values and thus construct an optimal response to the (most recent) desired average strategy of the adversary. The loss function of the neural network may be set as:
Figure BDA0002143422280000052
wherein theta isQ'Is a target network, and needs a periodic network theta in the training processQIs assigned a weight value of thetaQ'In (1). And the final network determines the near optimal reaction strategy of the agent.
β=ε-greedy[Q(·|θQ)] (3-5)
This means that the agent chooses the action randomly with a probability of epsilon, or the action that predicts the maximum Q value.
1.2.2.2 Generation of average strategies Using mock learning
In FSP, averaging strategy
Figure BDA0002143422280000054
Is the average of the optimal reaction strategies that the player has taken over the past k iterations. Namely:
Figure BDA0002143422280000053
suppose player i wishes to combine from his own canonical gambling strategy
Figure BDA0002143422280000061
Learning a behavior strategy equivalent to the implementation of the behavior strategy. This is equivalent to a model that learns its behavior strategy by sampling actions from Π. The simplest method for learning the average strategy is toThe number of actions taken on the different information sets is counted. Represented by the formula, N(s)t,at) For players in the information set utThe sum of the probabilities of taking action a. RhotIs the policy used by the agent when a certain examined segment is sampled.
Figure BDA0002143422280000062
Figure BDA0002143422280000063
However, for the game of texas poker, which is a state space huge, the existing storage level conditions cannot save all the states. Therefore, NFSP introduces the idea of Learning from Learning (LFD) and fitting an average strategy using a neural network. Average strategy | (s, a | theta)) With the equation (3-9) as an error updating parameter, since an optimal response strategy is required to provide training data, the updating process of the average strategy can be regarded as a kind of simulation learning.
Figure BDA0002143422280000064
To ensure | (s, a | theta)) Is an unbiased estimate of the average of the optimal response strategy, and must ensure that MslAll memory segments in (A) are sampled with the same probability, but MslIs implemented by a list of fixed size, simple random sampling necessarily results in a small probability that the last added memory segment is extracted. Therefore, the invention introduces an algorithm design teaching learning memory M of Reservoir Sampling (Reservoir Sampling)slTo ensure that all memory fragments are extracted with the same probability.
The invention takes the memory segment of the optimal response strategy adopted by the intelligent agent as data, and adopts the method of reservoir sampling to train a fully-connected shallow neural network pi (s, a | theta |)Π) Input of the networkAnd outputting the probability of taking each action a in the state for the current poker game situation s.
1.2.2.3 Online NFSP Algorithm
As shown in fig. 2, the present invention uses the online NFSP algorithm, i.e. two agents play and update strategies at the same time, instead of the alternate update strategy. The online algorithm is more advantageous in practical training: first, simultaneous sampling experience is theoretically n times more efficient than the alternating approach, where n is the number of agents; second, the simultaneously learned agent can be applied to the environment of an actual black box scene, such as a traffic light system.
Figure BDA0002143422280000065
Figure BDA0002143422280000071
1.2.2.4 Multi-agent near-end policy optimization Algorithm
The invention provides a Multi-Agent proximity Policy Optimization algorithm (MAPPO). Aiming at the situation that the strategy performance disturbance of the agent is amplified and the problem that the learning rate is difficult to adjust under the multi-agent situation, the tailored proxy objective function is used for replacing the optimization objective function of the Actor in the MADDPG.
One strategy gradient implementation that is more common and suitable for recurrent neural networks is to run the strategy for T time steps (T is much smaller than the segment length) and then perform a strategy update using a sample of the T collected time steps. This implementation requires an evaluation function that can evaluate sequences not exceeding T time steps, the evaluation function used in the present invention being
Figure BDA0002143422280000072
Where t represents the time index in [0, T ], more generally a truncated version of the dominance estimate can be used, as shown in equations (4-11), which are the same when λ is 1.
Figure BDA0002143422280000081
δt=rt+γV(st+1)-V(st) (4-3)
Under the multi-agent reinforcement learning training framework of centralized training-decentralized execution, centralized Critic and decentralized Actor have different observation visual fields, so that a neural network structure sharing parameters cannot be used, the Critic and Actor errors in MAPPO are calculated separately, and two independent networks are used for realizing strategy and state estimation. The main difference between MAPO and PPO is that a centralized advantage estimation is used, so that observation and action of all agents can be observed, and a more accurate estimation can be obtained, wherein TD-ERROR used for the centralized advantage estimation is shown as a formula (4-13).
Figure BDA0002143422280000082
Error of Critic
Figure BDA0002143422280000083
The calculation method is shown as the formula (4-14)
Figure BDA0002143422280000084
Where S is the size of the mini-batch and i represents agent i. MAPPO can use a way of updating the network and the target network to ensure that the parameters of the Critic network do not diverge during the update.
MAPPO uses a tailored proxy objective function instead of a policy gradient when updating a policy network. Unlike PPO, MAPPO uses a centralized dominance function
Figure BDA0002143422280000085
And guiding the updating of the strategy. The strategy update objective function of MAPPO is
Figure BDA0002143422280000086
In the training process, the intelligent agent continuously updates the strategy network by using the decision sequence obtained by exploration in the environment and updates the target strategy network regularly
Figure BDA0002143422280000087
Figure BDA0002143422280000088
Figure BDA0002143422280000091
1.2.3 Multi-agent NFSP Algorithm
1.2.3.1 partially observable Markov games
Multi-agent Markov Decision Problems (MDPs) can be viewed as part of Observable Markov Games (Partially Observable Markov Games). The partially observable Markov game consists of three parts:
(1) Set of actions A1,...,AN
(2) Observable information set O1,...,ON
(3) Random strategy piθi:
Figure BDA0002143422280000092
(4) State transfer function Γ:
Figure BDA0002143422280000093
(5) Reward function ri:
Figure BDA0002143422280000101
(6) Private observable information oi:
Figure BDA0002143422280000102
Expected reward for agent i
Figure BDA0002143422280000103
Wherein y is the reward discount rate over time.
1.2.3.2 Multi-agent NFSP Algorithm
The order of the multi-player Texas poker is existed when making a bet, and the environment changes after each agent takes action. In addition, all agents can only receive their final prize value when a game is over. Therefore, the invention adjusts the NFSP algorithm to be suitable for the learning of multiplayer Texas poker strategy, which is characterized in that the current state is executed and updated in the environment after each sampling action, and the reward value in the last memory segment of all the intelligent agents in the memory bank is updated after the game is finished.
Figure BDA0002143422280000104
Figure BDA0002143422280000111
2.1 Experimental setup
2.1.1 ACPC rule Texas poker gaming environment
The invention realizes chess playing among the agents based on the communication framework of the ACPC competition, as shown in figure 3, when the game starts, each agent needs to establish a TCP/IP link with a server running a card distributor through a specified port, and whether the ACPC communication protocol version of each agent is consistent with that of the server is confirmed. After the game is started, the server sends the game state information coded into the character strings to each intelligent agent respectively and waits for the information of the intelligent agents needing to take action. After each intelligent agent receives the character string, the character string is analyzed into a corresponding game state, if the game state display needs to take action, the intelligent agent selects an action through a strategy and codes the action into the character string to be sent to the server. And after receiving the character strings sent by the intelligent agents, the server analyzes the character strings into legal actions, executes the actions and sends the game state information after the actions are finished to each intelligent agent. And circulating until the game is finished.
Fig. 4 is a frame diagram of a game system according to the present invention, and experiments are performed on two-person and multi-person games by using different methods to generate and train an average strategy and an optimal response strategy. Specifically, aiming at the two-person game problem, the average strategy is generated by using multi-class logistic regression and reservoir sampling, and the optimal reaction strategy is generated by using DQN and annular buffer memory; aiming at the problem of multi-person game, the optimal reaction strategy is realized by using the multi-agent near-end strategy optimization algorithm provided by the invention, and the training of NFSP (network function protocol) regulation agents is replaced by using the multi-agent NFSP.
2.1.2 Experimental design
In a two-player zero-sum game, the degree of availability epsilonσ=b12)+b21) Is an index for measuring the distance between the strategy group and the nash equilibrium distance. In poker games, mbb/h (milli-big-flag per hand) is often used to indicate availability, i.e. the number of thousandths of a large blind bet that each player would lose in the worst case. The invention also adopts the same availability expression to test the convergence speed and the solving precision of the NFSP solution approximate Nash equilibrium strategy on the Leduc playing cards. Table 5-1 shows the parameters used in the experiment.
TABLE 5-1 two-person Leduc poker experiment parameters
Figure BDA0002143422280000121
2.1.3 experiments
2.1.3.1 availability and error analysis
The training process of the NFSP consists of two parts, namely reinforcement learning and simulation learning, and the invention explores the relation between the reduction of the training error of the two parts and the change of the availability. The tested error is the average of the mean square error of the strategy on all nodes from the beginning of the game to the end of the game. Let H be the set of histories H for which the action sequences of all players are empty, then the error is:
Figure BDA0002143422280000131
Figure BDA0002143422280000132
wherein Q and Π are strategies generated for reinforcement learning and simulation learning, and Q*II and*the optimal response strategy calculated for CFR and the extended averaging strategy mentioned above. The nodes at the beginning of the game are chosen because they are the most distant batches from the leaf nodes, so the strategy calculated by NFSP is more distinct from the ideal strategy.
As can be seen by modeling learning error experiments and reinforcement learning error experiments, both learning components converge faster than the strategy of NFSP generation, since NFSP must learn according to its target. At the time t, one agent in the NFSP updates the optimal response strategy of the adversary average strategy by using reinforcement learning, and simultaneously updates the average strategy of the own optimal response strategy by using imitation learning. In the learning process, the target strategy sets of the two parts of the NFSP are continuously changed in the training process. Thus, the error can be viewed as two components: the error and the fitting error are updated. When training is just started, the updating error is larger than the fitting error, which means that the error mainly comes from the process of calculating the optimal reaction strategy with the reinforcement learning algorithm; and when the total error of the training is reduced, the updating error is smaller than the fitting error, which means that the error mainly comes from the process of simulating the learning algorithm to fit the optimal reaction strategy average.
2.1.3.2 visualization of poker gaming networks
To verify that NFSP learns valid poker characteristics, the present invention visualizes the output of the last hidden layer of the policy network using the t-SNE algorithm. In particular, by using the average strategy finally learned by NFSP to perform analog sampling, a data set is obtained, which comprises an information set, the output of the last hidden layer, and some design features (such as the domino feature). And then, carrying out dimension reduction on the hidden layer output to two dimensions by using a t-SNE algorithm, marking the data points according to designed characteristics, and observing the distribution of the data points.
In the t-SNE embedding experiment of the Leduc playing card strategy network, all data are dyed by orange and blue respectively according to the card shape or the pairs of the card, so that the strategy network can distinguish the card shapes with different card forces on the premise that the input of the network is represented by the original card. This shows that the NFSP agent successfully learns the strength knowledge without relying on domain knowledge, and proves that the learning of NFSP is effective.
2.1.3.3 results of the multiplayer poker experiment
The invention realizes MAPP and performs experiments on three-player Leduc poker and six-player Texas poker. In order to avoid the use of domain knowledge as much as possible, the invention attempts to use state coding in a manner that is as objective as possible, i.e. with as little feature engineering as possible.
And (3) state coding: poker games typically include several rounds, with a new card being revealed to the player on each round. The present invention uses a form of k-of-n to encode each round of cards. For example, texas poker always has 52 cards, and 3 community cards are revealed at the second round, so the round is encoded as a length 52 vector with 3 positions of 1 for 3 community cards, and the rest of 0. Three actions of the Texas poker can be selected, which are { discard, follow-up and fill }. In an experiment, the filling action is divided into a fixed number of discrete actions via an action abstraction. And limit the number of wagering actions per round (in reality, there are many fewer wagering actions in the texas poker game). Accordingly, the bet history may be encoded as a 4-dimensional vector { player, number of rounds, number of actions bet, actions taken }.
And (3) motion coding: the number of chips of the player in the poker game is fixed, so that the bet is illegal when the maximum amount is reached. In addition, discarding cards is also an illegal action when no player is wagered. The output action of the strategy network realized by the invention is fixed, the environment is adjusted in order to solve the problem of illegal action output, and any illegal action is defaulted as follow-up remark.
And (3) encoding the reward value: the invention directly takes the net balance of the chips after the intelligent body acts as the reward value. For example, when an agent chooses to follow a wager to 500 in the case of his own pit 400, then it receives an instant prize value of-100. In the poker game, all the intelligent agents can know the loss and win of the intelligent agents after the whole game is finished, so that the prize value in the termination state is determined by adding the instant prize value and the game termination value. For example, the agent selects 20000 all under the situation of self bottom pool 19000 of the last round, and the game ending value is 40000 due to the maximum winning of the card shape in the process of card matching, and the bonus value of the ending state is-1000 +40000=39000.
The experiment was run against ACPC Random _ Player and CFR5000 (i.e., an agent after iterating 5000 with the CFR algorithm on the same game) as opponent agents. The game-play result uses mbb/g as an evaluation criterion, wherein mbb/g represents how many thousandths of a large blind-bet chip amount can be won in a game. An intelligent player who always discards the cards will lose 750mbb/g, and professional poker can obtain the expected profit of 40-60mbb/g in a large online game. 10000 times, 50000 times and 100000 times of iterated multi-person intelligent bodies are respectively selected for experiments, 3000 mbb/g of the intelligent bodies are calculated, and the result of the game of Random _ Player is shown in a table 5-3.
TABLE 5-3 results of the game agent and Random _ Player
Figure BDA0002143422280000151
It can be seen that three agents can achieve significant advantages in dealing with two other Random _ Player agents, but two agents based on NFSP perform as same as Pure CFR agents at the beginning of training, since the algorithm used in the present invention is sample-based, with only one segment being sampled per round until the game is terminated. The number of iterations required for training is large but the time required for each iteration is short. In the experiment, 50000 iterations of NFSP + MAPPO were substantially identical to 3000 iterations of Pure CFR. In addition, the performance of the NFSP + MAPPO agent is better than that of the NFSP under all iteration times, which shows that the invention effectively relieves the problem of unstable environment by adding the improved multi-agent reinforcement learning algorithm to learn the optimal reaction strategy, and finally improves the game level of the agent.
Table 5-4 results of multi-player gaming agent and CFR5000 game
Figure BDA0002143422280000152
From the results of the alignment, it can be seen that the NFSP + MAPPO agent achieved similar performance to CFR5000 after 50000 iterations.
HITSZ _ Jaysen is a six-player texas poker agent based on a hand-designed strategy of force characteristics, with a third name obtained in the 2018ACPC game six-player non-limiting texas poker project. The experiment sets up 3 designed agents of the invention and 3 HITSZ _ Jaysen agents to form six-person Texas poker hands, and 3000 hands calculate the average large blind-bet winning amount of each hand. As shown in tables 5-5, the multi-player gaming agent of the present invention achieved superior performance in six-player, non-limiting texas poker experiments than HITSZ _ Jaysen. The invention provides a multi-person game intelligent agent which can obtain a game strategy with a higher level through an end-to-end learning mode on the premise of not needing the knowledge in the field of poker.
TABLE 5-5 results of multiplayer Game agent and HITSZ _ Jaysen deal
Figure BDA0002143422280000161
The foregoing is a further detailed description of the invention in connection with specific preferred embodiments and it is not intended to limit the invention to the specific embodiments described. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (4)

1. A multi-person non-complete information game strategy solving method based on virtual self-game is characterized in that:
aiming at the two-person game situation, the average strategy is generated by using multi-class logistic regression and reservoir sampling, and the optimal reaction strategy is generated by using DQN and annular buffer memory;
aiming at the multi-person game situation, a multi-agent near-end strategy optimization algorithm MAPPO is used for realizing an optimal reaction strategy, and meanwhile, multi-agent NFSP is used for adjusting the training of agents;
NFSP: neural network virtual self-alignment, DQN: a deep Q-value network;
aiming at the two-player game situation, the intelligent agent adopts a memory segment of an optimal response strategy as data, and adopts a reservoir sampling method to train a fully-connected shallow neural network, wherein the input of the shallow neural network is the current poker game situation, and the output is the probability of taking each action in the state; and the online NFSP algorithm is adopted, and the two agents play chess and update strategies at the same time;
in the case of multi-player gaming, the multi-agent near-end strategy optimization algorithm comprises the following steps: by using centralized advantage estimation, when updating the strategy network, the MAPO uses a tailored proxy objective function, and in the training process, the intelligent agent continuously updates the strategy network by using a decision sequence obtained by exploration in the environment and updates the target strategy network at regular time;
in the case of a multiplayer game, the multi-agent NFSP comprises: executing and updating the current state in the environment after each sampling action, and updating the reward values in the last memory segments of all the intelligent agents in the memory bank after the game is finished;
the multi-player non-complete information game strategy solving method is applied to the Texas poker game.
2. A multi-person non-complete information game strategy solving device based on virtual self-game is characterized by comprising the following steps:
two-player game module: the average strategy is generated by using multi-class logistic regression and reservoir sampling, and the optimal reaction strategy is generated by using DQN and annular buffer memory;
the multi-player gaming module: using a multi-agent near-end strategy optimization algorithm MAPPO to realize an optimal reaction strategy, and simultaneously using a multi-agent NFSP to adjust the training of the agents;
NFSP: neural network virtual self-alignment, DQN: a deep Q-value network;
in the two-player game module, an intelligent body adopts a memory segment of an optimal response strategy as data, and trains a fully-connected shallow neural network by adopting a reservoir sampling method, wherein the input of the shallow neural network is the current poker game situation, and the output is the probability of taking each action in the state; the online NFSP algorithm is adopted, and the two agents play chess and update strategies simultaneously;
in the multi-player gaming module, the multi-agent near-end strategy optimization algorithm comprises the following steps: by using centralized advantage estimation, when updating the strategy network, the MAPPO uses the tailored proxy objective function, and in the training process, the intelligent agent continuously updates the strategy network by using a decision sequence obtained by exploring in the environment and updates the target strategy network at regular time;
in the multi-player gaming module, the multi-agent NFSP comprises: executing and updating the current state in the environment after each sampling action, and updating the reward values in the last memory segments of all the intelligent agents in the memory bank after the game is finished;
the multi-player non-complete information game strategy solving device is applied to the Texas poker game.
3. A multi-person non-complete information game strategy solving system based on virtual self-game is characterized by comprising the following steps: memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the method of claim 1 when invoked by the processor.
4. A computer-readable storage medium, characterized in that: the computer-readable storage medium stores a computer program configured to implement the steps of the method of claim 1 when invoked by a processor.
CN201910676407.9A 2019-07-25 2019-07-25 Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium Active CN110404264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910676407.9A CN110404264B (en) 2019-07-25 2019-07-25 Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910676407.9A CN110404264B (en) 2019-07-25 2019-07-25 Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium

Publications (2)

Publication Number Publication Date
CN110404264A CN110404264A (en) 2019-11-05
CN110404264B true CN110404264B (en) 2022-11-01

Family

ID=68363135

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910676407.9A Active CN110404264B (en) 2019-07-25 2019-07-25 Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium

Country Status (1)

Country Link
CN (1) CN110404264B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291890B (en) * 2020-05-13 2021-01-01 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Game strategy optimization method, system and storage medium
CN111667075A (en) * 2020-06-12 2020-09-15 杭州浮云网络科技有限公司 Service execution method, device and related equipment
CN112001071A (en) * 2020-08-14 2020-11-27 广州市百果园信息技术有限公司 Method, device, equipment and medium for determining simulated guess data
CN112329348B (en) * 2020-11-06 2023-09-15 东北大学 Intelligent decision-making method for military countermeasure game under incomplete information condition
CN112396180B (en) * 2020-11-25 2021-06-29 中国科学院自动化研究所 Deep Q learning network optimization method based on dynamic teaching data and behavior cloning
CN112507104B (en) * 2020-12-18 2022-07-22 北京百度网讯科技有限公司 Dialog system acquisition method, apparatus, storage medium and computer program product
CN112870727B (en) * 2021-01-18 2022-02-22 浙江大学 Training and control method for intelligent agent in game
CN113159313B (en) * 2021-03-02 2022-09-09 北京达佳互联信息技术有限公司 Data processing method and device of game model, electronic equipment and storage medium
CN113359480B (en) * 2021-07-16 2022-02-01 中国人民解放军火箭军工程大学 Multi-unmanned aerial vehicle and user cooperative communication optimization method based on MAPPO algorithm
CN113805568B (en) * 2021-08-17 2024-04-09 北京理工大学 Man-machine collaborative perception method based on multi-agent space-time modeling and decision
CN113791634B (en) * 2021-08-22 2024-02-02 西北工业大学 Multi-agent reinforcement learning-based multi-machine air combat decision method
CN113706197A (en) * 2021-08-26 2021-11-26 西安交通大学 Multi-microgrid electric energy transaction pricing strategy and system based on reinforcement and simulation learning
CN113689001B (en) * 2021-08-30 2023-12-05 浙江大学 Virtual self-playing method and device based on counter-facts regretation minimization
CN113827946A (en) * 2021-09-10 2021-12-24 网易(杭州)网络有限公司 Game game-play decision-making method and device, electronic equipment and storage medium
CN114048833B (en) * 2021-11-05 2023-01-17 哈尔滨工业大学(深圳) Multi-person and large-scale incomplete information game method and device based on neural network virtual self-game
CN114053712B (en) * 2022-01-17 2022-04-22 中国科学院自动化研究所 Action generation method, device and equipment of virtual object
US11995380B2 (en) * 2022-04-29 2024-05-28 Hadi KERAMATI System and method for heat exchanger shape optimization
CN114839884B (en) * 2022-07-05 2022-09-30 山东大学 Underwater vehicle bottom layer control method and system based on deep reinforcement learning
CN117151224A (en) * 2023-07-27 2023-12-01 中国科学院自动化研究所 Strategy evolution training method, device, equipment and medium for strong random game of soldiers
CN117439794B (en) * 2023-11-09 2024-05-14 浙江大学 CPPS optimal defense strategy game method for uncertainty attack

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106296006A (en) * 2016-08-10 2017-01-04 哈尔滨工业大学深圳研究生院 The minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation
CN106469317A (en) * 2016-09-20 2017-03-01 哈尔滨工业大学深圳研究生院 A kind of method based on carrying out Opponent Modeling in non-perfect information game
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7892080B1 (en) * 2006-10-24 2011-02-22 Fredrik Andreas Dahl System and method for conducting a game including a computer-controlled player

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106296006A (en) * 2016-08-10 2017-01-04 哈尔滨工业大学深圳研究生院 The minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation
CN107038477A (en) * 2016-08-10 2017-08-11 哈尔滨工业大学深圳研究生院 A kind of neutral net under non-complete information learns the estimation method of combination with Q
CN106469317A (en) * 2016-09-20 2017-03-01 哈尔滨工业大学深圳研究生院 A kind of method based on carrying out Opponent Modeling in non-perfect information game

Also Published As

Publication number Publication date
CN110404264A (en) 2019-11-05

Similar Documents

Publication Publication Date Title
CN110404264B (en) Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium
CN111291890B (en) Game strategy optimization method, system and storage medium
CN111282267B (en) Information processing method, information processing apparatus, information processing medium, and electronic device
Lee et al. The computational intelligence of MoGo revealed in Taiwan's computer Go tournaments
CN113688977B (en) Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium
Togelius How to run a successful game-based AI competition
Whitehouse Monte Carlo tree search for games with hidden information and uncertainty
Buro Statistical feature combination for the evaluation of game positions
Reis et al. Vgc ai competition-a new model of meta-game balance ai competition
CN114404975A (en) Method, device, equipment, storage medium and program product for training decision model
Dobre et al. Online learning and mining human play in complex games
CN110598853B (en) Model training method, information processing method and related device
Zhang et al. A neural model for automatic bidding of contract bridge
CN112870722B (en) Method, device, equipment and medium for generating fighting AI (AI) game model
PRICOPE A view on deep reinforcement learning in imperfect information games
CN114404976A (en) Method and device for training decision model, computer equipment and storage medium
Vieira et al. Exploring Deep Reinforcement Learning for Battling in Collectible Card Games
Ring et al. Replicating deepmind starcraft ii reinforcement learning benchmark with actor-critic methods
CN117883788B (en) Intelligent body training method, game fight method and device and electronic equipment
Kitchen et al. ExIt-OOS: Towards learning from planning in imperfect information games
Guan et al. Learning to Play Koi-Koi Hanafuda Card Games With Transformers
Reis et al. Automatic generation of a sub-optimal agent population with learning
Chen et al. A Novel Reward Shaping Function for Single-Player Mahjong
Li et al. Speedup training artificial intelligence for mahjong via reward variance reduction
Shan ShengJi+: Playing Tractor with Deep Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant