CN110404264B

CN110404264B - Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium

Info

Publication number: CN110404264B
Application number: CN201910676407.9A
Authority: CN
Inventors: 王轩; 漆舒汉; 蒋琳; 胡书豪; 毛建博; 廖清; 李化乐; 张加佳; 刘洋; 夏文
Original assignee: Shenzhen Graduate School Harbin Institute of Technology
Current assignee: Shenzhen Graduate School Harbin Institute of Technology
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2022-11-01
Anticipated expiration: 2039-07-25
Also published as: CN110404264A

Abstract

The invention provides a method, a device and a system for solving a multi-person non-complete information game strategy based on virtual self-game, and a storage medium, wherein the method comprises the following steps: aiming at the two-person game situation, the average strategy is generated by using multi-class logistic regression and reservoir sampling, and the optimal reaction strategy is generated by using DQN and annular buffer memory; aiming at the situation of multi-person game, a multi-agent near-end strategy optimization algorithm MAPPO is used for realizing an optimal reaction strategy, and simultaneously, multi-agent NFSP is used for adjusting the training of agents. The beneficial effects of the invention are: the invention introduces an algorithm framework of virtual self-alignment, divides the Texas poker strategy optimization process into an optimal reaction strategy learning part and an average strategy learning part, and respectively realizes the optimal strategy learning method by using simulation learning and deep reinforcement learning, thereby designing a more universal multi-agent optimal strategy learning method.

Description

Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method, a device and a system for solving a multi-person non-complete information game strategy based on virtual self-game and a storage medium.

Background

Machine game and artificial intelligence are all connected with each other in a secret and inseparable way, and the method is an important aspect for embodying the development of artificial intelligence. Many well-known scholars in the computer field have conducted related research on machine gaming: the parent von neumann and the mathematician olympic of computers proposed a very small and extremely large approach in gaming. The father of artificial intelligence, alan, tuling, provides a theoretical basis for developing computer chess programs. This theory was later used to design the world's first computer program checkers on ENIAC. For more than half a century, many of the major research efforts generated in the field of machine gaming have been recognized as important milestones in the development of artificial intelligence. At present, the research results of machine game have been widely used in the fields of robot control, dialogue systems, resource scheduling, traffic signal lamp control, automatic driving, external negotiation, financial decision-making, and the like.

According to the completeness of game information, the game can be divided into complete information game and non-complete information game. In the perfect information game, a game player can completely and immediately acquire all information related to game decision in the decision making process, and many chess game games belong to non-perfect information games, such as: go, chinese chess, general chess. In the non-perfect information game. Game information acquired by the game players is incomplete or lagged, so that each game player has private information that cannot be acquired by an opponent: such as hands in poker games, self-vision in autopilot, knowledge of chips exchanged in outcrossings, etc.

The pair of incompleteness of the information makes the solution of the optimal strategy more complicated. To date, many more complex non-perfect information gaming problems such as multi-person non-zero and gaming do not have theoretical methods to solve the optimal solution. Taking a typical game of weiqi, which is a complete information game, as an example, all players can obtain all information of the current game, so that a minimum maximum algorithm can be used for traversing the game tree, thereby finding the current optimal strategy. But for non-perfect information gambling games the gambling information is not fully visible. Taking the texas playing card as an example, each hand is invisible to other people in other places, so that an agent is required to reason and guess unknown information of the opponent in the game playing process, and the opponent can be used for cheating by using the private information that the opponent cannot acquire the opponent. These characteristics make solving the game of incomplete information much more difficult.

Many decision problems in reality can be abstracted to strategy optimization problems of non-complete information games, but the current strategy optimization algorithm of non-complete information, such as a cold pounder, can only solve the game problems of two persons, discrete actions and simple states, and cannot be well applied to solving the decision problems in reality. Therefore, the research on the incomplete information strategy optimization algorithm supporting continuous actions and complex states of multiple people has important theoretical and practical significance.

Disclosure of Invention

The invention provides a multi-person non-perfect information game strategy solving method based on virtual self-game, aiming at the two-person game situation, the generation of an average strategy is realized by using multi-class logistic regression and reservoir sampling, and the generation of an optimal reaction strategy is realized by using DQN and annular buffer memory; aiming at the situation of multi-person game, a multi-agent near-end strategy optimization algorithm MAPPO is used for realizing an optimal reaction strategy, and meanwhile, multi-agent NFSP is used for adjusting the training of agents; NFSP: neural network virtual self-alignment, DQN: a deep Q-value network.

As a further improvement of the invention, aiming at the two-player game situation, the intelligent agent adopts a memory segment of an optimal response strategy as data, and trains a fully-connected shallow neural network by adopting a reservoir sampling method, wherein the input of the shallow neural network is the current poker game situation, and the output is the probability of taking each action in the state; and the online NFSP algorithm is adopted, and the two agents play chess and update strategies at the same time.

As a further improvement of the invention, in the case of multiplayer gaming, the multi-agent near-end strategy optimization algorithm comprises: by using centralized advantage estimation, when updating the strategy network, the MAPO uses the tailored proxy objective function, and in the training process, the intelligent agent continuously updates the strategy network by using a decision sequence obtained by exploration in the environment and updates the target strategy network at regular time.

As a further improvement of the present invention, in the case of multiplayer gaming, the multi-agent NFSP comprises: after each sampling action, the current state is executed and updated in the environment, and after the game is finished, the reward values in the last memory segments of all the intelligent agents in the memory base are updated.

The invention also provides a multi-person non-complete information game strategy solving device based on virtual self-game, which comprises the following steps:

two-player game module: the average strategy is generated by using multi-class logistic regression and reservoir sampling, and the optimal reaction strategy is generated by using DQN and annular buffer memory;

the multi-player game module: using a multi-agent near-end strategy optimization algorithm MAPPO to realize an optimal reaction strategy, and simultaneously using a multi-agent NFSP to adjust the training of the agents;

NFSP: neural network virtual self-alignment, DQN: a deep Q-value network.

In the two-player game module, the intelligent agent adopts a memory segment of an optimal response strategy as data, and trains a fully-connected shallow neural network by adopting a reservoir sampling method, wherein the input of the shallow neural network is the current poker game situation, and the output is the probability of taking each action in the state; and the online NFSP algorithm is adopted, and the two agents play chess and update strategies simultaneously.

As a further improvement of the invention, in the multi-player gaming module, the multi-agent near-end strategy optimization algorithm comprises: by using centralized advantage estimation, when updating the strategy network, the MAPO uses the tailored proxy objective function, and in the training process, the intelligent agent continuously updates the strategy network by using a decision sequence obtained by exploration in the environment and updates the target strategy network at regular time.

As a further development of the invention, in the multi-player gaming module, the multi-agent NFSP comprises: after each sampling action, the current state is executed and updated in the environment, and after the game is finished, the reward values in the last memory segments of all the intelligent agents in the memory base are updated.

The invention also provides a system for solving the multi-person non-complete information game strategy based on virtual self-game, which comprises the following steps: memory, a processor and a computer program stored on the memory, the computer program being configured to carry out the steps of the method of the invention when called by the processor.

The invention also provides a computer-readable storage medium having stored thereon a computer program configured to, when invoked by a processor, perform the steps of the method of the invention.

The invention has the beneficial effects that: the invention introduces an algorithm framework of virtual self-game, divides the Texas poker strategy optimization process into an optimal reaction strategy learning part and an average strategy learning part, and respectively realizes the optimal reaction strategy learning part and the average strategy learning part by using imitation learning and deep reinforcement learning, thereby designing a more universal multi-agent optimal strategy learning method.

Drawings

FIG. 1 is a flow chart of a virtual self-alignment algorithm;

FIG. 2 is a NFSP algorithm framework diagram;

FIG. 3 is a diagram of a communication framework for an ACPC race;

figure 4 is a data communication flow diagram for the texas poker machine gaming system.

Detailed Description

The invention 1.1 discloses a multi-person non-complete information game strategy solving method based on virtual self-game, taking multi-person non-limiting Texas poker as an example, the invention is a multi-person non-limiting Texas poker strategy solving algorithm. The invention is based on virtual self-alignment, combines the technologies of deep learning, multi-agent reinforcement learning and the like, and takes Texas poker and multi-agent particle environment as an experimental platform. When the traditional method is used for solving the problem of incomplete information game of Texas poker, the scale of the game tree needs to be reduced by using methods in the fields of card abstraction and the like, and the transferability is poor. The invention introduces an algorithm framework of virtual self-alignment, divides the Texas poker strategy optimization process into an optimal reaction strategy learning part and an average strategy learning part, and respectively realizes the optimal strategy learning algorithm by using simulation learning and deep reinforcement learning, thereby designing a more universal multi-agent optimal strategy learning algorithm.

1.2 detailed description of the method

1.2.1 neural network virtual self-alignment

1.2.1.1 Algorithm framework

Virtual Self-Play (FSP) can be used to solve the problem of optimizing large non-perfect information game strategies like texas poker. FSP is a machine learning algorithm that has been shown to converge to nash equilibrium in two-player zero-sum gaming, which achieves generalized weakening of virtual pairings based on behavioral strategies and sampling. Conventional virtual hands are faced with dimensional disasters as the size of the game increases, since each iteration requires computing all the states of the game. However, FSP requires only an approximately optimal reaction and also allows for some perturbation in the update. FSP replaces the operation of computing the average strategy and optimal response in virtual hands-on with machine-learned algorithms. The updating of the average strategy is a process of imitation learning, and a common imitation learning algorithm can be used; the calculation of the optimal response is replaced by reinforcement learning.

In the learning process of FSP, two virtual players continuously optimize respective game strategies by mutually playing. Each player has two sets of policies: optimal reaction strategy for adversaries beta_jAnd its own averaging strategy p_j(j). The training process of the FSP is specifically divided into the following steps: each player initializes the average strategy of the player, and the initialization mode can be any random algorithm; each player plays games by utilizing the optimal reaction strategy and the average strategy of the opponents respectively to obtain data D; each player updates its own reinforcement learning memory M according to the data D_rlAnd imitate learning memory M_sl(ii) a Each player updates the optimal reaction strategy beta of the player by using a reinforced learning algorithm_j+1Updating its own average strategy p using a mock learning algorithm_j(j)(ii) a When a certain iteration number m is reached, the average strategy of each player is the learned nash equilibrium strategy about m.

Neural networkVirtual Self-alignment (NFSP) uses Neural networks to fit optimal strategies as well as average strategies on the basis of FSP. Whenever the agent takes action according to its action strategy (mixture of optimal response and average strategy) and gets environmental feedback, the memory segment M = (i, s, a, r, s') is stored in the reinforcement learning memory M_rlAnd imitate learning memory M_slIn (1). Subsequently, using M_rlThe memory M in (1) updates the optimal response network Q (s, a | theta)^Q) And M_slMemory update network pi (s, a | theta) in (1)^Π) The parameter (c) of (c). In NFSP, Q (s, a | θ)^Q) Is used to fit the optimal response of the agent, Π (s, a | θ)^Π) It is the average strategy used to fit the agent's optimal response strategy, and the agent uses continuous-time dynamic virtual hits to predict the adversary's tie strategy. By pi^N，tMean strategy, beta, representing a regular form^N，tAn optimal reaction strategy, representing a canonical form, then at continuous time:

Δπ^N,t＝π^N,t+1-π^N,t+1∝β^N,t-π^N,t (3-2)

thus is provided with

π^N,t+1≈ηβ^N,t+(1-η)π^N,t (3-3)

Where η ∈ R is called the prediction Parameter (anticancer Parameter). Fig. 1 is a general virtual self-alignment algorithm flow.

1.2.2 NFSP solution two-person non-limiting Texas poker strategy solution

1.2.2.1 learning optimal reaction strategy by using deep Q value network

First, an attempt is made to learn the optimal reaction strategy of an agent using a Deep Q-Network (DQN). DQN is a different strategy reinforcement learning algorithm used for finding the optimal strategy of MDP, so we can apply it under the FSP framework.

DQN can be used in playback experienceThe greedy strategy of epsilon is derived using Q learning, and similarly, DQN can be used to learn the optimal reaction strategy to the average strategy of the opponents under the FSP framework when solving the problem of texas poker. At this point, the FSP agent can learn a neural network Q (s, a; theta) according to the experience segments in the reinforcement learning memory by using DQN algorithm^Q) To predict the state-action values and thus construct an optimal response to the (most recent) desired average strategy of the adversary. The loss function of the neural network may be set as:

wherein theta is^Q'Is a target network, and needs a periodic network theta in the training process^QIs assigned a weight value of theta^Q'In (1). And the final network determines the near optimal reaction strategy of the agent.

β＝ε-greedy[Q(·|θ^Q)] (3-5)

This means that the agent chooses the action randomly with a probability of epsilon, or the action that predicts the maximum Q value.

1.2.2.2 Generation of average strategies Using mock learning

In FSP, averaging strategy

Is the average of the optimal reaction strategies that the player has taken over the past k iterations. Namely:

suppose player i wishes to combine from his own canonical gambling strategy

Learning a behavior strategy equivalent to the implementation of the behavior strategy. This is equivalent to a model that learns its behavior strategy by sampling actions from Π. The simplest method for learning the average strategy is toThe number of actions taken on the different information sets is counted. Represented by the formula, N(s)_t,a_t) For players in the information set u_tThe sum of the probabilities of taking action a. Rho_tIs the policy used by the agent when a certain examined segment is sampled.

However, for the game of texas poker, which is a state space huge, the existing storage level conditions cannot save all the states. Therefore, NFSP introduces the idea of Learning from Learning (LFD) and fitting an average strategy using a neural network. Average strategy | (s, a | theta)^∏) With the equation (3-9) as an error updating parameter, since an optimal response strategy is required to provide training data, the updating process of the average strategy can be regarded as a kind of simulation learning.

To ensure | (s, a | theta)^∏) Is an unbiased estimate of the average of the optimal response strategy, and must ensure that M_slAll memory segments in (A) are sampled with the same probability, but M_slIs implemented by a list of fixed size, simple random sampling necessarily results in a small probability that the last added memory segment is extracted. Therefore, the invention introduces an algorithm design teaching learning memory M of Reservoir Sampling (Reservoir Sampling)_slTo ensure that all memory fragments are extracted with the same probability.

The invention takes the memory segment of the optimal response strategy adopted by the intelligent agent as data, and adopts the method of reservoir sampling to train a fully-connected shallow neural network pi (s, a | theta |)^Π) Input of the networkAnd outputting the probability of taking each action a in the state for the current poker game situation s.

1.2.2.3 Online NFSP Algorithm

As shown in fig. 2, the present invention uses the online NFSP algorithm, i.e. two agents play and update strategies at the same time, instead of the alternate update strategy. The online algorithm is more advantageous in practical training: first, simultaneous sampling experience is theoretically n times more efficient than the alternating approach, where n is the number of agents; second, the simultaneously learned agent can be applied to the environment of an actual black box scene, such as a traffic light system.

1.2.2.4 Multi-agent near-end policy optimization Algorithm

The invention provides a Multi-Agent proximity Policy Optimization algorithm (MAPPO). Aiming at the situation that the strategy performance disturbance of the agent is amplified and the problem that the learning rate is difficult to adjust under the multi-agent situation, the tailored proxy objective function is used for replacing the optimization objective function of the Actor in the MADDPG.

One strategy gradient implementation that is more common and suitable for recurrent neural networks is to run the strategy for T time steps (T is much smaller than the segment length) and then perform a strategy update using a sample of the T collected time steps. This implementation requires an evaluation function that can evaluate sequences not exceeding T time steps, the evaluation function used in the present invention being

Where t represents the time index in [0, T ], more generally a truncated version of the dominance estimate can be used, as shown in equations (4-11), which are the same when λ is 1.

δ_t＝r_t+γV(s_t+1)-V(s_t) (4-3)

Under the multi-agent reinforcement learning training framework of centralized training-decentralized execution, centralized Critic and decentralized Actor have different observation visual fields, so that a neural network structure sharing parameters cannot be used, the Critic and Actor errors in MAPPO are calculated separately, and two independent networks are used for realizing strategy and state estimation. The main difference between MAPO and PPO is that a centralized advantage estimation is used, so that observation and action of all agents can be observed, and a more accurate estimation can be obtained, wherein TD-ERROR used for the centralized advantage estimation is shown as a formula (4-13).

Error of Critic

The calculation method is shown as the formula (4-14)

Where S is the size of the mini-batch and i represents agent i. MAPPO can use a way of updating the network and the target network to ensure that the parameters of the Critic network do not diverge during the update.

MAPPO uses a tailored proxy objective function instead of a policy gradient when updating a policy network. Unlike PPO, MAPPO uses a centralized dominance function

And guiding the updating of the strategy. The strategy update objective function of MAPPO is

In the training process, the intelligent agent continuously updates the strategy network by using the decision sequence obtained by exploration in the environment and updates the target strategy network regularly

1.2.3 Multi-agent NFSP Algorithm

1.2.3.1 partially observable Markov games

Multi-agent Markov Decision Problems (MDPs) can be viewed as part of Observable Markov Games (Partially Observable Markov Games). The partially observable Markov game consists of three parts:

(1) Set of actions A₁,...,A_N

(2) Observable information set O₁,...,O_N

(3) Random strategy pi_θi:

(4) State transfer function Γ:

(5) Reward function r_i:

(6) Private observable information o_i:

Expected reward for agent i

Wherein y is the reward discount rate over time.

1.2.3.2 Multi-agent NFSP Algorithm

The order of the multi-player Texas poker is existed when making a bet, and the environment changes after each agent takes action. In addition, all agents can only receive their final prize value when a game is over. Therefore, the invention adjusts the NFSP algorithm to be suitable for the learning of multiplayer Texas poker strategy, which is characterized in that the current state is executed and updated in the environment after each sampling action, and the reward value in the last memory segment of all the intelligent agents in the memory bank is updated after the game is finished.

2.1 Experimental setup

2.1.1 ACPC rule Texas poker gaming environment

The invention realizes chess playing among the agents based on the communication framework of the ACPC competition, as shown in figure 3, when the game starts, each agent needs to establish a TCP/IP link with a server running a card distributor through a specified port, and whether the ACPC communication protocol version of each agent is consistent with that of the server is confirmed. After the game is started, the server sends the game state information coded into the character strings to each intelligent agent respectively and waits for the information of the intelligent agents needing to take action. After each intelligent agent receives the character string, the character string is analyzed into a corresponding game state, if the game state display needs to take action, the intelligent agent selects an action through a strategy and codes the action into the character string to be sent to the server. And after receiving the character strings sent by the intelligent agents, the server analyzes the character strings into legal actions, executes the actions and sends the game state information after the actions are finished to each intelligent agent. And circulating until the game is finished.

Fig. 4 is a frame diagram of a game system according to the present invention, and experiments are performed on two-person and multi-person games by using different methods to generate and train an average strategy and an optimal response strategy. Specifically, aiming at the two-person game problem, the average strategy is generated by using multi-class logistic regression and reservoir sampling, and the optimal reaction strategy is generated by using DQN and annular buffer memory; aiming at the problem of multi-person game, the optimal reaction strategy is realized by using the multi-agent near-end strategy optimization algorithm provided by the invention, and the training of NFSP (network function protocol) regulation agents is replaced by using the multi-agent NFSP.

2.1.2 Experimental design

In a two-player zero-sum game, the degree of availability epsilon_σ＝b₁(σ₂)+b₂(σ₁) Is an index for measuring the distance between the strategy group and the nash equilibrium distance. In poker games, mbb/h (milli-big-flag per hand) is often used to indicate availability, i.e. the number of thousandths of a large blind bet that each player would lose in the worst case. The invention also adopts the same availability expression to test the convergence speed and the solving precision of the NFSP solution approximate Nash equilibrium strategy on the Leduc playing cards. Table 5-1 shows the parameters used in the experiment.

TABLE 5-1 two-person Leduc poker experiment parameters

2.1.3 experiments

2.1.3.1 availability and error analysis

The training process of the NFSP consists of two parts, namely reinforcement learning and simulation learning, and the invention explores the relation between the reduction of the training error of the two parts and the change of the availability. The tested error is the average of the mean square error of the strategy on all nodes from the beginning of the game to the end of the game. Let H be the set of histories H for which the action sequences of all players are empty, then the error is:

wherein Q and Π are strategies generated for reinforcement learning and simulation learning, and Q^*II and^*the optimal response strategy calculated for CFR and the extended averaging strategy mentioned above. The nodes at the beginning of the game are chosen because they are the most distant batches from the leaf nodes, so the strategy calculated by NFSP is more distinct from the ideal strategy.

As can be seen by modeling learning error experiments and reinforcement learning error experiments, both learning components converge faster than the strategy of NFSP generation, since NFSP must learn according to its target. At the time t, one agent in the NFSP updates the optimal response strategy of the adversary average strategy by using reinforcement learning, and simultaneously updates the average strategy of the own optimal response strategy by using imitation learning. In the learning process, the target strategy sets of the two parts of the NFSP are continuously changed in the training process. Thus, the error can be viewed as two components: the error and the fitting error are updated. When training is just started, the updating error is larger than the fitting error, which means that the error mainly comes from the process of calculating the optimal reaction strategy with the reinforcement learning algorithm; and when the total error of the training is reduced, the updating error is smaller than the fitting error, which means that the error mainly comes from the process of simulating the learning algorithm to fit the optimal reaction strategy average.

2.1.3.2 visualization of poker gaming networks

To verify that NFSP learns valid poker characteristics, the present invention visualizes the output of the last hidden layer of the policy network using the t-SNE algorithm. In particular, by using the average strategy finally learned by NFSP to perform analog sampling, a data set is obtained, which comprises an information set, the output of the last hidden layer, and some design features (such as the domino feature). And then, carrying out dimension reduction on the hidden layer output to two dimensions by using a t-SNE algorithm, marking the data points according to designed characteristics, and observing the distribution of the data points.

In the t-SNE embedding experiment of the Leduc playing card strategy network, all data are dyed by orange and blue respectively according to the card shape or the pairs of the card, so that the strategy network can distinguish the card shapes with different card forces on the premise that the input of the network is represented by the original card. This shows that the NFSP agent successfully learns the strength knowledge without relying on domain knowledge, and proves that the learning of NFSP is effective.

2.1.3.3 results of the multiplayer poker experiment

The invention realizes MAPP and performs experiments on three-player Leduc poker and six-player Texas poker. In order to avoid the use of domain knowledge as much as possible, the invention attempts to use state coding in a manner that is as objective as possible, i.e. with as little feature engineering as possible.

And (3) state coding: poker games typically include several rounds, with a new card being revealed to the player on each round. The present invention uses a form of k-of-n to encode each round of cards. For example, texas poker always has 52 cards, and 3 community cards are revealed at the second round, so the round is encoded as a length 52 vector with 3 positions of 1 for 3 community cards, and the rest of 0. Three actions of the Texas poker can be selected, which are { discard, follow-up and fill }. In an experiment, the filling action is divided into a fixed number of discrete actions via an action abstraction. And limit the number of wagering actions per round (in reality, there are many fewer wagering actions in the texas poker game). Accordingly, the bet history may be encoded as a 4-dimensional vector { player, number of rounds, number of actions bet, actions taken }.

And (3) motion coding: the number of chips of the player in the poker game is fixed, so that the bet is illegal when the maximum amount is reached. In addition, discarding cards is also an illegal action when no player is wagered. The output action of the strategy network realized by the invention is fixed, the environment is adjusted in order to solve the problem of illegal action output, and any illegal action is defaulted as follow-up remark.

And (3) encoding the reward value: the invention directly takes the net balance of the chips after the intelligent body acts as the reward value. For example, when an agent chooses to follow a wager to 500 in the case of his own pit 400, then it receives an instant prize value of-100. In the poker game, all the intelligent agents can know the loss and win of the intelligent agents after the whole game is finished, so that the prize value in the termination state is determined by adding the instant prize value and the game termination value. For example, the agent selects 20000 all under the situation of self bottom pool 19000 of the last round, and the game ending value is 40000 due to the maximum winning of the card shape in the process of card matching, and the bonus value of the ending state is-1000 +40000=39000.

The experiment was run against ACPC Random _ Player and CFR5000 (i.e., an agent after iterating 5000 with the CFR algorithm on the same game) as opponent agents. The game-play result uses mbb/g as an evaluation criterion, wherein mbb/g represents how many thousandths of a large blind-bet chip amount can be won in a game. An intelligent player who always discards the cards will lose 750mbb/g, and professional poker can obtain the expected profit of 40-60mbb/g in a large online game. 10000 times, 50000 times and 100000 times of iterated multi-person intelligent bodies are respectively selected for experiments, 3000 mbb/g of the intelligent bodies are calculated, and the result of the game of Random _ Player is shown in a table 5-3.

TABLE 5-3 results of the game agent and Random _ Player

It can be seen that three agents can achieve significant advantages in dealing with two other Random _ Player agents, but two agents based on NFSP perform as same as Pure CFR agents at the beginning of training, since the algorithm used in the present invention is sample-based, with only one segment being sampled per round until the game is terminated. The number of iterations required for training is large but the time required for each iteration is short. In the experiment, 50000 iterations of NFSP + MAPPO were substantially identical to 3000 iterations of Pure CFR. In addition, the performance of the NFSP + MAPPO agent is better than that of the NFSP under all iteration times, which shows that the invention effectively relieves the problem of unstable environment by adding the improved multi-agent reinforcement learning algorithm to learn the optimal reaction strategy, and finally improves the game level of the agent.

Table 5-4 results of multi-player gaming agent and CFR5000 game

From the results of the alignment, it can be seen that the NFSP + MAPPO agent achieved similar performance to CFR5000 after 50000 iterations.

HITSZ _ Jaysen is a six-player texas poker agent based on a hand-designed strategy of force characteristics, with a third name obtained in the 2018ACPC game six-player non-limiting texas poker project. The experiment sets up 3 designed agents of the invention and 3 HITSZ _ Jaysen agents to form six-person Texas poker hands, and 3000 hands calculate the average large blind-bet winning amount of each hand. As shown in tables 5-5, the multi-player gaming agent of the present invention achieved superior performance in six-player, non-limiting texas poker experiments than HITSZ _ Jaysen. The invention provides a multi-person game intelligent agent which can obtain a game strategy with a higher level through an end-to-end learning mode on the premise of not needing the knowledge in the field of poker.

TABLE 5-5 results of multiplayer Game agent and HITSZ _ Jaysen deal

The foregoing is a further detailed description of the invention in connection with specific preferred embodiments and it is not intended to limit the invention to the specific embodiments described. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims

1. A multi-person non-complete information game strategy solving method based on virtual self-game is characterized in that:

aiming at the two-person game situation, the average strategy is generated by using multi-class logistic regression and reservoir sampling, and the optimal reaction strategy is generated by using DQN and annular buffer memory;

aiming at the multi-person game situation, a multi-agent near-end strategy optimization algorithm MAPPO is used for realizing an optimal reaction strategy, and meanwhile, multi-agent NFSP is used for adjusting the training of agents;

NFSP: neural network virtual self-alignment, DQN: a deep Q-value network;

aiming at the two-player game situation, the intelligent agent adopts a memory segment of an optimal response strategy as data, and adopts a reservoir sampling method to train a fully-connected shallow neural network, wherein the input of the shallow neural network is the current poker game situation, and the output is the probability of taking each action in the state; and the online NFSP algorithm is adopted, and the two agents play chess and update strategies at the same time;

in the case of multi-player gaming, the multi-agent near-end strategy optimization algorithm comprises the following steps: by using centralized advantage estimation, when updating the strategy network, the MAPO uses a tailored proxy objective function, and in the training process, the intelligent agent continuously updates the strategy network by using a decision sequence obtained by exploration in the environment and updates the target strategy network at regular time;

in the case of a multiplayer game, the multi-agent NFSP comprises: executing and updating the current state in the environment after each sampling action, and updating the reward values in the last memory segments of all the intelligent agents in the memory bank after the game is finished;

the multi-player non-complete information game strategy solving method is applied to the Texas poker game.

2. A multi-person non-complete information game strategy solving device based on virtual self-game is characterized by comprising the following steps:

the multi-player gaming module: using a multi-agent near-end strategy optimization algorithm MAPPO to realize an optimal reaction strategy, and simultaneously using a multi-agent NFSP to adjust the training of the agents;

NFSP: neural network virtual self-alignment, DQN: a deep Q-value network;

in the two-player game module, an intelligent body adopts a memory segment of an optimal response strategy as data, and trains a fully-connected shallow neural network by adopting a reservoir sampling method, wherein the input of the shallow neural network is the current poker game situation, and the output is the probability of taking each action in the state; the online NFSP algorithm is adopted, and the two agents play chess and update strategies simultaneously;

in the multi-player gaming module, the multi-agent near-end strategy optimization algorithm comprises the following steps: by using centralized advantage estimation, when updating the strategy network, the MAPPO uses the tailored proxy objective function, and in the training process, the intelligent agent continuously updates the strategy network by using a decision sequence obtained by exploring in the environment and updates the target strategy network at regular time;

in the multi-player gaming module, the multi-agent NFSP comprises: executing and updating the current state in the environment after each sampling action, and updating the reward values in the last memory segments of all the intelligent agents in the memory bank after the game is finished;

the multi-player non-complete information game strategy solving device is applied to the Texas poker game.

3. A multi-person non-complete information game strategy solving system based on virtual self-game is characterized by comprising the following steps: memory, a processor and a computer program stored on the memory, the computer program being configured to implement the steps of the method of claim 1 when invoked by the processor.

4. A computer-readable storage medium, characterized in that: the computer-readable storage medium stores a computer program configured to implement the steps of the method of claim 1 when invoked by a processor.