CN112691383A

CN112691383A - Texas poker AI training method based on virtual regret minimization algorithm

Info

Publication number: CN112691383A
Application number: CN202110048898.XA
Authority: CN
Inventors: 张轶飞; 程帆; 张冬梅
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-01-14
Filing date: 2021-01-14
Publication date: 2021-04-23

Abstract

The invention relates to a Texas poker AI training method based on a virtual regret minimization algorithm, which comprises the following steps: 1) obtaining private hand information and game display information, and performing game feature abstraction; 2) establishing a strategy prediction neural network model aiming at a player based on a historical game log of the player; 3) training by adopting a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI behavior strategy; 4) and (4) carrying out real-time match-making with the game player by adopting the AI behavior strategy obtained by training, and recording the match after finishing. Compared with the prior art, the game information abstraction embedding is introduced, an unfortunate matching mechanism and a local minimum unfortunate value calculation method are integrated into an unfortunate minimization algorithm, and the game information abstraction embedding method has the advantages of improving the calculation efficiency and the actual game success rate.

Description

Texas poker AI training method based on virtual regret minimization algorithm

Technical Field

The invention relates to the field of machine gaming in artificial intelligence, in particular to a Texas poker AI training method based on a virtual regret minimization algorithm.

Background

Machine gaming has been a very important research problem in the field of artificial intelligence, and is an important means for checking the development level of artificial intelligence. In recent years, with the development of artificial intelligence, especially deep learning, many traditional machine game problems are solved, especially, an artificial intelligence algorithm represented by AlphaGo and combined with deep reinforcement learning and monte carlo game tree search obtains remarkable results on a complete information machine game problem, while a plurality of unsolved problems still exist in a non-complete information machine game, and the research and realization of an efficient poker AI algorithm are of great significance in theory and application as a complex and typical non-complete information machine game problem.

The regret minimization algorithm is an effective algorithm for solving the regular game at present, the regret minimization algorithm absorbs experience teaching of past decision errors in the game matching process of AI and self and AI and human, and finally trains the AI to obtain the best game strategy by calculating the minimum regret of the whole game tree and continuously iterating and optimizing, so that the decision made by the AI in the future is expected, and the minimum expected action regret is achieved. In the extended game problem, because the game state space grows exponentially, the traditional regret minimization algorithm consumes huge computing resources for computing the overall minimum of the game tree in such a scale, which is not practical. Meanwhile, the probability of some game branches on the game tree occurring in practical situations is very small, and the operation of calculating the game branches is unfortunately a behavior of wasting calculation resources, so that the algorithm is very inefficient and has limited effect in the non-complete information expansion game problem such as texas poker.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a Texas poker AI training method based on a virtual regret minimization algorithm.

The purpose of the invention can be realized by the following technical scheme:

a Texas poker AI training method based on a virtual regret minimization algorithm comprises the following steps:

1) obtaining private hand information and game display information, and performing game feature abstraction;

2) establishing a strategy prediction neural network model aiming at a player based on a historical game log of the player;

3) training by adopting a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI behavior strategy;

4) and (4) carrying out real-time match-making with the game player by adopting the AI behavior strategy obtained by training, and recording the match after finishing.

The step 1) specifically comprises the following steps:

11) acquiring the hand card type of the AI and the public card type information displayed on the field according to the game real-time interface;

12) mapping the hand card type of the AI and the public card type information displayed on the field by using the abstract state space of the Texas poker to obtain a corresponding card force value through a 9-Buckering strategy;

13) and constructing a strategy prediction neural network model, taking the obtained card value and game interface information at the moment as input of the neural network model, and taking the abstract characteristics of the current game information as output, wherein the game interface information specifically comprises the card type of the AI hand at the moment, the public card type information displayed on the scene and the current betting information of all the players of the game.

In the step 13), the strategy prediction neural network model is a neural network model, and the structure of the strategy prediction neural network model is specifically a convolution layer with a convolution kernel size of 3 × 3, a maximum pooling layer, a convolution layer with a convolution kernel size of 5 × 5 and a maximum pooling layer which are connected in sequence, and the obtained matrix is elongated and combined with the AI token force value to form a vector which is used as the input of two fully-connected layers, wherein the node number of the fully-connected layers is 1326 and 169 respectively.

The step 2) specifically comprises the following steps:

21) obtaining past game matching records of the player through the game ID of the player;

22) obtaining a trained strategy prediction neural network model corresponding to the player ID, wherein the strategy prediction neural network model framework is selected from the existing neural network models based on Q-learning or strategy gradient reinforcement learning algorithm;

23) and predicting the walking state of the player by using the strategy prediction neural network model, and performing parameter fine adjustment on the neural network model by using the actual behavior as new training data after the game is finished.

In the step 3), in the virtual regret minimization algorithm, the virtual value at the non-leaf node h of the game tree is defined as:

wherein Z is the set of all leaf nodes in the game tree, h is the non-leaf node in the game tree, mu_i(z) is the value of the utility of player i at leaf node z, δ is the policy used by the current player,

probability, π, that a gaming process reaches node h based on the respective strategies of the other players when strategy δ is implemented for player i^δ(h, z) is the probability that player i will reach leaf node z from h node according to policy δ.

In the virtual regret minimization algorithm, the expression of the virtual regret value r (h, a) for which no action a is taken at the non-leaf node h is:

r(h,a)＝v_i(δ_I→a,h)-v_i(δ,h)

wherein v is_i(δ_I→aH) the virtual regret value obtained for player i by taking action a at non-leaf node h of the game tree based on policy δ.

The expression of the virtual regret value r (I, a) when the information partition set I does not take the action a is:

wherein I is an information partition set, each of IA component I_iOne set of decision nodes representing player I, i.e. I for non-leaf nodes h, I_iH is the set of all non-leaf nodes, and p (H) i indicates that the player acting on the H node is.

Virtual regret values accumulated over T iterations

The expression of (a) is:

wherein the content of the first and second substances,

the virtual regret value obtained in the t iteration is obtained.

In the step 3), a regret matching mechanism in the virtual regret minimization algorithm is adopted to obtain a behavior strategy of the current AI, and the steps are as follows:

wherein the content of the first and second substances,

the probability of performing action a in the AI behavior policy at the T +1 th iteration,

for the Tth iteration, the virtual regret value of action a is executed, the superscript + indicates that if the virtual regret value is negative, the value is directly 0, and A (I) is the optional action set of all players under the condition of information segmentation I.

The step 4) specifically comprises the following steps:

41) performing game-play with the real player by using an AI behavior strategy obtained by a virtual regret minimization algorithm;

42) the AI and player's hand and bet game information are recorded and stored as game logs at the time of the game play.

Compared with the prior art, the invention has the following advantages:

firstly, in the invention, the core virtual regret minimization algorithm is obtained by combining the Monte Carlo CFR algorithm and the pure CFR algorithm for improvement, compared with the existing various CFR algorithms, the algorithm provided by the invention has advantages in the aspects of calculation speed and training time, and meanwhile, in the simulation, the virtual regret minimization algorithm provided by the invention can win more game rewards in the game of Texas poker compared with other algorithms.

Secondly, in the invention, considering that 52 different cards are shared in the Texas poker game, each game has 4 betting rounds, and the total game state space is exponentially increased along with the change of the number of players from 2 to 10, the invention provides a neural network model to finish the abstraction process of game information, and the neural network model is used for extracting card type characteristics, thereby greatly simplifying the game state space, and further improving the training time required by the iterative convergence of the subsequent virtual regret minimization algorithm.

The virtual regret minimization algorithm provided by the invention can not only be applied to Texas poker games, but also be used for a large number of incomplete information game problems, so that the virtual regret minimization algorithm has strong generalization capability.

Drawings

FIG. 1 is a schematic flow chart of the method of the present invention.

Fig. 2 is a diagram of the model architecture of the present invention.

Fig. 3 shows a specific embodiment of the present invention.

Detailed Description

The invention is described in detail below with reference to the figures and specific embodiments.

Examples

As shown in fig. 1, the present invention provides a texas poker AI training method based on a virtual regret minimization algorithm, comprising the following steps:

2) establishing a strategy prediction neural network model for a specific player based on a historical game log of the player;

3) training by using a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI (artificial intelligence) confrontation strategy model;

4) and (4) enabling the AI to match with the game player in real time, and recording the match after finishing.

The specific process of each step is as follows:

step 1) obtaining private hand information and game display information, and performing game feature abstraction, wherein the game feature abstraction specifically comprises the following steps:

11) and obtaining the card type condition of the hand and other displayed card types on the field according to the game real-time interface.

12) And mapping the information of the current card type and the field card type of the AI by using the abstract state space of the Texas poker through a 9-Buckering strategy to obtain a corresponding card force value.

13) The obtained AI card force value and the game interface information at the moment are used as the input of a neural network model, the output is the abstract characteristic of the current game information, the specific neural network model is that a convolution layer with the convolution kernel size of 3 x 3 is connected with a maximum pooling layer and a convolution layer with the convolution kernel size of 5 x 5 is connected with a maximum pooling layer, the obtained matrix elongation and the AI card force value at the moment are combined into a vector to be used as the input of two fully-connected layers, and the node number of the fully-connected layers is 1326 and 169 respectively.

And 2) establishing a strategy prediction neural network model aiming at a specific player based on the historical game log of the player.

The method specifically comprises the following steps:

21) the past game-play record of the player is found through the player game ID.

22) And finding a trained strategy prediction neural network model corresponding to the player ID, wherein the neural network model framework can select a mature neural network model based on Q-learning or strategy gradient and other reinforcement learning algorithms.

Step 3) in the virtual regret minimization algorithm, the virtual value at the non-leaf node h of the game tree is defined as:

wherein Z represents all leaf nodes in the game tree, h represents non-leaf nodes in the game tree, and mu_i(z) represents the utility value of player i at leaf node z.

Further, the specific calculation method of the different virtual regret values in the virtual regret minimization algorithm is as follows:

virtual regret value of no action a taken at node h:

r(h,a)＝v_i(δ_I→a,h)-v_i(δ,h)

virtual regret values where no action a is taken on the information partition set I:

virtual regret values accumulated over T iterations:

further, the action strategy of the current AI is obtained by utilizing a regret matching mechanism in the virtual regret minimization algorithm:

and 4) obtaining an AI real-time game strategy file by using the AI decision model obtained in the step 3) and the game real-time information obtained in the step 1).

The method specifically comprises the following steps:

41) carrying out game-play on the AI behavior strategy obtained by utilizing the virtual regret minimization algorithm and the real player;

42) game information such as an AI and a player's hand and bet is recorded and stored as a game log at the time of a game.

The invention integrates an unfortunate matching mechanism and a local minimum unfortunate value calculation method into an unfortunate minimization algorithm by introducing game information abstract embedding. Compared with the prior art, the method has the advantages of improving the calculation efficiency and the actual winning rate of the game.

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A Texas poker AI training method based on a virtual regret minimization algorithm is characterized by comprising the following steps:

2. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein the step 1) specifically comprises the following steps:

3. The texas playing card AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein in step 13), the strategy prediction neural network model is a neural network model having a structure of a convolution layer with a convolution kernel size of 3 x 3, a maximum pooling layer, a convolution layer with a convolution kernel size of 5 x 5 and a maximum pooling layer connected in sequence, and the obtained matrix is elongated and combined with the AI-card force values to form a vector as an input of two fully-connected layers, wherein the node numbers of the fully-connected layers are 1326 and 169 respectively.

4. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein the step 2) specifically comprises the following steps:

5. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein in the step 3), the virtual value at the non-leaf node h of the game tree is defined as:

6. The texas poker AI training method based on the virtual regret minimization algorithm of claim 5, wherein in the virtual regret minimization algorithm, the expression of the virtual regret value r (h, a) for not taking the action a at the non-leaf node h is:

r(h，a)＝v_i(δ_I→a，h)-v_i(δ，h)

7. The virtual regret minimization algorithm-based texas poker AI training method according to claim 6, wherein the expression of the virtual regret value r (I, a) for which no action a is taken in the information segmentation set I is:

wherein, I is an information segmentation set, and each component I in I_iOne set of decision nodes representing player I, i.e. I for non-leaf nodes h, I_iH is the set of all non-leaf nodes, and p (H) i indicates that the player acting on the H node is.

8. The virtual regret minimization algorithm-based texas poker AI training method of claim 7, wherein the virtual regret values accumulated in T iterations

The expression of (a) is:

wherein the content of the first and second substances,

the virtual regret value obtained in the t iteration is obtained.

9. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 8, wherein in the step 3), the action strategy of the current AI is obtained by adopting a regret matching mechanism in the virtual regret minimization algorithm, and then:

wherein the content of the first and second substances,

10. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein the step 4) comprises the following steps: