CN112691383A - Texas poker AI training method based on virtual regret minimization algorithm - Google Patents

Texas poker AI training method based on virtual regret minimization algorithm Download PDF

Info

Publication number
CN112691383A
CN112691383A CN202110048898.XA CN202110048898A CN112691383A CN 112691383 A CN112691383 A CN 112691383A CN 202110048898 A CN202110048898 A CN 202110048898A CN 112691383 A CN112691383 A CN 112691383A
Authority
CN
China
Prior art keywords
game
virtual
player
regret
minimization algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110048898.XA
Other languages
Chinese (zh)
Inventor
张轶飞
程帆
张冬梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202110048898.XA priority Critical patent/CN112691383A/en
Publication of CN112691383A publication Critical patent/CN112691383A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a Texas poker AI training method based on a virtual regret minimization algorithm, which comprises the following steps: 1) obtaining private hand information and game display information, and performing game feature abstraction; 2) establishing a strategy prediction neural network model aiming at a player based on a historical game log of the player; 3) training by adopting a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI behavior strategy; 4) and (4) carrying out real-time match-making with the game player by adopting the AI behavior strategy obtained by training, and recording the match after finishing. Compared with the prior art, the game information abstraction embedding is introduced, an unfortunate matching mechanism and a local minimum unfortunate value calculation method are integrated into an unfortunate minimization algorithm, and the game information abstraction embedding method has the advantages of improving the calculation efficiency and the actual game success rate.

Description

Texas poker AI training method based on virtual regret minimization algorithm
Technical Field
The invention relates to the field of machine gaming in artificial intelligence, in particular to a Texas poker AI training method based on a virtual regret minimization algorithm.
Background
Machine gaming has been a very important research problem in the field of artificial intelligence, and is an important means for checking the development level of artificial intelligence. In recent years, with the development of artificial intelligence, especially deep learning, many traditional machine game problems are solved, especially, an artificial intelligence algorithm represented by AlphaGo and combined with deep reinforcement learning and monte carlo game tree search obtains remarkable results on a complete information machine game problem, while a plurality of unsolved problems still exist in a non-complete information machine game, and the research and realization of an efficient poker AI algorithm are of great significance in theory and application as a complex and typical non-complete information machine game problem.
The regret minimization algorithm is an effective algorithm for solving the regular game at present, the regret minimization algorithm absorbs experience teaching of past decision errors in the game matching process of AI and self and AI and human, and finally trains the AI to obtain the best game strategy by calculating the minimum regret of the whole game tree and continuously iterating and optimizing, so that the decision made by the AI in the future is expected, and the minimum expected action regret is achieved. In the extended game problem, because the game state space grows exponentially, the traditional regret minimization algorithm consumes huge computing resources for computing the overall minimum of the game tree in such a scale, which is not practical. Meanwhile, the probability of some game branches on the game tree occurring in practical situations is very small, and the operation of calculating the game branches is unfortunately a behavior of wasting calculation resources, so that the algorithm is very inefficient and has limited effect in the non-complete information expansion game problem such as texas poker.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a Texas poker AI training method based on a virtual regret minimization algorithm.
The purpose of the invention can be realized by the following technical scheme:
a Texas poker AI training method based on a virtual regret minimization algorithm comprises the following steps:
1) obtaining private hand information and game display information, and performing game feature abstraction;
2) establishing a strategy prediction neural network model aiming at a player based on a historical game log of the player;
3) training by adopting a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI behavior strategy;
4) and (4) carrying out real-time match-making with the game player by adopting the AI behavior strategy obtained by training, and recording the match after finishing.
The step 1) specifically comprises the following steps:
11) acquiring the hand card type of the AI and the public card type information displayed on the field according to the game real-time interface;
12) mapping the hand card type of the AI and the public card type information displayed on the field by using the abstract state space of the Texas poker to obtain a corresponding card force value through a 9-Buckering strategy;
13) and constructing a strategy prediction neural network model, taking the obtained card value and game interface information at the moment as input of the neural network model, and taking the abstract characteristics of the current game information as output, wherein the game interface information specifically comprises the card type of the AI hand at the moment, the public card type information displayed on the scene and the current betting information of all the players of the game.
In the step 13), the strategy prediction neural network model is a neural network model, and the structure of the strategy prediction neural network model is specifically a convolution layer with a convolution kernel size of 3 × 3, a maximum pooling layer, a convolution layer with a convolution kernel size of 5 × 5 and a maximum pooling layer which are connected in sequence, and the obtained matrix is elongated and combined with the AI token force value to form a vector which is used as the input of two fully-connected layers, wherein the node number of the fully-connected layers is 1326 and 169 respectively.
The step 2) specifically comprises the following steps:
21) obtaining past game matching records of the player through the game ID of the player;
22) obtaining a trained strategy prediction neural network model corresponding to the player ID, wherein the strategy prediction neural network model framework is selected from the existing neural network models based on Q-learning or strategy gradient reinforcement learning algorithm;
23) and predicting the walking state of the player by using the strategy prediction neural network model, and performing parameter fine adjustment on the neural network model by using the actual behavior as new training data after the game is finished.
In the step 3), in the virtual regret minimization algorithm, the virtual value at the non-leaf node h of the game tree is defined as:
Figure BDA0002898249940000031
wherein Z is the set of all leaf nodes in the game tree, h is the non-leaf node in the game tree, mui(z) is the value of the utility of player i at leaf node z, δ is the policy used by the current player,
Figure BDA0002898249940000032
probability, π, that a gaming process reaches node h based on the respective strategies of the other players when strategy δ is implemented for player iδ(h, z) is the probability that player i will reach leaf node z from h node according to policy δ.
In the virtual regret minimization algorithm, the expression of the virtual regret value r (h, a) for which no action a is taken at the non-leaf node h is:
r(h,a)=viI→a,h)-vi(δ,h)
wherein v isiI→aH) the virtual regret value obtained for player i by taking action a at non-leaf node h of the game tree based on policy δ.
The expression of the virtual regret value r (I, a) when the information partition set I does not take the action a is:
Figure BDA0002898249940000033
wherein I is an information partition set, each of IA component IiOne set of decision nodes representing player I, i.e. I for non-leaf nodes h, IiH is the set of all non-leaf nodes, and p (H) i indicates that the player acting on the H node is.
Virtual regret values accumulated over T iterations
Figure BDA0002898249940000034
The expression of (a) is:
Figure BDA0002898249940000035
wherein the content of the first and second substances,
Figure BDA0002898249940000036
the virtual regret value obtained in the t iteration is obtained.
In the step 3), a regret matching mechanism in the virtual regret minimization algorithm is adopted to obtain a behavior strategy of the current AI, and the steps are as follows:
Figure BDA0002898249940000037
wherein the content of the first and second substances,
Figure BDA0002898249940000038
the probability of performing action a in the AI behavior policy at the T +1 th iteration,
Figure BDA0002898249940000039
for the Tth iteration, the virtual regret value of action a is executed, the superscript + indicates that if the virtual regret value is negative, the value is directly 0, and A (I) is the optional action set of all players under the condition of information segmentation I.
The step 4) specifically comprises the following steps:
41) performing game-play with the real player by using an AI behavior strategy obtained by a virtual regret minimization algorithm;
42) the AI and player's hand and bet game information are recorded and stored as game logs at the time of the game play.
Compared with the prior art, the invention has the following advantages:
firstly, in the invention, the core virtual regret minimization algorithm is obtained by combining the Monte Carlo CFR algorithm and the pure CFR algorithm for improvement, compared with the existing various CFR algorithms, the algorithm provided by the invention has advantages in the aspects of calculation speed and training time, and meanwhile, in the simulation, the virtual regret minimization algorithm provided by the invention can win more game rewards in the game of Texas poker compared with other algorithms.
Secondly, in the invention, considering that 52 different cards are shared in the Texas poker game, each game has 4 betting rounds, and the total game state space is exponentially increased along with the change of the number of players from 2 to 10, the invention provides a neural network model to finish the abstraction process of game information, and the neural network model is used for extracting card type characteristics, thereby greatly simplifying the game state space, and further improving the training time required by the iterative convergence of the subsequent virtual regret minimization algorithm.
The virtual regret minimization algorithm provided by the invention can not only be applied to Texas poker games, but also be used for a large number of incomplete information game problems, so that the virtual regret minimization algorithm has strong generalization capability.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a diagram of the model architecture of the present invention.
Fig. 3 shows a specific embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
Examples
As shown in fig. 1, the present invention provides a texas poker AI training method based on a virtual regret minimization algorithm, comprising the following steps:
1) obtaining private hand information and game display information, and performing game feature abstraction;
2) establishing a strategy prediction neural network model for a specific player based on a historical game log of the player;
3) training by using a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI (artificial intelligence) confrontation strategy model;
4) and (4) enabling the AI to match with the game player in real time, and recording the match after finishing.
The specific process of each step is as follows:
step 1) obtaining private hand information and game display information, and performing game feature abstraction, wherein the game feature abstraction specifically comprises the following steps:
11) and obtaining the card type condition of the hand and other displayed card types on the field according to the game real-time interface.
12) And mapping the information of the current card type and the field card type of the AI by using the abstract state space of the Texas poker through a 9-Buckering strategy to obtain a corresponding card force value.
13) The obtained AI card force value and the game interface information at the moment are used as the input of a neural network model, the output is the abstract characteristic of the current game information, the specific neural network model is that a convolution layer with the convolution kernel size of 3 x 3 is connected with a maximum pooling layer and a convolution layer with the convolution kernel size of 5 x 5 is connected with a maximum pooling layer, the obtained matrix elongation and the AI card force value at the moment are combined into a vector to be used as the input of two fully-connected layers, and the node number of the fully-connected layers is 1326 and 169 respectively.
And 2) establishing a strategy prediction neural network model aiming at a specific player based on the historical game log of the player.
The method specifically comprises the following steps:
21) the past game-play record of the player is found through the player game ID.
22) And finding a trained strategy prediction neural network model corresponding to the player ID, wherein the neural network model framework can select a mature neural network model based on Q-learning or strategy gradient and other reinforcement learning algorithms.
23) And predicting the walking state of the player by using the strategy prediction neural network model, and performing parameter fine adjustment on the neural network model by using the actual behavior as new training data after the game is finished.
Step 3) in the virtual regret minimization algorithm, the virtual value at the non-leaf node h of the game tree is defined as:
Figure BDA0002898249940000051
wherein Z represents all leaf nodes in the game tree, h represents non-leaf nodes in the game tree, and mui(z) represents the utility value of player i at leaf node z.
Further, the specific calculation method of the different virtual regret values in the virtual regret minimization algorithm is as follows:
virtual regret value of no action a taken at node h:
r(h,a)=viI→a,h)-vi(δ,h)
virtual regret values where no action a is taken on the information partition set I:
Figure BDA0002898249940000052
virtual regret values accumulated over T iterations:
Figure BDA0002898249940000061
further, the action strategy of the current AI is obtained by utilizing a regret matching mechanism in the virtual regret minimization algorithm:
Figure BDA0002898249940000062
and 4) obtaining an AI real-time game strategy file by using the AI decision model obtained in the step 3) and the game real-time information obtained in the step 1).
The method specifically comprises the following steps:
41) carrying out game-play on the AI behavior strategy obtained by utilizing the virtual regret minimization algorithm and the real player;
42) game information such as an AI and a player's hand and bet is recorded and stored as a game log at the time of a game.
The invention integrates an unfortunate matching mechanism and a local minimum unfortunate value calculation method into an unfortunate minimization algorithm by introducing game information abstract embedding. Compared with the prior art, the method has the advantages of improving the calculation efficiency and the actual winning rate of the game.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. A Texas poker AI training method based on a virtual regret minimization algorithm is characterized by comprising the following steps:
1) obtaining private hand information and game display information, and performing game feature abstraction;
2) establishing a strategy prediction neural network model aiming at a player based on a historical game log of the player;
3) training by adopting a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI behavior strategy;
4) and (4) carrying out real-time match-making with the game player by adopting the AI behavior strategy obtained by training, and recording the match after finishing.
2. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein the step 1) specifically comprises the following steps:
11) acquiring the hand card type of the AI and the public card type information displayed on the field according to the game real-time interface;
12) mapping the hand card type of the AI and the public card type information displayed on the field by using the abstract state space of the Texas poker to obtain a corresponding card force value through a 9-Buckering strategy;
13) and constructing a strategy prediction neural network model, taking the obtained card value and game interface information at the moment as input of the neural network model, and taking the abstract characteristics of the current game information as output, wherein the game interface information specifically comprises the card type of the AI hand at the moment, the public card type information displayed on the scene and the current betting information of all the players of the game.
3. The texas playing card AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein in step 13), the strategy prediction neural network model is a neural network model having a structure of a convolution layer with a convolution kernel size of 3 x 3, a maximum pooling layer, a convolution layer with a convolution kernel size of 5 x 5 and a maximum pooling layer connected in sequence, and the obtained matrix is elongated and combined with the AI-card force values to form a vector as an input of two fully-connected layers, wherein the node numbers of the fully-connected layers are 1326 and 169 respectively.
4. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein the step 2) specifically comprises the following steps:
21) obtaining past game matching records of the player through the game ID of the player;
22) obtaining a trained strategy prediction neural network model corresponding to the player ID, wherein the strategy prediction neural network model framework is selected from the existing neural network models based on Q-learning or strategy gradient reinforcement learning algorithm;
23) and predicting the walking state of the player by using the strategy prediction neural network model, and performing parameter fine adjustment on the neural network model by using the actual behavior as new training data after the game is finished.
5. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein in the step 3), the virtual value at the non-leaf node h of the game tree is defined as:
Figure FDA0002898249930000021
wherein Z is the set of all leaf nodes in the game tree, h is the non-leaf node in the game tree, mui(z) is the value of the utility of player i at leaf node z, δ is the policy used by the current player,
Figure FDA0002898249930000022
probability, π, that a gaming process reaches node h based on the respective strategies of the other players when strategy δ is implemented for player iδ(h, z) is the probability that player i will reach leaf node z from h node according to policy δ.
6. The texas poker AI training method based on the virtual regret minimization algorithm of claim 5, wherein in the virtual regret minimization algorithm, the expression of the virtual regret value r (h, a) for not taking the action a at the non-leaf node h is:
r(h,a)=viI→a,h)-vi(δ,h)
wherein v isiI→aH) the virtual regret value obtained for player i by taking action a at non-leaf node h of the game tree based on policy δ.
7. The virtual regret minimization algorithm-based texas poker AI training method according to claim 6, wherein the expression of the virtual regret value r (I, a) for which no action a is taken in the information segmentation set I is:
Figure FDA0002898249930000023
wherein, I is an information segmentation set, and each component I in IiOne set of decision nodes representing player I, i.e. I for non-leaf nodes h, IiH is the set of all non-leaf nodes, and p (H) i indicates that the player acting on the H node is.
8. The virtual regret minimization algorithm-based texas poker AI training method of claim 7, wherein the virtual regret values accumulated in T iterations
Figure FDA0002898249930000024
The expression of (a) is:
Figure FDA0002898249930000025
wherein the content of the first and second substances,
Figure FDA0002898249930000031
the virtual regret value obtained in the t iteration is obtained.
9. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 8, wherein in the step 3), the action strategy of the current AI is obtained by adopting a regret matching mechanism in the virtual regret minimization algorithm, and then:
Figure FDA0002898249930000032
wherein the content of the first and second substances,
Figure FDA0002898249930000033
the probability of performing action a in the AI behavior policy at the T +1 th iteration,
Figure FDA0002898249930000034
for the Tth iteration, the virtual regret value of action a is executed, the superscript + indicates that if the virtual regret value is negative, the value is directly 0, and A (I) is the optional action set of all players under the condition of information segmentation I.
10. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein the step 4) comprises the following steps:
41) performing game-play with the real player by using an AI behavior strategy obtained by a virtual regret minimization algorithm;
42) the AI and player's hand and bet game information are recorded and stored as game logs at the time of the game play.
CN202110048898.XA 2021-01-14 2021-01-14 Texas poker AI training method based on virtual regret minimization algorithm Pending CN112691383A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110048898.XA CN112691383A (en) 2021-01-14 2021-01-14 Texas poker AI training method based on virtual regret minimization algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110048898.XA CN112691383A (en) 2021-01-14 2021-01-14 Texas poker AI training method based on virtual regret minimization algorithm

Publications (1)

Publication Number Publication Date
CN112691383A true CN112691383A (en) 2021-04-23

Family

ID=75514706

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110048898.XA Pending CN112691383A (en) 2021-01-14 2021-01-14 Texas poker AI training method based on virtual regret minimization algorithm

Country Status (1)

Country Link
CN (1) CN112691383A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448994A (en) * 2021-07-07 2021-09-28 南京航空航天大学 Continuous regrettage minimization query method based on core set
CN116028817A (en) * 2023-01-13 2023-04-28 哈尔滨工业大学(深圳) CFR strategy solving method based on single evaluation value network and related equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080248849A1 (en) * 2007-04-05 2008-10-09 Lutnick Howard W Sorting Games of Chance
CN110826717A (en) * 2019-11-12 2020-02-21 腾讯科技(深圳)有限公司 Game service execution method, device, equipment and medium based on artificial intelligence
CN111738294A (en) * 2020-05-21 2020-10-02 深圳海普参数科技有限公司 AI model training method, use method, computer device and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080248849A1 (en) * 2007-04-05 2008-10-09 Lutnick Howard W Sorting Games of Chance
CN110826717A (en) * 2019-11-12 2020-02-21 腾讯科技(深圳)有限公司 Game service execution method, device, equipment and medium based on artificial intelligence
CN111738294A (en) * 2020-05-21 2020-10-02 深圳海普参数科技有限公司 AI model training method, use method, computer device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
滕雯娟: "《基于虚拟遗憾最小化算法的德州扑克机器博弈研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113448994A (en) * 2021-07-07 2021-09-28 南京航空航天大学 Continuous regrettage minimization query method based on core set
CN116028817A (en) * 2023-01-13 2023-04-28 哈尔滨工业大学(深圳) CFR strategy solving method based on single evaluation value network and related equipment

Similar Documents

Publication Publication Date Title
CN111291890B (en) Game strategy optimization method, system and storage medium
CN111353582B (en) Particle swarm algorithm-based distributed deep learning parameter updating method
CN109657780A (en) A kind of model compression method based on beta pruning sequence Active Learning
CN112691383A (en) Texas poker AI training method based on virtual regret minimization algorithm
CN113688977B (en) Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium
CN111242841A (en) Image background style migration method based on semantic segmentation and deep learning
CN112926744A (en) Incomplete information game method and system based on reinforcement learning and electronic equipment
Gruslys et al. The advantage regret-matching actor-critic
CN114048834B (en) Continuous reinforcement learning non-complete information game method and device based on after-the-fact review and progressive expansion
Borovikov et al. Winning isn’t everything: Training agents to playtest modern games
CN108970119A (en) The adaptive game system strategic planning method of difficulty
CN103559363A (en) Method for calculating optimum response strategy in imperfect information extensive game
CN111667043B (en) Chess game playing method, system, terminal and storage medium
CN115033878A (en) Rapid self-game reinforcement learning method and device, computer equipment and storage medium
CN110727870A (en) Novel single-tree Monte Carlo search method for sequential synchronous game
CN111310918A (en) Data processing method and device, computer equipment and storage medium
CN111330255B (en) Amazon chess-calling generation method based on deep convolutional neural network
CN113230650A (en) Data processing method and device and computer readable storage medium
CN111617479B (en) Acceleration method and system of game artificial intelligence system
PRICOPE A view on deep reinforcement learning in imperfect information games
CN114004359A (en) Mahjong-to-custom-cut prediction method and device, storage medium and equipment
CN111178541B (en) Game artificial intelligence system and performance improving system and method thereof
ZHANG et al. A Texas Hold’em decision model based on Reinforcement Learning
CN117648585B (en) Intelligent decision model generalization method and device based on task similarity
Orenbas et al. Analysing the Lottery Ticket Hypothesis on Face Recognition for Structured and Unstructured Pruning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210423

RJ01 Rejection of invention patent application after publication