CN112691383A - Texas poker AI training method based on virtual regret minimization algorithm - Google Patents
Texas poker AI training method based on virtual regret minimization algorithm Download PDFInfo
- Publication number
- CN112691383A CN112691383A CN202110048898.XA CN202110048898A CN112691383A CN 112691383 A CN112691383 A CN 112691383A CN 202110048898 A CN202110048898 A CN 202110048898A CN 112691383 A CN112691383 A CN 112691383A
- Authority
- CN
- China
- Prior art keywords
- game
- virtual
- player
- regret
- minimization algorithm
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a Texas poker AI training method based on a virtual regret minimization algorithm, which comprises the following steps: 1) obtaining private hand information and game display information, and performing game feature abstraction; 2) establishing a strategy prediction neural network model aiming at a player based on a historical game log of the player; 3) training by adopting a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI behavior strategy; 4) and (4) carrying out real-time match-making with the game player by adopting the AI behavior strategy obtained by training, and recording the match after finishing. Compared with the prior art, the game information abstraction embedding is introduced, an unfortunate matching mechanism and a local minimum unfortunate value calculation method are integrated into an unfortunate minimization algorithm, and the game information abstraction embedding method has the advantages of improving the calculation efficiency and the actual game success rate.
Description
Technical Field
The invention relates to the field of machine gaming in artificial intelligence, in particular to a Texas poker AI training method based on a virtual regret minimization algorithm.
Background
Machine gaming has been a very important research problem in the field of artificial intelligence, and is an important means for checking the development level of artificial intelligence. In recent years, with the development of artificial intelligence, especially deep learning, many traditional machine game problems are solved, especially, an artificial intelligence algorithm represented by AlphaGo and combined with deep reinforcement learning and monte carlo game tree search obtains remarkable results on a complete information machine game problem, while a plurality of unsolved problems still exist in a non-complete information machine game, and the research and realization of an efficient poker AI algorithm are of great significance in theory and application as a complex and typical non-complete information machine game problem.
The regret minimization algorithm is an effective algorithm for solving the regular game at present, the regret minimization algorithm absorbs experience teaching of past decision errors in the game matching process of AI and self and AI and human, and finally trains the AI to obtain the best game strategy by calculating the minimum regret of the whole game tree and continuously iterating and optimizing, so that the decision made by the AI in the future is expected, and the minimum expected action regret is achieved. In the extended game problem, because the game state space grows exponentially, the traditional regret minimization algorithm consumes huge computing resources for computing the overall minimum of the game tree in such a scale, which is not practical. Meanwhile, the probability of some game branches on the game tree occurring in practical situations is very small, and the operation of calculating the game branches is unfortunately a behavior of wasting calculation resources, so that the algorithm is very inefficient and has limited effect in the non-complete information expansion game problem such as texas poker.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a Texas poker AI training method based on a virtual regret minimization algorithm.
The purpose of the invention can be realized by the following technical scheme:
a Texas poker AI training method based on a virtual regret minimization algorithm comprises the following steps:
1) obtaining private hand information and game display information, and performing game feature abstraction;
2) establishing a strategy prediction neural network model aiming at a player based on a historical game log of the player;
3) training by adopting a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI behavior strategy;
4) and (4) carrying out real-time match-making with the game player by adopting the AI behavior strategy obtained by training, and recording the match after finishing.
The step 1) specifically comprises the following steps:
11) acquiring the hand card type of the AI and the public card type information displayed on the field according to the game real-time interface;
12) mapping the hand card type of the AI and the public card type information displayed on the field by using the abstract state space of the Texas poker to obtain a corresponding card force value through a 9-Buckering strategy;
13) and constructing a strategy prediction neural network model, taking the obtained card value and game interface information at the moment as input of the neural network model, and taking the abstract characteristics of the current game information as output, wherein the game interface information specifically comprises the card type of the AI hand at the moment, the public card type information displayed on the scene and the current betting information of all the players of the game.
In the step 13), the strategy prediction neural network model is a neural network model, and the structure of the strategy prediction neural network model is specifically a convolution layer with a convolution kernel size of 3 × 3, a maximum pooling layer, a convolution layer with a convolution kernel size of 5 × 5 and a maximum pooling layer which are connected in sequence, and the obtained matrix is elongated and combined with the AI token force value to form a vector which is used as the input of two fully-connected layers, wherein the node number of the fully-connected layers is 1326 and 169 respectively.
The step 2) specifically comprises the following steps:
21) obtaining past game matching records of the player through the game ID of the player;
22) obtaining a trained strategy prediction neural network model corresponding to the player ID, wherein the strategy prediction neural network model framework is selected from the existing neural network models based on Q-learning or strategy gradient reinforcement learning algorithm;
23) and predicting the walking state of the player by using the strategy prediction neural network model, and performing parameter fine adjustment on the neural network model by using the actual behavior as new training data after the game is finished.
In the step 3), in the virtual regret minimization algorithm, the virtual value at the non-leaf node h of the game tree is defined as:
wherein Z is the set of all leaf nodes in the game tree, h is the non-leaf node in the game tree, mui(z) is the value of the utility of player i at leaf node z, δ is the policy used by the current player,probability, π, that a gaming process reaches node h based on the respective strategies of the other players when strategy δ is implemented for player iδ(h, z) is the probability that player i will reach leaf node z from h node according to policy δ.
In the virtual regret minimization algorithm, the expression of the virtual regret value r (h, a) for which no action a is taken at the non-leaf node h is:
r(h,a)=vi(δI→a,h)-vi(δ,h)
wherein v isi(δI→aH) the virtual regret value obtained for player i by taking action a at non-leaf node h of the game tree based on policy δ.
The expression of the virtual regret value r (I, a) when the information partition set I does not take the action a is:
wherein I is an information partition set, each of IA component IiOne set of decision nodes representing player I, i.e. I for non-leaf nodes h, IiH is the set of all non-leaf nodes, and p (H) i indicates that the player acting on the H node is.
wherein the content of the first and second substances,the virtual regret value obtained in the t iteration is obtained.
In the step 3), a regret matching mechanism in the virtual regret minimization algorithm is adopted to obtain a behavior strategy of the current AI, and the steps are as follows:
wherein the content of the first and second substances,the probability of performing action a in the AI behavior policy at the T +1 th iteration,for the Tth iteration, the virtual regret value of action a is executed, the superscript + indicates that if the virtual regret value is negative, the value is directly 0, and A (I) is the optional action set of all players under the condition of information segmentation I.
The step 4) specifically comprises the following steps:
41) performing game-play with the real player by using an AI behavior strategy obtained by a virtual regret minimization algorithm;
42) the AI and player's hand and bet game information are recorded and stored as game logs at the time of the game play.
Compared with the prior art, the invention has the following advantages:
firstly, in the invention, the core virtual regret minimization algorithm is obtained by combining the Monte Carlo CFR algorithm and the pure CFR algorithm for improvement, compared with the existing various CFR algorithms, the algorithm provided by the invention has advantages in the aspects of calculation speed and training time, and meanwhile, in the simulation, the virtual regret minimization algorithm provided by the invention can win more game rewards in the game of Texas poker compared with other algorithms.
Secondly, in the invention, considering that 52 different cards are shared in the Texas poker game, each game has 4 betting rounds, and the total game state space is exponentially increased along with the change of the number of players from 2 to 10, the invention provides a neural network model to finish the abstraction process of game information, and the neural network model is used for extracting card type characteristics, thereby greatly simplifying the game state space, and further improving the training time required by the iterative convergence of the subsequent virtual regret minimization algorithm.
The virtual regret minimization algorithm provided by the invention can not only be applied to Texas poker games, but also be used for a large number of incomplete information game problems, so that the virtual regret minimization algorithm has strong generalization capability.
Drawings
FIG. 1 is a schematic flow chart of the method of the present invention.
Fig. 2 is a diagram of the model architecture of the present invention.
Fig. 3 shows a specific embodiment of the present invention.
Detailed Description
The invention is described in detail below with reference to the figures and specific embodiments.
Examples
As shown in fig. 1, the present invention provides a texas poker AI training method based on a virtual regret minimization algorithm, comprising the following steps:
1) obtaining private hand information and game display information, and performing game feature abstraction;
2) establishing a strategy prediction neural network model for a specific player based on a historical game log of the player;
3) training by using a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI (artificial intelligence) confrontation strategy model;
4) and (4) enabling the AI to match with the game player in real time, and recording the match after finishing.
The specific process of each step is as follows:
step 1) obtaining private hand information and game display information, and performing game feature abstraction, wherein the game feature abstraction specifically comprises the following steps:
11) and obtaining the card type condition of the hand and other displayed card types on the field according to the game real-time interface.
12) And mapping the information of the current card type and the field card type of the AI by using the abstract state space of the Texas poker through a 9-Buckering strategy to obtain a corresponding card force value.
13) The obtained AI card force value and the game interface information at the moment are used as the input of a neural network model, the output is the abstract characteristic of the current game information, the specific neural network model is that a convolution layer with the convolution kernel size of 3 x 3 is connected with a maximum pooling layer and a convolution layer with the convolution kernel size of 5 x 5 is connected with a maximum pooling layer, the obtained matrix elongation and the AI card force value at the moment are combined into a vector to be used as the input of two fully-connected layers, and the node number of the fully-connected layers is 1326 and 169 respectively.
And 2) establishing a strategy prediction neural network model aiming at a specific player based on the historical game log of the player.
The method specifically comprises the following steps:
21) the past game-play record of the player is found through the player game ID.
22) And finding a trained strategy prediction neural network model corresponding to the player ID, wherein the neural network model framework can select a mature neural network model based on Q-learning or strategy gradient and other reinforcement learning algorithms.
23) And predicting the walking state of the player by using the strategy prediction neural network model, and performing parameter fine adjustment on the neural network model by using the actual behavior as new training data after the game is finished.
Step 3) in the virtual regret minimization algorithm, the virtual value at the non-leaf node h of the game tree is defined as:
wherein Z represents all leaf nodes in the game tree, h represents non-leaf nodes in the game tree, and mui(z) represents the utility value of player i at leaf node z.
Further, the specific calculation method of the different virtual regret values in the virtual regret minimization algorithm is as follows:
virtual regret value of no action a taken at node h:
r(h,a)=vi(δI→a,h)-vi(δ,h)
virtual regret values where no action a is taken on the information partition set I:
virtual regret values accumulated over T iterations:
further, the action strategy of the current AI is obtained by utilizing a regret matching mechanism in the virtual regret minimization algorithm:
and 4) obtaining an AI real-time game strategy file by using the AI decision model obtained in the step 3) and the game real-time information obtained in the step 1).
The method specifically comprises the following steps:
41) carrying out game-play on the AI behavior strategy obtained by utilizing the virtual regret minimization algorithm and the real player;
42) game information such as an AI and a player's hand and bet is recorded and stored as a game log at the time of a game.
The invention integrates an unfortunate matching mechanism and a local minimum unfortunate value calculation method into an unfortunate minimization algorithm by introducing game information abstract embedding. Compared with the prior art, the method has the advantages of improving the calculation efficiency and the actual winning rate of the game.
While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (10)
1. A Texas poker AI training method based on a virtual regret minimization algorithm is characterized by comprising the following steps:
1) obtaining private hand information and game display information, and performing game feature abstraction;
2) establishing a strategy prediction neural network model aiming at a player based on a historical game log of the player;
3) training by adopting a virtual regret minimization algorithm and taking a strategy prediction neural network model of a player as an opponent and obtaining an AI behavior strategy;
4) and (4) carrying out real-time match-making with the game player by adopting the AI behavior strategy obtained by training, and recording the match after finishing.
2. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein the step 1) specifically comprises the following steps:
11) acquiring the hand card type of the AI and the public card type information displayed on the field according to the game real-time interface;
12) mapping the hand card type of the AI and the public card type information displayed on the field by using the abstract state space of the Texas poker to obtain a corresponding card force value through a 9-Buckering strategy;
13) and constructing a strategy prediction neural network model, taking the obtained card value and game interface information at the moment as input of the neural network model, and taking the abstract characteristics of the current game information as output, wherein the game interface information specifically comprises the card type of the AI hand at the moment, the public card type information displayed on the scene and the current betting information of all the players of the game.
3. The texas playing card AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein in step 13), the strategy prediction neural network model is a neural network model having a structure of a convolution layer with a convolution kernel size of 3 x 3, a maximum pooling layer, a convolution layer with a convolution kernel size of 5 x 5 and a maximum pooling layer connected in sequence, and the obtained matrix is elongated and combined with the AI-card force values to form a vector as an input of two fully-connected layers, wherein the node numbers of the fully-connected layers are 1326 and 169 respectively.
4. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein the step 2) specifically comprises the following steps:
21) obtaining past game matching records of the player through the game ID of the player;
22) obtaining a trained strategy prediction neural network model corresponding to the player ID, wherein the strategy prediction neural network model framework is selected from the existing neural network models based on Q-learning or strategy gradient reinforcement learning algorithm;
23) and predicting the walking state of the player by using the strategy prediction neural network model, and performing parameter fine adjustment on the neural network model by using the actual behavior as new training data after the game is finished.
5. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein in the step 3), the virtual value at the non-leaf node h of the game tree is defined as:
wherein Z is the set of all leaf nodes in the game tree, h is the non-leaf node in the game tree, mui(z) is the value of the utility of player i at leaf node z, δ is the policy used by the current player,probability, π, that a gaming process reaches node h based on the respective strategies of the other players when strategy δ is implemented for player iδ(h, z) is the probability that player i will reach leaf node z from h node according to policy δ.
6. The texas poker AI training method based on the virtual regret minimization algorithm of claim 5, wherein in the virtual regret minimization algorithm, the expression of the virtual regret value r (h, a) for not taking the action a at the non-leaf node h is:
r(h,a)=vi(δI→a,h)-vi(δ,h)
wherein v isi(δI→aH) the virtual regret value obtained for player i by taking action a at non-leaf node h of the game tree based on policy δ.
7. The virtual regret minimization algorithm-based texas poker AI training method according to claim 6, wherein the expression of the virtual regret value r (I, a) for which no action a is taken in the information segmentation set I is:
wherein, I is an information segmentation set, and each component I in IiOne set of decision nodes representing player I, i.e. I for non-leaf nodes h, IiH is the set of all non-leaf nodes, and p (H) i indicates that the player acting on the H node is.
8. The virtual regret minimization algorithm-based texas poker AI training method of claim 7, wherein the virtual regret values accumulated in T iterationsThe expression of (a) is:
9. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 8, wherein in the step 3), the action strategy of the current AI is obtained by adopting a regret matching mechanism in the virtual regret minimization algorithm, and then:
wherein the content of the first and second substances,the probability of performing action a in the AI behavior policy at the T +1 th iteration,for the Tth iteration, the virtual regret value of action a is executed, the superscript + indicates that if the virtual regret value is negative, the value is directly 0, and A (I) is the optional action set of all players under the condition of information segmentation I.
10. The texas poker AI training method based on the virtual regret minimization algorithm as claimed in claim 1, wherein the step 4) comprises the following steps:
41) performing game-play with the real player by using an AI behavior strategy obtained by a virtual regret minimization algorithm;
42) the AI and player's hand and bet game information are recorded and stored as game logs at the time of the game play.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110048898.XA CN112691383A (en) | 2021-01-14 | 2021-01-14 | Texas poker AI training method based on virtual regret minimization algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110048898.XA CN112691383A (en) | 2021-01-14 | 2021-01-14 | Texas poker AI training method based on virtual regret minimization algorithm |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112691383A true CN112691383A (en) | 2021-04-23 |
Family
ID=75514706
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110048898.XA Pending CN112691383A (en) | 2021-01-14 | 2021-01-14 | Texas poker AI training method based on virtual regret minimization algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112691383A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113448994A (en) * | 2021-07-07 | 2021-09-28 | 南京航空航天大学 | Continuous regrettage minimization query method based on core set |
CN116028817A (en) * | 2023-01-13 | 2023-04-28 | 哈尔滨工业大学(深圳) | CFR strategy solving method based on single evaluation value network and related equipment |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080248849A1 (en) * | 2007-04-05 | 2008-10-09 | Lutnick Howard W | Sorting Games of Chance |
CN110826717A (en) * | 2019-11-12 | 2020-02-21 | 腾讯科技(深圳)有限公司 | Game service execution method, device, equipment and medium based on artificial intelligence |
CN111738294A (en) * | 2020-05-21 | 2020-10-02 | 深圳海普参数科技有限公司 | AI model training method, use method, computer device and storage medium |
-
2021
- 2021-01-14 CN CN202110048898.XA patent/CN112691383A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080248849A1 (en) * | 2007-04-05 | 2008-10-09 | Lutnick Howard W | Sorting Games of Chance |
CN110826717A (en) * | 2019-11-12 | 2020-02-21 | 腾讯科技(深圳)有限公司 | Game service execution method, device, equipment and medium based on artificial intelligence |
CN111738294A (en) * | 2020-05-21 | 2020-10-02 | 深圳海普参数科技有限公司 | AI model training method, use method, computer device and storage medium |
Non-Patent Citations (1)
Title |
---|
滕雯娟: "《基于虚拟遗憾最小化算法的德州扑克机器博弈研究》", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113448994A (en) * | 2021-07-07 | 2021-09-28 | 南京航空航天大学 | Continuous regrettage minimization query method based on core set |
CN116028817A (en) * | 2023-01-13 | 2023-04-28 | 哈尔滨工业大学(深圳) | CFR strategy solving method based on single evaluation value network and related equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111291890B (en) | Game strategy optimization method, system and storage medium | |
CN111353582B (en) | Particle swarm algorithm-based distributed deep learning parameter updating method | |
CN109657780A (en) | A kind of model compression method based on beta pruning sequence Active Learning | |
CN112691383A (en) | Texas poker AI training method based on virtual regret minimization algorithm | |
CN113688977B (en) | Human-computer symbiotic reinforcement learning method and device oriented to countermeasure task, computing equipment and storage medium | |
CN111242841A (en) | Image background style migration method based on semantic segmentation and deep learning | |
CN112926744A (en) | Incomplete information game method and system based on reinforcement learning and electronic equipment | |
Gruslys et al. | The advantage regret-matching actor-critic | |
CN114048834B (en) | Continuous reinforcement learning non-complete information game method and device based on after-the-fact review and progressive expansion | |
Borovikov et al. | Winning isn’t everything: Training agents to playtest modern games | |
CN108970119A (en) | The adaptive game system strategic planning method of difficulty | |
CN103559363A (en) | Method for calculating optimum response strategy in imperfect information extensive game | |
CN111667043B (en) | Chess game playing method, system, terminal and storage medium | |
CN115033878A (en) | Rapid self-game reinforcement learning method and device, computer equipment and storage medium | |
CN110727870A (en) | Novel single-tree Monte Carlo search method for sequential synchronous game | |
CN111310918A (en) | Data processing method and device, computer equipment and storage medium | |
CN111330255B (en) | Amazon chess-calling generation method based on deep convolutional neural network | |
CN113230650A (en) | Data processing method and device and computer readable storage medium | |
CN111617479B (en) | Acceleration method and system of game artificial intelligence system | |
PRICOPE | A view on deep reinforcement learning in imperfect information games | |
CN114004359A (en) | Mahjong-to-custom-cut prediction method and device, storage medium and equipment | |
CN111178541B (en) | Game artificial intelligence system and performance improving system and method thereof | |
ZHANG et al. | A Texas Hold’em decision model based on Reinforcement Learning | |
CN117648585B (en) | Intelligent decision model generalization method and device based on task similarity | |
Orenbas et al. | Analysing the Lottery Ticket Hypothesis on Face Recognition for Structured and Unstructured Pruning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210423 |
|
RJ01 | Rejection of invention patent application after publication |