CN111905373A - Artificial intelligence decision method and system based on game theory and Nash equilibrium - Google Patents
Artificial intelligence decision method and system based on game theory and Nash equilibrium Download PDFInfo
- Publication number
- CN111905373A CN111905373A CN202010744288.9A CN202010744288A CN111905373A CN 111905373 A CN111905373 A CN 111905373A CN 202010744288 A CN202010744288 A CN 202010744288A CN 111905373 A CN111905373 A CN 111905373A
- Authority
- CN
- China
- Prior art keywords
- value
- game
- strategy
- counterfactual
- player
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F13/00—Video games, i.e. games using an electronically generated display having two or more dimensions
- A63F13/60—Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
- A63F13/67—Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F13/00—Video games, i.e. games using an electronically generated display having two or more dimensions
- A63F13/70—Game security or game management aspects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F2300/00—Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
- A63F2300/60—Methods for processing data by generating or executing the game program
-
- A—HUMAN NECESSITIES
- A63—SPORTS; GAMES; AMUSEMENTS
- A63F—CARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
- A63F2300/00—Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
- A63F2300/60—Methods for processing data by generating or executing the game program
- A63F2300/6027—Methods for processing data by generating or executing the game program using adaptive systems learning from user actions, e.g. for skill level adjustment
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Molecular Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Business, Economics & Management (AREA)
- Computer Security & Cryptography (AREA)
- General Business, Economics & Management (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides an artificial intelligence decision method and system based on game theory and Nash equilibrium, wherein the artificial intelligence decision method comprises the following steps: step S1, dividing the cards into different buckets according to the strength of the hands; step S2, initializing an average strategy; step S3, sampling state in game tree, sampling action, and calculating information setIThe counter fact profit value and the expected profit value of each state; step S4, calculating the regret value of the counterfactual, and updating the strategy value to train the neural network; step S5, judging whether the player finishes the iteration, if so, outputting the currentOtherwise, returning to the step S3. The invention can well achieve the purpose of reducing the calculated amount and the calculating time, and realizes the updating of the strategy value through a series of steps, and judges whether the iteration is finished, thereby well improving the calculating accuracy.
Description
Technical Field
The invention relates to a decision training method for electronic chess, in particular to an artificial intelligence decision method based on a game theory and Nash equilibrium, and an artificial intelligence decision system adopting the artificial intelligence decision method based on the game theory and Nash equilibrium.
Background
The poker game AI was developed in 2017 by professor Brown Noam, university of kangsulon, and professor Sandholm Tuomas, entitled "chef sourdough," the english name libaratus. The cold poker master uses the common minimal virtual Regret Minimization (CFR) algorithm for poker games AI to find a nash equilibrium betting method, also known as GTO betting, at each betting round of each hand. The cold pounce teacher algorithm system is divided into three modules. (1) The abstract poker game reduces the complexity of the poker game by abstracting the poker game. (2) Subdividing the game scenario, the module is operative to find in which game scenario the AI and the player are located in the calculated game scenario. (3) Self-match training, searching and repairing game strategy loopholes by analyzing data and strategy generated in self-match.
However, in the prior art, real-time calculation is needed in any possible situation of the poker game, so that the calculation amount is too large, a large-scale server is needed for calculation, the calculation time is long, and the requirement of man-machine real-time fight cannot be met. In addition, the conventional common minimal Regret Minimization (CFR) algorithm cannot well utilize the specificity of the poker game to achieve the purposes of reducing the calculation amount, saving the calculation time and improving the calculation accuracy.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an artificial intelligence decision method based on game theory and Nash equilibrium, which can reduce the calculation amount and time and improve the calculation accuracy, and further provide an artificial intelligence decision system adopting the artificial intelligence decision method.
Therefore, the invention provides an artificial intelligence decision method based on game theory and Nash equilibrium, which comprises the following steps:
step S1, dividing the cards into different buckets according to the strength of the hands;
step S2, initializing an average strategy;
step S3, sampling states in a game tree, sampling actions, and calculating counterfactual income values and expected income values of all the states in an information set I, wherein the information set I is a set of public cards and public action information;
step S4, calculating the regret value of the counterfactual, and updating the strategy value to train the neural network;
and step S5, judging whether the player finishes the iteration, if so, outputting the current average strategy value, and if not, returning to the step S3.
A further improvement of the present invention is that said step S1 of dividing the cards into different buckets according to the hand strength means iterating each hand to obtain the card type and the card score of the hand, and creating the feature vector corresponding to the hand based on the hand, the card type and the card score.
In a further improvement of the present invention, in the step S1, the hand is trained by a K-means clustering model according to the feature vector corresponding to the hand, and then the hand is divided into different buckets.
In a further improvement of the present invention, in the step S3, let Z be a set of all end histories in the game, where all end histories refer to a sequence from a root node to a leaf node; for Z' ∈ Z,a prefix history representing that the non-end history information h is an end history z'; u. ofi(z') is player iCalculating a counterfactual profit value CFV under the non-terminal historical information h according to the counterfactual probability, the probability of strategy combination and the utility value of the terminal historical z'; the expected profit value EV in the current state is expressed as the product of the counterfactual profit value CFV and its arrival probability in the current state.
In a further improvement of the present invention, in the step S3, the formula is usedCalculating a counterfactual profit value CFV under the non-terminal historical information h, wherein vi(σ, h) represents a counterfactual profit value CFV for player i to reach non-end history information h when strategy combination σ is adopted;as counter-fact probabilitiesWhen the strategy combination sigma is adopted, the probability of reaching the non-terminal historical information h under the condition of not considering the player i is obtained; piσ(h, z ') is the probability of reaching the end history z' on the premise that the non-end history information h has been reached when the policy combination σ is employed.
In a further improvement of the present invention, in the step S3, the formula is usedCalculating an expected profit value EV in the current state, wherein ui(σ, h) is the expected profit value EV for the player i to reach the non-end history information h when the combination of strategies σ is employed,the probability that player i reaches non-end history information h when strategy combination sigma is adopted.
In a further development of the invention, in step S4, the formula r (h, a) ═ vi(σI→a,h)-vi(σ, h) calculating a regret value of counterfactual without action a under the non-end history information hCFR, wherein r (h, a) is the counterfactual regret value CFR, v of the non-terminal historical information h without action ai(σI→aH) is the utility resulting from always performing action a in information set I under non-terminal history information h; by the formulaCalculating the regret value of the counterfactual without action a in the information set I; by the formulaCalculating the accumulated regret value, wherein,for combining sigma in a strategytNext player I does not have the regret value of action a in information set I.
In a further improvement of the present invention, in the step S4, the formula is usedAnd updating the strategy value to train the neural network, wherein,a (I) is the legal action set under information set I,the regret value of all actions under the information set I.
In a further improvement of the present invention, in step S5, the game play count is configured, and the completion of all game play counts marks the termination of the iteration, otherwise the iteration is continued by increasing the game play count.
The invention also provides an artificial intelligence decision-making system based on the game theory and the Nash equilibrium, which adopts the artificial intelligence decision-making method based on the game theory and the Nash equilibrium and comprises the following steps:
the hand card dividing module is used for dividing the cards into different buckets according to the strength of the hand cards;
the initialization module is used for realizing an initialization average strategy;
the profit calculation module is used for sampling the state in the game tree, sampling the action based on the strategy value and calculating an expected profit value of the sampled action and a counterfactual profit value of the non-sampled action;
the anti-fact regret value calculation module calculates the anti-fact regret value and updates the strategy value to carry out neural network training;
and the iteration module is used for judging whether the player completes the iteration, outputting the current average strategy value if the player completes the iteration, and returning to the income calculation module if the player does not complete the iteration.
Compared with the prior art, the invention has the beneficial effects that: the method is characterized in that the method comprises the steps of calculating an expected profit value of a sampled action and a counterfactual profit value of an un-sampled action, and further calculating a counterfactual regret value so as to update a strategy value to train a neural network, so that repeated recalculation is not needed each time, only cached results need to be searched and called for part of card types, and the calculation amount for turning over the cards is greatly reduced, so that the purposes of reducing the calculation amount and the calculation time can be well achieved, the strategy value is updated through a series of steps, whether iteration is completed or not is judged, and the calculation accuracy can be well improved.
Drawings
FIG. 1 is a schematic workflow diagram of one embodiment of the present invention;
fig. 2 is a schematic diagram of the principle of feature vectors in an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
Before starting to explain the working process of the example, a certain explanation is carried out on the parameters, and CFR + is the counting regression Minimization Plus; CFR-D is a Counterfactual Regret Minimization composition; the Adam random Gradient is decreased to Adam Gradient Descent; the K mean value Clustering is K-means Clustering, which is called KMC for short; the Nash equalization Strategy is Nash equiliubrium Strategy, referred to as NES for short; the Game Optimal strategy is Game theoretical Optimal, short GTO; the minimal virtual Regret Minimization algorithm is CFR for short; the game refers to an Incomplete Information game, namely Incomplate-Information Games; the continuous solving algorithm is continuous Re-solving; the Deep virtual Regret Value Network is a Deep virtual Regret Value Neural Network, which is called DCFRVNN for short; the Forward-looking Tree is Forward-looking Game Tree, FGT for short; the Sparsity of the Forward-looking Tree is Forward-looking Game Tree Sparsity, which is called FGTS for short; the abstract method of the storage barrel is blocked Abstraction, abbreviated as BA.
As shown in fig. 1, this example provides an artificial intelligence decision method based on game theory and nash equilibrium, which includes:
step S1, dividing the cards into different buckets according to the strength of the hands;
step S2, initializing an average strategy;
step S3, sampling states in a game tree, sampling actions, and calculating expected income values of the sampled actions and counterfactual income values of the actions which are not sampled;
step S4, calculating the regret value of the counterfactual, and updating the strategy value to train the neural network;
and step S5, judging whether the player finishes the iteration, if so, outputting the current average strategy value, and if not, returning to the step S3.
The game theory described in this embodiment refers to Incomplete Information game, that is, FastNet, which is a poker game applied to Incomplete Information game (incorporated-Information Games), and is an Artificial Intelligence (AI) innovation algorithm based on continuous Re-solving.
This example first describes details of how to implement the continuous solution algorithm (continuous Re-solving) by the incomplete information game (FastNet) and how to train the Deep virtual Regret Value Network (DCFRVNN), which is the Neural Network training of step S4.
The continuous solution algorithm of the FastNet algorithm can be applied to the Nash Equilibrium Strategy (NES) solution of the general extended game. Taking a poker Game as an example, a Nash equilibrium best strategy is solved in each turn of each hand, thereby forming a Game Optimal strategy (GTO) strategy. By playing the game strictly in accordance with nash equilibrium strategy, it can be mathematically proven that other players are certainly lost if they deviate from the optimal equilibrium strategy.
The extended game solving aims to make a strategy for the player, so that the aim of minimizing the virtual regret value of an opponent is fulfilled. The algorithm of FastNet for solving the extended game is a mixture of general CFR and CFR + (computational risk regression Minimization Plus), which uses a virtual Regret matching algorithm similar to CFR +, but performs uniform weighting and real-time updating, and ignores early iteration of CFR in the average value when calculating the final average strategy and average adverse event value, so as to achieve the purposes of reducing the calculation amount, saving the calculation time and improving the calculation accuracy.
Meanwhile, the FastNet uses the modification of the original CFR-D (computational regression composition) tool for the extended game, so that the calculation can be further optimized according to the situation of the game. Through testing, while the Maxmargin tool can improve the poor strategy of the abstract agent towards the end of the gambling game, the CFR-D tool performs better in early testing.
One of the main design goals of the FastNet implementation is to meet human-to-machine engagement requirements, that is, to achieve the same game speed of a computer as a human player on a computer using general purpose commodity hardware and a single GPU. The looking-ahead Tree Sparsity (FGTS) and the number of iterations of continuous solution are the main ways for FastNest to optimize and adjust to achieve this goal. These attributes are separately selected and designed for each round of the poker game to achieve a consistent speed for each round. It should be noted that there are no additional special requirements of FastNet for its look-ahead tree density, other than the hard requirements of hardware limitations and speed limitations.
The look-ahead Tree (FGT) used by FastNet adjusts as actions a player can take, response actions an opponent can take, and actions any player can take in the remaining rounds. FastNet typically limits the end of a round as the depth of the look-ahead tree, unless the remaining game can be accurately calculated in the round, in which case the depth of the look-ahead tree is the end point of the entire game. That is, the Game Tree in the step S3 is preferably a Forward-looking Tree (FGT) used by FastNet.
In this example, in the pre-turning circle (prefrop) and the turning circle (Flop), FastNet uses a deep virtual regret value network trained in advance, because the calculation amount of the pre-turning circle is very large, and to complete the calculation, all possible 22,100 possible public card types during turning need to be enumerated so as to evaluate each public card type, so that the data processing speed can be effectively increased by using the deep virtual regret value network trained in advance. To speed up the game of the turn before the turn, FastNet trains an additional auxiliary neural network in advance to estimate the expected value of the turning net for all possible turns (e.g., step S3 calculates the expected profit value of the sampled action, etc.) and applies it during the initial iteration of the CFR. In the last iteration to compute the virtual regret value of the average strategy, FastNet performs brand enumeration and neural network evaluation of value.
In addition, FastNet will cache successive solution algorithm calculations for each observed pre-flip card type and game context. When the same card type and game situation occur again, only the cached results need to be searched and called repeatedly, instead of being recalculated, and therefore, the calculation speed can be further increased. In the round, we do not use the neural network after the last River Card (River Card), but use the successive solution algorithm until the end of the game. We use Bucket Abstraction (BA) for all actions of River board circle (River), and the bucket in step S1 refers to the bucket in the Bucket Abstraction (BA).
The details of the neural network used by FastNet are as follows. All networks were trained using a built-in torque 7 library and using the Adam random Gradient Descent (Adam Gradient Descent) algorithm to minimize the mean of the Huber losses in the counter-fact error. The minimum batch used for training was 1,000, the learning rate was 0.001, and decreased to 0.0001 after the first 200 sessions. The network was trained for approximately 350 epochs, two days on a single GPU, and the epoch with the least loss of validation was selected.
In the neural network training performed in step S4 described in this example, the neural network used by FastNet can adapt to the strategy of different human players. To increase the versatility of dealing with the player' S strategy, this example maps the distribution of individual hands (combination of public and private cards) into the distribution of buckets, i.e., step S1 implements the hand splitting. The buckets are generated using an abstraction technique based on a Clustering method that uses strategically similar hands for K-means Clustering (KMC), and the distance between buckets resulting from K-means Clustering can be viewed as a hand-force comparison. For the flip circle network and the turn circle network, we use 1,000 clusters (clusters) and map the original range to the distribution on these clusters as the first layer of the neural network. Since there are only 169 strategically different hand types before the flipping, this storage step is not used on the auxiliary network and therefore the distribution of the different hands can be entered directly.
Unlike the prior art, this example is also implemented based on a game and averaging strategy, where the game (incomplete information game, FastNet) uses two deep virtual regret value networks, one for the flop (Turn), and one auxiliary neural network as described above, providing a counter-reality value at the end of the flop. To train a deep virtual regret value network, this example generates a random poker situation at the beginning of the turn. The context of each poker game is defined by the pool size, the range of actions that both players can take and the public cards that have been released. A complete game history is not necessary, as the pool and scope of actions that can be taken are sufficient to represent the current game context.
The output value of the deep virtual regret value network is a vector of virtual regret values, wherein each component corresponds to each game player. This example interprets the output value as a score of winning the pool size to accommodate versatility in the overall poker game. In the pre-turning circle (prefrop) and the turning circle (Flop), the FastNet uses a depth virtual regret value network trained in advance, because the calculation amount of the pre-turning circle is extremely large, and to complete the calculation, all possible 22,100 possible public card types during turning need to be enumerated so as to evaluate each public card type.
To speed up the game of the turn before the turn, FastNet trains an additional auxiliary neural network in advance to estimate the expected value of the turning net for all possible turns and applies it during the initial iteration of the CFR. In the last iteration to compute the virtual regret value of the average strategy, FastNet performs brand enumeration and neural network evaluation of value.
In addition, FastNet will cache successive solution algorithm calculations for each observed pre-flip card type and game context. When the same card type and game context occurs again, only the cached results need to be searched and invoked repeatedly, rather than recomputed. In the round, this example does not use a neural network after the last River Card (River Card), but a continuous solution algorithm until the end of the game. This example uses Bucket Abstraction (BA) for all actions of the River course (River).
In this example, the step S1 of dividing the cards into different buckets according to the hand strength refers to iterating each hand to obtain the card type and the card score of the hand, and creating the feature vector corresponding to the hand based on the hand, the card type and the card score.
Before formal calculation, step S1 in this example iterates through each hand based on rule programming to derive its card type and card score. The judgment of the card type is realized by using algorithms such as sequencing, conditions, circulation and the like; the rules of card distribution can be set according to actual needs, for example, a minimum card "2 s3s4d5s7 h" (or other non-same-flower same numerical value combination, the same below) is specified to be 1 distribution, next "2 s3s4d6s7 h" is specified to be 2 distribution, next "2 s3s5d6s7 h" is specified to be 3 distribution, and the like is repeated until the maximum royal is in the same-flower sequence, and therefore the card distribution is obtained; of course, this is a rule for setting the card distribution, and in practice, the card distribution can be set and adjusted by self according to actual needs.
Feature vectors are then created based on the hands, card type, and scores, as shown in figure 2. For example, the feature vector representation of the hand "2 s3s4d6s7 h" is divided into 3 parts as shown in FIG. 2: the first part score indicates the brand score, the second part High Card to Straight Flush indicates the brand type, and the third part of the back indicates the specific brand. The feature vector corresponding to a hand is a data point xi。
Based on the data obtained above, i.e., the feature vectors corresponding to the hands in the step S1, a K-means model is trained to separate the hands into different buckets (clusters). Performing iterative operation to minimize Euclidean distances from different sample points to corresponding central points, wherein the Euclidean distances are expressed by the following formula:in the formula: x is the number ofiIs data set X ═ X0,x1,x2,…,xnA data point of, n is the number of samples, vjIs V ═ V0,v1,v2,…,vnJ (th) bucket center point (also called cluster center point) in }, | | xi-vjI is the data point xiWith cluster center point vjC is the Euclidean distance betweeniIs the number of data points for the ith bucket (cluster), and c is the number of bucket centers (cluster centers). Barrel center point viThe calculation formula of (a) is as follows:
step S2 in this example performs initialization operation on the averaging strategy, which is not described herein.
In step S3, let I be a certain information set, i.e. a set of public cards and public action information; a (I) is a legal action set under an information set I, and T and T are time parameters, wherein T is related to each information set and is increased along with the access times of the information sets; t is the maximum of the time parameter TLarge value, i.e. total number of times. Strategy of player iIs an information set I of IiThe legal action of and I, a ∈ A (I)i) Mapping to player at t time, information set IiProbability of next selection action a. At time t, the strategies of all players constitute a combination of strategies σt(ii) a i is a serial number indicating the current player. The policy combination σ refers to a policy combination made up of the policies of all players. Action a refers to a certain action, such as actions including "bet 4 chips", "discard", and the like; the set of actions a is represented by action set a.
The non-terminal historical information H is an action sequence from a root node in one game, the set where the non-terminal historical information H is located is a non-terminal historical information set H, and the non-terminal historical information set H (action sequence set) comprises opportunity output; piσ(h) The probability of reaching the non-terminal historical information h under the strategy combination sigma; probability of counterfactualIs the probability of reaching information set I under the combination of strategies σ, assuming that the probability of the current player I reaching this state is 1.
Step S3 in this example performs monte carlo sampling (MC) on the state in the game tree, and samples the action based on the policy value to reduce the regret value CFR; and on the premise of ensuring that the mathematical expectation of the instant regret value is unchanged, the whole game tree is prevented from being traversed during iteration.
This example typically limits the number of end histories. Let Q be { Q ═ Q1,…,QrThe set of subsets of the set Z of end histories, evenly distributed in the set Z, one of which is called a "block". In each iteration, a block is randomly chosen, and only the end history in this block is considered. Let q bej>0 is the selected block Q in the current iterationjProbability (satisfy))。
Order toq (z ') is the probability of the selected terminal history z' in the current iteration. The sampled reflexive regret value is calculated when block j is updated as follows:wherein the content of the first and second substances,is a pair of blocks QjThe sum of the intersections with the end history set under information set I,probability of reaching the end history z' under the information set I for all players except for the player I to act according to the strategy combination sigma, piσ(z′[I]Z ') is a non-terminal history z' where the player acts according to the strategy combination σ [ I ]]Probability of the lower arrival end history z'; the set in which the end history Z' is located is the end history set Z.
The selection subset set Q and the sampling probabilities define a complete monte carlo CFR algorithm: a block is sampled rather than traversing the entire game tree, and only the terminal history in this block is parsed.
Let Z be the set of all end histories Z' (from root node to leaf node) in the game, and the non-end history information h be the sequence from root node to the end except the last level child node; for Z' ∈ Z,a prefix history indicating that the non-end history information h is the end history z'; u. ofi(z ') calculating a counterfactual profit value CFV under the non-terminal history information h according to the counterfactual probability, the probability of strategy combination and the utility value, wherein the utility value of the player i in the terminal history z', namely the number (or fraction) of chips won or lost by the player according to the game rule; the expected profit value EV in the current state is expressed as the counterfactual profit value CFV and the arrival probability thereof in the current stateAccumulating; the non-terminal history information h is the prefix of the terminal history z', the history except the last level leaf nodeThe representative non-end history information h is a prefix history of the end history z'.
Specifically, in step S3 in this example, the formula is usedCalculating a counterfactual profit value CFV under the non-terminal historical information h, wherein vi(σ, h) represents a counterfactual profit value CFV for player i to reach non-end history information h when strategy combination σ is adopted;as counter-fact probabilitiesBased on the policy combination σ, on the premise that the probability that the current player i reaches the non-end history information h is 1, the probability that other players perform actions to reach the non-end history information h is obtained by the following process: in addition to player i, each player reaches the product of the non-end history information h under the combination of strategies σ, i.e.When the strategy combination sigma is adopted, the probability of reaching the end history z' is obtained on the premise that the non-end history information h is reached; the calculation procedure obtained is as follows:the non-terminal history information h is the prefix history of the terminal history z', piσ(z ') is the probability, π, of reaching the terminal history z' based on the policy combination σσ(h) The probability of σ arriving at the non-end history information h is combined based on the policy.
In step S3 of the present example, in step S3, the formula is usedCalculating an expected profit value EV in the current state, wherein ui(σ, h) is the expected profit value EV for the player i to reach the non-end history information h when the combination of strategies σ is employed,for the probability, v, of a player i arriving at the non-end history information h when the strategy combination σ is adoptedi(σ, h) is the counterfactual profit value CFV for player i to reach non-end history information h when strategy combination σ is employed. The process obtained is as follows: the value is calculated based on the principle of elementary probability multiplication, e.g.The preset specific numerical value can also be adopted, and custom modification and adjustment can also be realized; v. ofi(σ, h) is the counterfactual profit value CFV in the current state, and the obtained procedure is as follows: the calculation process is as follows:wherein the counter-fact probabilityBased on the strategy combination sigma, on the premise that the probability that the current player i reaches the non-terminal historical information h is 1, the probability that other players act to reach the non-terminal historical information h is assumed; u. ofi(z ') is the value of the utility of player i at the end history z'.
In step S4 described in this example, the formula r (h, a) ═ vi(σI→a,h)-vi(σ, h) calculating a counterfactual regret value CFR of the non-terminal history information h without action a, wherein r (h, a) is the counterfactual regret value CFR, v (v) of the non-terminal history information h without action ai(σI→aH) is the profit value obtained by action a in the non-terminal historical information h under the information set I; by the formulaComputingThe regret value r (I, a) of the counterfactual without action a in the information set I; by the formulaCalculating the accumulated regret value, wherein,for combining sigma in a strategytNext player I does not have the regret value of action a in information set I.
In step S4 described in this example, the difference between the benefit of action a and the expected benefit is always chosen to be the regret value of the action, under the combination of strategies σ, and then weighted with the probability value that other players (including the chance) arrive at the current node. Defining a non-negativity-fact repentance valueAnd then applying a regret-matching form to the accumulated counterfactual regret value to obtain a new strategy.
That is, step S4 in this example is formulated byThe strategy values are updated for neural network training (training with a deep learning network), wherein,a (I) is the legal action set under information set I,is the sum of the regret values of all actions in the information set I.
This example is used to calculate, for each information set, the probability of action proportional to the positive accumulated regret value, by this formula above. For each action, the greater the probability value, the more likely the action is to be selected. The CFR generates the next state of the game and recursively calculates the profit value of the selected action. And calculating the regret value from the returned profit value, and finally calculating and returning the profit value of the current node.
In this example, one iteration of the CFR algorithm represents a one-play-by-one-play game in step S5. Arranging game plays, e.g. 1011The completion of all game plays marks the termination of the iteration, otherwise the iteration is continued by increasing the number of plays.
The present embodiment also provides an artificial intelligence decision system based on game theory and nash equilibrium, which adopts the above artificial intelligence decision method based on game theory and nash equilibrium, and includes:
the hand card dividing module is used for dividing the cards into different buckets according to the strength of the hand cards;
the initialization module is used for realizing an initialization average strategy;
the profit calculation module is used for sampling the state in the game tree, sampling the action based on the strategy value and calculating an expected profit value of the sampled action and a counterfactual profit value of the non-sampled action;
the anti-fact regret value calculation module calculates the anti-fact regret value and updates the strategy value to carry out neural network training;
and the iteration module is used for judging whether the player completes the iteration, outputting the current average strategy value if the player completes the iteration, and returning to the income calculation module if the player does not complete the iteration.
In summary, in this embodiment, the expected profit value of the sampled action and the counterfactual profit value of the non-sampled action are calculated, and the counterfactual regret value is further calculated, so as to update the strategy value for neural network training, therefore, it is not necessary to repeat recalculation each time, only the cached results need to be searched and called for part of the brands, and the calculation amount for turning the brands can be greatly reduced, so that the purpose of reducing the calculation amount and the calculation time can be well achieved, and the strategy value is updated through a series of steps, and whether iteration is completed or not is judged, thereby well improving the calculation accuracy.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.
Claims (10)
1. An artificial intelligence decision method based on game theory and Nash equilibrium is characterized by comprising the following steps:
step S1, dividing the cards into different buckets according to the strength of the hands;
step S2, initializing an average strategy;
step S3, sampling states in a game tree, sampling actions, and calculating counterfactual income values and expected income values of all the states in the information set I;
step S4, calculating the regret value of the counterfactual, and updating the strategy value to train the neural network;
and step S5, judging whether the player finishes the iteration, if so, outputting the current average strategy value, and if not, returning to the step S3.
2. The method of claim 1, wherein the step S1 of dividing the cards into different buckets according to the hand strength includes iterating each hand to obtain the type and score of the hand, and creating the feature vector corresponding to the hand based on the hand, the type and the score.
3. The artificial intelligence decision method based on game theory and nash equilibrium as claimed in claim 2, wherein in step S1, the hand is trained by a K-means clustering model according to the feature vector corresponding to the hand, and then the hand is sorted into different buckets.
4. A game theory and nash equilibrium based artificial intelligence decision method according to any one of claims 1 to 3, characterized in that in the step S3, let Z be the set of all end histories in the game, all end histories refer to the sequence from root node to leaf node; for Z' ∈ Z,a prefix history representing that the non-end history information h is an end history z'; u. ofi(z ') is the utility value of the player i in the terminal history z', and then the counterfactual profit value CFV under the non-terminal history information h is calculated according to the counterfactual probability, the probability of strategy combination and the utility value; the expected profit value EV in the current state is expressed as the product of the counterfactual profit value CFV and its arrival probability in the current state.
5. The artificial intelligence decision method based on game theory and nash equilibrium as claimed in claim 4, wherein in the step S3, the artificial intelligence decision method based on formulaCalculating a counterfactual profit value CFV under the non-terminal historical information h, wherein vi(σ, h) represents a counterfactual profit value CFV for player i to reach non-end history information h when strategy combination σ is adopted;as counter-fact probabilitiesWhen the strategy combination sigma is adopted, the probability of reaching the non-terminal historical information h under the condition of not considering the player i is obtained; piσ(h, z ') is the probability of reaching the end history z' on the premise that the non-end history information h has been reached when the policy combination σ is employed.
6. The artificial intelligence decision method based on game theory and nash equilibrium of claim 5, wherein in the step S3, through formulaCalculating an expected profit value EV in the current state, wherein ui(σ, h) Player i when strategy combination σ is adoptedThe expected benefit value EV of the non-end history information h is reached,the probability that player i reaches non-end history information h when strategy combination sigma is adopted.
7. The method for artificial intelligence decision based on game theory and nash equilibrium as claimed in claim 6, wherein in said step S4, through the formula r (h, a) ═ vi(σI→a,h)-vi(σ, h) calculating a counterfactual regret value CFR of the non-terminal history information h without action a, wherein r (h, a) is the counterfactual regret value CFR, v (v) of the non-terminal history information h without action ai(σI→aH) is a utility value generated by performing action a in the information set I under the non-terminal historical information h; by the formulaCalculating the regret value of the counterfactual without action a in the information set I; by the formulaCalculating the accumulated regret value, wherein,for combining sigma in a strategytNext player I does not have the regret value of action a in information set I.
8. The artificial intelligence decision method based on game theory and nash equilibrium of claim 7, wherein in the step S4, through formulaAnd updating the strategy value to train the neural network, wherein,a (I) is the legal action set under information set I,the regret value of all actions under the information set I.
9. A game-theory and nash-balance-based artificial-intelligence decision method according to any one of claims 1 to 3, characterized in that in the step S5, game-hands are configured, and the completion of all game-hands marks the termination of iteration, otherwise the iteration is continued by increasing the game-hands.
10. An artificial intelligence decision system based on game theory and nash equilibrium, characterized in that, the artificial intelligence decision method based on game theory and nash equilibrium of any claim 1 to 9 is adopted, and comprises:
the hand card dividing module is used for dividing the cards into different buckets according to the strength of the hand cards;
the initialization module is used for realizing an initialization average strategy;
the profit calculation module is used for sampling the state in the game tree, sampling the action based on the strategy value and calculating an expected profit value of the sampled action and a counterfactual profit value of the non-sampled action;
the anti-fact regret value calculation module calculates the anti-fact regret value and updates the strategy value to carry out neural network training;
and the iteration module is used for judging whether the player completes the iteration, outputting the current average strategy value if the player completes the iteration, and returning to the income calculation module if the player does not complete the iteration.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010719081 | 2020-07-23 | ||
CN2020107190816 | 2020-07-23 |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111905373A true CN111905373A (en) | 2020-11-10 |
Family
ID=73286714
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010744288.9A Pending CN111905373A (en) | 2020-07-23 | 2020-07-29 | Artificial intelligence decision method and system based on game theory and Nash equilibrium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111905373A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113779870A (en) * | 2021-08-24 | 2021-12-10 | 清华大学 | Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium |
CN113869501A (en) * | 2021-10-19 | 2021-12-31 | 京东科技信息技术有限公司 | Neural network generation method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103559363A (en) * | 2013-11-15 | 2014-02-05 | 南京大学 | Method for calculating optimum response strategy in imperfect information extensive game |
CN106296006A (en) * | 2016-08-10 | 2017-01-04 | 哈尔滨工业大学深圳研究生院 | The minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation |
CN110404265A (en) * | 2019-07-25 | 2019-11-05 | 哈尔滨工业大学(深圳) | A kind of non-complete information machine game method of more people based on game final phase of a chess game online resolution, device, system and storage medium |
CN110489668A (en) * | 2019-09-11 | 2019-11-22 | 东北大学 | Synchronous game monte carlo search sets mutation method more under non-complete information |
US10675537B1 (en) * | 2019-05-15 | 2020-06-09 | Alibaba Group Holding Limited | Determining action selection policies of an execution device |
-
2020
- 2020-07-29 CN CN202010744288.9A patent/CN111905373A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103559363A (en) * | 2013-11-15 | 2014-02-05 | 南京大学 | Method for calculating optimum response strategy in imperfect information extensive game |
CN106296006A (en) * | 2016-08-10 | 2017-01-04 | 哈尔滨工业大学深圳研究生院 | The minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation |
US10675537B1 (en) * | 2019-05-15 | 2020-06-09 | Alibaba Group Holding Limited | Determining action selection policies of an execution device |
CN110404265A (en) * | 2019-07-25 | 2019-11-05 | 哈尔滨工业大学(深圳) | A kind of non-complete information machine game method of more people based on game final phase of a chess game online resolution, device, system and storage medium |
CN110489668A (en) * | 2019-09-11 | 2019-11-22 | 东北大学 | Synchronous game monte carlo search sets mutation method more under non-complete information |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113779870A (en) * | 2021-08-24 | 2021-12-10 | 清华大学 | Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium |
CN113869501A (en) * | 2021-10-19 | 2021-12-31 | 京东科技信息技术有限公司 | Neural network generation method and device, electronic equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110404265B (en) | Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium | |
Brown et al. | Deep counterfactual regret minimization | |
CN110404264B (en) | Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium | |
Ponsen et al. | Integrating opponent models with monte-carlo tree search in poker | |
Bošanský et al. | Algorithms for computing strategies in two-player simultaneous move games | |
JP2006204921A (en) | Bayesian scoring | |
Zhang et al. | Improving hearthstone AI by learning high-level rollout policies and bucketing chance node events | |
CN111905373A (en) | Artificial intelligence decision method and system based on game theory and Nash equilibrium | |
Rubin et al. | Similarity-based retrieval and solution re-use policies in the game of Texas Hold’em | |
CN112926744A (en) | Incomplete information game method and system based on reinforcement learning and electronic equipment | |
Sismanis | How i won the" chess ratings-elo vs the rest of the world" competition | |
Sturtevant et al. | Prob-max^ n: Playing n-player games with opponent models | |
Dobre et al. | Online learning and mining human play in complex games | |
CN112691383A (en) | Texas poker AI training method based on virtual regret minimization algorithm | |
CN115282604A (en) | On-line explicit reconstruction method and device for adversary strategy of incomplete information repeated game | |
Kocsis et al. | Universal parameter optimisation in games based on SPSA | |
CN110909890B (en) | Game artificial intelligence training method and device, server and storage medium | |
Wang | Cfr-p: Counterfactual regret minimization with hierarchical policy abstraction, and its application to two-player mahjong | |
Nestoruk et al. | Prediction of Football Games Results. | |
CN114048833B (en) | Multi-person and large-scale incomplete information game method and device based on neural network virtual self-game | |
Chen et al. | A Novel Reward Shaping Function for Single-Player Mahjong | |
Bucquet et al. | The Bank is Open: AI in Sports Gambling | |
Chen et al. | Learning Strategies for Imperfect Information Board Games Using Depth-Limited Counterfactual Regret Minimization and Belief State | |
Agrawal et al. | Targeted upskilling framework based on player mistake context in online skill gaming platforms | |
Li et al. | Scalable sub-game solving for imperfect-information games |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201110 |
|
RJ01 | Rejection of invention patent application after publication |