CN111905373A - Artificial intelligence decision method and system based on game theory and Nash equilibrium - Google Patents

Artificial intelligence decision method and system based on game theory and Nash equilibrium Download PDF

Info

Publication number
CN111905373A
CN111905373A CN202010744288.9A CN202010744288A CN111905373A CN 111905373 A CN111905373 A CN 111905373A CN 202010744288 A CN202010744288 A CN 202010744288A CN 111905373 A CN111905373 A CN 111905373A
Authority
CN
China
Prior art keywords
value
game
strategy
counterfactual
player
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010744288.9A
Other languages
Chinese (zh)
Inventor
李波
鲍凌威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Aiwenzhesi Technology Co ltd
Original Assignee
Shenzhen Aiwenzhesi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Aiwenzhesi Technology Co ltd filed Critical Shenzhen Aiwenzhesi Technology Co ltd
Publication of CN111905373A publication Critical patent/CN111905373A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/60Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor
    • A63F13/67Generating or modifying game content before or while executing the game program, e.g. authoring tools specially adapted for game development or game-integrated level editor adaptively or by learning from player actions, e.g. skill level adjustment or by storing successful combat sequences for re-use
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F13/00Video games, i.e. games using an electronically generated display having two or more dimensions
    • A63F13/70Game security or game management aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63FCARD, BOARD, OR ROULETTE GAMES; INDOOR GAMES USING SMALL MOVING PLAYING BODIES; VIDEO GAMES; GAMES NOT OTHERWISE PROVIDED FOR
    • A63F2300/00Features of games using an electronically generated display having two or more dimensions, e.g. on a television screen, showing representations related to the game
    • A63F2300/60Methods for processing data by generating or executing the game program
    • A63F2300/6027Methods for processing data by generating or executing the game program using adaptive systems learning from user actions, e.g. for skill level adjustment

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Business, Economics & Management (AREA)
  • Computer Security & Cryptography (AREA)
  • General Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides an artificial intelligence decision method and system based on game theory and Nash equilibrium, wherein the artificial intelligence decision method comprises the following steps: step S1, dividing the cards into different buckets according to the strength of the hands; step S2, initializing an average strategy; step S3, sampling state in game tree, sampling action, and calculating information setIThe counter fact profit value and the expected profit value of each state; step S4, calculating the regret value of the counterfactual, and updating the strategy value to train the neural network; step S5, judging whether the player finishes the iteration, if so, outputting the currentOtherwise, returning to the step S3. The invention can well achieve the purpose of reducing the calculated amount and the calculating time, and realizes the updating of the strategy value through a series of steps, and judges whether the iteration is finished, thereby well improving the calculating accuracy.

Description

Artificial intelligence decision method and system based on game theory and Nash equilibrium
Technical Field
The invention relates to a decision training method for electronic chess, in particular to an artificial intelligence decision method based on a game theory and Nash equilibrium, and an artificial intelligence decision system adopting the artificial intelligence decision method based on the game theory and Nash equilibrium.
Background
The poker game AI was developed in 2017 by professor Brown Noam, university of kangsulon, and professor Sandholm Tuomas, entitled "chef sourdough," the english name libaratus. The cold poker master uses the common minimal virtual Regret Minimization (CFR) algorithm for poker games AI to find a nash equilibrium betting method, also known as GTO betting, at each betting round of each hand. The cold pounce teacher algorithm system is divided into three modules. (1) The abstract poker game reduces the complexity of the poker game by abstracting the poker game. (2) Subdividing the game scenario, the module is operative to find in which game scenario the AI and the player are located in the calculated game scenario. (3) Self-match training, searching and repairing game strategy loopholes by analyzing data and strategy generated in self-match.
However, in the prior art, real-time calculation is needed in any possible situation of the poker game, so that the calculation amount is too large, a large-scale server is needed for calculation, the calculation time is long, and the requirement of man-machine real-time fight cannot be met. In addition, the conventional common minimal Regret Minimization (CFR) algorithm cannot well utilize the specificity of the poker game to achieve the purposes of reducing the calculation amount, saving the calculation time and improving the calculation accuracy.
Disclosure of Invention
The technical problem to be solved by the invention is to provide an artificial intelligence decision method based on game theory and Nash equilibrium, which can reduce the calculation amount and time and improve the calculation accuracy, and further provide an artificial intelligence decision system adopting the artificial intelligence decision method.
Therefore, the invention provides an artificial intelligence decision method based on game theory and Nash equilibrium, which comprises the following steps:
step S1, dividing the cards into different buckets according to the strength of the hands;
step S2, initializing an average strategy;
step S3, sampling states in a game tree, sampling actions, and calculating counterfactual income values and expected income values of all the states in an information set I, wherein the information set I is a set of public cards and public action information;
step S4, calculating the regret value of the counterfactual, and updating the strategy value to train the neural network;
and step S5, judging whether the player finishes the iteration, if so, outputting the current average strategy value, and if not, returning to the step S3.
A further improvement of the present invention is that said step S1 of dividing the cards into different buckets according to the hand strength means iterating each hand to obtain the card type and the card score of the hand, and creating the feature vector corresponding to the hand based on the hand, the card type and the card score.
In a further improvement of the present invention, in the step S1, the hand is trained by a K-means clustering model according to the feature vector corresponding to the hand, and then the hand is divided into different buckets.
In a further improvement of the present invention, in the step S3, let Z be a set of all end histories in the game, where all end histories refer to a sequence from a root node to a leaf node; for Z' ∈ Z,
Figure BDA0002607815210000026
a prefix history representing that the non-end history information h is an end history z'; u. ofi(z') is player iCalculating a counterfactual profit value CFV under the non-terminal historical information h according to the counterfactual probability, the probability of strategy combination and the utility value of the terminal historical z'; the expected profit value EV in the current state is expressed as the product of the counterfactual profit value CFV and its arrival probability in the current state.
In a further improvement of the present invention, in the step S3, the formula is used
Figure BDA0002607815210000021
Calculating a counterfactual profit value CFV under the non-terminal historical information h, wherein vi(σ, h) represents a counterfactual profit value CFV for player i to reach non-end history information h when strategy combination σ is adopted;
Figure BDA0002607815210000022
as counter-fact probabilities
Figure BDA0002607815210000023
When the strategy combination sigma is adopted, the probability of reaching the non-terminal historical information h under the condition of not considering the player i is obtained; piσ(h, z ') is the probability of reaching the end history z' on the premise that the non-end history information h has been reached when the policy combination σ is employed.
In a further improvement of the present invention, in the step S3, the formula is used
Figure BDA0002607815210000024
Calculating an expected profit value EV in the current state, wherein ui(σ, h) is the expected profit value EV for the player i to reach the non-end history information h when the combination of strategies σ is employed,
Figure BDA0002607815210000025
the probability that player i reaches non-end history information h when strategy combination sigma is adopted.
In a further development of the invention, in step S4, the formula r (h, a) ═ viI→a,h)-vi(σ, h) calculating a regret value of counterfactual without action a under the non-end history information hCFR, wherein r (h, a) is the counterfactual regret value CFR, v of the non-terminal historical information h without action aiI→aH) is the utility resulting from always performing action a in information set I under non-terminal history information h; by the formula
Figure BDA0002607815210000031
Calculating the regret value of the counterfactual without action a in the information set I; by the formula
Figure BDA0002607815210000032
Calculating the accumulated regret value, wherein,
Figure BDA0002607815210000033
for combining sigma in a strategytNext player I does not have the regret value of action a in information set I.
In a further improvement of the present invention, in the step S4, the formula is used
Figure BDA0002607815210000034
And updating the strategy value to train the neural network, wherein,
Figure BDA0002607815210000035
a (I) is the legal action set under information set I,
Figure BDA0002607815210000036
the regret value of all actions under the information set I.
In a further improvement of the present invention, in step S5, the game play count is configured, and the completion of all game play counts marks the termination of the iteration, otherwise the iteration is continued by increasing the game play count.
The invention also provides an artificial intelligence decision-making system based on the game theory and the Nash equilibrium, which adopts the artificial intelligence decision-making method based on the game theory and the Nash equilibrium and comprises the following steps:
the hand card dividing module is used for dividing the cards into different buckets according to the strength of the hand cards;
the initialization module is used for realizing an initialization average strategy;
the profit calculation module is used for sampling the state in the game tree, sampling the action based on the strategy value and calculating an expected profit value of the sampled action and a counterfactual profit value of the non-sampled action;
the anti-fact regret value calculation module calculates the anti-fact regret value and updates the strategy value to carry out neural network training;
and the iteration module is used for judging whether the player completes the iteration, outputting the current average strategy value if the player completes the iteration, and returning to the income calculation module if the player does not complete the iteration.
Compared with the prior art, the invention has the beneficial effects that: the method is characterized in that the method comprises the steps of calculating an expected profit value of a sampled action and a counterfactual profit value of an un-sampled action, and further calculating a counterfactual regret value so as to update a strategy value to train a neural network, so that repeated recalculation is not needed each time, only cached results need to be searched and called for part of card types, and the calculation amount for turning over the cards is greatly reduced, so that the purposes of reducing the calculation amount and the calculation time can be well achieved, the strategy value is updated through a series of steps, whether iteration is completed or not is judged, and the calculation accuracy can be well improved.
Drawings
FIG. 1 is a schematic workflow diagram of one embodiment of the present invention;
fig. 2 is a schematic diagram of the principle of feature vectors in an embodiment of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in further detail below with reference to the accompanying drawings.
Before starting to explain the working process of the example, a certain explanation is carried out on the parameters, and CFR + is the counting regression Minimization Plus; CFR-D is a Counterfactual Regret Minimization composition; the Adam random Gradient is decreased to Adam Gradient Descent; the K mean value Clustering is K-means Clustering, which is called KMC for short; the Nash equalization Strategy is Nash equiliubrium Strategy, referred to as NES for short; the Game Optimal strategy is Game theoretical Optimal, short GTO; the minimal virtual Regret Minimization algorithm is CFR for short; the game refers to an Incomplete Information game, namely Incomplate-Information Games; the continuous solving algorithm is continuous Re-solving; the Deep virtual Regret Value Network is a Deep virtual Regret Value Neural Network, which is called DCFRVNN for short; the Forward-looking Tree is Forward-looking Game Tree, FGT for short; the Sparsity of the Forward-looking Tree is Forward-looking Game Tree Sparsity, which is called FGTS for short; the abstract method of the storage barrel is blocked Abstraction, abbreviated as BA.
As shown in fig. 1, this example provides an artificial intelligence decision method based on game theory and nash equilibrium, which includes:
step S1, dividing the cards into different buckets according to the strength of the hands;
step S2, initializing an average strategy;
step S3, sampling states in a game tree, sampling actions, and calculating expected income values of the sampled actions and counterfactual income values of the actions which are not sampled;
step S4, calculating the regret value of the counterfactual, and updating the strategy value to train the neural network;
and step S5, judging whether the player finishes the iteration, if so, outputting the current average strategy value, and if not, returning to the step S3.
The game theory described in this embodiment refers to Incomplete Information game, that is, FastNet, which is a poker game applied to Incomplete Information game (incorporated-Information Games), and is an Artificial Intelligence (AI) innovation algorithm based on continuous Re-solving.
This example first describes details of how to implement the continuous solution algorithm (continuous Re-solving) by the incomplete information game (FastNet) and how to train the Deep virtual Regret Value Network (DCFRVNN), which is the Neural Network training of step S4.
The continuous solution algorithm of the FastNet algorithm can be applied to the Nash Equilibrium Strategy (NES) solution of the general extended game. Taking a poker Game as an example, a Nash equilibrium best strategy is solved in each turn of each hand, thereby forming a Game Optimal strategy (GTO) strategy. By playing the game strictly in accordance with nash equilibrium strategy, it can be mathematically proven that other players are certainly lost if they deviate from the optimal equilibrium strategy.
The extended game solving aims to make a strategy for the player, so that the aim of minimizing the virtual regret value of an opponent is fulfilled. The algorithm of FastNet for solving the extended game is a mixture of general CFR and CFR + (computational risk regression Minimization Plus), which uses a virtual Regret matching algorithm similar to CFR +, but performs uniform weighting and real-time updating, and ignores early iteration of CFR in the average value when calculating the final average strategy and average adverse event value, so as to achieve the purposes of reducing the calculation amount, saving the calculation time and improving the calculation accuracy.
Meanwhile, the FastNet uses the modification of the original CFR-D (computational regression composition) tool for the extended game, so that the calculation can be further optimized according to the situation of the game. Through testing, while the Maxmargin tool can improve the poor strategy of the abstract agent towards the end of the gambling game, the CFR-D tool performs better in early testing.
One of the main design goals of the FastNet implementation is to meet human-to-machine engagement requirements, that is, to achieve the same game speed of a computer as a human player on a computer using general purpose commodity hardware and a single GPU. The looking-ahead Tree Sparsity (FGTS) and the number of iterations of continuous solution are the main ways for FastNest to optimize and adjust to achieve this goal. These attributes are separately selected and designed for each round of the poker game to achieve a consistent speed for each round. It should be noted that there are no additional special requirements of FastNet for its look-ahead tree density, other than the hard requirements of hardware limitations and speed limitations.
The look-ahead Tree (FGT) used by FastNet adjusts as actions a player can take, response actions an opponent can take, and actions any player can take in the remaining rounds. FastNet typically limits the end of a round as the depth of the look-ahead tree, unless the remaining game can be accurately calculated in the round, in which case the depth of the look-ahead tree is the end point of the entire game. That is, the Game Tree in the step S3 is preferably a Forward-looking Tree (FGT) used by FastNet.
In this example, in the pre-turning circle (prefrop) and the turning circle (Flop), FastNet uses a deep virtual regret value network trained in advance, because the calculation amount of the pre-turning circle is very large, and to complete the calculation, all possible 22,100 possible public card types during turning need to be enumerated so as to evaluate each public card type, so that the data processing speed can be effectively increased by using the deep virtual regret value network trained in advance. To speed up the game of the turn before the turn, FastNet trains an additional auxiliary neural network in advance to estimate the expected value of the turning net for all possible turns (e.g., step S3 calculates the expected profit value of the sampled action, etc.) and applies it during the initial iteration of the CFR. In the last iteration to compute the virtual regret value of the average strategy, FastNet performs brand enumeration and neural network evaluation of value.
In addition, FastNet will cache successive solution algorithm calculations for each observed pre-flip card type and game context. When the same card type and game situation occur again, only the cached results need to be searched and called repeatedly, instead of being recalculated, and therefore, the calculation speed can be further increased. In the round, we do not use the neural network after the last River Card (River Card), but use the successive solution algorithm until the end of the game. We use Bucket Abstraction (BA) for all actions of River board circle (River), and the bucket in step S1 refers to the bucket in the Bucket Abstraction (BA).
The details of the neural network used by FastNet are as follows. All networks were trained using a built-in torque 7 library and using the Adam random Gradient Descent (Adam Gradient Descent) algorithm to minimize the mean of the Huber losses in the counter-fact error. The minimum batch used for training was 1,000, the learning rate was 0.001, and decreased to 0.0001 after the first 200 sessions. The network was trained for approximately 350 epochs, two days on a single GPU, and the epoch with the least loss of validation was selected.
In the neural network training performed in step S4 described in this example, the neural network used by FastNet can adapt to the strategy of different human players. To increase the versatility of dealing with the player' S strategy, this example maps the distribution of individual hands (combination of public and private cards) into the distribution of buckets, i.e., step S1 implements the hand splitting. The buckets are generated using an abstraction technique based on a Clustering method that uses strategically similar hands for K-means Clustering (KMC), and the distance between buckets resulting from K-means Clustering can be viewed as a hand-force comparison. For the flip circle network and the turn circle network, we use 1,000 clusters (clusters) and map the original range to the distribution on these clusters as the first layer of the neural network. Since there are only 169 strategically different hand types before the flipping, this storage step is not used on the auxiliary network and therefore the distribution of the different hands can be entered directly.
Unlike the prior art, this example is also implemented based on a game and averaging strategy, where the game (incomplete information game, FastNet) uses two deep virtual regret value networks, one for the flop (Turn), and one auxiliary neural network as described above, providing a counter-reality value at the end of the flop. To train a deep virtual regret value network, this example generates a random poker situation at the beginning of the turn. The context of each poker game is defined by the pool size, the range of actions that both players can take and the public cards that have been released. A complete game history is not necessary, as the pool and scope of actions that can be taken are sufficient to represent the current game context.
The output value of the deep virtual regret value network is a vector of virtual regret values, wherein each component corresponds to each game player. This example interprets the output value as a score of winning the pool size to accommodate versatility in the overall poker game. In the pre-turning circle (prefrop) and the turning circle (Flop), the FastNet uses a depth virtual regret value network trained in advance, because the calculation amount of the pre-turning circle is extremely large, and to complete the calculation, all possible 22,100 possible public card types during turning need to be enumerated so as to evaluate each public card type.
To speed up the game of the turn before the turn, FastNet trains an additional auxiliary neural network in advance to estimate the expected value of the turning net for all possible turns and applies it during the initial iteration of the CFR. In the last iteration to compute the virtual regret value of the average strategy, FastNet performs brand enumeration and neural network evaluation of value.
In addition, FastNet will cache successive solution algorithm calculations for each observed pre-flip card type and game context. When the same card type and game context occurs again, only the cached results need to be searched and invoked repeatedly, rather than recomputed. In the round, this example does not use a neural network after the last River Card (River Card), but a continuous solution algorithm until the end of the game. This example uses Bucket Abstraction (BA) for all actions of the River course (River).
In this example, the step S1 of dividing the cards into different buckets according to the hand strength refers to iterating each hand to obtain the card type and the card score of the hand, and creating the feature vector corresponding to the hand based on the hand, the card type and the card score.
Before formal calculation, step S1 in this example iterates through each hand based on rule programming to derive its card type and card score. The judgment of the card type is realized by using algorithms such as sequencing, conditions, circulation and the like; the rules of card distribution can be set according to actual needs, for example, a minimum card "2 s3s4d5s7 h" (or other non-same-flower same numerical value combination, the same below) is specified to be 1 distribution, next "2 s3s4d6s7 h" is specified to be 2 distribution, next "2 s3s5d6s7 h" is specified to be 3 distribution, and the like is repeated until the maximum royal is in the same-flower sequence, and therefore the card distribution is obtained; of course, this is a rule for setting the card distribution, and in practice, the card distribution can be set and adjusted by self according to actual needs.
Feature vectors are then created based on the hands, card type, and scores, as shown in figure 2. For example, the feature vector representation of the hand "2 s3s4d6s7 h" is divided into 3 parts as shown in FIG. 2: the first part score indicates the brand score, the second part High Card to Straight Flush indicates the brand type, and the third part of the back indicates the specific brand. The feature vector corresponding to a hand is a data point xi
Based on the data obtained above, i.e., the feature vectors corresponding to the hands in the step S1, a K-means model is trained to separate the hands into different buckets (clusters). Performing iterative operation to minimize Euclidean distances from different sample points to corresponding central points, wherein the Euclidean distances are expressed by the following formula:
Figure BDA0002607815210000071
in the formula: x is the number ofiIs data set X ═ X0,x1,x2,…,xnA data point of, n is the number of samples, vjIs V ═ V0,v1,v2,…,vnJ (th) bucket center point (also called cluster center point) in }, | | xi-vjI is the data point xiWith cluster center point vjC is the Euclidean distance betweeniIs the number of data points for the ith bucket (cluster), and c is the number of bucket centers (cluster centers). Barrel center point viThe calculation formula of (a) is as follows:
Figure BDA0002607815210000072
step S2 in this example performs initialization operation on the averaging strategy, which is not described herein.
In step S3, let I be a certain information set, i.e. a set of public cards and public action information; a (I) is a legal action set under an information set I, and T and T are time parameters, wherein T is related to each information set and is increased along with the access times of the information sets; t is the maximum of the time parameter TLarge value, i.e. total number of times. Strategy of player i
Figure BDA0002607815210000081
Is an information set I of IiThe legal action of and I, a ∈ A (I)i) Mapping to player at t time, information set IiProbability of next selection action a. At time t, the strategies of all players constitute a combination of strategies σt(ii) a i is a serial number indicating the current player. The policy combination σ refers to a policy combination made up of the policies of all players. Action a refers to a certain action, such as actions including "bet 4 chips", "discard", and the like; the set of actions a is represented by action set a.
The non-terminal historical information H is an action sequence from a root node in one game, the set where the non-terminal historical information H is located is a non-terminal historical information set H, and the non-terminal historical information set H (action sequence set) comprises opportunity output; piσ(h) The probability of reaching the non-terminal historical information h under the strategy combination sigma; probability of counterfactual
Figure BDA0002607815210000082
Is the probability of reaching information set I under the combination of strategies σ, assuming that the probability of the current player I reaching this state is 1.
Step S3 in this example performs monte carlo sampling (MC) on the state in the game tree, and samples the action based on the policy value to reduce the regret value CFR; and on the premise of ensuring that the mathematical expectation of the instant regret value is unchanged, the whole game tree is prevented from being traversed during iteration.
This example typically limits the number of end histories. Let Q be { Q ═ Q1,…,QrThe set of subsets of the set Z of end histories, evenly distributed in the set Z, one of which is called a "block". In each iteration, a block is randomly chosen, and only the end history in this block is considered. Let q bej>0 is the selected block Q in the current iterationjProbability (satisfy)
Figure BDA0002607815210000083
)。
Order to
Figure BDA0002607815210000084
q (z ') is the probability of the selected terminal history z' in the current iteration. The sampled reflexive regret value is calculated when block j is updated as follows:
Figure BDA0002607815210000085
wherein the content of the first and second substances,
Figure BDA0002607815210000086
is a pair of blocks QjThe sum of the intersections with the end history set under information set I,
Figure BDA0002607815210000087
probability of reaching the end history z' under the information set I for all players except for the player I to act according to the strategy combination sigma, piσ(z′[I]Z ') is a non-terminal history z' where the player acts according to the strategy combination σ [ I ]]Probability of the lower arrival end history z'; the set in which the end history Z' is located is the end history set Z.
The selection subset set Q and the sampling probabilities define a complete monte carlo CFR algorithm: a block is sampled rather than traversing the entire game tree, and only the terminal history in this block is parsed.
Let Z be the set of all end histories Z' (from root node to leaf node) in the game, and the non-end history information h be the sequence from root node to the end except the last level child node; for Z' ∈ Z,
Figure BDA0002607815210000088
a prefix history indicating that the non-end history information h is the end history z'; u. ofi(z ') calculating a counterfactual profit value CFV under the non-terminal history information h according to the counterfactual probability, the probability of strategy combination and the utility value, wherein the utility value of the player i in the terminal history z', namely the number (or fraction) of chips won or lost by the player according to the game rule; the expected profit value EV in the current state is expressed as the counterfactual profit value CFV and the arrival probability thereof in the current stateAccumulating; the non-terminal history information h is the prefix of the terminal history z', the history except the last level leaf node
Figure BDA00026078152100000911
The representative non-end history information h is a prefix history of the end history z'.
Specifically, in step S3 in this example, the formula is used
Figure BDA0002607815210000091
Calculating a counterfactual profit value CFV under the non-terminal historical information h, wherein vi(σ, h) represents a counterfactual profit value CFV for player i to reach non-end history information h when strategy combination σ is adopted;
Figure BDA0002607815210000092
as counter-fact probabilities
Figure BDA0002607815210000093
Based on the policy combination σ, on the premise that the probability that the current player i reaches the non-end history information h is 1, the probability that other players perform actions to reach the non-end history information h is obtained by the following process: in addition to player i, each player reaches the product of the non-end history information h under the combination of strategies σ, i.e.
Figure BDA0002607815210000094
When the strategy combination sigma is adopted, the probability of reaching the end history z' is obtained on the premise that the non-end history information h is reached; the calculation procedure obtained is as follows:
Figure BDA0002607815210000095
the non-terminal history information h is the prefix history of the terminal history z', piσ(z ') is the probability, π, of reaching the terminal history z' based on the policy combination σσ(h) The probability of σ arriving at the non-end history information h is combined based on the policy.
In step S3 of the present example, in step S3, the formula is used
Figure BDA0002607815210000096
Calculating an expected profit value EV in the current state, wherein ui(σ, h) is the expected profit value EV for the player i to reach the non-end history information h when the combination of strategies σ is employed,
Figure BDA0002607815210000097
for the probability, v, of a player i arriving at the non-end history information h when the strategy combination σ is adoptedi(σ, h) is the counterfactual profit value CFV for player i to reach non-end history information h when strategy combination σ is employed. The process obtained is as follows: the value is calculated based on the principle of elementary probability multiplication, e.g.
Figure BDA0002607815210000098
The preset specific numerical value can also be adopted, and custom modification and adjustment can also be realized; v. ofi(σ, h) is the counterfactual profit value CFV in the current state, and the obtained procedure is as follows: the calculation process is as follows:
Figure BDA0002607815210000099
wherein the counter-fact probability
Figure BDA00026078152100000910
Based on the strategy combination sigma, on the premise that the probability that the current player i reaches the non-terminal historical information h is 1, the probability that other players act to reach the non-terminal historical information h is assumed; u. ofi(z ') is the value of the utility of player i at the end history z'.
In step S4 described in this example, the formula r (h, a) ═ viI→a,h)-vi(σ, h) calculating a counterfactual regret value CFR of the non-terminal history information h without action a, wherein r (h, a) is the counterfactual regret value CFR, v (v) of the non-terminal history information h without action aiI→aH) is the profit value obtained by action a in the non-terminal historical information h under the information set I; by the formula
Figure BDA0002607815210000101
ComputingThe regret value r (I, a) of the counterfactual without action a in the information set I; by the formula
Figure BDA0002607815210000102
Calculating the accumulated regret value, wherein,
Figure BDA0002607815210000103
for combining sigma in a strategytNext player I does not have the regret value of action a in information set I.
In step S4 described in this example, the difference between the benefit of action a and the expected benefit is always chosen to be the regret value of the action, under the combination of strategies σ, and then weighted with the probability value that other players (including the chance) arrive at the current node. Defining a non-negativity-fact repentance value
Figure BDA0002607815210000104
And then applying a regret-matching form to the accumulated counterfactual regret value to obtain a new strategy.
That is, step S4 in this example is formulated by
Figure BDA0002607815210000105
The strategy values are updated for neural network training (training with a deep learning network), wherein,
Figure BDA0002607815210000106
a (I) is the legal action set under information set I,
Figure BDA0002607815210000107
is the sum of the regret values of all actions in the information set I.
This example is used to calculate, for each information set, the probability of action proportional to the positive accumulated regret value, by this formula above. For each action, the greater the probability value, the more likely the action is to be selected. The CFR generates the next state of the game and recursively calculates the profit value of the selected action. And calculating the regret value from the returned profit value, and finally calculating and returning the profit value of the current node.
In this example, one iteration of the CFR algorithm represents a one-play-by-one-play game in step S5. Arranging game plays, e.g. 1011The completion of all game plays marks the termination of the iteration, otherwise the iteration is continued by increasing the number of plays.
The present embodiment also provides an artificial intelligence decision system based on game theory and nash equilibrium, which adopts the above artificial intelligence decision method based on game theory and nash equilibrium, and includes:
the hand card dividing module is used for dividing the cards into different buckets according to the strength of the hand cards;
the initialization module is used for realizing an initialization average strategy;
the profit calculation module is used for sampling the state in the game tree, sampling the action based on the strategy value and calculating an expected profit value of the sampled action and a counterfactual profit value of the non-sampled action;
the anti-fact regret value calculation module calculates the anti-fact regret value and updates the strategy value to carry out neural network training;
and the iteration module is used for judging whether the player completes the iteration, outputting the current average strategy value if the player completes the iteration, and returning to the income calculation module if the player does not complete the iteration.
In summary, in this embodiment, the expected profit value of the sampled action and the counterfactual profit value of the non-sampled action are calculated, and the counterfactual regret value is further calculated, so as to update the strategy value for neural network training, therefore, it is not necessary to repeat recalculation each time, only the cached results need to be searched and called for part of the brands, and the calculation amount for turning the brands can be greatly reduced, so that the purpose of reducing the calculation amount and the calculation time can be well achieved, and the strategy value is updated through a series of steps, and whether iteration is completed or not is judged, thereby well improving the calculation accuracy.
The foregoing is a more detailed description of the invention in connection with specific preferred embodiments and it is not intended that the invention be limited to these specific details. For those skilled in the art to which the invention pertains, several simple deductions or substitutions can be made without departing from the spirit of the invention, and all shall be considered as belonging to the protection scope of the invention.

Claims (10)

1. An artificial intelligence decision method based on game theory and Nash equilibrium is characterized by comprising the following steps:
step S1, dividing the cards into different buckets according to the strength of the hands;
step S2, initializing an average strategy;
step S3, sampling states in a game tree, sampling actions, and calculating counterfactual income values and expected income values of all the states in the information set I;
step S4, calculating the regret value of the counterfactual, and updating the strategy value to train the neural network;
and step S5, judging whether the player finishes the iteration, if so, outputting the current average strategy value, and if not, returning to the step S3.
2. The method of claim 1, wherein the step S1 of dividing the cards into different buckets according to the hand strength includes iterating each hand to obtain the type and score of the hand, and creating the feature vector corresponding to the hand based on the hand, the type and the score.
3. The artificial intelligence decision method based on game theory and nash equilibrium as claimed in claim 2, wherein in step S1, the hand is trained by a K-means clustering model according to the feature vector corresponding to the hand, and then the hand is sorted into different buckets.
4. A game theory and nash equilibrium based artificial intelligence decision method according to any one of claims 1 to 3, characterized in that in the step S3, let Z be the set of all end histories in the game, all end histories refer to the sequence from root node to leaf node; for Z' ∈ Z,
Figure FDA0002607815200000011
a prefix history representing that the non-end history information h is an end history z'; u. ofi(z ') is the utility value of the player i in the terminal history z', and then the counterfactual profit value CFV under the non-terminal history information h is calculated according to the counterfactual probability, the probability of strategy combination and the utility value; the expected profit value EV in the current state is expressed as the product of the counterfactual profit value CFV and its arrival probability in the current state.
5. The artificial intelligence decision method based on game theory and nash equilibrium as claimed in claim 4, wherein in the step S3, the artificial intelligence decision method based on formula
Figure FDA0002607815200000012
Calculating a counterfactual profit value CFV under the non-terminal historical information h, wherein vi(σ, h) represents a counterfactual profit value CFV for player i to reach non-end history information h when strategy combination σ is adopted;
Figure FDA0002607815200000013
as counter-fact probabilities
Figure FDA0002607815200000014
When the strategy combination sigma is adopted, the probability of reaching the non-terminal historical information h under the condition of not considering the player i is obtained; piσ(h, z ') is the probability of reaching the end history z' on the premise that the non-end history information h has been reached when the policy combination σ is employed.
6. The artificial intelligence decision method based on game theory and nash equilibrium of claim 5, wherein in the step S3, through formula
Figure FDA0002607815200000015
Calculating an expected profit value EV in the current state, wherein ui(σ, h) Player i when strategy combination σ is adoptedThe expected benefit value EV of the non-end history information h is reached,
Figure FDA0002607815200000021
the probability that player i reaches non-end history information h when strategy combination sigma is adopted.
7. The method for artificial intelligence decision based on game theory and nash equilibrium as claimed in claim 6, wherein in said step S4, through the formula r (h, a) ═ viI→a,h)-vi(σ, h) calculating a counterfactual regret value CFR of the non-terminal history information h without action a, wherein r (h, a) is the counterfactual regret value CFR, v (v) of the non-terminal history information h without action aiI→aH) is a utility value generated by performing action a in the information set I under the non-terminal historical information h; by the formula
Figure FDA0002607815200000022
Calculating the regret value of the counterfactual without action a in the information set I; by the formula
Figure FDA0002607815200000023
Calculating the accumulated regret value, wherein,
Figure FDA0002607815200000024
for combining sigma in a strategytNext player I does not have the regret value of action a in information set I.
8. The artificial intelligence decision method based on game theory and nash equilibrium of claim 7, wherein in the step S4, through formula
Figure FDA0002607815200000025
And updating the strategy value to train the neural network, wherein,
Figure FDA0002607815200000026
a (I) is the legal action set under information set I,
Figure FDA0002607815200000027
the regret value of all actions under the information set I.
9. A game-theory and nash-balance-based artificial-intelligence decision method according to any one of claims 1 to 3, characterized in that in the step S5, game-hands are configured, and the completion of all game-hands marks the termination of iteration, otherwise the iteration is continued by increasing the game-hands.
10. An artificial intelligence decision system based on game theory and nash equilibrium, characterized in that, the artificial intelligence decision method based on game theory and nash equilibrium of any claim 1 to 9 is adopted, and comprises:
the hand card dividing module is used for dividing the cards into different buckets according to the strength of the hand cards;
the initialization module is used for realizing an initialization average strategy;
the profit calculation module is used for sampling the state in the game tree, sampling the action based on the strategy value and calculating an expected profit value of the sampled action and a counterfactual profit value of the non-sampled action;
the anti-fact regret value calculation module calculates the anti-fact regret value and updates the strategy value to carry out neural network training;
and the iteration module is used for judging whether the player completes the iteration, outputting the current average strategy value if the player completes the iteration, and returning to the income calculation module if the player does not complete the iteration.
CN202010744288.9A 2020-07-23 2020-07-29 Artificial intelligence decision method and system based on game theory and Nash equilibrium Pending CN111905373A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010719081 2020-07-23
CN2020107190816 2020-07-23

Publications (1)

Publication Number Publication Date
CN111905373A true CN111905373A (en) 2020-11-10

Family

ID=73286714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010744288.9A Pending CN111905373A (en) 2020-07-23 2020-07-29 Artificial intelligence decision method and system based on game theory and Nash equilibrium

Country Status (1)

Country Link
CN (1) CN111905373A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779870A (en) * 2021-08-24 2021-12-10 清华大学 Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium
CN113869501A (en) * 2021-10-19 2021-12-31 京东科技信息技术有限公司 Neural network generation method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559363A (en) * 2013-11-15 2014-02-05 南京大学 Method for calculating optimum response strategy in imperfect information extensive game
CN106296006A (en) * 2016-08-10 2017-01-04 哈尔滨工业大学深圳研究生院 The minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation
CN110404265A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) A kind of non-complete information machine game method of more people based on game final phase of a chess game online resolution, device, system and storage medium
CN110489668A (en) * 2019-09-11 2019-11-22 东北大学 Synchronous game monte carlo search sets mutation method more under non-complete information
US10675537B1 (en) * 2019-05-15 2020-06-09 Alibaba Group Holding Limited Determining action selection policies of an execution device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103559363A (en) * 2013-11-15 2014-02-05 南京大学 Method for calculating optimum response strategy in imperfect information extensive game
CN106296006A (en) * 2016-08-10 2017-01-04 哈尔滨工业大学深圳研究生院 The minimum sorry appraisal procedure of non-perfect information game risk and Revenue Reconciliation
US10675537B1 (en) * 2019-05-15 2020-06-09 Alibaba Group Holding Limited Determining action selection policies of an execution device
CN110404265A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) A kind of non-complete information machine game method of more people based on game final phase of a chess game online resolution, device, system and storage medium
CN110489668A (en) * 2019-09-11 2019-11-22 东北大学 Synchronous game monte carlo search sets mutation method more under non-complete information

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113779870A (en) * 2021-08-24 2021-12-10 清华大学 Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium
CN113869501A (en) * 2021-10-19 2021-12-31 京东科技信息技术有限公司 Neural network generation method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110404265B (en) Multi-user non-complete information machine game method, device and system based on game incomplete on-line resolving and storage medium
Brown et al. Deep counterfactual regret minimization
CN110404264B (en) Multi-person non-complete information game strategy solving method, device and system based on virtual self-game and storage medium
Ponsen et al. Integrating opponent models with monte-carlo tree search in poker
Bošanský et al. Algorithms for computing strategies in two-player simultaneous move games
JP2006204921A (en) Bayesian scoring
Zhang et al. Improving hearthstone AI by learning high-level rollout policies and bucketing chance node events
CN111905373A (en) Artificial intelligence decision method and system based on game theory and Nash equilibrium
Rubin et al. Similarity-based retrieval and solution re-use policies in the game of Texas Hold’em
CN112926744A (en) Incomplete information game method and system based on reinforcement learning and electronic equipment
Sismanis How i won the" chess ratings-elo vs the rest of the world" competition
Sturtevant et al. Prob-max^ n: Playing n-player games with opponent models
Dobre et al. Online learning and mining human play in complex games
CN112691383A (en) Texas poker AI training method based on virtual regret minimization algorithm
CN115282604A (en) On-line explicit reconstruction method and device for adversary strategy of incomplete information repeated game
Kocsis et al. Universal parameter optimisation in games based on SPSA
CN110909890B (en) Game artificial intelligence training method and device, server and storage medium
Wang Cfr-p: Counterfactual regret minimization with hierarchical policy abstraction, and its application to two-player mahjong
Nestoruk et al. Prediction of Football Games Results.
CN114048833B (en) Multi-person and large-scale incomplete information game method and device based on neural network virtual self-game
Chen et al. A Novel Reward Shaping Function for Single-Player Mahjong
Bucquet et al. The Bank is Open: AI in Sports Gambling
Chen et al. Learning Strategies for Imperfect Information Board Games Using Depth-Limited Counterfactual Regret Minimization and Belief State
Agrawal et al. Targeted upskilling framework based on player mistake context in online skill gaming platforms
Li et al. Scalable sub-game solving for imperfect-information games

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201110

RJ01 Rejection of invention patent application after publication