CN113779870A - Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium - Google Patents

Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113779870A
CN113779870A CN202110975035.7A CN202110975035A CN113779870A CN 113779870 A CN113779870 A CN 113779870A CN 202110975035 A CN202110975035 A CN 202110975035A CN 113779870 A CN113779870 A CN 113779870A
Authority
CN
China
Prior art keywords
strategy
game
blueprint
algorithm
abstract
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110975035.7A
Other languages
Chinese (zh)
Inventor
刘启涵
杨君
梁斌
芦维宁
陈章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202110975035.7A priority Critical patent/CN113779870A/en
Publication of CN113779870A publication Critical patent/CN113779870A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram

Abstract

The application belongs to the technical field of machine learning, and particularly relates to a parallelization imperfect information game strategy generation method and device, electronic equipment and a storage medium. The method comprises the following steps: compressing an original characteristic space of an imperfect information game by using a non-complete recall clustering method to obtain an abstract characteristic space; utilizing an MCCFR minimization method, and iteratively generating a blueprint strategy through self-gaming in the abstract feature space; and performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string. According to the method, incomplete memory is used for feature space abstraction, and strategy robustness is improved; on the basis of the MCCFR algorithm, the integral expected profit is used for replacing the regret value to carry out interval updating, the sampling action frequency is used for generating a final strategy, and the characteristic mapping and the parallel frame are combined, so that the convergence speed of the algorithm is improved, and the training time of the algorithm is shortened.

Description

Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium
Technical Field
The application belongs to the technical field of machine learning, and particularly relates to a parallelization imperfect information game strategy generation method and device, electronic equipment and a storage medium.
Background
Gaming has long been one of the important application scenarios in the field of artificial intelligence. With the rapid development of reinforcement learning and deep learning, many perfect information game problems have been solved in the last decade, and the exploration of imperfect information game problems has made a remarkable progress by taking the successful experience of the perfect information game problems as a reference.
The incomplete knowledge of the state information of imperfect information gaming makes many conventional iteration and search algorithms no longer applicable, as compared to perfect information gaming. How to reasonably utilize the invisible characteristic of partial information and avoid the dilemma of circular reasoning in strategy exploration is always the focus of attention in imperfect information game research, and a plurality of algorithms are derived on the basis. Among them, the anti-fact regret value minimization algorithm (hereinafter, referred to as CFR) is the most widely practical method for generating an approximate nash equalization strategy through self-pulsation training. The CFR approaches a nash equilibrium strategy of a generalized game through a repeated self-game between two minimum regret value algorithms, but a game tree needs to be fully traversed in each iteration process, so that the algorithm complexity is high.
Disclosure of Invention
The present disclosure is directed to solving, at least to some extent, one of the technical problems in the related art. Based on the inventors' recognition and understanding of the following facts and problems: the anti-fact regret value minimization algorithm (MCCFR for short) based on the Monte Carlo sampling idea reduces the traversal overhead to the maximum extent by reasonably selecting a sampling strategy, but has the problems of large variance and slow convergence speed. The improved CFR + algorithm optimizes the calculation of the average strategy and improves the convergence efficiency of the algorithm. In addition, the CFR-D algorithm combined with online sub-game solving and the CFR-like algorithm utilizing the deep neural network to carry out regret value calculation are highlighted in solving the large-scale imperfect information game problem. However, the above strategy generation method still has the following disadvantages: when large-scale game problems are faced, the algorithm overhead is large, and the number of the game players increases exponentially; as the game depth deepens, the arrival probability tends to zero, so that the strategy is not obviously updated, and the convergence speed is slow; the sampling update variance is large, so that the final strategy is easy to fall into local optimization, and the like.
In view of this, the present disclosure provides a parallelized imperfect information game strategy generation method, apparatus, electronic device and storage medium, so as to solve technical problems in the related art.
According to a first aspect of the disclosure, a parallelization imperfect information game strategy generation method is provided, which includes the following steps:
compressing an original characteristic space of an imperfect information game according to a non-complete recall clustering method to obtain an abstract characteristic space;
utilizing an MCCFR minimization method, and iteratively generating a blueprint strategy through self-gaming in the abstract feature space;
and performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.
The parallel imperfect information game strategy generation method has the advantages that the incomplete memory is used for abstracting the characteristic space, and the strategy robustness is improved; on the basis of the MCCFR algorithm, the integral expected profit is used for replacing the regret value to carry out interval updating, the sampling action frequency is used for generating a final strategy, and the characteristic mapping and the parallel frame are combined, so that the convergence speed of the algorithm is improved, and the training time of the algorithm is shortened.
Optionally, the compressing the original feature space of the imperfect information game by using a non-perfect recall clustering method to obtain an abstract feature space includes:
(1) during initialization, setting the total l rounds of current games, performing the total l layers of game trees for abstract clustering, wherein the clustering algorithm of each layer in the game trees is Mk(k 1, …, l), the number of clusters at layer k in the game tree is CkM is a clustering algorithm;
(2) using a metric function DlCalculating the original characteristic space of the l layer of the game tree to obtain a clustering result AlCluster the result AlMean of the ith class
Figure BDA0003226989760000021
And calculating a clustering result AlDistance between the ith and jth classes
Figure BDA0003226989760000022
(3) Enabling the current layer n of the game tree to be l-1, and calculating each node x of the layer on the game treenPotential histogram of (H)n(xn) The ith element in the potential histogram represents the nth layer node xnClustering the results A at the next layern+1Using histogram ground-motion distance as a metric function and a clustering algorithm MnCalculating to obtain the clustering result A of the current layer nn
(4) Repeating the step (3) until n is 1, and obtaining the clustering result A of all layers of the game tree (A)1,…,Al) That is, the compressed feature is not completely recalled, and a mapping relationship F (-) between the compressed abstract feature space and the original feature space is established.
Optionally, the iteratively generating the blueprint strategy by self-gaming in the abstract feature space by using the MCCFR minimization method includes:
(1) setting a blueprint strategy, wherein the blueprint strategy is a random strategy when the blueprint strategy is initialized, and setting a repeatable game play, and the game play number of the game play is P;
(2) inputting the blueprint strategy into game alignment, generating samples through self game, using MCCFR minimization algorithm based on external sampling, and calculating the regret value of the self game generated samples through replacing the counterfactual return in the original MCCFR algorithm by the overall expected profit in the abstract feature space;
assuming that the action taken by the gambling player in this sample is a, then the current player p is available in the abstract information set IpThe global regret value for a;
(3) updating the blueprint strategy of the current game player p in the next iteration turn according to the regret value calculation method;
(4) using interval updates andthe concurrent Monte Carlo sampling method alternately updates the strategies of different game players and sets the regret value change amplitude threshold RswitchAnd a policy update time threshold TswitchAnd after the strategy of the game player p is iterated, if the variation amplitude of the accumulated total regret value exceeds the threshold value and the accumulated updating times reach the threshold value, switching to another player p ', and repeating the steps (2) to (3) to update the strategy for the player p'.
(5) And calculating a final output blueprint strategy by using action sampling to replace arrival probability weighting in the original MCCFR algorithm, creating a new game environment after each strategy updating iteration, inputting the current round of instant strategies, performing K times of simulation matching, recording the selected times of each action a of each matching under different abstract information sets I, and obtaining the blueprint strategy output after T rounds of iteration.
Optionally, the performing distributed storage and update on the blueprint policy by using a hash algorithm of the feature string includes:
(1) according to the abstract feature space and the mapping relation obtained by the non-complete recall clustering method, character coding is carried out on the original feature space, namely, the original features are converted into 16-system character strings with fixed length by using an MD5 hash algorithm, key value mapping is used in the character string space, and a fixed number is allocated to each original feature;
(2) according to the distributed fixed number, the regret value and the blueprint strategy are stored in a distributed mode;
(3) and setting a blueprint strategy updating time threshold, for each blueprint strategy stored in a distributed mode, storing the current blueprint strategy when the updating time reaches the threshold, and performing simulation countermeasure test with the last stored blueprint strategy until the yield of the current blueprint strategy in the simulation countermeasure is not obviously improved, so that an imperfect information game strategy is generated.
According to a second aspect of the present disclosure, a parallelized imperfect information game strategy generation apparatus is provided, including:
the space compression module is used for compressing the original characteristic space of the imperfect information game according to the incomplete recall clustering method to obtain an abstract characteristic space;
the self-game module generates a blueprint strategy in an iteration mode through self-game in the abstract feature space by utilizing an MCCFR (micro-controller CFR) minimization method;
and the computing module is used for performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.
According to a third aspect of the present disclosure, an electronic device is presented, comprising:
a memory for storing processor-executable instructions;
a processor configured to perform:
compressing an original characteristic space of an imperfect information game according to a non-complete recall clustering method to obtain an abstract characteristic space;
utilizing an MCCFR minimization method, and iteratively generating a blueprint strategy through self-gaming in the abstract feature space;
and performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.
According to a fourth aspect of the present disclosure, a computer-readable storage medium is presented, having stored thereon a computer program for causing a computer to execute:
compressing an original characteristic space of an imperfect information game according to a non-complete recall clustering method to obtain an abstract characteristic space;
utilizing an MCCFR minimization method, and iteratively generating a blueprint strategy through self-gaming in the abstract feature space;
and performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.
According to the embodiment of the disclosure, the feature space abstraction is performed by using incomplete recall, so that the strategy robustness is improved; on the basis of the MCCFR algorithm, the integral expected profit is used for replacing the regret value to carry out interval updating, the sampling action frequency is used for generating a final strategy, and the characteristic mapping and the parallel frame are combined, so that the convergence speed of the algorithm is improved, and the training time of the algorithm is shortened.
Additional aspects and advantages of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is apparent that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings can be derived from those drawings without inventive effort for a person skilled in the art.
The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:
fig. 1 is a block diagram of a flow of a method for generating a game strategy of personalized imperfect information according to an embodiment of the present disclosure.
Fig. 2 is a detailed flow diagram according to one embodiment of the present disclosure.
Fig. 3 is a schematic block diagram of a validation imperfect information gaming policy generation apparatus according to an embodiment of the present disclosure.
FIG. 4 is a training process image according to one embodiment of the present disclosure.
Fig. 5 is a training result presentation image according to one embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flow diagram illustrating a parallelized imperfect information gambling policy generation method according to one embodiment of the present disclosure. The parallelization imperfect information game strategy generation method of the embodiment can be applied to user equipment such as a mobile phone and a tablet personal computer.
As shown in fig. 1 and fig. 2, the parallelized imperfect information game strategy generation method may include the following steps:
in the step, according to a non-complete recall clustering method, the original feature space of an imperfect information game is compressed to obtain an abstract feature space.
In an embodiment, the compressing the original feature space of the imperfect information game by using a non-perfect recall clustering method to obtain an abstract feature space includes:
(1) during initialization, setting the total l rounds of current games, performing the total l layers of game trees for abstract clustering, wherein the clustering algorithm of each layer in the game trees is Mk(k 1, …, l), the number of clusters at layer k in the game tree is CkM is a clustering algorithm (e.g., k-means nearest neighbor clustering algorithm);
(2) using a metric function DlCalculating the original characteristic space of the l layer of the game tree to obtain a clustering result AlCluster the result AlMean of the ith class
Figure BDA0003226989760000051
And calculating a clustering result AlDistance between the ith and jth classes
Figure BDA0003226989760000052
(3) Enabling the current layer n of the game tree to be l-1, and calculating each node x of the layer on the game treenPotential histogram of (H)n(xn) Potential histogram Hn(xn) Is a one-dimensional vector, the length of the vector is C of the number of clusters of the n +1 th layer of the game treen+1The ith element in the potential histogram represents the nth node xnClustering the results A at the next layern+1Using Earth Mover's Distance (EMD) as metric function and clustering algorithm MnCalculating to obtain the clustering result A of the current layer nn
(4) Repeating the step (3) until n is 1, and obtaining the clustering result A of all layers of the game tree (A)1,…,Al) That is, the compressed feature is not completely recalled, and a mapping relationship F (-) between the compressed abstract feature space and the original feature space is established.
In one embodiment, using a non-perfect recall clustering method and using the histogram EMD as a metric function, the unexpected technical effect is mainly: compared with the traditional complete recall abstract method, the incomplete recall abstract method requires that information (such as opponent historical actions) acquired at a previous decision point can be ignored when clustering is performed on a current round information set of a certain game player, and in the imperfect information game problem, the incomplete recall clustering can realize fuzzification on a historical track, so that the game player can forget the game player or the opponent historical actions to a certain extent, the possibility that the game player is influenced by the opponent fraudulent actions is reduced, and the robustness of the game is improved.
In step 2, a blueprint strategy is iteratively generated through self-gaming in the abstract feature space by using an MCCFR minimization method.
In one embodiment, the iteratively generating the blueprint strategy by self-gaming in the abstract feature space by using the MCCFR minimization method includes:
(1) setting a blueprint strategy, wherein the blueprint strategy is a random strategy when the blueprint strategy is initialized, and setting a repeatable game play, and the game play number of the game play is P;
(2) inputting the blueprint strategy into game, generating samples through self game, using MCCFR minimization algorithm based on external sampling, and calculating the regret value of the self game generated samples through replacing the counterfactual return in the original MCCFR algorithm by the overall expected profit in the abstract feature space, wherein the calculation formula is as follows:
Figure BDA0003226989760000061
wherein T is the current iteration round,p is the current game player, IpThe abstract feature of the sample generated for the current gamer under the mapping F (-) is the abstract information set σTFor the blueprint strategy under the current iteration round,
Figure BDA0003226989760000062
for abstract information sets IpIn the blueprint strategy sigmaTThe approximate probability of access of the following,
Figure BDA0003226989760000063
for the in-blueprint strategy sigmaTLower Player p in abstract information set IpApproximate profit expectations on;
assuming that the action taken by the gambling player in this sample is a, then the current player p is available in the abstract information set IpThe following overall regret value for a:
Figure BDA0003226989760000064
(3) according to the regret value calculation method, updating the blueprint strategy sigma of the current game player p in the next iteration round T +1T+1The calculation formula is as follows:
Figure BDA0003226989760000065
wherein R isT(a|Ip) For the global regret value, a is all optional actions and | a | is the number of all optional actions.
(4) Alternately updating different game player strategies by using interval updating and parallel Monte Carlo sampling methods, and setting the regret value change amplitude threshold value RswitchAnd a policy update time threshold TswitchAnd after the strategy of the game player p is iterated, if the variation amplitude of the accumulated total regret value exceeds the threshold value and the accumulated updating times reach the threshold value, switching to another player p ', and repeating the steps (2) to (3) to update the strategy for the player p'.
(5) Using motion sampling to replace sourcesCalculating final output blueprint strategies by weighting arrival probability in MCCFR algorithm, creating a new game environment after each strategy updating iteration, inputting the current round of instant strategies, performing K times of simulation matching, and recording the number of times each action a is selected under different abstract information sets I of each matching, namely the action frequency f (a | t, σ |)tAnd I), the blueprint strategy output after T rounds of iteration is as follows:
Figure BDA0003226989760000071
in one embodiment, the overall expected revenue replaces the counterfactual returns in the original MCCFR algorithm to compute the regret value of the self-game generated samples, the different game player strategies are alternately updated using interval update and concurrent monte carlo sampling methods, the final output blueprint strategy is computed using action sampling to replace the arrival probability weighting in the original MCCFR algorithm, resulting in unexpected technical effects including:
A. with the enlargement of the game problem scale and the deepening of the game tree, the original MCCFR algorithm can not traverse all possible actions, and is used for an abstract information set IpThe access probability of (2) will continuously decrease with the increase of exploration depth, resulting in corresponding counter-fact return
Figure BDA0003226989760000072
And the overall regret value RT(a|Ip) The absolute value of (2) is continuously reduced, so that the situation that the convergence speed is reduced due to too low value occurs when the integral regret value is updated. And sampling the whole expected income to replace counter-fact return, namely not considering the arrival probability of the information set, so that the numerical value updating amplitude in the strategy iteration process can not be suddenly reduced to zero, the strategy convergence speed is obviously improved, and the updating stagnation condition can not occur.
B. The original MCCFR algorithm is unreliable in updating the regret value in a single iteration, so that the regret value is always ensured to be updated according with expectation by increasing the iteration round number, different game player strategies are updated alternately by using an interval updating and parallel Monte Carlo sampling method, the randomness of sampling the action of a random event in the original MCCFR algorithm only once can be obviously reduced, the traversal degree of the algorithm on a game tree is increased on the premise of ensuring that the time overhead of the single iteration of the algorithm is basically unchanged, the game tree is more suitable for processing the game problem with a high action dispersion degree of the random event, and the performance of the final strategy generation of the algorithm is improved.
The action sampling is used for replacing the arrival probability weighting in the original MCCFR algorithm to calculate the finally output blueprint strategy, and compared with the original algorithm for averaging based on the arrival probability and the regret value, the method reduces the deviation of the finally generated strategy possibly brought by the abnormal integral regret value in a certain round of strategy iteration, and improves the stability and the robustness of the algorithm.
In step 3, distributed storage and updating are carried out on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.
In one embodiment, the distributed storage and update of the blueprint policy by using the hash algorithm of the feature string includes:
(1) according to the abstract feature space and the mapping relation obtained by the non-complete recall clustering method, character coding is carried out on the original feature space, namely, the original features are converted into 16-system character strings with fixed length by using an MD5 hash algorithm, key value mapping is used in the character string space, and a fixed number is allocated to each original feature;
(2) according to the distributed fixed number, the regret value and the blueprint strategy are stored in a distributed mode;
(3) and setting a blueprint strategy updating time threshold, for each blueprint strategy stored in a distributed mode, storing the current blueprint strategy when the updating time reaches the threshold, and performing simulation countermeasure test with the last stored blueprint strategy until the yield of the current blueprint strategy in the simulation countermeasure is not obviously improved, so that an imperfect information game strategy is generated.
Corresponding to the parallelization imperfect information game strategy generation method, the disclosure also provides a parallelization imperfect information game strategy generation device.
Fig. 3 is a schematic block diagram of a parallelized imperfect information game strategy generating device, which includes:
the space compression module is used for compressing the original characteristic space of the imperfect information game according to the incomplete recall clustering method to obtain an abstract characteristic space;
the self-game module generates a blueprint strategy in an iteration mode through self-game in the abstract feature space by utilizing an MCCFR (micro-controller CFR) minimization method;
and the computing module is used for performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.
An embodiment of the present disclosure also provides an electronic device, including:
a memory for storing processor-executable instructions; and
a processor configured to perform:
compressing an original characteristic space of an imperfect information game according to a non-complete recall clustering method to obtain an abstract characteristic space;
utilizing an MCCFR minimization method, and iteratively generating a blueprint strategy through self-gaming in the abstract feature space;
and performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.
Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a computer program for causing a computer to execute:
compressing an original characteristic space of an imperfect information game according to a non-complete recall clustering method to obtain an abstract characteristic space;
utilizing an MCCFR minimization method, and iteratively generating a blueprint strategy through self-gaming in the abstract feature space;
and performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.
It should be noted that, in the embodiment of the present disclosure, the Processor may be a Central Processing Unit (CPU), or may be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the memory may be used for storing the computer program and/or the module, and the processor may realize various functions of the automobile accessory picture dataset making apparatus by executing or executing the computer program and/or the module stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device. If the modules/units of the construction device of the wind power system operation stability domain are realized in the form of software functional units and sold or used as independent products, the modules/units can be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method of the embodiments described above can be realized by the present disclosure, and the method can also be realized by the relevant hardware instructed by a computer program, which can be stored in a computer readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments described above can be realized. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the device provided by the present disclosure, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.
The parallelization imperfect information game strategy generation method is explained in detail through a specific embodiment.
Taking two-player unlimited Texas poker as an example, the game problem is a typical two-player zero-sum imperfect information non-cooperative dynamic game, and the traditional CFR algorithm has extremely slow iteration speed on the problem and is difficult to converge to an approximate Nash equilibrium strategy. For the problem, the total number l of constructed game trees is 4, and the final clustering number of the 1 st layer of the game tree is set to be C when the abstract feature space is calculated1The final cluster number of the other 3 layers is C1692=C3C 4200 parts of a total weight; clustering algorithm M of last layer4Distance metric function D using k-means algorithm4For Euclidean distance of hand strength Expectation (EHS) distribution, other 3-layer clustering algorithms are all k-means algorithms, and histogram EMD is used as a distance measurement function. The strategy algorithm for generating blueprints by selfoc iteration sets 200 parallel simulation environments, the number of regret value storage distributed memories is 10, and the settings areGame player regret value change amplitude threshold value Rswitch200000, policy update time threshold Tswitch=100。
Fig. 4 is a comparison of the training process of the above algorithm and the original MCCFR algorithm in the two-player unlimited adequam poker environment, and it can be found that the parallelized imperfect information strategy generation algorithm is superior to the original MCCFR algorithm in terms of convergence speed and performance of the final convergence strategy.
Fig. 5 is an online match record of the strategy generated by the algorithm and the 2017 international poker AI championship slot, and the performance of the strategy is found to be gradually stable and better than that of the slot as the iteration number is increased.
Although embodiments of the present disclosure have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present disclosure, and that changes, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present disclosure.

Claims (7)

1. A parallelized imperfect information game strategy generation method is characterized by comprising the following steps:
compressing an original characteristic space of an imperfect information game according to a non-complete recall clustering method to obtain an abstract characteristic space;
utilizing an MCCFR minimization method, and iteratively generating a blueprint strategy through self-gaming in the abstract feature space;
and performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.
2. The method for generating the parallelized imperfect information game strategy according to claim 1, wherein the compressing the original feature space of the imperfect information game by using the incomplete recall clustering method to obtain the abstract feature space comprises:
(1) during initialization, setting the total l rounds of current games, performing the total l layers of game trees for abstract clustering, wherein the clustering algorithm of each layer in the game trees is Mk(k=1,…,l),The number of the k-th layer in the game tree is CkM is a clustering algorithm;
(2) using a metric function DlCalculating the original characteristic space of the l layer of the game tree to obtain a clustering result AlCluster the result AlMean of the ith class
Figure FDA0003226989750000011
And calculating a clustering result AlDistance between the ith and jth classes
Figure FDA0003226989750000012
(3) Enabling the current layer n of the game tree to be l-1, and calculating each node x of the layer on the game treenPotential histogram of (H)n(xn) The ith element in the potential histogram represents the nth node xnClustering the results A at the next layern+1Using histogram ground-motion distance as a metric function and a clustering algorithm MnCalculating to obtain the clustering result A of the current layer nn
(4) Repeating the step (3) until n is 1, and obtaining the clustering result A of all layers of the game tree (A)1,…,Al) That is, the compressed feature is not completely recalled, and a mapping relationship F (-) between the compressed abstract feature space and the original feature space is established.
3. The parallelized imperfect information game strategy generating method of claim 1, wherein the iteratively generating the blueprint strategy through self-game in the abstract feature space by using the MCCFR minimizing method comprises:
(1) setting a blueprint strategy, wherein the blueprint strategy is a random strategy when the blueprint strategy is initialized, and setting a repeatable game play, and the game play number of the game play is P;
(2) inputting the blueprint strategy into game alignment, generating samples through self game, using MCCFR minimization algorithm based on external sampling, and calculating the regret value of the self game generated samples through replacing the counterfactual return in the original MCCFR algorithm by the overall expected profit in the abstract feature space;
assuming that the action taken by the gambling player in this sample is a, then the current player p is available in the abstract information set IpThe global regret value for a;
(3) updating the blueprint strategy of the current game player p in the next iteration turn according to the regret value calculation method;
(4) and (3) alternately updating different game player strategies by using an interval updating and parallel Monte Carlo sampling method, setting a regret value change amplitude threshold and a strategy updating frequency threshold, and switching to another player p 'after the strategy of the game player p is iterated, and repeating the steps (2) to (3) to update the strategy for the player p' if the accumulative total regret value change amplitude exceeds the threshold and the accumulative updating frequency reaches the threshold.
(5) And calculating a final output blueprint strategy by using action sampling to replace arrival probability weighting in the original MCCFR algorithm, creating a new game environment after each round of strategy updating iteration, inputting the current round of instant strategy, performing K times of simulation matching, recording the selected times of each action a under different abstract information sets I in each matching, and outputting the blueprint strategy after T rounds of iteration.
4. The parallelized imperfect information game strategy generation method according to claim 1, wherein the distributed storage and update of the blueprint strategy by using the hash algorithm of the characteristic character string comprises:
(1) according to the abstract feature space and the mapping relation obtained by the non-complete recall clustering method, character coding is carried out on the original feature space, namely, the original features are converted into 16-system character strings with fixed length by using an MD5 hash algorithm, key value mapping is used in the character string space, and a fixed number is allocated to each original feature;
(2) according to the distributed fixed number, the regret value and the blueprint strategy are stored in a distributed mode;
(3) and setting a blueprint strategy updating time threshold, for each blueprint strategy stored in a distributed mode, storing the current blueprint strategy when the updating time reaches the threshold, and performing simulation countermeasure test with the last stored blueprint strategy until the yield of the current blueprint strategy in the simulation countermeasure is not obviously improved, so that an imperfect information game strategy is generated.
5. A parallelized imperfect information game strategy generation device is characterized by comprising:
the space compression module is used for compressing the original characteristic space of the imperfect information game according to the incomplete recall clustering method to obtain an abstract characteristic space;
the self-game module generates a blueprint strategy in an iteration mode through self-game in the abstract feature space by utilizing an MCCFR (micro-controller CFR) minimization method;
and the computing module is used for performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.
6. An electronic device, comprising:
a memory for storing processor-executable instructions;
a processor configured to perform any of the parallelized imperfect information gambling policy generation methods of claims 1-4.
7. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program for causing the computer to execute any parallelized imperfect information gambling policy generation method of claims 1-4.
CN202110975035.7A 2021-08-24 2021-08-24 Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium Pending CN113779870A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110975035.7A CN113779870A (en) 2021-08-24 2021-08-24 Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110975035.7A CN113779870A (en) 2021-08-24 2021-08-24 Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113779870A true CN113779870A (en) 2021-12-10

Family

ID=78838840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110975035.7A Pending CN113779870A (en) 2021-08-24 2021-08-24 Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113779870A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170060930A1 (en) * 2015-08-24 2017-03-02 Palantir Technologies Inc. Feature clustering of users, user correlation database access, and user interface generation system
CN110404265A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) A kind of non-complete information machine game method of more people based on game final phase of a chess game online resolution, device, system and storage medium
CN111905373A (en) * 2020-07-23 2020-11-10 深圳艾文哲思科技有限公司 Artificial intelligence decision method and system based on game theory and Nash equilibrium
CN112329348A (en) * 2020-11-06 2021-02-05 东北大学 Intelligent decision-making method for military countermeasure game under incomplete information condition
US20210182718A1 (en) * 2019-12-12 2021-06-17 Alipay (Hangzhou) Information Technology Co., Ltd. Determining action selection policies of an execution device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170060930A1 (en) * 2015-08-24 2017-03-02 Palantir Technologies Inc. Feature clustering of users, user correlation database access, and user interface generation system
CN110404265A (en) * 2019-07-25 2019-11-05 哈尔滨工业大学(深圳) A kind of non-complete information machine game method of more people based on game final phase of a chess game online resolution, device, system and storage medium
US20210182718A1 (en) * 2019-12-12 2021-06-17 Alipay (Hangzhou) Information Technology Co., Ltd. Determining action selection policies of an execution device
CN111905373A (en) * 2020-07-23 2020-11-10 深圳艾文哲思科技有限公司 Artificial intelligence decision method and system based on game theory and Nash equilibrium
CN112329348A (en) * 2020-11-06 2021-02-05 东北大学 Intelligent decision-making method for military countermeasure game under incomplete information condition

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
HU, XY ET AL.: "A Fast-Convergence Method of Monte Carlo Counterfactual Regret Minimization for Imperfect Information Dynamic Games", 9TH IEEE DATA DRIVEN CONTROL AND LEARNING SYSTEMS CONFERENCE (DDCLS), pages 1048 - 1053 *

Similar Documents

Publication Publication Date Title
US20230191229A1 (en) Method and System for Interactive, Interpretable, and Improved Match and Player Performance Predictions in Team Sports
CN109902706B (en) Recommendation method and device
Antonoglou et al. Planning in stochastic environments with a learned model
CN103942571A (en) Graphic image sorting method based on genetic programming algorithm
Yankelevsky et al. Finding gems: Multi-scale dictionaries for high-dimensional graph signals
CN116090536A (en) Neural network optimization method, device, computer equipment and storage medium
Chen et al. Linearity grafting: Relaxed neuron pruning helps certifiable robustness
Liu et al. Model-free neural counterfactual regret minimization with bootstrap learning
Gummadi et al. Mean field analysis of multi-armed bandit games
Cho et al. Espn: Extremely sparse pruned networks
CN113779870A (en) Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium
CN111957053A (en) Game player matching method and device, storage medium and electronic equipment
CN111178541B (en) Game artificial intelligence system and performance improving system and method thereof
Huang et al. Optimizer amalgamation
Chouliaras et al. Feed Forward Neural Network Sparsificationwith Dynamic Pruning
Oertell et al. RL for Consistency Models: Faster Reward Guided Text-to-Image Generation
CN115392456B (en) Fusion optimization algorithm asymptotically normal high migration countermeasure sample generation method
Sun et al. Hiabp: Hierarchical initialized abp for unsupervised representation learning
Tzikas et al. Incremental relevance vector machine with kernel learning
Deja et al. Adapt & Align: Continual Learning with Generative Models Latent Space Alignment
Habara et al. Convergence analysis and acceleration of the smoothing methods for solving extensive-form games
CN117094769A (en) Click rate prediction method, click rate prediction device, computer equipment and storage medium
Li et al. RL-CFR: Improving Action Abstraction for Imperfect Information Extensive-Form Games with Reinforcement Learning
Berger et al. Learning Discrete Weights and Activations Using the Local Reparameterization Trick
Xu et al. Deep Variational Incomplete Multi-View Clustering: Exploring Shared Clustering Structures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination