CN113779870A

CN113779870A - Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium

Info

Publication number: CN113779870A
Application number: CN202110975035.7A
Authority: CN
Inventors: 刘启涵; 杨君; 梁斌; 芦维宁; 陈章
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2021-12-10

Abstract

The application belongs to the technical field of machine learning, and particularly relates to a parallelization imperfect information game strategy generation method and device, electronic equipment and a storage medium. The method comprises the following steps: compressing an original characteristic space of an imperfect information game by using a non-complete recall clustering method to obtain an abstract characteristic space; utilizing an MCCFR minimization method, and iteratively generating a blueprint strategy through self-gaming in the abstract feature space; and performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string. According to the method, incomplete memory is used for feature space abstraction, and strategy robustness is improved; on the basis of the MCCFR algorithm, the integral expected profit is used for replacing the regret value to carry out interval updating, the sampling action frequency is used for generating a final strategy, and the characteristic mapping and the parallel frame are combined, so that the convergence speed of the algorithm is improved, and the training time of the algorithm is shortened.

Description

Parallelization imperfect information game strategy generation method and device, electronic equipment and storage medium

Technical Field

The application belongs to the technical field of machine learning, and particularly relates to a parallelization imperfect information game strategy generation method and device, electronic equipment and a storage medium.

Background

Gaming has long been one of the important application scenarios in the field of artificial intelligence. With the rapid development of reinforcement learning and deep learning, many perfect information game problems have been solved in the last decade, and the exploration of imperfect information game problems has made a remarkable progress by taking the successful experience of the perfect information game problems as a reference.

The incomplete knowledge of the state information of imperfect information gaming makes many conventional iteration and search algorithms no longer applicable, as compared to perfect information gaming. How to reasonably utilize the invisible characteristic of partial information and avoid the dilemma of circular reasoning in strategy exploration is always the focus of attention in imperfect information game research, and a plurality of algorithms are derived on the basis. Among them, the anti-fact regret value minimization algorithm (hereinafter, referred to as CFR) is the most widely practical method for generating an approximate nash equalization strategy through self-pulsation training. The CFR approaches a nash equilibrium strategy of a generalized game through a repeated self-game between two minimum regret value algorithms, but a game tree needs to be fully traversed in each iteration process, so that the algorithm complexity is high.

Disclosure of Invention

The present disclosure is directed to solving, at least to some extent, one of the technical problems in the related art. Based on the inventors' recognition and understanding of the following facts and problems: the anti-fact regret value minimization algorithm (MCCFR for short) based on the Monte Carlo sampling idea reduces the traversal overhead to the maximum extent by reasonably selecting a sampling strategy, but has the problems of large variance and slow convergence speed. The improved CFR + algorithm optimizes the calculation of the average strategy and improves the convergence efficiency of the algorithm. In addition, the CFR-D algorithm combined with online sub-game solving and the CFR-like algorithm utilizing the deep neural network to carry out regret value calculation are highlighted in solving the large-scale imperfect information game problem. However, the above strategy generation method still has the following disadvantages: when large-scale game problems are faced, the algorithm overhead is large, and the number of the game players increases exponentially; as the game depth deepens, the arrival probability tends to zero, so that the strategy is not obviously updated, and the convergence speed is slow; the sampling update variance is large, so that the final strategy is easy to fall into local optimization, and the like.

In view of this, the present disclosure provides a parallelized imperfect information game strategy generation method, apparatus, electronic device and storage medium, so as to solve technical problems in the related art.

According to a first aspect of the disclosure, a parallelization imperfect information game strategy generation method is provided, which includes the following steps:

compressing an original characteristic space of an imperfect information game according to a non-complete recall clustering method to obtain an abstract characteristic space;

utilizing an MCCFR minimization method, and iteratively generating a blueprint strategy through self-gaming in the abstract feature space;

and performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.

The parallel imperfect information game strategy generation method has the advantages that the incomplete memory is used for abstracting the characteristic space, and the strategy robustness is improved; on the basis of the MCCFR algorithm, the integral expected profit is used for replacing the regret value to carry out interval updating, the sampling action frequency is used for generating a final strategy, and the characteristic mapping and the parallel frame are combined, so that the convergence speed of the algorithm is improved, and the training time of the algorithm is shortened.

Optionally, the compressing the original feature space of the imperfect information game by using a non-perfect recall clustering method to obtain an abstract feature space includes:

(1) during initialization, setting the total l rounds of current games, performing the total l layers of game trees for abstract clustering, wherein the clustering algorithm of each layer in the game trees is M_k(k 1, …, l), the number of clusters at layer k in the game tree is C_kM is a clustering algorithm;

(2) using a metric function D_lCalculating the original characteristic space of the l layer of the game tree to obtain a clustering result A_lCluster the result A_lMean of the ith class

And calculating a clustering result A_lDistance between the ith and jth classes

(3) Enabling the current layer n of the game tree to be l-1, and calculating each node x of the layer on the game tree_nPotential histogram of (H)_n(x_n) The ith element in the potential histogram represents the nth layer node x_nClustering the results A at the next layer_n+1Using histogram ground-motion distance as a metric function and a clustering algorithm M_nCalculating to obtain the clustering result A of the current layer n_n；

(4) Repeating the step (3) until n is 1, and obtaining the clustering result A of all layers of the game tree (A)₁，…，A_l) That is, the compressed feature is not completely recalled, and a mapping relationship F (-) between the compressed abstract feature space and the original feature space is established.

Optionally, the iteratively generating the blueprint strategy by self-gaming in the abstract feature space by using the MCCFR minimization method includes:

(1) setting a blueprint strategy, wherein the blueprint strategy is a random strategy when the blueprint strategy is initialized, and setting a repeatable game play, and the game play number of the game play is P;

(2) inputting the blueprint strategy into game alignment, generating samples through self game, using MCCFR minimization algorithm based on external sampling, and calculating the regret value of the self game generated samples through replacing the counterfactual return in the original MCCFR algorithm by the overall expected profit in the abstract feature space;

assuming that the action taken by the gambling player in this sample is a, then the current player p is available in the abstract information set I_pThe global regret value for a;

(3) updating the blueprint strategy of the current game player p in the next iteration turn according to the regret value calculation method;

(4) using interval updates andthe concurrent Monte Carlo sampling method alternately updates the strategies of different game players and sets the regret value change amplitude threshold R_switchAnd a policy update time threshold T_switchAnd after the strategy of the game player p is iterated, if the variation amplitude of the accumulated total regret value exceeds the threshold value and the accumulated updating times reach the threshold value, switching to another player p ', and repeating the steps (2) to (3) to update the strategy for the player p'.

(5) And calculating a final output blueprint strategy by using action sampling to replace arrival probability weighting in the original MCCFR algorithm, creating a new game environment after each strategy updating iteration, inputting the current round of instant strategies, performing K times of simulation matching, recording the selected times of each action a of each matching under different abstract information sets I, and obtaining the blueprint strategy output after T rounds of iteration.

Optionally, the performing distributed storage and update on the blueprint policy by using a hash algorithm of the feature string includes:

(1) according to the abstract feature space and the mapping relation obtained by the non-complete recall clustering method, character coding is carried out on the original feature space, namely, the original features are converted into 16-system character strings with fixed length by using an MD5 hash algorithm, key value mapping is used in the character string space, and a fixed number is allocated to each original feature;

(2) according to the distributed fixed number, the regret value and the blueprint strategy are stored in a distributed mode;

(3) and setting a blueprint strategy updating time threshold, for each blueprint strategy stored in a distributed mode, storing the current blueprint strategy when the updating time reaches the threshold, and performing simulation countermeasure test with the last stored blueprint strategy until the yield of the current blueprint strategy in the simulation countermeasure is not obviously improved, so that an imperfect information game strategy is generated.

According to a second aspect of the present disclosure, a parallelized imperfect information game strategy generation apparatus is provided, including:

the space compression module is used for compressing the original characteristic space of the imperfect information game according to the incomplete recall clustering method to obtain an abstract characteristic space;

the self-game module generates a blueprint strategy in an iteration mode through self-game in the abstract feature space by utilizing an MCCFR (micro-controller CFR) minimization method;

and the computing module is used for performing distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.

According to a third aspect of the present disclosure, an electronic device is presented, comprising:

a memory for storing processor-executable instructions;

a processor configured to perform:

According to a fourth aspect of the present disclosure, a computer-readable storage medium is presented, having stored thereon a computer program for causing a computer to execute:

According to the embodiment of the disclosure, the feature space abstraction is performed by using incomplete recall, so that the strategy robustness is improved; on the basis of the MCCFR algorithm, the integral expected profit is used for replacing the regret value to carry out interval updating, the sampling action frequency is used for generating a final strategy, and the characteristic mapping and the parallel frame are combined, so that the convergence speed of the algorithm is improved, and the training time of the algorithm is shortened.

Additional aspects and advantages of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is apparent that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings can be derived from those drawings without inventive effort for a person skilled in the art.

The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a block diagram of a flow of a method for generating a game strategy of personalized imperfect information according to an embodiment of the present disclosure.

Fig. 2 is a detailed flow diagram according to one embodiment of the present disclosure.

Fig. 3 is a schematic block diagram of a validation imperfect information gaming policy generation apparatus according to an embodiment of the present disclosure.

FIG. 4 is a training process image according to one embodiment of the present disclosure.

Fig. 5 is a training result presentation image according to one embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flow diagram illustrating a parallelized imperfect information gambling policy generation method according to one embodiment of the present disclosure. The parallelization imperfect information game strategy generation method of the embodiment can be applied to user equipment such as a mobile phone and a tablet personal computer.

As shown in fig. 1 and fig. 2, the parallelized imperfect information game strategy generation method may include the following steps:

in the step, according to a non-complete recall clustering method, the original feature space of an imperfect information game is compressed to obtain an abstract feature space.

In an embodiment, the compressing the original feature space of the imperfect information game by using a non-perfect recall clustering method to obtain an abstract feature space includes:

(1) during initialization, setting the total l rounds of current games, performing the total l layers of game trees for abstract clustering, wherein the clustering algorithm of each layer in the game trees is M_k(k 1, …, l), the number of clusters at layer k in the game tree is C_kM is a clustering algorithm (e.g., k-means nearest neighbor clustering algorithm);

And calculating a clustering result A_lDistance between the ith and jth classes

(3) Enabling the current layer n of the game tree to be l-1, and calculating each node x of the layer on the game tree_nPotential histogram of (H)_n(x_n) Potential histogram H_n(x_n) Is a one-dimensional vector, the length of the vector is C of the number of clusters of the n +1 th layer of the game tree_n+1The ith element in the potential histogram represents the nth node x_nClustering the results A at the next layer_n+1Using Earth Mover's Distance (EMD) as metric function and clustering algorithm M_nCalculating to obtain the clustering result A of the current layer n_n；

In one embodiment, using a non-perfect recall clustering method and using the histogram EMD as a metric function, the unexpected technical effect is mainly: compared with the traditional complete recall abstract method, the incomplete recall abstract method requires that information (such as opponent historical actions) acquired at a previous decision point can be ignored when clustering is performed on a current round information set of a certain game player, and in the imperfect information game problem, the incomplete recall clustering can realize fuzzification on a historical track, so that the game player can forget the game player or the opponent historical actions to a certain extent, the possibility that the game player is influenced by the opponent fraudulent actions is reduced, and the robustness of the game is improved.

In step 2, a blueprint strategy is iteratively generated through self-gaming in the abstract feature space by using an MCCFR minimization method.

In one embodiment, the iteratively generating the blueprint strategy by self-gaming in the abstract feature space by using the MCCFR minimization method includes:

(2) inputting the blueprint strategy into game, generating samples through self game, using MCCFR minimization algorithm based on external sampling, and calculating the regret value of the self game generated samples through replacing the counterfactual return in the original MCCFR algorithm by the overall expected profit in the abstract feature space, wherein the calculation formula is as follows:

wherein T is the current iteration round,p is the current game player, I_pThe abstract feature of the sample generated for the current gamer under the mapping F (-) is the abstract information set σ_TFor the blueprint strategy under the current iteration round,

for abstract information sets I_pIn the blueprint strategy sigma_TThe approximate probability of access of the following,

for the in-blueprint strategy sigma_TLower Player p in abstract information set I_pApproximate profit expectations on;

assuming that the action taken by the gambling player in this sample is a, then the current player p is available in the abstract information set I_pThe following overall regret value for a:

(3) according to the regret value calculation method, updating the blueprint strategy sigma of the current game player p in the next iteration round T +1_T+1The calculation formula is as follows:

wherein R is_T(a|I_p) For the global regret value, a is all optional actions and | a | is the number of all optional actions.

(4) Alternately updating different game player strategies by using interval updating and parallel Monte Carlo sampling methods, and setting the regret value change amplitude threshold value R_switchAnd a policy update time threshold T_switchAnd after the strategy of the game player p is iterated, if the variation amplitude of the accumulated total regret value exceeds the threshold value and the accumulated updating times reach the threshold value, switching to another player p ', and repeating the steps (2) to (3) to update the strategy for the player p'.

(5) Using motion sampling to replace sourcesCalculating final output blueprint strategies by weighting arrival probability in MCCFR algorithm, creating a new game environment after each strategy updating iteration, inputting the current round of instant strategies, performing K times of simulation matching, and recording the number of times each action a is selected under different abstract information sets I of each matching, namely the action frequency f (a | t, σ |)_tAnd I), the blueprint strategy output after T rounds of iteration is as follows:

in one embodiment, the overall expected revenue replaces the counterfactual returns in the original MCCFR algorithm to compute the regret value of the self-game generated samples, the different game player strategies are alternately updated using interval update and concurrent monte carlo sampling methods, the final output blueprint strategy is computed using action sampling to replace the arrival probability weighting in the original MCCFR algorithm, resulting in unexpected technical effects including:

A. with the enlargement of the game problem scale and the deepening of the game tree, the original MCCFR algorithm can not traverse all possible actions, and is used for an abstract information set I_pThe access probability of (2) will continuously decrease with the increase of exploration depth, resulting in corresponding counter-fact return

And the overall regret value R_T(a|I_p) The absolute value of (2) is continuously reduced, so that the situation that the convergence speed is reduced due to too low value occurs when the integral regret value is updated. And sampling the whole expected income to replace counter-fact return, namely not considering the arrival probability of the information set, so that the numerical value updating amplitude in the strategy iteration process can not be suddenly reduced to zero, the strategy convergence speed is obviously improved, and the updating stagnation condition can not occur.

B. The original MCCFR algorithm is unreliable in updating the regret value in a single iteration, so that the regret value is always ensured to be updated according with expectation by increasing the iteration round number, different game player strategies are updated alternately by using an interval updating and parallel Monte Carlo sampling method, the randomness of sampling the action of a random event in the original MCCFR algorithm only once can be obviously reduced, the traversal degree of the algorithm on a game tree is increased on the premise of ensuring that the time overhead of the single iteration of the algorithm is basically unchanged, the game tree is more suitable for processing the game problem with a high action dispersion degree of the random event, and the performance of the final strategy generation of the algorithm is improved.

The action sampling is used for replacing the arrival probability weighting in the original MCCFR algorithm to calculate the finally output blueprint strategy, and compared with the original algorithm for averaging based on the arrival probability and the regret value, the method reduces the deviation of the finally generated strategy possibly brought by the abnormal integral regret value in a certain round of strategy iteration, and improves the stability and the robustness of the algorithm.

In step 3, distributed storage and updating are carried out on the blueprint strategy by utilizing a hash algorithm of the characteristic character string.

In one embodiment, the distributed storage and update of the blueprint policy by using the hash algorithm of the feature string includes:

Corresponding to the parallelization imperfect information game strategy generation method, the disclosure also provides a parallelization imperfect information game strategy generation device.

Fig. 3 is a schematic block diagram of a parallelized imperfect information game strategy generating device, which includes:

An embodiment of the present disclosure also provides an electronic device, including:

a memory for storing processor-executable instructions; and

a processor configured to perform:

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon a computer program for causing a computer to execute:

It should be noted that, in the embodiment of the present disclosure, the Processor may be a Central Processing Unit (CPU), or may be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like, the memory may be used for storing the computer program and/or the module, and the processor may realize various functions of the automobile accessory picture dataset making apparatus by executing or executing the computer program and/or the module stored in the memory and calling data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. In addition, the memory may include high speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), at least one magnetic disk storage device, a Flash memory device, or other volatile solid state storage device. If the modules/units of the construction device of the wind power system operation stability domain are realized in the form of software functional units and sold or used as independent products, the modules/units can be stored in a computer readable storage medium. Based on such understanding, all or part of the flow of the method of the embodiments described above can be realized by the present disclosure, and the method can also be realized by the relevant hardware instructed by a computer program, which can be stored in a computer readable storage medium, and when the computer program is executed by a processor, the steps of the method embodiments described above can be realized. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, etc. It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. In addition, in the drawings of the embodiment of the device provided by the present disclosure, the connection relationship between the modules indicates that there is a communication connection between them, and may be specifically implemented as one or more communication buses or signal lines. One of ordinary skill in the art can understand and implement it without inventive effort.

The parallelization imperfect information game strategy generation method is explained in detail through a specific embodiment.

Taking two-player unlimited Texas poker as an example, the game problem is a typical two-player zero-sum imperfect information non-cooperative dynamic game, and the traditional CFR algorithm has extremely slow iteration speed on the problem and is difficult to converge to an approximate Nash equilibrium strategy. For the problem, the total number l of constructed game trees is 4, and the final clustering number of the 1 st layer of the game tree is set to be C when the abstract feature space is calculated₁The final cluster number of the other 3 layers is C169₂＝C₃＝C ₄200 parts of a total weight; clustering algorithm M of last layer₄Distance metric function D using k-means algorithm₄For Euclidean distance of hand strength Expectation (EHS) distribution, other 3-layer clustering algorithms are all k-means algorithms, and histogram EMD is used as a distance measurement function. The strategy algorithm for generating blueprints by selfoc iteration sets 200 parallel simulation environments, the number of regret value storage distributed memories is 10, and the settings areGame player regret value change amplitude threshold value R_switch200000, policy update time threshold T_switch＝100。

Fig. 4 is a comparison of the training process of the above algorithm and the original MCCFR algorithm in the two-player unlimited adequam poker environment, and it can be found that the parallelized imperfect information strategy generation algorithm is superior to the original MCCFR algorithm in terms of convergence speed and performance of the final convergence strategy.

Fig. 5 is an online match record of the strategy generated by the algorithm and the 2017 international poker AI championship slot, and the performance of the strategy is found to be gradually stable and better than that of the slot as the iteration number is increased.

Although embodiments of the present disclosure have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present disclosure, and that changes, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present disclosure.

Claims

1. A parallelized imperfect information game strategy generation method is characterized by comprising the following steps:

2. The method for generating the parallelized imperfect information game strategy according to claim 1, wherein the compressing the original feature space of the imperfect information game by using the incomplete recall clustering method to obtain the abstract feature space comprises:

(1) during initialization, setting the total l rounds of current games, performing the total l layers of game trees for abstract clustering, wherein the clustering algorithm of each layer in the game trees is M_k(k＝1,…,l)，The number of the k-th layer in the game tree is C_kM is a clustering algorithm;

And calculating a clustering result A_lDistance between the ith and jth classes

(3) Enabling the current layer n of the game tree to be l-1, and calculating each node x of the layer on the game tree_nPotential histogram of (H)_n(x_n) The ith element in the potential histogram represents the nth node x_nClustering the results A at the next layer_n+1Using histogram ground-motion distance as a metric function and a clustering algorithm M_nCalculating to obtain the clustering result A of the current layer n_n；

3. The parallelized imperfect information game strategy generating method of claim 1, wherein the iteratively generating the blueprint strategy through self-game in the abstract feature space by using the MCCFR minimizing method comprises:

(4) and (3) alternately updating different game player strategies by using an interval updating and parallel Monte Carlo sampling method, setting a regret value change amplitude threshold and a strategy updating frequency threshold, and switching to another player p 'after the strategy of the game player p is iterated, and repeating the steps (2) to (3) to update the strategy for the player p' if the accumulative total regret value change amplitude exceeds the threshold and the accumulative updating frequency reaches the threshold.

(5) And calculating a final output blueprint strategy by using action sampling to replace arrival probability weighting in the original MCCFR algorithm, creating a new game environment after each round of strategy updating iteration, inputting the current round of instant strategy, performing K times of simulation matching, recording the selected times of each action a under different abstract information sets I in each matching, and outputting the blueprint strategy after T rounds of iteration.

4. The parallelized imperfect information game strategy generation method according to claim 1, wherein the distributed storage and update of the blueprint strategy by using the hash algorithm of the characteristic character string comprises:

5. A parallelized imperfect information game strategy generation device is characterized by comprising:

6. An electronic device, comprising:

a memory for storing processor-executable instructions;

a processor configured to perform any of the parallelized imperfect information gambling policy generation methods of claims 1-4.

7. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program for causing the computer to execute any parallelized imperfect information gambling policy generation method of claims 1-4.