CN113779870B

CN113779870B - Parallelization imperfect information game strategy generation method, parallelization imperfect information game strategy generation device, electronic equipment and storage medium

Info

Publication number: CN113779870B
Application number: CN202110975035.7A
Authority: CN
Inventors: 刘启涵; 杨君; 梁斌; 芦维宁; 陈章
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2024-08-23
Anticipated expiration: 2041-08-24
Also published as: CN113779870A

Abstract

The application belongs to the technical field of machine learning, and particularly relates to a parallelization imperfect information game strategy generation method, a parallelization imperfect information game strategy generation device, electronic equipment and a storage medium. The method comprises the following steps: compressing the original feature space of the imperfect information game by utilizing an incomplete recall clustering method to obtain an abstract feature space; iteratively generating a blueprint strategy in the abstract feature space through self-game by utilizing MCCFR minimization method; and carrying out distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character strings. According to the method, incomplete recall is used for feature space abstraction, and strategy robustness is improved; and on the basis of MCCFR algorithm, the integral expected benefit is used for replacing the remorse value to update at intervals, the sampling action frequency is used for generating a final strategy, and the feature mapping and the parallel framework are combined, so that the algorithm convergence speed is improved, and the algorithm training time is shortened.

Description

Parallelization imperfect information game strategy generation method, parallelization imperfect information game strategy generation device, electronic equipment and storage medium

Technical Field

The application belongs to the technical field of machine learning, and particularly relates to a parallelization imperfect information game strategy generation method, a parallelization imperfect information game strategy generation device, electronic equipment and a storage medium.

Background

Gaming has long been one of the important application scenarios in the field of artificial intelligence. With the rapid development of reinforcement learning and deep learning, many perfect information gaming problems have been solved in the past decade, and exploration of imperfect information gaming problems has also progressed significantly with reference to their successful experiences.

The incomplete knowledge of the imperfect information game state information, as compared to perfect information games, makes many conventional iterative and search algorithms unsuitable. How to reasonably utilize the invisible characteristics of partial information and avoid the dilemma of cyclic reasoning in strategy exploration is always the focus of imperfect information game research, and a plurality of algorithms are derived on the basis. The counter-fact remorse value minimization algorithm (hereinafter referred to as CFR) is the most widely used method for generating an approximate nash equalization strategy through self-blog training. CFR approximates the Nash equilibrium strategy of the generalized game through repeated self-game between two minimized regret algorithms, but the game tree needs to be fully traversed in each iteration process, so that the algorithm complexity is high.

Disclosure of Invention

The present disclosure aims to solve, at least to some extent, one of the technical problems in the related art. Based on the inventors' knowledge and understanding of the following facts and problems: the inverse regret value minimization algorithm (MCCFR) based on Monte Carlo sampling thought reduces traversing cost to the greatest extent by reasonably selecting sampling strategy, but has the problems of larger variance and slower convergence speed. The improved CFR+ algorithm optimizes the calculation of the average strategy and improves the algorithm convergence efficiency. In addition, the CFR-D algorithm for solving the online sub-game and the CFR-like algorithm for performing regret value calculation by using the deep neural network are both outstanding in solving the problem of large-scale imperfect information game. However, the above policy generation method still has the following disadvantages: the algorithm cost is high when facing the problem of large-scale game, and the algorithm cost grows exponentially with the number of game players; as the game depth is deepened, the arrival probability tends to zero, so that strategy updating is not obvious, and the convergence speed is slower; the sampling update variance is large, resulting in the final strategy being easily trapped in local optima, etc.

In view of this, the disclosure proposes a parallelized imperfect information game strategy generation method, device, electronic device and storage medium to solve the technical problems in the related art.

According to a first aspect of the present disclosure, a parallelization imperfect information game strategy generation method is provided, including the following steps:

compressing the original feature space of the imperfect information game according to the incomplete recall clustering method to obtain an abstract feature space;

Iteratively generating a blueprint strategy in the abstract feature space through self-game by utilizing MCCFR minimization method;

and carrying out distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character strings.

The parallelization imperfect information game strategy generation method has the advantages that incomplete recall is used for carrying out feature space abstraction, and strategy robustness is improved; and on the basis of MCCFR algorithm, the integral expected benefit is used for replacing the remorse value to update at intervals, the sampling action frequency is used for generating a final strategy, and the feature mapping and the parallel framework are combined, so that the algorithm convergence speed is improved, and the algorithm training time is shortened.

Optionally, the compressing the original feature space of the imperfect information game by using the incomplete recall clustering method to obtain an abstract feature space includes:

(1) During initialization, setting a current game total l rounds, and carrying out abstract clustering on game tree total l layers, wherein a clustering algorithm of each layer in the game tree is M _k (k=1, …, l), the number of clusters of the k layer in the game tree is C _k, and M is a clustering algorithm;

(2) Calculating the original feature space of the first layer of the game tree by using a metric function D _l to obtain a clustering result A _l, and marking the mean value of the i-th class in the clustering result A _l as And calculates the distance between the ith class and the jth class in the clustering result A _l

(3) Calculating a potential histogram H _n(x_n of each node x _n of the current layer on the game tree by making n=l-1 of the current layer of the game tree, wherein the ith element in the potential histogram represents the probability that the node x _n of the nth layer belongs to the ith class in the clustering result A _n+1 of the next layer, and calculating to obtain a clustering result A _n of the current layer n by using the moving distance of the histogram as a measurement function and a clustering algorithm M _n;

(4) Repeating the step (3) until n=1 to obtain a clustering result A= (A ₁,…,A_l) of all layers of the game tree, namely, incompletely recall the compressed features, and establishing a mapping relation F (·) between the compressed abstract feature space and the original feature space.

Optionally, the generating, by using MCCFR minimizing method, a blueprint policy iteratively in the abstract feature space through self-gaming includes:

(1) Setting a blueprint strategy, wherein the blueprint strategy is a random strategy when in initialization, and setting a repeatable game, wherein the number of game players contained in the game is P;

(2) Inputting the blueprint strategy into game, generating samples through self-games, and calculating the remorse value of the self-games by replacing the return of the inverse facts in the original MCCFR algorithm with the overall expected return in the abstract feature space by using MCCFR minimization algorithm based on external sampling;

assuming that the action taken by the game player in this sample is a, the current player p may regret about a's overall value under abstract information set I _p;

(3) Updating the blueprint strategy of the current game player p in the next iteration round according to the regret value calculation method;

(4) And alternately updating different game player strategies by using interval updating and parallel Monte Carlo sampling methods, setting a remorse value change amplitude threshold R _switch and a strategy updating frequency threshold T _switch, and repeating (2) to (3) to update the strategy of the player p' when the accumulated integral remorse value change amplitude exceeds the threshold and the accumulated updating frequency reaches the threshold after strategy iteration of the game player p.

(5) And (3) calculating a final output blueprint strategy by using action sampling to replace arrival probability weighting in an original MCCFR algorithm, creating a new game environment after each round of strategy updating iteration, inputting a current round of instant strategy, performing K times of simulation for each game, and recording the selected times of each action a of each game under different abstract information sets I to obtain the blueprint strategy output after T rounds of iteration.

Optionally, the hash algorithm using the feature string performs distributed storage and updating on the blueprint policy, including:

(1) According to the abstract feature space and the mapping relation obtained by the incomplete recall clustering method, character encoding is carried out on the original feature space, namely, original features are converted into a fixed-length 16-system character string by utilizing an MD5 hash algorithm, and a fixed number is allocated to each original feature by using key value mapping in the character string space;

(2) According to the allocated fixed number, the remorse value and the blueprint strategy are stored in a distributed mode;

(3) Setting a blueprint strategy updating frequency threshold, storing a current blueprint strategy when the updating frequency reaches the threshold for each blueprint strategy stored in a distributed mode, and performing simulation countermeasure test with the blueprint strategy stored last time until the benefit of the current blueprint strategy in the simulation countermeasure is not obviously improved, and generating an imperfect information game strategy.

According to a second aspect of the present disclosure, a parallelized imperfect information game strategy generation device is provided, including:

the space compression module is used for compressing the original feature space of the imperfect information game according to the incomplete recall clustering method to obtain an abstract feature space;

the self-game module is used for iteratively generating a blueprint strategy in the abstract feature space through self-game by utilizing a MCCFR minimization method;

And the calculation module is used for carrying out distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character strings.

According to a third aspect of the present disclosure, an electronic device is presented, comprising:

a memory for storing instructions executable by the processor;

a processor configured to perform:

According to a fourth aspect of the present disclosure, a computer-readable storage medium is presented, on which a computer program is stored, the computer program being for causing the computer to perform:

According to the embodiment of the disclosure, the feature space is abstracted by using incomplete recall, so that the strategy robustness is improved; and on the basis of MCCFR algorithm, the integral expected benefit is used for replacing the remorse value to update at intervals, the sampling action frequency is used for generating a final strategy, and the feature mapping and the parallel framework are combined, so that the algorithm convergence speed is improved, and the algorithm training time is shortened.

Additional aspects and advantages of the disclosure will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the prior art, the drawings that are used in the description of the embodiments or the prior art will be briefly described below. It is evident that the drawings in the following description are only some embodiments of the present disclosure, and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow diagram of a method for generating a parallelized imperfection information gaming strategy in accordance with one embodiment of the present disclosure.

Fig. 2 is a specific flow diagram according to one embodiment of the present disclosure.

Fig. 3 is a schematic block diagram of a parallelization imperfection information gaming strategy generation device of one embodiment of the present disclosure.

Fig. 4 is a training process image according to one embodiment of the present disclosure.

Fig. 5 is a training result presentation image according to one embodiment of the present disclosure.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Fig. 1 is a schematic flow diagram illustrating a parallelized imperfect information game strategy generation method according to one embodiment of the present disclosure. The parallelization imperfect information game strategy generation method can be applied to user equipment, such as mobile phones, tablet computers and the like.

As shown in fig. 1 and 2, the method for generating the parallelized imperfect information game policy may include the following steps:

In the step, the original feature space of the imperfect information game is compressed according to the incomplete recall clustering method, and an abstract feature space is obtained.

In one embodiment, the compressing the original feature space of the imperfect information game by using the incomplete recall clustering method to obtain an abstract feature space includes:

(1) During initialization, setting a current game total l round, and carrying out abstract clustering on game tree total l layers, wherein a clustering algorithm of each layer in the game tree is M _k (k=1, …, l), the number of clusters of the kth layer in the game tree is C _k, and M is a clustering algorithm (such as nearest neighbor clustering algorithm k-means);

(3) Calculating a potential histogram H _n(x_n) of each node x _n of the current layer of the game tree, wherein the potential histogram H _n(x_n) is a one-dimensional vector, the length of the vector is the number of clusters of the n+1th layer of the game tree and is C _n+1, the ith element in the potential histogram represents the probability that the node x _n of the nth layer belongs to the ith class in the clustering result A _n+1 of the next layer, and the clustering result A _n of the current layer n is calculated by using a histogram Earth moving distance (Earth river' S DISTANCE, abbreviated as EMD) as a measurement function and a clustering algorithm M _n;

In one embodiment, using the incomplete recall clustering method and using the histogram EMD as a metric function, the unexpected technical effect is mainly: compared with the traditional complete recall abstraction method, the incomplete recall abstraction method requires that when clustering is carried out on the current round information set of a certain game player, information (such as historical actions of opponents) acquired at the previous decision point can be ignored, and in the imperfect information game problem, the incomplete recall abstraction method can realize blurring of historical tracks, so that the game player can forget the historical actions of the game player or the opponents to a certain extent, thereby being beneficial to reducing the possibility that the strategy of the game player is influenced by the fraudulent actions of the opponents and improving the robustness of the strategy.

In step 2, a MCCFR minimization method is used to iteratively generate a blueprint strategy through self-gaming within the abstract feature space.

In one embodiment, the iteratively generating a blueprint strategy within the abstract feature space by self-gaming using MCCFR minimization methods comprises:

(2) Inputting the blueprint strategy into game play, generating samples through self-games, and calculating the remorse value of the self-games by substituting the return of the inverse facts in the original MCCFR algorithm with the overall expected return in the abstract feature space by using MCCFR minimization algorithm based on external sampling, wherein the calculation formula is as follows:

Wherein T is the current iteration round, p is the current game player, I _p is the abstract feature of the sample generated by the current game player under the mapping relation F (-), namely the abstract information set, sigma _T is the blueprint strategy under the current iteration round, For the approximate access probability of abstract information set I _p under blueprint policy sigma _T,For approximate revenue expectations for player p on abstract information set I _p under blueprint policy σ _T;

Assuming that the action taken by the gaming player in this sample is a, the current player p may regret about a's overall value under abstract information set I _p:

(3) According to the remorse value calculation method, the blueprint strategy sigma _T+1 of the current game player p in the next iteration round T+1 is updated, and the calculation formula is as follows:

Wherein R _T(a|I_p) is the overall remorse value, a is all optional actions, and iai is the number of all optional actions.

(5) The final output blueprint strategy is calculated by using action sampling to replace arrival probability weighting in an original MCCFR algorithm, after each round of strategy updating iteration, a new game environment is created, the current round of instant strategy is input, K times of simulation countercheck is carried out, the selected times of each action a of each countercheck under different abstract information sets I are recorded, namely, the action frequency f (a|t, sigma _t, I), and the blueprint strategy output after T rounds of iteration is:

In one embodiment, the overall expected benefits replace the inverse fact payback in the original MCCFR algorithm to calculate the remorse value of the self-game generation samples, the different game player strategies are alternately updated using interval updating and parallel monte carlo sampling methods, the final output blueprint strategy is calculated using action sampling instead of arrival probability weighting in the original MCCFR algorithm, and the unexpected technical effects result include:

A. As the game problem scale expands and the game tree depth deepens, the original MCCFR algorithm does not traverse all possible actions, and the access probability to the abstract information set I _p is continuously reduced along with the increase of the exploration depth, so that the corresponding return of the counterfactual is caused And the overall regret value R _T(a|I_p) is continuously reduced, so that the situation that the convergence speed is reduced due to the excessively low value occurs when the overall regret value is updated. And the whole expected benefits of sampling replace the inverse fact return, namely the arrival probability of the information set is not considered, so that the numerical value update amplitude in the strategy iteration process can not drop to zero suddenly, the strategy convergence speed is remarkably improved, and the update stagnation situation can not occur.

B. The original MCCFR algorithm updates the remorse value in a single iteration, so that the number of iteration rounds is often required to be increased to ensure that the remorse value is updated to meet the expectations, different game player strategies are alternately updated by using interval updating and a parallel Monte Carlo sampling method, the randomness of the original MCCFR algorithm for sampling the action of a random event only once can be remarkably reduced, the traversal degree of the algorithm to a game tree is increased on the premise that the time cost of the single iteration of the algorithm is basically unchanged, the game problem with higher action discreteness of the random event is more suitable to be processed, and the performance of the final generation strategy of the algorithm is improved.

And the motion sampling is used for replacing the arrival probability weighting in the original MCCFR algorithm to calculate the final output blueprint strategy, compared with the original algorithm which averages based on the arrival probability and the remorse value, the deviation of the final generation strategy possibly caused by abnormal integral remorse value in a certain round of strategy iteration is reduced, and the stability and the robustness of the algorithm are improved.

In step 3, the blueprint strategy is stored and updated in a distributed manner by utilizing a hash algorithm of the characteristic character strings.

In one embodiment, the hash algorithm using the feature string performs distributed storage and updating on the blueprint policy, including:

Correspondingly to the parallelization imperfect information game strategy generation method, the invention also provides a parallelization imperfect information game strategy generation device.

Fig. 3 is a schematic block diagram of a parallelized imperfect information game strategy generation device, including:

The embodiment of the disclosure also provides an electronic device, including:

a memory for storing instructions executable by the processor; and

A processor configured to perform:

Embodiments of the present disclosure also propose a computer-readable storage medium having stored thereon a computer program for causing the computer to execute:

It should be noted that in the embodiments of the present disclosure, the Processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal Processor (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), off-the-shelf programmable gate array (FieldProgrammable GATE ARRAY, FPGA) or other programmable logic device, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and the memory may be used to store the computer program and/or modules, and the processor may implement the various functions of the auto-parts picture dataset production device by running or executing the computer program and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart memory card (SMA RT MED IA CA RD, SMC), secure Digital (SD) card, flash memory card (FLASH CARD), at least one disk storage device, flash memory device, or other volatile solid-state storage device. The modules/units of the building means of the operational stability domain of the wind power system may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above embodiments, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the device embodiment drawings provided by the disclosure, the connection relation between the modules represents that the modules have communication connection therebetween, and may be specifically implemented as one or more communication buses or signal lines. those of ordinary skill in the art will understand and implement the present invention without undue burden.

The method for generating the parallelization imperfect information game strategy is described in detail by a specific embodiment.

Taking two-player infinite-note texas poker as an example, the game problem is a typical two-player zero and imperfect information non-cooperative dynamic game, and the traditional CFR algorithm has extremely slow iteration speed on the problem, and is difficult to converge to an approximate Nash equilibrium strategy. For the problem, the total layer number l of the constructed game tree is=4, the final cluster number of the 1 st layer of the game tree is set to be C ₁ =169 when the abstract feature space is calculated, and the final cluster number of the other 3 layers is set to be C ₂＝C₃＝C₄ =200; the last layer of clustering algorithm M ₄ uses k-means algorithm, the distance measurement function D ₄ is Euclidean distance of hand intensity Expectation (EHS) distribution, the other 3 layers of clustering algorithm are k-means algorithm, and the distance measurement function uses histogram EMD. The self-blog iterative generation blueprint strategy algorithm sets 200 parallel simulation environments, the number of remorse value storage distributed memories is 10, a remorse value change amplitude threshold R _switch =200000 of a game player is set, and a strategy update frequency threshold T _switch =100.

FIG. 4 is a comparison of the training process of the above algorithm and the original MCCFR algorithm in a two-person Infinite Style playing card environment, which can find that the parallelization imperfection information policy generation algorithm is superior to the original MCCFR algorithm in both convergence speed and final convergence policy performance.

FIG. 5 is an online record of an algorithmically generated strategy with 2017 International poker Deck March champion Slumbot, which may find that the strategy performance is progressively more stable with increasing iteration number and better than Slumbot.

Although embodiments of the present disclosure have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the present disclosure, and that variations, modifications, alternatives, and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the present disclosure.

Claims

1. A parallelization imperfect information game strategy generation method is characterized by comprising the following steps:

The blueprint strategy is stored and updated in a distributed mode by utilizing a hash algorithm of the characteristic character string;

the method for iteratively generating blueprint strategies by utilizing MCCFR minimization method in the abstract feature space through self-gaming comprises the following steps:

(4) Alternately updating different game player strategies by using interval updating and parallel Monte Carlo sampling methods, setting a remorse value change amplitude threshold and a strategy updating frequency threshold, and switching to another player p 'when the accumulated integral remorse value change amplitude exceeds the threshold and the accumulated updating frequency reaches the threshold after each strategy iteration of the game player p, and repeating the strategy updating of the player p' from (2) to (3);

(5) And (3) using action sampling to replace arrival probability weighting in an original MCCFR algorithm to calculate a final output blueprint strategy, creating a new game environment after each round of strategy updating iteration, inputting a current round of instant strategy, performing K times of simulation for the game, recording the selected times of each action a of each game under different abstract information sets I, and outputting the blueprint strategy after T rounds of iteration.

2. The method for generating the parallelized imperfect information game strategy according to claim 1, wherein the compressing the original feature space of the imperfect information game by using the incomplete recall clustering method to obtain the abstract feature space comprises:

(3) Calculating a potential histogram H _n(x_n of each node x _n of the current layer on the game tree by enabling n=l-1 of the current layer of the game tree, wherein the ith element in the potential histogram represents the probability that the node x _n of the nth layer belongs to the ith class in the clustering result A _n+1 of the next layer, and calculating to obtain a clustering result A _n of the current layer n by using the histogram ground movement distance as a measurement function and a clustering algorithm M _n;

3. The method for generating the parallelized imperfect information game strategy according to claim 1, wherein the distributed storing and updating of the blueprint strategy by utilizing the hash algorithm of the characteristic character string comprises the following steps:

4. A parallelized imperfect information game strategy generation device, comprising:

The calculation module is used for carrying out distributed storage and updating on the blueprint strategy by utilizing a hash algorithm of the characteristic character strings;

The self-game module is further configured to (1) set a blueprint strategy, wherein the blueprint strategy is a random strategy during initialization, and a repeatable game is set, and the number of game players included in the game is P;

5. An electronic device, comprising:

a memory for storing instructions executable by the processor;

A processor configured to perform any of the parallelized imperfect information game strategy generation methods of claims 1-3.

6. A computer readable storage medium having stored thereon a computer program for causing the computer to perform any of the parallelized imperfect information game strategy generation methods of claims 1-3.