CN110489668A

CN110489668A - Synchronous game monte carlo search sets mutation method more under non-complete information

Info

Publication number: CN110489668A
Application number: CN201910860992.8A
Authority: CN
Inventors: 潘家鑫; 黄湛钧; 高庆龙; 王骄
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-09-11
Filing date: 2019-09-11
Publication date: 2019-11-22

Abstract

The invention discloses synchronous game monte carlo searches under a kind of non-complete information to set mutation method more, comprising: S1: will speculate remaining information according to known posterior infromation for the strategy of player；S2: all information are sampled before game theory expansion, screen fair play；S3: after game-tree search, the search result of each game theory is trained again, it predicts final dominating stragegy: S4: two game theories is arranged according to different players, it is wherein to interknit between the game theory of different players, multiple game theories are unfolded in every wheel simultaneously, sample content before expansion is identical, and each starting to spread out movement from player's angle is the information collection according to oneself, directly sets mapping according to other for the movement of opponent and obtains.

Description

Synchronous game monte carlo search sets mutation method more under non-complete information

Technical field

The present invention relates to game monte carlo search synchronous under game playing by machine technical field more particularly to non-complete information is more Set mutation method.

Background technique

How game playing by machine research, which allows the computer simulation mankind to carry out game, is played chess, and is that artificial intelligence field is most challenging One of research direction.Many famous scholars once set foot in the research field, such as father's von Neumann (Von of computer Meunann), father John mccarthy (John McCarthy) of artificial intelligence, information theory founder's Shannon (C.E.Shannon), cybernetics founder wiener (Norbert Wiener) and famous computer scholar A Lantuling (A.Turing) etc..Game playing by machine be to the abstract of mankind's game and refining, be it is simple and convenient, economical and practical, and rich connotation, Varied logical thinking Study of Support provides an ideal experimental bed for artificial intelligence, is known as " artificial intelligence The drosophila of energy ".In addition to theory significance, game playing by machine is also with a wide range of applications, and especially advises in war simulation, city It draws, the fields such as network security.However, how to realize the intelligence of game decision-making,.It is rich that the solution of these problems is dependent on machine Play chess the theoretical development with technology.

Game-tree search technology is to solve for the maximally efficient method of game playing by machine problem, i.e., News Search is most in game theory Good path, to reach comprehensive income maximization.However, the game theory scale of practical problem of game is very huge, lead to game theory Optimizing is extremely difficult, if chess game theory complexity is 10¹²³, the game theory complexity of go is up to 10³⁶⁰, and on the earth The number of whole atoms just has 10 according to estimates¹³².In addition, the missing of opponent's information makes game burl in non-perfect information game Dotted state height is uncertain, and game theory expansion and solution is caused to become more difficult.In short, the game playing by machine under complex environment has The features such as stateful space is big, information is unknown, action income is uncertain, although having broader practice prospect, also faces Huge challenge.

Monte Carlo tree search based on sampling is mainly used for solving the problems, such as the non-perfect information game of higher complexity.It is right Hand modeling be also non-perfect information game important research content, in non-perfect game playing by machine, the status information of opponent with it is right There is very big connection between hand behavior.It predicts opponent's state, behavior etc., deflated state space, drop by establishing opponent model Low information uncertainty.

The research of current non-perfect information game focuses primarily upon board class problem, is mostly asked using what refining and equilibrium were sought Solution method has a disadvantage in that and is unable to get optimal policy when other side deviates balance policy or cheating, and is only limitted to double zero He Game, for n-person game, cooperative game, the problems such as synchronous game existing algorithm there are still many deficiencies.So using More trees model it by the angle of different players, and according in gambling process observation information and hiding information to knowledge It practises and extracts, screen effective information, supplement the loss of learning under non-perfect information, opponent's state and decision are effectively estimated And prediction, the structure of synchronous monte carlo search mutation under non-perfect information is improved, the strategy of game theory is supplemented.

Summary of the invention

According to problem of the existing technology, the invention discloses synchronous game Monte Carlos under a kind of non-complete information to search Suo Duoshu mutation method, specifically comprises the following steps:

S1: remaining information will be speculated according to known posterior infromation for the strategy of player, screen fair play, then will be complete U.S. information game opponent strategy estimation mode is transferred in the information in non-perfect information game speculating and observing, in search The habitual movement of opponent under each state is recorded outside, establishment strategy auxiliary function；

S2: all information are sampled before game theory expansion, screen fair play: by opponent in game before The movement executed in journey is recorded, according to actual needs given threshold, and the movement income in the threshold value is screened, right Player and the higher movement of opponent's income are marked, and establish an action message library and store:

S3: after game-tree search, the search result of each game theory being trained again, predicts final advantage plan Slightly: the result of search is combined, by these from different perspectives the game theory of player and different sampling actions result carry out Compare, the end value of all game theory solving result tendencies of final reaction is chosen using convergent Decision Method；

S4: being arranged two game theories according to different players, wherein interknited between the game theory of different players, Every these game theories of wheel are unfolded simultaneously, and the sample content before expansion is identical, and each starting to spread out movement from player's angle is basis The information collection of oneself, for opponent movement directly according to other tree mapping come, be between player and the game theory of opponent Line search, communication with one another affect one another.The purpose for the arrangement is that in order to guarantee that the movement of different players executes synchronization, every time The characteristics of state transfer after execution movement is identical, is bonded synchronous game.

Described to sample before game theory expansion to all information, screening fair play specifically includes: sample phase, choosing It is as follows to select stage, extension phase, dummy run phase and more new stage, specific embodiment:

Sample phase: expansion first carries out grab sample to these movements from information bank every time, only samples run, Game theory is unfolded again；The movement type and quantity wherein sampled before expansion every time carry out at random, wherein sample every time Sample size is identical；

Choice phase: according to the game theory of the angle of different players, after the completion of sample phase, according to the movement after screening Information starts to select, and each player is the information concentration selection movement that upper node is set from oneself, and then for the movement of opponent It is to come from the result mapping after the selection in other trees；

Extension phase: the generation movement transfer after both sides' player actions are carried out；

Dummy run phase: the game of respective player's angle is: after generating state transfer while simulating, each tree is only right Valuation is carried out with this tree for the player actions of angle；

The more new stage: treating each tree will be recalled the access times of movement income and movement after valuation more Newly.

Further, the prediction and estimation of decision will be carried out in terms of opponent's angle and self-view two, (i.e. sampling rank Section estimates the selection situation of each movement, predicts opponent's tactics)

S31: being from player's angle first using priori knowledge, moving by the select probability of each movement and in this state Make probability product as molecule, and the probability that each state is occurred is calculated and selected under each player's particular state as denominator The probability of some movement；

S32: opponent selects most movements from opponent's angular observation prior actions, this kind of movement is known as being accustomed to dynamic Make, for this kind of movement, player, which should select to be corresponding to it, can obtain the movement of optimal benefit；

S33: determine that player should from the self benefits and priori knowledge of player angle combination player for some state The movement of selection；

The prediction of the decision and estimation are specific in the following way: set the prior probability that P (a) is appearance movement a, P (s | A) be in previous game bout appearance movement a shape probability of state, by the two product normalization after, choose this normalization after Prior probability and movement income U (s_i, a) maximum value combined, this maximum value combines priori knowledge and movement income, to object for appreciation This is more to trust this movement to a kind of trust value of some movement for family, then the probability for choosing this movement is bigger.Because The movement of player and opponent are correspondingFor corresponding relative motion, in gambling process Movement number be equal N (a_j)=N (a_i), so only needing to calculate the most common movement of opponent in preceding game bout , it is directly added into an adjusting parameter λ in habitual movement and movement income, i.e., tactful formula is as follows:

N(a_i)=N (a_j)

Wherein y is mixed strategy, | A (I) | it is the movement number in player actions set, tactful formula is dynamic to opponent's habit Make the mixing that the trust with player acts.

This method is solved for non-perfect synchronization problem of game and is more of practical significance, using more trees to different players It is modeled, remains mapping relations mutually between these trees, the sampling action information in maneuver library is screened fair play, kept away It is too big motion space is exempted from, caused game-tree search is difficult, and inefficiency solves the problems such as of low quality.In game theory The characteristics of mapping in search process between game theory ensure that the synchronization of state transfer, perfectly be bonded synchronous game.No Only in this way, carrying out search finding after different samplings in game theory, these solving results are compared and are screened, used Convergent Decision Method chooses the end value of all game theory solving results tendencies of final reaction, ensure that the accuracy and rationally of result Property, it will not be finally tactful because of the unilateral decision of the error of the solving result of one tree, make the choosing of player in final gambling process Strategy is selected, the various actions of execution are more reasonable.The invention simultaneously is not player in two level estimated informations of opponent and player It is not single according to income selection movement, because both sides player a kind of will not choose dynamic according to income in practical gambling process Make, also habitual movement can be generated because of artificially to the hobby of certain movements, so needing to take into account the habitual movement of opponent It goes, under a certain state, the frequency that opponent acts if there is some is high, and representing this movement is largely preference movement Or habitual movement, income and habitual movement are taken into account simultaneously in decision, which greatly enhances the accuracys and spirit of strategy Activity.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The some embodiments recorded in application, for those of ordinary skill in the art, without creative efforts, It is also possible to obtain other drawings based on these drawings.

Fig. 1 is that non-perfect synchronizing information Monte Carlo tree searches for mutation technical solution figure；

Fig. 2 is the game knowledge extractive technique route map based on sampling；

Fig. 3 is non-perfect synchronizing information monte carlo search schematic diagram；

Fig. 4 is opponent's parsing action modeling figure under non-perfect information.

Specific embodiment

To keep technical solution of the present invention and advantage clearer, with reference to the attached drawing in the embodiment of the present invention, to this Technical solution in inventive embodiments carries out clear and complete description:

Synchronous game monte carlo search sets mutation method, this method pair more under the non-complete information of one kind disclosed by the invention The information that the partial information of observable and player deliberately hide in non-perfect synchronizing information game, extracts, is added to game theory Middle auxiliary policy selection.It makes up non-perfect information and is known due to information and is not difficult to entirely to its direct solution.Reservation synchronizes rich simultaneously The characteristics of playing chess is transformed tree, makes the tree established with different player's angles, and search process is synchronous to be carried out, and specifically includes following Step:

Step 1-1: for the strategy of player, remaining information will be speculated according to known posterior infromation, diminution acts model It encloses, screens fair play, in order to reduce the search scale of game theory, then perfect information game opponent's strategy estimation mode shifted In the information for speculating and observing into non-perfect information game.Except search to each state under opponent habitual movement into Row record, establishment strategy auxiliary function.

Step 1-2: for the information observed and guessed in gambling process, opponent is done using error message in order to prevent Disturb, thus estimation opponent's information in hiding information need to do it is anti-interference, remove mistake information, retain real information.Side Method are as follows: all information are sampled before game theory expansion, screen fair play.Opponent is held in gambling process before Capable movement is recorded, and sets a threshold value according to actual needs, the movement income in this threshold value is screened, right Player and the higher movement of opponent's income are marked, and establish an action message library, store these information, for game theory pair The screening of fair play is prepared.

Step 1-3: after game-tree search, the search result of each game theory is trained again, is predicted final Dominating stragegy.The result of search is combined, by these from different player's angles and the result of the tree of different sampling actions into Row compares, and the end value of all game theory solving result tendencies of final reaction is chosen using convergent Decision Method, ensure that result Accuracy and reasonability, will not be finally tactful because of the unilateral decision of the error of the solving result of one tree, makes final game The selection strategy of player in journey, the various actions of execution are more reasonable.

Step 1-4: the foundation of game theory is the improvement in traditional Monte Carlo tree, is arranged two according to different players It sets, is interknited between the game theory of different players, these set while being unfolded every wheel, and the sample content before expansion is identical 's.Each starting to spread out movement from player's angle is the information collection according to oneself, and the movement of opponent is directly set according to other Mapping comes, and is on-line search between player and the game theory of opponent, and communication with one another affects one another.The purpose for the arrangement is that In order to guarantee that the movement of different players executes synchronization, the state transfer after each execution movement is identical, is bonded the spy of synchronous game Point.

Step 2-1: sample phase, expansion first carries out grab sample to these movements from information bank every time, only sampling one Partial act, then game theory is unfolded.The movement type and quantity sampled before expansion every time are all random, but need to protect Demonstrate,proving the sample size sampled every time is identical, being consistent property.So multiple sample, these samples are put back to after sampling This, in order to set the use of information in the pre-deployed next time.Movement is sampled before this each game-tree search and is put back to again. It then samples when second of search, once analogizes, after recycling n times, because the information for sampling some every time is not sample , this is to retain certain flexibility, and the data that this part does not sample prevent the solution of search excessively mechanical and inflexible, retains one Partial information is that other states occurred in actual player gambling process are prepared.

Further, step 2-2: the choice phase completes according to the game theory of the angle of different players in sample phase Afterwards, according to the action message after screening, start to select.Each player is to concentrate selection dynamic from the information of the upper node of oneself tree Make, and for the movement of opponent is come from the result mapping after the selection in other trees.

Step 2-3: extension phase, the generation movement transfer after both sides' player actions are carried out, due to mutual between movement Mapping, and both sides act generation movement transfer after being carried out, so state each other is consistent, ensure that movement is same Step, because they successive are not influenced by executing before state does not change.

Step 2-4: the game of dummy run phase, respective player's angle are after generating state transfer, while to be simulated, Each tree only carries out valuation to this tree for the player actions of angle.

Step 2-5: the more new stage, treat each tree will by the access times of movement income and movement after valuation, Carry out backtracking update.For these processes, it is completed at the same time between tree.

Further, before game decision-making, pre-estimate opponent's status and predict opponent's action strategy, i.e., into Row Opponent Modeling.The prediction and estimation of decision will be carried out in terms of opponent's angle and self-view two.

Step 3-1: being from player's angle, using priori knowledge, by the select probability of each movement and in the state first Under movement probability product as molecule, and the probability that each state is occurred calculates each player's particular state as denominator The probability of lower some movement of selection.Because can consider to act the dynamic of Income Maximum under oneself corresponding states for player's angle Make, can also consider which should be selected act in Heuristics under some state, if select probability is maximum in Heuristics The movement for acting while being also Income Maximum, probably selects the movement from player for the angle of player.

Step 3-2: from the point of view of opponent's angle, we can observe opponent in prior actions and select most movements, this kind of Movement is known as habitual movement, and for this kind of movement, player, which should select to be corresponding to it, can obtain the movement of optimal benefit, because with Corresponding movement number and the habitual movement number of opponent be equal, it is possible to the movement that execute is speculated according to quantity.

Step 3-3: be directed to some state, for player, in conjunction with player self benefits and priori knowledge in player answer The movement of the selection.But in practical gambling process, opponent is possible without acting according to income selection for machinery, there is also The habitual movement of oneself considers that habitual movement is to retain the variation of certain ratio reply movement for opponent.

Step 3-4: circular are as follows: P (a) is the prior probability of appearance movement a, and P (s | a) is previous game bout The shape probability of state of middle appearance movement a.After the normalization of the two product, prior probability and movement after choosing this normalization are received Beneficial U (s_i, a) combine maximum value, this maximum value combine priori knowledge and movement income, this is to some for player A kind of trust value of movement, more trusts this movement, then the probability for choosing this movement is bigger.Because of the movement of player and opponent It is correspondingFor corresponding relative motion, the movement number in gambling process is equal N(a_j)=N (a_i), so only need to calculate the most common movement of opponent in preceding game bout, it can be dynamic in habit Make and movement income be directly added into an adjusting parameter λ, i.e., tactful formula is as follows:

N(a_i)=N (a_j)

Existing synchronous problem of game is the searching method under perfect information mostly, but in practical game, more Synchronous game be it is non-perfect, this method is solved for non-perfect synchronization problem of game and is more of practical significance, and uses more Tree models different players, remains mapping relations mutually between these trees, the sampling action information in maneuver library, screening Fair play, it is too big to avoid motion space, and caused game-tree search is difficult, and inefficiency solves the problems such as of low quality. It ensure that the synchronization of state transfer in the mapping in the search process of game theory between game theory, be perfectly bonded synchronous game The characteristics of.Moreover, search finding is carried out after different samplings in game theory, these solving results are compared and Screening is chosen the end value of all game theory solving result tendencies of final reaction using convergent Decision Method, ensure that the standard of result True property and reasonability, will not be finally tactful because of the unilateral decision of the error of the solving result of one tree, makes final gambling process The various actions of the selection strategy of middle player, execution are more reasonable.The invention simultaneously is in two level estimation letters of opponent and player It is not single according to income selection movement that breath, which is not player, because both sides player will not a kind of basis in practical gambling process Income selection movement can also generate habitual movement because of artificially to the hobby of certain movements, so needing the habit of opponent Movement is taken into account, and under a certain state, the frequency that opponent acts if there is some is high, and representing this movement is largely It is preference movement or habitual movement, takes into account income and habitual movement simultaneously in decision, which greatly enhances strategies Accuracy and flexibility.

Embodiment:

Fig. 1 is overall technical solution figure；Non-perfect synchronizing information problem of game includes carrying out to the gambling process of player Modeling, carries out expansion solution with each player's angle, then on the basis of this, and non-perfect information game monte carlo search sets mutation more Possess sampling, selection, extension, simulation, update；And tree with tree search be it is inter-related, synchronization.It is right before search Key message extracts, and carries out pre-estimation to opponent.It is solved in the search for uncertain problem.

It is screening fair play first first, reduces motion space, the action message after screening is sampled.Such as Fig. 2, enter Search process, then the search result by the search mutation of multiple Monte Carlo trees is combined, and elects most representative ask Solution value.

The specific search process of game theory, such as Fig. 3 is shown, and player and opponent sample from movement information bank respectively, Motion space is reduced, so as to the expansion of game theory.Player and opponent select according to the information of sampling, it is assumed that player executes A1 is acted, meanwhile, opponent's execution acts b1, and after having executed respective movement respectively, tree 1 will execute two trees from tree 2 Movement b1 mapping come, similarly, a1 2 is mapped by tree from another one tree, and on-line search in this improves the effect of search Rate.After two trees have been carried out the movement of both sides, completion status transfer calculates income, remains synchronous feature.

Due in non-perfect information, the characteristics of loss of learning, need to estimate opponent, establishment strategy is promoted, and is retained Reflect the useful information of opponent's state rule, and opponent's behavior is estimated according to key message.As shown in figure 4, player's basis first The Heuristics of oneself considers decision scheme to history game state and the movement income of oneself, considers further that the angle of opponent, Observe opponent's information, thus it is speculated that Behavior preference opponent has.Two aspects combine, and estimate opponent, construct selection strategy formula.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Anyone skilled in the art in the technical scope disclosed by the present invention, according to the technique and scheme of the present invention and its Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.

Claims

1. synchronous game monte carlo search sets mutation method more under a kind of non-complete information characterized by comprising

S1: remaining information will be speculated according to known posterior infromation for the strategy of player, screen fair play, then perfection is believed Breath game opponent strategy estimation mode is transferred in the information in non-perfect information game speculating and observing, right except search The habitual movement of opponent records under each state, establishment strategy auxiliary function；

S2: all information are sampled before game theory expansion, screen fair play: by opponent in gambling process before The movement of execution is recorded, according to actual needs given threshold, the movement income in the threshold value is screened, to player High movement is marked with opponent's income, establishes an action message library and stores:

S3: after game-tree search, the search result of each game theory being trained again, predicts final dominating stragegy: will The result of search is combined, and the search result of these game theories of player and different sampling actions from different perspectives is compared Compared with the end value for using all game theory solving results of convergent Decision Method selection final reaction to be inclined to；

S4: being arranged two game theories according to different players, wherein be to interknit between the game theory of different players, every wheel pair Multiple game theories are unfolded simultaneously, and the sample content before expansion is identical, and each starting to spread out movement from player's angle is according to oneself Information collection, for opponent movement directly according to other set mapping obtain.

2. synchronous game monte carlo search sets mutation method, feature more under non-complete information according to claim 1 Also reside in: described to sample before game theory expansion to all information, screening fair play specifically includes: sample phase, choosing It is as follows to select stage, extension phase, dummy run phase and more new stage, specific embodiment:

Sample phase: expansion first carries out grab sample to these movements from information bank every time, only samples run, then right Game theory is unfolded；The movement type and quantity wherein sampled before expansion every time carry out at random, wherein the sample sampled every time Quantity is identical；

Choice phase: according to the game theory of the angle of different players, after the completion of sample phase, according to the action message after screening Start to select, each player be the information of upper node is set from oneself to concentrate selection movement, and for the movement of opponent be then from Result mapping after selection in other trees comes；

Dummy run phase: the game of respective player's angle is: after generating state transfer while simulating, each tree is only to this Tree be angle player actions carry out valuation；

The more new stage: backtracking update will be carried out for the access times of movement income and movement after valuation by treating each tree.

3. synchronous game monte carlo search sets mutation method, feature more under non-complete information according to claim 1 It also resides in: the prediction and estimation of decision will be carried out in terms of opponent's angle and self-view two:

S31: being from player's angle first using priori knowledge, movement by the select probability of each movement and in this state is general Rate product is as molecule, and the probability that each state is occurred calculates as denominator and selects some under each player's particular state The probability of movement；

S32: opponent selects most movements from opponent's angular observation prior actions, and this kind of movement is known as habitual movement, needle To this kind of movement, player, which should select to be corresponding to it, can obtain the movement of optimal benefit；

S33: determine that player should select from the self benefits and priori knowledge of player angle combination player for some state Movement.

4. synchronous game monte carlo search sets mutation method, feature more under non-complete information according to claim 3 It also resides in:

The prediction and estimation of the decision are specific in the following way:

If P (a) is the prior probability of appearance movement a, P (s | a) is the shape probability of state of appearance movement a in previous game bout, Prior probability and movement income U (s after the normalization of the two product, after choosing this normalization_i, a) maximum value combined, such as Income U (the s of some movement of fruit_i, a) the maximum probability for then choosing the movement is bigger；Calculate in preceding game bout opponent most The movement often occurred is directly added into an adjusting parameter λ in habitual movement and movement income, i.e., tactful formula is as follows:

N(a_i)=N (a_j)

Wherein y is mixed strategy, | A (I) | be the movement number in player actions set, tactful formula be to opponent's habitual movement and The mixing of the trust movement of player.