CN111047053A

CN111047053A - Monte Carlo search game decision method and system facing to opponents with unknown strategies

Info

Publication number: CN111047053A
Application number: CN201911142537.0A
Authority: CN
Inventors: 芦维宁; 杨君; 赵千川; 梁斌; 谢鸣洲
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2020-04-21

Abstract

The invention discloses a Monte Carlo search game decision method and a system facing to an unknown strategy opponent, wherein the method comprises the following steps: designing a state evaluation function based on expert experience according to the rules of the air combat game to estimate the situation of an enemy and a self party in the fighting situation; and a Monte Carlo search algorithm fused with the maximum and minimum game ideas is provided, and game decisions are output when the opponents of the air battle game with unknown decision methods face. According to the Monte Carlo search game decision method, the game strategy idea of the maximum and minimum game algorithm for the alternative decision of the enemy and the my is used for reference, on the premise that the opponent game strategy is not needed to be obtained, various possibilities of the opponent decision can be fully considered, the corresponding coping decision can be made, and the confrontation rate is improved.

Description

Monte Carlo search game decision method and system facing to opponents with unknown strategies

Technical Field

The invention relates to the technical field of game decision, in particular to a game decision method and a game decision system for an unknown strategy opponent under a one-to-one air combat game environment.

Background

Machine gaming is a discipline that studies how to solve the antagonism decision problem using machine learning methods. With the rapid development of artificial intelligence technology, machine game theory and related applications have been advanced into various social fields such as politics, finance, military and the like. The AlphaGo and AlphaGo Zero systems, such as those introduced by DeepMind in 2016 and 2017, successfully compete human high-handed players in go-type gambling games, which also promises the machine gambling discipline, as well as the ability to provide human aid decision support in many segments of future life.

If the hand information is completely divided in the game process, the game problems can be roughly classified into perfect information games and imperfect information games, the difficulty of solving the problems is increased due to the missing of the game information in the game process under the imperfect information condition, however, in practical application, the missing of the game information becomes a normal state due to various subjective and objective reasons, and therefore, the game process for researching the imperfect information has practical significance.

The one-to-one air battle game is a typical game decision type game, and the confrontation environment with unknown opponent strategy is also the conventional setting of the game, so the one-to-one air battle game is a very representative imperfect information game algorithm experimental platform.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, one objective of the present invention is to provide a Monte Carlo search game decision method facing an opponent with unknown strategy, which can fully consider various possibilities of opponent decisions and make corresponding coping decisions to improve the winning rate without learning opponent game strategies.

Another objective of the present invention is to provide a monte carlo search betting decision system facing an opponent with unknown strategy.

In order to achieve the above object, an embodiment of the present invention provides a monte carlo search game decision method for an opponent with unknown policy, including the following steps: the method comprises the following steps of designing a composite state evaluation function by taking the influence of the current situation of the two enemies and the two parties on the subsequent confrontation situation under the air combat fighting game environment and the physical constraint factors of the two enemies and the two parties as reference standards; describing the change situation of the situations of the enemy and the my both sides in the adversarial progress from a plurality of dimensions through the composite state evaluation function; and outputting game decisions when the opponents of the air battle game with unknown decision methods face through a Monte Carlo search algorithm fused with the maximum and minimum game ideas according to the change conditions.

According to the Monte Carlo search game decision method for the opponents with unknown strategies, disclosed by the embodiment of the invention, through the game strategy idea of the alternative decision of the maximum and minimum game algorithm for the two opponents, the action with the maximum upper bound of confidence coefficient is obtained in the selection of the own evaluation system by the own decision wheel, and the action with the maximum upper bound of confidence coefficient is obtained in the selection of the opponent evaluation system by the opponent decision wheel, so that on the premise of not knowing the opponent game strategies, various possibilities of the opponent decisions can be fully considered, corresponding countermeasures can be made, and the confrontation rate is improved.

In addition, the Monte Carlo search game decision method facing the unknown strategy opponents according to the embodiment of the invention can also have the following additional technical features:

further, in an embodiment of the present invention, the composite state evaluation function includes an immediate return item, a persistent return item, and a physical relationship rationality constraint item.

Further, in one embodiment of the present invention, the persistent reward item includes: and calculating the return according to the situation superiority and inferiority duration of the two enemies, wherein the longer the situation superiority or inferiority duration of the two enemies is, the larger the reward/penalty value of the situation evaluation function of the two enemies is.

Further, in an embodiment of the present invention, the method further includes: and obtaining the action with the maximum confidence degree upper bound in the own-party decision wheel selection own-party evaluation system, and obtaining the action with the maximum confidence degree upper bound in the enemy decision wheel selection enemy evaluation system.

In order to achieve the above object, another embodiment of the present invention provides a monte carlo search game decision system facing an opponent with unknown policy, including: the design module is used for designing a composite state evaluation function by taking the influence of the current situation of the two enemies and the two parties on the subsequent confrontation situation and the physical constraint factors of the two enemies and the two parties as reference standards in the air combat game environment; the description module is used for describing the change situation of the situations of the enemy and the my both in the adversarial progress from a plurality of dimensions through the composite state evaluation function; and the output module is used for outputting game decisions when the opponents of the air battle games with unknown decision methods face through a Monte Carlo search algorithm fused with the maximum and minimum game ideas according to the change conditions.

According to the Monte Carlo search game decision system for the opponents with unknown strategies, the game strategy idea of the alternative decision of the maximum and minimum game algorithm for the two opponents is adopted, the action with the maximum upper bound of confidence coefficient is obtained in the selection of the own evaluation system by the own decision wheel, and the action with the maximum upper bound of confidence coefficient is obtained in the selection of the opponent evaluation system by the opponent decision wheel, so that on the premise that the opponent game strategy does not need to be obtained, various possibilities of the opponent decisions can be fully considered, corresponding countermeasures can be made, and the confrontation rate is improved.

In addition, the Monte Carlo search game decision system facing the unknown strategy opponents according to the above embodiment of the present invention may also have the following additional technical features:

Optionally, in an embodiment of the present invention, the method further includes: and the selection module is used for obtaining the action with the maximum confidence degree upper bound in the own-party decision wheel selection own-party evaluation system and obtaining the action with the maximum confidence degree upper bound in the enemy decision wheel selection enemy evaluation system.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a Monte Carlo search gambling decision method for an unknown policy opponent according to one embodiment of the invention;

FIG. 2 is a flow chart of a Monte Carlo search game decision method model construction according to one embodiment of the present invention;

FIG. 3 is a schematic diagram of an airplane combat model in a one-to-one air combat game combat platform employed in accordance with one embodiment of the present invention;

FIG. 4 is a diagram illustrating variable definitions associated with the location relationship of the friend or foe in accordance with an embodiment of the invention;

fig. 5 is a schematic structural diagram of a monte carlo search gambling decision system facing an unknown policy opponent according to an embodiment of the invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

The Monte Carlo search game decision method and system for the opponents with unknown strategies proposed by the embodiment of the invention are described below with reference to the accompanying drawings, and firstly, the Monte Carlo search game decision method for the opponents with unknown strategies proposed by the embodiment of the invention is described with reference to the accompanying drawings.

FIG. 1 is a flow chart of a Monte Carlo search betting decision method for an unknown policy opponent in accordance with an embodiment of the present invention.

As shown in fig. 1, the monte carlo search game decision method facing the unknown strategy opponent comprises the following steps:

in step S101, a composite state evaluation function is designed with the influence of the current situation of both the enemy and the my on the subsequent confrontation situation and the physical constraint factors of both the enemy and the my as reference criteria in the air combat game environment.

In step S102, the change of the situation of both the friend and foe in the progress of confrontation is described from a plurality of dimensions by the composite state evaluation function.

That is to say, as shown in fig. 2, with the influence of the current situation of the two enemies and the physical constraint factors of the two enemies and the two parties as the main reference criteria, a composite state evaluation function is designed, the change situation of the two enemies and the two parties in the fighting progress is described from multiple dimensions, and the rationality of the optimization direction of the own-party game decision method is ensured.

Further, designing a state evaluation function is one of important links for realizing a game decision method, and the embodiment of the invention provides a composite state evaluation function comprising an instant return item, a continuous return item and a physical relationship rationality constraint item on the basis of comprehensively considering the influence of the friend or foe situation on a final result in the aspects of return timeliness, physical constraint and the like.

The real-time return item is the real-time evaluation of the current positions of the two enemies, the continuous return item is the return given based on the duration time of the dominant and subordinate situation of the enemy and the my, and the physical relationship rationality constraint item is the punishment of collision of the two enemies and the my.

Further, the persistent reward item includes: and calculating the return according to the situation superiority and inferiority duration of the two enemies, wherein the longer the situation superiority or inferiority duration of the two enemies is, the larger the reward/penalty value of the situation evaluation function of the two enemies is.

In step S103, a game decision is output in the face of an air combat game opponent with an unknown decision method by a monte carlo search algorithm that fuses the maximum and minimum game ideas according to the variation.

Specifically, by means of a Monte Carlo search algorithm fused with the maximum and minimum game ideas, the action with the maximum upper confidence level is obtained in a self-decision wheel selection self-evaluation system, and the action with the maximum upper confidence level is obtained in an enemy decision wheel selection enemy evaluation system. Under the condition that the opponent game strategy is unknown, various possible influences of decisions of the two enemies and the two parties on subsequent confrontation situations are fully considered, so that corresponding decisions are made, and the confrontation winning rate is improved.

The embodiments of the invention will be further explained, but not limited to, by the exemplary embodiments described below with reference to the drawings.

The two-dimensional air combat simulation platform adopted by the embodiment of the invention comprises a dynamic model and a combat model of an airplane. In a two-dimensional plane, the state of each aircraft is represented by a five-tuple s ═ (x, y, v, θ, σ), and the meaning of each variable:

aircraft position (x, y): the position of the aircraft in a top view;

aircraft velocity v: the current flight rate of the aircraft;

aircraft yaw angle θ: the current nose orientation of the aircraft;

fuselage roll angle σ: the angle of the aircraft fuselage off the horizontal plane on the axis;

the variables have respective range limits, are matched with the game simulation platform adopted in the embodiment of the invention, and are not developed as the development details of the simulation platform do not belong to the content of the invention.

As shown in FIG. 3, each aircraft has a fan-shaped attack zone directly in front of it, with a length r_atkAngle of theta_atk. Each airplane has a fan-shaped dead angle with a length r and easy to attack right behind the airplane_dfAngle of theta_df。

As shown in fig. 4, described by the centroid distance r, the azimuth AA, and the antenna deflection angle ATA of both enemy and my, the relative position (r, AA, ATA) can be calculated from the positions of two airplanes.

In the case of one-to-one combat for two airplanes. In the two-dimensional plane, the ultimate goal of each aircraft action is: 1) the enemy plane is positioned in the attack area of the enemy plane; 2) and meanwhile, the air conditioner is positioned in the dead angle of the enemy plane. If the two conditions are met simultaneously, the airplane can be considered to enter the advantageous state that the airplane can attack the enemy and is difficult to counterattack.

The essence of the Monte Carlo algorithm (MCTS) is to obtain more information by sampling in order to approximate the optimal solution. If the continuous decision space in the two-dimensional air combat problem is discretized, a discrete decision set is obtained:

D＝{d₁,d₂,d₃,...,d_n}

wherein d is_iIs the action of the ith control aircraft obtained after discretization.

At each decision time t, the two parties of the enemy and the my depend on the current state

To make a decision, to pick a policy from the set of decisions

At the next decision time t +1, the state will transition on the decision, resulting in

The state of the next moment after the decision is made can be judged by a simulation method if the transfer function f is known. Given any one of the states(s)_self,s_enemy) The decision combination that both I and F can make is n²And (4) seed preparation. I.e. starting from a state, can be switched onGo through different decisions to reach n²A new state.

Under the ideal condition of infinite computing resources, if any one t is 0 initial state, all decisions can be exhausted, and all possible states when t is 1 are obtained; exhaustive decision is made for each possible state when t is 1, and all possible states … … are repeated until timeout when t is 2 is obtained; or finally obtaining a final state corresponding to the fighting win or lose, and then backtracking from the state to determine the specific decision which should be taken at each decision moment.

The method essentially traverses a decision tree taking an initial state as a root node, so as to deduce an optimal decision. However, such an approach has two limitations:

one is that the computational resources are limited and the number of states explored by the simulation is limited. The depth and the breadth of the exploration in the decision tree are limited, and the exploration to the end state is difficult;

second, each step of decision is composed of

The respective decision components of the two parties, i can only decide

Can not make a decision on the left or right of the other party

That is, the game exists, so that the actual party can not completely control the trend of the decision state.

For the first limitation, in the embodiment of the present invention, a UCT (upper Confidence Bound applied to tree) algorithm is used to balance the depth and breadth of the decision tree exploration under the limited computing resources, the UCT algorithm is a classic game tree search algorithm, and the content of the algorithm does not belong to the content scope of the present invention, so that the present invention does not expand. With respect to the second limitation, the embodiment of the present invention proposes a corresponding MCTS improvement.

The UCT algorithm needs to evaluate the nodes to be expanded through a scoring system in the process of searching the game space strategy. In the MCTS algorithm, scoring is performed by an expert-based evaluation function. The evaluation function is the basis of the UCT search and needs to reflect the quality of the current state.

Ideally, the evaluation function should also reflect the quality of the subsequent state obtained from the current state. This property is not necessary, however, because the UCT algorithm can modify the score for this state by extending the nodes of this state. Therefore, the MCTS algorithm can achieve the expected effect only by reflecting the evaluation function of the current state.

The evaluation function actually adopted in the embodiment of the invention is recorded as R(s)_self,s_enemy) Specifically, the method can be split into three parts:

R(s_self,s_enemy)＝R_imd+R_adv+R_col

R_imdis based on an immediate return of the current friend or foe location,

wherein R, AA and ATA are relative orientation parameters of friend or foe, R_dIs the collision radius of the aircraft.

R_advThe method is a return based on the duration time of the friend or foe state, and the specific form is as follows:

the IsAdvance is used for judging whether the I plane or the enemy plane is in a judgment function of an advantage state, and the judgment conditions are as follows:

1) the distance between the two enemy and my machines is less than a certain range (relevant to the adopted confrontation simulation platform);

2) the AA value of the two machines is less than 60 degrees;

3) the ATA value of the two machines is less than 120 degrees;

t is the duration of time after the enemy or my plane enters the dominant state.

R_colIs to make two machines collidePenalty of (2):

if the adversary strategy is unknown, in a decision tree in which both the enemy and the my participate in decision making, the nodes at the odd level make decisions d for the enemy_selfThis decision is made only so that my state s_selfAfter the state is transferred, the decision tree is expanded one layer; when the enemy comes to the even layer, the enemy carries out decision again under the condition that the enemy knows the decision of the enemy. In the decision tree, the paths of the odd layers are determined by the policy of our party, the policies of the even layers are determined by the policy of the enemy party, and both parties cannot completely control the expansion direction of the decision tree. The MiniMax (MiniMax) algorithm proposed by McGrew provides a way to make decisions in this case. The MiniMax algorithm is based on finite-step search and evaluation:

in decision making, a search depth is determined first, and then each node of the decision tree at the search depth is traversed. The scores of these nodes reflect the state of the war after a finite look-ahead. At the odd level of the decision tree, my party will choose an action that is favorable to my party; on even layers, the adversary chooses the action that is not good for my party as much as possible. Thus, when both the enemy and the my party select the action, the user wants to 'make the next state transition to the state favorable to my party as far as possible no matter how the other party selects'. Thus, the action is evaluated as "worst result per action" and "best result among worst results" is selected to obtain assurance of the results.

The MiniMax strategy described above is very conservative and limiting in nature. However, by using the idea of "decision of my at odd-numbered level and decision of enemy at even-numbered level", the embodiment of the present invention applies the UCT algorithm when expanding the decision tree:

for each node state, each of the two enemies has respective evaluation, and the respective evaluation of each of the two enemies can be reversely propagated to each node on the decision path, so that each node has accumulated return obtained by the respective evaluation of each of the two enemies. When making a decision and expanding, selecting an action with the highest confidence degree upper bound obtained by evaluation of one party from the other party on an odd layer; and at the even layer, the enemy selects the action with the maximum confidence coefficient upper bound obtained by evaluation of the enemy. Therefore, the idea of MiniMax is fused with the MCTS, and the MCTS game decision algorithm when the opponent strategy is unknown is obtained. The specific algorithm flow is as follows:

inputting:

root node state s_root＝(s_self,s_enemy) Decision set D ═ D₁,d₂,...,d_n(ii) a The number of expansion nodes N; my square merit function R_self(ii) a Adversary evaluation function R_enemy(ii) a Constant C to balance exploration depth and breadth

In order to show the effectiveness of the method, the embodiment of the invention carries out algorithm test on a one-to-one air fighting game platform. The experimental conditions adopted in the embodiment of the invention are as follows: the Monte Carlo game decision method is adopted by the party when the strategy is unknown, and the Minimax algorithm is adopted by the enemy as the game decision method. In the case that the models of both enemies are the same, the test is completed in 2048 initial states (the positions and orientations of both enemies). The test results are shown in the following table:

from the above experimental results, compared with the classic Minimax game decision algorithm, the game algorithm proposed by the embodiment of the invention has more or less victory results, and the effectiveness of the method proposed by the embodiment of the invention is shown.

According to the Monte Carlo search game decision method for the opponents with unknown strategies, provided by the embodiment of the invention, through the game strategy idea of the alternative decision of the maximum and minimum game algorithm for the two opponents, the action with the maximum upper bound of confidence coefficient is provided in the own decision wheel selection and own evaluation system, and the action with the maximum upper bound of confidence coefficient is obtained in the enemy decision wheel selection and opponent evaluation system, so that on the premise of not obtaining the game strategy of the opponents, various possibilities of the opponents can be fully considered, corresponding countermeasures can be made, and the win-out rate can be improved.

The Monte Carlo search gambling decision system device facing the unknown strategy opponents proposed by the embodiment of the invention is described next with reference to the attached drawings.

Fig. 5 is a structural diagram of a monte carlo search gambling decision system facing an unknown policy opponent according to an embodiment of the invention.

As shown in fig. 5, the monte carlo search gambling decision system 10 facing an unknown policy opponent includes: a design module 100, a description module 200 and an output module 300.

The design module 100 is configured to design a composite state evaluation function by taking an influence of current situations of the two enemy and me on a subsequent confrontation situation and physical constraint factors of the two enemy and me as reference standards in an air combat game environment. The description module 200 is used for describing the change situation of the situations of the enemy and the my both in the adversarial progress from a plurality of dimensions through a composite state evaluation function. The output module 300 is configured to output the game decision in the face of an air combat game opponent with an unknown decision method through a monte carlo search algorithm that fuses the maximum and minimum game ideas according to the change situation.

Further, in an embodiment of the present invention, the composite status evaluation function includes an immediate return item, a persistent return item, and a physical relationship rationality constraint item.

Further, in an embodiment of the present invention, the method further includes: the selection module 400 is configured to obtain an action with the largest confidence upper bound in the own-party decision wheel selection own-party evaluation system, and obtain an action with the largest confidence upper bound in the enemy decision wheel selection enemy evaluation system.

According to the Monte Carlo search game decision system facing the opponents with unknown strategies, provided by the embodiment of the invention, through the game strategy idea of the alternative decision of the maximum and minimum game algorithm, the action with the maximum upper bound of confidence coefficient is obtained in the own party decision wheel selection and own party evaluation system, and the action with the maximum upper bound of confidence coefficient is obtained in the enemy decision wheel selection and opponent evaluation system, so that on the premise of not obtaining the game strategy of the opponents, various possibilities of the opponents can be fully considered, corresponding countermeasures can be made, and the win-out rate is improved.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. An unknown strategy opponent-oriented Monte Carlo search game decision method is characterized by comprising the following steps:

the method comprises the following steps of designing a composite state evaluation function by taking the influence of the current situation of the two enemies and the two parties on the subsequent confrontation situation under the air combat fighting game environment and the physical constraint factors of the two enemies and the two parties as reference standards;

describing the change situation of the situations of the enemy and the my both sides in the adversarial progress from a plurality of dimensions through the composite state evaluation function; and

and outputting game decisions when the opponents of the air battle game with unknown decision methods face through a Monte Carlo search algorithm fused with the maximum and minimum game ideas according to the change conditions.

2. The Monte Carlo search game decision method for the unknown policy opponent according to claim 1, wherein the composite state evaluation function comprises an instant return item, a continuous return item and a physical relationship rationality constraint item.

3. The Monte Carlo search betting decision method for unknown policy adversaries according to claim 2, wherein the persistent reward term comprises:

and calculating the return according to the situation superiority and inferiority duration of the two enemies, wherein the longer the situation superiority or inferiority duration of the two enemies is, the larger the reward/penalty value of the situation evaluation function of the two enemies is.

4. The Monte Carlo search betting decision method for unknown tactical opponents according to claim 1, further comprising:

and obtaining the action with the maximum confidence degree upper bound in the own-party decision wheel selection own-party evaluation system, and obtaining the action with the maximum confidence degree upper bound in the enemy decision wheel selection enemy evaluation system.

5. An unknown strategy opponent oriented Monte Carlo search gaming decision system, comprising:

the design module is used for designing a composite state evaluation function by taking the influence of the current situation of the two enemies and the two parties on the subsequent confrontation situation and the physical constraint factors of the two enemies and the two parties as reference standards in the air combat game environment;

the description module is used for describing the change situation of the situations of the enemy and the my both in the adversarial progress from a plurality of dimensions through the composite state evaluation function; and

and the output module is used for outputting game decisions when the opponents of the air battle games with unknown decision methods face through a Monte Carlo search algorithm fused with the maximum and minimum game ideas according to the change conditions.

6. The Monte Carlo search game decision system for unknown policy adversaries according to claim 5, wherein the composite state evaluation function contains an instant return term, a continuous return term and a physical relationship rationality constraint term.

7. The system of claim 6, wherein the persistent reward term comprises:

8. The Monte Carlo search betting decision system for unknown policy opponents according to claim 6, further comprising:

and the selection module is used for obtaining the action with the maximum confidence degree upper bound in the own-party decision wheel selection own-party evaluation system and obtaining the action with the maximum confidence degree upper bound in the enemy decision wheel selection enemy evaluation system.