CN112755538B

CN112755538B - Real-time strategy game match method based on multiple intelligent agents

Info

Publication number: CN112755538B
Application number: CN202110370381.2A
Authority: CN
Inventors: 张俊格; 尹奇跃; 于彤彤
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2021-04-07
Filing date: 2021-04-07
Publication date: 2021-08-31
Anticipated expiration: 2041-04-07
Also published as: CN112755538A

Abstract

The invention provides a real-time strategy game match method based on multiple intelligent agents, which comprises the following steps: AERUCT search algorithm: carrying out forward search according to the current blood volume and the adaptive adjustment exploration ratio of the winning rate, calculating an evaluation value of a search direction according to the current state, and selecting the next search direction according to the evaluation value of the search direction; the AERUCT search algorithm is an improved UCT search algorithm; the performance of the AERUCT search algorithm is improved in a small-scale game scene, but because the number of nodes for large-scale game scene search decision is increased and is limited by time, the UCTRL algorithm evaluates and selects child nodes with high winning rate by comparing a strategy with good storage and update performance with the result of the AERUCT search, and reversely updates state information, and the steps are repeated, so that the current strategy is not inferior to the previous strategy, each intelligent agent is more intelligent, and the learning capacity is improved.

Description

Real-time strategy game match method based on multiple intelligent agents

Technical Field

The application relates to the field of reinforcement learning, man-machine confrontation and multi-agent games, in particular to a real-time strategy game match method based on multiple agents.

Background

Real-time strategy (RTS) game is not a turn-based game, but a video game. Players manage resources, build different types of structures, and instruct them how to fight opponents. The current research is mainly focused on the aspects of micro-operations, game strategies, optimal paths and the like. The game strategy is particularly important when the number of agents and the attack capabilities of both parties are the same. Therefore, researchers have made a great deal of research into multi-agent gaming strategies.

Script-based and search tree algorithms are commonly used in real-time strategic games, and classical script-based strategic algorithms use a strategy in a round of game play, such as attacking the nearest enemy or attacking the weakest enemy first, etc. The PGS algorithm selects the best action by evaluating multiple scripts. The strategy algorithm based on the script can make a decision quickly and is suitable for a game scene with a plurality of intelligent agents. However, it cannot update the policy according to the change of the real-time scene. In this case, you cannot win once the enemy knows your script algorithm. Search tree algorithms get better strategies as the search depth deepens, such as MCTS, Alpha-Beta, and UCT algorithms. The MCTS algorithm specifies the search tree depth and traverses all possible child nodes to select the best child node. The Alpha-beta algorithm prunes child nodes that are unlikely to have the best results, which improves search efficiency. However, the optimum value can be obtained only after the search is completed. The UCT algorithm is a combination of the UCB and MCTS algorithms. It has advantages in time and space over traditional search algorithms in the course of very large-scale games. Search tree algorithm based game strategies typically make better decisions based on real-time scenarios. But as the number of agents increases, the search depth will become shallower and the search decision obtained will be degraded.

Application publication No. CN 111111220 a relates to a self-playing model training method, device, computer device and storage medium for a multiplayer battle game. The method comprises the following steps: acquiring historical battle video data; acquiring training fight state characteristics from each state characteristic region in a fight video frame of historical fight video data, and acquiring operation labels corresponding to the training fight state characteristics from each fight operation region in the fight video frame; training based on the training fighting state characteristics and the operation labels to obtain a fighting strategy model; predicting operation to carry out the fight based on the fight state characteristics in the fight through a fight strategy model; acquiring the fighting state characteristics in the fighting and the corresponding predicted operation value labels; training a fighting operation value model based on the fighting state characteristics and the operation value labels; and constructing a self-playing model according to the fighting strategy model and the fighting operation value model and training. By adopting the method, the training efficiency of the self-playing model can be improved.

Application publication No. CN 111437608A provides a game match method, device, equipment and storage medium based on artificial intelligence; the method comprises the following steps: responding to a received operation instruction for joining the game play, and acquiring game play data streams of all participants in the game play; carrying out prediction operation on the game match data stream through a trained neural network model to obtain a prediction result, wherein the trained neural network model at least comprises a self-attention coding module; determining a target game strategy based on the prediction result; and sending the target game strategy to a server. Thus, the accuracy of the game strategy can be improved.

Disclosure of Invention

In view of the above, in a first aspect, the present invention provides a multi-agent based real-time strategy game match method, including:

AERUCT search algorithm: and carrying out forward search according to the current blood volume self-adaptive adjustment search ratio, calculating an evaluation value of a search direction according to the value, the traversal times and the search ratio of the current node, wherein the evaluation value is a winning probability value calculated by an AERUCT search algorithm, and selecting the next search direction according to the evaluation value of the search direction.

Preferably, the AERUCT search algorithm is an improved UCT search algorithm, and specifically includes:

(1) the forward search selects a child node for each non-leaf node starting from a root node;

(2) calculating an exploration ratio according to the current blood volume;

(3) calculating the evaluation value of each child node in the searching direction according to the value, the traversal times and the exploration ratio of the current node;

(4) if the child node with the maximum value of the node is required to be selected at present, the child node with the maximum evaluation value is selected; if the child node with the minimum value of the node is required to be selected at present, the child node with the minimum evaluation value is selected;

(5) and after the forward search is finished, updating the values and the traversal times of the nodes on all the search paths in a reverse value transmission mode.

Preferably, the exploratory ratio is a positive correlation function of blood volume.

Preferably, the forward search specifically calculates the evaluation value of each child node in the search direction by:

wherein,

: an evaluation value of a search direction;

: node point

The value of (D);

: current child node

The number of traversals;

: searching ratio;

: the sum of all blood volumes;

: current child node

Blood volume of

c, adjusting a constant of the exploration ratio.

Preferably, the specific method for updating the values and the traversal times of the nodes on all the search paths in the backward value transfer manner includes:

wherein,

: nodes on updated search path

The value of (D);

: the traversal times of the father node;

: child node

The value of (D);

: current child node

The number of traversals.

In a second aspect, the present invention provides another multi-agent based real-time strategy game play method, comprising:

the UCTRL algorithm:

(1) applying an AERUCT search module, adaptively adjusting an exploration ratio according to the current blood volume, carrying out forward search, and calculating an evaluation value of a search direction according to the value, the traversal times and the exploration ratio of the current node, wherein the evaluation value is a winning probability value calculated by an AERUCT search algorithm; the AERUCT search module applies part of an AERUCT search algorithm;

(2) selecting winning probability values of the search directions of the current state in a strategy pool by the evaluation function;

(3) comparing the winning probability value of the selected search direction in the strategy pool with the winning probability value calculated by the AERUCT search algorithm by using an evaluation function, selecting the node with the maximum winning probability value as an update node, and updating the state of the strategy pool;

(4) the currently selected action and the updated node are then passed to the AERUCT search module, and a new search is initiated from this updated node.

Preferably, the partial AERUCT search algorithm comprises:

(2) calculating an exploration ratio according to the current blood volume;

(3) and calculating the evaluation value of each child node in the searching direction according to the value, the traversal times and the exploration ratio of the current node.

Preferably, the exploration ratio is a positive correlation function of blood volume and odds ratio.

wherein,

: an evaluation value of a search direction;

: node point

The value of (D);

: current child node

The number of traversals;

: searching ratio;

: the sum of all blood volumes;

: current child node

The blood volume of (c);

c, adjusting a constant of the exploration ratio.

Preferably, the policy pool comprises: a memory pool and a forgetting pool; the winning probability value of the search direction of the current state is calculated by a memory pool; the method for calculating the winning probability value of the search direction of the current state comprises the following steps:

recording a state s 'most similar to the current state s in the memory pool, and taking the winning probability value of s' as the winning probability value of the current s; the winning probability value of s' is stored in a memory pool.

Preferably, the method for selecting the child node with the largest winning probability value is as follows: and comparing the winning probability value of the searching direction of the selected child node with the winning probability value of the searching direction in the strategy pool through the AERUCT searching algorithm, and selecting the child node with the high winning probability value as the optimal child node.

Preferably, the method for updating the value of all child nodes passing through the policy pool state includes:

wherein,

: nodes on updated search path

The value of (D);

: the traversal times of the father node;

: child node

The value of (D);

: current child node

The number of traversals.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages:

the method provided by the embodiment of the application has the following advantages that:

(1) in a small-scale game scene, the AERUCT search algorithm has better decision effect and can update the exploration and development ratio according to the real-time state of the game;

(2) the UCTRL algorithm introduces a reinforcement learning idea and a memory pool, so that knowledge can be continuously learned according to the acquired reward value when an agent interacts with the environment, the UCTRL algorithm adapts to the environment, and better strategies can be stored in game scenes of various scales for subsequent decisions.

Drawings

Fig. 1 is a diagram of a UCTRL algorithm structure according to an embodiment of the present invention;

fig. 2 is a data flow diagram of the UCTRL algorithm provided by the embodiment of the present invention;

in the figure: 1-AERUCT search module, 2-strategy pool, 21-memory pool, 22-forgetting pool, 3-evaluation function, and 4-reverse update module.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The UCT search algorithm is suitable for continuous real-time strategy games and can give action feedback to a plurality of agents in a continuous space, however, the exploration ratio of the UCT search algorithm in search decision is fixed and cannot be changed adaptively according to the change of a real-time scene.

In some embodiments, in a small-scale game scenario, the embodiment of the present application provides a multi-agent-based real-time strategy game match method, wherein the AERUCT search algorithm comprises:

and carrying out forward search according to the current blood volume and the adaptive search ratio adjustment of the winning rate, calculating an evaluation value of a search direction according to the value, the traversal times and the search ratio of the current node, and selecting the next search direction according to the evaluation value of the search direction.

The AERUCT search algorithm is an improved UCT search algorithm, and specifically comprises the following steps:

(2) calculating an exploration ratio according to the current blood volume and the winning rate; the exploration ratio is a positive correlation function of blood volume and the winning rate;

the current state is the ambient state. If the game environment is information such as the situation of the game, each node is in a different state. The value of a node is the value of the current state and the preservation of the value and traversal times of each node is the calculation of the new value of the node for subsequent updates.

The specific method for calculating the evaluation value of each child node in the search direction in the forward search comprises the following steps:

wherein,

: an evaluation value of a search direction;

: node point

The value of (D);

: current child node

The number of traversals;

: searching ratio;

: the sum of all blood volumes;

: current child node

The blood volume of (c);

c, adjusting a constant of the exploration ratio.

(5) after the forward search is finished, updating the values and the traversal times of the nodes on all the search paths in a reverse value transmission mode; the specific method comprises the following steps:

wherein,

: nodes on updated search path

The value of (D);

t: the traversal times of the father node;

: child node

The value of (D);

: current child node

The number of traversals.

The depth of a search tree is limited, a v value and a T value need to be initialized while child nodes are expanded, the initial value of v for expanding the child nodes is assumed to be the average value of multiple simulation results of the search tree, and T is initialized to be 0.

In some embodiments, because the number of nodes for large-scale game scene search decision is increased and limited by time, the search depth is reduced, and a better decision strategy is difficult to make, aiming at the problems, the strategy with good updating performance is stored and compared with the result of AERUCT search, the child nodes with high selection rate are evaluated, the state information is updated reversely, and the repeated operation is repeated to ensure that the current strategy is not inferior to the previous strategy, so that each intelligent agent is more intelligent, and the learning capability is improved. According to the real-time strategy game match method based on the multiple intelligent agents, a UCTRL algorithm, such as a reinforcement learning and memory pool, can continuously learn knowledge according to obtained reward values when the intelligent agents interact with the environment, and is suitable for the environment; reinforcement learning algorithms learn to update their own models based on previous sample experience and use the current model to guide the next action. The algorithm then updates the model after the next action. Finally, the algorithm iterates until the model converges. In reinforcement learning algorithms, agents have a definite goal. All agents recognize their environment and direct their behavior towards their goals. Therefore, the reinforcement learning algorithm considers the agent and the uncertain environment as a complete problem. Each action of the algorithm is not only related to the current action and environment of the current time period, but also to historical feedback information of previous time periods.

As shown in fig. 1 and 2, the UCTRL algorithm includes:

(1) applying an AERUCT searching module 1, adaptively adjusting an exploration ratio according to the current blood volume, carrying out forward search, and calculating an evaluation value of a searching direction according to the value, the traversal times and the exploration ratio of the current node; the AERUCT searching module 1 applies partial AERUCT searching algorithm;

in some embodiments, the partial AERUCT search algorithm comprises:

(a) the forward search selects a child node for each non-leaf node starting from a root node;

(b) calculating an exploration ratio according to the current blood volume; said exploration ratio is a positive correlation function of said blood volume and said odds;

(c) and calculating the evaluation value of each child node in the searching direction according to the value, the traversal times and the exploration ratio of the current node.

The specific method for calculating the evaluation value of each child node search direction is as follows:

wherein,

: an evaluation value of a search direction;

: node point

The value of (D);

: current child node

The number of traversals;

: searching ratio;

: the sum of all blood volumes;

: current child node

The blood volume of (c);

c, adjusting a constant of the exploration ratio.

(2) Selecting winning probability values of the search directions of the current state in the strategy pool 2; the policy pool includes: a memory pool 21 and a forgetting pool 22; the winning probability value of the search direction of the current state is calculated by a memory pool 21; the method for calculating the winning probability value of the search direction of the current state comprises the following steps:

the memory pool 21 records the state s 'most similar to the current state s, and then the winning probability value of s' is taken as the winning probability value of the current s; the winning probability value of s' is stored in a memory pool;

the method for updating the value of all the child nodes passing through the strategy pool state comprises the following steps:

wherein,

: nodes on updated search path

The value of (D);

T: the traversal times of the father node;

: child node

The value of (D);

: current child node

The number of traversals.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present invention. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A multi-agent based real-time strategy game match method, wherein when an agent interacts with the environment, the exploration ratio is updated according to the real-time status of the game, comprising:

(1) applying an AERUCT search module, adaptively adjusting an exploration ratio according to the current blood volume, carrying out forward search, and calculating an evaluation value of a search direction, namely a winning probability value according to the value, the traversal times and the exploration ratio of the current node; the AERUCT search module applies an AERUCT search algorithm;

(2) selecting a winning probability value of the search direction of the current state in the strategy pool by applying an evaluation function;

(4) then the currently selected action and the updating node are transmitted to an AERUCT searching module, and new searching is started from the updating node;

the AERUCT search algorithm comprises:

(11) the forward search selects a child node for each non-leaf node starting from a root node;

(12) calculating an exploration ratio according to the current blood volume;

(13) and calculating the evaluation value of each child node in the searching direction according to the value, the traversal times and the exploration ratio of the current node.

2. The multi-agent based real-time strategic game strategy approach of claim 1, wherein said heuristic ratio is a positive correlation function of blood volume.

3. The multi-agent based real-time strategy game play-alignment method according to claim 2, wherein the forward search, the specific method of calculating the evaluation value of each child node search direction is:

wherein,

: an evaluation value of a search direction;

: child node

The value of (D);

: current child node

The number of traversals;

: searching ratio;

: the sum of all blood volumes;

: current child node

The blood volume of (c);

c, adjusting a constant of the exploration ratio.

4. The multi-agent based real-time strategic game strategy approach of claim 3, wherein said strategy pool comprises: a memory pool and a forgetting pool; the winning probability value of the search direction of the current state is calculated by a memory pool; the method for calculating the winning probability value of the search direction of the current state comprises the following steps:

recording a state s 'most similar to the current state s in the memory pool, and taking the winning probability value of s' as the winning probability value of the current state s; the winning probability value of s' is stored in a memory pool.

5. The multi-agent based real-time strategic game strategy approach of claim 4, wherein said method of selecting the node with the highest winning probability value is: and comparing the winning probability value of the searching direction of the selected child node with the winning probability value of the searching direction in the strategy pool through the AERUCT searching algorithm, and selecting the child node with the high winning probability value as the optimal child node.

6. The multi-agent based real-time strategy game strategy method of claim 5, wherein the value of all the child nodes passing through is updated when the strategy pool status is updated, by:

wherein,

: nodes on updated search path

The value of (D);

T: the traversal times of the father node;

: child node

The value of (D);

: current child node

The number of traversals.