CN108809713A

CN108809713A - Monte Carlo tree searching method based on optimal resource allocation algorithm

Info

Publication number: CN108809713A
Application number: CN201810593129.6A
Authority: CN
Inventors: 陈子豪; 李斌; 李厚强
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2018-11-13
Anticipated expiration: 2038-06-08
Also published as: CN108809713B

Abstract

The invention discloses a kind of Monte Carlo tree searching method based on optimal resource allocation algorithm, only the selection strategy of the child node of root node in the tree of Monte Carlo is adjusted, optimal resource allocation algorithm is used to carry out the distribution of simulation calculation resource to the Monte Carlo subtree corresponding to each child node, and the searching method of the Monte Carlo tree corresponding to each child node, such as tree strategy etc., remain unchanged, this allows the method for the present invention to facilitate and combined with Monte Carlo tree searching method, simultaneously, Monte Carlo tree can also be improved and search for the decision performance under computing resource limited circumstances.The method of the present invention is suitable for the Monte Carlo tree searching method of all concrete forms, is with a wide range of applications.

Description

Monte Carlo tree searching method based on optimal resource allocation algorithm

Technical Field

The invention relates to the technical field of games, in particular to a Monte Carlo tree searching method based on an optimal resource allocation algorithm.

Background

The Markov Decision Process (MDP) models the sequential decision problem with known environment using a quadruple of { state set, action set, transfer model, reward function }. The complete decision process can be described by a sequence of { state, action } pairs. Where each next state s' is determined by a probability distribution that depends on the current state s and the chosen action a. The policy in MDP refers to the mapping from state space to action space, i.e. the rule to choose a specific action in each state. The goal of MDP is to find the strategy that maximizes the expected return. When the number of states in the environment is too large or difficult to know, the policy cannot be evaluated efficiently. One of the effective measures to solve this problem is to use a Monte Carlo Tree Search (MCTS) to evaluate the value function for each pair of state, action to replace the policy evaluation.

Monte carlo tree search is a method of finding the best decision in a given domain by randomly sampling in a decision space and building a search tree from the results. It has had a profound impact on Artificial Intelligence (AI), and in theory MCTS can be applied in any domain that can be described by { state, action } pairs and used to predict results through simulations. The interest in MCTS research has risen dramatically due to the great success MCTS has achieved in the game of Go (Go) and the potential applications to many other problems.

MCTS appeared as early as 1928, and John von Neumann proposed minimax theory to pave the way for the Adversarial Tree Search (adaptive Tree Search) method. Then, the Monte Carlo (Monte Carlo) method was formally used in the 40 th century as a method for dealing with a problem less suitable for the definition of tree search definition by random sampling. Finally, remi Coulomb combines the two methods in 2006 and proposes MCTS to provide a decision for the movement planning in Go.

Until now, MCTS has been extensively studied and many variant forms have emerged, such as belief upper bound trees (UCT), single-or multi-player MCTS, real-time MCTS, and so on. At the same time, the Tree Policy (Tree Policy) of MCTS is improved and enhanced, among other things. However, the monte carlo-based method has a common point that the nature of the problem faced needs to be counted through a large number of Simulation (Simulation) experiments. In the case of less computing resources, even in the face of moderate complexity problems, partially critical state nodes or action edges may not be accessible during the monte carlo tree search, which also leads to the difficulty that MCTS performs poorly with less computing resources.

Disclosure of Invention

The invention aims to provide a Monte Carlo tree searching method based on an optimal resource allocation algorithm, which can greatly improve the Monte Carlo tree searching performance under the condition of limited computing resources.

The purpose of the invention is realized by the following technical scheme:

a Monte Carlo tree searching method based on an optimal resource allocation algorithm comprises the following steps:

taking the initial state of the problem to be decided as the root node R of the Monte Carlo tree₀If n actions exist in the corresponding action space, the root node R is formed₀Each child node is used as a root node of a sub Monte Carlo tree, and each child node is used as a decision scheme of an optimal resource allocation algorithm;

allocating initial computing resources to each decision scheme, performing Monte Carlo tree search iterative computation of corresponding computing resource amount on the sub Monte Carlo trees corresponding to each decision scheme, and recording the benefit of each iteration;

judging the sum of the used computing resources of all the decision schemes after the first roundWhether it is not less than the maximum available computing resource T; wherein,representing the total computing resources of a decision scheme after the first round of computing resource allocation;

if not, increasing the computing resources delta, determining the actually available computing resource amount of each decision scheme in the (l + 1) th round of computation by using an optimal resource allocation algorithm according to the historical profit of each decision scheme, and executing the iterative computation same as the previous step;

if yes, the Monte Carlo tree searching process is ended, and therefore the action corresponding to the decision scheme with the best average performance is determined.

It can be seen from the above technical solutions that only the selection policy of the child nodes of the root node in the monte carlo tree is adjusted, that is, the optimal resource allocation algorithm is adopted to allocate the simulation computation resources to the monte carlo sub-trees corresponding to the child nodes, and the search methods of the monte carlo trees corresponding to the child nodes, such as tree policies and the like, are all kept unchanged, so that the method of the present invention can be conveniently combined with the monte carlo tree search method, and simultaneously, the decision performance of the monte carlo tree search under the condition of limited computation resources can be improved. The method is suitable for Monte Carlo tree searching methods in all specific forms, and has a wide application range.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

Fig. 1 is a flowchart of a monte carlo tree search method based on an optimal resource allocation algorithm according to an embodiment of the present invention;

fig. 2 is a schematic diagram of monte carlo tree search based on an optimal resource allocation algorithm according to an embodiment of the present invention;

fig. 3 is a schematic diagram of a process of performing monte carlo tree search on child nodes according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a Monte Carlo tree searching method based on an Optimal resource allocation (OCBA) algorithm, which aims at the problem that the Monte Carlo tree has poor decision performance under the condition of limited Computing resources.

The main process of the invention is shown in figure 1, which mainly comprises the following parts:

1. taking the initial state of the problem to be decided as the root node R of the Monte Carlo tree₀If n actions exist in the corresponding action space, the root node R is formed₀Each child node is used as a root node of a sub Monte Carlo tree, and each child node is used as a decision scheme of an optimal resource allocation algorithm.

In the embodiment of the invention, assuming that n actions exist in the corresponding action space, the actions are respectively executed and then transferred to the n actionsNew state, i.e. forming root node R₀N child nodes of (1); each child node is used as a root node of one child Monte Carlo tree, so that n child Monte Carlo trees SMCTs which are independent of each other are total_iEach child node is used as a decision scheme theta of an optimal resource allocation algorithm_i。

2. And allocating initial computing resources to each decision scheme, performing Monte Carlo tree search iterative computation of corresponding computing resource amount on the sub Monte Carlo trees corresponding to each decision scheme, and recording the benefit of each iteration.

In the embodiment of the present invention, initially, that is, when l is equal to 0, initial calculation resources are allocated to each decision scheme, and meanwhile, a monte carlo tree search iterative calculation of a corresponding calculation resource amount is performed on a sub-monte carlo tree corresponding to each decision scheme.

For convenience of understanding, in the embodiment of the present invention, the computing resource may be regarded as the number of iterations of the monte carlo tree search; let l be 0 and make l be 0,for each sub-Monte Carlo tree SMCT corresponding to the decision scheme_iAll carry out N₀And (4) searching and iterating the computation by using the sub Monte Carlo tree, and recording the income of each iteration.

In fact, in different environments, computing resources may also be understood as computing time, storage space, and the like.

3. Judging the sum of the used computing resources of all the decision schemes after the first roundWhether it is not less than the maximum available computing resource T.

In the embodiment of the present invention, the first and second substrates,representing the total computational resources of a decision-making scheme after the first round of allocation of computational resources, i.e., the decision-making scheme in the first round andthe sum of the computational resources used for each round prior to the l-th round.

4. And (3) increasing computing resources delta, determining the total computing resource amount of the decision schemes from the 1 st to the l +1 st rounds in the 1 st and the l +1 st rounds of computing by utilizing an optimal resource allocation algorithm according to the historical income of the decision schemes, determining the actually available computing resource amount of the decision schemes in the l +1 st rounds, and executing the same Monte Carlo tree search iterative computation as that in the previous step 2.

In the embodiment of the invention, the optimal resource allocation algorithm is utilized to calculate the quantity of the historical income according to the mean value and the variance of the historical income of each decision schemeThe available total computing resources of each decision scheme are distributed to each decision scheme, and the computing resource amount obtained by each decision scheme in the (l + 1) th round is Andthe difference between the two will determine the amount of computing resources actually available for each scheme in the first +1 th round of simulation calculation.

In particular, θ is for all decision schemes_iI ∈ I ═ {1,2, …, n }, noting that any suboptimal decision scheme is θ_jOne optimal decision scheme is θ_bThe other decision scheme is theta_xX ∈ X, where j, b ∈ I, j ≠ b, X ∈ X ═ I- { j, b }. Similarly, the symbols j, b, x are used as labels of various properties of the non-optimal decision scheme, the optimal decision scheme, and other decision schemes, respectively. Illustratively, j is 1, b is 2, then X ∈ X is {3,4,5, …, n }.

Then for all I e I ∈ I ═ {1,2, …, n }, j, b ∈ I, j ≠ b, X ∈ X ═ I- { j, b }, the following formula:

wherein:

in the above formula, the first and second carbon atoms are,respectively represent non-optimal decision schemes theta_jOptimal decision plan theta_bOther decision schemes theta_xComputing resource amount obtained in round (l + 1); n represents the total calculation resources of the decision schemes corresponding to the corresponding subscripts after the resources are allocated in the rounds indicated by the corresponding subscripts; mu.s_k(θ_i) Represents a decision scheme θ_iThe gain at the time of the k-th calculation, a flag representing the decision scheme with the highest average historical benefit after the first search iteration; μ represents the mean of the historical returns, δ represents the variance of the historical returns, the superscript l is the number of rounds, the subscript is the label for various property decision schemes, e.g.,to a decision scheme theta_iCalculating the mean value and the variance of the historical income in the 1 st to the l-th round;andare all intermediate parameters, wherein,andmay be other schemes theta_xAnd the optimum solution theta_bWith respect to the selected non-optimal decision scheme θ_jThe scaling factor of the total computational resource amount obtained in the first round. Assume decision scheme θ_jThe resulting computing resource is one unit of quantity. For theFrom this equation, it can be seen that if the other decision scheme θ is used_xThe larger the historical profit mean (better performance) and the larger the variance (indicating uncertain performance, requiring more calculations to determine true performance), thenThe larger the value of (a), this indicates that the other decision scheme θ is to be assigned_xThe more computation is performed.

Bonding ofAndthe difference between the first and second decision schemes determines the actual available computing resource amount of each decision scheme during the (l + 1) th round of computation:i belongs to I; that is, for each sub-Monte Carlo tree SMCT corresponding to the decision scheme_iAll perform the calculation of the resource amount ofSearching and iterating the Monte Carlo tree; the total computing resources after the computing resources are allocated in the (l + 1) th round of each decision scheme are:

and (4) after the steps are executed, judging in a step (3), if the judgment result is negative, continuing to execute the step until the judgment result is positive, and then, executing a step (5).

5. The monte carlo tree search process is ended to determine the action corresponding to the decision scheme that performs best on average.

After the monte carlo tree search is finished, the action corresponding to the decision scheme can be selected through the average performance.

In the above-mentioned solution of the embodiment of the present invention, only the selection policy of the child nodes of the root node in the monte carlo tree is acted, that is, the optimal resource allocation algorithm is adopted to perform the allocation of the simulation computation resource to the monte carlo sub-tree corresponding to each child node, and the search method of the monte carlo tree corresponding to each child node, such as the tree policy and the like, is kept unchanged, so that the method of the present invention can be conveniently combined with the search method of the monte carlo tree, and at the same time, the decision performance of the monte carlo tree search under the condition of limited computation resource can be improved. The method is suitable for Monte Carlo tree searching methods in all specific forms, and has a wide application range.

For ease of understanding, the following description is made in connection with an example.

The technical scheme of the embodiment of the invention can be suitable for Monte Carlo tree searching methods in all specific forms. In this example, the question of falling chess in the play of black and white chess is used as a research object, the specific form of Monte Carlo tree search is the confidence upper limit tree (UCT), and then the root node R in the UCT₀The chessboard state is the chessboard state to be dropped, the action space is all positions where the player can drop in the current chessboard state, each action corresponds to one dropping position, and n dropping actions are total.

Each child node of the root node is performing action a_iThe chessboard changes to a new state after falling. Each child node is used as a new root node to conduct UCT search, so that a new Monte Carlo tree SMCT is generated_iI.e. the above-mentioned node R₀A subtree of the monte carlo tree as a root node.

In the playing process of the black and white chess, if the result of the simulation calculation after the step of the; if the result is negative, the profit is marked as 0; otherwise, the benefit is noted as 0.5.

And the mean and the variance of all simulation calculation results of each child node are used as the input of an optimal resource allocation algorithm to calculate the calculation resources of each decision scheme in the next round.

In this example, the computing resource is a sub-Monte Carlo Tree SMCT_iThe number of iterations or simulation for performing the UCT search is shown in fig. 2 as an MCTS search process based on the most resource allocation algorithm, and shown in fig. 3 as an iteration process for performing the monte carlo tree search on child nodes.

And after the whole method is executed, returning to the optimal action of the current chessboard in the playing process.

Through the above description of the embodiments, it is clear to those skilled in the art that the above embodiments can be implemented by software, and can also be implemented by software plus a necessary general hardware platform. With this understanding, the technical solutions of the embodiments can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which can be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A Monte Carlo tree searching method based on an optimal resource allocation algorithm is characterized by comprising the following steps:

2. The method of claim 1, wherein the method comprises performing n actions to transfer to n new states, i.e. forming a root node R₀N child nodes of (1);

each child node is used as a root node of one child Monte Carlo tree, so that n child Monte Carlo trees SMCTs which are independent of each other are total_iEach child node is used as a decision scheme theta of an optimal resource allocation algorithm_i。

3. The method of claim 1, wherein the Monte Carlo tree search method based on the optimal resource allocation algorithm,

initially, initial computing resources are allocated for each decision schemeThat is to say that the first and second electrodes,

for each sub-Monte Carlo tree SMCT corresponding to the decision scheme_iAll perform calculation with resource amount N₀The search iteration of the Monte Carlo tree is calculated, and the income of each iteration is recorded.

4. The method as claimed in claim 1, wherein the total calculation resource amount of each decision scheme in 1 to 1 +1 round of calculation is determined by the optimal resource allocation algorithm according to the historical profit of each decision schemeDetermining therefrom the amount of computational resources actually available for each decision-making scheme in round i +1 includes:

the optimal resource allocation algorithm is utilized to obtain the quantity of the historical income according to the mean value and the variance of each decision schemeThe available total computing resources of (a) are allocated to each decision scheme, and each decision scheme obtains the amount of computing resources of

Recording any non-optimal decision scheme as theta_jThe optimal scheme is theta_bThe other decision scheme is theta_iThen, for all I ∈ I ═ {1, 2., n }, j, b ∈ I, j ≠ b, X ∈ X ═ I- { j, b }, the following formula:

wherein:

in the above formula, the first and second carbon atoms are,respectively represent non-optimal decision schemes theta_jOptimal decision plan theta_bOther decision schemes theta_xComputing resource amount obtained in round (l + 1); n represents the total calculation resources of the decision schemes corresponding to the corresponding subscripts after the resources are allocated in the rounds indicated by the corresponding subscripts; mu.s_k(θ_i) Represents a decision scheme θ_iThe gain at the time of the k-th calculation, a flag representing the decision scheme with the highest average historical benefit after the first search iteration; mu represents the mean value of the historical income, delta represents the variance of the historical income, the upper label l is the serial number of the round, and the lower label is the label of various property decision schemes;andare all intermediate parameters;

bonding ofAndthe difference between the two is used for determining the actually available computing resource amount of each decision scheme during the l +1 round of computation:i belongs to the E; that is, for each sub-Monte Carlo tree SMCT corresponding to the decision scheme_iAll perform the calculation of the resource amount ofSearching and iterating the Monte Carlo tree; the total computing resources after the computing resources are allocated in the (l + 1) th round of each decision scheme are: