CN113139644B

CN113139644B - Information source navigation method and device based on deep Monte Carlo tree search

Info

Publication number: CN113139644B
Application number: CN202110316103.9A
Authority: CN
Inventors: 徐诚; 何昊; 段世红; 殷楠
Original assignee: Shunde Graduate School of USTB
Current assignee: Shunde Graduate School of USTB
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2024-02-09
Anticipated expiration: 2041-03-24
Also published as: CN113139644A

Abstract

The invention discloses a source navigation method and a device based on deep Monte Carlo tree search, wherein the method comprises the following steps: acquiring environment information and executed action information of an agent to be navigated in a historical time step; predicting the action probability of the intelligent agent in each direction of the current time step based on the environmental information and the action information in the historical time step through a preset first neural network; taking the predicted action probability as priori knowledge of a Monte Carlo tree search algorithm, and selecting the best execution action of the intelligent agent in the current time step; and combining the optimal execution action of each time step to obtain an optimal path for the intelligent agent to move to the information source. The invention provides an integrated planning path framework using a cyclic neural network in a Monte Carlo tree, which helps to improve the stability and performance of navigation control and solves the problem of path planning in a continuous space by processing time action sequence data.

Description

Information source navigation method and device based on deep Monte Carlo tree search

Technical Field

The invention relates to the technical field of computer science, in particular to a source navigation method and device based on deep Monte Carlo tree search.

Background

The decision and solving process aiming at the environment with incomplete information assumes that the state information of the system cannot be directly observed and is partially known, so that the system with incomplete state information is modeled, and the decision is made according to the current incomplete state information. For example, in many environmental and geoscience applications, an expert wishes to collect a sample of greatest scientific value (e.g., oil spill source), and make a next decision through the collected sample, but the distribution of the phenomenon is initially unknown. Typically, the samples are collected by a technician or mobile platform at predetermined locations in accordance with predetermined coverage trajectories. These non-adaptive strategies lead to sample sparsity to the greatest extent and may not be feasible when the geometry of the environment is unknown (e.g., a boulder field) or changing (e.g., a tidal zone), maximizing the number of valuable samples requires adaptive positioning and navigation.

Positioning navigation is performed in a part of observable environments, and due to the invisibility of states, decisions cannot be directly made through the states, and for a decision process, monte Carlo tree search is a heuristic optimal search algorithm, which is a great breakthrough for many games since the proposal. As it can search while balancing exploration and development. In the searching process, states need to be predicted first, and if the states are discrete state space, research discovers that the states can be predicted through Gaussian process regression. But in the face of continuous state space, how to predict states effectively and accurately is a problem to be solved.

In addition, the positioning and navigation of the intelligent agent can only be controlled by searching through visual information under the condition of complex environment, the development of positioning and navigation based on vision is in the process of simulating the thinking of human being in the perceived environment, during the searching process, the rewarding function is often sparse, and the sparse rewarding plan needs long-term information collection, which is a challenging problem in the technology of the intelligent agent at present. In addition, with the development of computer vision, the conventional navigation method has some defects on the prediction of image states, and cyclic learning needs to be an effective solution for vision-based navigation design.

In summary, aiming at the problem of path planning of an intelligent agent in a part of observable environments, the stability and performance of the prior art are not ideal, so that a new information source navigation method needs to be developed, and the optimal path planning of the intelligent agent can be efficiently and stably sought after a signal field containing the signal source is given.

Disclosure of Invention

The invention provides a source navigation method and device based on deep Monte Carlo tree search, which are used for solving the technical problem that the stability and performance of the prior art are not ideal enough for planning an agent path under a part of observable environment.

In order to solve the technical problems, the invention provides the following technical scheme:

in one aspect, the invention provides a source navigation method based on deep Monte Carlo tree search, which comprises the following steps:

acquiring environment information and executed action information of an agent to be navigated in a historical time step;

predicting the action probability of the intelligent agent in each direction of the current time step based on the environmental information in the historical time step and the executed action information through a preset first neural network;

taking the predicted action probability of each direction of the current time step as priori knowledge of a Monte Carlo tree search algorithm, and selecting the best execution action of the intelligent agent in the current time step through the Monte Carlo tree search algorithm;

and combining the optimal execution action of each time step to obtain an optimal path for the intelligent agent to move to the information source.

Optionally, the preset first neural network is a long-term and short-term memory artificial memory neural network.

Further, when selecting an optimal execution action of the agent in the current time step by the monte carlo tree search algorithm, in a simulation phase of the monte carlo tree search algorithm, the method further includes:

inputting the predicted action probability and the state information of the current node into a preset second neural network, distributing the rewarding value to the current node through the preset second neural network, and then back-propagating the rewarding value to the root node.

Optionally, the preset second neural network is a convolutional neural network.

Further, after distributing the prize value to the current node through the preset second neural network and back propagating the prize value to the root node, the method further includes:

and continuing training the preset second neural network by using the acquired reward value so as to improve the prediction capability.

On the other hand, the invention also provides an information source navigation device based on the deep Monte Carlo tree search, which comprises:

the historical environment information and action information acquisition module is used for acquiring environment information and executed action information of an intelligent body to be navigated in a historical time step;

the action probability prediction module is used for predicting the action probability of the intelligent agent in each direction of the current time step based on the historical environment information and the environment information in the historical time step and the executed action information acquired by the action information acquisition module through a preset first neural network;

the optimal execution action decision module is used for taking the action probability of each direction of the current time step predicted by the action probability prediction module as priori knowledge of a Monte Carlo tree search algorithm, and selecting the optimal execution action of the intelligent agent in the current time step through the Monte Carlo tree search algorithm;

and the optimal path acquisition module is used for combining the optimal execution action of each time step output by the optimal execution action decision module to obtain an optimal path for the intelligent agent to move to the information source.

Further, when selecting the best execution action of the agent in the current time step through the monte carlo tree search algorithm, the best execution action decision module is further configured to:

Optionally, the preset second neural network is a convolutional neural network.

Further, after distributing the prize value to the current node through the preset second neural network and back propagating the prize value to the root node, the best performing action decision module is further configured to:

In yet another aspect, the present invention also provides an electronic device including a processor and a memory; wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the above-described method.

In yet another aspect, the present invention also provides a computer readable storage medium having at least one instruction stored therein, the instruction being loaded and executed by a processor to implement the above method.

The technical scheme provided by the invention has the beneficial effects that at least:

aiming at the problem of agent path planning in a part of observable environment, the invention uses a learning type social response model to predict the agent dynamics in the whole action space planning process. The method is applied to an agent system, and the agent continuously trains parameters of the cyclic neural network while observing environmental information in the moving process, so that the prediction capability of states in the moving process is improved, the prize distribution is more reasonable, and the path planning capability of the agent to the information source position is improved by combining Monte Carlo tree search to process time action sequence data. Therefore, the problem of efficient path planning of the intelligent agent under part of observable environments is solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flow chart of a source navigation method based on deep monte carlo tree search according to an embodiment of the present invention;

fig. 2 is an algorithm framework diagram of a base Yu Mengte carlo tree search and neural network provided by an embodiment of the present invention;

fig. 3 is a flowchart of the execution of the monte carlo tree search algorithm.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

First embodiment

In a partially observable environment, the agent is equipped with sensors that sense a limited range of environmental information, and the agent needs to perform path planning based on the limited environmental information, where it is involved how to make action decisions at each time step and how to define rewards for each time step with a clear value function as in conventional reinforcement learning. According to the problems, the embodiment provides a source navigation method based on deep Monte Carlo tree search, which is applied to the problem of efficient path planning when an intelligent agent obtains finite state information in a part of complex observable environments. The method is convenient for determining the action sequence of the intelligent agent in a future period of time under a given condition, and all actions form an optimal path under the whole environment.

The source navigation method of the embodiment may be implemented by an electronic device, which may be a terminal or a server. The execution flow of the method is shown in fig. 1, and comprises the following steps:

s101, acquiring environment information and executed action information of an agent to be navigated in a historical time step;

s102, predicting the action probability of the intelligent agent in all directions of the current time step based on the environmental information in the historical time step and the executed action information through a preset first neural network;

s103, taking the predicted action probability of each direction of the current time step as priori knowledge of a Monte Carlo tree search algorithm, and selecting the best execution action of the intelligent agent in the current time step through Monte Carlo tree search;

s104, combining the optimal execution actions of each time step to obtain an optimal path for the intelligent agent to move to the information source.

By adopting the technical scheme, the intelligent body can predict the next moment state in the moving process after training in a given signal field, and can move from the initial position to the information source position with low cost and high efficiency.

The source navigation method of the embodiment blends the cyclic neural network into the monte carlo tree search, and the monte carlo tree search uses the predicted action probability of the cyclic neural network to estimate the value of each state in one search tree. As more simulations are performed, the search tree becomes more and more massive and the associated values become more and more accurate. By choosing a sub-tree with a higher value, the policy probability for the selection action will always improve over time during the search. By combining two methods of supervision and reinforcement learning to train a cyclic neural network, a new search algorithm is introduced, and the algorithm successfully integrates the neural network evaluation and the Monte Carlo tree simulation algorithm, so that low-cost and high-efficiency information source navigation can be realized.

Moreover, it is noted that as reinforcement learning systems become more common, the design of rewards mechanisms that induce the desired behavior becomes more important and difficult. In addition, in prize distribution, we find that prize formation is critical to reinforcement learning speed, and that prize range is an important parameter that relates to the effectiveness of shaping and proves its strongest influence on the run-time of a simple reinforcement learning algorithm. Therefore, in order to give reasonable reward distribution, the embodiment learns the reward distribution by using the neural network, the reward distribution predicts the acquired state information and action information by the convolutional neural network, and the output of the neural network can approximate to the real reward through continuous training.

Furthermore, path planning in a dynamic environment can be expressed as a sequential decision problem. The sequence plays an important role in many applications and systems, and in this embodiment, the time action sequence is that the cyclic neural network makes next time step action probability distribution on state information and action information in K historical time steps, so as to provide priori knowledge for Monte Carlo tree search. Then, a Monte Carlo tree search is adopted to select the best action in the current time step according to the action probability and the current state predicted by the cyclic neural network.

Moreover, unlike the simulation operation of the conventional Monte Carlo tree search, the conventional simulation operation simulates the final state. From the state of the added node, the operation is simulated, which is executed randomly or according to a heuristic strategy, until the final state is reached. The simulation operation of the method simulates the distribution of the prize value, outputs the prize value which should be distributed in the current state by acquiring the state of the current node as the input of the recurrent neural network, and back propagates the prize value to the root node.

The output of the Monte Carlo tree search algorithm to the action is taken as a decision main line, the processing of the cyclic neural network to the time action sequence is taken as a time main line, and the combination of the two can promote the online training and learning of the intelligent agent.

Specifically, in this embodiment, the preset first neural network is a long-term and short-term memory artificial memory neural network LSTM. The preset second neural network is a convolutional neural network.

The framework of the source navigation method based on the deep monte carlo tree search in this embodiment is shown in fig. 2, and the monte carlo tree search has a good effect on the game, so this embodiment introduces the monte carlo tree search into the navigation problem, but unlike the game, the navigation process is a motion decision problem on a continuous state space, and for this problem, this embodiment proposes to blend a neural network into the monte carlo tree search, and uses the neural network to process huge and complex state space data. The method comprises the following specific steps:

(1) the agent performs the action and observes the environmental information of each step of action,

(2) environmental information observed by the first K time steps (excluding the current time step) of the intelligent agent and executed actions are input into an LSTM network, action probability vectors of all directions of the current time step are output,

(3) the predicted action probability information and the current observed state information are input into a Monte Carlo tree search and used as root node information,

(4) selecting, expanding, simulating, and back-propagating root nodes

(5) Transmitting the action probability information and the state information of the current node into a convolutional neural network in the simulation process, distributing the rewarding value to the current node, back-transmitting the rewarding value,

(6) repeating the steps (4) and (5) until the Monte Carlo tree searching times are met, outputting the optimal next action of the current time step,

(7) and (5) cycling (1) - (6) until the program iteration termination condition is met.

The following describes the Monte Carlo tree search and illustrates the process of time series data processing and prize distribution in the algorithm.

Monte Carlo tree search:

the monte carlo tree search is the best-prioritised search method for random sampling of monte carlo simulations based on a specific domain state space, meaning that decisions are made based on the results of the random simulations. The MCTS execution flow is shown in fig. 3, and is composed of four steps, which are repeatedly executed until reaching the calculation threshold, i.e. the set iteration number, the upper memory use limit or the time limit. The four steps of each iteration are:

selection: starting from the root node, the child nodes are selected in a recursive manner according to a selection policy. When a leaf node is reached that does not represent the state of the terminal, its selection is extended.

In this step, a strategy is needed to explore the tree to make meaningful decisions and eventually converge to the most valuable tree. Applied to the upper bound confidence interval (UCT) of the tree, this function serves to maximize the return of the dobby, and the UCT balances the utilization of rewards nodes while allowing exploration to access fewer nodes. The policy that determines which child is selected by a given current node is the policy that maximizes the following equation:

V _i is a score based on the current offspring of the defined proactive policy. In the second item, n _p Is the number of accesses of the node and the number of accesses of the current child node. C is an experimentally determined exploration constant. When the access count of the child node is above a threshold T, UCT will be applied. When the access count of a node is below this threshold, a child node extension will be randomly selected.

Expansion: given the available sequence of actions, all child nodes will be added to the selected leaf node.

Simulation: from the state of the added node, the simulation runs. The execution is performed randomly or according to a heuristic strategy until the final state is reached.

Back propagation: the results of the simulation are immediately propagated from the selected node to the root node. For each node selected in the selection phase, the statistics are updated along the tree and the number of accesses is increased.

Time action sequence data processing:

in complex dynamic environments, many ongoing tasks and motion planning require an efficient exploration of possible future environments, in real-world sequential decision problems (e.g. robotics), the order of collection of samples is critical, especially when the robot needs to optimize a non-stationary objective function in time. Model-free reinforcement learning has proven successful in many challenging tasks, but performs poorly in tasks requiring long-term planning, in the present invention we fuse the Monte Carlo tree search with long-term memory artificial memory neural networks (LSTMs) so that reinforcement learning and deep learning complement each other. The procedure for predicting the action probability using LSTM processing of the time action sequence is as follows:

(1) the intelligent agent observes the state information T at each moment in the process of travelling _k And action decision information preservation of the monte carlo tree search output,

(2) at the current time T _t And then, the relevant information of the first 6 moments is read and used as the input of LSTM, the probability prediction vector of each action in the current time is output,

(3) the Monte Carlo tree search takes the motion prediction probability as priori knowledge to make the best motion decision at the current time.

Prize distribution:

from the above monte carlo tree search process, it can be noted that feedback after the simulation phase is critical to the overall monte carlo tree search. In the simulation stage, the embodiment introduces the reward value obtained by the convolution neural network simulation agent in the process of travelling to enable the convolution neural network to replace the reinforcement learning value function to approach the real reward, and the process of reward distribution is as follows:

(1) the monte carlo tree search is extended to the current node, which is modeled,

(2) acquiring state information S observed by current node _i Motion prediction probability for the time step

(3) The information is input into the convolutional neural network, the reward value which is obtained by the current node through distribution is output, and meanwhile, the convolutional neural network is continuously trained by the reward value, so that the prediction capability is improved.

To sum up, for the problem of agent path planning in a partially observable environment, the present embodiment uses a learning type social response model to predict agent dynamics in the whole action space planning process. The method is applied to an agent system, and the agent continuously trains parameters of a circulating neural network while observing environmental information in the moving process, so that the state prediction capability in the moving process is improved, the prize distribution is more reasonable, and the path planning capability of the agent to the information source position is improved by combining Monte Carlo tree search to process time action sequence data. Therefore, the problem of efficient path planning of the intelligent agent under part of observable environments is solved.

Second embodiment

The embodiment provides an information source navigation device based on deep Monte Carlo tree search, which comprises:

The information source navigation device based on the depth Monte Carlo tree search in the embodiment corresponds to the information source navigation method based on the depth Monte Carlo tree search in the first embodiment; the functions realized by the functional modules in the information source navigation device based on the depth Monte Carlo tree search correspond to the flow steps in the information source navigation method based on the depth Monte Carlo tree search one by one; therefore, the description is omitted here.

Third embodiment

The embodiment provides an electronic device, which comprises a processor and a memory; wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method of the first embodiment.

The electronic device may vary considerably in configuration or performance and may include one or more processors (central processing units, CPU) and one or more memories having at least one instruction stored therein that is loaded by the processors and performs the methods described above.

Fourth embodiment

The present embodiment provides a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of the first embodiment described above. The computer readable storage medium may be, among other things, ROM, random access memory, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. The instructions stored therein may be loaded by a processor in the terminal and perform the methods described above.

Furthermore, it should be noted that the present invention can be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.

It is finally pointed out that the above description of the preferred embodiments of the invention, it being understood that although preferred embodiments of the invention have been described, it will be obvious to those skilled in the art that, once the basic inventive concepts of the invention are known, several modifications and adaptations can be made without departing from the principles of the invention, and these modifications and adaptations are intended to be within the scope of the invention. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims

1. The information source navigation method based on the deep Monte Carlo tree search is characterized by comprising the following steps of:

combining the optimal execution action of each time step to obtain an optimal path for the intelligent agent to move to the information source;

the preset first neural network is a long-term and short-term memory artificial memory neural network;

when selecting the best execution action of the agent in the current time step by the Monte Carlo tree search algorithm, in the simulation phase of the Monte Carlo tree search algorithm, the method further comprises:

inputting the predicted action probability and the state information of the current node into a preset second neural network, distributing a reward value to the current node through the preset second neural network, and back-propagating the reward value to the root node;

the preset second neural network is a convolutional neural network;

after distributing the reward value to the current node through the preset second neural network and back-propagating the reward value to the root node, the method further comprises:

2. A source navigation device based on deep monte carlo tree search, comprising:

the optimal path acquisition module is used for combining the optimal execution action of each time step output by the optimal execution action decision module to obtain an optimal path for the intelligent agent to move to the information source;

when the optimal execution action of the agent in the current time step is selected through the Monte Carlo tree search algorithm, the optimal execution action decision module is further used for:

the preset second neural network is a convolutional neural network;

after distributing the reward value to the current node through the preset second neural network and back propagating the reward value to the root node, the best execution action decision module is further configured to: