CN113139644B - Information source navigation method and device based on deep Monte Carlo tree search - Google Patents

Information source navigation method and device based on deep Monte Carlo tree search Download PDF

Info

Publication number
CN113139644B
CN113139644B CN202110316103.9A CN202110316103A CN113139644B CN 113139644 B CN113139644 B CN 113139644B CN 202110316103 A CN202110316103 A CN 202110316103A CN 113139644 B CN113139644 B CN 113139644B
Authority
CN
China
Prior art keywords
action
neural network
time step
information
monte carlo
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110316103.9A
Other languages
Chinese (zh)
Other versions
CN113139644A (en
Inventor
徐诚
何昊
段世红
殷楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shunde Graduate School of USTB
Original Assignee
Shunde Graduate School of USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shunde Graduate School of USTB filed Critical Shunde Graduate School of USTB
Priority to CN202110316103.9A priority Critical patent/CN113139644B/en
Publication of CN113139644A publication Critical patent/CN113139644A/en
Application granted granted Critical
Publication of CN113139644B publication Critical patent/CN113139644B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G01MEASURING; TESTING
    • G01CMEASURING DISTANCES, LEVELS OR BEARINGS; SURVEYING; NAVIGATION; GYROSCOPIC INSTRUMENTS; PHOTOGRAMMETRY OR VIDEOGRAMMETRY
    • G01C21/00Navigation; Navigational instruments not provided for in groups G01C1/00 - G01C19/00
    • G01C21/20Instruments for performing navigational calculations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Radar, Positioning & Navigation (AREA)
  • Remote Sensing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Automation & Control Theory (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a source navigation method and a device based on deep Monte Carlo tree search, wherein the method comprises the following steps: acquiring environment information and executed action information of an agent to be navigated in a historical time step; predicting the action probability of the intelligent agent in each direction of the current time step based on the environmental information and the action information in the historical time step through a preset first neural network; taking the predicted action probability as priori knowledge of a Monte Carlo tree search algorithm, and selecting the best execution action of the intelligent agent in the current time step; and combining the optimal execution action of each time step to obtain an optimal path for the intelligent agent to move to the information source. The invention provides an integrated planning path framework using a cyclic neural network in a Monte Carlo tree, which helps to improve the stability and performance of navigation control and solves the problem of path planning in a continuous space by processing time action sequence data.

Description

Information source navigation method and device based on deep Monte Carlo tree search
Technical Field
The invention relates to the technical field of computer science, in particular to a source navigation method and device based on deep Monte Carlo tree search.
Background
The decision and solving process aiming at the environment with incomplete information assumes that the state information of the system cannot be directly observed and is partially known, so that the system with incomplete state information is modeled, and the decision is made according to the current incomplete state information. For example, in many environmental and geoscience applications, an expert wishes to collect a sample of greatest scientific value (e.g., oil spill source), and make a next decision through the collected sample, but the distribution of the phenomenon is initially unknown. Typically, the samples are collected by a technician or mobile platform at predetermined locations in accordance with predetermined coverage trajectories. These non-adaptive strategies lead to sample sparsity to the greatest extent and may not be feasible when the geometry of the environment is unknown (e.g., a boulder field) or changing (e.g., a tidal zone), maximizing the number of valuable samples requires adaptive positioning and navigation.
Positioning navigation is performed in a part of observable environments, and due to the invisibility of states, decisions cannot be directly made through the states, and for a decision process, monte Carlo tree search is a heuristic optimal search algorithm, which is a great breakthrough for many games since the proposal. As it can search while balancing exploration and development. In the searching process, states need to be predicted first, and if the states are discrete state space, research discovers that the states can be predicted through Gaussian process regression. But in the face of continuous state space, how to predict states effectively and accurately is a problem to be solved.
In addition, the positioning and navigation of the intelligent agent can only be controlled by searching through visual information under the condition of complex environment, the development of positioning and navigation based on vision is in the process of simulating the thinking of human being in the perceived environment, during the searching process, the rewarding function is often sparse, and the sparse rewarding plan needs long-term information collection, which is a challenging problem in the technology of the intelligent agent at present. In addition, with the development of computer vision, the conventional navigation method has some defects on the prediction of image states, and cyclic learning needs to be an effective solution for vision-based navigation design.
In summary, aiming at the problem of path planning of an intelligent agent in a part of observable environments, the stability and performance of the prior art are not ideal, so that a new information source navigation method needs to be developed, and the optimal path planning of the intelligent agent can be efficiently and stably sought after a signal field containing the signal source is given.
Disclosure of Invention
The invention provides a source navigation method and device based on deep Monte Carlo tree search, which are used for solving the technical problem that the stability and performance of the prior art are not ideal enough for planning an agent path under a part of observable environment.
In order to solve the technical problems, the invention provides the following technical scheme:
in one aspect, the invention provides a source navigation method based on deep Monte Carlo tree search, which comprises the following steps:
acquiring environment information and executed action information of an agent to be navigated in a historical time step;
predicting the action probability of the intelligent agent in each direction of the current time step based on the environmental information in the historical time step and the executed action information through a preset first neural network;
taking the predicted action probability of each direction of the current time step as priori knowledge of a Monte Carlo tree search algorithm, and selecting the best execution action of the intelligent agent in the current time step through the Monte Carlo tree search algorithm;
and combining the optimal execution action of each time step to obtain an optimal path for the intelligent agent to move to the information source.
Optionally, the preset first neural network is a long-term and short-term memory artificial memory neural network.
Further, when selecting an optimal execution action of the agent in the current time step by the monte carlo tree search algorithm, in a simulation phase of the monte carlo tree search algorithm, the method further includes:
inputting the predicted action probability and the state information of the current node into a preset second neural network, distributing the rewarding value to the current node through the preset second neural network, and then back-propagating the rewarding value to the root node.
Optionally, the preset second neural network is a convolutional neural network.
Further, after distributing the prize value to the current node through the preset second neural network and back propagating the prize value to the root node, the method further includes:
and continuing training the preset second neural network by using the acquired reward value so as to improve the prediction capability.
On the other hand, the invention also provides an information source navigation device based on the deep Monte Carlo tree search, which comprises:
the historical environment information and action information acquisition module is used for acquiring environment information and executed action information of an intelligent body to be navigated in a historical time step;
the action probability prediction module is used for predicting the action probability of the intelligent agent in each direction of the current time step based on the historical environment information and the environment information in the historical time step and the executed action information acquired by the action information acquisition module through a preset first neural network;
the optimal execution action decision module is used for taking the action probability of each direction of the current time step predicted by the action probability prediction module as priori knowledge of a Monte Carlo tree search algorithm, and selecting the optimal execution action of the intelligent agent in the current time step through the Monte Carlo tree search algorithm;
and the optimal path acquisition module is used for combining the optimal execution action of each time step output by the optimal execution action decision module to obtain an optimal path for the intelligent agent to move to the information source.
Optionally, the preset first neural network is a long-term and short-term memory artificial memory neural network.
Further, when selecting the best execution action of the agent in the current time step through the monte carlo tree search algorithm, the best execution action decision module is further configured to:
inputting the predicted action probability and the state information of the current node into a preset second neural network, distributing the rewarding value to the current node through the preset second neural network, and then back-propagating the rewarding value to the root node.
Optionally, the preset second neural network is a convolutional neural network.
Further, after distributing the prize value to the current node through the preset second neural network and back propagating the prize value to the root node, the best performing action decision module is further configured to:
and continuing training the preset second neural network by using the acquired reward value so as to improve the prediction capability.
In yet another aspect, the present invention also provides an electronic device including a processor and a memory; wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the above-described method.
In yet another aspect, the present invention also provides a computer readable storage medium having at least one instruction stored therein, the instruction being loaded and executed by a processor to implement the above method.
The technical scheme provided by the invention has the beneficial effects that at least:
aiming at the problem of agent path planning in a part of observable environment, the invention uses a learning type social response model to predict the agent dynamics in the whole action space planning process. The method is applied to an agent system, and the agent continuously trains parameters of the cyclic neural network while observing environmental information in the moving process, so that the prediction capability of states in the moving process is improved, the prize distribution is more reasonable, and the path planning capability of the agent to the information source position is improved by combining Monte Carlo tree search to process time action sequence data. Therefore, the problem of efficient path planning of the intelligent agent under part of observable environments is solved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flow chart of a source navigation method based on deep monte carlo tree search according to an embodiment of the present invention;
fig. 2 is an algorithm framework diagram of a base Yu Mengte carlo tree search and neural network provided by an embodiment of the present invention;
fig. 3 is a flowchart of the execution of the monte carlo tree search algorithm.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.
First embodiment
In a partially observable environment, the agent is equipped with sensors that sense a limited range of environmental information, and the agent needs to perform path planning based on the limited environmental information, where it is involved how to make action decisions at each time step and how to define rewards for each time step with a clear value function as in conventional reinforcement learning. According to the problems, the embodiment provides a source navigation method based on deep Monte Carlo tree search, which is applied to the problem of efficient path planning when an intelligent agent obtains finite state information in a part of complex observable environments. The method is convenient for determining the action sequence of the intelligent agent in a future period of time under a given condition, and all actions form an optimal path under the whole environment.
The source navigation method of the embodiment may be implemented by an electronic device, which may be a terminal or a server. The execution flow of the method is shown in fig. 1, and comprises the following steps:
s101, acquiring environment information and executed action information of an agent to be navigated in a historical time step;
s102, predicting the action probability of the intelligent agent in all directions of the current time step based on the environmental information in the historical time step and the executed action information through a preset first neural network;
s103, taking the predicted action probability of each direction of the current time step as priori knowledge of a Monte Carlo tree search algorithm, and selecting the best execution action of the intelligent agent in the current time step through Monte Carlo tree search;
s104, combining the optimal execution actions of each time step to obtain an optimal path for the intelligent agent to move to the information source.
Further, when selecting an optimal execution action of the agent in the current time step by the monte carlo tree search algorithm, in a simulation phase of the monte carlo tree search algorithm, the method further includes:
inputting the predicted action probability and the state information of the current node into a preset second neural network, distributing the rewarding value to the current node through the preset second neural network, and then back-propagating the rewarding value to the root node.
By adopting the technical scheme, the intelligent body can predict the next moment state in the moving process after training in a given signal field, and can move from the initial position to the information source position with low cost and high efficiency.
The source navigation method of the embodiment blends the cyclic neural network into the monte carlo tree search, and the monte carlo tree search uses the predicted action probability of the cyclic neural network to estimate the value of each state in one search tree. As more simulations are performed, the search tree becomes more and more massive and the associated values become more and more accurate. By choosing a sub-tree with a higher value, the policy probability for the selection action will always improve over time during the search. By combining two methods of supervision and reinforcement learning to train a cyclic neural network, a new search algorithm is introduced, and the algorithm successfully integrates the neural network evaluation and the Monte Carlo tree simulation algorithm, so that low-cost and high-efficiency information source navigation can be realized.
Moreover, it is noted that as reinforcement learning systems become more common, the design of rewards mechanisms that induce the desired behavior becomes more important and difficult. In addition, in prize distribution, we find that prize formation is critical to reinforcement learning speed, and that prize range is an important parameter that relates to the effectiveness of shaping and proves its strongest influence on the run-time of a simple reinforcement learning algorithm. Therefore, in order to give reasonable reward distribution, the embodiment learns the reward distribution by using the neural network, the reward distribution predicts the acquired state information and action information by the convolutional neural network, and the output of the neural network can approximate to the real reward through continuous training.
Furthermore, path planning in a dynamic environment can be expressed as a sequential decision problem. The sequence plays an important role in many applications and systems, and in this embodiment, the time action sequence is that the cyclic neural network makes next time step action probability distribution on state information and action information in K historical time steps, so as to provide priori knowledge for Monte Carlo tree search. Then, a Monte Carlo tree search is adopted to select the best action in the current time step according to the action probability and the current state predicted by the cyclic neural network.
Moreover, unlike the simulation operation of the conventional Monte Carlo tree search, the conventional simulation operation simulates the final state. From the state of the added node, the operation is simulated, which is executed randomly or according to a heuristic strategy, until the final state is reached. The simulation operation of the method simulates the distribution of the prize value, outputs the prize value which should be distributed in the current state by acquiring the state of the current node as the input of the recurrent neural network, and back propagates the prize value to the root node.
The output of the Monte Carlo tree search algorithm to the action is taken as a decision main line, the processing of the cyclic neural network to the time action sequence is taken as a time main line, and the combination of the two can promote the online training and learning of the intelligent agent.
Specifically, in this embodiment, the preset first neural network is a long-term and short-term memory artificial memory neural network LSTM. The preset second neural network is a convolutional neural network.
The framework of the source navigation method based on the deep monte carlo tree search in this embodiment is shown in fig. 2, and the monte carlo tree search has a good effect on the game, so this embodiment introduces the monte carlo tree search into the navigation problem, but unlike the game, the navigation process is a motion decision problem on a continuous state space, and for this problem, this embodiment proposes to blend a neural network into the monte carlo tree search, and uses the neural network to process huge and complex state space data. The method comprises the following specific steps:
(1) the agent performs the action and observes the environmental information of each step of action,
(2) environmental information observed by the first K time steps (excluding the current time step) of the intelligent agent and executed actions are input into an LSTM network, action probability vectors of all directions of the current time step are output,
(3) the predicted action probability information and the current observed state information are input into a Monte Carlo tree search and used as root node information,
(4) selecting, expanding, simulating, and back-propagating root nodes
(5) Transmitting the action probability information and the state information of the current node into a convolutional neural network in the simulation process, distributing the rewarding value to the current node, back-transmitting the rewarding value,
(6) repeating the steps (4) and (5) until the Monte Carlo tree searching times are met, outputting the optimal next action of the current time step,
(7) and (5) cycling (1) - (6) until the program iteration termination condition is met.
The following describes the Monte Carlo tree search and illustrates the process of time series data processing and prize distribution in the algorithm.
Monte Carlo tree search:
the monte carlo tree search is the best-prioritised search method for random sampling of monte carlo simulations based on a specific domain state space, meaning that decisions are made based on the results of the random simulations. The MCTS execution flow is shown in fig. 3, and is composed of four steps, which are repeatedly executed until reaching the calculation threshold, i.e. the set iteration number, the upper memory use limit or the time limit. The four steps of each iteration are:
selection: starting from the root node, the child nodes are selected in a recursive manner according to a selection policy. When a leaf node is reached that does not represent the state of the terminal, its selection is extended.
In this step, a strategy is needed to explore the tree to make meaningful decisions and eventually converge to the most valuable tree. Applied to the upper bound confidence interval (UCT) of the tree, this function serves to maximize the return of the dobby, and the UCT balances the utilization of rewards nodes while allowing exploration to access fewer nodes. The policy that determines which child is selected by a given current node is the policy that maximizes the following equation:
V i is a score based on the current offspring of the defined proactive policy. In the second item, n p Is the number of accesses of the node and the number of accesses of the current child node. C is an experimentally determined exploration constant. When the access count of the child node is above a threshold T, UCT will be applied. When the access count of a node is below this threshold, a child node extension will be randomly selected.
Expansion: given the available sequence of actions, all child nodes will be added to the selected leaf node.
Simulation: from the state of the added node, the simulation runs. The execution is performed randomly or according to a heuristic strategy until the final state is reached.
Back propagation: the results of the simulation are immediately propagated from the selected node to the root node. For each node selected in the selection phase, the statistics are updated along the tree and the number of accesses is increased.
Time action sequence data processing:
in complex dynamic environments, many ongoing tasks and motion planning require an efficient exploration of possible future environments, in real-world sequential decision problems (e.g. robotics), the order of collection of samples is critical, especially when the robot needs to optimize a non-stationary objective function in time. Model-free reinforcement learning has proven successful in many challenging tasks, but performs poorly in tasks requiring long-term planning, in the present invention we fuse the Monte Carlo tree search with long-term memory artificial memory neural networks (LSTMs) so that reinforcement learning and deep learning complement each other. The procedure for predicting the action probability using LSTM processing of the time action sequence is as follows:
(1) the intelligent agent observes the state information T at each moment in the process of travelling k And action decision information preservation of the monte carlo tree search output,
(2) at the current time T t And then, the relevant information of the first 6 moments is read and used as the input of LSTM, the probability prediction vector of each action in the current time is output,
(3) the Monte Carlo tree search takes the motion prediction probability as priori knowledge to make the best motion decision at the current time.
Prize distribution:
from the above monte carlo tree search process, it can be noted that feedback after the simulation phase is critical to the overall monte carlo tree search. In the simulation stage, the embodiment introduces the reward value obtained by the convolution neural network simulation agent in the process of travelling to enable the convolution neural network to replace the reinforcement learning value function to approach the real reward, and the process of reward distribution is as follows:
(1) the monte carlo tree search is extended to the current node, which is modeled,
(2) acquiring state information S observed by current node i Motion prediction probability for the time step
(3) The information is input into the convolutional neural network, the reward value which is obtained by the current node through distribution is output, and meanwhile, the convolutional neural network is continuously trained by the reward value, so that the prediction capability is improved.
To sum up, for the problem of agent path planning in a partially observable environment, the present embodiment uses a learning type social response model to predict agent dynamics in the whole action space planning process. The method is applied to an agent system, and the agent continuously trains parameters of a circulating neural network while observing environmental information in the moving process, so that the state prediction capability in the moving process is improved, the prize distribution is more reasonable, and the path planning capability of the agent to the information source position is improved by combining Monte Carlo tree search to process time action sequence data. Therefore, the problem of efficient path planning of the intelligent agent under part of observable environments is solved.
Second embodiment
The embodiment provides an information source navigation device based on deep Monte Carlo tree search, which comprises:
the historical environment information and action information acquisition module is used for acquiring environment information and executed action information of an intelligent body to be navigated in a historical time step;
the action probability prediction module is used for predicting the action probability of the intelligent agent in each direction of the current time step based on the historical environment information and the environment information in the historical time step and the executed action information acquired by the action information acquisition module through a preset first neural network;
the optimal execution action decision module is used for taking the action probability of each direction of the current time step predicted by the action probability prediction module as priori knowledge of a Monte Carlo tree search algorithm, and selecting the optimal execution action of the intelligent agent in the current time step through the Monte Carlo tree search algorithm;
and the optimal path acquisition module is used for combining the optimal execution action of each time step output by the optimal execution action decision module to obtain an optimal path for the intelligent agent to move to the information source.
The information source navigation device based on the depth Monte Carlo tree search in the embodiment corresponds to the information source navigation method based on the depth Monte Carlo tree search in the first embodiment; the functions realized by the functional modules in the information source navigation device based on the depth Monte Carlo tree search correspond to the flow steps in the information source navigation method based on the depth Monte Carlo tree search one by one; therefore, the description is omitted here.
Third embodiment
The embodiment provides an electronic device, which comprises a processor and a memory; wherein the memory stores at least one instruction that is loaded and executed by the processor to implement the method of the first embodiment.
The electronic device may vary considerably in configuration or performance and may include one or more processors (central processing units, CPU) and one or more memories having at least one instruction stored therein that is loaded by the processors and performs the methods described above.
Fourth embodiment
The present embodiment provides a computer-readable storage medium having stored therein at least one instruction that is loaded and executed by a processor to implement the method of the first embodiment described above. The computer readable storage medium may be, among other things, ROM, random access memory, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc. The instructions stored therein may be loaded by a processor in the terminal and perform the methods described above.
Furthermore, it should be noted that the present invention can be provided as a method, an apparatus, or a computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal device comprising the element.
It is finally pointed out that the above description of the preferred embodiments of the invention, it being understood that although preferred embodiments of the invention have been described, it will be obvious to those skilled in the art that, once the basic inventive concepts of the invention are known, several modifications and adaptations can be made without departing from the principles of the invention, and these modifications and adaptations are intended to be within the scope of the invention. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Claims (2)

1. The information source navigation method based on the deep Monte Carlo tree search is characterized by comprising the following steps of:
acquiring environment information and executed action information of an agent to be navigated in a historical time step;
predicting the action probability of the intelligent agent in each direction of the current time step based on the environmental information in the historical time step and the executed action information through a preset first neural network;
taking the predicted action probability of each direction of the current time step as priori knowledge of a Monte Carlo tree search algorithm, and selecting the best execution action of the intelligent agent in the current time step through the Monte Carlo tree search algorithm;
combining the optimal execution action of each time step to obtain an optimal path for the intelligent agent to move to the information source;
the preset first neural network is a long-term and short-term memory artificial memory neural network;
when selecting the best execution action of the agent in the current time step by the Monte Carlo tree search algorithm, in the simulation phase of the Monte Carlo tree search algorithm, the method further comprises:
inputting the predicted action probability and the state information of the current node into a preset second neural network, distributing a reward value to the current node through the preset second neural network, and back-propagating the reward value to the root node;
the preset second neural network is a convolutional neural network;
after distributing the reward value to the current node through the preset second neural network and back-propagating the reward value to the root node, the method further comprises:
and continuing training the preset second neural network by using the acquired reward value so as to improve the prediction capability.
2. A source navigation device based on deep monte carlo tree search, comprising:
the historical environment information and action information acquisition module is used for acquiring environment information and executed action information of an intelligent body to be navigated in a historical time step;
the action probability prediction module is used for predicting the action probability of the intelligent agent in each direction of the current time step based on the historical environment information and the environment information in the historical time step and the executed action information acquired by the action information acquisition module through a preset first neural network;
the optimal execution action decision module is used for taking the action probability of each direction of the current time step predicted by the action probability prediction module as priori knowledge of a Monte Carlo tree search algorithm, and selecting the optimal execution action of the intelligent agent in the current time step through the Monte Carlo tree search algorithm;
the optimal path acquisition module is used for combining the optimal execution action of each time step output by the optimal execution action decision module to obtain an optimal path for the intelligent agent to move to the information source;
the preset first neural network is a long-term and short-term memory artificial memory neural network;
when the optimal execution action of the agent in the current time step is selected through the Monte Carlo tree search algorithm, the optimal execution action decision module is further used for:
inputting the predicted action probability and the state information of the current node into a preset second neural network, distributing a reward value to the current node through the preset second neural network, and back-propagating the reward value to the root node;
the preset second neural network is a convolutional neural network;
after distributing the reward value to the current node through the preset second neural network and back propagating the reward value to the root node, the best execution action decision module is further configured to:
and continuing training the preset second neural network by using the acquired reward value so as to improve the prediction capability.
CN202110316103.9A 2021-03-24 2021-03-24 Information source navigation method and device based on deep Monte Carlo tree search Active CN113139644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110316103.9A CN113139644B (en) 2021-03-24 2021-03-24 Information source navigation method and device based on deep Monte Carlo tree search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110316103.9A CN113139644B (en) 2021-03-24 2021-03-24 Information source navigation method and device based on deep Monte Carlo tree search

Publications (2)

Publication Number Publication Date
CN113139644A CN113139644A (en) 2021-07-20
CN113139644B true CN113139644B (en) 2024-02-09

Family

ID=76810034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110316103.9A Active CN113139644B (en) 2021-03-24 2021-03-24 Information source navigation method and device based on deep Monte Carlo tree search

Country Status (1)

Country Link
CN (1) CN113139644B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113334395B (en) * 2021-08-09 2021-11-26 常州唯实智能物联创新中心有限公司 Multi-clamp mechanical arm disordered grabbing method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR19990038244A (en) * 1997-11-04 1999-06-05 김덕중 Vehicle Navigation System and Path Matching Method Using Route Coupling Hypothesis
CN109063823A (en) * 2018-07-24 2018-12-21 北京工业大学 A kind of intelligent body explores batch A3C intensified learning method in the labyrinth 3D
CN109682392A (en) * 2018-12-28 2019-04-26 山东大学 Vision navigation method and system based on deeply study
CN110427261A (en) * 2019-08-12 2019-11-08 电子科技大学 A kind of edge calculations method for allocating tasks based on the search of depth Monte Carlo tree
CN110989352A (en) * 2019-12-06 2020-04-10 上海应用技术大学 Group robot collaborative search method based on Monte Carlo tree search algorithm
CN111242246A (en) * 2020-04-27 2020-06-05 北京同方软件有限公司 Image classification method based on reinforcement learning
CN111506980A (en) * 2019-01-30 2020-08-07 斯特拉德视觉公司 Method and device for generating traffic scene for virtual driving environment
CN111780777A (en) * 2020-07-13 2020-10-16 江苏中科智能制造研究院有限公司 Unmanned vehicle route planning method based on improved A-star algorithm and deep reinforcement learning
CN111832501A (en) * 2020-07-20 2020-10-27 中国人民解放军战略支援部队航天工程大学 Remote sensing image text intelligent description method for satellite on-orbit application

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108429259B (en) * 2018-03-29 2019-10-18 山东大学 A kind of online dynamic decision method and system of unit recovery

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR19990038244A (en) * 1997-11-04 1999-06-05 김덕중 Vehicle Navigation System and Path Matching Method Using Route Coupling Hypothesis
CN109063823A (en) * 2018-07-24 2018-12-21 北京工业大学 A kind of intelligent body explores batch A3C intensified learning method in the labyrinth 3D
CN109682392A (en) * 2018-12-28 2019-04-26 山东大学 Vision navigation method and system based on deeply study
CN111506980A (en) * 2019-01-30 2020-08-07 斯特拉德视觉公司 Method and device for generating traffic scene for virtual driving environment
CN110427261A (en) * 2019-08-12 2019-11-08 电子科技大学 A kind of edge calculations method for allocating tasks based on the search of depth Monte Carlo tree
CN110989352A (en) * 2019-12-06 2020-04-10 上海应用技术大学 Group robot collaborative search method based on Monte Carlo tree search algorithm
CN111242246A (en) * 2020-04-27 2020-06-05 北京同方软件有限公司 Image classification method based on reinforcement learning
CN111780777A (en) * 2020-07-13 2020-10-16 江苏中科智能制造研究院有限公司 Unmanned vehicle route planning method based on improved A-star algorithm and deep reinforcement learning
CN111832501A (en) * 2020-07-20 2020-10-27 中国人民解放军战略支援部队航天工程大学 Remote sensing image text intelligent description method for satellite on-orbit application

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
基于深度强化学习的三维路径规划算法;黄东晋;蒋晨凤;韩凯丽;;计算机工程与应用(第15期);全文 *
基于蒙特卡洛Q值函数的多智能体决策方法;张健;潘耀宗;杨海涛;孙舒;赵洪利;;控制与决策(第03期);全文 *
移动机器人自主定位和导航系统设计与实现;仉新;张禹;苏晓明;;机床与液压(第10期);全文 *

Also Published As

Publication number Publication date
CN113139644A (en) 2021-07-20

Similar Documents

Publication Publication Date Title
CN110119844B (en) Robot motion decision method, system and device introducing emotion regulation and control mechanism
Gronauer et al. Multi-agent deep reinforcement learning: a survey
Wiering Explorations in E cient Reinforcement Learning
Van Otterlo The logic of adaptive behavior: Knowledge representation and algorithms for adaptive sequential decision making under uncertainty in first-order and relational domains
CN105637540A (en) Methods and apparatus for reinforcement learning
CN112784949A (en) Neural network architecture searching method and system based on evolutionary computation
Liu et al. Libero: Benchmarking knowledge transfer for lifelong robot learning
Tang et al. A review of computational intelligence for StarCraft AI
CN113139644B (en) Information source navigation method and device based on deep Monte Carlo tree search
Tan et al. Optimized deep reinforcement learning approach for dynamic system
Yang et al. Abstract demonstrations and adaptive exploration for efficient and stable multi-step sparse reward reinforcement learning
Kujanpää et al. Hierarchical imitation learning with vector quantized models
He et al. Influence-augmented online planning for complex environments
Raffert et al. Optimally designing games for cognitive science research
Dockhorn Prediction-based search for autonomous game-playing
Massi et al. Model-Based and Model-Free Replay Mechanisms for Reinforcement Learning in Neurorobotics
Guo Deep learning and reward design for reinforcement learning
Jones et al. Data Driven Control of Interacting Two Tank Hybrid System using Deep Reinforcement Learning
Prescott Explorations in reinforcement and model-based learning
Venturini Distributed deep reinforcement learning for drone swarm control
Van Otterlo The Logic of Adaptive Behavior-Knowledge Representation and Algorithms for the Markov Decision Process Framework in First-Order Domains
Ba et al. Monte Carlo Tree Search with variable simulation periods for continuously running tasks
Ge Solving planning problems with deep reinforcement learning and tree search
Davide S-MARL: An Algorithm for Single-To-Multi-Agent Reinforcement Learning: Case Study: Formula 1 Race Strategies
DE GEUS et al. Utilizing Available Data to Warm Start Online Reinforcement Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant