JP2020009122A

JP2020009122A - Control program, control method and system

Info

Publication number: JP2020009122A
Application number: JP2018129322A
Authority: JP
Inventors: セイドゥバ; Seydou Ba; 慶雅鶴岡; Yoshimasa Tsuruoka
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2018-07-06
Filing date: 2018-07-06
Publication date: 2020-01-16
Anticipated expiration: 2038-07-06
Also published as: JP7093547B2

Abstract

To provide a control program, a control method, and a system capable of selecting and instructing an action according to a task at an appropriate time interval.SOLUTION: A control method includes the steps of: (A) in a search tree for a tree search for selecting an action and a time, in which an action to be executed next and a time until the further next action to be executed are associated with an edge between nodes, performing search processing of searching for a node with a value of a first evaluation expression while adding a new node during a time period associated with an edge to the node selected in the search processing previously executed; and (B) causing a control object to execute the action associated with the edge to the node selected in the search processing executed this time.SELECTED DRAWING: Figure 2

Description

本発明は、最適制御技術に関する。 The present invention relates to an optimal control technique.

モンテカルロ木探索（Monte Carlo Tree Search: ＭＣＴＳ）は、決定空間を追加的に伸びる探索木で表すことで、現在の状態において採るべき最善のアクションを決定するものである。探索木は、ランダムなシミュレーションにより、シミュレートされた新たな状態（ノード）と、アクション（エッジ又は辺）とで更新される。ノード及び辺を繰り返し追加することで、現在の状態から終了状態（terminal state）まで遷移する。探索木のノードは、そのノードの状態から先に進む場合における期待報酬（expected reward）を保持している。ノードの期待報酬は、そのノードを介して進む全シミュレーションの平均的な成果を表している。 Monte Carlo Tree Search (MCTS) determines the best action to be taken in the current state by representing the decision space with an additional growing search tree. The search tree is updated with a simulated new state (node) and an action (edge or edge) by random simulation. By repeatedly adding nodes and edges, a transition is made from the current state to a terminal state. Each node of the search tree holds an expected reward when proceeding from the state of the node. The expected reward of a node represents the average performance of all simulations going through that node.

ＭＣＴＳのシミュレーションは、選択（selection）、展開（expansion）、ロールアウト（roll-out）及び逆伝搬（backpropagation）の４つのフェーズに分けられる。図１に４つのフェーズを簡易的に示している。（１）選択フェーズでは、まだ十分に展開されていないノードに達するまでルートからノードを選択してゆく。図１の例では、ルートノードから２階層下のノードが選択されたことが矢印で表されている。（２）展開フェーズでは、選択フェーズで選択されたノードに、１つのノードが追加される。（３）ロールアウトフェーズでは、ロールアウトのデフォルトポリシーに従ってシミュレーションを行う。（４）逆伝搬フェーズでは、シミュレーションの結果を逆方向、すなわちリーフノードからルートノード方向に伝搬してゆく。図１の例では、追加されたノード、選択されたノード、その親ノード、そしてルートノードに、シミュレーションの結果が逆方向に伝播されていく様子が模式的に示されている。
上でも述べたように、ノードは、シミュレートされたタスクの状態を表し、辺は、現在のノードの状態から子ノードの状態へ遷移するために実行されたアクションに対応する。 MCTS simulations are divided into four phases: selection, expansion, roll-out, and backpropagation. FIG. 1 schematically shows the four phases. (1) In the selection phase, nodes are selected from the root until reaching nodes that have not yet been fully deployed. In the example of FIG. 1, an arrow indicates that a node two levels below the root node has been selected. (2) In the development phase, one node is added to the nodes selected in the selection phase. (3) In the rollout phase, a simulation is performed according to the default policy of rollout. (4) In the back propagation phase, the result of the simulation is propagated in the reverse direction, that is, from the leaf node to the root node. In the example of FIG. 1, a state in which the result of the simulation is propagated in the opposite direction to the added node, the selected node, its parent node, and the root node is schematically illustrated.
As mentioned above, a node represents the state of a simulated task, and an edge corresponds to the action performed to transition from the state of the current node to the state of a child node.

ＭＣＴＳの各シミュレーションは、現在の状態に対応するルートノードから始まる。上で述べた選択フェーズにおいては、ノードの値と被訪問回数とについて定義された選択ポリシーに基づき探索木の辺をたどってゆく。そして、子ノードを追加できるノードに到達すると、そのノードに新たなリーフノードを追加することで、探索木を拡張する。このリーフノードの追加が、展開フェーズの処理である。そして、ロールアウトフェーズにおいて、新たなリーフノードの状態から、ロールアウトポリシーが適用される。ロールアウトのシミュレーションは、許可された最大回数実行するか又は終了状態に至るまで繰り返される。このロールアウトポリシーには、ストレートフォワードランダムアクション（straight-forward random action）というロールアウトポリシーが広く用いられている。最後に、シミュレーションの結果が、リーフノードからルートノードまでの部分木において管理されている情報を更新するように、逆伝搬される。 Each simulation of the MCTS starts with the root node corresponding to the current state. In the above-described selection phase, the search tree is traversed based on a selection policy defined for the value of the node and the number of visits. Then, when reaching a node where a child node can be added, the search tree is extended by adding a new leaf node to the node. This addition of the leaf node is the processing of the development phase. Then, in the rollout phase, the rollout policy is applied from the state of the new leaf node. The simulation of the rollout is repeated the maximum number of times allowed or until an end condition is reached. As this rollout policy, a rollout policy called a straight-forward random action is widely used. Finally, the result of the simulation is back-propagated to update the information managed in the subtree from the leaf node to the root node.

なお、ＭＣＴＳの選択フェーズにおけるキーは、有望なアクションの活用と決定空間の探索とのバランスをとることで、Upper Confidence Bounds applied to Trees (ＵＣＴ)アルゴリズムが、それに対する良い解を与えており、一般的に用いられている。 The key in the selection phase of MCTS is to balance the use of promising actions and the search of decision space, and the Upper Confidence Bounds applied to Trees (UCT) algorithm has given a good solution to it. It is used regularly.

典型的なＭＣＴＳは、有限数のシーケンシャルな決定問題に適用される。しかし、決定空間がとても広いと、ある状態からすべてのアクションが少なくとも一度は探索されるので、探索木の展望がとても浅くなってしまうという問題が生ずる。このような場合は、決定空間が無限で連続であるという環境において生ずる。このような問題については、Hierarchical Optimistic Optimization applied to Tree (ＨＯＯＴ)というアプローチが提案されている。ＨＯＯＴは、ＵＣＴの限界を超えるために、ＨＯＯアルゴリズムを木探索プランニングに統合したものである。ＨＯＯは、bandit問題に対する選択ポリシーを構成しており、その選択ポリシーは、前回のシミュレーションの結果に関して、アーム（arm）の選択で損した範囲（後悔の範囲：regret bounds）を改善するものである。ＨＯＯは、決定空間を木として分解することで、決定空間の一般的なトポロジカルな表現を形成するものである。 A typical MCTS is applied to a finite number of sequential decision problems. However, if the decision space is very large, there is a problem that the view of the search tree becomes very shallow because all actions are searched at least once from a certain state. Such a case occurs in an environment where the decision space is infinite and continuous. For such a problem, an approach called Hierarchical Optimistic Optimization applied to Tree (HOOT) has been proposed. HOOT integrates the HOO algorithm into tree search planning to go beyond the limits of UCT. HOO constitutes a selection policy for the bandit problem, and the selection policy is to improve the range (regret bounds) lost in the selection of the arm with respect to the result of the previous simulation. . HOO forms a general topological representation of a decision space by decomposing the decision space as a tree.

アクションが照会されると、ＨＯＯアルゴリズムでは、ノードで算出されるB-valueというスコアが最大となるパスに沿って探索が行われる。リーフノードでは、リーフノードで表される、決定空間内の範囲においてアクションがサンプリングされる。そして、２つの子ノードが、そのリーフノードに追加され、各子ノードは、そのリーフノードの親ノードにより表される、決定空間内の範囲の半分をカバーする。 When an action is queried, the HOO algorithm performs a search along a path having a maximum score of B-value calculated at a node. At the leaf nodes, actions are sampled in a range in the decision space represented by the leaf nodes. Then, two child nodes are added to the leaf node, each child node covering half of the range in decision space represented by the leaf node's parent node.

ＨＯＯＴは、ＵＣＴに類似するが、ＨＯＯＴは、連続的なアクションのbanditアルゴリズムであるＨＯＯを探索木の各ノードに配置するものである。ＨＯＯは、ＵＣＴにおける離散アクションという限界を超えるため、アクションの選択のために、決定空間をサンプリングするのに用いられる。 HOOT is similar to UCT, but HOOT places a HOO, a bandit algorithm of continuous actions, at each node of the search tree. HOO is used to sample the decision space for selection of actions because it exceeds the limit of discrete actions in UCT.

従来のＭＣＴＳにおいては、アクションを選択してから次のアクションを選択するまでの時間間隔は固定であった。そのため、状態が刻々と変化するような環境においては、この時間間隔を短くすることで対応することになる。しかしながら、時間間隔が短くなると、シミュレーションの実行回数が制限されて信頼のおける探索木が生成できないという問題がある。 In the conventional MCTS, the time interval between the selection of an action and the selection of the next action is fixed. Therefore, in an environment where the state changes every moment, it is possible to cope by shortening the time interval. However, when the time interval becomes short, there is a problem that the number of times the simulation is executed is limited and a reliable search tree cannot be generated.

また、従来のＨＯＯＴでは、１種類のアクションを取り扱うことがほとんどであり、２種類以上のアクションをどのように取り扱うかについては、具体的な提案は存在していない。 Further, in the conventional HOOT, one type of action is mostly handled, and there is no specific proposal on how to handle two or more types of actions.

Remi Coulom, "Efficient selectivity and backup operators in monte-carlo tree search" In Interna-tional conference on computers and games, pages p72-83, Springer, 2006Remi Coulom, "Efficient selectivity and backup operators in monte-carlo tree search" In Interna-tional conference on computers and games, pages p72-83, Springer, 2006 Christopher R. Mansley, Ari Weinstein, and Michael L. Littman, "Sample-based planning for continuous action markov decision processes", In Proceed-ings of the 21st International Conference on Automated Planning and Scheduling, 2011Christopher R. Mansley, Ari Weinstein, and Michael L. Littman, "Sample-based planning for continuous action markov decision processes", In Proceed-ings of the 21st International Conference on Automated Planning and Scheduling, 2011 Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer, "Finite-time analysis of the multiarmed bandit problem", Machine learning, 47(2-3):235-256, 2002Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer, "Finite-time analysis of the multiarmed bandit problem", Machine learning, 47 (2-3): 235-256, 2002 Levente Kocsis and Csaba Szepesvari, "Bandit based monte-carlo planning", In Eu-ropean conference on machine learning, pages 282-293, Springer, 2006Levente Kocsis and Csaba Szepesvari, "Bandit based monte-carlo planning", In Eu-ropean conference on machine learning, pages 282-293, Springer, 2006 Kristof Van Moffaert, "Multi-Criteria Reinforcement Learning for Sequential Decision Making Problems", 2016 Uitgeverij VUBPRESS Brussels University PressKristof Van Moffaert, "Multi-Criteria Reinforcement Learning for Sequential Decision Making Problems", 2016 Uitgeverij VUBPRESS Brussels University Press

従って、本発明の目的は、一側面として、タスクに応じたアクションを適切な時間間隔において選択して指示できるようにするための技術を提供することである。 Therefore, an object of the present invention is to provide, as one aspect, a technique for selecting and instructing an action according to a task at an appropriate time interval.

本発明に係る制御方法は、（Ａ）ノード間の辺に、次に実行すべきアクションとさらに次に実行すべきアクションを選択するまでの時間とを対応付けた、アクション及び時間を選択する木探索のための探索木において、新たなノードを追加しつつ、第１の評価式の値でノードを探索する探索処理を、前回実行された探索処理において選択されたノードへの辺に対応付けられた時間中に実行するステップと、（Ｂ）今回実行された探索処理において選択されたノードへの辺に対応付けられたアクションを、制御対象物に実行させるステップとを含む。 According to the control method of the present invention, (A) a tree for selecting an action and a time, in which an action to be executed next and a time until a next action to be executed is selected are associated with an edge between nodes. In a search tree for search, a search process of searching for a node with the value of the first evaluation expression while adding a new node is associated with an edge to the node selected in the previously executed search process. And (B) causing the control object to execute the action associated with the side to the node selected in the search process executed this time.

一側面によれば、タスクに応じたアクションを適切な時間間隔において選択して指示できるようになる。 According to an aspect, an action according to a task can be selected and indicated at an appropriate time interval.

図１は、ＭＣＴＳを説明するための図である。FIG. 1 is a diagram for explaining the MCTS. 図２は、実施の形態に係るシステム全体の概要を示す図である。FIG. 2 is a diagram illustrating an overview of the entire system according to the embodiment. 図３は、実施の形態におけるタイムシーケンス例を示す図である。FIG. 3 is a diagram illustrating a time sequence example according to the embodiment. 図４は、本実施の形態に係るＭＣＴＳの探索木の一例を示す図である。FIG. 4 is a diagram illustrating an example of a search tree of the MCTS according to the present embodiment. 図５は、モンテカルロ木探索部が実行する処理の処理フローを示す図である。FIG. 5 is a diagram depicting a processing flow of a processing executed by the Monte Carlo tree search unit; 図６は、２ＤＨＯＯＴ部が実行する処理の処理フローを示す図である。FIG. 6 is a diagram depicting a processing flow of a processing executed by the 2DHOOT unit; 図７Ａは、処理の具体例における第１のフェーズを示す図である。FIG. 7A is a diagram illustrating a first phase in a specific example of the processing. 図７Ｂは、処理の具体例における第２のフェーズを示す図である。FIG. 7B is a diagram illustrating a second phase in a specific example of the processing. 図７Ｃは、処理の具体例における第３のフェーズを示す図である。FIG. 7C is a diagram illustrating a third phase in a specific example of the processing. 図７Ｄは、処理の具体例における第４のフェーズを示す図である。FIG. 7D is a diagram illustrating a fourth phase in a specific example of the processing. 図７Ｅは、処理の具体例における第１のフェーズを示す図である。FIG. 7E is a diagram illustrating a first phase in a specific example of the processing. 図８は、コンピュータ装置のブロック構成図である。FIG. 8 is a block diagram of the computer device.

図２に、本実施の形態に係るシステムの概要を示す。本実施の形態に係るシステムは、制御装置１００と、連続的にタスクを実行する制御対象物２００とを含む。制御装置１００は、制御対象物２００に対する制御装置として機能する。 FIG. 2 shows an outline of a system according to the present embodiment. The system according to the present embodiment includes a control device 100 and a control target 200 that continuously executes a task. The control device 100 functions as a control device for the control target 200.

制御装置１００は、モンテカルロ木探索部１１０と、２ＤＨＯＯＴ部１２０と、シミュレーション部１３０と、インターフェース部１４０とを有する。インターフェース部１４０は、時刻ｔ_iにおいて実行すべきアクションａ_i及び次のアクションを選択するまでの時間τ_iをモンテカルロ木探索部１１０から受け取り、時間τ_iの間に制御対象物２００にアクションａ_iを実行させる。これに対して、インターフェース部１４０は、制御対象物２００から時刻ｔ_iにおけるタスクの状態Ｓ_iを受け取り、モンテカルロ木探索部１１０に出力する。 The control device 100 includes a Monte Carlo tree search unit 110, a 2DHOOT unit 120, a simulation unit 130, and an interface unit 140. Interface unit 140 receives the time tau _i until selecting an action a _i and the next action to be executed at time t _i from Monte-Carlo tree searching unit 110, the action a _i to the control object 200 during the time tau _i Is executed. On the other hand, the interface unit 140 receives the state S _i of the task at the time t _i from the control target 200 and outputs it to the Monte Carlo tree search unit 110.

シミュレーション部１３０は、制御対象物２００が実行するタスクの挙動を複製したシミュレーションモデルである。よって、モンテカルロ木探索部１１０が、状態Ｓを指定して、時間τ中にアクションａを実行するように指示すれば、シミュレーション部１３０は、期待される状態Ｓと、報酬ｒとを予測して、モンテカルロ木探索部１１０に返す。 The simulation unit 130 is a simulation model that duplicates the behavior of a task executed by the control target 200. Therefore, if the Monte Carlo tree search unit 110 specifies the state S and instructs to execute the action a during the time τ, the simulation unit 130 predicts the expected state S and the reward r. Is returned to the Monte Carlo tree search unit 110.

２ＤＨＯＯＴ部１２０は、モンテカルロ木探索部１１０から状態ＳとＭＣＴＳ探索木の着目ノードの平均報酬ｒ_aとを受け取り、以下に述べる処理を行って、それらに応じたアクションａと時間τとを返す。 2DHOOT unit 120 receives an average reward r _a state S and MCTS search tree of the node of interest from Monte-Carlo tree searching unit 110 performs the processing described below, and returns an action a and time τ corresponding to them.

モンテカルロ木探索部１１０は、２ＤＨＯＯＴ部１２０とシミュレーション部１３０と協働して、以下に述べる処理を行って、次に実行すべきアクションａ_i及びさらに次に実行すべきアクションを選択するまでの時間τ_iを決定する。 The Monte Carlo tree search unit 110 performs the processing described below in cooperation with the 2DHOOT unit 120 and the simulation unit 130, and selects the action a _i to be executed next and the action to be executed next. Determine τ _i .

具体的なタイムシーケンスを図３を用いて説明する。時間は左から右に進行するが、上段はアクションの遷移を表し、下段は時間間隔の遷移を表す。すなわち、時刻ｔ_iでは、タスクの状態は実際にはＳ_iであるが、状態Ｓ'_iを推定して、モンテカルロ木探索部１１０は、時刻ｔ_iにおいてアクションａ_i及び時間τ_iを選択する。制御対象物２００がアクションａ_iを実行している時間τ_iにおいて、モンテカルロ木探索部１１０は、以下で述べるような処理を行って状態Ｓ'_i+1を推定して（図３では、ＭＣＴＳ（Ｓ'_i+1）と表している）、アクションａ_i+1及び時間τ_i+1を選択する。但し、時刻ｔ_iから時間τ_i後の時刻ｔ_i+1では、タスクの実際の状態はＳ_i+1である。このようなシーケンスが繰り返される。なお、図３に示したようにτ_i＝τ_i+1ではないかもしれないし、同じであるかもしれないが、その都度選択される。 A specific time sequence will be described with reference to FIG. Time progresses from left to right, with the upper row representing transitions in actions and the lower row representing transitions in time intervals. That is, at time t _i , the state of the task is actually S _i , but the state S ′ _i is estimated, and the Monte Carlo tree search unit 110 selects the action a _i and the time τ _i at time t _i . . At the time τ _i during which the control target 200 is executing the action a _i , the Monte Carlo tree search unit 110 estimates the state S ′ _{i + 1} by performing the processing described below (in FIG. 3, the MCTS (Represented as S ′ _{i + 1} ), action a _{i + 1} and time τ _{i + 1} are selected. However, at time t _{i + 1} after time tau _i from time t _i, the actual state of the task is S _{i + 1.} Such a sequence is repeated. Note that τ _i = τ _{i + 1} may not be as shown in FIG. 3 or may be the same, but is selected each time.

本実施の形態においてモンテカルロ木探索部１１０で探索される探索木の一例を図４を用いて説明する。本実施の形態では、ノード間の辺には、アクションと時間のペアが対応付けられている。子ノードの状態は、エッジに対応付けられているアクションが、エッジに対応付けられている時間中、親ノードの状態から適用された場合に推定される状態である。図４の例では、状態Ｓ_iにおいて、アクションａ_iを実行している時間τ_i中に、次の状態Ｓ'_i+1を推定して、そこから、第１の候補のアクション及び時間のペア（ａ¹ _i+1，τ¹ _i+1）で遷移するノードと、第２の候補のアクション及び時間のペア（ａ² _i+1，τ² _i+1）で遷移するノードと、第３の候補のアクション及び時間のペア（ａ³ _i+1，τ³ _i+1）で遷移するノードとがある。 An example of a search tree searched by the Monte Carlo tree search unit 110 in the present embodiment will be described with reference to FIG. In the present embodiment, a pair of an action and a time is associated with an edge between nodes. The state of the child node is a state estimated when the action associated with the edge is applied from the state of the parent node during the time associated with the edge. In the example of FIG. 4, in the state S _i , during the time τ _i during which the action a _i is being executed, the next state S ′ _{i + 1} is estimated, and the action and time of the first candidate are estimated therefrom. A node transitioning with a pair (a ¹ _{i + 1} , τ ¹ _{i + 1} ), a node transitioning with a second candidate action / time pair (a ² _{i + 1} , τ ² _{i + 1} ), There is a node that transitions with a pair of action and time of three candidates (a ³ _{i + 1} , τ ³ _{i + 1} ).

本実施の形態では、モンテカルロ木探索部１１０は、上で述べたＵＣＴを採用し、ＵＣＢ１と呼ばれる値を最大化するノードを選択する。ＵＣＢ１は、以下のように表される。なお、子ノードｊを表す。
ＵＣＢ１＝Ｘa_j＋Ｃ×（２logｎ／ｎ_j）^1/2 In the present embodiment, the Monte Carlo tree search unit 110 adopts the UCT described above and selects a node that maximizes a value called UCB1. UCB1 is represented as follows. Note that it represents a child node j.
UCB1 = Xa _j + C × (2 logn / n _j ) ^1/2

ここで、Ｘa_jは、子ノードｊを通過するシミュレーションのすべての報酬の平均値を表す。ｎは、子ノードｊの親ノードの被訪問回数を表し、ｎ_jは子ノードｊの被訪問回数を表す。また、Ｃは正の定数であり、実験的に決定されるものである。 Here, Xa _j represents the average of all compensation simulation through the child nodes j. n represents the number of visits of the parent node of the child node j, and n _j represents the number of visits of the child node j. C is a positive constant, and is determined experimentally.

図４において、第１乃至第３の候補のアクション及び時間のペアで遷移する３つのノードに付された値がＵＣＢ１の値であるとすると、矢印で示しているように、ＵＣＢ１の値が最も大きいノードに繋がれた辺に係る第１の候補のアクション及び時間のペアが選択されることになる。 In FIG. 4, assuming that the values assigned to the three nodes that transit in the pair of the action and time of the first to third candidates are the values of UCB1, as shown by the arrows, the value of UCB1 is the most significant. A first candidate action and time pair for the edge connected to the larger node will be selected.

本実施の形態では、ＨＯＯＴを２次元に拡張して適用することで、従来の問題を解決する。すなわち、時間とアクションとで張られる２次元空間に対して２ＤＨＯＯＴを用いて、時間とアクションのペアを効果的にサンプリングする。２ＤＨＯＯＴでは、２次元空間における領域の包含関係を階層的に示し、各領域に対応するノードを有する別の探索木を用いる。例えば、ルートノードが、２次元空間全体を表し、ルートノードの４つの子ノードが、２次元空間を４つに分割した４つの大領域を表し、子ノードの４つの孫ノードが、大領域を４つに分割した４つの中領域を表し、孫ノードの４つのひ孫ノードが、中領域を４つに分割した４つの小領域を表す。 In the present embodiment, the conventional problem is solved by extending and applying HOOT to two dimensions. That is, a pair of time and action is effectively sampled using 2DHOOT in a two-dimensional space spanned by time and action. In 2DHOOT, the inclusion relation of a region in a two-dimensional space is hierarchically shown, and another search tree having a node corresponding to each region is used. For example, a root node represents the entire two-dimensional space, four child nodes of the root node represent four large regions obtained by dividing the two-dimensional space into four, and four grandchild nodes of the child nodes represent large regions. It represents four middle regions divided into four, and four great-grandchild nodes of the grandchild node represent four small regions obtained by dividing the middle region into four.

但し、このようなペアの数については、既知のプログレッシブワイドニング（ＰＷ：Progressive widening）を用いて制限する。ＰＷは、探索木を浅くしてしまうのを避けるために、ＭＣＴＳが、制限された数のペアに初期的にはフォーカスするようにする。シミュレーションの回数が増加すると、より広く決定空間をカバーするように、考慮すべきペアを増加させる。ある時点において、子ノードの数が、ノードの被訪問回数に関連して許容されるペアの最大数に達すると、そのノードについては、十分に展開されたとみなされる。 However, the number of such pairs is limited using known progressive widening (PW). The PW causes the MCTS to initially focus on a limited number of pairs to avoid making the search tree shallow. As the number of simulations increases, the number of pairs to be considered increases so as to cover the decision space more widely. At some point, if the number of child nodes reaches the maximum number of pairs allowed in relation to the number of times the node has been visited, the node is considered fully expanded.

２ＤＨＯＯＴは、シミュレーション中に評価されるべきアクションと時間のペア群に対するフィルタを追加するように機能する。２ＤＨＯＯＴサンプリングにおいては、探索木に追加すべきアクションと時間のペアの選択は、有望なペアがより選択され易いような領域に向かうようになる。結果として、ＭＣＴＳによるアクションと時間のペアの選択効率は、シミュレーションの回数が制限されていても、改善されることになる。具体例については、後に詳細に述べる。 2DHOOT functions to add a filter for action-time pairs to be evaluated during the simulation. In 2DHOOT sampling, the selection of action and time pairs to be added to the search tree is directed to regions where promising pairs are more likely to be selected. As a result, the efficiency of selecting the action-time pair by the MCTS is improved even with a limited number of simulations. Specific examples will be described later in detail.

次に、図５乃至図７Ｅを用いて、制御装置１００の処理内容について説明する。 Next, processing contents of the control device 100 will be described with reference to FIGS. 5 to 7E.

まず、図５を用いて、タスクについて与えられた状態についてのＭＣＴＳに係る処理について説明する。 First, with reference to FIG. 5, a process related to the MCTS in a state given to a task will be described.

モンテカルロ木探索部１１０は、現在ノード（すなわち着目ノード）に、ルートノードを設定する（図５：ステップＳ１）。前回選択されたアクションと対で選択された時間τ（τは変動する）毎に行われる探索木の構築は、ルートノードから始まる。 The Monte Carlo tree search unit 110 sets a root node as a current node (that is, a node of interest) (FIG. 5: step S1). The construction of the search tree performed every time τ (τ varies) selected in combination with the previously selected action starts from the root node.

そして、モンテカルロ木探索部１１０は、現在ノードは十分に展開されたか否かを判断する（ステップＳ３）。十分に展開されたか否かは、例えばＰＷに基づき決定される。十分に展開されたと判断された場合には、モンテカルロ木探索部１１０は、現在ノードに、評価値ＵＣＢ１が最大の子ノードを設定する（ステップＳ５）。そして処理はステップＳ３に戻る。ＵＣＢ１の定義は、従来と同じである。 Then, the Monte Carlo tree search unit 110 determines whether or not the current node has been sufficiently expanded (step S3). Whether or not the image has been sufficiently developed is determined based on, for example, PW. If it is determined that the expansion is sufficient, the Monte Carlo tree search unit 110 sets the child node having the largest evaluation value UCB1 as the current node (step S5). Then, the process returns to step S3. The definition of UCB1 is the same as the conventional one.

一方、十分に展開されたと判断されなかった場合には、モンテカルロ木探索部１１０は、状態Ｓと平均報酬ｒ_aを渡して２ＤＨＯＯＴ部１２０を呼び出し、（ａ，τ）をサンプリングさせ、探索木に当該（ａ，τ）を辺とするリーフノードを追加する（ステップＳ７）。２ＤＨＯＯＴ部１２０の処理については、後に説明する。なお、ここで、追加したリーフノードに、２ＤＨＯＯＴにおいてサンプリングされたノードへのポインタを保存する。 On the other hand, when it is not determined to have been fully deployed, Carlo tree searching unit 110 calls the 2DHOOT unit 120 passes the state S and the average reward r _{a, (a,} τ) is sampled, the search tree A leaf node having the side of (a, τ) is added (step S7). The processing of the 2DHOOT unit 120 will be described later. Here, a pointer to the node sampled in 2DHOOT is stored in the added leaf node.

そして、モンテカルロ木探索部１１０は、サンプリングされた（ａ，τ）と状態Ｓとを渡してシミュレーション部１３０にタスクのシミュレーションを行わせることでロールアウトを実行する（ステップＳ９）。制御対象物２００が実行するタスクを模したモデルでシミュレーションを実行するものである。そうすると、シミュレーション部１３０は、シミュレーション結果として状態Ｓと報酬ｒを返してくる。 Then, the Monte Carlo tree search unit 110 executes the rollout by passing the sampled (a, τ) and the state S and causing the simulation unit 130 to simulate the task (step S9). The simulation is performed using a model imitating the task executed by the control target 200. Then, the simulation unit 130 returns the state S and the reward r as a simulation result.

また、モンテカルロ木探索部１１０は、追加されたリーフノードからシミュレーション結果を逆伝搬させるバックアップ処理を実行する（ステップＳ１１）。本ステップの逆伝搬は、従来と同じなので説明を省略する。 In addition, the Monte Carlo tree search unit 110 executes a backup process for back-propagating the simulation result from the added leaf node (step S11). The back-propagation in this step is the same as in the related art, and a description thereof is omitted.

そして、モンテカルロ木探索部１１０は、今回の処理開始から、前回選択されたノードへの辺に対応づけられた時間τ_i-1i-1が経過したか否かを判断する（ステップＳ１３）。今回の処理開始から時間τ_i-1が経過していなければ、処理はステップＳ１に戻る。 Then, the Monte Carlo tree search unit 110 determines whether or not the time τ _{i-1i-1 associated} with the edge to the previously selected node has elapsed since the start of the current processing (step S13). If the time τ _i-1 has not elapsed since the start of the current process, the process returns to step S1.

一方、処理開始から時間τ_i-1が経過した場合には、モンテカルロ木探索部１１０は、各ノードについて評価値ＵＣＢ１を算出して、評価値ＵＣＢ１が最大のノードを選択し、そのノードへの辺に対応付けられている（ａ_i，τ_i）を選択する（ステップＳ１５）。 On the other hand, if the time τ _i-1 has elapsed from the start of the processing, the Monte Carlo tree search unit 110 calculates the evaluation value UCB1 for each node, selects the node having the largest evaluation value UCB1, and sends the node to that node. (A _i , τ _i ) associated with the side is selected (step S15).

そうすると、モンテカルロ木探索部１１０は、ａ_i及びτ_iをインターフェース部１４０に出力し、インターフェース部１４０は、時間τ_i中に、アクションａ_iを実行するように、制御対象物２００に指示する（ステップＳ１７）。制御対象物２００は指示に従ってタスクを実行し、タスクの処理結果として状態Ｓ_iを返してくる。 Then, the Monte Carlo tree search unit 110 outputs a _i and τ _i to the interface unit 140, and the interface unit 140 instructs the control target 200 to execute the action a _i during the time τ _i ( Step S17). Control object 200 performs the tasks according to the instructions come returns the state S _i as a processing result of the task.

このような処理を繰り返すことで、適切に選択された各時間間隔で、適切に選択されたアクションを制御対象物２００に実行させることができるようになる。 By repeating such processing, it becomes possible to cause the control target 200 to execute an appropriately selected action at each appropriately selected time interval.

次に、ステップＳ７で呼び出される２ＤＨＯＯＴ部１２０の処理内容について説明する。 Next, the processing contents of the 2D HOOOT unit 120 called in step S7 will be described.

本実施の形態では、ＭＣＴＳにおける探索木の各ノードは、２ＤＨＯＯＴにおける探索木を含む。２ＤＨＯＯＴにおける探索木において、ルートノードは、アクション空間（決定空間）全体のクオリティの楽観的な見積もりを保持しており、その子ノードは、アクション空間におけるより狭い領域についてのより正確な見積もりを保持している。一般的に、ある親ノードの子ノードは、同じサイズを有する各々別の領域を表しており、それらの領域で、親ノードが表す領域をすべてカバーする。２ＤＨＯＯＴのアルゴリズムに従えば、あるリーフノードが選択され、そのリーフノードが表す領域から、アクションａと時間τのペアがサンプリングされる。 In the present embodiment, each node of the search tree in the MCTS includes a search tree in 2DHOOT. In the search tree in 2DHOOT, the root node holds an optimistic estimate of the quality of the entire action space (decision space), and its child nodes hold more accurate estimates of a smaller area in the action space. I have. In general, the child nodes of a parent node represent different regions of the same size, which cover all the regions represented by the parent node. According to the 2DHOOT algorithm, a leaf node is selected, and a pair of an action a and a time τ is sampled from a region represented by the leaf node.

より詳細な処理内容について、図６を用いて説明する。 More detailed processing contents will be described with reference to FIG.

まず、２ＤＨＯＯＴ部１２０は、ルートノードを選択する（図６：ステップＳ３１）。そして、２ＤＨＯＯＴ部１２０は、現在ノード（すなわち着目ノード）に、選択ノードを設定する（ステップＳ３３）。その後、２ＤＨＯＯＴ部１２０は、現在ノードは、スコア等が更新されていない子ノードを有するか判断する（ステップＳ３５）。現在ノードが、スコア等が更新されていない子ノードを有する場合には（ステップＳ３５：Ｙｅｓルート）、２ＤＨＯＯＴ部１２０は、スコア等が更新されていない子ノードを、選択する（ステップＳ３７）。そして処理はステップＳ３３に戻る。このように、ループＬ３１１では、ルートノードから、スコア等が更新されていない下位のノードを再帰的に探索する。 First, the 2DHOOT unit 120 selects a root node (FIG. 6: step S31). Then, the 2DHOOT unit 120 sets the selected node as the current node (that is, the node of interest) (Step S33). Thereafter, the 2DHOOT unit 120 determines whether the current node has a child node whose score or the like has not been updated (step S35). If the current node has a child node whose score or the like has not been updated (step S35: Yes route), the 2D HOOOT unit 120 selects a child node whose score or the like has not been updated (step S37). Then, the process returns to step S33. As described above, in the loop L311, a lower node whose score or the like is not updated is recursively searched from the root node.

一方、現在ノードが、スコア等が更新されていない子ノードを有しないノードまで到達すると（ステップＳ３５：Ｎｏルート）、２ＤＨＯＯＴ部１２０は、現在ノードのスコア等を更新する（ステップＳ３９）。 On the other hand, when the current node reaches a node having no child node whose score or the like has not been updated (step S35: No route), the 2DHOOT unit 120 updates the score or the like of the current node (step S39).

２ＤＨＯＯＴの探索木から選択されるアクション及び時間ペアの期待報酬は、ＭＣＴＳで実施されたシミュレーションの平均報酬ｒ_aに依存する。また、２ＤＨＯＯＴの探索木におけるノードの値は、そのサブツリーに属する全てのノードの影響を受ける。すなわち、ノードｉの見積もり報酬Ｒ^e _iは、以下の式で繰り返し更新される。

Expected reward actions and time pair to be selected from the search tree 2DHOOT depends on the average reward r _a simulation was conducted in MCTS. Also, the value of a node in the 2DHOOT search tree is affected by all nodes belonging to the subtree. In other words, estimated compensation R ^e _i of the node i is updated repeatedly by the following equation.

なお、ｊは、ノードｉの各子ノードを表し、ｎ_iは、ノードｉの被訪問回数であり、ｎ_jは、ノードｊの被訪問回数であり、Ｒ^e _jは、ノードｊの見積もり報酬を表す。 Incidentally, j represents the child node of the node i, n _i is the target number of visits node i, n _j is the number of visits node j, R ^e _j is estimated compensation node j Represents

また、U-Valueは、ノードｉの見積もり報酬Ｒ^e _iに基づき以下のように定義されている。

Also, U-Value is defined as follows based on the estimated compensation R ^e _i node i.

ここで、ｎは、図６の処理の繰り返し回数を表し、定数ｖ₁及びρは、ｖ₁＞０で０＜ρ＜１で、実験的に適切な値がセットされる。ｈ_iは、探索木におけるノードｉの深さを表す。 Here, n represents the number of repetitions of the processing in FIG. 6, and the constants v ₁ and ρ are v ₁ > 0 and 0 <ρ <1, and appropriate values are set experimentally. h _i represents the depth of the node i in the search tree.

そして、スコアB-value（スコアＢとも記す）は、以下のように定義される。

The score B-value (also referred to as score B) is defined as follows.

なお、Ｂ_jは、ノードｉの子ノードｊのスコアB-valueであり、（３）式は、全ての子ノードｊについてのＢ_jのうちの最大値と、Ｕ_iとの最小値を、Ｂ_iに設定するものである。また、まだサンプリングされていないノードのＢ_i及びＵ_iには、無限大が設定される。 Note that B _j is the score B-value of the child node j of the node i. Equation (3) expresses the maximum value of B _j and the minimum value of U _i for all the child nodes j, B _i . In addition, infinity is set for B _i and U _i of nodes that have not been sampled yet.

ステップＳ３９では、このように定義されるスコア等の値を更新する。 In step S39, the values such as the score defined in this way are updated.

そして、２ＤＨＯＯＴ部１２０は、現在ノードがルートノードであるか否かを判断する（ステップＳ４１）。現在ノードがルートノードではない場合には、２ＤＨＯＯＴ部１２０は、親ノードを選択する（ステップＳ４３）。そして処理はステップＳ３３に移行する。 Then, the 2DHOOT unit 120 determines whether or not the current node is the root node (Step S41). If the current node is not the root node, the 2DHOOT unit 120 selects a parent node (Step S43). Then, the process proceeds to step S33.

このように、ループＬ３１における処理によってリーフノードからルートノード方向に、スコア等の値の更新を行ってゆく。 As described above, the values such as the score are updated in the direction from the leaf node to the root node by the processing in the loop L31.

一方、現在ノードがルートノードである場合には、スコア等の値の更新が終了したので、２ＤＨＯＯＴ部１２０は、現在ノードがリーフノードであるか否かを判断する（ステップＳ４５）。ステップＳ４５に最初に移行してくる場合には、現在ノードはルートノードなので、現在ノードはリーフノードではないと判断される。 On the other hand, when the current node is the root node, the updating of the values such as the scores has been completed, and the 2DHOOT unit 120 determines whether the current node is a leaf node (step S45). When the process first proceeds to step S45, it is determined that the current node is not a leaf node because the current node is the root node.

現在ノードがリーフノードではない場合には、２ＤＨＯＯＴ部１２０は、スコアB-valueが、最大の子ノードを選択して、現在ノードに設定する（ステップＳ４７）。また、選択されたノードの被訪問回数を１インクリメントする。そして、処理はステップＳ４５に移行する。このようなループＬ３３は、スコアB-valueが最大であるノードをリーフノードまでたどるパスを探索する処理である。 If the current node is not a leaf node, the 2D HOOT unit 120 selects a child node having the largest score B-value and sets it as the current node (step S47). Also, the number of times the selected node is visited is incremented by one. Then, the process proceeds to step S45. Such a loop L33 is a process of searching for a path that follows the node having the largest score B-value to the leaf node.

一方、現在ノードがリーフノードである場合には、２ＤＨＯＯＴ部１２０は、リーフノードでカバーされる領域において、（ａ，τ）をサンプリングする（ステップＳ４９）。アクションａについては、選択されたリーフノードでカバーされる領域［ａ_min，ａ_max］内の値がサンプリングされる。時間τについても、選択されたリーフノードでカバーされる領域［τ_min，τ_max］内の値がサンプリングされる。 On the other hand, when the current node is a leaf node, the 2DHOOT unit 120 samples (a, τ) in an area covered by the leaf node (step S49). For action a, values in the area [a _min , a _max ] covered by the selected leaf node are sampled. As for the time τ, values in the area [τ _min , τ _max ] covered by the selected leaf node are sampled.

そして、２ＤＨＯＯＴ部１２０は、２ＤＨＯＯＴの探索木を拡張する処理を行う（ステップＳ５１）。具体的には、選択されたリーフノードに、４つの子ノードを追加する。４つの子ノードは、選択されたリーフノードがカバーする領域を均等に４分割することで得られる領域の各々に対応する。なお、アクションａと時間τの２次元空間なので、次元数（＝２）を指数とする、ひとつの次元についての分割数（＝２）のべき乗＝４で分割される。なお、ひとつの次元についての分割数については２以上の自然数であり、ユーザなどの設定により決定される。 Then, the 2DHOOT unit 120 performs a process of expanding the search tree of the 2DHOOT (step S51). Specifically, four child nodes are added to the selected leaf node. The four child nodes correspond to each of the regions obtained by equally dividing the region covered by the selected leaf node into four. Note that, since the space is a two-dimensional space of the action a and the time τ, the division is performed by a power of the division number (= 2) for one dimension = 4, with the number of dimensions (= 2) as an index. Note that the number of divisions for one dimension is a natural number of 2 or more, and is determined by settings of a user or the like.

なお、リーフノードからルートノードまでの選択されたノードの被訪問回数を更新するために、２ＤＨＯＯＴの探索木のバックアップが実行される。すなわち、各ノードについて被訪問回数を保持しておき、その数を更新する。 Note that a 2DHOOT search tree backup is performed to update the number of times the selected node from the leaf node to the root node is visited. That is, the number of times visited is held for each node, and the number is updated.

そして、２ＤＨＯＯＴ部１２０は、リーフノードへのポインタと、サンプリングされた（ａ，τ）を、モンテカルロ木探索部１１０に出力する（ステップＳ５３）。これによって、より有望なアクション及び時間のペアがサンプリングされる。 Then, the 2DHOOT unit 120 outputs the pointer to the leaf node and the sampled (a, τ) to the Monte Carlo tree search unit 110 (Step S53). This samples the more promising action and time pairs.

図７Ａ乃至図７Ｅを用いて、処理の具体例について説明する。 A specific example of the processing will be described with reference to FIGS. 7A to 7E.

ＭＣＴＳの探索木におけるノードｐが初めて訪問されると（図７Ａの（ａ）の上段）、２ＤＨＯＯＴの探索木のルートノード１が選択され（図７Ａの（ｂ）の上段）、当該ルートノード１が表す、アクションａと時間τとで張られる決定空間の全領域から、点１（ａ¹，τ¹）がサンプリングされる（図７Ａの（ｃ））。 When the node p in the search tree of the MCTS is visited for the first time (upper part of (a) of FIG. 7A), the root node 1 of the search tree of 2DHOOT is selected (upper part of (b) of FIG. 7A), and the root node 1 is selected. The point 1 (a ¹ , τ ¹ ) is sampled from the entire region of the decision space defined by the action a and the time τ represented by (c) in FIG. 7A.

本実施の形態では、アクションａも時間τも、有限の範囲内の連続的な値であるが、ここでは［０，１］に正規化されて示されている。 In the present embodiment, both the action a and the time τ are continuous values within a finite range, but are shown here normalized to [0, 1].

そうすると、２ＤＨＯＯＴの探索木の拡張、すなわち４つの子ノードの追加がなされる（図７Ａの（ｂ）の下段）。この４つの子ノードは、図７Ａの（ｃ）に示すように、互いに重ならずに同じ面積を有する領域Ａ乃至Ｄのいずれかに対応する。 Then, expansion of the 2DHOOT search tree, that is, addition of four child nodes, is performed (lower part of (b) of FIG. 7A). These four child nodes correspond to any of the regions A to D having the same area without overlapping each other, as shown in FIG. 7C.

サンプリングされた（ａ¹，τ¹）は、ＭＣＴＳの探索木をノードａで拡張するのに用いられる。すなわち、図７Ａの（ａ）の下段に示すように、ノードｐからノードａへの辺は、（ａ¹，τ¹）に対応付けられている。なお、ノードａの値が更新されると、ノード１の値も更新される。 The sampled (a ¹ , τ ¹ ) is used to extend the MCTS search tree at node a. That is, as shown in the lower part of (a) of FIG. 7A, the side from the node p to the node a is associated with (a ¹ , τ ¹ ). When the value of the node a is updated, the value of the node 1 is also updated.

次に、ＭＣＴＳの探索木におけるノードｐが再度拡張される場合には（図７Ｂの（ａ）の上段）、２ＤＨＯＯＴの探索木では、ルートノード１の子ノード２が選択され（図７Ｂの（ｂ）の上段）、ノード２が表す領域Ｂの中から、点２（ａ²，τ²）がサンプリングされる（図７Ｂの（ｃ））。 Next, when the node p in the search tree of the MCTS is expanded again (the upper part of (a) in FIG. 7B), the child node 2 of the root node 1 is selected in the search tree of 2DHOOT ((in FIG. 7B). b) (upper), point 2 (a ² , τ ² ) is sampled from area B represented by node 2 ((c) in FIG. 7B).

そうすると、２ＤＨＯＯＴの探索木の拡張、すなわち４つの子ノードの追加がなされる（図７Ｂの（ｂ）の下段）。この４つの子ノードは、図７Ｂの（ｃ）に示すように、互いに重ならずに同じ面積を有する領域Ｂ１乃至Ｂ４のいずれかに対応する。 Then, the 2DHOOT search tree is expanded, that is, four child nodes are added (lower part of (b) of FIG. 7B). These four child nodes correspond to any of the regions B1 to B4 which do not overlap each other and have the same area as shown in FIG. 7C.

サンプリングされた（ａ²，τ²）は、ＭＣＴＳの探索木をノードｂで拡張するのに用いられる。すなわち、図７Ｂの（ａ）の下段に示すように、ノードｐからノードｂへの辺は、（ａ²，τ²）に対応付けられている。 The sampled (a ² , τ ² ) is used to extend the MCTS search tree at node b. That is, as shown in the lower part of (a) of FIG. 7B, the side from the node p to the node b is associated with (a ² , τ ² ).

次に、ＭＣＴＳの探索木におけるノードｐが再度拡張される場合には（図７Ｃの（ａ）の上段）、２ＤＨＯＯＴの探索木では、ルートノード１の子ノード３が選択され（図７Ｃの（ｂ）の上段）、ノード３が表す領域Ａの中から、点３（ａ³，τ³）がサンプリングされる（図７Ｃの（ｃ））。 Next, when the node p in the search tree of the MCTS is expanded again (the upper part of (a) of FIG. 7C), the child node 3 of the root node 1 is selected in the search tree of 2DHOOT ((FIG. b) (upper), point 3 (a ³ , τ ³ ) is sampled from area A represented by node 3 ((c) in FIG. 7C).

そうすると、２ＤＨＯＯＴの探索木の拡張、すなわち４つの子ノードの追加がなされる（図７Ｃの（ｂ）の下段）。この４つの子ノードは、図７Ｃの（ｃ）に示すように、互いに重ならずに同じ面積を有する領域Ａ１乃至Ａ４のいずれかに対応する。 Then, the 2DHOOT search tree is expanded, that is, four child nodes are added (lower part of (b) of FIG. 7C). As shown in FIG. 7C, the four child nodes correspond to any of the regions A1 to A4 which do not overlap each other and have the same area.

サンプリングされた（ａ³，τ³）は、ＭＣＴＳの探索木をノードｃで拡張するのに用いられる。すなわち、図７Ｃの（ａ）の下段に示すように、ノードｐからノードｃへの辺は、（ａ³，τ³）に対応付けられている。 The sampled (a ³ , τ ³ ) is used to extend the MCTS search tree at node c. That is, as shown in the lower part of FIG. 7A, the side from the node p to the node c is associated with (a ³ , τ ³ ).

このような処理を繰り返してゆき、ＭＣＴＳの探索木に５つのノードａ乃至ｅを追加した後に、再度、ＭＣＴＳの探索木におけるノードｐが拡張される場合には（図７Ｄの（ａ）の上段）、２ＤＨＯＯＴの探索木では、ノード２の子ノード６が選択され（図７Ｄの（ｂ）の上段）、ノード６が表す領域Ｂ１の中から、点６（ａ⁶，τ⁶）がサンプリングされる（図７Ｄの（ｃ））。 Such a process is repeated, and after adding five nodes a to e to the MCTS search tree, if the node p in the MCTS search tree is expanded again (the upper part of (a) in FIG. 7D) In the 2DHOOT search tree, the child node 6 of the node 2 is selected (the upper part of (b) of FIG. 7D), and the point 6 (a ⁶ , τ ⁶ ) is sampled from the area B1 represented by the node 6. ((C) of FIG. 7D).

そうすると、２ＤＨＯＯＴの探索木の拡張、すなわち４つの子ノードの追加がなされる（図７Ｄの（ｂ）の下段）。この４つの子ノードは、図７Ｄの（ｃ）に示すように、互いに重ならずに同じ面積を有する領域Ｂ１１乃至Ｂ１４のいずれかに対応する。 Then, expansion of the 2DHOOT search tree, that is, addition of four child nodes, is performed (lower part of (b) of FIG. 7D). These four child nodes correspond to any of the regions B11 to B14 which do not overlap each other and have the same area as shown in (c) of FIG. 7D.

サンプリングされた（ａ⁶，τ⁶）は、ＭＣＴＳの探索木をノードｆで拡張するのに用いられる。すなわち、図７Ｄの（ａ）の下段に示すように、ノードｐからノードｆへの辺は、（ａ⁶，τ⁶）に対応付けられている。 The sampled (a ⁶ , τ ⁶ ) is used to extend the MCTS search tree at node f. That is, as shown in the lower part of (a) of FIG. 7D, the side from the node p to the node f is associated with (a ⁶ , τ ⁶ ).

ここまでで、ノードａとノード１が関連付けられ、ノードｂとノード２が関連付けられ、ノードｃとノード３が関連付けられ、ノードｄとノード４が関連付けられ、ノードｅとノード５が関連付けられ、ノードｆとノード６が関連付けられている。 Up to this point, node a is associated with node 1, node b is associated with node 2, node c is associated with node 3, node d is associated with node 4, node e is associated with node 5, and node e is associated with node e. f and the node 6 are associated with each other.

なお、２ＤＨＯＯＴの探索木において、孫ノードの階層に行く前に、ルートノードの全ての子ノードが更新される。一般的には、そのノードから次の下の階層に行く前に、そのノードの全ての子が更新される。 Note that in the 2DHOOT search tree, all child nodes of the root node are updated before going to the grandchild node hierarchy. Generally, all children of the node are updated before going from the node to the next lower level.

さらに処理が繰り返されて、ＭＣＴＳの探索木に１０個のノードａ乃至ｌを追加した後に、再度、ＭＣＴＳの探索木におけるノードｐが拡張される場合には（図７Ｅの（ａ）の上段）、２ＤＨＯＯＴの探索木では、ノード７の子ノード１１が選択され（図７Ｅの（ｂ）の上段）、ノード１１が表す領域Ｂ２３の中から、点１１（ａ¹¹，τ¹¹）がサンプリングされる（図７Ｅの（ｃ））。 When the process is repeated and ten nodes a to l are added to the MCTS search tree, and the node p in the MCTS search tree is expanded again (the upper part of (a) in FIG. 7E). In the 2DHOOT search tree, the child node 11 of the node 7 is selected (the upper part of (b) of FIG. 7E), and the point 11 (a ¹¹ , τ ¹¹ ) is sampled from the area B23 represented by the node 11. ((C) of FIG. 7E).

そうすると、２ＤＨＯＯＴの探索木の拡張、すなわち４つの子ノードの追加がなされる（図７Ｅの（ｂ）の下段）。この４つの子ノードは、図７Ｅの（ｃ）に示すように、互いに重ならずに同じ面積を有する領域Ｂ２３１乃至Ｂ２３４のいずれかに対応する。 Then, the search tree of 2DHOOT is expanded, that is, four child nodes are added (lower part of (b) of FIG. 7E). These four child nodes correspond to any of the regions B231 to B234 having the same area without overlapping each other, as shown in FIG. 7E (c).

サンプリングされた（ａ¹¹，τ¹¹）は、ＭＣＴＳの探索木をノードｋで拡張するのに用いられる。すなわち、図７Ｅの（ａ）の下段に示すように、ノードｐからノードｋへの辺は、（ａ¹¹，τ¹¹）に対応付けられている。 The sampled (a ¹¹ , τ ¹¹ ) is used to extend the MCTS search tree at node k. That is, as shown in the lower part of (a) of FIG. 7E, the side from the node p to the node k is associated with (a ¹¹ , τ ¹¹ ).

このように、２ＤＨＯＯＴの探索木におけるノードは、ＭＣＴＳの探索木におけるノードと関連付けられる。また、ＭＣＴＳの探索木におけるノードの値が更新されると、２ＤＨＯＯＴの探索木におけるノードの値も更新される。 Thus, the nodes in the 2DHOOT search tree are associated with the nodes in the MCTS search tree. When the value of the node in the MCTS search tree is updated, the value of the node in the 2DHOOT search tree is also updated.

ＭＣＴＳの探索木において追加された各ノードは、親ノードのための２ＤＨＯＯＴの探索木における選択リーフノードへのポインタを保持し、新たな２ＤＨＯＯＴの探索木を初期化する。すなわち、ＭＣＴＳの探索木は、各ノードにアクション及び時間のペアを選択するための２ＤＨＯＯＴの探索木と、親ノードのための２ＤＨＯＯＴの探索木においてアクション及び時間のペアがサンプルされたリーフノードへのポインタとを保持している。 Each node added in the MCTS search tree holds a pointer to the selected leaf node in the 2DHOOT search tree for the parent node, and initializes a new 2DHOOT search tree. That is, the search tree of the MCTS includes a 2DHOOT search tree for selecting an action and time pair for each node, and a leaf node where the action and time pair is sampled in the 2DHOOT search tree for the parent node. Holding a pointer.

リーフノードへのポインタは、平均報酬ｒ_aを更新するために用いられる。（１）式で示したように、２ＤＨＯＯＴの探索木におけるノードｉの見積もり報酬Ｒ^e _iの計算には、平均報酬ｒ_aが用いられる。このような平均報酬ｒ_aを用いることで、決定空間において、長期にわたってうまく実行されることが期待されるアクション等の領域に向かって、２ＤＨＯＯＴの探索木におけるノード選択が行われるようになる。 Pointer to the leaf node is used to update the average reward r _a. (1) As shown by the formula, the calculation of the estimated compensation R ^e _i of the node i in the search tree of 2DHOOT, average reward r _a is used. By using such an average reward r _a , the node selection in the 2DHOOT search tree is performed in the decision space toward an area such as an action that is expected to be executed well over a long period of time.

以上本実施の形態を説明したが、本発明はこれに限定されるものではない。例えば、機能ブロック構成例は例に過ぎず、プログラムモジュール構成とは一致しない場合もある。また、処理フローについても、処理結果が変わらない限り、処理順番を入れ替えたり、複数のステップを並列に実行してもよい。 Although the present embodiment has been described above, the present invention is not limited to this. For example, the functional block configuration example is merely an example, and may not match the program module configuration. As for the processing flow, as long as the processing result does not change, the processing order may be changed or a plurality of steps may be executed in parallel.

また、上で述べた例では、アクションａと時間τで２次元空間を取り扱ったが、より多くの次元をＨＯＯＴで取り扱うことも可能である。すなわち、アクションａ₁乃至ａ_n-1と時間τとでｎ次元空間を取り扱ってもよい。時間τを除いてｎ次元空間としてもよい。その場合、探索木の拡張時には、ｎ次元ＨＯＯＴで子ノードは、ひとつの次元について指定された分割数（２以上の自然数）のｎ乗個追加することになる。 Further, in the example described above, the two-dimensional space is handled by the action a and the time τ, but it is possible to handle more dimensions by HOOT. That is, the _n- dimensional space may be handled by the actions a _{1 to an} _-1 and the time τ. The space may be an n-dimensional space excluding the time τ. In this case, when expanding the search tree, n-dimensional HOOT adds n number of child nodes to the number of divisions (natural numbers of 2 or more) specified for one dimension.

なお、制御対象部２００は、どのようなものでもよいが、例えば車両の自動運転その他の連続したタスクを実行する装置である。 Note that the control target unit 200 may be any device, but is a device that executes, for example, automatic operation of a vehicle and other continuous tasks.

図６に示した処理の内容を疑似コードで表すと以下のようになる。
1: procedure UCTSEARCH(root, τ)
2: while time ＜ τ do //ステップＳ１３
3: front = TreeSearch(root) //ステップＳ１
4: reward = DefaultPolicy(front.state, front.horizon)
//ステップＳ９
5: BackUp(front, reward) //ステップＳ１１
6: return BestChild(root, 0) //最適な（ａ，τ）を選択

7: procedure TREESEARCH(node)
8: while not node.state.terminal() do: //終了状態になるまで探索を繰り返す
9: if not node.fully.expanded() then: //ステップＳ３
10: return Expand(node) //ステップＳ７
11: else:
12: node = BestChild(node, scale) //ステップＳ５
13: return node

14: procedure DEFAULTPOLICY(state) //ステップＳ９の詳細
15: reward = state.reward
16: done = state.terminal()
17: steps=h
18: while done == false && steps ＜MAXSTEPS do:
19: a = sample.actionSpace()
20: obs, rew, done = env.step(a)
21: reward = reward + rew
22: steps = steps + 1
23: return reward

24: procedure EXPAND(node)
25: a, τ, Hlead = HOOSearch(node.Hroot) //ステップＳ７:（ａ，τ）sampling
26: addedNode = node.AddChild(a, τ, Hleaf) //ステップＳ７：ノード追加
27: return addedNode //ステップＳ７：探索木の拡張

28: procedure BESTCHILD(node, C) //最大UCB1のノードを選択
29: bestchildren = []
30: UCB1_max = −∞
31: for node.childrenに含まれるｘ do
32: UCB1 = x.reward/x.visits + C×{(2log(node.visits)/x.visits}^0.5
33: if UCB1 == UCB1_max then
34: bestchildren.append(x)
35: if UCB1 ＞ UCB1_max then
36: bestchildren = [x]
37: UCB1_max = UCB1
38: return random.choice(bestchildren) //複数候補の場合はランダム

39: procedure BACKUP(Node, reward) //ステップＳ１１の詳細
40: while node do
41: node.visits = node.visits + 1 //被訪問回数をインクリメント
42: node.reward = node.reward + reward //報酬の更新
43: node.Hleaf.reward = node.reward //2DHOOTのノード値を更新
44: node = node.parent The contents of the processing shown in FIG. 6 are represented by pseudo code as follows.
1: procedure UCTSEARCH (root, τ)
2: while time <τ do // Step S13
3: front = TreeSearch (root) // Step S1
4: reward = DefaultPolicy (front.state, front.horizon)
// Step S9
5: BackUp (front, reward) // Step S11
6: return BestChild (root, 0) // Select the best (a, τ)

7: procedure TREESEARCH (node)
8: while not node.state.terminal () do: // Repeat until it reaches the end state
9: if not node.fully.expanded () then: // Step S3
10: return Expand (node) // Step S7
11: else:
12: node = BestChild (node, scale) // Step S5
13: return node

14: procedure DEFAULTPOLICY (state) // Details of step S9
15: reward = state.reward
16: done = state.terminal ()
17: steps = h
18: while done == false && steps <MAXSTEPS do:
19: a = sample.actionSpace ()
20: obs, rew, done = env.step (a)
21: reward = reward + rew
22: steps = steps + 1
23: return reward

24: procedure EXPAND (node)
25: a, τ, Hlead = HOOSearch (node.Hroot) // Step S7: (a, τ) sampling
26: addedNode = node.AddChild (a, τ, Hleaf) // Step S7: Add node
27: return addedNode // Step S7: Expansion of search tree

28: procedure BESTCHILD (node, C) // Select the node with the maximum UCB1
29: bestchildren = []
30: UCB1 _max = −∞
31: x do included in for node.children
32: UCB1 = x.reward / x.visits + C × {(2log (node.visits) /x.visits} ^0.5
33: if UCB1 == UCB1 _max then
34: bestchildren.append (x)
35: if UCB1> UCB1 _max then
36: bestchildren = [x]
37: UCB1 _max = UCB1
38: return random.choice (bestchildren) // Random for multiple candidates

39: procedure BACKUP (Node, reward) // Details of step S11
40: while node do
41: node.visits = node.visits + 1 // increments the number of visits
42: node.reward = node.reward + reward // Update reward
43: node.Hleaf.reward = node.reward // Update 2DHOOT node value
44: node = node.parent

図７に示した処理の内容を疑似コードで表すと以下のようになる。
1: procedure HOOSEARCH(root)
2: HOO-Update(root, root.visits) //ループL31 木の更新
3: leaf = HOOPolicy(root) //ループL33 最大スコアのパス探索
4: a,τ = sample.leaf.region() //ステップＳ４９
5: leaf.expand() //ステップＳ５１
6: return a, τ, leaf //（ａ，τ）とpointerを返す

7: procedure HOO-UPDATE(node, N) //再帰的なループL31の詳細
8: childB_max = 0
9: for node.children含まれるｘ do //ループL311 再帰的な子ノードの更新
10: HOO-Update(x,N)
11: for node.children含まれるｘ do //ステップＳ３９
12: node.reward = node.reward + x.reward
13: if x.Bvalue ＞ childB_max then
14: childB_max = x.Bvalue //Bvalue更新のため用いられる
15: Uvalue=node.reward/node.visits + {2log(N)/node.visits}^0.5
+ v₁ρ^h //ステップＳ３９ U-value更新
16: if Uvalue ＜ childB_max then //ステップＳ３９ B-value更新
17: node.Bvalue = Uvalue
18: else
19: node.Bvalue = childB_max

20: procedure HOOPOLICY(node)
21: while node.children do //ループL33
22: node.visits = node.visits + 1 //ノード被訪問回数の増分
23: bestchildren = []
24: Bvalue_max = -∞
25: for node.childrenに含まれるｘ do
26: if x.Bvalue == Bvalue_max then
27: bestchildren.append(x)
28: if x.Bvalue ＞ Bvalue_max then
29: bestchildren = [x]
30: Bvalue_max = x.Bvalue
31: node = random.choice(bestchildren) //複数候補の場合はランダムに
32: node.visits = node.vistis + 1 //被訪問回数の増分
33: return node //選択されたleaf nodeを返す The contents of the processing shown in FIG. 7 are represented by pseudo code as follows.
1: procedure HOOSEARCH (root)
2: HOO-Update (root, root.visits) // Loop L31 tree update
3: leaf = HOOPolicy (root) // Loop L33 Search for path with maximum score
4: a, τ = sample.leaf.region () // Step S49
5: leaf.expand () // Step S51
6: return a, τ, leaf // Return (a, τ) and pointer

7: procedure HOO-UPDATE (node, N) // Details of recursive loop L31
8: childB _max = 0
9: x do included in for node.children // loop L311 recursive child node update
10: HOO-Update (x, N)
11: x do included in for node.children // Step S39
12: node.reward = node.reward + x.reward
13: if x.Bvalue> childB _max then
14: childB _max = x.Bvalue // used for Bvalue update
15: Uvalue = node.reward / node.visits + {2log (N) /node.visits} ^0.5
+ v ₁ ρ ^h // Step S39 U-value update
16: if Uvalue <childB _max then // Step S39 B-value update
17: node.Bvalue = Uvalue
18: else
19: node.Bvalue = childB _max

20: procedure HOOPOLICY (node)
21: while node.children do // loop L33
22: node.visits = node.visits + 1 // increment of node visit count
23: bestchildren = []
24: Bvalue _max = -∞
25: x do included in for node.children
26: if x.Bvalue == Bvalue _max then
27: bestchildren.append (x)
28: if x.Bvalue> Bvalue _max then
29: bestchildren = [x]
30: Bvalue _max = x.Bvalue
31: node = random.choice (bestchildren) // Random for multiple candidates
32: node.visits = node.vistis + 1 // increment of the number of visits
33: return node // returns the selected leaf node

なお、上で述べた制御装置１００は、コンピュータ装置であって、図８に示すように、メモリ２５０１とＣＰＵ（Central Processing Unit）２５０３とハードディスク・ドライブ（ＨＤＤ：Hard Disk Drive）２５０５と表示装置２５０９に接続される表示制御部２５０７とリムーバブル・ディスク２５１１用のドライブ装置２５１３と入力装置２５１５とネットワークに接続するための通信制御部２５１７とスピーカ２５１８とがバス２５１９で接続されている。なお、ＨＤＤはソリッドステート・ドライブ（ＳＳＤ：Solid State Drive）などの記憶装置でもよい。オペレーティング・システム（ＯＳ：Operating System）及び本実施例における処理を実施するためのアプリケーション・プログラムは、ＨＤＤ２５０５に格納されており、ＣＰＵ２５０３により実行される際にはＨＤＤ２５０５からメモリ２５０１に読み出される。ＣＰＵ２５０３は、アプリケーション・プログラムの処理内容に応じて表示制御部２５０７、通信制御部２５１７、ドライブ装置２５１３を制御して、所定の動作を行わせる。また、処理途中のデータについては、主としてメモリ２５０１に格納されるが、ＨＤＤ２５０５に格納されるようにしてもよい。本技術の実施例では、上で述べた処理を実施するためのアプリケーション・プログラムはコンピュータ読み取り可能なリムーバブル・ディスク２５１１に格納されて頒布され、ドライブ装置２５１３からＨＤＤ２５０５にインストールされる。インターネットなどのネットワーク及び通信制御部２５１７を経由して、ＨＤＤ２５０５にインストールされる場合もある。このようなコンピュータ装置は、上で述べたＣＰＵ２５０３、メモリ２５０１などのハードウエアとＯＳ及びアプリケーション・プログラムなどのプログラムとが有機的に協働することにより、上で述べたような各種機能を実現する。 The control device 100 described above is a computer device, and as shown in FIG. 8, a memory 2501, a CPU (Central Processing Unit) 2503, a hard disk drive (HDD: Hard Disk Drive) 2505, and a display device 2509. , A drive 2513 for the removable disk 2511, an input device 2515, a communication controller 2517 for connecting to a network, and a speaker 2518 are connected by a bus 2519. The HDD may be a storage device such as a solid state drive (SSD). An operating system (OS) and an application program for executing the processing in the present embodiment are stored in the HDD 2505, and are read from the HDD 2505 to the memory 2501 when executed by the CPU 2503. The CPU 2503 controls the display control unit 2507, the communication control unit 2517, and the drive device 2513 in accordance with the processing content of the application program to perform a predetermined operation. Although the data being processed is mainly stored in the memory 2501, the data may be stored in the HDD 2505. In the embodiment of the present technology, an application program for performing the above-described process is stored in a computer-readable removable disk 2511 and distributed, and is installed from the drive device 2513 to the HDD 2505. It may be installed in the HDD 2505 via a network such as the Internet and the communication control unit 2517. Such a computer device realizes the above-described various functions by the hardware such as the CPU 2503 and the memory 2501 described above and the programs such as the OS and the application program cooperate organically. .

なお、上で述べたような処理を実行することで用いられるデータは、処理途中のものであるか、処理結果であるかを問わず、メモリ２５０１又はＨＤＤ２５０５等の記憶装置に格納される。例えば、探索木を表すデータ、及び被訪問回数、評価値、スコア等のノードの値は、メモリ２５０１等に格納される。 The data used by executing the above-described processing is stored in a storage device such as the memory 2501 or the HDD 2505 irrespective of whether the data is being processed or is a processing result. For example, data representing a search tree and node values such as the number of times visited, an evaluation value, and a score are stored in the memory 2501 or the like.

以上述べた実施の形態をまとめると以下のようになる。 The embodiments described above are summarized as follows.

本実施の形態に係る制御方法は、（Ａ）ノード間の辺に、次に実行すべきアクションとさらに次に実行すべきアクションを選択するまでの時間とを対応付けた、アクション及び時間を選択する木探索のための探索木において、新たなノードを追加しつつ、第１の評価式の値（例えばＵＣＢ１の値が最大）でノードを探索する探索処理を、前回実行された探索処理において選択されたノードへの辺に対応付けられた時間中に実行するステップと、（Ｂ）今回実行された探索処理において選択されたノードへの辺に対応付けられたアクションを、制御対象物に実行させるステップとを含む。 The control method according to the present embodiment includes: (A) selecting an action and a time, in which an action to be executed next and a time until a next action to be executed are selected are associated with an edge between nodes; In a search tree for tree search, a search process for searching for a node with the value of the first evaluation expression (for example, the value of UCB1 is maximum) while adding a new node is selected in the search process executed last time. And (B) causing the control object to execute an action associated with the edge to the node selected in the search process executed this time, and a step to be executed during the time associated with the edge to the selected node. Steps.

このように、新規な探索木を用いて、制御対象物が実行するタスクに応じたアクションを適切な時間間隔において選択して指示できるようになる。なお、木探索の探索処理は、例えばＭＣＴＳに従うものである。 As described above, using the new search tree, an action according to the task executed by the control target can be selected and indicated at an appropriate time interval. The search processing of the tree search is based on, for example, MCTS.

なお、上記次に実行すべきアクションが、有限の連続値の中の一つの値に対応し、上記さらに次に実行すべきアクションを選択するまでの時間が、有限の連続する時間内の一つの値である場合もある。これらの値は、適切なサンプリングにて選択される。 The next action to be executed corresponds to one of the finite continuous values, and the time until the further next action to be executed is one of the finite continuous times. It can be a value. These values are selected with appropriate sampling.

また、上で述べた探索処理に含まれる上記新たなノードを追加する処理が、（ａ１）次に実行すべきアクションと時間とで張られる空間において、当該空間に含まれる領域の包含関係を階層的に表し、各領域に対応するノードを有する第２の探索木において、第２の評価式の値（例えばスコアＢが最大）でノードを第２の探索木のルートノードからリーフノードまで選択するステップと、（ａ２）選択されたリーフノードに対応する領域内において点を選択するステップと、（ａ３）選択された点についてのアクション及び時間が対応付けられた辺で繋がるノードを探索木において追加するステップとを含むようにしてもよい。 Further, the process of adding the new node included in the above-described search process is performed by: (a1) In a space spanned by an action to be executed next and time, the inclusion relation of the region included in the space is hierarchically determined. In the second search tree having nodes corresponding to the respective regions, nodes are selected from the root node of the second search tree to the leaf nodes by the value of the second evaluation expression (for example, the score B is the largest). (A2) selecting a point in the area corresponding to the selected leaf node; and (a3) adding a node connected by an action and time associated with the selected point to the search tree. And the step of performing

このように、アクション及び時間をペアで適切にサンプリングするために、上で述べたような第２の探索木を用いるような処理を採用するようにしてもよい。第２の探索木を用いるような処理は、例えば多次元に拡張されたＨＯＯＴに従うものである。 As described above, in order to appropriately sample the action and the time as a pair, a process using the second search tree as described above may be adopted. The processing using the second search tree is based on, for example, HOOT extended in a multi-dimensional manner.

さらに、上で述べた探索処理に含まれる上記新たなノードを追加する処理が、（ａ４）選択されたリーフノードに対応する領域を、上記空間の次元数を指数とする所定数（２以上の自然数）のべき乗個に分割することで生成された新たな領域に対応するリーフノードを、第２の探索木に追加するステップを含むようにしてもよい。このようにすれば、適切に空間を分割しつつ、第２の探索木で領域の関係を適切に表すことができるようになる。 Further, the process of adding the new node included in the above-described search process includes the step of: (a4) setting the area corresponding to the selected leaf node to a predetermined number (two or more The method may include a step of adding, to the second search tree, a leaf node corresponding to a new region generated by dividing into a power of (natural number). This makes it possible to appropriately divide the space and appropriately express the relationship between the regions by the second search tree.

以上述べた制御方法をコンピュータに実行させるためのプログラムを作成することができて、そのプログラムは、様々な記憶媒体に記憶される。 A program for causing a computer to execute the control method described above can be created, and the program is stored in various storage media.

また、上で述べたような制御方法を実行する制御装置は、１台のコンピュータで実現される場合もあれば、複数台のコンピュータで実現される場合もあり、それらを合わせて制御システム又は単にシステムと呼ぶものとする。 Further, the control device that executes the control method as described above may be realized by a single computer, or may be realized by a plurality of computers. It shall be called a system.

１００制御装置
１１０モンテカルロ木探索部
１２０２ＤＨＯＯＴ部
１３０シミュレーション部
１４０インターフェース部
２００制御対象物 REFERENCE SIGNS LIST 100 Control device 110 Monte Carlo tree search unit 120 2DHOOT unit 130 Simulation unit 140 Interface unit 200 Control target

Claims

In a search tree for searching a tree for selecting an action and a time, in which a next action to be performed and a time until the next action to be further performed are selected, a new node Executing a search process for searching for a node with the value of the first evaluation expression during the time period associated with the edge to the node selected in the search process executed last time,
Causing the control object to execute an action associated with the side to the node selected in the search processing executed this time;
For causing one or more processors to execute

The next action to be performed corresponds to one of a finite number of continuous values,
The control program according to claim 1, wherein the time until the action to be executed next is selected is one value within a finite continuous time.

The process of adding the new node included in the search process includes:
In a space spanned by the next action to be executed and the time, the inclusion relation of the regions included in the space is hierarchically represented, and in a second search tree having a node corresponding to each region, Selecting a node from a root node of the second search tree to a leaf node with a value of an evaluation expression;
Selecting a point in an area corresponding to the selected leaf node;
Adding, in the search tree, a node connected by the action and time associated with the selected point,
The control program according to claim 2, comprising:

The process of adding the new node included in the search process includes:
A leaf node corresponding to a new region generated by dividing an area corresponding to the selected leaf node into powers of a predetermined number (a natural number of 2 or more) whose exponent is the dimension number of the space, The control program according to claim 3, further comprising: adding to the second search tree.

In a search tree for searching a tree for selecting an action and a time, in which a next action to be performed and a time until the next action to be further performed are selected, a new node Executing a search process for searching for a node with the value of the first evaluation expression during the time period associated with the edge to the node selected in the search process executed last time,
Causing the control object to execute an action associated with the side to the node selected in the search processing executed this time;
And a control method executed by one or more processors.

In a search tree for searching a tree for selecting an action and a time, in which a next action to be performed and a time until the next action to be further performed are selected, a new node A search unit that executes a search process for searching for a node with the value of the first evaluation expression during a time period associated with an edge to the node selected in the search process executed last time,
An instruction unit for causing the control object to execute the action associated with the side to the node selected in the search process executed this time;
A system having: