JP6285849B2

JP6285849B2 - Behavior control system, method and program thereof

Info

Publication number: JP6285849B2
Application number: JP2014232549A
Authority: JP
Inventors: 洋川野
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-11-17
Filing date: 2014-11-17
Publication date: 2018-02-28
Anticipated expiration: 2034-11-17
Also published as: JP2016095754A

Description

本発明は、複数の制御対象物の行動を制御する技術に関する。例えば、複数のロボットを、開始位置における隊列形成状態から協調して移動させ、障害物を回避させ、目標位置で隊列形成をさせるための各ロボットの行動計画を求めるロボット協調制御技術に関する。 The present invention relates to a technique for controlling actions of a plurality of control objects. For example, the present invention relates to a robot cooperative control technique for obtaining an action plan for each robot for moving a plurality of robots in a coordinated manner from a formation state at a start position, avoiding an obstacle, and forming a formation at a target position.

近年、多数の自律移動ロボットを効率的に制御にするための研究が活発に行われている。その任務内容は、人の入れない箇所の監視、物品の搬送などさまざまであるが、多数のロボットの協調動作による隊列形成を効率的に行わせるための技術が求められており盛んに研究が行われている（例えば、非特許文献１参照）。多数のロボットによる効率的な隊列形成を実現するには、それぞれのロボットの配置、動作順序などを事前に計画することが重要である。このような計画においては、当然ながら、複数のロボットが動作する実環境における障害物の存在や経路の形状なども十分に考慮しなければならない。 In recent years, research has been actively conducted to efficiently control a large number of autonomous mobile robots. Their missions vary, such as monitoring places where people can't enter, transporting goods, etc., but technology is being sought for efficient formation of platoons through the coordinated operation of many robots. (For example, refer nonpatent literature 1). In order to realize efficient formation of a formation by a large number of robots, it is important to plan the arrangement and operation order of each robot in advance. In such a plan, as a matter of course, it is necessary to sufficiently consider the presence of obstacles and the shape of a route in an actual environment where a plurality of robots operate.

このような計画計算を行うための効果的な手法の一つとして、マルコフ決定過程における動的計画法や強化学習の手法があり、さまざまな研究が行われている（例えば、非特許文献２参照）。 As an effective method for performing such a plan calculation, there are a dynamic programming method and a reinforcement learning method in a Markov decision process, and various studies have been conducted (for example, see Non-Patent Document 2). ).

また、ロボットの隊列制御の中でも、ロボット同士が互いに接したままの状態で、アメーバのように全体で移動を行うという仮定の下でのロボット隊列制御においては、ロボット同士の相対的な位置関係から、各ロボットの絶対位値の決定が可能であるという利点と、付加的な位置計測用の装備を必要としないという利点があり、そのようなロボットの研究もおこなわれている。例えば、非特許文献３に示すものでは任意の矩形形状隊列から他の矩形形状隊列までの隊列制御が示されている。 Also, in the robot row control, the robot row control under the assumption that the robot moves as a whole in a state where the robots are in contact with each other, the relative positional relationship between the robots The advantages of being able to determine the absolute position of each robot and the advantage of not requiring additional position measurement equipment are being studied. For example, in the non-patent document 3, the row control from an arbitrary rectangular shape row to another rectangular shape row is shown.

また、非特許文献４に示す研究に至る一連の研究や非特許文献５では、ある隊列から他の隊列に変化する隊列制御が示されている。 In addition, in a series of studies leading to the research shown in Non-Patent Document 4 and Non-Patent Document 5, the formation control that changes from one formation to another formation is shown.

M.Shimizu, A.Ishiguro, T.Kawakatsu, Y.Masubuchi, “Coherent Swarming from Local Interaction by Exploiting Molecular Dynamics and Stokesian Dynamics Methods”, Proceeaings of the 2003 IEE/RSJ International Conference on intelligent Robots and Systems, Las Veqas, pp.1614-1619, October 2003.M. Shimizu, A. Ishiguro, T. Kawakatsu, Y. Masubuchi, “Coherent Swarming from Local Interaction by Exploiting Molecular Dynamics and Stokesian Dynamics Methods”, Proceeaings of the 2003 IEE / RSJ International Conference on intelligent Robots and Systems, Las Veqas, pp.1614-1619, October 2003. Y.Wang, C.W.de Silva, “Multi-Robot Box-pushing: Single-Agent Q-Learning vs. Team Q-Learning”, Proceedings of the 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems, Beijing, China, pp.3694-3699, October 2006.Y.Wang, CWde Silva, “Multi-Robot Box-pushing: Single-Agent Q-Learning vs. Team Q-Learning”, Proceedings of the 2006 IEEE / RSJ International Conference on Intelligent Robots and Systems, Beijing, China, pp .3694-3699, October 2006. A.Becker, G.Habibi, J.Werfel, M.Rubenstein, and J.McLurkin, “Massive Uniform Manipulation: Controlling Large Populations of Simple Robots with a Common Input Signal”, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems, Japan, pp.520-527, November, 2013.A. Becker, G. Habibi, J. Werfel, M. Rubenstein, and J. McLurkin, “Massive Uniform Manipulation: Controlling Large Populations of Simple Robots with a Common Input Signal”, Proceedings of the IEEE / RSJ International Conference on Intelligent Robots and Systems, Japan, pp.520-527, November, 2013. Stanton Wong1 and Jennifer Walter ”Deterministic Distributed Algorithm for Self-Reconfiguration of Modular Robots from Arbitrary to Straight Chain Configurations”, Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA), Karlsruhe, Germany, pp.537-543, May 6-10, 2013.Stanton Wong1 and Jennifer Walter ”Deterministic Distributed Algorithm for Self-Reconfiguration of Modular Robots from Arbitrary to Straight Chain Configurations”, Proceedings of the 2013 IEEE International Conference on Robotics and Automation (ICRA), Karlsruhe, Germany, pp.537-543, May 6-10, 2013. Michael Rubenstein, Alejandro Cornejo, Radhika Nagpal, “Programmable self-assembly in a thousand-robot swarm”, SCIENCE, 2014, Vol. 345, Issue 6198, pp.795-799.Michael Rubenstein, Alejandro Cornejo, Radhika Nagpal, “Programmable self-assembly in a thousand-robot swarm”, SCIENCE, 2014, Vol. 345, Issue 6198, pp.795-799.

しかしながら、非特許文献１の手法では、流体力学的な特性をロボット動作に組み込む手法を用いて群ロボットの動作を制御しており、低い計算負荷での制御を可能にしている利点があるが、任意の形状の隊列形成をすることができるとは限らない。 However, in the method of Non-Patent Document 1, the operation of the group robot is controlled using a method of incorporating the hydrodynamic characteristics into the robot operation, and there is an advantage that enables control with a low calculation load. It is not always possible to form a formation of any shape.

また、非特許文献２の手法のように、マルコフ決定過程における動的計画法や強化学習を使用してこのような計画を行おうとすると、単体のロボットを使用する場合に比べて複数のロボットを使用する場合には、その計算に要する時間や計算機の記憶容量がロボットの数に対して指数関数的に増大してしまう。その主たる原因となるのが、探索計算のためのマルコフ状態空間内の状態数の莫大な増加である。非特許文献２では、検証された強化学習の手法では、ロボット数の増加に伴い、指数関数的に計算負荷が増加するという、マルコフ状態空間内の爆発問題への解決策は示されていない。 In addition, as in the method of Non-Patent Document 2, when trying to perform such a plan using dynamic programming or reinforcement learning in the Markov decision process, a plurality of robots are used compared to the case of using a single robot. When used, the time required for the calculation and the storage capacity of the computer increase exponentially with respect to the number of robots. The main cause is the enormous increase in the number of states in the Markov state space for search computation. Non-Patent Document 2 does not show a solution to the explosion problem in the Markov state space in which the verified reinforcement learning method increases the computational load exponentially as the number of robots increases.

また、非特許文献１，２の手法ともに、付加的な位置計測用の装備を必要とする。 Further, both the methods of Non-Patent Documents 1 and 2 require additional equipment for position measurement.

また、非特許文献３では、ロボットが接したままの状態を維持するという条件を考慮して、付加的な位置計測用の装備を必要としないが、その実現には障害物の存在を必要としており、動作計画に必要な計算量が、ロボットの台数の２乗に比例してしまい、ロボットの台数の増加とともに急激に上昇する。 In Non-Patent Document 3, no additional equipment for position measurement is required in consideration of the condition of maintaining the state in which the robot is in contact, but the presence of an obstacle is required for its realization. Therefore, the amount of calculation required for the operation plan is proportional to the square of the number of robots, and increases rapidly as the number of robots increases.

非特許文献４の手法においては、一度、線形隊列への変換をしなければならず、可能な隊列形成動作そのものへの制約が大きい。 In the method of Non-Patent Document 4, conversion to a linear formation must be performed once, and there is a great restriction on the possible formation operation itself.

非特許文献５の手法においては、開始隊列と目標隊列が共有する点がなければならないことや、障害物を考慮していないなどの問題点がある。 In the method of Non-Patent Document 5, there are problems that the starting row and the target row must share a point, and that obstacles are not considered.

このような現状に鑑みて、本発明では、多数のロボットの存在を考慮しつつも、計画計算に必要な計算時間や計算機の記憶容量を一台のロボットを扱うときと同様に少ないものに低減可能で、かつ、ロボット同士が接したままの状態を維持しつつ任意の開始位置における隊列形成状態から、他の任意の目標位置における隊列形成状態へ障害物のある環境にて変形動作を行うことを可能とする、ロボット協調隊列形成技術を提供することを目的とする。 In view of such a current situation, the present invention reduces the calculation time required for the planned calculation and the storage capacity of the computer to be as small as when handling a single robot, while considering the presence of a large number of robots. It is possible to perform a deformation operation in an environment where there is an obstacle from a formation formation state at an arbitrary start position to a formation formation state at any other target position while maintaining the state where the robots are in contact with each other. The purpose is to provide a robot coordination corps formation technology.

上記の課題を解決するために、本発明の一態様によれば、行動制御システムは、開始位置の集合に配置された複数の制御対象物を所定の入口位置を含む目標位置の集合に移動させるための行動制御を行う。行動制御システムは、第一方向に対して平行でない方向を第二方向とし、第一方向に対して反対の方向を第三方向とし、第二方向に対して反対の方向を第四方向とし、各開始位置及び各目標位置は、それぞれ第一方向〜第四方向の少なくとも何れかの方向において他の開始位置及び目標位置と隣接し、目標位置の集合及び開始位置の集合はそれぞれ一塊の任意の形状を成し、制御対象物は、当該制御対象物の２次元平面上の第一方向において隣接する第一位置、第二方向において隣接する第二位置、第三方向において隣接する第三位置、及び、第四方向において隣接する第四位置、第一位置に第二方向において隣接する第五位置、第二位置に第三方向において隣接する第六位置、第三位置に第四方向において隣接する第七位置、及び、第四位置に第一方向において隣接する第八位置に存在する他の制御対象物と通信するための通信手段を備え、制御対象物がその制御対象物の現在の位置sにおいて各行動aを取ったときの適切さを表す１個の価値関数に基づいて制御され、静止するか、または、二次元平面上の第一〜第四位置の何れかに移動するように制御されるものとし、価値関数が記憶される記憶部と、制御対象物の移動に伴って生じる、または、制御対象物の移動する方向と反対の方向に移動する仮想的な存在をボイドとし、価値関数を用いて、(G-1)制御対象物が目標位置の何れかに存在する場合、その制御対象物が、所定の入口位置から遠い他の目標位置に移動するように制御し、(G-2)制御対象物が目標位置の集合に含まれない位置に存在する場合、その制御対象物が、所定の入口位置に向かって移動するように制御し、(V-1)ボイドが目標位置の集合に含まれない位置または所定の入口位置に位置する場合、そのボイドが所定の入口位置から遠い目標位置の集合に含まれない位置に移動するように制御し、(V-2)ボイドが所定の入口位置を除いた目標位置の何れかに位置する場合、そのボイドが所定の入口位置に向かって移動するように制御する行動選択部を含む。行動選択部は、入口位置に近い先頭制御対象物を決め、(G-1)及び（G-2）の制御により移動させることでボイドを発生させる第一制御と、第一制御の結果、発生したボイドを(V-1)及び（V-2）の制御に従う限り移動させる第二制御とを実行する。 In order to solve the above problems, according to one aspect of the present invention, the behavior control system moves a plurality of control objects arranged in a set of start positions to a set of target positions including a predetermined entrance position. To control the behavior. The action control system has a direction that is not parallel to the first direction as a second direction, a direction opposite to the first direction as a third direction, a direction opposite to the second direction as a fourth direction, Each start position and each target position are adjacent to other start positions and target positions in at least one of the first to fourth directions, and the set of target positions and the set of start positions are each a set of arbitrary Forming a shape, the control object is a first position adjacent in the first direction on the two-dimensional plane of the control object, a second position adjacent in the second direction, a third position adjacent in the third direction, And a fourth position adjacent in the fourth direction, a fifth position adjacent to the first position in the second direction, a sixth position adjacent to the second position in the third direction, and adjacent to the third position in the fourth direction. 7th and 4th positions Communication means for communicating with another control object existing in the eighth position adjacent in the first direction, when the control object takes each action a at the current position s of the control object It is controlled based on one value function representing appropriateness, and is controlled to move to any one of the first to fourth positions on the two-dimensional plane, and the value function is stored. And a virtual existence that occurs with the movement of the control object or moves in the direction opposite to the movement direction of the control object as a void, and uses the value function (G-1 ) When the control object exists at any of the target positions, control is performed so that the control object moves to another target position far from the predetermined entrance position, and (G-2) the control object is the target position. If the target object is not included in the set of (V-1) When the void is located at a position not included in the set of target positions or at a predetermined entrance position, the void is located at a target position far from the predetermined entrance position. (V-2) If the void is located at any of the target positions excluding the predetermined entrance position, the void moves toward the predetermined entrance position. The action selection part which controls is included. The action selection unit determines the first control object close to the entrance position, and generates a void by moving it under the control of (G-1) and (G-2). The second control is performed to move the void as long as it follows the control of (V-1) and (V-2).

上記の課題を解決するために、本発明の他の態様によれば、行動制御方法は、開始位置の集合に配置された複数の制御対象物を所定の入口位置を含む目標位置の集合に移動させるための行動制御を行う。行動制御方法は、第一方向に対して平行でない方向を第二方向とし、第一方向に対して反対の方向を第三方向とし、第二方向に対して反対の方向を第四方向とし、各開始位置及び各目標位置は、それぞれ第一方向〜第四方向の少なくとも何れかの方向において他の開始位置及び目標位置と隣接し、目標位置の集合及び開始位置の集合はそれぞれ一塊の任意の形状を成し、制御対象物は、当該制御対象物の２次元平面上の第一方向において隣接する第一位置、第二方向において隣接する第二位置、第三方向において隣接する第三位置、及び、第四方向において隣接する第四位置、第一位置に第二方向において隣接する第五位置、第二位置に第三方向において隣接する第六位置、第三位置に第四方向において隣接する第七位置、及び、第四位置に第一方向において隣接する第八位置に存在する他の制御対象物と通信するための通信手段を備え、制御対象物がその制御対象物の現在の位置sにおいて各行動aを取ったときの適切さを表す１個の価値関数に基づいて制御され、静止するか、または、二次元平面上の第一〜第四位置の何れかに移動するように制御されるものとし、行動選択部が、制御対象物の移動に伴って生じる、または、制御対象物の移動する方向と反対の方向に移動する仮想的な存在をボイドとし、価値関数を用いて、(G-1)制御対象物が目標位置の何れかに存在する場合、その制御対象物が、所定の入口位置から遠い他の目標位置に移動するように制御し、(G-2)制御対象物が目標位置の集合に含まれない位置に存在する場合、その制御対象物が、所定の入口位置に向かって移動するように制御し、(V-1)ボイドが目標位置の集合に含まれない位置または所定の入口位置に位置する場合、そのボイドが所定の入口位置から遠い目標位置の集合に含まれない位置に移動するように制御し、(V-2)ボイドが所定の入口位置を除いた目標位置の何れかに位置する場合、そのボイドが所定の入口位置に向かって移動するように制御する行動選択ステップを含む。行動選択ステップは、入口位置に近い先頭制御対象物を決め、(G-1)及び（G-2）の制御により移動させることでボイドを発生させる第一制御と、第一制御の結果、発生したボイドを(V-1)及び（V-2）の制御に従う限り移動させる第二制御とを実行する。 In order to solve the above problem, according to another aspect of the present invention, a behavior control method moves a plurality of control objects arranged in a set of start positions to a set of target positions including a predetermined entrance position. To control the behavior. In the behavior control method, the direction that is not parallel to the first direction is the second direction, the direction opposite to the first direction is the third direction, the direction opposite to the second direction is the fourth direction, Each start position and each target position are adjacent to other start positions and target positions in at least one of the first to fourth directions, and the set of target positions and the set of start positions are each a set of arbitrary Forming a shape, the control object is a first position adjacent in the first direction on the two-dimensional plane of the control object, a second position adjacent in the second direction, a third position adjacent in the third direction, And a fourth position adjacent in the fourth direction, a fifth position adjacent to the first position in the second direction, a sixth position adjacent to the second position in the third direction, and adjacent to the third position in the fourth direction. 7th and 4th positions Appropriateness when the control object has each action a at the current position s of the control object with a communication means for communicating with another control object existing at the eighth position adjacent in one direction It is controlled on the basis of one value function that represents and is controlled to move to any one of the first to fourth positions on the two-dimensional plane. A virtual existence that occurs with the movement of the object or moves in the direction opposite to the direction in which the control object moves is defined as a void, and (G-1) the control object is moved to the target position using the value function. The control object is controlled to move to another target position far from the predetermined entrance position, and (G-2) a position where the control object is not included in the set of target positions. The control object moves toward a predetermined entrance position. (V-1) If the void is located at a position that is not included in the target position set or at a predetermined entrance position, the void is not included in the target position set that is far from the predetermined entrance position. (V-2) When the void is located at any of the target positions excluding the predetermined inlet position, the action selection is controlled so that the void moves toward the predetermined inlet position. Includes steps. In the action selection step, the first control object that determines the head control object close to the entrance position and moves by the control of (G-1) and (G-2), and the generation of the first control result The second control is performed to move the void as long as it follows the control of (V-1) and (V-2).

本発明に拠れば、詳しくは後述するが、一台のロボットに必要な分だけのマルコフ状態空間を用意し、それを用いて動的計画法を利用して各位置でのロボットの行動方策を計算し、その行動方策を利用することで、ロボットに任意の隊列形状と、任務環境内の任意の障害物形状に対応した、ロボット同士が接した状態を維持したうえでの多数ロボットのための隊列形成アルゴリズムを獲得することができる。すなわち、ロボット数に依存せずにロボット一台分の計画計算負荷での自己位置座標定義型隊列形成アルゴリズム獲得ができる。 According to the present invention, as will be described in detail later, a Markov state space as much as necessary for one robot is prepared, and using it, a robot's action policy at each position is determined using dynamic programming. By calculating and using its action strategy, it is possible for a large number of robots to maintain a state where the robots are in contact with each other, corresponding to an arbitrary formation shape of the robot and an arbitrary obstacle shape in the mission environment. A formation algorithm can be obtained. That is, the self-position coordinate definition formation formation algorithm can be obtained with the planned calculation load of one robot without depending on the number of robots.

多数のロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務を説明するための図。The figure for demonstrating the mission which many robots move from the formation formation state in a starting position in cooperation, and form formation at a target position. 伸縮型ロボットの移動を説明するための図。The figure for demonstrating the movement of an expansion-contraction type robot. 不変型ロボットの移動を説明するための図。The figure for demonstrating the movement of an invariant robot. 開始位置からのルートが二つ以上考えられる場合のQ関数の計算を説明するための図。The figure for demonstrating calculation of Q function when two or more routes from a starting position are considered. 多数の伸縮型ロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務を説明するための図。The figure for demonstrating the mission which a large number of expansion-contraction type robots move from the formation formation state in a start position in cooperation, and form formation in a target position. 多数の伸縮型ロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務を説明するための図。The figure for demonstrating the mission which a large number of expansion-contraction type robots move from the formation formation state in a start position in cooperation, and form formation in a target position. 多数の伸縮型ロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務を説明するための図。The figure for demonstrating the mission which a large number of expansion-contraction type robots move from the formation formation state in a start position in cooperation, and form formation in a target position. 多数の伸縮型ロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務を説明するための図。The figure for demonstrating the mission which a large number of expansion-contraction type robots move from the formation formation state in a start position in cooperation, and form formation in a target position. 多数の伸縮型ロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務を説明するための図。The figure for demonstrating the mission which a large number of expansion-contraction type robots move from the formation formation state in a start position in cooperation, and form formation in a target position. 多数のロボットが価値関数Q(s,a)を利用して隊列移動する方法を説明するための図。The figure for demonstrating the method of many robots moving in a formation using value function Q (s, a). 伸縮型ロボットのみを制御する場合の第一制御を説明するための処理フローの例を示す図。The figure which shows the example of the processing flow for demonstrating 1st control in the case of controlling only an expansion-contraction type robot. 伸縮型ロボットのみを制御する場合の第二制御を説明するための処理フローの例を示す図。The figure which shows the example of the processing flow for demonstrating 2nd control in the case of controlling only an expansion-contraction type robot. 多数の不変型ロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務を説明するための図。The figure for demonstrating the mission which many invariant type robots cooperate and move from the formation formation state in a starting position, and form a formation in a target position. 多数の不変型ロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務を説明するための図。The figure for demonstrating the mission which many invariant type robots cooperate and move from the formation formation state in a starting position, and form a formation in a target position. 多数の不変型ロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務を説明するための図。The figure for demonstrating the mission which many invariant type robots cooperate and move from the formation formation state in a starting position, and form a formation in a target position. 多数の不変型ロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務を説明するための図。The figure for demonstrating the mission which many invariant type robots cooperate and move from the formation formation state in a starting position, and form a formation in a target position. 不変型ロボットのみを制御する場合、及び、伸縮型ロボットと不変型ロボットとが混在するときに制御する場合の第一制御を説明するための処理フローの例を示す図。The figure which shows the example of the processing flow for demonstrating the 1st control in the case of controlling when controlling only an invariant robot, and when a telescopic robot and an invariant robot coexist. 不変型ロボットのみを制御する場合、及び、伸縮型ロボットと不変型ロボットとが混在するときに制御する場合の第二制御を説明するための処理フローの例を示す図。The figure which shows the example of the processing flow for demonstrating the 2nd control in the case of controlling when controlling only an invariant robot, and when a telescopic robot and an invariant robot coexist. 第一実施形態に係る行動制御システムの機能ブロック図。The functional block diagram of the action control system which concerns on 1st embodiment. 第一実施形態に係る行動制御システムの処理フローの例を示す図。The figure which shows the example of the processing flow of the action control system which concerns on 1st embodiment. 図２１Ａは不変型ロボット及び収縮状態の伸縮型ロボットの隣接する通信可能な位置を示す図、図２１Ｂは伸長状態の伸縮型ロボットの隣接する通信可能な位置を示す図。FIG. 21A is a diagram illustrating adjacent communicable positions of the invariable robot and the contracted telescopic robot, and FIG. 21B is a diagram illustrating adjacent communicable positions of the expanded telescopic robot. 各格子が菱形での場合の例を示す図。The figure which shows the example in case each lattice is a rhombus.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted.

＜第一実施形態＞
［理論的背景］
まず、行動制御システム及び方法の理論的背景について説明する。以下、行動制御の対象である制御対象物が、ロボットである場合を例に挙げて説明するが、制御対象物は、制御の対象となり得るものであれば、ロボット以外であってもよい。 <First embodiment>
[Theoretical background]
First, the theoretical background of the behavior control system and method will be described. Hereinafter, a case where the control target that is the target of behavior control is a robot will be described as an example, but the control target may be other than the robot as long as it can be a target of control.

多数のロボットが協調して開始位置における隊列形成状態から移動を行い、目標位置での隊列形成を行う任務は、例えば図１に例示するような壁で区切られた部屋においての開始位置から目標位置まで複数のロボットの移動によって実現するものである。 A number of robots move from the formation state at the start position in cooperation with each other, and the task of forming the formation at the target position is, for example, from the start position in the room separated by walls as illustrated in FIG. This is realized by moving a plurality of robots.

図１において、Rが記載された格子はロボットが存在する位置を示し、Oが記載された格子は障害物が存在する位置を示す。また、太線の破線で囲まれた領域は開始位置の集合を示し、太線の一点鎖線で囲まれた領域は目標位置の集合Gで表される領域（以下、この領域のことを「目標隊列エリアG」ともいう）を示し、太線の実線で囲まれた領域は後述する目標隊列エリアGの入口位置Peを示す。このように、各開始位置及び各目標位置は、それぞれ上下左右方向の少なくとも何れかの方向において他の開始位置及び目標位置と隣接し、ロボットの開始位置及び目標位置での隊列形状はそれぞれ一塊の任意の形状である。 In FIG. 1, a grid in which R is described indicates a position where the robot exists, and a grid in which O is described indicates a position where an obstacle exists. An area surrounded by a thick broken line indicates a set of start positions, and an area surrounded by a thick dashed line is an area represented by a set G of target positions (hereinafter, this area is referred to as “target row area”). G ”), and an area surrounded by a thick solid line indicates an entrance position Pe of a target platoon area G to be described later. In this way, each start position and each target position are adjacent to other start positions and target positions in at least one of the vertical and horizontal directions, respectively, and the formation at the start position and the target position of the robot is a lump. Any shape.

任務を行うロボットは、p台（pは2以上の整数であり、例えばp≧50）であり、各ロボットは、二次元平面におけるX軸方向及びY軸方向のそれぞれに移動可能とする。すなわち、この例では、各ロボットは、図１の紙面に対して上下左右の四方向に移動可能とする。ただし、各ロボットは、少なくとも1つの他のロボットと隣接し、p台のロボットで１つの群れを成すものとする。ここで、あるロボットと他のロボットが隣接するとは、あるロボットと他のロボットとが、一辺もしくは一つの頂点を共有することを意味する。 The robots that perform the mission are p units (p is an integer of 2 or more, for example, p ≧ 50), and each robot can move in the X-axis direction and the Y-axis direction on the two-dimensional plane. That is, in this example, each robot can move in four directions, up, down, left, and right with respect to the paper surface of FIG. However, each robot is adjacent to at least one other robot, and p robots form one group. Here, that a certain robot and another robot are adjacent means that a certain robot and another robot share one side or one vertex.

なお、p個のロボットのうちq個（qを0以上p以下の整数の何れかとする）のロボットを伸縮型ロボットとし、p個のロボットのうち(p-q)個のロボットを不変型ロボットとする。つまり、q=0のとき不変型ロボットのみとなり、q=pのとき伸縮型ロボットのみとなり、qが1以上p未満の整数の何れかのとき、不変型ロボットと伸縮型ロボットとの混在となる。 Of the p robots, q robots (q is any integer between 0 and p) are telescopic robots, and (pq) robots of the p robots are invariant robots. . That is, when q = 0, only invariant robots are available, when q = p, only telescopic robots are available, and when q is an integer between 1 and less than p, immutable robots and telescopic robots are mixed. .

伸縮型ロボットは、例えば正方形型のロボットであり、図２に示すように、進行方向に平行な２辺の長さを、ロボットの一辺の長さ分だけ伸長させて後、伸長させた辺をもとの長さに収縮させることで、1セル分の移動を行うというものを想定する。言い換えると、伸縮型ロボットは、図２に示すように、１つの位置のみを占有する収縮状態と上下左右方向の何れかの方向において隣接する２つの位置を占有する伸長状態とを備える。ある位置のみを占有する収縮状態から、ある位置と他の位置とを占有する伸長状態に変形し、他の位置のみを占有する収縮状態に変形することで、ある位置から他の位置へ移動する。伸縮型ロボットは、１回の行動制御により、静止するか、または、収縮状態から伸長状態への変形若しくは伸長状態から収縮状態への変形を行う。 The telescopic robot is, for example, a square robot. As shown in FIG. 2, the length of two sides parallel to the traveling direction is extended by the length of one side of the robot, and the extended side is then extended. Assume that one cell is moved by contracting to the original length. In other words, as shown in FIG. 2, the telescopic robot includes a contracted state that occupies only one position and an extended state that occupies two adjacent positions in any of the up, down, left, and right directions. Move from one position to another by transforming from a contracted state that occupies only a certain position to an expanded state that occupies a certain position and another position, and deforming to a contracted state that occupies only another position . The telescopic robot is either stationary or deformed from the contracted state to the extended state or deformed from the extended state to the contracted state by one action control.

一方、不変型ロボットは、図３に示すように、変形せず、１つの位置のみを占有し、１回の行動制御により、静止するか、または、上下左右方向の隣接する位置の何れかに移動するように制御される。 On the other hand, as shown in FIG. 3, the invariant robot does not deform and occupies only one position and either remains stationary or moves to an adjacent position in the up / down / left / right directions by one action control. Controlled to move.

図１の各格子にはロボットは一台しか存在することができない。それぞれのロボットは、移動しようとする方向に障害物か他のロボットがある場合には、静止をするものと仮定する。ただし、他のロボットが移動した結果、その格子が空く場合には、その格子に移動できるものとする。 There can be only one robot in each grid in FIG. Each robot is assumed to be stationary if there are obstacles or other robots in the direction of movement. However, if the lattice becomes empty as a result of the movement of another robot, it can be moved to that lattice.

何れのロボットもロボットの周囲のマス(不変型ロボット及び伸縮型ロボットの収縮状態の場合、周囲のマスは8マス、伸縮型ロボットの伸長状態の場合、周囲のマスは10マス)のうち一つに他のロボットが存在している状態を維持しながら移動をするものとする。また、何れのロボットもロボットの周囲のマスに存在する他のロボットと通信できるものとする。この手法では１つのロボット自身が、その移動量を正確に測ることができるというメリットがあり、また、１つの辺や頂点を共有する隣り合うロボットとの相対的な位置を計測しあうことで、ロボットの群れ全体の中での各ロボットの位置も容易に知ることができる。例えば、複数のロボットを連結して、先頭のロボットから、伸長と収縮を伝播させていくことで、全体としての移動を行っていくことが可能である。 Each robot is one of the surrounding masses of the robot (8 masses when the invariant robot and the telescopic robot are contracted, and 10 masses when the telescopic robot is expanded). It is assumed that the robot moves while maintaining the state where another robot exists. Also, it is assumed that any robot can communicate with other robots existing in the mass around the robot. This method has the advantage that one robot itself can accurately measure the amount of movement, and by measuring the relative position of neighboring robots that share one side or vertex, The position of each robot in the entire group of robots can be easily known. For example, it is possible to move as a whole by connecting a plurality of robots and propagating expansion and contraction from the leading robot.

図１に示すように、ロボットの開始位置と目標位置での隊列形状は連続な任意形状である。言い換えると、各開始位置及び各目標位置は、それぞれ上下左右方向の少なくとも何れかの方向において他の開始位置及び目標位置と隣接し、目標位置全体及び開始位置全体がそれぞれ一塊の任意の形状である。 As shown in FIG. 1, the formation shape at the start position and the target position of the robot is a continuous arbitrary shape. In other words, each start position and each target position are adjacent to other start positions and target positions in at least one of the up, down, left, and right directions, respectively, and the entire target position and the entire start position are each in an arbitrary shape. .

それぞれのロボットi（iはロボット番号を表し、i=0,1,2,…,p-1とする）の初期位置を（Xr0[i],Yr0[i]）とし、目標位置を（Xre[i],Yre[i]）とするとき、初期位置に配置されたロボットが、目標位置まで移動するための行動計画を求める問題を考える。なお、目標位置（Xre[i],Yre[i]）の集合をGとする。 The initial position of each robot i (i represents the robot number, i = 0,1,2, ..., p-1) is (Xr0 [i], Yr0 [i]), and the target position is (Xre [i], Yre [i]), consider the problem of obtaining an action plan for the robot placed at the initial position to move to the target position. Note that G is a set of target positions (Xre [i], Yre [i]).

このような問題に対して単純にマルコフ状態遷移モデルを適用しようとする場合、マルコフ状態空間は、ロボットiの位置（Xr[i],Yr[i]）、ロボットiの行動a[i]によって構成される。各状態（ロボットの位置と行動）は離散値で表現される。部屋をX,Yの直交座標系からなる２次元平面で表すと、X軸、Y軸をそれぞれ離散化表現した値により各位置を表現する。つまり、図１のように部屋（２次元平面）は格子で区切られ、各格子が各位置に対応する。また、各格子において、障害物の「ある／なし」が予め設定されている。 When a Markov state transition model is simply applied to such a problem, the Markov state space depends on the position of robot i (Xr [i], Yr [i]) and the action a [i] of robot i. Composed. Each state (robot position and action) is represented by discrete values. When a room is represented by a two-dimensional plane composed of an orthogonal coordinate system of X and Y, each position is represented by a discrete representation of the X axis and the Y axis. That is, as shown in FIG. 1, the room (two-dimensional plane) is divided by a grid, and each grid corresponds to each position. In each grid, “present / none” of the obstacle is set in advance.

また、この例では、制御対象物である行動主体は部屋に配置されている各ロボットとなる。ロボットiの行動a[i]∈D[i]は、静止、上下左右方向への１格子分の移動の計５種類のうちの何れかを取る。すなわち、D[ｉ]∈{0,1,2,3,4}として、各行動は例えば以下のように定義される。
0: 静止
1: 二次元平面上で右方向に１格子だけ移動する
2: 二次元平面上で上方向に１格子だけ移動する
3: 二次元平面上で左方向に１格子だけ移動する
4: 二次元平面上で下方向に１格子だけ移動する In this example, the action subject that is the control target is each robot arranged in the room. The action a [i] ∈D [i] of the robot i takes one of a total of five types, that is, stationary and movement of one lattice in the vertical and horizontal directions. That is, each action is defined as follows, for example, with D [i] ε {0,1,2,3,4}.
0: stationary
1: Move one grid to the right on the 2D plane
2: Move one grid upward on a two-dimensional plane
3: Move one grid to the left on the 2D plane
4: Move down one grid on a two-dimensional plane

このような任務環境におけるマルコフ状態空間は、ロボット数×２の次元数の状態を持ち、かつ選択可能な行動数は、ロボットの行動（＝５通り）のロボット数乗だけ存在する。例えば、ロボット数が５０で、部屋の縦横方向の格子数がそれぞれ２０であるとすれば状態数は２０の100乗個にもなり、探索計算に要する資源の量は膨大なものとなる。さらにロボット数が１台増えるごとに、その状態数は400倍増加していくことになり、ロボット同士が接しているという拘束条件を取り入れても、根本的な計算量の削減は難しく、複数ロボットを使用する場合の大きな問題となっている。 The Markov state space in such a mission environment has a state of the number of dimensions of the number of robots × 2, and the number of selectable actions exists by the number of robot actions (= 5). For example, if the number of robots is 50 and the number of grids in the vertical and horizontal directions of the room is 20, the number of states becomes 20 to the 100th power, and the amount of resources required for the search calculation becomes enormous. Furthermore, as the number of robots increases by one, the number of states will increase by 400 times. Even if the constraint condition that the robots are in contact with each other is incorporated, it is difficult to reduce the fundamental amount of computation, and multiple robots It has become a big problem when using.

そこで、この実施形態では、このような状態空間の爆発をさけるために、学習に使用するマルコフ状態空間を、一台分のロボットの状態変数のみで構成することにする。すなわち、状態変数及び行動変数を以下のように定義する。 Therefore, in this embodiment, in order to avoid such an explosion of the state space, the Markov state space used for learning is configured only by the state variables of one robot. That is, state variables and behavior variables are defined as follows.

状態変数s=（Xr,Yr），行動変数a∈{0,1,2,3,4}
p台あるすべてのロボットは、この状態変数を引数とした１個の価値関数Q(s,a)を共有し、行動決定を行う。 State variable s = (Xr, Yr), action variable a∈ {0,1,2,3,4}
All the p robots share a single value function Q (s, a) with this state variable as an argument, and make an action decision.

なお、価値関数Q（s,ａ）の計算は、例えば、動的計画法を使用して、任務の事前に行うものとする。まず、各ロボットの目標位置をここに厳密に割り振ることをせず、目標位置全体の集合を、目標隊列エリアGと定義する。すなわち、
（Xre[i],Yre[i]）∈G …(3)
として、各ロボットはG内の全ての全ての位置を自由に目標位置とすることができるものとする。つまり、Gをちょうど流体を注ぐ器のようなものとして扱い、その入口位置をPeとする。すなわち、各ロボットは、Gの境界上に一点だけ設定された入口位置PeからGに入ることが可能であり、一度G内に入ったロボットは、Gを出る行動をとることができないものとする。本実施形態では、Gで構成される一塊の任意の形状（星形）の開始位置に近い１点を選んで入口位置Peとし、入口位置Peにロボットが移動した場合において、高報酬値が与えられ、その他は低報酬が与えられる。入口位置Peの位置はGの内部であって、Gで構成される一塊の任意の形状のいずれかの１点であればどこでも構わないが、ロボットの開始位置から近い位置を選ぶのがロボットの動作をスムーズにするうえで効果的である。 Note that the calculation of the value function Q (s, a) is performed in advance of the task using, for example, dynamic programming. First, the target position of each robot is not strictly allocated here, and the set of the entire target position is defined as a target platoon area G. That is,
(Xre [i], Yre [i]) ∈G… (3)
Assuming that each robot can freely set all the positions in G as target positions. In other words, G is treated just like a fluid pourer, and its inlet position is Pe. In other words, each robot can enter G from the entrance position Pe set at one point on the boundary of G, and once entered into G, the robot cannot take action to exit G . In the present embodiment, a point close to the start position of an arbitrary shape (star shape) composed of G is selected as the entrance position Pe, and a high reward value is given when the robot moves to the entrance position Pe. Others receive low rewards. The position of the entrance position Pe is inside G, and any point can be used as long as it is one point of an arbitrary shape of a lump formed by G. It is effective for smooth operation.

（Q関数の計算について）
Q関数を計算するために必要なマルコフ状態空間での状態遷移確率と報酬値の設定は、行動aにより、状態(位置)がsからs'に移動したとして、その遷移確率P及び報酬Rとすると、例えば、遷移確率P(s'|s,a)(状態sから行動aにより状態s'に遷移する確率)を以下のように設定する。
(1)sが障害物でなく、s'が行動aの行先のセルのとき、
P(s'|s,a)←1
(2)sが障害物でなく、s'が行動aの行先のセルでないとき、
P(s'|s,a)←0
(3)sが障害物のときで、s'＝sのとき、
P(s'|s,a)←1
(4)sが障害物のときで、s'＝sではないとき、
P(s'|s,a)←0
また、報酬R(s',s,a) (状態sから行動aにより状態s'に遷移したときの報酬)を以下のように設定する。
(1)If s'=Pe then R(s',s,a) ← r_plus
(状態s'と入口位置とが一致するとき、その報酬R(s',s,a)をr_plusとする。)
(2)Else If sがG内かつs'がG外 then R(s',s,a) ← -r_minus
(つまり、(1)の条件を満たさない場合であって、目標位置の集合内からその外に出ようとするとき、その報酬R(s',s,a)を-r_minusとする。)
(3)Else If sがG外かつs'がG内でかつs'=PeでないthenR(s',s,a)=-r_minus
（つまり、(1)及び(2)の条件を満たさない場合であって、入口位置Pe以外の位置から目標位置の集合内に入ろうとするとき、その報酬R(s',s,a)を-r_minusとする。）
(4)Else R(s',s,a)=0
ここで、r_minusとr_plusとはともに正値とし、r_minus > 10 × r_plusとするのが望ましい。なお、ロボットが障害物には入れず、また、入口位置Pe以外からGに入れず、G内から出られないという条件が記述できていれば、遷移確率と報酬の設定方法はこの限りではない。 (About Q function calculation)
The state transition probability and reward value setting in the Markov state space necessary for calculating the Q function are determined by assuming that the state (position) has moved from s to s' by action a. Then, for example, the transition probability P (s ′ | s, a) (probability of transition from state s to state s ′ by action a) is set as follows.
(1) When s is not an obstacle and s' is the destination cell of action a,
P (s' | s, a) ← 1
(2) When s is not an obstacle and s' is not the destination cell of action a,
P (s' | s, a) ← 0
(3) When s is an obstacle and s' = s,
P (s' | s, a) ← 1
(4) When s is an obstacle and not s' = s,
P (s' | s, a) ← 0
Also, reward R (s ′, s, a) (reward when transitioning from state s to state s ′ by action a) is set as follows.
(1) If s '= Pe then R (s', s, a) ← r_plus
(When the state s ′ and the entrance position match, the reward R (s ′, s, a) is set to r_plus.)
(2) Else If s is in G and s 'is out of G then R (s', s, a) ← -r_minus
(That is, when the condition of (1) is not satisfied and the user wants to go out of the set of target positions, the reward R (s ′, s, a) is set to −r_minus.)
(3) thenR (s', s, a) =-r_minus where Else If s is outside G and s' is within G and s' = Pe is not
(In other words, when the conditions of (1) and (2) are not satisfied and the user tries to enter the set of target positions from a position other than the entrance position Pe, the reward R (s ′, s, a) is -r_minus)
(4) Else R (s', s, a) = 0
Here, it is desirable that r_minus and r_plus are both positive values and r_minus> 10 × r_plus. Note that the transition probability and reward setting method is not limited to this as long as the conditions that the robot does not enter the obstacle, cannot enter the G other than the entrance position Pe, and cannot exit the G are described. .

このようにして、Q関数を、全ての位置において、制御対象物が所定の入口位置Peに移動した場合に高報酬値を得られるように学習する。 In this way, the Q function is learned so that a high reward value can be obtained when the control object moves to the predetermined entrance position Pe at all positions.

(開始位置からロボット群がたどるルートが二つ以上考えられる場合のQ関数の計算)
複数の開始位置から入口位置Peまでの最短経路を考えたとき、複数の開始位置の設定や、障害物の配置によっては、ロボット群のたどる最短経路が２つ以上になる可能性がある。例えば、図４の場合、ロボットＡにとっての最短経路はＲＡであるが、ロボットＢにとっての最短経路はＲＢである。この場合、１つのQ関数により全てのロボットの経路を決定すると、１つの群れを成さず、分裂する可能性がある。そうすると、p台のロボットで１つの群れを成すという前提が崩れ、自己位置座標を取得することが困難になる場合が生ずる。そこで、ロボット動作の開始前の動作計画において、以下のように、Q関数を学習することが有効である。
(1)まず、前述の方法により、全ての位置について動的計画法による探索計算を行い、各位置のQ関数の値を求める。
(2)続いて、開始位置の集合のみを対象として動的計画法による探索計算を行う。その際に、(1)の結果、計算されたQ関数の値において、開始位置の集合の中で、その位置での全行動の中でのQ値の最大値（Qmax_start(s)とする。）が、最大となる位置のQ関数の値は固定とし、その値の更新は行わないものとする。例えば、図４の場合、開始位置の集合に属する各位置におけるQmax_startの値の中で、ロボットＡの開始位置P_Aに対するQmax_startの値が最も大きい場合、位置P_Aに対するQ関数の値を固定とし、その値の更新は行わない。さらに、遷移確率Pとして、行動aによって、位置P_Aを経由せずに開始位置の外に出てしまう行動を禁止する。例えば、遷移確率P(s'|s,a)を以下のように設定する。
(A-1)sが障害物でなく、s'が行動aの行先のセルで、s'が開始位置の集合に含まれるとき、
P(s'|s,a)←1
(A-2)sが障害物でなく、s'が行動aの行先のセルで、s'が開始位置の集合に含まれない（開始隊列の外の）とき、
P(s'|s,a)←0
(A-3)sが障害物でなく、s'が行動aの行先のセルでなく、ｓ'が開始位置の集合に含まれるとき、
P(s'|s,a)←0
(A-4)ｓが障害物でなく、s'が行動aの行先のセルでなく、s'が開始隊列の集合に含まれないとき、
s'=sならば、P(s'|s,a)←1
そうでなければ、P(s'|s,a)←0
(A-5)sが障害物のときで、s'=sのとき、
P(s'|s,a)←1
(A-6)sが障害物のときで、s'=sではないとき、
P(s'|s,a)←0 (Calculation of Q function when two or more routes the robot group follows from the start position)
When considering the shortest path from a plurality of start positions to the entrance position Pe, there may be two or more shortest paths that the robot group follows depending on the setting of the plurality of start positions and the arrangement of obstacles. For example, in the case of FIG. 4, the shortest path for robot A is RA, but the shortest path for robot B is RB. In this case, if the paths of all the robots are determined by one Q function, there is a possibility that the robot does not form one group and splits. Then, the premise that a group of p robots forms one group breaks down, and it may be difficult to acquire self-position coordinates. Therefore, it is effective to learn the Q function as follows in the motion plan before starting the robot motion.
(1) First, search calculation by dynamic programming is performed for all positions by the method described above, and the value of the Q function at each position is obtained.
(2) Next, search calculation by dynamic programming is performed only on the set of start positions. At this time, in the value of the calculated Q function as a result of (1), the maximum Q value (Qmax_start (s)) of all actions at that position in the set of start positions. ) However, the value of the Q function at the maximum position is fixed, and the value is not updated. For example, in the case of FIG. 4, in the values of Qmax_start at each position belonging to the set start position, if the value of Qmax_start for the start position P _A of the robot A is the largest, and a fixed value of Q function for the position P _A The value is not updated. Furthermore, the transition probability P, the action a, prohibits action escapes out of the starting position without passing through the position P _A. For example, the transition probability P (s ′ | s, a) is set as follows.
(A-1) When s is not an obstacle, s 'is the destination cell of action a, and s' is included in the set of start positions,
P (s' | s, a) ← 1
(A-2) When s is not an obstacle, s 'is the destination cell of action a, and s' is not included in the set of starting positions (outside the starting line),
P (s' | s, a) ← 0
(A-3) When s is not an obstacle, s 'is not the destination cell of action a, and s' is included in the set of start positions,
P (s' | s, a) ← 0
(A-4) When s is not an obstacle, s 'is not the destination cell of action a, and s' is not included in the set of starting lines
If s '= s, P (s' | s, a) ← 1
Otherwise, P (s' | s, a) ← 0
(A-5) When s is an obstacle and s' = s,
P (s' | s, a) ← 1
(A-6) When s is an obstacle and not s' = s,
P (s' | s, a) ← 0

複数の開始位置の設定や障害物の配置によっては、動的計算法によって計算される行動方策に従って、拘束なしに自由に行動するロボットの群れが二つ以上の群れに分かれてしまうような場合がある。そのような場合は、隣接するロボット同士の接続を保ちたい場合に不都合であり、それを解消するために以上に述べたような探索計算を行う。すなわち、開始位置の集合の内で最もQmax_startの値が高い位置を全てのロボットが目指すように行動方策を決定し、開始位置の外にロボットが出るのは、開始位置内で最もQmax_startの値の高い位置を経由してのみとなるように設定する。言い換えると、最終的に得られる価値関数Qは、全ての開始位置の中で(1)で学習したQmax_startの値が最も大きい一点の位置を目指す行動の価値関数の値が高くなるように（さらに、言い換えると、その位置を経由して、ロボットが、開始隊列の内から開始位置の外に移動するように、）学習して得られる。 Depending on the setting of multiple start positions and the placement of obstacles, there may be cases in which a group of robots that freely move without restraints is divided into two or more groups according to the action policy calculated by the dynamic calculation method. is there. In such a case, it is inconvenient when it is desired to maintain the connection between adjacent robots, and the search calculation as described above is performed in order to solve the problem. That is, the action policy is determined so that all robots aim at the position where the Qmax_start value is the highest in the set of start positions, and the robot comes out of the start position because the Qmax_start value is the most within the start position. Set to be only via high positions. In other words, the finally obtained value function Q is such that the value function of the action aiming at the position of the point with the largest Qmax_start value learned in (1) among all the start positions becomes higher (and (In other words, the robot moves through the position so that the robot moves out of the starting position from the starting position).

（行動選択の方法と２つのポリシーについて）
通常は、各状態sにあるロボットが、価値関数Q（s,a）を使用して行動選択を行う場合は、各状態sにおいて、Q関数の値を最大にする行動aを選択する。しかし、以上に提示した条件設定で、Q関数を計算した場合、目標隊列エリアGの外にいるロボットも目標隊列エリアGの内にいるロボットもともに、入口位置Peを目指して移動することになる。そこで、以下のルールにより行動選択を行うものとする。この行動方策を行動方策Policy_Gとする。 (About action selection methods and two policies)
Normally, when a robot in each state s selects an action using the value function Q (s, a), the action a that maximizes the value of the Q function is selected in each state s. However, if the Q function is calculated with the condition settings presented above, both the robot outside the target platoon area G and the robot inside the target platoon area G will move toward the entrance position Pe. . Therefore, action selection is performed according to the following rules. This action policy is defined as action policy Policy_G.

（行動方策Policy_G）
(1)ｓがG内のとき、ロボットは、他のロボットの位置や障害物位置に移動しない行動であって、かつ、G外に出ていかない行動であり、かつ、Q(s,a)<Q(s,0)を満たす行動aの中でQ(s,a)を最小化する行動を選択する。そのような行動がない場合は行動a=0（静止）を選択する。
(2)ｓがG外のとき、ロボットは、、他のロボットの位置や障害物位置に移動しない行動であって、かつ、Q(s,a)>Q(s,0)を満たす行動aの中でQ(s,a)を最大化する行動を選択する。そのような行動がない場合は行動a=0（静止）を選択する。 (Action Policy Policy_G)
(1) When s is in G, the robot is an action that does not move to the position of other robots or obstacles, and does not go out of G, and Q (s, a) <The action that minimizes Q (s, a) is selected from actions a that satisfy Q (s, 0). If there is no such action, action a = 0 (still) is selected.
(2) When s is out of G, the robot does not move to the position of other robots or obstacles and satisfies Q (s, a)> Q (s, 0) a Select the action that maximizes Q (s, a). If there is no such action, action a = 0 (still) is selected.

このようにすることで、(1)により、一度、G内に入ったロボットは入口位置Peからなるべく離れた（遠い他の）G内の位置に移動する。一方、(2)により、G外のロボットは、入口位置Peに向かって移動し、入口位置PeよりG内に入ろうとする行動をとるようにできる。また、他のロボットや障害物との衝突も避けることができる。 By doing in this way, according to (1), once the robot enters G, it moves to a position in G that is as far away as possible from the entrance position Pe. On the other hand, according to (2), the robot outside G moves toward the entrance position Pe and can take an action to enter the G from the entrance position Pe. Also, collisions with other robots and obstacles can be avoided.

（行動方策Policy_V）
また、ボイドの位置を制御するという概念を使用する。ここでいうボイドとは、あるロボットが別の位置に移動した後に、元いた位置にできる空隙のことである。なお、伸縮型ロボットがある位置のみを占有する収縮状態から伸長し、もともと占有していた位置と他の位置とを占有する伸長状態に変形したときに、もともと占有していた位置にボイドが生じる、または、もともと占有していた位置にボイドが移動するものとする（図２参照）。一方、不変型ロボットがある位置から他の位置へ移動するときに、ある位置にボイドが生じる、または、移動するものとする（図３参照）。このボイドの動きを制御することで、１つの群れを維持する。後に説明するボイドの制御のための行動方策Policy_Vは、
(3)ｓがG外であるか、入口位置Peであるとき、ボイドは、移動先が他のロボットの位置に重なる位置となる行動であって、かつ、G内に入っていかない行動であって、かつ、Q(s,a)<Q(s,0)を満たす行動aの中でQ(s,a)を最小化する行動を選択する。そのような行動がない場合は行動a=0（静止）を選択する。このような選択をすることで、ボイドが入口位置Peから遠い目標位置の集合に含まれない位置に移動するように制御される。
(4)ｓがG内でかつ入口位置Peでないとき、ボイドは、移動先が他のロボットの位置に重なる位置となる行動であって、かつ、Q(s,a)>Q(s,0)を満たす行動aの中でQ(s,a)を最大化する行動を選択する。そのような行動がない場合は行動a=0（静止）を選択する。このような選択をすることで、ボイドが入口位置Peに向かって移動するように制御される。 (Action Policy Policy_V)
The concept of controlling the position of the void is used. The term “void” as used herein refers to a gap that can be made to the original position after a certain robot moves to another position. In addition, when the telescopic robot expands from a contracted state that occupies only a certain position and deforms to an expanded state that occupies the originally occupied position and another position, a void is generated at the originally occupied position. Alternatively, it is assumed that the void moves to the position originally occupied (see FIG. 2). On the other hand, when the invariant robot moves from one position to another, a void is generated or moved at a certain position (see FIG. 3). One flock is maintained by controlling the movement of this void. The action policy Policy_V for void control, which will be explained later,
(3) When s is outside G or at the entrance position Pe, the void is an action in which the moving destination overlaps with the position of another robot and does not enter the G. In addition, an action that minimizes Q (s, a) is selected from actions a that satisfy Q (s, a) <Q (s, 0). If there is no such action, action a = 0 (still) is selected. By making such a selection, the void is controlled to move to a position not included in the set of target positions far from the entrance position Pe.
(4) When s is within G and not at the entrance position Pe, the void is an action whose destination overlaps the position of another robot, and Q (s, a)> Q (s, 0 The action that maximizes Q (s, a) is selected from the actions a that satisfy). If there is no such action, action a = 0 (still) is selected. By making such a selection, the void is controlled to move toward the entrance position Pe.

(拘束条件)
本実施形態において、ロボットは各々のロボット同士の相対位置関係から、各々の絶対位置を計測することができ、また隣の位置に他のロボットが存在しているか否か、障害物があるか否か、そして、自身が目標位置上にいるかどうかを知ることができるものとする。これを実現するための拘束条件は制御対象となるロボットの構成により異なるため、以下、それぞれ説明する。 (Restraint condition)
In this embodiment, the robot can measure the absolute position of each robot based on the relative positional relationship between the robots, and whether there is another robot at an adjacent position or whether there is an obstacle. And whether it is on the target position. Since the constraint conditions for realizing this vary depending on the configuration of the robot to be controlled, each will be described below.

(伸縮型ロボットのみの制御する場合の拘束条件)
伸縮型ロボットのみの制御する場合の位置関係と移動に関する拘束条件は、各ロボットの周囲の格子(伸長状態の場合、周囲の10マス、収縮状態の場合周囲の8マス)のいずれかに他のロボットが存在し、ロボット全体で一つの集合をなすという拘束条件を満たさなければならないことである。この拘束条件を維持しつつ目標位置での隊列形成を行う動作の例を図５〜図９に示す。以下、この条件を維持しながらの、価値関数Q(s,a)を利用した隊列移動の方法について、図１０〜図１２を用いて説明する。 (Restriction condition when controlling only telescopic robot)
When controlling only a telescopic robot, the positional relationship and the constraint conditions related to movement are as follows: one of the grids around each robot (10 squares in the stretched state, 8 squares in the contracted state) The robot must exist and the constraint condition that the entire robot forms one set must be satisfied. Examples of operations for forming a formation at the target position while maintaining this constraint condition are shown in FIGS. Hereinafter, a method of platoon movement using the value function Q (s, a) while maintaining this condition will be described with reference to FIGS.

全てのロボットを開始位置の集合に含まれる位置(開始隊列位置ともいう)に置き、先頭移動フラグflg←0とする（ｓ１）。 All robots are placed at positions (also referred to as start row positions) included in the set of start positions, and the head movement flag flg ← 0 is set (s1).

先頭移動フラグflgの値を判定する（ｓ３）。先頭移動フラグは、入口位置に近いロボット（以下「先頭ロボット」ともいう）を移動させるか否かを表す情報であり、flg=0のとき、先頭ロボットを移動させる制御（以下、この処理を「第一制御」ともいう）（ｓ５）、を行う。flg=1のとき、先頭ロボット以外のロボットを移動させる制御（以下、この処理を二「第二制御」ともいう）（ｓ６）を行う。なお、第一制御は先頭ロボットを決め、行動方策Policy_Gにより移動させることでボイドを発生させる制御ともいえ、第二制御は第一制御の結果、発生したボイドを行動方策Policy_Vに従う限り移動させる制御とも言える。 The value of the head movement flag flg is determined (s3). The head movement flag is information indicating whether or not the robot close to the entrance position (hereinafter also referred to as “head robot”) is to be moved. When flg = 0, control for moving the head robot (hereinafter, this process is referred to as “ (Also referred to as “first control”) (s5). When flg = 1, control (hereinafter, this process is also referred to as “second“ second control ”) (s6) for moving a robot other than the head robot is performed. Note that the first control is a control that generates a void by deciding the head robot and moving it according to the action policy Policy_G, and the second control is a control that moves the void generated as a result of the first control as long as it follows the action policy Policy_V. I can say that.

第一制御の詳細を説明する。第一制御では、まず、ロボットの順位付け、先頭ロボットの選択、ボイドの発生（設定）、行動方針Policy_Gにより先頭ロボットの行動選択を行う（ｓ５１）。 Details of the first control will be described. In the first control, first, robot ranking, selection of the leading robot, generation of voids (setting), and behavior selection of the leading robot are performed based on the behavior policy Policy_G (s51).

例えばｓ５１では、以下のようにしてロボットの順位付けを行う。
(1)G内の各ロボットiについて、ロボットiの位置sにおける全行動aの中でのQ(s,a)の最小値Qmin(i)を計算する。
(2)G内の各ロボットiについて、Qmin(i)の値が小さい順に順位付けを行う。
(3)G外の各ロボットiについて、ロボットiの位置sにおける全行動aの中でのQ(s,a)の最大値Qmax(i)を計算する。
(4)G外の各ロボットiについて、Qmax(i)の値が大きい順に順位付けを行う。
(5)G外の各ロボットiの順位値にG内にあるロボット数を加算する。 For example, in s51, the robots are ranked as follows.
(1) For each robot i in G, the minimum value Qmin (i) of Q (s, a) in all actions a at the position s of the robot i is calculated.
(2) For each robot i in G, ranking is performed in ascending order of Qmin (i).
(3) For each robot i outside G, the maximum value Qmax (i) of Q (s, a) in all actions a at the position s of the robot i is calculated.
(4) For each robot i outside G, ranking is performed in descending order of Qmax (i).
(5) Add the number of robots in G to the ranking value of each robot i outside G.

次に、例えば以下のようにして先頭ロボットを選択する。行動方策Policy_Gにより、a=0（静止）以外の行動を選択するロボットのうち、上述の順位付けで求めた順位の最も高いロボットを選択し、i_selectとする。 Next, for example, the top robot is selected as follows. Of the robots that select an action other than a = 0 (stationary) by the action policy Policy_G, the robot with the highest rank obtained in the above ranking is selected and is set as i_select.

次に、例えば以下のようにしてボイドの発生（設定）を行う。ロボットi_selectの位置を現在のボイドの位置（void_x,void_y）として設定する。 Next, for example, voids are generated (set) as follows. Set the position of robot i_select as the current void position (void_x, void_y).

その後、行動方策Policy_Gによりロボットi_selectの行動選択を行う（ｓ５１ここまで）。 Thereafter, the action selection of the robot i_select is performed by the action policy Policy_G (s51 so far).

第一制御の続いての処理では、先頭移動フラグflgを更新し、flg←1とし、行動方策Policy_Gにより選択された行動に基づき選択したロボットi_selectを移動させる（ｓ５６）。ロボットi_selectを収縮状態から伸長状態に変形させる。 In the process subsequent to the first control, the head movement flag flg is updated to flg ← 1, and the robot i_select selected based on the action selected by the action policy Policy_G is moved (s56). The robot i_select is deformed from the contracted state to the extended state.

次に、第二制御（ｓ６）について説明する。 Next, the second control (s6) will be described.

第二制御では、ロボットi_selectを伸長状態から収縮状態に変形させると同時に、現在のボイド位置に後続のロボットを伸長移動させる。そのために、以下の処理ｓ６１〜ｓ６４を行う。現在のボイド位置（void_x,void_y）をｓとしたとき、行動方策Policy_Vによりボイドの行動を選択する（ｓ６１）。選択された行動a_voidが移動する（a_void≠0）か否かを判定する（ｓ６２）。a_void≠0の場合、行動方策Policy_Vによりボイドを移動させ、移動先の位置にあるロボットを選択し、そのロボットをロボットi_select2とする。ロボットi_select2の行動を、行動a_voidとは逆方向に移動するものと決定する（ｓ６３）。ロボットi_select2を収縮状態から伸長状態に変形させると同時にロボットi_selectを同時に収縮させる。続いて、i_selectをi_select2に置き換える（i_select←i_select2）。a_void=0の場合(ボイドが静止する場合)、flg←0とする（ｓ６４）。 In the second control, the robot i_select is deformed from the extended state to the contracted state, and at the same time, the subsequent robot is extended to the current void position. For this purpose, the following processes s61 to s64 are performed. When the current void position (void_x, void_y) is s, the action of the void is selected by the action policy Policy_V (s61). It is determined whether or not the selected action a_void moves (a_void ≠ 0) (s62). When a_void ≠ 0, the void is moved by the action policy Policy_V, the robot at the position of the movement destination is selected, and the robot is set as the robot i_select2. The action of the robot i_select2 is determined to move in the direction opposite to the action a_void (s63). The robot i_select2 is deformed from the contracted state to the extended state, and at the same time, the robot i_select is contracted simultaneously. Subsequently, i_select is replaced with i_select2 (i_select ← i_select2). When a_void = 0 (when the void is stationary), flg ← 0 is set (s64).

移動後の位置に基づき、全てのロボットの位置の情報を更新する（ｓ８）。 Based on the moved position, the information on the positions of all the robots is updated (s8).

全てのロボットがG内に入ったか否かを判定し（ｓ９）、入った場合には行動制御処理を終了する。入っていないロボットがある場合には、ｓ３〜ｓ９の処理を繰り返す。なお、全てのロボットがG内に入った状態とは、伸長状態のロボットの一部がG内に入っている状態ではなく、収縮状態のロボットがG内に入っている状態を意味する。 It is determined whether or not all the robots have entered G (s9). If entered, the behavior control process is terminated. If there is a robot that has not entered, the processing of s3 to s9 is repeated. The state where all the robots are in G means a state in which a part of the extended robot is not in G and a contracted robot is in G.

以上の方法により実現される動作を説明する。まず開始隊列形態にあるロボットのうち、もっとも入口位置Peに近いロボットが隊列の中から選択され、入口位置Peに向かって１ステップ移動する（第一制御）。そのあとの時間ステップにおいては、前のステップでできたボイドの位置を埋めるように後続のロボットが一ステップずつ移動していく。それをボイドがこれ以上Q関数の値の低い位置に移動不可能になるまで繰り返す（第二制御）。ボイドが移動不可能になったら、また、もっとも入口位置Peに近いロボットが隊列の中から選択される（第一制御）。第一制御と第二制御を繰り返していくうちに、いつしか、先頭のロボットが入口位置Peに到達して、G内に入る。そのあとは、今度は先頭のロボットは、G内においてもっともQ関数の値の低い位置を目指して移動を続け、後続のロボットは引き続き、先頭のロボットの作ったボイドを埋めるように移動していく。 The operation realized by the above method will be described. First, among the robots in the starting formation form, the robot closest to the entrance position Pe is selected from the formation and moved one step toward the entry position Pe (first control). In subsequent time steps, subsequent robots move step by step so as to fill in the void positions created in the previous step. This is repeated until the void cannot move to a position where the Q function value is lower than this (second control). When the void becomes impossible to move, the robot closest to the entrance position Pe is selected from the platoon (first control). As the first control and the second control are repeated, the leading robot reaches the entrance position Pe and enters G. After that, the leading robot will continue to move toward the position with the lowest Q function value in G, and the following robot will continue to move to fill the void created by the leading robot. .

以上のような動作で、ロボットの群れが目標隊列に近づいていく。 With the above operation, the group of robots approaches the target line.

(不変型ロボットのみの制御する場合の拘束条件)
つづいて、不変型ロボットのみの制御する場合の拘束条件について説明する。この場合、ロボットは、自分が移動する方向と逆方向の向きで接している他のロボットをできるだけ多く引き連れて移動を行う。 (Restriction conditions when controlling only invariant robots)
Next, a constraint condition in the case of controlling only the invariant robot will be described. In this case, the robot moves as many other robots as possible in contact with the robot in the direction opposite to the direction in which the robot moves.

この場合は、各ロボットの周囲の８つの格子のいずれかに他のロボットが存在し、ロボット全体で一つの集合をなすという拘束条件を満たさなければならない。この拘束条件を維持しつつ目標位置での隊列形成を行う動作の例を図１３〜図１６に示す。以下、この条件を維持しながらの、価値関数Q(s,a)を利用した隊列移動の方法について、伸縮型ロボットのみの制御する場合と、第一制御及び第二制御の処理内容が主に異なるため、この部分を中心に図１７、図１８を用いて説明する。 In this case, it is necessary to satisfy the constraint that another robot exists in one of the eight grids around each robot, and that the entire robot forms one set. Examples of operations for forming a formation at the target position while maintaining this constraint condition are shown in FIGS. In the following, regarding the method of platoon movement using the value function Q (s, a) while maintaining this condition, the processing contents of the first control and the second control are mainly the case of controlling only the telescopic robot. Since this is different, this portion will be mainly described with reference to FIGS.

第一制御では、まず、ロボットの順位付け、先頭ロボットの選択、ボイドの発生（設定）、行動方針Policy_Gにより先頭ロボットの行動選択を行う（ｓ５１）。このｓ５１に関しては伸縮型ロボットのみの制御する場合と同様である。ただし不変型ロボットを使用するこの場合においては以下の処理ｓ５１１をｓ５１に続いて行う。 In the first control, first, robot ranking, selection of the leading robot, generation of voids (setting), and behavior selection of the leading robot are performed based on the behavior policy Policy_G (s51). This s51 is the same as the case where only the telescopic robot is controlled. However, in this case where an invariant robot is used, the following processing s511 is performed following s51.

変数Ir←0とし、第一制御にて一度に動かすロボットの番号を格納する配列をi_select_moveとする。ｓ５１で選択されたロボットi_selectをi_select_move[Ir]とし、ロボットi_select_move[Ir]の選択する行動をロボットi_selectの選択した行動と同じとし、a_selectとする（ｓ５１１）。 The variable Ir ← 0 is set, and an array for storing the numbers of robots to be moved at a time in the first control is i_select_move. The robot i_select selected in s51 is i_select_move [Ir], the action selected by the robot i_select_move [Ir] is the same as the action selected by the robot i_select, and is a_select (s511).

続いて、現在のボイド位置（void_x,void_y）をｓとしたときの行動方策Policy_Vによりボイドの行動を選択する（ｓ５３）。選択された行動a_voidが、行動a_selectとはちょうど逆向きの行動であるかどうかの判定を行う（ｓ５４）。 Subsequently, the action of the void is selected by the action policy Policy_V when the current void position (void_x, void_y) is s (s53). It is determined whether or not the selected action a_void is an action opposite to the action a_select (s54).

逆向きの行動以外の場合、先頭移動フラグflgを更新し、flg←1とし、ロボットi_select_move[0]〜i_select_move[Ir]を行動a_selectの方向に移動させる（ｓ５６）。 If the action is not an action in the reverse direction, the head movement flag flg is updated to flg ← 1, and the robots i_select_move [0] to i_select_move [Ir] are moved in the direction of the action a_select (s56).

逆向きの行動の場合、変数Irをインクリメントし、行動a_voidによってボイドを移動させる。さらに、移動先のボイドの位置にあるロボットを選択し、そのロボットをi_select_move[Ir]とする。ロボットi_select_move[Ir]の選択すべき行動を、ｓ５１、ｓ５１１で求めた行動a_selectと同じとする。ｓ５３に戻る（ｓ５５）。 In the case of a reverse action, the variable Ir is incremented and the void is moved by the action a_void. Further, the robot at the position of the destination void is selected, and the robot is set as i_select_move [Ir]. The behavior to be selected by the robot i_select_move [Ir] is the same as the behavior a_select obtained in s51 and s511. The process returns to s53 (s55).

ｓ５４において逆向きの行動以外になるまで、ｓ５３〜ｓ５５を繰り返す。 Steps s53 to s55 are repeated until an action other than a reverse action is performed in s54.

第二制御では、変数Ir←0とし、現在のボイド位置（void_x,void_y）をsとしたときの行動方策Policy_Vにより行動a_voidを選択する（ｓ６１）。選択された行動a_voidが移動する（a_void≠0）か否かを判定する（ｓ６２）。a_void=0の場合(ボイドが静止する場合)、先頭移動フラグflgを更新し、flg←0とし（ｓ６４）、第二制御を終了する。a_void≠0の場合、行動方策Policy_Vによりボイドを移動させ、移動先の位置にあるロボットを選択し、そのロボットをロボットi_select_move[Ir]とする。その後、ロボットi_select_move[Ir]のとるべき行動を、行動a_voidとは逆方向の行動とする（ｓ６３）。このときのロボットi_select_move[Ir]の移動方向をa_selectとする。 In the second control, the action a_void is selected by the action policy Policy_V when the variable Ir ← 0 and the current void position (void_x, void_y) is s (s61). It is determined whether or not the selected action a_void moves (a_void ≠ 0) (s62). When a_void = 0 (when the void is stationary), the head movement flag flg is updated, flg ← 0 is set (s64), and the second control is terminated. If a_void ≠ 0, the void is moved by the action policy Policy_V, the robot at the position of the movement destination is selected, and the robot is set as robot i_select_move [Ir]. Thereafter, the action to be taken by the robot i_select_move [Ir] is set to an action opposite to the action a_void (s63). The moving direction of the robot i_select_move [Ir] at this time is a_select.

続いて、移動後の現在のボイド位置（void_x,void_y）をｓとしたときの行動方策Policy_Vにより行動a_voidを選択する（ｓ６６）。選択された行動a_voidが、行動a_selectとはちょうど逆向きの行動であるかどうかの判定を行う（ｓ６７）。逆向きの行動以外の場合、ｓ６３で選択された行動a_selectに基づき選択したロボットi_select_move[0]〜i_select_move[Ir]を移動させる。（ｓ６９）。逆向きの行動の場合、変数Irをインクリメントし、行動a_voidによってボイドを移動させる。さらに、移動先のボイドの位置にあるロボットを選択し、そのロボットをi_select_move[Ir]とする。ロボットi_select_move[Ir]の選択すべき行動を、ｓ６３で求めた行動a_selectと同じとする。ｓ６６に戻る（ｓ６８）。 Subsequently, the action a_void is selected by the action policy Policy_V when the current void position (void_x, void_y) after the movement is s (s66). It is determined whether or not the selected action a_void is an action opposite to the action a_select (s67). If the action is not in the reverse direction, the robot i_select_move [0] to i_select_move [Ir] selected based on the action a_select selected in s63 is moved. (S69). In the case of a reverse action, the variable Ir is incremented and the void is moved by the action a_void. Further, the robot at the position of the destination void is selected, and the robot is set as i_select_move [Ir]. The action to be selected by the robot i_select_move [Ir] is the same as the action a_select obtained in s63. The process returns to s66 (s68).

Ｓ６７において逆向きの行動以外になるまで、ｓ６６〜ｓ６８を繰り返す。 S66-s68 are repeated until it becomes other than a reverse action in S67.

以上のプロセスで、ある先行する不変型ロボットの移動方向と逆方向において隣接する後続する不変型ロボットが、その先行する不変型ロボットの移動によって生じたボイドが行動方策Policy_Vの制御に従って移動する場合の移動方向と逆方向に移動したときに、その移動方向が、先行する不変型ロボットの移動方向と同じ限りは、同時に移動するように制御している。具体的には、まず開始隊列形態にあるロボットのうち、もっとも入口位置Peに近いロボットが隊列の中から選択され、入口位置Peに向かうための行動が選択される。そして、そのロボットの移動方向と逆側にて接しているロボットを可能な限り（ｓ５３にて行動方策Policy_Vにより選択されたボイドの行動a_voidとｓ５１１にて行動方策Policy_Gにより選択されたロボットの行動a_selectとが逆向きの移動である限り）引き連れて1ステップ移動する（第一制御）。そのあと以降の時間ステップにおいては、前のステップでできたボイドの位置を埋めるように同じ方向に動ける(ｓ６３、ｓ６６にて行動方策Policy_Vにより選択された行動a_voidの移動方向が同じ分だけ)後続のロボットが、同時に一ステップずつ移動していく。ボイドがこれ以上Q関数の値の低い位置に移動不可能になるまで処理を繰り返す（第二制御）。ボイドが移動不可能になったら、また、もっとも入口位置Peに近いロボットを隊列の中から選択する（第一制御）。第一制御と第二制御とを繰り返していくうちに、いつしか、先頭のロボットが入口位置Peに到達して、G内に入る。そのあとは、今度は先頭のロボットは、G内においてもっともQ関数の値の低い位置を目指して後続ロボットを引き連れて移動を続け、後続のロボットは引き続き、先頭のロボットの作ったボイドを埋めるように後続ロボットを引き連れて移動していく。 In the above process, when a subsequent invariant robot adjacent in the direction opposite to the movement direction of a certain previous invariant robot moves a void generated by the movement of the preceding invariant robot according to the control of the action policy Policy_V When moving in the direction opposite to the moving direction, as long as the moving direction is the same as the moving direction of the preceding invariant robot, it is controlled to move simultaneously. Specifically, first, the robot closest to the entrance position Pe is selected from the formation among the robots in the starting formation form, and the action to go to the entrance position Pe is selected. As far as possible, the robot abutting on the opposite side of the moving direction of the robot (the action a_void of the void selected by the action policy Policy_V in s53 and the action a_select of the robot selected by the action policy Policy_G in s511) (As long as and are the movements in the opposite direction), it moves one step (first control). In subsequent time steps, it can move in the same direction so as to fill the position of the void created in the previous step (the movement direction of action a_void selected by action policy Policy_V in s63 and s66 is the same) The robot moves one step at a time. The process is repeated until the void cannot move to a position where the Q function value is lower than this (second control). When the void becomes impossible to move, the robot closest to the entrance position Pe is selected from the formation (first control). As the first control and the second control are repeated, the leading robot reaches the entrance position Pe and enters G. After that, this time, the leading robot will continue to move with the following robot aiming at the position with the lowest Q function value in G, and the following robot will continue to fill the void created by the leading robot To move the following robot.

(伸縮型ロボットと不変型ロボットとが混在するときに制御する場合の拘束条件)
さらに、伸縮型ロボットと不変型ロボットが混在する場合について述べる。 (Restriction condition when controlling when telescopic robot and immutable robot are mixed)
Furthermore, the case where a telescopic robot and an immutable robot are mixed will be described.

以下、各ロボットの周囲の格子(伸縮型ロボットの伸長状態の場合、周囲の10マス、伸縮型ロボットの収縮状態及び不変型ロボットの場合周囲の8マス)のいずれかに他のロボットが存在し、ロボット全体で一つの集合をなすという拘束条件を満たさなければならない。以下、この条件を維持しながらの、価値関数Q(s,a)を利用した隊列移動の方法について、不変型ロボットのみの制御する場合と、第一制御及び第二制御の処理内容が主に異なるため、この部分を中心に図１７、図１８を用いて説明する。なお、新たな変数として、各ロボットiが伸縮型ロボットか不変型ロボットかを表す変数としてcompress[i]を定義する。compress[i]=1のときロボットiは収縮型ロボットであり、compress[i]=0のときロボットiは不変型ロボットである。 In the following, there are other robots in either of the grids around each robot (10 squares in the stretched state of the extendable robot, 8 squares in the contracted state of the telescopic robot and 8 squares in the case of the invariant robot). The constraint condition that the entire robot forms one set must be satisfied. In the following, regarding the method of platoon movement using the value function Q (s, a) while maintaining this condition, the control contents of the invariant type robot and the processing contents of the first control and the second control are mainly described. Since this is different, this portion will be mainly described with reference to FIGS. As a new variable, compress [i] is defined as a variable indicating whether each robot i is a telescopic robot or an invariant robot. When compress [i] = 1, the robot i is a contractive robot, and when compress [i] = 0, the robot i is an invariant robot.

第一制御では、以下の処理ｓ５１〜ｓ５６を行う。なお、このｓ５１、ｓ５１１、及びｓ５３〜５６に関しては不変型ロボットのみの制御する場合と同様であるため、説明を簡略化する。 In the first control, the following processes s51 to s56 are performed. Since s51, s511, and s53 to 56 are the same as in the case of controlling only the invariant robot, the description will be simplified.

ロボットの順位付け、先頭ロボットの選択、ボイドの発生（設定）、行動方針Policy_Gにより先頭ロボットの行動選択を行い（ｓ５１）、移動するロボットの配列、行動を設定する（ｓ５１１）。 The robot is ranked, the head robot is selected, the void is generated (set), and the head robot is selected based on the action policy Policy_G (s51), and the moving robot array and action are set (s511).

選択されたロボットが不変型ロボットか否か(compress[i]=0?)を判定する（ｓ５２）。 It is determined whether or not the selected robot is an immutable robot (compress [i] = 0?) (S52).

伸縮型ロボットの場合、先頭移動フラグflgを更新し、flg←1とし、行動方策Policy_Gにより選択された行動a_selectに基づき選択したロボットi_select_move[0]〜i_select_move[Ir]を移動させる。（ｓ５６）。ここで、ロボットi_select_move[0]〜i_select_move[Ir]に含まれるロボットが、不変型ロボットの場合にはロボットを変形させずに隣接する格子に移動させ、伸縮型ロボットの場合にはロボットを収縮状態から伸長状態に変形させる。 In the case of the telescopic robot, the head movement flag flg is updated, flg ← 1, and the robots i_select_move [0] to i_select_move [Ir] selected based on the action a_select selected by the action policy Policy_G are moved. (S56). Here, if the robots included in the robots i_select_move [0] to i_select_move [Ir] are immutable robots, the robots are moved to the adjacent grid without being deformed, and if they are telescopic robots, the robots are contracted. It is deformed from the extended state.

不変型ロボットの場合、不変型ロボットのみを制御する場合で説明したｓ５３（行動方策Policy_Vによるボイドの行動選択）とｓ５４（選択したロボットの選択した移動方向とボイドの移動方向とが逆方向か否かの判定）とを行う。 In the case of an invariant robot, s53 (action selection of a void by action policy Policy_V) and s54 described in the case of controlling only an invariant robot and s54 (whether the movement direction selected by the selected robot is opposite to the movement direction of the void) Whether or not).

逆向きの行動以外の場合、先頭移動フラグflgを更新し、flg←1とし、行動方策Policy_Gにより選択された行動a_selectに基づき選択したロボットi_select_move[0]〜i_select_move[Ir]を移動させる。（ｓ５６）。逆向きの行動の場合、ボイドを移動させ、ボイドの移動先にあるロボットを選択し、その行動を、前のロボットと同じ方向の移動とする（ｓ５５）。 If the action is other than the reverse action, the head movement flag flg is updated, flg ← 1, and the robots i_select_move [0] to i_select_move [Ir] selected based on the action a_select selected by the action policy Policy_G are moved. (S56). In the case of an action in the opposite direction, the void is moved, the robot at the destination of the void is selected, and the action is set to move in the same direction as the previous robot (s55).

伸縮型ロボットが選択されるまで（ｓ５２）、または、選択したロボットの選択した移動方向とボイドの移動方向とが逆方向以外になるまで（ｓ５４）、ｓ５２〜ｓ５５を繰り返す。 Steps s52 to s55 are repeated until the telescopic robot is selected (s52) or until the selected moving direction of the selected robot is different from the moving direction of the void (s54).

なお、後述する第二制御において、伸縮型ロボットが選択されたことを理由として、第二制御を終了した場合には、上述の処理ｓ５１〜ｓ５６と同時に伸縮型ロボットを伸長状態から収縮状態に変形させる。 In the second control described later, when the second control is terminated because the telescopic robot is selected, the telescopic robot is deformed from the expanded state to the contracted state simultaneously with the above-described processes s51 to s56. Let

第二制御では、以下の処理ｓ６１〜ｓ６９を行う。なお、このｓ６１〜ｓ６４及びｓ６６〜６９に関しては不変型ロボットのみの制御する場合と同様であるため、説明を簡略化する。 In the second control, the following processes s61 to s69 are performed. Since s61 to s64 and s66 to 69 are the same as the case of controlling only the invariant robot, the description will be simplified.

行動方策Policy_Vにより行動a_voidを選択する（ｓ６１）。選択された行動a_voidが移動する（a_void≠0）か否かを判定し（ｓ６２）、a_void≠0の場合、ボイドを移動させ、移動先の位置にあるロボットを選択し、そのロボットi_select[Ir]のとるべき行動を、行動a_voidとは逆方向の行動とする（ｓ６３）。なお、a_void=0の場合(ボイドが静止する場合)、先頭移動フラグflgを更新し、flg←0とし（ｓ６４）、第二制御を終了する。 Action a_void is selected by action policy Policy_V (s61). It is determined whether or not the selected action a_void moves (a_void ≠ 0) (s62). If a_void ≠ 0, the void is moved, the robot at the destination position is selected, and the robot i_select [Ir The action to be taken is the action opposite to the action a_void (s63). If a_void = 0 (when the void is stationary), the head movement flag flg is updated, flg ← 0 is set (s64), and the second control is terminated.

選択されたロボットが不変型ロボットか否か(compress[i]=0?)を判定する（ｓ６５）。 It is determined whether or not the selected robot is an immutable robot (compress [i] = 0?) (S65).

伸縮型ロボットの場合、ｓ６３で選択された行動a_selectに基づき選択したロボットi_select_move[0]〜i_select_move[Ir]を移動させる（ｓ６９）。伸縮型ロボットの場合には、ロボットi_select[Ir]を収縮状態から伸長状態に変形させる。 In the case of the telescopic robot, the robots i_select_move [0] to i_select_move [Ir] selected based on the action a_select selected in s63 are moved (s69). In the case of a telescopic robot, the robot i_select [Ir] is deformed from the contracted state to the extended state.

ｓ６５の判定の結果、不変型ロボットの場合、続いて、行動方策Policy_Vにより行動a_voidを選択する（ｓ６６）。選択された行動a_voidが、行動a_selectとはちょうど逆向きの行動であるかどうかの判定を行い（ｓ６７）、逆向きの行動以外の場合、ｓ６３で選択された行動a_selectに基づき選択したロボットi_select_move[0]〜i_select_move[Ir]を移動させる（ｓ６９）。逆向きの行動の場合、ボイドを移動させ、移動先のボイドの位置にあるロボットを選択し、そのロボットの選択すべき行動を、ｓ６３で求めた行動a_selectと同じとする（ｓ６８）。 If the result of the determination in s65 is an invariant robot, the action a_void is subsequently selected by the action policy Policy_V (s66). It is determined whether or not the selected action a_void is an action opposite to the action a_select (s67). If the action a_void is not the opposite action, the robot i_select_move [selected based on the action a_select selected in s63 is determined. [0] to i_select_move [Ir] are moved (s69). In the case of a backward action, the void is moved, the robot at the position of the destination void is selected, and the action to be selected by the robot is the same as the action a_select obtained in s63 (s68).

伸縮型ロボットが選択されるまで（ｓ６５）、または、選択したロボットの選択した移動方向とボイドの移動方向とが逆方向以外になるまで（ｓ６７）、ｓ６５〜ｓ６８を繰り返す。 Steps s65 to s68 are repeated until the telescopic robot is selected (s65) or until the selected moving direction of the selected robot and the moving direction of the void are not in opposite directions (s67).

なお、第一制御において伸縮型ロボットが選択されたことを理由として、第一制御を終了した場合には、上述の処理ｓ６１〜ｓ６９と同時に伸縮型ロボットを伸長状態から収縮状態に変形させる。 When the first control is terminated because the telescopic robot is selected in the first control, the telescopic robot is deformed from the expanded state to the contracted state simultaneously with the above-described processes s61 to s69.

以上のプロセスで、伸縮型ロボットの移動を制御するとともに、ある先行する不変型ロボットの移動方向と逆方向において隣接する後続する不変型ロボットが、その先行する不変型ロボットの移動によって生じたボイドが行動方策Policy_Vの制御に従って移動する場合の移動方向と逆方向に移動したときに、その移動方向が、先行する不変型ロボットの移動方向と同じ限りは、同時に移動するように制御している。具体的には、まず開始隊列形態にあるロボットのうち、もっとも入口位置Peに近いロボットが隊列の中から選択され、入口位置Peに向かうための行動が選択される。そして、そのロボットの移動方向と逆側にて接しているロボットを引き連れて1ステップ移動するが、その中に、収縮可能ロボットがある場合は、その収縮可能ロボットを後端として移動を行う（第一制御）。そのあとの時間ステップ以降においては、前のステップでできたボイドの位置を埋めるように同じ方向に動ける後続のロボットが、同時に一ステップずつ移動していく、ここでも同様に、後続ロボット中に収縮可能ロボットがある場合は、その収縮可能ロボットを後端として移動を行う。ボイドがこれ以上Q関数の値の低い位置に移動不可能になるまで処理を繰り返す（第二制御）。ボイドが移動不可能となったら、また、もっとも入口位置Peに近いロボットが隊列の中から選択し第一制御を行う。第一制御と第二制御とを繰り返していくうちに、いつしか、先頭のロボットが入口位置Peに到達して、G内に入る。そのあとは、今度は先頭のロボットは、G内においてもっともQ関数の値の低い位置を目指して後続ロボットを引き連れて移動を続け、後続のロボットは引き続き、先頭のロボットの作ったボイドを埋めるように後続ロボットを引き連れて移動していく。 In the above process, the movement of the telescopic robot is controlled, and the subsequent invariant robot adjacent in the direction opposite to the direction of movement of a certain preceding invariant robot has a void generated by the movement of the preceding invariant robot. When moving according to the control of the action policy Policy_V, the movement direction is controlled to move simultaneously as long as the movement direction is the same as the movement direction of the preceding invariant robot. Specifically, first, the robot closest to the entrance position Pe is selected from the formation among the robots in the starting formation form, and the action to go to the entrance position Pe is selected. Then, the robot in contact with the robot in the direction opposite to the moving direction of the robot moves and moves one step. If there is a retractable robot in the robot, the robot moves with the retractable robot as the rear end. One control). In subsequent time steps, subsequent robots that move in the same direction so as to fill in the void positions created in the previous step move one step at a time. If there is a robot that can be moved, the robot moves with the retractable robot as the rear end. The process is repeated until the void cannot move to a position where the Q function value is lower than this (second control). When the void becomes impossible to move, the robot closest to the entrance position Pe is selected from the platoon and performs the first control. As the first control and the second control are repeated, the leading robot reaches the entrance position Pe and enters G. After that, this time, the leading robot will continue to move with the following robot aiming at the position with the lowest Q function value in G, and the following robot will continue to fill the void created by the leading robot To move the following robot.

以下、ロボット移動の制御方法、拘束条件を実現するための処理について述べる。
＜第一実施形態に係る行動制御システム１００＞
図１９は第一実施形態に係る行動制御システム１００の機能ブロック図を、図２０はその処理フローの例を示す。行動制御システム１００は、図１９に示すように、動作計画部１１０と、行動選択部１２０と、隣接状態判定部１２１と、位置更新部１２５と、位置判定部１２６と、記憶部１４０と、通信部１５０と、入力部１６０とを含む。 Hereinafter, a robot movement control method and processing for realizing the constraint conditions will be described.
<Action control system 100 according to the first embodiment>
FIG. 19 is a functional block diagram of the behavior control system 100 according to the first embodiment, and FIG. 20 shows an example of the processing flow. As illustrated in FIG. 19, the behavior control system 100 includes an operation planning unit 110, a behavior selection unit 120, an adjacent state determination unit 121, a position update unit 125, a position determination unit 126, a storage unit 140, and communication. Part 150 and input part 160.

以下では、制御の対象となる制御対象物が、ロボットである場合を例に挙げて説明する。もちろん、制御対象物は、制御の対象となり得るものであれば、ロボット以外であってもよい。 Hereinafter, a case where the control target to be controlled is a robot will be described as an example. Of course, the control object may be other than the robot as long as it can be a control target.

本実施形態では、行動制御システム１００は、p台のロボットの行動を制御し、p台のロボットの内の１つのロボット上に実装される。なお、p個のロボットのうちq個（qを0以上p以下の整数の何れかとする）のロボットを伸縮型ロボットとし、p個のロボットのうち(p-q)個のロボットを不変型ロボットとする。行動制御システム１００が実装されるロボットは不変型ロボットでも伸縮型ロボットでもよい。なお、行動制御システム１００が実装されていないp-１台のロボットについても、通信部１５０と、隣接状態判定部１２１とを含む。 In the present embodiment, the behavior control system 100 controls the behavior of p robots and is mounted on one of the p robots. Of the p robots, q robots (q is any integer between 0 and p) are telescopic robots, and (pq) robots of the p robots are invariant robots. . The robot on which the behavior control system 100 is mounted may be an invariant robot or a telescopic robot. Note that the p-1 robot on which the behavior control system 100 is not mounted also includes the communication unit 150 and the adjacent state determination unit 121.

＜動作計画部１１０＞
動作計画部１１０は、価値関数Q(s,a)の値を、動的計画法により、ロボットの任務行動開始前に事前に計算し（Ｓ１１０）、記憶部１４０に格納する。ここで、動作計画部１１０の計算は、一台のロボットを使用したQ学習に置き換えてもよい。なお、別装置で価値関数Q(s,a)を計算しておき、ロボットの任務行動開始前に事前に記憶部１４０に格納しておけば、行動制御システム１００は、動作計画部１１０を備えなくともよい。 <Operation Planning Unit 110>
The motion planning unit 110 calculates the value of the value function Q (s, a) in advance by dynamic programming before starting the mission action of the robot (S110) and stores it in the storage unit 140. Here, the calculation of the motion planning unit 110 may be replaced with Q learning using a single robot. If the value function Q (s, a) is calculated by a separate device and stored in the storage unit 140 in advance before starting the robot's mission behavior, the behavior control system 100 includes the motion planning unit 110. Not necessary.

なお、価値関数Q(s,a)の計算については、（Q関数の計算について）、(開始位置からのロボット群のルートが二つ以上考えられる場合のQ関数の計算)で説明した通りである。 The calculation of the value function Q (s, a) is as described in (Calculation of the Q function) and (Calculation of the Q function when two or more routes of the robot group from the start position are considered). is there.

＜入力部１６０＞
入力部１６０には、p台のロボットiのそれぞれの初期位置(Xr0[i],Xr0[i])及びp個の目標位置の集合G={(Xre[0],Yre[0]),(Xre[1],Yre[1]),…,(Xre[p-1],Yre[p-1])}が入力され、記憶部１４０に記憶される。 <Input unit 160>
The input unit 160 includes an initial position (Xr0 [i], Xr0 [i]) and a set of p target positions G = {(Xre [0], Yre [0]), (Xre [1], Yre [1]),..., (Xre [p-1], Yre [p-1])} are input and stored in the storage unit 140.

なお、目標位置は、所定の入口位置Peを含むとする。この入口位置Peについての情報も、入力部１６０から入力され、記憶部１４０に記憶されるとする。 Note that the target position includes a predetermined entrance position Pe. Information about the entrance position Pe is also input from the input unit 160 and stored in the storage unit 140.

＜記憶部１４０＞
記憶部１４０には、位置s及びa∈{0,1,2,3,4}の組み合わせのそれぞれについての価値関数Q(s,a)が記憶されているとする。sの取りうる範囲は、対象となる二次元平面上の領域内のロボットiが存在しうる全ての座標である。 <Storage unit 140>
It is assumed that the storage unit 140 stores a value function Q (s, a) for each of the combinations of the position s and aε {0,1,2,3,4}. The range that s can take is all the coordinates where the robot i in the region on the target two-dimensional plane can exist.

各位置sの報酬r(s)についても、記憶部１４０に記憶されているとする。各位置sの報酬r(s)についての情報は、例えば入力部１６０から入力される。 It is assumed that the reward r (s) at each position s is also stored in the storage unit 140. Information about the reward r (s) at each position s is input from the input unit 160, for example.

＜通信部１５０＞
行動制御システム１００が実装されているロボットも含め、全てのロボットは、通信部１５０を介して、二次元平面上の上、右上、右、右下、下、左下、左、左上方向(以下「８方向」ともいう)において隣接する他のロボットと通信することができる。図２１Ａの梨地部分は不変型ロボット及び収縮状態の伸縮型ロボットの隣接する通信可能な位置を、図２１Ｂの梨地部分は伸長状態の伸縮型ロボットの隣接する通信可能な位置を示す。 <Communication unit 150>
All robots including the robot in which the behavior control system 100 is mounted are connected to the upper, upper right, right, lower right, lower, lower left, left, upper left directions (hereinafter “ It is also possible to communicate with other adjacent robots in the “8 directions”. The satin portion in FIG. 21A shows the communicable position of the invariable robot and the retractable telescopic robot adjacent to each other, and the satin portion in FIG. 21B shows the communicable position of the stretchable telescopic robot adjacent to each other.

＜行動選択部１２０＞
行動選択部１２０は、記憶部１４０から価値関数Qを取り出す。行動選択部１２０は、（伸縮型ロボットのみの制御する場合の拘束条件）、（不変型ロボットのみの制御する場合の拘束条件）、（伸縮型ロボットと不変型ロボットとが混在するときに制御する場合の拘束条件）の部分で説明した方法で、p台のロボットを制御する（ｓ１２０）。具体的には、図１０のｓ１（初期設定）、ｓ３（先頭移動フラグの判定）、ｓ５（第一制御）、ｓ６（第二制御）を行う。なお、行動方策Policy_G、行動方策Policy_Vにおいて、各ロボットの位置や障害物位置を必要とするが、各ロボットの位置の初期位置(Xr0[i],Xr0[i])については、前述の通り、入力部１６０を介して入力され記憶部１４０に記憶されているものを利用し、移動後の各ロボットの位置については、後述する位置判定部１２６で求め記憶部１４０に記憶されているものを利用すればよい。障害物の位置に関しては、価値関数Q(s,a)を学習する際に利用した障害物の位置を利用してもよいし、後述する隣接状態判定部１２１で判定した判定結果を用いて得られるものを利用してもよい。 <Action selection unit 120>
The action selection unit 120 extracts the value function Q from the storage unit 140. The action selection unit 120 performs control when (constraint condition when controlling only the telescopic robot), (restraint condition when controlling only the invariant robot), and (extractable robot and immutable robot are mixed). P robots are controlled by the method described in the section (Restrictive conditions in case) (s120). Specifically, s1 (initial setting), s3 (determination of the head movement flag), s5 (first control), and s6 (second control) in FIG. 10 are performed. In addition, in the action policy Policy_G and the action policy Policy_V, the position of each robot and the position of an obstacle are required, but the initial position (Xr0 [i], Xr0 [i]) of each robot is as described above. The information input via the input unit 160 and stored in the storage unit 140 is used, and the position of each robot after movement is obtained by the position determination unit 126 described later and stored in the storage unit 140. do it. As for the position of the obstacle, the position of the obstacle used when learning the value function Q (s, a) may be used, or obtained using the determination result determined by the adjacent state determination unit 121 described later. You may use what is available.

＜位置更新部１２５＞
位置更新部１２５は、各i=0,1,…,p-1について、i番目のロボットの現在の位置（Xr[i],Yr[i]）において、行動選択部１２０で決定した行動を実行した場合のロボットの移動後（行動後）の位置（Xr'[i],Yr'[i]）を計算し、計算された（Xr'[i],Yr'[i]）で記憶部１４０に格納されたi番目のロボットの位置を更新する（Ｓ１２５）。言い換えれば、位置更新部１２５は、行動選択部１２０で決定した行動に基づいて、ロボットが行動した場合に想定される位置（以下、「想定位置」ともいう）を計算し、ロボットの位置を更新し記憶部１４０に格納する。具体的には、図１０のｓ８（位置の更新）を行う。 <Location update unit 125>
For each i = 0, 1,..., P−1, the position update unit 125 performs the action determined by the action selection unit 120 at the current position (Xr [i], Yr [i]) of the i-th robot. Calculates the position (Xr '[i], Yr' [i]) after movement (after action) of the robot when executed, and stores the calculated (Xr '[i], Yr' [i]) The position of the i-th robot stored in 140 is updated (S125). In other words, the position update unit 125 calculates a position assumed when the robot behaves (hereinafter also referred to as “assumed position”) based on the action determined by the action selection unit 120 and updates the position of the robot. Stored in the storage unit 140. Specifically, s8 (update of position) in FIG. 10 is performed.

＜隣接状態判定部１２１＞
隣接状態判定部１２１は、ロボットの２次元平面上の上下左右の隣接する位置に、障害物または他のロボットが存在するか否かを判定し、ロボットの２次元平面上の右上、左上、左下、右下の隣接する位置に他のロボットが存在するか否かを判定し（Ｓ１２１），判定結果を記憶部１４０に格納する。 <Adjacent state determination unit 121>
The adjacency state determination unit 121 determines whether an obstacle or another robot exists in the upper, lower, left, and right adjacent positions on the two-dimensional plane of the robot, and the upper right, upper left, lower left on the two-dimensional plane of the robot. Then, it is determined whether another robot is present at the lower right adjacent position (S121), and the determination result is stored in the storage unit 140.

なお、上述の通り、行動制御システム１００が実装されていないp-１台のロボットについても、通信部１５０と、隣接状態判定部１２１とを含む。そのため、各ロボットiは隣接状態判定部１２１において、自身の上下左右方向に障害物があるかどうか、周囲８方向に他のロボットがいるかどうかを判定し、通信部１５０を介して判定結果を行動制御システム１００に出力することができる。行動制御システム１００は、通信部１５０を介して各ロボットiから判定結果を受け取り、行動制御システム１００に含まれる隣接状態判定部１２１の判定結果と一緒に記憶部１４０に格納する。なお、p台のロボットは、各ロボットの隣り合う位置（８方向）に必ず、他のロボットが存在し、隣り合うロボット同士がなす群れは、一塊なので、各ロボットiは通信部１５０を介してp-1個の判定結果を直接、または、他のロボットを介して、行動制御システム１００に送信することができる。また、行動制御システム１００は、通信部１５０を介して、直接、または、他のロボットを介して、各ロボットiに選択した行動を実行するように制御信号を送信することができる。また、他の情報もp台のロボット間で送受信可能となる。 As described above, the p-1 robot on which the behavior control system 100 is not mounted also includes the communication unit 150 and the adjacent state determination unit 121. Therefore, each robot i determines whether there is an obstacle in its up / down / left / right direction in the adjacent state determination unit 121 and whether there is another robot in the surrounding eight directions, and determines the result of the determination via the communication unit 150. It can be output to the control system 100. The behavior control system 100 receives the determination result from each robot i via the communication unit 150 and stores it in the storage unit 140 together with the determination result of the adjacent state determination unit 121 included in the behavior control system 100. Note that the p robots always have other robots in adjacent positions (eight directions) of each robot, and the group formed by the adjacent robots is a lump, so each robot i is connected via the communication unit 150. The p-1 determination results can be transmitted to the behavior control system 100 directly or via another robot. In addition, the behavior control system 100 can transmit a control signal so as to execute the selected behavior to each robot i directly or via another robot via the communication unit 150. In addition, other information can be transmitted and received between the p robots.

なお、ロボットが移動するように制御したとしても、何らかのトラブル（感知できなかった障害物の存在や、機器の故障等）により、行動選択部１２０の制御通りに移動できるとは限らない。一方、動かなかったロボットの位置は、制御する前と変わらない。よって、動かなかったロボットの位置を基準として、隣接状態判定部１２１による判定結果を用いて、移動するように制御されたロボットの、実際に行動した後の位置（以下「行動後位置」ともいう）（Xr"[i],Yr"[i]）を求めることができる。なお、不変型ロボットが進行方向に対して一列に並んでいる場合には、全てのロボットが同一方向に移動するため、基準となるロボットが存在しないことになる。そのような場合には、その状態を検知し、１つ以上のロボットを進行方向に対して直交する方向に１マス分移動させ、その移動させたロボットを基準として、他のロボットを元々の進行方向に向かって移動させればよい。 Even if the robot is controlled to move, it may not be able to move as controlled by the action selection unit 120 due to some trouble (the presence of an obstacle that could not be detected, equipment failure, etc.). On the other hand, the position of the robot that did not move is the same as before the control. Therefore, using the determination result by the adjacent state determination unit 121 based on the position of the robot that has not moved, the position of the robot that has been controlled to move (hereinafter also referred to as “post-action position”). ) (Xr "[i], Yr" [i]). When the invariant robots are arranged in a line with respect to the traveling direction, since all the robots move in the same direction, there is no reference robot. In such a case, the state is detected, one or more robots are moved by one square in a direction orthogonal to the traveling direction, and the other robots are originally moved with reference to the moved robots. What is necessary is just to move toward a direction.

＜位置判定部１２６＞
位置判定部１２６は、隣接状態判定部１２１の判定結果を用いて、行動後位置を求め、行動後位置（Xr"[i],Yr"[i]）と想定位置（Xr'[i],Yr'[i]）とが一致するか否かを判定する（Ｓ１２６）。なお、一致しない場合には、移動するように制御されたロボットが何らかのトラブルにより、制御通りに移動できなかったと考えられる。この場合、行動後位置（Xr"[i],Yr"[i]）と想定位置（Xr'[i],Yr'[i]）との少なくとも一方を補正すればよい。補正方法としては様々な手法が考えられる。例えば、移動した全てのロボットに対して、制御前の位置に戻るように指示し、行動後位置（Xr"[i],Yr"[i]）を補正してもよいし、想定位置（Xr'[i],Yr'[i]）を行動後位置（Xr"[i],Yr"[i]）に合わせて補正してもよい。 <Position determination unit 126>
The position determination unit 126 obtains the post-behavior position using the determination result of the adjacent state determination unit 121, and the post-behavior position (Xr "[i], Yr" [i]) and the assumed position (Xr '[i], It is determined whether or not Yr ′ [i]) matches (S126). If they do not match, it is considered that the robot controlled to move could not move as controlled due to some trouble. In this case, at least one of the post-behavior position (Xr "[i], Yr" [i]) and the assumed position (Xr '[i], Yr' [i]) may be corrected. Various methods can be considered as the correction method. For example, all the robots that have moved may be instructed to return to the pre-control position, and the post-behavior position (Xr "[i], Yr" [i]) may be corrected, or the assumed position (Xr '[i], Yr' [i]) may be corrected according to the post-action position (Xr "[i], Yr" [i]).

各時刻ステップごとに、すべてのロボットがG内にあるかどうかを判定し（Ｓ１２７、図１０のｓ９に相当）、すべてのロットがG内にあるときは、任務を終了する。そうでないときは、任務を継続する。 At each time step, it is determined whether all robots are in G (S127, corresponding to s9 in FIG. 10). If all lots are in G, the mission is terminated. If not, continue the mission.

例えば、図示しない目標位置到達判定部において、各i=0,1,…,p-1について、位置判定部１２６から出力された行動後位置（Xr"[i],Yr"[i]）∈Gであるか否かを判定し、全てのiについて（Xr"[i],Yr"[i]）∈Gである場合には、任務を終了する。少なくとも１つ以上のiについて（Xr"[i],Yr"[i]）∈Gを満たさない場合には、行動選択部１２０を再度実行するよう制御する。 For example, in the target position arrival determination unit (not shown), the post-action position (Xr "[i], Yr" [i]) ∈ output from the position determination unit 126 for each i = 0, 1,. It is determined whether or not G, and if (Xr "[i], Yr" [i]) ∈ G for all i, the mission is terminated. If (Xr ″ [i], Yr ″ [i]) εG is not satisfied for at least one i, the behavior selection unit 120 is controlled to be executed again.

以上に述べた処理ｓ１２０〜ｓ１２７を毎時刻ステップごとに行う。 The processes s120 to s127 described above are performed for each time step.

＜効果＞
このような構成により、一台のロボットに必要な分だけのマルコフ状態空間を用意し、それを用いて動的計画法を利用して各位置でのロボットの行動方策を計算し、その行動方策を利用することで、ロボットに任意の隊列形状と、任務環境内の任意の障害物形状に対応した、ロボット同士が接した状態を維持したうえでの多数ロボットのための隊列形成アルゴリズムを獲得することができる。すなわち、ロボット数に依存せずにロボット一台分の計画計算負荷での自己位置座標定義型隊列形成アルゴリズム獲得ができる。また、静止しているロボットに対する相対的な位置を判定することで、絶対的な位置を取得することができるため、付加的な位置計測用の装備を必要としない。 <Effect>
With such a configuration, the Markov state space required for one robot is prepared, and using it, the robot's action policy at each position is calculated using dynamic programming, and the action policy By using, we obtain a formation algorithm for a large number of robots while maintaining a state where the robots are in contact with each other, corresponding to an arbitrary formation shape of the robot and an arbitrary obstacle shape in the mission environment. be able to. That is, the self-position coordinate definition formation formation algorithm can be obtained with the planned calculation load of one robot without depending on the number of robots. Moreover, since an absolute position can be acquired by determining a relative position with respect to a stationary robot, no additional position measurement equipment is required.

なお、前述の通り、伸縮型ロボットは、尺取虫のように伸長収縮により移動する。このような移動は、足や車輪による移動に比べ不整地や地中などの移動に有利であることが知られており、災害現場等での利用する際に特に有効である。本実施形態では、そのような伸縮型ロボットのみを制御する場合や不変型ロボットと組合せた場合の制御方法を提示している。 As described above, the telescopic robot moves by stretching and contracting like a scale insect. Such movement is known to be advantageous for movement on rough terrain and underground compared to movement by feet and wheels, and is particularly effective when used at a disaster site or the like. In the present embodiment, a control method for controlling only such a telescopic robot or a combination with an invariant robot is presented.

＜変形例＞
本実施形態では、各格子（マス）は、正方形であるが、他の形状であってもよい。格子は左右方向及び上下方向に連続して配置される。また、各格子は左右方向で他の二つの格子と隣接し、上下方向で他の二つの格子と隣接する。言い換えると、各格子は、ロボットの移動できる方向と同じ方向においてのみ、他の格子と隣接する。この条件を満たせば、各格子はどのような形状であってもよい。また、「直交」とは、厳密に「垂直に交わること」を意味しなくともよく、例えば、図２２のように、各格子は、菱形であってもよく、各格子が他の二つの格子と隣接する方向の一方を上下方向とし、他方を左右方向とすればよく、このとき、上下方向と左右方向とは直交するものとする。 <Modification>
In the present embodiment, each lattice (mass) is a square, but may have other shapes. The lattice is continuously arranged in the left-right direction and the up-down direction. Each lattice is adjacent to the other two lattices in the left-right direction and adjacent to the other two lattices in the vertical direction. In other words, each grid is adjacent to other grids only in the same direction that the robot can move. Each lattice may have any shape as long as this condition is satisfied. Further, the term “orthogonal” does not necessarily mean strictly “vertically intersecting”. For example, as shown in FIG. 22, each lattice may be a rhombus, and each lattice is the other two lattices. One of the adjacent directions may be the vertical direction and the other is the horizontal direction. In this case, the vertical direction and the horizontal direction are orthogonal to each other.

別の言い方をすると、制御対象物は、二次元平面上の、第一方向（例えば右方向）、第一方向に対して平行でない方向である第二方向（例えば上方向）、第一方向に対して反対方向である第三方向（例えば左方向）、第二方向に対して反対方向である第四方向（例えば下方向）に移動可能であり、一回の行動制御により、現在いる領域（格子、マス）から、現在いる領域に対して、第一方向、第二方向、第三方向、第四方向において隣接する領域の何れかに移動するように制御される。また、この場合、ロボットの２次元平面上の、第一方向において隣接する位置を第一位置、第二方向において隣接する位置を第二位置、第三方向において隣接する位置を第三位置、第四方向において隣接する位置を第四位置、第一位置に第二方向において隣接する位置を第五位置、第二位置に第三方向において隣接する位置を第六位置、第三位置に第四方向において隣接する位置を第七位置、第四位置に第一方向において隣接する位置を第八位置と呼んでもよい。例えば第一〜第八位置は、それぞれ図２１の「１」〜「８」の位置に対応する。 In other words, the control object is in the first direction (for example, the right direction), the second direction (for example, the upward direction) that is not parallel to the first direction, and the first direction on the two-dimensional plane. On the other hand, it is possible to move in the third direction (for example, the left direction) opposite to the second direction, and in the fourth direction (for example, the downward direction) opposite to the second direction. Control is performed so that the current area is moved to one of the adjacent areas in the first direction, the second direction, the third direction, and the fourth direction from the grid. In this case, on the two-dimensional plane of the robot, the position adjacent in the first direction is the first position, the position adjacent in the second direction is the second position, the position adjacent in the third direction is the third position, Positions that are adjacent in the four directions are the fourth position, positions that are adjacent to the first position in the second direction are the fifth positions, positions that are adjacent to the second position in the third direction are the sixth positions, and positions that are adjacent to the third position are the fourth direction May be referred to as the seventh position, and the position adjacent to the fourth position in the first direction as the eighth position. For example, the first to eighth positions correspond to the positions “1” to “8” in FIG.

本実施形態では、行動制御システム１００が、p台のロボットの内の１つのロボット上に実装される例を示したが、コンピュータ上の制御対象物に対して行動制御を行ってもよい。その場合、意図的にトラブルを発生させない限り、制御した通りに制御対象物は移動するため、想定位置（Xr'[i],Yr'[i]）と移動後位置（Xr"[i],Yr"[i]）とが一致する。そのような状態で、各制御対象物i毎の理想的な行動計画を行ってもよい。その結果得られた任務を終了するまでの行動計画を実ロボットに実行させる。実ロボットは、通信部１５０と隣接状態判定部１２１を備え、各行動を終えた後に、隣接状態を判定し、判定結果を用いて、行動後位置を求め、行動後位置（Xr"[i],Yr"[i]）と想定位置（Xr'[i],Yr'[i]）とが一致するか否かを判定する。なお、一致しない場合には、移動するように制御されたロボットが何らかのトラブルにより、制御通りに移動できなかったと考えられる。その後の処理については第一実施形態と同様でもよいし、他の処理を行ってもよい。このような構成により、少なくとも制御通りに移動できなかったことを検知することができる。 In the present embodiment, the example in which the behavior control system 100 is mounted on one of the p robots has been shown, but behavior control may be performed on a control target on a computer. In that case, unless a trouble occurs intentionally, the controlled object moves as controlled, so the assumed position (Xr '[i], Yr' [i]) and the post-movement position (Xr "[i], Yr "[i]) matches. In such a state, an ideal action plan for each control object i may be performed. The actual robot is made to execute the action plan until the mission obtained as a result is completed. The real robot includes a communication unit 150 and an adjacent state determination unit 121. After finishing each action, the real robot determines an adjacent state, obtains a post-behavior position using the determination result, and determines a post-behavior position (Xr "[i] , Yr "[i]) and the assumed position (Xr '[i], Yr' [i]) are determined to match. If they do not match, it is considered that the robot controlled to move could not move as controlled due to some trouble. Subsequent processing may be the same as in the first embodiment, or other processing may be performed. With such a configuration, it is possible to detect at least that movement was not possible as controlled.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

A behavior control system for performing behavior control for moving a plurality of control objects arranged in a set of start positions to a set of target positions including a predetermined entrance position,
The direction not parallel to the first direction is the second direction, the opposite direction to the first direction is the third direction, the opposite direction to the second direction is the fourth direction, and each start position and each The target positions are adjacent to other start positions and target positions in at least one of the first direction to the fourth direction, respectively, and the set of the target positions and the set of the start positions each have an arbitrary shape in a lump. And
The control object includes a first position adjacent in the first direction on the two-dimensional plane of the control object, a second position adjacent in the second direction, a third position adjacent in the third direction, and a fourth position. A fourth position adjacent in the direction, a fifth position adjacent to the first position in the second direction, a sixth position adjacent to the second position in the third direction, a seventh position adjacent to the third position in the fourth direction, And a communication means for communicating with another control object existing in the eighth position adjacent to the fourth position in the first direction, and the control object is a behavior a at the current position s of the control object. It is controlled based on a single value function that represents the appropriateness when taking the measurement, and is controlled to move to any one of the first to fourth positions on the two-dimensional plane. ,
The Markov state space used for learning is composed of only one robot state variable, and all p robots share one value function with this state variable as an argument.
The value function first learns that a high reward value can be obtained when the control object moves to a predetermined entrance position at all positions, and then the value function It was obtained by learning so that the value function of the action aiming at the position of the point with the largest value becomes high,
A storage unit for storing the value function;
A virtual existence that occurs with the movement of the control object or moves in a direction opposite to the direction in which the control object moves is defined as a void, and using the value function, (G-1) If it exists at any of the target positions, the control object is controlled to move to another target position far from the predetermined entrance position, and (G-2) the control object is added to the set of target positions. If it exists at a position not included, the control object is controlled to move toward the predetermined entrance position, and (V-1) a position where a void is not included in the set of target positions or the predetermined position The void is moved to a position that is far from the predetermined inlet position and not included in the set of target positions, and (V-2) the void is excluded from the predetermined inlet position. If it is located at any of the target positions, the void is Comprises action selecting section for controlling to move towards the fixed inlet position,
The action selection unit determines a leading control object close to the entrance position, and generates a void by moving the control object according to the control of (G-1) and (G-2), and the result of the first control The second control for moving the generated void as long as it follows the control of (V-1) and (V-2) is executed.
Behavior control system.

A behavior control method for performing behavior control for moving a plurality of control objects arranged in a set of start positions to a set of target positions including a predetermined entrance position,
The direction not parallel to the first direction is the second direction, the opposite direction to the first direction is the third direction, the opposite direction to the second direction is the fourth direction, and each start position and each The target positions are adjacent to other start positions and target positions in at least one of the first direction to the fourth direction, respectively, and the set of the target positions and the set of the start positions each have an arbitrary shape in a lump. And
The control object includes a first position adjacent in the first direction on the two-dimensional plane of the control object, a second position adjacent in the second direction, a third position adjacent in the third direction, and a fourth position. A fourth position adjacent in the direction, a fifth position adjacent to the first position in the second direction, a sixth position adjacent to the second position in the third direction, a seventh position adjacent to the third position in the fourth direction, And a communication means for communicating with another control object existing in the eighth position adjacent to the fourth position in the first direction, and the control object is a behavior a at the current position s of the control object. It is controlled based on a single value function that represents the appropriateness when taking the measurement, and is controlled to move to any one of the first to fourth positions on the two-dimensional plane. ,
The Markov state space used for learning is composed of only one robot state variable, and all p robots share one value function with this state variable as an argument.
The value function first learns that a high reward value can be obtained when the control object moves to a predetermined entrance position at all positions, and then the value function It was obtained by learning so that the value function of the action aiming at the position of the point with the largest value becomes high,
The action selection unit uses a virtual existence that occurs with the movement of the control object or moves in a direction opposite to the direction of movement of the control object as a void, and uses the value function (G-1 ) When the control object exists at any of the target positions, the control object is controlled to move to another target position far from the predetermined entrance position, and (G-2) the control object is If it exists at a position that is not included in the set of target positions, the control object is controlled to move toward the predetermined entrance position, and (V-1) void is included in the set of target positions. The void is moved to a position not included in the set of target positions far from the predetermined entrance position, and (V-2) the void is the predetermined position. If it is located at any of the target positions excluding the entrance position of Comprises action selecting step of controlling such voids are moved toward the predetermined inlet position,
In the action selection step, the first control object that determines the head control object close to the entrance position, and generates a void by being moved by the control of (G-1) and (G-2), and the result of the first control The second control for moving the generated void as long as it follows the control of (V-1) and (V-2) is executed.
Behavior control method.

A program for causing a computer to function as the behavior control system according to claim 1 .