JP2021154481A

JP2021154481A - Controller and method for controlling operation of robot executing tasks

Info

Publication number: JP2021154481A
Application number: JP2021025324A
Authority: JP
Inventors: ファン・バール・イェルーン; Van Baar Jeroen; ジャー・デベシュ; Jha Devesh; ロメレス・ディエゴ; Diego Romeres
Original assignee: Mitsubishi Electric Corp
Current assignee: Mitsubishi Electric Corp
Priority date: 2020-03-27
Filing date: 2021-02-19
Publication date: 2021-10-07

Abstract

To provide a method for controlling operation of a robot executing tasks which method simplifies learning of the robot, thereby, enabling the robot to execute a new task more efficiently.SOLUTION: A controller comprises: an input interface configured to receive a current state of a robot for each control step; and a memory. The memory is configured to store a robot skill library, and the skills are task-agnostics. The memory is configured to store learned functions for a task. The controller comprises a processor. The processor is configured to execute the learned function on the current state so as to select a skill from the skill library for each control step in a sequence of control steps that reach a termination condition, as well as, to execute the selected skill for the current state so as to select an action from the selected skill in order to transit a state of the robot from the current state to a next state.SELECTED DRAWING: Figure 1

Description

本発明は、ロボット制御に関し、より特定的には、特定のタスクを実行するようにロボットを学習および制御することに関する。 The present invention relates to robot control, and more specifically to learning and controlling a robot to perform a particular task.

ロボットの操作は、ロボットの有望性を達成することの中核をなす。ロボットの定義そのものは、ロボットが世界に対して変化を与えるために使用し得るアクチュエータを有することを必要とする。自律操作アプリケーションのポテンシャルは巨大である。すなわち、自身の環境を操作することが可能なロボットは、病院、高齢者ケアおよび保育、工場、宇宙空間、レストラン、サービス産業、および、家庭において展開され得る。これらの多種多様な展開シナリオ、ならびに、食事の用意のような非常に特殊化されたシナリオにおける広範囲かつ非系統的な環境変動は、ロボットの迅速な学習の必要性が存在することを示唆している。ロボットがその周りの世界を操作するためにどのようにラーニングすべきかという問題に対して、多くの方法が焦点を当てている。 Robot operation is at the core of achieving robot promising. The definition of a robot itself requires that the robot have actuators that can be used to make a difference to the world. The potential of autonomous operations applications is enormous. That is, robots capable of manipulating their own environment can be deployed in hospitals, elderly care and childcare, factories, outer space, restaurants, service industries, and homes. These wide variety of deployment scenarios, as well as widespread and unsystematic environmental changes in highly specialized scenarios such as food preparation, suggest that there is a need for rapid learning of robots. There is. Many methods focus on the question of how robots should learn to manipulate the world around them.

たとえば、いくつかの方法は、新しいタスクを実行するために、人間のデモンストレーションから個々の操作スキルをラーニングするようにロボットに学習させる。いくつかのソリューションは、特定のロボットに適合される注意深く設計されたデモンストレーションを使用する。しかしながら、これらの方法は、非常に労力を要し、時間がかかり、および／または、リソース集中的である。いくつかの他の方法は、階層型学習（hierarchical learning）、カリキュラム学習（curriculum learning）、および、オプションフレームワークを使用することを試みている。これらの方法は、以前に学習されたあまり複雑でないタスクのセットからより複雑なタスクをラーニングする。これは、新しいタスクを実行するのではなく、新しいタスクの目標をより複雑にする。 For example, some methods train robots to learn individual operational skills from human demonstrations in order to perform new tasks. Some solutions use carefully designed demonstrations that are adapted to a particular robot. However, these methods are very labor intensive, time consuming, and / or resource intensive. Some other methods attempt to use hierarchical learning, curriculum learning, and optional frameworks. These methods learn more complex tasks from a previously learned set of less complex tasks. This complicates the goals of the new task rather than performing the new task.

いくつかの他の方法は、強化学習（ＲＬ: reinforcement learning）を使用して、新しいタスクをどのように効果的な態様で実行すべきかをラーニングするようロボットに学習させる。しかしながら、ロボット操作のための新しいタスクをラーニングすることについて、多くのアクションは事実上境界がないが、ＲＬはアクションの有限のセットからの選択を期待する。 Some other methods use reinforcement learning (RL) to train a robot to learn how to perform a new task in an effective manner. However, while many actions are virtually borderless when it comes to learning new tasks for robotic manipulation, RL expects to choose from a finite set of actions.

したがって、ロボットの学習を簡素化し、これによりロボットがより効率的に新しいタスクを行うことを可能にする方法を提供する必要性が存在する。 Therefore, there is a need to provide a way to simplify the learning of the robot, thereby allowing the robot to perform new tasks more efficiently.

いくつかの実施形態の目的は、タスクを実行するようにロボットを学習および制御するためのシステムおよび方法を提供することである。付加的または代替的には、いくつかの実施形態の目的は、ロボットの学習プロセスを修正することである。いくつかの実施形態は、新しいタスクを実行するための新しいスキルをラーニングするように学習されるロボットが、新しいタスクを「模倣する」ために既存のタスク不可知論的スキルを再利用するようラーニングするように修正され得るという認識に基づいている。このような定式化によって、実施形態は、新しいスキルのラーニング目的を選択目的に置き換えることが可能になる。さらに、このような定式化によって、新しいタスクの効果または作用のみを「模倣する」ように、新しいタスクの特定の事項から実施形態が距離を置くことが可能になる。付加的または代替的には、このような定式化によって、いくつかの実施形態が、起こり得る障害および新しいタスクの性質を考慮するために、リアルタイムでスキルを適応可能に選択することが可能になる。付加的または代替的には、そのような定式化によって、いくつかの実施形態が、強化学習（ＲＬ）を使用して、どのように効果的な態様でタスクを実行すべきかをラーニングするようにロボットを学習させることが可能になる。 An object of some embodiments is to provide a system and a method for learning and controlling a robot to perform a task. Additional or alternative, the purpose of some embodiments is to modify the learning process of the robot. In some embodiments, a robot learned to learn a new skill to perform a new task learns to reuse an existing task agnostic skill to "mimic" the new task. Based on the recognition that it can be modified to. Such a formulation allows embodiments to replace the learning objectives of new skills with selective objectives. Moreover, such a formulation allows the embodiment to distance itself from certain items of the new task so as to "mimic" only the effects or actions of the new task. Additional or alternative, such a formulation allows some embodiments to adaptively select skills in real time to account for possible obstacles and the nature of new tasks. .. Additional or alternative, such a formulation allows some embodiments to use reinforcement learning (RL) to learn how to perform tasks in an effective manner. It becomes possible to train a robot.

たとえば、いくつかの実施形態の目的は、状態およびアクションが観察されるがスキルは観察されないタスクの実行の複数のデモンストレーションからタスクを実行するようにロボットを学習させることである。実際、いくつかの実施形態は、状態／アクションの対がデモンストレーションから観察可能である場合がしばしば存在する一方、スキルが未知または未定義であるという認識に基づいている。さらに、状態／アクションの対は、スキルよりも重要である。その理由は、状態／アクションの対がロボットの空間領域において実際にタスクを実行する一方で、スキルは時間領域において状態／アクションの対のシーケンスを接続するからである。 For example, an object of some embodiments is to train a robot to perform a task from multiple demonstrations of task execution where states and actions are observed but skills are not observed. In fact, some embodiments are based on the recognition that skills are unknown or undefined, while state / action pairs are often observable from demonstrations. In addition, state / action pairs are more important than skills. The reason is that the state / action pair actually performs the task in the robot's spatial domain, while the skill connects the state / action pair sequence in the time domain.

そのため、いくつかの実施形態は、状態について選択されたアクションと、タスクの複数のデモンストレーションからの対応するアクションとの間の差の確率を最小化するように関数を学習させる。実施形態は、アクションを行なうスキルを選択するように関数を学習させる。関数の学習は、ロボットの学習と考えられ得る。選択されたスキルは、当該スキルが、入力された状態について必要とされるアクションを返す限り、タスク不可知論的であり得る。実際、いくつかの実施形態は、特定の状態について特定のアクションを規定する新しいタスクについて、１つの状態値について正しいアクションを偶然返す一方で、異なる状態値について間違ったアクションを返す多くの既存のスキルが存在するという認識に基づいている。しかしながら、その異なる状態値については、正しいアクションを返し得る別のスキルが存在し得る。 As such, some embodiments train the function to minimize the probability of difference between the selected action for a state and the corresponding action from multiple demonstrations of the task. The embodiment trains a function to select a skill to perform an action. Function learning can be thought of as robot learning. The selected skill can be task agnostic as long as the skill returns the required action for the state entered. In fact, some embodiments happen to return the correct action for one state value for a new task that defines a particular action for a particular state, while returning the wrong action for a different state value. Is based on the recognition that exists. However, for that different state value, there may be another skill that can return the correct action.

したがって、いくつかの実施形態は、既存のスキルを再利用し、かつ、ラーニング目的を選択目的に置き換えるよう関数を学習させる。そのような選択は、起こり得る障害およびタスクの性質を考慮するよう、決定論的または適応可能に実行され得る。実施形態の副作用は、選択されたスキルが複数の制御ステップについて実行されるように構成されるマルチステップスキルであっても、スキルを実行するために各制御ステップについて学習済関数を実行する必要があることを含み得る。しかしながら、選択目的は、強化学習のための報酬関数を簡素化することを可能にする。 Therefore, some embodiments train the function to reuse existing skills and replace learning objectives with selective objectives. Such choices can be made deterministically or adaptively to account for possible obstacles and the nature of the task. The side effect of the embodiment is that even if the selected skill is a multi-step skill configured to be executed for multiple control steps, it is necessary to execute a trained function for each control step in order to execute the skill. Can include being. However, the purpose of choice makes it possible to simplify the reward function for reinforcement learning.

具体的には、強化学習（ＲＬ）は、どのようにソフトウェアエージェントが累積的報酬の何らかの概念を最大化するために環境内でアクションを行うべきかに関する機械学習の領域である。ＲＬエージェントは、離散的な時間ステップにおいてその環境と相互作用する。各時間ｔにおいて、エージェントは、典型的には報酬ｒ_ｔを含む観察結果ｏ_ｔを受け取る。次いで、ＲＬコントローラは、報酬を増加するよう、利用可能なアクションのセットからアクションａ_ｔを選択し、当該アクションａ_ｔは次いで、環境に送られる。しかしながら、ロボット操作のための新しいタスクをラーニングすることについて、多くのアクションは事実上境界がないが、ＲＬはアクションの有限のセットからの選択を期待する。そのため、いくつかの実施形態の選択ベースのＲＬは、アクションの境界のない選択を、スキルの有限のライブラリからのスキルの境界のある選択と置き換える。実際、このような選択ベースのＲＬは、タスク不可知論的スキルのライブラリを使用するタスク固有の関数の学習を簡素化する。たとえば、一実施形態では、深層強化学習者（ＤＲＬ: deep reinforcement learner）は、各スキルおよびそのアクションについての値を決定する。 Specifically, reinforcement learning (RL) is an area of machine learning about how software agents should take action in the environment to maximize some concept of cumulative reward. The RL agent interacts with its environment in discrete time steps. At each time t, the agent is typically receive observations o _t containing reward r _t. Then, RL controller, to increase the compensation, select an action a _t the set of available actions, the action a _t is then sent to the environment. However, while many actions are virtually borderless when it comes to learning new tasks for robotic manipulation, RL expects to choose from a finite set of actions. As such, selection-based RLs in some embodiments replace borderless selection of actions with bordered selection of skills from a finite library of skills. In fact, such a selection-based RL simplifies the learning of task-specific functions using a library of task agnostic skills. For example, in one embodiment, a deep reinforcement learner (DRL) determines values for each skill and its actions.

異なる実施形態は、ＲＬコントローラを形成するパラメータ化された関数を学習させるために異なる方法を使用する。たとえば、いくつかの実施形態では、パラメータ化された関数は、深層決定方策勾配法（deep deterministic policy gradient method）、アドバンテージ−アクタークリティック法（advantage-actor critic method）、プロキシマル方策最適化法（proximal policy optimization method）、ディープＱネットワーク法（deep Q-network method）、または、モンテカルロ方策勾配法（Monte Carlo policy gradient method）のうちの１つを使用して学習される。 Different embodiments use different methods to train the parameterized functions that form the RL controller. For example, in some embodiments, the parameterized function is a deep deterministic policy gradient method, an advantage-actor critic method, or a proximity-actor critic method (proximal policy optimization method). It is learned using one of the proximal policy optimization method, the deep Q-network method, or the Monte Carlo policy gradient method.

したがって、一実施形態は、タスクを実行するようロボットの動作を制御するためのコントローラを開示する。コントローラは、各制御ステップについてロボットの現在の状態を受け付ける入力インターフェイスと、メモリとを含み、メモリは、（１）ロボットのスキルのライブラリを格納するように構成され、各スキルは、現在の状態を提出することに応答して、アクションの分布を返すロボットの状態の確率関数であり、スキルは、タスク不可知論的であり、メモリは、（２）タスクについての学習済関数を格納するように構成され、学習済関数は、ロボットの状態を提出することに応答して、スキルの分布を返すロボットの状態の確率関数である。コントローラはさらに、プロセッサを含み、プロセッサは、終了条件に達する制御ステップのシーケンスにおける各制御ステップについて、スキルの分布上で最も高い確率を有するスキルを選択するよう、現在の状態についての学習済関数を実行し、かつ、ロボットの状態を現在の状態から次の状態に遷移させるために、アクションの分布上で最も高い確率を有するアクションを選択するよう、現在の状態についての選択されたスキルを実行するように構成される。コントローラはさらに、タスクを行なうために、選択されたアクションを実行するようロボットに命令するように構成される出力インターフェイスを含む。 Accordingly, one embodiment discloses a controller for controlling the movement of a robot to perform a task. The controller includes an input interface that receives the current state of the robot for each control step and a memory, which is configured to (1) store a library of robot skills, and each skill holds the current state. A stochastic function of the robot's state that returns a distribution of actions in response to submission, the skill is task insane, and the memory is configured to (2) store the learned functions for the task. The learned function is a probability function of the robot's state that returns the distribution of skills in response to submitting the state of the robot. The controller also includes a processor, which performs a trained function on the current state for each control step in the sequence of control steps that reaches the termination condition, so that it selects the skill with the highest probability on the skill distribution. Performs the selected skill for the current state to select the action with the highest probability on the distribution of actions in order to perform and transition the robot's state from the current state to the next state. It is configured as follows. The controller also includes an output interface that is configured to instruct the robot to perform selected actions to perform the task.

別の実施形態は、タスクを実行するようロボットの動作を制御するための方法を開示する。当該方法は、メモリに結合されるプロセッサを使用し、メモリは、（１）ロボットのスキルのライブラリを格納し、各スキルは、現在の状態を提出することに応答して、アクションの分布を返すロボットの状態の確率関数であり、スキルは、タスク不可知論的であり、メモリは、（２）タスクについての学習済関数を格納し、学習済関数は、ロボットの状態を提出することに応答して、スキルの分布を返すロボットの状態の確率関数であり、プロセッサは、格納された命令と結合され、方法のステップを行なうようプロセッサによって実行されると、各制御ステップについてロボットの現在の状態を受け付けることと、終了条件に達する制御ステップのシーケンスにおける各制御ステップについて、スキルの分布上で最も高い確率を有するスキルを選択するよう、現在の状態についての学習済関数を実行することと、ロボットの状態を現在の状態から次の状態に遷移させるために、アクションの分布上で最も高い確率を有するアクションを選択するよう、現在の状態についての選択されたスキルを実行することと、タスクを行なうために、選択されたアクションを実行するようロボットに命令を出力することとを含む。 Another embodiment discloses a method for controlling the movement of a robot to perform a task. The method uses a processor that is bound to memory, which stores (1) a library of robot skills, and each skill returns a distribution of actions in response to submitting its current state. It is a stochastic function of the robot's state, the skill is task insane, the memory stores (2) the learned function for the task, and the learned function responds to submitting the robot's state. , A stochastic function of the state of the robot that returns the distribution of skills, the processor is combined with the stored instructions and accepts the current state of the robot for each control step when executed by the processor to perform a method step. That, for each control step in the sequence of control steps that reach the end condition, execute a trained function for the current state to select the skill with the highest probability on the skill distribution, and the state of the robot. To perform the selected skill for the current state and to perform a task to select the action with the highest probability on the distribution of actions in order to transition from the current state to the next state. Includes outputting commands to the robot to perform the selected action.

定義
本発明の実施形態を説明する際に、以下の定義が全体（上記を含む）を通して適用可能である。 Definitions In describing embodiments of the invention, the following definitions are applicable throughout (including the above).

「ロボット」は、その根本的な目的が１つ以上のタスクを自動的に実行することであるマシンを指す。 "Robot" refers to a machine whose underlying purpose is to automatically perform one or more tasks.

「タスク」は、ロボットによって実行されるべき作業を指し、当該作業は、時間領域におけるスキルのシーケンスによって定義される。代替的には、タスクは、ロボットによって時間領域において実行されるべき状態／アクションの対のシーケンスを指す。 A "task" refers to a task to be performed by a robot, which task is defined by a sequence of skills in the time domain. Alternatively, a task refers to a sequence of state / action pairs to be performed by a robot in the time domain.

「スキル」は、現在の状態を提出することに応答して、アクションまたはアクションにわたる分布を返すロボットの状態の確率関数を指す。 "Skill" refers to the probability function of a robot's state that returns an action or distribution over the action in response to submitting the current state.

「状態」は、特定の時間ｔにおけるロボットおよびその環境の値を指す。これらの値は、特定の時間ｔにおけるロボットの設定を定義する。ロボットの設定の例は、たとえば、モータの設定、ジョイントの設定、グリッパの設定、工具の設定、ロボットの位置の設定を含む。その環境における値の例は、たとえば、１つ以上の対象の位置、１つ以上の障害物の位置、目標位置までの距離、環境の画像である。 "State" refers to the value of the robot and its environment at a particular time t. These values define the robot's settings at a particular time t. Examples of robot settings include, for example, motor settings, joint settings, gripper settings, tool settings, robot position settings. Examples of values in that environment are, for example, the position of one or more objects, the position of one or more obstacles, the distance to a target position, and an image of the environment.

「アクション」は、特定の時間ｔにおいてロボットを現在の状態から次の状態に移行する値のベクトルを指す。当該値のベクトルは、特定の時間ｔにおいてロボットの次の設定を決定する。 “Action” refers to a vector of values that shifts the robot from the current state to the next state at a particular time t. The vector of such values determines the next setting of the robot at a particular time t.

「制御ステップ」は、特定の時間ｔにおける状態についてアクションを行うことを指す。 “Control step” refers to taking action on a state at a particular time t.

「コンピュータ」は、構造化された入力を受け付け、規定されたルールに従って当該構造化された入力を処理し、当該処理の結果を出力として生成することが可能な任意の装置を指す。コンピュータの例は、たとえば、汎用コンピュータと、スーパーコンピュータと、メインフレームと、スーパーミニコンピュータと、ミニコンピュータと、ワークステーションと、マイクロコンピュータと、サーバと、インタラクティブテレビジョンと、コンピュータおよびインタラクティブテレビジョンのハイブリッドな組み合せと、コンピュータおよび／またはソフトウェアをエミュレートする特定用途向けハードウェアとを含む。コンピュータは、単一のプロセッサを有し得、または、並列に動作し得るおよび／もしくは並列に動作し得ない複数のプロセッサを有し得る。さらに、コンピュータは、２つ以上のコンピュータ同士間で情報の送受信を行うためのネットワークを介して接続された当該２つ以上のコンピュータを指す。このようなコンピュータの例は、ネットワークによってリンクされたコンピュータを介して情報を処理するための分散型コンピュータシステムを含む。 "Computer" refers to any device capable of accepting structured inputs, processing the structured inputs according to defined rules, and producing the results of the processing as output. Examples of computers are, for example, general purpose computers, supercomputers, mainframes, superminicomputers, minicomputers, workstations, microcomputers, servers, interactive televisions, computers and interactive televisions. Includes hybrid combinations and application-specific hardware that emulates computers and / or software. A computer may have a single processor, or may have multiple processors that can and / or cannot operate in parallel. Further, the computer refers to the two or more computers connected via a network for transmitting and receiving information between the two or more computers. Examples of such computers include decentralized computer systems for processing information through network-linked computers.

「中央処理装置（ＣＰＵ: central processing unit）」または「プロセッサ」は、ソフトウェア命令を読み出して実行するコンピュータまたはコンピュータの構成要素を指す。 A "central processing unit (CPU)" or "processor" refers to a computer or computer component that reads and executes software instructions.

「メモリ」または「コンピュータ読取可能媒体」は、コンピュータによってアクセス可能な、データを格納するための任意のストレージを指す。その例は、磁気ハードディスクと、フロッピー（登録商標）ディスクと、ＣＤ−ＲＯＭまたはＤＶＤのような光学ディスクと、磁気テープと、メモリチップと、電子メールの送受信またはネットワークへのアクセスに使用されるような、コンピュータ読取可能電子データを搬送するために使用される搬送波と、たとえばランダムアクセスメモリ（ＲＡＭ: random access memory）のようなコンピュータメモリとを含む。 "Memory" or "computer-readable medium" refers to any storage accessible by a computer for storing data. Examples include magnetic hard disks, floppy (registered trademark) disks, optical disks such as CD-ROMs or DVDs, magnetic tapes, memory chips, and to be used for sending and receiving e-mails or accessing networks. Includes a carrier used to carry computer-readable electronic data and computer memory such as, for example, random access memory (RAM).

「ソフトウェア」は、コンピュータを動作するための規定されたルールを指す。ソフトウェアの例は、コードセグメントと、命令と、コンピュータプログラムと、プログラムされたロジックとを含む。インテリジェントシステムのソフトウェアは、セルフラーニング可能であってもよい。 "Software" refers to the prescribed rules for operating a computer. Examples of software include code segments, instructions, computer programs, and programmed logic. The software of the intelligent system may be self-learning.

「モジュール」または「ユニット」は、タスクまたはタスクの一部分を実行するコンピュータにおける基本な構成要素を指す。「モジュール」または「ユニット」は、ソフトウェアまたはハードウェアのいずれによっても実現され得る。 A "module" or "unit" refers to a basic component of a computer that performs a task or part of a task. A "module" or "unit" can be implemented by either software or hardware.

「コントローラ」および／または「制御システム」は、他のデバイスまたはシステムの挙動を管理、命令、指示または調整するためのデバイスまたはデバイスのセットを指す。コントローラは、ハードウェアと、動作がソフトウェアにより構成されるプロセッサと、これらの組み合わせとによって実現され得る。コントローラは組込システムであり得る。 "Controller" and / or "control system" refers to a device or set of devices for managing, instructing, directing or coordinating the behavior of another device or system. The controller can be realized by hardware, a processor whose operation is composed of software, and a combination thereof. The controller can be an embedded system.

添付の図面を参照して、ここで開示される実施形態がさらに説明される。示される図面は、必ずしも尺度決めされておらず、その代わりに、ここで開示される実施形態の原理を説明する際に概して強調がなされている。 The embodiments disclosed herein are further described with reference to the accompanying drawings. The drawings shown are not necessarily scaled and instead are generally emphasized in explaining the principles of the embodiments disclosed herein.

タスクを実行するようロボットの動作を学習および制御するためにいくつかの実施形態によって使用される原理の概略的な全体図を示す図である。FIG. 6 illustrates a schematic overview of the principles used by some embodiments to learn and control the movements of a robot to perform tasks. タスクを実行するようロボットの動作を学習および制御するためのロボット制御システムのブロック図を示す図である。It is a figure which shows the block diagram of the robot control system for learning and controlling the operation of a robot to execute a task. ロボット制御システムのスキルライブラリモジュールに格納される例示的なスキルを示す図である。It is a figure which shows the exemplary skill stored in the skill library module of a robot control system. いくつかの実施形態に従った、リモート制御デバイスを使用するロボットためのタスクデモンストレーションを示す図である。It is a figure which shows the task demonstration for the robot which uses a remote control device according to some embodiments. いくつかの実施形態に従った、手動操作によるロボットのためのタスクデモンストレーションを示す図である。It is a figure which shows the task demonstration for the robot by manual operation according to some embodiments. いくつかの実施形態に従った、タスクデモンストレーションモジュールに格納される例示的なタスクデモンストレーションを示す図である。It is a figure which shows the exemplary task demonstration stored in the task demonstration module according to some embodiments. いくつかの実施形態に従った、スキルを選択するための関数を学習させる学習モジュールの例示的なアーキテクチャ図を示す図である。It is a figure which shows the example architecture diagram of the learning module which trains a function for selecting a skill according to some embodiments. いくつかの実施形態に従った、選択ベースの強化学習を使用して関数を学習させる例示的なアーキテクチャを示す図である。FIG. 5 illustrates an exemplary architecture for training a function using selection-based reinforcement learning, according to some embodiments. いくつかの実施形態に従った、タスクを実行する実行モジュールの例示的なアーキテクチャ図を示す図である。It is a figure which shows the example architecture diagram of the execution module which executes a task according to some embodiments. いくつかの実施形態に従った、実行モジュールによって実行される動作を示すフローチャートを示す図である。It is a figure which shows the flowchart which shows the operation performed by the execution module according to some embodiments. いくつかの実施形態に従った、タスクを実行するための例示的な学習フェーズおよび実行フェーズを示す図である。It is a figure which shows the exemplary learning phase and execution phase for performing a task according to some embodiments.

以下の記載では、説明目的のために、本開示の完全な理解を提供するために、多数の具体的な詳細が記載される。しかしながら、本開示は、これらの具体的な詳細がなくても実施されてもよいことが当業者には明白であろう。他の場合では、本開示を不明瞭にすることを避けるために、装置および方法はブロック図の形式でのみ示される。 In the following description, for explanatory purposes, a number of specific details are provided to provide a complete understanding of the present disclosure. However, it will be apparent to those skilled in the art that this disclosure may be carried out without these specific details. In other cases, devices and methods are shown only in the form of block diagrams to avoid obscuring this disclosure.

本明細書および特許請求の範囲において使用されるように、「たとえば」、「例としては」、および、「のような」といった用語、ならびに、「備える」、「有する」、「含む」といった動詞およびそれらの他の動詞形は、１つ以上の構成要素または他の項目のリストに関連して使用される場合、各々オープンエンドであると解釈されるべきであり、当該リストは他の付加的な構成要素または項目を排除すると考えられるべきではないことを意味する。「に基づく」という用語は、少なくとも部分的に基づくことを意味する。さらに、本明細書において使用される表現および用語は、説明目的であり、限定と見なされるべきではないことが理解されるべきである。この記載において利用されるいずれの見出しも、便宜上のためのみであり、法的効果または限定効果を有さない。 As used herein and in the claims, terms such as "for example," "as an example," and "like," and verbs such as "provide," "have," and "include." And their other verb forms, when used in connection with a list of one or more components or other items, should each be construed as open-ended, which list is another additive. Means that no component or item should be considered to be excluded. The term "based on" means at least partially based. In addition, it should be understood that the expressions and terms used herein are for explanatory purposes and should not be considered limiting. None of the headings used in this description are for convenience only and have no legal or limiting effect.

図１は、いくつかの実施形態に従った、タスクを実行するようにロボットの動作を学習および制御するために、いくつかの実施形態によって使用される原理の概略的な全体図を示す。ロボットはマシンであり、その根本的な目的は、１つ以上のタスクを自動的に実行することである。タスクは、人間にとって実行するのが困難または反復的な作業である。たとえば、タスクは、工業的な作業、家事の維持、および、レストランでの給仕などに対応する。いくつかの実施形態では、未習熟ロボット１０２は、学習されていないロボットまたは学習中のロボットを指す。ロボット（たとえば、未習熟ロボット１０２）にタスクを実行させるためには、ロボットは、タスクについて広く学習される必要がある。しかしながら、タスクは複雑であり、実行されるべきいくつかのステップを伴う。したがって、ロボットの学習を補助するために、ロボットのタスクはサブタスク（以下、スキルという）に分割され得る。タスクは、いくつかの部品を何らかの複合的な物体に組み立てることであり得る。組み立てに必要とされるスキルは、ロボットのアームを左に移動させること、グリッパを開くこと、グリッパを下降させること、部品を持ち上げること、アームを上昇させること、アームを特定の位置まで右に移動させること、アームを下降させること、部品を他の部品内に挿入することなどであり得る。 FIG. 1 shows a schematic overview of the principles used by some embodiments to learn and control the movements of the robot to perform tasks, according to some embodiments. A robot is a machine whose fundamental purpose is to automatically perform one or more tasks. Tasks are tasks that are difficult or repetitive for humans to perform. For example, tasks correspond to industrial work, maintenance of household chores, and serving in a restaurant. In some embodiments, the immature robot 102 refers to an unlearned robot or a learning robot. In order for a robot (eg, an unskilled robot 102) to perform a task, the robot needs to be widely learned about the task. However, the task is complex and involves several steps to be performed. Therefore, in order to assist the learning of the robot, the task of the robot can be divided into subtasks (hereinafter referred to as skills). The task can be to assemble some parts into some complex object. The skills required for assembly are moving the robot's arm to the left, opening the gripper, lowering the gripper, lifting parts, raising the arm, moving the arm to the right to a specific position. It can be caused, the arm lowered, a part inserted into another part, and so on.

いくつかの実施形態は、あるタスクについて学習済であるロボットが新しいタスクを実行するように操作され得るという認識に基づいている。たとえば、食器を洗うことおよび床を掃除することといった家事を維持するように学習済であるロボットは、食器を洗うこと、床を掃除すること、および、ごみを廃棄することなどを行うように操作され得る。しかしながら、ロボットを操作するためには、ロボットはさらに、新しいタスクに対して迅速かつ広範な学習を必要とする。そのため、いくつかの実施形態の目的は、新しいタスクをラーニングするようにロボットを学習させることである。図１に示されるように、未習熟ロボット１０２は、ロボット制御システム１０４によって学習され、これにより、新しいタスクを実行する学習済ロボットが形成される。いくつかの実施形態では、ロボット制御システム１０４は、人間のデモンストレーションから個々の操作スキルをラーニングするように未習熟ロボット１０２を学習させる学習モジュール１０４ａを含む。たとえば、学習モジュール１０４ａは、ロボットの特定の事項に適合される注意深く設計されたデモンストレーションを使用する。しかしながら、このアプローチは、デモンストレーションを設計するのに付加的な作業を必要とし、当該作業は、長く時間のかかるプロセスである。いくつかの他の実施形態では、学習モジュール１０４ａは、効果的な態様で新しいタスクをラーニングするために、ロボットを学習させるのに強化学習（ＲＬ）を使用する。強化学習（ＲＬ）では、ＲＬエージェントは、離散的な時間ステップでその環境と相互作用する。各時間ｔにおいて、エージェントは、典型的に報酬ｒ_ｔを含む観察結果ｏ_ｔを受け取る。次いで、ＲＬコントローラは、報酬を増加するよう、利用可能なアクションのセットからアクションａ_ｔを選択し、次いで、当該アクションａ_ｔは、環境に送られる。しかしながら、ロボット操作のための新しいタスクをラーニングすることについて、多くアクションは事実上境界がないが、ＲＬはアクションの有限のセットからの選択を期待する。さらに、ＲＬは、新しいタスクをラーニングするのに適切な報酬関数を設計する必要性を要求する。これらの報酬関数を設計することは困難である。 Some embodiments are based on the recognition that a robot that has learned about a task can be manipulated to perform a new task. For example, a robot that has been trained to maintain household chores, such as washing dishes and cleaning the floor, operates to wash dishes, clean the floor, and dispose of debris. Can be done. However, in order to operate the robot, the robot also requires rapid and extensive learning for new tasks. Therefore, the purpose of some embodiments is to train the robot to learn new tasks. As shown in FIG. 1, the untrained robot 102 is trained by the robot control system 104, which forms a trained robot that executes a new task. In some embodiments, the robot control system 104 includes a learning module 104a that trains an unskilled robot 102 to learn individual operational skills from a human demonstration. For example, the learning module 104a uses a carefully designed demonstration adapted to the specifics of the robot. However, this approach requires additional work to design the demonstration, which is a long and time consuming process. In some other embodiments, the learning module 104a uses reinforcement learning (RL) to train the robot to learn new tasks in an effective manner. In reinforcement learning (RL), RL agents interact with their environment in discrete time steps. At each time t, the agent typically receive observations o _t containing reward r _t. Then, RL controller, to increase the compensation, select an action a _t the set of available actions, then the action a _t is sent to the environment. However, while many actions are virtually borderless when it comes to learning new tasks for robotic manipulation, RL expects to choose from a finite set of actions. In addition, the RL requires the need to design appropriate reward functions for learning new tasks. It is difficult to design these reward functions.

そのため、いくつかの実施形態の目的は、新しいタスクを実行するために改善された学習方法を提供することである。したがって、ラーニング目的の学習モジュール１０４ａを含むロボット制御システム１０４は、選択目的の学習モジュール１０８ａにより修正される。学習モジュール１０８ａは、既存のタスク不可知論的なスキルを再使用して未習熟ロボットを学習済ロボット１０６に変換するように未習熟ロボット１０２を学習させる。さらに、学習モジュール１０８ａによって、新しいタスクの効果または作用のみを「模倣する」ように、新しいタスクの特定の事項から実施形態が距離をおくことが可能になる。付加的または代替的には、学習モジュール１０８ａによって、いくつかの実施形態が、起こり得る障害および新しいタスクの性質を考慮するために、リアルタイムでスキルを適応可能に選択することが可能になる。さらに、いくつかの実施形態では、選択目的は、強化学習のための報酬関数を簡素化することを可能にする。いくつかの実施形態では、学習モジュール１０８ａは、新しいタスクを効率的な態様で実行するために、未習熟ロボット１０２を学習させるのにＲＬを使用する。具体的には、学習モジュール１０８ａは、既存のタスク不可知論的なスキルの有限のライブラリからスキルを選択するように未習熟ロボット１０２を学習させるのにＲＬを使用する。そのため、いくつかの実施形態の選択ベースのＲＬは、アクションの境界のない選択を、スキルの有限のライブラリからのスキルの境界のある選択と置き換える。 Therefore, the purpose of some embodiments is to provide an improved learning method for performing new tasks. Therefore, the robot control system 104 including the learning module 104a for learning purposes is modified by the learning module 108a for selection purposes. The learning module 108a trains the unlearned robot 102 so as to reuse the existing task agnostic skill to convert the unlearned robot into the learned robot 106. In addition, the learning module 108a allows the embodiment to be distanced from certain items of the new task so as to "mimic" only the effects or actions of the new task. Additional or alternative, the learning module 108a allows some embodiments to adaptively select skills in real time to account for possible obstacles and the nature of new tasks. Moreover, in some embodiments, the selection objective makes it possible to simplify the reward function for reinforcement learning. In some embodiments, the learning module 108a uses the RL to train the immature robot 102 to perform new tasks in an efficient manner. Specifically, the learning module 108a uses the RL to train the immature robot 102 to select a skill from a finite library of existing task agnostic skills. As such, selection-based RLs in some embodiments replace borderless selection of actions with bordered selection of skills from a finite library of skills.

図２は、いくつかの実施形態に従った、タスクを実行するようにロボットの動作を学習および制御するためのロボット制御システム２００のブロック図を示す。以下、「ロボット制御システム」および「コントローラ」は相互に交換可能に使用され得、同じものを意味する。コントローラ２００は、コントローラ２００を他のシステムおよびデバイスに接続するための、入力インターフェイスと、出力インターフェイスとを含む。いくつかの実施形態では、入力インターフェイスは、ヒューマンマシンインターフェイス（ＨＭＩ: Human Machine Interface）２０２およびイメージングインターフェイス２２０を含む。ＨＭＩ２０２は、キーボード２３０および／またはポインティングデバイス／媒体２３２といった入力デバイスにコントローラ２００を接続し得る。ポインティングデバイス２３２はたとえば、マウス、トラックボール、タッチパッド、ジョイスティック、ポインティングスティック、スタイラス、または、タッチスクリーンを含み得る。イメージングインターフェイス２２０は、コントローラ２００をイメージングデバイス２２６に接続するように適合され得る。イメージングデバイス２２６は、ＲＧＢＤカメラ、デプスカメラ、ＲＧＢカメラ、グレースケールカメラ、コンピュータ、スキャナ、モバイルデバイス、ウェブカム、または、それらの任意の組み合せを含み得る。いくつかの実施形態によれば、入力インターフェイスは、ロボット２３６の各制御ステップについて、ロボット２３６の現在の状態を受け取る／受け付けるように構成され得る。入力インターフェイスは、コントローラ２００をバス２１０を介してネットワーク２３４に接続するように適合されるネットワークインターフェイスコントローラ（ＮＩＣ: network interface controller）２１２をさらに含む。コントローラ２００は、無線または有線のいずれかにより、ネットワーク２３４を介して、各制御ステップについてロボット２３６の現在の状態を受け取る。 FIG. 2 shows a block diagram of a robot control system 200 for learning and controlling robot movements to perform tasks, according to some embodiments. Hereinafter, "robot control system" and "controller" can be used interchangeably and mean the same thing. The controller 200 includes an input interface and an output interface for connecting the controller 200 to other systems and devices. In some embodiments, the input interface includes a human machine interface (HMI) 202 and an imaging interface 220. The HMI 202 may connect the controller 200 to an input device such as a keyboard 230 and / or a pointing device / medium 232. The pointing device 232 may include, for example, a mouse, trackball, touchpad, joystick, pointing stick, stylus, or touch screen. The imaging interface 220 may be adapted to connect the controller 200 to the imaging device 226. The imaging device 226 may include an RGBD camera, a depth camera, an RGB camera, a grayscale camera, a computer, a scanner, a mobile device, a webcam, or any combination thereof. According to some embodiments, the input interface may be configured to receive / accept the current state of the robot 236 for each control step of the robot 236. The input interface further includes a network interface controller (NIC) 212 adapted to connect the controller 200 to the network 234 via the bus 210. The controller 200 receives the current state of the robot 236 for each control step via the network 234, either wirelessly or by wire.

いくつかの実施形態によれば、入力インターフェイスは、タスクの実行のセットを受け取るようにさらに構成され得る。実行のセットは、ロボットがタスクを実行するためのタスクのデモンストレーションである。以下、「タスクの実行セット」および「タスクデモンストレーション」は、相互に交換可能に使用され得、同じものを意味する。いくつかの実施形態は、イメージングインターフェイス２２０が、ロボット２３６のためのタスクデモンストレーションをイメージングデバイス２２６から受け取り得るという認識に基づいている。いくつかの実施形態では、イメージングデバイス２２６は、画像および／またはビデオフォーマットでタスクデモンストレーションを提供し得る。付加的または代替的には、コントローラ２００は、画像処理モジュール２１８を含み得る。画像処理モジュール２１８は、学習モジュール２０６ｄへの入力としてタスクデモンストレーションを提供するために、または、タスクデモンストレーションをタスクデモンストレーションモジュール２０６ｂに格納するために、画像およびビデオを処理するように構成される。いくつかの実施形態では、画像処理２１８は、学習モジュール２０６ｄへのタスクデモンストレーション入力を増強するか、または、モジュール２０６ｂに格納されたタスクデモンストレーションを増強し得る。付加的または代替的には、コントローラ２００は、ロボット２３６から直接的にまたはネットワーク２３４を介して、タスクデモンストレーションを受け取るように構成され得るタスク処理モジュール２１６を含む。いくつかの実施形態では、タスク処理モジュール２１６は、学習モジュール２０６ｄへの入力としてタスクデモンストレーションを提供するために、または、タスクデモンストレーションをタスクデモンストレーションモジュール２０６ｂに格納するために、タスクデモンストレーションを処理するようにさらに構成され得る。タスク処理モジュール２１６は、何らかのフォーマットでタスクデモンストレーションを取得し、当該タスクデモンストレーションを状態／アクションの対のシーケンスに変換する。たとえば、タスクは、ロボットコマンドとして受け取られてもよく、タスク処理モジュール２１６は、それらのコマンドを状態／アクションの対に変換し得る。タスク処理モジュール２１６は、画像処理２１８によって提供される要素からのタスクデモンストレーションの増強において使用されてもよい。 According to some embodiments, the input interface may be further configured to receive a set of task executions. A set of executions is a demonstration of a task for a robot to perform a task. Hereinafter, "task execution set" and "task demonstration" can be used interchangeably and mean the same thing. Some embodiments are based on the recognition that the imaging interface 220 may receive a task demonstration for the robot 236 from the imaging device 226. In some embodiments, the imaging device 226 may provide task demonstrations in image and / or video formats. Additional or alternative, the controller 200 may include an image processing module 218. The image processing module 218 is configured to process images and videos to provide a task demonstration as input to the learning module 206d or to store the task demonstration in the task demonstration module 206b. In some embodiments, the image processing 218 may enhance the task demonstration input to the learning module 206d or enhance the task demonstration stored in the module 206b. Additional or alternative, the controller 200 includes a task processing module 216 that may be configured to receive task demonstrations directly from robot 236 or via network 234. In some embodiments, the task processing module 216 processes the task demonstration to provide the task demonstration as input to the learning module 206d or to store the task demonstration in the task demonstration module 206b. It can be further configured. The task processing module 216 acquires a task demonstration in some format and converts the task demonstration into a sequence of state / action pairs. For example, tasks may be received as robot commands, and task processing module 216 may translate those commands into state / action pairs. The task processing module 216 may be used in enhancing the task demonstration from the elements provided by the image processing 218.

コントローラ２００は、プロセッサ２０４と、プロセッサ２０４によって実行可能な命令を格納するメモリ２０８とをさらに含む。プロセッサ２０４は、シングルコアプロセッサ、マルチコアプロセッサ、コンピューティングクラスタ、または、任意の数の他の構成であり得る。メモリ２０８は、ランダムアクセスメモリ（ＲＡＭ: random access memory）、リードオンリメモリ（ＲＯＭ: read only memory）、フラッシュメモリ、または、任意の他の適切なメモリシステムを含み得る。プロセッサ２０４は、バス２１０を介して１つ以上の入力デバイスおよび出力デバイスに接続される。格納された命令は、タスクを実行するようにロボット（たとえば、ロボット２３６）の動作を学習および制御するための方法を実現する。 The controller 200 further includes a processor 204 and a memory 208 that stores instructions that can be executed by the processor 204. Processor 204 can be a single-core processor, a multi-core processor, a computing cluster, or any number of other configurations. Memory 208 may include random access memory (RAM), read only memory (ROM), flash memory, or any other suitable memory system. Processor 204 is connected to one or more input and output devices via bus 210. The stored instructions provide a way to learn and control the behavior of a robot (eg, robot 236) to perform a task.

いくつかの実施形態では、メモリ２０８は、ストレージ２０６を含むようにさらに拡張され得る。いくつかの実施形態では、ストレージ２０６は、ラーニング済スキルライブラリモジュール２０６ａ、タスクデモンストレーションモジュール２０６ｂ、ロボットコントローラ２０６ｃ、学習モジュール２０６ｄおよび実行モジュール２０６ｅを含む。スキルライブラリモジュール２０６ａは、スキルのライブラリ（すなわち、複数のスキル）を格納するように構成され得る。スキルは、ロボットの現在の状態を提出することに応答して、アクションまたはアクションにわたる分布を返すロボット（たとえば、ロボット２３６）の状態の確率関数である。いくつかの実施形態は、モジュール２０６ａにおけるスキルがタスクのサブシーケンスまたはサブタスクであるという認識に基づいている。さらに、各スキルは、時間領域における状態／アクションの対または状態／アクションの対のシーケンスであり得る。そのため、モジュール２０６ａにおけるスキルは、シングルステップスキルまたはマルチステップスキルであり得る。たとえば、状態／アクションの対に対応するスキルは、シングルステップスキルとして作成され(coined)得、状態／アクションの対のシーケンスに対応するスキルは、マルチステップのスキルとして作成され得る。さらに、モジュール２０６ａに格納されるスキルは、図３を参照して詳細に説明される。 In some embodiments, memory 208 may be further expanded to include storage 206. In some embodiments, the storage 206 includes a learned skill library module 206a, a task demonstration module 206b, a robot controller 206c, a learning module 206d and an execution module 206e. The skill library module 206a may be configured to store a library of skills (ie, a plurality of skills). A skill is a probability function of the state of a robot (eg, robot 236) that returns an action or distribution over the action in response to submitting the current state of the robot. Some embodiments are based on the recognition that a skill in module 206a is a subsequence or subtask of a task. In addition, each skill can be a sequence of state / action pairs or state / action pairs in the time domain. Therefore, the skill in module 206a can be a single-step skill or a multi-step skill. For example, a skill corresponding to a state / action pair can be coined as a single-step skill, and a skill corresponding to a state / action pair sequence can be created as a multi-step skill. Further, the skills stored in the module 206a will be described in detail with reference to FIG.

いくつかの実施形態は、タスクデモンストレーションモジュール２０６ｂがタスクの実行のセット（すなわち、ロボット２３６のためのタスクデモンストレーション）を格納するように構成され得るという認識に基づいている。いくつかの実施形態では、タスクデモンストレーションは、状態／アクションの対のシーケンスとして入力デバイスによって観察または記録される一方、スキルは未知または未定義である。そのため、タスクデモンストレーションは、状態／アクションの対のシーケンスとしてモジュール２０６ｂに格納される。さらに、状態／アクションの対は、スキルよりも重要である。その理由は、状態／アクションの対がロボットの空間領域において実際にタスクを実行する一方で、スキルは、時間領域において状態／アクションの対のシーケンスを接続するからである。ロボットのためのタスクデモンストレーションは、図４Ａ〜図４Ｃを参照して詳細に説明される。 Some embodiments are based on the recognition that the task demonstration module 206b can be configured to store a set of task executions (ie, task demonstrations for robot 236). In some embodiments, the task demonstration is observed or recorded by the input device as a sequence of state / action pairs, while the skill is unknown or undefined. Therefore, the task demonstration is stored in module 206b as a sequence of state / action pairs. In addition, state / action pairs are more important than skills. The reason is that the state / action pair actually performs the task in the robot's spatial domain, while the skill connects a sequence of state / action pairs in the time domain. Task demonstrations for robots will be described in detail with reference to FIGS. 4A-4C.

ロボットコントローラ２０６ｃは、プロセッサ２０４による実行時に、ストレージ２０６内の１つ以上のモジュールを実行する命令を格納するように構成され得る。いくつかの実施形態は、ロボットコントローラ２０６ｃがロボットの学習フェーズおよび実行フェーズを管理するという認識に基づいている。 Robot controller 206c may be configured to store instructions to execute one or more modules in storage 206 when executed by processor 204. Some embodiments are based on the recognition that the robot controller 206c manages the learning and execution phases of the robot.

学習モジュール２０６ｄは、プロセッサ２０４によって実行されるとプロセッサ２０４にスキルを選択するための関数を学習させる命令を格納するように構成され得る。いくつかの実施形態は、各々が状態／アクションの対のシーケンスを定義する、タスクの実行のセットを受け取るための命令と、各状態について、状態／アクションの対のシーケンスから、当該状態について受け取られたアクションにフィットするアクションの分布を決定するための命令と、ある状態について、当該状態について決定されるアクションの分布上でサンプリングされるアクションを返す最も高い確率を有するスキルを決定するための命令とを学習モジュール２０６ｄが格納するという認識に基づいている。いくつかの実施形態の目的は、スキル同士の間の切り替えにペナルティを課すスキルを選択するように関数を学習させることである。そのため、いくつかの実施形態は、関数を学習させるために強化学習（ＲＬ）を使用する。たとえば、ＲＬは、スキル同士の間での切り替えについてのペナルティと、ＲＬによって選択されたアクションとデモンストレーションされたタスクのアクションとの間の差に基づくペナルティとを含む報酬関数を含む。さらに、タスクを実行するようにロボットを学習させるための学習モジュール２０６ｄ（すなわち学習フェーズ）が、図５〜図６を参照して詳細に説明される。 The learning module 206d may be configured to store instructions that, when executed by the processor 204, cause the processor 204 to learn a function for selecting a skill. Some embodiments are received for a state from an instruction to receive a set of task executions, each defining a sequence of state / action pairs, and for each state from a sequence of state / action pairs. An instruction to determine the distribution of actions that fits the action, and an instruction to determine the skill with the highest probability of returning the action sampled on the distribution of actions determined for that state for a given state. Is based on the recognition that the learning module 206d stores. An object of some embodiments is to train a function to select a skill that penalizes switching between skills. Therefore, some embodiments use reinforcement learning (RL) to train the function. For example, the RL includes a reward function that includes a penalty for switching between skills and a penalty based on the difference between the action selected by the RL and the action of the demonstrated task. Further, a learning module 206d (ie, a learning phase) for training a robot to perform a task will be described in detail with reference to FIGS. 5-6.

実行モジュール２０６ｅは、スキルを選択するための学習済関数を格納するように構成され得る。学習済関数は、単一の引数関数である。いくつかの実施形態では、学習済関数は、ロボットの状態を提出することに応答して、スキルの分布を返すロボットの状態の確率関数である。さらに、実行モジュール２０６ｅは、選択されたスキルからアクションを選択する命令を格納するように構成され得る。さらに、実行モジュール２０６ｅは、各制御ステップについてスキル選択を実行する命令を格納するように構成され、当該スキルはマルチステップスキルに対応する。さらに、タスクを実行するための実行モジュール２０６ｅ（すなわち実行フェーズ）は、図７〜図９を参照して詳細に説明される。 Execution module 206e may be configured to store learned functions for selecting skills. A trained function is a single argument function. In some embodiments, the learned function is a probability function of the robot's state that returns a distribution of skills in response to submitting the robot's state. Further, the execution module 206e may be configured to store an instruction to select an action from the selected skill. Further, the execution module 206e is configured to store an instruction to execute skill selection for each control step, and the skill corresponds to a multi-step skill. Further, the execution module 206e (ie, the execution phase) for executing the task will be described in detail with reference to FIGS. 7-9.

出力インターフェイスは、タスクを実行するために、選択されたアクションを実行するようにロボットに命令するように構成され得る。出力インターフェイスは、ディスプレイインターフェイス２１４およびアプリケーションインターフェイス２２２を含み得る。ディスプレイインターフェイス２１４は、コントローラ２００をディスプレイデバイス２２４に接続し得る。ディスプレイデバイス２２４はたとえば、コンピュータモニタ、カメラ、テレビジョン、プロジェクタ、または、モバイルデバイスを含み得る。アプリケーションインターフェイス２２２は、コントローラ２００をアプリケーションデバイス２２８に接続し得る。いくつかの実施形態では、アプリケーションデバイスは、コントローラ２００内において具現化されてもよい。そのため、アプリケーションデバイス２２８は、タスクを実行するために、選択されたアクションに基づいて動作する。たとえば、アプリケーションデバイス２２８は、タスクを実行するために、選択されたアクションを実行するよう、ロボットのための制御信号を提供する。 The output interface can be configured to instruct the robot to perform selected actions in order to perform a task. The output interface may include a display interface 214 and an application interface 222. The display interface 214 may connect the controller 200 to the display device 224. The display device 224 may include, for example, a computer monitor, camera, television, projector, or mobile device. The application interface 222 may connect the controller 200 to the application device 228. In some embodiments, the application device may be embodied within the controller 200. Therefore, the application device 228 operates based on the selected action to perform the task. For example, application device 228 provides a control signal for a robot to perform a selected action to perform a task.

図３は、いくつかの実施形態に従った、スキルライブラリモジュール２０６ａに格納される例示的なスキルを示す。いくつかの実施形態は、スキルライブラリモジュール２０６ａに格納されるスキルが、移動オプション、回転オプション、停止オプション、ピックオプション、および、配置オプションなどといった基本スキルであるという認識に基づいている。そのため、スキルライブラリモジュール２０６ａは、左への移動３０２、右への移動３０４、ＸＹＺ方向への移動３０６、上への移動３０８、下への移動３１０、停止３１２、対象の配置３１４、対象のピック３１６、および、工具の回転３１８などといったスキルを含む。いくつかの実施形態では、スキルライブラリモジュール２０６ａに格納されるスキルは、複雑なスキルに繋がる２つ以上の基本的スキルから構成されてもよい。たとえば、複雑なスキルは、ＸＹＺ方向への移動３０６および対象の配置３１６といった基本スキルを含み得る。いくつかの実施形態は、１つのタスクまたは複数のタスクを実行するためにモジュール２０６ａ内にスキルの複数のライブラリが存在するという認識に基づいている。たとえば、タスクが複数のライブラリからのスキルを要求する場合、モジュール２０６ａは、スキルの複数のライブラリを格納する。いくつかの他の実施形態では、いくつかのタスクデモンストレーションを実行するために新しいスキルをラーニングする必要性があり得る。そのため、ラーニングされた新しいスキルは、既存のスキルのライブラリに格納され得るか、または、ラーニングされた新しいスキルを格納するためにスキルの新しいライブラリが作成され得る。新しいスキルをラーニングする決定は、人間の専門家の決定に完全に依存する。いくつかの実施形態の目的は、タスクを実行するために、スキルライブラリモジュール２０６ａにおけるすべての必要なスキルを維持することである。 FIG. 3 shows exemplary skills stored in skill library module 206a, according to some embodiments. Some embodiments are based on the recognition that the skills stored in the skill library module 206a are basic skills such as move options, rotation options, stop options, pick options, and placement options. Therefore, the skill library module 206a has a left movement 302, a right movement 304, an XYZ movement 306, an up movement 308, a down movement 310, a stop 312, a target arrangement 314, and a target pick. Includes skills such as 316 and tool rotation 318. In some embodiments, the skill stored in the skill library module 206a may consist of two or more basic skills that lead to complex skills. For example, complex skills may include basic skills such as movement 306 in the XYZ direction and placement of objects 316. Some embodiments are based on the recognition that there are multiple libraries of skills within module 206a to perform one task or multiple tasks. For example, if the task requests skills from multiple libraries, module 206a stores multiple libraries of skills. In some other embodiments, it may be necessary to learn new skills in order to perform some task demonstrations. Therefore, the new learned skill can be stored in a library of existing skills, or a new library of skills can be created to store the new learned skill. The decision to learn a new skill depends entirely on the decision of a human expert. An object of some embodiments is to maintain all required skills in skill library module 206a in order to perform a task.

さらに、各スキルは確率関数としてモジュール２０６ａに格納される。そのため、モジュール２０６ａにおけるスキルは、現在の状態を提出することに応答して、アクションまたはアクションにわたる分布を返すロボットの状態の確率関数である。したがって、数学的に以下のように定義される。 Further, each skill is stored in the module 206a as a probability function. Thus, the skill in module 206a is a probability function of the robot's state that returns an action or distribution over the action in response to submitting the current state. Therefore, it is mathematically defined as follows.

したがって、モジュール２０６ａにおけるスキルは、 Therefore, the skill in module 206a is

などのように数学的に表され得る。 It can be expressed mathematically as such.

図４Ａは、いくつかの実施形態に従った、リモート制御デバイスを使用するロボットのためのタスクデモンストレーションを示す。図４Ａに示されるように、リモート制御デバイス４０２は、ロボット２３６のためのタスクをデモンストレーションするために使用され得る。いくつかの実施形態は、オペレータ４０４（たとえば、技術者およびユーザなど）がリモート制御デバイス４０２を動作し、次いでリモート制御デバイス４０２が、特定のタスクをデモンストレーションするようロボット２３６を動作するという認識に基づいている。いくつかの実施形態では、リモート制御デバイス４０２は、特定のタスクをデモンストレーションするために、ロボット構成設定（すなわち、ロボットの設定）をロボット２３６に送信する。たとえば、リモート制御デバイス４０２は、特定のタスクをデモンストレーションするために、ＸＹＺ方向への移動のような制御コマンド、速度制御コマンド、および、ジョイント位置コマンドなどを送信する。たとえば、図４Ａに示されるように、特定のタスクは、対象をピックするスキルおよび当該対象を上方向に移動するスキルといった２つのスキルによって定義され得る。リモート制御デバイス４０２は、特定のタスクをデモンストレーションするために、両方のスキルについてのロボット構成設定を送信する。いくつかの他の実施形態では、特定のタスクのデモンストレーションは、制御ステップにおいて行われる。たとえば、リモート制御デバイス４０２は、各制御ステップについてロボット構成設定を送信する。そのため、ロボット２３６のためにデモンストレーションされる特定のタスクは、入力デバイスによって状態／アクションの対で記録され得る。入力デバイスは、記録されたデータをコントローラ２００に提供し得る。 FIG. 4A shows a task demonstration for a robot using a remote control device, according to some embodiments. As shown in FIG. 4A, the remote control device 402 can be used to demonstrate a task for robot 236. Some embodiments are based on the recognition that an operator 404 (eg, a technician and a user) operates a remote control device 402, and then the remote control device 402 operates a robot 236 to demonstrate a particular task. ing. In some embodiments, the remote control device 402 sends robot configuration settings (ie, robot settings) to the robot 236 to demonstrate a particular task. For example, the remote control device 402 sends control commands such as movement in the XYZ directions, speed control commands, joint position commands, and the like to demonstrate a particular task. For example, as shown in FIG. 4A, a particular task can be defined by two skills: a skill that picks a target and a skill that moves the target upwards. The remote control device 402 sends robot configuration settings for both skills to demonstrate a particular task. In some other embodiments, the demonstration of a particular task is done in a control step. For example, the remote control device 402 transmits robot configuration settings for each control step. As such, the particular task demonstrated for Robot 236 can be recorded in state / action pairs by the input device. The input device may provide the recorded data to the controller 200.

図４Ｂは、いくつかの実施形態に従った、手動操作によるロボットのためのタスクデモンストレーションを示す。図４Ｂに示されるように、ロボット２３６は、特定のタスクをデモンストレーションするように手動で操作され得る。いくつかの実施形態は、特定のタスクをデモンストレーションするために、オペレータ４０４が、たとえばモータ、ジョイント、グリッパ、工具、ロボット２３６の位置といったロボット２３６の設定を手動で制御するという認識に基づいている。いくつかの実施形態では、手動操作によってデモンストレーションされるタスクは、デモンストレーションされるタスクについてスキルおよびスキル構成を定義しなくてもよい。そのため、タスクデモンストレーションは、時間領域において状態／アクションの対で入力デバイスによって記録される。入力デバイスは、記録されたデータをコントローラ２００に提供し得る。 FIG. 4B shows a task demonstration for a manually operated robot according to some embodiments. As shown in FIG. 4B, the robot 236 can be manually manipulated to demonstrate a particular task. Some embodiments are based on the recognition that the operator 404 manually controls the robot 236 settings, such as the position of the motor, joints, grippers, tools, robot 236, to demonstrate a particular task. In some embodiments, the manually demonstrated task does not have to define skills and skill configurations for the demonstrated task. As such, task demonstrations are recorded by the input device in state / action pairs in the time domain. The input device may provide the recorded data to the controller 200.

いくつかの他の実施形態では、タスクのデモンストレーションはシミュレーションによって行われ得る。いくつかの他の実施形態では、タスクデモンストレーションは、たとえば工具といったロボットの端部の動作について、タスクデモンストレーション中に到達する３Ｄ空間におけるリストポイントの実行であり得る。いくつかの他の実施形態では、タスクデモンストレーションは、手動操作とリモート制御デバイス操作との組み合わせによって達成され得る。いくつかの他の実施形態では、イメージングデバイス２２６は、タスクデモンストレーションを記録し得、画像処理２１８は、視覚的要素によりタスクデモンストレーションを増強するよう、記録されたタスクデモンストレーションを処理し得る。さらに、ロボットのためのタスクデモンストレーションは、上述の例示的な説明に限定され得ない。代わりに、本発明の範囲内にあるすべての種類のデモンストレーションが考慮されるべきである。 In some other embodiments, the task demonstration can be done by simulation. In some other embodiments, the task demonstration can be the execution of wrist points in 3D space reached during the task demonstration for the movement of the end of the robot, such as a tool. In some other embodiments, task demonstrations can be achieved by a combination of manual operation and remote control device operation. In some other embodiments, the imaging device 226 may record the task demonstration and the image processing 218 may process the recorded task demonstration so as to enhance the task demonstration by visual elements. Moreover, task demonstrations for robots cannot be limited to the exemplary description described above. Instead, all kinds of demonstrations within the scope of the invention should be considered.

図４Ｃは、いくつかの実施形態に従った、タスクデモンストレーションモジュール２０６ｂに格納される例示的なタスクデモンストレーションを示す。いくつかの実施形態の目的は、入力デバイスから受け取られたタスクの実行のセット（すなわち、ロボットのためのタスクデモンストレーション）を格納することである。いくつかの実施形態は、タスクデモンストレーションが状態／アクションの対で受け取られる一方、スキルは未知または未定義であるという認識に基づいている。さらに、状態／アクションの対は、スキルよりも重要である。その理由は、状態／アクションの対は、ロボットの空間領域において実際にタスクを実行する一方、スキルは時間領域において状態／アクションの対のシーケンスを接続するからである。そのため、受け取られたタスクデモンストレーションは、状態／アクションの対のシーケンスとしてタスクデモンストレーションモジュール２０６ｂに格納され得る。いくつかの実施形態では、状態／アクションの対のシーケンスは、時間領域において定義される。たとえば、タスクデモンストレーションは、開始時間から終了時間までの状態／アクションの対のシーケンスとして表される。 FIG. 4C shows an exemplary task demonstration stored in the task demonstration module 206b, according to some embodiments. An object of some embodiments is to store a set of task executions (ie, task demonstrations for a robot) received from an input device. Some embodiments are based on the recognition that the skill is unknown or undefined, while the task demonstration is received in a state / action pair. In addition, state / action pairs are more important than skills. The reason is that state / action pairs actually perform tasks in the robot's spatial domain, while skills connect a sequence of state / action pairs in the time domain. Therefore, the received task demonstration can be stored in the task demonstration module 206b as a sequence of state / action pairs. In some embodiments, the sequence of state / action pairs is defined in the time domain. For example, a task demonstration is represented as a sequence of state / action pairs from start time to end time.

特定の時間ｔにおける状態Ｓ_ｔは、ロボットの現在の設定を定義し、その環境からの要素によって増強される場合がある値を含む。いくつかの実施形態では、ロボットの設定は、モータ、ジョイント、グリッパ、工具、位置、画像、または、それらの任意の組み合わせの設定を含むが、これらに限定されない。値は、たとえば、特定のロボット構成のためのジョイント値を表す。その環境における値の例は、１つ以上の対象の位置、１つ以上の障害物の位置、目標位置までの距離、環境の画像である。特定の時間ｔにおけるアクションＡ_ｔは、ロボットの異なる設定を決定するために値のベクトルを含む。いくつかの実施形態は、アクションがロボットをある状態から次の状態に移行させるという認識に基づいている。このアクションは、本質的に離散的または連続的であり得る。アクションは、単一の値または値のベクトルなどであり得る。 State S _t at a particular time t contains a value that if there is to define the current configuration of the robot is enhanced by elements from the environment. In some embodiments, robot settings include, but are not limited to, settings for motors, joints, grippers, tools, positions, images, or any combination thereof. The value represents, for example, a joint value for a particular robot configuration. Examples of values in that environment are the position of one or more objects, the position of one or more obstacles, the distance to a target position, and an image of the environment. Action A _t at a particular time t includes a vector of values to determine the different settings of the robot. Some embodiments are based on the recognition that the action causes the robot to move from one state to the next. This action can be discrete or continuous in nature. The action can be a single value or a vector of values, and so on.

図４Ｃでは、ブロック４０６は、タスクデモンストレーションモジュール２０６ｂに格納されるタスクデモンストレーションを示し得る。さらに、タスクデモンストレーションモジュール２０６ｂは、タスクデモンストレーション４０６だけではなく、いくつかのタスクデモンストレーションを格納する。たとえば、タスクデモンストレーションモジュール２０６ｂは、有限数のタスクデモンストレーションを格納する。ブロック４６０は、ロボットのために格納される最後または最終のタスクデモンストレーションを表し得る。タスクデモンストレーション４０６は、時間ｔ_０において開始し、時間ｔ_０における状態およびアクション４０６−０（たとえば、状態／アクションの対であるＳ_ｔ０，Ａ_ｔ０）を格納する。いくつかの実施形態では、状態（すなわち、Ｓ_ｔ０）について行われるべきアクション（すなわち、Ａ_ｔ０）は、時間ｔ_０についての制御ステップとして定義される。タスクデモンストレーションモジュール２０６ｂは、タスクデモンストレーション４０６をタスクデモンストレーション４０６の終了時間まで格納し続ける。終了時間ｔ_Ｎにおいて、タスクデモンストレーションモジュール２０６ｂは、状態およびアクション４０６−Ｎ（たとえば、状態／アクションの対であるＳ_ｔＮ，Ａ_ｔＮ）を格納する。したがって、タスクデモンストレーション４０６は、Ｎ＋１個の時間インスタンスについてＮ＋１個の状態／アクションの対を含む。時間ｔ_Ｎにおいて、最終アクションＡ_ｔＮがタスクデモンストレーション４０６についてデモンストレーションされると、ロボットは、状態Ｓ_ｔＮ＋１になる。したがって、タスクデモンストレーション４０６として、状態４０６−Ｎ＋１がさらに格納される。いくつかの実施形態によれば、格納されたタスクデモンストレーション４０６は、軌道 In FIG. 4C, block 406 may represent a task demonstration stored in task demonstration module 206b. Further, the task demonstration module 206b stores not only the task demonstration 406 but also some task demonstrations. For example, the task demonstration module 206b stores a finite number of task demonstrations. Block 460 may represent the last or final task demonstration stored for the robot. Task Demonstration 406 begins at time _{t 0,} and stores the status and actions 406-0 at time _{t 0} (e.g., _S _{t0, A t0} is a pair of state / action). In some embodiments, state _{(i.e., S t0)} action to be performed on _{(i.e., A t0)} is defined as a control step for time _{t 0.} The task demonstration module 206b keeps storing the task demonstration 406 until the end time of the task demonstration 406. In the end time _{t N,} the task demonstration module 206b, the state and action 406-N (e.g., _S tN is a pair of state / _{action, A tN)} stores. Therefore, task demonstration 406 contains N + 1 state / action pairs for N + 1 time instances. At time _{t N,} the final action _{A tN} is demonstrated for the task demonstration 406, the robot is in a state _{S tN + 1.} Therefore, as task demonstration 406, states 406-N + 1 are further stored. According to some embodiments, the stored task demonstration 406 is in orbit.

として表され得る。 Can be expressed as.

さらに、タスクデモンストレーションモジュール２０６ｂは、タスクデモンストレーション４０６を参照して説明したように、すべての有限数のタスクデモンストレーションを格納する。いくつかの実施形態によれば、タスクデモンストレーションモジュール２０６ｂにおけるタスクデモンストレーションは、状態／アクションの対の同じまたは異なるシーケンスを有し得る。たとえば、状態／アクションの対４６０−０〜４６０−Ｏは、状態／アクションの対４０６−０′〜４０６−Ｎと同じでもよくまたは異なっていてもよい。さらに、デモンストレーションのための第１の状態およびアクションの対は、同じでもよくまたは異なっていてもよい。さらに、タスクデモンストレーションモジュール２０６ｂにおけるタスクデモンストレーションは、同じ長さを有してもよく、または、異なる長さを有してもよい。たとえば、タスクデモンストレーション４６０は、Ｏ＋１個の時間インスタンスについてのＯ＋１個の状態／アクションの対を含み得、タスクデモンストレーション４０６は、Ｎ＋１個の時間インスタンスについてのＮ＋１個の状態／アクションの対を含み得る。そのため、タスクデモンストレーションモジュール２０６ｂは、Ｍ個の有限のタスクデモンストレーションを格納し、Ｍ≧１である。いくつかの実施形態によれば、当該Ｍ個の有限のデモンストレーションタスクは軌道Ｄのセット、すなわち、 In addition, the task demonstration module 206b stores all finite number of task demonstrations, as described with reference to task demonstration 406. According to some embodiments, the task demonstration in the task demonstration module 206b can have the same or different sequence of state / action pairs. For example, state / action pairs 460-0 to 460-O may be the same as or different from state / action pairs 406-0'to 406-N. In addition, the first state and action pair for demonstration may be the same or different. Further, the task demonstrations in the task demonstration module 206b may have the same length or may have different lengths. For example, task demonstration 460 may include O + 1 state / action pairs for O + 1 time instances, and task demonstration 406 may include N + 1 state / action pairs for N + 1 time instances. Therefore, the task demonstration module 206b stores M finite number of task demonstrations, and M ≧ 1. According to some embodiments, the M finite demonstration tasks are a set of orbits D, i.e.

として表され得る。 Can be expressed as.

図５は、いくつかの実施形態に従った、スキルを選択するための関数を学習させるための学習モジュール２０６ｄの例示的なアーキテクチャ図を示す。学習モジュール２０６ｄは、モジュール２０６ｂにおけるタスクデモンストレーションに基づいてスキルライブラリモジュール２０６ａからスキルを選択するための関数を学習させるように構成され得る。関数の学習は、スキルを選択するためにロボットを学習させることとして考えられ得る。いくつかの実施形態では、スキルライブラリモジュール２０６ａにおけるスキルは、タスクを実行するのにタスク不可知論的である。いくつかの実施形態は、特定の状態について特定のアクションを規定するタスクデモンストレーションモジュール２０６ｂにおけるタスクデモンストレーションが、タスクを実行するためにどのスキルをスキルライブラリモジュール２０６ａから選択するべきかについての指示を有さないという認識に基づいている。そのため、いくつかの実施形態の目的は、タスクを実行するようにスキルを選択するための関数を学習させることである。 FIG. 5 shows an exemplary architectural diagram of a learning module 206d for training a function for selecting a skill, according to some embodiments. The learning module 206d may be configured to train a function for selecting a skill from the skill library module 206a based on a task demonstration in the module 206b. Learning a function can be thought of as training a robot to select a skill. In some embodiments, the skill in the skill library module 206a is task agnostic to perform the task. In some embodiments, the task demonstration in the task demonstration module 206b, which defines a particular action for a particular state, has instructions as to which skill should be selected from the skill library module 206a to perform the task. It is based on the recognition that it does not exist. Therefore, the purpose of some embodiments is to train a function for selecting a skill to perform a task.

いくつかの実施形態は、学習モジュール２０６ｄが、タスク処理モジュール２１６、画像処理モジュール２１８、または、タスクデモンストレーションモジュール２０６ｂのうちの少なくとも１つからタスクの実行のセット（すなわち、ロボットのためのタスクデモンストレーション）を受け取るように構成されるという認識に基づいている。受け取られたタスクデモンストレーションは、図４Ｃに例示的に示されるように、状態／アクションの対のシーケンスである。受け取られたデモンストレーションは、ランタイムにおいてロボットによって実行される必要があるタスクのデモンストレーションである。いくつかの実施形態では、関数を学習させるために、すべてのタスクデモンストレーション（すなわち、特定のタスクについてのＭ個の有限のタスクデモンストレーションのすべて）が受け取られ得る。いくつかの実施形態では、受け取られたＭ個の有限のタスクデモンストレーションは、軌道Ｄのセット、すなわち、 In some embodiments, the learning module 206d is a set of task executions from at least one of a task processing module 216, an image processing module 218, or a task demonstration module 206b (ie, a task demonstration for a robot). Is based on the recognition that it is configured to receive. The task demonstration received is a sequence of state / action pairs, as illustrated in FIG. 4C. The demonstration received is a demonstration of the task that needs to be performed by the robot at runtime. In some embodiments, all task demonstrations (ie, all of the M finite task demonstrations for a particular task) may be received to train the function. In some embodiments, the received M finite task demonstration is a set of orbits D, i.e.

として表され、当該式は、 The formula is expressed as

と等価である。 Is equivalent to.

いくつかの実施形態は、学習モジュール２０６ｄが、軌道Ｄのセットにおける各状態について、当該状態について受け取られるアクションにフィットするアクションの分布を決定するようにさらに構成されるという認識に基づいている。説明目的のために、状態Ｓ_ｔ０を考える。時間ｔ_０において、状態Ｓ_ｔ０は、セットＤにおけるＭ個の有限のタスクデモンストレーションからＭ個の有限のＡ_ｔ０アクションを受け取る。いくつかの実施形態では、Ｍ個の有限のタスクデモンストレーションからのＭ個の有限のＡ_ｔ０アクションは、同じまたは異なり得る。したがって、学習モジュール２０６ｄは、状態Ｓ_ｔ０について、Ｍ個の有限のＡ_ｔ０アクションにわたる分布を決定する。さらに、学習モジュール２０６ｄは、状態Ｓ_ｔ０に関して説明したように、軌道Ｄのセットにおける各状態についてアクションの分布を決定することを反復する。そのため、学習モジュール２０６ｄは、セットＤにおけるすべての状態についてのアクションの分布を決定する。さらに、軌道Ｄのセットにおける各状態について、当該状態について受け取られるアクションにフィットするアクションの分布を決定するための学習モジュール２０６ｄは、図９を参照して詳細に記載される。 Some embodiments are based on the recognition that the learning module 206d is further configured for each state in the set of trajectories D to determine the distribution of actions that fit the actions received for that state. For explanatory purposes, consider _{the state St0.} At time _{t 0,} the state _{S t0} receives the _{A t0} action of the M finite from M finite task demonstration in the set D. In some embodiments, the M finite _At0 actions from the M finite task demonstrations can be the same or different. Therefore, the learning module 206d is the state _{S t0,} determines the distribution over the M finite _{A t0} action. Further, the learning module 206d, as described in connection with state S _t0, repeating the determining a distribution of actions for each state in the set of trajectories D. Therefore, the learning module 206d determines the distribution of actions for all states in set D. Further, for each state in the set of trajectories D, a learning module 206d for determining the distribution of actions that fits the actions received for that state is described in detail with reference to FIG.

いくつかの実施形態では、学習モジュール２０６ｄは、軌道Ｄのセットにおける各状態について、対応する状態について決定されるアクションの分布上のアクションをサンプリングするように構成される。いくつかの実施形態は、サンプリングされたアクションがアクションの分布上の最も高い確率のアクションに対応するという認識に基づいている。実際、いくつかの実施形態の目的は、軌道Ｄのセットにおける状態について、当該状態についてサンプリングされたアクションに基づいて、モジュール２０６ａについてのスキルを決定することである。そのため、状態について、当該状態について決定されるアクションの分布上でサンプリングされるアクションを返す最も高い確率に対応する、モジュール２０６ａからのスキルを決定するために、関数を学習させる必要がある。 In some embodiments, the learning module 206d is configured to sample actions on the distribution of actions determined for the corresponding states for each state in the set of orbits D. Some embodiments are based on the recognition that the sampled actions correspond to the actions with the highest probability in the distribution of actions. In fact, an object of some embodiments is to determine the skill for module 206a for a state in the set of orbit D, based on the actions sampled for that state. Therefore, for a state, it is necessary to train a function to determine the skill from module 206a that corresponds to the highest probability of returning an action sampled on the distribution of actions determined for that state.

いくつかの実施形態では、学習モジュール２０６ｄは、モジュール２０６ａからのスキルを決定するための関数を学習させるように構成される。さらに、関数の学習は、以下に詳細に記載される。 In some embodiments, the learning module 206d is configured to train a function for determining skills from module 206a. In addition, function learning is described in detail below.

説明の明確性のために、タスクの単一のデモンストレーション、すなわち、Ｍ＝１を考える。この問題は以下のように定式化される。 For clarity of explanation, consider a single demonstration of the task, namely M = 1. This problem is formulated as follows.

利用可能な情報：
アクションのシーケンスは、Ａ＝Ａ_{ｔ０：ｔＮ}として表される。 Available information:
The sequence of actions is represented as _{A = At0: tN.}

状態のシーケンスは、Ｓ＝Ｓ_{ｔ０：ｔＮ＋１}として表される。 The sequence of states is represented as _{S = St0: tN + 1.}

モジュール２０６ａにおける離散的なスキルのセットは、Ｈ＝｛Ｈ_１，Ｈ_２，・・・，Ｈ_ｊ｝として表される。 The discrete set of skills in module 206a is represented as H = {H ₁ , H ₂ , ..., H _j }.

導出されるべき情報：
尤度を最大化する選択されるスキルのシーケンスを決定することは、Ｌ_θ（Ｈ，Ｓ，Ａ）として表される。Ｌ_θ（Ｈ，Ｓ，Ａ）は、シーケンスにおいてスキルＨのセットから選択される離散的なスキルが観察Ｓ，Ａを最も可能性が高いものとする、すなわち、最も高い確率を有することを表す。いくつかの実施形態では、シーケンスにおいて選択される離散的なスキルは、スキルＨのセットのサブセットであるか、または、スキルＨのセットに等しい。したがって、スキルＨのシーケンスの最大化は、最大尤度最適化から達成され得る。さらに、ベイズ確率理論を使用し、最適化において役割を果たさない項を排除すると、Ｌ_θ（Ｈ，Ｓ，Ａ）を最大化することは、 Information to be derived:
Determining the sequence of skills selected to maximize the likelihood is _{expressed as L θ} (H, S, A). L _θ (H, S, A) indicates that the discrete skills selected from the set of skills H in the sequence make the observations S, A most likely, i.e., have the highest probability. .. In some embodiments, the discrete skills selected in the sequence are a subset of the set of skills H or equal to the set of skills H. Therefore, maximizing the sequence of skill H can be achieved from maximum likelihood optimization. Furthermore, using Bayesian probability theory and excluding terms that do not play a role in optimization, maximizing _Lθ (H, S, A) can be done.

と等価になる。 Is equivalent to.

Ｌ_θ（Ｈ，Ｓ，Ａ）においてＨは未知であるので、 Since _{H is unknown in L θ} (H, S, A),

として以下のように無視する。 Ignore as follows.

式（１）におけるπ^ｈ（Ａ｜Ｓ）は、スキルライブラリモジュール２０６ａからの以前にラーニングされたスキルである。式（１）におけるｆ_θ（Ｓ）は、状態が与えられると、スキル分布を返す関数である。たとえば、関数ｆ_θ（Ｓ）は、入力状態Ｓが与えられると、スキルにわたる分布を返す状態の確率分布関数である。いくつかの実施形態によれば、ｆ_θ（Ｓ）は終了関数（termination function）を定式化する能力を提供する。終了関数（termination function）は、状態Ｓが与えられると、現在のスキルが終了するか（値１によって表される）、または、現在のスキルが継続する（値０によって表される）かを示す。たとえば、終了関数は、関数β（Ｓ）→０／１として定義される。 ^{Π h} (A | S) in equation (1) is a previously learned skill from skill library module 206a. _{F θ} (S) in equation (1) is a function that returns the skill distribution given a state. For example, the function f _θ (S) is a probability distribution function that returns a distribution over skills given the input state S. According to some embodiments, f _θ (S) provides the ability to formulate a termination function. The termination function indicates whether the current skill ends (represented by a value 1) or the current skill continues (represented by a value 0) given the state S. .. For example, the termination function is defined as the function β (S) → 0/1.

いくつかの実施形態は、状態Ｓ_ｔについてπ^ｈ（Ａ｜Ｓ）によって与えられるアクションまたはアクションの分布が、状態Ｓ_ｔについてスキルπ^ｈ（Ａ｜Ｓ）を決定するように、状態Ｓ_ｔについて決定されるアクションの分布（すなわち、Ｍ個の有限のデモンストレーションから状態Ｓ_ｔについて決定されるアクションの分布）に、可能な限り近くなるようマッチするはずであるという認識に基づいている。そのため、学習モジュール２０６ｄは、式（１）についての制約を生成し、たとえば、当該制約は、 Some embodiments, for state _{S t} [pi ^h | distribution action or actions given by (A S) is, skills [pi ^h for state _{S t} | to determine (A S), the state _{S t} distribution of actions to be determined (i.e., the distribution of actions to be determined about the state S _t from M finite demonstration) it is based on the recognition that the should match as close as possible. Therefore, the learning module 206d generates a constraint for equation (1), for example, the constraint is

であり得る。当該制約におけるアクションＡ_ｔは、状態Ｓ_ｔについて決定されるサンプリングされたアクションであり得る。いくつかの実施形態では、アクションＡ_ｔは、状態Ｓ_ｔについて決定されるアクションの分布上のアクションのうちの１つであり得る。 Can be. Action A _t in the constraint may be sampled actions are determined for a state S _t. In some embodiments, the action A _t may be one of the actions on the distribution of actions to be determined about the state S _t.

は、状態Ｓ_ｔについてスキル（π^ｈ（Ａ｜Ｓ））によって返されるアクションと、状態Ｓ_ｔについて決定されるアクションの分布との間の最小の差の値（たとえば確率値または距離のようなメトリック）であり得る。したがって、式（１）は以下のように変換される。 The skills for state S _t | and actions returned by ^{(π h (A S))} , such as the minimum difference value (e.g. probability value or distance between the distribution of actions to be determined about the state S _t Can be metric). Therefore, equation (1) is converted as follows.

式（２）は、制約 Equation (2) is a constraint

を有する単一の引数関数である。式（２）において、説明の明瞭性のために、タスクデモンストレーションが有限数の時間インスタンスに格納されるという事実は省略されている。時間インスタンスを考慮に入れるために、式（２）は各時間インスタンスｔに対して加算和により拡張され得る。さらに、式（２）は、以下のように、デモンストレーションｄにおけるすべての状態およびアクションの対に対する和まで拡大され得る。ｄは、デモンストレーションＤのセットに存在する。 Is a single argument function with. In equation (2), the fact that task demonstrations are stored in a finite number of time instances is omitted for clarity of explanation. To take into account the time instances, equation (2) can be extended by the sum of additions for each time instance t. Further, equation (2) can be extended to the sum of all state and action pairs in demonstration d as follows. d exists in the set of demonstration D.

式（２）は、さまざまな異なる態様で学習され得る。一実施形態では、式（２）は、期待値最大化（ＥＭ: expectation-maximization）またはＥＭの何らかの変形例の周知技術により学習され得る。別の実施形態では、式（２）は、タスクの実行の結果の偏差を最小化する報酬関数を有する強化学習（ＲＬ）技術により学習され得る。さらに、ＲＬ技術に基づいて学習される関数は、図６を参照して詳細に説明される。そのため、関数は、状態Ｓ_ｔについて、当該状態Ｓ_ｔについて決定されるアクションの分布上でサンプリングされるアクションを返す最も高い確率に対応するモジュール２０６ａからのスキルを決定するように、学習される（学習済関数５０２）。たとえば、式（２）における Equation (2) can be learned in a variety of different ways. In one embodiment, equation (2) can be learned by well-known techniques of expectation-maximization (EM) or some variant of EM. In another embodiment, equation (2) can be learned by a reinforcement learning (RL) technique that has a reward function that minimizes the deviation of the outcome of the task execution. Further, the functions learned based on the RL technique will be described in detail with reference to FIG. Therefore, the function, the state S _t, so as to determine the skills from module 206a corresponding to the highest probability of returning an action to be sampled on the distribution of actions to be determined for the state S _t, is learned ( Trained function 502). For example, in equation (2)

は、状態Ｓ_ｔについて決定されるアクションの分布上でサンプリングされるアクションを返す最も高い確率を有するスキルπ^ｈ（Ａ｜Ｓ）を、状態Ｓ_ｔについて決定する。 The skills [pi ^h has the highest probability that returns the action to be sampled on the distribution of actions to be determined about the state S _t | a ^(A S), determines the state S _t.

さらに、制約 In addition, constraints

は、以下のようにソフトな制約（soft constraint）として与えられ得る。 Can be given as a soft constraint as follows.

式（３）において、ｐはノルムを表し、たとえば０ノルムまたは２ノルムを表し、典型的にはｎ＝１，ｎ＝２である。ここで、ｐおよびｎは、いわゆるハイパーパラメータである、すなわち、最終最適化関数を設計する人間によって選択されるパラメータである。いくつかの実施形態では、式（３）は、学習済関数５０２として、学習モジュール２０６ｄの出力であり得る。いくつかの他の実施形態では、学習モジュール２０６ｄの出力は、セットＤにおけるすべての状態について学習済関数５０２によって決定されるスキルのシーケンスであり得る。たとえば、セットＨは、決定されたスキルをシーケンスで、すなわちＨ＝｛Ｈ_４，Ｈ_７，．．．，Ｈ_１｝で、格納するために生成され得る。さらに、シーケンスにおいて決定されるスキルはリストＬに格納され得る。たとえば、リストＬは、アクションまたはアクションの分布をシーケンスで、すなわちＬ＝｛π_４（ｓ），π_７（ｓ），・・・，π_１（ｓ）｝で、格納するために生成され得る。さらに、いくつかの実施形態では、学習モジュール２０６ｄは、終了関数β（Ｓ）を返す。したがって、終了関数β（Ｓ）が１を返すときはいつでも、リストＬからの次のスキルに関連付けられる確率関数π^ｈｌが選択され、その後実行される。いくつかの実施形態では、終了関数β（Ｓ）は、タスクを実行するよう、リストＬにおいて現在選択されているスキルを次のスキルに切り替える時間を決定する。 In equation (3), p represents the norm, for example 0 norm or 2 norm, typically n = 1, n = 2. Here, p and n are so-called hyperparameters, that is, parameters selected by the person who designs the final optimization function. In some embodiments, equation (3) can be the output of the learning module 206d as the trained function 502. In some other embodiments, the output of the learning module 206d can be a sequence of skills determined by the trained function 502 for all states in set D. For example, set H sets the determined skills in a sequence, i.e. H = {H ₄ , H ₇ , ... .. .. , H ₁ }, can be generated for storage. In addition, the skills determined in the sequence can be stored in list L. For example, list L can be generated to store an action or distribution of actions in a sequence, i.e. L = {π ₄ (s), π ₇ (s), ..., π _{1 (s)}.} .. Further, in some embodiments, the learning module 206d returns a termination function β (S). Therefore, whenever the termination function β (S) returns 1, the probability function π ^hl associated with the next skill from list L is selected and then executed. In some embodiments, the termination function β (S) determines the time to switch the currently selected skill in list L to the next skill to perform the task.

図６は、いくつかの実施形態に従った、選択ベースの強化学習を使用して関数を学習させる例示的なアーキテクチャを示す。強化学習（ＲＬ）は、どのようにソフトウェアエージェントが累積的報酬の何らかの概念を最大化するために環境内でアクションを行うべきかに関する機械学習の領域である。ＲＬエージェントは、離散的な時間ステップにおいてその環境と相互作用する。各時間ｔにおいて、エージェントは、典型的には報酬ｒ_ｔを含む観察結果ｏ_ｔを受け取る。次いで、ＲＬコントローラは、報酬を増加するよう、利用可能なアクションのセットからアクションａ_ｔを選択し、当該アクションａ_ｔは次いで、環境に送られる。しかしながら、ロボット操作のための新しいタスクをラーニングすることについて、多くのアクションは事実上境界がないが、ＲＬはアクションの有限のセットからの選択を期待する。そのため、アクションの境界のない選択を、スキルの有限のライブラリからのスキルの境界のある選択と置き換える選択ベースのＲＬアーキテクチャが、図６に示される。 FIG. 6 shows an exemplary architecture for training a function using selection-based reinforcement learning, according to some embodiments. Reinforcement learning (RL) is an area of machine learning about how software agents should take action in their environment to maximize some concept of cumulative reward. The RL agent interacts with its environment in discrete time steps. At each time t, the agent is typically receive observations o _t containing reward r _t. Then, RL controller, to increase the compensation, select an action a _t the set of available actions, the action a _t is then sent to the environment. However, while many actions are virtually borderless when it comes to learning new tasks for robotic manipulation, RL expects to choose from a finite set of actions. As such, a selection-based RL architecture that replaces action-boundary selection with skill-bounded selection from a finite library of skills is shown in FIG.

図６では、ブロック６０２は、タスクデモンストレーションモジュール２０６ｂを表し得る。ブロック６０４において、状態オブザーバが、モジュール２０６ｂにおいて、時間ｔにおける状態Ｓ_ｔを記録する。ブロック６０６では、アクションオブザーバが、実行の複数のセット（すなわち、Ｍ個の有限のタスクデモンストレーション）から状態Ｓ_ｔについて受け取られるアクションを記録する。ブロック６０８では、タスクの実行の結果の偏差を最小化する、状態Ｓ_ｔについての報酬関数が定義される。いくつかの実施形態は、報酬関数がスキルを切り替えることについてのペナルティを含むという認識に基づいている。そのため、報酬関数は、スキルを切り替えるための終了関数β（Ｓ）を含む。たとえば、終了関数は、モジュール２０６ｂにおいて、現在のスキルを別のスキルに切り替えるための時間を特定する。他の実施形態では、報酬関数は、モジュール２０６ｂにおける記録されたアクションと生成されたアクション６１４との間の差に基づくペナルティを含む。 In FIG. 6, block 602 may represent the task demonstration module 206b. In block 604, the state observer, in module 206 b, and records the state _{S t} at time t. In block 606, the action observer, a plurality of sets of execution (i.e., M-number of finite task demonstration) to record the actions received the state S _t from. At block 608, to minimize the results deviation of the execution of the task, the reward function of the state S _t are defined. Some embodiments are based on the recognition that the reward function includes a penalty for switching skills. Therefore, the reward function includes a termination function β (S) for switching skills. For example, the termination function specifies in module 206b the time to switch the current skill to another skill. In other embodiments, the reward function includes a penalty based on the difference between the recorded action and the generated action 614 in module 206b.

ブロック６１０では、ＲＬコントローラは、状態Ｓ_ｔについて受け取られたアクションからアクションをサンプリングする。サンプリングされたアクションは、状態Ｓ_ｔについての最も高い確率のアクションに対応する。さらに、ＲＬコントローラは、状態Ｓ_ｔについて、最も高い確率に対応するスキルをモジュール２０６ａから選択して、サンプリングされたアクションまたは状態Ｓ_ｔについて受け取られたアクションを返す。そのため、ブロック６１２では、ＲＬコントローラは、境界のある／有限数のスキルを含むモジュール２０６ａからスキルを選択する。スキルは、所与の状態Ｓ_ｔについてアクションまたはアクションの境界のある／有限の分布を返す。ブロック６１４では、アクションの有限の分布上で最も高い確率を有するアクションが選択される。そのため、いくつかの実施形態の選択ベースのＲＬは、アクションの境界のない選択を、スキルの有限のライブラリからのスキルの境界のある選択と置き換える。実際、このような選択ベースのＲＬは、タスク不可知論的スキルのライブラリを使用するタスク固有の関数の学習を簡素化する。たとえば、一実施形態では、深層強化学習者（ＤＲＬ: deep reinforcement learner）は、各スキルおよびそのアクションについての値を決定する。 At block 610, RL controller samples the action from the action received about the state _{S t.} Sampled action corresponding to the action of the highest probability of the state S _t. Further, RL controller, the state S _t, select the skill corresponding to the highest probability from the module 206a, and returns the actions received for sampled action or state S _t. Therefore, in block 612, the RL controller selects a skill from module 206a, which contains a bounded / finite number of skills. Skills, return a certain / finite distribution of the boundary of the action or actions for a given state S _t. At block 614, the action with the highest probability on the finite distribution of actions is selected. As such, selection-based RLs in some embodiments replace borderless selection of actions with bordered selection of skills from a finite library of skills. In fact, such a selection-based RL simplifies the learning of task-specific functions using a library of task agnostic skills. For example, in one embodiment, a deep reinforcement learner (DRL) determines values for each skill and its actions.

異なる実施形態は、ＲＬコントローラを形成するパラメータ化された関数を学習させるために異なる方法を使用し得る。たとえば、いくつかの実施形態では、パラメータ化された関数は、深層決定方策勾配法（deep deterministic policy gradient method）、アドバンテージ−アクタークリティック法（advantage-actor critic method）、プロキシマル方策最適化法（proximal policy optimization method）、ディープＱネットワーク法（deep Q-network method）、または、モンテカルロ方策勾配法（Monte Carlo policy gradient method）のうちの１つを使用して学習される。 Different embodiments may use different methods to train the parameterized functions that form the RL controller. For example, in some embodiments, the parameterized function is a deep deterministic policy gradient method, an advantage-actor critic method, or a proximity-actor critic method (proximal policy optimization method). It is learned using one of the proximal policy optimization method, the deep Q-network method, or the Monte Carlo policy gradient method.

図７は、いくつかの実施形態に従った、タスクを実行する実行モジュール２０６ｅの例示的なアーキテクチャ図を示す。図７に示されるように、ブロック７０２では、実行モジュール２０６ｅは、入力インターフェイスを介して、各制御ステップについてロボット２３６の現在の状態を受け付けるように構成される。ブロック７０４では、実行モジュール２０６ｅは、タスクを実行するための学習済関数を格納するように構成される。図５に示されるように、学習済関数は、ロボットの状態を提出することに応答したスキルの分布である。ブロック７０６では、実行モジュール２０６ｅは、終了条件に達する制御ステップのシーケンスにおける各制御ステップについて、現在の状態の学習済関数を実行して、スキルの分布上で最も高い確率を有するスキルを選択するように構成される。図３に示されるように、スキルは、現在の状態を提出することに応答してアクションまたはアクションにわたる分布を返すロボットの状態の確率関数である。ブロック７０８では、実行モジュール２０６ｅは、現在の状態について選択されたスキルを実行して、アクションの分布において最も高い確率を有するアクションを選択し、ロボットの状態を現在の状態から次の状態に遷移させるように構成される。ブロック７１０では、実行モジュール２０６ｅは、ロボットにタスクを実行するよう命令するために、出力インターフェイスを介して、選択されたロボットアクションを出力するように構成される。 FIG. 7 shows an exemplary architectural diagram of an execution module 206e that executes a task according to some embodiments. As shown in FIG. 7, in block 702, the execution module 206e is configured to accept the current state of the robot 236 for each control step via the input interface. In block 704, the execution module 206e is configured to store the learned functions for executing the task. As shown in FIG. 5, the learned function is the distribution of skills in response to submitting the state of the robot. In block 706, the execution module 206e executes the trained function of the current state for each control step in the sequence of control steps that reach the end condition to select the skill with the highest probability in the skill distribution. It is composed of. As shown in FIG. 3, a skill is a probability function of a robot's state that returns an action or distribution over the action in response to submitting the current state. In block 708, the execution module 206e executes the skill selected for the current state, selects the action with the highest probability in the distribution of actions, and transitions the robot state from the current state to the next state. It is configured as follows. In block 710, the execution module 206e is configured to output the selected robot action via the output interface in order to instruct the robot to perform a task.

図８は、いくつかの実施形態に従った、実行モジュール２０６ｅによって実行される動作を示すフローチャートである。ブロック８０２では、実行モジュール２０６ｅは、制御ステップについてのロボットの状態を取得するように構成される。いくつかの実施形態では、ロボットの状態は、入力インターフェイスを介して取得される。ブロック８０４では、実行モジュール２０６ｅは、スキルの分布上で最も高い確率を有するスキルを選択するよう、取得された状態についての学習済関数を実行するように構成される。代替的には、実行モジュール２０６ｅは、リストＬにおいて最も高い確率に対応するスキルを選択するよう、取得された状態についての学習済関数を実行する。選択されたスキルは、シングルステップスキルまたはマルチステップスキルであり得る。いくつかの実施形態では、マルチステップスキルは、複数の制御ステップの複数の状態について実行して、それらの対応する複数のアクションを返すように構成され得るスキルである。ブロック８０６では、実行モジュール２０６ｅは、アクションの分布上で最も高い確率を有するアクションを選択するよう、取得された状態について選択されたスキルを実行するように構成される。ブロック８０８では、実行モジュール２０６ｅは、選択されたアクションを実行するようにロボットに命令するために、選択されたアクションを出力するように構成される。 FIG. 8 is a flowchart showing the operation executed by the execution module 206e according to some embodiments. In block 802, the execution module 206e is configured to acquire the state of the robot for the control step. In some embodiments, the robot state is acquired via an input interface. In block 804, the execution module 206e is configured to execute a learned function for the acquired state to select the skill with the highest probability on the skill distribution. Alternatively, the execution module 206e executes a learned function on the acquired state to select the skill corresponding to the highest probability in list L. The selected skill can be a single-step skill or a multi-step skill. In some embodiments, a multi-step skill is a skill that can be configured to perform on multiple states of multiple control steps and return their corresponding multiple actions. In block 806, the execution module 206e is configured to execute the selected skill for the acquired state to select the action with the highest probability on the distribution of actions. In block 808, the execution module 206e is configured to output the selected action in order to instruct the robot to perform the selected action.

理解されるべきであるように、選択されたアクションをロボットが実行すると、ロボットは、取得された状態から次の状態に移行する。さらに、実行モジュール２０６ｅは、ブロック８０２において、次の制御ステップについての次の状態を取得する。いくつかの実施形態において、実行モジュール２０６ｅは、ブロック８０４において、次の状態についての終了関数β（Ｓ）を実行し、その後、次の状態について学習済関数を実行する。終了関数β（Ｓ）は、次の状態について、スキルを継続することを示す０を返すか、または、スキルを選択することを示す１を返す（すなわち終了条件）。いくつかの実施形態では、終了関数β（Ｓ）は、以前に選択されたスキル（すなわち、取得された状態について選択されたスキル）がマルチステップスキルであり、かつ、次の状態について最も高い確率を有するサンプリングされたアクションを返すことに対応する場合、０を返し、そうでなければ、終了関数β（Ｓ）は１を返す。終了関数が０を返すと、実行モジュール２０６ｅは、ブロック８０４において、次の状態について学習済関数の実行をスキップする。終了関数が１を返すと、実行モジュール２０６ａは、ブロック８０４において、次の状態について学習済関数を実行する。いくつかの他の実施形態では、実行モジュール２０６ｅは、ブロック８０４において、終了条件β（Ｓ）を実行することなく、次の状態について学習済関数を実行する。そのため、実行モジュール２０６ｅは、以前に選択されたスキル（すなわち、取得された状態について選択されたスキル）がマルチステップスキルであり、かつ、次の状態について最も高い確率を有するサンプリングされたアクションを返すことに対応する場合でも、次の状態について学習済関数を実行することを繰り返す。実行モジュール２０６ｅは、ブロック８０６において、アクションの分布上で最も高い確率を有するアクションを選択するよう、選択されたスキルを実行する。実行モジュール２０６ｅは、ブロック８０８において、選択されたアクションを出力する。さらに、フローチャート８００は、タスクのすべての制御ステップが実行されるまで繰り返される。 As should be understood, when the robot performs the selected action, the robot moves from the acquired state to the next state. Further, the execution module 206e acquires the next state for the next control step in block 802. In some embodiments, the execution module 206e executes the termination function β (S) for the next state in block 804, and then executes the learned function for the next state. The termination function β (S) returns 0 indicating that the skill is continued or 1 indicating that the skill is selected for the next state (that is, the termination condition). In some embodiments, the termination function β (S) has the highest probability that the previously selected skill (ie, the skill selected for the acquired state) is a multi-step skill and for the next state. Corresponds to returning a sampled action with, returns 0, otherwise the termination function β (S) returns 1. When the end function returns 0, the execution module 206e skips the execution of the learned function for the next state in block 804. When the end function returns 1, the execution module 206a executes the learned function for the next state in block 804. In some other embodiments, the execution module 206e executes the trained function for the next state in block 804 without executing the termination condition β (S). Therefore, the execution module 206e returns a sampled action in which the previously selected skill (ie, the skill selected for the acquired state) is a multi-step skill and has the highest probability for the next state. Even if it corresponds to the above, the trained function is repeatedly executed for the next state. Execution module 206e executes the selected skill in block 806 to select the action having the highest probability in the distribution of actions. Execution module 206e outputs the selected action in block 808. Further, the flowchart 800 is repeated until all the control steps of the task are executed.

図９は、いくつかの実施形態に従った、タスクを実行するための例示的な学習フェーズおよび実行フェーズを示す。部分９０２はタスクを実行するための準備フェーズである。図９に示されるように、準備フェーズ９０２は、スキルライブラリモジュール２０６ａおよびタスクデモンストレーションモジュール２０６ｂを含む。モジュール２０６ａは、境界のある／有限数のスキルを１つのライブラリまたは複数のライブラリの形式で格納する。たとえば、モジュール２０６ａは、スキル９０６ａ（すなわち、スキルＨ１）、スキル９０８ａ（すなわち、スキルＨ２）、および、スキル９１０ａ（すなわち、スキルＨ３）を格納する。当該モジュールにおける各スキルは、現在の状態を提出することに応答して、アクションまたはアクションの分布を返すロボットの状態の確率関数である。たとえば、スキル９０６ａは、状態Ｓ_０についてアクションＡ_０を返し、スキル９０８ａは、状態Ｓ_１についてアクションＡ_１を返し、スキル９１０ａは、状態Ｓ_２およびＳ_３についてアクションの分布を返す。しかしながら、スキル９１０ａは、Ａ_２が状態Ｓ_２について最も高い確率であり、Ａ_３が状態Ｓ_３について最も高い確率であるアクションの分布を返す。いくつかの実施形態は、モジュール２０６ａに格納されるスキルがタスク不可知論的であるという認識に基づいている。 FIG. 9 shows exemplary learning and execution phases for performing tasks according to some embodiments. Part 902 is the preparatory phase for performing the task. As shown in FIG. 9, preparation phase 902 includes skill library module 206a and task demonstration module 206b. Module 206a stores a bounded / finite number of skills in the form of one library or multiple libraries. For example, module 206a stores skill 906a (ie, skill H1), skill 908a (ie, skill H2), and skill 910a (ie, skill H3). Each skill in the module is a probability function of the robot's state that returns an action or distribution of actions in response to submitting the current state. For example, skill 906a returns action A ₀ _{for state S 0} , skill 908a returns action A ₁ _{for state S 1} , and skill 910a returns the distribution of actions for states S ₂ and S _3. However, skill 910a returns a distribution of actions in which _{A 2} has the highest probability for state S ₂ _{and A 3} has the highest probability for state S _3. Some embodiments are based on the recognition that the skills stored in module 206a are task agnostic.

タスクデモンストレーションモジュール２０６ｂは、状態／アクションの対でタスク（すなわち｛Ｓ_０−Ａ_０，Ｓ_１−Ａ_１，Ｓ_２−Ａ_２，Ｓ_３−Ａ_３，Ｓ_４｝）のデモンストレーション（すなわち、複数の実行）を格納する一方、スキルは未知または未定義である。たとえば、タスクデモンストレーションモジュール２０６ｂは、タスクデモンストレーション９０６ｂ（すなわち｛Ｓ_０−Ａ_０，Ｓ_１−Ａ_１，Ｓ_２−Ａ_２，Ｓ_３−Ａ_３，Ｓ_４｝）、タスクデモンストレーション９０８ｂ（すなわち、｛Ｓ_０−Ａ_０，Ｓ_１−Ａ_１，Ｓ_２−Ａ_２，Ｓ_３−Ａ_３，Ｓ_４｝）、および、タスクデモンストレーション９１０ｂ（すなわち、｛Ｓ_０−Ａ_１，Ｓ_２−Ａ_２，Ｓ_３−Ａ_３，Ｓ_４｝）を格納する。状態Ｓ_０は、時間ｔ_０におけるロボットの設定であり、アクションＡ_０は、ロボットを状態Ｓ_０から次の状態Ｓ_１に移行する、時間ｔ_０における値のベクトルを表す。図９に示されるように、タスクデモンストレーション９０６ｂおよび９０８ｂは正しく記録される。しかしながら、何らかのランダムエラーにより、タスクデモンストレーション９１０ｂにおいて、状態Ｓ_１が記録されていない。 The task demonstration module 206b is a state / action paired demonstration of a task (ie {S _0- A ₀ , S _1- A ₁ , S _2- A ₂ , S _3- A ₃ , S ₄ }). The skill is unknown or undefined, while storing the execution). For example, the task demonstration module 206b is a task demonstration 906b (ie {S _0- A ₀ , S _1- A ₁ , S _2- A ₂ , S _3- A ₃ , S ₄ }), a task demonstration 908b (ie, {. S _0- A ₀ , S _1- A ₁ , S _2- A ₂ , S _3- A ₃ , S ₄ }), and task demonstration 910b (ie {S _0- A ₁ , S _2- A ₂ , Stores S _3- _{A 3} , S ₄ }). The state S ₀ is the setting of the robot at the time t ₀ _{, and the action A 0} represents a vector of values at the _{time t 0} that shifts the robot from the state S ₀ to the next state S _1. As shown in FIG. 9, task demonstrations 906b and 908b are recorded correctly. However, due to some random errors, in the task demonstration 910b, the state _{S 1} is not recorded.

いくつかの実施形態は、タスクデモンストレーション９０６ｂ、９０８ｂおよび９１０ｂが状態／アクションの対のシーケンスであるが、スキルに関する指示を有さないという認識に基づいている。そのため、学習モジュール２０６ｄは、モジュール２０６ｂにおけるタスクデモンストレーション９０６ｂ、９０８ｂおよび９１０ｂに基づいてタスクを実行するために、モジュール２０６ａからスキルを選択するよう関数を学習させる。 Some embodiments are based on the recognition that task demonstrations 906b, 908b and 910b are a sequence of state / action pairs but do not have instructions regarding skills. Therefore, the learning module 206d trains a function to select a skill from the module 206a in order to perform a task based on the task demonstrations 906b, 908b and 910b in the module 206b.

部分９０４は、タスクを実行する学習フェーズを表す。図９に示されるように、学習モジュール２０６ｄは、ブロック９０６ｄにおいて、タスクデモンストレーションを受け取るように構成される。たとえば、学習モジュール２０６ｄは、３つのデモンストレーション９０６ｂ、９０８ｂおよび９１０ｂを受け取る。学習モジュール２０６ｄは、ブロック９０８ｄにおいて、各状態について、受け取られたタスクデモンストレーション（すなわち、３つのデモンストレーション９０６ｂ、９０８ｂ、および９１０ｂ）から、当該状態について受け取られたアクションにフィットするアクションの分布を決定するように構成される。ブロック９０８ｄ−０は、状態Ｓ_０について決定されたアクションの分布を表し、ブロック９０８ｄ−１は、状態Ｓ_１について決定されたアクションの分布を表し、ブロック９０８ｄ−２は、状態Ｓ_２について決定されたアクションの分布を表し、ブロック９０８ｄ−３は、状態Ｓ_３について決定されたアクションの分布を表す。たとえば、状態Ｓ_０は、タスクデモンストレーション９０６ｂおよび９０８ｂから２つのＡ_０アクションを受け取り、タスクデモンストレーション９１０ｂからアクションＡ_１を受け取る。したがって、アクションＡ_０およびＡ_１にフィットする状態Ｓ_０について、分布９０８ｄ−０が決定される。さらに、分布９０８ｄ−０は、２／３の確率を有するアクションＡ_０を返し、１／３の確率を有するアクションＡ_１を返す。同様に、分布９０８ｄ−１、９０８ｄ−２および９０８ｄ−３が決定される。いくつかの実施形態では、学習モジュール２０６ｄは、ブロック９０８ｄにおいて、タスクデモンストレーションにおける各状態について、対応する状態について決定されたアクションの分布上のアクションをサンプリングするように構成される。いくつかの実施形態では、サンプリングされたアクションは、アクションの分布上の最も高い確率のアクションに対応する。そのため、状態Ｓ_０、Ｓ_１、Ｓ_２およびＳ_３についてのサンプリングされたアクションは、それぞれＡ_０、Ａ_１、Ａ_２およびＡ_３である。 Part 904 represents a learning phase in which a task is performed. As shown in FIG. 9, the learning module 206d is configured to receive a task demonstration at block 906d. For example, the learning module 206d receives three demonstrations 906b, 908b and 910b. In block 908d, the learning module 206d determines, for each state, from the received task demonstrations (ie, the three demonstrations 906b, 908b, and 910b) the distribution of actions that fit the actions received for that state. It is composed of. Block 908d-0 represents the distribution of actions determined for _{state S 0} _{, block 908d-1 represents the distribution of actions determined for state S 1} , and block 908d-2 represents the distribution of actions determined for state S _2. It represents the distribution of the action, blocks 908d-3 represents the distribution of the actions determined for the state S _3. For example, state S ₀ _{receives two A 0} actions from task demonstrations 906b and 908b and _{action A 1} from task demonstration 910b. Therefore, the distribution 908d-0 is determined for the states S ₀ _{that fit the actions A 0} and A _1. Further, the distribution 908d-0 _{returns action A 0} with a probability of 2/3 and action A ₁ with a probability of 1/3. Similarly, distributions 908d-1, 908d-2 and 908d-3 are determined. In some embodiments, the learning module 206d is configured in block 908d to sample the actions on the distribution of actions determined for the corresponding states for each state in the task demonstration. In some embodiments, the sampled actions correspond to the actions with the highest probability on the distribution of actions. Therefore, the sampled actions for states S ₀ , S ₁ , S ₂ and S ₃ _{are A 0} , A ₁ , A ₂ and A ₃ , respectively.

いくつかの実施形態では、状態は、デモンストレーション９０６ｂ、９０８ｂおよび９１０ｂ間で異なり得る。ここで使用されるインデックスおよび表記は、説明の明確性のためである。同じ引数がアクションに適用される。たとえば、状態が異なる場合、タスクデモンストレーション９０６ｂ、９０８ｂおよび９１０ｂのうち最も類似した状態に従ってアラインメントを決定するよう、本発明の範囲外の付加的なプロセスが呼び出され得る。そのようなアラインメントの例は、動的時間伸縮法（dynamic time warping）の周知のプロセスであり得る。 In some embodiments, the condition can vary between demonstrations 906b, 908b and 910b. The indexes and notations used here are for clarity of description. The same arguments apply to the action. For example, if the states are different, additional processes outside the scope of the invention may be called to determine the alignment according to the most similar state of the task demonstrations 906b, 908b and 910b. An example of such an alignment could be a well-known process of dynamic time warping.

学習モジュール２０６ｄは、ブロック９１０ｄにおいて、ある状態について、当該状態について決定されたアクションの分布上でサンプリングされたアクションを返す最も高い確率を有するスキルを決定するように構成される。そのため、学習モジュール２０６ｄは、図５〜図６において説明されたように、関数 The learning module 206d is configured in block 910d to determine, for a state, the skill with the highest probability of returning a sampled action on the distribution of actions determined for that state. Therefore, the learning module 206d is a function as described in FIGS. 5 to 6.

を学習させる。学習済関数における To learn. In the trained function

は、状態Ｓ_ｔについて、当該状態Ｓ_ｔについて決定されたアクションの分布上でサンプリングされたアクションを返す最も高い確率を有するスキルを決定する。たとえば、学習モジュール２０６ｄは、状態Ｓ_０およびＳ_１について、それぞれスキルＨ１およびＨ２を決定し、状態Ｓ_２およびＳ_３について、スキルＨ３を決定する。スキルＨ３は、マルチステップスキルである。代替的には、学習フェーズにおいて決定されたスキルは、セットＨにおいてスキルのシーケンスとして格納され得る。たとえば、セットＨはＨ＝｛Ｈ_１，Ｈ_２，Ｈ_３｝であり、セットＨにおけるようにロボットによってシーケンスで実行されると、実行フェーズにおいてタスクを完了する。 Is the state S _t, determines the skill with the highest probability that return action sampled on the distribution of the actions determined for the state S _t. For example, the learning module 206d determines skills H1 and H2 for _{states S 0} and S ₁ , respectively, and determines skill H ₃ _{for states S 2} and S 3, respectively. Skill H3 is a multi-step skill. Alternatively, the skills determined in the learning phase can be stored as a sequence of skills in set H. For example, set H is H = {H ₁ , H ₂ , H ₃ }, and when executed in sequence by a robot as in set H, it completes the task in the execution phase.

理解されるべきであるように、学習フェーズは一旦完了する。フローは実行フェーズに続く。そのため、タスクを実行するために実行モジュール２０６ｅが提供される。 As it should be understood, the learning phase is completed once. The flow continues in the execution phase. Therefore, an execution module 206e is provided to execute the task.

実行モジュール２０６ｅは、現在の状態を受け取るように構成される。たとえば、実行モジュール２０６ｅは、状態Ｓ_０を受け取る。次に、実行モジュール２０６ｅは、スキルの分布上で最も高い確率を有するスキルを選択するよう、状態Ｓ_０についての学習済関数を実行するように構成される。たとえば、実行モジュール２０６ｅは、状態Ｓ_０についてスキルＨ１が学習フェーズにおいて決定されたスキルであるので、状態Ｓ_０についてスキルＨ１を選択し、最も高い確率を有するアクションＡ_０を返す。さらに、実行モジュール２０６ｅは、アクションＡ_０を選択するよう、状態Ｓ_０について選択されたスキルＨ１を実行するように構成される。実行モジュール２０６ｅは、状態Ｓ_０についてのアクションＡ_０を実行するようにロボットに命令する。 Execution module 206e is configured to receive the current state. For example, the execution module 206e receives the state _{S 0.} Then, execution module 206e is to select the skill with the highest probability on the skill of the distribution, configured to perform the learned function of the state S _0. For example, the execution module 206e selects the _{skill H1 for the state S 0 because the} skill H1 is the skill determined in the learning phase for the state S ₀ , and returns the _{action A 0} having the highest probability. Further, the execution module 206e is configured to execute the skill H1 selected for the _{state S 0} so as to select the _{action A 0.} The execution module 206e commands the robot to perform action A ₀ _{for state S 0.}

理解されるべきであるように、ロボットがアクションＡ_０を実行すると、ロボットは状態Ｓ_１に移行する。実行モジュール２０６ｅは、再び状態Ｓ_１を受け取り、状態Ｓ_１について他のスキルを選択するか否かを決定するよう、状態Ｓ_１についての終了関数β（Ｓ）を実行する。スキルＨ１は、状態Ｓ_０についてのみアクションまたはアクションの分布を提供する。したがって、終了関数は、状態Ｓ_１について１を返す。終了関数が１を返すと、実行モジュール２０６ｅは、状態Ｓ_０に関して説明したように、状態Ｓ_１について別のスキルを選択するように構成される。このプロセスは、実行モジュールが状態Ｓ_４を受け取るまで繰り返される。そのため、実行モジュール２０６ｅは、状態Ｓ_１についてのスキルＨ２と、状態Ｓ_２およびＳ_３についてのスキルＨ３とを選択する。スキルＨ３は、それぞれアクションＡ_２およびＡ_３を返すよう、実行モジュール２０６ｅによって状態Ｓ_２およびＳ_３について実行されるマルチステップスキルである。アクションＡ_３が実行されると、ロボットは、タスク実行の完了を示す状態Ｓ_４に移行する。これにより、タスクを行うためにロボットによって実行されるべきスキルは、タスク不可知論的なスキルのライブラリから選択され、これによって、ラーニング目的が選択目的に置換される。 As should be understood, when the robot performs action A ₀ , the robot transitions to _{state S 1.} Execution module 206e again receives the state S _1, to determine whether or not to select the other skills the state S _1, executes the termination function of the state S ₁ β (S). Skill H1 provides a distribution of the action or the action only for the state S _0. Therefore, the termination function returns _{1 for state S1.} When the end function returns 1, execution module 206e, as described with respect to state S _0, configured to select a different skill for state S _1. This process execution module is repeated until it receives a state S _4. Therefore, execution module 206e selects the skill H2 of the state _{S 1,} and the skills H3 of the the state _{S 2} and _{S 3.} Skill H3 is a multi-step skill executed for _{states S 2} and S ₃ by the execution module 206e to return _{actions A 2} and A _{3, respectively.} When the action A ₃ is performed, the robot moves to a state S ₄ indicating the completion of task execution. This selects the skills to be performed by the robot to perform the task from the library of task agnostic skills, which replaces the learning objective with the selective objective.

上記の説明は、例示的な実施形態のみを提供し、本開示の範囲、利用可能性、または、構成を制限することを意図していない。むしろ、例示的な実施形態の以下の記載は、１つ以上の例示的な実施形態を実現するための実施可能な記載を当業者に提供する。添付の特許請求の範囲に記載されるように開示される主題の精神および範囲から逸脱することがなければ、要素の機能および構成において行われ得るさまざまな変更が企図される。 The above description provides only exemplary embodiments and is not intended to limit the scope, availability, or configuration of the present disclosure. Rather, the following description of an exemplary embodiment provides one of ordinary skill in the art with an operable description for realizing one or more exemplary embodiments. Various changes can be made in the function and composition of the elements, provided that they do not deviate from the spirit and scope of the subject matter disclosed as described in the appended claims.

具体的な詳細は、実施形態の完全な理解を提供するために以下の記載において与えられる。しかしながら、当業者によって理解される場合、実施形態は、これらの具体的な詳細なしで実施されてもよい。たとえば、開示される主題におけるシステム、プロセス、および他の要素は、実施形態を不必要な詳細で不明瞭にしないために、ブロック図の形態で構成要素として示され得る。他の場合では、実施形態を不明瞭にすることを避けるために、周知のプロセス、構造、および技術は、不必要な詳細なしで示され得る。さらに、さまざまな図面における同様の参照番号および名称は、同様の要素を示す。 Specific details are given in the following description to provide a complete understanding of the embodiments. However, as understood by those skilled in the art, embodiments may be implemented without these specific details. For example, systems, processes, and other elements in the disclosed subject matter may be shown as components in the form of block diagrams so as not to obscure the embodiments with unnecessary details. In other cases, well-known processes, structures, and techniques may be presented without unnecessary details to avoid obscuring embodiments. In addition, similar reference numbers and names in various drawings indicate similar elements.

さらに、個々の実施形態は、フローチャート、フロー図、データフロー図、構造図またはブロック図として示されるプロセスとして説明され得る。フローチャートは、動作をシーケンシャルなプロセスとして記載し得るが、動作の多くは、並列にまたは同時に実行され得る。さらに、動作の順序は、再構成され得る。プロセスは、その動作が完了すると終了され得るが、論じられていないまたは図に含まれていない付加的なステップを有し得る。さらに、任意の特定的に記載されるプロセスにおけるすべての動作が、すべての実施形態において行われ得るわけではない。プロセスは、メソッド、関数、プロシージャ、サブルーチン、サブプログラムなどに対応し得る。プロセスが関数に対応する場合、関数の終了は、呼び出し関数またはメイン関数への関数の復帰に対応し得る。 Further, individual embodiments may be described as processes shown as flowcharts, flow diagrams, data flow diagrams, structural diagrams or block diagrams. Flowcharts can describe actions as sequential processes, but many of the actions can be performed in parallel or simultaneously. Moreover, the order of operations can be reconfigured. The process may be terminated when its operation is complete, but may have additional steps not discussed or included in the figure. Moreover, not all actions in any specifically described process can be performed in all embodiments. Processes can correspond to methods, functions, procedures, subroutines, subprograms, and so on. If the process corresponds to a function, the termination of the function may correspond to the return of the function to the calling function or the main function.

さらに、開示される主題の実施形態は、少なくとも部分的に、手動または自動のいずれかで実現され得る。手動または自動の実現は、マシン、ハードウェア、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード、ハードウェア記述言語、もしくは、それらの任意の組み合せの使用を通じて実行され得るか、または、少なくとも支援され得る。ソフトウェア、ファームウェア、ミドルウェア、または、マイクロコードで実現される場合、必要なタスクを実行するべきプログラムコードまたはコードセグメントは、マシン読取可能媒体に格納され得る。プロセッサが必要なタスクを実行し得る。 Moreover, embodiments of the disclosed subject matter can be realized, at least in part, either manually or automatically. Manual or automatic realization can be performed, or at least assisted, through the use of machines, hardware, software, firmware, middleware, microcode, hardware description languages, or any combination thereof. Program code or code segments that should perform the required tasks, if implemented in software, firmware, middleware, or microcode, may be stored on machine-readable media. The processor can perform the required tasks.

本明細書において概説されるさまざまな方法またはプロセスは、さまざまなオペレーティングシステムまたはプラットフォームのいずれか１つを使用する１つ以上のプロセッサ上で実行可能なソフトウェアとしてコード化され得る。さらに、そのようなソフトウェアは、多くの好適なプログラミング言語および／またはプログラミングもしくはスクリプトツールのいずれかを使用して記述されてもよく、フレームワークまたは仮想マシン上で実行される実行可能なマシン言語コードまたは中間コードとしてコンパイルされてもよい。典型的には、プログラムモジュールの機能は、さまざまな実施形態において所望のように組み合わせられてもよく、または、分散されてもよい。 The various methods or processes outlined herein can be encoded as software that can run on one or more processors using any one of different operating systems or platforms. In addition, such software may be written using any of many suitable programming languages and / or programming or scripting tools, and executable machine language code that runs on a framework or virtual machine. Alternatively, it may be compiled as intermediate code. Typically, the functions of the program modules may be combined or distributed as desired in various embodiments.

本開示の実施形態は、一例として提供された方法として具現化され得る。本方法の部分として実行される動作は、任意の適切な態様で順序付けられてもよい。したがって、例示的な実施形態ではシーケンシャルな動作として示されているが、いくつかの動作を同時に実行することを含み得る、動作が示されるのとは異なる順序で実行される実施形態が構築され得る。 The embodiments of the present disclosure can be embodied as the methods provided as an example. The operations performed as part of the method may be ordered in any suitable manner. Thus, although shown as sequential actions in the exemplary embodiments, embodiments may be constructed in which the actions are performed in a different order than shown, which may include performing several actions at the same time. ..

本開示は、ある好ましい実施形態を参照して記載されているが、本開示の精神および範囲内で、さまざまな他の適応例および修正例がなされ得ることが理解されるべきである。したがって、そのようなすべての変形例および修正例が本開示の真の精神および範囲内にあるようにカバーすることが、添付の特許請求の範囲の局面である。 Although this disclosure has been described with reference to certain preferred embodiments, it should be understood that a variety of other indications and modifications may be made within the spirit and scope of this disclosure. Therefore, it is an aspect of the appended claims to cover all such modifications and modifications so that they are within the true spirit and scope of the present disclosure.

代替的には、実行モジュール２０６ｅは、状態Ｓ_１が受け取られると、終了関数を実行した状態で、状態Ｓ_１について学習関数を実行する。 Alternatively, execution module 206e is the state S ₁ is received, while running the completion function, executes a learning function for the state S _1.

Claims

A controller for controlling the movement of a robot to execute a task.
The controller
An input interface that accepts the current state of the robot for each control step,
Including memory
The memory is configured to store a library of the robot's skills, where each skill is a probability function of the robot's state that returns a distribution of actions in response to submitting the current state. Yes, the skill is task agnostic,
The memory is configured to store a learned function for the task, which returns the distribution of skills in response to submitting the state of the robot. It is a probability function of the state,
The controller further
The processor includes the trained function for the current state so that for each control step in the sequence of control steps reaching the termination condition, the skill with the highest probability on the distribution of skills is selected. Selected for the current state to perform and to select the action with the highest probability on the distribution of actions in order to transition the state of the robot from the current state to the next state. It is configured to perform the above skills
The controller further
A controller comprising an output interface configured to instruct the robot to perform the selected action to perform the task.

The parameters of the learned function minimize the probability of difference between the action returned by the skill for each possible state and the corresponding action of multiple executions of the task for the corresponding state. The controller of claim 1, wherein each execution of the task is represented by a sequence of pairs of states and action values, chosen to maximize the probability of returning the skill from said library.

The controller according to claim 1, wherein the learned function is a single argument function.

At least some skills are multi-step skills that are configured to be performed for multiple control steps, and the processor is responsible for each control step, even if the currently selected skill is a multi-step skill. The controller of claim 1, which repeats the selection of skills.

The controller of claim 1, wherein the function for selecting skills is learned to impose a penalty on switching between skills.

The controller according to claim 1, wherein the function is learned based on reinforcement learning (RL) having a reward function including a penalty for switching between skills.

The function is learned based on reinforcement learning (RL) having a reward function including a penalty based on the distance between the selected action and one or more demonstrated actions, claim 1. Described controller.

The controller according to claim 1, wherein the function is learned by maximizing the expected value.

The controller of claim 1, wherein the learned function selects the skill by returning a time to switch the current skill from the sequence of skills.

A claim further comprising a learning module configured to train the function to improve skill selection using a reinforcement learning (RL) model with a reward function that minimizes the deviation of the result of performing the task. Item 1. The controller according to item 1.

It further includes a learning module configured to train the function for selecting the skill.
In order to train the function, the learning module
It is configured to receive a set of executions of the task, and each execution defines a sequence of state / action pairs.
To train the function, the learning module further
For each state, from the sequence of state / action pairs, determining the distribution of actions that fit the actions received for that state.
The controller of claim 1, wherein for a state, determining the skill with the highest probability of returning an action sampled on said distribution of actions determined for that state.

A method for controlling the behavior of a robot to perform a task, the method using a processor coupled to a memory, which stores (1) a library of skills of the robot. Each skill is a stochastic function of the state of the robot that returns a distribution of actions in response to submitting the current state, the skill is task insane, and the memory is (2) the task. The learned function is a stochastic function of the state of the robot that returns a skill distribution in response to submitting the state of the robot, and the processor stores. When combined with the instructions given and executed by the processor to perform the steps of the method,
Accepting the current state of the robot for each control step
For each control step in the sequence of control steps that reach the end condition, executing the learned function for the current state to select the skill with the highest probability on the distribution of skills.
Performing the selected skill for the current state to select the action with the highest probability on the distribution of actions in order to transition the state of the robot from the current state to the next state. To do and
A method comprising outputting a command to the robot to perform the selected action in order to perform the task.

The parameters of the learned function are skills that minimize the probability of difference between the action returned by the skill for each possible state and the corresponding action of multiple executions of the task for the corresponding state. 12. The method of claim 12, wherein each execution of the task is represented by a sequence of pairs of states and action values, chosen to maximize the probability of the skill returning from the library.

The method of claim 12, wherein the learned function is a single argument function.

At least some skills are multi-step skills that are configured to be performed for multiple control steps, and the processor is responsible for each control step, even if the currently selected skill is a multi-step skill. 12. The method of claim 12, wherein the selection of skills is repeated.

12. The method of claim 12, wherein the function is learned to select a skill based on a function that imposes a penalty on switching between skills.

12. The function is learned to select a skill based on the resulting action being penalized according to a function of distance to one or more demonstrated actions. Method.

12. The method of claim 12, wherein the function is learned based on reinforcement learning (RL) having a reward function that includes a penalty for switching between skills.

The method of claim 12, wherein the function is learned by maximizing the expected value.

12. The method of claim 12, wherein the learned function selects the skill by returning the time to switch the current skill from the sequence of skills.

The memory is further configured to store a learning module and uses an RL with a reward function that minimizes the deviation of the result of performing the task when executed by the processor to perform the steps of the method. 12. The method of claim 12, wherein the function is trained to improve skill selection.

The memory is further configured to store a learning module and, when executed by the processor to perform the steps of the method, includes training the function to select the skill.
Learning the function further
Each execution defines a sequence of state / action pairs, including receiving a set of executions of the task.
Learning the function further
For each state, from the sequence of state / action pairs, determining the distribution of actions that fit the actions received for that state.
12. The method of claim 12, wherein for a state, determining the skill with the highest probability of returning an action sampled on said distribution of actions determined for that state.