WO2024059285A1 - Système(s) et procédé(s) d'utilisation d'approximation de valeur de clonage comportementale dans l'entraînement et l'affinage de politiques de commande robotique - Google Patents
Système(s) et procédé(s) d'utilisation d'approximation de valeur de clonage comportementale dans l'entraînement et l'affinage de politiques de commande robotique Download PDFInfo
- Publication number
- WO2024059285A1 WO2024059285A1 PCT/US2023/032900 US2023032900W WO2024059285A1 WO 2024059285 A1 WO2024059285 A1 WO 2024059285A1 US 2023032900 W US2023032900 W US 2023032900W WO 2024059285 A1 WO2024059285 A1 WO 2024059285A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- robot
- failure
- performance
- task
- robotic
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 99
- 238000012549 training Methods 0.000 title claims abstract description 49
- 238000010367 cloning Methods 0.000 title claims description 20
- 238000007670 refining Methods 0.000 title abstract description 5
- 230000003542 behavioural effect Effects 0.000 title description 18
- 230000009471 action Effects 0.000 claims abstract description 70
- 238000013528 artificial neural network Methods 0.000 claims abstract description 51
- 230000000875 corresponding effect Effects 0.000 claims description 60
- 238000012545 processing Methods 0.000 claims description 27
- 230000004044 response Effects 0.000 claims description 12
- 230000006399 behavior Effects 0.000 claims description 5
- 230000002123 temporal effect Effects 0.000 claims description 3
- 238000000926 separation method Methods 0.000 claims description 2
- 230000008569 process Effects 0.000 description 16
- 239000012636 effector Substances 0.000 description 14
- 238000009826 distribution Methods 0.000 description 13
- 230000006870 function Effects 0.000 description 13
- 230000033001 locomotion Effects 0.000 description 13
- 238000013459 approach Methods 0.000 description 11
- 230000001276 controlling effect Effects 0.000 description 11
- 238000011156 evaluation Methods 0.000 description 10
- 238000003860 storage Methods 0.000 description 10
- 238000004891 communication Methods 0.000 description 6
- 230000002787 reinforcement Effects 0.000 description 6
- 238000001514 detection method Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 230000015654 memory Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 238000012614 Monte-Carlo sampling Methods 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 2
- 210000004556 brain Anatomy 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 239000003795 chemical substances by application Substances 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 230000026676 system process Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 230000000007 visual effect Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 238000000342 Monte Carlo simulation Methods 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 235000000332 black box Nutrition 0.000 description 1
- 238000013329 compounding Methods 0.000 description 1
- 230000001143 conditioned effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000003116 impacting effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000000704 physical effect Effects 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1628—Programme controls characterised by the control loop
- B25J9/163—Programme controls characterised by the control loop learning, adaptive, model based, rule based expert control
-
- B—PERFORMING OPERATIONS; TRANSPORTING
- B25—HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
- B25J—MANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
- B25J9/00—Programme-controlled manipulators
- B25J9/16—Programme controls
- B25J9/1694—Programme controls characterised by use of sensors other than normal servo-feedback from position, speed or acceleration sensors, perception control, multi-sensor controlled systems, sensor fusion
- B25J9/1697—Vision controlled systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B2219/00—Program-control systems
- G05B2219/30—Nc systems
- G05B2219/33—Director till display
- G05B2219/33034—Online learning, training
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B2219/00—Program-control systems
- G05B2219/30—Nc systems
- G05B2219/39—Robotics, robotics to robotics hand
- G05B2219/39271—Ann artificial neural network, ffw-nn, feedforward neural network
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B2219/00—Program-control systems
- G05B2219/30—Nc systems
- G05B2219/39—Robotics, robotics to robotics hand
- G05B2219/39289—Adaptive ann controller
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B2219/00—Program-control systems
- G05B2219/30—Nc systems
- G05B2219/40—Robotics, robotics mapping to robotics vision
- G05B2219/40153—Teleassistance, operator assists, controls autonomous robot
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B2219/00—Program-control systems
- G05B2219/30—Nc systems
- G05B2219/40—Robotics, robotics mapping to robotics vision
- G05B2219/40298—Manipulator on vehicle, wheels, mobile
-
- G—PHYSICS
- G05—CONTROLLING; REGULATING
- G05B—CONTROL OR REGULATING SYSTEMS IN GENERAL; FUNCTIONAL ELEMENTS OF SUCH SYSTEMS; MONITORING OR TESTING ARRANGEMENTS FOR SUCH SYSTEMS OR ELEMENTS
- G05B2219/00—Program-control systems
- G05B2219/30—Nc systems
- G05B2219/40—Robotics, robotics mapping to robotics vision
- G05B2219/40391—Human to robot skill transfer
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
Definitions
- a human may physically manipulate a given robot, or an end effector thereof, to cause a reference point of the given robot or the end effector to traverse the particular trajectory – and that particular traversed trajectory may thereafter be repeatable by using a robotic control policy trained based on the physical manipulation by the human.
- the human may control a given robot, or end effector thereof, using one or more teleoperation techniques to perform a given task, and the given task may thereafter be repeatable by using a robotic control policy trained based on one or more of the teleoperation techniques.
- Such robotic task(s) can include, for example, door opening, door closing, drawer opening, drawer closing, picking up an object, placing an object, and/or other robotic task(s).
- the failure output can indicate the likelihood the robot will fail in performance of the robotic task at some point in the future.
- the system can determine to request for a human operator to intervene in the robot’s performance of the task, where the determination to request the human intervention can be based on the failure output.
- Some implementations relate to training a failure neural network (NN) model to generate the failure output.
- vision data can capture the environment of a robot during the performance of a robotic task by the robot.
- vision data can capture the environment of a robot while the robot performs the task of picking up a cup.
- an instance of vision data capturing the state of the robot performing the task can be processed using an embedding model to generate an embedding.
- the embedding can be processed using a robotic control policy to generate action output corresponding to an action to be performed by the robot in continuance of the performance of the task.
- the action output can indicate corresponding action for one or more components of the robot.
- the components of the robot can include a robot base, a robot arm, a robot end effector, and/or one or more additional or alternative robotic components.
- the action output can indicate an action for each of the one or more components.
- the embedding can be processed using the failure NN model to generate the failure output.
- the same embedding processed using the robotic control policy to generate the action output can be processed using the failure NN model to generate the failure output.
- the system can determine whether the robot will fail in performance of the robotic task based on the failure output. For example, the system can determine the robot will fail the task if the failure output satisfies a threshold value. For example, the system can determine the robot will fail the task if the failure output exceeds 75 percent, 80 percent, 90 percent, and/or one or more additional or alternative values.
- the system can determine whether the failure output has exceeded a threshold value for several states.
- the failure output can indicate the robot will fail the task when the failure output corresponding to the state and one or more previous failure outputs corresponding to one or more previous states are above a threshold value.
- the system can determine the robot will fail the task if the failure output exceeds a threshold value for two states, three sequential states, three of the last five states, five total states, five sequential states, and/or additional or alternative combinations of previous states.
- the system can determine whether the robot will fail the task based on multiple threshold values.
- the system can determine the robot will the fail the task if the failure output exceeds a given threshold Attorney Docket No. GOOG-0331-WO-01 value or if the failure output exceeds an additional threshold value for multiple states. For example, the system can determine the robot will fail the task if the failure output ever exceeds a threshold value (e.g., 95%) or if the failure output corresponding to sequence of states exceeds an additional threshold value (e.g., three sequential states have a corresponding failure output that exceeds 75%). In other words, the system can determine the robot will fail the task if any individual failure output indicates such a high likelihood the robot will fail, or if several failure outputs indicate a lower likelihood the robot will fail.
- a threshold value e.g. 95%)
- the failure output corresponding to sequence of states exceeds an additional threshold value
- the system can determine the robot will fail the task if any individual failure output indicates such a high likelihood the robot will fail, or if several failure outputs indicate a lower likelihood the robot will fail.
- the system can determine the status of availability of computing devices to intervene in robotic task performance when selecting a particular threshold value. Additionally or alternatively, the system can select a particular threshold value based on the robotic task and/or category of robotic task being performed. For example, the system can select a particular threshold value when the robot is performing grasping tasks, and the system can select an additional particular threshold value when the robot is performing locomotion tasks. [0009] In some implementations, the system can select the particular threshold based on whether the robot has detected one or more particular types of objects in the environment.
- the one or more objects can include obstacle(s) for the robot to avoid while performing the task (e.g., a wall, a table, a door, a shelf, another robot, a human, one or more additional or alternative obstacles, and/or combinations thereof) as well as objects used by the robot in performance of the task (e.g., a tool the robot picks up, an object to retrieve off a shelf, a door to close, one or more objects used in performance of the robotic task, and/or combinations thereof).
- the system can select the particular threshold based on determining whether the robot has detected a door in the environment. Additionally or alternatively, the system can select a particular threshold based on whether one or more objects are within a threshold distance of the robot.
- various implementations set forth techniques for predicting a future failure of a robot to complete a robotic task.
- the robot can request help from a human operator to complete the task.
- instructions to complete the task provided by the human operator can be used as additional Attorney Docket No. GOOG-0331-WO-01 training data to further refine the policy network and/or failure NN of the system.
- the robot can pause execution of the task as soon as it determines a predicted failure instead of waiting to detect a failure after it has occurred.
- the robot can be damaged by the task failure (e.g., the robot can fall resulting in damage to one or more components of the robot and motors, gears, etc.
- a method implemented by one or more processors includes receiving an instance of vision data capturing an environment of a robot during performance of a robotic task by the robot, where the instance of vision data is captured via a vision component.
- the method includes generating an embedding based on processing the instance of vision data using an encoder model, the encoder model being a trained neural network (NN) model.
- NN trained neural network
- the method includes processing the embedding using a robotic control policy to generate action output that indicates, for each of one or more components of the robot, a corresponding action to be performed by the component, the robotic control policy being a trained NN model.
- the method includes processing the embedding using a failure NN model to generate failure output indicating a likelihood of the robot Attorney Docket No. GOOG-0331-WO-01 successfully completing the task.
- the method includes determining, based on the failure output, whether the robot will fail in performance of the robotic task.
- the method in response to determining the robot will fail in performance of the robotic task, includes causing a user of a computing device to intervene in performance of the robotic task.
- the method includes receiving, from the user and via the computing device, user interface input that intervenes with performance of the robotic task. In some implementations, the method includes causing the robot to complete performance of the task based on the user interface input. [0014] These and other implementations of the technology can include one or more of the following features. [0015] In some implementations, the robotic control policy is trained using imitation learning. In some versions of those implementations, the robotic control policy is a Behavior Cloning model. [0016] In some implementations, determining whether the robot will fail in performance of the robotic task based on the failure output includes determining whether the failure output satisfies a threshold value.
- determining whether the robot will fail in performance of the robotic task based on the failure output includes determining whether the failure output satisfies a threshold value. In some implementations, the method further includes determining whether a previous failure output satisfies the threshold value, wherein the previous failure output was generated by processing a previous embedding using the failure model, and wherein the previous embedding was generated based on processing a previous instance of vision data captured t by the vision component during the performance of the robotic task by the robot. In some implementations, the method further includes determining whether the robot will fail in performance of the robotic task based on both whether the failure output satisfies the threshold likelihood value and whether the previous failure output satisfies the threshold likelihood value.
- the previous instance of vision data is an immediately preceding instance of vision data, captured most recently by the vision component relative to the instance of vision data, and wherein the previous Attorney Docket No. GOOG-0331-WO-01 embedding, generated based on the previous instance of vision data, is utilized in determining whether the robot will fail in performance of the robotic task based on the previous instance of vision data being the immediately preceding instance of vision data.
- the previous instance of vision data is captured by the vision component within a threshold amount of time relative to the instance of vision data, and wherein the previous embedding, generated based on the previous instance of vision data, is utilized in determining whether the robot will fail in performance of the robotic task based on the previous instance of vision data being captured within the threshold amount of time.
- the previous embedding was processed, using the robotic control policy to generate previous action output that indicated, for each of the plurality of components of the robot, a corresponding previous action to be performed by the component, and wherein the previous action was already implemented by the robot, or were being implemented by the robot, during processing the embedding using the failure NN model to generate the failure output.
- determining whether the robot will fail in performance of the robotic task based on both whether the failure output satisfies the threshold likelihood value and whether the previous failure output satisfies the threshold likelihood value includes determining that the robot will not fail in performance of the task if either one of the failure output or the previous failure output fails to satisfy the threshold. [0020] In some implementations, determining whether the robot will fail in performance of the robotic task based on both whether the failure output satisfies the threshold likelihood value and whether the previous failure output satisfies the threshold likelihood value includes determining that the robot will fail in performance of the task only when both the failure output or the previous failure output satisfy the threshold.
- determining whether the robot will fail in performance of the robotic task based on the failure output includes selecting, from a plurality of candidate thresholds, a particular threshold. In some implementations, the method further includes determining whether the failure output satisfies the selected particular threshold. In some implementations, the method further includes determining whether the robot will fail in Attorney Docket No. GOOG-0331-WO-01 performance of the robotic task based on whether the failure output satisfies the selected particular threshold. In some versions of those implementations, selecting the particular threshold includes determining a current status of availability of computing devices to intervene in robotic task performance. In some implementations, the method further includes selecting the particular threshold based on the current status.
- selecting the particular threshold includes selecting the particular threshold based on a category assigned to the robotic task that is being performed. [0023] In some implementations, selecting the particular threshold includes selecting the particular threshold based on whether the robot has detected one or more particular types of objects in the environment. [0024] In some implementations, selecting the particular threshold includes selecting the particular threshold based on whether the robot has detected one or more particular types of objects to be within a threshold distance of the robot. [0025] In some implementations, the computing device is in the environment of the robot. [0026] In some implementations, the computing device is remote from the robot and is not in the environment of the robot.
- the method further includes causing the robotic control policy and/or the failure NN model to be updated based on the performance of the task based on the user interface input.
- the failure NN model was previously trained based on a plurality of supervised training instances from a previous episode of robotic performance of a task that was determined to be a failure.
- each of the supervised training instances includes training instance input of a corresponding embedding, the corresponding embedding being generated during the episode using the encoder model and being processed, using the robotic control policy during the episode, to generate corresponding actions implemented during the episode, and training instance output that includes a corresponding failure measure.
- a plurality of the corresponding failure measures are discounted and indicate a corresponding reduced degree of failure.
- a given failure measure, Attorney Docket No. GOOG-0331-WO-01 of the corresponding failure measures, and of a given training instance, of the supervised training instances is generated based on temporal separation between a first time corresponding to generation of the corresponding embedding of the given training instance and a second time corresponding to the failure.
- a given training instance of the training instances includes: a given embedding, of the corresponding embeddings, that was generated based on a first vision data instance of the episode, and a given failure measure, of the corresponding failure measures, that was generated based on a difference between the first vision data instance and a failure vision data instance, of the episode, that corresponds to the failure.
- a given training instance of the training instances includes: a given embedding, of the corresponding embeddings, that was generated at a first time of the episode, and a given failure measure, of the corresponding failure measures, that was generated based on a difference between a robot state at the first time and an alternate robot state at a failure time corresponding to the failure.
- the method in response to determining that the robot will fail in performance of the robotic task: the method further includes halting performance of the task by the robot.
- processing the embedding using the robotic control policy to generate the action output that indicates, for each of the one or more components of the robot, the corresponding action to be performed by the component includes generating, as output of a first head of the model, a first portion of the action output, wherein the first portion of the action output indicates a first action to be performed by a first component of the robot.
- the method further includes generating, as output of a second head of the model, a second portion of the action output, wherein the second portion of the action output indicates a second action to be performed by a second component of the robot.
- the first component is a robot arm and the second component is a robot base.
- a method implemented by one or more processors of a robot includes receiving an instance of vision data capturing an environment Attorney Docket No. GOOG-0331-WO-01 of the robot during performance of a robotic task by the robot, where the instance of vision data is captured via one or more vision components of the robot.
- the method includes generating an embedding based on processing the instance of vision data using an encoder model, the encoder model being a trained neural network (NN) model.
- the method includes processing the embedding using a robotic control policy to generate action output that indicates, for each of a plurality of components of the robot, a corresponding action to be performed by the component.
- the method includes processing the embedding using a failure NN model to generate failure output indicating a likelihood of the robot successfully completing the task. In some implementations, the method includes determining, based on the failure output, whether the robot will fail in performance of the robotic task. In some implementations, in response to determining that the robot will fail in performance of the robotic task, the method includes halting performance of the task by the robot. In some implementations, in response to determining that the robot will not fail in performance of the robotic task, the method includes continuing performance of the task by the robot, continuing performance of the task by the robot comprising causing implementation of the corresponding actions by the components of the robot. [0032] These and other implementations of the technology can include one or more of the following features.
- the method in response to determining that the robot will fail in performance of the robotic task, the method further includes causing a prompt to be rendered via an interface of a computing device or the robot, the prompt requesting intervention in performance of the robotic task.
- the method in response to determining that the robot will fail in performance of the robotic task, the method further includes transmitting, to a remote computing device, the vision data and/or additional vision data captured by at least one of the vision components.
- the method includes causing the robot to complete performance of the task based on user interface input received via the remote computing device responsive to the transmitting.
- FIG.1 illustrates an example environment in which implementations described herein can be implemented.
- FIG.2 is a block diagram illustrating an example of generating action output and failure output in accordance with various implementations described herein.
- FIG.3A-3D illustrate examples of determining whether failure output satisfies a threshold likelihood value in accordance with various implementations described herein.
- FIG.4A-4B is a flowchart illustrating an example process in accordance with various implementations described herein.
- FIG.5 schematically depicts an example architecture of a robot, in accordance with various implementations disclosed herein.
- FIG.6 schematically depicts an example architecture of a computer system, in accordance with various implementations disclosed herein.
- FIG.6 schematically depicts an example architecture of a computer system, in accordance with various implementations disclosed herein.
- FIG.6 schematically depicts an example architecture of a computer system, in accordance with various implementations disclosed herein.
- FIG.6 schematically depicts an example architecture of a computer system, in accordance with various implementations disclosed herein.
- FIG.6 schematically depicts an example architecture of a computer system, in accordance with various implementations disclosed herein.
- FIG.6 schematically depicts an example architecture of a computer system, in accordance with various implementations disclosed herein.
- FIG.6 schematically depicts an example architecture of a computer system, in accordance with various implementations disclosed herein.
- FIG.6 schematically depicts an example architecture of a computer system, in accordance with various implementations disclosed herein.
- FIG.6 schematically depicts an example architecture of a computer system, in accordance with various implementation
- multiple robots e.g., a fleet of robots asking for help when needed can allow a remote operator (e.g., a human operator) to supervise multiple robots and help when needed.
- a remote operator e.g., a human operator
- Behavioral Cloning Value Approximation an approach to learning a state value function based on and trained jointly with a Behavioral Cloning policy that can be used to predict failures.
- BCVA can be used to complete a Attorney Docket No.
- GOOG-0331-WO-01 variety of challenging mobile manipulation task such as (but not limited to) latched-door opening.
- the field of robotics has seen significant developments in recent years on mobile manipulation tasks. A variety of techniques have made it possible to learn from simulated and real-world experiences together. Being able to fuse data from different sensor modalities, such as RGB and depth, has resulted in greatly improved action-taking decisions. Additionally or alternatively, fusing data from different sensor modalities can thus improve manipulation behaviors. With these improvements, it has become possible to deploy such mobile manipulation agents in the real world to solve practical real-life tasks.
- Imitation Learning (IL) approaches such as Behavioral Cloning have been established as a practical solution to many mobile manipulation tasks, such as latched door opening.
- deploying Imitation Learning models in the real world can consist of two phases: a training phase where a human operator performs demonstrations in order to learn a policy, and an operational phase where the robot can operate, ideally unattended, in an open environment to perform its task.
- Imitation Learning approaches have some shortcomings that can make their naive implementation problematic for such real-world deployments. For example, agents trained using this paradigm tend to perform poorly in out-of-distribution states. Therefore, in the operational phase, the policy can continue to execute with compounding error under out-of-distribution states until failure can be externally detected post-factum, e.g. by a human operator, a sensor, etc., which can lead to damage to the robot and the environment.
- the robots may need continuous one-on-one human supervision, blurring the lines between the training and operational phases.
- One approach to this problem is to make sure as much of the robot’s state space is covered by the training data as possible. In the training phase where one-on-one supervision is available, this can be achieved by applying data collection regimens, such as DAgger, that iteratively include more of the state space in the training distribution.
- data collection regimens such as DAgger
- it is desirable for the robot policy itself to Attorney Docket No. GOOG-0331-WO-01 be able to identify that it is going to be unable to solve the task successfully.
- the robot can preemptively stop and ask for help, potentially avoiding any damage to the robot and/or the environment based on allowing human experts to provide corrective demonstrations.
- this mode of operation can allow the robots to operate outside of one-on-one human-to-robot supervision, by allowing remote supervision where human operators at an off-site control center can loosely monitor multiple robots and intervene only when asked for help.
- this can allow retraining in the real world in an incremental manner: initial policies from the training phase are deployed on real robots, additional data is collected during the operational phase through both regular operations and expert corrections, and the policies are relearned periodically from the continuously-growing dataset.
- the robot can quantify a confidence in its ability to solve the task given its current state in order to predict failure. While this role can be filled by a state-value function in the Reinforcement Learning paradigm, Behavioral Cloning does not provide an explicit value function representation. In some implementations, this gap can be filled by applying Reinforcement Learning approaches alongside Behavioral Cloning. In some of those implementations, Reinforcement Learning approaches alongside Behavioral Cloning for the purpose of obtaining a state-value representation, which comes at great computational cost and low data efficiency. [0045] In some implementations, BCVA can be used to determine state-value representations, where the state-value representations are conditioned by a history of policies learned through incremental updates to the training data and the resulting evaluations from said policies.
- BCVA has three a variety of advantages: (1) BCVA can allow the state values to be batch-computed offline using policy evaluation data (and variable reward discounting regimes) prior to training, rather than through exploration in expensive paradigms such as Reinforcement Learning. (2) BCVA can be learned simply as a regression head on top of an existing Behavioral Cloning model, allowing for low-cost training and inference as well as weight sharing. (3) When trained jointly with the Behavioral Cloning model, BCVA can allow failure Attorney Docket No. GOOG-0331-WO-01 examples to also be used for representation learning, increasing the data available for learning a state embedding that helps the policy generalize better.
- BCVA on a mobile manipulation robot can be used in solving the latched door opening task, where the robot must approach a closed latched door, grasp the handle and rotate it to release the latch, drive forward to open the door, and enter the room.
- This task is challenging due to the large number of failure modes that stem from minor errors, e.g., being unable to find or grasp the handle, grasping it too close to the pivot point to be able to apply the necessary force, or colliding with the door frame or the door itself.
- the problem of Failure Detection and Prediction in robotics has received considerable interest in the past decades.
- POMDPs Partially- Observable Markov Decision Processes
- vision-based mobile manipulation Attorney Docket No. GOOG-0331-WO-01 since they largely rely on tabular data and the presence of explicit transition models, even if approximate, which makes them incompatible with Behavioral Cloning.
- POMDPs Partially- Observable Markov Decision Processes
- the problem of detecting success/failure probabilities has been studied in the context of Imitation Learning policies. For example, Imitation Learning policies have been used in detecting out-of-distribution states to prevent robot failures.
- a related problem in the context of Imitation Learning is Dataset Collection, i.e., how to collect the right demonstrations that allow a Behavioral Cloning policy to improve on failure scenarios with the highest possible data efficiency.
- DAgger is a well-tested mechanism that can be used at the time of data collection (in the training phase) to make sure that frequently-encountered but out-of-training-distribution states get new labelled data to help the policy recover from such states.
- the system can jointly learn: (1) a policy ⁇ ( ⁇
- ⁇ ) that outputs an action ⁇ (e.g. a joint velocity) given a state ⁇ (e.g. RGB and depth images) in order to complete the task
- ⁇ e.g. RGB and depth images
- the system can use Behavioral Cloning to learn to imitate this policy, where the objective is to minimize the divergence between our policy ⁇ ( ⁇
- ⁇ ) is trained jointly with a value estimate ⁇ ⁇ ( ⁇ ) from previous episodes’ labelled rollouts.
- the system can define distance function as the normalized sum of the absolute pixel differences: 0.5 (4) [0080] where ⁇ is the RGB sensor width, h the RGB sensor height, ⁇ the number of channels, and ⁇ ⁇ ( ⁇ , ⁇ , ⁇ ) the value of the ⁇ ’th channel of the pixel at ( ⁇ , ⁇ ) of state ⁇ ⁇ .
- the sample mean ⁇ and standard deviation ⁇ are used to scale the values such that one standard deviation falls in the (0.5, 1.5) range, suitable for use as the discount factor exponent.
- the system can determine the kinematic state difference as the normalized sum of the joint states and robot base pose: [0083] where ⁇ is the set of all joints of the robot, ⁇ ( ⁇ ) indicates the angular position of joint ⁇ , and ⁇ ( ⁇ ) indicates the Cartesian velocity of the robot base along world axis ⁇ .
- ⁇ is the set of all joints of the robot
- ⁇ ( ⁇ ) indicates the angular position of joint ⁇
- ⁇ ( ⁇ ) indicates the Cartesian velocity of the robot base along world axis ⁇ .
- the joint angles and Cartesian coordinates can be scaled separately since they have different units.
- the goal is to jointly learn a policy ⁇ ( ⁇
- the system can combine a stochastic state encoder network ⁇ ( ⁇
- the system can learn a policy that will replicate the expert actions, by applying a Huber loss between the demonstrated actions ⁇ and predictions ⁇ obtained through Monte Carlo sampling from a neural network decoder, as the Behavioral Cloning loss: [0087] where ⁇ ⁇ R 10 is the action command that includes: 2 DoF for the robot base, 9 DoF for the robot arm, and 1 DoF for the task termination prediction.
- the Behavioral Cloning loss is only applied to training instances that are expert demonstrations, e.g. evaluation instances from previous policies are not used for policy learning.
- state values in the discrete-state context can be computed by averaging the discounted returns over each time the state is visited, where ⁇ ⁇ is the set of trajectories that visit state ⁇ : [0090] However, since the state representation (RGB or depth images) is continuous, each state ⁇ appears typically in exactly one trajectory; but, the system wants the learned state value information to be shared between similar states.
- the system can use the stochastic value decoder ⁇ ⁇ ( ⁇
- the system can apply a Kullback–Leibler divergence loss L ⁇ ⁇ between the state embedding posterior ⁇ ( ⁇
- ⁇ ) is a 2- layer MLP and the value decoder ⁇ ⁇ ( ⁇
- the learned prior ⁇ ( ⁇ ) is a multivariate Gaussian mixture with 512 components and learnable parameters.
- the system is able to output the policy action at and the state value ⁇ ( ⁇ ⁇ ) given a state ⁇ ⁇ .
- various criteria can be used to decide when to ask for help: If the state value ⁇ ( ⁇ ⁇ ) ⁇ ⁇ , for some constant ⁇ , for more than some constant ⁇ past frames, the system stops and asks for help.
- the system continues executing ⁇ ⁇ , observing ⁇ ⁇ +1 , and re-evaluating the asking-for-help criteria.
- the system can tune the thresholds ⁇ and ⁇ by running the value estimate on rollouts from the (human-labelled) validation set, computing the episode-level confusion matrix across different values of ⁇ and ⁇ , and picking appropriate values such that the model satisfies requirements in terms of both overall precision and recall, as well as being able to correctly flag a small, hand-selected sample of failures of concern.
- the model can be trained using a real-world dataset of ⁇ 2900 expert demonstrations and bootstrap the value estimate with ⁇ 9000 episodes of policy rollouts under a fully-supervised human operator setting.
- all expert demonstrations were success cases, whereas each of the policy rollouts was manually Attorney Docket No. GOOG-0331-WO-01 labelled as success or failure at the end of the episode, and discounted rewards for each step can be calculated offline post-factum.
- the policy rollouts were executed and labelled in a span of multiple months using 100+ different Behavioral Cloning models each independently trained with the action decoder head only.
- FIG.1 illustrates an example environment in which implementations related to training failure neural network (NN) models can be implemented.
- the example environment includes a robot 100, a robot system 104, a training system 126, and a user input system 136.
- NN training failure neural network
- FIG.1 One or more of these components of FIG.1 can be communicatively coupled over one or more networks 102, such as local area networks (LANs), wide area networks (WANs), and/or any other communication network.
- networks 102 such as local area networks (LANs), wide area networks (WANs), and/or any other communication network.
- the environment also a non- limiting example of a failure NN model 114, an embedding NN model 112, and a control policy 116.
- the robot 100 illustrated in FIG.1 is a particular real-world mobile robot. However, additional and/or alternative robots can be utilized with techniques disclosed herein, such as additional robots that vary in one or more respects from robot 100 illustrated in FIG.1.
- a stationary robot arm a mobile telepresence robot, a mobile forklift robot, an unmanned aerial vehicle (UAV), and/or a humanoid robot can be utilized instead of or in addition to robot 100, in techniques described herein.
- the robot 100 may include one or more engines implemented by processor(s) of the robot and/or by one or more processor(s) that are remote from, but in communication with, the robot 100.
- the robot 100 includes one or more vision components that can generate instances of vision data (e.g., images, point clouds, etc.) related to shape, color, depth, and/or other features of object(s) that are in the line of sight of the vision components.
- vision data e.g., images, point clouds, etc.
- the instances of the vision data generated by one or more of the vision components can for some or all of state Attorney Docket No. GOOG-0331-WO-01 data (e.g., environmental state data and/or robot state data).
- the robot 100 can also include position sensor(s), torque sensor(s), and/or other sensor(s) that can generate data and such data, or data derived therefrom, can form some or all of state data (if any).
- one or more vision components that can generate the instances of the vision data may be located external from the robot.
- One or more of the vision components 142 may be, for example, a monocular camera, a stereographic camera (active or passive), and or a light detection and ranging (LIDAR) component.
- LIDAR light detection and ranging
- a LIDAR component can generate vision data that is a 3D point cloud with each of the points of the 3D point cloud defining a position of a point of a surface in 3D space.
- a monocular camera may include a single sensor (e.g., a charge-coupled device (CCD)), and generate, based on physical properties sensed by the sensor, images that each include a plurality of data points defining color values and/or grayscale values. For instance, the monocular camera may generate images that include red, blue, and/or green channels.
- a stereographic camera may include two or more sensors, each at a different vantage point, and can optionally include a projector (e.g., infrared projector).
- the stereographic camera generates, based on characteristics sensed by the two sensors (e.g., based on captured projection from the projector), images that each includes a plurality of data points defining depth values and color values and/or grayscale values. For example, the stereographic camera may generate images that include a depth channel and red, blue, and/or green channels.
- the robot 100 also includes a base 113 with wheels 148A, 148B provided on opposed sides thereof for locomotion of the robot 100.
- the base 113 may include, for example, one or more motors for driving wheels 148A, 148B of the robot 100 to achieve a desired direction, velocity, and/or acceleration of movement for the robot 100.
- the robot 100 also includes one or more processors that, for example provide control commands to actuators and/or other operation components thereof (e.g., control policy engine 108 as described herein).
- the robot 100 also includes robot arm 144 with end effector 146 that takes the form of a gripper with two opposing “fingers” or “digits” 146A, 146B. Additional and/or alternative end effectors can be utilized, or even no end effector. For Attorney Docket No.
- a robotic control policy 116 can be initially trained based on human demonstrations of various robotic tasks. As the human demonstrations are performed, demonstration data can be generated via the user input system 136, and can be stored in demonstration database 124.
- the demonstration data can include, for example, instances of vision data generated by one or more of the vision components 142 during the performance of a given human demonstration of a given robotic task, state data of the robot 100 and/or the environment corresponding to the instances of the vision data captured during the given human demonstration of the given robotic task, corresponding sets of values for controlling respective components of the robot 110 corresponding to the instances of the vision data captured during the given human demonstration.
- user input engine 138 can detect user input to control the robot 100, and intervention engine 140 can generate the corresponding sets of values for controlling the respective components of the robot 100.
- the corresponding sets of values utilized in controlling a respective component of the robot 100 can be, for example, a vector that describes a translational displacement and/or rotation (e.g., a sine-cosine encoding of the change in orientation about an axis of the respective component) of the respective component, lower-level control command(s) (e.g., individual torque commands that control corresponding actuator(s) of the robot 100, individual joint angles of component(s) of the robot, etc.), binary values for component(s) of the robot (e.g., indicative of whether a robot gripper should be opened or closed), other values for component(s) of the robot (e.g., robot arm movement, robot base movement, etc.), and/or other values that can be utilized to control the robot 100.
- lower-level control command(s) e.g., individual torque commands that control corresponding actuator(s) of the robot 100, individual joint angles of component(s) of the robot, etc.
- binary values for component(s) of the robot e.g.,
- a human can utilize one or more computing device or input devices thereof (not depicted) to control the robot 100 to perform the human demonstrations of the robotic task.
- the user can utilize a controller associated with the computing device to control the robot 100, an input device associated with an additional computing device, or any other input device of any computing device in Attorney Docket No. GOOG-0331-WO-01 communication with the robot 100, and the demonstration data can be generated based on the instances of the vision data captured by one or more of the vision components 142, and based on the user control of robot 100.
- the user can physically manipulate the robot 100 or one or more components thereof (e.g., the base 113, the robot arm 144, and/or other components).
- the user can physically manipulate the robot arm 144, and the demonstration data can be generated based on the instances of the vision data captured by one of the vision components 142, and based on the physical manipulation of the robot 100.
- the user can repeat this process to generate demonstration data for performance of various robotic tasks.
- the human demonstrations can be performed in a real- world environment of the robot 100.
- Robot system 104 can include embedding engine 106, control policy engine 108, failure engine 118, threshold engine 120, action engine 122, one or more additional or alternative engines, and/or combinations thereof.
- Embedding engine 106 can process one or more instances of vision data (e.g., data captured via one or more vision sensors 142) can be processed using embedding NN model 112 to generate an embedding.
- the embedding can be a stochastic embedding that parameterizes the means and covariances of a multivariate distribution over possible embeddings.
- Control policy engine 108 can process one or more embeddings (e.g., embeddings generated using embedding engine 106) using control policy 116 to generate action output indicating one or more actions for one or more components of the robot to perform.
- action engine 122 can process Attorney Docket No.
- failure engine 118 can process the embedding (e.g., the embedding generated using embedding engine 106) using failure NN model 114 to generate failure output indicating the likelihood the robot will fail performance of the robotic task.
- the same embedding, generated based on the same instance of vision data can be processed using the failure NN model 114 and the control policy 116.
- threshold engine 120 can determine whether the failure output, generated using failure engine 118, satisfies a threshold value indicating the likelihood of a failure.
- a user can utilize one or more computing devices (not depicted), the training system 126, the user input system 136, and the robot system 104 to train a robotic control policy for controlling the robot 100 in performance of various robotic tasks, to train the failure NN model 114 for determining the likelihood the robot 100 will fail in performance of the various robotic tasks, and/or to train the embedding NN model 112 to generate an embedding based on vision data captured in the environment of the robot.
- the robotic control policy can correspond to one or more machine learning (ML) models and a system that utilizes output, generated using the one or more ML models, in controlling the robot system 104 and/or various engines thereof.
- ML machine learning
- the techniques described herein relate to training and refining robotic control policies and/or failure NN models using imitation learning techniques.
- the robotic control policy can initially be trained based on demonstration data 124 (e.g., demonstration data stored in a database) and that is based on human demonstrations of various robotic tasks. Further, and subsequent to the initial training, the robotic control policy can be refined based on human interventions that are received during performance of various robotic tasks by the robot 100. Moreover, and subsequent to the refining, the robotic control policy can be deployed for use in controlling the robot 100 during future robotic tasks.
- the embedding NN model 112 and/or the failure NN model 114 can be trained based on the demonstration data 124, refined based on human Attorney Docket No.
- a robotic control policy, a failure NN model, and/or an embedding NN model can be initially trained based on human demonstrations of various robotic tasks. As the human demonstrations are performed, demonstration data can be generated via the user input system 136, and can be stored as demonstration data 124.
- the demonstration data can include, for example, instances of vision data generated by one or more of the vision components 142 during performance of a given human demonstration of a given robotic task, state data of the robot 100 and/or the environment corresponding to the instances of the vision data captured during the given human demonstration of the given robotic task, corresponding sets of values for controlling respective components of the robot 100 corresponding to the instances of the vision data captured during the human demonstration.
- user input engine 138 can detect user input to control the robot 100, and intervention engine 140 can generate the corresponding sets of values for controlling the respective components of the robot 100.
- the corresponding sets of values utilized in controlling a respective component of the robot 100 can be, for example, a vector that describes a translational displacement and/or rotation (e.g., a sine-cosine encoding of the change in orientation about an axis of the respective component) of the respective component, lower-level control command(s) (e.g., individual torque commands that control corresponding actuator(s) of the robot 100, individual joint angles of component(s) of the robot, etc.), binary values for component(s) of the robot (e.g., indicative of whether a robot gripper should be opened or closed), other values for component(s) of the robot 100 (e.g., indicative of an extend to which the robot gripper 146 should be opened or closed), velocities and/or accelerations of the component(s) of the robot 100 (e.g., robot arm movement, robot base movement, etc.), and/or other values that can be utilized to control the robot 100.
- lower-level control command(s) e.g., individual torque commands that
- FIG.2 illustrates an example 200 of generating failure output and/or action output in accordance with various implementations described herein.
- Example 200 includes processing an instance of vision data 202 using an embedding NN model 112 to generate an embedding 206.
- the instance of vision data can be captured via one or more Attorney Docket No. GOOG-0331-WO-01 vision sensors of the robot such as (but not limited to) one or more cameras, one or more RBG cameras, one or more depth cameras, one or more additional or alternative sensors, and/or combinations thereof.
- the vision sensor(s) can be affixed to the robot.
- one or more cameras can be mounted onto the robot to capture vision data of the environment of the robot.
- the vision sensor(s) can be fixed to object(s) in the environment of the robot.
- a camera can be affixed to a wall in the environment with the robot, to an additional robot in the environment, to one or more stationary objects in the environment, to one or more mobile objects in the environment, and/or combinations thereof.
- vision data can be captured via vision sensors affixed to the robot and to object(s) in the environment of the robot.
- vision data can be captured via one or more vision sensors 142 of robot 100 as described herein with respect to FIG.1.
- the embedding 206 can be a stochastic embedding that parameterizes a distribution.
- the stochastic embedding can parameterize the means and covariances of a multivariate distribution over possible embeddings.
- embedding 206 can be processed using robot control policy 116 to generate action output 214.
- Action output can include one or more corresponding actions to be performed by each of one or more components of the robot.
- the action output can include action(s) for a robot base, one or more robot arms, one or more end effectors, etc.
- embedding 206 can be processed using a failure NN model 114 to generate failure output 210.
- the failure output 210 can indicate a likelihood of the robot failing to perform the robotic task immediately and/or at some point in the future.
- FIGS.3A-3E illustrate examples of determining whether failure output satisfies a threshold in accordance with various implementations.
- FIG.3A includes failure output 210A and a threshold likelihood value 302.
- the system can determine whether failure output 210A satisfies the threshold likelihood value 302 based on whether the Attorney Docket No. GOOG-0331-WO-01 failure output is larger than the threshold likelihood value, is smaller than the threshold likelihood value, is equal to the threshold likelihood value, is based on one or more additional or alternative comparisons with the threshold likelihood value, and/or combinations thereof.
- the system can determine failure output 210A satisfies the threshold likelihood value 302 thus indicating the robot will fail in performance of the robotic task.
- FIG.3B illustrates an example of failure output 210B and threshold likelihood value 304. In FIG.3B, the system determines failure output 210B does not satisfy the threshold likelihood value 304. In some implementations, the system can determine whether the robot will fail the task based on processing several instances of failure output.
- FIG.3C illustrates a threshold likelihood value 306, a first instance of failure output 210C, and a second instance of failure output 210D.
- the first instance of failure output 210C can be generated based on processing a first instance of vision data captured while a robot is performing a given task.
- the second instance of failure output 210C can be generated based on processing a second instance of vision data captured while the robot is performing the given task.
- the first instance of vision data can precede the second instance of vision data.
- the first instance of vision data can capture the robot prior to performing one or more actions for the given task, and the second instance of vision data can capture the robot performing the task 10 second after the first instance.
- the first instance of vision data can be captured immediately prior to the second instance of vision data.
- the system can determine the robot will fail the task based on whether multiple instances of failure output 210 satisfy a threshold value. For example, the system can determine the first instance of failure output 210C and the second instance of failure output 210D both satisfy the threshold likelihood value 306, thus the system can determine the robot will fail performance of the task.
- FIG.3D illustrates a first failure output 210E, a second failure output 210F, and a threshold likelihood value 308, where the first failure output 210E satisfies the threshold likelihood value 308, but the second failure output 210F dies not satisfy the Attorney Docket No. GOOG-0331-WO-01 threshold likelihood value 308.
- the system will not determine the robot will fail the task based on the first failure output 210E satisfying the threshold likelihood value 308 while the second failure output 210F does not satisfy the threshold likelihood value 308.
- the threshold likelihood value utilized by the system in determining whether the robot will fail performance of the robotic task can be determined based on one or more factors.
- FIG.3E includes a first failure output 210G, a second failure output 210H, a first threshold likelihood value 310, a second threshold likelihood value 312, and a third threshold likelihood value 314.
- the system can select a given threshold value from a plurality of threshold likelihood values based on the status of a computing device (e.g., the computing a device for a user to intervene with the performance of the robotic task). Additionally or alternatively, the system can select a threshold value based on whether one or more objects are detected in the environment of the robot.
- a computing device e.g., the computing a device for a user to intervene with the performance of the robotic task.
- the system can select a threshold value based on whether one or more objects are detected in the environment of the robot.
- Objects can include stationary objects (doors, walls, tables, doorknobs, etc.), one or more objects for the robot to manipulate (e.g., a tool for the robot to manipulate with an end effector), one or more mobile objects (e.g., one or more additional robots, one or more people, etc.) [00125]
- the system can select a threshold value based on whether one or more objects are detected within a threshold distance of the robot.
- the example illustrated in FIG. 3E includes failure output 210G satisfying the first threshold likelihood value 310, the second threshold likelihood value 312, and the third threshold likelihood value 314. However, second failure output 210H does not satisfy the first threshold likelihood value 310 while second failure output 210H does satisfy the second threshold likelihood value 312 and the third threshold likelihood value 314.
- FIGS.4A-4B are a flowchart illustrating an example process 400 in accordance with a variety of implementations described herein.
- This system may include one or more processors, such as processor(s) of robot 100, robot 525 and/or computing system 610.
- processors such as processor(s) of robot 100, robot 525 and/or computing system 610.
- operations of process 400 are shown in a particular Attorney Docket No. GOOG-0331-WO-01 order, this is not meant to be limiting.
- One or more operations may be reordered, omitted and/or added.
- the system begins performing a robotic task.
- the system can receive an instance of vision data capturing an environment of a robot during performance of the robotic task.
- the robot can receive an instance of vision data captured by one or more vision sensors 142 of robot 100 as described herein with respect to FIG.1.
- the system generates an embedding based on processing the instance of vision data using an encoder neural network (NN) model.
- NN encoder neural network
- the system can use embedding engine 106 of robot 100 to process the instance of vision data using embedding NN model 112 to generate the embedding as described herein with respect to FIG. 1.
- the system processes the embedding using a robotic control policy to generate action output.
- the system can use control policy engine 108 of robot 100 to generate the action output as described herein with respect to FIG.1.
- the system processes the embedding using a failure NN model to generate failure output.
- the failure output indicates a likelihood of the robot successfully completing the robotic task.
- the system can use failure engine 118 of robot 100 to process the embedding using failure NN model 114 to generate the failure output as described herein with respect to FIG.1.
- the system determines whether the failure output indicates the robot will fail performance of the robotic task. In some implementations, the system can determine, based on the failure output, the robot will fail performance of the robotic task based the action output.
- the system can determine, based on the failure output, the robot will fail performance of the robotic task at some point in the future. If the system determines the failure output indicates the robot will fail performance of the robotic task, the system proceeds to block 412. If the system determines the failure output does not indicate the robot will fail performance of the robotic task, the system proceeds to block 416.
- the system can use threshold engine 120 of robot 100 as described Attorney Docket No. GOOG-0331-WO-01 herein with respect to FIG.1 in determining whether the failure output indicates the robot will fail performance of the robotic task.
- the system receives user interface input from a user of a computing device. In some implementations, the user interface input intervenes with performance of the robotic task.
- the system can receive user interface input intervening with performance of the robotic task using user input engine 138 of robot 100 as described herein with respect to FIG.1. Additionally or alternatively, the system can receive one or more instances of action output using intervention engine 140 of robot 100 described herein with respect to FIG.1. [00133] At block 414, the system causes the robot to complete performance of the robotic task based on the user interface input. Once completing the performance of the robotic task, the process ends. [00134] At block 416, the system causes the robot to perform one or more actions based on the action output. In some of those implementations, the one or more actions can be in furtherance of the robot performing the robotic task.
- the system can use action engine 122 of robot 100 as described herein with respect to FIG.1 in performance of the one or more actions based on the action output.
- the system determines whether the robot has completed performance of the task. If so, the process ends. If the robot determines the robot has not completed performance of the robotic task, the system proceeds back to block 402, and receives an additional instance of vision data capturing the environment of the robot. In some implementations, the additional instance of vision data can reflect one or more actions performed by the robot in the pervious iteration. Subsequently, the process can proceed to blocks 404, 406, 408, 410, 412, 414, and 416 based on the additional instance of vision data.
- FIG.5 schematically depicts an example architecture of a robot 520.
- the robot 520 includes a robot control system 560, one or more operational components 504a-n, and one or more sensors 508a-m.
- the sensors 508a-m can include, for example, vision components, pressure sensors, positional sensors, pressure wave sensors (e.g., microphones), proximity sensors, accelerometers, gyroscopes, thermometers, barometers, and so forth.
- Operational components 504a-n can include, for example, one or more end effectors (e.g., grasping end effectors) and/or one or more servo motors or other actuators to effectuate movement of one or more components of the robot.
- the robot 520 can have multiple degrees of freedom and each of the actuators can control actuation of the robot 520 within one or more of the degrees of freedom responsive to control commands provided by the robot control system 560 (e.g., torque and/or other commands generated based on action outputs from a trained action ML model).
- the term actuator encompasses a mechanical or electrical device that creates motion (e.g., a motor), in addition to any driver(s) that may be associated with the actuator and that translate received control commands into one or more signals for driving the actuator. Accordingly, providing a control command to an actuator can comprise providing the control command to a driver that translates the control command into appropriate signals for driving an electrical or mechanical device to create desired motion.
- the robot control system 560 can be implemented in one or more processors, such as a CPU, GPU, and/or other controller(s) of the robot 520. In some implementations, the robot 520 may comprise a “brain box” that may include all or aspects of the control system 560.
- the brain box may provide real time bursts of data to the operational components 504a-n, with each of the real time bursts comprising a set of one or more control commands that dictate, inter alia, the parameters of motion (if any) for each of one or more of the operational components 504a-n.
- the control commands can be at least selectively generated by the control system 560 based at least in part on final predicted action outputs and/or other determination(s) made using action machine learning model(s) that are stored locally on the robot 520, such as those described herein.
- control system 560 is illustrated in FIG.5 as an integral part of the robot 520, in some implementations, all or aspects of the control system 560 can be implemented in a component that is separate from, but in communication with, robot 520.
- all or aspects of control system 560 may be implemented on one or more computing devices that are in wired and/or wireless communication with the robot 520, such as computing device 610 of FIG.5.
- FIG.6 is a block diagram of an example computing device 610 that can optionally be utilized to perform one or more aspects of techniques described herein.
- Computing device 610 typically includes at least one processor 614 which communicates with a number of peripheral devices via bus subsystem 612.
- peripheral devices may include a storage subsystem 624, including, for example, a memory subsystem 625 and a file storage subsystem 826, user interface output devices 820, user interface input devices 622, and a network interface subsystem 616.
- the input and output devices allow user interaction with computing device 610.
- Network interface subsystem 616 provides an interface to outside networks and is coupled to corresponding interface devices in other computing devices.
- User interface input devices 622 can include a keyboard, pointing devices such as a mouse, trackball, touchpad, or graphics tablet, a scanner, a touchscreen incorporated into the display, audio input devices such as voice recognition systems, microphones, and/or other types of input devices.
- User interface output devices 620 may include a display subsystem, a printer, a fax machine, or non-visual displays such as audio output devices.
- the display subsystem may include a cathode ray tube (CRT), a flat-panel device such as a liquid crystal display (LCD), a projection device, or some other mechanism for creating a visible image.
- the display subsystem may also provide non-visual display such as via audio output devices.
- output device is intended to include all possible types of devices and ways to output information from computing device 610 to the user or to another machine or computing device.
- Storage subsystem 624 stores programming and data constructs that provide the functionality of some or all of the modules described herein.
- the storage Attorney Docket No. GOOG-0331-WO-01 subsystem 624 may include the logic to perform selected aspects of one or more methods described herein.
- These software modules are generally executed by processor 614 alone or in combination with other processors.
- Memory 625 used in the storage subsystem 624 can include a number of memories including a main random access memory (RAM) 630 for storage of instructions and data during program execution and a read only memory (ROM) 632 in which fixed instructions are stored.
- RAM main random access memory
- ROM read only memory
- a file storage subsystem 626 can provide persistent storage for program and data files, and may include a hard disk drive, a floppy disk drive along with associated removable media, a CD-ROM drive, an optical drive, or removable media cartridges.
- the modules implementing the functionality of certain implementations may be stored by file storage subsystem 626 in the storage subsystem 624, or in other machines accessible by the processor(s) 614.
- Bus subsystem 612 provides a mechanism for letting the various components and subsystems of computing device 610 communicate with each other as intended. Although bus subsystem 612 is shown schematically as a single bus, alternative implementations of the bus subsystem may use multiple busses.
- Computing device 610 can be of varying types including a workstation, server, computing cluster, blade server, server farm, or any other data processing system or computing device. Due to the ever-changing nature of computers and networks, the description of computing device 610 depicted in FIG.6 is intended only as a specific example for purposes of illustrating some implementations. Many other configurations of computing device 610 are possible having more or fewer components than the computing device depicted in FIG.6.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Robotics (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Mechanical Engineering (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Mathematical Optimization (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Manipulator (AREA)
Abstract
Des modes de réalisation de la présente invention concernent l'entraînement et le raffinage de modèles de réseau neuronal de défaillance (NN) et de politiques de commande robotique en utilisant des techniques d'apprentissage par imitation. Un modèle de NN de défaillance et une politique de commande robotique peuvent initialement être entraînés sur la base de démonstrations humaines de diverses tâches robotiques. Dans de nombreux modes de réalisation, une instance de données de vision capturant l'environnement du robot peut être traitée en utilisant un modèle d'incorporation pour générer une incorporation. L'incorporation donnée peut être traitée en utilisant le modèle de NN de défaillance pour générer une sortie de défaillance indiquant la probabilité que le robot échoue à achever la tâche robotique. Dans divers modes de réalisation, l'incorporation donnée peut également être traitée en utilisant la politique de commande robotique pour générer une sortie d'action destinée à être utilisée dans la commande du robot dans l'accomplissement de la tâche robotique.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202263407134P | 2022-09-15 | 2022-09-15 | |
US63/407,134 | 2022-09-15 | ||
US202263408714P | 2022-09-21 | 2022-09-21 | |
US63/408,714 | 2022-09-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024059285A1 true WO2024059285A1 (fr) | 2024-03-21 |
Family
ID=88315956
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2023/032900 WO2024059285A1 (fr) | 2022-09-15 | 2023-09-15 | Système(s) et procédé(s) d'utilisation d'approximation de valeur de clonage comportementale dans l'entraînement et l'affinage de politiques de commande robotique |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024059285A1 (fr) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20210031364A1 (en) * | 2019-07-29 | 2021-02-04 | TruPhysics GmbH | Backup control based continuous training of robots |
US20220105624A1 (en) * | 2019-01-23 | 2022-04-07 | Google Llc | Efficient adaption of robot control policy for new task using meta-learning based on meta-imitation learning and meta-reinforcement learning |
-
2023
- 2023-09-15 WO PCT/US2023/032900 patent/WO2024059285A1/fr unknown
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220105624A1 (en) * | 2019-01-23 | 2022-04-07 | Google Llc | Efficient adaption of robot control policy for new task using meta-learning based on meta-imitation learning and meta-reinforcement learning |
US20210031364A1 (en) * | 2019-07-29 | 2021-02-04 | TruPhysics GmbH | Backup control based continuous training of robots |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200361082A1 (en) | Machine learning methods and apparatus for robotic manipulation and that utilize multi-task domain adaptation | |
EP3414710B1 (fr) | Procédés et appareil d'apprentissage automatique profond pour préhension robotique | |
US11717959B2 (en) | Machine learning methods and apparatus for semantic robotic grasping | |
US20210325894A1 (en) | Deep reinforcement learning-based techniques for end to end robot navigation | |
Shabbir et al. | A survey of deep learning techniques for mobile robot applications | |
US20180272529A1 (en) | Apparatus and methods for haptic training of robots | |
EP3784451A1 (fr) | Apprentissage profond par renforcement pour manipulation robotique | |
US20240308068A1 (en) | Data-efficient hierarchical reinforcement learning | |
US20240173854A1 (en) | System and methods for pixel based model predictive control | |
US11607802B2 (en) | Robotic control using action image(s) and critic network | |
US11772272B2 (en) | System(s) and method(s) of using imitation learning in training and refining robotic control policies | |
Ochi et al. | Deep learning scooping motion using bilateral teleoperations | |
US12061481B2 (en) | Robot navigation using a high-level policy model and a trained low-level policy model | |
US20240100693A1 (en) | Using embeddings, generated using robot action models, in controlling robot to perform robotic task | |
WO2024059285A1 (fr) | Système(s) et procédé(s) d'utilisation d'approximation de valeur de clonage comportementale dans l'entraînement et l'affinage de politiques de commande robotique | |
Gokmen et al. | Asking for help: Failure prediction in behavioral cloning through value approximation | |
US20220245503A1 (en) | Training a policy model for a robotic task, using reinforcement learning and utilizing data that is based on episodes, of the robotic task, guided by an engineered policy | |
US11610153B1 (en) | Generating reinforcement learning data that is compatible with reinforcement learning for a robotic task | |
US20240094736A1 (en) | Robot navigation in dependence on gesture(s) of human(s) in environment with robot | |
Tolani | Visual model predictive control | |
Bozdogana et al. | Multi-step planning with learned effects of (possibly partial) action executions. | |
Dag | Comparison of monolithic and hybrid controllers for multi-objective sim-to-real learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23786863 Country of ref document: EP Kind code of ref document: A1 |