WO2017129200A1

WO2017129200A1 - A system for real-world continuous motion optimization and control

Info

Publication number: WO2017129200A1
Application number: PCT/EP2016/000144
Authority: WO
Inventors: Nathan RATLIFF
Original assignee: MAX-PLANCK-Gesellschaft zur Förderung der Wissenschaften e.V.
Priority date: 2016-01-28
Filing date: 2016-01-28
Publication date: 2017-08-03

Abstract

A method for controlling a robot, comprising the steps - estimating a true acceleration of the robot; - determining a torque offset (w), based on a difference between a acceleration and the true acceleration of the robot - controlling the robot, based on the torque offset (w).

Description

A System for Real- World Continuous Motion

Optimization and Control The present invention relates to a kinematic motion optimization system for real world robotics.

Prior technologies in motion control typically revolve around various forms of trajectory following using PID control. There are a number of other techniques out there, but very few actually work well in practice to produce the required precision. The key is that these control techniques are various forms of online gradient descent. But before that was understood, may compliant control techniques revolved around using inverse dynamics models to calculate feed forward terms within a PID control loop on various configuration space or task space variables.

Because PID control works, many modern methodologies that attempt to follow acceleration policies use the acceleration policy as a "trajectory generator". The acceleration policy, whether it's a DMP differential equation or something more general (such as an LQR using acceleration controls), is typically integrated forward to form a trajectory with a current desired position, velocity, and acceleration that a PID controller can try to catch. Such systems can adapt to external changes conveyed through data, whether it's the visual recognition of a new obstacle that creates repulsive acceleration that superimpose with the original DMP, or repulsive accelerations created through actual detected contact force. But they can't be inherently compliant in a natural, easily tunable way. For in- stance, perturbations need to be explicitly classified as either disturbances from modeling errors (which should be rejected) or external disturbances from external interaction. And tuning that classification can be hard.

Acceleration policies are natural representations of motion, indicating how the robot should accelerate if it finds itself in a given configuration moving with a particular velocity. Writing down these policies is easy, especially for manipulation platforms with invert- ible dynamics. And, importantly, it's very easy to shape these policies as desired. Simulation is as easy as simple kinematic forward integration, which means that motion optimizers, especially those generating second-order approximations to the problems (LQRs), can analyze in advance what following these policies means. Additionally, LQRs can effectively shape a whole (infinite) bundle of integral curve trajectories simultaneously by adjusting (maybe through a training process) the strengths and shape of various cost terms. DMPs are another form of acceleration policy shaped through training. And importantly, acceleration policies are naturally reactive. They know what to do from any position and velocity the robot might find itself in.

Unfortunately, in practice following acceleration policies is hard. The fidelity of the dy- namics model defines how accurately we can simulate the behavior of the robot, and it's it is widely understood that one can never actually simulate the robot well enough to sufficiently predict its behavior on the real system.

Worse, much of the prior art in continuous optimization (often following the Model Pre- dictive Control (MPC) formalism) integrates the dynamics model directly into the optimization. Since, in many cases, the dynamics are either fully actuated and therefore in- vertible, or the dynamics are approximated as invertible, the resulting policy is, for the purpose of understanding its real world behavior, effectively a planned acceleration policy composed with an invalid approximate inverse dynamics model. And for that reason, all of these attempts at real-world control of a robot have failed.

There are various adaptive control methodologies aimed at similar problems in principle, but those methods generally assume that the system is changing only slowly over time (such as fuel expenditure of a rocket or airplane, or the changing tides or unexpected loads of a ship). There are some intricate adaptive control approaches for robot manipulators from the 8o's, but those applications make specific assumptions about the mass model (for instance, that the robot is simply holding something of unknown mass and the PID control need to adjust). The prior work lacks the real time immediacy, low latencies, and generality of our technique. And they certainly haven't been applied to robotics for the purpose of tracking general acceleration policies, the problem we're addressing here.

In general, Model Predictive Control is most successful as a control methodology. Only a very small set of roboticists are actually exploring those applications. Instead, practitioners of motion planning for real world systems reduce the problem to the hammer that works: PID control around a trajectory. Rather than calculating an entire policy, the typical literature, whether traditional randomized motion planning or optimization-based motion planning, reduce the problem to calculating a single trajectory to send down to the tried and true trajectory following controller. Even re-planning methodologies such as typically re-plan at the level of trajectories.

Re-planning that single trajectory is very difficult, especially at the point of transitioning from on planned trajectory to the re-planned trajectory. Rather than planning a single trajectory, we plan an acceleration policy - a differential equation that can be viewed as an infinite bundle of trajectories - and send that down to our adaptive controller for execution. Having an infinite collection of trajectories to choose from for control is critical both for natural reactive behaviors and for transitioning smoothly from one policy to the next when overlaying original plans with re-optimized plans whose computational laten- cy takes on the order of tenths of a second to a second. Our smooth real world performance comes from our ability to plan and re-plan differential equations (acceleration policies) rather than just single trajectories, which in turn is possible only because we use online learning adaptive control to patch up the inverse dynamics model in real time in order to accurately track the differential equation's integral curves.

There's no such thing as an accurate dynamics model for real world high degree of freedom robots. Older work in motion control, focusing on precise trajectory following for preprogrammed automation motions executed on factory robots, acknowledged this reality, and relegated torques calculated by inverse dynamics methods to helper roles known as feed forward torques. These feed forward torques are used in traditional PID methods simply as suggestions, not as a rule. Their role is solely to aid the control methodology in order to increase compliance through lowered gains. These methods have seen tremendous success in real world industry applications because precision dictates the constant control of measured kinematic variables.

However, in a movement to transition away from pre-planned kinematic trajectories, methodologies have drifted toward increased reliance on accurate dynamics models. Operational space control, for instance, attempts to follow a differential equations (desired accelerations) in the operational space by converting those signals directly into torques in the configuration space using a generalized form of inverse dynamics. Differential equations are ideal forms of policies because they tell the system what to do from any position and velocity. But with inaccurate dynamics models comes the inability to accurately follow the integral curves of these differential equations. Real-world performance, therefore, becomes an art of tweaking gains and other parameters of the operational space dif- ferential equation to compensate for these inaccuracies, largely reducing the efficacy of using an acceleration policy in the first place (over simple trajectory following).

Similarly, motivated by the optimal control literature, and especially older works in model predictive control (MPC) applied to chemical processing plants, new research has shifted toward leveraging optimization over longer time horizons to create more sophisticated anticipatory policies. These classical methods again produce full policies that in principle represent automatically generated reactive policies for the robot. However, in many of the older traditional optimal control applications, systems are lower dimensional and less articulated, making the dynamics simpler and easier to model accurately. Accordingly, the resulting policies attempt to directly map kinematic state (q,q) to an action and usually directly dictate the forces to apply to the joints. This methodology works when inverse dynamics is perfect, but it breaks down when that translation function from accelerations to torques is not exact. Each policy implicitly dictates a planned kinematic motion when executed under zero noise. But when the dynamics model is erroneous, as it necessary is in high dof real world systems (due to frictions, stictions, inaccuracies in the rigid body assumption, unmeasured variables, variables that change over time due to wear, etc.), the resulting motion in reality is systematically biased even without external disturbance.

And herein lies the problem: there is a planned kinematic motion, but the controller makes no attempt to control those kinematic variables. All modern real-world manipulation systems are build atop trajectory following control because in reality the dynamics model is never good enough to be trusted. This is a theoretical distinction that needs to be acknowledged for accurate real-world performance. And, accordingly, to attain sufficient real-world performance, state-of-the-art systems have been forced to adopt it.

Operational space control and model predictive control are ideal candidates for establish- ing a core technology behind the type of reactive and adaptive system described above. But in order to get there, we need a paradigm shift in the way we think of them that combines their attractive theoretical qualities with the robustness of simple PID trajectory control, today's real-world control workhorse. There are two lines of literature addressing issues related to motion generation from different perspectives. The first is the motion planning literature. Traditionally, this literature has focused on planning behaviors for real-world robots, which means they're forced to acknowledge the real-world realities of manipulator control. Many of the techniques, therefore, revolve around reducing the problem to trajectory following. Systems plan tra- jectories and pass those down to control system which uses a variant of PID control for robust accurate execution. Motivated by optimal control, more recent techniques have leveraged optimization for motion generation but the basic paradigm is still the same: plan or optimize (or plan then optimize) a trajectory, and pass that single thread of desired execution down to control.

However, the simulated behaviors discovered by cutting-edge Model Predictive Control techniques and other methodologies that generate full reactive policies for execution (such as the simpler operational space control) are enticing. They promise real time reac- tions to perturbations, changes in the environment, and modifications of the task. The reason why these systems work is because they effectively reason about not just a single trajectory, but an infinite bundle of trajectories, the corresponding kinematic differential equation's integral curves.

However, their real-world execution is cumbersome, because, as discussed above, to leverage traditional techniques these differential equations need to be reduced to a single execution trajectory (i.e. the current integral curve) whose kinematic profile can be singled out and controlled. But doing so thwarts the reactivity of the policy since the control system no longer actually sees it as a full policy, but just a single trajectory. So real world implementations of controllers following such policies need explicit logic for detecting perturbations away from the current integral curve in order to transition to a new integral curve when needed. But the same system needs to be robust to smaller disturbances produced simply by inaccurate dynamics in order to prevent unwanted drift way from the current integral curve. The next section describes a methodology for fixing the dynamics model in real time which enables the direct execution of such reactive differential equation policies of the form q - f(q, q) (frequently referred to as acceleration policies in this document). It is therefore an object of the present invention to provide a method and a device for controlling a robot that essentially avoids the above-mentioned difficulties.

This object is achieved by a method and a device according to the independent claims. Advantageous embodiments are defined in the dependent claims.

At a high level, the invention is a system for generating reactive and adaptive motions and/or manipulation behaviors on a real world robotic platform in the presence of (possibly moving) obstacles and changing targets. More, the system can handle multiple candidate manipulation objects simultaneously through parallelization to enable quick reac- tion to unpredictable choices of human co-workers. Throughout this document, as a running example, we will consider the traditional high degree-of-freedom collaborative manipulation robot as embodied by Max Planck's Apollo platform.

The system generates predictable, smooth, motions (often deemed "natural" to human onlookers) around obstacles in real time using (possibly continuously running) motion optimization, and integrates the resulting behaviors together with local controllers to implement interaction policies and react to perturbations and changes in the environment. The primary problems in prior art solved by this system are as follows:

The system exposes and controls the core kinematic quantity of control policies in fully actuated systems enabling real world execution of planned reactive policies. This control methodology bridges the gap between modern optimal control techniques that work well in simulation and the robust time honored methodologies designed around variants of trajectory following PID control that have become a staple of precise real world robot control. What distinguishes this technique from prior art is that it neither follows a single trajectory or embeds a dynamics model into the planned controller (which requires unre- alistically high accuracy inverse dynamics models). Instead, it separates out a more flexible reactive representation at the kinematic level (treating accelerations as controls), and leverages a feedback loop around the translation of those accelerations to torques which adapts the inverse dynamics model in real time to eliminate errors locally. Section 3 describes the details of the technical differences in full.

The system uses a class of novel adaptive control algorithms that enable direct execution of acceleration policies. These control methods are build on online learning and designed to directly estimate the current offset to the inverse dynamics model output needed to create the desired accelerations. Previous control methods adjusted for inaccurate mod- els of inverse dynamics in practice by reducing all policies to trajectories and controlling tightly around the trajectory. Such control, however, is restrictive since the reduction removes the advantages inherent in reactive policies. Our class of adaptive control operates directly on the inverse dynamics model to, in real time, locally patch up errors in how desired accelerations are translated to torques. This adaptive control enables direct execu- tion of planned acceleration policies.

The system leverages successive mathematical approximations to the same fully modeled motion optimization problem to mitigate the conflicting requirements of computationally complex high fidelity modeling and quick real time control. The full problem includes a large number of objective terms trading off criteria such as the speed of various body points, proximity to environment and joint limits, redundancy resolution biases, etc. as well as a large number of constraints such as environmental penetration constraints, end-effector orientation constraints, joint limit constraints, etc. The optimization at this level is computationally complex and, therefore, slow. Each layer, though, increases the control loop rate while successively increasing the degree of approximation. Importantly, all of these approximations are various forms of second-order Taylor approximation to the original problem, so each control loop is solving the same problem faster leveraging the computation from the layer above. The middle layer of the three layer architecture is novel QP representation of the original problem leveraging a cached Hessian computation from the layer above and fast band-diagonal solvers to resolve the constrained finite- horizon an order of magnitude faster than the top level optimization can. We introduce a task pipelining execution system that enables designers to easily implement and string together sequences of tasks. The system pipelines the optimizations so that each task's optimizer start as soon as possible, using predicted resulting states of previous tasks as starting points. It works for either one-shot task optimization or continuous re-optimization setups following an MPC-style control methodology.

We address the core computational bottleneck of the inner-most loop of the motion optimization with a parallel processing methodology designed to leverage hybrid architectures with both multi-core and many-core (GPU) components. Our system makes the computation of Hessian pullbacks (generalized Gauss-Newton approximations), which are the core computational bottleneck in kth-order Markov motion optimization systems of the form, efficient.

Our system is designed to accept and directly execute any form of acceleration policy using the adaptive control technique outlined below. This interface is the key ingredient that enables both the real time reactive behavior and the use of smoothly executable continuous re-optimization. Acceleration policies, in general, are full policies that specify accelerations for any configuration the robot might find itself in. The motion optimizer, for instance, returns an entire second-order approximation to the problem in the form of a Linear Quadratic Regulator (LQR) - a time varying affine policy - outputting accelera- tions. Re-optimizing the LQR produces an updated policy whose behavior at the current point of execution is very similar to the behavior of the original LQR, despite it potentially being very different further in the future. Transitioning from one LQR to the next is, therefore, very smooth since they both represent an infinite bundle of trajectories and not just a single trajectory (see Figure 2).

This section lists much of the terminology used throughout the document.

1. Acceleration policy. A mapping of the form q = f(q, q). This mapping assigns a desired acceleration to each position q and velocity q . In Optimal Control theory, we can view such a mapping as a control policy, with q playing the role of the system's output or action.

2. Integral curve. The curve through space resulting from integrating forward a dif- ferential equation. Each acceleration policy q = f(q, q) is a simple kinematic differential equation; the integral curves are formed by starting from some initial position and velocity q₀, q₀ and recursively applying the formulas q_M = q_t + Atq, + - 1 At 2 q·.,

Acceleration policies as infinite bundles of trajectories. Each integral curve of an acceleration policy can also be called a trajectory through the space. Since each initial condition or trajectory starting point (q₀ ,q₀ ) is associated with a potentially distinct integral curve, we can view an acceleration policy as an infinite bundle of trajectories. Acceleration policies are, therefore, distinct from single trajectories in that they represent multiple trajectories (actually an infinitude of trajectories) simultaneously.

PID control. A basic form of control used to follow trajectories. PID controllers consist of P (proportional), I (integral), and D (derivative) terms. The P and D terms update the control signal based on divergence of the system's position and velocity, respectively, from the desired values; the I term updates the control signal based on the integral of the position errors over time. See [29, 28].

Gains. Often various terms in control are added together with scaling factors scaling their relative strengths. For instance, each term (P, I, and D) of PID control (see above) are scaled by scaling factors before being added together to form the full control signal. Those scaling factors are called gains.

Differentiable function. Function of multiple inputs and a single output that is differentiable in both outputs. In this document, we typically assume the requisite number of derivatives exist for all dimensions to satisfy the preconditions of any given statement. (For instance, if a section refers to a function's Hessian (see below), it's assumed the function is at least second-order differentiable).

Gradient. The (column) vector of first derivatives of a differentiable function. Hessian. Matrix of second derivatives of a differentiable function. Differentiable map. A differentiable function mapping from multiple inputs to multiple outputs. Again, differentiable maps are assumed to have as many derivatives as needed to fit the context.

10. Jacobian. The matrix of first derivatives of a differentiable map. The rows of the Jacobian correspond to outputs, and the columns to inputs; one may view each row as that output's Gradient. If φ is a differentiable map, the Jacobian is often denot-

11. Pullback function/map. When φ : ¾"→ 9Γ and / : 9Γ→¾ is a function defined on the map's range space, the composed function f(x) = /(φ(χ)) is often referred to as the pullback function. Similarly if ψ 9l^m→9i^k is a differentiable map on the range, the composition of the two maps π(χ) = ψ(φ{χ)) is known as the pull- back map.

12. Pullback metric/Hessian. The Hessian of a pullback function has two terms.

That second term is often difficult to calculate with because it is dependent on second derivatives of the map, and the product represented is actually a tensor product. The first term, known in some contexts as a generalized Gauss-Newton Hessian approximation, however, fully represents how first-order Riemannian geometry, transforms from the range space to the domain when pulled back through the differentiable map φ . For that reason, it's often referred to as the Pullback Hessian. When the range space Hessian is positive definite, it can be viewed as a Riemannian metric, so it's also called the Pullback metric. See [24] for details. 13. Model Predictive Control (MPC). A classical optimal control technique characterized by repeated optimization of a control policy into the future during execution.

These and other aspects of the present invention will be understood more readily when studying the following detailed description of the invention, in connection with the drawing, in which

Fig. i: shows an overview of a system according to a first embodiment of the invention. The novelty of the invention is summarized in the section enclosed by the dashed box, focusing on the optimization decompositions, online learning adaptive control, and other aspects surrounding the real- world implementation of continuous optimization based motion generation on physical robotic platforms.

Fig. 2 is a simple 2D experiment demonstrating the robustness of switching between different but related LQRs. 10 different LQRs were generated each tracking a slightly perturbed variant of the same elliptical trajectory (sinusoidal in each dimension). For each LQR, the perturbation is a constant random vector offset applied across all time. During execution the execution thread randomly switched to a different LQR every :oi seconds. The figure depicts io separate execution rollouts from each of io randomly sampled initial positions and velocities. Despite switching randomly between LQRs at IOOHZ, the resulting execution rollouts remain smooth. In all plots, the switching between LQRs is depicted by the blue line, which, at every moment in time, plots the position of the (perturbed) trajectory being tracked by the current LQR. Left: The full 2D depiction of the tracked noisy trajectory and corresponding rollouts. Right: The separate x (top) and y (bottom) dimensions with the independent axis denoting time.

Fig. 3: depicts the three levels of mathematical approximation to the full optimization problem used to satisfy the increasing control-loop rate requirements. Note that the bottom-most layer, the real-time lkHz layer, is represented simply as a sequence of affine policies of the state ( qd = Btst + ct, where st = [qt; q_ t]). This layer is can leverage dynamic programming (as is common for these sorts of problems in Optimal Control) to cache intermediate portions of the solution across the entire trajectory so that the problem can be resolved in real-time from the current state by a simple affine transformation. Even though this bottom layer is quite simple, it's still solving an approximation to the full motion optimization problem given in the top layer in the same way optimizing any quadratic function can be done analytically without requiring an iterative optimizer. Fig. 4 depicts the pipelining of future task optimizers so that optimized task policies are up-to-date and ready to be run once the execution thread transitions into the each new task. Fig. 5 shows a 2D simulation of online adaptation to severe model inaccuracies and underlying friction (which varies sinusoidally with each dimension in this case). The time axes (lower axis of the second and third plots) are in units of seconds, and all spacial axes are in units of meters. The controller updated at a rate of lkHz.

Figure l shows a system according to first embodiment of the invention. The novelty of the invention is summarized in section enclosed by the dashed box, focusing on the optimization decompositions, online learning adaptive control, and other aspects surround- ing the real-world implementation of continuous optimization based motion generation on physical robotic platforms.

Lagrangian mechanics states that the equations of motion can be summarized in the general form (eq. l):

_ d_ dL _ &£

dt dq dq where L is the Lagrangian

(2) with KiQ_' O the system's kinetic energy and V(q) the_^ system's potential energy. When the kinetic energy takes on a quadratic form 2 (as it does for all classical mechanical systems), we can write

(3)

In general, we don't know what M and h are, but we can estimate them using rigid body assumptions as ¾ and These assumptions often severely restrict the class of represent- able models, so even using parameter estimation techniques with friction and stiction models, the fitted models can't accurately describe the true behavior of the system. This disparity constitutes the observed discrepancy between simulation and real-world performance. Equation 3 is actually written in inverse dynamics form. Given an acceleration 9 , it tells us what torque τ would produce that acceleration. Inverting the expression gives the equation for forward dynamics:

This equation expresses the kinematic effect of applied torques τ in terms of the accelerations ? they generate.

As the model is only approximate, given any desired acceleration A d, we can calculate torques ^{T =} a ^{+ ,f|} using our approximate inverse dynamics model, but when we apply those torques we have to push them through the true system which may differ substantially from the model: q_CL = M-H(Mq_a + h) - h)_j (5) where here ¾ denotes the actual observed accelerations of the true system. This problem is particularly acute in the presence of nonlinear frictions and strong stic- tions. Often the calculated torque doesn't correctly compensate for those forces, which means in many cases the robot doesn't even move, or it moves much differently than expected, especially when commanded to move slowly for meticulous manipulation motions.

The typical solution is to instead reduce the problem to our hammer: PD control with feed-forward inverse dynamics. Any acceleration policy ¾ ⁼ A? ¹?) represents an infinite collection of trajectories. For every initialization (ic "?Q), forward integration produces a desired integral curve that we'd like the system to follow. If all we want to do is follow that integral curve well (which holds even when feedback comes in the form of modifications to future portions of our time varying differential equation), this technique works. Independent of the fidelity of the inverse dynamics model, we can tune the PD gains to reject disturbances resulting from modeling errors. DMPs often follow this pattern: they're used as trajectory generators and we can follow the resulting trajectories robustly using traditional methods.

However, these systems can't be robust to physical perturbations because they aren't actually following the policy mapped out by the differential equation. When the robot is pushed, and it can't distinguish between that perturbation and one resulting from a simple model inaccuracy. In the best case, one has a pretty good dynamics model enabling low PD gains, and the system will perturb away from the trajectory, but since it's following a trajectory it'll always then move back to continue along its path.

The only way to circumvent these issues when reducing the problem to trajectory following is to somehow detect when physical perturbations occur, acknowledge new states as seeds for new integral curves, and devise a policy for blending from one trajectory to the next. This system is complicated and requires a lot of tuning since unpredictable pertur- bations have unpredictable profiles. Often we want to choose an epsilon trough around the trajectory and predict physical perturbations as deviations beyond that epsilon trough. In practice, that epsilon is both too big and too small. It's too big because perturbations ultimately lock onto a new trajectory and we see typical trajectory following artifacts within that epsilon such as low-frequency oscillations around the trajectory or slight overshoots given the unpredictability of the perturbation profile. But it's also too small, and when the model inaccuracies are most severe, we see spurious blending triggered simply from attempts to follow the integral curves.

According to the invention, one may fix the model locally in real time using the ιοοο data points per second one gets back from the robot. If we can fix the model in real time, we don't have to worry about any of the complexity introduced by reducing the problem to trajectory following. We can simply execute the acceleration policy using the patched inverse dynamics model. The key will be a new gradient descent online learning method that attempts to directly minimize the error between desired and observed accelerations.

Returning to the equation expressing the how desired accelerations translate to applied torques through inverse dynamics and then push through the true system to create observed accelerations, we now propose adding a torque offset ω to the applied torques that we'll attempt to track over time as the difference between the torques our inaccurate model predicts and the torque we should have predicted to generate the desired accelerations: qM = M^'1 [(Miq)q + h{q_f q) + ω) - h] (₆) The observed accelerations q are then a function of co. Increasing the added applied torque increases the accelerations, etc. We can measure the error between desired accelerations and observed accelerations as /(■ω) = ^ ||¾ - ¾_β (ω)|| _Γ

(7)

Here we're expressing the error as a squared norm scaled by M, the true mass matrix. This error is a common metric most famously used in Gauss's Principle of Least Con- straint. In this case, we don't actually know M (or h, which appears in 4 ^' (^ω), zo we can't evaluate the objective directly.

However, if we evaluate the gradient of the expression, we see that

= -M~^%M(q _d— ξ_α(ω)) (10)

(11) which is just a difference between the acceleration we wanted and the acceleration we actually measured. So even though we can't evaluate either objective function or it's gra- dient directly, we can measure it's gradient at every control cycle by estimating the true accelerations.

Adding regularization to the objective gives

/(ω) = ΐΙ¾ - ¾(ω)|||_ί + ^ 1|ω|]² _> so the resulting online gradient descent update rule becomes

This expression is essentially an integral term on the acceleration error with a forgetting factor. The advantage of writing it as online learning is that now we have a box of tricks from the learning community we can leverage to improve performance. Parameter oscillations in neural networks are a problem resulting from ill-conditioning of the objective. The objective, as seen from the parameter space (under Euclidean geometry, which is a common easy choice), is quite elongated, meaning that it's extremely stretched with a highly diverse Hessian Eigenspectrum. Gradient descent alone in those settings undergoes severe oscillations making it's progress slow. More importantly, in our case, oscillations resulting from the ill-conditioning of the problem manifest physically as oscillations in the controller when the step size is too large. Fortunately, the learning community has a number of tricks to prevent these oscillations and promote fast conver- gence.

The most commonly used method for preventing oscillations is the use of momentum. Denoting g_t as the gradient at time t, the momentum update is "r+i = V-t + l - Χ⁾{-3: (14)

"t +i = «% (15) effectively, we treat the parameter w as the location of a particle with mass and treat the objective as a potential field generating forces on the particle. The amount of mass affects its perturbation response to the force field, so larger mass results in smoother motion.

This update is equivalent to an update that can be viewed as an exponential smoother:

Note that g_t is still being evaluated at wt, so it's not exactly equivalent to running simply a smoother on gradient descent (gradients in our case come from evaluations at smoothed points), but it's similar.

This latter interpretation is nice because it shows that we're taking the gradients and 1. literally smoothing them over time, and 2. effectively operating on a slower time scale. That second point is important: this technique works because the time scale of the changing system across motions generated by the acceleration policies is fundamentally slower than that of the controller. This enables the controller (online learning) to use hundreds or even thousands of examples to adjust to new changes as it moves between different areas of the configuration space. Another, common trick found in the machine learning literature (especially recently due to it's utility in deep learning training), is to scale the space by the observed variance of the error signal. When the error signal has high variance in a given dimension, the length scale of variation is smaller (small perturbations result in large changes). In that case, the step size should decrease. Similarly, when the observed variance is small, we can increase the step size to some maximal value. In our case, we care primarily about variance in the actual accelerations <7a (which measures the baseline noise, too, in the estimates) since we can assume the desired acceleration signal is changing only slowly relative to the lms control loop. Denoting this <?n variance estimate as v_t we scale the update as ur+i = ^ur - (/ + a diagiv )-¹^ (₂8)

Note that this is equivalent to using an estimated metric or Hessian approximation of the form A = I + adiagCv .

This combination of exponential smoothing (momentum) and a space metric built on an estimate of variance results in a smoothly changing w that's still able to track changes in the dynamics model errors.

Indirect approaches to adaptive control (essentially online regressions of linear dynamics models) often tune their forgetting factor based on the magnitude of the error they're seeing. Larger errors mean that the previous model is bad and we should forget it fast to best leverage the latest data. Smaller errors mean that we're doing pretty well, and we should use as much data as possible to fully converge to zero tracking error. Adaptively tuning the forgetting factor, which manifests as adaptive tuning of the regularizer in our case, enables fast response to new modeling errors while simultaneously promoting accurate convergence to meticulous manipulation targets. For controlling the end-effector to a fixed Cartesian point the forgetting factor converges to l (no forgetting; zero regulariza- tion) and within a couple second and we quickly achieve accuracies of around io^~5.

Additionally, since handling noisy data is a fundamental problem to machine learning in general, these same tools enable us to handle noisy acceleration measurements. Momentum acts as a damper to the forces generated by the objective. The interpretation as an exponential smoother shows that white noise in the estimates cancels over time averaging to a more clear acceleration signal.

Additionally, we get ιοοο training points per second. Physically the robot doesn't move very far in a second, and we can assume that the errors in the dynamics model will be changing at a time scale of tenths of a second, .1, .5, or even 1 second. That's anywhere between 100 and 1000 training examples available to track how errors in the dynamics model change as the robot executes its policy. That's plenty of data to average out noise and run a sufficient number of gradient descent iterations. Traditional methods of online dynamics learning are often categorized as indirect adaptive control in the literature. The dynamics parameters affect the performance of the controller, but rather than adjusting the performance of the controller directly, these meth- ods take streams of dynamics data and adjust the accuracy of the dynamics model under the understanding that accurate dynamics models should lead to accurate tracking. However, this section argues that the inverse dynamics problem is actually fairly difficult when frictions and especially stictions are involved since even invertible rigid body models end up being many to one forward models under stictions, and the data we get back from the system is actually slightly off policy. And we argue that the above outlined direct adaptive control approach, which directly attempts to minimize the error between desired and actual accelerations, is a more straightforward signal to train on.

Consider the situation of starting the online learning process from zero motion. With ze- ro motion, stictions are problematic. That means there are many torques that we might apply that are too small. All of these torques produce zero acceleration, and therefore the forward dynamics mapping becomes many to one and, therefore, non-invertible. Intuitively, this problem manifests as a learning process that needs to train the inverse model to adjust it's prediction of the torque needed to produce zero acceleration over a period of a number of trials that all produce zero accelerations (hopefully with increasing torque commands) until the predicted torque finally breaks through the stiction barrier to produce a nonzero acceleration target to train on.

More concretely, consider an objective of the form:

(19) where f is any class of function approximators parameterized by w predicting the inverse dynamics. In this case, t is the actual applied torques, and r is the corresponding actu- al accelerations that were observed. For instance, we might use just a simple torque offset w from the rigid body inverse dynamics model, paralleling what we outlined above:

(20)

= τ _e + ω (21)

Here we denote c ^_ M(<?)¾ + .¾(¾. <?) as the underlying rigid body inverse dynamics prediction. The gradient of this expression is just the torque error: νΙ_τ(ω) = -(τ₆ - (f_t Hh ω)_ (22)

That seems natural enough. But this expression is actually operating on a different set of data. This algorithm attempts to make the model's predicted torque on the actual ob- served acceleration more accurate. But those observed accelerations are the wrong accelerations. We didn't want them. We wanted the desired accelerations. We want the model to be accurate on the desired accelerations, but we're fixing it on the observed accelerations. This discrepancy is particularly problematic when stictions are involved: all actual observed accelerations are zero when we aren't applying enough torque. The breakdown of these terms is as follows: t is the torque we actually applied. The acceleration we actually achieved is it. If we push that back through the inverse dynamics, we get /^'(¾. i = f_t + ω it should have predicted t at that acceleration point, so a gradi- ent descent algorithm updates the model accordingly.

While the direct approach tries to explicitly find the torque that will produce a non-zero desired acceleration (if we don't see the desired acceleration, then just adjust the torque until we do), this indirect approach (for this stiction example) requires a feedback se- quence that keeps updating the torque predicted for zero accelerations to (hopefully) higher and higher values until we successfully start seeing data with non-zero acceleration. The hope is that we're actually increasing those torques over time. The basic assumption is that non-zero desired accelerations under our rigid body dynamics model will always predict higher torques (assuming positive desired accelerations) than zero accelerations would. So those non-zero desired accelerations predict a torque that in turn creates and actual acceleration (which would be zero at first before breaking through the stiction barrier) that in turn predicts a smaller torque under our erroneous model. The algorithm then updates the predictions for zero accelerations to be higher and that in turn increases the prediction for the desired acceleration so that the next cycle actually applies higher torque.

The direct approach, on the other hand, tries to directly minimize the error in accelerations, whereas the indirect approach updates the model to predict that a larger torque is actually the thing that generates the observed accelerations. The hope is that iterating that process reduces tracking error (the error between desired and actual accelerations), but the analysis of this process is difficult.

More generally, PID terms can also be interpreted as online learning. The true acceleration of a system can be described as

¾_β(ω) = Af^_1 + h + ω) - h (₂₅) where 4 ^' a is the actual acceleration the system undergoes under the torques calculated by an estimated inverse dynamics model ^ ⁼ Mija ⁺ ^ with torque offset ω. The true dynamics of the classical mechanical system (represented under the Lagrangian mechanics framework) is?e ^{~ ¾} — ¾)_; which is an inversion of the common form^T = + ¾. Note that because we representing the applied torque ^ταψ≠ίΒ& ⁼ (M¾ + ¾ ^{+ ω} as an inverse dynamics term plus the offset, this model automatically is using a feed forward term: ^Tff = ^j^¾d + ^» The learning technique is finding ω as an offset to that over time. Note that we would have equivalently used simply ^Ta??ued = ^ω so that ω represented the entire applied torque. That case would be equivalent to not using the feed forward term.

We can rewrite PID control as

(26) so we have three terms to explain: the position integral, the position error, and the velocity error. We'll start with the position integral term and progress to the position and velocity error terms.

If we desire to be at q , and are currently at q we can choose an acceleration for this current time step to try to minimize that difference. The position the system will be in after ^seconds executing an acceleration ¾ is

Thus, if <?α(^ω) defines the actual realized acceleration applied at time step t to get the system to time step t + l, we can write the resulting location

as a function of the current weight vector. Given a desired position signal q, we can write down an objective measuring the discrepancy between where we are at time t and where we wanted to be:

where Mt is the mass matrix at time t which defines the real system dynamics at that time. The derivative is

(31) since .¾· 2

This gradient is a difference of positions, so taking gradient steps in an online learning fashion gives the traditional integral term on the positions:

In other words, the typical I term of a PID controller is gradient descent on the objective given by Equation 29.

Similarly, we can define the actual velocity at time t + 1 in terms of the integrated velocity from time t under the true system acceleration q (wt) at that time: ϊί₊iGWi) = q_t + AiitfOe). (33)

The following velocity error objective

therefore, has the desired derivative

V S? (35)

Summing up the negative gradients over time (under constant step size of ffe= ff t), therefore, defines an Euler integration of these velocities, which is equivalent to a position difference:

fi (36) = «ί?τ - 9?) (37)

And finally, we can do the same thing for acceleration error objectives:

,2

^{¾) = ; ΙΙ^ - ¾(^} |,_{¾ (38);} with gradient v c*¾) = (39)

Gradient descent with step size ^s = ^βΔέ again defines an Euler integration of these acceleration difference terms, which this time amounts to a position error:

(4₀)

= «Wf - 4 ) (₄i)

In short, PID control is online gradient descent on the objective

since its negative gradient is

This perspective allows us to understand this method of control in the broader context of machine learning, specifically an online learning method explicitly attempting to patch up the inverse dynamics model by running gradient descent on objective terms leveraging three sources of information: desired and actual positions, velocities, and accelerations.

The relationship between momentum updates and exponential smoothing is

where w_t is the sequence of weight vectors generated by the algorithm and ^ωί - the expo- nentially smoothed evaluation point. Similar in structure is the momentum update, which takes the form

"m = r¾ + (¹ ~ Υ)

u¾ = a»_t ^m - iju_t

This time the gradient is evaluated at the momentum point and then smoothed with past gradients before the step is taken. With some algebra, they both expand out to be of the form:

= ⁽¹ + Y ⁺ r² &-s + Cl + r⁾g_t-i * St. (44)

Figure 5 shows a 2D simulation of online adaptation to severe model inaccuracies and underlying friction (which varies sinusoidally with each dimension in this case). The time axes (lower axis of the second and third plots) are in units of seconds, and all spatial axes are in units of meters. The controller updated at a rate of lkHz.

The larger the step size, the quicker the adaptive control strategy adjusts to errors between desired and actual accelerations. That means it will fight physical perturbations of the system stronger with larger step size. To accurately track desired accelerations, we either need large step sizes for fast adaptation, or a good underlying dynamics model. If we have a really good dynamics model, we can get away with smaller That means the better the dynamics model, the easier physical interaction with the robot becomes. Bad models require larger step sizes which manifests as a feeling of \tightness" in the robot's joints. We can always push the robot around, and it'll always follow its underlying acceleration policy from wherever it ends up, but the more we rely on the adaptive control techniques, the tighter the robot becomes and the more force we need to apply to physically push it around. This behavior parallels the trade-offs we see in the choice of PD gains for trajectory following. Bad (or no) dynamics models require hefty PD gains, which means it can be near impossible to perturb the robot off it's path. But the better the dynamics model, the better the feedforward term is, and the looser we can make the PD gains while still achieving good tracking. We're able to push the robot around more easily (and it's safer). The difference is whether or not we need that trajectory. The proposed direct adaptive control method attempts to follow the desired acceleration policy well, which means when we perturb it, it always continues from where it finds itself rather than attempting in any sense to return to a predefined trajectory or integral curve.

The purpose of this control system is to directly execute acceleration policies of the form q = f(q,q) (e.g. kinematic LQRs) despite insufficient fidelity of the dynamics model. These acceleration policies represent an infinite collection of trajectories (each integral curve of the differential equation is a trajectory). In the absence of external perturbation, the system should follow as closely as possible one of these integral curves. However, when the inverse dynamics model τ = d^~ _p ^] _prox(q, q, q_d) is insufficiently accurate, if we denote the true dynamics of the system as q = d_lni (q,q, r) , the true realized acceleration of the system

may differ significantly from the desired acceleration. I.e.

The key observation is that we can parameterize the inverse dynamics model with a parameter vector w and actually measure the derivative of a natural error metric in real time to run an online gradient descent to tune the parameter vector w over time from the real world data streaming from the robot.

Parameterizing the inverse dynamics model as d^~ _pprm(q,q,q_d) + f{q, q, q , w) (simply adding on a function approximator to model the offset between the given approximate dynamics model and the true dynamics model) makes the true acceleration an implicit function of the parameters as well: q_a{w). The objective function in question is then W) = fe/ - 4 ^' a (^WT^M(<,)> ( 6) where M(q) is the true mass matrix of the system. It's easy to show that the derivative of this objective is just

The adaptive control methodology then simply runs a variant of online gradient descent using these gradient updates. The simplest variant, and the one currently implemented on the Max Planck system, uses f(q, q, q_d, w) = w . I.e. the function approximator only represents literally the current offset between the output of the inverse dynamics model and the torque needed to produce the instantaneously desired acceleration. In this case, the function approximator's Jacobian is simply the identity matrix, so the gradient is just the difference between desired and observed accelerations. The online learning system is able to track that offset as it changes (sometimes significantly) throughout the course of a single motion. (Note that a single motion that takes 2-3 seconds to complete receives 2000-3000 data points from the robot over the course of its execution. That's plenty of data to adjust and track a single vector of offset values.)

The most non-obvious aspect to this algorithm is that, even though we cannot evaluate either the objective, itself, or its gradient, we can measure the gradient from the data streaming back from the robot in real time. This algorithm reduces observed discrepancies between desired and actual accelerations in real time by adjusting the inverse dynamics model online. Figure 2 shows a simple 2D experiment demonstrates the robustness of switching between different but related LQRs. 10 different LQRs were generated each tracking a slightly perturbed variant of the same elliptical trajectory (sinusoidal in each dimension). For each LQR, the perturbation is a constant random vector offset applied across all time. During execution the execution thread randomly switched to a different LQR every .01 seconds. The figure depicts 10 separate execution rollouts from each of 10 randomly sampled initial positions and velocities. Despite switching randomly between LQRs at 100Hz, the resulting execution rollouts remain smooth. In all plots, the switching between LQRs is depicted by the blue line, which, at every moment in time, plots the position of the (perturbed) trajectory being tracked by the current LQR. Left: The full 2D de- piction of the tracked noisy trajectory and corresponding rollouts. Right: The separate x (top) and y (bottom) dimensions with the independent axis denoting time.

Figure 3 depicts the three levels of mathematical approximation to the full optimization problem used to satisfy the increasing control-loop rate requirements. Note that the bot- tom-most layer, the real-time lkHz layer, is represented simply as a sequence of affine policies of the state {q_d = B_tSi + c_t , where s, = [q_t ; q_t ]) . This layer is can leverage dynamic programming (as is common for these sorts of problems in Optimal Control) to cache intermediate portions of the solution across the entire trajectory so that the problem can be resolved in real-time from the current state by a simple affine transformation. We emphasize here that even though this bottom layer is quite simple, it's still solving an ap- proximation to the full motion optimization problem given in the top layer in the same way optimizing any quadratic function can be done analytically without requiring an iterative optimizer.

The motion optimizer discussed in this document can optimize a high quality model of the problem. It can take current robot state information (positions and velocities), current and predicted environmental state information (configurations of objects over the time duration under consideration), and a number of other modeling factors as described in [31, 24]. Ideally, the right action to take is the acceleration the optimizer prescribes as the initial acceleration when optimizing from the current state under the current envi- ronmental configuration. If the optimizer could run fast enough (once per millisecond) that Model Predictive Control style policy would be optimal.

Unfortunately, making this optimization fast enough for real-time control is usually unrealistic. But also, we never really want it to be fast enough. Even when we allow tenths of a second of computation, the modeling is still constrained by efficiency restrictions: the discretization is set at At = .1 sec, energy terms are only approximate, we approximate the collision surface of the robot's arm with generalized ellipsoids.2 For any given amount of computation, constraining this optimizer to run in real time within a lkHz loop is too restrictive. That computation should be used for more modeling, not unneces- sarily fast speeds.

Instead, our system is constructed around a multi-tiered architecture of second-order approximations to the original kinematic motion optimization problem in order to simultaneously pack as many modeling components as possible into the motion problem at the highest fidelity (which will be too slow), while still achieving real time lkHz loop rates on an approximation to that same problem at the lowest level control. In brief, the highest level operates on the order of half a second to a second (O(.i)) to solve the original problem. It sends a QP approximation to that problem (of the same dimensionality defined over the same time horizon) down to a mid-layer operating at hundredths of a second (O(.oi)). This level leverages fast band-diagonal solvers to solve this high-dimensional QP very fast (the computational burden of the top level optimization is in computing the second-order approximation for the inner loop Newton step this layer can skip that step since it's objective's Hessian (as a quadratic) is fixed, and augmented with only quickly calculable added components coming from penalty terms on the linear constraints. And finally, that layer fixes the values of the final Lagrange multipliers and sends the solution to the QP's final inner loop unconstrained problem down to the lowest level control loop as an LQR (sequence of time-varying affine policies) to execute in real time at lkHz. The key feature of this architecture is that the approximations done at each stage are mathematical in nature (Taylor approximations) and not modeling approximations. This allows each subsequent layer to operate on various representations of the same problem.

Figure 3 depicts these three layers of optimization approximation. If we denote a clique of 3 configuration variables around the time step r as q_t ^c = (<?,_, , q_t , q_l+x ) , and the whole trajectory as ξ = (q₀ , q_x,...,q_T+ ), , we can write the top level full optimization problem as

The middle layer, which uses a quadratic programming approximations to the problem, can be represented as

where

es, and b_t and d_t are vectors. Note that this is a quadratic program across the entire finite-horizon problem and not just an instantaneous quadratic program as is typically used in many modern local control techniques [7]. Finally, at the bottom-most layer, we have an unconstrained quadratic approximation of the problem which we can denote

ς (=1 where Q, q_t° j is a quadratic function of the clique. As is common in Optimal Control so- lutions, we typically use dynamic programming to solve and cache off intermediate solutions to this lowest-level optimization problem: τ

t = l where s, =

B_t is a matrix, and c_t is a vector. Applying the t' affine function to a state s_t fully re-solves the optimization problem from time step f onward and returns the optimal first acceleration q_d = B_ts_t +c_t . We emphasize here that, despite the simplicity of simply multiplying by an affine function (often called an affine policy), this bottom-most layer is still running a version of continuous optimization on this quadratic approximation by re-solving the full problem in the same way that minimizing any quadratic analytical is still minimizing the quadratic.

Most practical motion problems, such as manipulation problems, are sequences of multiple tasks. For instance, the problem of moving an object in the environment from one location to another can be decomposed into five subtasks: l. reach to a preshape around the object;

2. establish the grasp (including turning on the grasp controller);

3. move the object to the desired location;

4. release the object (including turning off the grasp controller);

5. retract the arm.

All five of these subtasks are motion problems that should be optimized. But optimizing each subsequent task in preparation for execution only once the previous task has completed adds latencies. Fortunately, the optimization of a task typically runs faster than real time, i.e. faster than it takes to execute the task. The longest tasks in the grasping de- composition typically take 3 seconds to run; optimization of those tasks only take 1 second. Naive implementations optimizing each next task only upon completion of the previous task are therefore very wasteful. The optimizer simply waits idle for the previous task to finish executing before starting the next task's optimization. In the above example, that's 2 seconds of idle time.

Figure 4 depicts the pipelining of a task optimizer according to a further embodiment of the invention, so that optimized task policies are up-to-date and ready to be run once the execution thread transitions into the each new task. A task pipelining system component may string together the optimizations separately to run parallel to the task execution, handling both synchronization with the execution thread and updating of each task's optimization problem in the case of MPC-style continuous re-optimization. Each task is a state of a state machine. The execution thread operating on the state, waits idle until an LQR (acceleration policy) is available to be executed. When it finds one, it extracts the LQR from the task and sends it to the robot for exe- cution. The robot then executes the LQR for the duration of the LQR's lifespan (e.g. 3 seconds for the typical motion task). Every new LQR received by the robot preempts the previously running policy, and the LQR starts running immediately. Each LQR has a time index synchronized with the robot's internal clock, so the robot knows where along the LQR to start executing, independent of any delays introduced by communication and op- timization. While the robot is executing the LQR, the execution thread waits idle until it finds a new LQR that's ready for execution. There is a well defined task termination criterion which signals that the current task has completed and triggers the execution thread to transition to the next task. In parallel to the execution thread, each task has an LQR optimization thread that handles optimizing the task's acceleration policy. It receives a stream of time stamped state information from the robot (which enables the optimized LQR to maintain synchronized time stamps), that it uses to seed the optimization. Once the task optimization has completed, the LQR optimizer constructs the LQR and hands it down to the task state, for the execution thread to find and send to the robot for execution.

The LQR optimizer additionally, at that point, predicts where the robot will be once the LQR execution has completed, and sends that prediction to the next task's LQR optimizer, enabling that optimizer to start optimizing immediately from that resultant state pre- diction. There is only a single execution thread, and it only points to the current active task state, so all subsequent task LQR optimizers simply save the optimized LQR off within the task state so it's ready to be executed immediately once the execution thread reaches that state. This system is designed to facilitate both pre-optimization of tasks from predicted task start states as well continued communication between the optimizers to enable prere- optimization of those tasks from updates occurring in real time. The simple rule is that each task's LQR optimizer, running in a separate thread, consumes the predicted start state information sent by the previous task's LQR optimizer and integrates it into the cur- rent task optimization. More, each LQR optimizer, either periodically or upon completion of the optimization, sends its predicted information to the next task's LQR optimizer. This continuous communication and re-optimization between optimizers enables two behavioral modes: The system can be configured to optimize each task only once upon completion of the previous task's optimization and receipt of the predicted start state information. This configuration enables all task optimizers to complete in a chain as quickly as possible, independent of the rate of task execution, ensuring that each subsequent task's LQR is ready and available for execution once the execution thread enters the state. This behavioral mode enables a degree of reactivity, since the LQRs are full acceleration policies implementing solutions to second-order approximations to the original problem. But it cannot react to drastic perturbations since the approximation center around which the LQR is built is always fixed. This mode, however, is simpler and suitable for many motion sequences; more, occasional re-optimization upon detected substantial perturbation is straightforward to add, although such re-optimization will incur initial costs since the system would need to transition to an idle state while the initial re-optimization is running.

The system can also be configured so that each task is running its optimization continuously, always receiving the latest predicted start state information from the previous task's optimizer integrating it into the current optimization for online refinement, and sending the new predicted resultant state to the next optimizer. These optimizers are al- ways running and updating their LQRs with the latest information. The currently executing task, updates its LQR with the latest state information streamed from the robot; all subsequent optimizers update their LQRs with the latest predicted information streaming from the prior LQR. Since the task execution thread sends every LQR update to the robot for execution, the robot continuously receives a stream of LQRs; it executes the be- ginning of the LQR until a new LQR preempts it. This configuration enables full reactivity to both changes in the environment and substantial perturbation. All optimizers work with the current/predicted information, whether that be current/predicted robot state information or current/predicted environmental state information. When the robot is perturbed, or perception observes changes in the environmental constraints, the current optimizer is updated, it modifies the LQR accordingly, and ends the result both to the task (for either execution or storage) and to the next task's optimizer. The task pipelining and inter-optimization communication work in conjunction to enable low latency, reactive, behavior. Note that multiple pipelines can be run in parallel to calculate the relative trade-offs of multiple candidate objects to manipulate to be robust to uncertain decisions of human co-workers. If a co-worker makes a choice that blocks the current best hypothesis the system can quickly and smoothly transition to the second best hypothesis, and so on down the line, to maintain uninterrupted fluent task execution. According to a further embodiment of the invention, a parallel processing architecture for speeding the inner most computational bottleneck of motion optimization may be provided. The parallelization technique described in this section is not tied to the above described system as a whole.

Motion generation has traditionally been heavily sequential. Most motion planners such as RRTs, PRMs, lattice planners, etc. are designed for execution on a single processor, largely for legacy reasons of limited computation. But motion optimization, and global optimization in general, is inherently parallelizable.

There are two parts to motion optimization - the local optimization of a single motion policy (trajectory + surrounding second order information constituting an LQR); the global optimization of the problem throughout the full space - and both of these components can exploit parallelization.

Parallelization in the context of local optimization is described in detail in the document Kinematics for Motion Optimization, but briefly, the speedup results from the simultaneous exploitation of both CPU and GPU parallelization during the evaluation of trajectory objective functions and its derivatives. Evaluation requires three stages:

1. Forward evaluation of kinematic maps to task spaces and Jacobians.

2. Evaluation of task space objectives, and their gradients and Hessians.

3. Pullback of gradient and Hessian information through the kinematics maps. Both stages 1 and 2 can exploit parallelization on the GPU and CPU, respectively. The computational bottleneck during those stages comes both from Jacobian computation and task space evaluation (the latters complexity is a function of modeling complexity), but given forward kinematics evaluations, the Jacobian computation (which is just a bunch of cross products) can be calculated on the GPU in parallel with the task space evaluation which we can relegate to the CPU (enabling the exploitation of all available libraries and tools). Once those Jacobians and task space derivatives are complete, the pullbacks, which comprise the majority of the computation overall, can easily all be calculated using matrix libraries that exploit GPU processing. Back-of-the-envelope calculations detailed in the above linked design doc suggest we can get major speedups using this computational architecture.

Parallelization for global optimization is perhaps even more straightforward. The simplest technique to enable global optimization parallelization is simply multi-threaded parallel local optimization from random or calculated initial configurations. Each optimization will take a different amount of time, so we can easily choose the best policy from the collection of optimized motions successfully obtained within a given computational budget. From then on out, one of the threads could be the re-optimization thread adapting the current policy to changing scenarios, while the rest of the threads could be exploration threads. At the point when exploration threads find better solutions than the re- optimization thread, we can switch execution to the best of those exploration policies, thereby enabling switching of the global policy such as to a solution within a different homotopy class. Note that we can allocate multiple processors to local re-optimization as well, depending on exploration needs.

This parallelization architecture is designed to mitigate the primary computational bottleneck of motion optimization: Pullback calculations. The task spaces referenced in this document are primarily kinematic spaces (spaces defined as physical points on the robot or axes extending from physical points), as the computational properties of their forward evaluation and (especially) Jacobian evaluation are of concern.

Gauss-Newton Hessians are pullback metrics. Consider an objective term defined entirely in mapped space as c (g) = c(<fi(g)) , where j) : g -> I is a differentiable function. We often call a task map mapping configurations to a task space X; we say that c is defined on the task space X. The true Hessian is

where J =— is the Jacobian of φ and A denotes the additional terms dependent on higher order derivatives of the differentiable map φ . The Gauss-Newton approximation simplifies this expression by removing the second term A . The result is a pullback metric pulling the Hessian Riemannian metric V² _xc back into the configuration space. The resulting function, therefore, can be expressed using a map φ : Q^kM — X^k+X , where k is the order of the highest derivative, mapping a kth-order sequential clique of configuration along the trajectory to a corresponding kth-order clique of mapped variables in the task space: φ ^_ι__] ^ ς_ι+ι ) = (φ^_ι__] },φ^_ι ), ^_ι+ι )) . If c₃ (x, x,x) , for instance, is a task space function defined over positions, velocities, and accelerations, we can express the full function as c - ε[ρ J, where D is a constant finite-differencing matrix that calculates the requisite finite-differenced derivative approximations from the sequential variables in the mapped clique. In aggregate, the expression first maps the clique of configurations to a task space clique of configurations using φ , then finds the finite-difference derivative approximations, and ultimately evaluates the desired task space function on the resulting points. Gauss-Newton Hessian approximations are again pullback metrics of the form

where x denotes the mapped task space clique. Note that in this case, the Jacobian is larger in both dimensions. Calculating these pullbacks are the most computationally consuming portion each motion objective evaluation. This document describes a parallel architecture designed to leverage both CPU and GPU parallelization to speed this computa- tional bottleneck.

There are four components to the evaluation of objective terms and the calculation of gradient and Hessian pullbacks that we need to juggle: · Stage l: [GPU or CPU] Forward mapping computation

• Stage 2: [GPU] Jacobian computations (which amounts to a giant collection of cross products) [CPU(s)] Task space objective evaluations (includes task space gradients and Hessians)

Stage 3: [GPU] Pullback calculations: J^TVc and J^TV²cJ .

We've lumped two computations into stage 2 as they can be done in parallel, distributed between both the GPU and CPU. Both require only the forward mapping computation, and neither require each other (Jacobian computation calculates the pathway back from the task space to the configuration space, and the task space evaluation operates only within the task space, itself). We now describe each stage in more detail.

Stage 1 [GPU or CPU]: Forward mapping computation.

This stage isn't a bottleneck. Usually forward evaluation along a kinematic chain involves only a single sweep homogeneous transformations from the root to the most distal leaf (or leaves or a tree). These are fast, and can most likely be done in negligible time just within the CPU. GPUs may be beneficial for very long trajectories (finely discretized), but these computations need to be extracted from the GPU for the next stage of computation, so memory latencies will probably dominate any computational advantage offered by the GPU. This stage calculates all task space trajectories, from which we can retrieve all cliques and, accordingly, all finite differenced kinematic quantities needed for task space evaluation. Moreover, this stage calculates all joint locations and axes needed for Jacobi- an computation.

Stage 2 [CPU(s)]: Task space evaluation.

It's critical to run task-space evaluations on a multi-core CPU architecture. They can be trivially parallelized across multiple cores, but, importantly, these task space evaluations are sophisticated. From a modeling and usability standpoint requiring these calculations to be GPU compatible is too severe of a restriction for practical execution; we need access to all available CPU libraries and tools. These computations include geometric map eval- uation, distance field access (typically discrete voxel fields made analytic through the use of LWRs), energy evaluations, etc.

Stage 2 [GPU]: Jacobian evaluation.

Given joint positions and body frame axis configurations, Jacobian computations consist oi θ(η² ) cross products, where n is the dimensionality of the configuration space. Given three axes (a full coordinate system) for each key point (usually joint location), the constant factor is significant; moreover, given that we need to evaluate these across the entire trajectory, the fully computational complexity is θ(Τη² ). Fortunately, all Jacobian column computations are either trivial (in the case of prismatic joints) or simply well- defined cross products (in the case of revolute joints). Relegating this portion of the computation to a cross product machine of sorts implemented on the GPU may have significant speed benefits. Jacobian computational complexity is still dominated by the below pullback stage complexity, but their evaluation isnt trivial and leveraging the GPU allows us to compute them in parallel with the task space evaluations. Task space evaluations will likely dominate the Jacobian computation in terms of complexity, so we get the Jacobians essentially for free. Additionally, the Jacobians will need to be used later in the GPU anyway during the pullback calculation, so theres no need to ever pull them out of GPU memory. Stage 3 [GPU]: Pullback calculations:

J' Vc and J^rV²cJ . Pullback calculations are simply a collection of matrix-matrix or matrix-vector multiplies. If we calculate the Jacobians on the GPU and leave them there, we can avoid the bulk of the memory transfer overhead; we need only transfer the task space gradients and Hessians, and those are typically relatively small (most kinematic task spaces are just 3-dimensional). All of these pullback calculations, then can be done efficiently simply using GPU matrix libraries. The result would be a packed representation of a band-diagonal matrix which we can read from GPU memory. Solving the resulting line- ar system on a CPU using LAPACK is very fast Oust 1 ms using a basic installation of BLAS).

Example Applications

Rethink Robotics was the first to enter into the new robotics domain of Collaborative Ro- botics systems for small to medium scale manufacturing, and Universal Robotics took hold of the market soon thereafter. They've made Collaborative Robotics systems highly desirable by opening up new markets for robot arms that are safe enough to operate around human co-workers and easy enough to use that can be programmed by unskilled technicians, usually by simple kinesthetic demonstration. Collaborative robots diverge drastically from more traditional robotic automation solutions which are extremely dangerous while operating (requiring a cage to separate physically separate them from human co-workers) and require a lot of skill to program.

Our system forms an ideal platform for developing new Collaborative Robotic applica- tions. Existing Collaborative Robotic systems are technology limited. They don't react to the environment aside from stopping for safety when an anomaly is detected, they don't operate naturally with co-workers, and programming is limited to physically moving the robot to waypoints (the simplest of programming by demonstration methodologies). These robots are largely easily programmed safe versions of their more traditional coun- terparts.

Ideally, collaborative robotic systems should adapt to perceived constraints in their environment, focusing on the task rather than the raw motion of the arm through the configuration space. With the introduction of a new constraint, whether perceived through vi- sion or through a slight bump, the system should automatically update its internal representation and continue attempting to complete its task until it deems the problem un- solvable given the existing constraints. Bumps will happen in close proximity with coworkers and the environment will always be changing when human beings populate the robot's workspace. The robot should maintain safety and efficiency both through adapta- tion to the changes. The above described system makes this new functionality possible.

Likewise, rather than demonstrating full arm configurations that the robot should go through, it would be much more natural for a user to demonstrate how the task itself should be solved perhaps with a simple tool mimicking the robot's end-effector that the programmer can manipulate. If the tool measures desired forces and the system analyzes how the specified operations change the environment (perhaps with input from the programmer to ease the vision and analysis task), the robot obtains a target task-space pro- file of the sequence of problems it is tasked with solving. The motion optimizer can then create solutions to the problems on the fly that adapt to the changing environment. During the programming stage, the robot can optimize a sequence of nominal policies with the help of programmer who can provide adjustments as needed. The automated motion generation capabilities of the above described system enable such a programming by demonstration application. More, collections of arms operating in the same factory can learn general preferences over time across multiple programming by demonstration sessions [23, 17, 32, 4] methodologies since the motion system is designed around optimization. This system can revolutionize the space of Collaborative Robotics, making robot coworkers more natural, more adaptive, easier to use and program, and ultimately less scary.

Both drones and autonomous vehicles need to generation motion plans quickly in high dynamic and changing environments. Existing methodologies are effectively engineered around the difficulty of this problem. Our system provides a clean and general solution to implementing real-world MPC accounting for sophisticated time-varying modeling elements, thereby effectively reducing the problem to vision. Companies developing systems or intuitive drone collision-free control and autonomous vehicles can focus on the vision component, letting the above described system solve the motion generation problem giv- en the task description and predicted environmental constraints.

Rehabilitation robotics is starting to use robot arms to empower wheelchair-bound patients with increased autonomy. These patients can be incapable of caring for themselves and performing simple tasks such as reaching a cup of water or even pointing to an ob- ject. The above described system can easily be used as a platform atop which any number of natural behaviors that interact with the environment can be implemented. The system excels at generating clean and natural movements amid an unstructured and changing environment simplifying the sophistication of control needed to produce motion in clutter. Handicapped patients are often restricted in the degree to which they can offer input to a robotic help system; automatic obstacle avoidance and natural, predictable, reaction to common changes in the environment can greatly ease the burden of control for these users. The entertainment industry and the social robotics industry are both interested in lifelike and reactive motion for humanoid robots. In this case, the emphasis is on natural behavior and communication of emotional content. The motions generated by the above described system are much more natural that those produced by competing motion genera- tion methodologies, largely because the algorithms are covariant in the Riemannian geometric sense. The only way to achieve covariance is through judicious use of pullback metrics (Gauss-Newton Hessian approximations) to propagate geometric information between the various modeling spaces. Competing techniques often represent geometry only as Euclidean geometry in the configuration space; the resulting motion tends to em- phasize emphasizes loopy circular motions because of the propensity of revolute joints comprising most arms, which at best are unnatural and unpredictable and at worse can be dangerous to bystanders when factoring in modeling and environmental measurement uncertainties. Our system easily makes natural motions that can adapt in predictable ways to changes in the environment and perturbations. It can be used to bring to life entertainment and social robots in ways heretofore unseen.

The online learning adaptive control algorithm is applied above to correcting erroneous dynamics models in real time. One source of error can be the inaccuracies or even inconsistency introduced by lower quality motors and parts in cheaper manipulators. As long as we can measure the kinematic variables accurately, the online learning algorithm can adjust to the errors in real time enabling precise control of even low cost robots. This feature can greatly affect the price point of manipulators sold while maintaining requisite precision even as the manipulator's components change (e.g. cables or belts stretch or shift) over time.

Claims

1. Method for controlling a robot, comprising the steps

- estimating a true acceleration of the robot;

- determining a torque offset (w), based on a difference between a desired acceleration and the true acceleration of the robot

- controlling the robot, based on the torque offset (w).

The method of claim 1 , wherein the torque offset (w) is determined using a history of measured past robot states and gradients of a inertia-weighted squared norm difference between the desired and actual accelerations, for example by gradient descent.

The method of claim 2, wherein the torque offset (w) is determined based on a momentum.

The method of claim 3, wherein the momentum is scaled by a variance of the actual acceleration (q).

The method of claim 1 , wherein the torque offset (w) is a function of a position, a velocity and a desired acceleration.

A method for optimizing robot motion, comprising the steps of

- approximating an optimal robot motion at a first rate, e.g. in the order between half a second to a second;

- approximating a second optimal robot motion, based on the first approximation, at a second rate faster than the first rate, e.g. in the order of hundreds of a second;

- determining an optimal robot motion, based on the second approximation, in realtime.

7. The method of claim 6, wherein the first approximation is obtained as a quadratic program.

8. The method of claim 7, wherein the second approximation is obtained using fast band-diagonal solvers.

9. The method of claim 8, wherein the optimal robot motion is obtained using a linear quadratic regulator.

10. Method for executing subtasks in a robot, comprising the steps of

- waiting until an acceleration policy for the robot becomes available;

- extracting the acceleration policy and sending it to the robot for execution;

- executing the acceleration policy for the duration of its life span.

11. The method of claim 10, wherein every new acceleration policy received by the robot preempts the previously running policy.

12. The method of claim 10, wherein each acceleration policy has a time index synchronized with the robot's internal clock.

13. The method of claim 10, wherein the acceleration policy is optimized in a separate thread before it is executed.

14. The method of claim 10, wherein the acceleration policy optimizing thread predicts where the robot will be once the execution of the acceleration policy has completed and sends said prediction to the next tasks optimizer, enabling said optimizer to start optimizing immediately from said result and state prediction.

15. Device for controlling a robot, comprising

- an estimating module for estimating a true acceleration of the robot;

- a torque determining module for determining a torque offset (w), based on a difference between a desired acceleration and the true acceleration of the robot - a controller for controlling the robot, based on the torque offset (w); and adapted to execute a method according to claim 1.