WO2021008691A1 - Learning to robustly control a system - Google Patents

Learning to robustly control a system Download PDF

Info

Publication number
WO2021008691A1
WO2021008691A1 PCT/EP2019/069101 EP2019069101W WO2021008691A1 WO 2021008691 A1 WO2021008691 A1 WO 2021008691A1 EP 2019069101 W EP2019069101 W EP 2019069101W WO 2021008691 A1 WO2021008691 A1 WO 2021008691A1
Authority
WO
WIPO (PCT)
Prior art keywords
values
candidate solution
model
quality
candidate
Prior art date
Application number
PCT/EP2019/069101
Other languages
French (fr)
Inventor
Mohammed Abdullah
Haitham AMMAR
Hang REN
Mingtian ZHANG
Original Assignee
Huawei Technologies Co., Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co., Ltd. filed Critical Huawei Technologies Co., Ltd.
Priority to CN201980098404.XA priority Critical patent/CN114270375A/en
Priority to PCT/EP2019/069101 priority patent/WO2021008691A1/en
Publication of WO2021008691A1 publication Critical patent/WO2021008691A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks

Definitions

  • This disclosure relates to avoiding brittleness in reinforcement learning, in particular to learning to control a system in a way that is robust to variations in dynamics.
  • MDP Markov Decision Problem
  • the controller for the MDP is celled e policy. It is often implemented es e probability distribution over actions given the current state x, denoted p( ⁇
  • An MDP equipped with a starting state distribution and a policy gives rise to a Markov Reward Process (MRP). It induces probability distribution over trajectories (a trajectory is a sequence of states, actions, rewards).
  • MRP Markov Reward Process
  • the standard objective in RL is to optimise the expected return, that is, the discounted sum of rewards:
  • m is an initial distribution and where:
  • the dynamics, p is assumed to be fixed, meaning that if the same action is taken in a given state, the distribution over next states is the same as it would be had that action been taken in that state at some other time.
  • Dynamics are independent of whatever controlling policy is applied to the MDP.
  • policies tend to be brittle with respect to variations in dynamics.
  • an RL agent is trained using a simulator (for example, of a car, or robot), and then deployed in a real physical system, the simulator will not be perfect and the dynamics it simulates will not be exactly the same as the real-world dynamics.
  • JP 3465236 B2 which is based on H infinity techniques in classical control theory
  • CN 107856035 A which is specifically targeted to a particular class of problem rather than being a general RL algorithm
  • US 6665651 B2 which relies on having a controller to train the neural network, with emphasis on the stability of the learning process.
  • a system for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric the system being configured to, having formed a candidate solution comprising a candidate set of parameter values; repeatedly perform the steps of: making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric; making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and forming a further candidate solution in dependence on the first and second assessments.
  • the system may therefore assess the overall performance of the policy to optimise the performance of the model, as well as evaluating the model having those parameters at the worst-case dynamic to aim to minimise the occurrences of a low level of performance.
  • the system can then iteratively form further candidate solutions of the policy parameters in dependence on these first and second assessments.
  • the system may be configured to assess the quality of the candidate solution by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference values. This may allow for a convenient way of assessing the quality of the candidate solution.
  • the system may be further configured to assess the quality of the candidate solution by generating a set of adapted reference values comprising one or more items of adapted reference data in the vicinity of at least some of the set of reference values and test the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the set of adapted reference values. This may allow the quality of the candidate solution to be tested against dynamics that differ from the reference dynamics.
  • the one or more items of adapted reference data may be within a predetermined Wasserstein distance of the set of reference values.
  • the set of adapted reference values may represent the worst-case values of the set of reference values.
  • the quality of a policy can therefore be evaluated at the worst-case dynamic within the Wasserstein ball. By evaluating policies on their worst-case dynamic, robustness of the policy can be promoted.
  • the Wasserstein metric may therefore assist in measuring how“wrong” a model or simulator is in a useful and intuitively reasonable way.
  • the set of reference values may comprise the parameters of a neural network. This approach may allow for the efficient computation of the updated reference values and further candidate solution.
  • the set of reference values may comprise values output from a simulator or differential equation solver. This approach may allow for the efficient computation of the updated reference values and further candidate solution and may allow for the reference values to be varied in ways which are consistent with some set of rules. For example, a simulator for a physical system such as a robot or car will allow variations of quantities such as friction, mass, length etc., but the system is expected to obey Newton’s laws.
  • the set of reference values may comprise a set of reference dynamics. This may allow the system to be applied in real-world dynamic situations.
  • the system may be configured to perform an optimisation comprising the first and second assessments of the quality of the candidate solution.
  • the model may be a trained artificial intelligence model.
  • the model may be a neural network.
  • a method for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric comprising: forming a candidate solution comprising a candidate set of parameter values; and repeatedly performing the steps of: making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric; making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and forming a further candidate solution in dependence on the first and second assessments.
  • the method may therefore assess the overall performance of the policy to optimise the performance of the model, as well as evaluating the model having those parameters at the worst-case dynamic to aim to minimise the occurrences of a low level of performance.
  • the method can then iteratively form further candidate solutions of the policy parameters in dependence on these first and second assessments.
  • the assessment of the quality of the candidate solution may comprise testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference values. This may allow for a convenient way of assessing the quality of the candidate solution.
  • the assessment of the quality of the candidate solution may further comprise generating a set of adapted reference values comprising one or more items of adapted reference data in the vicinity of at least some of the set of reference values and test the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the set of adapted reference values. This may allow the quality of the candidate solution to be tested against dynamics that differ from the reference dynamics.
  • the one or more items of adapted reference data may be within a predetermined Wasserstein distance of the set of reference values.
  • the quality of a policy can therefore be evaluated at the worst-case dynamic within the Wasserstein ball.
  • the Wasserstein metric may therefore assist in measuring how“wrong” a model or simulator is in a useful and intuitively reasonable way.
  • Figure 1 illustrates iteratively updating the parameterised policy and dynamics.
  • Figure 2 shows a flowchart illustrating a method according to an embodiment of the present invention.
  • Figure 3 shows an example of a system for implementing the method illustrated in Figure 2.
  • Figure 4 shows a flowchart illustrating a method according to a further embodiment of the present invention.
  • Figure 5 shows a flowchart illustrating a method according to a subroutine performed as part of the method illustrated in Figure 4.
  • Figure 6 shows an example of a system for implementing the methods illustrated in Figures 4 and 5.
  • Figure 7 illustrates an example of a method for performing reinforcement learning to generate a solution set of values usable as parameters in a model.
  • the present invention relates to a system and method for performing reinforcement learning to generate a solution set of values usable as parameters in a model that are robust to changes in dynamics.
  • the quality of a candidate solution for the parameters of the model is assessed by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference dynamics, p 0 , which are input to the system.
  • p 0 a set of reference dynamics
  • a controlling policy is trained using these reference dynamics without consideration of other possible dynamics.
  • a set of possible dynamics is considered that is distributed about (for example, centered around) p 0 .
  • dynamics within a predetermined Wasserstein distance of the reference dynamics are considered during training of the controlling policy.
  • the Wasserstein distance is used as a measure of divergence between dynamics and is defined for probability measures that are defined with respect to metric spaces.
  • the state space X is a metric space with the metric denoted by d( ⁇ , ⁇ ) ⁇
  • M(S) denote the set of probability measures over a set S
  • the set of couplings on probability measures m and v is defined as:
  • the p-Wasserstein distance is defined as:
  • the Wasserstein distance has no closed-form solution but can be numerically estimated.
  • the squared 2-Wasserstein between Gaussians has a closed form.
  • the set of dynamics used is p, whose distance from p 0 , when measured by a Wasserstein metric, is within some predefined limit.
  • the Wasserstein metric may be any one of the Wasserstein metric class.
  • the predefined limit may be referred to as the epsilon-Wasserstein ball around p 0 , defined as
  • the system is configured to learn the best controlling policy, where the quality of a policy p is the standard RL objective function, but evaluated at the worst-case dynamic (for p) within the Wasserstein ball.
  • the Wasserstein metric may therefore assist in measuring how“wrong” a model or simulator is in a useful and intuitively reasonable way.
  • the Wasserstein distance has the form (distance x probability mass). Therefore, if this product is constrained by some number, then it implies that if distance is large, probability is small, and if probability is large, distance is small. This means that the model can be greatly wrong (large distance), but this is unlikely, or it can be very likely wrong (high probability), but then it cannot be too inaccurate. If the reference dynamic is frequently highly inaccurate, then it is useless, and training on it is pointless, and ultimately, attempting to achieve robustness is a futile endeavor.
  • the policy p is parameterised with a vector q Î R d 1 , and written as p q .
  • the Q parameters are the parameters (weights) of a neural network and may be updated with time.
  • the dynamics are also parameterized with another vector y Î R d 2. Again, these parameters y may be updated with time. These parameters may be, for example, the parameters of a neural network, a simulator, or some other processor that implements the system dynamics (for example, a differential equation solver, or real-life system).
  • the parameter vector corresponding to the reference dynamic p 0 is denoted Y 0 .
  • the system may generate a set of adapted reference dynamics parameters comprising adapted reference data in the vicinity of Y 0 within a predetermined Wasserstein distance.
  • the system assesses the quality of the candidate solution policy parameters by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the adapted reference dynamics parameters.
  • the system makes a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against a performance metric, i.e. the system assesses the overall performance of the policy to optimise the performance of the model.
  • the system also makes a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric i.e. the system evaluates the model having those parameters at the worst-case dynamic and aims to minimise the occurrences of a low level of performance.
  • the system iteratively forms further candidate solutions of the policy parameters in dependence on these first and second assessments.
  • the optimisation problem to be solved by the system may be expressed as:
  • the system iteratively solves an approximation of the above optimisation problem, as represented by Figure 1.
  • the policy parameters of the previous candidate solution are updated, shown at 101 , in dependence on the updated dynamics parameters, 102, and the updated policy parameters, 0 k , shown at 103, are used in the next
  • H 0 is an estimate of the Hessian of the function F evaluated at y 0 .
  • the high-level strategy for solving the optimisation problem in Equation (6) is to have an outer loop which updates q and an inner loop which solves the minimization problem for a given fixed q.
  • the system therefore learns the best controlling policy, where the quality of a policy is the standard RL objective function, but the quality is evaluated at the worst-case dynamic within the Wasserstein ball. By being pessimistic about the possible dynamics that may be encountered by the model, robustness of the policy may be promoted.
  • y K are the weights of a neural network, fY k that outputs mean and covariance of a Gaussian process, i.e., the quantity is efficiently computed using the conjugate gradient algorithm and automatic differentiation (for example, autograd, see: https://github.com/HIPS/autograd) applied to the following optimisation problem:
  • the system 300 has access to parameters Y 0 , which are the parameters of a neural network (NN) NN2, shown at 301 in the system diagram of Figure 3, that represent the reference dynamic p 0 .
  • NN2 301 takes as input a state-action pair (x, u) and outputs mean vector and covariance matrix
  • sampler 302 which samples a next state x' from a multivariate Gaussian distribution which has m yo , ⁇ y0 as mean and covariance, respectively. This is a standard approach to model dynamics in reinforcement learning.
  • NN1 takes as input a state x and gives some parameters to a sampler 304 that uses them to sample an action according to a probability distribution. This is a standard method of implementing a stochastic policy in reinforcement learning.
  • the system initializes the policy parameter vector arbitrarily as q 0 which is a Euclidean vector.
  • the system initializes the Hessian matrix estimate H to the d x d zero matrix, where d is the dimension of the dynamics parameters y K .
  • the system then samples a batch B of trajectories using the latest parameters y K and q k .
  • This is done using NN3 and NN4, shown at 305 and 306 respectively in Figure 3, and their associated samplers 302 and 307. Since a trajectory is a sequence of (state, action rewards), each new state-action pair is fed as input to NN3 305 (which feeds into its sampler) to sample a new state, and each new state is fed to NN4 306 (which feeds into its sampler) to sample a new action. It is assumed that there is a mechanism to extract the rewards from the sampled trajectories. For example, if the reward function is a known function of state-action pairs, or applying a simulator.
  • the gradient is given by:
  • the next stage after initialization at step 205 is a loop comprising steps 206-210, the ultimate purpose of which is to estimate This is done by generating and eventually averaging where each v i is an estimate of In order to generate a v i , the following is done:
  • steps 206-210 the ultimate purpose of which is to estimate This is done by generating and eventually averaging where each v i is an estimate of In order to generate a v i , the following is done:
  • GWCN Gaussian Wasserstein computation network
  • an automatic differentiation engine for example, autograd: https://github.com/HIPS/autograd
  • v Hessian- vector product
  • This DPUC feeds into the policy parameters update engine (PPUE) 31 1 at step
  • Y k corresponds to the parameters of a simulator.
  • the system 600 has access to a simulator, shown at 601 in Figure 6, that allows for the parameterization of the dynamics with a vector
  • the system accesses parameters Y 0 .
  • Y 0 are the parameters of the simulator that represent the reference dynamic.
  • q 0 is set to the zero vector and is set to the zero matrix at step 402.
  • the system enters a loop which ends with an estimate of H 0 . It does this through application of a method known in the literature as “evolutionary strategies”.
  • H 0 is estimated using the formula: with and F (see Equation (10) being estimated using Wasserstein distance calculations for the empirical distribution from the points generated by the samples
  • the dynamics are d-dimensional vectors that parameterise the simulator.
  • a random vector is sampled by the multivariate Gaussian sampler 603 and
  • the system sets and passes it to the simulator 601 , as shown at step 404.
  • the system then enters sub-routine A, illustrated in the flowchart of Figure 5, which takes y and ft as its input at step 501 and at step 502, y and are used to
  • the neural network NN 1 shown at 604 in Figure 6, and its sampler 605 are used to sample actions from ft and the simulator 601 parameterised with y is used to perform the sampling above.
  • the above samples are fed into the simulator 601 to make a number of samples and
  • WCE Wasserstein computation engine
  • This cycle of steps 502-508 is repeated a number of times and an average of the estimates is taken at step 510. This is done by passing the estimates from memory store 607 to arithmetic mean calculator 608, illustrated by step 509. The result is referred to as h i in Figure 4.
  • the outer cycle (which calls subroutine A) is repeated for each sample Î, and the average computed and set as the estimate of H 0 .
  • the system then enters a cycle with an entry point at step 409 which samples a batch B of trajectories using q k and y k. . This is done using the simulator 601 along with NN2 602 and its sampler 610.
  • B is used by the system to estimate: using the formula: This function is performed by the gradient estimator 609 in Figure 6.
  • step 41 1 the system then computes an estimate of by applying the conjugate gradient algorithm in conjugate gradient estimator 61 1 to the optimisation problem:
  • the result is passed into the new dynamics parameters calculation engine 612 which performs the computation:
  • policy parameters update engine 613 which computes the new policy parameters at step 413.
  • the updating of the policy parameters may be performed by algorithms such as PPO (as described in Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal policy optimization algorithms.” arXiv preprint arXiv:1707.06347 (2017)), TRPO (as described in Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International Conference on Machine Learning (pp. 1889-1897)) or“vanilla” policy gradient updates.
  • the new policy parameters 0 k+1 are then used in the next iteration of the optimisation and the above steps are repeated.
  • Using the parameters from a simulator has an advantage over the generic method previously described above because, generally, simulators can allow their parameters to be varied in ways which are consistent with some set of rules. For example, a simulator for a physical system such as a robot or car will allow variations of quantities such as friction, mass, length etc., but the system is expected to obey Newton’s laws.
  • the parameters used in this embodiment may be from a differential equation solver that allows for the parameterization of the dynamics with a vector Y k Î R d .
  • Figure 7 summarises a method for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric.
  • the method comprises forming a candidate solution comprising a candidate set of parameter values.
  • the method then comprises repeatedly performing the following steps 702- 704.
  • a first assessment is made of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric.
  • a second assessment is made of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric.
  • a further candidate solution is formed in dependence on the first and second assessments.
  • a model is formed for analysing data and providing indications of one or more properties of that input data.
  • the model is generalised, and operates in dependence on values that control the performance of the model.
  • the model could be a neural network and the values could be weights applied to the network.
  • the system described above selects the values by training against reference or training data.
  • the training data comprises a set of possible input data to the model and, for each one, a corresponding expected output from the model.
  • the system operates a first loop and a second loop.
  • the second loop runs inside the first loop.
  • a candidate set of values is formed by selecting for high performance of the model having those values against the training data.
  • the candidate set of values are selected or assessed in dependence on a determination of whether there is relatively high conformity between the outputs of the model configured with those values taking the training data as input and the expected outputs for that training data.
  • the candidate set of values is tested for low performance of the model having those values against the training data.
  • the candidate set of values are selected or assessed in dependence on a determination of whether there is relatively low conformity between the outputs of the model configured with those values taking the training data as input and the expected outputs for that training data. This process is repeated numerous times, with the candidate set of values for each iteration being selected such that it has been determined to exhibit a relatively (e.g. relative to a previous candidate set of data) high propensity for good performance and a relatively low propensity for poor performance.
  • the approach described herein helps to solve the problem of brittleness in reinforcement learning, and in particular can provide a system and method for learning to control a system in a way that is robust to variations in dynamics. Therefore, the present invention may allow a user to train on a simulator and deploy in the real world with good performance.
  • Embodiments of the present invention may provide advantages over previous approaches.
  • the present approach performs better in experiments and can be used for continuous state and action spaces.
  • the approach described in‘Non- Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning', Lecarpentier and Rachelson, arXiv:1904.10090 2019, is only applicable to scenarios with finite discrete state and action spaces.
  • the described approach operates with continuous state and action spaces, which is an essential requirement to be practically applicable in real world settings.
  • One particularly advantageous situation to which the present invention may be applied is a self-driving car.
  • the dynamics of the car will vary due to a multitude of factors, from variations of road surfaces, road inclines, tire pressure, frictional forces and variations due to weight carried. It is clear in this example that there is no single set of dynamics that the car will experience throughout its lifetime or even over a short period of time.
  • the present algorithm is more robust to such variations and can cope with novel dynamics without having to learn to cope with them in the environment. Therefore, the solution disclosed herein may learn a controlling policy that is robust to variations in dynamics between the environment it was trained on and the environment it is deployed on.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A system for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric, the system being configured to, having formed a candidate solution comprising a candidate set of parameter values; repeatedly perform the steps of: making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric; making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and forming a further candidate solution in dependence on the first and second assessments.

Description

LEARNING TO ROBUSTLY CONTROL A SYSTEM
FIELD OF THE INVENTION
This disclosure relates to avoiding brittleness in reinforcement learning, in particular to learning to control a system in a way that is robust to variations in dynamics.
BACKGROUND
In reinforcement learning (RL), a system is modelled as a Markov Decision Problem (MDP). This is defined as a tuple (C,u,r,G,), where X is a state space, U is an action space, p(- |x,u) is a probability distribution over next state for each state-action pair (x,u), and r(x,u) is a reward (positive or negative real number). The probability distributions p are called the dynamics.
The controller for the MDP is celled e policy. It is often implemented es e probability distribution over actions given the current state x, denoted p(· |x). An MDP equipped with a starting state distribution and a policy gives rise to a Markov Reward Process (MRP). It induces probability distribution over trajectories (a trajectory is a sequence of states, actions, rewards).
The standard objective in RL is to optimise the expected return, that is, the discounted sum of rewards:
Figure imgf000003_0001
where m is an initial distribution and where:
Figure imgf000003_0002
The dynamics, p, is assumed to be fixed, meaning that if the same action is taken in a given state, the distribution over next states is the same as it would be had that action been taken in that state at some other time. Dynamics are independent of whatever controlling policy is applied to the MDP.
This aspect of fixed dynamics of MDPs is a fundamental assumption in standard RL algorithms. However, there are several problems that result from this.
If a policy is trained on some MDP, and then deployed on an MDP which has different dynamics, the policy will usually perform poorly, i.e., policies tend to be brittle with respect to variations in dynamics. For example, where an RL agent is trained using a simulator (for example, of a car, or robot), and then deployed in a real physical system, the simulator will not be perfect and the dynamics it simulates will not be exactly the same as the real-world dynamics.
Another problem is that the variation in dynamics that occur in the real world at different times will affect the results. For example, the driver of a car experiences different dynamics due to, for example, the road surface, the differences in loads carried, or the tire pressure. Any machine may experience differences in friction due to changes in temperature or lubrication. Any algorithm which produces controlling policies that are brittle to these variations in dynamics is clearly not going to be practically applicable. One of the reasons that RL has not thus far been particularly successful outside the lab or outside of controlled environments such as games is due to this lack of robustness.
Previous approaches have attempted to produce policies that are more robust to this variation in dynamics. For example, in‘Action Robust Reinforcement Learning and Applications in Continuous Control’, Tessler et al, ICML 2019, the problem is framed as a zero-sum game. The following robustness criteria are given an action by the policy: (i) with fixed probability, a different, possibly adversarial, action is taken instead, (ii) a perturbation is added to the action itself. However, although the algorithms perform well in some Mujoco tasks, they perform poorly in others, such as an inverted pendulum.
Another approach, as described in‘Robust Adversarial Reinforcement Learning’, Pinto et al, ICML 2017, also frames the problem as a zero-sum game and robustness is learned by alternative adversary and agent policy iterations. ‘Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning’, Lecarpentier and Rachelson, arXiv: 1904.10090 2019, models dynamics which vary over time, constrained in terms of Wasserstein distance per unit time. A tree-search algorithm is solved for the worst case, where the environment is adversarial. Experiments are made on grid world, i.e., small scale examples, and the approach appears to not be extendable to continuous state and action spaces.
Further approaches are described in JP 3465236 B2, which is based on H infinity techniques in classical control theory, CN 107856035 A, which is specifically targeted to a particular class of problem rather than being a general RL algorithm, and US 6665651 B2, which relies on having a controller to train the neural network, with emphasis on the stability of the learning process.
It is desirable to learn to control a system in a way that has improved robustness to variations in dynamics.
SUMMARY OF THE INVENTION
There is provided a system for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric, the system being configured to, having formed a candidate solution comprising a candidate set of parameter values; repeatedly perform the steps of: making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric; making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and forming a further candidate solution in dependence on the first and second assessments.
The system may therefore assess the overall performance of the policy to optimise the performance of the model, as well as evaluating the model having those parameters at the worst-case dynamic to aim to minimise the occurrences of a low level of performance. The system can then iteratively form further candidate solutions of the policy parameters in dependence on these first and second assessments.
The system may be configured to assess the quality of the candidate solution by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference values. This may allow for a convenient way of assessing the quality of the candidate solution.
The system may be further configured to assess the quality of the candidate solution by generating a set of adapted reference values comprising one or more items of adapted reference data in the vicinity of at least some of the set of reference values and test the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the set of adapted reference values. This may allow the quality of the candidate solution to be tested against dynamics that differ from the reference dynamics.
The one or more items of adapted reference data may be within a predetermined Wasserstein distance of the set of reference values. The set of adapted reference values may represent the worst-case values of the set of reference values. The quality of a policy can therefore be evaluated at the worst-case dynamic within the Wasserstein ball. By evaluating policies on their worst-case dynamic, robustness of the policy can be promoted. The Wasserstein metric may therefore assist in measuring how“wrong” a model or simulator is in a useful and intuitively reasonable way. The set of reference values may comprise the parameters of a neural network. This approach may allow for the efficient computation of the updated reference values and further candidate solution.
The set of reference values may comprise values output from a simulator or differential equation solver. This approach may allow for the efficient computation of the updated reference values and further candidate solution and may allow for the reference values to be varied in ways which are consistent with some set of rules. For example, a simulator for a physical system such as a robot or car will allow variations of quantities such as friction, mass, length etc., but the system is expected to obey Newton’s laws.
The set of reference values may comprise a set of reference dynamics. This may allow the system to be applied in real-world dynamic situations.
The system may be configured to perform an optimisation comprising the first and second assessments of the quality of the candidate solution.
The model may be a trained artificial intelligence model. The model may be a neural network.
According to a second aspect there is provided a method for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric, the method comprising: forming a candidate solution comprising a candidate set of parameter values; and repeatedly performing the steps of: making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric; making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and forming a further candidate solution in dependence on the first and second assessments. The method may therefore assess the overall performance of the policy to optimise the performance of the model, as well as evaluating the model having those parameters at the worst-case dynamic to aim to minimise the occurrences of a low level of performance. The method can then iteratively form further candidate solutions of the policy parameters in dependence on these first and second assessments.
The assessment of the quality of the candidate solution may comprise testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference values. This may allow for a convenient way of assessing the quality of the candidate solution.
The assessment of the quality of the candidate solution may further comprise generating a set of adapted reference values comprising one or more items of adapted reference data in the vicinity of at least some of the set of reference values and test the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the set of adapted reference values. This may allow the quality of the candidate solution to be tested against dynamics that differ from the reference dynamics.
The one or more items of adapted reference data may be within a predetermined Wasserstein distance of the set of reference values. The quality of a policy can therefore be evaluated at the worst-case dynamic within the Wasserstein ball. By evaluating policies on their worst-case dynamic, robustness of the policy can be promoted. The Wasserstein metric may therefore assist in measuring how“wrong” a model or simulator is in a useful and intuitively reasonable way.
BRIEF DESCRIPTION OF THE FIGURES
The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:
Figure 1 illustrates iteratively updating the parameterised policy and dynamics.
Figure 2 shows a flowchart illustrating a method according to an embodiment of the present invention.
Figure 3 shows an example of a system for implementing the method illustrated in Figure 2.
Figure 4 shows a flowchart illustrating a method according to a further embodiment of the present invention.
Figure 5 shows a flowchart illustrating a method according to a subroutine performed as part of the method illustrated in Figure 4.
Figure 6 shows an example of a system for implementing the methods illustrated in Figures 4 and 5.
Figure 7 illustrates an example of a method for performing reinforcement learning to generate a solution set of values usable as parameters in a model.
DETAILED DESCRIPTION OF THE INVENTION
The present invention relates to a system and method for performing reinforcement learning to generate a solution set of values usable as parameters in a model that are robust to changes in dynamics.
In standard RL algorithms, the quality of a candidate solution for the parameters of the model is assessed by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference dynamics, p0, which are input to the system. In standard algorithms, a controlling policy is trained using these reference dynamics without consideration of other possible dynamics. In embodiments of the present invention, a set of possible dynamics is considered that is distributed about (for example, centered around) p0. These reference values represent a starting point from which to train the model.
In one embodiment, dynamics within a predetermined Wasserstein distance of the reference dynamics are considered during training of the controlling policy. The Wasserstein distance is used as a measure of divergence between dynamics and is defined for probability measures that are defined with respect to metric spaces. In particular, it is assumed that the state space X is a metric space with the metric denoted by d(·,·)· Letting M(S) denote the set of probability measures over a set S, the set of couplings on probability measures m and v is defined as:
Figure imgf000010_0003
That is, it is the set of probability measures over the product space which marginalize to m along one dimension and v along the other. The p-Wasserstein distance is defined as:
Figure imgf000010_0001
In general, the Wasserstein distance has no closed-form solution but can be numerically estimated. However, the squared 2-Wasserstein between Gaussians has a closed form.
The set of dynamics used is p, whose distance from p0, when measured by a Wasserstein metric, is within some predefined limit. The Wasserstein metric may be any one of the Wasserstein metric class. The predefined limit may be referred to as the epsilon-Wasserstein ball around p0, defined as
Figure imgf000010_0002
The system is configured to learn the best controlling policy, where the quality of a policy p is the standard RL objective function, but evaluated at the worst-case dynamic (for p) within the Wasserstein ball. By being“pessimistic” about the possible dynamics that may be encountered (i.e. by evaluating policies on their worst-case dynamic), robustness of the policy can be promoted. The Wasserstein metric may therefore assist in measuring how“wrong” a model or simulator is in a useful and intuitively reasonable way.
The Wasserstein distance has the form (distance x probability mass). Therefore, if this product is constrained by some number, then it implies that if distance is large, probability is small, and if probability is large, distance is small. This means that the model can be greatly wrong (large distance), but this is unlikely, or it can be very likely wrong (high probability), but then it cannot be too inaccurate. If the reference dynamic is frequently highly inaccurate, then it is useless, and training on it is pointless, and ultimately, attempting to achieve robustness is a futile endeavor.
Not all dynamics within a Wasserstein ball are plausible; some may violate Newton’s laws of motion, for example. In the present invention, dynamics may be perturbed but remain plausible. To account for this, the policy p is parameterised with a vector q Î Rd1 , and written as pq. The Q parameters are the parameters (weights) of a neural network and may be updated with time.
The dynamics are also parameterized with another vector y Î Rd2. Again, these parameters y may be updated with time. These parameters may be, for example, the parameters of a neural network, a simulator, or some other processor that implements the system dynamics (for example, a differential equation solver, or real-life system). The parameter vector corresponding to the reference dynamic p0 is denoted Y0.
The system may generate a set of adapted reference dynamics parameters comprising adapted reference data in the vicinity of Y0 within a predetermined Wasserstein distance. The system then assesses the quality of the candidate solution policy parameters by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the adapted reference dynamics parameters.
The system makes a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against a performance metric, i.e. the system assesses the overall performance of the policy to optimise the performance of the model. The system also makes a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric i.e. the system evaluates the model having those parameters at the worst-case dynamic and aims to minimise the occurrences of a low level of performance. The system iteratively forms further candidate solutions of the policy parameters in dependence on these first and second assessments.
The optimisation problem to be solved by the system may be expressed as:
Figure imgf000012_0001
where, in the continuing RL setting:
Figure imgf000012_0002
and in the episodic RL setting:
Figure imgf000012_0003
for
Figure imgf000013_0006
These are "occupation measures”, which in the continuing case is the stationary distribution of the Markov chain induced by the MDP with dynamics ip and policy ft , and in the episodic case is a probability distribution that serves a similar purpose to the stationary distribution.
The system iteratively solves an approximation of the above optimisation problem, as represented by Figure 1. The policy parameters of the previous candidate solution, are updated, shown at 101 , in dependence on the updated dynamics parameters, 102, and the updated policy parameters, 0k, shown at 103, are used in the next
Figure imgf000013_0005
iteration of the optimization problem, shown at 104.
The inner optimisation problem is given by:
Figure imgf000013_0001
The term H0 is an estimate of the Hessian of the function F evaluated at y0 . where
Figure imgf000013_0002
The solution to the above problem is given by:
Figure imgf000013_0003
where:
Figure imgf000013_0004
The above-defined optimization problems can be solved to compute an update of the dynamics ip, under certain assumptions (for example, that H0 exists and is symmetric positive definite). The updated dynamics parameters are then used to update the policy parameters, which are then used in the subsequent iteration of the optimisation, as shown in Figure 1.
Therefore, the high-level strategy for solving the optimisation problem in Equation (6) is to have an outer loop which updates q and an inner loop which solves the minimization problem for a given fixed q.
The system therefore learns the best controlling policy, where the quality of a policy is the standard RL objective function, but the quality is evaluated at the worst-case dynamic within the Wasserstein ball. By being pessimistic about the possible dynamics that may be encountered by the model, robustness of the policy may be promoted.
Two exemplary embodiments of the present invention will now be described.
One generic approach using neural networks to define Gaussian distributions will first be described with reference to Figures 2 and 3. This approach allows for the efficient computation of the updated dynamics parameters,
Figure imgf000014_0004
In this embodiment, yK are the weights of a neural network, fYk that outputs mean and covariance of a Gaussian process, i.e.,
Figure imgf000014_0002
the quantity is
Figure imgf000014_0003
efficiently computed using the conjugate gradient algorithm and automatic differentiation (for example, autograd, see: https://github.com/HIPS/autograd) applied to the following optimisation problem:
Figure imgf000014_0001
At step 201 in the flowchart of Figure 2, the system 300 has access to parameters Y0, which are the parameters of a neural network (NN) NN2, shown at 301 in the system diagram of Figure 3, that represent the reference dynamic p0. NN2 301 takes as input a state-action pair (x, u) and outputs mean vector and covariance matrix
Figure imgf000015_0004
These are fed into a sampler 302 which samples a next state x' from a multivariate Gaussian distribution which has myo, åy0 as mean and covariance, respectively. This is a standard approach to model dynamics in reinforcement learning.
The system also assumes it has access to a fixed policy,
Figure imgf000015_0005
which is implemented as a neural network NN1 , shown at 303 in Figure 3. NN1 takes as input a state x and gives some parameters to a sampler 304 that uses them to sample an action according to a probability distribution. This is a standard method of implementing a stochastic policy in reinforcement learning.
At step 202 in Figure 2, the system initializes the policy parameter vector arbitrarily as q0 which is a Euclidean vector. The system initializes the Hessian matrix estimate H
Figure imgf000015_0003
to the d x d zero matrix, where d is the dimension of the dynamics parameters yK .
At step 203, the system then samples a batch B of trajectories using the latest parameters yK and qk. This is done using NN3 and NN4, shown at 305 and 306 respectively in Figure 3, and their associated samplers 302 and 307. Since a trajectory is a sequence of (state, action rewards), each new state-action pair is fed as input to NN3 305 (which feeds into its sampler) to sample a new state, and each new state is fed to NN4 306 (which feeds into its sampler) to sample a new action. It is assumed that there is a mechanism to extract the rewards from the sampled trajectories. For example, if the reward function is a known function of state-action pairs, or applying a simulator.
The gradient is given by:
Figure imgf000015_0002
which can be estimated with the aid of the formula:
Figure imgf000015_0001
That is, an empirical estimate of the right-hand-side of Equation (15) can be made by averaging the quantity in the square brackets using the batch B. This function is performed by the gradient estimator, 308 in Figure 3, at step 204 of Figure 2.
As can be seen from the flowchart of Figure 2, the next stage after initialization at step 205 is a loop comprising steps 206-210, the ultimate purpose of which is to estimate
Figure imgf000016_0010
This is done
Figure imgf000016_0009
by generating and eventually averaging
Figure imgf000016_0003
where each vi is an estimate of In order to generate a vi, the following is done: At step 206,
Figure imgf000016_0011
a sample is drawn, given by:
Figure imgf000016_0002
This sample is drawn through application of NN1 303 and its sampler 304, and NN3 305 and its sampler 302. The sample is input into NN3 305 to obtain
Figure imgf000016_0001
as the output of NN3. This is fed as the input to Gaussian Wasserstein computation network (GWCN) 309. As shown in Figure 3, GWCN also takes as input At step
Figure imgf000016_0012
207, internally, it computes and outputs:
Figure imgf000016_0007
This is equal to i.e., the square of the 2-Waserstein
Figure imgf000016_0008
distance between multivariate normal distributions. Since NN3 305 feeds directly into GWCN 309, at step 208 an automatic differentiation engine (for example, autograd: https://github.com/HIPS/autograd) is used to efficiently compute inverse Hessian- vector product v
Figure imgf000016_0005
The loop continues until the desired number of training scenarios are completed. Each
Figure imgf000016_0014
is given to the dynamics parameters update calculator (DPUC) 310 which computes the average and then computes as an estimate of
Figure imgf000016_0013
This is used by the DPUC at step 21 1 to compute
Figure imgf000016_0006
Figure imgf000016_0004
This DPUC feeds into the policy parameters update engine (PPUE) 31 1 at step
Figure imgf000017_0007
212, which computes and feeds it into NN4 306. This completes the cycle.
Figure imgf000017_0008
The new policy parameters are then used in the next iteration of the optimisation
Figure imgf000017_0009
problem and the above steps are repeated.
In another example, which will now be described with reference to Figures 4-6, Yk corresponds to the parameters of a simulator.
The system 600 has access to a simulator, shown at 601 in Figure 6, that allows for the parameterization of the dynamics with a vector
Figure imgf000017_0006
At step 401 in Figure 4, the system accesses parameters Y0. Y0 are the parameters of the simulator that represent the reference dynamic. q0 is set to the zero vector and is set to the zero matrix at step 402. The system enters a loop which ends with an estimate of H0. It does this through application of a method known in the literature as “evolutionary strategies”.
In this embodiment, H0 is estimated using the formula:
Figure imgf000017_0001
with
Figure imgf000017_0003
and F (see Equation (10) being estimated using Wasserstein distance calculations for the empirical distribution from the points generated by the samples
Figure imgf000017_0002
The dynamics are d-dimensional vectors that parameterise the simulator. A random vector is sampled by the multivariate Gaussian sampler 603 and
Figure imgf000017_0005
the system sets
Figure imgf000017_0004
and passes it to the simulator 601 , as shown at step 404. At step 405, the system then enters sub-routine A, illustrated in the flowchart of Figure 5, which takes y and ft as its input at step 501 and at step 502, y and are used to
Figure imgf000018_0010
draw samples, given by:
Figure imgf000018_0001
The neural network NN 1 , shown at 604 in Figure 6, and its sampler 605 are used to sample actions from ft and the simulator 601 parameterised with y is used to perform the sampling above. At steps 503 and 504 respectively, the above samples are fed into the simulator 601 to make a number of samples
Figure imgf000018_0006
and
Figure imgf000018_0007
These are held in storage in data sets and shown at steps 505
Figure imgf000018_0002
Figure imgf000018_0008
Figure imgf000018_0009
and 506 respectively. These data sets are fed into the Wasserstein computation engine (WCE), 606, in order to compute an empirical Wasserstein distance at step 507, which is considered to be an estimate of:
Figure imgf000018_0003
This cycle of steps 502-508 is repeated a number of times and an average of the estimates is taken at step 510. This is done by passing the estimates from memory store 607 to arithmetic mean calculator 608, illustrated by step 509. The result is referred to as hi in Figure 4. The outer cycle (which calls subroutine A) is repeated for each sample Î, and the average computed and set as the estimate of H0.
As can be seen from Figure 4, the system then enters a cycle with an entry point at step 409 which samples a batch B of trajectories using qk and yk.. This is done using the simulator 601 along with NN2 602 and its sampler 610. At step 410, B is used by the system to estimate:
Figure imgf000018_0004
using the formula:
Figure imgf000018_0005
This function is performed by the gradient estimator 609 in Figure 6.
At step 41 1 , the system then computes an estimate of
Figure imgf000019_0003
by applying the conjugate gradient algorithm in conjugate gradient estimator 61 1 to the optimisation problem:
Figure imgf000019_0001
At step 412, the result is passed into the new dynamics parameters calculation engine 612 which performs the computation:
Figure imgf000019_0002
This determines the new dynamics parameters Subsequently, the new
Figure imgf000019_0005
parameters are passed into policy parameters update engine 613 which computes the new policy parameters
Figure imgf000019_0004
at step 413. The updating of the policy parameters may be performed by algorithms such as PPO (as described in Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017)), TRPO (as described in Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International Conference on Machine Learning (pp. 1889-1897)) or“vanilla” policy gradient updates.
The new policy parameters 0k+1 are then used in the next iteration of the optimisation and the above steps are repeated.
Using the parameters from a simulator has an advantage over the generic method previously described above because, generally, simulators can allow their parameters to be varied in ways which are consistent with some set of rules. For example, a simulator for a physical system such as a robot or car will allow variations of quantities such as friction, mass, length etc., but the system is expected to obey Newton’s laws. Alternatively, the parameters used in this embodiment may be from a differential equation solver that allows for the parameterization of the dynamics with a vector Yk Î Rd.
Figure 7 summarises a method for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric. At step 701 , the method comprises forming a candidate solution comprising a candidate set of parameter values. The method then comprises repeatedly performing the following steps 702- 704. At step 702, a first assessment is made of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric. At step 703, a second assessment is made of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric. At step 704, a further candidate solution is formed in dependence on the first and second assessments.
In the approaches described above, a model is formed for analysing data and providing indications of one or more properties of that input data. The model is generalised, and operates in dependence on values that control the performance of the model. For example, the model could be a neural network and the values could be weights applied to the network. The system described above selects the values by training against reference or training data. The training data comprises a set of possible input data to the model and, for each one, a corresponding expected output from the model. To select the values, the system operates a first loop and a second loop. The second loop runs inside the first loop. In the first loop, a candidate set of values is formed by selecting for high performance of the model having those values against the training data. Put another way, the candidate set of values are selected or assessed in dependence on a determination of whether there is relatively high conformity between the outputs of the model configured with those values taking the training data as input and the expected outputs for that training data. In the inner loop the candidate set of values is tested for low performance of the model having those values against the training data. Put another way, the candidate set of values are selected or assessed in dependence on a determination of whether there is relatively low conformity between the outputs of the model configured with those values taking the training data as input and the expected outputs for that training data. This process is repeated numerous times, with the candidate set of values for each iteration being selected such that it has been determined to exhibit a relatively (e.g. relative to a previous candidate set of data) high propensity for good performance and a relatively low propensity for poor performance.
The approach described herein helps to solve the problem of brittleness in reinforcement learning, and in particular can provide a system and method for learning to control a system in a way that is robust to variations in dynamics. Therefore, the present invention may allow a user to train on a simulator and deploy in the real world with good performance.
Embodiments of the present invention may provide advantages over previous approaches. The present approach performs better in experiments and can be used for continuous state and action spaces. In particular, the approach described in‘Non- Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning', Lecarpentier and Rachelson, arXiv:1904.10090 2019, is only applicable to scenarios with finite discrete state and action spaces. The described approach operates with continuous state and action spaces, which is an essential requirement to be practically applicable in real world settings.
One particularly advantageous situation to which the present invention may be applied is a self-driving car. The dynamics of the car will vary due to a multitude of factors, from variations of road surfaces, road inclines, tire pressure, frictional forces and variations due to weight carried. It is clear in this example that there is no single set of dynamics that the car will experience throughout its lifetime or even over a short period of time. The present algorithm is more robust to such variations and can cope with novel dynamics without having to learn to cope with them in the environment. Therefore, the solution disclosed herein may learn a controlling policy that is robust to variations in dynamics between the environment it was trained on and the environment it is deployed on.
In the above description, dynamics within a predetermined Wasserstein distance of the reference dynamics are considered during training of the controlling policy. However, other metrics may also be used as the distance function.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A system for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric, the system being configured to, having formed a candidate solution comprising a candidate set of parameter values; repeatedly perform the steps of:
making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric;
making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and
forming a further candidate solution in dependence on the first and second assessments.
2. The system as claimed in claim 1 , wherein the system is configured to assess the quality of the candidate solution by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference values.
3. The system as claimed in claim 2, wherein the system is further configured to assess the quality of the candidate solution by generating a set of adapted reference values comprising one or more items of adapted reference data in the vicinity of at least some of the set of reference values and test the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the set of adapted reference values.
4. The system as claimed in claim 3, wherein the one or more items of adapted reference data are within a predetermined Wasserstein distance of the set of reference values.
5. The system as claimed in claim 3 or claim 4, wherein the set of adapted reference values represents the worst-case values of the set of reference values.
6. The system as claimed in any of claims 2 to 5, wherein the set of reference values comprises the parameters of a neural network.
7. The system as claimed in any of claims 2 to 5, wherein the set of reference values comprises values output from a simulator or differential equation solver.
8. The system as claimed in any of claims 2 to 7, wherein the set of reference values comprises a set of reference dynamics.
9. The system as claimed in anyone of the preceding claims, wherein the system is configured to perform an optimisation comprising the first and second assessments of the quality of the candidate solution.
10. The system as claimed in anyone of the preceding claims, wherein the model is a trained artificial intelligence model.
1 1. The system as claimed in claim 10, wherein the model is a neural network.
12. A method for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric, the method comprising:
forming a candidate solution comprising a candidate set of parameter values; and repeatedly performing the steps of:
making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric;
making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and forming a further candidate solution in dependence on the first and second assessments.
13. The method as claimed in claim 12, wherein the assessment of the quality of the candidate solution comprises testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference values.
14. The method as claimed in claim 13, wherein the assessment of the quality of the candidate solution further comprises generating a set of adapted reference values comprising one or more items of adapted reference data in the vicinity of at least some of the set of reference values and test the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the set of adapted reference values.
15. The method as claimed in claim 14, wherein the one or more items of adapted reference data are within a predetermined Wasserstein distance of the set of reference values.
PCT/EP2019/069101 2019-07-16 2019-07-16 Learning to robustly control a system WO2021008691A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980098404.XA CN114270375A (en) 2019-07-16 2019-07-16 Learning to robustly control a system
PCT/EP2019/069101 WO2021008691A1 (en) 2019-07-16 2019-07-16 Learning to robustly control a system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/069101 WO2021008691A1 (en) 2019-07-16 2019-07-16 Learning to robustly control a system

Publications (1)

Publication Number Publication Date
WO2021008691A1 true WO2021008691A1 (en) 2021-01-21

Family

ID=67314771

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2019/069101 WO2021008691A1 (en) 2019-07-16 2019-07-16 Learning to robustly control a system

Country Status (2)

Country Link
CN (1) CN114270375A (en)
WO (1) WO2021008691A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3465236B2 (en) 2000-12-20 2003-11-10 科学技術振興事業団 Robust reinforcement learning method
US6665651B2 (en) 2001-07-18 2003-12-16 Colorado State University Research Foundation Control system and technique employing reinforcement learning having stability and learning phases
CN107856035A (en) 2017-11-06 2018-03-30 深圳市唯特视科技有限公司 A kind of robustness dynamic motion method based on intensified learning and whole body controller

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3465236B2 (en) 2000-12-20 2003-11-10 科学技術振興事業団 Robust reinforcement learning method
US6665651B2 (en) 2001-07-18 2003-12-16 Colorado State University Research Foundation Control system and technique employing reinforcement learning having stability and learning phases
CN107856035A (en) 2017-11-06 2018-03-30 深圳市唯特视科技有限公司 A kind of robustness dynamic motion method based on intensified learning and whole body controller

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
ELENA SMIRNOVA ET AL: "Distributionally Robust Reinforcement Learning", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 23 February 2019 (2019-02-23), XP081032496 *
ERWAN LECARPENTIER ET AL: "Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning, Extended version", 24 May 2019 (2019-05-24), XP055678929, Retrieved from the Internet <URL:https://arxiv.org/pdf/1904.10090v2.pdf> [retrieved on 20200323] *
LECARPENTIERRACHELSON: "Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning", ARXIV:1904.10090, 2019
PINTO ET AL.: "Robust Adversarial Reinforcement Learning", ICML, 2017
SCHULMAN, J.LEVINE, S.ABBEEL, P.JORDAN, M.MORITZ, P.: "Trust region policy optimization", INTERNATIONAL CONFERENCE ON MACHINE LEARNING, June 2015 (2015-06-01), pages 1889 - 1897
SCHULMAN, JOHNFILIP WOLSKIPRAFULLA DHARIWALALEC RADFORDOLEG KLIMOV: "Proximal policy optimization algorithms", ARXIV PREPRINT ARXIV:1707.06347, 2017
TESSLER ET AL.: "Action Robust Reinforcement Learning and Applications in Continuous Control", ICML, 2019

Also Published As

Publication number Publication date
CN114270375A (en) 2022-04-01

Similar Documents

Publication Publication Date Title
Ramos et al. Bayessim: adaptive domain randomization via probabilistic inference for robotics simulators
Wen et al. Constrained cross-entropy method for safe reinforcement learning
Xu et al. Kernel-based approximate dynamic programming for real-time online learning control: An experimental study
Tangkaratt et al. Variational imitation learning with diverse-quality demonstrations
Berkenkamp Safe exploration in reinforcement learning: Theory and applications in robotics
Wang et al. Cooling strategies for the moment-generating function in Bayesian global optimization
Huang et al. Approximate maxent inverse optimal control and its application for mental simulation of human interactions
Duell et al. Solving partially observable reinforcement learning problems with recurrent neural networks
Awheda et al. Exponential moving average Q-learning algorithm
Joshi et al. Adaptive control using gaussian-process with model reference generative network
KR102093080B1 (en) System and method for classifying base on generative adversarial network using labeled data and unlabled data
Possas et al. Online bayessim for combined simulator parameter inference and policy improvement
Barcelos et al. Disco: Double likelihood-free inference stochastic control
Mozian et al. Learning domain randomization distributions for training robust locomotion policies
Liu et al. Benchmarking constraint inference in inverse reinforcement learning
KR102093079B1 (en) System and method for classifying base on generative adversarial network using labeled data
WO2021008691A1 (en) Learning to robustly control a system
Malloy et al. Deep rl with information constrained policies: Generalization in continuous control
Kamalapurkar et al. State following (StaF) kernel functions for function approximation part II: Adaptive dynamic programming
Mguni et al. Timing is Everything: Learning to act selectively with costly actions and budgetary constraints
Liu et al. Online expectation maximization for reinforcement learning in POMDPs
JPWO2020121494A1 (en) Arithmetic logic unit, action determination method, and control program
Tamar et al. Policy evaluation with variance related risk criteria in markov decision processes
Guzman et al. Adaptive model predictive control by learning classifiers
Liu et al. Experience replay for least-squares policy iteration

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19740558

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19740558

Country of ref document: EP

Kind code of ref document: A1