WO2021008691A1

WO2021008691A1 - Learning to robustly control a system

Info

Publication number: WO2021008691A1
Application number: PCT/EP2019/069101
Authority: WO
Inventors: Mohammed Abdullah; Haitham AMMAR; Hang REN; Mingtian ZHANG
Original assignee: Huawei Technologies Co., Ltd.
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2021-01-21
Also published as: CN114270375A

Abstract

A system for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric, the system being configured to, having formed a candidate solution comprising a candidate set of parameter values; repeatedly perform the steps of: making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric; making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and forming a further candidate solution in dependence on the first and second assessments.

Description

LEARNING TO ROBUSTLY CONTROL A SYSTEM

FIELD OF THE INVENTION

This disclosure relates to avoiding brittleness in reinforcement learning, in particular to learning to control a system in a way that is robust to variations in dynamics.

BACKGROUND

In reinforcement learning (RL), a system is modelled as a Markov Decision Problem (MDP). This is defined as a tuple (C,u,r,G,), where X is a state space, U is an action space, p(- |x,u) is a probability distribution over next state for each state-action pair (x,u), and r(x,u) is a reward (positive or negative real number). The probability distributions p are called the dynamics.

The controller for the MDP is celled e policy. It is often implemented es e probability distribution over actions given the current state x, denoted p(· |x). An MDP equipped with a starting state distribution and a policy gives rise to a Markov Reward Process (MRP). It induces probability distribution over trajectories (a trajectory is a sequence of states, actions, rewards).

The standard objective in RL is to optimise the expected return, that is, the discounted sum of rewards:

where m is an initial distribution and where:

The dynamics, p, is assumed to be fixed, meaning that if the same action is taken in a given state, the distribution over next states is the same as it would be had that action been taken in that state at some other time. Dynamics are independent of whatever controlling policy is applied to the MDP.

This aspect of fixed dynamics of MDPs is a fundamental assumption in standard RL algorithms. However, there are several problems that result from this.

If a policy is trained on some MDP, and then deployed on an MDP which has different dynamics, the policy will usually perform poorly, i.e., policies tend to be brittle with respect to variations in dynamics. For example, where an RL agent is trained using a simulator (for example, of a car, or robot), and then deployed in a real physical system, the simulator will not be perfect and the dynamics it simulates will not be exactly the same as the real-world dynamics.

Another problem is that the variation in dynamics that occur in the real world at different times will affect the results. For example, the driver of a car experiences different dynamics due to, for example, the road surface, the differences in loads carried, or the tire pressure. Any machine may experience differences in friction due to changes in temperature or lubrication. Any algorithm which produces controlling policies that are brittle to these variations in dynamics is clearly not going to be practically applicable. One of the reasons that RL has not thus far been particularly successful outside the lab or outside of controlled environments such as games is due to this lack of robustness.

Previous approaches have attempted to produce policies that are more robust to this variation in dynamics. For example, in‘Action Robust Reinforcement Learning and Applications in Continuous Control’, Tessler et al, ICML 2019, the problem is framed as a zero-sum game. The following robustness criteria are given an action by the policy: (i) with fixed probability, a different, possibly adversarial, action is taken instead, (ii) a perturbation is added to the action itself. However, although the algorithms perform well in some Mujoco tasks, they perform poorly in others, such as an inverted pendulum.

Another approach, as described in‘Robust Adversarial Reinforcement Learning’, Pinto et al, ICML 2017, also frames the problem as a zero-sum game and robustness is learned by alternative adversary and agent policy iterations. ‘Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning’, Lecarpentier and Rachelson, arXiv: 1904.10090 2019, models dynamics which vary over time, constrained in terms of Wasserstein distance per unit time. A tree-search algorithm is solved for the worst case, where the environment is adversarial. Experiments are made on grid world, i.e., small scale examples, and the approach appears to not be extendable to continuous state and action spaces.

Further approaches are described in JP 3465236 B2, which is based on H infinity techniques in classical control theory, CN 107856035 A, which is specifically targeted to a particular class of problem rather than being a general RL algorithm, and US 6665651 B2, which relies on having a controller to train the neural network, with emphasis on the stability of the learning process.

It is desirable to learn to control a system in a way that has improved robustness to variations in dynamics.

SUMMARY OF THE INVENTION

There is provided a system for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric, the system being configured to, having formed a candidate solution comprising a candidate set of parameter values; repeatedly perform the steps of: making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric; making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and forming a further candidate solution in dependence on the first and second assessments.

The system may therefore assess the overall performance of the policy to optimise the performance of the model, as well as evaluating the model having those parameters at the worst-case dynamic to aim to minimise the occurrences of a low level of performance. The system can then iteratively form further candidate solutions of the policy parameters in dependence on these first and second assessments.

The system may be configured to assess the quality of the candidate solution by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference values. This may allow for a convenient way of assessing the quality of the candidate solution.

The system may be further configured to assess the quality of the candidate solution by generating a set of adapted reference values comprising one or more items of adapted reference data in the vicinity of at least some of the set of reference values and test the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the set of adapted reference values. This may allow the quality of the candidate solution to be tested against dynamics that differ from the reference dynamics.

The one or more items of adapted reference data may be within a predetermined Wasserstein distance of the set of reference values. The set of adapted reference values may represent the worst-case values of the set of reference values. The quality of a policy can therefore be evaluated at the worst-case dynamic within the Wasserstein ball. By evaluating policies on their worst-case dynamic, robustness of the policy can be promoted. The Wasserstein metric may therefore assist in measuring how“wrong” a model or simulator is in a useful and intuitively reasonable way. The set of reference values may comprise the parameters of a neural network. This approach may allow for the efficient computation of the updated reference values and further candidate solution.

The set of reference values may comprise values output from a simulator or differential equation solver. This approach may allow for the efficient computation of the updated reference values and further candidate solution and may allow for the reference values to be varied in ways which are consistent with some set of rules. For example, a simulator for a physical system such as a robot or car will allow variations of quantities such as friction, mass, length etc., but the system is expected to obey Newton’s laws.

The set of reference values may comprise a set of reference dynamics. This may allow the system to be applied in real-world dynamic situations.

The system may be configured to perform an optimisation comprising the first and second assessments of the quality of the candidate solution.

The model may be a trained artificial intelligence model. The model may be a neural network.

According to a second aspect there is provided a method for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric, the method comprising: forming a candidate solution comprising a candidate set of parameter values; and repeatedly performing the steps of: making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric; making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and forming a further candidate solution in dependence on the first and second assessments. The method may therefore assess the overall performance of the policy to optimise the performance of the model, as well as evaluating the model having those parameters at the worst-case dynamic to aim to minimise the occurrences of a low level of performance. The method can then iteratively form further candidate solutions of the policy parameters in dependence on these first and second assessments.

The assessment of the quality of the candidate solution may comprise testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference values. This may allow for a convenient way of assessing the quality of the candidate solution.

The assessment of the quality of the candidate solution may further comprise generating a set of adapted reference values comprising one or more items of adapted reference data in the vicinity of at least some of the set of reference values and test the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the set of adapted reference values. This may allow the quality of the candidate solution to be tested against dynamics that differ from the reference dynamics.

The one or more items of adapted reference data may be within a predetermined Wasserstein distance of the set of reference values. The quality of a policy can therefore be evaluated at the worst-case dynamic within the Wasserstein ball. By evaluating policies on their worst-case dynamic, robustness of the policy can be promoted. The Wasserstein metric may therefore assist in measuring how“wrong” a model or simulator is in a useful and intuitively reasonable way.

BRIEF DESCRIPTION OF THE FIGURES

The present invention will now be described by way of example with reference to the accompanying drawings. In the drawings:

Figure 1 illustrates iteratively updating the parameterised policy and dynamics.

Figure 2 shows a flowchart illustrating a method according to an embodiment of the present invention.

Figure 3 shows an example of a system for implementing the method illustrated in Figure 2.

Figure 4 shows a flowchart illustrating a method according to a further embodiment of the present invention.

Figure 5 shows a flowchart illustrating a method according to a subroutine performed as part of the method illustrated in Figure 4.

Figure 6 shows an example of a system for implementing the methods illustrated in Figures 4 and 5.

Figure 7 illustrates an example of a method for performing reinforcement learning to generate a solution set of values usable as parameters in a model.

DETAILED DESCRIPTION OF THE INVENTION

The present invention relates to a system and method for performing reinforcement learning to generate a solution set of values usable as parameters in a model that are robust to changes in dynamics.

In standard RL algorithms, the quality of a candidate solution for the parameters of the model is assessed by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference dynamics, p₀, which are input to the system. In standard algorithms, a controlling policy is trained using these reference dynamics without consideration of other possible dynamics. In embodiments of the present invention, a set of possible dynamics is considered that is distributed about (for example, centered around) p₀. These reference values represent a starting point from which to train the model.

In one embodiment, dynamics within a predetermined Wasserstein distance of the reference dynamics are considered during training of the controlling policy. The Wasserstein distance is used as a measure of divergence between dynamics and is defined for probability measures that are defined with respect to metric spaces. In particular, it is assumed that the state space X is a metric space with the metric denoted by d(·,·)· Letting M(S) denote the set of probability measures over a set S, the set of couplings on probability measures m and v is defined as:

That is, it is the set of probability measures over the product space which marginalize to m along one dimension and v along the other. The p-Wasserstein distance is defined as:

In general, the Wasserstein distance has no closed-form solution but can be numerically estimated. However, the squared 2-Wasserstein between Gaussians has a closed form.

The set of dynamics used is p, whose distance from p₀, when measured by a Wasserstein metric, is within some predefined limit. The Wasserstein metric may be any one of the Wasserstein metric class. The predefined limit may be referred to as the epsilon-Wasserstein ball around p₀, defined as

The system is configured to learn the best controlling policy, where the quality of a policy p is the standard RL objective function, but evaluated at the worst-case dynamic (for p) within the Wasserstein ball. By being“pessimistic” about the possible dynamics that may be encountered (i.e. by evaluating policies on their worst-case dynamic), robustness of the policy can be promoted. The Wasserstein metric may therefore assist in measuring how“wrong” a model or simulator is in a useful and intuitively reasonable way.

The Wasserstein distance has the form (distance x probability mass). Therefore, if this product is constrained by some number, then it implies that if distance is large, probability is small, and if probability is large, distance is small. This means that the model can be greatly wrong (large distance), but this is unlikely, or it can be very likely wrong (high probability), but then it cannot be too inaccurate. If the reference dynamic is frequently highly inaccurate, then it is useless, and training on it is pointless, and ultimately, attempting to achieve robustness is a futile endeavor.

Not all dynamics within a Wasserstein ball are plausible; some may violate Newton’s laws of motion, for example. In the present invention, dynamics may be perturbed but remain plausible. To account for this, the policy p is parameterised with a vector q Î R^d1 , and written as p_q. The Q parameters are the parameters (weights) of a neural network and may be updated with time.

The dynamics are also parameterized with another vector y Î R^d2. Again, these parameters y may be updated with time. These parameters may be, for example, the parameters of a neural network, a simulator, or some other processor that implements the system dynamics (for example, a differential equation solver, or real-life system). The parameter vector corresponding to the reference dynamic p₀ is denoted Y₀.

The system may generate a set of adapted reference dynamics parameters comprising adapted reference data in the vicinity of Y₀ within a predetermined Wasserstein distance. The system then assesses the quality of the candidate solution policy parameters by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the adapted reference dynamics parameters.

The system makes a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against a performance metric, i.e. the system assesses the overall performance of the policy to optimise the performance of the model. The system also makes a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric i.e. the system evaluates the model having those parameters at the worst-case dynamic and aims to minimise the occurrences of a low level of performance. The system iteratively forms further candidate solutions of the policy parameters in dependence on these first and second assessments.

The optimisation problem to be solved by the system may be expressed as:

where, in the continuing RL setting:

and in the episodic RL setting:

for

These are "occupation measures”, which in the continuing case is the stationary distribution of the Markov chain induced by the MDP with dynamics ip and policy ft , and in the episodic case is a probability distribution that serves a similar purpose to the stationary distribution.

The system iteratively solves an approximation of the above optimisation problem, as represented by Figure 1. The policy parameters of the previous candidate solution, are updated, shown at 101 , in dependence on the updated dynamics parameters, 102, and the updated policy parameters, 0_k, shown at 103, are used in the next

iteration of the optimization problem, shown at 104.

The inner optimisation problem is given by:

The term H₀ is an estimate of the Hessian of the function F evaluated at y₀ . where

The solution to the above problem is given by:

where:

The above-defined optimization problems can be solved to compute an update of the dynamics ip, under certain assumptions (for example, that H₀ exists and is symmetric positive definite). The updated dynamics parameters are then used to update the policy parameters, which are then used in the subsequent iteration of the optimisation, as shown in Figure 1.

Therefore, the high-level strategy for solving the optimisation problem in Equation (6) is to have an outer loop which updates q and an inner loop which solves the minimization problem for a given fixed q.

The system therefore learns the best controlling policy, where the quality of a policy is the standard RL objective function, but the quality is evaluated at the worst-case dynamic within the Wasserstein ball. By being pessimistic about the possible dynamics that may be encountered by the model, robustness of the policy may be promoted.

Two exemplary embodiments of the present invention will now be described.

One generic approach using neural networks to define Gaussian distributions will first be described with reference to Figures 2 and 3. This approach allows for the efficient computation of the updated dynamics parameters,

In this embodiment, y_K are the weights of a neural network, fY_k that outputs mean and covariance of a Gaussian process, i.e.,

the quantity is

efficiently computed using the conjugate gradient algorithm and automatic differentiation (for example, autograd, see: https://github.com/HIPS/autograd) applied to the following optimisation problem:

At step 201 in the flowchart of Figure 2, the system 300 has access to parameters Y₀, which are the parameters of a neural network (NN) NN2, shown at 301 in the system diagram of Figure 3, that represent the reference dynamic p₀. NN2 301 takes as input a state-action pair (x, u) and outputs mean vector and covariance matrix

These are fed into a sampler 302 which samples a next state x' from a multivariate Gaussian distribution which has m_yo, å_y0 as mean and covariance, respectively. This is a standard approach to model dynamics in reinforcement learning.

The system also assumes it has access to a fixed policy,

which is implemented as a neural network NN1 , shown at 303 in Figure 3. NN1 takes as input a state x and gives some parameters to a sampler 304 that uses them to sample an action according to a probability distribution. This is a standard method of implementing a stochastic policy in reinforcement learning.

At step 202 in Figure 2, the system initializes the policy parameter vector arbitrarily as q₀ which is a Euclidean vector. The system initializes the Hessian matrix estimate H

to the d x d zero matrix, where d is the dimension of the dynamics parameters y_K .

At step 203, the system then samples a batch B of trajectories using the latest parameters y_K and q_k. This is done using NN3 and NN4, shown at 305 and 306 respectively in Figure 3, and their associated samplers 302 and 307. Since a trajectory is a sequence of (state, action rewards), each new state-action pair is fed as input to NN3 305 (which feeds into its sampler) to sample a new state, and each new state is fed to NN4 306 (which feeds into its sampler) to sample a new action. It is assumed that there is a mechanism to extract the rewards from the sampled trajectories. For example, if the reward function is a known function of state-action pairs, or applying a simulator.

The gradient is given by:

which can be estimated with the aid of the formula:

That is, an empirical estimate of the right-hand-side of Equation (15) can be made by averaging the quantity in the square brackets using the batch B. This function is performed by the gradient estimator, 308 in Figure 3, at step 204 of Figure 2.

As can be seen from the flowchart of Figure 2, the next stage after initialization at step 205 is a loop comprising steps 206-210, the ultimate purpose of which is to estimate

This is done

by generating and eventually averaging

where each v_i is an estimate of In order to generate a v_i, the following is done: At step 206,

a sample is drawn, given by:

This sample is drawn through application of NN1 303 and its sampler 304, and NN3 305 and its sampler 302. The sample is input into NN3 305 to obtain

as the output of NN3. This is fed as the input to Gaussian Wasserstein computation network (GWCN) 309. As shown in Figure 3, GWCN also takes as input At step

207, internally, it computes and outputs:

This is equal to i.e., the square of the 2-Waserstein

distance between multivariate normal distributions. Since NN3 305 feeds directly into GWCN 309, at step 208 an automatic differentiation engine (for example, autograd: https://github.com/HIPS/autograd) is used to efficiently compute inverse Hessian- vector product v

The loop continues until the desired number of training scenarios are completed. Each

is given to the dynamics parameters update calculator (DPUC) 310 which computes the average and then computes as an estimate of

This is used by the DPUC at step 21 1 to compute

This DPUC feeds into the policy parameters update engine (PPUE) 31 1 at step

212, which computes and feeds it into NN4 306. This completes the cycle.

The new policy parameters are then used in the next iteration of the optimisation

problem and the above steps are repeated.

In another example, which will now be described with reference to Figures 4-6, Y_k corresponds to the parameters of a simulator.

The system 600 has access to a simulator, shown at 601 in Figure 6, that allows for the parameterization of the dynamics with a vector

At step 401 in Figure 4, the system accesses parameters Y₀. Y₀ are the parameters of the simulator that represent the reference dynamic. q₀ is set to the zero vector and is set to the zero matrix at step 402. The system enters a loop which ends with an estimate of H₀. It does this through application of a method known in the literature as “evolutionary strategies”.

In this embodiment, H₀ is estimated using the formula:

with

and F (see Equation (10) being estimated using Wasserstein distance calculations for the empirical distribution from the points generated by the samples

The dynamics are d-dimensional vectors that parameterise the simulator. A random vector is sampled by the multivariate Gaussian sampler 603 and

the system sets

and passes it to the simulator 601 , as shown at step 404. At step 405, the system then enters sub-routine A, illustrated in the flowchart of Figure 5, which takes y and ft as its input at step 501 and at step 502, y and are used to

draw samples, given by:

The neural network NN 1 , shown at 604 in Figure 6, and its sampler 605 are used to sample actions from ft and the simulator 601 parameterised with y is used to perform the sampling above. At steps 503 and 504 respectively, the above samples are fed into the simulator 601 to make a number of samples

and

These are held in storage in data sets and shown at steps 505

and 506 respectively. These data sets are fed into the Wasserstein computation engine (WCE), 606, in order to compute an empirical Wasserstein distance at step 507, which is considered to be an estimate of:

This cycle of steps 502-508 is repeated a number of times and an average of the estimates is taken at step 510. This is done by passing the estimates from memory store 607 to arithmetic mean calculator 608, illustrated by step 509. The result is referred to as h_i in Figure 4. The outer cycle (which calls subroutine A) is repeated for each sample Î, and the average computed and set as the estimate of H₀.

As can be seen from Figure 4, the system then enters a cycle with an entry point at step 409 which samples a batch B of trajectories using q_k and y_k.. This is done using the simulator 601 along with NN2 602 and its sampler 610. At step 410, B is used by the system to estimate:

using the formula:

This function is performed by the gradient estimator 609 in Figure 6.

At step 41 1 , the system then computes an estimate of

by applying the conjugate gradient algorithm in conjugate gradient estimator 61 1 to the optimisation problem:

At step 412, the result is passed into the new dynamics parameters calculation engine 612 which performs the computation:

This determines the new dynamics parameters Subsequently, the new

parameters are passed into policy parameters update engine 613 which computes the new policy parameters

at step 413. The updating of the policy parameters may be performed by algorithms such as PPO (as described in Schulman, John, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017)), TRPO (as described in Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015, June). Trust region policy optimization. In International Conference on Machine Learning (pp. 1889-1897)) or“vanilla” policy gradient updates.

The new policy parameters 0_k+1 are then used in the next iteration of the optimisation and the above steps are repeated.

Using the parameters from a simulator has an advantage over the generic method previously described above because, generally, simulators can allow their parameters to be varied in ways which are consistent with some set of rules. For example, a simulator for a physical system such as a robot or car will allow variations of quantities such as friction, mass, length etc., but the system is expected to obey Newton’s laws. Alternatively, the parameters used in this embodiment may be from a differential equation solver that allows for the parameterization of the dynamics with a vector Y_k Î R^d.

Figure 7 summarises a method for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric. At step 701 , the method comprises forming a candidate solution comprising a candidate set of parameter values. The method then comprises repeatedly performing the following steps 702- 704. At step 702, a first assessment is made of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric. At step 703, a second assessment is made of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric. At step 704, a further candidate solution is formed in dependence on the first and second assessments.

In the approaches described above, a model is formed for analysing data and providing indications of one or more properties of that input data. The model is generalised, and operates in dependence on values that control the performance of the model. For example, the model could be a neural network and the values could be weights applied to the network. The system described above selects the values by training against reference or training data. The training data comprises a set of possible input data to the model and, for each one, a corresponding expected output from the model. To select the values, the system operates a first loop and a second loop. The second loop runs inside the first loop. In the first loop, a candidate set of values is formed by selecting for high performance of the model having those values against the training data. Put another way, the candidate set of values are selected or assessed in dependence on a determination of whether there is relatively high conformity between the outputs of the model configured with those values taking the training data as input and the expected outputs for that training data. In the inner loop the candidate set of values is tested for low performance of the model having those values against the training data. Put another way, the candidate set of values are selected or assessed in dependence on a determination of whether there is relatively low conformity between the outputs of the model configured with those values taking the training data as input and the expected outputs for that training data. This process is repeated numerous times, with the candidate set of values for each iteration being selected such that it has been determined to exhibit a relatively (e.g. relative to a previous candidate set of data) high propensity for good performance and a relatively low propensity for poor performance.

The approach described herein helps to solve the problem of brittleness in reinforcement learning, and in particular can provide a system and method for learning to control a system in a way that is robust to variations in dynamics. Therefore, the present invention may allow a user to train on a simulator and deploy in the real world with good performance.

Embodiments of the present invention may provide advantages over previous approaches. The present approach performs better in experiments and can be used for continuous state and action spaces. In particular, the approach described in‘Non- Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning', Lecarpentier and Rachelson, arXiv:1904.10090 2019, is only applicable to scenarios with finite discrete state and action spaces. The described approach operates with continuous state and action spaces, which is an essential requirement to be practically applicable in real world settings.

One particularly advantageous situation to which the present invention may be applied is a self-driving car. The dynamics of the car will vary due to a multitude of factors, from variations of road surfaces, road inclines, tire pressure, frictional forces and variations due to weight carried. It is clear in this example that there is no single set of dynamics that the car will experience throughout its lifetime or even over a short period of time. The present algorithm is more robust to such variations and can cope with novel dynamics without having to learn to cope with them in the environment. Therefore, the solution disclosed herein may learn a controlling policy that is robust to variations in dynamics between the environment it was trained on and the environment it is deployed on.

In the above description, dynamics within a predetermined Wasserstein distance of the reference dynamics are considered during training of the controlling policy. However, other metrics may also be used as the distance function.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A system for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric, the system being configured to, having formed a candidate solution comprising a candidate set of parameter values; repeatedly perform the steps of:

making a first assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution provides a high level of performance against the performance metric;

making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and

forming a further candidate solution in dependence on the first and second assessments.

2. The system as claimed in claim 1 , wherein the system is configured to assess the quality of the candidate solution by testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference values.

3. The system as claimed in claim 2, wherein the system is further configured to assess the quality of the candidate solution by generating a set of adapted reference values comprising one or more items of adapted reference data in the vicinity of at least some of the set of reference values and test the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the set of adapted reference values.

4. The system as claimed in claim 3, wherein the one or more items of adapted reference data are within a predetermined Wasserstein distance of the set of reference values.

5. The system as claimed in claim 3 or claim 4, wherein the set of adapted reference values represents the worst-case values of the set of reference values.

6. The system as claimed in any of claims 2 to 5, wherein the set of reference values comprises the parameters of a neural network.

7. The system as claimed in any of claims 2 to 5, wherein the set of reference values comprises values output from a simulator or differential equation solver.

8. The system as claimed in any of claims 2 to 7, wherein the set of reference values comprises a set of reference dynamics.

9. The system as claimed in anyone of the preceding claims, wherein the system is configured to perform an optimisation comprising the first and second assessments of the quality of the candidate solution.

10. The system as claimed in anyone of the preceding claims, wherein the model is a trained artificial intelligence model.

1 1. The system as claimed in claim 10, wherein the model is a neural network.

12. A method for performing reinforcement learning to generate a solution set of values usable as parameters in a model so as to cause the model to provide a level of performance against a performance metric, the method comprising:

forming a candidate solution comprising a candidate set of parameter values; and repeatedly performing the steps of:

making a second assessment of the quality of the candidate solution by assessing the extent to which a model having the values of the candidate solution fails to provide a low level of performance against the performance metric; and forming a further candidate solution in dependence on the first and second assessments.

13. The method as claimed in claim 12, wherein the assessment of the quality of the candidate solution comprises testing the behaviour of the model as configured in accordance with the values of the candidate solution when applied to a set of reference values.

14. The method as claimed in claim 13, wherein the assessment of the quality of the candidate solution further comprises generating a set of adapted reference values comprising one or more items of adapted reference data in the vicinity of at least some of the set of reference values and test the behaviour of the model as configured in accordance with the values of the candidate solution when applied to the set of adapted reference values.

15. The method as claimed in claim 14, wherein the one or more items of adapted reference data are within a predetermined Wasserstein distance of the set of reference values.