CN114270375A

CN114270375A - Learning to robustly control a system

Info

Publication number: CN114270375A
Application number: CN201980098404.XA
Authority: CN
Inventors: 穆罕默德·阿卜杜拉; 海瑟姆·布·阿马尔; 任航; 张鸣天
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2022-04-01
Also published as: WO2021008691A1

Abstract

A system for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the system for: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps: first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric; second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric; forming another candidate solution based on the first evaluation and the second evaluation.

Description

Learning to robustly control a system

Technical Field

The present invention relates to avoiding fragility in reinforcement learning, and in particular to learning to control a system in a manner that is robust to dynamic changes.

Background

In Reinforcement Learning (RL), a system is modeled as a Markov Decision Problem (MDP). This is defined as the tuple < X, U, p, r, >, where X is the state space, U is the action space, p (· | X, U) is the probability distribution of the next state of each state-action pair (X, U), and r (X, U) is the reward (positive or negative real). The probability distribution p is called dynamic.

The controller of the MDP is called a policy. It is typically implemented as a probability distribution of actions given the current state x, denoted as π (. | x). MDP equipped with launch state distribution and policy will yield a Markov Rewarded Process (MRP). It generalizes the probability distribution over traces (traces are a sequence of states, actions, rewards).

The standard goal in the RL is to optimize the expected return, i.e., the total discount reward:

where μ is the initial distribution, where:

the dynamic p is assumed to be fixed, i.e. if the same action is taken in a given state, the distribution in the following state is the same as if the action was taken in that state at other times. The dynamics are independent of any control strategy applied to the MDP.

This aspect of the fixed dynamics of the MDP is a basic assumption in the standard RL algorithm. However, several problems arise from this.

If a policy is trained on a certain MDP and then deployed on MDPs with different dynamics, the policy typically underperforms, i.e., the policy tends to be fragile in terms of dynamic changes. For example, if the RL agent is trained using a simulator (e.g., a simulator for an automobile or a robot) and then deployed in a real physical system, the simulator will perform imperfectly and its simulated dynamics will not be exactly the same as the real world dynamics.

Another problem is that dynamic changes occurring at different times in the real world will affect the results. For example, drivers of automobiles may experience different dynamics due to, for example, road surfaces, load bearing differences, or tire air pressure. Any machine may experience differences in friction due to changes in temperature or lubrication. Any algorithm that produces a control strategy for these dynamically changing breakdowns is clearly not practical. One of the reasons why RL has not been particularly successful so far outside of the laboratory or outside of controlled environments such as games is the lack of robustness.

Previous approaches have attempted to develop strategies that are more robust to such dynamic changes. For example, in Tessler et al, "Action Robust Reinforcement Learning and its application in Continuous Control" (ICML 2019), this problem is framed as a zero-sum game. The policy gives the following action of robustness criteria: (i) taking different possibly antagonistic actions with fixed probability, (ii) adding perturbations to the actions themselves. However, while the algorithm performs well in some Mujoco tasks, it performs poorly in other tasks (such as backswing).

Another approach, as described by Pinto et al in "Robust countermeasure Reinforcement Learning" (ICML2017), also boxes the problem as null and gambling, robustness being iteratively learned by substituting opponent and agent strategies. The "Non-Stationary Markov Decision" by Lecarepentier and Rachelson models the dynamics over time using a Model-Based Reinforcement Learning process Worst Case method (Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning), "(arXiv: 1904.100902019), limited by the Wasserstein distance per unit time. For the worst case where the environment is antagonistic, the tree search algorithm is solved. Experiments were conducted in the grid world, i.e. on small scale examples, and the method seems to be unable to scale to a continuous state space and motion space.

Other methods are described in the following documents: JP 3465236B 2, based on the H infinity technique in the classical control theory; CN 107856035a, which is specific to a particular class of problems, rather than the general RL algorithm; and US 6665651B2, relying on a controller to train the neural network, with an emphasis on the stability of the learning process.

Learning is desirable to control the system in a manner that improves robustness to dynamic changes.

Disclosure of Invention

There is provided a system for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the system for: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps: first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric; second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric; forming another candidate solution based on the first evaluation and the second evaluation.

Thus, the system can evaluate the overall performance of the strategy to optimize the performance of the model and evaluate the model with these parameters under worst case dynamics to minimize the occurrence of low level performance. The system may then iteratively form another candidate solution for the policy parameter based on these first and second evaluations.

The system may be used to evaluate the quality of the candidate solution by testing the behavior of the model configured from the values of the candidate solution when applied to a set of reference values. This may enable a convenient way of assessing the quality of the candidate solution.

The system may be further configured to evaluate the quality of the candidate solution by: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values. This may enable testing the quality of the candidate solution for dynamics different from the reference dynamics.

The one or more adapted reference data items may be within a predetermined Wasserstein distance of the reference value set. The adapted set of reference values may represent worst case values of the set of reference values. Thus, the quality of the strategy can be evaluated in the worst case scenario within the Wasserstein ball. By evaluating the worst case dynamics of the strategy, the robustness of the strategy can be improved. Therefore, the Wasserstein metric may help measure the degree of "error" of a model or simulator in a useful and intuitively reasonable manner.

The set of reference values may comprise parameters of a neural network. This approach may enable efficient computation of the updated reference value and the further candidate solution.

The reference value set may include values output from a simulator or a differential equation solver. Such an approach may enable efficient computation of updated reference values and another candidate solution, and may enable the reference values to vary in a manner consistent with certain rule sets. For example, a simulator for a physical system such as a robot or automobile would allow for variations in the amount of friction, mass, length, etc., but the system should obey newton's law.

The set of reference values may comprise a set of reference dynamics. This may enable the system to be applied to real-world dynamic situations.

The system may be configured to perform an optimization including the first and second evaluations of the quality of the candidate solution.

The model may be a trained artificial intelligence model. The model may be a neural network.

According to a second aspect, there is provided a method of performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the method comprising: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps: first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric; second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric; forming another candidate solution based on the first evaluation and the second evaluation.

Thus, the method can evaluate the overall performance of the strategy to optimize the performance of the model and evaluate the model with these parameters under worst case dynamics to minimize the occurrence of low level performance. The method may then iteratively form another candidate solution for the policy parameter based on these first and second evaluations.

The evaluating the quality of the candidate solution may include testing a behavior of the model configured according to the values of the candidate solution when applied to a set of reference values. This may enable a convenient way of assessing the quality of the candidate solution.

The evaluating the quality of the candidate solution may further comprise: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values. This may enable testing the quality of the candidate solution for dynamics different from the reference dynamics.

The one or more adapted reference data items may be within a predetermined Wasserstein distance of the reference value set. Thus, the quality of the strategy can be evaluated in the worst case scenario within the Wasserstein ball. By evaluating the worst case dynamics of the strategy, the robustness of the strategy can be improved. Therefore, the Wasserstein metric may help measure the degree of "error" of a model or simulator in a useful and intuitively reasonable manner.

Drawings

The invention will now be described by way of example with reference to the accompanying drawings.

In the drawings:

FIG. 1 illustrates iteratively updating parameterized policies and dynamics;

FIG. 2 shows a flow chart of a method provided by an embodiment of the invention;

FIG. 3 illustrates an example of a system for implementing the method illustrated in FIG. 2;

FIG. 4 shows a flow chart of a method provided by yet another embodiment of the present invention;

FIG. 5 shows a flow diagram of a method provided by a subroutine executed as part of the method shown in FIG. 4;

FIG. 6 illustrates an example of a system for implementing the methods shown in FIGS. 4 and 5;

FIG. 7 illustrates an example of a method of performing reinforcement learning to generate a solution set that can be used as values for parameters in a model.

Detailed Description

The present invention relates to a system and method for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model that are robust to dynamic variations.

In the standard RL algorithm, the quality of the model parameter candidate solution is determined by testing the model configured according to the values of the candidate solution as it is applied to a set of reference dynamics p₀The behavior at (as input to the system) is evaluated. These reference dynamics are used to train the control strategy in the standard algorithm, without regard to other possible dynamics. In an embodiment of the present invention, consider surrounding p₀A set of possible dynamics distributed (e.g., centered thereon). These reference values represent the starting points of the training model.

In one embodiment, dynamics within a predetermined Wasserstein distance of a reference dynamics are considered during training of the control strategy. The Wasserstein distance is used as a measure of divergence between dynamics and defines a probability measure for definition relative to the metric space. Specifically, assume that state space X is a metric space, with the metric represented as d (·,). Let m (S) denote a set of probability measures over the set S, the coupling set over the probability measures μ and v being defined as:

that is, it is a set of probability measures in the product space, marginalized μ along one dimension and ν along the other dimension. The p-Wasserstein distance is defined as:

in general, the Wasserstein distance has no closed form solution, but can be numerically estimated. However, the square 2-Wasserstein between Gauss has a closed form.

The set of dynamics used is p, which is compared to p when measured by the Wasserstein metric₀Within some predefined limit. The Wasserstein metric may be any one of the class of Wasserstein metrics. The predefined limit may be referred to as surrounding p₀The epsilon-Wasserstein ball of (1) is defined as

The system is used to learn the best control strategy, where the quality of strategy pi is the standard RL objective function, but is evaluated at the worst case dynamics (for pi) within the Wasserstein sphere. By keeping "pessimistic" on the dynamics that may be encountered (i.e., by evaluating the worst-case dynamics of the policy), the robustness of the policy may be increased. Therefore, the Wasserstein metric may help measure the degree of "error" of a model or simulator in a useful and intuitively reasonable manner.

The Wasserstein distance is in the form (distance x probability mass). Thus, if this product is constrained by a certain number, it means that if the distance is large, the probability is small, and if the probability is large, the distance is small. That is, the model may be severely erroneous (large distance), but this is unlikely, or likely erroneous (high probability), but may not be too inaccurate. If the reference dynamics are often highly inaccurate, it is useless, training it meaningless, and, ultimately, attempting to achieve robustness is futile.

Not all dynamics in the Wasserstein ball are authentic; for example, some dynamics may violate newton's law of motion. In the present invention, dynamics may be disturbed, but remain trusted. In view of this, the strategy pi uses vectors

Parameterize, and write as pi_θ. The θ parameter is a parameter (weight) of the neural network, and may be updated as time passes.

The dynamics also using another vector

And (4) parameterizing. Also, these parameters ψ may be updated over time. These parameters may be, for example, parameters of a neural network, simulator, or other processor that implements system dynamics (e.g., a differential equation solver or a real system). Corresponding to the reference dynamic p₀Is expressed as phi₀。

The system may generate an adapted reference dynamic parameter set comprising psi within a predetermined Wasserstein distance₀Nearby adapted reference data. The system then evaluates the quality of the candidate solution policy parameters by testing the behavior of the model configured according to the candidate solution's values when applied to adapt the reference dynamic parameters.

The system performs a first evaluation of the quality of the candidate solution by evaluating the model with the values of the candidate solution to the extent that the model provides a high level of performance with respect to the performance metric, i.e., the system evaluates the overall performance of the strategy to optimize the performance of the model. The system also makes a second assessment of the quality of the candidate solution by assessing the extent to which the model with the value of the candidate solution fails to provide low level performance for the performance metric, i.e., the system assesses the model with these parameters under worst case dynamics, aiming to minimize the occurrence of low level performance. The system iteratively forms another candidate solution for the policy parameter based on the first and second evaluations.

The optimization problem to be solved by the system can be expressed as:

wherein, in the continuous RL setup:

in the periodic RL setup:

wherein the content of the first and second substances,

these are "occupancy measures" which, in the persistent case, are determined by having a dynamic psi and policy

The MDP generalized Markov chain of (1) is a probability distribution that, in the periodic case, is similar to the purpose of the stationary distribution.

The system iteratively solves an approximation of the optimization problem described above, as shown in fig. 1. According to the updated dynamic parameter psi, as indicated by 101_k+1102 update the policy parameter theta of the previous candidate solution_k-1And as shown at 104, the updated policy parameter θ shown at 103_kFor the next iteration of the optimization problem.

The internal optimization problem is given by the following formula:

item H₀Is aligned at psi₀An estimate of the Hessian of the function F evaluated below, wherein

The solution to the above problem is given by the following equation:

wherein the content of the first and second substances,

the optimization problem defined above may be under certain assumptions (e.g., H)₀Exists and is symmetrically positive) to compute an update of the dynamic ψ.

The updated dynamic parameters are then used to update the policy parameters, which are then used in subsequent iterations of the optimization, as shown in FIG. 1.

Thus, a high-level strategy for solving the optimization problem in equation (6) has an outer loop that updates θ and an inner loop that solves the minimization problem for a given fixed θ.

Thus, the system learns the optimal control strategy, where the quality of the strategy is the standard RL objective function, but the quality is evaluated under the worst-case dynamics within the Wasserstein sphere. By pessimistic on the dynamics that a model may encounter, the robustness of the strategy may be increased.

Two exemplary embodiments of the present invention will now be described.

A general method of defining a gaussian distribution using a neural network will first be described with reference to fig. 2 and 3. The method can realize the efficient calculation of the updated dynamic parameter psi_k+1。

In this embodiment, #_kIs a neural network

The neural network outputs the mean and covariance of the Gaussian process, i.e.

Efficient computation of quantities using conjugate gradient algorithms and automatic differentiation (e.g., autograd, see: https:// githu. com/HIPS/autograd) applied to the following optimization problem

In step 201 of the flow chart of fig. 2, the system 300 may access the parameter ψ₀These parameters are parameters of a Neural Network (NN) NN2, shown at 301 in the system diagram of fig. 3, representing a reference dynamic p₀. NN 2301 takes as input the state-action pair (x, u) and outputs the mean vector and covariance matrix

These are fed into a sampler 302, which sampler 302 is from having a mean and a covariance, respectively

The next state x' is sampled in the multivariate gaussian distribution of (a). This is a standard approach to dynamic modeling in reinforcement learning.

The system also assumes that it has access to a fixed policy

This strategy is implemented as a neural network NN1, as shown in 303 of fig. 3. NN1 takes state x as input and provides some parameters to sampler 304, which uses the parameters to sample the action according to a probability distribution. This is a standard method for implementing a random strategy in reinforcement learning.

In step 202 in FIG. 2, the system arbitrarily initializes the policy parameter vector to θ₀As a euclidean vector. The system estimates the Hessian matrix

Initialisation to a d x d zero matrix, where d is the dynamic parameter ψ_kOf (c) is calculated.

In step 203, the system then uses the latest parameter ψ_kAnd theta_kBatch B of traces was sampled. This is done using NN3 and NN4 and their associated

samplers

302 and 307, shown in fig. 3 at 305 and 306, respectively. Since the trace is a sequence of (states, actions, rewards), each new state-action pair is fed as input to the NN 3305 (fed into its sampler) to sample the new state, and each new state is fed to the NN 4306 (fed into its sampler) to sample the new action. Assume that there is a mechanism by which rewards can be extracted from the sample trajectory. For example, if the reward function is a known function of a state-action pair, or a simulator is applied.

The gradient is given by the following equation:

the estimation can be done by means of the following formula:

that is, the right side of equation (15) can be empirically estimated by averaging the amounts in square brackets using lot B. This function is performed by the gradient estimator (308 in fig. 3) in step 204 of fig. 2.

As can be seen from the flow chart of FIG. 2, the next stage after the initialization in step 205 is a

loop comprising steps

206 and 210, the final purpose of which is to estimate

This is by generating v₁,v₂,…,v_MAnd finally averaging it, wherein each v is_iDu Shi

An estimate of (d). To generate v_iAnd completing the following operations: in step 206, a sample is drawn, given by the following equation:

the samples are extracted by applying NN 1303 and its samplers 304 and NN 3305 and its sampler 302. Inputting the samples into NN 3305 to obtain

As an output of the NN 3. This is fed as an input to a Gaussian Wasserstein Computing Network (GWCN) 309. As shown in figure 3, the GWCN will also be

As an input. Internally, in step 207, calculate and output:

this is equal to

I.e., the square of the 2-Wasserstein distance between multivariate normal distributions. Since NN 3305 is fed directly into GWCN 309, in step 208, the inverse Hessian vector product is efficiently computed using an auto-differentiation engine (e.g., autograd: https:// github

The cycle will continue to be continued with,until the required number of training scenarios is completed. Each v₁,v₂,…,v_MAre provided to a Dynamic Parameter Update Calculator (DPUC) 310, which calculates an average value and then calculates as

An estimate of (d). In step 211, this is used by the DPUC to calculate ψ_k+1：

In step 212, the DPUC converts psi_k+1Fed into a Policy Parameters Update Engine (PPUE) 311, which computes θ_k+1And feeds it into the NN 4306. This completes the cycle.

Then, the new strategy parameter θ is used in the next iteration of the optimization problem_k+1And repeating the above steps.

In another example, which will now be described with reference to fig. 4 to 6, ψ_kCorresponding to the parameters of the simulator.

System 600 may have access to a simulator shown at 601 in FIG. 6, which may implement the use of vector ψ_k∈R^dThe dynamics are parameterized.

In step 401 in FIG. 4, the system access parameter ψ₀。ψ₀Are parameters of the simulator that represent the reference dynamics. In step 402, θ₀Is set to a zero vector and is set to zero,

set to a zero matrix. The system enters a cycle with H₀The estimation of (2) ends. This is achieved by applying what is known in the literature as an "evolutionary strategy".

In this embodiment, H is estimated using the following formula₀：

Wherein x ═ ψ₀F (see equation (10)) is determined using a sample ψ₀+ e is estimated by Wasserstein distance calculation of the empirical distribution of points generated.

Dynamics are d-dimensional vectors of the parameterized simulator. The random vector is from E to N (0, sigma)²I)∈R^dSampled by multivariate gaussian sampler 603, the system sets ψ ← ψ₀And passes it to the simulator 601 as shown in step 404.

In step 405, the system then enters subroutine A shown in the flowchart of FIG. 5, which takes psi and

as its input, and sum

For sampling, given by the following equation:

the neural network NN1 and its sampler 605 shown in 604 in fig. 6 are used to derive the neural network NN from

The action is sampled and a simulator 601 parameterized with psi is used to perform the sampling. The samples are fed into a simulator 601 in

steps

503 and 504, respectively, to produce a plurality of samples

And

the samples are collected as a data set

And

(shown in

steps

505 and 506, respectively) is stored in memory. These data sets are fed into a Wasserstein Calculation Engine (WCE) 606 in order to calculate an empirical Wasserstein distance in step 507, which is considered an estimate of:

this loop of steps 502 through 508 is repeated a plurality of times and an average of the estimates is taken in step 510. This is done by passing the estimate from the memory 607 to the arithmetic mean calculator 608, as shown in step 509. The result is referred to as h in FIG. 4_i. The outer loop is repeated for each sample e (call subroutine A), the average is calculated and set to H₀An estimate of (d).

As can be seen in FIG. 4, the system then enters a loop, entering the point at step 409, which step 409 uses θ_kAnd psi_kBatch B of traces was sampled. This is done using simulator 601 and NN 2602 and its sampler 610. In step 410, the system uses B to estimate:

using the formula:

this function is performed by the gradient estimator 609 in fig. 6.

In step 411, the system then calculates by applying the conjugate gradient algorithm in the conjugate gradient estimator 611 to the optimization problem

Estimation of (2):

in step 412, the results are passed to a new dynamic parameter calculation engine 612 that performs the following calculations:

this determines the new dynamic parameter psi_k+1. The new parameters are then passed to the policy parameter update engine 613, which policy parameter update engine 613 calculates the new policy parameters θ in step 413_k+1. The updating of policy parameters may be performed by algorithms such as PPO (the "near-end policy optimization algorithms" arXiv preprints: 1707.06347(2017) described in Schulman, John, Filip Wolski, Praflla Dharawal, Alec Radford and Oleg Klimov), TRPO (the "near-end policy optimization algorithms" described in Schulman, J&Moritz, P (2015, 6 months) "Trust region policy optimization," International machine learning Association (P1889-1897)), or with "raw" policy gradient updates.

Then, the new strategy parameter θ is used in the next iteration of the optimization_k+1And repeating the above steps.

Using parameters in the simulator is advantageous over the general approach described previously above, because, in general, the simulator can have its parameters varied in a manner consistent with a certain set of rules. For example, a simulator for a physical system such as a robot or automobile would allow for variations in the amount of friction, mass, length, etc., but the system should obey newton's law.

Alternatively, the parameters used in this embodiment may come from a differential equation solver, which may implement the use of the vector ψ_k∈R^dCarry out parameter to dynamic stateAnd (4) transforming.

FIG. 7 summarizes a method for performing reinforcement learning to generate a solution set of values that may be used as parameters in a model such that the model provides a level of performance with respect to a performance metric. In step 701, the method includes forming a candidate solution including a set of candidate parameter values. The method then includes repeating steps 702-704 below. In step 702, a first evaluation is made of the quality of the candidate solution by evaluating the degree to which the model with the values of the candidate solution provides a high level of performance for the performance metric. In step 703, a second evaluation is made of the quality of the candidate solution by evaluating the extent to which the model with the value of the candidate solution fails to provide a low level of performance with respect to the performance metric. In step 704, another candidate solution is formed based on the first evaluation and the second evaluation.

In the above method, a model is formed for analyzing data and providing an indication of one or more attributes of the input data. The model is generalized and operates according to values that control the performance of the model. For example, the model may be a neural network and the values may be weights applied to the network. The above system selects values by training against reference or training data. The training data includes a set of possible input data for the model, and a corresponding expected output for each model. To select these values, the system runs a first loop and a second loop. The second cycle is run during the first cycle. In the first loop, a set of candidate values is formed by selecting high performance of the model with these values for the training data. In other words, the set of candidate values is selected or evaluated according to a determination of whether there is a relatively high correspondence between the output of the model configured using those values having training data as input and the expected output of the training data. In the inner loop, it is tested whether the set of candidate values has a low performance model for these values of the training data. In other words, the set of candidate values is selected or evaluated according to a determination of whether there is a relatively low correspondence between the output of the model configured using those values having training data as input and the expected output of the training data. The process is repeated a number of times, with the set of candidate values for each iteration being selected such that it has been determined to exhibit a relatively high propensity for good performance (e.g., relative to the previous set of candidate data) and a relatively low propensity for poor performance.

The approach described herein helps to address the fragility problem in reinforcement learning, and in particular may provide a system and method for learning to control a system in a manner that is robust to dynamic changes. Thus, the present invention can enable users to train on simulators and deploy in the real world with good performance.

Embodiments of the present invention may provide advantages over previous approaches. The method of the invention has better performance in experiments and can be used for continuous states and action spaces. In particular, the "Non-Stationary Markov Decision" by Lecarepentier and Rachelson uses a Model-Based Reinforcement Learning process Worst Case method (Non-Stationary Markov Decision Processes a Worst-Case applying Model-Based Reinforcement Learning) "(arXiv: 1904.100902019) that is applicable only to scenarios with limited discrete states and motion space. The described method operates in a continuous state and motion space, which is a fundamental requirement for practical adaptation to real-world environments.

A particularly advantageous situation in which the invention can be applied is in the case of an autonomous vehicle. The dynamics of an automobile can vary due to a number of factors, including changes in road surface, road inclination, tire pressure, friction, and changes due to the weight carried. It is clear that in this example, the car does not experience any single set of dynamics throughout the life cycle, even for a short period of time. The inventive algorithm is more robust to such changes and can cope with new dynamics without having to learn to cope with them in the environment.

Thus, the approaches disclosed herein may learn control strategies that are robust to dynamic variations between the environment in which they are trained and the environment in which they are deployed.

In the above description, dynamics within a predetermined Wasserstein distance of the reference dynamics are considered during training of the control strategy. However, other metrics may be used as the distance function.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features. Such features or combinations of features can be implemented as a whole based on the present description, without regard to whether such features or combinations of features solve any of the problems disclosed herein, with the ordinary knowledge of a person skilled in the art; and not to limit the scope of the claims. This application is intended to cover any adaptations or combinations of the various aspects of the invention. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A system for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the system being configured to: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps:

first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric;

second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric;

forming another candidate solution based on the first evaluation and the second evaluation.

2. The system of claim 1, wherein the system is configured to evaluate the quality of the candidate solution by testing the behavior of the model configured according to the values of the candidate solution when applied to a set of reference values.

3. The system of claim 2, wherein the system is further configured to evaluate the quality of the candidate solution by: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values.

4. The system of claim 3, wherein the one or more adapted reference data items are within a predetermined Wasserstein distance of the set of reference values.

5. System according to claim 3 or 4, wherein the adapted set of reference values represents worst case values in the set of reference values.

6. The system according to any one of claims 2 to 5, wherein the set of reference values comprises parameters of a neural network.

7. The system of any one of claims 2 to 5, wherein the set of reference values comprises values output from a simulator or differential equation solver.

8. The system according to any one of claims 2 to 7, wherein the set of reference values comprises a set of reference dynamics.

9. The system of any preceding claim, wherein the system is configured to perform an optimization comprising the first and second evaluations of the quality of the candidate solution.

10. The system of any preceding claim, wherein the model is a trained artificial intelligence model.

11. The system of claim 10, wherein the model is a neural network.

12. A method for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the method comprising:

forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps:

13. The method of claim 12, wherein the evaluating the quality of the candidate solution comprises testing a behavior of the model configured according to the values of the candidate solution when applied to a set of reference values.

14. The method of claim 13, wherein said evaluating the quality of said candidate solution further comprises: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values.

15. The method of claim 14, wherein the one or more adapted reference data items are within a predetermined Wasserstein distance of the set of reference values.