CN114270375A - Learning to robustly control a system - Google Patents
Learning to robustly control a system Download PDFInfo
- Publication number
- CN114270375A CN114270375A CN201980098404.XA CN201980098404A CN114270375A CN 114270375 A CN114270375 A CN 114270375A CN 201980098404 A CN201980098404 A CN 201980098404A CN 114270375 A CN114270375 A CN 114270375A
- Authority
- CN
- China
- Prior art keywords
- candidate solution
- model
- values
- evaluating
- performance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/004—Artificial life, i.e. computing arrangements simulating life
- G06N3/006—Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Probability & Statistics with Applications (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A system for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the system for: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps: first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric; second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric; forming another candidate solution based on the first evaluation and the second evaluation.
Description
Technical Field
The present invention relates to avoiding fragility in reinforcement learning, and in particular to learning to control a system in a manner that is robust to dynamic changes.
Background
In Reinforcement Learning (RL), a system is modeled as a Markov Decision Problem (MDP). This is defined as the tuple < X, U, p, r, >, where X is the state space, U is the action space, p (· | X, U) is the probability distribution of the next state of each state-action pair (X, U), and r (X, U) is the reward (positive or negative real). The probability distribution p is called dynamic.
The controller of the MDP is called a policy. It is typically implemented as a probability distribution of actions given the current state x, denoted as π (. | x). MDP equipped with launch state distribution and policy will yield a Markov Rewarded Process (MRP). It generalizes the probability distribution over traces (traces are a sequence of states, actions, rewards).
The standard goal in the RL is to optimize the expected return, i.e., the total discount reward:
where μ is the initial distribution, where:
the dynamic p is assumed to be fixed, i.e. if the same action is taken in a given state, the distribution in the following state is the same as if the action was taken in that state at other times. The dynamics are independent of any control strategy applied to the MDP.
This aspect of the fixed dynamics of the MDP is a basic assumption in the standard RL algorithm. However, several problems arise from this.
If a policy is trained on a certain MDP and then deployed on MDPs with different dynamics, the policy typically underperforms, i.e., the policy tends to be fragile in terms of dynamic changes. For example, if the RL agent is trained using a simulator (e.g., a simulator for an automobile or a robot) and then deployed in a real physical system, the simulator will perform imperfectly and its simulated dynamics will not be exactly the same as the real world dynamics.
Another problem is that dynamic changes occurring at different times in the real world will affect the results. For example, drivers of automobiles may experience different dynamics due to, for example, road surfaces, load bearing differences, or tire air pressure. Any machine may experience differences in friction due to changes in temperature or lubrication. Any algorithm that produces a control strategy for these dynamically changing breakdowns is clearly not practical. One of the reasons why RL has not been particularly successful so far outside of the laboratory or outside of controlled environments such as games is the lack of robustness.
Previous approaches have attempted to develop strategies that are more robust to such dynamic changes. For example, in Tessler et al, "Action Robust Reinforcement Learning and its application in Continuous Control" (ICML 2019), this problem is framed as a zero-sum game. The policy gives the following action of robustness criteria: (i) taking different possibly antagonistic actions with fixed probability, (ii) adding perturbations to the actions themselves. However, while the algorithm performs well in some Mujoco tasks, it performs poorly in other tasks (such as backswing).
Another approach, as described by Pinto et al in "Robust countermeasure Reinforcement Learning" (ICML2017), also boxes the problem as null and gambling, robustness being iteratively learned by substituting opponent and agent strategies. The "Non-Stationary Markov Decision" by Lecarepentier and Rachelson models the dynamics over time using a Model-Based Reinforcement Learning process Worst Case method (Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning), "(arXiv: 1904.100902019), limited by the Wasserstein distance per unit time. For the worst case where the environment is antagonistic, the tree search algorithm is solved. Experiments were conducted in the grid world, i.e. on small scale examples, and the method seems to be unable to scale to a continuous state space and motion space.
Other methods are described in the following documents: JP 3465236B 2, based on the H infinity technique in the classical control theory; CN 107856035a, which is specific to a particular class of problems, rather than the general RL algorithm; and US 6665651B2, relying on a controller to train the neural network, with an emphasis on the stability of the learning process.
Learning is desirable to control the system in a manner that improves robustness to dynamic changes.
Disclosure of Invention
There is provided a system for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the system for: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps: first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric; second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric; forming another candidate solution based on the first evaluation and the second evaluation.
Thus, the system can evaluate the overall performance of the strategy to optimize the performance of the model and evaluate the model with these parameters under worst case dynamics to minimize the occurrence of low level performance. The system may then iteratively form another candidate solution for the policy parameter based on these first and second evaluations.
The system may be used to evaluate the quality of the candidate solution by testing the behavior of the model configured from the values of the candidate solution when applied to a set of reference values. This may enable a convenient way of assessing the quality of the candidate solution.
The system may be further configured to evaluate the quality of the candidate solution by: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values. This may enable testing the quality of the candidate solution for dynamics different from the reference dynamics.
The one or more adapted reference data items may be within a predetermined Wasserstein distance of the reference value set. The adapted set of reference values may represent worst case values of the set of reference values. Thus, the quality of the strategy can be evaluated in the worst case scenario within the Wasserstein ball. By evaluating the worst case dynamics of the strategy, the robustness of the strategy can be improved. Therefore, the Wasserstein metric may help measure the degree of "error" of a model or simulator in a useful and intuitively reasonable manner.
The set of reference values may comprise parameters of a neural network. This approach may enable efficient computation of the updated reference value and the further candidate solution.
The reference value set may include values output from a simulator or a differential equation solver. Such an approach may enable efficient computation of updated reference values and another candidate solution, and may enable the reference values to vary in a manner consistent with certain rule sets. For example, a simulator for a physical system such as a robot or automobile would allow for variations in the amount of friction, mass, length, etc., but the system should obey newton's law.
The set of reference values may comprise a set of reference dynamics. This may enable the system to be applied to real-world dynamic situations.
The system may be configured to perform an optimization including the first and second evaluations of the quality of the candidate solution.
The model may be a trained artificial intelligence model. The model may be a neural network.
According to a second aspect, there is provided a method of performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the method comprising: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps: first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric; second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric; forming another candidate solution based on the first evaluation and the second evaluation.
Thus, the method can evaluate the overall performance of the strategy to optimize the performance of the model and evaluate the model with these parameters under worst case dynamics to minimize the occurrence of low level performance. The method may then iteratively form another candidate solution for the policy parameter based on these first and second evaluations.
The evaluating the quality of the candidate solution may include testing a behavior of the model configured according to the values of the candidate solution when applied to a set of reference values. This may enable a convenient way of assessing the quality of the candidate solution.
The evaluating the quality of the candidate solution may further comprise: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values. This may enable testing the quality of the candidate solution for dynamics different from the reference dynamics.
The one or more adapted reference data items may be within a predetermined Wasserstein distance of the reference value set. Thus, the quality of the strategy can be evaluated in the worst case scenario within the Wasserstein ball. By evaluating the worst case dynamics of the strategy, the robustness of the strategy can be improved. Therefore, the Wasserstein metric may help measure the degree of "error" of a model or simulator in a useful and intuitively reasonable manner.
Drawings
The invention will now be described by way of example with reference to the accompanying drawings.
In the drawings:
FIG. 1 illustrates iteratively updating parameterized policies and dynamics;
FIG. 2 shows a flow chart of a method provided by an embodiment of the invention;
FIG. 3 illustrates an example of a system for implementing the method illustrated in FIG. 2;
FIG. 4 shows a flow chart of a method provided by yet another embodiment of the present invention;
FIG. 5 shows a flow diagram of a method provided by a subroutine executed as part of the method shown in FIG. 4;
FIG. 6 illustrates an example of a system for implementing the methods shown in FIGS. 4 and 5;
FIG. 7 illustrates an example of a method of performing reinforcement learning to generate a solution set that can be used as values for parameters in a model.
Detailed Description
The present invention relates to a system and method for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model that are robust to dynamic variations.
In the standard RL algorithm, the quality of the model parameter candidate solution is determined by testing the model configured according to the values of the candidate solution as it is applied to a set of reference dynamics p0The behavior at (as input to the system) is evaluated. These reference dynamics are used to train the control strategy in the standard algorithm, without regard to other possible dynamics. In an embodiment of the present invention, consider surrounding p0A set of possible dynamics distributed (e.g., centered thereon). These reference values represent the starting points of the training model.
In one embodiment, dynamics within a predetermined Wasserstein distance of a reference dynamics are considered during training of the control strategy. The Wasserstein distance is used as a measure of divergence between dynamics and defines a probability measure for definition relative to the metric space. Specifically, assume that state space X is a metric space, with the metric represented as d (·,). Let m (S) denote a set of probability measures over the set S, the coupling set over the probability measures μ and v being defined as:
that is, it is a set of probability measures in the product space, marginalized μ along one dimension and ν along the other dimension. The p-Wasserstein distance is defined as:
in general, the Wasserstein distance has no closed form solution, but can be numerically estimated. However, the square 2-Wasserstein between Gauss has a closed form.
The set of dynamics used is p, which is compared to p when measured by the Wasserstein metric0Within some predefined limit. The Wasserstein metric may be any one of the class of Wasserstein metrics. The predefined limit may be referred to as surrounding p0The epsilon-Wasserstein ball of (1) is defined as
The system is used to learn the best control strategy, where the quality of strategy pi is the standard RL objective function, but is evaluated at the worst case dynamics (for pi) within the Wasserstein sphere. By keeping "pessimistic" on the dynamics that may be encountered (i.e., by evaluating the worst-case dynamics of the policy), the robustness of the policy may be increased. Therefore, the Wasserstein metric may help measure the degree of "error" of a model or simulator in a useful and intuitively reasonable manner.
The Wasserstein distance is in the form (distance x probability mass). Thus, if this product is constrained by a certain number, it means that if the distance is large, the probability is small, and if the probability is large, the distance is small. That is, the model may be severely erroneous (large distance), but this is unlikely, or likely erroneous (high probability), but may not be too inaccurate. If the reference dynamics are often highly inaccurate, it is useless, training it meaningless, and, ultimately, attempting to achieve robustness is futile.
Not all dynamics in the Wasserstein ball are authentic; for example, some dynamics may violate newton's law of motion. In the present invention, dynamics may be disturbed, but remain trusted. In view of this, the strategy pi uses vectorsParameterize, and write as piθ. The θ parameter is a parameter (weight) of the neural network, and may be updated as time passes.
The dynamics also using another vectorAnd (4) parameterizing. Also, these parameters ψ may be updated over time. These parameters may be, for example, parameters of a neural network, simulator, or other processor that implements system dynamics (e.g., a differential equation solver or a real system). Corresponding to the reference dynamic p0Is expressed as phi0。
The system may generate an adapted reference dynamic parameter set comprising psi within a predetermined Wasserstein distance0Nearby adapted reference data. The system then evaluates the quality of the candidate solution policy parameters by testing the behavior of the model configured according to the candidate solution's values when applied to adapt the reference dynamic parameters.
The system performs a first evaluation of the quality of the candidate solution by evaluating the model with the values of the candidate solution to the extent that the model provides a high level of performance with respect to the performance metric, i.e., the system evaluates the overall performance of the strategy to optimize the performance of the model. The system also makes a second assessment of the quality of the candidate solution by assessing the extent to which the model with the value of the candidate solution fails to provide low level performance for the performance metric, i.e., the system assesses the model with these parameters under worst case dynamics, aiming to minimize the occurrence of low level performance. The system iteratively forms another candidate solution for the policy parameter based on the first and second evaluations.
The optimization problem to be solved by the system can be expressed as:
wherein, in the continuous RL setup:
in the periodic RL setup:
wherein the content of the first and second substances,these are "occupancy measures" which, in the persistent case, are determined by having a dynamic psi and policyThe MDP generalized Markov chain of (1) is a probability distribution that, in the periodic case, is similar to the purpose of the stationary distribution.
The system iteratively solves an approximation of the optimization problem described above, as shown in fig. 1. According to the updated dynamic parameter psi, as indicated by 101k+1102 update the policy parameter theta of the previous candidate solutionk-1And as shown at 104, the updated policy parameter θ shown at 103kFor the next iteration of the optimization problem.
The internal optimization problem is given by the following formula:
item H0Is aligned at psi0An estimate of the Hessian of the function F evaluated below, wherein
The solution to the above problem is given by the following equation:
the optimization problem defined above may be under certain assumptions (e.g., H)0Exists and is symmetrically positive) to compute an update of the dynamic ψ.
The updated dynamic parameters are then used to update the policy parameters, which are then used in subsequent iterations of the optimization, as shown in FIG. 1.
Thus, a high-level strategy for solving the optimization problem in equation (6) has an outer loop that updates θ and an inner loop that solves the minimization problem for a given fixed θ.
Thus, the system learns the optimal control strategy, where the quality of the strategy is the standard RL objective function, but the quality is evaluated under the worst-case dynamics within the Wasserstein sphere. By pessimistic on the dynamics that a model may encounter, the robustness of the strategy may be increased.
Two exemplary embodiments of the present invention will now be described.
A general method of defining a gaussian distribution using a neural network will first be described with reference to fig. 2 and 3. The method can realize the efficient calculation of the updated dynamic parameter psik+1。
In this embodiment, #kIs a neural networkThe neural network outputs the mean and covariance of the Gaussian process, i.e.Efficient computation of quantities using conjugate gradient algorithms and automatic differentiation (e.g., autograd, see: https:// githu. com/HIPS/autograd) applied to the following optimization problem
In step 201 of the flow chart of fig. 2, the system 300 may access the parameter ψ0These parameters are parameters of a Neural Network (NN) NN2, shown at 301 in the system diagram of fig. 3, representing a reference dynamic p0. NN 2301 takes as input the state-action pair (x, u) and outputs the mean vector and covariance matrixThese are fed into a sampler 302, which sampler 302 is from having a mean and a covariance, respectivelyThe next state x' is sampled in the multivariate gaussian distribution of (a). This is a standard approach to dynamic modeling in reinforcement learning.
The system also assumes that it has access to a fixed policyThis strategy is implemented as a neural network NN1, as shown in 303 of fig. 3. NN1 takes state x as input and provides some parameters to sampler 304, which uses the parameters to sample the action according to a probability distribution. This is a standard method for implementing a random strategy in reinforcement learning.
In step 202 in FIG. 2, the system arbitrarily initializes the policy parameter vector to θ0As a euclidean vector. The system estimates the Hessian matrixInitialisation to a d x d zero matrix, where d is the dynamic parameter ψkOf (c) is calculated.
In step 203, the system then uses the latest parameter ψkAnd thetakBatch B of traces was sampled. This is done using NN3 and NN4 and their associated samplers 302 and 307, shown in fig. 3 at 305 and 306, respectively. Since the trace is a sequence of (states, actions, rewards), each new state-action pair is fed as input to the NN 3305 (fed into its sampler) to sample the new state, and each new state is fed to the NN 4306 (fed into its sampler) to sample the new action. Assume that there is a mechanism by which rewards can be extracted from the sample trajectory. For example, if the reward function is a known function of a state-action pair, or a simulator is applied.
The gradient is given by the following equation:
the estimation can be done by means of the following formula:
that is, the right side of equation (15) can be empirically estimated by averaging the amounts in square brackets using lot B. This function is performed by the gradient estimator (308 in fig. 3) in step 204 of fig. 2.
As can be seen from the flow chart of FIG. 2, the next stage after the initialization in step 205 is a loop comprising steps 206 and 210, the final purpose of which is to estimateThis is by generating v1,v2,…,vMAnd finally averaging it, wherein each v isiDu ShiAn estimate of (d). To generate viAnd completing the following operations: in step 206, a sample is drawn, given by the following equation:
the samples are extracted by applying NN 1303 and its samplers 304 and NN 3305 and its sampler 302. Inputting the samples into NN 3305 to obtainAs an output of the NN 3. This is fed as an input to a Gaussian Wasserstein Computing Network (GWCN) 309. As shown in figure 3, the GWCN will also beAs an input. Internally, in step 207, calculate and output:
this is equal toI.e., the square of the 2-Wasserstein distance between multivariate normal distributions. Since NN 3305 is fed directly into GWCN 309, in step 208, the inverse Hessian vector product is efficiently computed using an auto-differentiation engine (e.g., autograd: https:// github
The cycle will continue to be continued with,until the required number of training scenarios is completed. Each v1,v2,…,vMAre provided to a Dynamic Parameter Update Calculator (DPUC) 310, which calculates an average value and then calculates asAn estimate of (d). In step 211, this is used by the DPUC to calculate ψk+1:
In step 212, the DPUC converts psik+1Fed into a Policy Parameters Update Engine (PPUE) 311, which computes θk+1And feeds it into the NN 4306. This completes the cycle.
Then, the new strategy parameter θ is used in the next iteration of the optimization problemk+1And repeating the above steps.
In another example, which will now be described with reference to fig. 4 to 6, ψkCorresponding to the parameters of the simulator.
In step 401 in FIG. 4, the system access parameter ψ0。ψ0Are parameters of the simulator that represent the reference dynamics. In step 402, θ0Is set to a zero vector and is set to zero,set to a zero matrix. The system enters a cycle with H0The estimation of (2) ends. This is achieved by applying what is known in the literature as an "evolutionary strategy".
In this embodiment, H is estimated using the following formula0:
Wherein x ═ ψ0F (see equation (10)) is determined using a sample ψ0+ e is estimated by Wasserstein distance calculation of the empirical distribution of points generated.
Dynamics are d-dimensional vectors of the parameterized simulator. The random vector is from E to N (0, sigma)2I)∈RdSampled by multivariate gaussian sampler 603, the system sets ψ ← ψ0And passes it to the simulator 601 as shown in step 404.
In step 405, the system then enters subroutine A shown in the flowchart of FIG. 5, which takes psi andas its input, and sumFor sampling, given by the following equation:
the neural network NN1 and its sampler 605 shown in 604 in fig. 6 are used to derive the neural network NN fromThe action is sampled and a simulator 601 parameterized with psi is used to perform the sampling. The samples are fed into a simulator 601 in steps 503 and 504, respectively, to produce a plurality of samplesAndthe samples are collected as a data setAnd(shown in steps 505 and 506, respectively) is stored in memory. These data sets are fed into a Wasserstein Calculation Engine (WCE) 606 in order to calculate an empirical Wasserstein distance in step 507, which is considered an estimate of:
this loop of steps 502 through 508 is repeated a plurality of times and an average of the estimates is taken in step 510. This is done by passing the estimate from the memory 607 to the arithmetic mean calculator 608, as shown in step 509. The result is referred to as h in FIG. 4i. The outer loop is repeated for each sample e (call subroutine A), the average is calculated and set to H0An estimate of (d).
As can be seen in FIG. 4, the system then enters a loop, entering the point at step 409, which step 409 uses θkAnd psikBatch B of traces was sampled. This is done using simulator 601 and NN 2602 and its sampler 610. In step 410, the system uses B to estimate:
using the formula:
this function is performed by the gradient estimator 609 in fig. 6.
In step 411, the system then calculates by applying the conjugate gradient algorithm in the conjugate gradient estimator 611 to the optimization problemEstimation of (2):
in step 412, the results are passed to a new dynamic parameter calculation engine 612 that performs the following calculations:
this determines the new dynamic parameter psik+1. The new parameters are then passed to the policy parameter update engine 613, which policy parameter update engine 613 calculates the new policy parameters θ in step 413k+1. The updating of policy parameters may be performed by algorithms such as PPO (the "near-end policy optimization algorithms" arXiv preprints: 1707.06347(2017) described in Schulman, John, Filip Wolski, Praflla Dharawal, Alec Radford and Oleg Klimov), TRPO (the "near-end policy optimization algorithms" described in Schulman, J&Moritz, P (2015, 6 months) "Trust region policy optimization," International machine learning Association (P1889-1897)), or with "raw" policy gradient updates.
Then, the new strategy parameter θ is used in the next iteration of the optimizationk+1And repeating the above steps.
Using parameters in the simulator is advantageous over the general approach described previously above, because, in general, the simulator can have its parameters varied in a manner consistent with a certain set of rules. For example, a simulator for a physical system such as a robot or automobile would allow for variations in the amount of friction, mass, length, etc., but the system should obey newton's law.
Alternatively, the parameters used in this embodiment may come from a differential equation solver, which may implement the use of the vector ψk∈RdCarry out parameter to dynamic stateAnd (4) transforming.
FIG. 7 summarizes a method for performing reinforcement learning to generate a solution set of values that may be used as parameters in a model such that the model provides a level of performance with respect to a performance metric. In step 701, the method includes forming a candidate solution including a set of candidate parameter values. The method then includes repeating steps 702-704 below. In step 702, a first evaluation is made of the quality of the candidate solution by evaluating the degree to which the model with the values of the candidate solution provides a high level of performance for the performance metric. In step 703, a second evaluation is made of the quality of the candidate solution by evaluating the extent to which the model with the value of the candidate solution fails to provide a low level of performance with respect to the performance metric. In step 704, another candidate solution is formed based on the first evaluation and the second evaluation.
In the above method, a model is formed for analyzing data and providing an indication of one or more attributes of the input data. The model is generalized and operates according to values that control the performance of the model. For example, the model may be a neural network and the values may be weights applied to the network. The above system selects values by training against reference or training data. The training data includes a set of possible input data for the model, and a corresponding expected output for each model. To select these values, the system runs a first loop and a second loop. The second cycle is run during the first cycle. In the first loop, a set of candidate values is formed by selecting high performance of the model with these values for the training data. In other words, the set of candidate values is selected or evaluated according to a determination of whether there is a relatively high correspondence between the output of the model configured using those values having training data as input and the expected output of the training data. In the inner loop, it is tested whether the set of candidate values has a low performance model for these values of the training data. In other words, the set of candidate values is selected or evaluated according to a determination of whether there is a relatively low correspondence between the output of the model configured using those values having training data as input and the expected output of the training data. The process is repeated a number of times, with the set of candidate values for each iteration being selected such that it has been determined to exhibit a relatively high propensity for good performance (e.g., relative to the previous set of candidate data) and a relatively low propensity for poor performance.
The approach described herein helps to address the fragility problem in reinforcement learning, and in particular may provide a system and method for learning to control a system in a manner that is robust to dynamic changes. Thus, the present invention can enable users to train on simulators and deploy in the real world with good performance.
Embodiments of the present invention may provide advantages over previous approaches. The method of the invention has better performance in experiments and can be used for continuous states and action spaces. In particular, the "Non-Stationary Markov Decision" by Lecarepentier and Rachelson uses a Model-Based Reinforcement Learning process Worst Case method (Non-Stationary Markov Decision Processes a Worst-Case applying Model-Based Reinforcement Learning) "(arXiv: 1904.100902019) that is applicable only to scenarios with limited discrete states and motion space. The described method operates in a continuous state and motion space, which is a fundamental requirement for practical adaptation to real-world environments.
A particularly advantageous situation in which the invention can be applied is in the case of an autonomous vehicle. The dynamics of an automobile can vary due to a number of factors, including changes in road surface, road inclination, tire pressure, friction, and changes due to the weight carried. It is clear that in this example, the car does not experience any single set of dynamics throughout the life cycle, even for a short period of time. The inventive algorithm is more robust to such changes and can cope with new dynamics without having to learn to cope with them in the environment.
Thus, the approaches disclosed herein may learn control strategies that are robust to dynamic variations between the environment in which they are trained and the environment in which they are deployed.
In the above description, dynamics within a predetermined Wasserstein distance of the reference dynamics are considered during training of the control strategy. However, other metrics may be used as the distance function.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features. Such features or combinations of features can be implemented as a whole based on the present description, without regard to whether such features or combinations of features solve any of the problems disclosed herein, with the ordinary knowledge of a person skilled in the art; and not to limit the scope of the claims. This application is intended to cover any adaptations or combinations of the various aspects of the invention. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.
Claims (15)
1. A system for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the system being configured to: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps:
first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric;
second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric;
forming another candidate solution based on the first evaluation and the second evaluation.
2. The system of claim 1, wherein the system is configured to evaluate the quality of the candidate solution by testing the behavior of the model configured according to the values of the candidate solution when applied to a set of reference values.
3. The system of claim 2, wherein the system is further configured to evaluate the quality of the candidate solution by: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values.
4. The system of claim 3, wherein the one or more adapted reference data items are within a predetermined Wasserstein distance of the set of reference values.
5. System according to claim 3 or 4, wherein the adapted set of reference values represents worst case values in the set of reference values.
6. The system according to any one of claims 2 to 5, wherein the set of reference values comprises parameters of a neural network.
7. The system of any one of claims 2 to 5, wherein the set of reference values comprises values output from a simulator or differential equation solver.
8. The system according to any one of claims 2 to 7, wherein the set of reference values comprises a set of reference dynamics.
9. The system of any preceding claim, wherein the system is configured to perform an optimization comprising the first and second evaluations of the quality of the candidate solution.
10. The system of any preceding claim, wherein the model is a trained artificial intelligence model.
11. The system of claim 10, wherein the model is a neural network.
12. A method for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the method comprising:
forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps:
first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric;
second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric;
forming another candidate solution based on the first evaluation and the second evaluation.
13. The method of claim 12, wherein the evaluating the quality of the candidate solution comprises testing a behavior of the model configured according to the values of the candidate solution when applied to a set of reference values.
14. The method of claim 13, wherein said evaluating the quality of said candidate solution further comprises: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values.
15. The method of claim 14, wherein the one or more adapted reference data items are within a predetermined Wasserstein distance of the set of reference values.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/EP2019/069101 WO2021008691A1 (en) | 2019-07-16 | 2019-07-16 | Learning to robustly control a system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114270375A true CN114270375A (en) | 2022-04-01 |
Family
ID=67314771
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201980098404.XA Pending CN114270375A (en) | 2019-07-16 | 2019-07-16 | Learning to robustly control a system |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114270375A (en) |
WO (1) | WO2021008691A1 (en) |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3465236B2 (en) | 2000-12-20 | 2003-11-10 | 科学技術振興事業団 | Robust reinforcement learning method |
US6665651B2 (en) | 2001-07-18 | 2003-12-16 | Colorado State University Research Foundation | Control system and technique employing reinforcement learning having stability and learning phases |
CN107856035A (en) | 2017-11-06 | 2018-03-30 | 深圳市唯特视科技有限公司 | A kind of robustness dynamic motion method based on intensified learning and whole body controller |
-
2019
- 2019-07-16 CN CN201980098404.XA patent/CN114270375A/en active Pending
- 2019-07-16 WO PCT/EP2019/069101 patent/WO2021008691A1/en active Application Filing
Also Published As
Publication number | Publication date |
---|---|
WO2021008691A1 (en) | 2021-01-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Nguyen-Tuong et al. | Model learning with local gaussian process regression | |
Kamalapurkar et al. | Reinforcement learning for optimal feedback control | |
Khansari-Zadeh et al. | BM: An iterative algorithm to learn stable non-linear dynamical systems with gaussian mixture models | |
Cheng et al. | Control regularization for reduced variance reinforcement learning | |
CN108153153B (en) | Learning variable impedance control system and control method | |
Parmas et al. | PIPPS: Flexible model-based policy search robust to the curse of chaos | |
Xu et al. | Reinforcement learning algorithms with function approximation: Recent advances and applications | |
Xu et al. | Kernel-based approximate dynamic programming for real-time online learning control: An experimental study | |
Romeres et al. | Derivative-free online learning of inverse dynamics models | |
JP2020535562A (en) | Devices and methods to control the system | |
JP2013242761A (en) | Method, and controller and control program thereof, for updating policy parameters under markov decision process system environment | |
CN109827579B (en) | Method and system for real-time correction of filtering model in combined positioning | |
Duell et al. | Solving partially observable reinforcement learning problems with recurrent neural networks | |
Heim et al. | A learnable safety measure | |
Inga et al. | Online inverse linear-quadratic differential games applied to human behavior identification in shared control | |
Senn et al. | Reducing the computational effort of optimal process controllers for continuous state spaces by using incremental learning and post-decision state formulations | |
JP7378836B2 (en) | Summative stochastic gradient estimation method, apparatus, and computer program | |
Possas et al. | Online bayessim for combined simulator parameter inference and policy improvement | |
Polydoros et al. | A reservoir computing approach for learning forward dynamics of industrial manipulators | |
US11614718B2 (en) | System and method for the autonomous construction and/or design of at least one component part for a component | |
Baert et al. | Maximum causal entropy inverse constrained reinforcement learning | |
CN112836439A (en) | Method and apparatus for processing sensor data | |
CN114270375A (en) | Learning to robustly control a system | |
Handoyo et al. | Implementation of particle swarm optimization (PSO) algorithm for estimating parameter of arma model via maximum likelihood method | |
Guzman et al. | Adaptive model predictive control by learning classifiers |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |