CN114270375A - Learning to robustly control a system - Google Patents

Learning to robustly control a system Download PDF

Info

Publication number
CN114270375A
CN114270375A CN201980098404.XA CN201980098404A CN114270375A CN 114270375 A CN114270375 A CN 114270375A CN 201980098404 A CN201980098404 A CN 201980098404A CN 114270375 A CN114270375 A CN 114270375A
Authority
CN
China
Prior art keywords
candidate solution
model
values
evaluating
performance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980098404.XA
Other languages
Chinese (zh)
Inventor
穆罕默德·阿卜杜拉
海瑟姆·布·阿马尔
任航
张鸣天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN114270375A publication Critical patent/CN114270375A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A system for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the system for: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps: first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric; second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric; forming another candidate solution based on the first evaluation and the second evaluation.

Description

Learning to robustly control a system
Technical Field
The present invention relates to avoiding fragility in reinforcement learning, and in particular to learning to control a system in a manner that is robust to dynamic changes.
Background
In Reinforcement Learning (RL), a system is modeled as a Markov Decision Problem (MDP). This is defined as the tuple < X, U, p, r, >, where X is the state space, U is the action space, p (· | X, U) is the probability distribution of the next state of each state-action pair (X, U), and r (X, U) is the reward (positive or negative real). The probability distribution p is called dynamic.
The controller of the MDP is called a policy. It is typically implemented as a probability distribution of actions given the current state x, denoted as π (. | x). MDP equipped with launch state distribution and policy will yield a Markov Rewarded Process (MRP). It generalizes the probability distribution over traces (traces are a sequence of states, actions, rewards).
The standard goal in the RL is to optimize the expected return, i.e., the total discount reward:
Figure BDA0003467178990000011
where μ is the initial distribution, where:
Figure BDA0003467178990000012
the dynamic p is assumed to be fixed, i.e. if the same action is taken in a given state, the distribution in the following state is the same as if the action was taken in that state at other times. The dynamics are independent of any control strategy applied to the MDP.
This aspect of the fixed dynamics of the MDP is a basic assumption in the standard RL algorithm. However, several problems arise from this.
If a policy is trained on a certain MDP and then deployed on MDPs with different dynamics, the policy typically underperforms, i.e., the policy tends to be fragile in terms of dynamic changes. For example, if the RL agent is trained using a simulator (e.g., a simulator for an automobile or a robot) and then deployed in a real physical system, the simulator will perform imperfectly and its simulated dynamics will not be exactly the same as the real world dynamics.
Another problem is that dynamic changes occurring at different times in the real world will affect the results. For example, drivers of automobiles may experience different dynamics due to, for example, road surfaces, load bearing differences, or tire air pressure. Any machine may experience differences in friction due to changes in temperature or lubrication. Any algorithm that produces a control strategy for these dynamically changing breakdowns is clearly not practical. One of the reasons why RL has not been particularly successful so far outside of the laboratory or outside of controlled environments such as games is the lack of robustness.
Previous approaches have attempted to develop strategies that are more robust to such dynamic changes. For example, in Tessler et al, "Action Robust Reinforcement Learning and its application in Continuous Control" (ICML 2019), this problem is framed as a zero-sum game. The policy gives the following action of robustness criteria: (i) taking different possibly antagonistic actions with fixed probability, (ii) adding perturbations to the actions themselves. However, while the algorithm performs well in some Mujoco tasks, it performs poorly in other tasks (such as backswing).
Another approach, as described by Pinto et al in "Robust countermeasure Reinforcement Learning" (ICML2017), also boxes the problem as null and gambling, robustness being iteratively learned by substituting opponent and agent strategies. The "Non-Stationary Markov Decision" by Lecarepentier and Rachelson models the dynamics over time using a Model-Based Reinforcement Learning process Worst Case method (Non-Stationary Markov Decision Processes a Worst-Case Approach using Model-Based Reinforcement Learning), "(arXiv: 1904.100902019), limited by the Wasserstein distance per unit time. For the worst case where the environment is antagonistic, the tree search algorithm is solved. Experiments were conducted in the grid world, i.e. on small scale examples, and the method seems to be unable to scale to a continuous state space and motion space.
Other methods are described in the following documents: JP 3465236B 2, based on the H infinity technique in the classical control theory; CN 107856035a, which is specific to a particular class of problems, rather than the general RL algorithm; and US 6665651B2, relying on a controller to train the neural network, with an emphasis on the stability of the learning process.
Learning is desirable to control the system in a manner that improves robustness to dynamic changes.
Disclosure of Invention
There is provided a system for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the system for: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps: first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric; second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric; forming another candidate solution based on the first evaluation and the second evaluation.
Thus, the system can evaluate the overall performance of the strategy to optimize the performance of the model and evaluate the model with these parameters under worst case dynamics to minimize the occurrence of low level performance. The system may then iteratively form another candidate solution for the policy parameter based on these first and second evaluations.
The system may be used to evaluate the quality of the candidate solution by testing the behavior of the model configured from the values of the candidate solution when applied to a set of reference values. This may enable a convenient way of assessing the quality of the candidate solution.
The system may be further configured to evaluate the quality of the candidate solution by: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values. This may enable testing the quality of the candidate solution for dynamics different from the reference dynamics.
The one or more adapted reference data items may be within a predetermined Wasserstein distance of the reference value set. The adapted set of reference values may represent worst case values of the set of reference values. Thus, the quality of the strategy can be evaluated in the worst case scenario within the Wasserstein ball. By evaluating the worst case dynamics of the strategy, the robustness of the strategy can be improved. Therefore, the Wasserstein metric may help measure the degree of "error" of a model or simulator in a useful and intuitively reasonable manner.
The set of reference values may comprise parameters of a neural network. This approach may enable efficient computation of the updated reference value and the further candidate solution.
The reference value set may include values output from a simulator or a differential equation solver. Such an approach may enable efficient computation of updated reference values and another candidate solution, and may enable the reference values to vary in a manner consistent with certain rule sets. For example, a simulator for a physical system such as a robot or automobile would allow for variations in the amount of friction, mass, length, etc., but the system should obey newton's law.
The set of reference values may comprise a set of reference dynamics. This may enable the system to be applied to real-world dynamic situations.
The system may be configured to perform an optimization including the first and second evaluations of the quality of the candidate solution.
The model may be a trained artificial intelligence model. The model may be a neural network.
According to a second aspect, there is provided a method of performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the method comprising: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps: first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric; second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric; forming another candidate solution based on the first evaluation and the second evaluation.
Thus, the method can evaluate the overall performance of the strategy to optimize the performance of the model and evaluate the model with these parameters under worst case dynamics to minimize the occurrence of low level performance. The method may then iteratively form another candidate solution for the policy parameter based on these first and second evaluations.
The evaluating the quality of the candidate solution may include testing a behavior of the model configured according to the values of the candidate solution when applied to a set of reference values. This may enable a convenient way of assessing the quality of the candidate solution.
The evaluating the quality of the candidate solution may further comprise: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values. This may enable testing the quality of the candidate solution for dynamics different from the reference dynamics.
The one or more adapted reference data items may be within a predetermined Wasserstein distance of the reference value set. Thus, the quality of the strategy can be evaluated in the worst case scenario within the Wasserstein ball. By evaluating the worst case dynamics of the strategy, the robustness of the strategy can be improved. Therefore, the Wasserstein metric may help measure the degree of "error" of a model or simulator in a useful and intuitively reasonable manner.
Drawings
The invention will now be described by way of example with reference to the accompanying drawings.
In the drawings:
FIG. 1 illustrates iteratively updating parameterized policies and dynamics;
FIG. 2 shows a flow chart of a method provided by an embodiment of the invention;
FIG. 3 illustrates an example of a system for implementing the method illustrated in FIG. 2;
FIG. 4 shows a flow chart of a method provided by yet another embodiment of the present invention;
FIG. 5 shows a flow diagram of a method provided by a subroutine executed as part of the method shown in FIG. 4;
FIG. 6 illustrates an example of a system for implementing the methods shown in FIGS. 4 and 5;
FIG. 7 illustrates an example of a method of performing reinforcement learning to generate a solution set that can be used as values for parameters in a model.
Detailed Description
The present invention relates to a system and method for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model that are robust to dynamic variations.
In the standard RL algorithm, the quality of the model parameter candidate solution is determined by testing the model configured according to the values of the candidate solution as it is applied to a set of reference dynamics p0The behavior at (as input to the system) is evaluated. These reference dynamics are used to train the control strategy in the standard algorithm, without regard to other possible dynamics. In an embodiment of the present invention, consider surrounding p0A set of possible dynamics distributed (e.g., centered thereon). These reference values represent the starting points of the training model.
In one embodiment, dynamics within a predetermined Wasserstein distance of a reference dynamics are considered during training of the control strategy. The Wasserstein distance is used as a measure of divergence between dynamics and defines a probability measure for definition relative to the metric space. Specifically, assume that state space X is a metric space, with the metric represented as d (·,). Let m (S) denote a set of probability measures over the set S, the coupling set over the probability measures μ and v being defined as:
Figure BDA0003467178990000045
that is, it is a set of probability measures in the product space, marginalized μ along one dimension and ν along the other dimension. The p-Wasserstein distance is defined as:
Figure BDA0003467178990000041
in general, the Wasserstein distance has no closed form solution, but can be numerically estimated. However, the square 2-Wasserstein between Gauss has a closed form.
The set of dynamics used is p, which is compared to p when measured by the Wasserstein metric0Within some predefined limit. The Wasserstein metric may be any one of the class of Wasserstein metrics. The predefined limit may be referred to as surrounding p0The epsilon-Wasserstein ball of (1) is defined as
Figure BDA0003467178990000042
The system is used to learn the best control strategy, where the quality of strategy pi is the standard RL objective function, but is evaluated at the worst case dynamics (for pi) within the Wasserstein sphere. By keeping "pessimistic" on the dynamics that may be encountered (i.e., by evaluating the worst-case dynamics of the policy), the robustness of the policy may be increased. Therefore, the Wasserstein metric may help measure the degree of "error" of a model or simulator in a useful and intuitively reasonable manner.
The Wasserstein distance is in the form (distance x probability mass). Thus, if this product is constrained by a certain number, it means that if the distance is large, the probability is small, and if the probability is large, the distance is small. That is, the model may be severely erroneous (large distance), but this is unlikely, or likely erroneous (high probability), but may not be too inaccurate. If the reference dynamics are often highly inaccurate, it is useless, training it meaningless, and, ultimately, attempting to achieve robustness is futile.
Not all dynamics in the Wasserstein ball are authentic; for example, some dynamics may violate newton's law of motion. In the present invention, dynamics may be disturbed, but remain trusted. In view of this, the strategy pi uses vectors
Figure BDA0003467178990000043
Parameterize, and write as piθ. The θ parameter is a parameter (weight) of the neural network, and may be updated as time passes.
The dynamics also using another vector
Figure BDA0003467178990000044
And (4) parameterizing. Also, these parameters ψ may be updated over time. These parameters may be, for example, parameters of a neural network, simulator, or other processor that implements system dynamics (e.g., a differential equation solver or a real system). Corresponding to the reference dynamic p0Is expressed as phi0
The system may generate an adapted reference dynamic parameter set comprising psi within a predetermined Wasserstein distance0Nearby adapted reference data. The system then evaluates the quality of the candidate solution policy parameters by testing the behavior of the model configured according to the candidate solution's values when applied to adapt the reference dynamic parameters.
The system performs a first evaluation of the quality of the candidate solution by evaluating the model with the values of the candidate solution to the extent that the model provides a high level of performance with respect to the performance metric, i.e., the system evaluates the overall performance of the strategy to optimize the performance of the model. The system also makes a second assessment of the quality of the candidate solution by assessing the extent to which the model with the value of the candidate solution fails to provide low level performance for the performance metric, i.e., the system assesses the model with these parameters under worst case dynamics, aiming to minimize the occurrence of low level performance. The system iteratively forms another candidate solution for the policy parameter based on the first and second evaluations.
The optimization problem to be solved by the system can be expressed as:
Figure BDA0003467178990000051
wherein, in the continuous RL setup:
Figure BDA0003467178990000052
in the periodic RL setup:
Figure BDA0003467178990000053
wherein the content of the first and second substances,
Figure BDA0003467178990000054
these are "occupancy measures" which, in the persistent case, are determined by having a dynamic psi and policy
Figure BDA0003467178990000055
The MDP generalized Markov chain of (1) is a probability distribution that, in the periodic case, is similar to the purpose of the stationary distribution.
The system iteratively solves an approximation of the optimization problem described above, as shown in fig. 1. According to the updated dynamic parameter psi, as indicated by 101k+1102 update the policy parameter theta of the previous candidate solutionk-1And as shown at 104, the updated policy parameter θ shown at 103kFor the next iteration of the optimization problem.
The internal optimization problem is given by the following formula:
Figure BDA0003467178990000056
item H0Is aligned at psi0An estimate of the Hessian of the function F evaluated below, wherein
Figure BDA0003467178990000057
The solution to the above problem is given by the following equation:
Figure BDA0003467178990000058
wherein the content of the first and second substances,
Figure BDA0003467178990000059
the optimization problem defined above may be under certain assumptions (e.g., H)0Exists and is symmetrically positive) to compute an update of the dynamic ψ.
The updated dynamic parameters are then used to update the policy parameters, which are then used in subsequent iterations of the optimization, as shown in FIG. 1.
Thus, a high-level strategy for solving the optimization problem in equation (6) has an outer loop that updates θ and an inner loop that solves the minimization problem for a given fixed θ.
Thus, the system learns the optimal control strategy, where the quality of the strategy is the standard RL objective function, but the quality is evaluated under the worst-case dynamics within the Wasserstein sphere. By pessimistic on the dynamics that a model may encounter, the robustness of the strategy may be increased.
Two exemplary embodiments of the present invention will now be described.
A general method of defining a gaussian distribution using a neural network will first be described with reference to fig. 2 and 3. The method can realize the efficient calculation of the updated dynamic parameter psik+1
In this embodiment, #kIs a neural network
Figure BDA00034671789900000510
The neural network outputs the mean and covariance of the Gaussian process, i.e.
Figure BDA00034671789900000511
Efficient computation of quantities using conjugate gradient algorithms and automatic differentiation (e.g., autograd, see: https:// githu. com/HIPS/autograd) applied to the following optimization problem
Figure BDA0003467178990000061
Figure BDA0003467178990000062
In step 201 of the flow chart of fig. 2, the system 300 may access the parameter ψ0These parameters are parameters of a Neural Network (NN) NN2, shown at 301 in the system diagram of fig. 3, representing a reference dynamic p0. NN 2301 takes as input the state-action pair (x, u) and outputs the mean vector and covariance matrix
Figure BDA0003467178990000063
These are fed into a sampler 302, which sampler 302 is from having a mean and a covariance, respectively
Figure BDA0003467178990000064
The next state x' is sampled in the multivariate gaussian distribution of (a). This is a standard approach to dynamic modeling in reinforcement learning.
The system also assumes that it has access to a fixed policy
Figure BDA0003467178990000065
This strategy is implemented as a neural network NN1, as shown in 303 of fig. 3. NN1 takes state x as input and provides some parameters to sampler 304, which uses the parameters to sample the action according to a probability distribution. This is a standard method for implementing a random strategy in reinforcement learning.
In step 202 in FIG. 2, the system arbitrarily initializes the policy parameter vector to θ0As a euclidean vector. The system estimates the Hessian matrix
Figure BDA0003467178990000066
Initialisation to a d x d zero matrix, where d is the dynamic parameter ψkOf (c) is calculated.
In step 203, the system then uses the latest parameter ψkAnd thetakBatch B of traces was sampled. This is done using NN3 and NN4 and their associated samplers 302 and 307, shown in fig. 3 at 305 and 306, respectively. Since the trace is a sequence of (states, actions, rewards), each new state-action pair is fed as input to the NN 3305 (fed into its sampler) to sample the new state, and each new state is fed to the NN 4306 (fed into its sampler) to sample the new action. Assume that there is a mechanism by which rewards can be extracted from the sample trajectory. For example, if the reward function is a known function of a state-action pair, or a simulator is applied.
The gradient is given by the following equation:
Figure BDA0003467178990000067
the estimation can be done by means of the following formula:
Figure BDA0003467178990000068
that is, the right side of equation (15) can be empirically estimated by averaging the amounts in square brackets using lot B. This function is performed by the gradient estimator (308 in fig. 3) in step 204 of fig. 2.
As can be seen from the flow chart of FIG. 2, the next stage after the initialization in step 205 is a loop comprising steps 206 and 210, the final purpose of which is to estimate
Figure BDA0003467178990000069
This is by generating v1,v2,…,vMAnd finally averaging it, wherein each v isiDu Shi
Figure BDA00034671789900000610
An estimate of (d). To generate viAnd completing the following operations: in step 206, a sample is drawn, given by the following equation:
Figure BDA00034671789900000611
the samples are extracted by applying NN 1303 and its samplers 304 and NN 3305 and its sampler 302. Inputting the samples into NN 3305 to obtain
Figure BDA00034671789900000612
As an output of the NN 3. This is fed as an input to a Gaussian Wasserstein Computing Network (GWCN) 309. As shown in figure 3, the GWCN will also be
Figure BDA00034671789900000613
As an input. Internally, in step 207, calculate and output:
Figure BDA00034671789900000614
this is equal to
Figure BDA00034671789900000615
I.e., the square of the 2-Wasserstein distance between multivariate normal distributions. Since NN 3305 is fed directly into GWCN 309, in step 208, the inverse Hessian vector product is efficiently computed using an auto-differentiation engine (e.g., autograd: https:// github
Figure BDA0003467178990000071
The cycle will continue to be continued with,until the required number of training scenarios is completed. Each v1,v2,…,vMAre provided to a Dynamic Parameter Update Calculator (DPUC) 310, which calculates an average value and then calculates as
Figure BDA0003467178990000072
An estimate of (d). In step 211, this is used by the DPUC to calculate ψk+1
Figure BDA0003467178990000073
In step 212, the DPUC converts psik+1Fed into a Policy Parameters Update Engine (PPUE) 311, which computes θk+1And feeds it into the NN 4306. This completes the cycle.
Then, the new strategy parameter θ is used in the next iteration of the optimization problemk+1And repeating the above steps.
In another example, which will now be described with reference to fig. 4 to 6, ψkCorresponding to the parameters of the simulator.
System 600 may have access to a simulator shown at 601 in FIG. 6, which may implement the use of vector ψk∈RdThe dynamics are parameterized.
In step 401 in FIG. 4, the system access parameter ψ0。ψ0Are parameters of the simulator that represent the reference dynamics. In step 402, θ0Is set to a zero vector and is set to zero,
Figure BDA0003467178990000074
set to a zero matrix. The system enters a cycle with H0The estimation of (2) ends. This is achieved by applying what is known in the literature as an "evolutionary strategy".
In this embodiment, H is estimated using the following formula0
Figure BDA0003467178990000075
Wherein x ═ ψ0F (see equation (10)) is determined using a sample ψ0+ e is estimated by Wasserstein distance calculation of the empirical distribution of points generated.
Dynamics are d-dimensional vectors of the parameterized simulator. The random vector is from E to N (0, sigma)2I)∈RdSampled by multivariate gaussian sampler 603, the system sets ψ ← ψ0And passes it to the simulator 601 as shown in step 404.
In step 405, the system then enters subroutine A shown in the flowchart of FIG. 5, which takes psi and
Figure BDA0003467178990000076
as its input, and sum
Figure BDA0003467178990000077
For sampling, given by the following equation:
Figure BDA0003467178990000078
the neural network NN1 and its sampler 605 shown in 604 in fig. 6 are used to derive the neural network NN from
Figure BDA0003467178990000079
The action is sampled and a simulator 601 parameterized with psi is used to perform the sampling. The samples are fed into a simulator 601 in steps 503 and 504, respectively, to produce a plurality of samples
Figure BDA00034671789900000710
And
Figure BDA00034671789900000711
the samples are collected as a data set
Figure BDA00034671789900000712
And
Figure BDA00034671789900000713
(shown in steps 505 and 506, respectively) is stored in memory. These data sets are fed into a Wasserstein Calculation Engine (WCE) 606 in order to calculate an empirical Wasserstein distance in step 507, which is considered an estimate of:
Figure BDA00034671789900000714
this loop of steps 502 through 508 is repeated a plurality of times and an average of the estimates is taken in step 510. This is done by passing the estimate from the memory 607 to the arithmetic mean calculator 608, as shown in step 509. The result is referred to as h in FIG. 4i. The outer loop is repeated for each sample e (call subroutine A), the average is calculated and set to H0An estimate of (d).
As can be seen in FIG. 4, the system then enters a loop, entering the point at step 409, which step 409 uses θkAnd psikBatch B of traces was sampled. This is done using simulator 601 and NN 2602 and its sampler 610. In step 410, the system uses B to estimate:
Figure BDA0003467178990000081
using the formula:
Figure BDA0003467178990000082
this function is performed by the gradient estimator 609 in fig. 6.
In step 411, the system then calculates by applying the conjugate gradient algorithm in the conjugate gradient estimator 611 to the optimization problem
Figure BDA0003467178990000083
Estimation of (2):
Figure BDA0003467178990000084
in step 412, the results are passed to a new dynamic parameter calculation engine 612 that performs the following calculations:
Figure BDA0003467178990000085
this determines the new dynamic parameter psik+1. The new parameters are then passed to the policy parameter update engine 613, which policy parameter update engine 613 calculates the new policy parameters θ in step 413k+1. The updating of policy parameters may be performed by algorithms such as PPO (the "near-end policy optimization algorithms" arXiv preprints: 1707.06347(2017) described in Schulman, John, Filip Wolski, Praflla Dharawal, Alec Radford and Oleg Klimov), TRPO (the "near-end policy optimization algorithms" described in Schulman, J&Moritz, P (2015, 6 months) "Trust region policy optimization," International machine learning Association (P1889-1897)), or with "raw" policy gradient updates.
Then, the new strategy parameter θ is used in the next iteration of the optimizationk+1And repeating the above steps.
Using parameters in the simulator is advantageous over the general approach described previously above, because, in general, the simulator can have its parameters varied in a manner consistent with a certain set of rules. For example, a simulator for a physical system such as a robot or automobile would allow for variations in the amount of friction, mass, length, etc., but the system should obey newton's law.
Alternatively, the parameters used in this embodiment may come from a differential equation solver, which may implement the use of the vector ψk∈RdCarry out parameter to dynamic stateAnd (4) transforming.
FIG. 7 summarizes a method for performing reinforcement learning to generate a solution set of values that may be used as parameters in a model such that the model provides a level of performance with respect to a performance metric. In step 701, the method includes forming a candidate solution including a set of candidate parameter values. The method then includes repeating steps 702-704 below. In step 702, a first evaluation is made of the quality of the candidate solution by evaluating the degree to which the model with the values of the candidate solution provides a high level of performance for the performance metric. In step 703, a second evaluation is made of the quality of the candidate solution by evaluating the extent to which the model with the value of the candidate solution fails to provide a low level of performance with respect to the performance metric. In step 704, another candidate solution is formed based on the first evaluation and the second evaluation.
In the above method, a model is formed for analyzing data and providing an indication of one or more attributes of the input data. The model is generalized and operates according to values that control the performance of the model. For example, the model may be a neural network and the values may be weights applied to the network. The above system selects values by training against reference or training data. The training data includes a set of possible input data for the model, and a corresponding expected output for each model. To select these values, the system runs a first loop and a second loop. The second cycle is run during the first cycle. In the first loop, a set of candidate values is formed by selecting high performance of the model with these values for the training data. In other words, the set of candidate values is selected or evaluated according to a determination of whether there is a relatively high correspondence between the output of the model configured using those values having training data as input and the expected output of the training data. In the inner loop, it is tested whether the set of candidate values has a low performance model for these values of the training data. In other words, the set of candidate values is selected or evaluated according to a determination of whether there is a relatively low correspondence between the output of the model configured using those values having training data as input and the expected output of the training data. The process is repeated a number of times, with the set of candidate values for each iteration being selected such that it has been determined to exhibit a relatively high propensity for good performance (e.g., relative to the previous set of candidate data) and a relatively low propensity for poor performance.
The approach described herein helps to address the fragility problem in reinforcement learning, and in particular may provide a system and method for learning to control a system in a manner that is robust to dynamic changes. Thus, the present invention can enable users to train on simulators and deploy in the real world with good performance.
Embodiments of the present invention may provide advantages over previous approaches. The method of the invention has better performance in experiments and can be used for continuous states and action spaces. In particular, the "Non-Stationary Markov Decision" by Lecarepentier and Rachelson uses a Model-Based Reinforcement Learning process Worst Case method (Non-Stationary Markov Decision Processes a Worst-Case applying Model-Based Reinforcement Learning) "(arXiv: 1904.100902019) that is applicable only to scenarios with limited discrete states and motion space. The described method operates in a continuous state and motion space, which is a fundamental requirement for practical adaptation to real-world environments.
A particularly advantageous situation in which the invention can be applied is in the case of an autonomous vehicle. The dynamics of an automobile can vary due to a number of factors, including changes in road surface, road inclination, tire pressure, friction, and changes due to the weight carried. It is clear that in this example, the car does not experience any single set of dynamics throughout the life cycle, even for a short period of time. The inventive algorithm is more robust to such changes and can cope with new dynamics without having to learn to cope with them in the environment.
Thus, the approaches disclosed herein may learn control strategies that are robust to dynamic variations between the environment in which they are trained and the environment in which they are deployed.
In the above description, dynamics within a predetermined Wasserstein distance of the reference dynamics are considered during training of the control strategy. However, other metrics may be used as the distance function.
The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features. Such features or combinations of features can be implemented as a whole based on the present description, without regard to whether such features or combinations of features solve any of the problems disclosed herein, with the ordinary knowledge of a person skilled in the art; and not to limit the scope of the claims. This application is intended to cover any adaptations or combinations of the various aspects of the invention. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims (15)

1. A system for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the system being configured to: forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps:
first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric;
second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric;
forming another candidate solution based on the first evaluation and the second evaluation.
2. The system of claim 1, wherein the system is configured to evaluate the quality of the candidate solution by testing the behavior of the model configured according to the values of the candidate solution when applied to a set of reference values.
3. The system of claim 2, wherein the system is further configured to evaluate the quality of the candidate solution by: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values.
4. The system of claim 3, wherein the one or more adapted reference data items are within a predetermined Wasserstein distance of the set of reference values.
5. System according to claim 3 or 4, wherein the adapted set of reference values represents worst case values in the set of reference values.
6. The system according to any one of claims 2 to 5, wherein the set of reference values comprises parameters of a neural network.
7. The system of any one of claims 2 to 5, wherein the set of reference values comprises values output from a simulator or differential equation solver.
8. The system according to any one of claims 2 to 7, wherein the set of reference values comprises a set of reference dynamics.
9. The system of any preceding claim, wherein the system is configured to perform an optimization comprising the first and second evaluations of the quality of the candidate solution.
10. The system of any preceding claim, wherein the model is a trained artificial intelligence model.
11. The system of claim 10, wherein the model is a neural network.
12. A method for performing reinforcement learning to generate a solution set of values that can be used as parameters in a model to cause the model to provide a level of performance with respect to a performance metric, the method comprising:
forming a candidate solution comprising a set of candidate parameter values; repeatedly executing the following steps:
first evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution provides a high level of performance for the performance metric;
second evaluating the quality of the candidate solution by evaluating a degree to which a model having the value of the candidate solution fails to provide a low level of performance for the performance metric;
forming another candidate solution based on the first evaluation and the second evaluation.
13. The method of claim 12, wherein the evaluating the quality of the candidate solution comprises testing a behavior of the model configured according to the values of the candidate solution when applied to a set of reference values.
14. The method of claim 13, wherein said evaluating the quality of said candidate solution further comprises: generating a set of adapted reference values comprising one or more adapted reference data items in the vicinity of at least some of the reference values in the set of reference values; testing the behavior of the model configured according to the values of the candidate solution when applied to the set of adapted reference values.
15. The method of claim 14, wherein the one or more adapted reference data items are within a predetermined Wasserstein distance of the set of reference values.
CN201980098404.XA 2019-07-16 2019-07-16 Learning to robustly control a system Pending CN114270375A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2019/069101 WO2021008691A1 (en) 2019-07-16 2019-07-16 Learning to robustly control a system

Publications (1)

Publication Number Publication Date
CN114270375A true CN114270375A (en) 2022-04-01

Family

ID=67314771

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980098404.XA Pending CN114270375A (en) 2019-07-16 2019-07-16 Learning to robustly control a system

Country Status (2)

Country Link
CN (1) CN114270375A (en)
WO (1) WO2021008691A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3465236B2 (en) 2000-12-20 2003-11-10 科学技術振興事業団 Robust reinforcement learning method
US6665651B2 (en) 2001-07-18 2003-12-16 Colorado State University Research Foundation Control system and technique employing reinforcement learning having stability and learning phases
CN107856035A (en) 2017-11-06 2018-03-30 深圳市唯特视科技有限公司 A kind of robustness dynamic motion method based on intensified learning and whole body controller

Also Published As

Publication number Publication date
WO2021008691A1 (en) 2021-01-21

Similar Documents

Publication Publication Date Title
Nguyen-Tuong et al. Model learning with local gaussian process regression
Kamalapurkar et al. Reinforcement learning for optimal feedback control
Khansari-Zadeh et al. BM: An iterative algorithm to learn stable non-linear dynamical systems with gaussian mixture models
Cheng et al. Control regularization for reduced variance reinforcement learning
CN108153153B (en) Learning variable impedance control system and control method
Parmas et al. PIPPS: Flexible model-based policy search robust to the curse of chaos
Xu et al. Reinforcement learning algorithms with function approximation: Recent advances and applications
Xu et al. Kernel-based approximate dynamic programming for real-time online learning control: An experimental study
Romeres et al. Derivative-free online learning of inverse dynamics models
JP2020535562A (en) Devices and methods to control the system
JP2013242761A (en) Method, and controller and control program thereof, for updating policy parameters under markov decision process system environment
CN109827579B (en) Method and system for real-time correction of filtering model in combined positioning
Duell et al. Solving partially observable reinforcement learning problems with recurrent neural networks
Heim et al. A learnable safety measure
Inga et al. Online inverse linear-quadratic differential games applied to human behavior identification in shared control
Senn et al. Reducing the computational effort of optimal process controllers for continuous state spaces by using incremental learning and post-decision state formulations
JP7378836B2 (en) Summative stochastic gradient estimation method, apparatus, and computer program
Possas et al. Online bayessim for combined simulator parameter inference and policy improvement
Polydoros et al. A reservoir computing approach for learning forward dynamics of industrial manipulators
US11614718B2 (en) System and method for the autonomous construction and/or design of at least one component part for a component
Baert et al. Maximum causal entropy inverse constrained reinforcement learning
CN112836439A (en) Method and apparatus for processing sensor data
CN114270375A (en) Learning to robustly control a system
Handoyo et al. Implementation of particle swarm optimization (PSO) algorithm for estimating parameter of arma model via maximum likelihood method
Guzman et al. Adaptive model predictive control by learning classifiers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination