CN108885721B - Direct inverse reinforcement learning using density ratio estimation - Google Patents

Direct inverse reinforcement learning using density ratio estimation Download PDF

Info

Publication number
CN108885721B
CN108885721B CN201780017406.2A CN201780017406A CN108885721B CN 108885721 B CN108885721 B CN 108885721B CN 201780017406 A CN201780017406 A CN 201780017406A CN 108885721 B CN108885721 B CN 108885721B
Authority
CN
China
Prior art keywords
estimating
behavior
logarithm
function
cost function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201780017406.2A
Other languages
Chinese (zh)
Other versions
CN108885721A (en
Inventor
内部英治
铜谷贤治
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Okinawa Institute of Science and Technology School Corp
Original Assignee
Okinawa Institute of Science and Technology School Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Okinawa Institute of Science and Technology School Corp filed Critical Okinawa Institute of Science and Technology School Corp
Publication of CN108885721A publication Critical patent/CN108885721A/en
Application granted granted Critical
Publication of CN108885721B publication Critical patent/CN108885721B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks

Abstract

A method for inverse reinforcement learning to estimate a reward and a cost function of a behavior of an object, the method comprising: acquiring data representing changes of state variables, the state variables defining behaviors of the objects; the modified Bellman equation given by equation (1) is applied to the acquired data,
Figure DDA0001799867590000011
where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, and γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning; estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2); estimating r (x) and v (x) in formula 2 from the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and outputting the estimated r (x) and V (x).

Description

Direct inverse reinforcement learning using density ratio estimation
Technical Field
The present invention relates to inverse reinforcement learning, and more particularly, to systems and methods of inverse reinforcement learning. This application claims the benefit of U.S. provisional application No.62/308,722 filed on 3/15/2016 and is incorporated herein by reference.
Background
Understanding human behavior from observations is crucial to developing artificial systems that can interact with humans. Since our decision process is influenced by the return/cost associated with the selected action, the problem can be formulated to estimate the return/cost from the observed behavior.
The idea of reverse reinforcement learning was originally proposed by Ng and Russel (2000) (NPL 14). The OptV algorithm proposed by Dvijotham and Todorov (2010) (NPL 6) is a previous work that showed that the strategy of the demonstration was approximated by a cost function, which is a solution to the linearized Bellman equation.
In general, Reinforcement Learning (RL) is a computational framework for the decision-making process of studying both biological and artificial systems, which can learn optimal strategies through interaction with the environment. There are several pending problems in RL, one of which is how to design and prepare the appropriate reward/cost function. It is easy to design a sparse reward function that gives a positive reward when a task is completed, otherwise it is zero, but this makes it difficult to find the optimal strategy.
In some cases, it is easier to prepare an example of the desired behavior than to manually make the appropriate reward/cost function. Recently, several reverse reinforcement learning (IRL) methods (Ng & Russell, 2000, NPL 14) and apprentice learning methods (Abbeel & Ng, 2004, NPL 1) have been proposed to derive a reward/cost function and implement mock learning based on the performance of the demonstrator. However, most existing studies (Abbel & Ng, 2004, NPL 1; Ratliff et al, 2009, NPL 16; Ziebart et al, 2008, NPL 26) require routines that utilize an estimated reward/cost function to solve the forward reinforcement learning problem. Even if an environmental model is available, this process is typically very time consuming.
Recently, the concept of linearly solvable Markov decision processes (LMDP) (Todorov, 2007; 2009, NPL 23-NPL 24), which is a subclass of performing Markov decision processes by limiting the form of cost functions, has been introduced. This limitation plays an important role in IRL. LMDP is also known as KL control and path integration methods (kappa et al, 2012, NPL 10; Theodorou et al, 2010, NPL 21) and similar ideas have been proposed in the field of control theory (Fleming and Soner, 2006, NPL 7). From Aghasadeghi & Bretl (2011) (NPL 2); kalakrishnan et al (2013) (NPL 8) propose a model-free IRL algorithm based on a path integration method. Since the likelihood of the optimal trajectory is parameterized by a cost function, the cost parameters can be optimized by maximizing the likelihood. However, their method requires the entire trajectory data. Model-based IRL methods are proposed by dvijtham and Todorov (2010) (NPL 6) based on the LMDP framework, where the likelihood of the best state transition is represented by a cost function. In contrast to the path integration method of IRL, it can be optimized from any state transition dataset. One major drawback is the evaluation of integrals that cannot be solved analytically. In practice, they discretize the state space to replace the integral with the sum, but they are not feasible in the high-dimensional continuum problem.
CITATION LIST
Patent document
PTL 1: U.S. Pat. No.8,756,177, Methods and systems for evaluating subject intent from perspective.
PTL 2: U.S. Pat. No.7,672,739 System for multiresolution analysis assisted recovery from run-by-run control.
PTL 3: japanese patent No. 5815458. A method and program for simulating a simulation device.
Non-patent document
NPL 1:Abbeel,P.and Ng,A.Y.Apprenticeship learning via inverse reinforcement learning.In Proc.of the 21st International Conference on Machine Learning,2004.
NPL 2:Aghasadeghi,N.and Bretl,T.Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals.In Proc.of IEEE/RSJ International Conference on Intelligent Robots and Systems,pp.1561-1566,2011.
NPL 3:Boularias,A.,Kober,J.,and Peters,J.Relative entropy inverse reinforcement learning.In Proc.of the 14th International Conference on Artificial Intelligence and Statistics,volume 15,2011.
NPL 4:Deisenroth,M.P.,Rasmussen,C.E,and Peters,J.Gaussian process dynamic programming.Neurocomputing,72(7-9):1508-1524,2009.
NPL 5:Doya,K.Reinforcement learning in continuous time and space.Neural Computation,12:219-245,2000.
NPL 6:Dvijotham,K.and Todorov,E.Inverse optimal control with linearly solvable MDPs.In Proc.of the 27th International Conference on Machine Learning,2010.
NPL 7:Fleming,W.H.and Soner,H.M.Controlled Markov Processes and Viscosity Solutions.Springer,second edition,2006.
NPL 8:Kalakrishnan,M.,Pastor,P.,Righetti,L.,and Schaal,S.Learning objective functions for manipulation.In Proc.of IEEE International Conference on Robotics and Automation,pp.1331-1336,2013.
NPL 9:Kanamori,T.,Hido,S.,and Sugiyama,M.A Least-squares Approach to Direct Importance Estimation.Journal of Machine Learning Research,10:1391-1445,2009.
NPL 10:Kappen,H.J.,Gomez,V.,and Opper,M.Optimal control as a graphical model inference problem.Machine Learning,87(2):159-182,2012.
NPL 11:Kinjo,K.,Uchibe,E.,and Doya,K.Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task.Frontiers in Neurorobotics,7(7),2013.
NPL 12:Levine,S.and Koltun,V.Continuous inverse optimal control with locally optimal examples.In Proc.of the 27th International Conference on Machine Learning,2012.
NPL 13:Levine,S.,Popovic,Z.,and Koltun,V.Nonlinear inverse reinforcement learning with Gaussian processes.Advances in Neural Information Processing Systems 24,pp.19-27.2011.
NPL 14:Ng,A.Y.and Russell,S.Algorithms for inverse reinforcement learning.In Proc.of the 17th International Conference on Machine Learning,2000.
NPL 15:Rasmussen,C.E.and Williams,C.K.I.Gaussian Processes for Machine Learning.MIT Press,2006.
NPL 16:Ratliff,N.D.,Silver,D,and Bagnell,J.A.Learning to search:Functional gradient techniques for imitation learning.Autonomous Robots,27(1):25-53,2009.
NPL 17:Stulp,F.and Sigaud,O.Path integral policy improvement with covariance matrix adaptation.In Proc.of the 10th European Workshop on Reinforcement Learning,2012.
NPL 18:Sugimoto,N.and Morimoto,J.Phase-dependent trajectory optimization for periodic movement using path integral reinforcement learning.In Proc.of the 21st Annual Conference of the Japanese Neural Network Society,2011.
NPL 19:Sugiyama,M.,Takeuchi,I.,Suzuki,T.,Kanamori,T.,Hachiya,H.,and Okanohara,D.Least-squares conditional density estimation.IEICE Transactions on Information and Systems,E93-D(3):583-594,2010.
NPL 20:Sugiyama,M.,Suzuki,T.,and Kanamori,T.Density ratio estimation in machine learning.Cambridge University Press,2012.
NPL 21:Theodorou,E.,Buchli,J.,and Schaal,S.A generalized path integral control approach to reinforcement learning.Journal of Machine Learning Research,11:3137--3181,2010.
NPL 22:Theodorou,E.A and Todorov,E.Relative entropy and free energy dualities:Connections to path integral and KL control.In Proc.of the 51st IEEE Conference on Decision and Control,pp.1466-1473,2012.
NPL 23:Todorov,E.Linearly-solvable Markov decision problems.Advances in Neural Information Processing Systems 19,pp.1369-1376.MIT Press,2007.
NPL 24:Todorov,E.Efficient computation of optimal actions.Proceedings of the National Academy of Sciences of the United States of America,106(28):11478-83,2009.
NPL 25:Todorov,E.Eigenfunction approximation methods for linearly-solvable optimal control problems.In Proc.of the 2nd IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning,pp.161-168,2009.
NPL 26:Ziebart,B.D.,Maas,A.,Bagnell,J.A.,and Dey,A.K.Maximum entropy inverse reinforcement learning.In Proc.of the 23rd AAAI Conference on Artificial Intelligence,2008.
NPL 27:Vroman,M.(2014).Maximum likelihood inverse reinforcement learning.PhD Thesis,Rutgers University,2014.
NPL 28:Raita,H.(2012).On the performance of maximum likelihood inverse reinforcement learning.arXiv preprint.
NPL 29:Choi,J.and Kim,K.(2012).Nonparametric Bayesian inverse reinforcement learning for multiple reward functions.NIPS 25.
NPL 30:Choi,J.and Kim,J.(2011).Inverse reinforcement learning in partially observable environments.Journal of Machine Learning Research.
NPL 31:Neu,and Szepesvari,C.(2007).Apprenticeship learning using inverse reinforcement learning and gradient methods.In Proc.of UAI.
NPL 32:Mahadevan,S.(2005).Proto-value functions:developmental reinforcement learning.In Proc.of the 22nd ICML.
Disclosure of Invention
Technical problem
Reverse reinforcement learning is a framework to solve the above problems, but as described above, the existing methods have the following disadvantages: (1) difficult to handle when the state persists, (2) computationally expensive, and (3) the entire state trajectory is needed for estimation. The method disclosed in the present disclosure addresses these shortcomings. In particular, the previous approach proposed in NPL 14 is not as effective as reported in many previous studies. Furthermore, the methods proposed in NPL 6 do not solve the continuous problem in practice, since their algorithms involve complex integral evaluations.
The present invention is directed to systems and methods for reverse reinforcement learning.
It is an object of the present invention to provide a new and improved reverse reinforcement learning system and method which obviates one or more of the problems in the prior art.
Solution to the problem
To achieve these and other advantages and in accordance with the purpose of the present invention, as embodied and broadly described, in one aspect, the present invention provides a method for estimating reward of a behavior of an object and inverse reinforcement learning of a cost function, the method comprising: acquiring data representing changes of state variables, the state variables defining behaviors of the objects; the modified Bellman equation given by equation (1) is applied to the acquired data,
Figure GDA0001799867660000051
where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning; estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2); estimating r (x) and v (x) in formula 2 based on the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and outputting the estimated r (x) and V (x).
In another aspect, the present invention provides a method for inverse reinforcement learning for estimating a reward function and a cost function of a behavior of an object, the method comprising: obtaining data representing a state transition with an action defining a behavior of the object; the modified Bellman equation given by equation (3) is applied to the acquired data,
Figure GDA0001799867660000061
where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (u | x) and π (u | x) represent random strategies, before and after learning, respectively, representing the probability of selecting action u at state x; estimating the logarithm of the density ratio pi (x)/b (x) in the formula (3); estimating r (x) and v (x) in formula 4 based on the result of estimating the logarithm of the density ratio pi (x, u)/b (x, u); and outputting the estimated r (x) and V (x).
In another aspect, the invention provides a non-transitory storage medium storing instructions for causing a processor to execute an algorithm for inverse reinforcement learning for estimating a reward function and a cost function of a behavior of an object, the instructions causing the processor to perform the steps of: acquiring data representing changes of state variables, the state variables defining behaviors of the objects; applying a modified Bellman equation given by equation (1) to the acquired data;
Figure GDA0001799867660000062
where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning; estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2); estimating r (x) and v (x) in formula 2 based on the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and outputting the estimated r (x) and V (x).
In another aspect, the present invention provides an inverse reinforcement learning system for estimating a reward function and a cost function of a behavior of an object, the system comprising: a data acquisition unit for acquiring data representing a change in a state variable defining a behavior of the object; a processor having a memory, the processor and the memory configured to: applying a modified Bellman equation given by equation (1) to the acquired data;
Figure GDA0001799867660000071
where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning; estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2); estimating r (x) and v (x) in formula 2 based on the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and an output interface that outputs the estimated r (x) and v (x).
In another aspect, the present invention provides a system for predicting a preference of a topic of an article that a user is likely to read based on a series of articles selected by the user while browsing on the internet, the system comprising: the system for reverse reinforcement learning of claim 8, implemented in a computer connected to the internet, wherein the object is the user and the state variables defining the behavior of the object include topics of articles selected by the user while browsing each web page, and wherein the processor causes the user to be browsing an interface of an internet website to display articles recommended for the user to read according to the estimated reward function and cost function.
In another aspect, the present invention provides a method for programming a robot to perform complex tasks, the method comprising: controlling the first robot to complete a task to record a sequence of states and actions; estimating the reward function and cost function with the system for inverse reinforcement learning of claim 8 based on the recorded sequence of states and actions; and providing the estimated reward and cost function to a forward reinforcement learning controller of a second robot to program the second robot with the estimated reward and cost function.
[ advantageous effects of the invention ]
According to one or more aspects of the present invention, inverse reinforcement learning can be efficiently and effectively performed. In some embodiments, there is no need to know the environment dynamics in advance and no need to perform integration.
Additional or separate features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.
Drawings
Fig. 1 shows the density ratio estimation method for each of the following: (1) LSCDE-IRL, (2) usif-IRL, (3) LogReg-IRL, (4) Gauss-IRL, (5) LSCDE-OptV, and (6) Gauss-OptV, normalized square error of experimental results of inverted pendulum wobble to which embodiments of the present invention are applied. As shown, (a) - (d) differ from each other in sampling method and other parameters.
Fig. 2 is a graph showing cross-validation errors in inverted pendulum swing experiments for various density ratio estimation methods.
FIG. 3 shows an experimental setup for the rod balancing task of a long rod; left side: start position, middle: target position, and right side: a state variable.
FIG. 4 illustrates learning curves in a rod balancing task experiment on various subjects, according to an embodiment of the present invention; solid line: stock, dotted line: a short rod.
FIG. 5 illustrates estimated cost functions derived from a rod balancing task experiment according to an embodiment of the present invention for objects # 4, # 5, and # 7 projected into a defined subspace.
Fig. 6 shows the negative log-likelihood values of the test data sets in the rod balance task experiments for subject nos. 4 and 7 to evaluate the estimated cost function.
Fig. 7 schematically shows a framework of reverse reinforcement learning according to embodiment 1 of the present invention, which can infer an objective function from an observation state transition generated by an demonstrator.
FIG. 8 is a schematic block diagram illustrating an example of an implementation of the reverse reinforcement learning of the present invention in terms of mimicking learning robot behavior.
FIG. 9 is a schematic block diagram illustrating an example of an implementation of the reverse reinforcement learning of the present invention in explaining human behavior.
FIG. 10 schematically illustrates a series of click actions by a web visitor, illustrating subject matter preferences of the visitor in browsing on the web.
Fig. 11 schematically illustrates an example of an inverse reinforcement learning system according to an embodiment of the present invention.
Fig. 12 schematically shows the difference between embodiment 1 and embodiment 2 of the present invention.
Fig. 13 schematically illustrates a calculation scheme of the second DRE of step (2) in embodiment 2.
Fig. 14 shows the experimental results of the inverted pendulum problem of the pendulum comparing embodiment 2 with embodiment 1 and other methods.
Fig. 15 shows experimental results of the robot navigation task using embodiments 1 and 2 and RelEnt-IRL.
Detailed Description
The present disclosure provides a novel inverse strong chemistry method and system based on density ratio estimation in a linearly solvable Markov decision process (LMDP) framework. In LMDP, the logarithm of the ratio between the controlled state transition density and the uncontrolled state transition density is represented by a state dependent cost and cost function. Previously, the present inventors devised a novel inverse reinforcement learning method and system as described in PCT international application No. PCT/JP2015/004001, in which a density ratio estimation method is used to estimate a conversion density ratio, and a least squares method with regularization is used to estimate a state-dependent cost and value function that satisfies the relationship. The method may avoid calculating an integral, such as evaluating a partition function. The present disclosure includes a description of the invention described in PCT/JP2015/004001 as embodiment 1 below, and further describes a new embodiment as embodiment 2, which has improved features in several respects over embodiment 1. Subject matter described and/or claimed in PCT/JP2015/004001 may or may not be prior art for embodiment 2 according to regional national law. As described below, in embodiment 1, a simple numerical simulation of the pendulum swinging was performed, and its superiority over the conventional method was demonstrated. The inventors further applied this method to human behavior in performing the pole balancing task and showed that the estimated cost function can predict the performance of objects in a new trial or environment in a satisfactory way.
One aspect of the invention is based on the framework of a linearly solvable Markov decision process, such as the OptV algorithm. In embodiment 1, the inventors have developed a novel Bellman equation given by:
Figure GDA0001799867660000091
where q (x) and V (x) represent the cost and cost functions for state x, and γ represents the discount factor. p (y | x) and π (y | x) represent the state transition probabilities before and after learning, respectively. The density ratio (left side of the above equation) is effectively calculated from the observed behavior by the density ratio estimation method. Once the density ratio is estimated, the cost and cost functions can be estimated by regularized least squares. An important feature is that our method can avoid computing the integral, which is usually computed at high computational cost. The present inventors have applied this approach to human behavior in performing pole balancing tasks and show that the estimated cost function can predict the performance of objects in new trials or environments, verifying the general applicability and effectiveness of this new computing technique in reverse reinforcement learning, with well-recognized broad applicability in control systems, machine learning, operational research, information theory, etc.
< I > embodiment 1>
<1. Markov decision Process for Linear solution >
<1.1. Forward Reinforcement learning >
The present disclosure provides a brief introduction to the Markov decision process and its simplification of the discrete-time continuous spatial domain. Let X and U be continuous state and continuous motion space respectively. At time step t, the learning agent observes the current state x of the environmenttE.g. X, and execute according to a random strategy pi (u)t|xt) Act u of samplingtE.g. U. Thus, the immediate cost c (x) is given from the environmentt,ut) And in action utNext, the environment is according to the following from xtProbability of state transition P to y ∈ XT(y|xt,ut) And performing state transition. The goal of reinforcement learning is to construct an optimal strategy pi (u | x) that minimizes a given objective function. There are several objective functions, and the most widely used is the discount given bySum of costs of (c):
Figure GDA0001799867660000101
where γ ∈ (0,1) is referred to as a discount factor. It is known that the optimum function satisfies the following Bellman equation:
Figure GDA0001799867660000102
equation (2) is a nonlinear equation due to the min operator.
The linearly solvable Markov decision process (LMDP) simplifies equation (2) under certain assumptions (Todorov, 2007; 2009a, NPL 23-NPL 24). The key skill of LMDP is to optimize the state transition probability directly, rather than optimizing the strategy. More specifically, two conditional probability density functions are introduced. One is the uncontrolled probability, denoted by p (y | x), which can be considered an innate state transition. P (y | x) is arbitrary, and may be represented by P (y | x) ═ P-T(y|x,u)π0(u | x) du, where π0(u | x) is a random strategy. The other is the controlled probability, denoted by π (y | x), which can be interpreted as the best state transition. The cost function is then limited to the following form:
c(x,u)=q(x)+KL(π(·|x)||p(·|x)), (3)
wherein q (x) and
KL(π(·|x)||p(·|x))
representing the Kullback Leibler divergence between the state dependent cost function and the controlled and uncontrolled state transition densities, respectively. In this case, Bellman equation (2) is simplified to the following equation:
exp(-V(x))=exp(-q(x))∫p(y|y)exp(-γV(y))dy. (4)
the optimal controlled probability is given by:
Figure GDA0001799867660000103
it should be noted that even if the expectation function z (x) ═ exp (-v (x)) is introduced due to the presence of the discount factor γ, equation (4) is still non-linear. In forward reinforcement learning in the LMDP framework, v (x) is calculated by solving equation (4), and then pi (y | x) is calculated (Todorov, 2009, NPL 25).
<1.2. reverse reinforcement learning >
The Inverse Reinforcement Learning (IRL) algorithm under LMDP is proposed by dvijthatam and Todorov (2010) (NPL 6). In particular, OptV works well for discrete state problems. The advantage of OptV is that the best state transition is unambiguously represented by the cost function, so that the maximum likelihood method can be applied to estimate the cost function. It is assumed that the observed trajectory results from the optimum state transition density (5). The cost function is approximated by the following linear model:
Figure GDA0001799867660000111
wherein, wVAnd ΨV(x) Respectively, the learning weights and the basis function vectors.
Since the controlled probability is given by equation (5), the weight vector w can be optimized by maximizing the likelihoodV. Assuming a state transition data set:
Figure GDA0001799867660000112
wherein N isπRepresenting the amount of data from the controlled probability. The log-likelihood and its derivative are then given by:
Figure GDA0001799867660000113
Figure GDA0001799867660000114
wherein, pi (y | x; w)V) Is a controlled strategy in which a cost function is composed ofThe formula (6) is parameterized. Once the gradient is evaluated, the weight vector w is updated according to the gradient ascent methodV
After estimating the cost function, the cost function can be derived using the simplified Bellman equation (4). Which means that it is given
Figure GDA0001799867660000115
And γ, a cost function q (x) is uniquely determined, and q (x) is expressed by a basis function used in the cost function. Although the representation of the cost function is not important in the case of mock learning, we wish to find a simpler representation of the cost for analysis. Thus, the inventors introduced the approximate sub:
Figure GDA0001799867660000116
wherein, wqAnd
ψq(x)
respectively, the learning weights and the basis function vectors. With L1 regularization to optimize wqIs given by:
Figure GDA0001799867660000117
wherein λ isqIs a regularization constant. A simple gradient descent algorithm is employed, and J (w) is evaluated under observed conditionsq)。
The most important problem of dvijthatam and Todorov (2010) (NPL 6) is the integrals in equations (8) and (10) that cannot be solved by analysis, and they discretize the state space and replace the integrals with sums. However, as they suggest, it is not feasible on a high dimensional problem. In addition, the uncontrolled probability p (y | x) is not necessarily gaussian. In at least some embodiments of the invention, the Metropolis Hastings algorithm is applied to evaluate the gradient of the log-likelihood, where a non-controlled probability p (y | x) is used as the causal density.
<2. reverse reinforcement learning by Density ratio estimation >
<2.1. Bellman equation for IRL >
According to equations (4) and (5), the present inventors have derived the following important relationship for the discount cost problem:
Figure GDA0001799867660000121
equation (11) plays an important role in the IRL algorithm according to an embodiment of the present invention. Similar equations can be derived for the first exit, average cost, and finite horizon problems. It should be noted that the left side of equation (11) is not a time difference error, since q (x) is the state-dependent part of the cost function shown in equation (3). Our IRL is still an ill-defined problem, and although the form of the cost function is constrained by equation (3) under LMDP, the cost function is not uniquely defined. More specifically, if the state dependent cost function is modified by:
q′(x)=q(x)+C, (12)
then the corresponding cost function changes to:
Figure GDA0001799867660000122
wherein C is a constant value. The controlled probability derived from V (x) is then the same as the controlled probability from V' (x). This property is very useful when estimating the cost function, as described below. In one aspect of the invention, the disclosed IRL method consists of two parts. One part is to estimate the density ratio on the right side of equation (11), as described below. The other part estimates q (x) and v (x) by least squares using regularization, as shown below.
<2.2. Density ratio estimation for IRL >
Estimating the ratio of controlled and uncontrolled transition probability densities can be viewed as a matter of density ratio estimation (Sugiyama et al, 2012, NPL 20). In terms of setting of the problem, the present disclosure considers the following formulation.
<2.2.1. general case >
First, a general setting is considered. Suppose we have two state transition datasets: one is D shown in formula (7)πAnd another data set of uncontrolled probability:
Figure GDA0001799867660000131
wherein N ispIndicating the amount of data. Then, we focus on the data from DpAnd DπTo estimate the ratio pi (y | x)/p (y | x).
From equation (11), we can consider the following two decompositions:
Figure GDA0001799867660000132
the first decomposition (14) shows the logarithmic difference of the conditional probability density. To estimate equation (14), the present disclosure considers two implementations. The first is to estimate the LSCDE-IRL of pi (y | x) and p (y | x) using Least Squares Conditional Density Estimation (LSCDE) (Sugiyama et al, 2010). The other is Gauss-IRL using a gaussian process (Rasmussen & Williams, 2006, NPL 15) to estimate the conditional density in equation (14).
The second decomposition (15) shows the logarithmic difference in density ratio. The advantage of the second decomposition is that if pi (x) ═ p (x), then ln pi (x)/p (x) can be ignored. The condition may be satisfied according to the setting. Currently, two methods are implemented to estimate π (x)/p (x) and π (x, y)/p (x, y). One is uslsif-IRL using an unconstrained least squares significance fit (uslsif) (Kanamori et al, 2009, NPL 9). The other is LogReg using logistic regression in a different way. Their implementation is described in section 2.3 below.
<2.2.2. when p (y | x) is unknown >
Probability of state transition PT(y | x, u) is assumed to be known in advance in the case of the standard IRL problem, and this corresponds to the assumption that the uncontrolled probability p (y | x) is given in the case of LMDP. This can be considered as a model-based IRL. In this case, equation (14) is adequate and sufficient from the data set DπThe controlled probability pi (y | x) is estimated.
In some cases, we have neither analytical models nor data sets from uncontrolled probability densities. Then p (y | x) is replaced by a uniform distribution, which is an incorrect distribution of unconstrained variables. Without the loss of generality, p (y | x) is set to 1, since it can be compensated by shifting the cost and cost functions by equations (12) and (13).
<2.3. Density ratio estimation Algorithm >
This section describes a density ratio estimation algorithm suitable for the IRL method disclosed in this disclosure.
<2.3.1.uLSIF>
uLSIF (Kanamori et al, 2009, NPL9) is a least squares method for direct density ratio estimation methods. The goal of the uLSIF is to estimate the ratio of the two densities π (x)/p (x) and π (x, y)/p (x, y). Hereinafter, the present disclosure explains how to follow DpAnd DπTo estimate r (z) ═ pi (z)/p (z), where z is (x, y) for simplicity. Let us estimate the ratio by linear model approximation:
Figure GDA0001799867660000141
wherein
φ(z)
The basis function vectors are represented and are the parameters to be learned, respectively. The objective function is given by:
Figure GDA0001799867660000142
wherein λ is a regularization constant, and
Figure GDA0001799867660000143
Figure GDA0001799867660000144
it should be noted that, accordingly, H is according to DpIs estimated, and h is according to DπTo estimate. Equation (16) can be analytically minimized to
Figure GDA0001799867660000145
But this minimum (minizer) neglects the non-negative constraint of density ratio. To compensate for this problem, the uLSIF modifies the solution by:
Figure GDA0001799867660000146
where the above max operator is applied element-wise. According to Kanamori et al recommendation (2009) (NPL9), with DπA state-centered gaussian function is used as the basis function described by:
Figure GDA0001799867660000147
where σ is the width parameter.
Figure GDA0001799867660000148
Is from DπOf the random selection of states. The parameters λ and σ are selected by leave-one-out cross-validation.
<2.3.2.LSCDE>
LSCDE (Sugiyama et al 2010, NPL 19) is considered a special case of the nlsif estimated conditional probability density function. For example, for according to DπThe objective function to estimate pi (y | x) ═ pi (x, y)/pi (x) is given by:
Figure GDA0001799867660000151
wherein
Figure GDA0001799867660000152
Is a linear model and λ is a regularization constant. Calculating H and H in LSCDE is slightly different from calculating H and H in uLSIF, and they are calculated as follows:
Figure GDA0001799867660000153
Figure GDA0001799867660000154
wherein
Figure GDA0001799867660000155
Is defined as:
Figure GDA0001799867660000156
since the basis function shown in equation (18) is used, the integral can be analytically calculated. The estimated weight of LSCDE is given by equation (17). To ensure that the estimated ratio is the conditional density, the solution should be normalized when used to estimate the cost and cost functions.
<2.3.3.LogReg>
LogReg is a method for density estimation using logistic regression. Let us assign the selector variable η -1 to samples from uncontrolled probabilities and η -1 to samples from controlled probabilities:
p(z)=Pr(z|η=-1),π(z)=Pr(z|η=1).
the density ratio can be expressed by applying Bayes rule as follows:
Figure GDA0001799867660000157
by NP/NπThe first ratio Pr (η ═ -1)/Pr (η ═ 1) is estimated, and the second ratio is calculated after the conditional probability pi (η | z) is estimated by the logistic regression classifier:
Figure GDA0001799867660000161
where η may be considered a label. It should be noted that in the case of LogReg, the logarithm of the density ratio is given by the linear model:
Figure GDA0001799867660000162
second term lnNP/NπCan be omitted from our IRL formula shown in equation (15).
The objective function is derived from the negative regularized log-likelihood expressed by:
Figure GDA0001799867660000163
a solution for the closed form cannot be derived but can be effectively minimized by standard non-linear optimization methods because the objective function is a convex function.
<2.4. estimate cost and merit function >
Once the density ratio π (y | x)/p (y | x) is estimated, a least squares with regularization is applied to estimate the state dependent cost function q (x) and the cost function V (x). Suppose that
Figure GDA0001799867660000164
Is an approximation of negative logarithmic ratio:
Figure GDA0001799867660000165
and linear approximations of q (x) and v (x) as defined in equations (6) and (9), respectively, are considered. The objective function is given by:
Figure GDA0001799867660000166
wherein λ isqAnd λVIs a regularization constant. L2 regularization is used for wVSince L2 regularization is an effective means to achieve numerical stability. On the other hand, L1 regularization is used for wqTo produce a sparse model that is easier for the subject to interpret. If sparsity is not important, then w can beqRegularization using L2. In addition, since the formula (12) can be used by setting the following formula, w is not introducedqAnd wVNon-negative constraints of (2):
Figure GDA0001799867660000171
to effectively satisfy the non-negativity of the cost function.
In theory, we can choose any basis function. In one embodiment of the present invention, for the sake of simplicity, the gaussian function shown in equation (18) is used:
Figure GDA0001799867660000172
where σ is the width parameter. Center position
Figure GDA0001799867660000173
According to DπAnd (4) randomly selecting.
<3. experiment >
<3.1. inverted pendulum of rocking >
<3.1.1. task description >
In order to prove and confirm the effectiveness of the above-described embodiment belonging to embodiment 1 of the present invention, the present inventors studied the swing-up inverted pendulum (swing-up pendulum) problem in which the state vector is composed of a two-dimensional vector x ═ θ, ω]TGiven, where θ and ω represent the angle and angular velocity of the rod, respectively. The equation of motion is given by the following random differential equation:
Figure GDA0001799867660000174
wherein, l, m, g, k, σeAnd ω represents the length of the rod, mass, gravitational acceleration, coefficient of friction, proportional parameters of noise, and Brownian noise, respectively. Contrary to previous studies (Deisenroth et al, 2009, NPL 4; Doya, 2000, NPL 5), the applied torque u is not limited and can be directly increased. Obtaining the corresponding state transition probability P by discretizing the time axis and the step hT(y | x, u), which is represented by a gaussian distribution. In the simulation, these parameters are given as follows:
Figure GDA0001799867660000175
the inventors varied (1) the state dependent cost function q (x), (2) the uncontrolled probability p (y | x), and (3) the data set D bypAnd DπA series of experiments were performed.
< cost function >
The goal was to keep the pole upright and to prepare the following three cost functions:
qcos(x)=1-cosθ,qquad(x)=xTQx,
Figure GDA0001799867660000181
wherein Q ═ diag [1,0.2]。qcost(x) Used by Doya (2000), and qexp(x) Used by Deisenroth et al (2009) (NPL 4).
< uncontrolled probability >
Considering two densities pG(y | x) and pM(y | x). Constructing p with a random strategy pi (u | x) represented by a Gaussian distributionG(y | x). Since the equation for the motion in discrete time is given by Gauss, pG(y | x) is also a Gaussian distribution. At pMIn the case of (y | x), a mixture of gaussian distributions is used as the random strategy.
< preparation of data set >
Two sampling methods are considered. One method is uniform sampling and the other method is trajectory-based sampling. In the uniform sampling method, x is sampled from a uniform distribution defined over the entire state space. In other words, p (x) and π (x) are considered to be uniformly distributed. Then, y is sampled from the uncontrolled and controlled probabilities to construct D, respectivelypAnd Dπ. In the trace-based sampling method, p (y | x) and π (y | x) are used to start the state x from the same0A trace of the states is generated. Then, a pair of state transitions is randomly selected from the trace to construct DpAnd Dπ. It is expected that p (x) is different from π (x).
For each cost function, the corresponding cost function is calculated by solving equation (4), and the corresponding optimal controlled probability is evaluated by equation (5). In the previous method (Todorov, 2009b, NPL 25), exp (-v (x)) is represented by a linear model, but this is difficult under the objective function (1) because the discount factor γ complicates the linear model. Therefore, the cost function is approximated by a linear model shown in equation (6), and the integral is evaluated using the Metropolis Hastings algorithm.
The method according to an embodiment of the invention in embodiment 1 can be compared with OptV, since the assumptions of OptV are the same as those of the method according to an embodiment of the invention. There are several variations as described above, depending on the choice of density ratio estimation method. More specifically, consider the following six algorithms: (1) LSCDE-IRL; (2) uLSIF-IRL; (3) LogReg-IRL; (4) Gauss-IRL; (5) LSCDE-OptV, which is the OptV method, where p (y | x) is estimated by LSCDE; and (6) Gauss-OptV, where a Gaussian processing method is used to estimate p (y | x).
We will DpAnd DπIs set to be Np=Nπ300. Parameter lambdaq、λVσ and γ were optimized by cross-validation of the following regions: log lambdaq、logλVE.g., link (-3,1,9), log σ e.g., link (-1.5,1.5,9), and log γ e.g., link (-0.2,0,9), wherein link (x)min,xmaxN) generating a set of n points, which are at xminAnd xmaxAt equal intervals.
<3.1.2. Experimental results >
The accuracy of the estimated cost function is measured by the normalized squared error of the test samples:
Figure GDA0001799867660000191
wherein, respectively, q (x)j) Is in a state x shown in formula (19)jOne of the following real cost functions, and
Figure GDA0001799867660000192
is an estimated cost function. Fig. 1 (a) - (d) compare the accuracy of the IRL method of this embodiment; the results show that our methods (1) - (4) perform better than the OptV methods (5) - (6) in all circumstances. More specifically, LogReg-IRL showed the best performance, but there were no significant differences between our methods (1) - (3). If the random strategy pi (u | x) is given by a mixture of gaussians, the accuracy of the cost estimated by Gauss-IRL increases significantly, since the standard gaussian process cannot represent a mixture of gaussians.
FIG. 2 illustrates the cross-validation error for a discounting factor γ, such as λq、λVAnd of sigmaThe other parameters are set to optimal values. In this simulation, the cross-validation error is 10 at the true discount factor γ in all methods-0.025And 0.94 is minimum.
As shown in fig. 2 and also illustrated in fig. 1, the embodiments of the present invention proved to have sufficiently small errors that it was effective to confirm the effectiveness of the present invention.
<3.2. analysis of human behavior >
<3.2.1. task description >
To evaluate our IRL algorithm in real-world situations, the inventors performed a dynamic motor control, rod balancing problem. Fig. 3 shows an experimental setup. The subject can move the base left, right, up and down to swing the pole multiple times and slow the pole to balance in an upright position. Dynamics are described by six-dimensional state vectors
Figure GDA0001799867660000193
Wherein, θ and
Figure GDA0001799867660000194
are the angle and angular velocity of the rod, x and y are the horizontal and vertical positions of the base, and
Figure GDA0001799867660000195
and
Figure GDA0001799867660000196
respectively, their time derivatives.
This task is performed under two conditions: a long rod (73cm) and a short rod (29 cm). Each subject performed 15 trials to balance the rods under each condition. Each trial was ended when the subject held the pole upright for 3 seconds or when 40 seconds passed. We collected data from 7 objects (5 on the right and 2 on the left) and used a trajectory-based sampling method to construct the following two controlled probability data sets:
for training
Figure GDA0001799867660000201
And
for testing the ith object
Figure GDA0001799867660000202
It is assumed that all objects have a unique uncontrolled probability p (y | x), which is generated by a random strategy. This means that the data set (for training) is
Figure GDA0001799867660000203
And for testing
Figure GDA0001799867660000204
) Shared between objects. The number of samples in the data set was 300.
<3.2.2. Experimental results >
Fig. 4 shows learning curves for seven subjects, which indicate that the learning process is very different between the subjects. Object numbers 1 and 3 cannot accomplish this task. Since the IRL algorithm should use a successful set of trajectories, we take data from five objects # 2 and # 4-7.
The experimental results in the case of using LogReg-IRL will be described below (LSCDE-IRL and uLSIF-IRL show similar results). FIG. 5 shows projection onto subspace
Figure GDA0001799867660000205
And objects 4, 5 and 7, respectively, and
x,y,
Figure GDA0001799867660000206
and
Figure GDA0001799867660000207
set to zero for visualization. For the case of object 7, the cost function for the long pole condition was not much different from the cost function for the short pole condition, while the cost function for object No.5 was significantly different, which performed poorly in the short pole condition, as shown in FIG. 4.
To evaluate the cost function estimated from the training data set, the inventors applied forward reinforcement learning to find the optimal controlled transition probability for the estimated cost function, and then calculated the negative log-likelihood of the test data set:
Figure GDA0001799867660000211
wherein the content of the first and second substances,
Figure GDA0001799867660000212
is that
Figure GDA0001799867660000213
The number of samples in (1).
Fig. 6 shows the results. In the left panel (a), we used the test dataset of the subject under the long-bar condition
Figure GDA0001799867660000214
By training data sets according to the same conditions
Figure GDA0001799867660000215
And
Figure GDA0001799867660000216
the estimated cost function, achieves a minimum negative log-likelihood. The right panel (b) of fig. 6 shows that the test data of subject No.7 under long and short pole conditions is best predicted by a cost function estimated from the training data set of the same subject No.7 only under long pole conditions. Thus, the effectiveness and utility of embodiments of the present invention have also been demonstrated and proven by this experiment.
The present disclosure presents novel reverse reinforcement learning under the LMDP framework. One feature of the present invention is that equation (11) is shown, which means that for an optimum function with a corresponding cost function, the time difference error is zero. Since the right side of equation (11) can be estimated from the samples by an efficient method of density ratio estimation, the IRL of the present invention results in a simple least squares method with regularization. In addition, the method according to an embodiment of the present invention in embodiment 1 does not require the calculation of an integral that is often difficult to handle in high-dimensional continuity problems. As a result, the disclosed method is computationally cheaper than OptV.
LMDP and path integration methods have recently received attention in the fields of robot and machine learning (Theodorou & Todorov, 2012, NPL 22) because there are many interesting properties in the linearized Bellman equation (Todorov, 2009a, NPL 24). They have been successfully applied to the learning of stochastic strategies for robots with greater degrees of freedom (Kinjo et al, 2013, NPL 11; Stulp & Sigaud, 2012, NPL 17; Sugimoto and Morimoto, 2011, NPL 18; Theodoro et al, 2010, NPL 21). The IRL method according to embodiments of the present invention can be integrated with existing forward reinforcement learning methods to design complex controllers.
As noted above, in at least some aspects of embodiment 1 of the present invention, the present disclosure provides a computational algorithm that can efficiently infer a reward/cost function from observed behavior. The algorithms of the embodiments of the present invention may be implemented in a general purpose computer system with appropriate hardware and software, as well as specially designed proprietary hardware/software. Various advantages according to at least some embodiments of the invention include:
A) model-free method/system: the method and the system do not need to know the environmental dynamics in advance; that is, the method/system is considered a model-less method-although some prior art methods assume that the environmental dynamics are known in advance, it is not necessary to explicitly model the target dynamics.
B) The data is valid: the data set for a method and system according to embodiments of the present invention includes a set of state transitions, whereas many previous methods required a set of state traces. Therefore, in the method and system according to the embodiment of the present invention, it is easier to collect data.
C) Calculated efficiency (1): methods and systems according to embodiments of the present invention do not require a solution to the (forward) reinforcement learning problem. In contrast, some previous approaches require multiple solutions to this forward reinforcement learning problem with an estimated reward/cost function. This calculation must be performed for each candidate and often takes a long time to find the best solution.
D) Calculated efficiency (2): the method and system according to embodiments of the invention use two optimization algorithms: (a) density ratio estimation and (b) regularized least squares. In contrast, some previous methods use stochastic gradient methods or Markov chain Monte Carlo methods, which typically require time to optimize as compared to least squares.
As described above, in one aspect, the present invention provides reverse reinforcement learning, which can infer an objective function from observed state transitions produced by a demonstrator. Fig. 7 schematically shows a framework of the method according to embodiment 1 of the present invention. An embodiment of reverse reinforcement learning according to embodiment 1 of the present invention includes two components: (1) learning the ratio of state transition probabilities with and without control by density ratio estimation, and (2) estimating cost and cost functions compatible with the ratio of transition probabilities by regularized least squares. By using an efficient algorithm for each step, embodiments of the present invention are more data and computationally efficient than other inverse reinforcement learning methods.
The industrial applicability and usefulness of reverse reinforcement learning is well understood and appreciated. An example of a system/configuration to which embodiments of the present invention may be applied is described below.
< simulation learning of robot behavior >
It is difficult to program robots to perform complex tasks using standard methods, such as motion planning. In many cases it is much easier to show the robot the desired behavior. However, one major drawback of classical mimic learning is that the resulting controller cannot cope with new situations, as it only reproduces the demonstrated motion. Embodiments of the present invention may estimate an objective function from the demonstrated behavior and then may use the estimated objective function to learn different behaviors for different situations.
Fig. 8 schematically illustrates such an implementation of the invention. First, the demonstrator controls the robot to complete the task and records the state and sequence of actions. The reverse reinforcement learning component according to embodiments of the present invention then estimates the cost and cost functions, which are then given to the forward reinforcement learning controllers of the different robots.
< interpretation of human behavior >
Understanding the human intent behind behavior is a fundamental problem in building user-friendly support systems. Typically, a behavior is represented by a sequence of states, which are extracted by a motion tracking system. The cost function estimated by the inverse reinforcement learning method/system according to embodiments of the present invention can be considered as a compact representation for interpreting a given behavioral data set. By the pattern classification of the estimated cost function, the expertise or preference of the user can be estimated. Fig. 9 schematically illustrates such an implementation according to an embodiment of the invention.
< analysis of network experience >
To increase the likelihood of a visitor reading an article presented to the visitor, for example, a designer of an online news website should investigate the visitor's web experience from a decision-making perspective. In particular, recommendation systems are of interest as important business applications for personalized services. However, previous approaches such as collaborative filtering do not explicitly consider the order of decisions. Embodiments of the present invention may provide a different and efficient way to model the behavior of visitors during web browsing. FIG. 10 shows an example of a series of click actions by a user indicating in what order the user has visited what topics. The subject the visitor is reading is considered to be the state and clicking on the link is considered to be the action. Reverse reinforcement learning according to embodiments of the present invention may then analyze the decisions in the user's web browsing. Because the estimated cost function represents the visitor's preference, a list of articles may be recommended for the user.
As described above, in embodiment 1 of the present invention, the reverse reinforcement learning scheme according to the embodiment is applicable to a wide range of industrial and/or commercial systems. FIG. 11 shows an example of an implementation using a general purpose computer system and a sensor system. The method explained above with mathematical equations may for example be implemented in such a general-purpose computer system. As shown in the figure, the system of this example includes a sensor system 111 (an example of a data acquisition unit) to receive information about state transitions, i.e., observed behaviors, from an observed object. The sensor system 111 may include one or more of the following: image capture devices with image processing software/hardware, displacement sensors, velocity sensors, acceleration sensors, microphones, keyboards, and any other input devices. The sensor system 111 is connected to a computer 112, the computer 112 having a processor 113 with appropriate memory 114 so that the received data can be analyzed according to embodiments of the present invention. The result of the analysis is output to any output system 115 such as a display monitor, a controller, a driver, or the like (an example of an output interface), or an object to be controlled in the case of using the result for control. The results may be used to program or communicate to another system, such as another robot or computer, or website software that responds to user interactions, as described above.
In the case of predicting a user's web article preferences as described above, the implemented system may include a system for reverse reinforcement learning as described in any of the embodiments above, implemented in a computer connected to the internet. The state variables that define the user's behavior include the topics of articles that the user selects while browsing each web page. The results of the reverse reinforcement learning are then used to cause the user to be browsing the interface of an internet website (e.g., a portable smartphone, a personal computer, etc.) to display recommended articles for the user.
< II > embodiment 2>
Next, embodiment 2 having characteristics superior to those of embodiment 1 in some respects will be described. Fig. 12 schematically shows the difference between embodiment 1 and embodiment 2. As described above, and shown in (a) of fig. 12, embodiment 1 uses the density ratio estimation algorithm twice and the regularized least squares method. In contrast, in embodiment 2 of the present invention, the logarithm of the density ratio pi (x)/b (x) is estimated using the standard Density Ratio Estimation (DRE) algorithm, and r (x) and v (x) as the reward function and the cost function, respectively, are calculated by estimating the logarithm of the density ratio pi (x, y)/b (x, y) using the Bellman equation. In more detail, in embodiment 1, the following three steps are required: (1) estimate pi (x)/b (x) by standard DRE algorithm; (2) estimate pi (x, y)/b (x, y) by standard DRE algorithm, and (3) calculate r (x) and v (x) by regularized least squares using Bellman's equation. In contrast, embodiment 2 uses only two-step optimization: (1) estimate ln pi (x)/b (x) by standard Density Ratio Estimation (DRE) algorithm, and (2) calculate r (x) and v (x) by DRE (second time) for ln pi (x, y)/b (x, y) using Bellman's equation.
Fig. 13 schematically illustrates a calculation scheme of the second DRE of step (2) in embodiment 2. As shown in fig. 13, the second DRE for ln pi (x, y)/b (x, y) results in an estimate of r (x) + γ v (y) -v (x) using the following equation, since the first DRE estimates ln pi (x)/b (x).
Figure GDA0001799867660000241
These equations are substantially the same as the above equations (11) and (15). Therefore, in embodiment 2, the third step (3) in embodiment 1 does not need to be calculated by the regularized least squares method, and the calculation cost can be significantly reduced as compared with embodiment 1. In embodiment 2, in order to perform the second step (2), i.e. to calculate r (x) and v (x) by DRE (second time) for ln pi (x, y)/b (x, y) using Bellman's equation, the basis functions are designed according to this state space, which reduces the number of parameters to be optimized. In contrast, in embodiment 1, in step (2) of estimating pi (x, y)/b (x, y) by the standard DRE algorithm, the basis functions need to be designed as products of the state space, which requires optimization of a relatively large number of parameters. Therefore, embodiment 2 requires a relatively low memory usage amount as compared to embodiment 1. Therefore, embodiment 2 has these different significant advantages over embodiment 1. Other features and settings of embodiment 2 are the same as the various methods and schemes described above for embodiment 1, unless specifically noted otherwise below.
Table 1 below shows a general comparison of embodiment 2 with various conventional methods. Specifically, for embodiment 2, various features are compared for the above-described OptV, maximum entropy IRL (MaxEnt-IRL), and relative entropy IPL (RelEnt-IRL). As shown in table 1, embodiment 1 of the present invention has various advantages as compared with the conventional method.
[ Table 1]
Figure GDA0001799867660000251
In order to demonstrate and confirm the effectiveness of embodiment 2 of the present invention, the aforementioned inverted pendulum problem was studied. FIG. 14 shows the results of experiments comparing embodiment 2 with embodiment 1 with respect to MaxEnt-IRL, RelEnt-IRL and OptV. Embodiment 2 is expressed as "new invention", and embodiment 1 is expressed as "PCT/JP 2015/004001" in the figure. As shown in fig. 14, even if the number of samples is small, embodiment 2 has succeeded in restoring the observed strategy better than other methods including embodiment 1.
< robot navigation task experiment >
To further demonstrate and confirm the effectiveness of embodiment 2 of the present invention, the robot navigation task was studied with respect to embodiment 2, embodiment 1, and RelEt-IRL. Three target objects, red (r), green (g) and blue (b), are placed in front of a programmable robot with camera eyes. The goal is to reach the green (g) goal of the three goals. Five predetermined starting positions a-E are arranged in front of the three objects. Training data is collected from the starting positions A-C and E, while test data is acquired using the starting position D. The state vectors are as follows: x ═ θr,Nrg,Ngb,Nbpantilt]TWhere θ i (i ═ r, g, b) is the angle for the target, Ni (i ═ r, g, b) is the blob (blob) size, and θ i (i ═ r, g, b) is the blob (blob) sizepanAnd thetatiltIs the angle of the robot camera. The basis functions for v (x) are given as follows:
Figure GDA0001799867660000261
wherein, ciIs a central location selected from the data set. The basis functions for r (x) are given by:
ψq(x)=[fgr),fs(Nr),fgg),fs(Ng),fgb)fs(Nb)]T
wherein f isgIs a Gaussian function, and fsIs a sigmoid function. In this experiment, π and b were given by the experimenter, and for each starting point, 10 traces were collected to create a data set. Fig. 15 shows the experimental results. In this figure, embodiment 2 is denoted as "new invention", and embodiment 1 is denoted as "PCT/JP 2015/004001". The results are compared to those of RelEnt-IRL, as described above. As shown in fig. 15, embodiment 2 produced significantly better results. This also shows that the estimate function according to embodiment 2 can be used as a potential function to shape the reward.
The calculation time (in minutes) in the inverted pendulum task discussed above was estimated. The LogReg IRL and KLIEP IRL in embodiment 2 only require about 2.5 minutes in the calculation. The uslif IRL, LSCDE IRL, and LogReg IRL in embodiment 1 take about 4 to 9.5 minutes, respectively. Thus, embodiment 2 requires significantly less computing time than the various versions of embodiment 1 discussed above.
It is easily understood that the application of embodiment 2 is substantially the same as the various applications for embodiment 1 described above. Specifically, as described above, the various versions of embodiment 2 are particularly applicable to: interpreting human behavior, analyzing network experiences, and designing a robot controller by simulation, wherein a corresponding objective function is estimated as an immediate reward by displaying some desired behavior. The robot may use the estimated reward to attribute the behavior of the non-experienced condition with forward reinforcement learning. Therefore, a highly economical and reliable system and method can be constructed according to embodiment 2 of the present invention. In particular, as described above, embodiment 2 can recover the observed policy with a small number of observations better than other methods. This is a significant advantage.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents. In particular, it is expressly contemplated that any portion or all of any two or more of the above-described embodiments and modifications thereof may be combined and considered to be within the scope of the present invention.

Claims (10)

1. A method for inverse reinforcement learning to estimate a reward function and a cost function of a behavior of an object, wherein the object includes a demonstrator and a robot, the method comprising:
obtaining data representing changes in state variables defining behavior of the object;
the modified Bellman equation given by equation (1) is applied to the acquired data,
Figure FDA0003510513660000011
Figure FDA0003510513660000012
where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning;
estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2);
estimating r (x) and v (x) in formula 2 based on the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and
outputting the estimated r (x) and V (x) to estimate the behavior of the subject.
2. The method of claim 1, wherein the step of estimating the logarithm of the density ratios pi (x)/b (x) and pi (x, y)/b (x, y) comprises: the process KLIEP is estimated using the Kullback-Leibler significance with a log linear model.
3. The method of claim 1, wherein the step of estimating the logarithm of the density ratios pi (x)/b (x) and pi (x, y)/b (x, y) comprises: logistic regression was used.
4. A method for inverse reinforcement learning to estimate a reward function and a cost function of a behavior of an object, wherein the object includes a demonstrator and a robot, the method comprising:
obtaining data representing a state transition with an action defining a behavior of the object;
the modified Bellman equation given by equation (3) is applied to the acquired data,
Figure FDA0003510513660000013
Figure FDA0003510513660000014
where r (x) and v (x) represent a reward function and a cost function, respectively, for state x, γ represents a discount factor, and b (u | x) and π (u | x) represent random strategies, before and after learning, respectively, representing the probability of selecting action u in state x;
estimating the logarithm of the density ratio pi (x)/b (x) in the formula (3);
estimating r (x) and v (x) in formula 4 based on the result of estimating the logarithm of the density ratio pi (x, u)/b (x, u); and
outputting the estimated r (x) and V (x) to estimate the behavior of the subject.
5. The method of claim 4, wherein the step of estimating the logarithm of the density ratios pi (x)/b (x) and pi (x, u)/b (x, u) comprises: the process KLIEP is estimated using the Kullback-Leibler significance with a log linear model.
6. The method of claim 4, wherein the step of estimating the logarithm of the density ratios pi (x)/b (x) and pi (x, u)/b (x, u) comprises: logistic regression was used.
7. A non-transitory storage medium storing instructions for causing a processor to execute an algorithm for inverse reinforcement learning for estimating reward and cost functions for behavior of an object, wherein the object includes a demographics and a robot, the instructions causing the processor to perform the steps of:
obtaining data representing changes in state variables defining behavior of the object;
the modified Bellman equation given by equation (1) is applied to the acquired data,
Figure FDA0003510513660000021
Figure FDA0003510513660000022
where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning;
estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2);
estimating r (x) and v (x) in formula 2 based on the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and
outputting the estimated r (x) and V (x) to estimate the behavior of the subject.
8. A system for inverse reinforcement learning to estimate a reward function and a cost function of a behavior of an object, wherein the object includes a demonstrator and a robot, the system comprising:
a data acquisition unit for acquiring data representing a change in a state variable defining a behavior of the object;
a processor having a memory, the processor and the memory configured to:
the modified Bellman equation given by equation (1) is applied to the acquired data,
Figure FDA0003510513660000023
Figure FDA0003510513660000024
where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning;
estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2);
estimating r (x) and v (x) in formula 2 based on the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and
an output interface that outputs the estimated r (x) and V (x) to estimate a behavior of the subject.
9. A system for predicting preferences of topics of articles a user is likely to read from a series of articles selected by the user while browsing the internet, the system being a reverse reinforcement learning system implemented in a computer connected to the internet for estimating a reward function and a cost function of a behavior of an object, the system comprising:
a data acquisition unit for acquiring data representing a change in a state variable defining a behavior of the object;
a processor having a memory, the processor and the memory configured to:
the modified Bellman equation given by equation (1) is applied to the acquired data,
Figure FDA0003510513660000031
Figure FDA0003510513660000032
where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning;
estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2);
estimating r (x) and v (x) in formula 2 based on the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and
an output interface that outputs the estimated r (x) and V (x) to estimate a behavior of the subject,
wherein the object is the user and the state variables defining the behavior of the object include a topic of an article selected by the user while browsing each web page, and
and the processor enables an interface used when the user browses the internet website to display an article recommended to be read by the user according to the estimated return function and the estimated value function.
10. A method for programming a robot to perform complex tasks, the method comprising:
controlling the first robot to complete a task to record a sequence of states and actions;
estimating the reward function and cost function with the system for inverse reinforcement learning of claim 8 based on the recorded sequence of states and actions; and
providing the estimated reward function and cost function to a forward reinforcement learning controller of a second robot to program the second robot with the estimated reward function and cost function.
CN201780017406.2A 2016-03-15 2017-02-07 Direct inverse reinforcement learning using density ratio estimation Active CN108885721B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201662308722P 2016-03-15 2016-03-15
US62/308,722 2016-03-15
PCT/JP2017/004463 WO2017159126A1 (en) 2016-03-15 2017-02-07 Direct inverse reinforcement learning with density ratio estimation

Publications (2)

Publication Number Publication Date
CN108885721A CN108885721A (en) 2018-11-23
CN108885721B true CN108885721B (en) 2022-05-06

Family

ID=59851115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780017406.2A Active CN108885721B (en) 2016-03-15 2017-02-07 Direct inverse reinforcement learning using density ratio estimation

Country Status (5)

Country Link
EP (1) EP3430578A4 (en)
JP (1) JP6910074B2 (en)
KR (1) KR102198733B1 (en)
CN (1) CN108885721B (en)
WO (1) WO2017159126A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7464115B2 (en) * 2020-05-11 2024-04-09 日本電気株式会社 Learning device, learning method, and learning program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8756177B1 (en) * 2011-04-18 2014-06-17 The Boeing Company Methods and systems for estimating subject intent from surveillance
CN104573621A (en) * 2014-09-30 2015-04-29 李文生 Dynamic gesture learning and identifying method based on Chebyshev neural network
WO2016021210A1 (en) * 2014-08-07 2016-02-11 Okinawa Institute Of Science And Technology School Corporation Inverse reinforcement learning by density ratio estimation

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8359226B2 (en) * 2006-01-20 2013-01-22 International Business Machines Corporation System and method for marketing mix optimization for brand equity management
US9090255B2 (en) * 2012-07-12 2015-07-28 Honda Motor Co., Ltd. Hybrid vehicle fuel efficiency using inverse reinforcement learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8756177B1 (en) * 2011-04-18 2014-06-17 The Boeing Company Methods and systems for estimating subject intent from surveillance
WO2016021210A1 (en) * 2014-08-07 2016-02-11 Okinawa Institute Of Science And Technology School Corporation Inverse reinforcement learning by density ratio estimation
CN104573621A (en) * 2014-09-30 2015-04-29 李文生 Dynamic gesture learning and identifying method based on Chebyshev neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Density-ratio Framework for Statistical Data Processing;Masashi Sugiyama 等;《IPSJ Transactions on Computer Vision and Application》;20090901;全文 *
Multi-robot inverse reinforcement learning under occlusion with interactions;Kenneth Bogert 等;《AAMAS "14: Proceedings of the 2014 international conference on Autonomous agents and multi-agent systems》;20140505;全文 *

Also Published As

Publication number Publication date
JP2019508817A (en) 2019-03-28
WO2017159126A1 (en) 2017-09-21
KR20180113587A (en) 2018-10-16
EP3430578A4 (en) 2019-11-13
KR102198733B1 (en) 2021-01-05
EP3430578A1 (en) 2019-01-23
JP6910074B2 (en) 2021-07-28
CN108885721A (en) 2018-11-23

Similar Documents

Publication Publication Date Title
CN106575382B (en) Computer method and system for estimating object behavior, system and medium for predicting preference
Chatzis et al. Echo state Gaussian process
US10896383B2 (en) Direct inverse reinforcement learning with density ratio estimation
Moreno-Muñoz et al. Heterogeneous multi-output Gaussian process prediction
Rothkopf et al. Modular inverse reinforcement learning for visuomotor behavior
Zhe et al. Scalable high-order gaussian process regression
Wang et al. Focused model-learning and planning for non-Gaussian continuous state-action systems
Osa Motion planning by learning the solution manifold in trajectory optimization
Chatzis et al. The copula echo state network
Stojkovic et al. Distance Based Modeling of Interactions in Structured Regression.
Amini et al. POMCP-based decentralized spatial task allocation algorithms for partially observable environments
Wang et al. Dynamic-resolution model learning for object pile manipulation
CN108885721B (en) Direct inverse reinforcement learning using density ratio estimation
Yamaguchi et al. Model-based multi-objective reinforcement learning with unknown weights
Obukhov et al. Neural network method for automatic data generation in adaptive information systems
Zhou et al. Bayesian inference for data-efficient, explainable, and safe robotic motion planning: A review
Vien et al. A covariance matrix adaptation evolution strategy for direct policy search in reproducing kernel Hilbert space
Matsumoto et al. Mobile robot navigation using learning-based method based on predictive state representation in a dynamic environment
Theodoropoulos et al. Cyber-physical systems in non-rigid assemblies: A methodology for the calibration of deformable object reconstruction models
Okadome et al. Predictive control method for a redundant robot using a non-parametric predictor
Liu et al. Distributional reinforcement learning with epistemic and aleatoric uncertainty estimation
Meden et al. First steps towards state representation learning for cognitive robotics
Pinto et al. One-shot learning in the road sign problem
Keurulainen Improving the sample efficiency of few-shot reinforcement learning with policy embeddings
Hanesz et al. Meta-Learning Path Planning Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant