CN108885721B

CN108885721B - Direct inverse reinforcement learning using density ratio estimation

Info

Publication number: CN108885721B
Application number: CN201780017406.2A
Authority: CN
Inventors: 内部英治; 铜谷贤治
Original assignee: Okinawa Institute of Science and Technology School Corp
Current assignee: Okinawa Institute of Science and Technology School Corp
Priority date: 2016-03-15
Filing date: 2017-02-07
Publication date: 2022-05-06
Anticipated expiration: 2037-02-07
Also published as: JP2019508817A; WO2017159126A1; KR20180113587A; EP3430578A4; KR102198733B1; EP3430578A1; JP6910074B2; CN108885721A

Abstract

A method for inverse reinforcement learning to estimate a reward and a cost function of a behavior of an object, the method comprising: acquiring data representing changes of state variables, the state variables defining behaviors of the objects; the modified Bellman equation given by equation (1) is applied to the acquired data,

where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, and γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning; estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2); estimating r (x) and v (x) in formula 2 from the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and outputting the estimated r (x) and V (x).

Description

Direct inverse reinforcement learning using density ratio estimation

Technical Field

The present invention relates to inverse reinforcement learning, and more particularly, to systems and methods of inverse reinforcement learning. This application claims the benefit of U.S. provisional application No.62/308,722 filed on 3/15/2016 and is incorporated herein by reference.

Background

Understanding human behavior from observations is crucial to developing artificial systems that can interact with humans. Since our decision process is influenced by the return/cost associated with the selected action, the problem can be formulated to estimate the return/cost from the observed behavior.

The idea of reverse reinforcement learning was originally proposed by Ng and Russel (2000) (NPL 14). The OptV algorithm proposed by Dvijotham and Todorov (2010) (NPL 6) is a previous work that showed that the strategy of the demonstration was approximated by a cost function, which is a solution to the linearized Bellman equation.

In general, Reinforcement Learning (RL) is a computational framework for the decision-making process of studying both biological and artificial systems, which can learn optimal strategies through interaction with the environment. There are several pending problems in RL, one of which is how to design and prepare the appropriate reward/cost function. It is easy to design a sparse reward function that gives a positive reward when a task is completed, otherwise it is zero, but this makes it difficult to find the optimal strategy.

In some cases, it is easier to prepare an example of the desired behavior than to manually make the appropriate reward/cost function. Recently, several reverse reinforcement learning (IRL) methods (Ng & Russell, 2000, NPL 14) and apprentice learning methods (Abbeel & Ng, 2004, NPL 1) have been proposed to derive a reward/cost function and implement mock learning based on the performance of the demonstrator. However, most existing studies (Abbel & Ng, 2004, NPL 1; Ratliff et al, 2009, NPL 16; Ziebart et al, 2008, NPL 26) require routines that utilize an estimated reward/cost function to solve the forward reinforcement learning problem. Even if an environmental model is available, this process is typically very time consuming.

Recently, the concept of linearly solvable Markov decision processes (LMDP) (Todorov, 2007; 2009, NPL 23-NPL 24), which is a subclass of performing Markov decision processes by limiting the form of cost functions, has been introduced. This limitation plays an important role in IRL. LMDP is also known as KL control and path integration methods (kappa et al, 2012, NPL 10; Theodorou et al, 2010, NPL 21) and similar ideas have been proposed in the field of control theory (Fleming and Soner, 2006, NPL 7). From Aghasadeghi & Bretl (2011) (NPL 2); kalakrishnan et al (2013) (NPL 8) propose a model-free IRL algorithm based on a path integration method. Since the likelihood of the optimal trajectory is parameterized by a cost function, the cost parameters can be optimized by maximizing the likelihood. However, their method requires the entire trajectory data. Model-based IRL methods are proposed by dvijtham and Todorov (2010) (NPL 6) based on the LMDP framework, where the likelihood of the best state transition is represented by a cost function. In contrast to the path integration method of IRL, it can be optimized from any state transition dataset. One major drawback is the evaluation of integrals that cannot be solved analytically. In practice, they discretize the state space to replace the integral with the sum, but they are not feasible in the high-dimensional continuum problem.

CITATION LIST

Patent document

PTL 1: U.S. Pat. No.8,756,177, Methods and systems for evaluating subject intent from perspective.

PTL 2: U.S. Pat. No.7,672,739 System for multiresolution analysis assisted recovery from run-by-run control.

PTL 3: japanese patent No. 5815458. A method and program for simulating a simulation device.

Non-patent document

NPL 1：Abbeel,P.and Ng,A.Y.Apprenticeship learning via inverse reinforcement learning.In Proc.of the 21st International Conference on Machine Learning,2004.

NPL 2：Aghasadeghi,N.and Bretl,T.Maximum entropy inverse reinforcement learning in continuous state spaces with path integrals.In Proc.of IEEE/RSJ International Conference on Intelligent Robots and Systems,pp.1561-1566,2011.

NPL 3：Boularias,A.,Kober,J.,and Peters,J.Relative entropy inverse reinforcement learning.In Proc.of the 14th International Conference on Artificial Intelligence and Statistics,volume 15,2011.

NPL 4：Deisenroth,M.P.,Rasmussen,C.E,and Peters,J.Gaussian process dynamic programming.Neurocomputing,72(7-9):1508-1524,2009.

NPL 5：Doya,K.Reinforcement learning in continuous time and space.Neural Computation,12:219-245,2000.

NPL 6：Dvijotham,K.and Todorov,E.Inverse optimal control with linearly solvable MDPs.In Proc.of the 27th International Conference on Machine Learning,2010.

NPL 7：Fleming,W.H.and Soner,H.M.Controlled Markov Processes and Viscosity Solutions.Springer,second edition,2006.

NPL 8：Kalakrishnan,M.,Pastor,P.,Righetti,L.,and Schaal,S.Learning objective functions for manipulation.In Proc.of IEEE International Conference on Robotics and Automation,pp.1331-1336,2013.

NPL 9：Kanamori,T.,Hido,S.,and Sugiyama,M.A Least-squares Approach to Direct Importance Estimation.Journal of Machine Learning Research,10:1391-1445,2009.

NPL 10：Kappen,H.J.,Gomez,V.,and Opper,M.Optimal control as a graphical model inference problem.Machine Learning,87(2):159-182,2012.

NPL 11：Kinjo,K.,Uchibe,E.,and Doya,K.Evaluation of linearly solvable Markov decision process with dynamic model learning in a mobile robot navigation task.Frontiers in Neurorobotics,7(7),2013.

NPL 12：Levine,S.and Koltun,V.Continuous inverse optimal control with locally optimal examples.In Proc.of the 27th International Conference on Machine Learning,2012.

NPL 13：Levine,S.,Popovic,Z.,and Koltun,V.Nonlinear inverse reinforcement learning with Gaussian processes.Advances in Neural Information Processing Systems 24,pp.19-27.2011.

NPL 14：Ng,A.Y.and Russell,S.Algorithms for inverse reinforcement learning.In Proc.of the 17th International Conference on Machine Learning,2000.

NPL 15：Rasmussen,C.E.and Williams,C.K.I.Gaussian Processes for Machine Learning.MIT Press,2006.

NPL 16：Ratliff,N.D.,Silver,D,and Bagnell,J.A.Learning to search:Functional gradient techniques for imitation learning.Autonomous Robots,27(1):25-53,2009.

NPL 17：Stulp,F.and Sigaud,O.Path integral policy improvement with covariance matrix adaptation.In Proc.of the 10th European Workshop on Reinforcement Learning,2012.

NPL 18：Sugimoto,N.and Morimoto,J.Phase-dependent trajectory optimization for periodic movement using path integral reinforcement learning.In Proc.of the 21st Annual Conference of the Japanese Neural Network Society,2011.

NPL 19：Sugiyama,M.,Takeuchi,I.,Suzuki,T.,Kanamori,T.,Hachiya,H.,and Okanohara,D.Least-squares conditional density estimation.IEICE Transactions on Information and Systems,E93-D(3):583-594,2010.

NPL 20：Sugiyama,M.,Suzuki,T.,and Kanamori,T.Density ratio estimation in machine learning.Cambridge University Press,2012.

NPL 21：Theodorou,E.,Buchli,J.,and Schaal,S.A generalized path integral control approach to reinforcement learning.Journal of Machine Learning Research,11:3137--3181,2010.

NPL 22：Theodorou,E.A and Todorov,E.Relative entropy and free energy dualities:Connections to path integral and KL control.In Proc.of the 51st IEEE Conference on Decision and Control,pp.1466-1473,2012.

NPL 23：Todorov,E.Linearly-solvable Markov decision problems.Advances in Neural Information Processing Systems 19,pp.1369-1376.MIT Press,2007.

NPL 24：Todorov,E.Efficient computation of optimal actions.Proceedings of the National Academy of Sciences of the United States of America,106(28):11478-83,2009.

NPL 25：Todorov,E.Eigenfunction approximation methods for linearly-solvable optimal control problems.In Proc.of the 2nd IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning,pp.161-168,2009.

NPL 26：Ziebart,B.D.,Maas,A.,Bagnell,J.A.,and Dey,A.K.Maximum entropy inverse reinforcement learning.In Proc.of the 23rd AAAI Conference on Artificial Intelligence,2008.

NPL 27：Vroman,M.(2014).Maximum likelihood inverse reinforcement learning.PhD Thesis,Rutgers University,2014.

NPL 28：Raita,H.(2012).On the performance of maximum likelihood inverse reinforcement learning.arXiv preprint.

NPL 29：Choi,J.and Kim,K.(2012).Nonparametric Bayesian inverse reinforcement learning for multiple reward functions.NIPS 25.

NPL 30：Choi,J.and Kim,J.(2011).Inverse reinforcement learning in partially observable environments.Journal of Machine Learning Research.

NPL 31：Neu,and Szepesvari,C.(2007).Apprenticeship learning using inverse reinforcement learning and gradient methods.In Proc.of UAI.

NPL 32：Mahadevan,S.(2005).Proto-value functions:developmental reinforcement learning.In Proc.of the 22nd ICML.

Disclosure of Invention

Technical problem

Reverse reinforcement learning is a framework to solve the above problems, but as described above, the existing methods have the following disadvantages: (1) difficult to handle when the state persists, (2) computationally expensive, and (3) the entire state trajectory is needed for estimation. The method disclosed in the present disclosure addresses these shortcomings. In particular, the previous approach proposed in NPL 14 is not as effective as reported in many previous studies. Furthermore, the methods proposed in NPL 6 do not solve the continuous problem in practice, since their algorithms involve complex integral evaluations.

The present invention is directed to systems and methods for reverse reinforcement learning.

It is an object of the present invention to provide a new and improved reverse reinforcement learning system and method which obviates one or more of the problems in the prior art.

Solution to the problem

To achieve these and other advantages and in accordance with the purpose of the present invention, as embodied and broadly described, in one aspect, the present invention provides a method for estimating reward of a behavior of an object and inverse reinforcement learning of a cost function, the method comprising: acquiring data representing changes of state variables, the state variables defining behaviors of the objects; the modified Bellman equation given by equation (1) is applied to the acquired data,

where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning; estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2); estimating r (x) and v (x) in formula 2 based on the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and outputting the estimated r (x) and V (x).

In another aspect, the present invention provides a method for inverse reinforcement learning for estimating a reward function and a cost function of a behavior of an object, the method comprising: obtaining data representing a state transition with an action defining a behavior of the object; the modified Bellman equation given by equation (3) is applied to the acquired data,

where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (u | x) and π (u | x) represent random strategies, before and after learning, respectively, representing the probability of selecting action u at state x; estimating the logarithm of the density ratio pi (x)/b (x) in the formula (3); estimating r (x) and v (x) in formula 4 based on the result of estimating the logarithm of the density ratio pi (x, u)/b (x, u); and outputting the estimated r (x) and V (x).

In another aspect, the invention provides a non-transitory storage medium storing instructions for causing a processor to execute an algorithm for inverse reinforcement learning for estimating a reward function and a cost function of a behavior of an object, the instructions causing the processor to perform the steps of: acquiring data representing changes of state variables, the state variables defining behaviors of the objects; applying a modified Bellman equation given by equation (1) to the acquired data;

In another aspect, the present invention provides an inverse reinforcement learning system for estimating a reward function and a cost function of a behavior of an object, the system comprising: a data acquisition unit for acquiring data representing a change in a state variable defining a behavior of the object; a processor having a memory, the processor and the memory configured to: applying a modified Bellman equation given by equation (1) to the acquired data;

where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning; estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2); estimating r (x) and v (x) in formula 2 based on the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and an output interface that outputs the estimated r (x) and v (x).

In another aspect, the present invention provides a system for predicting a preference of a topic of an article that a user is likely to read based on a series of articles selected by the user while browsing on the internet, the system comprising: the system for reverse reinforcement learning of claim 8, implemented in a computer connected to the internet, wherein the object is the user and the state variables defining the behavior of the object include topics of articles selected by the user while browsing each web page, and wherein the processor causes the user to be browsing an interface of an internet website to display articles recommended for the user to read according to the estimated reward function and cost function.

In another aspect, the present invention provides a method for programming a robot to perform complex tasks, the method comprising: controlling the first robot to complete a task to record a sequence of states and actions; estimating the reward function and cost function with the system for inverse reinforcement learning of claim 8 based on the recorded sequence of states and actions; and providing the estimated reward and cost function to a forward reinforcement learning controller of a second robot to program the second robot with the estimated reward and cost function.

[ advantageous effects of the invention ]

According to one or more aspects of the present invention, inverse reinforcement learning can be efficiently and effectively performed. In some embodiments, there is no need to know the environment dynamics in advance and no need to perform integration.

Additional or separate features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.

Drawings

Fig. 1 shows the density ratio estimation method for each of the following: (1) LSCDE-IRL, (2) usif-IRL, (3) LogReg-IRL, (4) Gauss-IRL, (5) LSCDE-OptV, and (6) Gauss-OptV, normalized square error of experimental results of inverted pendulum wobble to which embodiments of the present invention are applied. As shown, (a) - (d) differ from each other in sampling method and other parameters.

Fig. 2 is a graph showing cross-validation errors in inverted pendulum swing experiments for various density ratio estimation methods.

FIG. 3 shows an experimental setup for the rod balancing task of a long rod; left side: start position, middle: target position, and right side: a state variable.

FIG. 4 illustrates learning curves in a rod balancing task experiment on various subjects, according to an embodiment of the present invention; solid line: stock, dotted line: a short rod.

FIG. 5 illustrates estimated cost functions derived from a rod balancing task experiment according to an embodiment of the present invention for objects # 4, # 5, and # 7 projected into a defined subspace.

Fig. 6 shows the negative log-likelihood values of the test data sets in the rod balance task experiments for subject nos. 4 and 7 to evaluate the estimated cost function.

Fig. 7 schematically shows a framework of reverse reinforcement learning according to embodiment 1 of the present invention, which can infer an objective function from an observation state transition generated by an demonstrator.

FIG. 8 is a schematic block diagram illustrating an example of an implementation of the reverse reinforcement learning of the present invention in terms of mimicking learning robot behavior.

FIG. 9 is a schematic block diagram illustrating an example of an implementation of the reverse reinforcement learning of the present invention in explaining human behavior.

FIG. 10 schematically illustrates a series of click actions by a web visitor, illustrating subject matter preferences of the visitor in browsing on the web.

Fig. 11 schematically illustrates an example of an inverse reinforcement learning system according to an embodiment of the present invention.

Fig. 12 schematically shows the difference between embodiment 1 and embodiment 2 of the present invention.

Fig. 13 schematically illustrates a calculation scheme of the second DRE of step (2) in embodiment 2.

Fig. 14 shows the experimental results of the inverted pendulum problem of the pendulum comparing embodiment 2 with embodiment 1 and other methods.

Fig. 15 shows experimental results of the robot navigation

task using embodiments

1 and 2 and RelEnt-IRL.

Detailed Description

The present disclosure provides a novel inverse strong chemistry method and system based on density ratio estimation in a linearly solvable Markov decision process (LMDP) framework. In LMDP, the logarithm of the ratio between the controlled state transition density and the uncontrolled state transition density is represented by a state dependent cost and cost function. Previously, the present inventors devised a novel inverse reinforcement learning method and system as described in PCT international application No. PCT/JP2015/004001, in which a density ratio estimation method is used to estimate a conversion density ratio, and a least squares method with regularization is used to estimate a state-dependent cost and value function that satisfies the relationship. The method may avoid calculating an integral, such as evaluating a partition function. The present disclosure includes a description of the invention described in PCT/JP2015/004001 as embodiment 1 below, and further describes a new embodiment as embodiment 2, which has improved features in several respects over embodiment 1. Subject matter described and/or claimed in PCT/JP2015/004001 may or may not be prior art for embodiment 2 according to regional national law. As described below, in embodiment 1, a simple numerical simulation of the pendulum swinging was performed, and its superiority over the conventional method was demonstrated. The inventors further applied this method to human behavior in performing the pole balancing task and showed that the estimated cost function can predict the performance of objects in a new trial or environment in a satisfactory way.

One aspect of the invention is based on the framework of a linearly solvable Markov decision process, such as the OptV algorithm. In embodiment 1, the inventors have developed a novel Bellman equation given by:

where q (x) and V (x) represent the cost and cost functions for state x, and γ represents the discount factor. p (y | x) and π (y | x) represent the state transition probabilities before and after learning, respectively. The density ratio (left side of the above equation) is effectively calculated from the observed behavior by the density ratio estimation method. Once the density ratio is estimated, the cost and cost functions can be estimated by regularized least squares. An important feature is that our method can avoid computing the integral, which is usually computed at high computational cost. The present inventors have applied this approach to human behavior in performing pole balancing tasks and show that the estimated cost function can predict the performance of objects in new trials or environments, verifying the general applicability and effectiveness of this new computing technique in reverse reinforcement learning, with well-recognized broad applicability in control systems, machine learning, operational research, information theory, etc.

< I > embodiment 1>

<1. Markov decision Process for Linear solution >

<1.1. Forward Reinforcement learning >

The present disclosure provides a brief introduction to the Markov decision process and its simplification of the discrete-time continuous spatial domain. Let X and U be continuous state and continuous motion space respectively. At time step t, the learning agent observes the current state x of the environment_tE.g. X, and execute according to a random strategy pi (u)_t|x_t) Act u of sampling_tE.g. U. Thus, the immediate cost c (x) is given from the environment_t,u_t) And in action u_tNext, the environment is according to the following from x_tProbability of state transition P to y ∈ X_T(y|x_t,u_t) And performing state transition. The goal of reinforcement learning is to construct an optimal strategy pi (u | x) that minimizes a given objective function. There are several objective functions, and the most widely used is the discount given bySum of costs of (c):

where γ ∈ (0,1) is referred to as a discount factor. It is known that the optimum function satisfies the following Bellman equation:

equation (2) is a nonlinear equation due to the min operator.

The linearly solvable Markov decision process (LMDP) simplifies equation (2) under certain assumptions (Todorov, 2007; 2009a, NPL 23-NPL 24). The key skill of LMDP is to optimize the state transition probability directly, rather than optimizing the strategy. More specifically, two conditional probability density functions are introduced. One is the uncontrolled probability, denoted by p (y | x), which can be considered an innate state transition. P (y | x) is arbitrary, and may be represented by P (y | x) ═ P-_T(y|x,u)π₀(u | x) du, where π₀(u | x) is a random strategy. The other is the controlled probability, denoted by π (y | x), which can be interpreted as the best state transition. The cost function is then limited to the following form:

c(x，u)＝q(x)+KL(π(·|x)||p(·|x))， (3)

wherein q (x) and

KL(π(·|x)||p(·|x))

representing the Kullback Leibler divergence between the state dependent cost function and the controlled and uncontrolled state transition densities, respectively. In this case, Bellman equation (2) is simplified to the following equation:

exp(-V(x))＝exp(-q(x))∫p(y|y)exp(-γV(y))dy. (4)

the optimal controlled probability is given by:

it should be noted that even if the expectation function z (x) ═ exp (-v (x)) is introduced due to the presence of the discount factor γ, equation (4) is still non-linear. In forward reinforcement learning in the LMDP framework, v (x) is calculated by solving equation (4), and then pi (y | x) is calculated (Todorov, 2009, NPL 25).

<1.2. reverse reinforcement learning >

The Inverse Reinforcement Learning (IRL) algorithm under LMDP is proposed by dvijthatam and Todorov (2010) (NPL 6). In particular, OptV works well for discrete state problems. The advantage of OptV is that the best state transition is unambiguously represented by the cost function, so that the maximum likelihood method can be applied to estimate the cost function. It is assumed that the observed trajectory results from the optimum state transition density (5). The cost function is approximated by the following linear model:

wherein, w_VAnd Ψ_V(x) Respectively, the learning weights and the basis function vectors.

Since the controlled probability is given by equation (5), the weight vector w can be optimized by maximizing the likelihood_V. Assuming a state transition data set:

wherein N is^πRepresenting the amount of data from the controlled probability. The log-likelihood and its derivative are then given by:

wherein, pi (y | x; w)_V) Is a controlled strategy in which a cost function is composed ofThe formula (6) is parameterized. Once the gradient is evaluated, the weight vector w is updated according to the gradient ascent method_V。

After estimating the cost function, the cost function can be derived using the simplified Bellman equation (4). Which means that it is given

And γ, a cost function q (x) is uniquely determined, and q (x) is expressed by a basis function used in the cost function. Although the representation of the cost function is not important in the case of mock learning, we wish to find a simpler representation of the cost for analysis. Thus, the inventors introduced the approximate sub:

wherein, w_qAnd

ψ_q(x)

respectively, the learning weights and the basis function vectors. With L1 regularization to optimize w_qIs given by:

wherein λ is_qIs a regularization constant. A simple gradient descent algorithm is employed, and J (w) is evaluated under observed conditions_q)。

The most important problem of dvijthatam and Todorov (2010) (NPL 6) is the integrals in equations (8) and (10) that cannot be solved by analysis, and they discretize the state space and replace the integrals with sums. However, as they suggest, it is not feasible on a high dimensional problem. In addition, the uncontrolled probability p (y | x) is not necessarily gaussian. In at least some embodiments of the invention, the Metropolis Hastings algorithm is applied to evaluate the gradient of the log-likelihood, where a non-controlled probability p (y | x) is used as the causal density.

<2. reverse reinforcement learning by Density ratio estimation >

<2.1. Bellman equation for IRL >

According to equations (4) and (5), the present inventors have derived the following important relationship for the discount cost problem:

equation (11) plays an important role in the IRL algorithm according to an embodiment of the present invention. Similar equations can be derived for the first exit, average cost, and finite horizon problems. It should be noted that the left side of equation (11) is not a time difference error, since q (x) is the state-dependent part of the cost function shown in equation (3). Our IRL is still an ill-defined problem, and although the form of the cost function is constrained by equation (3) under LMDP, the cost function is not uniquely defined. More specifically, if the state dependent cost function is modified by:

q′(x)＝q(x)+C， (12)

then the corresponding cost function changes to:

wherein C is a constant value. The controlled probability derived from V (x) is then the same as the controlled probability from V' (x). This property is very useful when estimating the cost function, as described below. In one aspect of the invention, the disclosed IRL method consists of two parts. One part is to estimate the density ratio on the right side of equation (11), as described below. The other part estimates q (x) and v (x) by least squares using regularization, as shown below.

<2.2. Density ratio estimation for IRL >

Estimating the ratio of controlled and uncontrolled transition probability densities can be viewed as a matter of density ratio estimation (Sugiyama et al, 2012, NPL 20). In terms of setting of the problem, the present disclosure considers the following formulation.

<2.2.1. general case >

First, a general setting is considered. Suppose we have two state transition datasets: one is D shown in formula (7)^πAnd another data set of uncontrolled probability:

wherein N is^pIndicating the amount of data. Then, we focus on the data from D^pAnd D^πTo estimate the ratio pi (y | x)/p (y | x).

From equation (11), we can consider the following two decompositions:

the first decomposition (14) shows the logarithmic difference of the conditional probability density. To estimate equation (14), the present disclosure considers two implementations. The first is to estimate the LSCDE-IRL of pi (y | x) and p (y | x) using Least Squares Conditional Density Estimation (LSCDE) (Sugiyama et al, 2010). The other is Gauss-IRL using a gaussian process (Rasmussen & Williams, 2006, NPL 15) to estimate the conditional density in equation (14).

The second decomposition (15) shows the logarithmic difference in density ratio. The advantage of the second decomposition is that if pi (x) ═ p (x), then ln pi (x)/p (x) can be ignored. The condition may be satisfied according to the setting. Currently, two methods are implemented to estimate π (x)/p (x) and π (x, y)/p (x, y). One is uslsif-IRL using an unconstrained least squares significance fit (uslsif) (Kanamori et al, 2009, NPL 9). The other is LogReg using logistic regression in a different way. Their implementation is described in section 2.3 below.

<2.2.2. when p (y | x) is unknown >

Probability of state transition P_T(y | x, u) is assumed to be known in advance in the case of the standard IRL problem, and this corresponds to the assumption that the uncontrolled probability p (y | x) is given in the case of LMDP. This can be considered as a model-based IRL. In this case, equation (14) is adequate and sufficient from the data set D^πThe controlled probability pi (y | x) is estimated.

In some cases, we have neither analytical models nor data sets from uncontrolled probability densities. Then p (y | x) is replaced by a uniform distribution, which is an incorrect distribution of unconstrained variables. Without the loss of generality, p (y | x) is set to 1, since it can be compensated by shifting the cost and cost functions by equations (12) and (13).

<2.3. Density ratio estimation Algorithm >

This section describes a density ratio estimation algorithm suitable for the IRL method disclosed in this disclosure.

<2.3.1.uLSIF>

uLSIF (Kanamori et al, 2009, NPL9) is a least squares method for direct density ratio estimation methods. The goal of the uLSIF is to estimate the ratio of the two densities π (x)/p (x) and π (x, y)/p (x, y). Hereinafter, the present disclosure explains how to follow D^pAnd D^πTo estimate r (z) ═ pi (z)/p (z), where z is (x, y) for simplicity. Let us estimate the ratio by linear model approximation:

wherein

φ(z)

The basis function vectors are represented and are the parameters to be learned, respectively. The objective function is given by:

wherein λ is a regularization constant, and

it should be noted that, accordingly, H is according to D^pIs estimated, and h is according to D^πTo estimate. Equation (16) can be analytically minimized to

But this minimum (minizer) neglects the non-negative constraint of density ratio. To compensate for this problem, the uLSIF modifies the solution by:

where the above max operator is applied element-wise. According to Kanamori et al recommendation (2009) (NPL9), with D^πA state-centered gaussian function is used as the basis function described by:

where σ is the width parameter.

Is from D^πOf the random selection of states. The parameters λ and σ are selected by leave-one-out cross-validation.

<2.3.2.LSCDE>

LSCDE (Sugiyama et al 2010, NPL 19) is considered a special case of the nlsif estimated conditional probability density function. For example, for according to D^πThe objective function to estimate pi (y | x) ═ pi (x, y)/pi (x) is given by:

wherein

Is a linear model and λ is a regularization constant. Calculating H and H in LSCDE is slightly different from calculating H and H in uLSIF, and they are calculated as follows:

wherein

Is defined as:

since the basis function shown in equation (18) is used, the integral can be analytically calculated. The estimated weight of LSCDE is given by equation (17). To ensure that the estimated ratio is the conditional density, the solution should be normalized when used to estimate the cost and cost functions.

<2.3.3.LogReg>

LogReg is a method for density estimation using logistic regression. Let us assign the selector variable η -1 to samples from uncontrolled probabilities and η -1 to samples from controlled probabilities:

p(z)＝Pr(z|η＝-1)，π(z)＝Pr(z|η＝1).

the density ratio can be expressed by applying Bayes rule as follows:

by N^P/N^πThe first ratio Pr (η ═ -1)/Pr (η ═ 1) is estimated, and the second ratio is calculated after the conditional probability pi (η | z) is estimated by the logistic regression classifier:

where η may be considered a label. It should be noted that in the case of LogReg, the logarithm of the density ratio is given by the linear model:

second term lnN^P/N^πCan be omitted from our IRL formula shown in equation (15).

The objective function is derived from the negative regularized log-likelihood expressed by:

a solution for the closed form cannot be derived but can be effectively minimized by standard non-linear optimization methods because the objective function is a convex function.

<2.4. estimate cost and merit function >

Once the density ratio π (y | x)/p (y | x) is estimated, a least squares with regularization is applied to estimate the state dependent cost function q (x) and the cost function V (x). Suppose that

Is an approximation of negative logarithmic ratio:

and linear approximations of q (x) and v (x) as defined in equations (6) and (9), respectively, are considered. The objective function is given by:

wherein λ is_qAnd λ_VIs a regularization constant. L2 regularization is used for w_VSince L2 regularization is an effective means to achieve numerical stability. On the other hand, L1 regularization is used for w_qTo produce a sparse model that is easier for the subject to interpret. If sparsity is not important, then w can be_qRegularization using L2. In addition, since the formula (12) can be used by setting the following formula, w is not introduced_qAnd w_VNon-negative constraints of (2):

to effectively satisfy the non-negativity of the cost function.

In theory, we can choose any basis function. In one embodiment of the present invention, for the sake of simplicity, the gaussian function shown in equation (18) is used:

where σ is the width parameter. Center position

According to D^πAnd (4) randomly selecting.

<3. experiment >

<3.1. inverted pendulum of rocking >

<3.1.1. task description >

In order to prove and confirm the effectiveness of the above-described embodiment belonging to embodiment 1 of the present invention, the present inventors studied the swing-up inverted pendulum (swing-up pendulum) problem in which the state vector is composed of a two-dimensional vector x ═ θ, ω]^TGiven, where θ and ω represent the angle and angular velocity of the rod, respectively. The equation of motion is given by the following random differential equation:

wherein, l, m, g, k, σ_eAnd ω represents the length of the rod, mass, gravitational acceleration, coefficient of friction, proportional parameters of noise, and Brownian noise, respectively. Contrary to previous studies (Deisenroth et al, 2009, NPL 4; Doya, 2000, NPL 5), the applied torque u is not limited and can be directly increased. Obtaining the corresponding state transition probability P by discretizing the time axis and the step h_T(y | x, u), which is represented by a gaussian distribution. In the simulation, these parameters are given as follows:

the inventors varied (1) the state dependent cost function q (x), (2) the uncontrolled probability p (y | x), and (3) the data set D by^pAnd D^πA series of experiments were performed.

< cost function >

The goal was to keep the pole upright and to prepare the following three cost functions:

q_cos(x)＝1-cosθ，q_quad(x)＝x^TQx，

wherein Q ═ diag [1,0.2]。q_cost(x) Used by Doya (2000), and q_exp(x) Used by Deisenroth et al (2009) (NPL 4).

< uncontrolled probability >

Considering two densities p_G(y | x) and p_M(y | x). Constructing p with a random strategy pi (u | x) represented by a Gaussian distribution_G(y | x). Since the equation for the motion in discrete time is given by Gauss, p_G(y | x) is also a Gaussian distribution. At p_MIn the case of (y | x), a mixture of gaussian distributions is used as the random strategy.

< preparation of data set >

Two sampling methods are considered. One method is uniform sampling and the other method is trajectory-based sampling. In the uniform sampling method, x is sampled from a uniform distribution defined over the entire state space. In other words, p (x) and π (x) are considered to be uniformly distributed. Then, y is sampled from the uncontrolled and controlled probabilities to construct D, respectively^pAnd D^π. In the trace-based sampling method, p (y | x) and π (y | x) are used to start the state x from the same₀A trace of the states is generated. Then, a pair of state transitions is randomly selected from the trace to construct D^pAnd D^π. It is expected that p (x) is different from π (x).

For each cost function, the corresponding cost function is calculated by solving equation (4), and the corresponding optimal controlled probability is evaluated by equation (5). In the previous method (Todorov, 2009b, NPL 25), exp (-v (x)) is represented by a linear model, but this is difficult under the objective function (1) because the discount factor γ complicates the linear model. Therefore, the cost function is approximated by a linear model shown in equation (6), and the integral is evaluated using the Metropolis Hastings algorithm.

The method according to an embodiment of the invention in embodiment 1 can be compared with OptV, since the assumptions of OptV are the same as those of the method according to an embodiment of the invention. There are several variations as described above, depending on the choice of density ratio estimation method. More specifically, consider the following six algorithms: (1) LSCDE-IRL; (2) uLSIF-IRL; (3) LogReg-IRL; (4) Gauss-IRL; (5) LSCDE-OptV, which is the OptV method, where p (y | x) is estimated by LSCDE; and (6) Gauss-OptV, where a Gaussian processing method is used to estimate p (y | x).

We will D^pAnd D^πIs set to be N^p＝N^π300. Parameter lambda_q、λ_Vσ and γ were optimized by cross-validation of the following regions: log lambda_q、logλ_VE.g., link (-3,1,9), log σ e.g., link (-1.5,1.5,9), and log γ e.g., link (-0.2,0,9), wherein link (x)_min,x_maxN) generating a set of n points, which are at x_minAnd x_maxAt equal intervals.

<3.1.2. Experimental results >

The accuracy of the estimated cost function is measured by the normalized squared error of the test samples:

wherein, respectively, q (x)_j) Is in a state x shown in formula (19)_jOne of the following real cost functions, and

is an estimated cost function. Fig. 1 (a) - (d) compare the accuracy of the IRL method of this embodiment; the results show that our methods (1) - (4) perform better than the OptV methods (5) - (6) in all circumstances. More specifically, LogReg-IRL showed the best performance, but there were no significant differences between our methods (1) - (3). If the random strategy pi (u | x) is given by a mixture of gaussians, the accuracy of the cost estimated by Gauss-IRL increases significantly, since the standard gaussian process cannot represent a mixture of gaussians.

FIG. 2 illustrates the cross-validation error for a discounting factor γ, such as λ_q、λ_VAnd of sigmaThe other parameters are set to optimal values. In this simulation, the cross-validation error is 10 at the true discount factor γ in all methods^-0.025And 0.94 is minimum.

As shown in fig. 2 and also illustrated in fig. 1, the embodiments of the present invention proved to have sufficiently small errors that it was effective to confirm the effectiveness of the present invention.

<3.2. analysis of human behavior >

<3.2.1. task description >

To evaluate our IRL algorithm in real-world situations, the inventors performed a dynamic motor control, rod balancing problem. Fig. 3 shows an experimental setup. The subject can move the base left, right, up and down to swing the pole multiple times and slow the pole to balance in an upright position. Dynamics are described by six-dimensional state vectors

Wherein, θ and

are the angle and angular velocity of the rod, x and y are the horizontal and vertical positions of the base, and

and

respectively, their time derivatives.

This task is performed under two conditions: a long rod (73cm) and a short rod (29 cm). Each subject performed 15 trials to balance the rods under each condition. Each trial was ended when the subject held the pole upright for 3 seconds or when 40 seconds passed. We collected data from 7 objects (5 on the right and 2 on the left) and used a trajectory-based sampling method to construct the following two controlled probability data sets:

for training

And

for testing the ith object

It is assumed that all objects have a unique uncontrolled probability p (y | x), which is generated by a random strategy. This means that the data set (for training) is

And for testing

) Shared between objects. The number of samples in the data set was 300.

<3.2.2. Experimental results >

Fig. 4 shows learning curves for seven subjects, which indicate that the learning process is very different between the subjects.

Object numbers

1 and 3 cannot accomplish this task. Since the IRL algorithm should use a successful set of trajectories, we take data from five objects # 2 and # 4-7.

The experimental results in the case of using LogReg-IRL will be described below (LSCDE-IRL and uLSIF-IRL show similar results). FIG. 5 shows projection onto subspace

And objects 4, 5 and 7, respectively, and

x，y，

and

set to zero for visualization. For the case of object 7, the cost function for the long pole condition was not much different from the cost function for the short pole condition, while the cost function for object No.5 was significantly different, which performed poorly in the short pole condition, as shown in FIG. 4.

To evaluate the cost function estimated from the training data set, the inventors applied forward reinforcement learning to find the optimal controlled transition probability for the estimated cost function, and then calculated the negative log-likelihood of the test data set:

wherein the content of the first and second substances,

is that

The number of samples in (1).

Fig. 6 shows the results. In the left panel (a), we used the test dataset of the subject under the long-bar condition

By training data sets according to the same conditions

And

the estimated cost function, achieves a minimum negative log-likelihood. The right panel (b) of fig. 6 shows that the test data of subject No.7 under long and short pole conditions is best predicted by a cost function estimated from the training data set of the same subject No.7 only under long pole conditions. Thus, the effectiveness and utility of embodiments of the present invention have also been demonstrated and proven by this experiment.

The present disclosure presents novel reverse reinforcement learning under the LMDP framework. One feature of the present invention is that equation (11) is shown, which means that for an optimum function with a corresponding cost function, the time difference error is zero. Since the right side of equation (11) can be estimated from the samples by an efficient method of density ratio estimation, the IRL of the present invention results in a simple least squares method with regularization. In addition, the method according to an embodiment of the present invention in embodiment 1 does not require the calculation of an integral that is often difficult to handle in high-dimensional continuity problems. As a result, the disclosed method is computationally cheaper than OptV.

LMDP and path integration methods have recently received attention in the fields of robot and machine learning (Theodorou & Todorov, 2012, NPL 22) because there are many interesting properties in the linearized Bellman equation (Todorov, 2009a, NPL 24). They have been successfully applied to the learning of stochastic strategies for robots with greater degrees of freedom (Kinjo et al, 2013, NPL 11; Stulp & Sigaud, 2012, NPL 17; Sugimoto and Morimoto, 2011, NPL 18; Theodoro et al, 2010, NPL 21). The IRL method according to embodiments of the present invention can be integrated with existing forward reinforcement learning methods to design complex controllers.

As noted above, in at least some aspects of embodiment 1 of the present invention, the present disclosure provides a computational algorithm that can efficiently infer a reward/cost function from observed behavior. The algorithms of the embodiments of the present invention may be implemented in a general purpose computer system with appropriate hardware and software, as well as specially designed proprietary hardware/software. Various advantages according to at least some embodiments of the invention include:

A) model-free method/system: the method and the system do not need to know the environmental dynamics in advance; that is, the method/system is considered a model-less method-although some prior art methods assume that the environmental dynamics are known in advance, it is not necessary to explicitly model the target dynamics.

B) The data is valid: the data set for a method and system according to embodiments of the present invention includes a set of state transitions, whereas many previous methods required a set of state traces. Therefore, in the method and system according to the embodiment of the present invention, it is easier to collect data.

C) Calculated efficiency (1): methods and systems according to embodiments of the present invention do not require a solution to the (forward) reinforcement learning problem. In contrast, some previous approaches require multiple solutions to this forward reinforcement learning problem with an estimated reward/cost function. This calculation must be performed for each candidate and often takes a long time to find the best solution.

D) Calculated efficiency (2): the method and system according to embodiments of the invention use two optimization algorithms: (a) density ratio estimation and (b) regularized least squares. In contrast, some previous methods use stochastic gradient methods or Markov chain Monte Carlo methods, which typically require time to optimize as compared to least squares.

As described above, in one aspect, the present invention provides reverse reinforcement learning, which can infer an objective function from observed state transitions produced by a demonstrator. Fig. 7 schematically shows a framework of the method according to embodiment 1 of the present invention. An embodiment of reverse reinforcement learning according to embodiment 1 of the present invention includes two components: (1) learning the ratio of state transition probabilities with and without control by density ratio estimation, and (2) estimating cost and cost functions compatible with the ratio of transition probabilities by regularized least squares. By using an efficient algorithm for each step, embodiments of the present invention are more data and computationally efficient than other inverse reinforcement learning methods.

The industrial applicability and usefulness of reverse reinforcement learning is well understood and appreciated. An example of a system/configuration to which embodiments of the present invention may be applied is described below.

< simulation learning of robot behavior >

It is difficult to program robots to perform complex tasks using standard methods, such as motion planning. In many cases it is much easier to show the robot the desired behavior. However, one major drawback of classical mimic learning is that the resulting controller cannot cope with new situations, as it only reproduces the demonstrated motion. Embodiments of the present invention may estimate an objective function from the demonstrated behavior and then may use the estimated objective function to learn different behaviors for different situations.

Fig. 8 schematically illustrates such an implementation of the invention. First, the demonstrator controls the robot to complete the task and records the state and sequence of actions. The reverse reinforcement learning component according to embodiments of the present invention then estimates the cost and cost functions, which are then given to the forward reinforcement learning controllers of the different robots.

< interpretation of human behavior >

Understanding the human intent behind behavior is a fundamental problem in building user-friendly support systems. Typically, a behavior is represented by a sequence of states, which are extracted by a motion tracking system. The cost function estimated by the inverse reinforcement learning method/system according to embodiments of the present invention can be considered as a compact representation for interpreting a given behavioral data set. By the pattern classification of the estimated cost function, the expertise or preference of the user can be estimated. Fig. 9 schematically illustrates such an implementation according to an embodiment of the invention.

< analysis of network experience >

To increase the likelihood of a visitor reading an article presented to the visitor, for example, a designer of an online news website should investigate the visitor's web experience from a decision-making perspective. In particular, recommendation systems are of interest as important business applications for personalized services. However, previous approaches such as collaborative filtering do not explicitly consider the order of decisions. Embodiments of the present invention may provide a different and efficient way to model the behavior of visitors during web browsing. FIG. 10 shows an example of a series of click actions by a user indicating in what order the user has visited what topics. The subject the visitor is reading is considered to be the state and clicking on the link is considered to be the action. Reverse reinforcement learning according to embodiments of the present invention may then analyze the decisions in the user's web browsing. Because the estimated cost function represents the visitor's preference, a list of articles may be recommended for the user.

As described above, in embodiment 1 of the present invention, the reverse reinforcement learning scheme according to the embodiment is applicable to a wide range of industrial and/or commercial systems. FIG. 11 shows an example of an implementation using a general purpose computer system and a sensor system. The method explained above with mathematical equations may for example be implemented in such a general-purpose computer system. As shown in the figure, the system of this example includes a sensor system 111 (an example of a data acquisition unit) to receive information about state transitions, i.e., observed behaviors, from an observed object. The sensor system 111 may include one or more of the following: image capture devices with image processing software/hardware, displacement sensors, velocity sensors, acceleration sensors, microphones, keyboards, and any other input devices. The sensor system 111 is connected to a computer 112, the computer 112 having a processor 113 with appropriate memory 114 so that the received data can be analyzed according to embodiments of the present invention. The result of the analysis is output to any output system 115 such as a display monitor, a controller, a driver, or the like (an example of an output interface), or an object to be controlled in the case of using the result for control. The results may be used to program or communicate to another system, such as another robot or computer, or website software that responds to user interactions, as described above.

In the case of predicting a user's web article preferences as described above, the implemented system may include a system for reverse reinforcement learning as described in any of the embodiments above, implemented in a computer connected to the internet. The state variables that define the user's behavior include the topics of articles that the user selects while browsing each web page. The results of the reverse reinforcement learning are then used to cause the user to be browsing the interface of an internet website (e.g., a portable smartphone, a personal computer, etc.) to display recommended articles for the user.

< II > embodiment 2>

Next, embodiment 2 having characteristics superior to those of embodiment 1 in some respects will be described. Fig. 12 schematically shows the difference between embodiment 1 and embodiment 2. As described above, and shown in (a) of fig. 12, embodiment 1 uses the density ratio estimation algorithm twice and the regularized least squares method. In contrast, in embodiment 2 of the present invention, the logarithm of the density ratio pi (x)/b (x) is estimated using the standard Density Ratio Estimation (DRE) algorithm, and r (x) and v (x) as the reward function and the cost function, respectively, are calculated by estimating the logarithm of the density ratio pi (x, y)/b (x, y) using the Bellman equation. In more detail, in embodiment 1, the following three steps are required: (1) estimate pi (x)/b (x) by standard DRE algorithm; (2) estimate pi (x, y)/b (x, y) by standard DRE algorithm, and (3) calculate r (x) and v (x) by regularized least squares using Bellman's equation. In contrast, embodiment 2 uses only two-step optimization: (1) estimate ln pi (x)/b (x) by standard Density Ratio Estimation (DRE) algorithm, and (2) calculate r (x) and v (x) by DRE (second time) for ln pi (x, y)/b (x, y) using Bellman's equation.

Fig. 13 schematically illustrates a calculation scheme of the second DRE of step (2) in embodiment 2. As shown in fig. 13, the second DRE for ln pi (x, y)/b (x, y) results in an estimate of r (x) + γ v (y) -v (x) using the following equation, since the first DRE estimates ln pi (x)/b (x).

These equations are substantially the same as the above equations (11) and (15). Therefore, in embodiment 2, the third step (3) in embodiment 1 does not need to be calculated by the regularized least squares method, and the calculation cost can be significantly reduced as compared with embodiment 1. In embodiment 2, in order to perform the second step (2), i.e. to calculate r (x) and v (x) by DRE (second time) for ln pi (x, y)/b (x, y) using Bellman's equation, the basis functions are designed according to this state space, which reduces the number of parameters to be optimized. In contrast, in embodiment 1, in step (2) of estimating pi (x, y)/b (x, y) by the standard DRE algorithm, the basis functions need to be designed as products of the state space, which requires optimization of a relatively large number of parameters. Therefore, embodiment 2 requires a relatively low memory usage amount as compared to embodiment 1. Therefore, embodiment 2 has these different significant advantages over embodiment 1. Other features and settings of embodiment 2 are the same as the various methods and schemes described above for embodiment 1, unless specifically noted otherwise below.

Table 1 below shows a general comparison of embodiment 2 with various conventional methods. Specifically, for embodiment 2, various features are compared for the above-described OptV, maximum entropy IRL (MaxEnt-IRL), and relative entropy IPL (RelEnt-IRL). As shown in table 1, embodiment 1 of the present invention has various advantages as compared with the conventional method.

[ Table 1]

In order to demonstrate and confirm the effectiveness of embodiment 2 of the present invention, the aforementioned inverted pendulum problem was studied. FIG. 14 shows the results of experiments comparing embodiment 2 with embodiment 1 with respect to MaxEnt-IRL, RelEnt-IRL and OptV. Embodiment 2 is expressed as "new invention", and embodiment 1 is expressed as "PCT/JP 2015/004001" in the figure. As shown in fig. 14, even if the number of samples is small, embodiment 2 has succeeded in restoring the observed strategy better than other methods including embodiment 1.

< robot navigation task experiment >

To further demonstrate and confirm the effectiveness of embodiment 2 of the present invention, the robot navigation task was studied with respect to embodiment 2, embodiment 1, and RelEt-IRL. Three target objects, red (r), green (g) and blue (b), are placed in front of a programmable robot with camera eyes. The goal is to reach the green (g) goal of the three goals. Five predetermined starting positions a-E are arranged in front of the three objects. Training data is collected from the starting positions A-C and E, while test data is acquired using the starting position D. The state vectors are as follows: x ═ θ_r,N_r,θ_g,N_g,θ_b,N_b,θ_pan,θ_tilt]^TWhere θ i (i ═ r, g, b) is the angle for the target, Ni (i ═ r, g, b) is the blob (blob) size, and θ i (i ═ r, g, b) is the blob (blob) size_panAnd theta_tiltIs the angle of the robot camera. The basis functions for v (x) are given as follows:

wherein, c_iIs a central location selected from the data set. The basis functions for r (x) are given by:

ψ_q(x)＝[f_g(θ_r)，f_s(N_r)，f_g(θ_g)，f_s(N_g)，f_g(θ_b)f_s(N_b)]^T，

wherein f is_gIs a Gaussian function, and f_sIs a sigmoid function. In this experiment, π and b were given by the experimenter, and for each starting point, 10 traces were collected to create a data set. Fig. 15 shows the experimental results. In this figure, embodiment 2 is denoted as "new invention", and embodiment 1 is denoted as "PCT/JP 2015/004001". The results are compared to those of RelEnt-IRL, as described above. As shown in fig. 15, embodiment 2 produced significantly better results. This also shows that the estimate function according to embodiment 2 can be used as a potential function to shape the reward.

The calculation time (in minutes) in the inverted pendulum task discussed above was estimated. The LogReg IRL and KLIEP IRL in embodiment 2 only require about 2.5 minutes in the calculation. The uslif IRL, LSCDE IRL, and LogReg IRL in embodiment 1 take about 4 to 9.5 minutes, respectively. Thus, embodiment 2 requires significantly less computing time than the various versions of embodiment 1 discussed above.

It is easily understood that the application of embodiment 2 is substantially the same as the various applications for embodiment 1 described above. Specifically, as described above, the various versions of embodiment 2 are particularly applicable to: interpreting human behavior, analyzing network experiences, and designing a robot controller by simulation, wherein a corresponding objective function is estimated as an immediate reward by displaying some desired behavior. The robot may use the estimated reward to attribute the behavior of the non-experienced condition with forward reinforcement learning. Therefore, a highly economical and reliable system and method can be constructed according to embodiment 2 of the present invention. In particular, as described above, embodiment 2 can recover the observed policy with a small number of observations better than other methods. This is a significant advantage.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents. In particular, it is expressly contemplated that any portion or all of any two or more of the above-described embodiments and modifications thereof may be combined and considered to be within the scope of the present invention.

Claims

1. A method for inverse reinforcement learning to estimate a reward function and a cost function of a behavior of an object, wherein the object includes a demonstrator and a robot, the method comprising:

obtaining data representing changes in state variables defining behavior of the object;

the modified Bellman equation given by equation (1) is applied to the acquired data,

where r (x) and v (x) represent a reward function and a cost function, respectively, in state x, γ represents a discount factor, and b (y | x) and π (y | x) represent state transition probabilities, respectively, before and after learning;

estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2);

estimating r (x) and v (x) in formula 2 based on the result of estimating the logarithm of the density ratio pi (x, y)/b (x, y); and

outputting the estimated r (x) and V (x) to estimate the behavior of the subject.

2. The method of claim 1, wherein the step of estimating the logarithm of the density ratios pi (x)/b (x) and pi (x, y)/b (x, y) comprises: the process KLIEP is estimated using the Kullback-Leibler significance with a log linear model.

3. The method of claim 1, wherein the step of estimating the logarithm of the density ratios pi (x)/b (x) and pi (x, y)/b (x, y) comprises: logistic regression was used.

4. A method for inverse reinforcement learning to estimate a reward function and a cost function of a behavior of an object, wherein the object includes a demonstrator and a robot, the method comprising:

obtaining data representing a state transition with an action defining a behavior of the object;

the modified Bellman equation given by equation (3) is applied to the acquired data,

where r (x) and v (x) represent a reward function and a cost function, respectively, for state x, γ represents a discount factor, and b (u | x) and π (u | x) represent random strategies, before and after learning, respectively, representing the probability of selecting action u in state x;

estimating the logarithm of the density ratio pi (x)/b (x) in the formula (3);

estimating r (x) and v (x) in formula 4 based on the result of estimating the logarithm of the density ratio pi (x, u)/b (x, u); and

5. The method of claim 4, wherein the step of estimating the logarithm of the density ratios pi (x)/b (x) and pi (x, u)/b (x, u) comprises: the process KLIEP is estimated using the Kullback-Leibler significance with a log linear model.

6. The method of claim 4, wherein the step of estimating the logarithm of the density ratios pi (x)/b (x) and pi (x, u)/b (x, u) comprises: logistic regression was used.

7. A non-transitory storage medium storing instructions for causing a processor to execute an algorithm for inverse reinforcement learning for estimating reward and cost functions for behavior of an object, wherein the object includes a demographics and a robot, the instructions causing the processor to perform the steps of:

estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2);

8. A system for inverse reinforcement learning to estimate a reward function and a cost function of a behavior of an object, wherein the object includes a demonstrator and a robot, the system comprising:

a data acquisition unit for acquiring data representing a change in a state variable defining a behavior of the object;

a processor having a memory, the processor and the memory configured to:

estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2);

an output interface that outputs the estimated r (x) and V (x) to estimate a behavior of the subject.

9. A system for predicting preferences of topics of articles a user is likely to read from a series of articles selected by the user while browsing the internet, the system being a reverse reinforcement learning system implemented in a computer connected to the internet for estimating a reward function and a cost function of a behavior of an object, the system comprising:

a processor having a memory, the processor and the memory configured to:

estimating the logarithm of the density ratio pi (x)/b (x) in the formula (2);

an output interface that outputs the estimated r (x) and V (x) to estimate a behavior of the subject,

wherein the object is the user and the state variables defining the behavior of the object include a topic of an article selected by the user while browsing each web page, and

and the processor enables an interface used when the user browses the internet website to display an article recommended to be read by the user according to the estimated return function and the estimated value function.

10. A method for programming a robot to perform complex tasks, the method comprising:

controlling the first robot to complete a task to record a sequence of states and actions;

estimating the reward function and cost function with the system for inverse reinforcement learning of claim 8 based on the recorded sequence of states and actions; and

providing the estimated reward function and cost function to a forward reinforcement learning controller of a second robot to program the second robot with the estimated reward function and cost function.