WO2022230019A1 - 学習装置、学習方法および学習プログラム - Google Patents

学習装置、学習方法および学習プログラム Download PDF

Info

Publication number
WO2022230019A1
WO2022230019A1 PCT/JP2021/016630 JP2021016630W WO2022230019A1 WO 2022230019 A1 WO2022230019 A1 WO 2022230019A1 JP 2021016630 W JP2021016630 W JP 2021016630W WO 2022230019 A1 WO2022230019 A1 WO 2022230019A1
Authority
WO
WIPO (PCT)
Prior art keywords
updating
likelihood
reward function
trajectory
regularization term
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2021/016630
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
力 江藤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NEC Corp
Original Assignee
NEC Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NEC Corp filed Critical NEC Corp
Priority to EP21939182.8A priority Critical patent/EP4332845A4/en
Priority to JP2023516874A priority patent/JP7529144B2/ja
Priority to US18/287,546 priority patent/US20240211767A1/en
Priority to PCT/JP2021/016630 priority patent/WO2022230019A1/ja
Publication of WO2022230019A1 publication Critical patent/WO2022230019A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/092Reinforcement learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a learning device, a learning method, and a learning program that perform inverse reinforcement learning.
  • Reinforcement learning is known as one of the machine learning methods. Reinforcement learning is a method of learning actions that maximize value through trial and error of various actions. Reinforcement learning sets a reward function for evaluating this value, and searches for actions that maximize this reward function. However, setting the reward function is generally difficult.
  • Inverse Reinforcement Learning is known as a method to facilitate the setting of this reward function.
  • Inverse reinforcement learning generates a reward function that reflects the intentions of the expert by repeating optimization using the reward function and updating the parameters of the reward function using decision-making history data of the expert. .
  • Non-Patent Document 1 describes maximum entropy inverse reinforcement learning (ME-IRL: Maximum Entropy-IRL), which is one type of inverse reinforcement learning.
  • ME-IRL uses the maximum entropy principle to specify the trajectory distribution and learn the reward function by approximating the true distribution (ie maximum likelihood estimation). This solves the ambiguity that there are multiple reward functions that reproduce the trajectory (behavior history) of the expert.
  • Non-Patent Document 2 describes GCL (Guided Cost Learning), which is one of the methods of inverse reinforcement learning that improves maximum entropy inverse reinforcement learning.
  • GCL Guided Cost Learning
  • importance sampling is used to update the weights of the reward function.
  • the present invention provides a learning device, a learning method, and a learning device that can perform inverse reinforcement learning that can be applied to mathematical optimization problems such as combinatorial optimization while solving the problem of indefiniteness in inverse reinforcement learning.
  • the purpose is to provide a learning program.
  • a learning device is a function input means for receiving an input of a reward function whose feature amount is set so as to satisfy the Lipschitz continuity condition; Estimating means for estimating a trajectory that minimizes the Wasserstein distance representing the distance from the probability distribution of the trajectory; and updating means for updating the parameter of the reward function, wherein the updating means is defined as a lower bound of the logarithmic likelihood by subtracting the average value of the reward for the parameter from the maximum reward value for the parameter from the Wasserstein distance It is characterized by deriving a formula that reduces the entropy regularization term that is calculated, and updating the parameters of the reward function so as to maximize the lower bound of the derived log-likelihood.
  • a computer receives an input of a reward function whose feature amount is set so as to satisfy the Lipschitz continuity condition, and the computer determines based on the probability distribution of the expert's trajectory and the parameters of the reward function.
  • the computer maximizes the logarithmic likelihood of the Boltzmann distribution derived by the principle of maximum entropy
  • the computer subtracts the average reward value for that parameter from the maximum reward value for that parameter from the Wasserstein distance as the lower bound of the logarithmic likelihood It is characterized by deriving a formula that reduces the entropy regularization term defined in and updating the parameters of the reward function so as to maximize the lower bound of the derived log-likelihood.
  • the learning program includes function input processing for receiving an input of a reward function whose feature amount is set so as to satisfy the Lipschitz continuity condition, determination based on the probability distribution of the expert's trajectory, and the parameters of the reward function.
  • the update process as the lower bound of the logarithmic likelihood, subtract the average reward value for that parameter from the maximum reward value for that parameter from the Wasserstein distance It is characterized by deriving a formula that reduces the entropy regularization term defined by the formula, and updating the parameters of the reward function so as to maximize the lower bound of the derived logarithmic likelihood.
  • FIG. 1 is a block diagram showing a configuration example of an embodiment of a learning device according to the present disclosure
  • FIG. 4 is a flowchart showing an operation example of the learning device
  • 1 is a block diagram showing a configuration example of an embodiment of a robot control system
  • FIG. 1 is a block diagram showing an overview of a learning device according to the present disclosure
  • FIG. 1 is a schematic block diagram showing a configuration of a computer according to at least one embodiment
  • R(s,a) ⁇ f(s,a).
  • Equation 1 the trajectory ⁇
  • Equation 2 the probability model representing the trajectory distribution p ⁇ ( ⁇ )
  • Equation 3 the reward function
  • Z the sum of rewards for all trajectories (see Equation 4).
  • Equation 5 the rule for updating the weight of the reward function by maximum likelihood estimation (specifically, the gradient ascending method) is represented by Equations 5 and 6 exemplified below.
  • ⁇ in Equation 5 is the step size and L( ⁇ ) is the distance measure between distributions used in ME-IRL.
  • Equation 6 The second term in Equation 6 is the sum of rewards for all trajectories.
  • ME-IRL assumes that the value of the second term can be strictly calculated. However, in reality, it is difficult to calculate the sum of rewards for all trajectories. The above is the ME-IRL problem setting, method, and problem.
  • FIG. 1 is a block diagram showing a configuration example of one embodiment of a learning device according to the present disclosure.
  • the learning device 100 of the present embodiment is a device that performs inverse reinforcement learning for estimating a reward function from the behavior of a target person (expert) by machine learning, and specifically performs information processing based on the behavioral characteristics of the expert. It is a device.
  • the learning device 100 includes a storage unit 10, an input unit 20, a feature amount setting unit 30, a weight initial value setting unit 40, a mathematical optimization execution unit 50, a weight update unit 60, a convergence determination unit 70, and an output unit 80 .
  • Inverse reinforcement learning which will be described later, is performed by the mathematical optimization execution unit 50, the weight update unit 60, and the convergence determination unit 70. can be called an inverse reinforcement learning device.
  • the storage unit 10 stores information necessary for the learning device 100 to perform various processes.
  • the storage unit 10 may store expert decision-making history data (trajectory) received by the input unit 20, which will be described later. Further, the storage unit 10 may store candidates for the feature amount of the reward function used for learning by the mathematical optimization execution unit 50 and the weight update unit 60, which will be described later.
  • feature amount candidates do not necessarily have to be feature amounts used for the objective function.
  • the storage unit 10 may store a mathematical optimization solver for realizing the mathematical optimization executing unit 50, which will be described later.
  • the content of the mathematical optimization solver is arbitrary, and may be determined according to the execution environment and device.
  • the input unit 20 accepts input of information necessary for the learning device 100 to perform various processes.
  • the input unit 20 may, for example, receive an input of the expert's decision-making history data (specifically, pairs of states and actions) described above. Further, the input unit 20 may receive input of an initial state and constraint conditions used when a reverse reinforcement learning device, which will be described later, performs reverse reinforcement learning.
  • the feature amount setting unit 30 sets the feature amount of the reward function from data including states and actions. Specifically, the feature quantity setting unit 30 sets the reward so that the slope of the tangent line is finite in the entire function so that the inverse reinforcement learning device, which will be described later, can use the Wasserstein distance as a distance measure between distributions. Set the features of the function. For example, the feature quantity setting unit 30 may set the feature quantity of the reward function so as to satisfy the Lipschitz continuity condition.
  • the feature amount setting unit 30 may set the feature amount so that the reward function becomes a linear function.
  • Equation 7 exemplified below has an infinite gradient at a 0 , so it can be said to be an inappropriate reward function in the present disclosure.
  • the feature quantity setting unit 30 may determine a reward function with a feature quantity set according to a user's instruction, or may acquire a reward function that satisfies the Lipschitz continuity condition from the storage unit 10 .
  • the weight initial value setting unit 40 initializes the weight of the reward function. Specifically, the weight initial value setting unit 40 sets weights for individual feature amounts included in the reward function. Note that the method of initializing the weight is not particularly limited, and the weight may be initialized based on an arbitrary method predetermined according to the user or the like.
  • the mathematical optimization execution unit 50 minimizes the distance between the probability distribution of the expert's trajectory (action history) and the probability distribution of the trajectory determined based on the optimized (reward function) parameters. Derive the trajectory ⁇ ⁇ ( ⁇ ⁇ is the superscript ⁇ of ⁇ ). Specifically, the mathematical optimization execution unit 50 uses the Wasserstein distance as a distance measure between distributions, and performs mathematical optimization so as to minimize the Wasserstein distance, thereby obtaining the trajectory of the expert. Estimate ⁇ .
  • the Wasserstein distance is defined by Equation 8 exemplified below. That is, the Wasserstein distance represents the distance between the probability distribution of the expert's trajectory and the probability distribution of the trajectory determined based on the parameters of the reward function.
  • the reward function ⁇ T f ⁇ must be a function that satisfies the Lipschitz continuity condition due to the restriction of the Wasserstein distance.
  • the mathematical optimization execution unit 50 calculates the Wasserstein distance as exemplified below. becomes available.
  • Equation 8 The Wasserstein distance defined by Equation 8 exemplified above takes a value of 0 or less, and increasing this value corresponds to bringing the distributions closer together. Also, in the second term of Equation 8, ⁇ ⁇ (n) represents the n-th trajectory optimized with the parameter ⁇ . The second term of Equation 8 is a term that can be calculated even in a combinatorial optimization problem. Therefore, by using the Wasserstein distance exemplified in Equation 8 as a distance measure between distributions, inverse reinforcement learning that can be applied to mathematical optimization problems such as combinatorial optimization problems can be performed.
  • the weight updating unit 60 updates the parameter ⁇ of the reward function so as to maximize the distance measure between distributions based on the estimated expert's trajectory ⁇ ⁇ .
  • maximum entropy inverse reinforcement learning that is, ME-IRL
  • the trajectory ⁇ follows the Boltzmann distribution according to the maximum entropy principle. Therefore, as in ME-IRL, the weight updating unit 60 calculates the log-likelihood of the Boltzmann distribution derived by the principle of maximum entropy based on the estimated expert's trajectory ⁇ ⁇ as shown in Equations 5 and 6 above. Update the parameter ⁇ of the reward function to maximize the degree.
  • the weight updating unit 60 of the present embodiment sets the upper limit of log sum exponential (hereinafter referred to as logSumExp) to derive That is, the weight updating unit 60 derives the lower limit L_( ⁇ ) (L_ indicates an underscore of L) in the inter-distribution distance measure used in ME-IRL, as shown in Equation 9 below.
  • the derived formula may simply be referred to as the lower bound of the logarithmic likelihood.
  • Equation 9 which represents the log-likelihood lower bound, is the maximum reward value for the current parameter ⁇
  • the third term is the log value of the number of possible trajectories (N ⁇ ).
  • the weight updating unit 60 based on the logarithmic likelihood of ME-IRL, the maximum reward value for the current parameter ⁇ and the log value (logarithmic value) of the number of possible trajectories (N ⁇ ), A lower bound of the log-likelihood calculated by subtracting from the probability distribution of the trajectory is derived.
  • the weight updating unit 60 transforms the derived lower bound formula for the logarithmic likelihood of ME-IRL into a formula for subtracting the entropy regularization term from the Wasserstein distance.
  • a formula obtained by decomposing the formula for the lower bound of the log-likelihood of ME-IRL into the Wasserstein distance and the entropy regularization term is expressed as Formula 10 illustrated below.
  • the expression in the first parenthesis of Expression 10 represents the Wasserstein distance, like Expression 8 above.
  • the expression in the second parenthesis of Equation 10 represents an entropy regularization term that contributes to the increase in the logarithmic likelihood of the Boltzmann distribution derived from the maximum entropy principle.
  • the first term represents the maximum reward value for the current parameter
  • the term represents the mean value of the reward for the current parameter ⁇ .
  • this second term functions as an entropy regularization term.
  • the value of the second term should be small, which corresponds to a small difference between the maximum reward value and the mean value. A smaller difference between the maximum reward value and the average value indicates a smaller trajectory variability.
  • a smaller difference between the maximum reward value and the average value means an increase in entropy, so entropy regularization works and contributes to entropy maximization. This contributes to the maximization of the log-likelihood of the Boltzmann distribution and, as a result, contributes to resolution of ambiguity in inverse reinforcement learning.
  • the weight update unit 60 fixes the estimated trajectory ⁇ ⁇ and updates the parameter ⁇ by the gradient ascending method, for example, based on Equation 10 shown above.
  • the normal gradient ascent method may not converge.
  • the feature quantity (f ⁇ max ) of the trajectory with the maximum reward value does not match the average value of the feature quantity (f ⁇ (n) ) of the other trajectories (i.e., the difference between the two does not become 0). Therefore, in the normal gradient ascent method, the logarithmic likelihood oscillates and does not converge, and is unstable, making it difficult to appropriately determine convergence (see Equation 11 below).
  • the weight updating unit 60 of the present embodiment updates the parameter ⁇ so as to gradually attenuate the portion that contributes to entropy regularization (that is, the portion corresponding to the entropy regularization term). good too.
  • the weight update unit 60 defines an update formula in which a damping coefficient ⁇ t indicating the degree of damping is set in a portion that contributes to entropy regularization.
  • the weight updating unit 60 differentiates the above equation 10 with respect to ⁇ , and divides the portion corresponding to the term indicating the Wasserstein distance (that is, the portion contributing to the process of increasing the Wasserstein distance) and the entropy regularization term into Equation 12 exemplified below is defined in which the damping coefficient is set to the portion corresponding to the entropy regularization term among the corresponding portions.
  • the damping factor is predefined according to how to dampen the portion corresponding to the entropy regularization term. For example, in the case of smooth attenuation, ⁇ t is defined as in Equation 13 exemplified below.
  • Equation 13 ⁇ 1 is set to 1 and ⁇ 2 is set to 0 or greater. Also, t indicates the number of iterations. As a result, the attenuation coefficient ⁇ t functions as a coefficient that reduces the portion corresponding to the entropy regularization term as the number of iterations t increases.
  • the weight update unit 60 updates the parameter ⁇ without attenuating the portion corresponding to the entropy regularization term in the initial stage of updating, and at the timing when the logarithmic likelihood starts to oscillate, the parameter ⁇ corresponding to the entropy regularization term
  • the parameter ⁇ may be updated to reduce the effect of parts.
  • the weight update unit 60 may determine that the logarithmic likelihood has started to oscillate, for example, when the moving average of the logarithmic likelihood becomes constant. Specifically, when the change in the moving average in the “lower limit of log-likelihood” time window (several points from the current value to the past) is small (for example, 1e ⁇ 3 or less), the weight updating unit 60 It may be judged that the average has become constant.
  • the method of determining the timing at which vibration starts is the same as the method described above.
  • the weight updating unit 60 may change the update method of the parameter ⁇ at the timing when the logarithmic likelihood further starts to oscillate after changing the oscillation coefficient as in Equation 13 shown above. Specifically, the weight updating unit 60 may update the parameter ⁇ using a momentum method as exemplified in Equation 14 below.
  • trajectory estimation processing by the mathematical optimization execution unit 50 and the parameter ⁇ update processing by the weight update unit 60 are repeated until the convergence determination unit 70, which will be described later, determines that the lower limit of the logarithmic likelihood has converged.
  • the convergence determination unit 70 determines whether or not the distance measure between distributions has converged. Specifically, the convergence determination unit 70 determines whether or not the lower limit of the logarithmic likelihood has converged. Any determination method may be used, and the convergence determination unit 70 may determine that the distance measure between distributions has converged, for example, when the absolute value of the lower limit of the logarithmic likelihood becomes smaller than a predetermined threshold value. good.
  • the convergence determination unit 70 determines that the distance measure between distributions has not converged, the processing by the mathematical optimization execution unit 50 and the weight update unit 60 is continued. On the other hand, when the convergence determination unit 70 determines that the distance measure between distributions has converged, the processing by the mathematical optimization execution unit 50 and the weight update unit 60 is terminated.
  • the output unit 80 outputs the learned reward function.
  • the input unit 20, the feature amount setting unit 30, the weight initial value setting unit 40, the mathematical optimization execution unit 50, the weight updating unit 60, the convergence determination unit 70, and the output unit 80 are a program (learning program ), which is implemented by a computer processor (for example, a CPU (Central Processing Unit)).
  • a computer processor for example, a CPU (Central Processing Unit)
  • the program is stored in the storage unit 10 provided in the learning device 100, and the processor reads the program, and according to the program, the input unit 20, the feature amount setting unit 30, the weight initial value setting unit 40, the mathematical optimization execution unit 50 , weight update unit 60 , convergence determination unit 70 and output unit 80 .
  • the functions of the learning device 100 may be provided in a SaaS (Software as a Service) format.
  • the input unit 20, the feature amount setting unit 30, the weight initial value setting unit 40, the mathematical optimization execution unit 50, the weight updating unit 60, the convergence determination unit 70, and the output unit 80 are each It may be realized by dedicated hardware. Also, part or all of each component of each device may be implemented by general-purpose or dedicated circuitry, processors, etc., or combinations thereof. These may be composed of a single chip, or may be composed of multiple chips connected via a bus. A part or all of each component of each device may be implemented by a combination of the above-described circuits and the like and programs.
  • the plurality of information processing devices, circuits, etc. may be centrally arranged or distributed. may be placed.
  • the information processing device, circuits, and the like may be implemented as a form in which each is connected via a communication network, such as a client-server system, a cloud computing system, or the like.
  • FIG. 2 is a flowchart showing an operation example of the learning device 100 of this embodiment.
  • the input unit 20 receives an input of expert data (that is, an expert's trajectory/decision-making history data) (step S11).
  • the feature amount setting unit 30 sets the feature amount of the reward function so as to satisfy the Lipschitz continuity condition from the data including the state and action (step S12).
  • the weight initial value setting unit 40 also initializes the weights (parameters) of the reward function (step S13).
  • the mathematical optimization execution unit 50 receives input of a reward function whose feature amount is set so as to satisfy the Lipschitz continuity condition (step S14). Then, the mathematical optimization executing unit 50 executes mathematical optimization so as to minimize the Wasserstein distance (step S15). Specifically, the mathematical optimization execution unit 50 selects a trajectory that minimizes the Wasserstein distance, which represents the distance between the probability distribution of the trajectory of the expert and the probability distribution of the trajectory determined based on the parameters of the reward function. presume.
  • the weight updating unit 60 updates the parameters of the reward function so as to maximize the logarithmic likelihood of the Boltzmann distribution based on the estimated trajectory (step S16). At this time, the weight updating unit 60 derives the lower bound of the logarithmic likelihood and updates the parameters of the reward function so as to maximize the derived lower bound of the logarithmic likelihood.
  • the convergence determination unit 70 determines whether or not the lower limit of the logarithmic likelihood has converged (step S17). If it is determined that the lower limit of the logarithmic likelihood has not converged (No in step S17), the processes after step S15 are repeated using the updated trajectory. On the other hand, when it is determined that the lower limit of the logarithmic likelihood has converged (Yes in step S17), the output unit 80 outputs the learned reward function (step S18).
  • the mathematical optimization execution unit 50 receives an input of a reward function whose feature amount is set so as to satisfy the Lipschitz continuity condition.
  • a trajectory that minimizes the Wasserstein distance which represents the distance to the probability distribution of the trajectory determined based on the parameters, is estimated.
  • the weight updating unit 60 updates the parameters of the reward function so as to maximize the logarithmic likelihood of the Boltzmann distribution.
  • the weight updating unit 60 derives an equation for subtracting the entropy regularization term from the Wasserstein distance as the lower bound of the log-likelihood, and sets the parameters of the reward function to maximize the derived lower bound of the log-likelihood. to update. Therefore, inverse reinforcement learning that can be applied to mathematical optimization problems such as combinatorial optimization can be performed while solving the problem of ambiguity in inverse reinforcement learning.
  • the learning device 100 (weight updating unit 60) of the present embodiment derives the lower bound of the logarithmic likelihood of maximum entropy inverse reinforcement learning and decomposes it into the Wasserstein distance and the entropy regularization term. Then, learning device 100 updates the parameters of the reward function so as to maximize the lower bound of the derived logarithmic likelihood. Therefore, the ambiguity in inverse reinforcement learning can be resolved, and the setting of the sampling distribution is unnecessary, so it can be applied to various mathematical optimizations, especially combinatorial optimizations.
  • typical combinatorial optimization problems include routing problems, scheduling problems, cut-out/packing problems, and assignment/matching problems.
  • the route problem is, for example, a transportation route problem or a traveling salesman problem
  • the scheduling problem is, for example, a job shop problem or work schedule problem.
  • the cut-out/packing problem is, for example, a knapsack problem or a bin-packing problem
  • the allocation/matching problem is a maximum matching problem, a generalized allocation problem, or the like.
  • FIG. 3 is a block diagram showing a configuration example of an embodiment of the robot control system.
  • a robot control system 2000 illustrated in FIG. 3 includes a learning device 100 , a learning data storage unit 2200 and a robot 2300 .
  • the learning device 100 illustrated in FIG. 3 is the same as the learning device 100 in the above embodiment.
  • the learning device 100 stores the reward function created as a result of learning in the storage unit 2310 of the robot 2300, which will be described later.
  • the learning data storage unit 2200 stores learning data that the learning device 100 uses for learning.
  • the learning data storage unit 2200 may store decision-making history data of experts, for example.
  • a robot 2300 is a device that operates based on a reward function. It should be noted that the robots here are not limited to devices shaped like humans or animals, and include devices that perform automatic work (automatic operation, automatic control, etc.). Robot 2300 includes a storage unit 2310 , an input unit 2320 and a control unit 2330 .
  • the storage unit 2310 stores the reward function learned by the learning device 100.
  • the input unit 2320 accepts input of data indicating the state when the robot is operated.
  • the control unit 2330 determines the action to be performed by the robot 2300 based on the received data (indicating the state) and the reward function stored in the storage unit 2310.
  • the method by which the control unit 2330 determines the control action based on the reward function is widely known, and detailed description thereof will be omitted here.
  • a device that performs automatic work such as the robot 2300, can be controlled based on a reward function that reflects the intention of the expert.
  • FIG. 4 is a block diagram outlining a learning device according to the present disclosure.
  • a learning device 90 (for example, learning device 100) according to the present disclosure includes function input means 91 (for example, mathematical optimization execution unit 50) that receives input of a reward function whose feature amount is set so as to satisfy the Lipschitz continuity condition;
  • An estimating means 92 (for example, a mathematical optimization unit 50), and updating means 93 (for example, weight updating unit 60) for updating the parameter of the reward function so as to maximize the logarithmic likelihood of the Boltzmann distribution derived by the principle of maximum entropy based on the estimated trajectory. It has
  • Updating means 93 derives an expression that subtracts an entropy regularization term defined by an expression obtained by subtracting the average value of the reward for the parameter from the maximum reward value for the parameter from the Wasserstein distance as the lower bound of the logarithmic likelihood, Update the parameters of the reward function to maximize the lower bound of the derived log-likelihood.
  • the updating means 93 updates the entropy regularization term with a portion corresponding to the entropy regularization term (for example, the expression in the second parenthesis of Expression 10) (for example, Expression 12 ) to maximize the lower bound of the log-likelihood, and set a damping factor (e.g., ⁇ t ) that maximizes the lower bound of the log-likelihood You may update the parameters of the reward function as
  • the updating means 93 sets an attenuation coefficient (for example, ⁇ t ) that attenuates the degree to which the entropy regularization term contributes to maximization of the lower bound of the logarithmic likelihood in the portion corresponding to the entropy regularization term,
  • the updating means 93 may change the attenuation coefficient when determining that the moving average of the logarithmic likelihood has become constant (for example, the change in the moving average is small).
  • the updating means 93 may derive the lower bound of the log-likelihood based on the upper bound of log sum exponential.
  • the function input means 91 may receive an input of a reward function whose feature amount is set to be a linear function.
  • FIG. 5 is a schematic block diagram showing the configuration of a computer according to at least one embodiment.
  • a computer 1000 comprises a processor 1001 , a main storage device 1002 , an auxiliary storage device 1003 and an interface 1004 .
  • the learning device 90 described above is implemented in the computer 1000 .
  • the operation of each processing unit described above is stored in the auxiliary storage device 1003 in the form of a program (learning program).
  • the processor 1001 reads out the program from the auxiliary storage device 1003, develops it in the main storage device 1002, and executes the above processing according to the program.
  • the secondary storage device 1003 is an example of a non-transitory tangible medium.
  • Other examples of non-transitory tangible media include magnetic disks, magneto-optical disks, CD-ROMs (Compact Disc Read-only memory), DVD-ROMs (Read-only memory), connected via interface 1004, A semiconductor memory etc. are mentioned.
  • the computer 1000 receiving the distribution may develop the program in the main storage device 1002 and execute the above process.
  • the program may be for realizing part of the functions described above.
  • the program may be a so-called difference file (difference program) that implements the above-described functions in combination with another program already stored in the auxiliary storage device 1003 .
  • (Appendix 1) function input means for receiving an input of a reward function whose feature amount is set so as to satisfy the Lipschitz continuity condition; estimating means for estimating a trajectory that minimizes the Wasserstein distance representing the distance between the probability distribution of the trajectory of the expert and the probability distribution of the trajectory determined based on the parameters of the reward function; updating means for updating the parameters of the reward function so as to maximize the log-likelihood of the Boltzmann distribution derived by the principle of maximum entropy, based on the estimated trajectory;
  • the updating means uses, as the lower bound of the logarithmic likelihood, a formula that subtracts an entropy regularization term defined by a formula obtained by subtracting an average reward value for the parameter from the maximum reward value for the parameter, from the Wasserstein distance. and updating parameters of the reward function so as to maximize the lower bound of the derived logarithmic likelihood.
  • the update means includes an attenuation coefficient that attenuates the extent to which the portion corresponding to the entropy regularization term contributes to maximization of the lower limit of the logarithmic likelihood as the parameter updating process is repeated for the entropy regularization term. and update the parameters of the reward function so as to maximize the lower bound of the set log-likelihood.
  • the updating means sets an attenuation coefficient for attenuating the extent to which the entropy regularization term contributes to maximization of the lower bound of the logarithmic likelihood in the portion corresponding to the entropy regularization term, and updates the parameter.
  • the learning device according to Supplementary Note 1, wherein during the iterations, the attenuation coefficient is changed so as to attenuate the extent to which the portion corresponding to the entropy regularization term contributes to the maximization of the lower bound of the logarithmic likelihood.
  • Appendix 6 The learning device according to any one of Appendices 1 to 5, wherein the function input means receives an input of a reward function whose feature amount is set to be a linear function.
  • the computer receives an input of a reward function whose feature amount is set so as to satisfy the Lipschitz continuity condition, the computer estimates the trajectory that minimizes the Wasserstein distance, which represents the distance between the probability distribution of the trajectory of the expert and the probability distribution of the trajectory determined based on the parameters of the reward function;
  • the computer based on the estimated trajectory, updates the parameters of the reward function to maximize the log-likelihood of the Boltzmann distribution derived by the principle of maximum entropy;
  • the computer uses, as the lower bound of the logarithmic likelihood, an entropy regularity defined by a formula obtained by subtracting the average reward value for the parameter from the maximum reward value for the parameter from the Wasserstein distance deriving a formula that reduces a coefficient term, and updating the parameters of the reward function so as to maximize the derived lower bound of the logarithmic likelihood.
  • the computer sets an attenuation coefficient that attenuates the extent to which the entropy regularization term contributes to maximizing the lower bound of the logarithmic likelihood in the portion corresponding to the entropy regularization term, and repeats the process of updating the parameters.
  • the entropy regularization term is set with an attenuation coefficient that attenuates the extent to which the portion corresponding to the entropy regularization term contributes to maximization of the lower limit of the logarithmic likelihood as the process of updating the parameters is repeated, 11.
  • the program storage medium according to appendix 10 which stores a learning program for updating the parameter of the reward function so as to maximize the lower limit of the set log-likelihood.
  • the entropy regularization term is set with an attenuation coefficient that attenuates the extent to which the portion corresponding to the entropy regularization term contributes to maximization of the lower limit of the logarithmic likelihood as the process of updating the parameters is repeated, 14.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
PCT/JP2021/016630 2021-04-26 2021-04-26 学習装置、学習方法および学習プログラム Ceased WO2022230019A1 (ja)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP21939182.8A EP4332845A4 (en) 2021-04-26 2021-04-26 LEARNING DEVICE, LEARNING METHOD AND LEARNING PROGRAM
JP2023516874A JP7529144B2 (ja) 2021-04-26 2021-04-26 学習装置、学習方法および学習プログラム
US18/287,546 US20240211767A1 (en) 2021-04-26 2021-04-26 Learning device, learning method, and learning program
PCT/JP2021/016630 WO2022230019A1 (ja) 2021-04-26 2021-04-26 学習装置、学習方法および学習プログラム

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2021/016630 WO2022230019A1 (ja) 2021-04-26 2021-04-26 学習装置、学習方法および学習プログラム

Publications (1)

Publication Number Publication Date
WO2022230019A1 true WO2022230019A1 (ja) 2022-11-03

Family

ID=83846792

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2021/016630 Ceased WO2022230019A1 (ja) 2021-04-26 2021-04-26 学習装置、学習方法および学習プログラム

Country Status (4)

Country Link
US (1) US20240211767A1 (https=)
EP (1) EP4332845A4 (https=)
JP (1) JP7529144B2 (https=)
WO (1) WO2022230019A1 (https=)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7815840B2 (ja) * 2022-02-22 2026-02-18 富士通株式会社 関数生成プログラム、関数生成装置、制御装置、及び関数生成方法
CN119045292A (zh) * 2024-10-31 2024-11-29 浙江大学 一种基于多层感知机的逆向光刻方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP7315007B2 (ja) * 2019-08-29 2023-07-26 日本電気株式会社 学習装置、学習方法および学習プログラム

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
B. D. ZIEBARTA. MAASJ. A. BAGNELLA. K. DEY: "Maximum entropy inverse reinforcement learning", AAAI, AAAI '08, 2008
CHELSEA FINNSERGEY LEVINEPIETER ABBEEL: "Guided Cost Learning: Deep Inverse Optimal Control via Policy Optimization", PROCEEDINGS OF THE 33RD INTERNATIONAL CONFERENCE ON MACHINE LEARNING, PMLR, vol. 48, 2016, pages 49 - 58
HUANG XIAO; MICHAEL HERMAN; JOERG WAGNER; SEBASTIAN ZIESCHE; JALAL ETESAMI; THAI HONG LINH: "Wasserstein Adversarial Imitation Learning", ARXIV.ORG, 19 June 2019 (2019-06-19), pages 1 - 18, XP081377972 *
See also references of EP4332845A4

Also Published As

Publication number Publication date
JPWO2022230019A1 (https=) 2022-11-03
EP4332845A1 (en) 2024-03-06
EP4332845A4 (en) 2024-06-12
JP7529144B2 (ja) 2024-08-06
US20240211767A1 (en) 2024-06-27

Similar Documents

Publication Publication Date Title
CN108089921B (zh) 用于云端大数据运算架构的服务器及其运算资源最佳化方法
US11562223B2 (en) Deep reinforcement learning for workflow optimization
US20190324822A1 (en) Deep Reinforcement Learning for Workflow Optimization Using Provenance-Based Simulation
JP7315007B2 (ja) 学習装置、学習方法および学習プログラム
CN110795246A (zh) 资源利用率的预测方法及装置
Hellwig et al. Evolution under strong noise: A self-adaptive evolution strategy can reach the lower performance bound-the pccmsa-es
CN113641445B (zh) 基于深度确定性策略的云资源自适应配置方法及系统
JP7687041B2 (ja) 衛星観測計画立案システム、衛星観測計画立案方法、および衛星観測計画立案プログラム
US20190332933A1 (en) Optimization of model generation in deep learning neural networks using smarter gradient descent calibration
US11182689B2 (en) Adaptive learning rate schedule in distributed stochastic gradient descent
WO2022230019A1 (ja) 学習装置、学習方法および学習プログラム
Zhai et al. Deep q-learning with prioritized sampling
WO2019235551A1 (en) Total stochastic gradient estimation method, device and computer program
CN119739534A (zh) 基于计算引擎模型驱动算子链动态优化方法及系统
WO2022230038A1 (ja) 学習装置、学習方法および学習プログラム
CN115688893B (zh) 内存调度方法及装置、电子设备和存储介质
CN117396850A (zh) 用于为深度学习作业弹性分配资源的系统、方法和介质
CN114675975A (zh) 一种基于强化学习的作业调度方法、装置及设备
JP2020126511A (ja) 最適化装置、方法、及びプログラム
KR20200109917A (ko) Gpu 기반의 분산 딥 러닝 모델의 학습 속도 예측 방법 및 기록매체
JP7537517B2 (ja) 学習装置、学習方法および学習プログラム
KR100935361B1 (ko) 가중치 기반 멀티큐 부하분산 병렬처리 시스템 및 방법
Decuyper et al. Tuning nonlinear state-space models using unconstrained multiple shooting
JP2005049922A (ja) ジョブ実行計画の評価システム
Maranjyan et al. ATA: Adaptive task allocation for efficient resource management in distributed machine learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21939182

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023516874

Country of ref document: JP

WWE Wipo information: entry into national phase

Ref document number: 18287546

Country of ref document: US

WWE Wipo information: entry into national phase

Ref document number: 2021939182

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2021939182

Country of ref document: EP

Effective date: 20231127