WO2023144961A1 - Information processing device, information processing method, and program - Google Patents

Information processing device, information processing method, and program Download PDF

Info

Publication number
WO2023144961A1
WO2023144961A1 PCT/JP2022/003100 JP2022003100W WO2023144961A1 WO 2023144961 A1 WO2023144961 A1 WO 2023144961A1 JP 2022003100 W JP2022003100 W JP 2022003100W WO 2023144961 A1 WO2023144961 A1 WO 2023144961A1
Authority
WO
WIPO (PCT)
Prior art keywords
information processing
reward function
data
parameter
reinforcement learning
Prior art date
Application number
PCT/JP2022/003100
Other languages
French (fr)
Japanese (ja)
Inventor
力 江藤
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to PCT/JP2022/003100 priority Critical patent/WO2023144961A1/en
Publication of WO2023144961A1 publication Critical patent/WO2023144961A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to an information processing device, an information processing method, and a program.
  • RL Reinforcement Learning
  • Inverse Reinforcement Learning is known as a method for generating this reward function.
  • Non-Patent Document 1 describes maximum entropy inverse reinforcement learning (ME-IRL: Maximum Entropy-IRL), which is one type of inverse reinforcement learning.
  • ME-IRL uses the maximum entropy principle to specify the trajectory distribution and learn the reward function by approximating the true distribution (ie maximum likelihood estimation).
  • Non-Patent Document 2 describes GCL (Guided Cost Learning), which is one of the methods of inverse reinforcement learning that improves maximum entropy inverse reinforcement learning.
  • GCL Guided Cost Learning
  • importance sampling is used to update the weights of the reward function.
  • Non-Patent Document 1 both the techniques described in Non-Patent Document 1 and Non-Patent Document 2 have room for improvement in terms of generating an appropriate reward function.
  • One aspect of the present invention has been made in view of the above problems, and an example of its purpose is to provide a technique capable of generating a more appropriate reward function.
  • An information processing apparatus includes an acquisition unit that acquires reference data, and a reward function that includes a weighting factor and a feature amount parameter by performing inverse reinforcement learning using the reference data, wherein the feature and determining means for determining by inverse reinforcement learning including the quantity parameter as an object to be manipulated.
  • An information processing apparatus includes acquisition means for acquiring target data, and a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target. generating means for generating output data corresponding to the target data by solving an optimization problem using the obtained reward function and the target data obtained by the obtaining means.
  • An information processing method is an information processing method using an information processing apparatus, which includes obtaining reference data, calculating a reward function including a weighting factor and a feature amount parameter using the reference data. Determining by inverse reinforcement learning that uses the feature amount parameter as an operation target.
  • An information processing method is an information processing method using an information processing apparatus, wherein target data is obtained, and a reward function including a weighting factor and a feature amount parameter, wherein the feature amount parameter Generating output data according to the target data by solving an optimization problem using a reward function determined by inverse reinforcement learning including as an operation target and the target data acquired in the acquisition , contains
  • a program according to one aspect of the present invention is a program that causes a computer to function as an information processing apparatus, and includes an acquisition unit that acquires reference data, and a reward function that includes a weighting factor and a feature amount parameter, the reference data. and determining means for determining by inverse reinforcement learning including the feature amount parameter as an operation target.
  • a program is a program that causes a computer to function as an information processing apparatus, and is an acquisition unit that acquires target data, and a reward function that includes a weighting factor and a feature amount parameter, wherein the feature amount generating means for generating output data according to the target data by solving an optimization problem using a reward function determined by inverse reinforcement learning including a parameter as an operation target and the target data obtained by the obtaining means; , to function as
  • a more appropriate reward function can be generated.
  • FIG. 1 is a block diagram showing the configuration of an information processing device according to exemplary Embodiment 1 of the present invention
  • FIG. FIG. 3 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 1 of the present invention
  • 1 is a block diagram showing the configuration of an information processing device according to exemplary Embodiment 1 of the present invention
  • FIG. 3 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 1 of the present invention
  • FIG. 7 is a block diagram showing the configuration of an information processing apparatus according to exemplary Embodiment 2 of the present invention
  • FIG. 7 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 2 of the present invention
  • FIG. 10 is a diagram showing a display example generated by a display control unit according to exemplary embodiment 2 of the present invention
  • FIG. 11 is a block diagram showing the configuration of an information processing apparatus according to exemplary Embodiment 3 of the present invention
  • FIG. 11 is a diagram showing a display example generated by a display control unit according to exemplary Embodiment 3 of the present invention
  • FIG. 12 is a diagram showing a second display example by the information processing apparatus according to exemplary embodiment 3 of the present invention
  • FIG. 10 is a diagram showing an application example of an information processing apparatus according to exemplary embodiment 3 of the present invention
  • 1 is a diagram showing an example of a computer that implements an information processing apparatus according to each exemplary embodiment of the present invention
  • the information processing device 1 is a device that determines a reward function including weighting factors and feature amount parameters by inverse reinforcement learning using reference data.
  • reference data refers to data referred to by inverse reinforcement learning, and includes, as an example, a set of state data and action data.
  • the reference data may include state data representing the state of a certain system, and action data representing actions taken by a specific expert in that state.
  • reference data ⁇ is ⁇ ⁇ 1 , ⁇ 2 , . , a N )).
  • N is any natural number.
  • the reference data can include one or more sets of state data and action data, as an example.
  • some or all of the data included in the reference data are also called explanatory variables, which are arguments of the reward function.
  • action data is not limited to data indicating actions taken by a specific expert, and may be data indicating actions taken by a subject who executes actions associated with the action data ai . It may be data indicating the action taken by.
  • state data may be simply referred to as state
  • action data may simply be referred to as action
  • inverse reinforcement learning refers to learning for determining a reward function.
  • the reference data is referred to, and the reward function is determined by updating the feature parameter included in the reward function as an operation target.
  • the reference data may be referred to and the weighting factor included in the reward function may be updated as an operation target.
  • the reward function is, as an example, a function for evaluating the value of each of various actions.
  • the reward function includes a weighting factor and a feature amount parameter as parameters.
  • a weighting factor is, for example, a weight by which each of one or more feature quantities included in the reward function is multiplied.
  • a feature amount parameter is, for example, a parameter that characterizes one or more feature amounts included in a reward function.
  • x 1 , x 2 , x 3 are variables that can correspond to either state data (s i ) and behavior data (a i ), respectively.
  • the explanatory variable itself may constitute the feature amount, or the function of the explanatory variable may constitute the feature amount.
  • the reward function is the inverse of the cost function. Therefore, there is a relationship that the smaller the cost, the larger the reward.
  • the cost function includes one or more cost terms including a feature value represented using explanatory variables and a weighting factor representing the weight of the feature value, and one or more cost At least some of the terms include the feature parameters that characterize the cost term together with explanatory variables.
  • the reward function can be expressed as follows.
  • T represents the transpose of the vector.
  • is sometimes called a weighting coefficient vector or simply a parameter, and f ⁇ is sometimes called a feature amount vector.
  • the information processing device 1 uses the reference data ⁇ to use, as an example, a feature amount parameter Determine the reward function Reward by updating
  • FIG. 1 is a block diagram showing the configuration of an information processing device 1 according to this exemplary embodiment.
  • the information processing device 1 includes an acquisition unit 11 and a determination unit 12 .
  • Acquisition unit 11 and determination unit 12 are configured to implement acquisition means and determination means, respectively, in this exemplary embodiment.
  • the acquisition unit 11 acquires reference data.
  • the acquisition unit 11 supplies the acquired reference data to the determination unit 12 .
  • the determining unit 12 determines the reward function including the weighting factor and the feature amount parameter by inverse reinforcement learning using the reference data and including the feature amount parameter as an operation target.
  • the acquisition unit 11 that acquires the reference data and the reward function including the weighting coefficient and the feature parameter are obtained by inverse using the reference data.
  • a configuration including a determination unit 12 that determines by inverse reinforcement learning that is reinforcement learning and that includes a feature amount parameter as an operation target is adopted.
  • the operation target includes the feature amount parameter for determining the feature amount, so that the result of the prediction model or the like can be adopted as the feature amount. Therefore, according to the information processing device 1 according to this exemplary embodiment, a more appropriate reward function can be generated.
  • FIG. 2 is a flow diagram showing the flow of the information processing method S1 according to this exemplary embodiment.
  • Step S11 the acquisition unit 11 acquires reference data.
  • the acquisition unit 11 supplies the acquired reference data to the determination unit 12 .
  • step S12 determines the reward function including the weighting factor and the feature amount parameter by inverse reinforcement learning using the reference data supplied from the acquisition unit 11 and including the feature amount parameter as an operation target. Determined by reinforcement learning.
  • step S11 the acquisition unit 11 acquires reference data
  • step S12 the determination unit 12 determines the weighting coefficient and the feature parameter.
  • the reward function including the feature amount parameter is determined by reverse reinforcement learning using the reference data supplied from the acquisition unit 11 and including the feature amount parameter as an operation target. Therefore, according to the information processing method S1 according to this exemplary embodiment, the same effects as the information processing apparatus 1 can be obtained.
  • the information processing device 2 is a device that generates output data according to target data by solving an optimization problem using target data and a reward function determined by inverse reinforcement learning.
  • the reward function and inverse reinforcement learning are as described above.
  • the target data includes at least part of state data representing the state of a certain system and action data indicating actions taken by a specific expert in the state. .
  • solving the optimization problem means maximizing the reward function by manipulating the data to be manipulated with the target data as input.
  • the target data TD can be represented by ⁇ s 1 , s 2 , . can be done.
  • the information processing apparatus 2 according to the exemplary embodiment maximizes a reward function having the target data TD and the manipulated data MD as explanatory variables by manipulating the manipulated data MD. .
  • the information processing apparatus 2 solves the optimization problem with the target data as input, and generates the data of the operation target that maximizes the reward function as the output data.
  • FIG. 3 is a block diagram showing the configuration of the information processing device 2 according to this exemplary embodiment.
  • the information processing device 2 includes an acquisition unit 11 and a generation unit 22.
  • the acquisition unit 11 and the generation unit 22 are configured to implement acquisition means and generation means, respectively, in this exemplary embodiment.
  • the acquisition unit 11 acquires target data.
  • the acquisition unit 11 supplies the acquired target data to the generation unit 22 .
  • the generation unit 22 uses a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target data acquired by the acquisition unit 11. By solving the optimization problem, output data is generated according to the target data.
  • the reward function determined by inverse reinforcement learning can be used.
  • the acquisition unit 11 that acquires the target data, and the reward function that includes the weighting factor and the feature amount parameter, and the feature amount parameter is used as the operation target.
  • a generating unit 22 that generates output data according to the target data by solving an optimization problem using the target data acquired by the acquiring unit 11 and the reward function determined by the inverse reinforcement learning included in Adopted.
  • FIG. 4 is a flow diagram showing the flow of the information processing method S2 according to this exemplary embodiment.
  • step S21 the acquisition unit 11 acquires target data.
  • Step S22 the generation unit 22 generates a reward function including a weighting factor and a feature amount parameter, the reward function determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target acquired by the acquisition unit 11. Generate output data according to the target data by solving an optimization problem using the data.
  • the reward function determined in step S12 included in the information processing method S1 described above can be used.
  • the acquisition unit 11 acquires target data
  • the generation unit 22 generates the weighting coefficient and the feature parameter.
  • the reward function includes a feature parameter as an operation target, and is determined by inverse reinforcement learning, and by solving an optimization problem using the target data acquired by the acquisition unit 11, depending on the target data generate output data. Therefore, according to the information processing method S ⁇ b>2 according to this exemplary embodiment, the same effects as those of the information processing apparatus 2 can be obtained.
  • the information processing device 3 is a device that determines a reward function including a weighting factor WF and a feature amount parameter FP by inverse reinforcement learning using reference data RD.
  • the information processing device 3 also displays information corresponding to at least one of the determined weighting factor WF, feature parameter FP, and reward function.
  • the reference data, inverse reinforcement learning, reward function, weighting factor, and feature parameters are as described above.
  • FIG. 5 is a block diagram showing the configuration of the information processing device 3 according to this exemplary embodiment.
  • the information processing device 3 includes a storage unit 31, an input unit 32, an output unit 33, a communication unit 34, and a control unit 35.
  • the storage unit 31 is a memory that stores various data referred to by the control unit 35, which will be described later. Examples of data stored in the storage unit 31 include reference data RD, weighting factors WF, and feature amount parameters FP. As an example of the reference data RD, expert decision-making history data (trajectory) received by the input unit 32, which will be described later, may be stored. Further, the storage unit 31 may store candidates for the feature amount of the reward function that the determination unit 12 uses for learning. However, feature amount candidates do not necessarily have to be feature amounts used in the reward function.
  • the storage unit 31 may store a mathematical optimization solver for realizing the processing by the determination unit 12. Note that the content of the mathematical optimization solver is arbitrary, and may be determined according to the execution environment and apparatus.
  • the input unit 32 accepts various data input to the information processing device 3 .
  • the input unit 32 may, for example, receive an input of the expert's decision-making history data (specifically, pairs of states and actions) described above. Further, the input unit 32 may receive input of an initial state and constraint conditions used when a reverse reinforcement learning device, which will be described later, performs reverse reinforcement learning.
  • the input unit 32 is configured with input devices such as a keyboard, mouse, and touch panel.
  • the input unit 32 may also function as an interface for acquiring data from other connected devices.
  • the input unit 32 supplies data acquired from another device to the control unit 35, which will be described later.
  • the output unit 33 is configured to output the calculation result by the information processing device 3 .
  • the output unit 33 includes a display panel (display unit), and displays the calculation result on the display panel.
  • the output unit 33 may function as an interface that outputs data to other connected devices. In this configuration, the output unit 33 outputs data supplied from the control unit 35, which will be described later, to other connected devices.
  • the communication unit 34 is a communication module that communicates with other devices via a network (not shown). As an example, the communication unit 34 outputs data supplied from a control unit 35, which will be described later, to another device via a network, acquires data output from another device via a network, and controls the data. It is supplied to the unit 35.
  • the specific configuration of the network does not limit this embodiment, but as an example, wireless LAN (Local Area Network), wired LAN, WAN (Wide Area Network), public line network, mobile data communication network, or these network combinations can be used.
  • wireless LAN Local Area Network
  • wired LAN Wireless Local Area Network
  • WAN Wide Area Network
  • public line network public line network
  • mobile data communication network or these network combinations can be used.
  • the calculation result is displayed via at least one of the output unit 33 and the communication unit 34.
  • control unit 35 controls each unit included in the information processing device 3 .
  • the control unit 35 stores data acquired from the input unit 32 or the communication unit 34 in the storage unit 31, and supplies data stored in the storage unit 31 to the output unit 33 or the communication unit 34. .
  • the control unit 35 also functions as the acquisition unit 11, the determination unit 12, and the display control unit 13, as shown in FIG.
  • the acquisition unit 11, the determination unit 12, and the display control unit 13 are configured to implement acquisition means, determination means, and first display means, respectively, in this exemplary embodiment.
  • the acquisition unit 11 acquires the reference data RD via the input unit 32 or the communication unit 34.
  • the acquisition unit 11 stores the acquired reference data RD in the storage unit 31 .
  • the determination unit 12 obtains the reference data RD stored in the storage unit 31, and calculates a reward function including the weighting factor WF and the feature amount parameter FP by inverse reinforcement learning using the reference data RD, which is a feature It is determined by inverse reinforcement learning including the quantity parameter FP as an operation target.
  • the determination unit 12 stores the determined feature parameter FP in the storage unit 31 .
  • the operation target in the inverse reinforcement learning by the determination unit 12 may include a weighting factor WF included in at least one of one or a plurality of cost terms.
  • the determining unit 12 stores the post-operation weighting factor WF in the storage unit 31 .
  • the display control unit 13 displays, via the output unit 33, information corresponding to at least one of the weighting factor WF, the feature amount parameter FP, and the reward function.
  • ME-IRL Maximum entropy inverse reinforcement learning
  • is a weighting factor vector whose component is the weighting factor WF.
  • f(s, a) is a feature quantity vector, which can include multiple terms corresponding to each feature quantity.
  • the total number of weighting factors WF included in the weighting factor vector ⁇ is determined according to the number of components of the feature quantity vector f(s, a).
  • Equation A1 the trajectory ⁇
  • Equation A2 the probability model representing the distribution p ⁇ ( ⁇ ) of the trajectory
  • ⁇ T f ⁇ in Equation A2 represents the reward function (see Equation A3)
  • Z represents the sum of rewards for all trajectories (see equation A4).
  • Equation A5 is the step size and L( ⁇ ) is the distance measure between distributions used in ME-IRL.
  • Equation A6 The second term in Equation A6 is the sum of rewards for all trajectories.
  • ME-IRL assumes that the value of the second term can be strictly calculated. However, in reality, there is also the problem that it is difficult to calculate the total sum of rewards for all trajectories.
  • the maximum entropy inverse reinforcement learning includes a feature parameter FP that characterizes the term.
  • the maximum entropy inverse reinforcement learning according to this exemplary embodiment not only ⁇ described above but also the feature parameter FP is estimated. Therefore, the maximum entropy inverse reinforcement learning according to this exemplary embodiment should also be referred to as improved maximum entropy inverse reinforcement learning.
  • the “improved maximum entropy inverse reinforcement learning” is hereinafter simply referred to as “maximum entropy inverse reinforcement learning (ME-IRL)”.
  • the determination unit 12 sets the feature amount of the reward function from the reference data including the state and action.
  • the determining unit 12 sets the feature quantity of the reward function so that the gradient of the tangent line is finite throughout the function so that the Wasserstein distance can be used as a distance measure between distributions in the inverse reinforcement learning process. It is good also as a structure which carries out.
  • the determination unit 12 may set the feature amount of the reward function so as to satisfy the Lipschitz continuity condition, for example.
  • the determination unit 12 may set the feature amount so that the reward function becomes a linear function.
  • Equation 4 illustrated below has an infinite gradient at a 0 , and therefore can be said to be an inappropriate reward function in the present disclosure.
  • the determination unit 12 may determine, for example, a reward function in which the feature amount is set according to the user's instruction. A satisfying reward function may be obtained.
  • the determination unit 12 may be configured to initialize the weighting factor WF.
  • the method by which the determination unit 12 initializes the weighting factor WF is not particularly limited, and the weighting factor WF may be initialized based on an arbitrary method predetermined according to the user or the like.
  • the determining unit 12 determines the trajectory ⁇ ( ⁇ ⁇ derives the superscript ⁇ ) of ⁇ . Specifically, the determining unit 12 uses the Wasserstein distance as a distance measure between distributions, and performs mathematical optimization to minimize the Wasserstein distance, thereby obtaining the trajectory ⁇ of the expert as presume.
  • the Wasserstein distance is defined by Equation 5 exemplified below. That is, the Wasserstein distance represents the distance between the probability distribution of the expert's trajectory and the probability distribution of the trajectory determined based on the parameters of the reward function.
  • the reward function ⁇ T f ⁇ must be a function that satisfies the Lipschitz continuity condition due to the restriction of the Wasserstein distance.
  • the determining unit 12 sets the feature amount of the reward function so as to satisfy the Lipschitz continuity condition, so it is possible to use the Wasserstein distance as exemplified below.
  • Equation 5 The Wasserstein distance defined by Equation 5 exemplified above takes a value of 0 or less, and increasing this value corresponds to bringing the distributions closer together. Also, in the second term of Equation 5, ⁇ ⁇ (n) represents the n-th trajectory optimized with the parameter ⁇ . The second term of Equation 5 is a term that can be calculated even in a combinatorial optimization problem. Therefore, by using the Wasserstein distance exemplified in Equation 5 as a distance measure between distributions, inverse reinforcement learning that can be applied to mathematical optimization problems such as combinatorial optimization problems can be performed.
  • the determination unit 12 updates the parameter ⁇ of the reward function and the feature parameter FP so as to maximize the distance measure between the distributions based on the estimated expert's trajectory ⁇ .
  • the trajectory ⁇ follows the Boltzmann distribution according to the maximum entropy principle. Therefore, similar to ME-IRL, the determining unit 12, based on the estimated expert trajectory ⁇ , determines the parameter ⁇ and the feature parameter FP are updated.
  • Equation A6 the second term in Equation A6 is the sum of rewards for all trajectories.
  • ME-IRL assumes that the value of the second term can be strictly calculated. However, in reality, there is a problem that it is difficult to calculate the total sum of rewards for all trajectories.
  • the determining unit 12 sets the lower limit of the logarithmic likelihood represented using the reward function, and maximizes the lower limit of the logarithmic likelihood represented using the reward function. update the feature parameter FP).
  • the determination unit 12 determines the lower limit of L( ⁇ ) is set as follows. the following, is also called the lower bound of the log-likelihood.
  • Equation 6 The second term in Equation 6, the log-likelihood lower bound, is the maximum reward value for the current parameter ⁇ , and the third term is the log value of the number of possible trajectories (N ⁇ ). .
  • the determining unit 12 calculates the maximum reward value for the current parameter ⁇ and the log value (logarithmic value) of the number of possible trajectories (N ⁇ ). Derive the lower bound of the log-likelihood calculated by subtracting from the probability distribution of .
  • the determining unit 12 may use a modified formula for the derived lower limit of the log-likelihood of ME-IRL into a formula for subtracting the entropy regularization term from the Wasserstein distance.
  • a formula obtained by decomposing the lower bound formula of the log-likelihood of ME-IRL into the Wasserstein distance and the entropy regularization term is expressed as Formula 7 illustrated below.
  • the expression in the first parenthesis of Expression 7 represents the Wasserstein distance. That is, the Wasserstein distance represents the distance between the probability distribution of the expert's trajectory and the probability distribution of the trajectory determined based on the parameters of the reward function.
  • the reward function ⁇ T f ⁇ must be a function that satisfies the Lipschitz continuity condition due to the restriction of the Wasserstein distance.
  • the determination unit 12 can use the Wasserstein distance to set the features of the reward function so as to satisfy the Lipschitz continuity condition.
  • Equation 7 represents an entropy regularization term that contributes to increasing the logarithmic likelihood of the Boltzmann distribution derived from the maximum entropy principle.
  • the first term represents the maximum reward value for the current parameter
  • the term represents the mean value of the reward for the current parameter ⁇ .
  • the inverse reinforcement learning by the determination unit 12 includes update processing for updating the operation target (parameter ⁇ , feature amount parameter FP) so as to maximize the lower limit of the logarithmic likelihood represented using the reward function. is included.
  • the lower limit of the logarithmic likelihood in the update process is the Wasserstein distance, which represents the distance between the standard probability distribution and the probability distribution represented using the reward function, and the maximum value of the reward function. and a regularization term representing the difference between the value and the average value of the reward function.
  • Equation 7 functions as an entropy regularization term.
  • the value of the second term should be small, which corresponds to a small difference between the maximum reward value and the mean value. A smaller difference between the maximum reward value and the average value indicates a smaller trajectory variability.
  • a smaller difference between the maximum reward value and the average value means an increase in entropy, so entropy regularization works and contributes to entropy maximization. This contributes to the maximization of the log-likelihood of the Boltzmann distribution and, as a result, contributes to resolution of ambiguity in inverse reinforcement learning.
  • the determining unit 12 fixes, for example, the estimated trajectory ⁇ based on Equation 7 shown above, and updates the parameter ⁇ and the feature amount parameter FP by the gradient ascending method.
  • the normal gradient ascent method may not converge.
  • the feature quantity (f ⁇ max ) of the trajectory with the maximum reward value does not match the average value of the feature quantity (f ⁇ (n) ) of the other trajectories (i.e., the difference between the two does not become 0). Therefore, in the normal gradient ascent method, the logarithmic likelihood oscillates and does not converge, which makes it unstable and difficult to appropriately determine convergence (see Equation 8 below for updating the parameter ⁇ ).
  • the determination unit 12 updates the parameter ⁇ and the feature parameter FP so as to gradually attenuate the portion that contributes to entropy regularization (that is, the portion corresponding to the entropy regularization term). good too.
  • the lower bound of the log-likelihood may include a damping factor that is multiplied by the regularization term to dampen the contribution of the regularization term as the update process is repeated.
  • the determination unit 12 defines an update formula in which a damping coefficient ⁇ t indicating the degree of damping is set in a portion that contributes to entropy regularization.
  • the determining unit 12 differentiates Equation 7 above with respect to ⁇ , a portion corresponding to the term indicating the Wasserstein distance (that is, a portion contributing to processing for increasing the Wasserstein distance) and an entropy regularization term.
  • Equation 9 exemplified below is defined in which the damping coefficient is set to the portion corresponding to the entropy regularization term among the portions where .
  • the damping factor is predefined according to how to dampen the portion corresponding to the entropy regularization term. For example, for smooth attenuation, ⁇ t is defined as in Equation 10 exemplified below.
  • Equation 10 ⁇ 1 is set to 1 and ⁇ 2 is set to 0 or greater. Also, t indicates the number of iterations. As a result, the attenuation coefficient ⁇ t functions as a coefficient that reduces the portion corresponding to the entropy regularization term as the number of iterations t increases.
  • the determination unit 12 updates the parameter ⁇ and the feature parameter FP without attenuating the portion corresponding to the entropy regularization term in the initial stage of updating, and at the timing when the logarithmic likelihood begins to oscillate, the entropy regularization
  • the parameter ⁇ and the feature amount parameter FP may be updated so as to reduce the influence of the portion corresponding to the term.
  • the determination unit 12 may determine that the logarithmic likelihood has started to oscillate when the moving average of the logarithmic likelihood becomes constant. Specifically, when the change in the moving average in the “lower limit of logarithmic likelihood” time window (several points from the current value to the past) is small (for example, 1e ⁇ 3 or less), the determining unit 12 determines that the moving average can be judged to be constant.
  • the method of determining the timing at which vibration starts is the same as the method described above.
  • the determination unit 12 may change the update method of the parameter ⁇ and the feature amount parameter FP at the timing when the logarithmic likelihood further starts to oscillate after changing the oscillation coefficient as in Equation 10 shown above. Specifically, the determining unit 12 may update the parameter ⁇ and the feature amount parameter FP using a momentum method as exemplified in Equation 11 below.
  • the determination unit 12 repeats the trajectory estimation process and the update process of the parameter ⁇ and the feature amount parameter FP until it determines that the lower limit of the logarithmic likelihood has converged.
  • the distance measure between the distributions has converged when the absolute value of the value of the lower limit of the logarithmic likelihood becomes smaller than a predetermined threshold.
  • the determination unit 12 When determining that the distance measure between distributions has not converged, the determination unit 12 continues the trajectory estimation process and the update process of the parameter ⁇ and the feature parameter FP. On the other hand, when determining that the distance measure between distributions has converged, the determination unit 12 ends the trajectory estimation process and the update process of the parameter ⁇ and the feature parameter FP.
  • Equations 12 and 13 Each variable in Equations 12 and 13 has the following meaning as described in Exemplary Embodiment 1.
  • Reward reward function
  • Cost Cost function ⁇ 1 , ⁇ 2 , ⁇ 3 : Weighting factor x 1 , x 2 , x 3 : explanatory variables
  • the above exemplary explanatory variables: x 1 , x 2 , x 3 are variables that can correspond to either state data (s i ) and behavior data (a i ), respectively.
  • the explanatory variable itself may constitute the feature amount, or the function of the explanatory variable may constitute the feature amount.
  • the weighting factor ⁇ 1, the weighting factor ⁇ 2, and the feature parameter A case where inverse reinforcement learning is performed with .
  • the determination unit 12 when the determination unit 12 derives the lower limit of the logarithmic likelihood shown in Equation 9 described above, it updates the operation target of the reward function. For example, when the reward function is given by Equation 12 below, the determining unit 12 determines the weighting factor ⁇ 1, the weighting factor ⁇ 2, and the feature amount parameter is updated using Equations 14 to 17 below. however,
  • ⁇ t is a coefficient defined by Equation 10 above, and ⁇ is a parameter indicating a learning rate.
  • the determination unit 12 updates the parameter ⁇ of the reward function and the feature parameter FP so as to maximize the logarithmic likelihood of the Boltzmann distribution derived from the principle of maximum entropy.
  • inverse reinforcement learning is performed with not only the weighting coefficients but also the feature parameter as the update target, so that a more appropriate reward function can be generated.
  • FIG. 6 is a flow diagram showing the flow of the information processing method S3 according to this exemplary embodiment.
  • Step S31 In step S ⁇ b>31 , the acquisition unit 11 acquires reference data RD via the input unit 32 or the communication unit 34 .
  • the acquisition unit 11 stores the acquired reference data RD in the storage unit 31 . Since the reference data RD has been described above, the description thereof is omitted here.
  • step S32 the determination unit 12 initializes the weighting factor and the feature amount parameter, which are the operation targets in the inverse reinforcement learning, among the parameters included in the reward function.
  • the determining unit 12 may use the initial values stored in the storage unit 31 to initialize the weighting coefficients and feature amount parameters that are the operation targets in the inverse reinforcement learning.
  • step S33 the determination unit 12 performs mathematical optimization to minimize the Wasserstein distance.
  • the determination unit 12 estimates the trajectory that minimizes the Wasserstein distance, which represents the distance between the probability distribution of the trajectory of the expert and the probability distribution of the trajectory determined based on the parameters of the reward function.
  • step S34 the determining unit 12 updates the parameter ⁇ of the reward function and the feature parameter FP so as to maximize the logarithmic likelihood of the Boltzmann distribution derived from the principle of maximum entropy. Since the specific example of the update process has been described above, the description is omitted here.
  • Step S35 the determination unit 12 determines whether or not the lower limit of the logarithmic likelihood has converged. If it is determined that the lower limit of the logarithmic likelihood has converged (YES in S35), the process proceeds to step S36; otherwise (NO in S35), the process returns to step S33.
  • Step S36 the determination unit 12 determines whether or not the lower limit of the logarithmic likelihood has converged. If it is determined that the lower limit of the logarithmic likelihood has converged, the determination unit 12 outputs a reward function in step S36.
  • the parameters (weighting factor WF and feature quantity parameter FP) included in the reward function output by the determining unit 12 are stored in the storage unit 31 as an example.
  • the output unit 33 may include a display panel (display unit) and display various information on the display panel.
  • the information displayed on the display panel may include information corresponding to at least one of the weighting factor WF, the feature amount parameter FP, and the reward function.
  • the display content displayed by the output unit 33 is generated by the display control unit 13 as an example.
  • FIG. 7 is a diagram showing a display example generated by the display control unit 13. As shown in FIG. As shown in FIG. 7, a display screen may be generated that shows the relationship between the values of at least some of the parameters to be operated (the weighting factor WF and the feature amount parameter FP) and the number of steps. In other words, a display screen may be generated that shows a change in the parameter value of the operation target according to an increase in the number of steps of the update process. In the example shown in FIG. 7, the number of steps, the weighting factor ⁇ 1 , and the feature parameter 10 shows a display screen showing the relationship between .
  • the number of steps, the weighting factor ⁇ 1 , and the feature parameter 10 shows a display screen showing the relationship between .
  • the information processing apparatus 3 displays information corresponding to at least one of the weighting factor WF, the feature amount parameter FP, and the reward function, so that inverse reinforcement learning can be performed favorably. It is possible to suitably present to the user what is being done.
  • FIG. 8 is a block diagram showing the configuration of the information processing device 3 according to this exemplary embodiment.
  • the information processing device 4 includes a control section 45 instead of the control section 35 provided in the information processing device 3 .
  • the control unit 45 includes a generation unit 14 in addition to each configuration included in the control unit 35 .
  • the information processing device 4 includes a storage unit 41 instead of the storage unit 31 included in the information processing device 3.
  • the storage unit 41 stores target data TD.
  • the acquisition unit 11 included in the information processing device 4 further acquires target data TD in addition to various data acquired by the acquisition unit 11 according to the second exemplary embodiment.
  • the acquired target data TD is stored in the storage unit 41 described above as an example.
  • the target data TD includes at least part of state data representing the state of a certain system (system) and action data representing actions taken by a specific expert in that state. included.
  • the target data TD can be represented by ⁇ s 1 , s 2 , . can be done.
  • the generation unit 14 included in the information processing device 4 maximizes a reward function having the target data TD acquired by the acquisition unit 11 and the operation target data MD as explanatory variables by operating the operation target data MD. .
  • the information processing device 4 solves the optimization problem with the target data as input, and generates as output data the data of the operation target that maximizes the reward function.
  • the generation unit 14 generates a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target acquired by the acquisition unit 11. Generate output data according to the target data by solving an optimization problem using the data.
  • the reward function determined by the determining unit 12 through the processing described in the second exemplary embodiment can be used as the above reward function.
  • the acquisition unit 11 that acquires the target data, the reward function including the weighting factor and the feature amount parameter, and the feature amount parameter is used as the operation target.
  • the output unit 33 may include a display panel (display unit) and display various information on the display panel.
  • the information displayed on the display panel may include at least part of the data included in the output data generated by the generator 14 .
  • FIG. 9 is a diagram showing a display example generated by the display control unit 13.
  • the reward function is given by Equation 12 described in Exemplary Embodiment 2
  • - Weighting coefficients ⁇ 1 , ⁇ 2 , ⁇ 3 and feature parameters is determined by inverse reinforcement learning by the determination unit 12,
  • ⁇ The target data TD includes x 1 and x 2
  • - An example of a display screen generated by the display control unit 13 when the generation unit 14 generates output data corresponding to target data by solving an optimization problem with x3 as data to be manipulated is shown.
  • the display screen generated by the display control unit 13 displays the values of the explanatory variables x 1 and x 2 included in the target data TD, and the operation target determined by the generation unit 14 according to the values.
  • data x 3 values i.e. recommended values for users.
  • the information processing device 4 can suitably present the solution of the optimization problem to the user by displaying the output data generated by the generation unit 14 as described above.
  • the acquisition unit 11 receives input of at least one of explanatory variables, weighting coefficients, and feature amount parameters from the user via the input unit 32 . Then, as shown in the upper part of FIG. 10, the display control unit 13 controls at least one of the explanatory variable, the weighting factor, and the feature amount parameter input by the user, and the reference data for one or more experts. At least one of explanatory variables, weighting coefficients, and feature parameter values obtained by inverse reinforcement learning using TD may be displayed in a comparable manner.
  • the display control unit 13 may be configured to generate a GUI (Graphical User Interface) including operation objects that can be operated by the user and display it on the output unit 33 .
  • GUI Graphic User Interface
  • Such a GUI is shown in the lower left part of FIG. By sliding the bar included in the GUI, it is possible to change the value of at least one of the explanatory variable, the weighting factor, and the feature amount parameter corresponding to the bar.
  • the display control unit 13 may rank at least one of explanatory variables, weighting coefficients, and feature parameters, and display the variables together with the ranking.
  • the information processing device 4 generates an operation plan regarding the water distribution plan of the water supply infrastructure.
  • the water infrastructure includes, by way of example, multiple sites such as reservoirs, distribution reservoirs, water intake facilities, water purification plants, water stations, and demand points.
  • the operation plan includes, for example, information indicating the operation pattern of pumps at each site.
  • the acquisition unit 11 acquires the target data TD and the reference data RD.
  • the acquisition unit 11 acquires the target data TD and the reference data RD from another device via the communication unit 34 .
  • the acquisition unit 11 may acquire the target data TD and the reference data RD input via the input unit 32 .
  • the acquisition unit 11 may acquire the target data TD and the reference data RD by reading the target data TD and the reference data RD from the storage unit 41 or an externally connected storage device. Details of the target data TD and the reference data RD according to this example will be described later.
  • the determination unit 12 determines a reward function used in the optimization problem for generating the operation plan OP regarding the target water distribution plan by inverse reinforcement learning with reference to the reference data RD.
  • Inverse reinforcement learning of the reward function includes, as described above, update processing with the weighting factor WF and the feature amount parameter FP as the manipulation targets.
  • the generation unit 14 solves the optimization problem using the reward function determined by inverse reinforcement learning using the reference data RD related to the reference water distribution plan and the target data TD acquired by the acquisition unit 11, Generate an operation plan OP for the target water distribution plan.
  • the operation plan OP generation processing executed by the generation unit 14 will be described later.
  • the storage unit 41 stores the target data TD and the reference data RD acquired by the acquisition unit 11 .
  • the storage unit 41 also stores the operation plan OP generated by the generation unit 14 .
  • the storage unit 41 also stores the reward function determined by the determination unit 12 and the constraint condition LC.
  • storing a reward function in the storage unit 41 means that a parameter defining the reward function is stored in the storage unit 41 .
  • the target data TD is data used by the generating unit 14 to generate the operation plan OP.
  • the target data TD includes information indicating the state of the target water supply infrastructure.
  • the target data TD includes information about pumps, distribution networks, water pipelines and/or demand points in the target water infrastructure.
  • the target data TD includes, as an example, at least one of the following data (i) to (x) in the water supply infrastructure that is the target of the operation plan.
  • the data included in the target data TD is not limited to these, and may include other data.
  • the power consumption at each base indicates the power consumption at each base such as water purification plants and water supply stations.
  • demand forecast margin indicates the extent to which supply exceeds demand;
  • Reservoir margin indicates the extent to which the designed reservoir capacity exceeds the actual reservoir capacity.
  • Water distribution loss indicates the extent to which water is not being distributed to each demand point.
  • the number of operating personnel indicates the number of operating personnel at each site.
  • the reference data RD is data used when the determination unit 12 determines the reward function.
  • the reference data RD includes information representing the state of the reference water supply infrastructure.
  • the reference water infrastructure may be the same as or different from the water infrastructure for which the operation plan is generated.
  • the reference data RD includes, as an example, information on at least one of pumps, distribution networks, water pipelines, and demand points in the reference water infrastructure.
  • the reference data RD also includes, as an example, information on at least one of pump operating patterns and personnel in the reference water supply infrastructure. Each item included in the reference data RD may be treated as state data, or may be treated as action data.
  • the reference data RD includes, as an example, at least one of the following data (i) to (x) in the reference water infrastructure.
  • the data included in the reference data RD is not limited to these, and may include other data.
  • the reference data RD includes, as an example, data indicating an operation plan created by a skilled person for reference water infrastructure. More specifically, the reference data RD includes, as an example, data represented by variables controlled based on operation rules, such as opening/closing of valves, intake of water, thresholds of pumps, and the like. Such data can also be said to be data representing the decision-making history (expert's intention) of the expert who created the operational plan for reference.
  • the operational plan OP includes, by way of example, information about the operating pattern of the pumps in the water infrastructure of interest.
  • the operation plan OP also includes, as an example, information about the personnel involved in the target water supply infrastructure.
  • the reward function includes each cost term including each variable corresponding to each item included in the reference data RD.
  • the generality of the reward function was described in the exemplary embodiment above.
  • Constraint condition LC is a constraint condition of the optimization problem that the generator 14 solves. Constraints LC include, for example, the following (i) to (iv). Note that the constraint conditions LC are not limited to these, and may include other conditions.
  • the water storage volume of the reservoir/distribution reservoir is greater than or equal to threshold X and less than Y.
  • the determining unit 12 determines a reward function to be used in the optimization problem for generating the operation plan for the target water distribution plan by inverse reinforcement learning with reference to the reference data RD. As an example, the determining unit 12 determines the weighting factor of the cost term included in the reward function and the feature parameter that characterizes the cost term by inverse reinforcement learning using the state data and action data included in the reference data RD. .
  • An example of inverse reinforcement learning by the determination unit 12 is as described above.
  • the determination unit 12 outputs the determined reward function.
  • the determination unit 12 may output the reward function by writing it in the storage unit 41 or an external storage device, or output it to the output unit 33 .
  • the generation unit 14 generates an operation plan OP related to the target water distribution plan by solving an optimization problem using the reward function and the target data TD under the constraint LC.
  • the generation unit 14 is an optimization problem using a reward function
  • the target data TD acquired by the acquisition unit 11 is a fixed variable
  • the variable included in each cost term included in the reward function is By solving an optimization problem in which variables other than fixed variables are used as manipulated variables, an operation plan OP relating to the target water distribution plan is generated.
  • the generation unit 14 also outputs the generated operation plan OP.
  • the generation unit 14 may output the operation plan OP by writing it in the storage unit 41 or an external storage device, or may output it to the output unit 33 .
  • FIG. 11 is a diagram for explaining a specific example of setting the optimization problem according to this example.
  • the operation plan OP needs to be determined in consideration of various points of view, such as how much margin should be provided from the forecasted demand, how much power consumption should be suppressed, and how much the water level of the distribution reservoir should be considered. . Setting weights for these aspects is difficult. This is because the degree of emphasis on which point of view varies depending on the operator who operates the water supply infrastructure, and is not uniformly determined. For example, there is a case where the municipality A, which is the generator of a certain operation plan, places importance on the viewpoint of power consumption, while the municipality B places importance on the water level of the distribution reservoir.
  • the generation unit 14 under the constraint condition LC, generates a reward function whose weighting factor and feature amount parameter of each cost term are determined by inverse reinforcement learning with reference to the reference data RD, and the target data TD Solve the optimization problem using
  • the weighting factor and the feature amount parameter of each cost term included in the reward function are determined by inverse reinforcement learning with reference to the reference data RD, values reflecting the action data included in the reference data RD, In other words, the value reflects the intention of the expert who generated the operation plan for reference.
  • the weighting coefficients ⁇ 1 to ⁇ 6 and the feature amount parameters included in the reward function used to generate the operation plan OP of the local government A generate the reference operation plan used to determine the reward function. It is a value that reflects the intention of an expert or the like.
  • the weighting coefficients ⁇ 1 to ⁇ 6 and the feature amount parameters included in the reward function used to generate the operation plan OP for local government B are the intentions of the expert who generated the reference operation plan used to determine the reward function. is a value that reflects By comparing the weighting factor and feature amount parameter of the local government A with the weighting factor and feature amount parameter of the local government B, it becomes easy to grasp what viewpoint each local government attaches importance to.
  • the determination unit 12 determines the reward function by referring to the reference data RD including the operation plan created by the expert a1 in the municipality A, and the reward function determined by the determination unit 12 and the target data TD of the municipality A
  • the generation unit 14 can also generate the future operation plan OP using the and. In this case, the generation unit 14 can generate the future operation plan OP of the local government A that reflects the intention of the expert a1.
  • the determination unit 12 determines the reward function by referring to the reference data RD including the operation plan created by the expert a1 in the municipality A, and the reward function determined by the determination unit 12 and the target data TD of the municipality B are combined.
  • the generation unit 14 can also generate the future operation plan OP by using it. In this case, the generation unit 14 can generate the operation plan OP of the local government B that reflects the intention of the expert a1.
  • Some or all of the functions of the information processing apparatuses 1, 2, 3, and 4 may be implemented by hardware such as integrated circuits (IC chips), or may be implemented by software.
  • the information processing apparatuses 1, 2, 3, and 4 are implemented by computers that execute program instructions, which are software that implements each function, for example.
  • An example of such a computer (hereinafter referred to as computer C) is shown in FIG.
  • Computer C comprises at least one processor C1 and at least one memory C2.
  • a program P for operating the computer C as the information processing apparatuses 1, 2, 3, and 4 is recorded in the memory C2.
  • the processor C1 reads the program P from the memory C2 and executes it, thereby realizing each function of the information processing apparatuses 1, 2, 3, and 4.
  • processor C1 for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof.
  • memory C2 for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.
  • the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data.
  • Computer C may further include a communication interface for sending and receiving data to and from other devices.
  • Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
  • the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C.
  • a recording medium M for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used.
  • the computer C can acquire the program P via such a recording medium M.
  • the program P can be transmitted via a transmission medium.
  • a transmission medium for example, a communication network or broadcast waves can be used.
  • Computer C can also acquire program P via such a transmission medium.
  • An acquisition means for acquiring reference data and a reward function including a weighting factor and a feature parameter are determined by inverse reinforcement learning using the reference data and including the feature parameter as an operation target.
  • An information processing apparatus comprising: determining means for
  • the reward function includes one or more cost terms including a feature value represented using explanatory variables and the weighting factor representing the weight of the feature value, and at least one of the one or more cost terms
  • the information processing apparatus according to appendix 1, wherein the feature parameter that characterizes the cost term is included along with the explanatory variable.
  • Appendix 3 The information processing apparatus according to appendix 2, wherein an operation target in the inverse reinforcement learning by the determining means includes the weighting factor included in at least one of the one or the plurality of cost terms.
  • appendix 4 Any one of appendices 1 to 3, wherein the inverse reinforcement learning by the determining means includes an update process of updating the operation target so as to maximize the lower limit of the logarithmic likelihood represented using the reward function.
  • the information processing device according to .
  • the lower limit of the log-likelihood is the Wasserstein distance representing the distance between the reference probability distribution and the probability distribution represented using the reward function, the maximum value of the reward function, and the average value of the reward function.
  • the information processing device according to appendix 4 which is expressed using a regularization term that represents the difference between .
  • Appendix 7 The information processing apparatus according to any one of appendices 1 to 6, further comprising first display means for displaying information corresponding to at least one of the weighting coefficient, the feature quantity parameter, and the reward function.
  • the acquisition means further acquires target data, and the information processing device solves the target data by solving an optimization problem using the reward function determined by the determination means and the target data acquired by the acquisition means.
  • the information processing apparatus according to any one of appendices 1 to 7, further comprising generating means for generating output data according to.
  • appendix 9 The information processing apparatus according to appendix 8, further comprising second display means for displaying the output data.
  • Appendix 10 an acquisition means for acquiring target data; a reward function including a weighting factor and a feature parameter, the reward function determined by inverse reinforcement learning including the feature parameter as an operation target; and generating means for generating output data according to the target data by solving an optimization problem using the target data.
  • Appendix 12 An information processing method by an information processing device, comprising obtaining target data, and a reward function including a weighting factor and a feature parameter, which is determined by inverse reinforcement learning including the feature parameter as an operation target. Generating output data according to the target data by solving an optimization problem using a reward function and the target data obtained in the obtaining.
  • a program for causing a computer to function as an information processing apparatus comprising: obtaining means for obtaining reference data; and a reward function including a weighting factor and a feature amount parameter.
  • Appendix 14 A program that causes a computer to function as an information processing device, comprising: acquisition means for acquiring target data; and a reward function including a weighting factor and a feature amount parameter, wherein the feature amount parameter is used as an operation target by inverse reinforcement learning.
  • At least one processor is provided, and the processor performs an acquisition process for acquiring reference data, and a reward function including a weighting factor and a feature amount parameter for inverse reinforcement learning using the reference data, wherein the feature amount
  • An information processing device that executes a determination process of determining by inverse reinforcement learning including a parameter as an operation target.
  • the information processing apparatus may further include a memory, and the memory may store a program for causing the processor to execute the acquisition process and the determination process. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.
  • At least one processor is provided, and the processor is an acquisition process for acquiring target data, and a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target. and a generating process for generating output data corresponding to the target data by solving an optimization problem using the target data obtained in the obtaining process.
  • the information processing apparatus may further include a memory, and the memory may store a program for causing the processor to execute the acquisition process and the generation process. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

In order to generate a more suitable reward function, an information processing device (1) is provided with: an acquisition unit (11) that acquires reference data; and a determination unit (12) that determines a reward function including weighting factors and feature quantity parameters by inverse reinforcement learning that uses the reference data and includes the feature quantity parameters as operation targets.

Description

情報処理装置、情報処理方法、およびプログラムInformation processing device, information processing method, and program
 本発明は、情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to an information processing device, an information processing method, and a program.
 機械学習の手法の一つである強化学習(RL:Reinforcement Learning)において、様々な行動の価値を評価するために報酬関数が用いられる。この報酬関数を生成する方法として、逆強化学習(IRL:Inverse Reinforcement Learning)が知られている。 In reinforcement learning (RL: Reinforcement Learning), one of the machine learning methods, a reward function is used to evaluate the value of various actions. Inverse Reinforcement Learning (IRL) is known as a method for generating this reward function.
 非特許文献1には、逆強化学習の一つである最大エントロピー逆強化学習(ME-IRL:Maximum Entropy-IRL)について記載されている。ME―IRLでは、最大エントロピー原理を用いて軌跡の分布を指定し、真の分布へ近づけること(すなわち、最尤推定)により報酬関数を学習する。 Non-Patent Document 1 describes maximum entropy inverse reinforcement learning (ME-IRL: Maximum Entropy-IRL), which is one type of inverse reinforcement learning. ME-IRL uses the maximum entropy principle to specify the trajectory distribution and learn the reward function by approximating the true distribution (ie maximum likelihood estimation).
 また、非特許文献2には、最大エントロピー逆強化学習を改良した逆強化学習の手法の一つであるGCL(Guided Cost Learning)について記載されている。非特許文献2に記載された手法では、重点サンプリングを用いて報酬関数の重みを更新する。 In addition, Non-Patent Document 2 describes GCL (Guided Cost Learning), which is one of the methods of inverse reinforcement learning that improves maximum entropy inverse reinforcement learning. In the method described in Non-Patent Document 2, importance sampling is used to update the weights of the reward function.
 しかしながら、非特許文献1及び非特許文献2に記載の技術は何れも、適切な報酬関数を生成するという観点において改善の余地がある。 However, both the techniques described in Non-Patent Document 1 and Non-Patent Document 2 have room for improvement in terms of generating an appropriate reward function.
 本発明の一態様は、上記の問題に鑑みてなされたものであり、その目的の一例は、より適切な報酬関数を生成することができる技術を提供することである。 One aspect of the present invention has been made in view of the above problems, and an example of its purpose is to provide a technique capable of generating a more appropriate reward function.
 本発明の一側面に係る情報処理装置は、参照用データを取得する取得手段と、重み係数と特徴量パラメータとを含む報酬関数を、前記参照用データを用いた逆強化学習であって前記特徴量パラメータを操作対象に含む逆強化学習によって決定する決定手段と、を備えている。 An information processing apparatus according to one aspect of the present invention includes an acquisition unit that acquires reference data, and a reward function that includes a weighting factor and a feature amount parameter by performing inverse reinforcement learning using the reference data, wherein the feature and determining means for determining by inverse reinforcement learning including the quantity parameter as an object to be manipulated.
 本発明の一側面に係る情報処理装置は、対象データを取得する取得手段と、重み係数と特徴量パラメータとを含む報酬関数であって、前記特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、前記取得手段が取得した対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成する生成手段と、を備えている。 An information processing apparatus according to one aspect of the present invention includes acquisition means for acquiring target data, and a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target. generating means for generating output data corresponding to the target data by solving an optimization problem using the obtained reward function and the target data obtained by the obtaining means.
 本発明の一側面に係る情報処理方法は、情報処理装置による情報処理方法であって、参照用データを取得することと、重み係数と特徴量パラメータとを含む報酬関数を、前記参照用データを用いた逆強化学習であって前記特徴量パラメータを操作対象に含む逆強化学習によって決定することと、を含んでいる。 An information processing method according to one aspect of the present invention is an information processing method using an information processing apparatus, which includes obtaining reference data, calculating a reward function including a weighting factor and a feature amount parameter using the reference data. Determining by inverse reinforcement learning that uses the feature amount parameter as an operation target.
 本発明の一側面に係る情報処理方法は、情報処理装置による情報処理方法であって、対象データを取得することと、重み係数と特徴量パラメータとを含む報酬関数であって、前記特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、前記取得することにおいて取得された対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成することと、を含んでいる。 An information processing method according to one aspect of the present invention is an information processing method using an information processing apparatus, wherein target data is obtained, and a reward function including a weighting factor and a feature amount parameter, wherein the feature amount parameter Generating output data according to the target data by solving an optimization problem using a reward function determined by inverse reinforcement learning including as an operation target and the target data acquired in the acquisition , contains
 本発明の一側面に係るプログラムは、情報処理装置としてコンピュータを機能させるプログラムであって、参照用データを取得する取得手段と、重み係数と特徴量パラメータとを含む報酬関数を、前記参照用データを用いた逆強化学習であって前記特徴量パラメータを操作対象に含む逆強化学習によって決定する決定手段と、として機能させる。 A program according to one aspect of the present invention is a program that causes a computer to function as an information processing apparatus, and includes an acquisition unit that acquires reference data, and a reward function that includes a weighting factor and a feature amount parameter, the reference data. and determining means for determining by inverse reinforcement learning including the feature amount parameter as an operation target.
 本発明の一側面に係るプログラムは、情報処理装置としてコンピュータを機能させるプログラムであって、対象データを取得する取得手段と、重み係数と特徴量パラメータとを含む報酬関数であって、前記特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、前記取得手段が取得した対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成する生成手段と、として機能させる。 A program according to one aspect of the present invention is a program that causes a computer to function as an information processing apparatus, and is an acquisition unit that acquires target data, and a reward function that includes a weighting factor and a feature amount parameter, wherein the feature amount generating means for generating output data according to the target data by solving an optimization problem using a reward function determined by inverse reinforcement learning including a parameter as an operation target and the target data obtained by the obtaining means; , to function as
 本発明の一態様によれば、より適切な報酬関数を生成することができる。 According to one aspect of the present invention, a more appropriate reward function can be generated.
本発明の例示的実施形態1に係る情報処理装置の構成を示すブロック図である。1 is a block diagram showing the configuration of an information processing device according to exemplary Embodiment 1 of the present invention; FIG. 本発明の例示的実施形態1に係る情報処理方法の流れを示すフロー図である。FIG. 3 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 1 of the present invention; 本発明の例示的実施形態1に係る情報処理装置の構成を示すブロック図である。1 is a block diagram showing the configuration of an information processing device according to exemplary Embodiment 1 of the present invention; FIG. 本発明の例示的実施形態1に係る情報処理方法の流れを示すフロー図である。FIG. 3 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 1 of the present invention; 本発明の例示的実施形態2に係る情報処理装置の構成を示すブロック図である。FIG. 7 is a block diagram showing the configuration of an information processing apparatus according to exemplary Embodiment 2 of the present invention; 本発明の例示的実施形態2に係る情報処理方法の流れを示すフロー図である。FIG. 7 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 2 of the present invention; 本発明の例示的実施形態2における表示制御部が生成する表示例を示す図である。FIG. 10 is a diagram showing a display example generated by a display control unit according to exemplary embodiment 2 of the present invention; 本発明の例示的実施形態3に係る情報処理装置の構成を示すブロック図である。FIG. 11 is a block diagram showing the configuration of an information processing apparatus according to exemplary Embodiment 3 of the present invention; 本発明の例示的実施形態3における表示制御部が生成する表示例を示す図である。FIG. 11 is a diagram showing a display example generated by a display control unit according to exemplary Embodiment 3 of the present invention; 本発明の例示的実施形態3に係る情報処理装置による第2の表示例を示す図である。FIG. 12 is a diagram showing a second display example by the information processing apparatus according to exemplary embodiment 3 of the present invention; 本発明の例示的実施形態3に係る情報処理装置の適用例を示す図である。FIG. 10 is a diagram showing an application example of an information processing apparatus according to exemplary embodiment 3 of the present invention; 本発明の各例示的実施形態に係る情報処理装置を実現するコンピュータの一例を示す図である。1 is a diagram showing an example of a computer that implements an information processing apparatus according to each exemplary embodiment of the present invention; FIG.
 〔例示的実施形態1〕
 本発明の第1の例示的実施形態について、図面を参照して詳細に説明する。本例示的実施形態は、後述する例示的実施形態の基本となる形態である。
[Exemplary embodiment 1]
A first exemplary embodiment of the invention will now be described in detail with reference to the drawings. This exemplary embodiment is the basis for the exemplary embodiments described later.
 (情報処理装置1の概要)
 本例示的実施形態に係る情報処理装置1は、参照用データを用いた逆強化学習によって、重み係数と特徴量パラメータとを含む報酬関数を決定する装置である。
(Overview of information processing device 1)
The information processing device 1 according to this exemplary embodiment is a device that determines a reward function including weighting factors and feature amount parameters by inverse reinforcement learning using reference data.
 ここで、参照用データとは、逆強化学習によって参照されるデータのことを指し、一例として、状態データと行動データとの組を含む。例えば、参照用データには、あるシステム(系)の状態を表す状態データと、当該状態において特定の熟練者がとった行動を示す行動データとが含まれ得る。一例として、参照用データτを、{τ、τ、・・・τ}(ただし、τ=(s、a)、(s、a)、・・・(s、a))によって表すことができる。ここで、Nは任意の自然数である。また、s(i=1~N)は、上記システムの状態を示す状態データを表しており、a(i=1~N)は、当該状態データが示す状態において選択された行動データを表している。このように、参照用データは、一例として、状態データと行動データとの組を1又は複数含み得る。また、参照用データに含まれる一部のデータ又は全部のデータは、報酬関数の引数である説明変数とも呼ばれる。 Here, reference data refers to data referred to by inverse reinforcement learning, and includes, as an example, a set of state data and action data. For example, the reference data may include state data representing the state of a certain system, and action data representing actions taken by a specific expert in that state. As an example , reference data τ is { τ 1 , τ 2 , . , a N )). Here, N is any natural number. Further, s i (i=1 to N) represents state data indicating the state of the system, and a i (i=1 to N) represents action data selected in the state indicated by the state data. represent. Thus, the reference data can include one or more sets of state data and action data, as an example. Also, some or all of the data included in the reference data are also called explanatory variables, which are arguments of the reward function.
 なお、行動データは特定の熟練者がとった行動を示すデータに限定されず、行動データaに対応付けられる行動を実行する主体がとった行動を示すデータであってもよく、例えば、ロボットがとった行動を示すデータであってもよい。 Note that the action data is not limited to data indicating actions taken by a specific expert, and may be data indicating actions taken by a subject who executes actions associated with the action data ai . It may be data indicating the action taken by.
 なお、以下の説明では、混乱のない限り、状態データを単に状態とも表記することがあり、行動データを単に行動と表記することもある。 In the explanation below, unless there is confusion, state data may be simply referred to as state, and action data may simply be referred to as action.
 本例示的実施形態において、逆強化学習とは、報酬関数を決定するための学習のことを指す。本例示的実施形態に係る逆強化学習では、参照用データを参照し、報酬関数に含まれる特徴量パラメータを操作対象として更新することによって、報酬関数を決定する。また、本例示的実施形態に係る逆強化学習では、参照用データを参照し、報酬関数に含まれる重み係数を操作対象として更新してもよい。 In this exemplary embodiment, inverse reinforcement learning refers to learning for determining a reward function. In the inverse reinforcement learning according to this exemplary embodiment, the reference data is referred to, and the reward function is determined by updating the feature parameter included in the reward function as an operation target. Further, in the inverse reinforcement learning according to the present exemplary embodiment, the reference data may be referred to and the weighting factor included in the reward function may be updated as an operation target.
 ここで、報酬関数とは、一例として、様々な行動のそれぞれの価値を評価するための関数である。報酬関数には、パラメータとして、重み係数と特徴量パラメータとが含まれる。重み係数とは、一例として、報酬関数に含まれる1又は複数の特徴量の各々に乗ぜられる重みである。特徴量パラメータとは、一例として、報酬関数に含まれる1又は複数の特徴量を特徴づけるパラメータである。 Here, the reward function is, as an example, a function for evaluating the value of each of various actions. The reward function includes a weighting factor and a feature amount parameter as parameters. A weighting factor is, for example, a weight by which each of one or more feature quantities included in the reward function is multiplied. A feature amount parameter is, for example, a parameter that characterizes one or more feature amounts included in a reward function.
 報酬関数Rewardの簡単な例を例示すれば以下の通りである。
A simple example of the reward function Reward is as follows.
 式1および式2における各変数は、以下の通りである。
Reward:報酬関数
Cost:コスト関数
λ1、λ2、λ3:重み係数
x1、x2、x3:説明変数
Each variable in Formula 1 and Formula 2 is as follows.
Reward: reward function
Cost: Cost function λ 1 , λ 2 , λ 3 : Weighting factor
x 1 , x 2 , x 3 : explanatory variables
 ここで、上記の例示的な説明変数:x1、x2、x3は、それぞれ、状態データ(s)及び行動データ(a)の何れかに対応し得る変数である。この例のように、説明変数そのものが特徴量を構成する場合もあるし、説明変数の関数が特徴量を構成する場合もある。 Here, the above exemplary explanatory variables: x 1 , x 2 , x 3 are variables that can correspond to either state data (s i ) and behavior data (a i ), respectively. As in this example, the explanatory variable itself may constitute the feature amount, or the function of the explanatory variable may constitute the feature amount.
 また、式1に示すように、本例示的実施形態において、報酬関数は、コスト関数の正負を反転させたものである。したがって、コストが小さい程、報酬が大きいという関係がある。 Also, as shown in Equation 1, in this exemplary embodiment, the reward function is the inverse of the cost function. Therefore, there is a relationship that the smaller the cost, the larger the reward.
 また、式2に示すように、コスト関数は、説明変数を用いて表される特徴量と、特徴量の重みを表す重み係数とを含む1又は複数のコスト項を含み、1又は複数のコスト項の少なくとも何れかには、説明変数と共にコスト項を特徴づける前記特徴量パラメータが含まれている。 Further, as shown in Equation 2, the cost function includes one or more cost terms including a feature value represented using explanatory variables and a weighting factor representing the weight of the feature value, and one or more cost At least some of the terms include the feature parameters that characterize the cost term together with explanatory variables.
 また、式1における
と表記すると、報酬関数は以下のように表すことができる。
Also, in formula 1
, the reward function can be expressed as follows.
 ここで、「T」はベクトルの転置を表している。また、θを重み係数ベクトル、又は単にパラメータと呼ぶことがあり、fτを特徴量ベクトルと呼ぶことがある。 Here, "T" represents the transpose of the vector. Also, θ is sometimes called a weighting coefficient vector or simply a parameter, and f τ is sometimes called a feature amount vector.
 上記報酬関数において、情報処理装置1は、参照用データτを用いて、一例として特徴量パラメータ
を更新することによって、報酬関数Rewardを決定する。
In the above reward function, the information processing device 1 uses the reference data τ to use, as an example, a feature amount parameter
Determine the reward function Reward by updating
 (情報処理装置1の構成)
 本例示的実施形態に係る情報処理装置1の構成について、図1を参照して説明する。図1は、本例示的実施形態に係る情報処理装置1の構成を示すブロック図である。
(Configuration of information processing device 1)
A configuration of an information processing apparatus 1 according to this exemplary embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing the configuration of an information processing device 1 according to this exemplary embodiment.
 図1に示すように、情報処理装置1は、取得部11および決定部12を備えている。取得部11および決定部12は、本例示的実施形態において、それぞれ取得手段および決定手段を実現する構成である。 As shown in FIG. 1 , the information processing device 1 includes an acquisition unit 11 and a determination unit 12 . Acquisition unit 11 and determination unit 12 are configured to implement acquisition means and determination means, respectively, in this exemplary embodiment.
 取得部11は、参照用データを取得する。取得部11は、取得した参照用データを決定部12に供給する。 The acquisition unit 11 acquires reference data. The acquisition unit 11 supplies the acquired reference data to the determination unit 12 .
 決定部12は、重み係数と特徴量パラメータとを含む報酬関数を、参照用データを用いた逆強化学習であって特徴量パラメータを操作対象に含む逆強化学習によって決定する。 The determining unit 12 determines the reward function including the weighting factor and the feature amount parameter by inverse reinforcement learning using the reference data and including the feature amount parameter as an operation target.
 以上のように、本例示的実施形態に係る情報処理装置1においては、参照用データを取得する取得部11と、重み係数と特徴量パラメータとを含む報酬関数を、参照用データを用いた逆強化学習であって特徴量パラメータを操作対象に含む逆強化学習によって決定する決定部12とを備える構成が採用されている。このため、本例示的実施形態に係る情報処理装置1によれば、特徴量を決定するための特徴量パラメータを操作対象に含むため、特徴量として予測モデルの結果等を採用することができる。したがって、本例示的実施形態に係る情報処理装置1によれば、より適切な報酬関数を生成することができる。 As described above, in the information processing apparatus 1 according to the present exemplary embodiment, the acquisition unit 11 that acquires the reference data and the reward function including the weighting coefficient and the feature parameter are obtained by inverse using the reference data. A configuration including a determination unit 12 that determines by inverse reinforcement learning that is reinforcement learning and that includes a feature amount parameter as an operation target is adopted. For this reason, according to the information processing apparatus 1 according to the present exemplary embodiment, the operation target includes the feature amount parameter for determining the feature amount, so that the result of the prediction model or the like can be adopted as the feature amount. Therefore, according to the information processing device 1 according to this exemplary embodiment, a more appropriate reward function can be generated.
 (情報処理方法S1の流れ)
 本例示的実施形態に係る情報処理方法S1の流れについて、図2を参照して説明する。図2は、本例示的実施形態に係る情報処理方法S1の流れを示すフロー図である。
(Flow of information processing method S1)
The flow of the information processing method S1 according to this exemplary embodiment will be described with reference to FIG. FIG. 2 is a flow diagram showing the flow of the information processing method S1 according to this exemplary embodiment.
 (ステップS11)
 ステップS11において、取得部11は、参照用データを取得する。取得部11は、取得した参照用データを決定部12に供給する。
(Step S11)
In step S11, the acquisition unit 11 acquires reference data. The acquisition unit 11 supplies the acquired reference data to the determination unit 12 .
 (ステップS12)
 ステップS12において、決定部12は、重み係数と特徴量パラメータとを含む報酬関数を、取得部11から供給された参照用データを用いた逆強化学習であって特徴量パラメータを操作対象に含む逆強化学習によって決定する。
(Step S12)
In step S<b>12 , the determination unit 12 determines the reward function including the weighting factor and the feature amount parameter by inverse reinforcement learning using the reference data supplied from the acquisition unit 11 and including the feature amount parameter as an operation target. Determined by reinforcement learning.
 以上のように、本例示的実施形態に係る情報処理方法S1においては、ステップS11において、取得部11は参照用データを取得し、ステップS12において、決定部12は重み係数と特徴量パラメータとを含む報酬関数を、取得部11から供給された参照用データを用いた逆強化学習であって特徴量パラメータを操作対象に含む逆強化学習によって決定する。このため、本例示的実施形態に係る情報処理方法S1によれば、情報処理装置1と同様の効果を奏する。 As described above, in the information processing method S1 according to the present exemplary embodiment, in step S11, the acquisition unit 11 acquires reference data, and in step S12, the determination unit 12 determines the weighting coefficient and the feature parameter. The reward function including the feature amount parameter is determined by reverse reinforcement learning using the reference data supplied from the acquisition unit 11 and including the feature amount parameter as an operation target. Therefore, according to the information processing method S1 according to this exemplary embodiment, the same effects as the information processing apparatus 1 can be obtained.
 (情報処理装置2の概要)
 本例示的実施形態に係る情報処理装置2は、対象データと、逆強化学習によって決定された報酬関数とを用いた最適化問題を解くことによって、対象データに応じた出力データを生成する装置である。ここで、報酬関数および逆強化学習については、上述した通りである。
(Overview of information processing device 2)
The information processing device 2 according to this exemplary embodiment is a device that generates output data according to target data by solving an optimization problem using target data and a reward function determined by inverse reinforcement learning. be. Here, the reward function and inverse reinforcement learning are as described above.
 また、本例示的実施形態において、対象データには、あるシステム(系)の状態を表す状態データ、及び当該状態において特定の熟練者がとった行動を示す行動データの少なくも一部が含まれる。 In addition, in this exemplary embodiment, the target data includes at least part of state data representing the state of a certain system and action data indicating actions taken by a specific expert in the state. .
 また、ここで、最適化問題を解くとは、対象データを入力として、操作対象のデータを操作することによって報酬関数を最大化させることを指す。 Also, here, solving the optimization problem means maximizing the reward function by manipulating the data to be manipulated with the target data as input.
 一例として、対象データTDを、{s、s、・・・s}によって表すことができ、操作対象のデータMDを、{a、a、・・・a}によって表すことができる。ここで、s(i=1~N)は、システムの状態を示す状態データを表しており、a(i=1~N)は、状態データTDが示す状態において選択可能な行動データを表している。当該例において、本例示的実施形態に係る情報処理装置2は、対象データTDと、操作対象のデータMDとを説明変数とする報酬関数を、操作対象のデータMDを操作することによって最大化させる。換言すれば、情報処理装置2は、対象データを入力として最適化問題を解き、報酬関数を最大化させる操作対象のデータを出力データとして生成する。 As an example , the target data TD can be represented by { s 1 , s 2 , . can be done. Here, s i (i=1 to N) represents state data indicating the state of the system, and a i (i=1 to N) represents action data that can be selected in the state indicated by the state data TD. represent. In this example, the information processing apparatus 2 according to the exemplary embodiment maximizes a reward function having the target data TD and the manipulated data MD as explanatory variables by manipulating the manipulated data MD. . In other words, the information processing apparatus 2 solves the optimization problem with the target data as input, and generates the data of the operation target that maximizes the reward function as the output data.
 (情報処理装置2の概要)
 本例示的実施形態に係る情報処理装置2の構成について、図3を参照して説明する。図3は、本例示的実施形態に係る情報処理装置2の構成を示すブロック図である。
(Overview of information processing device 2)
The configuration of the information processing device 2 according to this exemplary embodiment will be described with reference to FIG. FIG. 3 is a block diagram showing the configuration of the information processing device 2 according to this exemplary embodiment.
 図3に示すように、情報処理装置2は、取得部11および生成部22を備えている。取得部11および生成部22は、本例示的実施形態において、それぞれ取得手段および生成手段を実現する構成である。 As shown in FIG. 3, the information processing device 2 includes an acquisition unit 11 and a generation unit 22. The acquisition unit 11 and the generation unit 22 are configured to implement acquisition means and generation means, respectively, in this exemplary embodiment.
 取得部11は、対象データを取得する。取得部11は、取得した対象データを生成部22に供給する。 The acquisition unit 11 acquires target data. The acquisition unit 11 supplies the acquired target data to the generation unit 22 .
 生成部22は、重み係数と特徴量パラメータとを含む報酬関数であって、特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、取得部11が取得した対象データとを用いた最適化問題を解くことによって、対象データに応じた出力データを生成する。 The generation unit 22 uses a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target data acquired by the acquisition unit 11. By solving the optimization problem, output data is generated according to the target data.
 ここで、逆強化学習により決定された報酬関数としては、一例として、上述した情報処理装置1が備える決定部12が決定した報酬関数を用いることができる。 Here, as an example of the reward function determined by inverse reinforcement learning, the reward function determined by the determination unit 12 included in the information processing apparatus 1 described above can be used.
 以上のように、本例示的実施形態に係る情報処理装置2においては、対象データを取得する取得部11と、重み係数と特徴量パラメータとを含む報酬関数であって、特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、取得部11が取得した対象データとを用いた最適化問題を解くことによって対象データに応じた出力データを生成する生成部22とを備える構成が採用されている。 As described above, in the information processing apparatus 2 according to the present exemplary embodiment, the acquisition unit 11 that acquires the target data, and the reward function that includes the weighting factor and the feature amount parameter, and the feature amount parameter is used as the operation target. and a generating unit 22 that generates output data according to the target data by solving an optimization problem using the target data acquired by the acquiring unit 11 and the reward function determined by the inverse reinforcement learning included in Adopted.
 このため、本例示的実施形態に係る情報処理装置2によれば、特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数を用いて最適化問題を解くため、より適切な報酬関数を最大化する出力データを生成することができる。 Therefore, according to the information processing apparatus 2 according to the present exemplary embodiment, since the optimization problem is solved using the reward function determined by the inverse reinforcement learning including the feature parameter as the operation target, a more appropriate reward function can generate output data that maximizes
 (情報処理方法S2の流れ)
 本例示的実施形態に係る情報処理方法S2の流れについて、図4を参照して説明する。図4は、本例示的実施形態に係る情報処理方法S2の流れを示すフロー図である。
(Flow of information processing method S2)
The flow of the information processing method S2 according to this exemplary embodiment will be described with reference to FIG. FIG. 4 is a flow diagram showing the flow of the information processing method S2 according to this exemplary embodiment.
 (ステップS21)
 ステップS21において、取得部11は、対象データを取得する。
(Step S21)
In step S21, the acquisition unit 11 acquires target data.
 (ステップS22)
 ステップS22において、生成部22は、重み係数と特徴量パラメータとを含む報酬関数であって、特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、取得部11が取得した対象データとを用いた最適化問題を解くことによって、対象データに応じた出力データを生成する。
(Step S22)
In step S22, the generation unit 22 generates a reward function including a weighting factor and a feature amount parameter, the reward function determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target acquired by the acquisition unit 11. Generate output data according to the target data by solving an optimization problem using the data.
 ここで、逆強化学習により決定された報酬関数としては、一例として、上述した情報処理方法S1に含まれるステップS12において決定された報酬関数を用いることができる。 Here, as an example of the reward function determined by inverse reinforcement learning, the reward function determined in step S12 included in the information processing method S1 described above can be used.
 以上のように、本例示的実施形態に係る情報処理方法S2においては、ステップS21において、取得部11は対象データを取得し、ステップS22において、生成部22は、重み係数と特徴量パラメータとを含む報酬関数であって、特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、取得部11が取得した対象データとを用いた最適化問題を解くことによって、対象データに応じた出力データを生成する。このため、本例示的実施形態に係る情報処理方法S2によれば、情報処理装置2と同様の効果を奏する。 As described above, in the information processing method S2 according to the present exemplary embodiment, in step S21, the acquisition unit 11 acquires target data, and in step S22, the generation unit 22 generates the weighting coefficient and the feature parameter. The reward function includes a feature parameter as an operation target, and is determined by inverse reinforcement learning, and by solving an optimization problem using the target data acquired by the acquisition unit 11, depending on the target data generate output data. Therefore, according to the information processing method S<b>2 according to this exemplary embodiment, the same effects as those of the information processing apparatus 2 can be obtained.
 〔例示的実施形態2〕
 本発明の第2の例示的実施形態について、図面を参照して詳細に説明する。なお、例示的実施形態1にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付し、その説明を適宜省略する。
[Exemplary embodiment 2]
A second exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as the components described in the exemplary embodiment 1 are denoted by the same reference numerals, and descriptions thereof are omitted as appropriate.
 (情報処理装置3の概要)
 本例示的実施形態に係る情報処理装置3は、参照用データRDを用いた逆強化学習によって、重み係数WFと特徴量パラメータFPとを含む報酬関数を決定する装置である。また、情報処理装置3は、決定した重み係数WF、特徴量パラメータFP、及び報酬関数の少なくとも何れかに対応する情報を表示する。
(Overview of information processing device 3)
The information processing device 3 according to this exemplary embodiment is a device that determines a reward function including a weighting factor WF and a feature amount parameter FP by inverse reinforcement learning using reference data RD. The information processing device 3 also displays information corresponding to at least one of the determined weighting factor WF, feature parameter FP, and reward function.
 参照用データ、逆強化学習、報酬関数、重み係数、および特徴量パラメータについては、上述した通りである。 The reference data, inverse reinforcement learning, reward function, weighting factor, and feature parameters are as described above.
 (情報処理装置3の構成)
 本例示的実施形態に係る情報処理装置3の構成について、図5を参照して説明する。図5は、本例示的実施形態に係る情報処理装置3の構成を示すブロック図である。
(Configuration of information processing device 3)
The configuration of the information processing device 3 according to this exemplary embodiment will be described with reference to FIG. FIG. 5 is a block diagram showing the configuration of the information processing device 3 according to this exemplary embodiment.
 図5に示すように、情報処理装置3は、記憶部31、入力部32、出力部33、通信部34、および制御部35を備えている。 As shown in FIG. 5, the information processing device 3 includes a storage unit 31, an input unit 32, an output unit 33, a communication unit 34, and a control unit 35.
 記憶部31は、後述する制御部35によって参照される各種のデータが格納されるメモリである。記憶部31に格納されるデータの一例として、参照用データRD、重み係数WF、および特徴量パラメータFPが挙げられる。参照用データRDの一例として、後述する入力部32が受け付けた熟練者の意思決定履歴データ(軌跡)を記憶してもよい。また、記憶部31は、決定部12が学習に用いる報酬関数の特徴量の候補を記憶していてもよい。ただし、特徴量の候補は、必ずしも報酬関数に使用される特徴量である必要はない。 The storage unit 31 is a memory that stores various data referred to by the control unit 35, which will be described later. Examples of data stored in the storage unit 31 include reference data RD, weighting factors WF, and feature amount parameters FP. As an example of the reference data RD, expert decision-making history data (trajectory) received by the input unit 32, which will be described later, may be stored. Further, the storage unit 31 may store candidates for the feature amount of the reward function that the determination unit 12 uses for learning. However, feature amount candidates do not necessarily have to be feature amounts used in the reward function.
 また、記憶部31は、決定部12による処理を実現するための数理最適化ソルバを記憶していてもよい。なお、数理最適化ソルバの内容は任意であり、実行する環境や装置に応じて決定されればよい。 Further, the storage unit 31 may store a mathematical optimization solver for realizing the processing by the determination unit 12. Note that the content of the mathematical optimization solver is arbitrary, and may be determined according to the execution environment and apparatus.
 入力部32は、情報処理装置3に入力される各種のデータを受け付ける。入力部32は、例えば、上述する熟練者の意思決定履歴データ(具体的には、状態と行動のペア)の入力を受け付けてもよい。また、入力部32は、後述する逆強化学習装置が逆強化学習を行う際に用いる初期状態および制約条件の入力を受け付けてもよい。 The input unit 32 accepts various data input to the information processing device 3 . The input unit 32 may, for example, receive an input of the expert's decision-making history data (specifically, pairs of states and actions) described above. Further, the input unit 32 may receive input of an initial state and constraint conditions used when a reverse reinforcement learning device, which will be described later, performs reverse reinforcement learning.
 入力部32は、一例として、キーボード、マウス、タッチパネル等の入力デバイスを備えて構成される。また、入力部32は、接続されている他の装置からデータを取得するインタフェースとして機能してもよい。この構成の場合、入力部32は、他の装置から取得したデータを、後述する制御部35に供給する。 For example, the input unit 32 is configured with input devices such as a keyboard, mouse, and touch panel. The input unit 32 may also function as an interface for acquiring data from other connected devices. In this configuration, the input unit 32 supplies data acquired from another device to the control unit 35, which will be described later.
 出力部33は、情報処理装置3による演算結果を出力する構成である。一例として、出力部33は表示パネル(表示部)を備え、当該表示パネルに上記演算結果を表示する。また、出力部33は、接続されている他の装置にデータを出力するインタフェースとして機能してもよい。この構成の場合、出力部33は、後述する制御部35から供給されたデータを、接続されている他の装置に出力する。 The output unit 33 is configured to output the calculation result by the information processing device 3 . As an example, the output unit 33 includes a display panel (display unit), and displays the calculation result on the display panel. In addition, the output unit 33 may function as an interface that outputs data to other connected devices. In this configuration, the output unit 33 outputs data supplied from the control unit 35, which will be described later, to other connected devices.
 通信部34は、図示しないネットワークを介して他の装置と通信する通信モジュールである。一例として、通信部34は、後述する制御部35から供給されたデータを、ネットワークを介して他の装置に出力したり、他の装置から出力されたデータを、ネットワークを介して取得し、制御部35に供給したりする。 The communication unit 34 is a communication module that communicates with other devices via a network (not shown). As an example, the communication unit 34 outputs data supplied from a control unit 35, which will be described later, to another device via a network, acquires data output from another device via a network, and controls the data. It is supplied to the unit 35.
 ネットワークの具体的構成は本実施形態を限定するものではないが、一例として、無線LAN(Local Area Network)、有線LAN、WAN(Wide Area Network)、公衆回線網、モバイルデータ通信網、又は、これらのネットワークの組み合わせを用いることができる。 The specific configuration of the network does not limit this embodiment, but as an example, wireless LAN (Local Area Network), wired LAN, WAN (Wide Area Network), public line network, mobile data communication network, or these network combinations can be used.
 本例示的実施形態では、出力部33および通信部34の少なくとも何れかを介して、演算結果を表示する。 In this exemplary embodiment, the calculation result is displayed via at least one of the output unit 33 and the communication unit 34.
 (制御部35)
 制御部35は、情報処理装置3が備える各部を制御する。一例として、制御部35は、入力部32または通信部34から取得したデータを記憶部31に格納したり、記憶部31に格納されているデータを出力部33または通信部34に供給したりする。
(control unit 35)
The control unit 35 controls each unit included in the information processing device 3 . As an example, the control unit 35 stores data acquired from the input unit 32 or the communication unit 34 in the storage unit 31, and supplies data stored in the storage unit 31 to the output unit 33 or the communication unit 34. .
 また、制御部35は、図5に示すように、取得部11、決定部12、および表示制御部13としても機能する。取得部11、決定部12、および表示制御部13は、本例示的実施形態においてそれぞれ取得手段、決定手段、および第1の表示手段を実現する構成である。 The control unit 35 also functions as the acquisition unit 11, the determination unit 12, and the display control unit 13, as shown in FIG. The acquisition unit 11, the determination unit 12, and the display control unit 13 are configured to implement acquisition means, determination means, and first display means, respectively, in this exemplary embodiment.
 取得部11は、入力部32または通信部34を介して、参照データ用RDを取得する。取得部11は、取得した参照用データRDを、記憶部31に格納する。 The acquisition unit 11 acquires the reference data RD via the input unit 32 or the communication unit 34. The acquisition unit 11 stores the acquired reference data RD in the storage unit 31 .
 決定部12は、記憶部31に格納されている参照用データRDを取得し、重み係数WFと特徴量パラメータFPとを含む報酬関数を、参照用データRDを用いた逆強化学習であって特徴量パラメータFPを操作対象に含む逆強化学習によって決定する。決定部12は、決定後の特徴量パラメータFPを、記憶部31に格納する。 The determination unit 12 obtains the reference data RD stored in the storage unit 31, and calculates a reward function including the weighting factor WF and the feature amount parameter FP by inverse reinforcement learning using the reference data RD, which is a feature It is determined by inverse reinforcement learning including the quantity parameter FP as an operation target. The determination unit 12 stores the determined feature parameter FP in the storage unit 31 .
 また、決定部12による逆強化学習における操作対象には、1または複数のコスト項の少なくとも何れかに含まれる重み係数WFが含まれていてもよい。決定部12は、操作後の重み係数WFを、記憶部31に格納する。 Further, the operation target in the inverse reinforcement learning by the determination unit 12 may include a weighting factor WF included in at least one of one or a plurality of cost terms. The determining unit 12 stores the post-operation weighting factor WF in the storage unit 31 .
 決定部12が実行する処理の一例については、後述する。 An example of the processing executed by the determination unit 12 will be described later.
 表示制御部13は、重み係数WF、特徴量パラメータFP、および報酬関数の少なくとも何れかに対応する情報を、出力部33を介して表示する。 The display control unit 13 displays, via the output unit 33, information corresponding to at least one of the weighting factor WF, the feature amount parameter FP, and the reward function.
 表示制御部13が実行する処理の一例については、後述する。 An example of processing executed by the display control unit 13 will be described later.
 <問題設定及び手法に関する説明>
 以下では、理解を容易にするため、初めに、本例示的実施形態に係る最大エントロピー逆強化学習の問題設定及び手法について説明する。最大エントロピー逆強化学習(ME-IRL)では、以下の問題設定を想定している。すなわち、熟練者のデータD={τ,τ,…τ}(ただし、τ=((s,a),(s,a),…,(s,a)))から、ただ1つの報酬関数R(s,a)=θ・f(s,a)を推定するという設定である。ME-IRLでは、θを推定することで、熟練者の意思決定を再現できる。
<Explanation on problem setting and method>
In the following, for ease of understanding, first, the problem setting and method of maximum entropy inverse reinforcement learning according to this exemplary embodiment will be described. Maximum entropy inverse reinforcement learning (ME-IRL) assumes the following problem setting. That is, expert data D = { τ 1 , τ 2 , . ))) to estimate a single reward function R(s,a)=θ·f(s,a). In ME-IRL, by estimating θ, decision-making by an expert can be reproduced.
 ここで、θは、重み係数WFを成分とする重み係数ベクトルである。また、f(s,a)は特徴量ベクトルであり、各特徴量に対応する複数の項を含み得る。また、重み係数ベクトルθに含まれる重み係数WFの総数は、特徴量ベクトルf(s,a)の成分数に応じて定まる。 Here, θ is a weighting factor vector whose component is the weighting factor WF. Also, f(s, a) is a feature quantity vector, which can include multiple terms corresponding to each feature quantity. Also, the total number of weighting factors WF included in the weighting factor vector θ is determined according to the number of components of the feature quantity vector f(s, a).
 次に、ME-IRLの手法を説明する。ME-IRLでは、軌跡τは、以下に例示する式A1で表わされ、軌跡の分布pθ(τ)を表わす確率モデルは、以下に例示する式A2で表わされる。式A2におけるθτが報酬関数を表わす(式A3参照)。また、Zは、全ての軌跡に対する報酬の総和を表わす(式A4参照)。
ただし、
Next, the ME-IRL technique will be described. In ME-IRL, the trajectory τ is represented by Equation A1 exemplified below, and the probability model representing the distribution p θ (τ) of the trajectory is represented by Equation A2 exemplified below. θ T f τ in Equation A2 represents the reward function (see Equation A3). Also, Z represents the sum of rewards for all trajectories (see equation A4).
however,
 そして、最尤推定による報酬関数の重みの更新則(具体的には、勾配上昇法)は、以下に例示する式A5および式A6で表わされる。式A5におけるαはステップ幅であり、L(θ)は、ME-IRLで用いられる分布間の距離尺度である。
Then, the rule for updating the weight of the reward function by maximum likelihood estimation (specifically, the gradient ascending method) is represented by Equations A5 and A6 exemplified below. α in Equation A5 is the step size and L(θ) is the distance measure between distributions used in ME-IRL.
 式A6における第2項は、全ての軌跡に対する報酬の総和である。ME-IRLでは、この第2項の値を厳密に計算できることを前提としている。しかし、現実的には、全ての軌跡に対する報酬の総和を計算することは困難であるという問題もある。 The second term in Equation A6 is the sum of rewards for all trajectories. ME-IRL assumes that the value of the second term can be strictly calculated. However, in reality, there is also the problem that it is difficult to calculate the total sum of rewards for all trajectories.
 また、式A5のように、重み係数ベクトルθを更新するのみでは、適切な報酬関数を生成するという観点において改善の余地があるという問題がある。 In addition, there is a problem that there is room for improvement in terms of generating an appropriate reward function just by updating the weighting coefficient vector θ as in Equation A5.
 本例示的実施形態に係る最大エントロピー逆強化学習では、特徴量ベクトルf(s,a)の有する複数の項の少なくとも何れかには、当該項を特徴付ける特徴量パラメータFPが含まれる。また、本例示的実施形態に係る最大エントロピー逆強化学習では、上述したθのみならず、特徴量パラメータFPも推定の対象となる。そのため、本例示的実施形態に係る最大エントロピー逆強化学習は、改良型最大エントロピー逆強化学習とでも呼称すべきものである。ただし、呼称が複雑化することを避けるため、以下では、当該「改良型最大エントロピー逆強化学習」のことを単に「最大エントロピー逆強化学習(ME-IRL)」とも呼ぶ。 In the maximum entropy inverse reinforcement learning according to this exemplary embodiment, at least one of the multiple terms of the feature vector f(s, a) includes a feature parameter FP that characterizes the term. In addition, in the maximum entropy inverse reinforcement learning according to this exemplary embodiment, not only θ described above but also the feature parameter FP is estimated. Therefore, the maximum entropy inverse reinforcement learning according to this exemplary embodiment should also be referred to as improved maximum entropy inverse reinforcement learning. However, in order to avoid complicating the terminology, the “improved maximum entropy inverse reinforcement learning” is hereinafter simply referred to as “maximum entropy inverse reinforcement learning (ME-IRL)”.
 (決定部12が実行する処理の一例)
 続いて、決定部12が実行する処理の一例について説明する。
(Example of processing executed by determination unit 12)
Next, an example of processing executed by the determination unit 12 will be described.
 決定部12は、状態および行動を含む参照用データから、報酬関数の特徴量を設定する。一例として、決定部12は、逆強化学習処理において分布間の距離尺度としてワッサースタイン(Wasserstein)距離を利用できるように、関数全体で接線の勾配が有限になるように報酬関数の特徴量を設定する構成としてもよい。また、決定部12は、例えば、リプシッツ連続条件を満たすように報酬関数の特徴量を設定してもよい。 The determination unit 12 sets the feature amount of the reward function from the reference data including the state and action. As an example, the determining unit 12 sets the feature quantity of the reward function so that the gradient of the tangent line is finite throughout the function so that the Wasserstein distance can be used as a distance measure between distributions in the inverse reinforcement learning process. It is good also as a structure which carries out. Further, the determination unit 12 may set the feature amount of the reward function so as to satisfy the Lipschitz continuity condition, for example.
 例えば、fτを軌跡τの特徴量ベクトルとする。報酬関数θτが線形の場合、写像F:τ→fτがリプシッツ連続であれば、θτもリプシッツ連続である。そのため、決定部12は、報酬関数が線形関数になるように特徴量を設定してもよい。 For example, let f τ be the feature vector of the trajectory τ. If the reward function θ T f τ is linear, then if the map F:τ→f τ is Lipschitz continuous, then θ T f τ is also Lipschitz continuous. Therefore, the determination unit 12 may set the feature amount so that the reward function becomes a linear function.
 なお、例えば、以下に例示する式4は、aにおいて勾配が無限大になってしまうため、本開示において不適切な報酬関数と言える。 Note that, for example, Equation 4 illustrated below has an infinite gradient at a 0 , and therefore can be said to be an inappropriate reward function in the present disclosure.
 決定部12は、例えば、ユーザの指示に応じて特徴量が設定された報酬関数を決定してもよく、この場合、取得部11は、入力部32または通信部34を介してリプシッツ連続条件を満たす報酬関数を取得してもよい。 The determination unit 12 may determine, for example, a reward function in which the feature amount is set according to the user's instruction. A satisfying reward function may be obtained.
 また、決定部12は、重み係数WFを初期化する構成としてもよい。決定部12が重み係数WFを初期化する方法は特に限定されず、ユーザ等に応じてあらかじめ定められた任意の方法に基づいて重み係数WFが初期化されればよい。 Further, the determination unit 12 may be configured to initialize the weighting factor WF. The method by which the determination unit 12 initializes the weighting factor WF is not particularly limited, and the weighting factor WF may be initialized based on an arbitrary method predetermined according to the user or the like.
 また、決定部12は、参照用データRDの確率分布と、最適化された(報酬関数の)パラメータに基づいて決定される最適解の確率分布との間の距離を最小にする軌跡τ^(τ^は、τの上付き^)を導出する。具体的には、決定部12は、分布間の距離尺度として、ワッサースタイン距離を利用して、そのワッサースタイン距離を最小にするよう数理最適化を実行することにより、熟練者の軌跡τ^を推定する。 In addition, the determining unit 12 determines the trajectory τ^( τ ̂ derives the superscript ̂) of τ. Specifically, the determining unit 12 uses the Wasserstein distance as a distance measure between distributions, and performs mathematical optimization to minimize the Wasserstein distance, thereby obtaining the trajectory τ of the expert as presume.
 ワッサースタイン距離は、以下に例示する式5で定義される。すなわち、ワッサースタイン距離は、熟練者の軌跡の確率分布と、報酬関数のパラメータに基づいて決定される軌跡の確率分布との距離を表わす。なお、ワッサースタイン距離の制約から、報酬関数θτは、リプシッツ連続条件を満たす関数である必要がある。一方、本例示的実施形態では、決定部12は、リプシッツ連続条件を満たすように報酬関数の特徴量を設定するため、以下に例示するようなワッサースタイン距離を利用することが可能になる。 The Wasserstein distance is defined by Equation 5 exemplified below. That is, the Wasserstein distance represents the distance between the probability distribution of the expert's trajectory and the probability distribution of the trajectory determined based on the parameters of the reward function. Note that the reward function θ T f τ must be a function that satisfies the Lipschitz continuity condition due to the restriction of the Wasserstein distance. On the other hand, in the present exemplary embodiment, the determining unit 12 sets the feature amount of the reward function so as to satisfy the Lipschitz continuity condition, so it is possible to use the Wasserstein distance as exemplified below.
 上記に例示する式5で定義されるワッサースタイン距離は0以下の値をとり、この値を大きくすることが、分布同士を近づけることに対応する。また、式5の第2項において、τθ(n)は、パラメータθで最適化したn番目の軌跡を表わす。式5の第2項は、組み合わせ最適化問題でも算出可能な項である。そのため、式5に例示するワッサースタイン距離を分布間の距離尺度として用いることで、組み合わせ最適化問題のような数理最適化問題に対しても適用可能な逆強化学習を行うことができる。 The Wasserstein distance defined by Equation 5 exemplified above takes a value of 0 or less, and increasing this value corresponds to bringing the distributions closer together. Also, in the second term of Equation 5, τ θ(n) represents the n-th trajectory optimized with the parameter θ. The second term of Equation 5 is a term that can be calculated even in a combinatorial optimization problem. Therefore, by using the Wasserstein distance exemplified in Equation 5 as a distance measure between distributions, inverse reinforcement learning that can be applied to mathematical optimization problems such as combinatorial optimization problems can be performed.
 また、決定部12は、推定された熟練者の軌跡τ^に基づいて分布間の距離尺度を最大にするように報酬関数のパラメータθと特徴量パラメータFPとを更新する。ここで、最大エントロピー逆強化学習(すなわち、ME-IRL)では、最大エントロピー原理により軌跡τがボルツマン分布に従うとされる。そこで、決定部12は、ME-IRLと同様、推定された熟練者の軌跡τ^に基づいて、最大エントロピーの原理で導かれるボルツマン分布の対数尤度を最大にするように報酬関数のパラメータθと特徴量パラメータFPとを更新する。 Also, the determination unit 12 updates the parameter θ of the reward function and the feature parameter FP so as to maximize the distance measure between the distributions based on the estimated expert's trajectory τ̂. Here, in maximum entropy inverse reinforcement learning (that is, ME-IRL), the trajectory τ follows the Boltzmann distribution according to the maximum entropy principle. Therefore, similar to ME-IRL, the determining unit 12, based on the estimated expert trajectory τ^, determines the parameter θ and the feature parameter FP are updated.
 上述したように、式A6における第2項は、全ての軌跡に対する報酬の総和である。ME-IRLでは、この第2項の値を厳密に計算できることを前提としている。しかし、現実的には、全ての軌跡に対する報酬の総和を計算することは困難であるという問題がある。 As described above, the second term in Equation A6 is the sum of rewards for all trajectories. ME-IRL assumes that the value of the second term can be strictly calculated. However, in reality, there is a problem that it is difficult to calculate the total sum of rewards for all trajectories.
 そこで、決定部12は、報酬関数を用いて表される対数尤度の下限を設定し、報酬関数を用いて表される対数尤度の下限を最大化するように、操作対象(パラメータθ、特徴量パラメータFP)を更新する。一例として、決定部12は、L(θ)の下限
を、以下のように設定する。以下、
を対数尤度の下限とも称する。
Therefore, the determining unit 12 sets the lower limit of the logarithmic likelihood represented using the reward function, and maximizes the lower limit of the logarithmic likelihood represented using the reward function. update the feature parameter FP). As an example, the determination unit 12 determines the lower limit of L(θ)
is set as follows. the following,
is also called the lower bound of the log-likelihood.
 対数尤度の下限を示す式6における第2項は、現在のパラメータθに対する最大の報酬値であり、第3項は、取り得る軌跡の数(Nτ)のlog値(対数値)である。このように、決定部12は、ME-IRLの対数尤度をもとに、現在のパラメータθに対する最大の報酬値および取り得る軌跡の数(Nτ)のlog値(対数値)を、軌跡の確率分布から減じて算出される対数尤度の下限を導出する。 The second term in Equation 6, the log-likelihood lower bound, is the maximum reward value for the current parameter θ, and the third term is the log value of the number of possible trajectories (N τ ). . Thus, based on the logarithmic likelihood of ME-IRL, the determining unit 12 calculates the maximum reward value for the current parameter θ and the log value (logarithmic value) of the number of possible trajectories (N τ ). Derive the lower bound of the log-likelihood calculated by subtracting from the probability distribution of .
 さらに、決定部12は、導出されたME-IRLの対数尤度の下限の式を、ワッサースタイン(Wasserstein)距離からエントロピー正則化項を減じる式に変形したものを用いてもよい。ME-IRLの対数尤度の下限の式を、ワッサースタイン距離とエントロピー正則化項とに分解した式は、以下に例示する式7のように表される。 Further, the determining unit 12 may use a modified formula for the derived lower limit of the log-likelihood of ME-IRL into a formula for subtracting the entropy regularization term from the Wasserstein distance. A formula obtained by decomposing the lower bound formula of the log-likelihood of ME-IRL into the Wasserstein distance and the entropy regularization term is expressed as Formula 7 illustrated below.
 式7の1つ目の括弧内の式は、ワッサースタイン距離を表わす。すなわち、ワッサースタイン距離は、熟練者の軌跡の確率分布と、報酬関数のパラメータに基づいて決定される軌跡の確率分布との距離を表わす。なお、ワッサースタイン距離の制約から、報酬関数θτは、リプシッツ連続条件を満たす関数である必要がある。一方、本例示的実施形態では、決定部12は、リプシッツ連続条件を満たすように報酬関数の特徴量を設定するため、ワッサースタイン距離を利用することが可能になる。 The expression in the first parenthesis of Expression 7 represents the Wasserstein distance. That is, the Wasserstein distance represents the distance between the probability distribution of the expert's trajectory and the probability distribution of the trajectory determined based on the parameters of the reward function. Note that the reward function θ T f τ must be a function that satisfies the Lipschitz continuity condition due to the restriction of the Wasserstein distance. On the other hand, in the present exemplary embodiment, the determination unit 12 can use the Wasserstein distance to set the features of the reward function so as to satisfy the Lipschitz continuity condition.
 また、式7の2つ目の括弧内の式は、最大エントロピー原理より導かれたボルツマン分布の対数尤度上昇に寄与するエントロピー正則化項を表わす。具体的には、式7に例示するエントロピー正則化項(すなわち、式7の2つ目の括弧内の式)において、第1項は、現在のパラメータθに対する最大の報酬値を表わし、第2項は、現在のパラメータθに対する報酬の平均値を表わす。 In addition, the expression in the second parenthesis of Equation 7 represents an entropy regularization term that contributes to increasing the logarithmic likelihood of the Boltzmann distribution derived from the maximum entropy principle. Specifically, in the entropy regularization term exemplified in Equation 7 (i.e., the expression in the second bracket of Equation 7), the first term represents the maximum reward value for the current parameter The term represents the mean value of the reward for the current parameter θ.
 このように、決定部12による逆強化学習には、報酬関数を用いて表される対数尤度の下限を最大化するように、操作対象(パラメータθ、特徴量パラメータFP)を更新する更新処理が含まれる。そして、式7に示すように、更新処理における対数尤度の下限は、基準となる確率分布と、報酬関数を用いて表される確率分布との距離を表すワッサースタイン距離と、報酬関数の最大値と報酬関数の平均値との差分を表す正則化項とを用いて表される。 In this way, the inverse reinforcement learning by the determination unit 12 includes update processing for updating the operation target (parameter θ, feature amount parameter FP) so as to maximize the lower limit of the logarithmic likelihood represented using the reward function. is included. Then, as shown in Equation 7, the lower limit of the logarithmic likelihood in the update process is the Wasserstein distance, which represents the distance between the standard probability distribution and the probability distribution represented using the reward function, and the maximum value of the reward function. and a regularization term representing the difference between the value and the average value of the reward function.
 ここで、式7の2つ目の括弧内の式における第2項が、エントロピー正則化項として機能する理由を説明する。ME-IRLの対数尤度の下限を最大化するためには、第2項の値を小さくする必要があり、それは、最大の報酬値と平均値との差を小さくすることに相当する。最大の報酬値と平均値との差が小さくなることは、軌跡のばらつきが小さくなることを示す。 Here, we will explain why the second term in the second parenthesized expression of Equation 7 functions as an entropy regularization term. In order to maximize the lower bound of the log-likelihood of ME-IRL, the value of the second term should be small, which corresponds to a small difference between the maximum reward value and the mean value. A smaller difference between the maximum reward value and the average value indicates a smaller trajectory variability.
 言い換えると、最大の報酬値と平均値との差が小さくなることは、エントロピーが増加することを意味するため、エントロピー正則化が機能することになり、エントロピーの最大化に寄与することになる。これは、ボルツマン分布の対数尤度の最大化に寄与していることになり、結果として、逆強化学習における不定性の解決に寄与することになる。 In other words, a smaller difference between the maximum reward value and the average value means an increase in entropy, so entropy regularization works and contributes to entropy maximization. This contributes to the maximization of the log-likelihood of the Boltzmann distribution and, as a result, contributes to resolution of ambiguity in inverse reinforcement learning.
 決定部12は、上記に示す式7に基づいて、例えば、推定された軌跡τ^を固定して、勾配上昇法によりパラメータθ及びと特徴量パラメータFPを更新する。ただし、通常の勾配上昇法では、収束しない恐れがある。エントロピー正則化項において、最大の報酬値をとる軌跡の特徴量(fτθmax)は、他の軌跡の特徴量(fτ(n))の平均値に一致することはない(すなわち、両者の差が0にはならない)。よって、通常の勾配上昇法では、対数尤度が振動して収束しないため安定せず、収束判定を適切に行うことが困難なためである(パラメータθの更新については下記式8参照)。 The determining unit 12 fixes, for example, the estimated trajectory τ̂ based on Equation 7 shown above, and updates the parameter θ and the feature amount parameter FP by the gradient ascending method. However, the normal gradient ascent method may not converge. In the entropy regularization term, the feature quantity (f τθmax ) of the trajectory with the maximum reward value does not match the average value of the feature quantity (f τ(n) ) of the other trajectories (i.e., the difference between the two does not become 0). Therefore, in the normal gradient ascent method, the logarithmic likelihood oscillates and does not converge, which makes it unstable and difficult to appropriately determine convergence (see Equation 8 below for updating the parameter θ).
 そこで、決定部12は、勾配法を用いる際、エントロピー正則化に寄与する部分(すなわち、エントロピー正則化項に対応する部分)を徐々に減衰させるようにパラメータθ及び特徴量パラメータFPを更新してもよい。換言すると、対数尤度の下限は、正則化項に乗ぜられる減衰係数であって、更新処理を繰り返すほど正則化項の寄与を減衰させる減衰係数を含んでもよい。 Therefore, when using the gradient method, the determination unit 12 updates the parameter θ and the feature parameter FP so as to gradually attenuate the portion that contributes to entropy regularization (that is, the portion corresponding to the entropy regularization term). good too. In other words, the lower bound of the log-likelihood may include a damping factor that is multiplied by the regularization term to dampen the contribution of the regularization term as the update process is repeated.
 具体的には、決定部12は、エントロピー正則化に寄与する部分に、減衰の程度を示す減衰係数βを設定した更新式を定義する。決定部12は、例えば、上記の式7をθで微分し、ワッサースタイン距離を示す項に対応する部分(すなわち、ワッサースタイン距離を大きくする処理に寄与する部分)と、エントロピー正則化項に対応する部分のうち、エントロピー正則化項に対応する部分に減衰係数を設定した以下に例示する式9を定義する。 Specifically, the determination unit 12 defines an update formula in which a damping coefficient β t indicating the degree of damping is set in a portion that contributes to entropy regularization. For example, the determining unit 12 differentiates Equation 7 above with respect to θ, a portion corresponding to the term indicating the Wasserstein distance (that is, a portion contributing to processing for increasing the Wasserstein distance) and an entropy regularization term. Equation 9 exemplified below is defined in which the damping coefficient is set to the portion corresponding to the entropy regularization term among the portions where .
 減衰係数は、エントロピー正則化項に対応する部分を減衰させる方法に応じて予め定義される。例えば、滑らかに減衰させる場合、βは、以下に例示する式10のように定義される。 The damping factor is predefined according to how to dampen the portion corresponding to the entropy regularization term. For example, for smooth attenuation, β t is defined as in Equation 10 exemplified below.
 式10において、βは1に設定され、βは0以上に設定される。また、tは反復回数を示す。これにより、減衰係数βは、反復回数tが大きくなるほど、エントロピー正則化項に対応する部分を小さくする係数として機能する。 In Equation 10, β 1 is set to 1 and β 2 is set to 0 or greater. Also, t indicates the number of iterations. As a result, the attenuation coefficient βt functions as a coefficient that reduces the portion corresponding to the entropy regularization term as the number of iterations t increases.
 また、ワッサースタイン距離は、KLダイバージェンスである対数尤度よりも弱位相なため、対数尤度を0に近づけることで、ワッサースタイン距離も0に近づけることが可能である。そこで、決定部12は、更新の初期段階ではエントロピー正則化項に対応する部分を減衰させずにパラメータθ及び特徴量パラメータFPを更新し、対数尤度が振動し始めたタイミングで、エントロピー正則化項に対応する部分の影響を低減させるようにパラメータθ及び特徴量パラメータFPを更新してもよい。 Also, since the Wasserstein distance has a weaker phase than the logarithmic likelihood that is the KL divergence, it is possible to bring the Wasserstein distance closer to 0 by bringing the logarithmic likelihood closer to 0. Therefore, the determination unit 12 updates the parameter θ and the feature parameter FP without attenuating the portion corresponding to the entropy regularization term in the initial stage of updating, and at the timing when the logarithmic likelihood begins to oscillate, the entropy regularization The parameter θ and the feature amount parameter FP may be updated so as to reduce the influence of the portion corresponding to the term.
 具体的には、決定部12は、上記に示す式9を用いて、初期段階では減衰係数β=1としてパラメータθ及び特徴量パラメータFPを更新する。その後、決定部12は、対数尤度が振動し始めたタイミングで減衰係数β=0に変更することで、エントロピー正則化項に対応する部分の影響をなくしてパラメータθ及び特徴量パラメータFPを更新してもよい。 Specifically, the determining unit 12 updates the parameter θ and the feature parameter FP with the attenuation coefficient β t =1 in the initial stage using Equation 9 shown above. After that, the determining unit 12 changes the attenuation coefficient β t =0 at the timing when the logarithmic likelihood starts to oscillate, thereby eliminating the influence of the portion corresponding to the entropy regularization term and determining the parameter θ and the feature parameter FP as You may update.
 決定部12は、例えば、対数尤度の移動平均が一定になったときに、対数尤度が振動し始めたと判断してもよい。具体的には、決定部12は、“対数尤度の下限”の時間窓(現在の値から過去の数点)における移動平均の変化が微小(例えば1e-3以下)のときに、移動平均が一定になったと判断してもよい。 For example, the determination unit 12 may determine that the logarithmic likelihood has started to oscillate when the moving average of the logarithmic likelihood becomes constant. Specifically, when the change in the moving average in the “lower limit of logarithmic likelihood” time window (several points from the current value to the past) is small (for example, 1e −3 or less), the determining unit 12 determines that the moving average can be judged to be constant.
 また、決定部12は、対数尤度が振動し始めたタイミングで、いきなり減衰係数β=0とせず、まずは、振動係数を上記に示す式10のように変更してもよい。そして、決定部12は、変更後、対数尤度がさらに振動し始めたタイミングで、減衰係数β=0に変更してもよい。振動し始めたタイミングの判断方法は、上記方法と同様である。 Also, the determination unit 12 may change the oscillation coefficient as shown in Equation 10 above, instead of suddenly setting the damping coefficient β t =0 at the timing when the logarithmic likelihood starts to oscillate. After the change, the determining unit 12 may change the damping coefficient β t =0 at the timing when the logarithmic likelihood starts to oscillate further. The method of determining the timing at which vibration starts is the same as the method described above.
 さらに、決定部12は、振動係数を上記に示す式10のように変更した後、対数尤度がさらに振動し始めたタイミングでパラメータθ及び特徴量パラメータFPの更新方法を変更してもよい。具体的には、決定部12は、以下の式11に例示するようなモーメンタム法を用いて、パラメータθ及び特徴量パラメータFPを更新してもよい。式11におけるγ1およびαの値は予め定められる。例えば、γ1=0.9、α=0.001と定められていてもよい。
Further, the determination unit 12 may change the update method of the parameter θ and the feature amount parameter FP at the timing when the logarithmic likelihood further starts to oscillate after changing the oscillation coefficient as in Equation 10 shown above. Specifically, the determining unit 12 may update the parameter θ and the feature amount parameter FP using a momentum method as exemplified in Equation 11 below. The values of γ1 and α in Equation 11 are predetermined. For example, γ1=0.9 and α=0.001.
 以降、決定部12は、対数尤度の下限が収束されたと判定するまで、軌跡の推定処理、およびパラメータθ及び特徴量パラメータFPの更新処理を繰り返す。 Thereafter, the determination unit 12 repeats the trajectory estimation process and the update process of the parameter θ and the feature amount parameter FP until it determines that the lower limit of the logarithmic likelihood has converged.
 決定部12が対数尤度の下限が収束されたと判定する処理の一例として、対数尤度の下限の値の絶対値が予め定めた閾値より小さくなったときに、分布間の距離尺度が収束したと判定する構成が挙げられる。 As an example of processing in which the determining unit 12 determines that the lower limit of the logarithmic likelihood has converged, the distance measure between the distributions has converged when the absolute value of the value of the lower limit of the logarithmic likelihood becomes smaller than a predetermined threshold. A configuration for judging is exemplified.
 決定部12は、分布間の距離尺度が収束していないと判断した場合、軌跡の推定処理、およびパラメータθ及び特徴量パラメータFPの更新処理を継続させる。一方、決定部12は、分布間の距離尺度が収束したと判断した場合、軌跡の推定処理、およびパラメータθ及び特徴量パラメータFPの更新処理を終了する。 When determining that the distance measure between distributions has not converged, the determination unit 12 continues the trajectory estimation process and the update process of the parameter θ and the feature parameter FP. On the other hand, when determining that the distance measure between distributions has converged, the determination unit 12 ends the trajectory estimation process and the update process of the parameter θ and the feature parameter FP.
 <より具体的な処理例>
 以下では、上述した決定部12による処理について、より具体的な例を挙げて説明する。以下の例では、報酬関数(Reward)及びコスト関数(Cost)が以下の式12及び式13によって与えられる場合を例に挙げる。
<More specific processing example>
Below, a more specific example is given and demonstrated about the process by the determination part 12 mentioned above. In the following example, the case where the reward function (Reward) and the cost function (Cost) are given by the following Equations 12 and 13 will be taken as an example.
 式12および式13における各変数は、例示的実施形態1において説明した通り、以下の意味を有する。
Reward:報酬関数
Cost:コスト関数
λ1、λ2、λ3:重み係数
x1、x2、x3:説明変数
Each variable in Equations 12 and 13 has the following meaning as described in Exemplary Embodiment 1.
Reward: reward function
Cost: Cost function λ 1 , λ 2 , λ 3 : Weighting factor
x 1 , x 2 , x 3 : explanatory variables
 ここで、上記の例示的な説明変数:x1、x2、x3は、それぞれ、状態データ(s)及び行動データ(a)の何れかに対応し得る変数である。この例のように、説明変数そのものが特徴量を構成する場合もあるし、説明変数の関数が特徴量を構成する場合もある。本例では、重み係数λ1、重み係数λ2、および特徴量パラメータ
を操作対象として逆強化学習を行う場合を例に挙げる。
Here, the above exemplary explanatory variables: x 1 , x 2 , x 3 are variables that can correspond to either state data (s i ) and behavior data (a i ), respectively. As in this example, the explanatory variable itself may constitute the feature amount, or the function of the explanatory variable may constitute the feature amount. In this example, the weighting factor λ1, the weighting factor λ2, and the feature parameter
A case where inverse reinforcement learning is performed with .
 上記のような設定において、決定部12は、上述した式9に示す対数尤度の下限を導出すると、報酬関数の操作対象を更新する。例えば、報酬関数が以下の式12の場合、決定部12は、重み係数λ1、重み係数λ2、および特徴量パラメータ
の更新を、以下の式14~式17を用いて行う。
ただし、
In the setting as described above, when the determination unit 12 derives the lower limit of the logarithmic likelihood shown in Equation 9 described above, it updates the operation target of the reward function. For example, when the reward function is given by Equation 12 below, the determining unit 12 determines the weighting factor λ1, the weighting factor λ2, and the feature amount parameter
is updated using Equations 14 to 17 below.
however,
 ここで、βは、上述した式10によって定義される係数であり、αは学習率を示すパラメータである。 Here, βt is a coefficient defined by Equation 10 above, and α is a parameter indicating a learning rate.
 このように、決定部12は、最大エントロピーの原理で導かれるボルツマン分布の対数尤度を最大にするように報酬関数のパラメータθと特徴量パラメータFPとを更新する。当該更新処理において、重み係数のみならず、特徴量パラメータも更新対象として逆強化学習を行うので、より適切な報酬関数を生成することができる。 In this way, the determination unit 12 updates the parameter θ of the reward function and the feature parameter FP so as to maximize the logarithmic likelihood of the Boltzmann distribution derived from the principle of maximum entropy. In the update process, inverse reinforcement learning is performed with not only the weighting coefficients but also the feature parameter as the update target, so that a more appropriate reward function can be generated.
 (情報処理方法S3の流れ)
 続いて、本例示的実施形態に係る情報処理方法S3の流れについて、図6を参照して説明する。図6は、本例示的実施形態に係る情報処理方法S3の流れを示すフロー図である。
(Flow of information processing method S3)
Subsequently, the flow of the information processing method S3 according to this exemplary embodiment will be described with reference to FIG. FIG. 6 is a flow diagram showing the flow of the information processing method S3 according to this exemplary embodiment.
 (ステップS31)
 ステップS31において、取得部11は、入力部32または通信部34を介して、参照用データRDを取得する。取得部11は、取得した参照用データRDを、記憶部31に格納する。参照用データRDについては上述したため、ここでは説明を省略する。
(Step S31)
In step S<b>31 , the acquisition unit 11 acquires reference data RD via the input unit 32 or the communication unit 34 . The acquisition unit 11 stores the acquired reference data RD in the storage unit 31 . Since the reference data RD has been described above, the description thereof is omitted here.
 (ステップS32)
 ステップS32において、決定部12は、報酬関数に含まれるパラメータのうち、逆強化学習における操作対象である重み係数及び特徴量パラメータを初期化する。一例として、決定部12は、記憶部31に格納された初期値を用いて、逆強化学習における操作対象である重み係数及び特徴量パラメータを初期化してもよい。
(Step S32)
In step S<b>32 , the determination unit 12 initializes the weighting factor and the feature amount parameter, which are the operation targets in the inverse reinforcement learning, among the parameters included in the reward function. As an example, the determining unit 12 may use the initial values stored in the storage unit 31 to initialize the weighting coefficients and feature amount parameters that are the operation targets in the inverse reinforcement learning.
 (ステップS33)
 ステップS33において、決定部12は、ワッサースタイン距離を最小にするように数理最適化を実行する。一例として、決定部12は、熟練者の軌跡の確率分布と、報酬関数のパラメータに基づいて決定される軌跡の確率分布との距離を表わすワッサースタイン距離を最小にする軌跡を推定する。
(Step S33)
In step S33, the determination unit 12 performs mathematical optimization to minimize the Wasserstein distance. As an example, the determination unit 12 estimates the trajectory that minimizes the Wasserstein distance, which represents the distance between the probability distribution of the trajectory of the expert and the probability distribution of the trajectory determined based on the parameters of the reward function.
 (ステップS34)
 ステップS34において、決定部12は、最大エントロピーの原理で導かれるボルツマン分布の対数尤度を最大にするように報酬関数のパラメータθと特徴量パラメータFPとを更新する。当該更新処理について具体例を挙げて上述したためここでは説明を省略する。
(Step S34)
In step S34, the determining unit 12 updates the parameter θ of the reward function and the feature parameter FP so as to maximize the logarithmic likelihood of the Boltzmann distribution derived from the principle of maximum entropy. Since the specific example of the update process has been described above, the description is omitted here.
 (ステップS35)
 ステップS35において、決定部12は、対数尤度の下限が収束したか否かを判定する。対数尤度の下限が収束したと判定した場合(S35でYES)、ステップS36に進み、そうでない場合(S35でNO)、ステップS33に戻る。
(Step S35)
In step S35, the determination unit 12 determines whether or not the lower limit of the logarithmic likelihood has converged. If it is determined that the lower limit of the logarithmic likelihood has converged (YES in S35), the process proceeds to step S36; otherwise (NO in S35), the process returns to step S33.
 (ステップS36)
 ステップS35において、決定部12が、対数尤度の下限が収束したか否かを判定する。対数尤度の下限が収束したと判定した場合、ステップS36において、決定部12は、報酬関数を出力する。
(Step S36)
In step S35, the determination unit 12 determines whether or not the lower limit of the logarithmic likelihood has converged. If it is determined that the lower limit of the logarithmic likelihood has converged, the determination unit 12 outputs a reward function in step S36.
 決定部12によって出力された報酬関数に含まれるパラメータ(重み係数WF及び特徴量パラメータFP)は、一例として記憶部31に格納される。 The parameters (weighting factor WF and feature quantity parameter FP) included in the reward function output by the determining unit 12 are stored in the storage unit 31 as an example.
 (表示例)
 続いて、本例示的実施形態に係る情報処理装置3による表示例について図7を参照して説明する。上述したように、出力部33は表示パネル(表示部)を備え、当該表示パネルに各種の情報を表示する構成としてもよい。ここで、表示パネルに表示される情報は、重み係数WF、特徴量パラメータFP、及び報酬関数の少なくとも何れかに対応する情報が含まれ得る。ここで、出力部33が表示する表示内容は、一例として、表示制御部13によって生成される。
(Display example)
Next, a display example by the information processing device 3 according to the exemplary embodiment will be described with reference to FIG. As described above, the output unit 33 may include a display panel (display unit) and display various information on the display panel. Here, the information displayed on the display panel may include information corresponding to at least one of the weighting factor WF, the feature amount parameter FP, and the reward function. Here, the display content displayed by the output unit 33 is generated by the display control unit 13 as an example.
 図7は、表示制御部13が生成する表示例を示す図である。図7に示すように、操作対象のパラメータ(重み係数WF及び特徴量パラメータFP)の少なくとも一部のパラメータの値と、ステップ数との関係を示す表示画面を生成してもよい。換言すれば、更新処理のステップ数の増加に応じた、操作対象のパラメータ値の変化を示す表示画面を生成してもよい。図7に示す例では、ステップ数と重み係数λ1と、特徴量パラメータ
との関係を示す表示画面を示している。
FIG. 7 is a diagram showing a display example generated by the display control unit 13. As shown in FIG. As shown in FIG. 7, a display screen may be generated that shows the relationship between the values of at least some of the parameters to be operated (the weighting factor WF and the feature amount parameter FP) and the number of steps. In other words, a display screen may be generated that shows a change in the parameter value of the operation target according to an increase in the number of steps of the update process. In the example shown in FIG. 7, the number of steps, the weighting factor λ 1 , and the feature parameter
10 shows a display screen showing the relationship between .
 本例示的実施形態に係る情報処理装置3は、上記のように、重み係数WF、特徴量パラメータFP、及び報酬関数の少なくとも何れかに対応する情報を表示することによって、逆強化学習が好適に行われているかについてユーザに好適に提示することができる。 As described above, the information processing apparatus 3 according to this exemplary embodiment displays information corresponding to at least one of the weighting factor WF, the feature amount parameter FP, and the reward function, so that inverse reinforcement learning can be performed favorably. It is possible to suitably present to the user what is being done.
 〔例示的実施形態3〕
 本発明の第3の例示的実施形態について、図面を参照して詳細に説明する。なお、例示的実施形態1及び2にて説明した構成要素と同じ機能を有する構成要素については、同じ符号を付し、その説明を適宜省略する。
[Exemplary embodiment 3]
A third exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as those described in exemplary embodiments 1 and 2 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.
 (情報処理装置4の構成)
 本例示的実施形態に係る情報処理装置4の構成について、図8を参照して説明する。図8は、本例示的実施形態に係る情報処理装置3の構成を示すブロック図である。
(Configuration of information processing device 4)
The configuration of the information processing device 4 according to this exemplary embodiment will be described with reference to FIG. FIG. 8 is a block diagram showing the configuration of the information processing device 3 according to this exemplary embodiment.
 図8に示すように、情報処理装置4は、情報処理装置3が備える制御部35に代えて、制御部45を備えている。また、制御部45は、制御部35が備える各構成に加えて、生成部14を備えている。 As shown in FIG. 8 , the information processing device 4 includes a control section 45 instead of the control section 35 provided in the information processing device 3 . Further, the control unit 45 includes a generation unit 14 in addition to each configuration included in the control unit 35 .
 また、図8に示すように、情報処理装置4は、情報処理装置3が備える記憶部31に代えて、記憶部41を備えている。記憶部41には、記憶部31が記憶する各種情報に加えて、対象データTDが記憶される。  In addition, as shown in FIG. 8, the information processing device 4 includes a storage unit 41 instead of the storage unit 31 included in the information processing device 3. In addition to various information stored in the storage unit 31, the storage unit 41 stores target data TD.
 また、情報処理装置4が備える取得部11は、例示的実施形態2に係る取得部11が取得する各種のデータに加え、対象データTDを更に取得する。取得された対象データTDは、一例として上述した記憶部41に記憶される。 In addition, the acquisition unit 11 included in the information processing device 4 further acquires target data TD in addition to various data acquired by the acquisition unit 11 according to the second exemplary embodiment. The acquired target data TD is stored in the storage unit 41 described above as an example.
 ここで、本例示的実施形態において、対象データTDには、あるシステム(系)の状態を表す状態データ、及び当該状態において特定の熟練者がとった行動を示す行動データの少なくも一部が含まれる。 Here, in this exemplary embodiment, the target data TD includes at least part of state data representing the state of a certain system (system) and action data representing actions taken by a specific expert in that state. included.
 一例として、対象データTDを、{s、s、・・・s}によって表すことができ、操作対象のデータMDを、{a、a、・・・a}によって表すことができる。ここで、s(i=1~N)は、システムの状態を示す状態データを表しており、a(i=1~N)は、状態データTDが示す状態において選択可能な行動データを表している。 As an example , the target data TD can be represented by { s 1 , s 2 , . can be done. Here, s i (i=1 to N) represents state data indicating the state of the system, and a i (i=1 to N) represents action data that can be selected in the state indicated by the state data TD. represent.
 (生成部14)
 情報処理装置4が備える生成部14は、取得部11が取得した対象データTDと、操作対象のデータMDとを説明変数とする報酬関数を、操作対象のデータMDを操作することによって最大化させる。換言すれば、情報処理装置4は、対象データを入力として最適化問題を解き、報酬関数を最大化させる操作対象のデータを出力データとして生成する。
(Generating unit 14)
The generation unit 14 included in the information processing device 4 maximizes a reward function having the target data TD acquired by the acquisition unit 11 and the operation target data MD as explanatory variables by operating the operation target data MD. . In other words, the information processing device 4 solves the optimization problem with the target data as input, and generates as output data the data of the operation target that maximizes the reward function.
 換言すれば、生成部14は、重み係数と特徴量パラメータとを含む報酬関数であって、特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、取得部11が取得した対象データとを用いた最適化問題を解くことによって、対象データに応じた出力データを生成する。 In other words, the generation unit 14 generates a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target acquired by the acquisition unit 11. Generate output data according to the target data by solving an optimization problem using the data.
 ここで、上記の報酬関数として、例示的実施形態2において説明した処理によって決定部12が決定した報酬関数を用いることができる。 Here, the reward function determined by the determining unit 12 through the processing described in the second exemplary embodiment can be used as the above reward function.
 以上のように、本例示的実施形態に係る情報処理装置4においては、対象データを取得する取得部11と、重み係数と特徴量パラメータとを含む報酬関数であって、特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、取得部11が取得した対象データとを用いた最適化問題を解くことによって対象データに応じた出力データを生成する生成部14とを備える構成が採用されている。 As described above, in the information processing apparatus 4 according to the present exemplary embodiment, the acquisition unit 11 that acquires the target data, the reward function including the weighting factor and the feature amount parameter, and the feature amount parameter is used as the operation target. A reward function determined by inverse reinforcement learning included in and a generation unit 14 that generates output data according to the target data by solving an optimization problem using the target data acquired by the acquisition unit 11. Adopted.
 このため、本例示的実施形態に係る情報処理装置4によれば、特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数を用いて最適化問題を解くため、より適切な報酬関数を最大化する出力データを生成することができる。 Therefore, according to the information processing apparatus 4 according to the present exemplary embodiment, since the optimization problem is solved using the reward function determined by the inverse reinforcement learning including the feature parameter as the operation target, a more appropriate reward function can generate output data that maximizes
 (表示例1)
 続いて、本例示的実施形態に係る情報処理装置4による第1の表示例について図9を参照して説明する。上述したように、出力部33は表示パネル(表示部)を備え、当該表示パネルに各種の情報を表示する構成としてもよい。本例示的実施形態において、表示パネルに表示される情報は、生成部14が生成した出力データに含まれる少なくとも一部のデータが含まれ得る。
(Display example 1)
Next, a first display example by the information processing device 4 according to this exemplary embodiment will be described with reference to FIG. As described above, the output unit 33 may include a display panel (display unit) and display various information on the display panel. In this exemplary embodiment, the information displayed on the display panel may include at least part of the data included in the output data generated by the generator 14 .
 図9は、表示制御部13が生成する表示例を示す図である。図9に示す例では、
 ・報酬関数が例示的実施形態2において説明した式12によって与えられ、
 ・重み係数λ1、λ2、λ3、及び特徴量パラメータ
が決定部12による逆強化学習によって決定され、
 ・対象データTDがx1、x2を含み、
 ・生成部14が、x3を操作対象のデータとして最適化問題を解くことによって対象データに応じた出力データを生成した
場合に表示制御部13が生成する表示画面の例を示している。
FIG. 9 is a diagram showing a display example generated by the display control unit 13. As shown in FIG. In the example shown in FIG. 9,
- the reward function is given by Equation 12 described in Exemplary Embodiment 2,
- Weighting coefficients λ 1 , λ 2 , λ 3 and feature parameters
is determined by inverse reinforcement learning by the determination unit 12,
・The target data TD includes x 1 and x 2 ,
- An example of a display screen generated by the display control unit 13 when the generation unit 14 generates output data corresponding to target data by solving an optimization problem with x3 as data to be manipulated is shown.
 図9に示すように、表示制御部13が生成する表示画面には、対象データTDに含まれる説明変数x1、x2の値と、当該値に応じて生成部14によって決定された操作対象のデータx3の値(すなわち、ユーザに対して推奨される値)とが含まれる。 As shown in FIG. 9, the display screen generated by the display control unit 13 displays the values of the explanatory variables x 1 and x 2 included in the target data TD, and the operation target determined by the generation unit 14 according to the values. data x 3 values (i.e. recommended values for users).
 本例示的実施形態に係る情報処理装置4は、上記のように、生成部14が生成した出力データを表示することによって最適化問題の解をユーザに好適に提示することができる。 The information processing device 4 according to this exemplary embodiment can suitably present the solution of the optimization problem to the user by displaying the output data generated by the generation unit 14 as described above.
 (表示例2)
 続いて、本例示的実施形態に係る情報処理装置4による第2の表示例について図10を参照して説明する。本例では、取得部11が、入力部32を介してユーザから、説明変数、重み係数、及び特徴量パラメータの少なくとも何れかの入力を受け付ける。そして、図10の上段に示すように、表示制御部13は、当該ユーザが入力した説明変数、重み係数、及び特徴量パラメータの少なくとも何れかの値と、1又は複数の熟練者の参照用データTDを用いた逆強化学習によって得られた説明変数、重み係数、及び特徴量パラメータの少なくとも何れかの値とを比較可能に表示する構成としてもよい。
(Display example 2)
Next, a second display example by the information processing device 4 according to this exemplary embodiment will be described with reference to FIG. In this example, the acquisition unit 11 receives input of at least one of explanatory variables, weighting coefficients, and feature amount parameters from the user via the input unit 32 . Then, as shown in the upper part of FIG. 10, the display control unit 13 controls at least one of the explanatory variable, the weighting factor, and the feature amount parameter input by the user, and the reference data for one or more experts. At least one of explanatory variables, weighting coefficients, and feature parameter values obtained by inverse reinforcement learning using TD may be displayed in a comparable manner.
 また、表示制御部13は、ユーザが操作可能な操作用オブジェクトを含むGUI(Graphical User Interface)を生成し、出力部33に表示させる構成としてもよい。図10の下段左側には、そのようなGUIが示されている。当該GUIに含まれるバーをスライドすることによって、当該バーに対応する説明変数、重み係数、及び特徴量パラメータの少なくとも何れかの値を変更することができる。 Further, the display control unit 13 may be configured to generate a GUI (Graphical User Interface) including operation objects that can be operated by the user and display it on the output unit 33 . Such a GUI is shown in the lower left part of FIG. By sliding the bar included in the GUI, it is possible to change the value of at least one of the explanatory variable, the weighting factor, and the feature amount parameter corresponding to the bar.
 また、表示制御部13は、説明変数、重み係数、及び特徴量パラメータの少なくとも何れかの変数について順位付けを行い、順位と共に、当該変数を表示させてもよい。 In addition, the display control unit 13 may rank at least one of explanatory variables, weighting coefficients, and feature parameters, and display the variables together with the ranking.
 <適用例>
 以下では、図11を参照して、本例示的実施形態に係る情報処理装置4の適用例について説明する。
<Application example>
An application example of the information processing apparatus 4 according to the exemplary embodiment will be described below with reference to FIG. 11 .
 本適用例において、情報処理装置4は、水道インフラストラクチャの配水計画に関する運用計画を生成する。本例示的実施形態に係る水道インフラストラクチャは、一例として、貯水池、配水池、取水施設、浄水場、給水所、及び需要点、等の複数の拠点を含む。運用計画は、一例として、各拠点のポンプの稼働パターンを示す情報を含む。 In this application example, the information processing device 4 generates an operation plan regarding the water distribution plan of the water supply infrastructure. The water infrastructure according to this exemplary embodiment includes, by way of example, multiple sites such as reservoirs, distribution reservoirs, water intake facilities, water purification plants, water stations, and demand points. The operation plan includes, for example, information indicating the operation pattern of pumps at each site.
 (取得部11)
 取得部11は、対象データTD及び参照用データRDを取得する。取得部11は一例として、通信部34を介して他の装置から対象データTD及び参照データ用RDを取得する。また、取得部11は一例として、入力部32を介して入力される対象データTD及び参照用データRDを取得してもよい。また、取得部11は、記憶部41又は外部接続された記憶装置から対象データTD及び参照用データRDを読み出すことにより、対象データTD及び参照用データRDを取得してもよい。本例に係る対象データTDと参照用データRDの詳細については後述する。
(Acquisition unit 11)
The acquisition unit 11 acquires the target data TD and the reference data RD. As an example, the acquisition unit 11 acquires the target data TD and the reference data RD from another device via the communication unit 34 . Further, as an example, the acquisition unit 11 may acquire the target data TD and the reference data RD input via the input unit 32 . Alternatively, the acquisition unit 11 may acquire the target data TD and the reference data RD by reading the target data TD and the reference data RD from the storage unit 41 or an externally connected storage device. Details of the target data TD and the reference data RD according to this example will be described later.
 (決定部12)
 決定部12は、対象の配水計画に関する運用計画OPを生成するための最適化問題に用いる報酬関数を、参照用データRDを参照した逆強化学習によって決定する。報酬関数の逆強化学習は、上述したように、重み係数WF及び特徴量パラメータFPを操作対象とする更新処理を含む。
(Decision unit 12)
The determination unit 12 determines a reward function used in the optimization problem for generating the operation plan OP regarding the target water distribution plan by inverse reinforcement learning with reference to the reference data RD. Inverse reinforcement learning of the reward function includes, as described above, update processing with the weighting factor WF and the feature amount parameter FP as the manipulation targets.
 (生成部14)
 生成部14は、参照用の配水計画に関する参照用データRDを用いた逆強化学習によって決定された報酬関数と、取得部11が取得した対象データTDとを用いた最適化問題を解くことによって、対象の配水計画に関する運用計画OPを生成する。生成部14が実行する運用計画OPの生成処理については後述する。
(Generating unit 14)
The generation unit 14 solves the optimization problem using the reward function determined by inverse reinforcement learning using the reference data RD related to the reference water distribution plan and the target data TD acquired by the acquisition unit 11, Generate an operation plan OP for the target water distribution plan. The operation plan OP generation processing executed by the generation unit 14 will be described later.
 (記憶部41)
 記憶部41には、取得部11が取得する対象データTD及び参照用データRDが記憶される。また、記憶部41には、生成部14が生成した運用計画OPが記憶される。また、記憶部41には、決定部12が決定する報酬関数、及び制約条件LCが記憶される。ここで、記憶部41に報酬関数が記憶されるとは、報酬関数を定めるパラメータが記憶部41に記憶されることをいう。
(storage unit 41)
The storage unit 41 stores the target data TD and the reference data RD acquired by the acquisition unit 11 . The storage unit 41 also stores the operation plan OP generated by the generation unit 14 . The storage unit 41 also stores the reward function determined by the determination unit 12 and the constraint condition LC. Here, storing a reward function in the storage unit 41 means that a parameter defining the reward function is stored in the storage unit 41 .
 (対象データTD)
 対象データTDは、生成部14が運用計画OPの生成に用いるデータである。対象データTDには、対象となる水道インフラストラクチャの状態を示す情報が含まれている。一例として、対象データTDには、対象となる水道インフラストラクチャにおけるポンプ、配水網、水道管路、及び需要点の少なくとも何れかに関する情報が含まれている。
(Target data TD)
The target data TD is data used by the generating unit 14 to generate the operation plan OP. The target data TD includes information indicating the state of the target water supply infrastructure. As an example, the target data TD includes information about pumps, distribution networks, water pipelines and/or demand points in the target water infrastructure.
 具体的には、対象データTDは、一例として、運用計画の対象である水道インフラストラクチャにおける、以下の(i)~(x)の少なくともいずれかひとつのデータを含む。ただし、対象データTDに含まれるデータはこれらに限られず、他のデータを含んでいてもよい。 Specifically, the target data TD includes, as an example, at least one of the following data (i) to (x) in the water supply infrastructure that is the target of the operation plan. However, the data included in the target data TD is not limited to these, and may include other data.
 (i)各拠点の消費電力、(ii)需要予測マージン、(iii)配水池マージン、(iv)配水ロス、(v)各拠点の運用人員数、(vi)各拠点の電力料金、(vii)各拠点の電圧、(viii)各拠点の水位、(ix)各拠点の水圧、(x)各拠点の水量。 (i) power consumption at each base, (ii) demand forecast margin, (iii) distribution reservoir margin, (iv) water distribution loss, (v) number of operating personnel at each base, (vi) electricity rate at each base, (vii) ) voltage at each location, (viii) water level at each location, (ix) water pressure at each location, and (x) water volume at each location.
 (i)各拠点の消費電力は、浄水場、給水所等の各拠点における消費電力を示す。(ii)需要予測マージンは、供給が需要を上回る程度を示す。(iii)配水池マージンは、配水池における設計上の貯水量が実際の貯水量を上回る程度を示す。(iv)配水ロスは、各需要点への配水が行えていない程度を示す。(v)運用人員数は、各拠点での運用人員の数を示す。 (i) The power consumption at each base indicates the power consumption at each base such as water purification plants and water supply stations. (ii) demand forecast margin indicates the extent to which supply exceeds demand; (iii) Reservoir margin indicates the extent to which the designed reservoir capacity exceeds the actual reservoir capacity. (iv) Water distribution loss indicates the extent to which water is not being distributed to each demand point. (v) The number of operating personnel indicates the number of operating personnel at each site.
 (参照用データRD)
 参照用データRDは、決定部12が報酬関数を決定する際に用いるデータである。参照用データRDには、参照用の水道インフラストラクチャの状態を表す情報が含まれる。ここで、参照用の水道インフラストラクチャは、運用計画の生成の対象である水道インフラストラクチャと同一であってもよく、また、異なっていてもよい。より具体的には、参照用データRDには、一例として、参照用の水道インフラストラクチャにおけるポンプ、配水網、水道管路、及び需要点の少なくとも何れかに関する情報が含まれる。また、参照用データRDには、一例として、参照用の水道インフラストラクチャにおけるポンプの稼働パターン、及び人員の少なくとも何れかに関する情報が含まれている。参照用データRDに含まれる各項目は状態データとして扱われてもよいし、また、行動データとして扱われてもよい。
(reference data RD)
The reference data RD is data used when the determination unit 12 determines the reward function. The reference data RD includes information representing the state of the reference water supply infrastructure. Here, the reference water infrastructure may be the same as or different from the water infrastructure for which the operation plan is generated. More specifically, the reference data RD includes, as an example, information on at least one of pumps, distribution networks, water pipelines, and demand points in the reference water infrastructure. The reference data RD also includes, as an example, information on at least one of pump operating patterns and personnel in the reference water supply infrastructure. Each item included in the reference data RD may be treated as state data, or may be treated as action data.
 具体的には、参照用データRDには、一例として、参照用の水道インフラストラクチャにおける、以下の(i)~(x)の少なくともいずれかひとつのデータを含む。ただし、参照用データRDに含まれるデータはこれらに限られず、他のデータを含んでいてもよい。 Specifically, the reference data RD includes, as an example, at least one of the following data (i) to (x) in the reference water infrastructure. However, the data included in the reference data RD is not limited to these, and may include other data.
 (i)各拠点の消費電力、(ii)需要予測マージン、(iii)配水池マージン、(iv)配水ロス、(v)各拠点の運用人員数、(vi)各拠点の電力料金、(vii)各拠点の電圧、(viii)各拠点の水位、(ix)各拠点の水圧、(x)各拠点の水量。 (i) power consumption at each base, (ii) demand forecast margin, (iii) distribution reservoir margin, (iv) water distribution loss, (v) number of operating personnel at each base, (vi) electricity rate at each base, (vii) ) voltage at each location, (viii) water level at each location, (ix) water pressure at each location, and (x) water volume at each location.
 また、参照用データRDには、一例として、参照用の水道インフラストラクチャについて熟練者が作成した運用計画を示すデータが含まれる。より具体的には、参照データRDには、一例として、バルブの開閉、水の引き入れ、ポンプの閾値など、運用ルールに基づいて制御される変数により表されるデータが含まれる。このようなデータは、参照用の運用計画を作成した熟練者等の意思決定の履歴(熟練者の意図)を表すデータであるとも言える。 In addition, the reference data RD includes, as an example, data indicating an operation plan created by a skilled person for reference water infrastructure. More specifically, the reference data RD includes, as an example, data represented by variables controlled based on operation rules, such as opening/closing of valves, intake of water, thresholds of pumps, and the like. Such data can also be said to be data representing the decision-making history (expert's intention) of the expert who created the operational plan for reference.
 (運用計画OP)
 運用計画OPには、一例として、対象となる水道インフラストラクチャにおけるポンプの稼働パターンに関する情報が含まれる。また、運用計画OPには、一例として、対象となる水道インフラストラクチャに関わる人員に関する情報が含まれている。
(Operation plan OP)
The operational plan OP includes, by way of example, information about the operating pattern of the pumps in the water infrastructure of interest. The operation plan OP also includes, as an example, information about the personnel involved in the target water supply infrastructure.
 (報酬関数)
 報酬関数には、参照用データRDに含まれる各項目に対応する各変数を含む各コスト項が含まれる。報酬関数の一般論については上述の例示的実施形態において説明した通りである。
(reward function)
The reward function includes each cost term including each variable corresponding to each item included in the reference data RD. The generality of the reward function was described in the exemplary embodiment above.
 (制約条件LC)
 制約条件LCは、生成部14が解く最適化問題の制約条件である。制約条件LCは、例えば、以下の(i)~(iv)を含む。なお、制約条件LCはこれらに限られず、他の条件を含んでいてもよい。
(Constraint LC)
Constraint condition LC is a constraint condition of the optimization problem that the generator 14 solves. Constraints LC include, for example, the following (i) to (iv). Note that the constraint conditions LC are not limited to these, and may include other conditions.
 (i)貯水池/配水池の貯水量が閾値X以上Y未満である。 (i) The water storage volume of the reservoir/distribution reservoir is greater than or equal to threshold X and less than Y.
 (ii)供給量が需要量を最低X%上回っている。 (ii) The supply exceeds the demand by at least X%.
 (iii)全ての需要点に配水できている。 (iii) Water is being distributed to all demand points.
 (iv)工事中の経路は使用しない。 (iv) Do not use routes under construction.
 <決定部12が実行する処理>
 決定部12は、対象の配水計画に関する運用計画を生成するための最適化問題に用いる報酬関数を、参照用データRDを参照した逆強化学習によって決定する。決定部12は、一例として、報酬関数に含まれるコスト項の重み係数、及び当該コスト項を特徴付ける特徴量パラメータを、参照データRDに含まれる状態データ及び行動データを用いた逆強化学習により決定する。決定部12による逆強化学習の一例については上述した通りである。
<Processing executed by determination unit 12>
The determining unit 12 determines a reward function to be used in the optimization problem for generating the operation plan for the target water distribution plan by inverse reinforcement learning with reference to the reference data RD. As an example, the determining unit 12 determines the weighting factor of the cost term included in the reward function and the feature parameter that characterizes the cost term by inverse reinforcement learning using the state data and action data included in the reference data RD. . An example of inverse reinforcement learning by the determination unit 12 is as described above.
 また、決定部12は、決定した報酬関数を出力する。決定部12は、報酬関数を記憶部41又は外部記憶装置に書き込むことにより出力してもよく、また、出力部33に出力してもよい。 Also, the determination unit 12 outputs the determined reward function. The determination unit 12 may output the reward function by writing it in the storage unit 41 or an external storage device, or output it to the output unit 33 .
 <生成部14が実行する処理>
 生成部14は、制約条件LCの下に、報酬関数と対象データTDとを用いた最適化問題を解くことによって、対象の配水計画に関する運用計画OPを生成する。本例示的実施形態において、生成部14は、報酬関数を用いた最適化問題であって、取得部11が取得した対象データTDを固定変数とし、報酬関数に含まれる各コスト項が含む変数のうち、固定変数以外の変数を操作変数とする最適化問題を解くことによって、対象の配水計画に関する運用計画OPを生成する。
<Process Executed by Generation Unit 14>
The generation unit 14 generates an operation plan OP related to the target water distribution plan by solving an optimization problem using the reward function and the target data TD under the constraint LC. In this exemplary embodiment, the generation unit 14 is an optimization problem using a reward function, and the target data TD acquired by the acquisition unit 11 is a fixed variable, and the variable included in each cost term included in the reward function is By solving an optimization problem in which variables other than fixed variables are used as manipulated variables, an operation plan OP relating to the target water distribution plan is generated.
 また、生成部14は、生成した運用計画OPを出力する。生成部14は、運用計画OPを記憶部41又は外部記憶装置に書き込むことにより出力してもよく、また、出力部33に出力してもよい。 The generation unit 14 also outputs the generated operation plan OP. The generation unit 14 may output the operation plan OP by writing it in the storage unit 41 or an external storage device, or may output it to the output unit 33 .
 <最適化問題の設定>
 図11は、本例に係る最適化問題の設定の具体例を説明するための図である。運用計画OPは、例えば、どれだけ予測需要から余裕を持たせるか、どれだけ消費電力を抑えるか、どれだけ配水池の水位を考慮するか、といった様々な観点を考慮して決定する必要がある。これらの観点の重み付けの設定は困難である。どの観点をどの程度重視するかは、水道インフラストラクチャを運用する運用者等により様々であり、一様に定まらないためである。例えば、ある運用計画の生成者である自治体Aでは、消費電力の観点に重きが置かれている一方、自治体Bでは配水池の水位に重きが置かれている、という場合がある。
<Optimization problem settings>
FIG. 11 is a diagram for explaining a specific example of setting the optimization problem according to this example. The operation plan OP needs to be determined in consideration of various points of view, such as how much margin should be provided from the forecasted demand, how much power consumption should be suppressed, and how much the water level of the distribution reservoir should be considered. . Setting weights for these aspects is difficult. This is because the degree of emphasis on which point of view varies depending on the operator who operates the water supply infrastructure, and is not uniformly determined. For example, there is a case where the municipality A, which is the generator of a certain operation plan, places importance on the viewpoint of power consumption, while the municipality B places importance on the water level of the distribution reservoir.
 本例示的実施形態では、生成部14は、制約条件LCの下に、各コスト項の重み係数及び特徴量パラメータが参照データRDを参照した逆強化学習によって決定された報酬関数と、対象データTDとを用いた最適化問題を解く。ここで、報酬関数に含まれる各コスト項の重み係数及び特徴量パラメータは、参照用データRDを参照した逆強化学習により決定されるため、参照用データRDに含まれる行動データを反映した値、すなわち参照用の運用計画を生成した熟練者等の意図を反映した値となっている。このような重み係数及び特徴量パラメータを含む報酬関数を用いて最適化問題を解くことにより、参照用の運用計画を生成した熟練者等の意図を反映させた運用計画の生成を行うことができる。 In this exemplary embodiment, the generation unit 14, under the constraint condition LC, generates a reward function whose weighting factor and feature amount parameter of each cost term are determined by inverse reinforcement learning with reference to the reference data RD, and the target data TD Solve the optimization problem using Here, since the weighting factor and the feature amount parameter of each cost term included in the reward function are determined by inverse reinforcement learning with reference to the reference data RD, values reflecting the action data included in the reference data RD, In other words, the value reflects the intention of the expert who generated the operation plan for reference. By solving the optimization problem using the reward function including such weighting factors and feature parameters, it is possible to generate an operation plan that reflects the intentions of the expert who generated the reference operation plan. .
 例えば、図11の例において、自治体Aの運用計画OPの生成に用いられる報酬関数に含まれる重み係数α1~α6及び特徴量パラメータは、報酬関数の決定に用いられた参照用の運用計画を生成した熟練者等の意図が反映された値である。また、自治体Bの運用計画OPの生成に用いられる報酬関数に含まれる重み係数α1~α6及び特徴量パラメータは、報酬関数の決定に用いられた参照用の運用計画を生成した熟練者等の意図が反映された値である。自治体Aの重み係数及び特徴量パラメータと自治体Bの重み係数及び特徴量パラメータとを比較することで、各自治体がどのような観点を重視しているかを把握し易くなる。 For example, in the example of FIG. 11, the weighting coefficients α1 to α6 and the feature amount parameters included in the reward function used to generate the operation plan OP of the local government A generate the reference operation plan used to determine the reward function. It is a value that reflects the intention of an expert or the like. In addition, the weighting coefficients α1 to α6 and the feature amount parameters included in the reward function used to generate the operation plan OP for local government B are the intentions of the expert who generated the reference operation plan used to determine the reward function. is a value that reflects By comparing the weighting factor and feature amount parameter of the local government A with the weighting factor and feature amount parameter of the local government B, it becomes easy to grasp what viewpoint each local government attaches importance to.
 また、例えば、自治体Aにおいて熟練者a1が作成した運用計画を含む参照用データRDを参照して決定部12が報酬関数を決定し、決定部12が決定した報酬関数と自治体Aの対象データTDとを用いて生成部14が今後の運用計画OPを生成することもできる。この場合、生成部14は、熟練者a1の意図を反映した自治体Aの今後の運用計画OPを生成することができる。 Further, for example, the determination unit 12 determines the reward function by referring to the reference data RD including the operation plan created by the expert a1 in the municipality A, and the reward function determined by the determination unit 12 and the target data TD of the municipality A The generation unit 14 can also generate the future operation plan OP using the and. In this case, the generation unit 14 can generate the future operation plan OP of the local government A that reflects the intention of the expert a1.
 また、本例示的実施形態によれば、ある自治体における運用計画の生成者の意図を、他の自治体の運用計画に反映させることもできる。例えば、自治体Aにおいて熟練者a1が作成した運用計画を含む参照用データRDを参照して決定部12が報酬関数を決定し、決定部12が決定した報酬関数と自治体Bの対象データTDとを用いて生成部14が今後の運用計画OPを生成することもできる。この場合、生成部14は、熟練者a1の意図を反映した、自治体Bの運用計画OPを生成することができる。 Also, according to this exemplary embodiment, it is possible to reflect the intention of a creator of an operation plan in a certain municipality in the operation plans of other municipalities. For example, the determination unit 12 determines the reward function by referring to the reference data RD including the operation plan created by the expert a1 in the municipality A, and the reward function determined by the determination unit 12 and the target data TD of the municipality B are combined. The generation unit 14 can also generate the future operation plan OP by using it. In this case, the generation unit 14 can generate the operation plan OP of the local government B that reflects the intention of the expert a1.
 〔ソフトウェアによる実現例〕
 情報処理装置1、2、3、4の一部又は全部の機能は、集積回路(ICチップ)等のハードウェアによって実現してもよいし、ソフトウェアによって実現してもよい。
[Example of realization by software]
Some or all of the functions of the information processing apparatuses 1, 2, 3, and 4 may be implemented by hardware such as integrated circuits (IC chips), or may be implemented by software.
 後者の場合、情報処理装置1、2、3、4は、例えば、各機能を実現するソフトウェアであるプログラムの命令を実行するコンピュータによって実現される。このようなコンピュータの一例(以下、コンピュータCと記載する)を図12に示す。コンピュータCは、少なくとも1つのプロセッサC1と、少なくとも1つのメモリC2と、を備えている。メモリC2には、コンピュータCを情報処理装置1、2、3、4として動作させるためのプログラムPが記録されている。コンピュータCにおいて、プロセッサC1は、プログラムPをメモリC2から読み取って実行することにより、情報処理装置1、2、3、4の各機能が実現される。 In the latter case, the information processing apparatuses 1, 2, 3, and 4 are implemented by computers that execute program instructions, which are software that implements each function, for example. An example of such a computer (hereinafter referred to as computer C) is shown in FIG. Computer C comprises at least one processor C1 and at least one memory C2. A program P for operating the computer C as the information processing apparatuses 1, 2, 3, and 4 is recorded in the memory C2. In the computer C, the processor C1 reads the program P from the memory C2 and executes it, thereby realizing each function of the information processing apparatuses 1, 2, 3, and 4. FIG.
 プロセッサC1としては、例えば、CPU(Central Processing Unit)、GPU(Graphic Processing Unit)、DSP(Digital Signal Processor)、MPU(Micro Processing Unit)、FPU(Floating point number Processing Unit)、PPU(Physics Processing Unit)、マイクロコントローラ、又は、これらの組み合わせなどを用いることができる。メモリC2としては、例えば、フラッシュメモリ、HDD(Hard Disk Drive)、SSD(Solid State Drive)、又は、これらの組み合わせなどを用いることができる。 As the processor C1, for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof. As the memory C2, for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.
 なお、コンピュータCは、プログラムPを実行時に展開したり、各種データを一時的に記憶したりするためのRAM(Random Access Memory)を更に備えていてもよい。また、コンピュータCは、他の装置との間でデータを送受信するための通信インタフェースを更に備えていてもよい。また、コンピュータCは、キーボードやマウス、ディスプレイやプリンタなどの入出力機器を接続するための入出力インタフェースを更に備えていてもよい。 Note that the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data. Computer C may further include a communication interface for sending and receiving data to and from other devices. Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.
 また、プログラムPは、コンピュータCが読み取り可能な、一時的でない有形の記録媒体Mに記録することができる。このような記録媒体Mとしては、例えば、テープ、ディスク、カード、半導体メモリ、又はプログラマブルな論理回路などを用いることができる。コンピュータCは、このような記録媒体Mを介してプログラムPを取得することができる。また、プログラムPは、伝送媒体を介して伝送することができる。このような伝送媒体としては、例えば、通信ネットワーク、又は放送波などを用いることができる。コンピュータCは、このような伝送媒体を介してプログラムPを取得することもできる。 In addition, the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C. As such a recording medium M, for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used. The computer C can acquire the program P via such a recording medium M. Also, the program P can be transmitted via a transmission medium. As such a transmission medium, for example, a communication network or broadcast waves can be used. Computer C can also acquire program P via such a transmission medium.
 〔付記事項1〕
 本発明は、上述した実施形態に限定されるものでなく、請求項に示した範囲で種々の変更が可能である。例えば、上述した実施形態に開示された技術的手段を適宜組み合わせて得られる実施形態についても、本発明の技術的範囲に含まれる。
[Appendix 1]
The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the claims. For example, embodiments obtained by appropriately combining the technical means disclosed in the embodiments described above are also included in the technical scope of the present invention.
 〔付記事項2〕
 上述した実施形態の一部又は全部は、以下のようにも記載され得る。ただし、本発明は、以下の記載する態様に限定されるものではない。
[Appendix 2]
Some or all of the above-described embodiments may also be described as follows. However, the present invention is not limited to the embodiments described below.
 (付記1)
 参照用データを取得する取得手段と、重み係数と特徴量パラメータとを含む報酬関数を、前記参照用データを用いた逆強化学習であって前記特徴量パラメータを操作対象に含む逆強化学習によって決定する決定手段とを備えている情報処理装置。
(Appendix 1)
An acquisition means for acquiring reference data and a reward function including a weighting factor and a feature parameter are determined by inverse reinforcement learning using the reference data and including the feature parameter as an operation target. An information processing apparatus comprising: determining means for
 (付記2)
 前記報酬関数は、説明変数を用いて表される特徴量と、前記特徴量の重みを表す前記重み係数とを含む1又は複数のコスト項を含み、前記1又は複数のコスト項の少なくとも何れかには、当該説明変数と共に当該コスト項を特徴づける前記特徴量パラメータが含まれている付記1に記載の情報処理装置。
(Appendix 2)
The reward function includes one or more cost terms including a feature value represented using explanatory variables and the weighting factor representing the weight of the feature value, and at least one of the one or more cost terms The information processing apparatus according to appendix 1, wherein the feature parameter that characterizes the cost term is included along with the explanatory variable.
 (付記3)
 前記決定手段による前記逆強化学習における操作対象には、前記1又は複数のコスト項の少なくとも何れかに含まれる前記重み係数が含まれる付記2に記載の情報処理装置。
(Appendix 3)
The information processing apparatus according to appendix 2, wherein an operation target in the inverse reinforcement learning by the determining means includes the weighting factor included in at least one of the one or the plurality of cost terms.
 (付記4)
 前記決定手段による前記逆強化学習には、前記報酬関数を用いて表される対数尤度の下限を最大化するように、前記操作対象を更新する更新処理が含まれる付記1から3の何れかに記載の情報処理装置。
(Appendix 4)
3. Any one of appendices 1 to 3, wherein the inverse reinforcement learning by the determining means includes an update process of updating the operation target so as to maximize the lower limit of the logarithmic likelihood represented using the reward function. The information processing device according to .
 (付記5)
 前記対数尤度の下限は、基準となる確率分布と、前記報酬関数を用いて表される確率分布との距離を表すワッサースタイン距離と、前記報酬関数の最大値と前記報酬関数の平均値との差分を表す正則化項とを用いて表される付記4に記載の情報処理装置。
(Appendix 5)
The lower limit of the log-likelihood is the Wasserstein distance representing the distance between the reference probability distribution and the probability distribution represented using the reward function, the maximum value of the reward function, and the average value of the reward function. The information processing device according to appendix 4, which is expressed using a regularization term that represents the difference between .
 (付記6)
 前記対数尤度の下限は、前記正則化項に乗ぜられる減衰係数であって、前記更新処理を繰り返すほど前記正則化項の寄与を減衰させる減衰係数を含んでいる付記5に記載の情報処理装置。
(Appendix 6)
The information processing apparatus according to appendix 5, wherein the lower limit of the logarithmic likelihood is an attenuation coefficient that is multiplied by the regularization term, and includes an attenuation coefficient that attenuates the contribution of the regularization term as the update process is repeated. .
 (付記7)
 前記重み係数、前記特徴量パラメータ、及び前記報酬関数の少なくとも何れかに対応する情報を表示する第1の表示手段を備えている付記1から6の何れかに記載の情報処理装置。
(Appendix 7)
7. The information processing apparatus according to any one of appendices 1 to 6, further comprising first display means for displaying information corresponding to at least one of the weighting coefficient, the feature quantity parameter, and the reward function.
 (付記8)
 前記取得手段は、対象データを更に取得し、当該情報処理装置は、前記決定手段が決定した報酬関数と、前記取得手段が取得した対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成する生成手段を更に備えている付記1から7の何れかに記載の情報処理装置。
(Appendix 8)
The acquisition means further acquires target data, and the information processing device solves the target data by solving an optimization problem using the reward function determined by the determination means and the target data acquired by the acquisition means. 8. The information processing apparatus according to any one of appendices 1 to 7, further comprising generating means for generating output data according to.
 (付記9)
 前記出力データを表示する第2の表示手段を備えている付記8に記載の情報処理装置。
(Appendix 9)
The information processing apparatus according to appendix 8, further comprising second display means for displaying the output data.
 (付記10)
 対象データを取得する取得手段と、重み係数と特徴量パラメータとを含む報酬関数であって、前記特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、前記取得手段が取得した対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成する生成手段とを備えている情報処理装置。
(Appendix 10)
an acquisition means for acquiring target data; a reward function including a weighting factor and a feature parameter, the reward function determined by inverse reinforcement learning including the feature parameter as an operation target; and generating means for generating output data according to the target data by solving an optimization problem using the target data.
 (付記11)
 情報処理装置による情報処理方法であって、参照用データを取得することと、重み係数と特徴量パラメータとを含む報酬関数を、前記参照用データを用いた逆強化学習であって前記特徴量パラメータを操作対象に含む逆強化学習によって決定することと、を含んでいる情報処理方法。
(Appendix 11)
An information processing method using an information processing apparatus, wherein reference data is obtained, and a reward function including a weighting factor and a feature parameter is obtained by inverse reinforcement learning using the reference data, wherein the feature parameter An information processing method comprising determining by inverse reinforcement learning including as an operation target.
 (付記12)
 情報処理装置による情報処理方法であって、対象データを取得することと、重み係数と特徴量パラメータとを含む報酬関数であって、前記特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、前記取得することにおいて取得された対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成することと、を備えている情報処理装置。
(Appendix 12)
An information processing method by an information processing device, comprising obtaining target data, and a reward function including a weighting factor and a feature parameter, which is determined by inverse reinforcement learning including the feature parameter as an operation target. Generating output data according to the target data by solving an optimization problem using a reward function and the target data obtained in the obtaining.
 (付記13)
 情報処理装置としてコンピュータを機能させるプログラムであって、参照用データを取得する取得手段と、重み係数と特徴量パラメータとを含む報酬関数を、前記参照用データを用いた逆強化学習であって前記特徴量パラメータを操作対象に含む逆強化学習によって決定する決定手段と、として機能させるプログラム。
(Appendix 13)
A program for causing a computer to function as an information processing apparatus, comprising: obtaining means for obtaining reference data; and a reward function including a weighting factor and a feature amount parameter. A program that functions as a decision means that decides by inverse reinforcement learning including a feature parameter as an operation target.
 (付記14)
 情報処理装置としてコンピュータを機能させるプログラムであって、対象データを取得する取得手段と、重み係数と特徴量パラメータとを含む報酬関数であって、前記特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、前記取得手段が取得した対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成する生成手段と、として機能させるプログラム。
(Appendix 14)
A program that causes a computer to function as an information processing device, comprising: acquisition means for acquiring target data; and a reward function including a weighting factor and a feature amount parameter, wherein the feature amount parameter is used as an operation target by inverse reinforcement learning. A program that functions as generating means for generating output data according to the target data by solving an optimization problem using the determined reward function and the target data obtained by the obtaining means.
 〔付記事項3〕
 上述した実施形態の一部又は全部は、更に、以下のように表現することもできる。
[Appendix 3]
Some or all of the embodiments described above can also be expressed as follows.
 少なくとも1つのプロセッサを備え、前記プロセッサは、参照用データを取得する取得処理と、重み係数と特徴量パラメータとを含む報酬関数を、前記参照用データを用いた逆強化学習であって前記特徴量パラメータを操作対象に含む逆強化学習によって決定する決定処理とを実行する情報処理装置。 At least one processor is provided, and the processor performs an acquisition process for acquiring reference data, and a reward function including a weighting factor and a feature amount parameter for inverse reinforcement learning using the reference data, wherein the feature amount An information processing device that executes a determination process of determining by inverse reinforcement learning including a parameter as an operation target.
 なお、この情報処理装置は、更にメモリを備えていてもよく、このメモリには、前記取得処理と、前記決定処理とを前記プロセッサに実行させるためのプログラムが記憶されていてもよい。また、このプログラムは、コンピュータ読み取り可能な一時的でない有形の記録媒体に記録されていてもよい。 The information processing apparatus may further include a memory, and the memory may store a program for causing the processor to execute the acquisition process and the determination process. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.
 少なくとも1つのプロセッサを備え、前記プロセッサは、対象データを取得する取得処理と、重み係数と特徴量パラメータとを含む報酬関数であって、前記特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、前記取得処理において取得された対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成する生成処理とを実行する情報処理装置。 At least one processor is provided, and the processor is an acquisition process for acquiring target data, and a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target. and a generating process for generating output data corresponding to the target data by solving an optimization problem using the target data obtained in the obtaining process.
 なお、この情報処理装置は、更にメモリを備えていてもよく、このメモリには、前記取得処理と、前記生成処理とを前記プロセッサに実行させるためのプログラムが記憶されていてもよい。また、このプログラムは、コンピュータ読み取り可能な一時的でない有形の記録媒体に記録されていてもよい。 The information processing apparatus may further include a memory, and the memory may store a program for causing the processor to execute the acquisition process and the generation process. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.
 1,2,3,4   ・・・情報処理装置
 11        ・・・取得部(取得手段)
 12        ・・・決定部(決定手段)
 13        ・・・表示制御部(表示手段)
 22,14     ・・・生成部(生成手段)

 
1, 2, 3, 4... Information processing apparatus 11... Acquisition unit (acquisition means)
12 ... decision unit (decision means)
13 ... display control unit (display means)
22, 14 ... generation unit (generating means)

Claims (14)

  1.  参照用データを取得する取得手段と、
     重み係数と特徴量パラメータとを含む報酬関数を、前記参照用データを用いた逆強化学習であって前記特徴量パラメータを操作対象に含む逆強化学習によって決定する決定手段と、
    を備えている情報処理装置。
    Acquisition means for acquiring reference data;
    determining means for determining a reward function including a weighting factor and a feature parameter by inverse reinforcement learning using the reference data and including the feature parameter as an operation target;
    Information processing device equipped with.
  2.  前記報酬関数は、説明変数を用いて表される特徴量と、前記特徴量の重みを表す前記重み係数とを含む1又は複数のコスト項を含み、
     前記1又は複数のコスト項の少なくとも何れかには、当該説明変数と共に当該コスト項を特徴づける前記特徴量パラメータが含まれている、
    請求項1に記載の情報処理装置。
    The reward function includes one or more cost terms including a feature value represented using an explanatory variable and the weighting factor representing the weight of the feature value,
    At least one of the one or more cost terms includes the explanatory variable and the feature parameter that characterizes the cost term,
    The information processing device according to claim 1 .
  3.  前記決定手段による前記逆強化学習における操作対象には、前記1又は複数のコスト項の少なくとも何れかに含まれる前記重み係数が含まれる、
    請求項2に記載の情報処理装置。
    An operation target in the inverse reinforcement learning by the determining means includes the weighting factor included in at least one of the one or more cost terms.
    The information processing apparatus according to claim 2.
  4.  前記決定手段による前記逆強化学習には、前記報酬関数を用いて表される対数尤度の下限を最大化するように、前記操作対象を更新する更新処理が含まれる、
    請求項1から3の何れか1項に記載の情報処理装置。
    The inverse reinforcement learning by the determining means includes an update process of updating the operation target so as to maximize the lower bound of the logarithmic likelihood represented using the reward function.
    The information processing apparatus according to any one of claims 1 to 3.
  5.  前記対数尤度の下限は、基準となる確率分布と、前記報酬関数を用いて表される確率分布との距離を表すワッサースタイン距離と、前記報酬関数の最大値と前記報酬関数の平均値との差分を表す正則化項とを用いて表される、
    請求項4に記載の情報処理装置。
    The lower limit of the log-likelihood is the Wasserstein distance representing the distance between the reference probability distribution and the probability distribution represented using the reward function, the maximum value of the reward function, and the average value of the reward function. expressed using a regularization term representing the difference of
    The information processing apparatus according to claim 4.
  6.  前記対数尤度の下限は、前記正則化項に乗ぜられる減衰係数であって、前記更新処理を繰り返すほど前記正則化項の寄与を減衰させる減衰係数を含んでいる、
    請求項5に記載の情報処理装置。
    The lower bound of the logarithmic likelihood is an attenuation coefficient that is multiplied by the regularization term, and includes an attenuation coefficient that attenuates the contribution of the regularization term as the update process is repeated.
    The information processing device according to claim 5 .
  7.  前記重み係数、前記特徴量パラメータ、及び前記報酬関数の少なくとも何れかに対応する情報を表示する第1の表示手段を備えている、
    請求項1から6の何れか1項に記載の情報処理装置。
    a first display means for displaying information corresponding to at least one of the weighting factor, the feature parameter, and the reward function;
    The information processing apparatus according to any one of claims 1 to 6.
  8.  前記取得手段は、対象データを更に取得し、
     当該情報処理装置は、前記決定手段が決定した報酬関数と、前記取得手段が取得した対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成する生成手段を更に備えている
    請求項1から7の何れか1項に記載の情報処理装置。
    The acquisition means further acquires target data,
    The information processing apparatus further includes generating means for generating output data according to the target data by solving an optimization problem using the reward function determined by the determining means and the target data obtained by the obtaining means. The information processing apparatus according to any one of claims 1 to 7, comprising:
  9.  前記出力データを表示する第2の表示手段を備えている、
    請求項8に記載の情報処理装置。
    comprising second display means for displaying the output data;
    The information processing apparatus according to claim 8 .
  10.  対象データを取得する取得手段と、
     重み係数と特徴量パラメータとを含む報酬関数であって、前記特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、前記取得手段が取得した対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成する生成手段と、
    を備えている情報処理装置。
    acquisition means for acquiring target data;
    An optimization problem using a reward function including a weighting factor and a feature amount parameter, the reward function determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target data obtained by the obtaining means. generating means for generating output data corresponding to the target data by solving
    Information processing device equipped with.
  11.  情報処理装置による情報処理方法であって、
     参照用データを取得することと、
     重み係数と特徴量パラメータとを含む報酬関数を、前記参照用データを用いた逆強化学習であって前記特徴量パラメータを操作対象に含む逆強化学習によって決定することと、
    を含んでいる情報処理方法。
    An information processing method by an information processing device,
    obtaining reference data;
    Determining a reward function including a weighting factor and a feature amount parameter by inverse reinforcement learning using the reference data and including the feature amount parameter as an operation target;
    Information processing methods, including.
  12.  情報処理装置による情報処理方法であって、
     対象データを取得することと、
     重み係数と特徴量パラメータとを含む報酬関数であって、前記特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、前記取得することにおいて取得された対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成することと、
    を含んでいる情報処理方法。
    An information processing method by an information processing device,
    obtaining target data;
    A reward function including a weighting factor and a feature parameter, the reward function determined by inverse reinforcement learning including the feature parameter as an operation target, and the target data obtained in the obtaining generating output data according to the target data by solving a transformation problem;
    Information processing methods, including.
  13.  情報処理装置としてコンピュータを機能させるプログラムであって、
     参照用データを取得する取得手段と、
     重み係数と特徴量パラメータとを含む報酬関数を、前記参照用データを用いた逆強化学習であって前記特徴量パラメータを操作対象に含む逆強化学習によって決定する決定手段と、
    として機能させるプログラム。
    A program that causes a computer to function as an information processing device,
    Acquisition means for acquiring reference data;
    determining means for determining a reward function including a weighting factor and a feature parameter by inverse reinforcement learning using the reference data and including the feature parameter as an operation target;
    A program that acts as a
  14.  情報処理装置としてコンピュータを機能させるプログラムであって、
     対象データを取得する取得手段と、
     重み係数と特徴量パラメータとを含む報酬関数であって、前記特徴量パラメータを操作対象に含む逆強化学習によって決定された報酬関数と、前記取得手段が取得した対象データとを用いた最適化問題を解くことによって前記対象データに応じた出力データを生成する生成手段と、
    として機能させるプログラム。

     
    A program that causes a computer to function as an information processing device,
    acquisition means for acquiring target data;
    An optimization problem using a reward function including a weighting factor and a feature amount parameter, the reward function determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target data obtained by the obtaining means. generating means for generating output data corresponding to the target data by solving
    A program that acts as a

PCT/JP2022/003100 2022-01-27 2022-01-27 Information processing device, information processing method, and program WO2023144961A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/003100 WO2023144961A1 (en) 2022-01-27 2022-01-27 Information processing device, information processing method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2022/003100 WO2023144961A1 (en) 2022-01-27 2022-01-27 Information processing device, information processing method, and program

Publications (1)

Publication Number Publication Date
WO2023144961A1 true WO2023144961A1 (en) 2023-08-03

Family

ID=87471309

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/003100 WO2023144961A1 (en) 2022-01-27 2022-01-27 Information processing device, information processing method, and program

Country Status (1)

Country Link
WO (1) WO2023144961A1 (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020065808A1 (en) * 2018-09-27 2020-04-02 日本電気株式会社 Information processing device and system, and non-temporary computer-readable medium for storing model adaptation method and program
JP2021033466A (en) * 2019-08-20 2021-03-01 国立大学法人電気通信大学 Encoding device, decoding device, parameter learning device, and program

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020065808A1 (en) * 2018-09-27 2020-04-02 日本電気株式会社 Information processing device and system, and non-temporary computer-readable medium for storing model adaptation method and program
JP2021033466A (en) * 2019-08-20 2021-03-01 国立大学法人電気通信大学 Encoding device, decoding device, parameter learning device, and program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NAKAGUCHI YUKI, ETO RIKI, NISHIOKA ITARU: "Construction of Inverse Reinforcement Dynamics Learning Framework based on Maximum Entropy Principle", THE 33RD ANNUAL CONFERENCE OF THE JAPANESE SOCIETY FOR ARTIFICIAL INTELLIGENCE, THE JAPANESE SOCIETY FOR ARTIFICIAL INTELLIGENCE, 1 January 2019 (2019-01-01), XP093080785, DOI: 10.11517/pjsai.JSAI2019.0_1Q2J204 *

Similar Documents

Publication Publication Date Title
JP5768834B2 (en) Plant model management apparatus and method
EP3446260B1 (en) Memory-efficient backpropagation through time
US8849737B1 (en) Prediction method of predicting a future state of a system
EP2981866B1 (en) Methods and systems for reservoir history matching for improved estimation of reservoir performance
Matarazzo et al. STRIDE for structural identification using expectation maximization: Iterative output-only method for modal identification
Rarità et al. Numerical schemes and genetic algorithms for the optimal control of a continuous model of supply chains
US20090172057A1 (en) Computer system for predicting the evolution of a chronological set of numerical values
CN102023570A (en) Method for computer-supported learning of a control and/or regulation of a technical system
CN107239589A (en) Reliability of slope analysis method based on MRVM AFOSM
US20170016354A1 (en) Output efficiency optimization in production systems
Wan Ahmad et al. Arima model and exponential smoothing method: A comparison
JP2020086778A (en) Machine learning model construction device and machine learning model construction method
Kanjilal et al. Cross entropy-based importance sampling for first-passage probability estimation of randomly excited linear structures with parameter uncertainty
US20210342691A1 (en) System and method for neural time series preprocessing
JP7497516B2 (en) A projection method for imposing equality constraints on algebraic models.
WO2023144961A1 (en) Information processing device, information processing method, and program
US8700686B1 (en) Robust estimation of time varying parameters
NO20200978A1 (en) Optimized methodology for automatic history matching of a petroleum reservoir model with ensemble kalman filter
Kim et al. Direct use of design criteria in genetic algorithm‐based controller optimization
WO2020180303A1 (en) Reservoir simulation systems and methods to dynamically improve performance of reservoir simulations
CN115577787A (en) Quantum amplitude estimation method, device, equipment and storage medium
KR102261055B1 (en) Method and system for optimizing design parameter of image to maximize click through rate
WO2016017171A1 (en) Flow rate prediction device, mixing ratio estimation device, method, and computer-readable recording medium
KR20220078243A (en) Apparatus and method for predicting based on time series network data
Zhao et al. Reduction of Carbon Footprint of Dynamical System Simulation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923826

Country of ref document: EP

Kind code of ref document: A1