WO2023144961A1

WO2023144961A1 - Information processing device, information processing method, and program

Info

Publication number: WO2023144961A1
Application number: PCT/JP2022/003100
Authority: WO
Inventors: 力江藤
Original assignee: 日本電気株式会社
Priority date: 2022-01-27
Filing date: 2022-01-27
Publication date: 2023-08-03

Abstract

In order to generate a more suitable reward function, an information processing device (1) is provided with: an acquisition unit (11) that acquires reference data; and a determination unit (12) that determines a reward function including weighting factors and feature quantity parameters by inverse reinforcement learning that uses the reference data and includes the feature quantity parameters as operation targets.

Description

Information processing device, information processing method, and program

The present invention relates to an information processing device, an information processing method, and a program.

In reinforcement learning (RL: Reinforcement Learning), one of the machine learning methods, a reward function is used to evaluate the value of various actions. Inverse Reinforcement Learning (IRL) is known as a method for generating this reward function.

Non-Patent Document 1 describes maximum entropy inverse reinforcement learning (ME-IRL: Maximum Entropy-IRL), which is one type of inverse reinforcement learning. ME-IRL uses the maximum entropy principle to specify the trajectory distribution and learn the reward function by approximating the true distribution (ie maximum likelihood estimation).

In addition, Non-Patent Document 2 describes GCL (Guided Cost Learning), which is one of the methods of inverse reinforcement learning that improves maximum entropy inverse reinforcement learning. In the method described in Non-Patent Document 2, importance sampling is used to update the weights of the reward function.

However, both the techniques described in Non-Patent Document 1 and Non-Patent Document 2 have room for improvement in terms of generating an appropriate reward function.

One aspect of the present invention has been made in view of the above problems, and an example of its purpose is to provide a technique capable of generating a more appropriate reward function.

An information processing apparatus according to one aspect of the present invention includes an acquisition unit that acquires reference data, and a reward function that includes a weighting factor and a feature amount parameter by performing inverse reinforcement learning using the reference data, wherein the feature and determining means for determining by inverse reinforcement learning including the quantity parameter as an object to be manipulated.

An information processing apparatus according to one aspect of the present invention includes acquisition means for acquiring target data, and a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target. generating means for generating output data corresponding to the target data by solving an optimization problem using the obtained reward function and the target data obtained by the obtaining means.

An information processing method according to one aspect of the present invention is an information processing method using an information processing apparatus, which includes obtaining reference data, calculating a reward function including a weighting factor and a feature amount parameter using the reference data. Determining by inverse reinforcement learning that uses the feature amount parameter as an operation target.

An information processing method according to one aspect of the present invention is an information processing method using an information processing apparatus, wherein target data is obtained, and a reward function including a weighting factor and a feature amount parameter, wherein the feature amount parameter Generating output data according to the target data by solving an optimization problem using a reward function determined by inverse reinforcement learning including as an operation target and the target data acquired in the acquisition , contains

A program according to one aspect of the present invention is a program that causes a computer to function as an information processing apparatus, and includes an acquisition unit that acquires reference data, and a reward function that includes a weighting factor and a feature amount parameter, the reference data. and determining means for determining by inverse reinforcement learning including the feature amount parameter as an operation target.

A program according to one aspect of the present invention is a program that causes a computer to function as an information processing apparatus, and is an acquisition unit that acquires target data, and a reward function that includes a weighting factor and a feature amount parameter, wherein the feature amount generating means for generating output data according to the target data by solving an optimization problem using a reward function determined by inverse reinforcement learning including a parameter as an operation target and the target data obtained by the obtaining means; , to function as

According to one aspect of the present invention, a more appropriate reward function can be generated.

1 is a block diagram showing the configuration of an information processing device according to exemplary Embodiment 1 of the present invention; FIG. FIG. 3 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 1 of the present invention; 1 is a block diagram showing the configuration of an information processing device according to exemplary Embodiment 1 of the present invention; FIG. FIG. 3 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 1 of the present invention; FIG. 7 is a block diagram showing the configuration of an information processing apparatus according to exemplary Embodiment 2 of the present invention; FIG. 7 is a flow diagram showing the flow of an information processing method according to exemplary embodiment 2 of the present invention; FIG. 10 is a diagram showing a display example generated by a display control unit according to exemplary embodiment 2 of the present invention; FIG. 11 is a block diagram showing the configuration of an information processing apparatus according to exemplary Embodiment 3 of the present invention; FIG. 11 is a diagram showing a display example generated by a display control unit according to exemplary Embodiment 3 of the present invention; FIG. 12 is a diagram showing a second display example by the information processing apparatus according to exemplary embodiment 3 of the present invention; FIG. 10 is a diagram showing an application example of an information processing apparatus according to exemplary embodiment 3 of the present invention; 1 is a diagram showing an example of a computer that implements an information processing apparatus according to each exemplary embodiment of the present invention; FIG.

[Exemplary embodiment 1]
A first exemplary embodiment of the invention will now be described in detail with reference to the drawings. This exemplary embodiment is the basis for the exemplary embodiments described later.

(Overview of information processing device 1)
The information processing device 1 according to this exemplary embodiment is a device that determines a reward function including weighting factors and feature amount parameters by inverse reinforcement learning using reference data.

Here, reference data refers to data referred to by inverse reinforcement learning, and includes, as an example, a set of state data and action data. For example, the reference data may include state data representing the state of a certain system, and action data representing actions taken by a specific expert in that state. As _an _example , reference _data _τ _is { _τ ₁ , τ ₂ , _. , a _N )). Here, N is any natural number. Further, s _i (i=1 to N) represents state data indicating the state of the system, and a _i (i=1 to N) represents action data selected in the state indicated by the state data. represent. Thus, the reference data can include one or more sets of state data and action data, as an example. Also, some or all of the data included in the reference data are also called explanatory variables, which are arguments of the reward function.

Note that the action data is not limited to data indicating actions taken by a specific expert, and may be data indicating actions taken by a subject who executes actions associated with the action data _ai . It may be data indicating the action taken by.

In the explanation below, unless there is confusion, state data may be simply referred to as state, and action data may simply be referred to as action.

In this exemplary embodiment, inverse reinforcement learning refers to learning for determining a reward function. In the inverse reinforcement learning according to this exemplary embodiment, the reference data is referred to, and the reward function is determined by updating the feature parameter included in the reward function as an operation target. Further, in the inverse reinforcement learning according to the present exemplary embodiment, the reference data may be referred to and the weighting factor included in the reward function may be updated as an operation target.

Here, the reward function is, as an example, a function for evaluating the value of each of various actions. The reward function includes a weighting factor and a feature amount parameter as parameters. A weighting factor is, for example, a weight by which each of one or more feature quantities included in the reward function is multiplied. A feature amount parameter is, for example, a parameter that characterizes one or more feature amounts included in a reward function.

A simple example of the reward function Reward is as follows.

Each variable in Formula 1 and Formula 2 is as follows.
Reward: reward function
Cost: Cost function λ ₁ , λ ₂ , λ ₃ : Weighting factor
x ₁ , x ₂ , x ₃ : explanatory variables

Here, the above exemplary explanatory variables: x ₁ , x ₂ , x ₃ are variables that can correspond to either state data (s _i ) and behavior data (a _i ), respectively. As in this example, the explanatory variable itself may constitute the feature amount, or the function of the explanatory variable may constitute the feature amount.

Also, as shown in Equation 1, in this exemplary embodiment, the reward function is the inverse of the cost function. Therefore, there is a relationship that the smaller the cost, the larger the reward.

Further, as shown in Equation 2, the cost function includes one or more cost terms including a feature value represented using explanatory variables and a weighting factor representing the weight of the feature value, and one or more cost At least some of the terms include the feature parameters that characterize the cost term together with explanatory variables.

Also, in formula 1
, the reward function can be expressed as follows.

Here, "T" represents the transpose of the vector. Also, θ is sometimes called a weighting coefficient vector or simply a parameter, and f _τ is sometimes called a feature amount vector.

In the above reward function, the information processing device 1 uses the reference data τ to use, as an example, a feature amount parameter
Determine the reward function Reward by updating

(Configuration of information processing device 1)
A configuration of an information processing apparatus 1 according to this exemplary embodiment will be described with reference to FIG. FIG. 1 is a block diagram showing the configuration of an information processing device 1 according to this exemplary embodiment.

As shown in FIG. 1 , the information processing device 1 includes an acquisition unit 11 and a determination unit 12 . Acquisition unit 11 and determination unit 12 are configured to implement acquisition means and determination means, respectively, in this exemplary embodiment.

The acquisition unit 11 acquires reference data. The acquisition unit 11 supplies the acquired reference data to the determination unit 12 .

The determining unit 12 determines the reward function including the weighting factor and the feature amount parameter by inverse reinforcement learning using the reference data and including the feature amount parameter as an operation target.

As described above, in the information processing apparatus 1 according to the present exemplary embodiment, the acquisition unit 11 that acquires the reference data and the reward function including the weighting coefficient and the feature parameter are obtained by inverse using the reference data. A configuration including a determination unit 12 that determines by inverse reinforcement learning that is reinforcement learning and that includes a feature amount parameter as an operation target is adopted. For this reason, according to the information processing apparatus 1 according to the present exemplary embodiment, the operation target includes the feature amount parameter for determining the feature amount, so that the result of the prediction model or the like can be adopted as the feature amount. Therefore, according to the information processing device 1 according to this exemplary embodiment, a more appropriate reward function can be generated.

(Flow of information processing method S1)
The flow of the information processing method S1 according to this exemplary embodiment will be described with reference to FIG. FIG. 2 is a flow diagram showing the flow of the information processing method S1 according to this exemplary embodiment.

(Step S11)
In step S11, the acquisition unit 11 acquires reference data. The acquisition unit 11 supplies the acquired reference data to the determination unit 12 .

(Step S12)
In step S<b>12 , the determination unit 12 determines the reward function including the weighting factor and the feature amount parameter by inverse reinforcement learning using the reference data supplied from the acquisition unit 11 and including the feature amount parameter as an operation target. Determined by reinforcement learning.

As described above, in the information processing method S1 according to the present exemplary embodiment, in step S11, the acquisition unit 11 acquires reference data, and in step S12, the determination unit 12 determines the weighting coefficient and the feature parameter. The reward function including the feature amount parameter is determined by reverse reinforcement learning using the reference data supplied from the acquisition unit 11 and including the feature amount parameter as an operation target. Therefore, according to the information processing method S1 according to this exemplary embodiment, the same effects as the information processing apparatus 1 can be obtained.

(Overview of information processing device 2)
The information processing device 2 according to this exemplary embodiment is a device that generates output data according to target data by solving an optimization problem using target data and a reward function determined by inverse reinforcement learning. be. Here, the reward function and inverse reinforcement learning are as described above.

In addition, in this exemplary embodiment, the target data includes at least part of state data representing the state of a certain system and action data indicating actions taken by a specific expert in the state. .

Also, here, solving the optimization problem means maximizing the reward function by manipulating the data to be manipulated with the target data as input.

As an example _, the target data TD _can be represented _by { _s ₁ , s ₂ , . can be done. Here, s _i (i=1 to N) represents state data indicating the state of the system, and a _i (i=1 to N) represents action data that can be selected in the state indicated by the state data TD. represent. In this example, the information processing apparatus 2 according to the exemplary embodiment maximizes a reward function having the target data TD and the manipulated data MD as explanatory variables by manipulating the manipulated data MD. . In other words, the information processing apparatus 2 solves the optimization problem with the target data as input, and generates the data of the operation target that maximizes the reward function as the output data.

(Overview of information processing device 2)
The configuration of the information processing device 2 according to this exemplary embodiment will be described with reference to FIG. FIG. 3 is a block diagram showing the configuration of the information processing device 2 according to this exemplary embodiment.

As shown in FIG. 3, the information processing device 2 includes an acquisition unit 11 and a generation unit 22. The acquisition unit 11 and the generation unit 22 are configured to implement acquisition means and generation means, respectively, in this exemplary embodiment.

The acquisition unit 11 acquires target data. The acquisition unit 11 supplies the acquired target data to the generation unit 22 .

The generation unit 22 uses a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target data acquired by the acquisition unit 11. By solving the optimization problem, output data is generated according to the target data.

Here, as an example of the reward function determined by inverse reinforcement learning, the reward function determined by the determination unit 12 included in the information processing apparatus 1 described above can be used.

As described above, in the information processing apparatus 2 according to the present exemplary embodiment, the acquisition unit 11 that acquires the target data, and the reward function that includes the weighting factor and the feature amount parameter, and the feature amount parameter is used as the operation target. and a generating unit 22 that generates output data according to the target data by solving an optimization problem using the target data acquired by the acquiring unit 11 and the reward function determined by the inverse reinforcement learning included in Adopted.

Therefore, according to the information processing apparatus 2 according to the present exemplary embodiment, since the optimization problem is solved using the reward function determined by the inverse reinforcement learning including the feature parameter as the operation target, a more appropriate reward function can generate output data that maximizes

(Flow of information processing method S2)
The flow of the information processing method S2 according to this exemplary embodiment will be described with reference to FIG. FIG. 4 is a flow diagram showing the flow of the information processing method S2 according to this exemplary embodiment.

(Step S21)
In step S21, the acquisition unit 11 acquires target data.

(Step S22)
In step S22, the generation unit 22 generates a reward function including a weighting factor and a feature amount parameter, the reward function determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target acquired by the acquisition unit 11. Generate output data according to the target data by solving an optimization problem using the data.

Here, as an example of the reward function determined by inverse reinforcement learning, the reward function determined in step S12 included in the information processing method S1 described above can be used.

As described above, in the information processing method S2 according to the present exemplary embodiment, in step S21, the acquisition unit 11 acquires target data, and in step S22, the generation unit 22 generates the weighting coefficient and the feature parameter. The reward function includes a feature parameter as an operation target, and is determined by inverse reinforcement learning, and by solving an optimization problem using the target data acquired by the acquisition unit 11, depending on the target data generate output data. Therefore, according to the information processing method S<b>2 according to this exemplary embodiment, the same effects as those of the information processing apparatus 2 can be obtained.

[Exemplary embodiment 2]
A second exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as the components described in the exemplary embodiment 1 are denoted by the same reference numerals, and descriptions thereof are omitted as appropriate.

(Overview of information processing device 3)
The information processing device 3 according to this exemplary embodiment is a device that determines a reward function including a weighting factor WF and a feature amount parameter FP by inverse reinforcement learning using reference data RD. The information processing device 3 also displays information corresponding to at least one of the determined weighting factor WF, feature parameter FP, and reward function.

The reference data, inverse reinforcement learning, reward function, weighting factor, and feature parameters are as described above.

(Configuration of information processing device 3)
The configuration of the information processing device 3 according to this exemplary embodiment will be described with reference to FIG. FIG. 5 is a block diagram showing the configuration of the information processing device 3 according to this exemplary embodiment.

As shown in FIG. 5, the information processing device 3 includes a storage unit 31, an input unit 32, an output unit 33, a communication unit 34, and a control unit 35.

The storage unit 31 is a memory that stores various data referred to by the control unit 35, which will be described later. Examples of data stored in the storage unit 31 include reference data RD, weighting factors WF, and feature amount parameters FP. As an example of the reference data RD, expert decision-making history data (trajectory) received by the input unit 32, which will be described later, may be stored. Further, the storage unit 31 may store candidates for the feature amount of the reward function that the determination unit 12 uses for learning. However, feature amount candidates do not necessarily have to be feature amounts used in the reward function.

Further, the storage unit 31 may store a mathematical optimization solver for realizing the processing by the determination unit 12. Note that the content of the mathematical optimization solver is arbitrary, and may be determined according to the execution environment and apparatus.

The input unit 32 accepts various data input to the information processing device 3 . The input unit 32 may, for example, receive an input of the expert's decision-making history data (specifically, pairs of states and actions) described above. Further, the input unit 32 may receive input of an initial state and constraint conditions used when a reverse reinforcement learning device, which will be described later, performs reverse reinforcement learning.

For example, the input unit 32 is configured with input devices such as a keyboard, mouse, and touch panel. The input unit 32 may also function as an interface for acquiring data from other connected devices. In this configuration, the input unit 32 supplies data acquired from another device to the control unit 35, which will be described later.

The output unit 33 is configured to output the calculation result by the information processing device 3 . As an example, the output unit 33 includes a display panel (display unit), and displays the calculation result on the display panel. In addition, the output unit 33 may function as an interface that outputs data to other connected devices. In this configuration, the output unit 33 outputs data supplied from the control unit 35, which will be described later, to other connected devices.

The communication unit 34 is a communication module that communicates with other devices via a network (not shown). As an example, the communication unit 34 outputs data supplied from a control unit 35, which will be described later, to another device via a network, acquires data output from another device via a network, and controls the data. It is supplied to the unit 35.

The specific configuration of the network does not limit this embodiment, but as an example, wireless LAN (Local Area Network), wired LAN, WAN (Wide Area Network), public line network, mobile data communication network, or these network combinations can be used.

In this exemplary embodiment, the calculation result is displayed via at least one of the output unit 33 and the communication unit 34.

(control unit 35)
The control unit 35 controls each unit included in the information processing device 3 . As an example, the control unit 35 stores data acquired from the input unit 32 or the communication unit 34 in the storage unit 31, and supplies data stored in the storage unit 31 to the output unit 33 or the communication unit 34. .

The control unit 35 also functions as the acquisition unit 11, the determination unit 12, and the display control unit 13, as shown in FIG. The acquisition unit 11, the determination unit 12, and the display control unit 13 are configured to implement acquisition means, determination means, and first display means, respectively, in this exemplary embodiment.

The acquisition unit 11 acquires the reference data RD via the input unit 32 or the communication unit 34. The acquisition unit 11 stores the acquired reference data RD in the storage unit 31 .

The determination unit 12 obtains the reference data RD stored in the storage unit 31, and calculates a reward function including the weighting factor WF and the feature amount parameter FP by inverse reinforcement learning using the reference data RD, which is a feature It is determined by inverse reinforcement learning including the quantity parameter FP as an operation target. The determination unit 12 stores the determined feature parameter FP in the storage unit 31 .

Further, the operation target in the inverse reinforcement learning by the determination unit 12 may include a weighting factor WF included in at least one of one or a plurality of cost terms. The determining unit 12 stores the post-operation weighting factor WF in the storage unit 31 .

An example of the processing executed by the determination unit 12 will be described later.

The display control unit 13 displays, via the output unit 33, information corresponding to at least one of the weighting factor WF, the feature amount parameter FP, and the reward function.

An example of processing executed by the display control unit 13 will be described later.

<Explanation on problem setting and method>
In the following, for ease of understanding, first, the problem setting and method of maximum entropy inverse reinforcement learning according to this exemplary embodiment will be described. Maximum entropy inverse reinforcement learning (ME-IRL) assumes the following problem setting. That is, expert _data D ₌ _{ _τ ₁ _, _τ ₂ _, _. ))) to estimate a single reward function R(s,a)=θ·f(s,a). In ME-IRL, by estimating θ, decision-making by an expert can be reproduced.

Here, θ is a weighting factor vector whose component is the weighting factor WF. Also, f(s, a) is a feature quantity vector, which can include multiple terms corresponding to each feature quantity. Also, the total number of weighting factors WF included in the weighting factor vector θ is determined according to the number of components of the feature quantity vector f(s, a).

Next, the ME-IRL technique will be described. In ME-IRL, the trajectory τ is represented by Equation A1 exemplified below, and the probability model representing the distribution p _θ (τ) of the trajectory is represented by Equation A2 exemplified below. θ ^T f _τ in Equation A2 represents the reward function (see Equation A3). Also, Z represents the sum of rewards for all trajectories (see equation A4).
however,

Then, the rule for updating the weight of the reward function by maximum likelihood estimation (specifically, the gradient ascending method) is represented by Equations A5 and A6 exemplified below. α in Equation A5 is the step size and L(θ) is the distance measure between distributions used in ME-IRL.

The second term in Equation A6 is the sum of rewards for all trajectories. ME-IRL assumes that the value of the second term can be strictly calculated. However, in reality, there is also the problem that it is difficult to calculate the total sum of rewards for all trajectories.

In addition, there is a problem that there is room for improvement in terms of generating an appropriate reward function just by updating the weighting coefficient vector θ as in Equation A5.

In the maximum entropy inverse reinforcement learning according to this exemplary embodiment, at least one of the multiple terms of the feature vector f(s, a) includes a feature parameter FP that characterizes the term. In addition, in the maximum entropy inverse reinforcement learning according to this exemplary embodiment, not only θ described above but also the feature parameter FP is estimated. Therefore, the maximum entropy inverse reinforcement learning according to this exemplary embodiment should also be referred to as improved maximum entropy inverse reinforcement learning. However, in order to avoid complicating the terminology, the “improved maximum entropy inverse reinforcement learning” is hereinafter simply referred to as “maximum entropy inverse reinforcement learning (ME-IRL)”.

(Example of processing executed by determination unit 12)
Next, an example of processing executed by the determination unit 12 will be described.

The determination unit 12 sets the feature amount of the reward function from the reference data including the state and action. As an example, the determining unit 12 sets the feature quantity of the reward function so that the gradient of the tangent line is finite throughout the function so that the Wasserstein distance can be used as a distance measure between distributions in the inverse reinforcement learning process. It is good also as a structure which carries out. Further, the determination unit 12 may set the feature amount of the reward function so as to satisfy the Lipschitz continuity condition, for example.

For example, let f _τ be the feature vector of the trajectory τ. If the reward function θ ^T f _τ is linear, then if the map F:τ→f _τ is Lipschitz continuous, then θ ^T f _τ is also Lipschitz continuous. Therefore, the determination unit 12 may set the feature amount so that the reward function becomes a linear function.

Note that, for example, Equation 4 illustrated below has an infinite gradient at a ₀ , and therefore can be said to be an inappropriate reward function in the present disclosure.

The determination unit 12 may determine, for example, a reward function in which the feature amount is set according to the user's instruction. A satisfying reward function may be obtained.

Further, the determination unit 12 may be configured to initialize the weighting factor WF. The method by which the determination unit 12 initializes the weighting factor WF is not particularly limited, and the weighting factor WF may be initialized based on an arbitrary method predetermined according to the user or the like.

In addition, the determining unit 12 determines the trajectory τ^( τ ̂ derives the superscript ̂) of τ. Specifically, the determining unit 12 uses the Wasserstein distance as a distance measure between distributions, and performs mathematical optimization to minimize the Wasserstein distance, thereby obtaining the trajectory τ of the expert as presume.

The Wasserstein distance is defined by Equation 5 exemplified below. That is, the Wasserstein distance represents the distance between the probability distribution of the expert's trajectory and the probability distribution of the trajectory determined based on the parameters of the reward function. Note that the reward function θ ^T f _τ must be a function that satisfies the Lipschitz continuity condition due to the restriction of the Wasserstein distance. On the other hand, in the present exemplary embodiment, the determining unit 12 sets the feature amount of the reward function so as to satisfy the Lipschitz continuity condition, so it is possible to use the Wasserstein distance as exemplified below.

The Wasserstein distance defined by Equation 5 exemplified above takes a value of 0 or less, and increasing this value corresponds to bringing the distributions closer together. Also, in the second term of Equation 5, τ _θ(n) represents the n-th trajectory optimized with the parameter θ. The second term of Equation 5 is a term that can be calculated even in a combinatorial optimization problem. Therefore, by using the Wasserstein distance exemplified in Equation 5 as a distance measure between distributions, inverse reinforcement learning that can be applied to mathematical optimization problems such as combinatorial optimization problems can be performed.

Also, the determination unit 12 updates the parameter θ of the reward function and the feature parameter FP so as to maximize the distance measure between the distributions based on the estimated expert's trajectory τ̂. Here, in maximum entropy inverse reinforcement learning (that is, ME-IRL), the trajectory τ follows the Boltzmann distribution according to the maximum entropy principle. Therefore, similar to ME-IRL, the determining unit 12, based on the estimated expert trajectory τ^, determines the parameter θ and the feature parameter FP are updated.

As described above, the second term in Equation A6 is the sum of rewards for all trajectories. ME-IRL assumes that the value of the second term can be strictly calculated. However, in reality, there is a problem that it is difficult to calculate the total sum of rewards for all trajectories.

Therefore, the determining unit 12 sets the lower limit of the logarithmic likelihood represented using the reward function, and maximizes the lower limit of the logarithmic likelihood represented using the reward function. update the feature parameter FP). As an example, the determination unit 12 determines the lower limit of L(θ)
is set as follows. the following,
is also called the lower bound of the log-likelihood.

The second term in Equation 6, the log-likelihood lower bound, is the maximum reward value for the current parameter θ, and the third term is the log value of the number of possible trajectories (N _τ ). . Thus, based on the logarithmic likelihood of ME-IRL, the determining unit 12 calculates the maximum reward value for the current parameter θ and the log value (logarithmic value) of the number of possible trajectories (N _τ ). Derive the lower bound of the log-likelihood calculated by subtracting from the probability distribution of .

Further, the determining unit 12 may use a modified formula for the derived lower limit of the log-likelihood of ME-IRL into a formula for subtracting the entropy regularization term from the Wasserstein distance. A formula obtained by decomposing the lower bound formula of the log-likelihood of ME-IRL into the Wasserstein distance and the entropy regularization term is expressed as Formula 7 illustrated below.

The expression in the first parenthesis of Expression 7 represents the Wasserstein distance. That is, the Wasserstein distance represents the distance between the probability distribution of the expert's trajectory and the probability distribution of the trajectory determined based on the parameters of the reward function. Note that the reward function θ _T f _τ must be a function that satisfies the Lipschitz continuity condition due to the restriction of the Wasserstein distance. On the other hand, in the present exemplary embodiment, the determination unit 12 can use the Wasserstein distance to set the features of the reward function so as to satisfy the Lipschitz continuity condition.

In addition, the expression in the second parenthesis of Equation 7 represents an entropy regularization term that contributes to increasing the logarithmic likelihood of the Boltzmann distribution derived from the maximum entropy principle. Specifically, in the entropy regularization term exemplified in Equation 7 (i.e., the expression in the second bracket of Equation 7), the first term represents the maximum reward value for the current parameter The term represents the mean value of the reward for the current parameter θ.

In this way, the inverse reinforcement learning by the determination unit 12 includes update processing for updating the operation target (parameter θ, feature amount parameter FP) so as to maximize the lower limit of the logarithmic likelihood represented using the reward function. is included. Then, as shown in Equation 7, the lower limit of the logarithmic likelihood in the update process is the Wasserstein distance, which represents the distance between the standard probability distribution and the probability distribution represented using the reward function, and the maximum value of the reward function. and a regularization term representing the difference between the value and the average value of the reward function.

Here, we will explain why the second term in the second parenthesized expression of Equation 7 functions as an entropy regularization term. In order to maximize the lower bound of the log-likelihood of ME-IRL, the value of the second term should be small, which corresponds to a small difference between the maximum reward value and the mean value. A smaller difference between the maximum reward value and the average value indicates a smaller trajectory variability.

In other words, a smaller difference between the maximum reward value and the average value means an increase in entropy, so entropy regularization works and contributes to entropy maximization. This contributes to the maximization of the log-likelihood of the Boltzmann distribution and, as a result, contributes to resolution of ambiguity in inverse reinforcement learning.

The determining unit 12 fixes, for example, the estimated trajectory τ̂ based on Equation 7 shown above, and updates the parameter θ and the feature amount parameter FP by the gradient ascending method. However, the normal gradient ascent method may not converge. In the entropy regularization term, the feature quantity (f _τθmax ) of the trajectory with the maximum reward value does not match the average value of the feature quantity (f _τ(n) ) of the other trajectories (i.e., the difference between the two does not become 0). Therefore, in the normal gradient ascent method, the logarithmic likelihood oscillates and does not converge, which makes it unstable and difficult to appropriately determine convergence (see Equation 8 below for updating the parameter θ).

Therefore, when using the gradient method, the determination unit 12 updates the parameter θ and the feature parameter FP so as to gradually attenuate the portion that contributes to entropy regularization (that is, the portion corresponding to the entropy regularization term). good too. In other words, the lower bound of the log-likelihood may include a damping factor that is multiplied by the regularization term to dampen the contribution of the regularization term as the update process is repeated.

Specifically, the determination unit 12 defines an update formula in which a damping coefficient β _t indicating the degree of damping is set in a portion that contributes to entropy regularization. For example, the determining unit 12 differentiates Equation 7 above with respect to θ, a portion corresponding to the term indicating the Wasserstein distance (that is, a portion contributing to processing for increasing the Wasserstein distance) and an entropy regularization term. Equation 9 exemplified below is defined in which the damping coefficient is set to the portion corresponding to the entropy regularization term among the portions where .

The damping factor is predefined according to how to dampen the portion corresponding to the entropy regularization term. For example, for smooth attenuation, β _t is defined as in Equation 10 exemplified below.

In Equation 10, β ₁ is set to 1 and β ₂ is set to 0 or greater. Also, t indicates the number of iterations. As a result, the attenuation coefficient _βt functions as a coefficient that reduces the portion corresponding to the entropy regularization term as the number of iterations t increases.

Also, since the Wasserstein distance has a weaker phase than the logarithmic likelihood that is the KL divergence, it is possible to bring the Wasserstein distance closer to 0 by bringing the logarithmic likelihood closer to 0. Therefore, the determination unit 12 updates the parameter θ and the feature parameter FP without attenuating the portion corresponding to the entropy regularization term in the initial stage of updating, and at the timing when the logarithmic likelihood begins to oscillate, the entropy regularization The parameter θ and the feature amount parameter FP may be updated so as to reduce the influence of the portion corresponding to the term.

Specifically, the determining unit 12 updates the parameter θ and the feature parameter FP with the attenuation coefficient β _t =1 in the initial stage using Equation 9 shown above. After that, the determining unit 12 changes the attenuation coefficient β _t =0 at the timing when the logarithmic likelihood starts to oscillate, thereby eliminating the influence of the portion corresponding to the entropy regularization term and determining the parameter θ and the feature parameter FP as You may update.

For example, the determination unit 12 may determine that the logarithmic likelihood has started to oscillate when the moving average of the logarithmic likelihood becomes constant. Specifically, when the change in the moving average in the “lower limit of logarithmic likelihood” time window (several points from the current value to the past) is small (for example, 1e ⁻³ or less), the determining unit 12 determines that the moving average can be judged to be constant.

Also, the determination unit 12 may change the oscillation coefficient as shown in Equation 10 above, instead of suddenly setting the damping coefficient β _t =0 at the timing when the logarithmic likelihood starts to oscillate. After the change, the determining unit 12 may change the damping coefficient β _t =0 at the timing when the logarithmic likelihood starts to oscillate further. The method of determining the timing at which vibration starts is the same as the method described above.

Further, the determination unit 12 may change the update method of the parameter θ and the feature amount parameter FP at the timing when the logarithmic likelihood further starts to oscillate after changing the oscillation coefficient as in Equation 10 shown above. Specifically, the determining unit 12 may update the parameter θ and the feature amount parameter FP using a momentum method as exemplified in Equation 11 below. The values of γ1 and α in Equation 11 are predetermined. For example, γ1=0.9 and α=0.001.

Thereafter, the determination unit 12 repeats the trajectory estimation process and the update process of the parameter θ and the feature amount parameter FP until it determines that the lower limit of the logarithmic likelihood has converged.

As an example of processing in which the determining unit 12 determines that the lower limit of the logarithmic likelihood has converged, the distance measure between the distributions has converged when the absolute value of the value of the lower limit of the logarithmic likelihood becomes smaller than a predetermined threshold. A configuration for judging is exemplified.

When determining that the distance measure between distributions has not converged, the determination unit 12 continues the trajectory estimation process and the update process of the parameter θ and the feature parameter FP. On the other hand, when determining that the distance measure between distributions has converged, the determination unit 12 ends the trajectory estimation process and the update process of the parameter θ and the feature parameter FP.

<More specific processing example>
Below, a more specific example is given and demonstrated about the process by the determination part 12 mentioned above. In the following example, the case where the reward function (Reward) and the cost function (Cost) are given by the following

Equations

12 and 13 will be taken as an example.

Each variable in

Equations

12 and 13 has the following meaning as described in Exemplary Embodiment 1.
Reward: reward function
Cost: Cost function λ ₁ , λ ₂ , λ ₃ : Weighting factor
x ₁ , x ₂ , x ₃ : explanatory variables

Here, the above exemplary explanatory variables: x ₁ , x ₂ , x ₃ are variables that can correspond to either state data (s _i ) and behavior data (a _i ), respectively. As in this example, the explanatory variable itself may constitute the feature amount, or the function of the explanatory variable may constitute the feature amount. In this example, the weighting factor λ1, the weighting factor λ2, and the feature parameter
A case where inverse reinforcement learning is performed with .

In the setting as described above, when the determination unit 12 derives the lower limit of the logarithmic likelihood shown in Equation 9 described above, it updates the operation target of the reward function. For example, when the reward function is given by Equation 12 below, the determining unit 12 determines the weighting factor λ1, the weighting factor λ2, and the feature amount parameter
is updated using Equations 14 to 17 below.
however,

Here, _βt is a coefficient defined by Equation 10 above, and α is a parameter indicating a learning rate.

In this way, the determination unit 12 updates the parameter θ of the reward function and the feature parameter FP so as to maximize the logarithmic likelihood of the Boltzmann distribution derived from the principle of maximum entropy. In the update process, inverse reinforcement learning is performed with not only the weighting coefficients but also the feature parameter as the update target, so that a more appropriate reward function can be generated.

(Flow of information processing method S3)
Subsequently, the flow of the information processing method S3 according to this exemplary embodiment will be described with reference to FIG. FIG. 6 is a flow diagram showing the flow of the information processing method S3 according to this exemplary embodiment.

(Step S31)
In step S<b>31 , the acquisition unit 11 acquires reference data RD via the input unit 32 or the communication unit 34 . The acquisition unit 11 stores the acquired reference data RD in the storage unit 31 . Since the reference data RD has been described above, the description thereof is omitted here.

(Step S32)
In step S<b>32 , the determination unit 12 initializes the weighting factor and the feature amount parameter, which are the operation targets in the inverse reinforcement learning, among the parameters included in the reward function. As an example, the determining unit 12 may use the initial values stored in the storage unit 31 to initialize the weighting coefficients and feature amount parameters that are the operation targets in the inverse reinforcement learning.

(Step S33)
In step S33, the determination unit 12 performs mathematical optimization to minimize the Wasserstein distance. As an example, the determination unit 12 estimates the trajectory that minimizes the Wasserstein distance, which represents the distance between the probability distribution of the trajectory of the expert and the probability distribution of the trajectory determined based on the parameters of the reward function.

(Step S34)
In step S34, the determining unit 12 updates the parameter θ of the reward function and the feature parameter FP so as to maximize the logarithmic likelihood of the Boltzmann distribution derived from the principle of maximum entropy. Since the specific example of the update process has been described above, the description is omitted here.

(Step S35)
In step S35, the determination unit 12 determines whether or not the lower limit of the logarithmic likelihood has converged. If it is determined that the lower limit of the logarithmic likelihood has converged (YES in S35), the process proceeds to step S36; otherwise (NO in S35), the process returns to step S33.

(Step S36)
In step S35, the determination unit 12 determines whether or not the lower limit of the logarithmic likelihood has converged. If it is determined that the lower limit of the logarithmic likelihood has converged, the determination unit 12 outputs a reward function in step S36.

The parameters (weighting factor WF and feature quantity parameter FP) included in the reward function output by the determining unit 12 are stored in the storage unit 31 as an example.

(Display example)
Next, a display example by the information processing device 3 according to the exemplary embodiment will be described with reference to FIG. As described above, the output unit 33 may include a display panel (display unit) and display various information on the display panel. Here, the information displayed on the display panel may include information corresponding to at least one of the weighting factor WF, the feature amount parameter FP, and the reward function. Here, the display content displayed by the output unit 33 is generated by the display control unit 13 as an example.

FIG. 7 is a diagram showing a display example generated by the display control unit 13. As shown in FIG. As shown in FIG. 7, a display screen may be generated that shows the relationship between the values of at least some of the parameters to be operated (the weighting factor WF and the feature amount parameter FP) and the number of steps. In other words, a display screen may be generated that shows a change in the parameter value of the operation target according to an increase in the number of steps of the update process. In the example shown in FIG. 7, the number of steps, the weighting factor λ ₁ , and the feature parameter
10 shows a display screen showing the relationship between .

As described above, the information processing apparatus 3 according to this exemplary embodiment displays information corresponding to at least one of the weighting factor WF, the feature amount parameter FP, and the reward function, so that inverse reinforcement learning can be performed favorably. It is possible to suitably present to the user what is being done.

[Exemplary embodiment 3]
A third exemplary embodiment of the invention will now be described in detail with reference to the drawings. Components having the same functions as those described in

exemplary embodiments

1 and 2 are denoted by the same reference numerals, and description thereof will be omitted as appropriate.

(Configuration of information processing device 4)
The configuration of the information processing device 4 according to this exemplary embodiment will be described with reference to FIG. FIG. 8 is a block diagram showing the configuration of the information processing device 3 according to this exemplary embodiment.

As shown in FIG. 8 , the information processing device 4 includes a control section 45 instead of the control section 35 provided in the information processing device 3 . Further, the control unit 45 includes a generation unit 14 in addition to each configuration included in the control unit 35 .

　In addition, as shown in FIG. 8, the information processing device 4 includes a storage unit 41 instead of the storage unit 31 included in the information processing device 3. In addition to various information stored in the storage unit 31, the storage unit 41 stores target data TD.

In addition, the acquisition unit 11 included in the information processing device 4 further acquires target data TD in addition to various data acquired by the acquisition unit 11 according to the second exemplary embodiment. The acquired target data TD is stored in the storage unit 41 described above as an example.

Here, in this exemplary embodiment, the target data TD includes at least part of state data representing the state of a certain system (system) and action data representing actions taken by a specific expert in that state. included.

As an example _, the target data TD _can be represented _by { _s ₁ , s ₂ , . can be done. Here, s _i (i=1 to N) represents state data indicating the state of the system, and a _i (i=1 to N) represents action data that can be selected in the state indicated by the state data TD. represent.

(Generating unit 14)
The generation unit 14 included in the information processing device 4 maximizes a reward function having the target data TD acquired by the acquisition unit 11 and the operation target data MD as explanatory variables by operating the operation target data MD. . In other words, the information processing device 4 solves the optimization problem with the target data as input, and generates as output data the data of the operation target that maximizes the reward function.

In other words, the generation unit 14 generates a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target acquired by the acquisition unit 11. Generate output data according to the target data by solving an optimization problem using the data.

Here, the reward function determined by the determining unit 12 through the processing described in the second exemplary embodiment can be used as the above reward function.

As described above, in the information processing apparatus 4 according to the present exemplary embodiment, the acquisition unit 11 that acquires the target data, the reward function including the weighting factor and the feature amount parameter, and the feature amount parameter is used as the operation target. A reward function determined by inverse reinforcement learning included in and a generation unit 14 that generates output data according to the target data by solving an optimization problem using the target data acquired by the acquisition unit 11. Adopted.

Therefore, according to the information processing apparatus 4 according to the present exemplary embodiment, since the optimization problem is solved using the reward function determined by the inverse reinforcement learning including the feature parameter as the operation target, a more appropriate reward function can generate output data that maximizes

(Display example 1)
Next, a first display example by the information processing device 4 according to this exemplary embodiment will be described with reference to FIG. As described above, the output unit 33 may include a display panel (display unit) and display various information on the display panel. In this exemplary embodiment, the information displayed on the display panel may include at least part of the data included in the output data generated by the generator 14 .

FIG. 9 is a diagram showing a display example generated by the display control unit 13. As shown in FIG. In the example shown in FIG. 9,
- the reward function is given by Equation 12 described in Exemplary Embodiment 2,
- Weighting coefficients λ ₁ , λ ₂ , λ ₃ and feature parameters
is determined by inverse reinforcement learning by the determination unit 12,
・The target data TD includes x ₁ and x ₂ ,
- An example of a display screen generated by the display control unit 13 when the generation unit 14 generates output data corresponding to target data by solving an optimization problem with _x3 as data to be manipulated is shown.

As shown in FIG. 9, the display screen generated by the display control unit 13 displays the values of the explanatory variables x ₁ and x ₂ included in the target data TD, and the operation target determined by the generation unit 14 according to the values. data x ₃ values (i.e. recommended values for users).

The information processing device 4 according to this exemplary embodiment can suitably present the solution of the optimization problem to the user by displaying the output data generated by the generation unit 14 as described above.

(Display example 2)
Next, a second display example by the information processing device 4 according to this exemplary embodiment will be described with reference to FIG. In this example, the acquisition unit 11 receives input of at least one of explanatory variables, weighting coefficients, and feature amount parameters from the user via the input unit 32 . Then, as shown in the upper part of FIG. 10, the display control unit 13 controls at least one of the explanatory variable, the weighting factor, and the feature amount parameter input by the user, and the reference data for one or more experts. At least one of explanatory variables, weighting coefficients, and feature parameter values obtained by inverse reinforcement learning using TD may be displayed in a comparable manner.

Further, the display control unit 13 may be configured to generate a GUI (Graphical User Interface) including operation objects that can be operated by the user and display it on the output unit 33 . Such a GUI is shown in the lower left part of FIG. By sliding the bar included in the GUI, it is possible to change the value of at least one of the explanatory variable, the weighting factor, and the feature amount parameter corresponding to the bar.

In addition, the display control unit 13 may rank at least one of explanatory variables, weighting coefficients, and feature parameters, and display the variables together with the ranking.

<Application example>
An application example of the information processing apparatus 4 according to the exemplary embodiment will be described below with reference to FIG. 11 .

In this application example, the information processing device 4 generates an operation plan regarding the water distribution plan of the water supply infrastructure. The water infrastructure according to this exemplary embodiment includes, by way of example, multiple sites such as reservoirs, distribution reservoirs, water intake facilities, water purification plants, water stations, and demand points. The operation plan includes, for example, information indicating the operation pattern of pumps at each site.

(Acquisition unit 11)
The acquisition unit 11 acquires the target data TD and the reference data RD. As an example, the acquisition unit 11 acquires the target data TD and the reference data RD from another device via the communication unit 34 . Further, as an example, the acquisition unit 11 may acquire the target data TD and the reference data RD input via the input unit 32 . Alternatively, the acquisition unit 11 may acquire the target data TD and the reference data RD by reading the target data TD and the reference data RD from the storage unit 41 or an externally connected storage device. Details of the target data TD and the reference data RD according to this example will be described later.

(Decision unit 12)
The determination unit 12 determines a reward function used in the optimization problem for generating the operation plan OP regarding the target water distribution plan by inverse reinforcement learning with reference to the reference data RD. Inverse reinforcement learning of the reward function includes, as described above, update processing with the weighting factor WF and the feature amount parameter FP as the manipulation targets.

(Generating unit 14)
The generation unit 14 solves the optimization problem using the reward function determined by inverse reinforcement learning using the reference data RD related to the reference water distribution plan and the target data TD acquired by the acquisition unit 11, Generate an operation plan OP for the target water distribution plan. The operation plan OP generation processing executed by the generation unit 14 will be described later.

(storage unit 41)
The storage unit 41 stores the target data TD and the reference data RD acquired by the acquisition unit 11 . The storage unit 41 also stores the operation plan OP generated by the generation unit 14 . The storage unit 41 also stores the reward function determined by the determination unit 12 and the constraint condition LC. Here, storing a reward function in the storage unit 41 means that a parameter defining the reward function is stored in the storage unit 41 .

(Target data TD)
The target data TD is data used by the generating unit 14 to generate the operation plan OP. The target data TD includes information indicating the state of the target water supply infrastructure. As an example, the target data TD includes information about pumps, distribution networks, water pipelines and/or demand points in the target water infrastructure.

Specifically, the target data TD includes, as an example, at least one of the following data (i) to (x) in the water supply infrastructure that is the target of the operation plan. However, the data included in the target data TD is not limited to these, and may include other data.

(i) power consumption at each base, (ii) demand forecast margin, (iii) distribution reservoir margin, (iv) water distribution loss, (v) number of operating personnel at each base, (vi) electricity rate at each base, (vii) ) voltage at each location, (viii) water level at each location, (ix) water pressure at each location, and (x) water volume at each location.

(i) The power consumption at each base indicates the power consumption at each base such as water purification plants and water supply stations. (ii) demand forecast margin indicates the extent to which supply exceeds demand; (iii) Reservoir margin indicates the extent to which the designed reservoir capacity exceeds the actual reservoir capacity. (iv) Water distribution loss indicates the extent to which water is not being distributed to each demand point. (v) The number of operating personnel indicates the number of operating personnel at each site.

(reference data RD)
The reference data RD is data used when the determination unit 12 determines the reward function. The reference data RD includes information representing the state of the reference water supply infrastructure. Here, the reference water infrastructure may be the same as or different from the water infrastructure for which the operation plan is generated. More specifically, the reference data RD includes, as an example, information on at least one of pumps, distribution networks, water pipelines, and demand points in the reference water infrastructure. The reference data RD also includes, as an example, information on at least one of pump operating patterns and personnel in the reference water supply infrastructure. Each item included in the reference data RD may be treated as state data, or may be treated as action data.

Specifically, the reference data RD includes, as an example, at least one of the following data (i) to (x) in the reference water infrastructure. However, the data included in the reference data RD is not limited to these, and may include other data.

In addition, the reference data RD includes, as an example, data indicating an operation plan created by a skilled person for reference water infrastructure. More specifically, the reference data RD includes, as an example, data represented by variables controlled based on operation rules, such as opening/closing of valves, intake of water, thresholds of pumps, and the like. Such data can also be said to be data representing the decision-making history (expert's intention) of the expert who created the operational plan for reference.

(Operation plan OP)
The operational plan OP includes, by way of example, information about the operating pattern of the pumps in the water infrastructure of interest. The operation plan OP also includes, as an example, information about the personnel involved in the target water supply infrastructure.

(reward function)
The reward function includes each cost term including each variable corresponding to each item included in the reference data RD. The generality of the reward function was described in the exemplary embodiment above.

(Constraint LC)
Constraint condition LC is a constraint condition of the optimization problem that the generator 14 solves. Constraints LC include, for example, the following (i) to (iv). Note that the constraint conditions LC are not limited to these, and may include other conditions.

(i) The water storage volume of the reservoir/distribution reservoir is greater than or equal to threshold X and less than Y.

(ii) The supply exceeds the demand by at least X%.

(iii) Water is being distributed to all demand points.

(iv) Do not use routes under construction.

<Processing executed by determination unit 12>
The determining unit 12 determines a reward function to be used in the optimization problem for generating the operation plan for the target water distribution plan by inverse reinforcement learning with reference to the reference data RD. As an example, the determining unit 12 determines the weighting factor of the cost term included in the reward function and the feature parameter that characterizes the cost term by inverse reinforcement learning using the state data and action data included in the reference data RD. . An example of inverse reinforcement learning by the determination unit 12 is as described above.

Also, the determination unit 12 outputs the determined reward function. The determination unit 12 may output the reward function by writing it in the storage unit 41 or an external storage device, or output it to the output unit 33 .

<Process Executed by Generation Unit 14>
The generation unit 14 generates an operation plan OP related to the target water distribution plan by solving an optimization problem using the reward function and the target data TD under the constraint LC. In this exemplary embodiment, the generation unit 14 is an optimization problem using a reward function, and the target data TD acquired by the acquisition unit 11 is a fixed variable, and the variable included in each cost term included in the reward function is By solving an optimization problem in which variables other than fixed variables are used as manipulated variables, an operation plan OP relating to the target water distribution plan is generated.

The generation unit 14 also outputs the generated operation plan OP. The generation unit 14 may output the operation plan OP by writing it in the storage unit 41 or an external storage device, or may output it to the output unit 33 .

<Optimization problem settings>
FIG. 11 is a diagram for explaining a specific example of setting the optimization problem according to this example. The operation plan OP needs to be determined in consideration of various points of view, such as how much margin should be provided from the forecasted demand, how much power consumption should be suppressed, and how much the water level of the distribution reservoir should be considered. . Setting weights for these aspects is difficult. This is because the degree of emphasis on which point of view varies depending on the operator who operates the water supply infrastructure, and is not uniformly determined. For example, there is a case where the municipality A, which is the generator of a certain operation plan, places importance on the viewpoint of power consumption, while the municipality B places importance on the water level of the distribution reservoir.

In this exemplary embodiment, the generation unit 14, under the constraint condition LC, generates a reward function whose weighting factor and feature amount parameter of each cost term are determined by inverse reinforcement learning with reference to the reference data RD, and the target data TD Solve the optimization problem using Here, since the weighting factor and the feature amount parameter of each cost term included in the reward function are determined by inverse reinforcement learning with reference to the reference data RD, values reflecting the action data included in the reference data RD, In other words, the value reflects the intention of the expert who generated the operation plan for reference. By solving the optimization problem using the reward function including such weighting factors and feature parameters, it is possible to generate an operation plan that reflects the intentions of the expert who generated the reference operation plan. .

For example, in the example of FIG. 11, the weighting coefficients α1 to α6 and the feature amount parameters included in the reward function used to generate the operation plan OP of the local government A generate the reference operation plan used to determine the reward function. It is a value that reflects the intention of an expert or the like. In addition, the weighting coefficients α1 to α6 and the feature amount parameters included in the reward function used to generate the operation plan OP for local government B are the intentions of the expert who generated the reference operation plan used to determine the reward function. is a value that reflects By comparing the weighting factor and feature amount parameter of the local government A with the weighting factor and feature amount parameter of the local government B, it becomes easy to grasp what viewpoint each local government attaches importance to.

Further, for example, the determination unit 12 determines the reward function by referring to the reference data RD including the operation plan created by the expert a1 in the municipality A, and the reward function determined by the determination unit 12 and the target data TD of the municipality A The generation unit 14 can also generate the future operation plan OP using the and. In this case, the generation unit 14 can generate the future operation plan OP of the local government A that reflects the intention of the expert a1.

Also, according to this exemplary embodiment, it is possible to reflect the intention of a creator of an operation plan in a certain municipality in the operation plans of other municipalities. For example, the determination unit 12 determines the reward function by referring to the reference data RD including the operation plan created by the expert a1 in the municipality A, and the reward function determined by the determination unit 12 and the target data TD of the municipality B are combined. The generation unit 14 can also generate the future operation plan OP by using it. In this case, the generation unit 14 can generate the operation plan OP of the local government B that reflects the intention of the expert a1.

[Example of realization by software]
Some or all of the functions of the

information processing apparatuses

1, 2, 3, and 4 may be implemented by hardware such as integrated circuits (IC chips), or may be implemented by software.

In the latter case, the

information processing apparatuses

1, 2, 3, and 4 are implemented by computers that execute program instructions, which are software that implements each function, for example. An example of such a computer (hereinafter referred to as computer C) is shown in FIG. Computer C comprises at least one processor C1 and at least one memory C2. A program P for operating the computer C as the

information processing apparatuses

1, 2, 3, and 4 is recorded in the memory C2. In the computer C, the processor C1 reads the program P from the memory C2 and executes it, thereby realizing each function of the

information processing apparatuses

1, 2, 3, and 4. FIG.

As the processor C1, for example, CPU (Central Processing Unit), GPU (Graphic Processing Unit), DSP (Digital Signal Processor), MPU (Micro Processing Unit), FPU (Floating point number Processing Unit), PPU (Physics Processing Unit) , a microcontroller, or a combination thereof. As the memory C2, for example, a flash memory, HDD (Hard Disk Drive), SSD (Solid State Drive), or a combination thereof can be used.

Note that the computer C may further include a RAM (Random Access Memory) for expanding the program P during execution and temporarily storing various data. Computer C may further include a communication interface for sending and receiving data to and from other devices. Computer C may further include an input/output interface for connecting input/output devices such as a keyboard, mouse, display, and printer.

In addition, the program P can be recorded on a non-temporary tangible recording medium M that is readable by the computer C. As such a recording medium M, for example, a tape, disk, card, semiconductor memory, programmable logic circuit, or the like can be used. The computer C can acquire the program P via such a recording medium M. Also, the program P can be transmitted via a transmission medium. As such a transmission medium, for example, a communication network or broadcast waves can be used. Computer C can also acquire program P via such a transmission medium.

[Appendix 1]
The present invention is not limited to the above-described embodiments, and various modifications are possible within the scope of the claims. For example, embodiments obtained by appropriately combining the technical means disclosed in the embodiments described above are also included in the technical scope of the present invention.

[Appendix 2]
Some or all of the above-described embodiments may also be described as follows. However, the present invention is not limited to the embodiments described below.

(Appendix 1)
An acquisition means for acquiring reference data and a reward function including a weighting factor and a feature parameter are determined by inverse reinforcement learning using the reference data and including the feature parameter as an operation target. An information processing apparatus comprising: determining means for

(Appendix 2)
The reward function includes one or more cost terms including a feature value represented using explanatory variables and the weighting factor representing the weight of the feature value, and at least one of the one or more cost terms The information processing apparatus according to appendix 1, wherein the feature parameter that characterizes the cost term is included along with the explanatory variable.

(Appendix 3)
The information processing apparatus according to appendix 2, wherein an operation target in the inverse reinforcement learning by the determining means includes the weighting factor included in at least one of the one or the plurality of cost terms.

(Appendix 4)
3. Any one of appendices 1 to 3, wherein the inverse reinforcement learning by the determining means includes an update process of updating the operation target so as to maximize the lower limit of the logarithmic likelihood represented using the reward function. The information processing device according to .

(Appendix 5)
The lower limit of the log-likelihood is the Wasserstein distance representing the distance between the reference probability distribution and the probability distribution represented using the reward function, the maximum value of the reward function, and the average value of the reward function. The information processing device according to appendix 4, which is expressed using a regularization term that represents the difference between .

(Appendix 6)
The information processing apparatus according to appendix 5, wherein the lower limit of the logarithmic likelihood is an attenuation coefficient that is multiplied by the regularization term, and includes an attenuation coefficient that attenuates the contribution of the regularization term as the update process is repeated. .

(Appendix 7)
7. The information processing apparatus according to any one of appendices 1 to 6, further comprising first display means for displaying information corresponding to at least one of the weighting coefficient, the feature quantity parameter, and the reward function.

(Appendix 8)
The acquisition means further acquires target data, and the information processing device solves the target data by solving an optimization problem using the reward function determined by the determination means and the target data acquired by the acquisition means. 8. The information processing apparatus according to any one of appendices 1 to 7, further comprising generating means for generating output data according to.

(Appendix 9)
The information processing apparatus according to appendix 8, further comprising second display means for displaying the output data.

(Appendix 10)
an acquisition means for acquiring target data; a reward function including a weighting factor and a feature parameter, the reward function determined by inverse reinforcement learning including the feature parameter as an operation target; and generating means for generating output data according to the target data by solving an optimization problem using the target data.

(Appendix 11)
An information processing method using an information processing apparatus, wherein reference data is obtained, and a reward function including a weighting factor and a feature parameter is obtained by inverse reinforcement learning using the reference data, wherein the feature parameter An information processing method comprising determining by inverse reinforcement learning including as an operation target.

(Appendix 12)
An information processing method by an information processing device, comprising obtaining target data, and a reward function including a weighting factor and a feature parameter, which is determined by inverse reinforcement learning including the feature parameter as an operation target. Generating output data according to the target data by solving an optimization problem using a reward function and the target data obtained in the obtaining.

(Appendix 13)
A program for causing a computer to function as an information processing apparatus, comprising: obtaining means for obtaining reference data; and a reward function including a weighting factor and a feature amount parameter. A program that functions as a decision means that decides by inverse reinforcement learning including a feature parameter as an operation target.

(Appendix 14)
A program that causes a computer to function as an information processing device, comprising: acquisition means for acquiring target data; and a reward function including a weighting factor and a feature amount parameter, wherein the feature amount parameter is used as an operation target by inverse reinforcement learning. A program that functions as generating means for generating output data according to the target data by solving an optimization problem using the determined reward function and the target data obtained by the obtaining means.

[Appendix 3]
Some or all of the embodiments described above can also be expressed as follows.

At least one processor is provided, and the processor performs an acquisition process for acquiring reference data, and a reward function including a weighting factor and a feature amount parameter for inverse reinforcement learning using the reference data, wherein the feature amount An information processing device that executes a determination process of determining by inverse reinforcement learning including a parameter as an operation target.

The information processing apparatus may further include a memory, and the memory may store a program for causing the processor to execute the acquisition process and the determination process. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.

At least one processor is provided, and the processor is an acquisition process for acquiring target data, and a reward function including a weighting factor and a feature amount parameter, which is determined by inverse reinforcement learning including the feature amount parameter as an operation target. and a generating process for generating output data corresponding to the target data by solving an optimization problem using the target data obtained in the obtaining process.

The information processing apparatus may further include a memory, and the memory may store a program for causing the processor to execute the acquisition process and the generation process. Also, this program may be recorded in a computer-readable non-temporary tangible recording medium.

1, 2, 3, 4... Information processing apparatus 11... Acquisition unit (acquisition means)
12 ... decision unit (decision means)
13 ... display control unit (display means)
22, 14 ... generation unit (generating means)

Claims

Acquisition means for acquiring reference data;
determining means for determining a reward function including a weighting factor and a feature parameter by inverse reinforcement learning using the reference data and including the feature parameter as an operation target;
Information processing device equipped with.
The reward function includes one or more cost terms including a feature value represented using an explanatory variable and the weighting factor representing the weight of the feature value,
At least one of the one or more cost terms includes the explanatory variable and the feature parameter that characterizes the cost term,
The information processing device according to claim 1 .
An operation target in the inverse reinforcement learning by the determining means includes the weighting factor included in at least one of the one or more cost terms.
The information processing apparatus according to claim 2.
The inverse reinforcement learning by the determining means includes an update process of updating the operation target so as to maximize the lower bound of the logarithmic likelihood represented using the reward function.
The information processing apparatus according to any one of claims 1 to 3.
The lower limit of the log-likelihood is the Wasserstein distance representing the distance between the reference probability distribution and the probability distribution represented using the reward function, the maximum value of the reward function, and the average value of the reward function. expressed using a regularization term representing the difference of
The information processing apparatus according to claim 4.
The lower bound of the logarithmic likelihood is an attenuation coefficient that is multiplied by the regularization term, and includes an attenuation coefficient that attenuates the contribution of the regularization term as the update process is repeated.
The information processing device according to claim 5 .
a first display means for displaying information corresponding to at least one of the weighting factor, the feature parameter, and the reward function;
The information processing apparatus according to any one of claims 1 to 6.
The acquisition means further acquires target data,
The information processing apparatus further includes generating means for generating output data according to the target data by solving an optimization problem using the reward function determined by the determining means and the target data obtained by the obtaining means. The information processing apparatus according to any one of claims 1 to 7, comprising:
comprising second display means for displaying the output data;
The information processing apparatus according to claim 8 .
acquisition means for acquiring target data;
An optimization problem using a reward function including a weighting factor and a feature amount parameter, the reward function determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target data obtained by the obtaining means. generating means for generating output data corresponding to the target data by solving
Information processing device equipped with.
An information processing method by an information processing device,
obtaining reference data;
Determining a reward function including a weighting factor and a feature amount parameter by inverse reinforcement learning using the reference data and including the feature amount parameter as an operation target;
Information processing methods, including.
An information processing method by an information processing device,
obtaining target data;
A reward function including a weighting factor and a feature parameter, the reward function determined by inverse reinforcement learning including the feature parameter as an operation target, and the target data obtained in the obtaining generating output data according to the target data by solving a transformation problem;
Information processing methods, including.
A program that causes a computer to function as an information processing device,
Acquisition means for acquiring reference data;
determining means for determining a reward function including a weighting factor and a feature parameter by inverse reinforcement learning using the reference data and including the feature parameter as an operation target;
A program that acts as a
A program that causes a computer to function as an information processing device,
acquisition means for acquiring target data;
An optimization problem using a reward function including a weighting factor and a feature amount parameter, the reward function determined by inverse reinforcement learning including the feature amount parameter as an operation target, and the target data obtained by the obtaining means. generating means for generating output data corresponding to the target data by solving
A program that acts as a