JP5046149B2

JP5046149B2 - Technology to determine the most appropriate measures to get rewards

Info

Publication number: JP5046149B2
Application number: JP2006209593A
Authority: JP
Inventors: 力矢高橋; 貴行恐神
Original assignee: International Business Machines Corp
Current assignee: International Business Machines Corp
Priority date: 2006-08-01
Filing date: 2006-08-01
Publication date: 2012-10-10
Anticipated expiration: 2026-08-01
Also published as: JP2008040522A

Description

本発明は、リスクを抑えて報酬を最大化するための施策を決定するシステムに関する。特に、本発明は、施策の実行により状態遷移する対象に対し、将来の累積報酬が最大化するように施策を決定するシステムに関する。 The present invention relates to a system for determining measures for minimizing risk and maximizing reward. In particular, the present invention relates to a system for determining a measure so that a future accumulated reward is maximized for an object whose state is changed by executing the measure.

従来、ポートフォリオ理論についての研究が進められている（非特許文献１を参照。）。ポートフォリオ理論は、株式や債券などリターンにリスクが伴う商品が複数存在する状況下で、それぞれの商品の運用割合を決定するための理論である。即ち例えば、利用者が期待値として所望のリターンを得たい場合に、ポートフォリオ理論を応用すれば、そのリターンを得るためにリスクを最小化する運用割合を決定することができる。また、従来、マルコフ決定過程についての研究が進められている（非特許文献２から８を参照。）。マルコフ決定過程問題は、状態遷移し得る対象に対し、所定の規則に従って複数回の行動を取った場合に、その対象から得られる累積の収益を算出する問題である。マルコフ決定過程問題の既存解法によれば、行動を定める施策を与えると、累積の収益の期待値を算出することができる。 Conventionally, research on portfolio theory has been carried out (see Non-Patent Document 1). The portfolio theory is a theory for determining the operation ratio of each product in a situation where there are a plurality of products such as stocks and bonds that have a risk in return. That is, for example, when a user wants to obtain a desired return as an expected value, by applying portfolio theory, it is possible to determine an operation ratio that minimizes risk in order to obtain the return. Conventionally, research on the Markov decision process has been carried out (see Non-Patent Documents 2 to 8). The Markov decision process problem is a problem of calculating the accumulated profit obtained from a target when the state transition is performed a plurality of times in accordance with a predetermined rule. According to the existing solution of the Markov decision process problem, given the measure that determines the behavior, the expected value of the accumulated profit can be calculated.

H. Markowitz, "Portfolio Selection," Journal of Finance, vol. 7, pp. 77-91, Mar. 1952.H. Markowitz, "Portfolio Selection," Journal of Finance, vol. 7, pp. 77-91, Mar. 1952. G. Tirenni, A. Labbi, A.Elisseeff, and C. Berrospi, "Efficient allocation of marketing resources using dynamic programming," in Proceedings of the SIAM International Conference on Data Mining, 2005.G. Tirenni, A. Labbi, A. Elisseeff, and C. Berrospi, "Efficient allocation of marketing resources using dynamic programming," in Proceedings of the SIAM International Conference on Data Mining, 2005. J. A. Filar, L. C. M. Kallenberg, and H. Lee, "Variance-penalized markov decision processes," Mathematics of Operations Research, vol. 14, pp. 147-161, 1989.J. A. Filar, L. C. M. Kallenberg, and H. Lee, "Variance-penalized markov decision processes," Mathematics of Operations Research, vol. 14, pp. 147-161, 1989. D. J. White, "Mean, variance, and probabilistic criteria in finite Markov decision processes: A review," Journal of Optimization Theory and Applications, vol. 56, no. 1, pp. 1-29, 1988.D. J. White, "Mean, variance, and probabilistic criteria in finite Markov decision processes: A review," Journal of Optimization Theory and Applications, vol. 56, no. 1, pp. 1-29, 1988. R. Munos and A. W. Moore, "Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems," in Proceedings of the International Joint Conference on Artificial Intelligence, 1999, pp. 1348-1355.R. Munos and A. W. Moore, "Variable Resolution Discretization for High-Accuracy Solutions of Optimal Control Problems," in Proceedings of the International Joint Conference on Artificial Intelligence, 1999, pp. 1348-1355. R. Neuneier, "Enhancing Q-learning for Optimal Asset Allocation," in Advances in Neural Information Processing Systems, 1998, vol. 10, pp. 936-942.R. Neuneier, "Enhancing Q-learning for Optimal Asset Allocation," in Advances in Neural Information Processing Systems, 1998, vol. 10, pp. 936-942. H. Kawai, "A variance minimization problem for a Markov decision process," European Journal of Operational Research, vol. 31, pp. 140-145, 1987.H. Kawai, "A variance minimization problem for a Markov decision process," European Journal of Operational Research, vol. 31, pp. 140-145, 1987. M. L. Puterman, Markov Decision Process, John Wiley and Sons, 1994.M. L. Puterman, Markov Decision Process, John Wiley and Sons, 1994.

多数のエージェント（たとえば顧客）がいる環境で、各エージェントに対しどの様な施策（マーケティング・キャンペーンなど）を打つべきかを決定しようとする場合、短期的な報酬を最大化する施策は長期的に最適とは限らない。施策はエージェントの状態変化をもたらすからである。また、エージェントのもたらす報酬は一定でなく確率変数としてモデル化するのが妥当である。したがって、報酬の期待値を最大化したのでは大きなリスクを伴う危険性がある。実際には、リターンとリスクの双方の観点から最適な施策を決めるのが望ましい。 In an environment with a large number of agents (for example, customers), when trying to decide what measures (marketing campaigns, etc.) should be applied to each agent, measures to maximize short-term rewards are long-term. Not necessarily optimal. This is because the policy changes the state of the agent. Also, it is reasonable to model the rewards that the agent brings as a random variable instead of being constant. Therefore, maximizing the expected value of remuneration has a risk with a large risk. In practice, it is desirable to determine the most appropriate measure from the viewpoints of both return and risk.

このような課題に対し、多数の施策のそれぞれについてその施策によって得られる累積報酬のリスクを算出し、算出した中でリスクを最小とする施策を最適施策として決定する手法も考えられる。しかしながら、従来、ある施策から所定の期待値を得ようとした結果として発生するリスクを算出するためには、計算に長時間を要するシミュレーションが必要であった。更に、施策は対象となるエージェントの状態毎に異なり、また、施策は決定的でなく確率的でもよいとすると、シミュレーションの回数が爆発的に増加して現実的な時間で完了しないことが想定される。 For such a problem, a method of calculating the risk of the accumulated reward obtained by the measure for each of a large number of measures and determining the measure that minimizes the risk as the optimum measure among the calculated risks is conceivable. However, conventionally, in order to calculate a risk that occurs as a result of trying to obtain a predetermined expected value from a certain measure, a simulation that requires a long time for calculation has been required. Furthermore, if the measure differs depending on the state of the target agent, and if the measure may be deterministic and probabilistic, it is assumed that the number of simulations will increase explosively and will not be completed in a realistic time. The

同様の課題を解決するために、顧客の状態変化はマルコフモデルとしてモデル化でき、リターンとリスクを考慮した施策決定はポートフォリオ理論が応用できるとも考えられる。しかしながら、従来、マルコフ決定過程問題とポートフォリオ理論とを組み合わせた技術は充分に研究されていない。例えば、従来、マルコフモデルとしてモデル化された対象から所定の報酬を得るためにリスクを軽減する技術が提案されている（非特許文献２から６を参照。）。しかしながら、非特許文献２の技術では、リスクをある程度軽減できる場合もあるが、特定の状況において極端にリスクが高くなってしまう場合があった。また、非特許文献３から６の技術では、極端な高いリスクを回避することはできるものの、求めることができるのは、ある状態のエージェントに対して取る行動を一意に定める施策（以下、決定的施策）であって、ある状態のエージェントに対する行動を複数の行動の候補の中から所定の確率で選択させる施策（以下、確率的施策）ではなかった。 In order to solve the same problem, it can be considered that the change in the state of customers can be modeled as a Markov model, and portfolio theory can be applied to decision making considering return and risk. However, the technology that combines the Markov decision process problem and portfolio theory has not been sufficiently studied. For example, techniques for reducing risk in order to obtain a predetermined reward from an object modeled as a Markov model have been proposed (see Non-Patent Documents 2 to 6). However, in the technique of Non-Patent Document 2, there are cases where the risk can be reduced to some extent, but there are cases where the risk becomes extremely high in a specific situation. Although the techniques of Non-Patent Documents 3 to 6 can avoid an extremely high risk, they can obtain a measure that uniquely determines an action to be taken for an agent in a certain state (hereinafter, decisive). Measure), which is not a measure for selecting an action for an agent in a certain state from a plurality of action candidates (hereinafter, a probabilistic measure).

これに対し、非特許文献７の技術では、１人のエージェントから得られる定常状態における１期あたりの報酬のリスクを最小化する確率的施策を求めることができる。しかしながら、この技術では、定常状態のみに着目した報酬に伴うリスクのみを最小化しており、定常状態にいたるまでの途中段階も含めた累積報酬に伴うリスクを最小化していない。このため問題設定が現実に解決すべき課題とは異なっており妥当でない。また、マーケティングの施策を決定する場合等の、現実の課題においては、複数のエージェントが互いに相関を持ちながら状態遷移する場合があり、エージェントを１人に限定するのは適切ではない。 On the other hand, in the technique of Non-Patent Document 7, it is possible to obtain a probabilistic measure that minimizes the risk of reward per period in a steady state obtained from one agent. However, in this technique, only the risk associated with the reward focusing only on the steady state is minimized, and the risk associated with the accumulated reward including the intermediate stage until the steady state is reached is not minimized. For this reason, the problem setting is different from the problem to be solved in reality and is not appropriate. Further, in an actual problem such as when a marketing measure is determined, a plurality of agents may make a state transition while having correlation with each other, and it is not appropriate to limit the number of agents to one.

そこで本発明は、上記の課題を解決することのできるシステム、方法およびプログラムを提供することを目的とする。この目的は特許請求の範囲における独立項に記載の特徴の組み合わせにより達成される。また従属項は本発明の更なる有利な具体例を規定する。 Therefore, an object of the present invention is to provide a system, a method, and a program that can solve the above-described problems. This object is achieved by a combination of features described in the independent claims. The dependent claims define further advantageous specific examples of the present invention.

上記課題を解決するために、本発明においては、行動に応じ状態遷移する複数のエージェントに対し、複数の行動を順次取った結果として得られる累積報酬の確率分布を算出するシステムであって、エージェントが採り得る複数の状態のそれぞれについて、当該状態のエージェントに対しそれぞれの行動を取った場合にそれぞれの状態に遷移する遷移確率を記憶している確率記憶部と、前記複数の状態のそれぞれについて、それぞれが共に当該状態である前記複数のエージェントに対しそれぞれの行動を取った結果それぞれの状態に遷移した場合に得られる報酬の確率分布のパラメータを記憶しているパラメータ記憶部と、前記複数の状態のそれぞれに対応付けて、当該状態のエージェントに対しそれぞれの行動を取る行動確率を定めた施策の入力を受け付ける施策取得部と、前記複数のエージェントから今期以降に得られる累積報酬の確率分布のパラメータを、今期の行動の前記行動確率および来期の状態への前記遷移確率により、今期の行動によって得られる報酬の確率分布のパラメータおよび来期の状態から来期以降に得られる累積報酬の確率分布を示すパラメータに基づく値を重み付けして、それぞれの行動および来期の状態について合計して算出する漸化式を生成し、当該漸化式において今期以降と来期以降とで初期状態が同一ならば累積報酬の確率分布のパラメータが同一値に収束するとみなした方程式を解くことにより、当該パラメータを算出する第１算出部と、算出した前記パラメータを、累積報酬の確率分布を示す情報として出力する出力部とを備えるシステム。当該システムとしてコンピュータを機能させるプログラム、および、当該システムによって確率分布を算出する方法を提供する。
なお、上記の発明の概要は、本発明の必要な特徴の全てを列挙したものではなく、これらの特徴群のサブコンビネーションもまた、発明となりうる。

In order to solve the above-mentioned problem, in the present invention, a system for calculating a probability distribution of cumulative rewards obtained as a result of sequentially taking a plurality of actions for a plurality of agents that change state according to the action, comprising: For each of the plurality of states that can be taken, for each of the plurality of states, a probability storage unit that stores transition probabilities of transitioning to each state when each action is taken with respect to the agent of the state, A parameter storage unit storing parameters of probability distribution of rewards obtained when transitioning to each state as a result of taking respective actions with respect to the plurality of agents each of which is in the state; and the plurality of states Measures that determine the probability of taking each action for an agent in that state in association with each A measure acquiring unit that receives an input, the parameters of the probability distribution of the cumulative reward obtained after this term from the plurality of agents, by the transition probability to the behavior probability and the next term of the state of this term behavior, obtained by this term actions The recursion formula is calculated by weighting the values based on the parameters of the probability distribution of the rewards and the parameters indicating the probability distribution of the cumulative rewards obtained from the next period onward, and summing up each action and the state of the next period. If the initial state is the same in the recurrence formula after this term and the following term in the recurrence formula, a first calculation is performed to calculate the parameter by solving an equation that the cumulative reward probability distribution parameter is assumed to converge to the same value And an output unit that outputs the calculated parameter as information indicating the probability distribution of the accumulated reward A program for causing a computer to function as the system and a method for calculating a probability distribution by the system are provided.
The above summary of the invention does not enumerate all the necessary features of the present invention, and sub-combinations of these feature groups can also be the invention.

本発明によれば、状態遷移する複数の対象から得られる累積報酬のリスクを最小化する確率的施策を求めることができる。 According to the present invention, it is possible to obtain a probabilistic measure that minimizes the risk of cumulative reward obtained from a plurality of objects that undergo state transition.

以下、発明を実施するための最良の形態（以下、実施形態と称す）を通じて本発明を説明するが、以下の実施形態は特許請求の範囲にかかる発明を限定するものではなく、また実施形態の中で説明されている特徴の組み合わせの全てが発明の解決手段に必須であるとは限らない。 Hereinafter, the present invention will be described through the best mode for carrying out the invention (hereinafter referred to as an embodiment). However, the following embodiment does not limit the invention according to the claims, and Not all the combinations of features described therein are essential to the solution of the invention.

図１は、情報システム１０の全体構成を示す。情報システム１０は、確率記憶部２０と、パラメータ記憶部３０と、施策決定システム４０とを有する。確率記憶部２０は、エージェントが採り得る複数の状態のそれぞれについて、当該状態のエージェントに対しそれぞれの行動を取った場合にそれぞれの状態に遷移する遷移確率を記憶している。エージェントとは、例えば、マーケティングの対象となる消費者であり、行動とは、例えば、それらの消費者に対して行うマーケティングの行動である。また、状態とは、消費者の行動特性を示し、例えば消費者の変化し得る属性などである。たとえば、マーケティングにおいては、１ヶ月の消費額がある範囲の金額である顧客セグメントに属することが状態１であり、１ヶ月の消費額が他の範囲の金額である他の顧客セグメントに属することが状態２などとなる。即ち、状態遷移とは、たとえば、ある消費者に対し割引やキャンペーンなどの行動をとった結果としてその消費者が他の顧客セグメントに属することとなることをいう。 FIG. 1 shows the overall configuration of the information system 10. The information system 10 includes a probability storage unit 20, a parameter storage unit 30, and a measure determination system 40. The probability storage unit 20 stores, for each of a plurality of states that can be taken by the agent, transition probabilities of transition to the respective states when the respective actions are taken with respect to the agent in the state. An agent is, for example, a consumer to be marketed, and an action is, for example, a marketing action performed on those consumers. The state indicates consumer behavior characteristics, for example, an attribute that the consumer can change. For example, in marketing, it is state 1 that the monthly consumption amount belongs to a customer segment that has a certain amount of money, and the monthly consumption amount belongs to another customer segment that has a different amount of money. State 2 and so on. That is, the state transition means, for example, that a consumer belongs to another customer segment as a result of taking an action such as a discount or a campaign for a certain consumer.

パラメータ記憶部３０は、これら複数の状態のそれぞれについて、それぞれが共に当該状態である複数のエージェントに対しそれぞれの行動を取った結果それぞれの状態に遷移した場合に得られる報酬の確率分布を示すパラメータを記憶している。報酬とは、例えば売上や利益の額をいう。上述のマーケティングにおいて、たとえば、報酬とは、ある状態である複数の消費者に対しあるキャンペーンを行った結果として当該複数の消費者から得られる売上の大きさである。そして、報酬の確率分布を示すパラメータとは、例えば確率分布が正規分布に従う場合の平均値および分散値などである。これら確率記憶部２０またはパラメータ記憶部３０に記憶されたデータは、過去のマーケティングの履歴などの情報を分析することによって予め生成されたものであってよい。 The parameter storage unit 30 is a parameter indicating a probability distribution of rewards obtained when each of the plurality of states transitions to each state as a result of taking each action for a plurality of agents that are in the state. Is remembered. The reward means, for example, the amount of sales or profit. In the above-described marketing, for example, a reward is the amount of sales obtained from a plurality of consumers as a result of conducting a certain campaign for a plurality of consumers in a certain state. The parameter indicating the probability distribution of the reward is, for example, an average value and a variance value when the probability distribution follows a normal distribution. The data stored in the probability storage unit 20 or the parameter storage unit 30 may be generated in advance by analyzing information such as past marketing history.

施策決定システム４０は、これら複数のエージェントに対し複数の行動を順次取った結果として得られる累積報酬について、複数の期待値を予め選択する。そして、施策決定システム４０は、これらそれぞれの期待値に対してリスクを最小化するための行動を定める施策を決定する。施策決定システム４０は、期待値とリスク指標とを示す座標軸によって構成される平面上に、期待値とその累積報酬を得るための最小リスク指標とを示す点を描画し、それぞれの点を結ぶことによって効率的フロンティア曲線６０を描画して利用者に表示する。利用者は、効率的フロンティア曲線６０上の点の中から、所望の期待値とリスク指標とを選択する。施策決定システム４０は、選択された期待値を得るための施策を、リスクを最小化するための最適施策７０として利用者に出力する。 The measure determination system 40 selects a plurality of expected values in advance for the accumulated reward obtained as a result of sequentially taking a plurality of actions for the plurality of agents. Then, the measure determination system 40 determines a measure that defines an action for minimizing the risk with respect to each of these expected values. The measure determination system 40 draws points indicating the expected value and the minimum risk index for obtaining the accumulated reward on a plane constituted by the coordinate axes indicating the expected value and the risk index, and connects the respective points. To draw an efficient frontier curve 60 and display it to the user. The user selects a desired expected value and a risk index from points on the efficient frontier curve 60. The measure determination system 40 outputs the measure for obtaining the selected expected value to the user as the optimum measure 70 for minimizing the risk.

このように、本実施形態に係る情報システム１０は、エージェントの状態遷移確率や１回の行動に対する報酬などのデータが予め与えられた場合に、複数のエージェントから所望の累積報酬を得るためにリスクを最小化する最適な施策を出力することを目的とする。 As described above, the information system 10 according to the present embodiment is provided with a risk in order to obtain a desired cumulative reward from a plurality of agents when data such as a state transition probability of the agent and a reward for one action is given in advance. The objective is to output the most appropriate measures to minimize the risk.

図２は、確率記憶部２０のデータ構造の一例を示す。確率記憶部２０は、エージェントが採り得る複数の状態のそれぞれについて、当該状態のエージェントに対しそれぞれの行動を取った場合にそれぞれの状態に遷移する遷移確率を記憶している。エージェントの状態を変数ｓによって表し、具体的には、状態はｓ_１、ｓ_２、…、ｓ_mなどとする。また、エージェントが採り得る状態の集合をＳとする。即ち、ｓ∈Ｓである。また、取り得る行動を変数ａによって表し、具体的には、行動はａ_１、ａ_２、…ａ_ｎなどとする。また、取り得る行動の集合をＡとする。即ちａ∈Ａである。そして、確率記憶部２０は、遷移元の状態と行動との組（ｓ，ａ）と、遷移先の状態ｓとの組合せ毎に、その遷移確率を記憶する。例えば、状態ｓ_１のエージェントに対し行動ａ_１を取った結果状態ｓ_１のまま状態遷移しない確率は２５％であり、状態ｓ_２に遷移する確率は４０％である。以降の説明において、この遷移確率をｐ_{ｓ´｜ｓ，ａ}と表記する。但し、この表記においてｓは遷移元の状態を示し、ａは行動を示し、ｓ´は遷移先の状態を示すものとする。 FIG. 2 shows an example of the data structure of the probability storage unit 20. The probability storage unit 20 stores, for each of a plurality of states that can be taken by the agent, transition probabilities of transition to the respective states when the respective actions are taken with respect to the agent in the state. It represents the state of the agent by the variable s, specifically, states _s _{1, s} 2, ..., and the like s _m. Also, let S be a set of states that the agent can take. That is, sεS. Also, it represents the action that can be taken by the variable a, specifically, the action is _a _1, a 2, ..., and the like _{a n.} Also, let A be a set of actions that can be taken. That is, aεA. And the probability memory | storage part 20 memorize | stores the transition probability for every combination of the pair (s, a) of the state and action of a transition origin, and the state s of a transition destination. For example, as a result of taking the action a ₁ for the agent in the state s _1, the probability that the state s ₁ is not changed is 25%, and the probability that the state s ₂ is changed is 40%. In the following description, this transition probability is expressed as p _{s ′ | s, a} . In this notation, s indicates a transition source state, a indicates an action, and s ′ indicates a transition destination state.

図３は、パラメータ記憶部３０のデータ構造の一例を示す。パラメータ記憶部３０は、これら複数の状態のそれぞれについて、それぞれが共に当該状態である複数のエージェントに対しそれぞれの行動を取った結果それぞれの状態に遷移した場合に得られる報酬の確率分布を示すパラメータを記憶している。図３には、確率分布を定めるパラメータの一例として報酬の平均値を示す。パラメータ記憶部３０は、遷移元の状態と行動との組（ｓ，ａ）と、遷移先の状態ｓとの組合せ毎に、その遷移元の状態からその遷移先の状態に遷移した結果として得られる報酬の平均値を記憶している。 FIG. 3 shows an exemplary data structure of the parameter storage unit 30. The parameter storage unit 30 is a parameter indicating a probability distribution of rewards obtained when each of the plurality of states transitions to each state as a result of taking each action for a plurality of agents that are in the state. Is remembered. FIG. 3 shows an average value of reward as an example of a parameter for determining the probability distribution. The parameter storage unit 30 is obtained as a result of transition from the transition source state to the transition destination state for each combination of the transition source state and action (s, a) and the transition destination state s. It memorizes the average value of rewards.

一例として、エージェントが１０，０００人の場合、１０，０００人の全てがある状態ｓ_１にあるとしたときに、それら全てのエージェントに対して行動ａ_１を取った結果それら全てのエージェントの状態が状態ｓ_２に遷移した場合に得られる報酬の平均値は＄２．１０である。ここでいう平均値とは、確率分布のパラメータとしての平均値であり、上記と同一条件で複数回行動ａ_１を取った結果得られる報酬の平均値をいう。なお、報酬の分布によっては平均値のみでは確率分布が定められないので、パラメータ記憶部３０は、図３と同様の表を分散値やその他のパラメータのそれぞれについて更に記憶する。その他のパラメータとは、例えば、安定分布における特性指数や歪度などである。これらのパラメータを記憶するためのデータ構造は、データの内容が平均値に代えてその他のパラメータとなることのほか、図３と略同一であるので説明を省略する。 As an example, if there are 10,000 agents, and all 10,000 agents are in a state s ₁ , the state of all the agents as a result of taking action a ₁ for all those agents The average value of the reward obtained when the state transits to state s ₂ is $ 2.10. The average value here is an average value as a parameter of the probability distribution, and means an average value of rewards obtained as a result of taking action a ₁ a plurality of times under the same conditions as described above. Note that, depending on the reward distribution, the probability distribution cannot be determined only by the average value, so the parameter storage unit 30 further stores the same table as FIG. 3 for each of the variance value and other parameters. The other parameters are, for example, a characteristic index and a skewness in a stable distribution. Since the data structure for storing these parameters is substantially the same as that shown in FIG. 3 except that the data contents are replaced with the average value and other parameters are used, the description thereof will be omitted.

以降の説明において、状態ｓのエージェントに対し行動ａを取った結果状態ｓ´に遷移した結果として得られる報酬をｒ_{ｓ´｜ｓ，ａ}と表記する。また、その確率分布をＰ（ｒ_{ｓ´｜ｓ，ａ}）と表記し、その平均値または位置パラメータをμ_{ｓ´｜ｓ，ａ}と表記する。 In the following description, the reward obtained as a result of the transition to the state s ′ as a result of taking the action a for the agent in the state s is expressed as r _{s ′ | s, a} . The probability distribution is expressed as P (r _{s ′ | s, a} ), and the average value or the position parameter is expressed as μ _{s ′ | s, a} .

図４は、最適施策７０の一例を示す。最適施策７０は、それぞれの状態のエージェントについてそれぞれの行動をとるべき行動確率を定める。例えば図４中で、状態ｓ_１の行と行動ａ_１の列との交差部分の確率値２０％が、状態ｓ_１のエージェントについて行動ａ_１を取る行動確率である。利用者は、状態ｓ_１のエージェントが１００人の場合、そのうち２０人に対し行動ａ_１を取ってもよいし、その１００人のエージェントのそれぞれについて、２０％の確率で行動ａ_１を取ってもよい。この最適施策７０に従って行動することにより、所望の期待値に対するリスクを最小化することができる。
以降の説明において、状態ｓのエージェントに対し行動ａを取る行動確率を定めた施策をπ_ｓ、ａと表記する。また、全ての状態ｓ∈Ｓおよび行動ａ∈Ａに対するπ_ｓ，ａのπの値をまとめてπ={π_ｓ，ａ; ｓ∈Ｓ, ａ∈Ａ}と表記する。 FIG. 4 shows an example of the optimum measure 70. The optimal measure 70 determines the action probability to take each action for each state agent. For example in FIG. 4, the probability value of 20% of the intersections of the row state s ₁ and row action a ₁ is a behavior probability that the agent of the state s ₁ take action a _1. The user, if the agent of the state _{s 1} is of 100 people, may be taking the action _{a 1} for 20 of them people, for each of the 100 people of the agent, taking action _{a 1} with a probability of 20% Also good. By acting in accordance with this optimum measure 70, the risk for the desired expected value can be minimized.
In the following description, a measure that defines an action probability of taking action a for an agent in state s is denoted as π _{s, a} . Further, the values of π _{s, a} for all states s ∈ S and action a ∈ A are collectively expressed as π = {π _{s, a} ; s ∈ _{S, a} ∈ A}.

続いて、これらの遷移確率およびパラメータに基づき、施策決定システム４０が最適施策を求める処理機能の詳細を説明する。説明に先立って、まず累積報酬の期待値を定義する。累積報酬は、全てのエージェントから今期および今期以降の将来に渡って得られる報酬の合計である。なお、今期とは、順次経過する複数の期間のうち、いま求めようとする最適施策を実行開始する期間をいい、来期とは、今期の次の期間をいう。現実の課題において、同一の金額であっても早く得られる報酬の方が価値が高いので、将来の報酬については割引率を乗じて価値を低く評価するものとする。具体的には、ある施策πについて期間0から期間（Ｔ−１）までの累積報酬の期待値は以下の式（１）のように表される。

但し、γは割引率を示し０より大きく１より小さい値をとる。ｒ_ｔはこの施策πによって得られる期間ｔにおける報酬を示し、Ｅ_π［ｒ_ｔ］はその期待値を示す。 Next, details of the processing function by which the measure determination system 40 obtains the optimum measure based on these transition probabilities and parameters will be described. Prior to explanation, first, the expected value of cumulative reward is defined. Cumulative reward is the total of rewards obtained from all agents for the current term and the future from this term. Note that this term refers to a period in which execution of the optimum measure to be requested is started among a plurality of periods that are sequentially passed, and the next term refers to a period subsequent to this term. In an actual problem, even if the amount is the same, a reward that can be obtained earlier has a higher value. Therefore, a future reward is multiplied by a discount rate and its value is evaluated low. Specifically, the expected value of the accumulated reward from period 0 to period (T-1) for a certain measure π is expressed as in the following formula (1).

However, γ indicates a discount rate and takes a value larger than 0 and smaller than 1. r _t represents the reward in the period t obtained by this measure π, and E _π [r _t ] represents its expected value.

図５は、施策決定システム４０の機能構成を示す。施策決定システム４０は、位置パラメータ範囲算出部４００と、位置パラメータ取得部４１０と、施策取得部４２０と、第１算出部４３０と、第２算出部４４０と、収束判定部４５０と、出力部４６０と、表示制御部４７０とを有する。位置パラメータ範囲算出部４００は、確率記憶部２０に記憶された遷移確率、および、パラメータ記憶部３０に記憶されたパラメータに基づいて、複数のエージェントから得られる累積報酬の確率分布を示す位置パラメータの最大値および最小値を算出する。算出処理の詳細については後述するが、来期以降に得られる累積報酬の位置パラメータと、今期の報酬の位置パラメータとに基づき、今期以降に得られる累積報酬の位置パラメータの最小値（または最大値）求める漸化式を生成し、バリュー・イテレーションによりその値の収束値を求めることによって実現される。 FIG. 5 shows a functional configuration of the measure determination system 40. The measure determination system 40 includes a position parameter range calculator 400, a position parameter acquirer 410, a measure acquirer 420, a first calculator 430, a second calculator 440, a convergence determination unit 450, and an output unit 460. And a display control unit 470. The position parameter range calculation unit 400 is a position parameter indicating a probability distribution of cumulative rewards obtained from a plurality of agents based on the transition probabilities stored in the probability storage unit 20 and the parameters stored in the parameter storage unit 30. Calculate the maximum and minimum values. The details of the calculation process will be described later, but based on the position parameter of the accumulated reward obtained from the next period and the position parameter of the reward of the current period, the minimum value (or the maximum value) of the position parameter of the accumulated reward obtained from the current period. This is realized by generating a recursion formula to be obtained and obtaining a convergence value of the value by value iteration.

以下、位置パラメータをどのように定めるかによって処理が異なるため、それぞれについて説明する。
エージェントは、既に述べたように、状態ｓ_１からｓ_ｍを採り得る。そして、エージェントから得られる報酬は、遷移元の状態によって異なっている。したがって、例え同一の施策に従って行動しても、初期状態が異なるエージェントから得られる累積報酬は異なる。このため、累積報酬の位置パラメータは、初期状態毎に定めたい場合がある。これは、マーケティングの例では、顧客セグメントの種類が少なく、そのそれぞれから得たい報酬を顧客セグメント毎にきめ細かく定めたい場合に有効である。一方で、様々な状態を初期状態とする複数のエージェントから得られる総報酬額の確率分布の位置パラメータを定めたい場合がある。これは、マーケティングの例では、顧客セグメントの種類が多くてそのそれぞれから得たい報酬を定めるのは困難な場合に有効である。以下、これらのそれぞれの場合について説明する。 In the following, since the process differs depending on how the position parameter is determined, each will be described.
The agent can take states s ₁ to s _m as described above. The reward obtained from the agent differs depending on the state of the transition source. Therefore, even if it acts according to the same measure, the accumulated rewards obtained from agents with different initial states are different. For this reason, the position parameter of the accumulated reward may be desired to be determined for each initial state. This is effective when there are few types of customer segments in the marketing example and it is desired to finely determine the rewards to be obtained from each of the customer segments. On the other hand, there is a case where it is desired to determine the position parameter of the probability distribution of the total reward amount obtained from a plurality of agents having various states as initial states. This is effective in the marketing example when there are many types of customer segments and it is difficult to determine a reward to be obtained from each of them. Hereinafter, each of these cases will be described.

（１）それぞれの初期状態について位置パラメータを定める場合
位置パラメータ取得部４１０は、複数の状態のそれぞれについて、位置パラメータ範囲算出部４００によって算出された最小値から最大値までの範囲内の値を、当該状態を初期状態とする複数のエージェントから得られるべき報酬の確率分布を示す位置パラメータとして取得する。位置パラメータ取得部４１０は、位置パラメータの指定の入力を利用者から受け付けてもよいし、当該最小値から最大値までの値の中から所定の規則に従って複数の値を取得してもよい。 (1) When Position Parameters are Determined for Each Initial State The position parameter acquisition unit 410 calculates a value within a range from the minimum value to the maximum value calculated by the position parameter range calculation unit 400 for each of a plurality of states. It is acquired as a position parameter indicating a probability distribution of rewards to be obtained from a plurality of agents having the state as an initial state. The position parameter acquisition unit 410 may receive an input for specifying the position parameter from the user, or may acquire a plurality of values from the minimum value to the maximum value according to a predetermined rule.

施策取得部４２０は、複数の状態のそれぞれについて、当該状態を初期状態とする複数のエージェントから得られる累積報酬の確率分布の位置パラメータを、取得した位置パラメータに一致させる施策のうち１つを生成し、その施策を初期施策として取得する。この初期施策は、位置パラメータを与えられた値と一致させれば充分であり、リスクを最小化するものであるかどうかは問わない。第１算出部４３０は、それぞれの状態について、当該状態を初期状態とする複数のエージェントから初期施策に従って行動した結果として得られる累積報酬の確率分布を示す位置パラメータおよびスケールパラメータを算出する。これらのパラメータの算出方法については後述する。 For each of a plurality of states, the measure acquisition unit 420 generates one of the measures for matching the position parameter of the probability distribution of the cumulative reward obtained from a plurality of agents having the state as an initial state with the acquired position parameter. And acquire that measure as the initial measure. It is sufficient for this initial measure to match the position parameter with the given value, regardless of whether it minimizes the risk. The first calculation unit 430 calculates, for each state, a position parameter and a scale parameter indicating a probability distribution of accumulated rewards obtained as a result of acting according to the initial measure from a plurality of agents having the state as an initial state. A method for calculating these parameters will be described later.

第２算出部４４０は、複数の状態のそれぞれについて、エージェントに対し取り得るそれぞれの行動の行動確率を変数とし、第１算出部４３０によって算出されたスケールパラメータに基づき累積報酬の確率分布のスケールパラメータを求める目的関数の値を最小化する線形計画問題を解くことにより、それぞれの行動確率を定める施策を算出する。この線形計画問題は、それぞれの行動の行動確率に従って行動した結果として得られる累積報酬の確率分布の位置パラメータが、第１算出部４３０によって算出された位置パラメータに一致することを制約とする。また、同一状態に対するそれぞれの行動確率の和が１であり、それぞれの行動確率が０以上であることを制約とする。 The second calculation unit 440 uses the action probability of each action that can be taken by the agent for each of the plurality of states as a variable, and the scale parameter of the probability distribution of the accumulated reward based on the scale parameter calculated by the first calculation unit 430. By solving the linear programming problem that minimizes the value of the objective function for obtaining, measures for determining the respective action probabilities are calculated. This linear programming problem is constrained in that the position parameter of the cumulative reward probability distribution obtained as a result of the action according to the action probability of each action matches the position parameter calculated by the first calculation unit 430. Also, the sum of the action probabilities for the same state is 1, and the action probabilities are 0 or more.

収束判定部４５０は、第１算出部４３０により算出されたスケールパラメータと第２算出部４４０により算出されたスケールパラメータとが予め定められた範囲内の値に収束したことを条件に、第２算出部４４０により算出された施策を出力部４６０に対し出力し、収束していないことを条件に、第２算出部４４０により算出された施策を初期施策に代えて第１算出部４３０に与える。この結果、第１算出部４３０は、与えられたこの施策に従い行動した結果として得られる累積報酬の確率分布の位置パラメータおよびスケールパラメータを更に算出する。 The convergence determination unit 450 performs the second calculation on the condition that the scale parameter calculated by the first calculation unit 430 and the scale parameter calculated by the second calculation unit 440 have converged to a value within a predetermined range. The measure calculated by the unit 440 is output to the output unit 460, and on the condition that the measure is not converged, the measure calculated by the second calculator 440 is given to the first calculator 430 instead of the initial measure. As a result, the first calculation unit 430 further calculates the position parameter and the scale parameter of the probability distribution of the cumulative reward obtained as a result of acting according to the given measure.

出力部４６０は、収束したスケールパラメータとはじめに取得した位置パラメータとを利用者に対し出力する。出力部４６０は、また、収束したスケールパラメータおよび取得した位置パラメータに基づいてリスク指標値、例えばバリューアットリスクなどを算出して出力してもよい。さらに、出力部４６０は、複数の位置パラメータが取得された場合に、それぞれの位置パラメータおよび対応するスケールパラメータを表示制御部４７０に出力する。 The output unit 460 outputs the converged scale parameter and the first acquired position parameter to the user. The output unit 460 may calculate and output a risk index value, such as a value at risk, based on the converged scale parameter and the acquired position parameter. Furthermore, when a plurality of position parameters are acquired, the output unit 460 outputs each position parameter and the corresponding scale parameter to the display control unit 470.

表示制御部４７０は、出力部４６０から入力を受けた位置パラメータおよびスケールパラメータに基づき効率的フロンティア曲線を描画し、利用者に出力する。即ちたとえば、表示制御部４７０は、位置パラメータ取得部４１０によって位置パラメータが取得される毎に、位置パラメータを示す座標軸、および、リスク指標値を示す座標軸とからなる平面上において、当該位置パラメータ、および、当該位置パラメータに対応して第２算出部４４０により算出されて収束判定部４５０によって収束が判定されたスケールパラメータに基づくリスク指標値によって表される座標値に点を描画する。そして、表示制御部４７０は、描画された点と点との間を補完することにより曲線を描画して表示する。 The display control unit 470 draws an efficient frontier curve based on the position parameter and the scale parameter received from the output unit 460, and outputs them to the user. That is, for example, each time the position parameter is acquired by the position parameter acquisition unit 410, the display control unit 470 performs the position parameter on the plane including the coordinate axis indicating the position parameter and the coordinate axis indicating the risk index value, and The point is drawn at the coordinate value represented by the risk index value based on the scale parameter calculated by the second calculation unit 440 corresponding to the position parameter and determined to be converged by the convergence determining unit 450. Then, the display control unit 470 draws and displays a curve by complementing the drawn points.

この場合、出力部４６０は、表示された曲線上の座標値の指定を利用者から受け付けてもよい。座標値の指定に応じ、出力部４６０は、当該座標値によって表される位置パラメータおよびリスク指標の組を、当該位置パラメータおよび当該リスク指標によって示される確率分布の累積報酬を得るために第２算出部４４０により算出された施策に対応付けて出力する。このように、利用者は、単に指定した位置パラメータに対応する最適施策を得るだけでなく、曲線上に表された様々な位置パラメータの中から所望の報酬額を任意に選択して、それに対応する施策を得ることができる。 In this case, the output unit 460 may accept designation of coordinate values on the displayed curve from the user. In response to the designation of the coordinate value, the output unit 460 performs a second calculation on the set of the position parameter and the risk index represented by the coordinate value to obtain a cumulative reward of the probability distribution indicated by the position parameter and the risk index. The information is output in association with the measure calculated by the unit 440. In this way, the user not only obtains the optimum measure corresponding to the specified position parameter, but also arbitrarily selects a desired reward amount from various position parameters represented on the curve and responds to it. Measures to do.

（２）全ての初期状態についての総報酬の位置パラメータを定める場合
位置パラメータ取得部４１０は、それぞれが異なる状態を初期状態として取り得る複数のエージェントから得られる累積報酬の合計の確率分布を示す位置パラメータを取得する。この位置パラメータは、それぞれの状態を初期状態とするエージェントから得られる報酬の確率分布の位置パラメータの最小値に、当該状態を初期状態とするエージェントの割合として予め与えられた値によって重み付けした合計以上の値であることが望ましい。また、この位置パラメータは、それぞれの状態を初期状態とするエージェントから得られる報酬の確率分布の位置パラメータの最大値に、当該状態を初期状態とするエージェントの割合として予め与えられた値によって重み付けした合計以下の値であることが望ましい。 (2) When determining the position parameter of the total reward for all initial states The position parameter acquisition unit 410 indicates a probability distribution of the total accumulated rewards obtained from a plurality of agents that can take different states as initial states. Get parameters. This position parameter is equal to or greater than the total weighted by a value given in advance as the ratio of agents having the state as the initial state to the minimum value of the position parameter of the probability distribution of rewards obtained from the agent having the state as the initial state. It is desirable that the value of In addition, this position parameter is obtained by weighting the maximum value of the position parameter of the probability distribution of rewards obtained from an agent having each state as an initial state by a value given in advance as a ratio of agents having the state as an initial state. It is desirable that the value is less than the total.

施策取得部４２０は、複数のエージェントから得られる累積報酬の合計の確率分布を示す位置パラメータを、取得した位置パラメータと一致させる施策のうち１つを生成し、初期施策として取得する。この初期施策は、位置パラメータを与えられた値と一致させれば充分であり、リスクを最小化するものであるかどうかは問わない。第１算出部４３０は、初期施策に従って行動した結果として複数のエージェントから得られる累積報酬の確率分布を示す位置パラメータおよびスケールパラメータを算出する。 The measure acquisition unit 420 generates one of the measures that matches the position parameter indicating the total probability distribution of accumulated rewards obtained from a plurality of agents with the acquired position parameter, and acquires it as an initial measure. It is sufficient for this initial measure to match the position parameter with the given value, regardless of whether it minimizes the risk. The first calculation unit 430 calculates a position parameter and a scale parameter indicating a probability distribution of cumulative rewards obtained from a plurality of agents as a result of acting according to the initial measure.

第２算出部４４０は、複数の状態のそれぞれについて、エージェントに対して取り得るそれぞれの行動の行動確率を変数とし、第１算出部４３０によって算出されたスケールパラメータに基づきそれぞれの状態を初期状態として当該行動確率に従って行動した結果として得られる累積報酬の確率分布のスケールパラメータを、当該状態を初期状態とするエージェントの数で重み付けして合計する目的関数の値を最小化する線形計画問題を解くことにより、それぞれの行動確率を定める施策を算出する。この線形計画問題は、それぞれの状態を初期状態として当該行動確率に従って行動した結果として得られる累積報酬の確率分布の位置パラメータを、当該状態を初期状態とするエージェントの数で重み付けして合計した値が、第１算出部４３０によって算出された位置パラメータに一致する制約を有する。また、同一状態に対する行動確率の合計が１となり、それぞれの行動確率が０以上となる制約を有する。 For each of the plurality of states, the second calculation unit 440 uses the action probability of each action that can be taken for the agent as a variable, and sets each state as an initial state based on the scale parameter calculated by the first calculation unit 430. Solving a linear programming problem that minimizes the value of the objective function that sums the scale parameter of the cumulative reward probability distribution obtained as a result of acting according to the action probability by weighting the number of agents with the state as the initial state Based on the above, a measure for determining each action probability is calculated. This linear programming problem is a value obtained by weighting the position parameter of the probability distribution of the cumulative reward obtained as a result of acting according to the action probability with each state as the initial state, weighted by the number of agents having the state as the initial state. Have a constraint that matches the position parameter calculated by the first calculation unit 430. In addition, there is a constraint that the total action probability for the same state is 1, and each action probability is 0 or more.

収束判定部４５０は、第１算出部４３０により算出されたスケールパラメータと第２算出部４４０により算出されたスケールパラメータとが予め定められた範囲内の値に収束したことを条件に、第２算出部４４０により算出された施策を出力する。一方で、収束判定部４５０は、収束していないことを条件に、第２算出部４４０により算出された施策を初期施策に代えて第１算出部４３０に与える。これにより、第１算出部４３０は、この施策に従い行動した結果として得られる累積報酬の確率分布の位置パラメータおよびスケールパラメータを更に算出する。
なお、出力部４６０および表示制御部４７０の機能は（１）で説明したものと同様である。処理内容は（１）の場合と一部異なるが詳細についてはフローチャートを参照して後に説明する。 The convergence determination unit 450 performs the second calculation on the condition that the scale parameter calculated by the first calculation unit 430 and the scale parameter calculated by the second calculation unit 440 have converged to a value within a predetermined range. The measure calculated by the unit 440 is output. On the other hand, the convergence determination unit 450 gives the measure calculated by the second calculation unit 440 to the first calculation unit 430 instead of the initial measure on the condition that the convergence has not occurred. Thereby, the 1st calculation part 430 further calculates the position parameter and scale parameter of the probability distribution of the cumulative reward obtained as a result of acting according to this measure.
The functions of the output unit 460 and the display control unit 470 are the same as those described in (1). The processing contents are partially different from the case of (1), but details will be described later with reference to a flowchart.

図６は、第１算出部４３０の機能構成を示す。第１算出部４３０は、第１ユニット５００と、第２ユニット５３０と、第３ユニット５７０とを有する。第１ユニット５００は、平均値算出部５１０と、分散値算出部５２０とを有する。第１ユニット５００は、複数のエージェントのそれぞれから確率的に得られる報酬額が独立に定まり、かつ同一状態にある複数のエージェントから１期間に得られる総報酬が正規分布に従う場合において、与えられた施策（例えば初期施策）に対して累積報酬の確率分布を定める平均値および分散値を算出することを目的とする。この場合、全員が状態ｓにいたときに行動ａを取った上で全員が状態ｓ´に移動した場合に得られる報酬の確率分布は以下の式（２）のように表される。

但し、σ_{ｓ´｜ｓ，ａ}は標準偏差である。また、状態ｓからの合計報酬は、互いに独立な無数のエージェントによる微小報酬の合計となり、中心極限定理が適用できる。 FIG. 6 shows a functional configuration of the first calculation unit 430. The first calculation unit 430 includes a first unit 500, a second unit 530, and a third unit 570. The first unit 500 includes an average value calculation unit 510 and a variance value calculation unit 520. The first unit 500 is given when the reward amount probabilistically obtained from each of a plurality of agents is determined independently, and the total reward obtained from a plurality of agents in the same state in one period follows a normal distribution. The purpose is to calculate an average value and a variance value that define a probability distribution of cumulative reward for a measure (for example, an initial measure). In this case, the probability distribution of the reward obtained when everyone moves to the state s ′ after taking action a when they are in the state s is expressed as the following equation (2).

However, σ _{s ′ | s, a} is a standard deviation. Further, the total reward from the state s is the sum of minute rewards by countless agents independent from each other, and the central limit theorem can be applied.

１期間毎の報酬が正規分布に従う場合には、累積報酬Ｒ_ｓ（π）も正規分布に従うため、累積報酬Ｒ_ｓ（π）は以下の式（３）のように表される。

ただし、Ｍ_ｓ（π）は平均値を表し、Ｓ_ｓ（π）は標準偏差を表す。これらのパラメータは全ての状態ｓ∈Ｓおよび行動ａ∈Ａに対する施策π_ｓ，ａが決まることで初めて決定される。したがってπ={π_ｓ，ａ; ｓ∈Ｓ, ａ∈Ａ}の関数であるため、・_ｓ（π）と表記する。 When the reward for each period follows the normal distribution, the cumulative reward R _s (π) also follows the normal distribution, and therefore the cumulative reward R _s (π) is expressed as the following equation (3).

However, M _s (π) represents an average value, and S _s (π) represents a standard deviation. These parameters are determined for the first time by determining the measures π _{s, a} for all states sεS and action aεA. Therefore, since it is a function of π = {π _{s, a} ; s ∈ _{S, a} ∈ A}, it is expressed as · _s (π).

この平均値Ｍ_ｓ（π）は、複数のエージェント全体から今期以降に得られる累積報酬の平均値を表す。そして、平均値（あるいは期待値）は線形性を有するから、この平均値は、今期の報酬の平均値と来期以降に得られる累積報酬の平均値とを合計する漸化式として表される。具体的には、この漸化式は、今期の行動の行動確率π_ｓ，ａおよび来期の状態への遷移確率ｐ_{ｓ´｜ｓ，ａ}により、来期の状態ｓ´から来期以降に得られる累積報酬の平均値に割引率ｒを乗じて今期の行動によって得られる報酬の平均値μ_{ｓ´｜ｓ，ａ}を加えた値を重み付けして、それぞれの行動ａ（∈Ａ）および来期の状態ｓ´（∈Ｓ）について合計することで、今期以降の累積報酬の平均値を求める式となる。 This average value M _s (π) represents an average value of accumulated rewards obtained from the plurality of agents after this term. And since an average value (or expected value) has linearity, this average value is represented as a recurrence formula which totals the average value of the reward of this term and the average value of the accumulated reward obtained in the following term. Specifically, this recurrence formula is an accumulation obtained from the state s ′ of the next period onward by the action probability π _{s, a} of the action of the current period and the transition probability p _{s ′ | s, a} of the next period. Multiplying the average reward value by the discount rate r and adding the average reward value μ _{s ′ | s, a} obtained by the action of the current term, weights each action a (∈A) and the state s of the next period By summing up ′ (∈S), an equation for obtaining an average value of accumulated rewards from this term is obtained.

この漸化式において、今期以降と来期以降とで初期状態が同一ならば累積報酬の確率分布の平均値が同一値に収束するとみなすと、累積報酬の平均値Ｍ_ｓ（π）についての、状態の数｜Ｓ｜元の連立方程式が生成され、これはベルマン方程式となる。この方程式を式（４）に示す。

平均値算出部５１０は、この方程式を解くことにより、累積報酬の平均値Ｍ_ｓ（π）を初期状態ｓ毎に算出する。 In this recurrence formula, assuming that the average value of the probability distribution of cumulative reward converges to the same value if the initial state is the same in this period and subsequent periods, the state for the average value M _s (π) of the cumulative reward | S | The original simultaneous equation is generated, which becomes the Bellman equation. This equation is shown in equation (4).

The average value calculation unit 510 calculates the average value M _s (π) of the accumulated reward for each initial state s by solving this equation.

また、標準偏差Ｓ_ｓ（π）は、複数のエージェント全体から今期以降に得られる累積報酬の標準偏差を表す。そして、この標準偏差の２乗である分散値は、今期の報酬の分散値と来期以降に得られる累積報酬の分散値とに基づく計算をする漸化式として表される。具体的には、この漸化式は、今期の行動の行動確率π_ｓ，ａおよび来期の状態への遷移確率ｐ_{ｓ´｜ｓ，ａ}により、来期の状態ｓ´から来期以降に得られる累積報酬の分散値に割引率ｒの２乗を乗じて今期の行動によって得られる報酬の分散値σ^２ _{ｓ´｜ｓ，ａ}を加えた値を重み付けして、それぞれの行動ａ（∈Ａ）および来期の状態ｓ´（∈Ｓ）について合計することで、今期以降の累積報酬の分散値を求める式となる。 The standard deviation S _s (π) represents the standard deviation of accumulated rewards obtained from the plurality of agents after this term. The variance value, which is the square of the standard deviation, is expressed as a recurrence formula that performs calculations based on the variance value of the current term reward and the variance value of the accumulated reward obtained in the following term. Specifically, this recurrence formula is an accumulation obtained from the state s ′ of the next period onward by the action probability π _{s, a} of the action of the current period and the transition probability p _{s ′ | s, a} of the next period. Weighting the value obtained by multiplying the variance value of the reward by the square of the discount rate r and adding the variance value σ ² _{s ′ | s, a} of the reward obtained by the action of the current term, each action a (∈A) and By summing up the state s ′ (∈S) in the next term, the equation for obtaining the variance value of the accumulated rewards from this term is obtained.

この漸化式において、今期以降と来期以降とで初期状態が同一ならば累積報酬の確率分布の分散値が同一値に収束するとみなすと、累積報酬の分散値Ｓ^２ _ｓ（π）についての、状態の数｜Ｓ｜元の連立方程式が生成され、これはベルマン方程式となる。この方程式を式（５）に示す。

分散値算出部５２０は、この方程式を解くことにより、累積報酬の分散値Ｓ^２ _ｓ（π）を初期状態ｓ毎に算出する。算出結果は第２算出部４４０に出力される。 In this recurrence formula, if the initial value is the same in this period and the following period, assuming that the variance value of the cumulative reward probability distribution converges to the same value, the cumulative reward variance value S ² _s (π) The number of states | S |. The original simultaneous equation is generated and becomes the Bellman equation. This equation is shown in equation (5).

The variance value calculation unit 520 calculates the variance value S ² _s (π) of the accumulated reward for each initial state s by solving this equation. The calculation result is output to the second calculation unit 440.

第２ユニット５３０は、歪度算出部５４０と、位置パラメータ算出部５５０と、スケールパラメータ算出部５６０とを有する。第２ユニット５３０は、複数のエージェントのそれぞれから確率的に得られる報酬額が独立に定まり、かつ同一状態にある複数のエージェントから１期間に得られる総報酬が安定分布に従う場合において、与えられた施策（例えば初期施策）に対して累積報酬の確率分布を定める平均値および分散値を算出することを目的とする。報酬の分布を正規分布から安定分布に拡張することで、ヘビー・テイル性を有する分布を取り扱うことができる。ヘビー・テイル性を有する分布とは、正規分布よりも裾野が厚く、分散値が有限値とならない分布をいう。これにより、株価の暴落や連鎖倒産に対する貸し倒れ額など、考慮しなければならない現実的な課題を解決することができる。

この場合、全員が状態ｓにいたときに行動ａを取った上で全員が状態ｓ´に移動した場合に得られる報酬の確率分布は以下の式（６）のように表される。

The second unit 530 includes a skewness calculation unit 540, a position parameter calculation unit 550, and a scale parameter calculation unit 560. The second unit 530 is given when the amount of reward probabilistically obtained from each of a plurality of agents is determined independently, and the total reward obtained from a plurality of agents in the same state in one period follows a stable distribution. The purpose is to calculate an average value and a variance value that define a probability distribution of cumulative reward for a measure (for example, an initial measure). By extending the reward distribution from the normal distribution to the stable distribution, it is possible to handle a distribution having a heavy tail property. The distribution having heavy tail property means a distribution having a thicker base than the normal distribution and a dispersion value that does not become a finite value. As a result, it is possible to solve practical problems that need to be taken into consideration, such as a stock price crash or a loan loss due to chain bankruptcy.

In this case, the probability distribution of the reward obtained when everyone moves to the state s ′ after taking action a when they are in the state s is expressed as the following equation (6).

α_{ｓ´｜ｓ，ａ}は安定分布の特性指数であり、報酬が大きな領域における確率密度の減衰の程度を示す。β_{ｓ´｜ｓ，ａ}は歪度であり、分布の非対称性を示す。μ_{ｓ´｜ｓ，ａ}は正規分布における期待値・平均値に対応し、安定分布の場合は期待値が定まらない場合があるので位置パラメータと呼ぶ。σ_{ｓ´｜ｓ，ａ}はスケール・パラメータである。なおα＝２の場合には分散値が有限となり安定分布は正規分布に一致する。また、１＜α≦２の場合には期待値が存在し、位置パラメータは期待値を示す。同一状態にある複数のエージェントからの合計報酬は、互いに独立な無数のエージェントによる微小報酬の合計となり、拡張された中心極限定理が適用できる。即ち、全てのｓ、ｓ´、ａに関しα_{ｓ´｜ｓ，ａ}が同一と仮定すると（この値を単にαと表記する）、累積報酬は安定分布に従う。この仮定の下、累積報酬Ｒ_ｓ（π）は、以下の式（７）によって表される。

ここで、Β_ｓ（π）は歪度を示し、Ｍ_ｓ（π）は位置パラメータを示し、Ｓ_ｓ（π）はスケールパラメータを示す。 α _{s ′ | s, a} is a characteristic index of the stable distribution, and indicates the degree of attenuation of probability density in a region where the reward is large. β _{s ′ | s, a} is the skewness and indicates the asymmetry of the distribution. μ _{s ′ | s, a} corresponds to an expected value / average value in _a normal distribution, and an expected value may not be determined in the case of a stable distribution, and is therefore referred to as a position parameter. σ _{s ′ | s, a} is a scale parameter. When α = 2, the variance value is finite, and the stable distribution matches the normal distribution. Further, when 1 <α ≦ 2, an expected value exists, and the position parameter indicates the expected value. The total reward from a plurality of agents in the same state is the sum of minute rewards from an infinite number of independent agents, and the extended central limit theorem can be applied. That is, assuming that α _{s ′ | s, a} is the same for all s, s ′, a (this value is simply expressed as α), the accumulated reward follows a stable distribution. Under this assumption, the accumulated reward R _s (π) is expressed by the following equation (7).

Here, Β _s (π) indicates the skewness, M _s (π) indicates a position parameter, and S _s (π) indicates a scale parameter.

この歪度Β_ｓ（π）は、今期の報酬の確率分布を示す歪度と、来期以降の報酬の確率分布を示す歪度とに基づく漸化式によって表される。具体的には、この漸化式は、今期の行動の行動確率π_ｓ，ａおよび来期の状態への遷移確率ｐ_{ｓ´｜ｓ，ａ}により、今期の行動によって得られる報酬の確率分布の歪度β_{ｓ´｜ｓ，ａ}およびスケールパラメータのα乗σ^α _{ｓ´｜ｓ，ａ}並びに来期の状態ｓ´から来期以降に得られる累積報酬の確率分布の歪度およびスケールパラメータに基づく値を重み付けして、それぞれの行動ａおよび来期の状態ｓ´について合計することで、今期以降の累積報酬の確率分布の歪度Β_ｓ（π）を算出する式となる。 The skewness Β _s (π) is expressed by a recurrence formula based on the skewness indicating the probability distribution of the reward for the current period and the skewness indicating the probability distribution of the reward for the next period or later. Specifically, this recurrence formula is obtained by calculating the distortion of the probability distribution of the reward obtained by the current period action by the action probability π _{s, a} of the current period action and the transition probability p _{s ′ | s, a} to the next period state. Degree β _{s ′ | s, a} and α of the scale parameter σ ^α _{s ′ | s, a} , and the value based on the skewness of the probability distribution of the cumulative reward obtained from the next period s ′ onwards and the scale parameter and the scale parameter Then, by summing up each action a and the state s ′ of the next term, an equation for calculating the skewness Β _s (π) of the probability distribution of the cumulative reward after this term is obtained.

この漸化式において、今期以降と来期以降とで初期状態が同一ならば累積報酬の確率分布の歪度が同一値に収束するとみなすと、累積報酬の確率分布の歪度Β_ｓ（π）についての、状態の数｜Ｓ｜元の連立方程式が生成され、これはベルマン方程式となる。この方程式を式（８）に示す。

ただし、スケールパラメータは後述する式（１０）に示す方程式を解くことで算出されるものとする。歪度算出部５４０は、式（８）を解くことにより、今期以降の累積報酬の確率分布の歪度であるΒ_ｓ（π）を算出できる。 In this recurrence formula, assuming that the skewness of the probability distribution of the cumulative reward converges to the same value if the initial state is the same in this period and after, the skewness 累積_s (π) of the probability distribution of the cumulative reward The number of states | S | of the original simultaneous equations is generated, which becomes the Bellman equation. This equation is shown in equation (8).

However, the scale parameter is calculated by solving an equation shown in the equation (10) described later. The skewness calculation unit 540 can calculate 歪_s (π), which is the skewness of the probability distribution of the cumulative reward from this period onward, by solving Equation (8).

また、位置パラメータＭ_ｓ（π）は、今期の報酬の確率分布を示す位置パラメータと、来期以降の報酬の確率分布を示す位置パラメータとに基づく漸化式によって表される。具体的には、この漸化式は、今期の行動の行動確率π_ｓ，ａおよび来期の状態への遷移確率ｐ_{ｓ´｜ｓ，ａ}により、来期の状態から来期以降に得られる累積報酬の位置パラメータに割引率ｒを乗じて今期の行動によって得られる報酬の位置パラメータμ_{ｓ´｜ｓ，ａ}を加えた値を重み付けして、それぞれの行動ａおよび来期の状態ｓ´について合計することで、今期以降の累積報酬の確率分布の位置パラメータＭ_ｓ（π）を算出する式となる。 Further, the position parameter M _s (π) is represented by a recurrence formula based on a position parameter indicating the probability distribution of the reward for the current period and a position parameter indicating the probability distribution of the reward for the next period or later. Specifically, this recurrence formula is based on the behavioral probability π _{s, a} of the current period and the transition probability p _{s ′ | s, a} of the next period. By multiplying the position parameter by the discount rate r and weighting the value obtained by adding the position parameter μ _{s ′ | s, a} of the reward obtained by this period's action, the total is obtained for each action a and the state s ′ of the next period. This is an expression for calculating the position parameter M _s (π) of the probability distribution of the cumulative reward from this term.

この漸化式において、今期以降と来期以降とで初期状態が同一ならば累積報酬の確率分布の位置パラメータが同一値に収束するとみなすと、累積報酬の確率分布の位置パラメータＭ_ｓ（π）についての、状態の数｜Ｓ｜元の連立方程式が生成され、これはベルマン方程式と成る。この方程式を式（９）に示す。

位置パラメータ算出部５５０は、式（９）を解くことにより、今期以降の累積報酬の確率分布の位置パラメータであるＭ_ｓ（π）を算出できる。 In this recurrence formula, if the initial state is the same in this period and the following period, if it is assumed that the position parameter of the cumulative reward probability distribution converges to the same value, the position parameter M _s (π) of the cumulative reward probability distribution The number of states | S | of the original simultaneous equations is generated, which becomes the Bellman equation. This equation is shown in equation (9).

The position parameter calculation unit 550 can calculate M _s (π), which is a position parameter of the probability distribution of the cumulative reward from this period, by solving the equation (9).

同様に、スケールパラメータのα乗Ｓ_ｓ ^α（π）は、今期の報酬の確率分布を示すスケールパラメータのα乗と、来期以降の報酬の確率分布を示すスケールパラメータのα乗とに基づく漸化式によって表される。具体的には、この漸化式は、今期の行動の行動確率π_ｓ，ａおよび来期の状態への遷移確率ｐ_{ｓ´｜ｓ，ａ}により、来期の状態から来期以降に得られる累積報酬のスケールパラメータのα乗に割引率γのα乗を乗じて今期の行動によって得られる報酬のスケールパラメータのα乗σ^α _{ｓ´｜ｓ，ａ}を加えた値を重み付けして、それぞれの行動ａおよび来期の状態ｓ´について合計することで、今期以降の累積報酬の確率分布のスケールパラメータのα乗Ｓ_ｓ ^α（π）を算出する式となる。この方程式を式（１０）に示す。

スケールパラメータ算出部５６０は、式（１０）を解くことにより、今期以降の累積報酬の確率分布のスケールパラメータであるＳ_ｓ ^α（π）を算出できる。 Similarly, the α of the scale parameter S _s ^α (π) is a recurrence based on the α power of the scale parameter indicating the probability distribution of the reward for the current period and the α parameter of the scale parameter indicating the probability distribution of the reward for the next period or later. Represented by an expression. Specifically, this recurrence formula is based on the behavioral probability π _{s, a} of the current period and the transition probability p _{s ′ | s, a} of the next period. Weighting the value obtained by multiplying the α power of the scale parameter by the α power of the discount rate γ and adding the α power σ ^α _{s ′ | s, a} of the scale parameter of the reward obtained by the current behavior, By summing up the state s ′ of the next term, an equation for calculating the α power S _s ^α (π) of the scale parameter of the probability distribution of the accumulated rewards from this term onwards is obtained. This equation is shown in equation (10).

The scale parameter calculation unit 560 can calculate S _s ^α (π), which is a scale parameter of the probability distribution of the cumulative reward from this term, by solving the equation (10).

第３ユニット５７０は、平均値算出部５８０と、スケールパラメータ算出部５９０とを有する。第３ユニット５７０は、複数のエージェントのそれぞれから確率的に得られる報酬額に相関があり、かつ同一状態にある複数のエージェントから１期間に得られる総報酬が正規分布に従う場合において、与えられた施策（例えば初期施策）に対して累積報酬の確率分布を定める平均値および分散値を算出することを目的とする。１期間毎の報酬が正規分布に従うため、独立で相関がない場合と同様に、全員が状態ｓにいたときに行動ａを取った上で全員が状態ｓ´に移動した場合に得られる報酬は式（１１）のようにモデル化できる。

The third unit 570 includes an average value calculation unit 580 and a scale parameter calculation unit 590. The third unit 570 is given when there is a correlation between the amount of reward probabilistically obtained from each of a plurality of agents, and the total reward obtained from a plurality of agents in the same state in one period follows a normal distribution. The purpose is to calculate an average value and a variance value that define a probability distribution of cumulative reward for a measure (for example, an initial measure). Since the reward for each period follows a normal distribution, the reward obtained when everyone moves to state s ′ after taking action a when everyone is in state s, as in the case of independent and uncorrelated It can be modeled as equation (11).

同一状態にある複数のエージェントは互いに相関するため、同一状態にある複数のエージェントから得られる合計報酬は、互いに相関をもった無数のエージェントによる微小報酬の合計となる。この場合には、中心極限定理をそのまま適用することができない。
このような相互に相関するエージェントから得られる合計報酬の分布をモデル化するべく、まず、互いに相関する確率変数Ｘ_１，Ｘ_２，…，Ｘ_ｎの和が従う確率分布がどの様になるかを考える。その１つのモデル化として次のような式（１２）を用いることができる。

Since a plurality of agents in the same state correlate with each other, the total reward obtained from the plurality of agents in the same state is the sum of minute rewards by an infinite number of agents correlated with each other. In this case, the central limit theorem cannot be applied as it is.
Or in order to model the distribution of total rewards from agents that correlates to such other, firstly, the random variables X _1, X 2, which correlate with each other, _..., the probability distribution of the sum follows a X _n will look like think of. As one modeling, the following equation (12) can be used.

Ｈは、それぞれのエージェントから得られる報酬が他のそれぞれのエージェントから得られる報酬と相関と相関する程度を示す相関指標値であり、これをハースト指数と呼ぶ。Ｈ＝１／２は報酬額が互いに独立の場合を示し、中心極限定理が適用される状況と一致する。逆に極端な場合には、Ｘ_ｉの挙動に他の全てのＸ_ｊ（ｉ≠ｊ）が連動する状況が想定できる。それぞれの確率変数は何れも単独では同一の平均と分散であるから、これは結局「Ｘ_２、Ｘ_３、…、Ｘ_ｎのそれぞれが常にＸ_１に等しい」という状況である。このときはＨ＝１となる。つまり、Ｘ_１の標準偏差のｎ倍が合計値にもそのまま現れる。逆相関の場合も含めてＨの定義域は０＜Ｈ≦１となる。 H is a correlation index value indicating the degree to which the reward obtained from each agent correlates with the reward obtained from each other agent, and this is called a Hurst index. H = 1/2 indicates a case where the remuneration amounts are independent from each other, which is consistent with the situation where the central limit theorem is applied. On the other hand, in an extreme case, it can be assumed that all other X _j (i ≠ j) are linked to the behavior of X _i . Since each random variable alone has the same mean and variance, this is a situation where “X ₂ , X ₃ ,..., X _n are always equal to X ₁ ”. At this time, H = 1. That, n times the standard deviation of X ₁ is as appears in the sum. Including the case of inverse correlation, the domain of H is 0 <H ≦ 1.

このモデルを応用する。つまり、状態ｓにｎ_ｓ個のエージェントがいるとした場合に、施策π_ｓ，ａを行うｎ_ｓ・π_ｓ，ａ個のエージェントの中で来期の状態がｓ´のものの合計報酬に関する分布が、ハースト指数Ｈ_{ｓ´｜ｓ，ａ}の依存性を持っていると考える。全てのｓ´、ａに関しＨ_{ｓ´｜ｓ，ａ}が等しい場合には（この値をＨ_ｓとする）、分布の形は比較的容易となり、施策π_ｓを行ったときの１期間あたりの報酬ｒ_{ｓ´｜ｓ，πｓ}は、以下の式（１３）のように表される。

Apply this model. That is, in the case of that there are n _s number of agents in state s, the next fiscal year of state in the n _s · [pi _{s, a} number of agents that measures [pi _{s, a} is the distribution for the total compensation of those s' , The Hurst exponent H _{s ′ | s, a} is considered to be dependent. When H _{s ′ | s, a} is the same for all s ′, a (this value is H _s ), the shape of the distribution is relatively easy, and the period per period when the measure π _s is performed The reward r _{s ′ | s, πs} is _expressed by the following equation (13).

これは、分散σ^２ _{ｓ´｜ｓ，ａ}の代わりに標準偏差をＨ_ｓ ^−１乗したσ^１／Ｈｓ _{ｓ´｜ｓ，ａ}
の領域で加法性が成り立つことを意味する。式（１３）を前提に累積報酬Ｒ_ｓ（π）の性質を知る必要があるが、時間発展を考慮する場合の着眼点は２つある。以下にそれらを示す。
１．異なる状態ｓ_１，ｓ_２のもたらす報酬間に相関があるかどうか。つまり、状態ｓ_１のエージェントから高い報酬が得られるときには状態ｓ_２のエージェントからも高い報酬が得られ、状態ｓ_１のエージェントから低い報酬が得られるときには状態ｓ_２のエージェントからも低い報酬が得られる、といった相関が見られるかどうか。
２．ｔ期とｔ＋１期の報酬は独立か、それとも連動しているか。つまり、同一の現状態であってもｔ期の報酬が高い場合にはｔ＋１期の報酬が高く、ｔ期の報酬が低い場合にはｔ＋１期の報酬も低いという状況が存在するかどうかである。
これらの着眼点に基づく相関は、マーケティングなどの分野においてはどちらも存在することが確かめられている。例えば、全ての状態に対し影響するようなグローバルな現象が生じ、それが状態の定義において考慮されていない場合には、上記１に示す相関が存在する。また、季節的な変動や上昇・下降トレンドがあり、それらが状態の定義において考慮されていない場合には、上記２に示す相関が存在する。 This is because, instead of the variance σ ² _{s ′ | s, a} , σ ^{1 / Hs} _{s ′ | s, a obtained by raising the} standard deviation to the power of H _s ^−1.
It means that additivity holds in the domain of. Although it is necessary to know the property of the accumulated reward R _s (π) on the premise of the equation (13), there are two points to consider when considering time evolution. They are shown below.
1. Whether there is a correlation between rewards brought by different states s ₁ and s ₂ . That is, when a high reward is obtained from the agent in the state s _1, a high reward is also obtained from the agent in the state s ₂ , and when a low reward is obtained from the agent in the state s _1, a low reward is also obtained from the agent in the state s _2. Whether there is a correlation such as
2. Are the rewards for period t and period t + 1 independent or linked? In other words, even in the same current state, if the reward for t period is high, the reward for t + 1 period is high, and if the reward for t period is low, the reward for t + 1 period is also low. .
It has been confirmed that both correlations based on these points of interest exist in fields such as marketing. For example, when a global phenomenon that affects all states occurs and is not considered in the definition of the state, the correlation shown in the above 1 exists. Further, when there are seasonal fluctuations and up / down trends, and these are not taken into account in the definition of the state, the correlation shown in 2 above exists.

Ｈ_ｓは、同一期において状態ｓに所属するエージェント同士の相関を示すハースト指数である。これが全ての状態ｓに関してＨ_ｓ＝Ｈであるとする。この条件下で更に、このハースト指数Ｈが上記１、２に関する連動性を示すハースト指数と同一である場合には、以下の式（１４）に示すベルマン方程式を導くことができる。

H _s is a Hurst index indicating a correlation between agents belonging to the state s in the same period. Let this be H _s = H for all states s. Under this condition, if the Hearst index H is the same as the Hearst index indicating the interlocking with respect to 1 and 2, the Bellman equation shown in the following equation (14) can be derived.

即ちこの方程式は、今期以降に得られる累積報酬のスケールパラメータＳ^１／Ｈ _ｓ（π_ｓ）を、今期の行動によって得られる報酬のスケールパラメータσ^１／Ｈ _{ｓ´｜ｓ，ａ}と、来期以降に得られる累積報酬のスケールパラメータとに基づいて算出する漸化式において、今期以降と来期以降とで累積報酬のスケールパラメータが同一値に収束するとみなすことにより生成される。そしてこの漸化式は、具体的には、今期の行動の行動確率π_ｓ，ａおよび来期の状態への遷移確率ｐ_{ｓ´｜ｓ，ａ}により、来期の状態から来期以降に得られる累積報酬のスケールパラメータに割引率ｒをハースト指数Ｈの逆数で累乗した値を乗じて今期の行動によって得られる報酬のスケールパラメータσ^１／Ｈ _{ｓ´｜ｓ，ａ}を加えた値を重み付けして、それぞれの行動ａおよび来期の状態ｓ´について合計することで、今期以降の累積報酬の確率分布のスケールパラメータＳ^１／Ｈ _ｓ（π）を算出する式となる。 In other words, the equation shows that the cumulative reward scale parameter S ^{1 / H} _s (π _s ) obtained from the current period, the reward scale parameter σ ^{1 / H} _{s ′ |} In the recurrence formula calculated based on the scale parameter of the accumulated reward obtained in the above, it is generated by regarding that the scale parameter of the accumulated reward converges to the same value in and after this term. The recurrence formula is, specifically, the cumulative reward obtained from the next period to the next period on the basis of the action probability π _{s, a} of the current period and the transition probability p _{s ′ | s, a} to the next period. Is multiplied by the value obtained by multiplying the scale parameter of the discount rate r by the reciprocal of the Hearst index H, and the value obtained by adding the scale parameter σ ^{1 / H} _{s ′ |} By summing up the action a and the state s ′ of the next term, the equation for calculating the scale parameter S ^{1 / H} _s (π) of the cumulative reward probability distribution from this term onwards is obtained.

スケールパラメータ算出部５９０は、この式（１４）を解くことにより、今期以降の累積報酬の確率分布のスケールパラメータであるＳ^１／Ｈ _ｓ（π）を算出することができる。なお、平均値算出部５８０による平均値の算出処理は、平均値算出部５１０による算出処理と同一であるから説明を省略する。
以上、図６を参照して説明したように、本実施形態に係る施策決定システム４０によれば、与えられた施策に対し累積報酬の確率分布を定めるパラメータを解析的に算出できる。これにより、様々な施策に対して繰返しパラメータを算出した場合であっても、計算に要する時間を少なくすることができる。 The scale parameter calculation unit 590 can calculate S ^{1 / H} _s (π), which is a scale parameter of the probability distribution of the cumulative reward from this period onward, by solving the equation (14). Note that the average value calculation processing by the average value calculation unit 580 is the same as the calculation processing by the average value calculation unit 510, and thus description thereof is omitted.
As described above with reference to FIG. 6, according to the measure determination system 40 according to the present embodiment, it is possible to analytically calculate parameters that determine the probability distribution of the accumulated reward for a given measure. Thereby, even if it is a case where a parameter is repeatedly calculated with respect to various measures, the time required for calculation can be reduced.

図７は、効率的フロンティア曲線６０の一例を示す。表示制御部４７０は、複数の位置パラメータのそれぞれについて、位置パラメータとそれに対応して算出したスケールパラメータに基づくリスク指標値とによって表される座標に点を描画する。そして、描画した点と点とをスプライン補完などで結んだ曲線が効率的フロンティア曲線６０となる。図７には、各エージェントから得られる報酬に相関がある場合において、ハースト指数の値を変化させた３つの場合について効率的フロンティア曲線を示す。詳細には、図の横軸は期待値を示し、縦軸はバリュー・アット・リスクを示し、ハースト指数が０．５の場合、０．５５６の場合、および、０．６６７の場合の３つの場合について効率的フロンティア曲線を図示する。 FIG. 7 shows an example of an efficient frontier curve 60. For each of the plurality of position parameters, display control unit 470 draws a point at the coordinates represented by the position parameter and the risk index value based on the scale parameter calculated correspondingly. A curve obtained by connecting the drawn points by spline interpolation or the like becomes an efficient frontier curve 60. FIG. 7 shows an efficient frontier curve for three cases where the value of the Hurst index is changed when there is a correlation between the rewards obtained from each agent. In detail, the horizontal axis of the figure shows the expected value, the vertical axis shows the value at risk, and the three cases where the Hurst index is 0.5, 0.556, and 0.667 are shown. The efficient frontier curve for the case is illustrated.

即ちたとえば、Ｈ＝０．６６７について、効率的フロンティア曲線６０上の点を選択すれば、何れの点を選択した場合であっても所定の期待値を得るためにリスクを最小化する施策を得ることができる。
このように、本実施形態における情報システム１０によれば、遷移確率や１期間の報酬の分布などが与えられると、その環境下でリスクを最小化する施策を求めて、それら施策の集合をフロンティア曲線として表示することができる。これにより、利用者が自己のリスク許容度に応じて利益を最大化する施策を選択できるようになるなど、施策決定の柔軟性を高めることができる。 That is, for example, for H = 0.667, if a point on the efficient frontier curve 60 is selected, a measure for minimizing the risk is obtained in order to obtain a predetermined expected value regardless of which point is selected. be able to.
As described above, according to the information system 10 in the present embodiment, given a transition probability, a distribution of rewards for one period, etc., a measure for minimizing the risk in the environment is obtained, and a set of these measures is converted into a frontier. It can be displayed as a curve. As a result, the user can select a measure that maximizes profits according to the risk tolerance of the user, and the flexibility of policy determination can be increased.

以下、図８から図９を参照しながら、更に詳細な処理について説明する。
図８は、施策決定システム４０によって最適施策が決定される処理のフローチャートを示す。位置パラメータ範囲算出部４００は、確率記憶部２０に記憶された遷移確率、および、パラメータ記憶部３０に記憶されたパラメータに基づいて、複数のエージェントから得られる累積報酬の確率分布を示す位置パラメータの最大値および最小値を算出する（Ｓ８００）。この算出処理を式（１５）に示す。

位置パラメータ範囲算出部４００は、式（１５）に基づくバリュー・イテレーションによる収束値を位置パラメータの最小値として算出する。また、位置パラメータ範囲算出部４００は、式（１５）における「ｍｉｎ」を「ｍａｘ」に換えた式に基づくバリュー・イテレーションによる収束値を位置パラメータの最大値として算出する。なお、各変数の定義は上述の通りである。 Hereinafter, further detailed processing will be described with reference to FIGS. 8 to 9.
FIG. 8 shows a flowchart of processing in which the optimum measure is determined by the measure determination system 40. The position parameter range calculation unit 400 is a position parameter indicating a probability distribution of cumulative rewards obtained from a plurality of agents based on the transition probabilities stored in the probability storage unit 20 and the parameters stored in the parameter storage unit 30. The maximum value and the minimum value are calculated (S800). This calculation process is shown in Expression (15).

The position parameter range calculation unit 400 calculates a convergence value based on value iteration based on Expression (15) as the minimum value of the position parameter. Further, the position parameter range calculation unit 400 calculates a convergence value by value iteration based on an expression obtained by replacing “min” in Expression (15) with “max” as the maximum value of the position parameter. The definition of each variable is as described above.

以下、図４の場合分けと同様、位置パラメータの与え方に応じて異なる処理となるから、場合を分けて説明する。 Hereinafter, similar to the case of FIG. 4, different processing is performed depending on how the position parameter is given, and therefore the case will be described separately.

（１）それぞれの初期状態について位置パラメータを定める場合
位置パラメータ取得部４１０は、複数の状態のそれぞれについて、位置パラメータ範囲算出部４００によって算出された最小値から最大値までの範囲内の値を、当該状態を初期状態とする複数のエージェントから得られるべき報酬の確率分布を示す位置パラメータとして取得する（Ｓ８１０）。取得する位置パラメータをＭ_ｓ ^ｏｂｊとし、その最小値をＭ_ｓ ^ｍｉｎとし、その最大値をＭ_ｓ ^ｍａｘとすると、以下の式（１６）が満たされる。

効率的フロンティア曲線を効率的に描画するため、位置パラメータ取得部４１０は、複数の位置パラメータとして、この範囲の中の所定の数の値を、差分値を均等として順次取得することが望ましい。 (1) When Position Parameters are Determined for Each Initial State The position parameter acquisition unit 410 calculates a value within a range from the minimum value to the maximum value calculated by the position parameter range calculation unit 400 for each of a plurality of states. It is acquired as a position parameter indicating the probability distribution of rewards to be obtained from a plurality of agents having the state as the initial state (S810). When the position parameter to be acquired is M _s ^obj , the minimum value is M _s ^min , and the maximum value is M _s ^max , the following equation (16) is satisfied.

In order to efficiently draw an efficient frontier curve, it is desirable that the position parameter acquisition unit 410 sequentially acquires a predetermined number of values in the range as a plurality of position parameters, with the difference values being equal.

次に、施策取得部４２０は、複数の状態のそれぞれについて、当該状態を初期状態とする複数のエージェントから得られる累積報酬の確率分布の位置パラメータを、取得した位置パラメータに一致させる施策のうち１つを生成し、その施策を初期施策として取得する（Ｓ８２０）。この初期施策は、位置パラメータを与えられた値と一致させれば充分であり、リスクを最小化するものであるかどうかは問わない。具体的には、例えば、施策取得部４２０は、以下の式（１７）に示す制約を有し、所定の目的関数を有する線形計画問題を解くことにより、初期施策π^（０） _ｓ，ａを求めることができる。

Next, for each of the plurality of states, the measure acquisition unit 420 selects one of the measures for matching the position parameter of the probability distribution of the cumulative reward obtained from the plurality of agents having the state as the initial state with the acquired position parameter. And the measure is acquired as an initial measure (S820). It is sufficient for this initial measure to match the position parameter with the given value, regardless of whether it minimizes the risk. Specifically, for example, the measure acquisition unit 420 has the constraint shown in the following equation (17), and solves the linear program problem having a predetermined objective function, thereby obtaining the initial measure π ⁽⁰⁾ _{s, a} Can be sought.

次に、第１算出部４３０は、それぞれの状態について、当該状態を初期状態とする複数のエージェントから初期施策に従って行動した結果として得られる累積報酬の確率分布を示す位置パラメータおよびスケールパラメータを算出する（Ｓ８３０）。これは、１期間における報酬の確率分布に応じ、図６において説明した第１ユニット５００、第２ユニット５３０または第３ユニット５７０によって算出される。次に、第２算出部４４０は、複数の状態のそれぞれについて、エージェントに対し取り得るそれぞれの行動の行動確率を変数とし、来期以降の累積報酬の確率分布におけるスケールパラメータが第１算出部４３０によって算出されたスケールパラメータに一致することを前提に当該スケールパラメータの値および当該変数に基づいて今期以降のスケールパラメータを求める目的関数の値を最小化する線形計画問題を解くことにより、それぞれの行動確率を定める施策を算出する（Ｓ８４０）。この線形計画問題は、それぞれの行動の行動確率に従って行動した結果として得られる累積報酬の確率分布の位置パラメータが、第１算出部４３０によって算出された位置パラメータに一致することを制約とする。また、同一状態に対するそれぞれの行動確率の和が１であり、それぞれの行動確率が０以上であることが制約となる。 Next, the first calculation unit 430 calculates, for each state, a position parameter and a scale parameter indicating a probability distribution of accumulated rewards obtained as a result of acting according to the initial measure from a plurality of agents having the state as an initial state. (S830). This is calculated by the first unit 500, the second unit 530, or the third unit 570 described in FIG. 6 according to the probability distribution of the reward in one period. Next, for each of the plurality of states, the second calculation unit 440 uses the action probability of each action that can be taken for the agent as a variable, and the first calculation unit 430 determines the scale parameter in the cumulative reward probability distribution from the next term. Each behavioral probability is solved by solving a linear programming problem that minimizes the value of the scale parameter and the objective function for obtaining the scale parameter from this term on the basis of the variable, assuming that it matches the calculated scale parameter. Measures for determining (S840). This linear programming problem is constrained in that the position parameter of the cumulative reward probability distribution obtained as a result of the action according to the action probability of each action matches the position parameter calculated by the first calculation unit 430. In addition, the sum of the action probabilities for the same state is 1, and the action probabilities are 0 or more.

具体的には、第２算出部４４０は、まず、式（１８）および式（１９）の値を求める。

Specifically, the second calculation unit 440 first obtains the values of Expression (18) and Expression (19).

そして、これらの式の値を用いて、線形計画問題の目的関数は式（２０）のように表される。また、線形計画問題の制約は式（２１）のように表される。

この目的関数は、来期以降の累積報酬の確率分布におけるスケールパラメータを固定、つまり、来期以降の行動確率は、前回に第２算出部４４０で算出された施策（初回の場合は初期施策）によって定められることを前提としている。そして、この目的関数は、今期の行動確率のみを変数である行動確率で置換えた場合に得られる新たな累積報酬の確率分布におけるスケールパラメータを算出している。 Then, using the values of these equations, the objective function of the linear programming problem is expressed as equation (20). In addition, the constraints of the linear programming problem are expressed as in Equation (21).

This objective function fixes the scale parameter in the probability distribution of accumulated rewards from the next term, that is, the action probability from the next term is determined by the measure previously calculated by the second calculation unit 440 (initial measure in the first case). It is assumed that The objective function calculates a scale parameter in a new cumulative reward probability distribution obtained when only the action probability of the current period is replaced with an action probability that is a variable.

なお、第２算出部４４０が最小化する目的関数の値は必ずしも論理的に最小値である必要は無い。例えば、第２算出部４４０は、この目的関数の値を最小化する方向で改善していくことにより、即ち、前回に第２算出部４４０によって算出されたスケールパラメータよりも小さい値を求めることにより、結果としてスケールパラメータを最小値の近傍に収束させればよい。
また、この線形計画問題は、行動に要する費用を当該行動の行動確率で重み付けして各行動について合計した値が予め定められた基準の予算（Ｃｓ）以下であることを更に制約としてもよい。この制約は式（２２）として表される。

これにより、リスクを最小化するために費やすことのできる費用の制約を定めることができ、より現実の課題に即した解を得ることができる。 Note that the value of the objective function minimized by the second calculation unit 440 does not necessarily need to be a logical minimum value. For example, the second calculation unit 440 improves by minimizing the value of the objective function, that is, by obtaining a value smaller than the scale parameter calculated by the second calculation unit 440 last time. As a result, the scale parameter may be converged to the vicinity of the minimum value.
In addition, the linear programming problem may be further constrained by the fact that a value obtained by weighting the cost required for an action by the action probability of the action and totaling the actions is equal to or less than a predetermined reference budget (Cs). This constraint is expressed as equation (22).

As a result, it is possible to set a cost constraint that can be spent to minimize the risk, and to obtain a solution that is more realistic.

収束判定部４５０は、第１算出部４３０により算出されたスケールパラメータと第２算出部４４０により算出されたスケールパラメータとが予め定められた範囲内の値に収束したかを判定する（Ｓ８５０）。収束したことを条件に（Ｓ８５０）、第２算出部４４０により算出された施策を出力部４６０に対し出力する。この際、第２算出部４４０は、算出されたスケールパラメータとそれに対応する位置パラメータとは、後にフロンティア曲線の描画に用いるため別途記録しておく。 The convergence determination unit 450 determines whether the scale parameter calculated by the first calculation unit 430 and the scale parameter calculated by the second calculation unit 440 have converged to a value within a predetermined range (S850). On the condition that it has converged (S850), the measure calculated by the second calculation unit 440 is output to the output unit 460. At this time, the second calculation unit 440 separately records the calculated scale parameter and the corresponding position parameter for later use in drawing a frontier curve.

一方、収束判定部４５０は、収束していないことを条件に（Ｓ８５０：ＮＯ）、第２算出部４４０により算出された施策を、前回に第２算出部４４０によって算出された施策（初回の場合には初期施策）に代えて第１算出部４３０に与える。この結果、第１算出部４３０は、与えられたこの施策に従い行動した結果として得られる累積報酬の確率分布の位置パラメータおよびスケールパラメータを更に算出する（Ｓ８３０）。また、式（２０）および式（１９）に示す目的関数において来期以降の累積報酬の前提としていた施策は、今回に第２算出部４４０によって算出された施策に置換えられ、新たな線形計画問題が生成され、当該新たな線形計画問題が第２算出部４４０によって解かれることとなる（Ｓ８４０）。
以上のＳ８３０からＳ８５０までの処理の繰返しにより、スケールパラメータが収束するまで施策π^（１） _ｓ，ａ、π^（２） _ｓ，ａ、…、π^（ｎ） _ｓ，ａが順次算出される。スケールパラメータが収束したとき、そのスケールパラメータはリスクを最小化する施策によって得られる報酬のスケールパラメータとなる。 On the other hand, on the condition that the convergence is not converged (S850: NO), the convergence determination unit 450 changes the measure calculated by the second calculation unit 440 to the measure previously calculated by the second calculation unit 440 (in the first case). Is given to the first calculation unit 430 instead of the initial measure. As a result, the first calculation unit 430 further calculates the position parameter and the scale parameter of the probability distribution of the cumulative reward obtained as a result of acting according to the given measure (S830). In addition, the measures that have been premised on the accumulated remuneration from the next term in the objective functions shown in the equations (20) and (19) are replaced with the measures calculated by the second calculation unit 440 this time, and a new linear programming problem is generated. The new linear programming problem is generated and solved by the second calculation unit 440 (S840).
By repeating the processes from S830 to S850, measures π ⁽¹⁾ _{s, a} , π ⁽²⁾ _{s, a} ,..., Π ⁽ⁿ⁾ _{s, a} are sequentially calculated until the scale parameters converge. When the scale parameter converges, the scale parameter becomes a scale parameter of a reward obtained by a measure for minimizing the risk.

スケールパラメータが収束すると（Ｓ８５０：ＹＥＳ）、続いて、位置パラメータ範囲算出部４００は、位置パラメータの値が取り得る範囲内において取得するべき位置パラメータが残っているかを判断する（Ｓ８６０）。残っていれば（Ｓ８６０：ＮＯ）、位置パラメータ範囲算出部４００は、Ｓ８１０に処理を戻して次の位置パラメータを取得し、それに対応する最適施策を算出する。一方で、残っていなければ（Ｓ８６０：ＹＥＳ）、表示制御部４７０は、効率的フロンティア曲線を描画して利用者に表示する（Ｓ８７０）。この効率的フロンティア曲線は、エージェントの初期状態毎に描画される。 When the scale parameter converges (S850: YES), the position parameter range calculation unit 400 determines whether or not position parameters to be acquired remain within a range that the position parameter value can take (S860). If it remains (S860: NO), the position parameter range calculation unit 400 returns to S810 to acquire the next position parameter, and calculates the optimum measure corresponding to it. On the other hand, if it does not remain (S860: YES), the display control unit 470 draws an efficient frontier curve and displays it to the user (S870). This efficient frontier curve is drawn for each initial state of the agent.

出力部４６０は、表示された曲線上の座標値が利用者により指定されたことに応じ、当該座標値によって表される位置パラメータおよびリスク指標の組を、当該位置パラメータおよび当該リスク指標によって示される確率分布の累積報酬を得るために第２算出部４４０によって算出された施策に対応付けて出力する（Ｓ８８０）。施策は、Ｓ８５０において収束が判定される毎に記録されたものであってもよいし、曲線上の座標値が指定されたことに応じて算出されるものであってもよい。例えば、出力部４６０は、利用者により指定された座標値が示す位置パラメータを位置パラメータ取得部４１０に与えて取得させることにより、Ｓ８１０からＳ８５０までの処理を再度行わせ、スケールパラメータが収束した時点の施策を出力してもよい。 The output unit 460 indicates a set of the position parameter and the risk index represented by the coordinate value by the position parameter and the risk index when the coordinate value on the displayed curve is designated by the user. In order to obtain the cumulative reward of the probability distribution, it is output in association with the measure calculated by the second calculation unit 440 (S880). The measure may be recorded every time convergence is determined in S850, or may be calculated in response to the coordinate value on the curve being designated. For example, the output unit 460 causes the position parameter acquisition unit 410 to acquire and acquire the position parameter indicated by the coordinate value specified by the user, thereby causing the processing from S810 to S850 to be performed again, and the scale parameter to converge The measures may be output.

以上が、それぞれの初期状態について位置パラメータを定める（１）の場合の処理である。これにより、利用者は、例えば顧客セグメント（即ち初期状態）毎に最適施策を決定することができる。
続いて、全ての初期状態についての総報酬の位置パラメータを定める場合について説明する。 The above is the processing in the case of (1) for determining the position parameter for each initial state. Thereby, the user can determine the optimum measure for each customer segment (that is, the initial state), for example.
Next, the case where the position parameter of the total reward for all initial states is determined will be described.

（２）全ての初期状態についての総報酬の位置パラメータを定める場合
位置パラメータ取得部４１０は、それぞれが異なる状態を初期状態として取り得る複数のエージェントから得られる累積報酬の合計の確率分布を示す位置パラメータの取得する（Ｓ８１０）。この位置パラメータＭ_ｏｂｊは以下の式（２３）によって表される。ただし、ｗ_ｓを、初期状態を状態ｓとするエージェントの割合とする。ｗ_ｓは、例えばマーケティング等の例における顧客の人数であってよい。

(2) When determining the position parameter of the total reward for all initial states The position parameter acquisition unit 410 indicates a probability distribution of the total accumulated rewards obtained from a plurality of agents that can take different states as initial states. The parameter is acquired (S810). The position parameter M _obj is expressed by the following equation (23). However, w _s is the ratio of agents whose initial state is state s. w _s may be a number of customers in the example of, for example, marketing and the like.

また、この位置パラメータは、それぞれの状態を初期状態とするエージェントから得られる報酬の確率分布の位置パラメータの最小値に、当該状態を初期状態とするエージェントの割合として予め与えられた値によって重み付けした合計以上の値であることが望ましい。また、この位置パラメータは、それぞれの状態を初期状態とするエージェントから得られる報酬の確率分布の位置パラメータの最大値に、当該状態を初期状態とするエージェントの割合として予め与えられた値によって重み付けした合計以下の値であることが望ましい。即ち、この位置パラメータＭ_ｏｂｊの範囲は以下の式（２４）によって表される。

In addition, the position parameter is weighted by the value given in advance as the ratio of the agent having the state as the initial state to the minimum value of the position parameter of the probability distribution of the reward obtained from the agent having the state as the initial state. It is desirable that the value is greater than the total. In addition, this position parameter is obtained by weighting the maximum value of the position parameter of the probability distribution of rewards obtained from an agent having each state as an initial state by a value given in advance as a ratio of agents having the state as an initial state. It is desirable that the value is less than the total. That is, the range of the position parameter M _obj is expressed by the following equation (24).

施策取得部４２０は、複数のエージェントから得られる累積報酬の合計の確率分布を示す位置パラメータを、取得した位置パラメータと一致させる施策のうち１つを生成し、初期施策として取得する（Ｓ８２０）。この初期施策は、位置パラメータを与えられた値と一致させれば充分であり、リスクを最小化するものであるかどうかは問わない。例えば、施策取得部４２０は、まず、それぞれの状態ｓについて、その状態を初期状態とする場合の累積報酬の確率分布の位置パラメータが採りうる範囲内の値であって、それぞれの値を重みｗ_ｓで重み付けして合計すると位置パラメータＭ_ｏｂｊとなるＭ_ｓ ^ｔｍｐを生成する。例えば、施策取得部４２０は、以下の式（２５）によってＭ_ｓ ^ｔｍｐを算出することができる。

The measure acquisition unit 420 generates one of the measures that matches the position parameter indicating the total probability distribution of the accumulated rewards obtained from a plurality of agents with the acquired position parameter, and acquires it as an initial measure (S820). It is sufficient for this initial measure to match the position parameter with the given value, regardless of whether it minimizes the risk. For example, the measure acquisition unit 420 first sets, for each state s, a value within a range that can be taken by the position parameter of the probability distribution of the cumulative reward when the state is the initial state, and each value is weighted w. M _s ^tmp is generated as a position parameter M _obj when weighted by _s and summed. For example, the measure acquisition unit 420 can calculate M _s ^tmp by the following equation (25).

そして、施策取得部４２０は、以下の式（２６）に示す制約を満たす初期施策π^（０） _ｓ，ａを状態ｓ毎に求める。これは、式（２６）に示す制約を有する線形計画問題を解くことによって実現される。

次に、第１算出部４３０は、初期施策に従って行動した結果として複数のエージェントから得られる累積報酬の確率分布を示す位置パラメータおよびスケールパラメータを算出する（Ｓ８３０）。第２算出部４４０は、施策を算出するべために、まず、Ｓ８３０において算出されたパラメータに基づき式（２７）および式（２８）の値を算出する。

Then, the measure acquisition unit 420 obtains an initial measure π ⁽⁰⁾ _{s, a} that satisfies the constraint shown in the following formula (26) for each state s. This is achieved by solving a linear programming problem having the constraints shown in equation (26).

Next, the first calculation unit 430 calculates a position parameter and a scale parameter indicating the probability distribution of cumulative rewards obtained from a plurality of agents as a result of acting according to the initial measure (S830). In order to calculate the measure, the second calculation unit 440 first calculates the values of Expression (27) and Expression (28) based on the parameters calculated in S830.

そして、第２算出部４４０は、エージェントに対して取り得るそれぞれの行動の行動確率を変数とし、複数の状態のそれぞれを初期状態とするエージェントから得られる総報酬の確率分布におけるスケールパラメータを求める目的関数の値を最小化する線形計画問題を解く。そしてこの目的関数は、複数の状態のそれぞれについて、来期以降の累積報酬の確率分布におけるスケールパラメータが第１算出部４３０によって算出されたスケールパラメータに一致することを前提に当該スケールパラメータの値に基づきそれぞれの状態を初期状態として当該行動確率に従って行動した結果として得られる今期以降の累積報酬の確率分布のスケールパラメータを求め、当該状態を初期状態とするエージェントの数で重み付けして合計する関数である。この結果、第２算出部４４０は、目的関数の値を最小化するそれぞれの行動確率を定める施策を算出することができる。詳細には、この線形計画問題の目的関数は式（２９）によって表される。

Then, the second calculation unit 440 uses the behavior probability of each action that can be taken for the agent as a variable, and obtains a scale parameter in the probability distribution of the total reward obtained from the agent having each of a plurality of states as an initial state. Solve a linear programming problem that minimizes the value of a function. The objective function is based on the value of the scale parameter on the assumption that the scale parameter in the cumulative reward probability distribution from the next term matches the scale parameter calculated by the first calculation unit 430 for each of the plurality of states. It is a function that calculates the scale parameter of the probability distribution of cumulative rewards from this term obtained as a result of acting according to the action probability with each state as the initial state, and sums it by weighting the number of agents with the state as the initial state . As a result, the second calculation unit 440 can calculate a measure that determines each action probability that minimizes the value of the objective function. Specifically, the objective function of this linear programming problem is expressed by equation (29).

この目的関数は、来期以降の累積報酬の確率分布におけるスケールパラメータを固定、つまり、来期以降の行動確率は、前回に第２算出部４４０で算出された施策（初回の場合は初期施策）によって定められることを前提としている。そして、この目的関数は、今期の行動確率のみを変数である行動確率で置換えた場合に得られる新たな累積報酬の確率分布におけるスケールパラメータを算出している。
また、この線形計画問題は、それぞれの状態を初期状態として当該行動確率に従って行動した結果として得られる累積報酬の確率分布の位置パラメータを、当該状態を初期状態とするエージェントの数で重み付けして合計した値が、第１算出部４３０によって算出された位置パラメータに一致する制約を有する。また、同一状態に対する行動確率の合計が１となり、それぞれの行動確率が０以上となる制約を有する。これらの制約は式（３０）によって表される。

This objective function fixes the scale parameter in the probability distribution of accumulated rewards from the next term, that is, the action probability from the next term is determined by the measure previously calculated by the second calculation unit 440 (initial measure in the first case). It is assumed that The objective function calculates a scale parameter in a new cumulative reward probability distribution obtained when only the action probability of the current period is replaced with an action probability that is a variable.
In addition, this linear programming problem is calculated by weighting the position parameter of the probability distribution of the cumulative reward obtained as a result of acting according to the action probability with each state as the initial state, weighted by the number of agents having the state as the initial state. The obtained value has a constraint that matches the position parameter calculated by the first calculation unit 430. In addition, there is a constraint that the total action probability for the same state is 1, and each action probability is 0 or more. These constraints are expressed by equation (30).

また、この線形計画問題は、行動に要する費用を当該行動の行動確率およびそれぞれの状態を初期状態とするエージェントの割合で重み付けして、それぞれの状態および行動の組合せについて合計した値が、予め定められた基準の予算Ｃ_{ｔｏｔａｌ}以下であることを更に制約として有していてもよい。この制約は以下の式（３１）として表される。

In addition, this linear programming problem has a predetermined value determined by weighting the cost required for an action by the action probability of the action and the ratio of agents having each state as an initial state, and adding up the combination of each state and action. It may be further included as a constraint that the budget is equal to or less than the standard budget C _total . This restriction is expressed as the following equation (31).

収束判定部４５０は、第１算出部４３０により算出されたスケールパラメータと第２算出部４４０により算出されたスケールパラメータとが予め定められた範囲内の値に収束したかを判定する（Ｓ８５０）。収束判定部４５０は、収束していないことを条件に（Ｓ８５０：ＮＯ）、第２算出部４４０により算出された施策を第２算出部４４０により前回に算出された施策（初回の場合には初期施策）に代えて第１算出部４３０に与える。これにより、第１算出部４３０は、この施策に従い行動した結果として得られる累積報酬の確率分布の位置パラメータおよびスケールパラメータを更に算出する（Ｓ８３０）。また、式（２９）および式（２８）に示す目的関数おいて来期以降の累積報酬の前提としていた施策は、今回に第２算出部４４０によって算出された施策に置換えられ、新たな線形計画問題が生成され、当該新たな線形計画問題が第２算出部４４０によって解かれることとなる（Ｓ８４０）。 The convergence determination unit 450 determines whether the scale parameter calculated by the first calculation unit 430 and the scale parameter calculated by the second calculation unit 440 have converged to a value within a predetermined range (S850). The convergence determination unit 450 sets the measure calculated by the second calculation unit 440 last time by the second calculation unit 440 on the condition that it has not converged (S850: NO) (initially in the first case) Instead of the measure) is provided to the first calculation unit 430. Thereby, the 1st calculation part 430 further calculates the position parameter and scale parameter of the probability distribution of the cumulative reward obtained as a result of acting according to this measure (S830). In addition, the measures that have been premised on the accumulated remuneration from the next term in the objective functions shown in the equations (29) and (28) are replaced with the measures calculated by the second calculation unit 440 this time, and a new linear programming problem is obtained. And the new linear programming problem is solved by the second calculation unit 440 (S840).

一方、スケールパラメータが収束したことを条件に（Ｓ８５０：ＹＥＳ）、収束判定部４５０は、収束したスケールパラメータおよびそれに対応する位置パラメータを、Ｓ８４０において算出された施策に対応付けて別途記憶する。続いて、位置パラメータ範囲算出部４００は、位置パラメータの値が取り得る範囲内において取得するべき位置パラメータが残っているかを判断する（Ｓ８６０）。残っていれば（Ｓ８６０：ＮＯ）、位置パラメータ範囲算出部４００は、Ｓ８１０に処理を戻して次の位置パラメータを取得し、それに対応する最適施策を算出する。一方で、残っていなければ（Ｓ８６０：ＹＥＳ）、表示制御部４７０は、効率的フロンティア曲線を描画して利用者に表示する（Ｓ８７０）。この効率的フロンティア曲線は、上記（１）の場合とは異なり１つのみが描画される。 On the other hand, on the condition that the scale parameter has converged (S850: YES), the convergence determination unit 450 separately stores the converged scale parameter and the corresponding position parameter in association with the measure calculated in S840. Subsequently, the position parameter range calculation unit 400 determines whether or not position parameters to be acquired remain within a range that the position parameter value can take (S860). If it remains (S860: NO), the position parameter range calculation unit 400 returns to S810 to acquire the next position parameter, and calculates the optimum measure corresponding to it. On the other hand, if it does not remain (S860: YES), the display control unit 470 draws an efficient frontier curve and displays it to the user (S870). Unlike the case of (1), only one efficient frontier curve is drawn.

出力部４６０は、表示された曲線上の座標値が利用者により指定されたことに応じ、当該座標値によって表される位置パラメータおよびリスク指標の組を、当該位置パラメータおよび当該リスク指標によって示される確率分布の累積報酬を得るために第２算出部４４０によって算出された施策に対応付けて出力する（Ｓ８８０）。施策は、Ｓ８５０において収束が判定される毎に記録されたものであることが望ましい。Ｓ８３０からＳ８５０までに示した計算手順については、リスクを最小化することは保証されるものの、Ｓ８１０において取得された位置パラメータが、Ｓ８４０において算出された施策によってもたらされる報酬の確率分布の位置パラメータと必ずしも一致しないからである。 The output unit 460 indicates a set of the position parameter and the risk index represented by the coordinate value by the position parameter and the risk index when the coordinate value on the displayed curve is designated by the user. In order to obtain the cumulative reward of the probability distribution, it is output in association with the measure calculated by the second calculation unit 440 (S880). The measure is desirably recorded every time convergence is determined in S850. For the calculation procedures shown in S830 to S850, although it is guaranteed that the risk is minimized, the position parameter acquired in S810 is the position parameter of the probability distribution of the reward brought about by the measure calculated in S840. This is because they do not necessarily match.

以上が、様々な初期状態を有するエージェントから得られる総報酬について位置パラメータを定める（２）の場合の処理である。これにより、利用者は、顧客セグメントが多いため個々に位置パラメータを定めるのが困難な場合であっても、報酬額全体とリスクとの関係を最適とする施策を決定することができる。 The above is the processing in the case of (2) for determining the position parameter for the total reward obtained from the agent having various initial states. As a result, the user can determine a measure that optimizes the relationship between the entire reward amount and the risk even when it is difficult to determine the position parameters individually because there are many customer segments.

図９は、Ｓ８３０における処理の詳細を示すフローチャートである。図９で対象とするのは、複数のエージェントから１期間に得られる報酬の合計について、（Ａ）分散が有限（即ち正規分布）でありエージェント同士が独立の場合、（Ｂ）分散が無限大でもよいがエージェント同士が独立の場合、（Ｃ）分散が有限でありエージェント同士がハースト指数に従って相関を有する場合について、施策からスケールパラメータを求める処理である。以下、それぞれについて述べる。 FIG. 9 is a flowchart showing details of the processing in S830. The target in FIG. 9 is (A) when the variance is finite (that is, normal distribution) and the agents are independent with respect to the total reward obtained from a plurality of agents in one period. (B) The variance is infinite. However, in the case where the agents are independent, (C) a process for obtaining a scale parameter from the measure in the case where the variance is finite and the agents have a correlation according to the Hurst index. Each will be described below.

第１算出部４３０は、複数のエージェントから１期間に得られる報酬の分散値が有限かを判断する（Ｓ９００）。分散値が有限の場合は（Ｓ９００：ＹＥＳ）、第１算出部４３０は、それぞれのエージェントから得られる報酬の報酬額が独立かを判断する（Ｓ９１０）。報酬額が独立の場合（Ｓ９１０：ＹＥＳ）、第１算出部４３０は、第１ユニット５００によって累積報酬の確率分布の平均値および分散値を算出させる（Ｓ９２０）。この算出処理は、上述の式（４）および式（５）の連立方程式を解くことによって実現される。これらの方程式は式（３２）として表される。即ち、第１ユニット５００は、式（３２）に示すＭについての方程式と、Ｓについての方程式とを、ＬＵ分解法やガウスの消去法などの既存の数値解法によって解くことにより、累積報酬の確率分布を定めるパラメータを算出することができる。

The first calculation unit 430 determines whether or not the variance value of rewards obtained from a plurality of agents in one period is finite (S900). When the variance value is finite (S900: YES), the first calculation unit 430 determines whether or not the reward amount obtained from each agent is independent (S910). When the reward amount is independent (S910: YES), the first calculation unit 430 causes the first unit 500 to calculate the average value and the variance value of the cumulative reward probability distribution (S920). This calculation process is realized by solving the simultaneous equations of the above equations (4) and (5). These equations are expressed as equation (32). That is, the first unit 500 solves the equation for M and the equation for S shown in Equation (32) by an existing numerical method such as the LU decomposition method or the Gaussian elimination method, thereby calculating the probability of the accumulated reward. Parameters that determine the distribution can be calculated.

但し、式（３３）に示すように、Ｍ、Ｓ^２、μ、およびσは、状態の数｜Ｓ｜次のベクトルであり、Ｐは｜Ｓ｜×｜Ｓ｜の行列である。

However, as shown in Expression (33), M, S ² , μ, and σ are vectors of the number | S | of the states, and P is a matrix of | S | × | S |.

一方、分散が有限でなくともよい場合には（Ｓ９００：ＮＯ）、第１算出部４３０は、第２ユニット５３０によって累積報酬の確率分布を定めるパラメータを算出させる（Ｓ９３０）。この算出処理は、上述の式（９）および式（１０）に示す方程式を解くことによって実現される。これらの方程式は以下の式（３４）として表される。即ち、第２ユニット５３０は、式（３４）に示すＭについての方程式と、Ｓについての方程式とを、ＬＵ分解法やガウスの消去法などの既存の数値解法によって解くことにより、累積報酬の確率分布を定める位置パラメータおよびスケールパラメータを算出することができる。

但し、式（３５）に示すように、Ｍ、Ｓ^２、μ、およびσは、状態の数｜Ｓ｜次のベクトルであり、Ｐは｜Ｓ｜×｜Ｓ｜の行列である。

On the other hand, if the variance does not have to be finite (S900: NO), the first calculation unit 430 causes the second unit 530 to calculate a parameter that determines the probability distribution of the accumulated reward (S930). This calculation process is realized by solving the equations shown in the above equations (9) and (10). These equations are expressed as the following equation (34). That is, the second unit 530 solves the equation for M shown in Equation (34) and the equation for S by an existing numerical method such as the LU decomposition method or the Gaussian elimination method, thereby calculating the probability of the accumulated reward. A position parameter and a scale parameter that define the distribution can be calculated.

However, as shown in Expression (35), M, S ² , μ, and σ are vectors of the number | S | of the states, and P is a matrix of | S | × | S |.

また、第２ユニット５３０は、以下の式（３６）に示す方程式を解くことによって、歪度ΒとスケールパラメータＳとの積を算出する。但し、各変数の定義は式（３７）に定める。

Further, the second unit 530 calculates the product of the skewness Β and the scale parameter S by solving an equation shown in the following equation (36). However, the definition of each variable is defined in Expression (37).

そして、第２ユニット５３０は、式（３８）を計算することによって歪度Βを算出することができる。

Then, the second unit 530 can calculate the skewness によって by calculating Expression (38).

一方、エージェント同士に相関がある場合には（Ｓ９１０：ＮＯ）、第１算出部４３０は、第３ユニット５７０によって累積報酬の確率分布を定めるパラメータを算出させる（Ｓ９４０）。この算出処理は、上記の式（４）と式（１４）との連立方程式を解くことによって実現される。これらの方程式は式（３９）として表される。即ち、第３ユニット５７０は、式（３９）に示すＭについての方程式と、Ｓについての方程式とを、ＬＵ分解法やガウスの消去法などの既存の数値解法によって解くことにより、累積報酬の確率分布を定めるパラメータを算出することができる。

On the other hand, when there is a correlation between agents (S910: NO), the first calculation unit 430 causes the third unit 570 to calculate a parameter that determines the probability distribution of the accumulated reward (S940). This calculation process is realized by solving the simultaneous equations of the above equations (4) and (14). These equations are expressed as equation (39). That is, the third unit 570 solves the equation for M shown in Equation (39) and the equation for S by an existing numerical method such as the LU decomposition method or the Gaussian elimination method, thereby calculating the probability of the accumulated reward. Parameters that determine the distribution can be calculated.

但し、式（４０）に示すように、Ｍ、Ｓ^２、μ、およびσは、状態の数｜Ｓ｜次のベクトルであり、Ｐは｜Ｓ｜×｜Ｓ｜の行列である。

However, as shown in Expression (40), M, S ² , μ, and σ are vectors of the number | S | of the states, and P is a matrix of | S | × | S |.

図１０は、情報システム１０として機能する情報処理装置６００のハードウェア構成の一例を示す。情報処理装置６００は、ホストコントローラ１０８２により相互に接続されるＣＰＵ１０００、ＲＡＭ１０２０、及びグラフィックコントローラ１０７５を有するＣＰＵ周辺部と、入出力コントローラ１０８４によりホストコントローラ１０８２に接続される通信インターフェイス１０３０、ハードディスクドライブ１０４０、及びＣＤ−ＲＯＭドライブ１０６０を有する入出力部と、入出力コントローラ１０８４に接続されるＲＯＭ１０１０、フレキシブルディスクドライブ１０５０、及び入出力チップ１０７０を有するレガシー入出力部とを備える。 FIG. 10 illustrates an example of a hardware configuration of the information processing apparatus 600 that functions as the information system 10. The information processing apparatus 600 includes a CPU peripheral unit including a CPU 1000, a RAM 1020, and a graphic controller 1075 connected to each other by a host controller 1082, a communication interface 1030, a hard disk drive 1040, and the like connected to the host controller 1082 by an input / output controller 1084. And an input / output unit having a CD-ROM drive 1060 and a legacy input / output unit having a ROM 1010 connected to an input / output controller 1084, a flexible disk drive 1050, and an input / output chip 1070.

ホストコントローラ１０８２は、ＲＡＭ１０２０と、高い転送レートでＲＡＭ１０２０をアクセスするＣＰＵ１０００及びグラフィックコントローラ１０７５とを接続する。ＣＰＵ１０００は、ＲＯＭ１０１０及びＲＡＭ１０２０に格納されたプログラムに基づいて動作し、各部の制御を行う。グラフィックコントローラ１０７５は、ＣＰＵ１０００等がＲＡＭ１０２０内に設けたフレームバッファ上に生成する画像データを取得し、表示装置１０８０上に表示させる。これに代えて、グラフィックコントローラ１０７５は、ＣＰＵ１０００等が生成する画像データを格納するフレームバッファを、内部に含んでもよい。 The host controller 1082 connects the RAM 1020 to the CPU 1000 and the graphic controller 1075 that access the RAM 1020 at a high transfer rate. The CPU 1000 operates based on programs stored in the ROM 1010 and the RAM 1020, and controls each unit. The graphic controller 1075 acquires image data generated by the CPU 1000 or the like on a frame buffer provided in the RAM 1020 and displays it on the display device 1080. Alternatively, the graphic controller 1075 may include a frame buffer that stores image data generated by the CPU 1000 or the like.

入出力コントローラ１０８４は、ホストコントローラ１０８２と、比較的高速な入出力装置である通信インターフェイス１０３０、ハードディスクドライブ１０４０、及びＣＤ−ＲＯＭドライブ１０６０を接続する。通信インターフェイス１０３０は、ネットワークを介して外部の装置と通信する。ハードディスクドライブ１０４０は、情報処理装置６００が使用するプログラム及びデータを格納する。ＣＤ−ＲＯＭドライブ１０６０は、ＣＤ−ＲＯＭ１０９５からプログラム又はデータを読み取り、ＲＡＭ１０２０又はハードディスクドライブ１０４０に提供する。 The input / output controller 1084 connects the host controller 1082 to the communication interface 1030, the hard disk drive 1040, and the CD-ROM drive 1060, which are relatively high-speed input / output devices. The communication interface 1030 communicates with an external device via a network. The hard disk drive 1040 stores programs and data used by the information processing apparatus 600. The CD-ROM drive 1060 reads a program or data from the CD-ROM 1095 and provides it to the RAM 1020 or the hard disk drive 1040.

また、入出力コントローラ１０８４には、ＲＯＭ１０１０と、フレキシブルディスクドライブ１０５０や入出力チップ１０７０等の比較的低速な入出力装置とが接続される。ＲＯＭ１０１０は、情報処理装置６００の起動時にＣＰＵ１０００が実行するブートプログラムや、情報処理装置６００のハードウェアに依存するプログラム等を格納する。フレキシブルディスクドライブ１０５０は、フレキシブルディスク１０９０からプログラム又はデータを読み取り、入出力チップ１０７０を介してＲＡＭ１０２０またはハードディスクドライブ１０４０に提供する。入出力チップ１０７０は、フレキシブルディスク１０９０や、例えばパラレルポート、シリアルポート、キーボードポート、マウスポート等を介して各種の入出力装置を接続する。 The input / output controller 1084 is connected to the ROM 1010 and relatively low-speed input / output devices such as the flexible disk drive 1050 and the input / output chip 1070. The ROM 1010 stores a boot program executed by the CPU 1000 when the information processing apparatus 600 is activated, a program depending on the hardware of the information processing apparatus 600, and the like. The flexible disk drive 1050 reads a program or data from the flexible disk 1090 and provides it to the RAM 1020 or the hard disk drive 1040 via the input / output chip 1070. The input / output chip 1070 connects various input / output devices via a flexible disk 1090 and, for example, a parallel port, a serial port, a keyboard port, a mouse port, and the like.

情報処理装置６００に提供されるプログラムは、フレキシブルディスク１０９０、ＣＤ−ＲＯＭ１０９５、又はＩＣカード等の記録媒体に格納されて利用者によって提供される。プログラムは、入出力チップ１０７０及び/又は入出力コントローラ１０８４を介して、記録媒体から読み出され情報処理装置６００にインストールされて実行される。プログラムが情報処理装置６００等に働きかけて行わせる動作は、図１から図９において説明した情報システム１０における動作と同一であるから、説明を省略する。 A program provided to the information processing apparatus 600 is stored in a recording medium such as the flexible disk 1090, the CD-ROM 1095, or an IC card and provided by the user. The program is read from the recording medium via the input / output chip 1070 and / or the input / output controller 1084, installed in the information processing apparatus 600, and executed. The operations that the program causes the information processing apparatus 600 to perform are the same as the operations in the information system 10 described with reference to FIGS.

以上に示したプログラムは、外部の記憶媒体に格納されてもよい。記憶媒体としては、フレキシブルディスク１０９０、ＣＤ−ＲＯＭ１０９５の他に、ＤＶＤやＰＤ等の光学記録媒体、ＭＤ等の光磁気記録媒体、テープ媒体、ＩＣカード等の半導体メモリ等を用いることができる。また、専用通信ネットワークやインターネットに接続されたサーバシステムに設けたハードディスク又はＲＡＭ等の記憶装置を記録媒体として使用し、ネットワークを介してプログラムを情報処理装置６００に提供してもよい。 The program shown above may be stored in an external storage medium. As the storage medium, in addition to the flexible disk 1090 and the CD-ROM 1095, an optical recording medium such as a DVD or PD, a magneto-optical recording medium such as an MD, a tape medium, a semiconductor memory such as an IC card, or the like can be used. Further, a storage device such as a hard disk or a RAM provided in a server system connected to a dedicated communication network or the Internet may be used as a recording medium, and the program may be provided to the information processing device 600 via the network.

以上、本実施形態に係る情報システム１０によれば、複数のエージェントに対する最適な施策を求めることができるので、マーケティングなどの現実の課題に対し適切な解を与えることができる。また、期待するリターンのみならずリスクや予算を充分に考慮した施策を求めることができ、現実の課題への適用を容易とすることができる。また、決定的施策ではなく確率的施策を求めることができるので、同一状態に対し取り得る行動を混在させることができる。これにより、より最適な施策を求めることを可能とすることができる。また、それぞれのエージェントが独立の場合のみならず相関を持って行動する場合も考慮し、より現実に即した課題について施策を決定できる。 As described above, according to the information system 10 according to the present embodiment, since it is possible to obtain an optimal measure for a plurality of agents, an appropriate solution can be given to an actual problem such as marketing. In addition, it is possible to obtain a measure that fully considers not only the expected return but also the risk and the budget, and it is easy to apply to the actual problem. Further, since a probabilistic measure can be obtained instead of a definitive measure, actions that can be taken for the same state can be mixed. Thereby, it is possible to obtain a more optimal measure. In addition, it is possible to determine a measure for a more realistic problem, considering not only the case where each agent is independent but also the case of acting with correlation.

以上、本発明を実施の形態を用いて説明したが、本発明の技術的範囲は上記実施の形態に記載の範囲には限定されない。上記実施の形態に、多様な変更または改良を加えることが可能であることが当業者に明らかである。その様な変更または改良を加えた形態も本発明の技術的範囲に含まれ得ることが、特許請求の範囲の記載から明らかである。 As mentioned above, although this invention was demonstrated using embodiment, the technical scope of this invention is not limited to the range as described in the said embodiment. It will be apparent to those skilled in the art that various modifications or improvements can be added to the above-described embodiment. It is apparent from the scope of the claims that the embodiments added with such changes or improvements can be included in the technical scope of the present invention.

図１は、情報システム１０の全体構成を示す。FIG. 1 shows the overall configuration of the information system 10. 図２は、確率記憶部２０のデータ構造の一例を示す。FIG. 2 shows an example of the data structure of the probability storage unit 20. 図３は、パラメータ記憶部３０のデータ構造の一例を示す。FIG. 3 shows an exemplary data structure of the parameter storage unit 30. 図４は、最適施策７０の一例を示す。FIG. 4 shows an example of the optimum measure 70. 図５は、施策決定システム４０の機能構成を示す。FIG. 5 shows a functional configuration of the measure determination system 40. 図６は、第１算出部４３０の機能構成を示す。FIG. 6 shows a functional configuration of the first calculation unit 430. 図７は、効率的フロンティア曲線６０の一例を示す。FIG. 7 shows an example of an efficient frontier curve 60. 図８は、施策決定システム４０によって最適施策が決定される処理のフローチャートを示す。FIG. 8 shows a flowchart of processing in which the optimum measure is determined by the measure determination system 40. 図９は、Ｓ８３０における処理の詳細を示すフローチャートである。FIG. 9 is a flowchart showing details of the processing in S830. 図１０は、施策決定システム４０として機能する情報処理装置６００のハードウェア構成の一例を示す。FIG. 10 shows an example of the hardware configuration of the information processing apparatus 600 that functions as the measure determination system 40.

Explanation of symbols

１０情報システム
２０確率記憶部
３０パラメータ記憶部
４０施策決定システム
６０効率的フロンティア曲線
７０最適施策
４００位置パラメータ範囲算出部
４１０位置パラメータ取得部
４２０施策取得部
４３０第１算出部
４４０第２算出部
４５０収束判定部
４６０出力部
４７０表示制御部
５００第１ユニット
５１０平均値算出部
５２０分散値算出部
５３０第２ユニット
５４０歪度算出部
５５０位置パラメータ算出部
５６０スケールパラメータ算出部
５７０第３ユニット
５８０平均値算出部
５９０スケールパラメータ算出部
６００情報処理装置 10 Information System 20 Probability Storage Unit 30 Parameter Storage Unit 40 Policy Determination System 60 Efficient Frontier Curve 70 Optimal Policy 400 Position Parameter Range Calculation Unit 410 Position Parameter Acquisition Unit 420 Policy Acquisition Unit 430 First Calculation Unit 440 Second Calculation Unit 450 Convergence Determination unit 460 Output unit 470 Display control unit 500 First unit 510 Average value calculation unit 520 Dispersion value calculation unit 530 Second unit 540 Skewness calculation unit 550 Position parameter calculation unit 560 Scale parameter calculation unit 570 Third unit 580 Average value calculation 590 Scale parameter calculation unit 600 Information processing apparatus

Claims

A system for calculating a probability distribution of cumulative rewards obtained as a result of sequentially taking a plurality of actions for a plurality of agents,
For each of a plurality of states that an agent can take, a probability storage unit that stores transition probabilities of transition to each state when each action is taken with respect to the agent in the state;
For each of the plurality of states, a parameter storing a parameter of a probability distribution of rewards obtained when transition is made to each state as a result of taking action for each of the plurality of agents that are in the state. A storage unit;
In association with each of the plurality of states, a measure acquisition unit that acquires a measure that determines the action probability of taking each action against the agent in the state,
The probability distribution parameters of cumulative rewards obtained from the plurality of agents from this term onward are the parameters of the probability distribution of the rewards obtained by this term behavior based on the behavior probability of this term behavior and the transition probability to the state of the next term, and by weighting the value based on the parameters of the probability distribution of the cumulative reward from next term of the state obtained after the next fiscal year, to generate a recurrence formula to calculate the sum for each of the behavior and the next term of the state, this term in the recurrence formula A first calculation unit that calculates the parameter by solving an equation that the parameters of the cumulative reward probability distribution converge to the same value if the initial state is the same in the following and subsequent periods;
An output unit that outputs the calculated parameter as information indicating a probability distribution of cumulative reward.

The parameter storage unit stores an average value and a variance value of rewards as the parameters of the probability distribution of rewards when the sum of rewards obtained from the plurality of agents follows a normal distribution.
The first calculation unit includes:
The average value of accumulated rewards obtained from the plurality of agents after this term is changed to the average value of accumulated rewards obtained from the next term to the next term based on the behavior probability of this term and the transition probability to the next term. Based on a recurrence formula that is calculated by multiplying the discount rate and adding the average value of rewards for this period's actions and summing up each action and the state of the next period, the average of accumulated rewards for this period and beyond An average value calculation unit for calculating a value;
The variance value of the cumulative reward obtained from the plurality of agents after this term is changed to the variance value of the cumulative reward obtained from the next term to the next term based on the behavior probability of the behavior of this term and the transition probability to the next term. Accumulated after the current term based on a recurrence formula that is calculated by multiplying the discount rate squared and adding the variance value of the rewards obtained by this term's behavior and summing up the status of each behavior and next term The system according to claim 1, further comprising: a variance value calculation unit that calculates a variance value of the reward.

The parameter storage unit obtains the average value of the reward, the variance value, and the reward obtained from each of the agents from each of the other agents when the total of the rewards obtained from the plurality of agents follows a normal distribution. Memorize the correlation index value indicating the degree of correlation with the reward,
The first calculation unit includes:
The average value of accumulated rewards obtained from the plurality of agents after this term is changed to the average value of accumulated rewards obtained from the next term to the next term based on the behavior probability of this term and the transition probability to the next term. Based on a recurrence formula that is calculated by multiplying the discount rate and adding the average value of rewards for this period's actions and summing up each action and the state of the next period, the average of accumulated rewards for this period and beyond An average value calculation unit for calculating a value;
The scale parameter of the probability distribution of cumulative rewards obtained from the plurality of agents from this term onward is used to calculate the cumulative rewards obtained from the next term from the next term onwards, based on the behavior probability of the current term and the transition probability to the next state . by weighting the value obtained by adding the scale parameter of the probability distribution of compensation obtained by this term behavioral scale parameter to the discount rate of the probability distribution is multiplied by the value raised to the power of a reciprocal of the correlation index value of each behavior and the next term The system according to claim 2, further comprising: a scale parameter calculation unit that calculates a scale parameter of a probability distribution of cumulative rewards from this term on the basis of a recurrence formula that is calculated by summing up the states.

When the probability distribution of rewards obtained from the plurality of agents follows a stable distribution, the parameter storage unit includes a characteristic index indicating a degree of probability density attenuation in a region where the reward is large in the stable distribution, and an asymmetry of the distribution in the stable distribution Memorizes the skewness indicating the stability, the position parameter of the stable distribution, and the scale parameter of the stable distribution,
The first calculation unit includes:
The skewness of the probability distribution of cumulative rewards obtained from the plurality of agents from this term onward is calculated based on the behavior probability of the current term and the transition probability to the state of the next term. The recurrence formula is calculated by weighting the degree and scale parameters and the value based on the skewness of the probability distribution of cumulative rewards obtained from the next term onward and the scale parameters and the scale parameters, and summing up each behavior and next state. Based on the skewness calculation unit that calculates the skewness of the probability distribution of the cumulative reward from this term,
The position parameter of the probability distribution of cumulative rewards obtained from the plurality of agents from the current term onward is calculated based on the behavioral probability of the current behavior and the transition probability to the next state . A recurrence formula that calculates the sum of each behavior and the state of the next term by weighting the value obtained by multiplying the location parameter of the probability distribution by the discount rate and adding the location parameter of the probability distribution of the reward obtained by this behavior. Based on the position parameter calculation unit for calculating the position parameter of the cumulative reward probability distribution from this term,
The scale parameter of the probability distribution of cumulative rewards obtained from the plurality of agents from this term onward is used to calculate the cumulative rewards obtained from the next term from the next term onwards, based on the behavior probability of the current term and the transition probability to the next state . by weighting the value obtained by adding the scale parameter of the probability distribution of compensation obtained by this term behavior is multiplied by the value raised to the power of the scale parameter to the discount rate of the probability distribution in the value of the quality index for each of the behavior and the next term of the state The system according to claim 1, further comprising: a scale parameter calculation unit that calculates a scale parameter of a probability distribution of cumulative rewards from this term on the basis of a recurrence formula calculated in total.

For each of the plurality of states, further comprising a position parameter acquisition unit that acquires a position parameter of a probability distribution of rewards to be obtained from the plurality of agents having the state as an initial state,
For each of the plurality of states, the measure acquisition unit includes one of the measures for matching the position parameter of the probability distribution of the cumulative reward obtained from the plurality of agents having the state as an initial state with the acquired position parameter. Generated as an initial measure,
The first calculation unit calculates a position parameter and a scale parameter of a probability distribution of cumulative rewards obtained from the plurality of agents as a result of acting according to the initial measure,
The system
With respect to each of the plurality of states, the behavior probability of each action that can be taken with respect to the agent is used as a variable, and the position parameter of the probability distribution of the cumulative reward obtained as a result of acting according to the action probability is calculated by the first calculation unit. Based on the value of the scale parameter based on the value of the scale parameter on the assumption that the scale parameter of the probability distribution of the cumulative reward from the next period matches the scale parameter calculated by the first calculation unit by solving the linear programming problem to minimize the value of the objective function for calculating the scale parameter of the probability distribution of cumulative rewards, a second calculation unit for calculating a measure for determining the respective action probability,
Calculated by the second calculation unit on condition that the scale parameter calculated by the first calculation unit and the scale parameter calculated by the second calculation unit have converged to a value within a predetermined range. Cumulative reward obtained as a result of giving a measure calculated by the second calculation unit to the first calculation unit instead of the initial measure and acting in accordance with the measure, on the condition that the measure is output and has not converged The system according to claim 1, further comprising: a convergence determination unit that calculates a position parameter and a scale parameter of the probability distribution.

Based on the transition probability stored in the probability storage unit and the parameter stored in the parameter storage unit, the maximum value and the minimum value of the position parameter of the probability distribution of the cumulative reward obtained from the plurality of agents are determined. A position parameter range calculation unit for calculating,
The system according to claim 5, wherein the position parameter acquisition unit accepts an input of a value within a range from the calculated minimum value to the maximum value, and does not accept an input of a value outside the range.

The second calculation unit further adds a constraint that a value obtained by weighting the cost required for an action by the action probability of the action and totaling each action is equal to or less than a predetermined reference budget, and the linear programming problem The system according to claim 5 or 6 .

A position parameter acquisition unit that acquires a position parameter of a probability distribution of a total cumulative reward obtained from the plurality of agents, each of which can take a different state as an initial state;
The measure acquisition unit generates one of the measures for matching the position parameter of the total probability distribution of cumulative rewards obtained from the plurality of agents with the acquired position parameter, and acquires it as an initial measure,
The first calculation unit calculates a position parameter and a scale parameter of a probability distribution of cumulative rewards obtained from the plurality of agents as a result of acting according to the initial measure,
The system
For each of the plurality of states, the action probability of each action that can be taken with respect to the agent as a variable, the position parameter of the probability distribution of the cumulative reward obtained as a result of acting according to the action probability with each state as an initial state, The scale parameter of the probability distribution of the cumulative reward from the next term is the constraint that the sum of weighted by the number of agents having the state as the initial state matches the position parameter calculated by the first calculation unit. Based on the value of the scale parameter on the assumption that it matches the scale parameter calculated by the first calculation unit, the probability distribution of cumulative rewards from this term obtained as a result of acting according to the action probability with each state as the initial state An agent that obtains the scale parameter and sets that state as the initial state By solving the linear programming problem to minimize the value of the objective function of total weighted by the number, a second calculation unit for calculating a measure for determining the respective action probability,
Calculated by the second calculation unit on condition that the scale parameter calculated by the first calculation unit and the scale parameter calculated by the second calculation unit have converged to a value within a predetermined range. Cumulative reward obtained as a result of giving a measure calculated by the second calculation unit to the first calculation unit instead of the initial measure and acting in accordance with the measure, on the condition that the measure is output and has not converged The system according to claim 1, further comprising: a convergence determination unit that calculates a position parameter and a scale parameter of the probability distribution.

Each time the position parameter is acquired, on the plane composed of the coordinate axis indicating the position parameter and the coordinate axis indicating the risk index value, the coordinates represented by the acquired risk parameter value based on the acquired position parameter and the converged scale parameter It further includes a display control unit that sequentially draws points on the value and draws and displays a curve by complementing between the drawn points.
The output unit indicates a set of a position parameter and a risk index represented by the coordinate value by the position parameter and the risk index when a coordinate value on the displayed curve is designated by the user. The system according to any one of claims 5 to 8, wherein an output is made in association with the measure calculated by the second calculation unit in order to obtain a cumulative reward of the probability distribution.

A method of calculating a probability distribution of cumulative rewards obtained as a result of sequentially taking a plurality of actions for a plurality of agents by a system ,
The system
For each of a plurality of states in which the agent can take stores the transition probabilities of transition to each state when taking the respective actions to the agent of the state probability memory unit,
For each of said plurality of states, each stored together with parameters of the probability distribution of reward obtained when the transition to the plurality of agents is the state to each of the actions of taking the results of each state in the parameter storage unit And
The method is
A step in which the measure acquisition unit of the system associates with each of the plurality of states and acquires a measure that defines an action probability of taking an action with respect to the agent in the state;
The first calculation unit of the system determines a parameter of a probability distribution of cumulative rewards obtained from the plurality of agents from this term onward, based on the behavior probability of this term and the transition probability to the state of the next term, according to the behavior of this term. A recurrence formula that weights values based on the probability distribution parameter of the reward and the value based on the parameter of the probability distribution of the cumulative reward obtained from the next period onward, and sums up each action and the state of the next period. Generating and calculating the parameter by solving the equation that the probability distribution parameter of the cumulative reward converges to the same value if the initial state is the same in the recurrence formula from the current term and the next term in the recurrence formula; and
A method in which the output unit of the system outputs the calculated parameter as information indicating a probability distribution of a cumulative reward.

A program for causing an information processing device to function as a system for calculating a probability distribution of cumulative rewards obtained as a result of sequentially taking a plurality of actions for a plurality of agents,
The information processing apparatus;
For each of a plurality of states that an agent can take, a probability storage unit that stores transition probabilities of transition to each state when each action is taken with respect to the agent in the state;
For each of the plurality of states, a parameter storing a parameter of a probability distribution of rewards obtained when transition is made to each state as a result of taking action for each of the plurality of agents that are in the state. A storage unit;
In association with each of the plurality of states, a measure acquisition unit that acquires a measure that determines the action probability of taking each action against the agent in the state,
The probability distribution parameters of cumulative rewards obtained from the plurality of agents from this term onward are the parameters of the probability distribution of the rewards obtained by this term behavior based on the behavior probability of this term behavior and the transition probability to the state of the next term, and by weighting the value based on the parameters of the probability distribution of the cumulative reward from next term of the state obtained after the next fiscal year, to generate a recurrence formula to calculate the sum for each of the behavior and the next term of the state, this term in the recurrence formula A first calculation unit that calculates the parameter by solving an equation that the parameters of the cumulative reward probability distribution converge to the same value if the initial state is the same in the following and subsequent periods;
A program that causes the calculated parameter to function as an output unit that outputs information indicating a probability distribution of accumulated rewards.