WO2020234913A1 - Decision-making device, decision-making method, and storage medium - Google Patents

Decision-making device, decision-making method, and storage medium Download PDF

Info

Publication number
WO2020234913A1
WO2020234913A1 PCT/JP2019/019636 JP2019019636W WO2020234913A1 WO 2020234913 A1 WO2020234913 A1 WO 2020234913A1 JP 2019019636 W JP2019019636 W JP 2019019636W WO 2020234913 A1 WO2020234913 A1 WO 2020234913A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
state
policy
decision
action
Prior art date
Application number
PCT/JP2019/019636
Other languages
French (fr)
Japanese (ja)
Inventor
慧 竹村
悠輝 中口
Original Assignee
日本電気株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電気株式会社 filed Critical 日本電気株式会社
Priority to PCT/JP2019/019636 priority Critical patent/WO2020234913A1/en
Priority to JP2021520487A priority patent/JP7279782B2/en
Publication of WO2020234913A1 publication Critical patent/WO2020234913A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a method of determining an action for a user based on a user's action history.
  • MDP Markov Decision Process
  • One of the objects of the present invention is to provide a decision-making method capable of avoiding falling into a specific unfavorable state in the process of determining a policy for a user.
  • the decision-making device An estimation unit that estimates model parameters that define user state transitions based on the behavior history of multiple users, A randomization unit that gives randomness to each user for the estimated parameters, A policy calculation unit that calculates a policy that is a function that determines an action for each user for each user using randomized parameters. Among the calculated measures, the policy management unit that excludes the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user, To be equipped.
  • the decision-making method Estimate model parameters that define user state transitions based on the behavior history of multiple users. Randomness is given to each user for the estimated parameters, A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters. Of the calculated measures, the measures that transition the user's state to a predetermined irreversible state are excluded from the measures applied to each user.
  • the recording medium is: Estimate model parameters that define user state transitions based on the behavior history of multiple users. Randomness is given to each user for the estimated parameters, A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters. Among the calculated measures, record a program that causes the computer to execute a process of excluding the measures that transition the user's state to a predetermined irreversible state from the measures applied to each user.
  • the hardware configuration of the decision-making apparatus according to the first embodiment is shown.
  • the functional configuration of the decision-making apparatus according to the first embodiment is shown. It is a flowchart of the decision-making process which concerns on 1st Embodiment.
  • a first example of a system to which a decision-making device is applied is shown.
  • a second example of a system to which a decision-making device is applied is shown. It is a figure explaining the effect obtained by the decision-making process.
  • the functional configuration of the decision-making apparatus according to the second embodiment is shown.
  • MDP uses "state”, “action”, “reward” and “transition probability” as parameters.
  • Each “state” and “action” is a finite set or an element thereof.
  • “Reward” is a function that outputs a real value when a state and an action are input.
  • the “transition probability” is a function that outputs the probability of transition to each state when the current state and the action are input.
  • the decision-making process using MDP basically includes (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. repeat.
  • the next state is determined based on the current state, the action, and the transition probability.
  • the reward and the transition probability are unknown, and the decision-making process repeats the above steps (a) and (b) while estimating the reward and the transition probability.
  • the decision-making process calculates the optimum policy based on the estimated reward and the transition probability.
  • the policy is a function that maps a state to an action and outputs the action when the state is input.
  • the user's condition may fall into a specific unfavorable condition, more specifically, an "irreparable condition".
  • the "irreparable state” means a state in which the reward is zero or unacceptably small, and the state cannot transition to another state. Whether or not the reward is unacceptably small is actually determined by whether or not the value of the reward is smaller than a predetermined threshold value. Once you fall into such an "irreparable state", you will not be able to return to the normal state where you can get rewards.
  • the decision-making process determines an appropriate measure for enhancing the long-term satisfaction of the user, executes an action on the user, and obtains the user's charge, contract renewal, and the like.
  • the "irreparable state” a state in which the contract is canceled by the user, to be exact, a state in which the contract is canceled by the user and no matter what is done thereafter, the contract cannot be obtained again. That is, it must be avoided that the decision-making process is canceled by the user in the process of trying various actions in order to increase the satisfaction of the user. If there is an "action that must be canceled", that is, if it is found that the user always cancels when an action in a certain state is executed, the decision-making process executes the action in that state. Must be avoided.
  • measures are determined for each user and actions are taken so that the user does not fall into an "irreparable state" as much as possible.
  • the decision-making process of the present embodiment is executed by combining the following three approaches.
  • roach A Handle a plurality of users at the same time. If there is only one user and the user falls into an "irreparable state" in the process of trying various actions against that user, there is no chance of recovery. Therefore, a plurality of users are handled at the same time. As the number of users increases, opportunities for recovery will be created. However, if there are only two types of user "status”, “not canceled” and “cancelled”, and there is an “action that must be canceled”, all users will be “always canceled” someday. You will try "action” and cancel the contract. Therefore, approach A alone is not sufficient.
  • the "policy” is a function that outputs an "action” when a "state” is input.
  • the decision-making process calculates the policy using the reward and transition probability estimated based on the past data. At that time, by giving randomness to at least one of the reward and the transition probability, the calculated policy is given randomness, and different actions are performed for a plurality of users in the same decision-making process. .. In this way, the possibility of executing "an action that must be canceled" for all users at the same time is significantly reduced. As a result, it is possible to prevent all users from falling into an "irreparable state" at the same time.
  • the decision-making process repeats (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action.
  • the user's action history is accumulated, and the reward and transition probability are estimated using the action history. Therefore, when the accumulated behavior history is small and the estimation confidence level is low, the decision-making process strengthens the randomness given to the reward and the transition probability.
  • the decision-making process weakens the randomness given to the reward and the transition probability.
  • the randomness given to the reward and the transition probability becomes strong, and the decision-making process tries various actions.
  • the randomness given to the reward and the transition probability becomes weaker, and many actions closer to the optimum are selected and executed.
  • the decision-making device basically has (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. And repeat.
  • the decision-making device updates the policy for each user at a predetermined timing based on the action history of the plurality of users obtained during that period.
  • the decision-making device does not use the policy used at that time after that.
  • the decision-making device stores the state immediately before a user falls into the "recovery state” and the action executed at that time, and does not execute the action in that state thereafter. To do so. As a result, the decision-making device can take appropriate actions for each user while avoiding the user from falling into an "irreparable state” as much as possible.
  • the decision-making apparatus according to the first embodiment will be described in detail.
  • FIG. 1 is a block diagram showing a hardware configuration of the decision-making device according to the first embodiment.
  • the decision-making device 1 includes a communication unit 11, a processor 12, a memory 13, a recording medium 14, and an action history database (DB) 15.
  • DB action history database
  • the communication unit 11 communicates with an external device. Specifically, the communication unit 11 transmits the action determined by the decision-making device 1 to the terminal device of each user, or receives the state of each user from the terminal device of each user.
  • the processor 12 is a computer such as a CPU (Central Processing Unit), and controls the entire decision-making device 1 by executing a program prepared in advance.
  • the memory 13 is composed of a ROM (Read Only Memory), a RAM (Random Access Memory), and the like.
  • the memory 13 stores various programs executed by the processor 12.
  • the memory 13 is also used as a working memory during execution of various processes by the processor 12.
  • the recording medium 14 is a non-volatile, non-temporary recording medium such as a disk-shaped recording medium or a semiconductor memory, and is configured to be removable from the decision-making device 1.
  • the recording medium 14 records various programs executed by the processor 12. When the decision-making apparatus 1 executes the decision-making process, the program recorded in the recording medium 14 is loaded into the memory 13 and executed by the processor 12.
  • the action history DB 15 stores the action history of a plurality of users.
  • the action history includes the user's "state", "action” and "reward”.
  • the action history is obtained while the decision-making device 1 repeats (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. , It is accumulated in the action history DB15.
  • FIG. 2 is a block diagram showing a functional configuration of the decision-making device 1.
  • the decision-making device 1 functionally includes a data storage unit 21, a policy management unit 22, an MDP estimation unit 23, a randomization unit 24, a policy calculation unit 25, and a policy execution unit 26.
  • the data storage unit 21 is realized by the action history DB 15.
  • the policy management unit 22, the MDP estimation unit 23, the randomization unit 24, the policy calculation unit 25, and the policy execution unit 26 are realized by the processor 12 executing a program prepared in advance.
  • the data storage unit 21 receives the status of each user from the external device, and acquires the action executed for each user from the policy management unit 22 and the reward obtained by the action. Then, the data storage unit 21 stores them as the action history of each user. The data storage unit 21 accumulates the entire action history of a plurality of users targeted by the decision-making device 1.
  • the policy management unit 22 determines the current policy for each user based on the total action history of each user stored in the data storage unit 21. Specifically, the policy management unit 22 first acquires the entire action history of each user stored in the data storage unit 21 and supplies it to the MDP estimation unit 23.
  • the MDP estimation unit 23 estimates MDP parameters (hereinafter referred to as "MDP parameters"), specifically, “reward” and "transition probability” using the entire action history of each user. Then, the MDP estimation unit 23 supplies the estimated MDP parameter (hereinafter, referred to as “estimated MDP parameter”) and the number of samples of the user's action history used for estimating the MDP parameter to the randomization unit 24. That is, the MDP estimation unit 23 supplies the estimated value of the “reward” and the estimated value of the “transition probability” to the randomization unit 24.
  • MDP parameters MDP parameters
  • the randomization unit 24 randomizes the estimated MDP parameters. Specifically, the randomization unit 24 imparts randomness by adding noise to the estimated MDP parameters.
  • the randomizing unit 24 may randomize at least one of the estimated MDP parameters. That is, the randomization unit 24 may randomize at least one of the estimated value of the reward supplied from the MDP estimation unit 23 and the estimated value of the transition probability.
  • the randomization unit 24 independently adds noise to the estimated MDP parameters for each user.
  • the estimated MDP parameters that are basically randomized hereinafter, referred to as "randomized MDP parameters" are different for each user. Then, the randomization unit 24 supplies the randomized MDP parameter to the policy calculation unit 25.
  • the randomization unit 24 preferably changes the strength of randomness given to the estimated MDP parameter according to the number of samples of the user's behavior history used for estimating the MDP parameter. Specifically, the randomization unit 24 strengthens the randomness given to the estimated MDP parameter when the number of samples of the user's behavior history used for estimating the MDP parameter is small, and the randomization unit 24 increases the randomness of the user's behavior history used for estimating the MDP parameter. When the number of samples is large, the randomness given to the estimated MDP parameters is weakened.
  • the policy calculation unit 25 calculates a policy for each user using the randomized MDP parameters supplied from the randomization unit 24. For example, the policy calculation unit 25 calculates the optimum policy by the value iterative method using the reward and the transition probability.
  • the policy calculated by the policy calculation unit 25 for each user has randomness.
  • the randomization unit 24 randomly randomizes each user independently, the policy calculated by the policy calculation unit 25 for each user is different. As a result, the same policy is applied to all users at the same time, and it is possible to avoid falling into an "irreparable state" at the same time.
  • the policy calculation unit 25 supplies the calculated policy for each user to the policy management unit 22.
  • the policy management unit 22 supplies the policy for each user calculated by the policy calculation unit 25 to the policy execution unit 26. However, the policy management unit 22 excludes the policy that has caused the user to be in an "irreparable state" even once in the past, and does not supply the policy to the policy execution unit 26.
  • the policy management unit 22 stores in advance a specific state corresponding to the "irreparable state".
  • the policy management unit 22 may store a plurality of states as “irreparable states”.
  • the policy management unit 22 refers to the user's action history stored in the data storage unit 21, and stores the policy that has made the user transition to the “irreparable state” in the past as an “NG policy”.
  • the measures for each user supplied from the policy calculation unit 25 include the NG policy
  • the policy management unit 22 does not apply the NG policy.
  • the policy management unit 22 applies measures other than the NG policy to each user. That is, the policy management unit 22 supplies the policy execution unit 26 with measures other than the NG policy among the measures for each user supplied from the policy calculation unit 25. As a result, the NG policy is applied to the user, and it is possible to prevent the user from falling into an "irreparable state”.
  • the policy execution unit 26 receives the policy for each user from the policy management unit 22, and acquires the current state of each user from the data storage unit 21. Then, the policy execution unit 26 determines and executes an action for each user based on the policy for each user and the current state of each user. As described above, since the "policy" is a function that outputs an "action” when the "state" is input, the policy execution unit 26 inputs the current state of each user into each user's policy. Determine and execute actions for the user.
  • the randomization unit 24 randomly randomizes the estimated MDP parameters for each user, and the policy calculation unit 25 calculates the policy for each user based on the randomized MDP parameters. Therefore, the same policy is not applied to all users at the same time. Therefore, it is possible to prevent all users from being applied to the NG policy at the same time and all users from falling into an "irreparable state" at the same time.
  • the policy management unit 22 does not apply the NG policy that has put the user in an "irreparable state" in the past, one or a very small number of users are sacrificed, but a large number of users thereafter It is possible to prevent falling into an "irreparable state".
  • the policy management unit 22 acquires the user's action history stored in the data storage unit 21 and supplies it to the MDP estimation unit 23.
  • the MDP estimation unit 23 estimates the reward and the estimated value of the transition probability as MDP parameters using the user's action history. Specifically, the MDP estimation unit 23 calculates the estimated value r ⁇ (s, a) of the reward when a certain action a is performed in a certain state s by the following equation (1).
  • R (s, a) is the sum of the rewards when the action a is performed in the state s in the action history of all users.
  • N (s, a) is the number of times the action a is performed in the state s in the action history of all users.
  • the MDP estimation unit 23 calculates an estimated value p ⁇ (s'
  • P (s, a, s') is the number of times the state has transitioned to the state s'as a result of performing the action a in the state s in the action history of all users.
  • the MDP estimation unit 23 sends the reward estimated value r ⁇ (s, a) thus obtained and the transition probability estimated value p ⁇ (s'
  • the randomization unit 24 randomizes at least one of the reward estimated value r ⁇ (s, a) and the transition probability estimated value p ⁇ (s'
  • the randomizing unit 24 adds noise to the estimated reward value r ⁇ (s, a), and randomizes the estimated reward value r ⁇ (s, a). To generate.
  • the average of the randomized reward estimates r ⁇ (s, a) is r ⁇ (s, a)
  • the standard deviation ⁇ is
  • the randomizing unit 24 adds noise to the transition probability estimated value p ⁇ (s'
  • This noise is determined so that the randomized estimated transition probability p to (s'
  • the parameter " ⁇ " is an A-dimensional parameter, and the dimensional value corresponding to the action "a" is N (s, a, s') + 1.
  • N (s, a, s') is the number of times the state transitions to s'as a result of performing the action a in the state s in the action history of all users.
  • A is the total number of actions as described above. Also in this case, if the number of samples N (s, a, s') of the action history is small, the randomness of the estimated value of the transition probability becomes strong, and the number of samples N (s, a, s') of the action history is large. Then, the randomness of the estimated value of the transition probability becomes weak. Therefore, as described above, the randomness given to the estimated value of the transition probability changes according to the number of samples of the action history. In this case, the estimated reward value r ⁇ (s, a) is not randomized and is used as it is.
  • the policy calculation unit 25 uses the randomized MDP parameter to be optimal by the value iterative method. Calculate the policy.
  • FIG. 3 is a flowchart of the decision-making process according to the first embodiment. This process is realized by the processor 12 shown in FIG. 1 executing a program prepared in advance and operating as each element shown in FIG.
  • the decision-making device 1 calculates a policy for each user (step S11). Since the action history of each user basically does not exist in the data storage unit 21 at the start of the decision-making process, the measures for each user are based on the randomized MDP parameter to which the randomization unit 24 gives strong randomness. Is calculated.
  • the decision-making device 1 applies the calculated policy for each user once (step S12). Specifically, the policy execution unit 26 determines and executes an action for each user based on the policy for each user and the current state of each user. By executing the action, each user transitions to the next state.
  • the decision-making device 1 detects the state of each user after the execution of the action, and determines whether or not a certain user has fallen into an "irreparable state" (step S13). As described above, since the decision-making device 1 stores in advance a specific state corresponding to the "irreversible state", the decision-making device 1 sets the state of a certain user after the action is executed to that specific state. To determine if it matches. When a user falls into an "irreparable state” (step S13: Yes), the process returns to step S11, and the decision-making device 1 updates the policy for each user.
  • the policy management unit 22 stores the policy that transitions the user to the "irreparable state" as an NG policy, and thereafter applies the NG policy to all users in the same state. Ban. That is, the policy applied in step S12 after that is a policy other than the NG policy.
  • step S14 determines whether or not the user's action history has sufficiently increased. Specifically, this determination is made by the policy management unit 22 observing the action history of each user. If the user's action history is not sufficiently increased (step S14: No), the process returns to step S12, and steps S12 to S14 are executed. That is, the decision-making device 1 determines and executes an action using the same policy and a new "state" of each user, and the user's action history is added by executing the action. In this way, steps S12 to S14 are repeated until the user's action history is sufficiently increased.
  • step S14 Yes
  • the process returns to step S11, and the policy for each user is updated using the sufficiently increased user's action history. That is, in step S14, it is determined whether or not the policy update time has come.
  • Whether or not the user's action history has increased sufficiently can be determined by, for example, the following method. As an example, when any user k satisfies the following equation (5), the decision-making device 1 determines that the user's action history has not sufficiently increased (step S13: No).
  • ⁇ k is the policy of the user k
  • ⁇ k (s) is the action in the state s of the user k
  • v (s, ⁇ k (s)) is the sum of all users who have performed the action ⁇ k (s) in the state s since the last update of the policy.
  • N ⁇ (s, ⁇ k (s)) is the total number of times the action ⁇ k (s) was performed in the state s before the last policy update.
  • the decision-making device 1 may determine in step S14 whether or not the same measure has been applied a predetermined number of times (for example, X times). In this case, the determination in step S14 becomes "No" until the same measure is applied X times, and steps S12 to S14 are repeated. Then, when the same measure is applied X times, the determination in step S14 becomes “Yes”, the process returns to step S11, and the measure is updated.
  • a predetermined number of times for example, X times.
  • the policy for each user is calculated and updated based on the randomized MDP parameter in step S11, the same policy is applied to each user at the same time. This eliminates the need for all users to execute NG measures at the same time. Further, when it is detected in step S13 that a certain user has fallen into an "irreparable state", the policy of each user is updated and the subsequent application of the NG policy that caused the user is prohibited. To. Therefore, it is possible to prevent other users from falling into an "irreparable state" by the same measures thereafter. Further, when it is determined in step S14 that the user's action history has sufficiently increased, the decision-making device 1 updates the policy for each user using the action history. Therefore, as the user's action history data increases, the estimation of the MDP parameters in the MDP estimation unit 23 is optimized, and more appropriate actions are executed for each user.
  • FIG. 4 shows a first example of a system to which the decision-making apparatus of the embodiment is applied.
  • This system provides a certain service to a user, and includes a server 50 and a plurality of user terminals 60.
  • the server 50 operates as the decision-making device 1 of the embodiment by executing a program prepared in advance.
  • the user terminal 60 is a terminal prepared for each user, and is, for example, a PC (Personal Computer), a mobile terminal, a tablet, a smartphone, or the like.
  • PC Personal Computer
  • the server 50 notifies the user terminal 60 of the action determined by the policy execution unit 26 of the decision-making device 1.
  • the user terminal 60 executes the received action. For example, the user terminal 60 proposes a new service or plan to the user as an action.
  • the user terminal 60 transmits a state indicating that the user has contracted to the server 50.
  • the transmitted state is stored in the data storage unit 21 and stored as a part of the user's action history. In this way, it is possible to take appropriate measures for the user by using the decision-making device 1, improve the satisfaction of the user, and improve the sales and profits on the service providing side.
  • FIG. 5 shows a second example of a system to which the decision-making device of the embodiment is applied.
  • the system provides a certain service to the user, and includes a server 70 and a plurality of user terminals 80.
  • the decision-making device 1x configured by the server 70 does not have a policy execution unit.
  • the user terminal 80 includes an AI (Artificial Intelligence) agent 81, and the AI agent 81 operates as a policy execution unit 82.
  • the AI agent 81 is actually realized by the computer constituting the user terminal 80 executing a program prepared in advance.
  • the server 70 transmits the policy determined by the policy management unit 22 of the decision-making device 1x to the user terminal 80.
  • the user terminal 80 receives the policy.
  • the policy is a function that outputs an action when a state is input
  • the policy execution unit 82 determines and executes the action based on the received policy and the current state of the user. For example, the user terminal 80 proposes a new service or plan to the user as an action.
  • the user terminal 80 transmits a state indicating that the user has contracted to the server 70.
  • the transmitted state is stored in the data storage unit 21 and stored as a part of the user's action history. In this way, it is possible to take appropriate measures for the user by using the decision-making device 1x, improve the satisfaction of the user, and improve the sales and profits on the service providing side.
  • FIG. 6A shows a state transition diagram of the environment to which the decision-making process is applied.
  • the actions that the user can take are three, "x", "y", and "z”, and the state transition probability is shown in FIG. 6 (A). That is, when the actions x and y are performed in the state p, the user returns to the state p, and when the user performs the action z in the state p, the user transitions to the state q.
  • the state q the state returns to q regardless of which of the actions x, y, and z is performed by the user.
  • each user has an AI agent, and the AI agent determines the action of each user.
  • FIG. 6B shows the reward obtained by each action.
  • the state p when the action x is performed, the reward "1" is obtained, when the action y is performed, the reward "2" is obtained, and when the action z is performed, the reward "3" is obtained.
  • the state q the reward is "0" regardless of which action the user performs. Therefore, the state q corresponds to the "irreparable state” because the reward is "0" and the state cannot transition to another state.
  • the actions taken by each user are distributed by randomizing the MDP parameters, so that not all of them take the action z at the same time in the state p. Further, since the information is shared among a plurality of users, when one user falls into the state q, the information is reflected in all the other policy decisions. Therefore, it is possible to prevent other users from falling into the state q in the same manner thereafter. Further, since the AI agent of each user advances learning based on the user's action history as described above, the policy adopted by the AI agent is optimized, and only the optimum action is selected.
  • FIG. 6 is a block diagram showing a functional configuration of the decision-making device 90 according to the second embodiment.
  • the decision-making device 90 includes an estimation unit 91, a randomization unit 92, a policy calculation unit 93, and a policy management unit 94.
  • the estimation unit 91 estimates the parameters of the model that defines the transition of the user's state based on the behavior history of a plurality of users.
  • An example of this model is the MDP.
  • the randomization unit 92 gives randomness to the parameters estimated by the estimation unit 91 for each user. By giving randomness, the parameters of each user are not all the same.
  • the policy calculation unit 93 calculates for each user using randomized parameters.
  • the policy is a function that determines the action for each user.
  • the policy management unit 94 excludes from the policies calculated by the policy calculation unit 93, the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user.
  • the decision-making device 90 can take an appropriate action for each user while avoiding the user from falling into an irreparable state as much as possible.
  • An estimation unit that estimates model parameters that define user state transitions based on the behavior history of multiple users, A randomization unit that gives randomness to each user for the estimated parameters, A policy calculation unit that calculates a policy that is a function that determines an action for each user for each user using randomized parameters. Among the calculated measures, the policy management unit that excludes the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user, A decision-making device equipped with.
  • the user's action history includes the user's state and actions for the user.
  • the decision-making device according to Appendix 1, wherein the measure is a function that outputs an action for the user when the user's state is input.
  • the irreparable state is any one of Appendix 1 to 4, which is a state in which the reward obtained by executing the action is zero or less than a predetermined value, and the state cannot be transitioned to another state by executing the action.
  • the parameters of the model include the user's state, actions against the user, rewards, and transition probabilities.
  • the decision-making device according to any one of Appendix 1 to 5, wherein the randomizing unit randomizes at least one of the reward and the transition probability.
  • Appendix 7 The decision-making device according to Appendix 6, wherein the randomization unit strengthens the randomness given to the parameter as the number of samples of the user's action history is smaller, and weakens the randomness given to the parameter as the number of samples is larger. ..
  • Appendix 8 The decision-making device according to any one of Appendix 1 to 7, further comprising a policy execution unit that determines and executes an action for each user based on a policy applied to each user.
  • a policy which is a function that determines the action for each user, is calculated for each user using randomized parameters.
  • a recording medium that records a program that causes a computer to execute a process of excluding a measure that transitions a user's state to a predetermined irreversible state from the measures applied to each user among the calculated measures.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Provided is a decision-making device, wherein an estimation unit estimates, on the basis of behavior histories of a plurality of users, a parameter of a model that defines a user state transition. A randomization unit imparts randomness to each user with respect to the estimated parameter. A policy calculation unit uses the randomized parameter to calculate, for each user, a policy, which is a function for determining an action for each user. A policy management unit excludes, from the policies applied to the users, a policy causing a user state to transition to a predetermined irrevocable state.

Description

意思決定装置、意思決定方法、及び、記録媒体Decision-making device, decision-making method, and recording medium
 本発明は、ユーザの行動履歴に基づいて、ユーザに対するアクションを決定する手法に関する。 The present invention relates to a method of determining an action for a user based on a user's action history.
 マルコフ決定過程(MDP:Markov Decision Process)を用いた意思決定手法が知られている。MDPを利用した典型的な手法では、長期的な累積報酬が最大となるような方策を学習し、ユーザに対するアクションの選択及び実行を繰り返す。このような手法の一例が特許文献1に記載されている。 A decision-making method using a Markov decision process (MDP: Markov Decision Process) is known. In a typical method using MDP, a policy for maximizing long-term cumulative reward is learned, and action selection and execution for the user are repeated. An example of such a method is described in Patent Document 1.
 Near-optimal Regret Bounds for Reinforcement Learning, Thomas Jaksch, Ronald Ortner & Peter Auer, Journal of Machine Learning Research 11(2010) 1563-1600 Near-optimal Regret Bounds for Reinforcement Learning, Thomas Jacksch, Ronald Ortner & Peter Auer, Journal of MachineLear (10Rear)
 上記の手法では、累積報酬を最大化する方策を見つけるために様々な状態で様々なアクションを試行するが、その過程で特定の好ましくない状態に陥ってしまうことがある。 In the above method, various actions are tried in various states in order to find a way to maximize the cumulative reward, but in the process, a specific unfavorable state may occur.
 本発明の目的の1つは、ユーザに対する方策を決定する過程で、特定の好ましくない状態に陥ることを回避することが可能な意思決定方法を提供することにある。 One of the objects of the present invention is to provide a decision-making method capable of avoiding falling into a specific unfavorable state in the process of determining a policy for a user.
 上記の課題を解決するため、本発明の一つの観点では、意思決定装置は、
 複数のユーザの行動履歴に基づいて、ユーザの状態の遷移を規定するモデルのパラメータを推定する推定部と、
 推定されたパラメータに対し、ユーザ毎にランダム性を与えるランダム化部と、
 各ユーザに対するアクションを決定する関数である方策を、ランダム化されたパラメータを用いてユーザ毎に算出する方策算出部と、
 算出された方策のうち、ユーザの状態を予め決められた取り返しの付かない状態に遷移させた方策を、各ユーザに適用する方策から除外する方策管理部と、
 を備える。
In order to solve the above problems, in one aspect of the present invention, the decision-making device
An estimation unit that estimates model parameters that define user state transitions based on the behavior history of multiple users,
A randomization unit that gives randomness to each user for the estimated parameters,
A policy calculation unit that calculates a policy that is a function that determines an action for each user for each user using randomized parameters.
Among the calculated measures, the policy management unit that excludes the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user,
To be equipped.
 本発明の他の観点では、意思決定方法は、
 複数のユーザの行動履歴に基づいて、ユーザの状態の遷移を規定するモデルのパラメータを推定し、
 推定されたパラメータに対し、ユーザ毎にランダム性を与え、
 各ユーザに対するアクションを決定する関数である方策を、ランダム化されたパラメータを用いてユーザ毎に算出し、
 算出された方策のうち、ユーザの状態を予め決められた取り返しの付かない状態に遷移させた方策を、各ユーザに適用する方策から除外する。
In another aspect of the invention, the decision-making method
Estimate model parameters that define user state transitions based on the behavior history of multiple users.
Randomness is given to each user for the estimated parameters,
A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
Of the calculated measures, the measures that transition the user's state to a predetermined irreversible state are excluded from the measures applied to each user.
 本発明のさらに他の観点では、記録媒体は、
 複数のユーザの行動履歴に基づいて、ユーザの状態の遷移を規定するモデルのパラメータを推定し、
 推定されたパラメータに対し、ユーザ毎にランダム性を与え、
 各ユーザに対するアクションを決定する関数である方策を、ランダム化されたパラメータを用いてユーザ毎に算出し、
 算出された方策のうち、ユーザの状態を予め決められた取り返しの付かない状態に遷移させた方策を、各ユーザに適用する方策から除外する処理をコンピュータに実行させるプログラムを記録する。
In yet another aspect of the present invention, the recording medium is:
Estimate model parameters that define user state transitions based on the behavior history of multiple users.
Randomness is given to each user for the estimated parameters,
A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
Among the calculated measures, record a program that causes the computer to execute a process of excluding the measures that transition the user's state to a predetermined irreversible state from the measures applied to each user.
 本発明によれば、ユーザに対する方策を決定する過程で、特定の好ましくない状態に陥ることを回避することが可能となる。 According to the present invention, it is possible to avoid falling into a specific unfavorable state in the process of determining a policy for a user.
第1実施形態に係る意思決定装置のハードウェア構成を示す。The hardware configuration of the decision-making apparatus according to the first embodiment is shown. 第1実施形態に係る意思決定装置の機能構成を示す。The functional configuration of the decision-making apparatus according to the first embodiment is shown. 第1実施形態に係る意思決定処理のフローチャートである。It is a flowchart of the decision-making process which concerns on 1st Embodiment. 意思決定装置を適用したシステムの第1の例を示す。A first example of a system to which a decision-making device is applied is shown. 意思決定装置を適用したシステムの第2の例を示す。A second example of a system to which a decision-making device is applied is shown. 意思決定処理により得られる効果を説明する図である。It is a figure explaining the effect obtained by the decision-making process. 第2実施形態に係る意思決定装置の機能構成を示す。The functional configuration of the decision-making apparatus according to the second embodiment is shown.
 以下、図面を参照して本発明の好適な実施形態について説明する。なお、以下の説明では、表記の便宜上、記号「A」上に「^」を付したものを「A^」と記し、記号「A」の上に「~」を付したものを「A」と記すものとする。 Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings. In the following description, for convenience of notation, the symbol "A" with "^" added is described as "A ^", and the symbol "A" with "~" added is "A ~ " . It shall be written.
 [基本的手法]
 まず、本発明の実施形態による意思決定の基本的手法について説明する。以下の実施形態は、MDPによりモデル化されたシステムにおいて、特定の好ましくない状態、具体的には「取り返しの付かない状態」を回避する点に特徴を有する。
[Basic method]
First, a basic method of decision making according to the embodiment of the present invention will be described. The following embodiments are characterized in avoiding certain unfavorable conditions, specifically "irreparable conditions", in a system modeled by MDP.
 MDPは、「状態」、「アクション」、「報酬」及び「遷移確率」をパラメータとする。「状態」及び「アクション」は、それぞれ有限個の集合、又は、その要素である。「報酬」は、状態とアクションを入力すると、実数値を出力する関数である。「遷移確率」は、現在の状態とアクションとを入力すると、各状態に対して、そこに遷移する確率を出力する関数である。 MDP uses "state", "action", "reward" and "transition probability" as parameters. Each "state" and "action" is a finite set or an element thereof. "Reward" is a function that outputs a real value when a state and an action are input. The "transition probability" is a function that outputs the probability of transition to each state when the current state and the action are input.
 MDPを用いた意思決定処理は、基本的に、(a)現在の状態に基づいてアクションを選択、実行するステップと、(b)実行したアクションの結果として、報酬と次の状態を得るステップとを繰り返す。次の状態は、現在の状態と、アクションと、遷移確率とに基づいて決まる。意思決定処理の開始時には、報酬及び遷移確率は未知であり、意思決定処理は、報酬及び遷移確率を推定しつつ上記のステップ(a)、(b)を繰り返す。そして、意思決定処理は、推定された報酬及び遷移確率に基づいて、最適な方策を算出する。方策は、状態をアクションにマッピングする関数であり、状態が入力されるとアクションを出力する。 The decision-making process using MDP basically includes (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. repeat. The next state is determined based on the current state, the action, and the transition probability. At the start of the decision-making process, the reward and the transition probability are unknown, and the decision-making process repeats the above steps (a) and (b) while estimating the reward and the transition probability. Then, the decision-making process calculates the optimum policy based on the estimated reward and the transition probability. The policy is a function that maps a state to an action and outputs the action when the state is input.
 報酬及び遷移確率を推定するためにはデータが必要である。このため、意思決定処理の初期においては、推定用のデータを収集するため、様々な状態で様々なアクションが試行される。その過程で、ユーザの状態が特定の好ましくない状態、より具体的には「取り返しの付かない状態」に陥る可能性がある。ここで、「取り返しの付かない状態」とは、報酬がゼロ又は許容できない程度に小さく、かつ、他の状態に遷移することができない状態をいう。なお、報酬が許容できない程度に小さいか否かは、実際には、報酬の値が所定の閾値より小さいか否かにより判定される。一度このような「取り返しの付かない状態」に陥ってしまうと、報酬を得られる通常の状態に復帰することができなくなってしまう。 Data is required to estimate the reward and transition probability. Therefore, in the initial stage of the decision-making process, various actions are tried in various states in order to collect data for estimation. In the process, the user's condition may fall into a specific unfavorable condition, more specifically, an "irreparable condition". Here, the "irreparable state" means a state in which the reward is zero or unacceptably small, and the state cannot transition to another state. Whether or not the reward is unacceptably small is actually determined by whether or not the value of the reward is smaller than a predetermined threshold value. Once you fall into such an "irreparable state", you will not be able to return to the normal state where you can get rewards.
 具体例として、携帯電話、会員制ECサイトなど、あるサービスをユーザに継続的に提供するシステムを考える。この場合、意思決定処理は、ユーザの長期的な満足度を高める適切な方策を決定し、ユーザに対してアクションを実行することにより、ユーザの課金や契約の更新などを得る。この例では、「取り返しの付かない状態」として、ユーザに解約された状態、正確には、ユーザに解約され、その後何をしても再度の契約を得られない状態を挙げることができる。即ち、意思決定処理がユーザの満足度を上げるために様々なアクションを試行する過程において、ユーザに解約された状態となることは避けなければならない。仮に、「必ず解約されるアクション」がある場合、つまり、ある状態であるアクションを実行すると必ずユーザが解約してしまうことが分かった場合、意思決定処理は、その状態でそのアクションを実行することは避けなければならない。 As a specific example, consider a system that continuously provides a certain service to users, such as a mobile phone and a membership-based EC site. In this case, the decision-making process determines an appropriate measure for enhancing the long-term satisfaction of the user, executes an action on the user, and obtains the user's charge, contract renewal, and the like. In this example, as the "irreparable state", a state in which the contract is canceled by the user, to be exact, a state in which the contract is canceled by the user and no matter what is done thereafter, the contract cannot be obtained again. That is, it must be avoided that the decision-making process is canceled by the user in the process of trying various actions in order to increase the satisfaction of the user. If there is an "action that must be canceled", that is, if it is found that the user always cancels when an action in a certain state is executed, the decision-making process executes the action in that state. Must be avoided.
 そこで、本実施形態では、ユーザが極力「取り返しの付かない状態」に陥らないように、各ユーザに対する方策を決定し、アクションを行う。具体的に、本実施形態の意思決定処理は、以下の3つのアプローチを組み合わせて実行する。 Therefore, in this embodiment, measures are determined for each user and actions are taken so that the user does not fall into an "irreparable state" as much as possible. Specifically, the decision-making process of the present embodiment is executed by combining the following three approaches.
(アプローチA):複数のユーザを同時に扱う。
 ユーザが1人だけの場合、そのユーザに対して様々なアクションを試行する過程で「取り返しの付かない状態」に陥ってしまうと、挽回の機会がなくなってしまう。そこで、複数のユーザを同時に扱う。ユーザが増えることで、挽回の機会が生まれる。但し、仮にユーザの「状態」が「解約していない」と「解約済み」の2種類のみであり、かつ、「必ず解約されるアクション」がある場合、いつかは全てのユーザが「必ず解約されるアクション」を試行し、解約してしまうことになる。よって、アプローチAのみでは不十分である。
(Approach A): Handle a plurality of users at the same time.
If there is only one user and the user falls into an "irreparable state" in the process of trying various actions against that user, there is no chance of recovery. Therefore, a plurality of users are handled at the same time. As the number of users increases, opportunities for recovery will be created. However, if there are only two types of user "status", "not canceled" and "cancelled", and there is an "action that must be canceled", all users will be "always canceled" someday. You will try "action" and cancel the contract. Therefore, approach A alone is not sufficient.
(アプローチB):各ユーザの行動履歴を共有して一つの問題として扱う。
 あるユーザが「取り返しの付かない状態」に陥った場合、その直前の状態及びそのときに実行したアクションを記憶しておく。そして、意思決定処理は、それ以降、その状態ではそのアクションを選択しないようにする。これにより、1人のユーザは「取り返しの付かない状態」に陥ってしまうが、その後に他のユーザが「取り返しの付かない状態」に陥ることを防止できる。但し、この場合でも、全てのユーザが同時に「取り返しの付かない状態」に陥る可能性は残る。即ち、意思決定処理が全てのユーザに同じ「アクション」を実行するように構成されている場合、全てのユーザに対して同時に「必ず解約されるアクション」を実行してしまう可能性がある。よって、アプローチA及びBのみでは不十分である。
(Approach B): The action history of each user is shared and treated as one problem.
When a user falls into an "irreparable state", the state immediately before that and the action performed at that time are stored. Then, the decision-making process does not select the action in that state after that. As a result, one user falls into the "irreparable state", but it is possible to prevent another user from falling into the "irreparable state" thereafter. However, even in this case, there is a possibility that all users will be in an "irreparable state" at the same time. That is, if the decision-making process is configured to execute the same "action" for all users, there is a possibility that the "action that must be canceled" will be executed for all users at the same time. Therefore, approaches A and B alone are not sufficient.
(アプローチC):各ユーザに対する方策をある程度分散させる。
 前述のように、「方策」とは、「状態」が入力されると「アクション」を出力する関数である。意思決定処理は、過去のデータに基づいて推定された報酬及び遷移確率を用いて方策を算出する。その際、報酬及び遷移確率の少なくとも一方にランダム性を与えることにより、算出される方策にランダム性を持たせ、意思決定処理が同じ状態にある複数のユーザに対して異なるアクションを行うようにする。こうすると、全てのユーザに対して同時に「必ず解約されるアクション」を実行する可能性が著しく低くなる。その結果、全てのユーザが同時に「取り返しの付かない状態」に陥ることを回避することができる。
(Approach C): Measures for each user are distributed to some extent.
As described above, the "policy" is a function that outputs an "action" when a "state" is input. The decision-making process calculates the policy using the reward and transition probability estimated based on the past data. At that time, by giving randomness to at least one of the reward and the transition probability, the calculated policy is given randomness, and different actions are performed for a plurality of users in the same decision-making process. .. In this way, the possibility of executing "an action that must be canceled" for all users at the same time is significantly reduced. As a result, it is possible to prevent all users from falling into an "irreparable state" at the same time.
 なお、この際、報酬及び遷移確率に与えるランダム性の度合いを、報酬及び遷移確率の推定の自信度に応じて変化させることが好ましい。上述のように、意思決定処理は、(a)現在の状態に基づいてアクションを選択、実行するステップと、(b)実行したアクションの結果として報酬と次の状態を得るステップとを繰り返して複数のユーザの行動履歴を蓄積し、その行動履歴を用いて報酬と遷移確率を推定する。そこで、蓄積された行動履歴が少なく、推定の自信度が低い場合、意思決定処理は、報酬及び遷移確率に与えるランダム性を強くする。一方、蓄積された行動履歴が多く、推定の自信度が高い場合、意思決定処理は、報酬及び遷移確率に与えるランダム性を弱くする。こうすると、意思決定処理の初期の段階では、蓄積された行動履歴が少ないため、報酬及び遷移確率に与えるランダム性が強くなり、意思決定処理は様々なアクションを試行することになる。その後、蓄積された行動履歴が増えていき、推定の信頼度が増すと、報酬及び遷移確率に与えるランダム性が弱くなり、より最適に近いアクションが多く選択され、実行されるようになる。 At this time, it is preferable to change the degree of randomness given to the reward and the transition probability according to the confidence level of the estimation of the reward and the transition probability. As described above, the decision-making process repeats (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. The user's action history is accumulated, and the reward and transition probability are estimated using the action history. Therefore, when the accumulated behavior history is small and the estimation confidence level is low, the decision-making process strengthens the randomness given to the reward and the transition probability. On the other hand, when the accumulated behavior history is large and the estimation confidence is high, the decision-making process weakens the randomness given to the reward and the transition probability. In this way, in the initial stage of the decision-making process, since the accumulated action history is small, the randomness given to the reward and the transition probability becomes strong, and the decision-making process tries various actions. After that, as the accumulated action history increases and the reliability of the estimation increases, the randomness given to the reward and the transition probability becomes weaker, and many actions closer to the optimum are selected and executed.
 以上のように、本実施形態では、上記のアプローチA~Cを組み合わせて実行することにより、ユーザが「取り返しの付かない状態」に陥ることを可能な限り回避する。 As described above, in the present embodiment, by executing the above approaches A to C in combination, it is possible to avoid the user from falling into an "irreparable state" as much as possible.
 [第1実施形態]
 次に、本発明の第1実施形態について説明する。第1実施形態に係る意思決定装置は、基本的に、(a)現在の状態に基づいてアクションを選択、実行するステップと、(b)実行したアクションの結果として報酬と次の状態を得るステップとを繰り返す。また、意思決定装置は、その間に得られた複数のユーザの行動履歴に基づいて、所定のタイミングで各ユーザに対する方策を更新する。また、意思決定装置は、あるユーザが「取り返しの付かない状態」に陥った場合、その際に使用した方策をそれ以降使用しないこととする。具体的には、意思決定装置は、あるユーザが「取り返しの付かい状態」に陥る直前の状態及びそのときに実行したアクションを記憶しておき、それ以降は、その状態でそのアクションを実行しないようにする。これにより、意思決定装置は、ユーザが「取り返しの付かない状態」に陥ることを極力回避しつつ、各ユーザに対して適切なアクションを行うことが可能となる。以下、第1実施形態に係る意思決定装置について詳しく説明する。
[First Embodiment]
Next, the first embodiment of the present invention will be described. The decision-making device according to the first embodiment basically has (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. And repeat. In addition, the decision-making device updates the policy for each user at a predetermined timing based on the action history of the plurality of users obtained during that period. In addition, when a user falls into an "irreparable state", the decision-making device does not use the policy used at that time after that. Specifically, the decision-making device stores the state immediately before a user falls into the "recovery state" and the action executed at that time, and does not execute the action in that state thereafter. To do so. As a result, the decision-making device can take appropriate actions for each user while avoiding the user from falling into an "irreparable state" as much as possible. Hereinafter, the decision-making apparatus according to the first embodiment will be described in detail.
 (ハードウェア構成)
 図1は、第1実施形態に係る意思決定装置のハードウェア構成を示すブロック図である。図示のように、意思決定装置1は、通信部11と、プロセッサ12と、メモリ13と、記録媒体14と、行動履歴データベース(DB)15と、を備える。
(Hardware configuration)
FIG. 1 is a block diagram showing a hardware configuration of the decision-making device according to the first embodiment. As shown in the figure, the decision-making device 1 includes a communication unit 11, a processor 12, a memory 13, a recording medium 14, and an action history database (DB) 15.
 通信部11は、外部装置との通信を行う。具体的には、通信部11は、意思決定装置1が決定したアクションを各ユーザの端末装置に送信したり、各ユーザの端末装置から各ユーザの状態を受信したりする。 The communication unit 11 communicates with an external device. Specifically, the communication unit 11 transmits the action determined by the decision-making device 1 to the terminal device of each user, or receives the state of each user from the terminal device of each user.
 プロセッサ12は、CPU(Central Processing Unit)などのコンピュータであり、予め用意されたプログラムを実行することにより、意思決定装置1の全体を制御する。メモリ13は、ROM(Read Only Memory)、RAM(Random Access Memory)などにより構成される。メモリ13は、プロセッサ12により実行される各種のプログラムを記憶する。また、メモリ13は、プロセッサ12による各種の処理の実行中に作業メモリとしても使用される。 The processor 12 is a computer such as a CPU (Central Processing Unit), and controls the entire decision-making device 1 by executing a program prepared in advance. The memory 13 is composed of a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. The memory 13 stores various programs executed by the processor 12. The memory 13 is also used as a working memory during execution of various processes by the processor 12.
 記録媒体14は、ディスク状記録媒体、半導体メモリなどの不揮発性で非一時的な記録媒体であり、意思決定装置1に対して着脱可能に構成される。記録媒体14は、プロセッサ12が実行する各種のプログラムを記録している。意思決定装置1が意思決定処理を実行する際には、記録媒体14に記録されているプログラムがメモリ13にロードされ、プロセッサ12により実行される。 The recording medium 14 is a non-volatile, non-temporary recording medium such as a disk-shaped recording medium or a semiconductor memory, and is configured to be removable from the decision-making device 1. The recording medium 14 records various programs executed by the processor 12. When the decision-making apparatus 1 executes the decision-making process, the program recorded in the recording medium 14 is loaded into the memory 13 and executed by the processor 12.
 行動履歴DB15は、複数のユーザの行動履歴を記憶する。行動履歴は、ユーザの「状態」、「アクション」及び「報酬」を含む。行動履歴は、意思決定装置1が、(a)現在の状態に基づいてアクションを選択、実行するステップと、(b)実行したアクションの結果として報酬と次の状態を得るステップとを繰り返す間に、行動履歴DB15に蓄積されていく。 The action history DB 15 stores the action history of a plurality of users. The action history includes the user's "state", "action" and "reward". The action history is obtained while the decision-making device 1 repeats (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. , It is accumulated in the action history DB15.
 (機能構成)
 図2は、意思決定装置1の機能構成を示すブロック図である。意思決定装置1は、機能的には、データ保存部21と、方策管理部22と、MDP推定部23と、ランダム化部24と、方策算出部25と、方策実行部26と、を備える。なお、データ保存部21は行動履歴DB15により実現される。また、方策管理部22、MDP推定部23、ランダム化部24、方策算出部25及び方策実行部26は、プロセッサ12が予め用意されたプログラムを実行することにより実現される。
(Functional configuration)
FIG. 2 is a block diagram showing a functional configuration of the decision-making device 1. The decision-making device 1 functionally includes a data storage unit 21, a policy management unit 22, an MDP estimation unit 23, a randomization unit 24, a policy calculation unit 25, and a policy execution unit 26. The data storage unit 21 is realized by the action history DB 15. Further, the policy management unit 22, the MDP estimation unit 23, the randomization unit 24, the policy calculation unit 25, and the policy execution unit 26 are realized by the processor 12 executing a program prepared in advance.
 データ保存部21は、外部装置から各ユーザの状態を受信し、方策管理部22から各ユーザに対して実行したアクション及びそれにより得られた報酬を取得する。そして、データ保存部21は、それらを各ユーザの行動履歴として保存する。データ保存部21は、意思決定装置1が対象とする複数のユーザの全行動履歴を蓄積している。 The data storage unit 21 receives the status of each user from the external device, and acquires the action executed for each user from the policy management unit 22 and the reward obtained by the action. Then, the data storage unit 21 stores them as the action history of each user. The data storage unit 21 accumulates the entire action history of a plurality of users targeted by the decision-making device 1.
 方策管理部22は、データ保存部21に保存されている各ユーザの全行動履歴に基づいて、各ユーザに対する現在の方策を決定する。具体的に、方策管理部22は、まずデータ保存部21に記憶されている各ユーザの全行動履歴を取得し、MDP推定部23に供給する。 The policy management unit 22 determines the current policy for each user based on the total action history of each user stored in the data storage unit 21. Specifically, the policy management unit 22 first acquires the entire action history of each user stored in the data storage unit 21 and supplies it to the MDP estimation unit 23.
 MDP推定部23は、各ユーザの全行動履歴を用いて、MDPのパラメータ(以下、「MDPパラメータ」と呼ぶ。)、具体的には、「報酬」及び「遷移確率」を推定する。そして、MDP推定部23は、推定したMDPパラメータ(以下、「推定MDPパラメータ」と呼ぶ。)と、MDPパラメータの推定に用いたユーザの行動履歴のサンプル数とをランダム化部24に供給する。即ち、MDP推定部23は、「報酬」の推定値と、「遷移確率」の推定値をランダム化部24に供給する。 The MDP estimation unit 23 estimates MDP parameters (hereinafter referred to as "MDP parameters"), specifically, "reward" and "transition probability" using the entire action history of each user. Then, the MDP estimation unit 23 supplies the estimated MDP parameter (hereinafter, referred to as “estimated MDP parameter”) and the number of samples of the user's action history used for estimating the MDP parameter to the randomization unit 24. That is, the MDP estimation unit 23 supplies the estimated value of the “reward” and the estimated value of the “transition probability” to the randomization unit 24.
 ランダム化部24は、推定MDPパラメータをランダム化する。具体的には、ランダム化部24は、推定MDPパラメータにノイズを加えることによりランダム性を与える。ここで、ランダム化部24は、推定MDPパラメータの少なくとも1つをランダム化すればよい。即ち、ランダム化部24は、MDP推定部23から供給された報酬の推定値と、遷移確率の推定値の少なくとも一方をランダム化すればよい。なお、ランダム化部24は、推定MDPパラメータに対して、ユーザ毎に独立にノイズを加える。これにより、基本的にランダム化された推定MDPパラメータ(以下、「ランダム化MDPパラメータ」と呼ぶ。)は、ユーザ毎に異なるものとなる。そして、ランダム化部24は、ランダム化MDPパラメータを方策算出部25に供給する。 The randomization unit 24 randomizes the estimated MDP parameters. Specifically, the randomization unit 24 imparts randomness by adding noise to the estimated MDP parameters. Here, the randomizing unit 24 may randomize at least one of the estimated MDP parameters. That is, the randomization unit 24 may randomize at least one of the estimated value of the reward supplied from the MDP estimation unit 23 and the estimated value of the transition probability. The randomization unit 24 independently adds noise to the estimated MDP parameters for each user. As a result, the estimated MDP parameters that are basically randomized (hereinafter, referred to as "randomized MDP parameters") are different for each user. Then, the randomization unit 24 supplies the randomized MDP parameter to the policy calculation unit 25.
 また、前述のように、ランダム化部24は、好ましくはMDPパラメータの推定に用いたユーザの行動履歴のサンプル数に応じて、推定MDPパラメータに与えるランダム性の強さを変化させる。具体的に、ランダム化部24は、MDPパラメータの推定に用いたユーザの行動履歴のサンプル数が少ないときには推定MDPパラメータに与えるランダム性を強くし、MDPパラメータの推定に用いたユーザの行動履歴のサンプル数が多いときには推定MDPパラメータに与えるランダム性を弱くする。 Further, as described above, the randomization unit 24 preferably changes the strength of randomness given to the estimated MDP parameter according to the number of samples of the user's behavior history used for estimating the MDP parameter. Specifically, the randomization unit 24 strengthens the randomness given to the estimated MDP parameter when the number of samples of the user's behavior history used for estimating the MDP parameter is small, and the randomization unit 24 increases the randomness of the user's behavior history used for estimating the MDP parameter. When the number of samples is large, the randomness given to the estimated MDP parameters is weakened.
 方策算出部25は、ランダム化部24から供給されたランダム化MDPパラメータを用いて、ユーザ毎に方策を算出する。例えば、方策算出部25は、報酬及び遷移確率を用いて価値反復法により最適な方策を算出する。ここで、前述のように、報酬と遷移確率の少なくとも一方はランダム化されているので、方策算出部25がユーザ毎に算出する方策はランダム性を有するものとなる。また、前述のように、ランダム化部24はユーザ毎に独立にランダム化を行っているので、方策算出部25が各ユーザについて算出する方策は異なるものとなる。その結果、全てのユーザに対して同時に同じ方策が適用され、同時に「取り返しの付かない状態」に陥ることが回避できる。方策算出部25は、算出した各ユーザに対する方策を方策管理部22に供給する。 The policy calculation unit 25 calculates a policy for each user using the randomized MDP parameters supplied from the randomization unit 24. For example, the policy calculation unit 25 calculates the optimum policy by the value iterative method using the reward and the transition probability. Here, as described above, since at least one of the reward and the transition probability is randomized, the policy calculated by the policy calculation unit 25 for each user has randomness. Further, as described above, since the randomization unit 24 randomly randomizes each user independently, the policy calculated by the policy calculation unit 25 for each user is different. As a result, the same policy is applied to all users at the same time, and it is possible to avoid falling into an "irreparable state" at the same time. The policy calculation unit 25 supplies the calculated policy for each user to the policy management unit 22.
 方策管理部22は、方策算出部25により算出された各ユーザに対する方策を、方策実行部26に供給する。但し、方策管理部22は、過去に一度でもユーザを「取り返しの付かない状態」に陥らせた方策は排除し、方策実行部26に供給しない。 The policy management unit 22 supplies the policy for each user calculated by the policy calculation unit 25 to the policy execution unit 26. However, the policy management unit 22 excludes the policy that has caused the user to be in an "irreparable state" even once in the past, and does not supply the policy to the policy execution unit 26.
 具体的には、方策管理部22は、「取り返しの付かない状態」に該当する特定の状態を予め記憶しておく。なお、方策管理部22は、「取り返しの付かない状態」として複数の状態を記憶しておいても良い。方策管理部22は、データ保存部21に記憶されているユーザの行動履歴を参照し、過去にユーザを「取り返しの付かない状態」に遷移させた方策を「NG方策」として記憶しておく。そして、方策管理部22は、方策算出部25から供給された各ユーザに対する対策にNG方策が含まれる場合には、そのNG方策を適用しないこととする。方策管理部22は、NG方策以外の方策は、各ユーザに適用する。即ち、方策管理部22は、方策算出部25から供給された各ユーザに対する対策のうち、NG方策以外を方策実行部26に供給する。これにより、ユーザに対してNG方策が適用され、そのユーザが「取り返しの付かない状態」に陥ることが回避できる。 Specifically, the policy management unit 22 stores in advance a specific state corresponding to the "irreparable state". The policy management unit 22 may store a plurality of states as "irreparable states". The policy management unit 22 refers to the user's action history stored in the data storage unit 21, and stores the policy that has made the user transition to the “irreparable state” in the past as an “NG policy”. Then, when the measures for each user supplied from the policy calculation unit 25 include the NG policy, the policy management unit 22 does not apply the NG policy. The policy management unit 22 applies measures other than the NG policy to each user. That is, the policy management unit 22 supplies the policy execution unit 26 with measures other than the NG policy among the measures for each user supplied from the policy calculation unit 25. As a result, the NG policy is applied to the user, and it is possible to prevent the user from falling into an "irreparable state".
 方策実行部26は、方策管理部22から各ユーザに対する方策を受け取るとともに、データ保存部21から各ユーザの現在の状態を取得する。そして、方策実行部26は、各ユーザに対する方策と各ユーザの現在の状態とに基づいて、各ユーザに対するアクションを決定し、実行する。前述のように、「方策」は、「状態」を入力すると「アクション」を出力する関数であるので、方策実行部26は、各ユーザの方策に、各ユーザの現在の状態を入力して各ユーザに対するアクションを決定し、実行する。 The policy execution unit 26 receives the policy for each user from the policy management unit 22, and acquires the current state of each user from the data storage unit 21. Then, the policy execution unit 26 determines and executes an action for each user based on the policy for each user and the current state of each user. As described above, since the "policy" is a function that outputs an "action" when the "state" is input, the policy execution unit 26 inputs the current state of each user into each user's policy. Determine and execute actions for the user.
 以上のように、本実施形態の意思決定装置1では、ランダム化部24が推定MDPパラメータをユーザ毎に独立にランダム化し、方策算出部25がランダム化MDPパラメータに基づいて各ユーザに対する方策を算出するので、全てのユーザに同時に同一の方策が適用されることがない。よって、全てのユーザに対して同時にNG方策が適用されて、全てのユーザが同時に「取り返しの付かない状態」に陥ることを回避できる。また、方策管理部22は、過去にユーザを「取り返しの付かない状態」に陥らせたNG方策を適用しないので、1人又はごく少数のユーザは犠牲になるものの、それ以後の多数のユーザが「取り返しの付かない状態」に陥ることを防止することができる。 As described above, in the decision-making device 1 of the present embodiment, the randomization unit 24 randomly randomizes the estimated MDP parameters for each user, and the policy calculation unit 25 calculates the policy for each user based on the randomized MDP parameters. Therefore, the same policy is not applied to all users at the same time. Therefore, it is possible to prevent all users from being applied to the NG policy at the same time and all users from falling into an "irreparable state" at the same time. In addition, since the policy management unit 22 does not apply the NG policy that has put the user in an "irreparable state" in the past, one or a very small number of users are sacrificed, but a large number of users thereafter It is possible to prevent falling into an "irreparable state".
 (方策算出方法の具体例)
 次に、図2を参照して、意思決定装置1による方策算出方法の具体例について説明する。なお、以下の方策算出方法はユーザ毎に実行され、ユーザ毎に方策が決定される。
(Specific example of policy calculation method)
Next, a specific example of the policy calculation method by the decision-making device 1 will be described with reference to FIG. The following policy calculation method is executed for each user, and the policy is determined for each user.
 方策管理部22は、データ保存部21に記憶されているユーザの行動履歴を取得し、MDP推定部23に供給する。MDP推定部23は、ユーザの行動履歴を用いて、MDPパラメータとして報酬及び遷移確率の推定値を推定する。具体的に、MDP推定部23は、ある状態sにおいてあるアクションaを行ったときの報酬の推定値r^(s,a)を以下の式(1)で算出する。 The policy management unit 22 acquires the user's action history stored in the data storage unit 21 and supplies it to the MDP estimation unit 23. The MDP estimation unit 23 estimates the reward and the estimated value of the transition probability as MDP parameters using the user's action history. Specifically, the MDP estimation unit 23 calculates the estimated value r ^ (s, a) of the reward when a certain action a is performed in a certain state s by the following equation (1).
Figure JPOXMLDOC01-appb-M000001
ここで、R(s,a)は、全ユーザの行動履歴のうち、状態sにおいてアクションaを行ったときの報酬の総和であり、
 N(s,a)は、全ユーザの行動履歴のうち、状態sにおいてアクションaを行った回数である。
Figure JPOXMLDOC01-appb-M000001
Here, R (s, a) is the sum of the rewards when the action a is performed in the state s in the action history of all users.
N (s, a) is the number of times the action a is performed in the state s in the action history of all users.
 また、MDP推定部23は、ある状態sにおいてあるアクションaを行った結果、状態が状態sに遷移する確率の推定値p^(s´|s,a)を以下の式(2)で算出する。 Further, the MDP estimation unit 23 calculates an estimated value p ^ (s'| s, a) of the probability that the state transitions to the state s as a result of performing a certain action a in a certain state s by the following equation (2). To do.
Figure JPOXMLDOC01-appb-M000002
ここで、P(s,a,s´)は、全ユーザの行動履歴のうち、状態sにおいてアクションaを行った結果、状態が状態s´に遷移した回数である。
Figure JPOXMLDOC01-appb-M000002
Here, P (s, a, s') is the number of times the state has transitioned to the state s'as a result of performing the action a in the state s in the action history of all users.
 MDP推定部23は、こうして得られた報酬の推定値r^(s,a)と、遷移確率の推定値p^(s´|s,a)とを、推定MDPパラメータとしてランダム化部24へ供給する。さらに、MDP推定部23は、MDPパラメータの推定に用いた行動履歴のサンプル数N(s,a)をランダム化部24へ供給する。 The MDP estimation unit 23 sends the reward estimated value r ^ (s, a) thus obtained and the transition probability estimated value p ^ (s'| s, a) to the randomization unit 24 as estimated MDP parameters. Supply. Further, the MDP estimation unit 23 supplies the randomization unit 24 with a sample number N (s, a) of the action history used for estimating the MDP parameter.
 ランダム化部24は、報酬の推定値r^(s,a)と、遷移確率の推定値p^(s´|s,a)の少なくとも一方をランダム化する。 The randomization unit 24 randomizes at least one of the reward estimated value r ^ (s, a) and the transition probability estimated value p ^ (s'| s, a).
(A)報酬の推定値のみをランダム化する場合
 ランダム化部24は、報酬の推定値r^(s,a)にノイズを付与し、ランダム化した報酬の推定値r(s,a)を生成する。このノイズは、ランダム化した報酬の推定値r(s,a)の平均がr(s,a)で、標準偏差σが
(A) When only the estimated reward value is randomized The randomizing unit 24 adds noise to the estimated reward value r ^ (s, a), and randomizes the estimated reward value r ~ (s, a). To generate. For this noise, the average of the randomized reward estimates r ~ (s, a) is r ~ (s, a), and the standard deviation σ is
Figure JPOXMLDOC01-appb-M000003
の正規分布に従う確率変数となるように決定される。式(3)において、「S」、「A」は、それぞれ状態とアクションの総数であり、「t」は反復回数のカウントであり、「δ」はハイパーパラメータである。なお、標準偏差σを、
Figure JPOXMLDOC01-appb-M000003
It is determined to be a random variable that follows a normal distribution of. In equation (3), "S" and "A" are the total number of states and actions, respectively, "t" is a count of the number of iterations, and "δ" is a hyperparameter. The standard deviation σ is
Figure JPOXMLDOC01-appb-M000004
として、反復回数のカウントtに依存させないようにしてもよい。式(4)において、「β」はハイパーパラメータである。
Figure JPOXMLDOC01-appb-M000004
As a result, it may not depend on the count t of the number of repetitions. In equation (4), "β" is a hyperparameter.
 式(3)及び(4)において、行動履歴のサンプル数N(s,a)が小さいと、標準偏差σは大きくなり、報酬の推定値のランダム性は強くなる。一方、行動履歴のサンプル数N(s,a)が大きいと、標準偏差σは小さくなり、報酬の推定値のランダム性は弱くなる。よって、前述のように、行動履歴のサンプル数に応じて、報酬の推定値に与えるランダム性が変化する。なお、この場合、遷移確率の推定値p^(s´|s,a)はランダム化せず、そのまま用いる。 In equations (3) and (4), when the number of action history samples N (s, a) is small, the standard deviation σ becomes large and the randomness of the reward estimate becomes strong. On the other hand, when the number of samples N (s, a) of the action history is large, the standard deviation σ becomes small and the randomness of the estimated value of the reward becomes weak. Therefore, as described above, the randomness given to the estimated value of the reward changes according to the number of samples of the action history. In this case, the estimated transition probability p ^ (s'| s, a) is used as it is without being randomized.
(B)遷移確率のみをランダム化する場合
 ランダム化部24は、遷移確率の推定値p^(s´|s,a)にノイズを付与し、ランダム化した遷移確率の推定値p(s´|s,a)を生成する。このノイズは、ランダム化した遷移確率の推定値p(s´|s,a)が、パラメータαのディリクレ分布に従う確率変数となるように決定される。即ち、ノイズは、ディリクレ分布から発生する乱数を用いて決定される。なお、パラメータ「α」はA次元のパラメータであり、アクション「a」に対応する次元の値はN(s,a,s´)+1である。N(s,a,s´)は、全ユーザの行動履歴のうち、状態sにおいてアクションaを行った結果、状態がs´に遷移した回数である。「A」は前述のようにアクションの総数である。この場合も、行動履歴のサンプル数N(s,a,s´)が小さいと、遷移確率の推定値のランダム性は強くなり、行動履歴のサンプル数N(s,a,s´)が大きいと、遷移確率の推定値のランダム性は弱くなる。よって、前述のように、行動履歴のサンプル数に応じて、遷移確率の推定値に与えるランダム性が変化する。なお、この場合、報酬の推定値r^(s,a)はランダム化せず、そのまま用いる。
(B) When only the transition probability is randomized The randomizing unit 24 adds noise to the transition probability estimated value p ^ (s'| s, a), and the randomized transition probability estimated value p ~ (s). ´ | s, a) is generated. This noise is determined so that the randomized estimated transition probability p to (s'| s, a) is a random variable that follows the Dirichlet distribution of the parameter α. That is, the noise is determined using a random number generated from the Dirichlet distribution. The parameter "α" is an A-dimensional parameter, and the dimensional value corresponding to the action "a" is N (s, a, s') + 1. N (s, a, s') is the number of times the state transitions to s'as a result of performing the action a in the state s in the action history of all users. "A" is the total number of actions as described above. Also in this case, if the number of samples N (s, a, s') of the action history is small, the randomness of the estimated value of the transition probability becomes strong, and the number of samples N (s, a, s') of the action history is large. Then, the randomness of the estimated value of the transition probability becomes weak. Therefore, as described above, the randomness given to the estimated value of the transition probability changes according to the number of samples of the action history. In this case, the estimated reward value r ^ (s, a) is not randomized and is used as it is.
(C)報酬と遷移確率の両方をランダム化する場合
 この場合、ランダム化部24は、上記の(A)と(B)の両方を行い、ランダム化した報酬の推定値r(s,a)と、ランダム化した遷移確率の推定値p(s´|s,a)を算出する。
(C) When both the reward and the transition probability are randomized In this case, the randomizing unit 24 performs both (A) and (B) above, and the estimated value r to (s, a) of the randomized reward. ) And the estimated value p ~ (s'| s, a) of the randomized transition probability are calculated.
 以上のようにして、ランダム化部24が報酬及び遷移確率の少なくとも一方をランダム化してランダム化MDPパラメータを生成すると、方策算出部25は、ランダム化MDPパラメータを用いて、価値反復法により最適な方策を算出する。 As described above, when the randomization unit 24 randomizes at least one of the reward and the transition probability to generate the randomized MDP parameter, the policy calculation unit 25 uses the randomized MDP parameter to be optimal by the value iterative method. Calculate the policy.
 (意思決定処理)
 次に、本実施形態による意思決定処理について説明する。図3は、第1実施形態に係る意思決定処理のフローチャートである。この処理は、図1に示すプロセッサ12が予め用意されたプログラムを実行し、図2に示す各要素として動作することにより実現される。
(Decision-making process)
Next, the decision-making process according to the present embodiment will be described. FIG. 3 is a flowchart of the decision-making process according to the first embodiment. This process is realized by the processor 12 shown in FIG. 1 executing a program prepared in advance and operating as each element shown in FIG.
 まず、意思決定装置1は、各ユーザに対する方策を算出する(ステップS11)。なお、意思決定処理の開始時には、基本的にデータ保存部21に各ユーザの行動履歴は存在しないので、ランダム化部24が強いランダム性を付与したランダム化MDPパラメータに基づいて、各ユーザに対する方策が算出される。 First, the decision-making device 1 calculates a policy for each user (step S11). Since the action history of each user basically does not exist in the data storage unit 21 at the start of the decision-making process, the measures for each user are based on the randomized MDP parameter to which the randomization unit 24 gives strong randomness. Is calculated.
 次に、意思決定装置1は、算出された各ユーザに対する方策を1回適用する(ステップS12)。具体的には、方策実行部26が、各ユーザに対する方策と、各ユーザの現在の状態とに基づいて各ユーザに対するアクションを決定し、実行する。アクションの実行により、各ユーザは次の状態に遷移する。 Next, the decision-making device 1 applies the calculated policy for each user once (step S12). Specifically, the policy execution unit 26 determines and executes an action for each user based on the policy for each user and the current state of each user. By executing the action, each user transitions to the next state.
 次に、意思決定装置1は、アクションの実行後における各ユーザの状態を検出し、あるユーザが「取り返しの付かない状態」に陥ったか否かを判定する(ステップS13)。前述のように、意思決定装置1は、「取り返しの付かない状態」に該当する特定の状態を予め記憶しているので、意思決定装置1はアクション実行後のあるユーザの状態がその特定の状態と一致するか否かを判定する。あるユーザが「取り返しの付かない状態」に陥った場合(ステップS13:Yes)、処理はステップS11へ戻り、意思決定装置1は、各ユーザに対する方策を更新する。この際、方策管理部22は、ユーザを「取り返しの付かない状態」に遷移させた方策をNG方策として記憶し、それ以降、全ユーザに対して、同じ状態でそのNG方策を適用することを禁止する。即ち、その後にステップS12で適用される方策は、そのNG方策以外の方策となる。 Next, the decision-making device 1 detects the state of each user after the execution of the action, and determines whether or not a certain user has fallen into an "irreparable state" (step S13). As described above, since the decision-making device 1 stores in advance a specific state corresponding to the "irreversible state", the decision-making device 1 sets the state of a certain user after the action is executed to that specific state. To determine if it matches. When a user falls into an "irreparable state" (step S13: Yes), the process returns to step S11, and the decision-making device 1 updates the policy for each user. At this time, the policy management unit 22 stores the policy that transitions the user to the "irreparable state" as an NG policy, and thereafter applies the NG policy to all users in the same state. Ban. That is, the policy applied in step S12 after that is a policy other than the NG policy.
 一方、あるユーザが「取り返しの付かない状態」に陥っていない場合(ステップS13:No)、意思決定装置1は、ユーザの行動履歴が十分増えたか否かを判定する(ステップS14)。この判定は、具体的には、方策管理部22が各ユーザの行動履歴を観測することにより行われる。ユーザの行動履歴が十分に増えていない場合(ステップS14:No)、処理はステップS12へ戻り、ステップS12~S14が実行される。即ち、意思決定装置1は、同一の方策と、各ユーザの新たな「状態」とを用いてアクションを決定して実行し、そのアクションの実行によりユーザの行動履歴が追加される。こうして、ユーザの行動履歴が十分に増えるまで、ステップS12~S14が繰り返される。そして、ユーザの行動履歴が十分に増えた場合(ステップS14:Yes)、処理はステップS11に戻り、十分に増えたユーザの行動履歴を用いて各ユーザに対する方策が更新される。即ち、ステップS14は、方策の更新時期が到来したか否かを判定していることになる。 On the other hand, when a certain user has not fallen into the "irreparable state" (step S13: No), the decision-making device 1 determines whether or not the user's action history has sufficiently increased (step S14). Specifically, this determination is made by the policy management unit 22 observing the action history of each user. If the user's action history is not sufficiently increased (step S14: No), the process returns to step S12, and steps S12 to S14 are executed. That is, the decision-making device 1 determines and executes an action using the same policy and a new "state" of each user, and the user's action history is added by executing the action. In this way, steps S12 to S14 are repeated until the user's action history is sufficiently increased. Then, when the user's action history is sufficiently increased (step S14: Yes), the process returns to step S11, and the policy for each user is updated using the sufficiently increased user's action history. That is, in step S14, it is determined whether or not the policy update time has come.
 ユーザの行動履歴が十分に増えたか否かは、例えば、以下の方法で判定することができる。一例として、任意のユーザkが以下の式(5)を満たしている場合、意思決定装置1は、ユーザの行動履歴が十分に増えていない(ステップS13:No)と判定する。 Whether or not the user's action history has increased sufficiently can be determined by, for example, the following method. As an example, when any user k satisfies the following equation (5), the decision-making device 1 determines that the user's action history has not sufficiently increased (step S13: No).
Figure JPOXMLDOC01-appb-M000005
ここで、πはユーザkの方策であり、π(s)はユーザkの状態sでのアクションであり、
 v(s,π(s))は、最後に方策を更新して以降、状態sでアクションπ(s)を行った回数の全ユーザの合計であり、
 N^(s,π(s))は、最後に方策を更新する前の、状態sでアクションπ(s)を行った回数の全ユーザの合計である。
Figure JPOXMLDOC01-appb-M000005
Here, π k is the policy of the user k, and π k (s) is the action in the state s of the user k.
v (s, π k (s)) is the sum of all users who have performed the action π k (s) in the state s since the last update of the policy.
N ^ (s, π k (s)) is the total number of times the action π k (s) was performed in the state s before the last policy update.
 また、他の例としては、意思決定装置1は、ステップS14で、同一の方策を所定回数(例えば、X回)適用したか否かを判定しても良い。この場合、同一の方策がX回適用されるまではステップS14の判定が「No」となり、ステップS12~S14が繰り返される。そして、同一の方策がX回適用されると、ステップS14の判定が「Yes」となり、処理はステップS11へ戻って方策が更新される。 As another example, the decision-making device 1 may determine in step S14 whether or not the same measure has been applied a predetermined number of times (for example, X times). In this case, the determination in step S14 becomes "No" until the same measure is applied X times, and steps S12 to S14 are repeated. Then, when the same measure is applied X times, the determination in step S14 becomes “Yes”, the process returns to step S11, and the measure is updated.
 以上のように、本実施形態の意思決定処理では、ステップS11で、ランダム化MDPパラメータに基づいて各ユーザに対する方策が算出、更新されるので、各ユーザに対して同時に同一の方策が適用されることが無くなり、全てのユーザが同時にNG方策を実行することが無くなる。また、ステップS13で、あるユーザが「取り返しの付かない状態」に陥ったことが検出されると、各ユーザの方策が更新されるとともに、その原因となったNG方策のその後の適用が禁止される。よって、その後に他のユーザが同じ方策によって「取り返しの付かない状態」に陥ることを防止できる。さらに、ステップS14においてユーザの行動履歴が十分に増えたと判定された場合、意思決定装置1は、その行動履歴を用いて各ユーザに対する方策を更新する。よって、ユーザの行動履歴データが増えるにつれて、MDP推定部23におけるMDPパラメータの推定が最適化されていき、各ユーザに対してより適切なアクションが実行されるようになる。 As described above, in the decision-making process of the present embodiment, since the policy for each user is calculated and updated based on the randomized MDP parameter in step S11, the same policy is applied to each user at the same time. This eliminates the need for all users to execute NG measures at the same time. Further, when it is detected in step S13 that a certain user has fallen into an "irreparable state", the policy of each user is updated and the subsequent application of the NG policy that caused the user is prohibited. To. Therefore, it is possible to prevent other users from falling into an "irreparable state" by the same measures thereafter. Further, when it is determined in step S14 that the user's action history has sufficiently increased, the decision-making device 1 updates the policy for each user using the action history. Therefore, as the user's action history data increases, the estimation of the MDP parameters in the MDP estimation unit 23 is optimized, and more appropriate actions are executed for each user.
 (適用事例)
 次に、実施形態に係る意思決定装置を実際のシステムへ適用した事例を説明する。図4は、実施形態の意思決定装置を適用したシステムの第1の例を示す。このシステムは、ユーザに対してあるサービスを提供するものであり、サーバ50と、複数のユーザ端末60を備える。サーバ50は、予め用意されたプログラムを実行することにより、実施形態の意思決定装置1として動作する。ユーザ端末60は、ユーザ毎に用意された端末であり、例えばPC(Personal Computer)、モバイル端末、タブレット、スマートフォンなどである。
(Case Study)
Next, an example in which the decision-making device according to the embodiment is applied to an actual system will be described. FIG. 4 shows a first example of a system to which the decision-making apparatus of the embodiment is applied. This system provides a certain service to a user, and includes a server 50 and a plurality of user terminals 60. The server 50 operates as the decision-making device 1 of the embodiment by executing a program prepared in advance. The user terminal 60 is a terminal prepared for each user, and is, for example, a PC (Personal Computer), a mobile terminal, a tablet, a smartphone, or the like.
 動作時には、サーバ50は、意思決定装置1の方策実行部26が決定したアクションをユーザ端末60に通知する。ユーザ端末60は、受信したアクションを実行する。例えば、ユーザ端末60は、アクションとして、ユーザに対して新たなサービスやプランの提案などを行う。これに対して、ユーザがそのサービスやプランを契約すると、ユーザ端末60は、ユーザが契約したことを示す状態をサーバ50へ送信する。送信された状態は、データ保存部21に保存され、ユーザの行動履歴の一部として記憶される。こうして、意思決定装置1を用いてユーザに対して適切な施策を行い、ユーザの満足度の向上、サービス提供側の売上や利益の向上が可能となる。 At the time of operation, the server 50 notifies the user terminal 60 of the action determined by the policy execution unit 26 of the decision-making device 1. The user terminal 60 executes the received action. For example, the user terminal 60 proposes a new service or plan to the user as an action. On the other hand, when the user contracts the service or plan, the user terminal 60 transmits a state indicating that the user has contracted to the server 50. The transmitted state is stored in the data storage unit 21 and stored as a part of the user's action history. In this way, it is possible to take appropriate measures for the user by using the decision-making device 1, improve the satisfaction of the user, and improve the sales and profits on the service providing side.
 図5は、実施形態の意思決定装置を適用したシステムの第2の例を示す。第2の例でも、システムはユーザに対してあるサービスを提供するものであり、サーバ70と、複数のユーザ端末80を備える。但し、第2の例では、サーバ70により構成される意思決定装置1xは、方策実行部を有しない。一方、ユーザ端末80は、AI(Artificial Inteligence)エージェント81を備え、このAIエージェント81が方策実行部82として動作する。なお、AIエージェント81は、実際にはユーザ端末80を構成するコンピュータが予め用意されたプログラムを実行することにより実現される。 FIG. 5 shows a second example of a system to which the decision-making device of the embodiment is applied. Also in the second example, the system provides a certain service to the user, and includes a server 70 and a plurality of user terminals 80. However, in the second example, the decision-making device 1x configured by the server 70 does not have a policy execution unit. On the other hand, the user terminal 80 includes an AI (Artificial Intelligence) agent 81, and the AI agent 81 operates as a policy execution unit 82. The AI agent 81 is actually realized by the computer constituting the user terminal 80 executing a program prepared in advance.
 動作時には、サーバ70は、意思決定装置1xの方策管理部22が決定した方策をユーザ端末80に送信する。ユーザ端末80は方策を受信する。前述のように、方策は状態が入力されるとアクションを出力する関数であるので、方策実行部82は、受信した方策及びユーザの現在の状態に基づいてアクションを決定し、実行する。例えば、ユーザ端末80は、アクションとして、ユーザに対して新たなサービスやプランの提案を行う。これに対して、ユーザがそのサービスやプランを契約すると、ユーザ端末80は、ユーザが契約したことを示す状態をサーバ70へ送信する。送信された状態は、データ保存部21に保存され、ユーザの行動履歴の一部として記憶される。こうして、意思決定装置1xを用いてユーザに対して適切な施策を行い、ユーザの満足度の向上、サービス提供側の売上や利益の向上が可能となる。 At the time of operation, the server 70 transmits the policy determined by the policy management unit 22 of the decision-making device 1x to the user terminal 80. The user terminal 80 receives the policy. As described above, since the policy is a function that outputs an action when a state is input, the policy execution unit 82 determines and executes the action based on the received policy and the current state of the user. For example, the user terminal 80 proposes a new service or plan to the user as an action. On the other hand, when the user contracts the service or plan, the user terminal 80 transmits a state indicating that the user has contracted to the server 70. The transmitted state is stored in the data storage unit 21 and stored as a part of the user's action history. In this way, it is possible to take appropriate measures for the user by using the decision-making device 1x, improve the satisfaction of the user, and improve the sales and profits on the service providing side.
 (効果)
 次に、実施形態の意思決定処理により得られる効果について説明する。図6(A)は、意思決定処理を適用する環境の状態遷移図を示す。2つの状態「p」と「q」があり、状態pからアクションがスタートする。ユーザが取りうるアクションは、「x」、「y」、「z」の3つであり、状態の遷移確率は図6(A)に示すものとする。即ち、状態pにおいてアクションx、yを行うとユーザは状態pに戻り、状態pにおいてユーザがアクションzを行うとユーザは状態qに遷移する。一方、状態qにおいては、ユーザがアクションx、y、zのいずれを行っても状態はqに戻る。なお、実際には、各ユーザにはAIエージェントが付いており、AIエージェントが各ユーザのアクションを決定する。
(effect)
Next, the effect obtained by the decision-making process of the embodiment will be described. FIG. 6A shows a state transition diagram of the environment to which the decision-making process is applied. There are two states "p" and "q", and the action starts from the state p. The actions that the user can take are three, "x", "y", and "z", and the state transition probability is shown in FIG. 6 (A). That is, when the actions x and y are performed in the state p, the user returns to the state p, and when the user performs the action z in the state p, the user transitions to the state q. On the other hand, in the state q, the state returns to q regardless of which of the actions x, y, and z is performed by the user. In reality, each user has an AI agent, and the AI agent determines the action of each user.
 図6(B)は、各アクションにより得られる報酬を示す。ユーザは、状態pにおいて、アクションxを行うと報酬「1」を得られ、アクションyを行うと報酬「2」を得られ、アクションzを行うと報酬「3」を得られる。一方、状態qにおいては、ユーザはいずれのアクションを行っても報酬は「0」である。従って、状態qは、報酬が「0」であり、かつ、他の状態に遷移することができない状態であるので、「取り返しの付かない状態」に該当する。 FIG. 6B shows the reward obtained by each action. In the state p, when the action x is performed, the reward "1" is obtained, when the action y is performed, the reward "2" is obtained, and when the action z is performed, the reward "3" is obtained. On the other hand, in the state q, the reward is "0" regardless of which action the user performs. Therefore, the state q corresponds to the "irreparable state" because the reward is "0" and the state cannot transition to another state.
 さて、いま複数のユーザがいるとする。既存技術の手法は、特に「取り返しの付かない状態」についての考慮が無いので、いつかはユーザ全員が状態pでアクションzを選んでしまい、全員が取り返しの付かない状態qに陥る。具体的に、複数のユーザがバラバラに学習すると、それぞれがいつかは状態pでアクションzを試行し、状態qに陥る。一方、複数のユーザが情報を共有して学習すると、基本的に全員が同じ方策を採用するので、いつか全員が同時に状態pでアクションzを試行し、状態qに陥る。 Now, suppose there are multiple users. Since the method of the existing technology does not particularly consider the "irreparable state", someday all the users select the action z in the state p, and all of them fall into the irreparable state q. Specifically, when a plurality of users learn separately, each of them will eventually try the action z in the state p and fall into the state q. On the other hand, when a plurality of users share information and learn, basically all of them adopt the same policy, so that someday all of them try the action z in the state p at the same time and fall into the state q.
 これに対し、上述した実施形態の意思決定処理によれば、MDPパラメータのランダム化により各ユーザがとるアクションは分散されているので、全員が同時に状態pでアクションzをとることはない。また、複数のユーザ間で情報を共有しているので、1人のユーザが状態qに陥った場合には、その情報が他の全ての方策決定に反映される。よって、その後に他のユーザが同じように状態qに陥ることが回避される。さらに、各ユーザのAIエージェントは、上記のようにユーザの行動履歴に基づいて学習を進めるので、やがてAIエージェントの採用する方策が最適化され、最適なアクションばかりを選ぶようになる。 On the other hand, according to the decision-making process of the above-described embodiment, the actions taken by each user are distributed by randomizing the MDP parameters, so that not all of them take the action z at the same time in the state p. Further, since the information is shared among a plurality of users, when one user falls into the state q, the information is reflected in all the other policy decisions. Therefore, it is possible to prevent other users from falling into the state q in the same manner thereafter. Further, since the AI agent of each user advances learning based on the user's action history as described above, the policy adopted by the AI agent is optimized, and only the optimum action is selected.
 [第2実施形態]
 次に、本発明の第2実施形態について説明する。図6は、第2実施形態に係る意思決定装置90の機能構成を示すブロック図である。図示のように、意思決定装置90は、推定部91と、ランダム化部92と、方策算出部93と、方策管理部94と、を備える。
[Second Embodiment]
Next, the second embodiment of the present invention will be described. FIG. 6 is a block diagram showing a functional configuration of the decision-making device 90 according to the second embodiment. As shown in the figure, the decision-making device 90 includes an estimation unit 91, a randomization unit 92, a policy calculation unit 93, and a policy management unit 94.
 推定部91は、複数のユーザの行動履歴に基づいて、ユーザの状態の遷移を規定するモデルのパラメータを推定する。このモデルの一例はMDPである。ランダム化部92は推定部91により推定されたパラメータに対し、ユーザ毎にランダム性を与える。ランダム性を与えることにより、各ユーザのパラメータは、全て同一ではなくなる。方策算出部93は、ランダム化されたパラメータを用いて、ユーザ毎に算出する。方策は、各ユーザに対するアクションを決定する関数である。そして、方策管理部94は、方策算出部93により算出された方策のうち、ユーザの状態を予め決められた取り返しの付かない状態に遷移させた方策を、各ユーザに適用する方策から除外する。これにより、意思決定装置90は、ユーザが取り返しの付かない状態に陥ることを極力回避しつつ、各ユーザに対して適切なアクションを行うことが可能となる。 The estimation unit 91 estimates the parameters of the model that defines the transition of the user's state based on the behavior history of a plurality of users. An example of this model is the MDP. The randomization unit 92 gives randomness to the parameters estimated by the estimation unit 91 for each user. By giving randomness, the parameters of each user are not all the same. The policy calculation unit 93 calculates for each user using randomized parameters. The policy is a function that determines the action for each user. Then, the policy management unit 94 excludes from the policies calculated by the policy calculation unit 93, the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user. As a result, the decision-making device 90 can take an appropriate action for each user while avoiding the user from falling into an irreparable state as much as possible.
 上記の実施形態の一部又は全部は、以下の付記のようにも記載されうるが、以下には限られない。 Part or all of the above embodiments may be described as in the following appendix, but are not limited to the following.
 (付記1)
 複数のユーザの行動履歴に基づいて、ユーザの状態の遷移を規定するモデルのパラメータを推定する推定部と、
 推定されたパラメータに対し、ユーザ毎にランダム性を与えるランダム化部と、
 各ユーザに対するアクションを決定する関数である方策を、ランダム化されたパラメータを用いてユーザ毎に算出する方策算出部と、
 算出された方策のうち、ユーザの状態を予め決められた取り返しの付かない状態に遷移させた方策を、各ユーザに適用する方策から除外する方策管理部と、
 を備える意思決定装置。
(Appendix 1)
An estimation unit that estimates model parameters that define user state transitions based on the behavior history of multiple users,
A randomization unit that gives randomness to each user for the estimated parameters,
A policy calculation unit that calculates a policy that is a function that determines an action for each user for each user using randomized parameters.
Among the calculated measures, the policy management unit that excludes the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user,
A decision-making device equipped with.
 (付記2)
 前記ユーザの行動履歴は、ユーザの状態及びユーザに対するアクションを含み、
 前記方策は、ユーザの状態を入力すると、ユーザに対するアクションを出力する関数である付記1に記載の意思決定装置。
(Appendix 2)
The user's action history includes the user's state and actions for the user.
The decision-making device according to Appendix 1, wherein the measure is a function that outputs an action for the user when the user's state is input.
 (付記3)
 前記方策管理部は、前記複数のユーザの行動履歴を参照し、ユーザの状態が前記取り返しの付かない状態に遷移したことを検出した場合、その原因となった方策を、各ユーザに適用する方策から除外する付記1又は2に記載の意思決定装置。
(Appendix 3)
When the policy management unit refers to the behavior history of the plurality of users and detects that the user's state has transitioned to the irreversible state, the policy management unit applies the policy that caused the transition to each user. The decision-making device according to Appendix 1 or 2 to be excluded from.
 (付記4)
 前記方策管理部は、前記取り返しの付かない状態に該当する1又は複数の状態を予め記憶している付記1乃至3のいずれか一項に記載の意思決定装置。
(Appendix 4)
The decision-making device according to any one of Supplementary note 1 to 3, wherein the policy management unit stores in advance one or a plurality of states corresponding to the irreversible state.
 (付記5)
 前記取り返しの付かない状態は、アクションの実行により得られる報酬がゼロ又は所定値以下であり、かつ、アクションの実行により他の状態へ遷移することができない状態である付記1乃至4のいずれか一項に記載の意思決定装置。
(Appendix 5)
The irreparable state is any one of Appendix 1 to 4, which is a state in which the reward obtained by executing the action is zero or less than a predetermined value, and the state cannot be transitioned to another state by executing the action. The decision-making device described in the section.
 (付記6)
 前記モデルのパラメータは、ユーザの状態、ユーザに対するアクション、報酬、及び、遷移確率を含み、
 前記ランダム化部は、前記報酬及び前記遷移確率の少なくとも一方をランダム化する付記1乃至5のいずれか一項に記載の意思決定装置。
(Appendix 6)
The parameters of the model include the user's state, actions against the user, rewards, and transition probabilities.
The decision-making device according to any one of Appendix 1 to 5, wherein the randomizing unit randomizes at least one of the reward and the transition probability.
 (付記7)
 前記ランダム化部は、前記ユーザの行動履歴のサンプル数が少ないほど前記パラメータに与えるランダム性を強くし、前記サンプル数が多いほど前記パラメータに与えるランダム性を弱くする付記6に記載の意思決定装置。
(Appendix 7)
The decision-making device according to Appendix 6, wherein the randomization unit strengthens the randomness given to the parameter as the number of samples of the user's action history is smaller, and weakens the randomness given to the parameter as the number of samples is larger. ..
 (付記8)
 各ユーザに適用する方策に基づいて、各ユーザに対するアクションを決定し、実行する方策実行部を備える付記1乃至7のいずれか一項に記載の意思決定装置。
(Appendix 8)
The decision-making device according to any one of Appendix 1 to 7, further comprising a policy execution unit that determines and executes an action for each user based on a policy applied to each user.
 (付記9)
 前記方策算出部は、前記方策が所定回数適用される毎に、各ユーザに対する方策を更新する請求項1乃至8のいずれか一項に記載の意思決定装置。
(Appendix 9)
The decision-making device according to any one of claims 1 to 8, wherein the policy calculation unit updates the policy for each user each time the policy is applied a predetermined number of times.
 (付記10)
 複数のユーザの行動履歴に基づいて、ユーザの状態の遷移を規定するモデルのパラメータを推定し、
 推定されたパラメータに対し、ユーザ毎にランダム性を与え、
 各ユーザに対するアクションを決定する関数である方策を、ランダム化されたパラメータを用いてユーザ毎に算出し、
 算出された方策のうち、ユーザの状態を予め決められた取り返しの付かない状態に遷移させた方策を、各ユーザに適用する方策から除外する意思決定方法。
(Appendix 10)
Estimate model parameters that define user state transitions based on the behavior history of multiple users.
Randomness is given to each user for the estimated parameters,
A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
A decision-making method that excludes, among the calculated measures, a measure that transitions a user's state to a predetermined irreversible state from the measures applied to each user.
 (付記11)
 複数のユーザの行動履歴に基づいて、ユーザの状態の遷移を規定するモデルのパラメータを推定し、
 推定されたパラメータに対し、ユーザ毎にランダム性を与え、
 各ユーザに対するアクションを決定する関数である方策を、ランダム化されたパラメータを用いてユーザ毎に算出し、
 算出された方策のうち、ユーザの状態を予め決められた取り返しの付かない状態に遷移させた方策を、各ユーザに適用する方策から除外する処理をコンピュータに実行させるプログラムを記録した記録媒体。
(Appendix 11)
Estimate model parameters that define user state transitions based on the behavior history of multiple users.
Randomness is given to each user for the estimated parameters,
A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
A recording medium that records a program that causes a computer to execute a process of excluding a measure that transitions a user's state to a predetermined irreversible state from the measures applied to each user among the calculated measures.
 以上、実施形態及び実施例を参照して本発明を説明したが、本発明は上記実施形態及び実施例に限定されるものではない。本発明の構成や詳細には、本発明のスコープ内で当業者が理解し得る様々な変更をすることができる。 Although the present invention has been described above with reference to the embodiments and examples, the present invention is not limited to the above embodiments and examples. Various changes that can be understood by those skilled in the art can be made to the structure and details of the present invention within the scope of the present invention.
 1、1x、90 意思決定装置
 11 通信部
 12 プロセッサ
 13 メモリ
 14 記録媒体
 15 行動履歴データベース
 21 データ保存部
 22 方策管理部
 23 MDP推定部
 24、92 ランダム化部
 25、93 方策算出部
 26、94 方策実行部
1, 1x, 90 Decision-making device 11 Communication unit 12 Processor 13 Memory 14 Recording medium 15 Action history database 21 Data storage unit 22 Policy management unit 23 MDP estimation unit 24, 92 Randomization unit 25, 93 Policy calculation unit 26, 94 Measures Execution department

Claims (11)

  1.  複数のユーザの行動履歴に基づいて、ユーザの状態の遷移を規定するモデルのパラメータを推定する推定部と、
     推定されたパラメータに対し、ユーザ毎にランダム性を与えるランダム化部と、
     各ユーザに対するアクションを決定する関数である方策を、ランダム化されたパラメータを用いてユーザ毎に算出する方策算出部と、
     算出された方策のうち、ユーザの状態を予め決められた取り返しの付かない状態に遷移させた方策を、各ユーザに適用する方策から除外する方策管理部と、
     を備える意思決定装置。
    An estimation unit that estimates model parameters that define user state transitions based on the behavior history of multiple users,
    A randomization unit that gives randomness to each user for the estimated parameters,
    A policy calculation unit that calculates a policy that is a function that determines an action for each user for each user using randomized parameters.
    Among the calculated measures, the policy management unit that excludes the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user,
    A decision-making device equipped with.
  2.  前記ユーザの行動履歴は、ユーザの状態及びユーザに対するアクションを含み、
     前記方策は、ユーザの状態を入力すると、ユーザに対するアクションを出力する関数である請求項1に記載の意思決定装置。
    The user's action history includes the user's state and actions for the user.
    The decision-making device according to claim 1, wherein the measure is a function that outputs an action for the user when the state of the user is input.
  3.  前記方策管理部は、前記複数のユーザの行動履歴を参照し、ユーザの状態が前記取り返しの付かない状態に遷移したことを検出した場合、その原因となった方策を、各ユーザに適用する方策から除外する請求項1又は2に記載の意思決定装置。 When the policy management unit refers to the behavior history of the plurality of users and detects that the user's state has transitioned to the irreversible state, the policy management unit applies the policy that caused the transition to each user. The decision-making device according to claim 1 or 2, which is excluded from the above.
  4.  前記方策管理部は、前記取り返しの付かない状態に該当する1又は複数の状態を予め記憶している請求項1乃至3のいずれか一項に記載の意思決定装置。 The decision-making device according to any one of claims 1 to 3, wherein the policy management unit stores one or a plurality of states corresponding to the irreversible state in advance.
  5.  前記取り返しの付かない状態は、アクションの実行により得られる報酬がゼロ又は所定値以下であり、かつ、アクションの実行により他の状態へ遷移することができない状態である請求項1乃至4のいずれか一項に記載の意思決定装置。 The irreparable state is any of claims 1 to 4, wherein the reward obtained by executing the action is zero or less than a predetermined value, and the state cannot transition to another state by executing the action. The decision-making device according to paragraph 1.
  6.  前記モデルのパラメータは、ユーザの状態、ユーザに対するアクション、報酬、及び、遷移確率を含み、
     前記ランダム化部は、前記報酬及び前記遷移確率の少なくとも一方をランダム化する請求項1乃至5のいずれか一項に記載の意思決定装置。
    The parameters of the model include the user's state, actions against the user, rewards, and transition probabilities.
    The decision-making device according to any one of claims 1 to 5, wherein the randomizing unit randomizes at least one of the reward and the transition probability.
  7.  前記ランダム化部は、前記ユーザの行動履歴のサンプル数が少ないほど前記パラメータに与えるランダム性を強くし、前記サンプル数が多いほど前記パラメータに与えるランダム性を弱くする請求項6に記載の意思決定装置。 The decision-making unit according to claim 6, wherein the randomization unit strengthens the randomness given to the parameter as the number of samples of the user's action history is small, and weakens the randomness given to the parameter as the number of samples is large. apparatus.
  8.  各ユーザに適用する方策に基づいて、各ユーザに対するアクションを決定し、実行する方策実行部を備える請求項1乃至7のいずれか一項に記載の意思決定装置。 The decision-making device according to any one of claims 1 to 7, further comprising a policy execution unit that determines and executes an action for each user based on a policy applied to each user.
  9.  前記方策算出部は、前記方策が所定回数適用される毎に、各ユーザに対する方策を更新する請求項1乃至8のいずれか一項に記載の意思決定装置。 The decision-making device according to any one of claims 1 to 8, wherein the policy calculation unit updates the policy for each user every time the policy is applied a predetermined number of times.
  10.  複数のユーザの行動履歴に基づいて、ユーザの状態の遷移を規定するモデルのパラメータを推定し、
     推定されたパラメータに対し、ユーザ毎にランダム性を与え、
     各ユーザに対するアクションを決定する関数である方策を、ランダム化されたパラメータを用いてユーザ毎に算出し、
     算出された方策のうち、ユーザの状態を予め決められた取り返しの付かない状態に遷移させた方策を、各ユーザに適用する方策から除外する意思決定方法。
    Estimate model parameters that define user state transitions based on the behavior history of multiple users.
    Randomness is given to each user for the estimated parameters,
    A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
    A decision-making method that excludes, among the calculated measures, a measure that transitions a user's state to a predetermined irreversible state from the measures applied to each user.
  11.  複数のユーザの行動履歴に基づいて、ユーザの状態の遷移を規定するモデルのパラメータを推定し、
     推定されたパラメータに対し、ユーザ毎にランダム性を与え、
     各ユーザに対するアクションを決定する関数である方策を、ランダム化されたパラメータを用いてユーザ毎に算出し、
     算出された方策のうち、ユーザの状態を予め決められた取り返しの付かない状態に遷移させた方策を、各ユーザに適用する方策から除外する処理をコンピュータに実行させるプログラムを記録した記録媒体。
    Estimate model parameters that define user state transitions based on the behavior history of multiple users.
    Randomness is given to each user for the estimated parameters,
    A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
    A recording medium that records a program that causes a computer to execute a process of excluding a measure that transitions a user's state to a predetermined irreversible state from the measures applied to each user among the calculated measures.
PCT/JP2019/019636 2019-05-17 2019-05-17 Decision-making device, decision-making method, and storage medium WO2020234913A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2019/019636 WO2020234913A1 (en) 2019-05-17 2019-05-17 Decision-making device, decision-making method, and storage medium
JP2021520487A JP7279782B2 (en) 2019-05-17 2019-05-17 Decision-making device, decision-making method, and program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2019/019636 WO2020234913A1 (en) 2019-05-17 2019-05-17 Decision-making device, decision-making method, and storage medium

Publications (1)

Publication Number Publication Date
WO2020234913A1 true WO2020234913A1 (en) 2020-11-26

Family

ID=73459213

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2019/019636 WO2020234913A1 (en) 2019-05-17 2019-05-17 Decision-making device, decision-making method, and storage medium

Country Status (2)

Country Link
JP (1) JP7279782B2 (en)
WO (1) WO2020234913A1 (en)

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8788439B2 (en) * 2012-12-21 2014-07-22 InsideSales.com, Inc. Instance weighted learning machine learning model
WO2019021401A1 (en) * 2017-07-26 2019-01-31 日本電気株式会社 Reinforcement learning device, reinforcement learning method, and reinforcement learning program recording medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
STREHL ALEXANDER L. ET AL.: "An Empirical Evaluation of Interval Estimation for Markov Decision Processes", PROCEEDINGS OF 16TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, 17 November 2004 (2004-11-17), pages 128 - 135, XP010759667, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/document/1374179> [retrieved on 20190808] *

Also Published As

Publication number Publication date
JP7279782B2 (en) 2023-05-23
JPWO2020234913A1 (en) 2020-11-26

Similar Documents

Publication Publication Date Title
US7882075B2 (en) System, method and program product for forecasting the demand on computer resources
US20080255910A1 (en) Method and System for Adaptive Project Risk Management
US20130124258A1 (en) Methods and Systems for Identifying Customer Status for Developing Customer Retention and Loyality Strategies
CN111325417A (en) Method and device for realizing privacy protection and realizing multi-party collaborative updating of business prediction model
O'Neil et al. Newsvendor problems with demand shocks and unknown demand distributions
Fang et al. Effective confidence interval estimation of fault-detection process of software reliability growth models
US9524481B2 (en) Time series technique for analyzing performance in an online professional network
Hines et al. Preference elicitation for risky prospects
CN112494935B (en) Cloud game platform pooling method, electronic equipment and storage medium
WO2020234913A1 (en) Decision-making device, decision-making method, and storage medium
Bahati et al. Adapting to run-time changes in policies driving autonomic management
US10783449B2 (en) Continual learning in slowly-varying environments
US20210158257A1 (en) Estimating a result of configuration change(s) in an enterprise
Azaron et al. Lower bound for the mean project completion time in dynamic PERT networks
US20200364555A1 (en) Machine learning system
CN112312173A (en) Anchor recommendation method and device, electronic equipment and readable storage medium
Story et al. Anticipation and choice heuristics in the dynamic consumption of pain relief
EP2312516A1 (en) Denoising explicit feedback for recommender systems
CN113179224A (en) Traffic scheduling method and device for content distribution network
CN116362415B (en) Airport ground staff oriented shift scheme generation method and device
CN118154252A (en) Data prediction method, computer device, and storage medium
CN115659377B (en) Interface abnormal access identification method and device, electronic equipment and storage medium
Schwartz et al. Evaluating investments in disruptive technologies
CN117634855B (en) Project risk decision method, system, equipment and medium based on self-adaptive simulation
Papaioannou et al. Optimizing an incentives’ mechanism for truthful feedback in virtual communities

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19929377

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021520487

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19929377

Country of ref document: EP

Kind code of ref document: A1