WO2020234913A1

WO2020234913A1 - Decision-making device, decision-making method, and storage medium

Info

Publication number: WO2020234913A1
Application number: PCT/JP2019/019636
Authority: WO
Inventors: 慧竹村; 悠輝中口
Original assignee: 日本電気株式会社
Priority date: 2019-05-17
Filing date: 2019-05-17
Publication date: 2020-11-26
Also published as: JP7279782B2; JPWO2020234913A1

Abstract

Provided is a decision-making device, wherein an estimation unit estimates, on the basis of behavior histories of a plurality of users, a parameter of a model that defines a user state transition. A randomization unit imparts randomness to each user with respect to the estimated parameter. A policy calculation unit uses the randomized parameter to calculate, for each user, a policy, which is a function for determining an action for each user. A policy management unit excludes, from the policies applied to the users, a policy causing a user state to transition to a predetermined irrevocable state.

Description

Decision-making device, decision-making method, and recording medium

The present invention relates to a method of determining an action for a user based on a user's action history.

A decision-making method using a Markov decision process (MDP: Markov Decision Process) is known. In a typical method using MDP, a policy for maximizing long-term cumulative reward is learned, and action selection and execution for the user are repeated. An example of such a method is described in Patent Document 1.

Near-optimal Regret Bounds for Reinforcement Learning, Thomas Jacksch, Ronald Ortner & Peter Auer, Journal of MachineLear (10Rear)

In the above method, various actions are tried in various states in order to find a way to maximize the cumulative reward, but in the process, a specific unfavorable state may occur.

One of the objects of the present invention is to provide a decision-making method capable of avoiding falling into a specific unfavorable state in the process of determining a policy for a user.

In order to solve the above problems, in one aspect of the present invention, the decision-making device
An estimation unit that estimates model parameters that define user state transitions based on the behavior history of multiple users,
A randomization unit that gives randomness to each user for the estimated parameters,
A policy calculation unit that calculates a policy that is a function that determines an action for each user for each user using randomized parameters.
Among the calculated measures, the policy management unit that excludes the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user,
To be equipped.

In another aspect of the invention, the decision-making method
Estimate model parameters that define user state transitions based on the behavior history of multiple users.
Randomness is given to each user for the estimated parameters,
A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
Of the calculated measures, the measures that transition the user's state to a predetermined irreversible state are excluded from the measures applied to each user.

In yet another aspect of the present invention, the recording medium is:
Estimate model parameters that define user state transitions based on the behavior history of multiple users.
Randomness is given to each user for the estimated parameters,
A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
Among the calculated measures, record a program that causes the computer to execute a process of excluding the measures that transition the user's state to a predetermined irreversible state from the measures applied to each user.

According to the present invention, it is possible to avoid falling into a specific unfavorable state in the process of determining a policy for a user.

The hardware configuration of the decision-making apparatus according to the first embodiment is shown. The functional configuration of the decision-making apparatus according to the first embodiment is shown. It is a flowchart of the decision-making process which concerns on 1st Embodiment. A first example of a system to which a decision-making device is applied is shown. A second example of a system to which a decision-making device is applied is shown. It is a figure explaining the effect obtained by the decision-making process. The functional configuration of the decision-making apparatus according to the second embodiment is shown.

Hereinafter, preferred embodiments of the present invention will be described with reference to the drawings. In the following description, for convenience of notation, the symbol "A" with "^" added is described as "A ^", and the symbol "A" with "~" added is "A ^~ " ^. It shall be written.

[Basic method]
First, a basic method of decision making according to the embodiment of the present invention will be described. The following embodiments are characterized in avoiding certain unfavorable conditions, specifically "irreparable conditions", in a system modeled by MDP.

MDP uses "state", "action", "reward" and "transition probability" as parameters. Each "state" and "action" is a finite set or an element thereof. "Reward" is a function that outputs a real value when a state and an action are input. The "transition probability" is a function that outputs the probability of transition to each state when the current state and the action are input.

The decision-making process using MDP basically includes (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. repeat. The next state is determined based on the current state, the action, and the transition probability. At the start of the decision-making process, the reward and the transition probability are unknown, and the decision-making process repeats the above steps (a) and (b) while estimating the reward and the transition probability. Then, the decision-making process calculates the optimum policy based on the estimated reward and the transition probability. The policy is a function that maps a state to an action and outputs the action when the state is input.

Data is required to estimate the reward and transition probability. Therefore, in the initial stage of the decision-making process, various actions are tried in various states in order to collect data for estimation. In the process, the user's condition may fall into a specific unfavorable condition, more specifically, an "irreparable condition". Here, the "irreparable state" means a state in which the reward is zero or unacceptably small, and the state cannot transition to another state. Whether or not the reward is unacceptably small is actually determined by whether or not the value of the reward is smaller than a predetermined threshold value. Once you fall into such an "irreparable state", you will not be able to return to the normal state where you can get rewards.

As a specific example, consider a system that continuously provides a certain service to users, such as a mobile phone and a membership-based EC site. In this case, the decision-making process determines an appropriate measure for enhancing the long-term satisfaction of the user, executes an action on the user, and obtains the user's charge, contract renewal, and the like. In this example, as the "irreparable state", a state in which the contract is canceled by the user, to be exact, a state in which the contract is canceled by the user and no matter what is done thereafter, the contract cannot be obtained again. That is, it must be avoided that the decision-making process is canceled by the user in the process of trying various actions in order to increase the satisfaction of the user. If there is an "action that must be canceled", that is, if it is found that the user always cancels when an action in a certain state is executed, the decision-making process executes the action in that state. Must be avoided.

Therefore, in this embodiment, measures are determined for each user and actions are taken so that the user does not fall into an "irreparable state" as much as possible. Specifically, the decision-making process of the present embodiment is executed by combining the following three approaches.

(Approach A): Handle a plurality of users at the same time.
If there is only one user and the user falls into an "irreparable state" in the process of trying various actions against that user, there is no chance of recovery. Therefore, a plurality of users are handled at the same time. As the number of users increases, opportunities for recovery will be created. However, if there are only two types of user "status", "not canceled" and "cancelled", and there is an "action that must be canceled", all users will be "always canceled" someday. You will try "action" and cancel the contract. Therefore, approach A alone is not sufficient.

(Approach B): The action history of each user is shared and treated as one problem.
When a user falls into an "irreparable state", the state immediately before that and the action performed at that time are stored. Then, the decision-making process does not select the action in that state after that. As a result, one user falls into the "irreparable state", but it is possible to prevent another user from falling into the "irreparable state" thereafter. However, even in this case, there is a possibility that all users will be in an "irreparable state" at the same time. That is, if the decision-making process is configured to execute the same "action" for all users, there is a possibility that the "action that must be canceled" will be executed for all users at the same time. Therefore, approaches A and B alone are not sufficient.

(Approach C): Measures for each user are distributed to some extent.
As described above, the "policy" is a function that outputs an "action" when a "state" is input. The decision-making process calculates the policy using the reward and transition probability estimated based on the past data. At that time, by giving randomness to at least one of the reward and the transition probability, the calculated policy is given randomness, and different actions are performed for a plurality of users in the same decision-making process. .. In this way, the possibility of executing "an action that must be canceled" for all users at the same time is significantly reduced. As a result, it is possible to prevent all users from falling into an "irreparable state" at the same time.

At this time, it is preferable to change the degree of randomness given to the reward and the transition probability according to the confidence level of the estimation of the reward and the transition probability. As described above, the decision-making process repeats (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. The user's action history is accumulated, and the reward and transition probability are estimated using the action history. Therefore, when the accumulated behavior history is small and the estimation confidence level is low, the decision-making process strengthens the randomness given to the reward and the transition probability. On the other hand, when the accumulated behavior history is large and the estimation confidence is high, the decision-making process weakens the randomness given to the reward and the transition probability. In this way, in the initial stage of the decision-making process, since the accumulated action history is small, the randomness given to the reward and the transition probability becomes strong, and the decision-making process tries various actions. After that, as the accumulated action history increases and the reliability of the estimation increases, the randomness given to the reward and the transition probability becomes weaker, and many actions closer to the optimum are selected and executed.

As described above, in the present embodiment, by executing the above approaches A to C in combination, it is possible to avoid the user from falling into an "irreparable state" as much as possible.

[First Embodiment]
Next, the first embodiment of the present invention will be described. The decision-making device according to the first embodiment basically has (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. And repeat. In addition, the decision-making device updates the policy for each user at a predetermined timing based on the action history of the plurality of users obtained during that period. In addition, when a user falls into an "irreparable state", the decision-making device does not use the policy used at that time after that. Specifically, the decision-making device stores the state immediately before a user falls into the "recovery state" and the action executed at that time, and does not execute the action in that state thereafter. To do so. As a result, the decision-making device can take appropriate actions for each user while avoiding the user from falling into an "irreparable state" as much as possible. Hereinafter, the decision-making apparatus according to the first embodiment will be described in detail.

(Hardware configuration)
FIG. 1 is a block diagram showing a hardware configuration of the decision-making device according to the first embodiment. As shown in the figure, the decision-making device 1 includes a communication unit 11, a processor 12, a memory 13, a recording medium 14, and an action history database (DB) 15.

The communication unit 11 communicates with an external device. Specifically, the communication unit 11 transmits the action determined by the decision-making device 1 to the terminal device of each user, or receives the state of each user from the terminal device of each user.

The processor 12 is a computer such as a CPU (Central Processing Unit), and controls the entire decision-making device 1 by executing a program prepared in advance. The memory 13 is composed of a ROM (Read Only Memory), a RAM (Random Access Memory), and the like. The memory 13 stores various programs executed by the processor 12. The memory 13 is also used as a working memory during execution of various processes by the processor 12.

The recording medium 14 is a non-volatile, non-temporary recording medium such as a disk-shaped recording medium or a semiconductor memory, and is configured to be removable from the decision-making device 1. The recording medium 14 records various programs executed by the processor 12. When the decision-making apparatus 1 executes the decision-making process, the program recorded in the recording medium 14 is loaded into the memory 13 and executed by the processor 12.

The action history DB 15 stores the action history of a plurality of users. The action history includes the user's "state", "action" and "reward". The action history is obtained while the decision-making device 1 repeats (a) a step of selecting and executing an action based on the current state, and (b) a step of obtaining a reward and the next state as a result of the executed action. , It is accumulated in the action history DB15.

(Functional configuration)
FIG. 2 is a block diagram showing a functional configuration of the decision-making device 1. The decision-making device 1 functionally includes a data storage unit 21, a policy management unit 22, an MDP estimation unit 23, a randomization unit 24, a policy calculation unit 25, and a policy execution unit 26. The data storage unit 21 is realized by the action history DB 15. Further, the policy management unit 22, the MDP estimation unit 23, the randomization unit 24, the policy calculation unit 25, and the policy execution unit 26 are realized by the processor 12 executing a program prepared in advance.

The data storage unit 21 receives the status of each user from the external device, and acquires the action executed for each user from the policy management unit 22 and the reward obtained by the action. Then, the data storage unit 21 stores them as the action history of each user. The data storage unit 21 accumulates the entire action history of a plurality of users targeted by the decision-making device 1.

The policy management unit 22 determines the current policy for each user based on the total action history of each user stored in the data storage unit 21. Specifically, the policy management unit 22 first acquires the entire action history of each user stored in the data storage unit 21 and supplies it to the MDP estimation unit 23.

The MDP estimation unit 23 estimates MDP parameters (hereinafter referred to as "MDP parameters"), specifically, "reward" and "transition probability" using the entire action history of each user. Then, the MDP estimation unit 23 supplies the estimated MDP parameter (hereinafter, referred to as “estimated MDP parameter”) and the number of samples of the user's action history used for estimating the MDP parameter to the randomization unit 24. That is, the MDP estimation unit 23 supplies the estimated value of the “reward” and the estimated value of the “transition probability” to the randomization unit 24.

The randomization unit 24 randomizes the estimated MDP parameters. Specifically, the randomization unit 24 imparts randomness by adding noise to the estimated MDP parameters. Here, the randomizing unit 24 may randomize at least one of the estimated MDP parameters. That is, the randomization unit 24 may randomize at least one of the estimated value of the reward supplied from the MDP estimation unit 23 and the estimated value of the transition probability. The randomization unit 24 independently adds noise to the estimated MDP parameters for each user. As a result, the estimated MDP parameters that are basically randomized (hereinafter, referred to as "randomized MDP parameters") are different for each user. Then, the randomization unit 24 supplies the randomized MDP parameter to the policy calculation unit 25.

Further, as described above, the randomization unit 24 preferably changes the strength of randomness given to the estimated MDP parameter according to the number of samples of the user's behavior history used for estimating the MDP parameter. Specifically, the randomization unit 24 strengthens the randomness given to the estimated MDP parameter when the number of samples of the user's behavior history used for estimating the MDP parameter is small, and the randomization unit 24 increases the randomness of the user's behavior history used for estimating the MDP parameter. When the number of samples is large, the randomness given to the estimated MDP parameters is weakened.

The policy calculation unit 25 calculates a policy for each user using the randomized MDP parameters supplied from the randomization unit 24. For example, the policy calculation unit 25 calculates the optimum policy by the value iterative method using the reward and the transition probability. Here, as described above, since at least one of the reward and the transition probability is randomized, the policy calculated by the policy calculation unit 25 for each user has randomness. Further, as described above, since the randomization unit 24 randomly randomizes each user independently, the policy calculated by the policy calculation unit 25 for each user is different. As a result, the same policy is applied to all users at the same time, and it is possible to avoid falling into an "irreparable state" at the same time. The policy calculation unit 25 supplies the calculated policy for each user to the policy management unit 22.

The policy management unit 22 supplies the policy for each user calculated by the policy calculation unit 25 to the policy execution unit 26. However, the policy management unit 22 excludes the policy that has caused the user to be in an "irreparable state" even once in the past, and does not supply the policy to the policy execution unit 26.

Specifically, the policy management unit 22 stores in advance a specific state corresponding to the "irreparable state". The policy management unit 22 may store a plurality of states as "irreparable states". The policy management unit 22 refers to the user's action history stored in the data storage unit 21, and stores the policy that has made the user transition to the “irreparable state” in the past as an “NG policy”. Then, when the measures for each user supplied from the policy calculation unit 25 include the NG policy, the policy management unit 22 does not apply the NG policy. The policy management unit 22 applies measures other than the NG policy to each user. That is, the policy management unit 22 supplies the policy execution unit 26 with measures other than the NG policy among the measures for each user supplied from the policy calculation unit 25. As a result, the NG policy is applied to the user, and it is possible to prevent the user from falling into an "irreparable state".

The policy execution unit 26 receives the policy for each user from the policy management unit 22, and acquires the current state of each user from the data storage unit 21. Then, the policy execution unit 26 determines and executes an action for each user based on the policy for each user and the current state of each user. As described above, since the "policy" is a function that outputs an "action" when the "state" is input, the policy execution unit 26 inputs the current state of each user into each user's policy. Determine and execute actions for the user.

As described above, in the decision-making device 1 of the present embodiment, the randomization unit 24 randomly randomizes the estimated MDP parameters for each user, and the policy calculation unit 25 calculates the policy for each user based on the randomized MDP parameters. Therefore, the same policy is not applied to all users at the same time. Therefore, it is possible to prevent all users from being applied to the NG policy at the same time and all users from falling into an "irreparable state" at the same time. In addition, since the policy management unit 22 does not apply the NG policy that has put the user in an "irreparable state" in the past, one or a very small number of users are sacrificed, but a large number of users thereafter It is possible to prevent falling into an "irreparable state".

(Specific example of policy calculation method)
Next, a specific example of the policy calculation method by the decision-making device 1 will be described with reference to FIG. The following policy calculation method is executed for each user, and the policy is determined for each user.

The policy management unit 22 acquires the user's action history stored in the data storage unit 21 and supplies it to the MDP estimation unit 23. The MDP estimation unit 23 estimates the reward and the estimated value of the transition probability as MDP parameters using the user's action history. Specifically, the MDP estimation unit 23 calculates the estimated value r ^ (s, a) of the reward when a certain action a is performed in a certain state s by the following equation (1).

Here, R (s, a) is the sum of the rewards when the action a is performed in the state s in the action history of all users.
N (s, a) is the number of times the action a is performed in the state s in the action history of all users.

Further, the MDP estimation unit 23 calculates an estimated value p ^ (s'| s, a) of the probability that the state transitions to the state s as a result of performing a certain action a in a certain state s by the following equation (2). To do.

Here, P (s, a, s') is the number of times the state has transitioned to the state s'as a result of performing the action a in the state s in the action history of all users.

The MDP estimation unit 23 sends the reward estimated value r ^ (s, a) thus obtained and the transition probability estimated value p ^ (s'| s, a) to the randomization unit 24 as estimated MDP parameters. Supply. Further, the MDP estimation unit 23 supplies the randomization unit 24 with a sample number N (s, a) of the action history used for estimating the MDP parameter.

The randomization unit 24 randomizes at least one of the reward estimated value r ^ (s, a) and the transition probability estimated value p ^ (s'| s, a).

(A) When only the estimated reward value is randomized The randomizing unit 24 adds noise to the estimated reward value r ^ (s, a), and randomizes the estimated reward value r ^~ (s, a). To generate. For this noise, the average of the randomized reward estimates r ^~ (s, a) is r ^~ (s, a), and the standard deviation σ is

It is determined to be a random variable that follows a normal distribution of. In equation (3), "S" and "A" are the total number of states and actions, respectively, "t" is a count of the number of iterations, and "δ" is a hyperparameter. The standard deviation σ is

As a result, it may not depend on the count t of the number of repetitions. In equation (4), "β" is a hyperparameter.

In equations (3) and (4), when the number of action history samples N (s, a) is small, the standard deviation σ becomes large and the randomness of the reward estimate becomes strong. On the other hand, when the number of samples N (s, a) of the action history is large, the standard deviation σ becomes small and the randomness of the estimated value of the reward becomes weak. Therefore, as described above, the randomness given to the estimated value of the reward changes according to the number of samples of the action history. In this case, the estimated transition probability p ^ (s'| s, a) is used as it is without being randomized.

(B) When only the transition probability is randomized The randomizing unit 24 adds noise to the transition probability estimated value p ^ (s'| s, a), and the randomized transition probability estimated value p ^~ (s). ´ | s, a) is generated. This noise is determined so that the randomized estimated transition probability p ^to (s'| s, a) is a random variable that follows the Dirichlet distribution of the parameter α. That is, the noise is determined using a random number generated from the Dirichlet distribution. The parameter "α" is an A-dimensional parameter, and the dimensional value corresponding to the action "a" is N (s, a, s') + 1. N (s, a, s') is the number of times the state transitions to s'as a result of performing the action a in the state s in the action history of all users. "A" is the total number of actions as described above. Also in this case, if the number of samples N (s, a, s') of the action history is small, the randomness of the estimated value of the transition probability becomes strong, and the number of samples N (s, a, s') of the action history is large. Then, the randomness of the estimated value of the transition probability becomes weak. Therefore, as described above, the randomness given to the estimated value of the transition probability changes according to the number of samples of the action history. In this case, the estimated reward value r ^ (s, a) is not randomized and is used as it is.

(C) When both the reward and the transition probability are randomized In this case, the randomizing unit 24 performs both (A) and (B) above, and the estimated value r ^to (s, a) of the randomized reward. ) And the estimated value p ^~ (s'| s, a) of the randomized transition probability are calculated.

As described above, when the randomization unit 24 randomizes at least one of the reward and the transition probability to generate the randomized MDP parameter, the policy calculation unit 25 uses the randomized MDP parameter to be optimal by the value iterative method. Calculate the policy.

(Decision-making process)
Next, the decision-making process according to the present embodiment will be described. FIG. 3 is a flowchart of the decision-making process according to the first embodiment. This process is realized by the processor 12 shown in FIG. 1 executing a program prepared in advance and operating as each element shown in FIG.

First, the decision-making device 1 calculates a policy for each user (step S11). Since the action history of each user basically does not exist in the data storage unit 21 at the start of the decision-making process, the measures for each user are based on the randomized MDP parameter to which the randomization unit 24 gives strong randomness. Is calculated.

Next, the decision-making device 1 applies the calculated policy for each user once (step S12). Specifically, the policy execution unit 26 determines and executes an action for each user based on the policy for each user and the current state of each user. By executing the action, each user transitions to the next state.

Next, the decision-making device 1 detects the state of each user after the execution of the action, and determines whether or not a certain user has fallen into an "irreparable state" (step S13). As described above, since the decision-making device 1 stores in advance a specific state corresponding to the "irreversible state", the decision-making device 1 sets the state of a certain user after the action is executed to that specific state. To determine if it matches. When a user falls into an "irreparable state" (step S13: Yes), the process returns to step S11, and the decision-making device 1 updates the policy for each user. At this time, the policy management unit 22 stores the policy that transitions the user to the "irreparable state" as an NG policy, and thereafter applies the NG policy to all users in the same state. Ban. That is, the policy applied in step S12 after that is a policy other than the NG policy.

On the other hand, when a certain user has not fallen into the "irreparable state" (step S13: No), the decision-making device 1 determines whether or not the user's action history has sufficiently increased (step S14). Specifically, this determination is made by the policy management unit 22 observing the action history of each user. If the user's action history is not sufficiently increased (step S14: No), the process returns to step S12, and steps S12 to S14 are executed. That is, the decision-making device 1 determines and executes an action using the same policy and a new "state" of each user, and the user's action history is added by executing the action. In this way, steps S12 to S14 are repeated until the user's action history is sufficiently increased. Then, when the user's action history is sufficiently increased (step S14: Yes), the process returns to step S11, and the policy for each user is updated using the sufficiently increased user's action history. That is, in step S14, it is determined whether or not the policy update time has come.

Whether or not the user's action history has increased sufficiently can be determined by, for example, the following method. As an example, when any user k satisfies the following equation (5), the decision-making device 1 determines that the user's action history has not sufficiently increased (step S13: No).

Here, π _k is the policy of the user k, and π _k (s) is the action in the state s of the user k.
v (s, π _k (s)) is the sum of all users who have performed the action π _k (s) in the state s since the last update of the policy.
N ^ (s, π _k (s)) is the total number of times the action π _k (s) was performed in the state s before the last policy update.

As another example, the decision-making device 1 may determine in step S14 whether or not the same measure has been applied a predetermined number of times (for example, X times). In this case, the determination in step S14 becomes "No" until the same measure is applied X times, and steps S12 to S14 are repeated. Then, when the same measure is applied X times, the determination in step S14 becomes “Yes”, the process returns to step S11, and the measure is updated.

As described above, in the decision-making process of the present embodiment, since the policy for each user is calculated and updated based on the randomized MDP parameter in step S11, the same policy is applied to each user at the same time. This eliminates the need for all users to execute NG measures at the same time. Further, when it is detected in step S13 that a certain user has fallen into an "irreparable state", the policy of each user is updated and the subsequent application of the NG policy that caused the user is prohibited. To. Therefore, it is possible to prevent other users from falling into an "irreparable state" by the same measures thereafter. Further, when it is determined in step S14 that the user's action history has sufficiently increased, the decision-making device 1 updates the policy for each user using the action history. Therefore, as the user's action history data increases, the estimation of the MDP parameters in the MDP estimation unit 23 is optimized, and more appropriate actions are executed for each user.

(Case Study)
Next, an example in which the decision-making device according to the embodiment is applied to an actual system will be described. FIG. 4 shows a first example of a system to which the decision-making apparatus of the embodiment is applied. This system provides a certain service to a user, and includes a server 50 and a plurality of user terminals 60. The server 50 operates as the decision-making device 1 of the embodiment by executing a program prepared in advance. The user terminal 60 is a terminal prepared for each user, and is, for example, a PC (Personal Computer), a mobile terminal, a tablet, a smartphone, or the like.

At the time of operation, the server 50 notifies the user terminal 60 of the action determined by the policy execution unit 26 of the decision-making device 1. The user terminal 60 executes the received action. For example, the user terminal 60 proposes a new service or plan to the user as an action. On the other hand, when the user contracts the service or plan, the user terminal 60 transmits a state indicating that the user has contracted to the server 50. The transmitted state is stored in the data storage unit 21 and stored as a part of the user's action history. In this way, it is possible to take appropriate measures for the user by using the decision-making device 1, improve the satisfaction of the user, and improve the sales and profits on the service providing side.

FIG. 5 shows a second example of a system to which the decision-making device of the embodiment is applied. Also in the second example, the system provides a certain service to the user, and includes a server 70 and a plurality of user terminals 80. However, in the second example, the decision-making device 1x configured by the server 70 does not have a policy execution unit. On the other hand, the user terminal 80 includes an AI (Artificial Intelligence) agent 81, and the AI agent 81 operates as a policy execution unit 82. The AI agent 81 is actually realized by the computer constituting the user terminal 80 executing a program prepared in advance.

At the time of operation, the server 70 transmits the policy determined by the policy management unit 22 of the decision-making device 1x to the user terminal 80. The user terminal 80 receives the policy. As described above, since the policy is a function that outputs an action when a state is input, the policy execution unit 82 determines and executes the action based on the received policy and the current state of the user. For example, the user terminal 80 proposes a new service or plan to the user as an action. On the other hand, when the user contracts the service or plan, the user terminal 80 transmits a state indicating that the user has contracted to the server 70. The transmitted state is stored in the data storage unit 21 and stored as a part of the user's action history. In this way, it is possible to take appropriate measures for the user by using the decision-making device 1x, improve the satisfaction of the user, and improve the sales and profits on the service providing side.

(effect)
Next, the effect obtained by the decision-making process of the embodiment will be described. FIG. 6A shows a state transition diagram of the environment to which the decision-making process is applied. There are two states "p" and "q", and the action starts from the state p. The actions that the user can take are three, "x", "y", and "z", and the state transition probability is shown in FIG. 6 (A). That is, when the actions x and y are performed in the state p, the user returns to the state p, and when the user performs the action z in the state p, the user transitions to the state q. On the other hand, in the state q, the state returns to q regardless of which of the actions x, y, and z is performed by the user. In reality, each user has an AI agent, and the AI agent determines the action of each user.

FIG. 6B shows the reward obtained by each action. In the state p, when the action x is performed, the reward "1" is obtained, when the action y is performed, the reward "2" is obtained, and when the action z is performed, the reward "3" is obtained. On the other hand, in the state q, the reward is "0" regardless of which action the user performs. Therefore, the state q corresponds to the "irreparable state" because the reward is "0" and the state cannot transition to another state.

Now, suppose there are multiple users. Since the method of the existing technology does not particularly consider the "irreparable state", someday all the users select the action z in the state p, and all of them fall into the irreparable state q. Specifically, when a plurality of users learn separately, each of them will eventually try the action z in the state p and fall into the state q. On the other hand, when a plurality of users share information and learn, basically all of them adopt the same policy, so that someday all of them try the action z in the state p at the same time and fall into the state q.

On the other hand, according to the decision-making process of the above-described embodiment, the actions taken by each user are distributed by randomizing the MDP parameters, so that not all of them take the action z at the same time in the state p. Further, since the information is shared among a plurality of users, when one user falls into the state q, the information is reflected in all the other policy decisions. Therefore, it is possible to prevent other users from falling into the state q in the same manner thereafter. Further, since the AI agent of each user advances learning based on the user's action history as described above, the policy adopted by the AI agent is optimized, and only the optimum action is selected.

[Second Embodiment]
Next, the second embodiment of the present invention will be described. FIG. 6 is a block diagram showing a functional configuration of the decision-making device 90 according to the second embodiment. As shown in the figure, the decision-making device 90 includes an estimation unit 91, a randomization unit 92, a policy calculation unit 93, and a policy management unit 94.

The estimation unit 91 estimates the parameters of the model that defines the transition of the user's state based on the behavior history of a plurality of users. An example of this model is the MDP. The randomization unit 92 gives randomness to the parameters estimated by the estimation unit 91 for each user. By giving randomness, the parameters of each user are not all the same. The policy calculation unit 93 calculates for each user using randomized parameters. The policy is a function that determines the action for each user. Then, the policy management unit 94 excludes from the policies calculated by the policy calculation unit 93, the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user. As a result, the decision-making device 90 can take an appropriate action for each user while avoiding the user from falling into an irreparable state as much as possible.

Part or all of the above embodiments may be described as in the following appendix, but are not limited to the following.

(Appendix 1)
An estimation unit that estimates model parameters that define user state transitions based on the behavior history of multiple users,
A randomization unit that gives randomness to each user for the estimated parameters,
A policy calculation unit that calculates a policy that is a function that determines an action for each user for each user using randomized parameters.
Among the calculated measures, the policy management unit that excludes the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user,
A decision-making device equipped with.

(Appendix 2)
The user's action history includes the user's state and actions for the user.
The decision-making device according to Appendix 1, wherein the measure is a function that outputs an action for the user when the user's state is input.

(Appendix 3)
When the policy management unit refers to the behavior history of the plurality of users and detects that the user's state has transitioned to the irreversible state, the policy management unit applies the policy that caused the transition to each user. The decision-making device according to

Appendix

1 or 2 to be excluded from.

(Appendix 4)
The decision-making device according to any one of Supplementary note 1 to 3, wherein the policy management unit stores in advance one or a plurality of states corresponding to the irreversible state.

(Appendix 5)
The irreparable state is any one of Appendix 1 to 4, which is a state in which the reward obtained by executing the action is zero or less than a predetermined value, and the state cannot be transitioned to another state by executing the action. The decision-making device described in the section.

(Appendix 6)
The parameters of the model include the user's state, actions against the user, rewards, and transition probabilities.
The decision-making device according to any one of Appendix 1 to 5, wherein the randomizing unit randomizes at least one of the reward and the transition probability.

(Appendix 7)
The decision-making device according to Appendix 6, wherein the randomization unit strengthens the randomness given to the parameter as the number of samples of the user's action history is smaller, and weakens the randomness given to the parameter as the number of samples is larger. ..

(Appendix 8)
The decision-making device according to any one of Appendix 1 to 7, further comprising a policy execution unit that determines and executes an action for each user based on a policy applied to each user.

(Appendix 9)
The decision-making device according to any one of claims 1 to 8, wherein the policy calculation unit updates the policy for each user each time the policy is applied a predetermined number of times.

(Appendix 10)
Estimate model parameters that define user state transitions based on the behavior history of multiple users.
Randomness is given to each user for the estimated parameters,
A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
A decision-making method that excludes, among the calculated measures, a measure that transitions a user's state to a predetermined irreversible state from the measures applied to each user.

(Appendix 11)
Estimate model parameters that define user state transitions based on the behavior history of multiple users.
Randomness is given to each user for the estimated parameters,
A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
A recording medium that records a program that causes a computer to execute a process of excluding a measure that transitions a user's state to a predetermined irreversible state from the measures applied to each user among the calculated measures.

Although the present invention has been described above with reference to the embodiments and examples, the present invention is not limited to the above embodiments and examples. Various changes that can be understood by those skilled in the art can be made to the structure and details of the present invention within the scope of the present invention.

1, 1x, 90 Decision-making device 11 Communication unit 12 Processor 13 Memory 14 Recording medium 15 Action history database 21 Data storage unit 22 Policy management unit 23

MDP estimation unit

24, 92

Randomization unit

25, 93

Policy calculation unit

26, 94 Measures Execution department

Claims

An estimation unit that estimates model parameters that define user state transitions based on the behavior history of multiple users,
A randomization unit that gives randomness to each user for the estimated parameters,
A policy calculation unit that calculates a policy that is a function that determines an action for each user for each user using randomized parameters.
Among the calculated measures, the policy management unit that excludes the policy that transitions the user's state to a predetermined irreversible state from the policy applied to each user,
A decision-making device equipped with.
The user's action history includes the user's state and actions for the user.
The decision-making device according to claim 1, wherein the measure is a function that outputs an action for the user when the state of the user is input.
When the policy management unit refers to the behavior history of the plurality of users and detects that the user's state has transitioned to the irreversible state, the policy management unit applies the policy that caused the transition to each user. The decision-making device according to claim 1 or 2, which is excluded from the above.
The decision-making device according to any one of claims 1 to 3, wherein the policy management unit stores one or a plurality of states corresponding to the irreversible state in advance.
The irreparable state is any of claims 1 to 4, wherein the reward obtained by executing the action is zero or less than a predetermined value, and the state cannot transition to another state by executing the action. The decision-making device according to paragraph 1.
The parameters of the model include the user's state, actions against the user, rewards, and transition probabilities.
The decision-making device according to any one of claims 1 to 5, wherein the randomizing unit randomizes at least one of the reward and the transition probability.
The decision-making unit according to claim 6, wherein the randomization unit strengthens the randomness given to the parameter as the number of samples of the user's action history is small, and weakens the randomness given to the parameter as the number of samples is large. apparatus.
The decision-making device according to any one of claims 1 to 7, further comprising a policy execution unit that determines and executes an action for each user based on a policy applied to each user.
The decision-making device according to any one of claims 1 to 8, wherein the policy calculation unit updates the policy for each user every time the policy is applied a predetermined number of times.
Estimate model parameters that define user state transitions based on the behavior history of multiple users.
Randomness is given to each user for the estimated parameters,
A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
A decision-making method that excludes, among the calculated measures, a measure that transitions a user's state to a predetermined irreversible state from the measures applied to each user.
Estimate model parameters that define user state transitions based on the behavior history of multiple users.
Randomness is given to each user for the estimated parameters,
A policy, which is a function that determines the action for each user, is calculated for each user using randomized parameters.
A recording medium that records a program that causes a computer to execute a process of excluding a measure that transitions a user's state to a predetermined irreversible state from the measures applied to each user among the calculated measures.