WO2022038655A1

WO2022038655A1 - Value function derivation method, value function derivation device, and program

Info

Publication number: WO2022038655A1
Application number: PCT/JP2020/030975
Authority: WO
Inventors: 匡宏幸島; 公海高橋; 浩之戸田
Original assignee: 日本電信電話株式会社
Priority date: 2020-08-17
Filing date: 2020-08-17
Publication date: 2022-02-24
Also published as: JPWO2022038655A1

Abstract

In the present invention, a computer executes a derivation procedure for deriving a learner's value function during a Markov decision process in which the learner makes a decision with regard to a third set of states among a first set of states in an environment, excluding a second set of states in which an external system makes a decision on the basis of a prescribed measure, whereby the estimation accuracy of the value function when the external system is present is improved.

Description

Value function derivation method, value function derivation device and program

The present invention relates to a value function derivation method, a value function derivation device, and a program.

The Markov decision process (MDP: Markov decision process) is a general-purpose framework that handles continuous decision making, and is widely used in reinforcement learning (RL: Reinforcement learning) (Non-Patent Document 1). Regarding reinforcement learning, in response to successful examples in the game field (Non-Patent Document 2), in recent years various efforts have been made with the aim of applying it to actual systems such as vehicle allocation in ride sharing, mobile communication network operations, and traffic signal control. It has been.

In such efforts, research called safe reinforcement learning (Safe RL), which aims to learn the optimum reward without damaging the behavioral subject or the system around it, is becoming active. .. For example, those using an evaluation index considering the worst case and risk as an objective function (see Non-Patent Document 3), those using human advice (see Non-Patent Document 4 and Non-Patent Document 5), and the like can be mentioned.

However, the existing MDP does not consider the existence of external systems such as human intervention and risk avoidance measures, so it cannot cope with the situation where external systems exist. Even if a planning algorithm (for example, value iterative method or policy iterative method) and reinforcement learning algorithm (TD learning, Q-learning) constructed based on the existing MDP is applied, the value function cannot be estimated accurately.

The present invention has been made in view of the above points, and an object of the present invention is to improve the estimation accuracy of the value function in a situation where an external system exists.

Therefore, in order to solve the above problem, the learner has a set of the third state excluding the set of the second state in which the external system makes a decision based on a predetermined measure among the set of the first states in the environment. In the Markov decision-making process of making a decision, the computer executes a derivation procedure for deriving the learner's value function.

It is possible to improve the estimation accuracy of the value function in the presence of an external system.

It is a figure which shows the interaction of the sensor Markov determination process (CeMDP). It is a figure which shows the comparison of the decision-making timing of each MDP. It is a schematic diagram of the redefined environment. It is a figure for demonstrating the processing procedure including the CenVI algorithm. It is a figure for demonstrating the CenQ learning. It is a figure which shows the hardware configuration example of the value function derivation apparatus 10 in 1st Embodiment. It is a figure which shows the functional structure example of the value function derivation apparatus 10 in the 1st Embodiment. It is a flowchart for demonstrating an example of the processing procedure executed by the value function derivation apparatus 10 in 1st Embodiment. It is a figure which shows the functional structure example of the value function derivation apparatus 10 in the 2nd Embodiment. It is a flowchart for demonstrating an example of the processing procedure executed by the value function derivation apparatus 10 in the 2nd Embodiment.

The motive of this embodiment is to construct a new Markov decision process (MDP: Markov decision process) assuming the use of an external system. Here, the external system refers to human intervention and predetermined measures for avoiding danger. Such a framework is considered to be important not only for Safe RL but also when it is necessary for humans and machines to work together to achieve their goals. For example, consider the case of searching for the optimum method for controlling a chemical plant or an automobile by reinforcement learning. When a system becomes dangerous, such as when a plant's equipment is heavily loaded or there is a possibility of an accident due to a damaged vehicle sensor, people or predetermined workarounds should be taken. It may be useful to avoid danger by handing over control.

Therefore, in this embodiment, a new framework called the sensor Markov decision process (CeMDP: censored MDP) is disclosed. CeMDP expresses the interaction of the learner who is the subject of action, the environment in which the learner is, and the external system (human intervention, risk avoidance measures, etc.). FIG. 1 shows the interaction of the three. Depending on the learner's condition, either the learner or the external system makes a decision.

Further, in the present embodiment, the following algorithm is disclosed in order to derive the optimum value function and the optimum policy in CeMDP.
(I) When there is prior knowledge about the environment and the external system (when the reward function _RM and the state transition probability _PM are known): An algorithm modified from an existing planning algorithm such as the value iterative method / policy iterative method.
(Ii) When there is no prior knowledge about the environment and the external system (when the reward function _RM and the state transition probability _PM are unknown, but the reward and (stochastic) state transition can be observed through the interaction by simulation etc.): Temporal An algorithm that modifies reinforcement learning algorithms such as difference (TD) learning and Q-learning.

In this embodiment, an algorithm modified from the value iteration method (hereinafter referred to as "CenVI (Censored Value Iteration)") and an algorithm modified from Q-learning (hereinafter referred to as "CenQ (Censored Q-learning)"). However, it is also possible to construct an algorithm modified from the algorithm used in other (normal) MDPs. Further, in the present embodiment, as CeMDP, an algorithm is constructed for the case of discrete state action space and discrete time, but almost the same framework is used even when considering continuous state action space and continuous time. It is possible. For example, when considering CeMDP in a continuous state-behavior space, LeastSquaresTD learning (LSTD), LeastSquaresTDQ learning (LSTDQ), and LeastsquaresPolicyIteration (LSPI) that estimate the value function by using a linear function approximator. Furthermore, it is possible to construct a method based on an arbitrary reinforcement learning algorithm using function approximation, such as Deep Q-Network (DQN) using deep learning and Double DQN.

First, the prerequisite knowledge of this embodiment will be explained.

[Markov Decision Process]
The Markov Decision Process is defined by a set of five elements {S, A, _PM , _RM , γ}. S = {1, 2, ..., | S |} represents a set of states, and A = {1, 2, ..., | A |} represents a set of actions. PM: S × A × S → [0,1] is a state transition probability, and the probability of transition from the state s to the state _s'by the action a is described as _PM (s' | s, a). _RM : S × A → R is a reward function, and the reward obtained when the action a is executed in the state s is described as _RM (s, a). γ ∈ [0,1) is the discount rate. Further, the state transition probability _PM and the reward function RM when performing the action a are expressed by the matrix P ^a ∈ [ _0,1 ] ^{| S | × | S |} and the vector R ^a ∈ R ^{| S |} , respectively. However, "R" in R ^{| S |} means a set of all real numbers.

Π: S × A → [0,1] is the learner's policy (decision-making rule). π (a | s) represents the probability that the action a is executed in the state s. When one policy π is defined, the interaction between the learner and the environment is expressed as follows.

The learner in the state _st at the time _t determines the action at according to the policy π (at | _st ₎ . The reward rt + 1 received by the learner and the state st ₊ ₁ at the next time are determined according to the _reward function _RM ( _st , at) and the state transition probability _PM ( _st _{+ 1} | _st , at). The sequence of states visited by the learner { _{st} t} _can be regarded as a Markov chain according to the state transition probability ⁻ P ^π . ^The elements of -P ^π are defined by ^[ -P ^π ] _ij = Σ a _{∈ A} π (a | s = i) [P ^a ] _ij . It should be noted that -P corresponds to a symbol in which ^- is added above P in the mathematical formula described later.

The state value function V ^π _M and the action value function Q ^π _M are defined as the expected value of the sum of discount rewards obtained when the action is determined according to the policy π as follows.

However,

Is the state of the T step and the sequence of actions when following the policy π.

Represents the expected value of how to get out. It is widely known that these value functions satisfy the following Bellman expectation equations.

Similarly,

The optimal action value function Q ^* _M defined as satisfies the following Bellman optimal equation.

In all state action pairs (s, a)

The policy π ^* that satisfies is called the optimal policy. Based on these equations, planning algorithms such as value iterative method and policy iterative method and reinforcement learning algorithms such as Temporal Difference (TD) learning and Q-learning are constructed, and optimal measures are obtained by using these algorithms. It is possible. The algorithm for finding the optimum measure when the reward function _RM and the state transition probability _PM are known is called a planning algorithm. On the other hand, although the reward function _RM and the state transition probability _PM are unknown, an algorithm for finding the optimum measure when the reward or (stochastic) state transition can be observed through interaction by simulation or the like is called a reinforcement learning algorithm.

If the Bellman equation is expressed using a vector and a matrix, the state value function V ^π _M can also be expressed as follows.

However,

It is also possible to calculate the state value function V ^π _M by iterative calculation similar to the value iterative method without actually calculating the inverse matrix.

[Options and MDPs with options (OMDP)]
In (normal) MDP, each action ends at each time step. On the other hand, in MDP (OMDP) using an option, a macro action called an option, which is executed over a plurality of hours, is used. The option definitions are shown below.

<Definition 1 (option)>
Options are defined by three sets <μ, β, I> of policy μ: S × A → [0,1], end condition β: S → [0,1], and start condition I⊆S ( "Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112 (1-2): 182-111, 1999." .).

Actions (normal in MDP) can also be considered as an option to end immediately after execution. Hereinafter, the set of options is referred to as O. Further, here, it is assumed that the measure π represents the execution probability of the option (π: S × O → [0,1]).

The learner-environment interaction in OMDP is expressed as follows. The learner in the state _st executes an option _ot (including the state _st in the start condition I) according to the policy π ( _ot | _st ). Once the option _ot to be executed is determined, the action at to be executed next is determined according to the policy _μ (at | _st ₎ of the option _ot . The _learner who executed the action at receives the reward _rt _{+ 1} according to the _reward function _RM ( _st , at), and the state s at the next time according to the state transition probability _PM ( _{st + 1} | _st , at). Transition to _{t + 1} . After the transition, the currently executing option is terminated with a probability β ( _{st + 1} ). If the option does not end and continues, the action at the next time is determined according to the currently executing option policy _μ (at _{+ 1} | st _{+ 1} ). If the option is finished, the option for the next time is determined according to the policy π (ot _{+ 1} | st _{+ 1} ).

Thus, in OMDP, the timing of decision-making (by policy π) is not constant. 2 (a) and 2 (b) show an example of the decision-making timing of MDP and OMDP. Each point in (a) indicates the timing of MDP decision making based on the policy π. The white circles in (b) indicate the timing of _OMDP decision-making based on the policy π (ot | st ₎ , and each point in (b) is the intention based on the policy μ (at _{+ 1} | st _{+ 1} ). Indicates the timing of the decision.

The advantage of OMDP is that the original reward function and state transition probability defined using behavior are redefined to those using options as shown below, and the value function and optimization are performed in almost the same way as normal MDP. It is to be able to acquire measures.

However, ρ (s', k) represents the probability that the option o ends in the state s'after k steps.

However, t + k represents the end time of the option o (probabilistically determined).

If the value function of OMDP is defined in the same way as MDP, the following Bellman equation holds.

Therefore, it is possible to use a planning algorithm such as a value iterative method or a policy iterative method created for ordinary MDP for estimating the value function of OMDP and acquiring the optimum policy (Reference 1). Similarly, reinforcement learning algorithms such as TD learning and Q-learning can be used in OMDP. When Q-learning is used, the estimated value ^ _QO of the optimal value function Q ^* _O may be updated as follows at the timing when the option ends. Note that ^ Q corresponds to a symbol in which ^ is added above Q in the following equation.

However, η _t represents the learning rate, and _t + k'represents the end time of the option ot. The difference from Q-learning in normal MDP is that the discount cumulative reward sum of k'steps is used.

[Outline of the present embodiment]
Next, the outline of the present embodiment will be described. First, we define a sensor Markov decision process transition (CeMDP) that expresses the interaction of the learner who is the subject of action, the environment in which the learner is, and the external system (human intervention, risk avoidance measures, etc.).

[Definition of sensor Markov decision process (CemDP)]
_CeMDP is defined by a set of seven elements {S, A, PM, RM, γ, _Ε , μ}. S, A, PM, _RM , and γ are the same as _MDP , and are a set of states, a set of actions, a state transition probability, a reward function, and a discount rate, respectively. Ε⊆S is a set of states in which an external system makes a decision, and μ: Ε × A → [0,1] is a fixed policy that the external system follows. In addition, the set of states in which the learner makes a decision is hereinafter referred to as "L". The elements of L and Ε do not overlap, and the sum of L and Ε is equal to the set S of states (Ε∪L = S, Ε∩L = φ). That is, the set of S excluding Ε is L. In the following, μ will also be referred to as π _e for convenience of explanation.

The interaction of the three shown in FIG. 1 is expressed as follows. At each time _t , if the state st _∈ L, then the action at is determined according to the learner's policy π _l (at | _st ₎ . Instead, if the state st _∈ _Ε , then the action at is determined according to the policy _μ (at | _st ) of the external system. The learner's policy π _l is the target of optimization in this embodiment. On the other hand, the policy μ of the external system is fixed. When the action is executed, the _learner receives the reward _rt _{+ 1} according to the reward function _RM ( _st , at) and the state transition probability _PM ( _{st + 1} | _st , at), and the state at the next time. Transition to s _{t + 1} .

FIG. 2 (c) shows an example of the learner's decision-making timing in CeMDP. In FIG. 2 (c), the white square wave indicates the timing of decision making based on the policy _π _l (at | _st ). In FIG. 2 (c), each point indicates the timing of decision making based on the policy _μ (at | _st ). If Ε is an empty set, this interaction is exactly the same as that of a normal MDP.

[Redefining the environment]
In constructing the algorithm in CeMDP, it is considered useful to redefine the environment such as reward function and state transition probability so that the algorithm of MDP can be applied, as was the case with OMDP. Therefore, the inventor of the present application considered to redefine the environment of CeMDP. The key idea in the redefinition is that in CeMDP interactions, the learner needs to make decisions only in the state s ∈ L. Therefore, by redefining the environment in CeMDP and the external system in FIG. 1 as the environment in MDP whose state space is L as shown in FIG. Can be obtained. FIG. 3 shows a schematic diagram of the redefined environment.

Based on the above approach, the inventor of the present application has derived the state transition probability and reward function in the redefined environment. It should be noted that the states were rearranged without losing generality, and the state transition probability P ^a and its policy π = (π _l , π _e ) were used to take an average for the action a ^- P ^π is as follows. It is assumed that it is represented by a block matrix.

Similarly, the subscript subscript l or e is expressed by extracting the element related to the state L or Ε from a certain vector. for example,

At this time, it can be shown that the state transition probability and the reward function in the redefined environment are given by the following equations.

<Theorem 1 (redefined state transition probability)>
The state transition probability P _C (of the state set L) in the redefined environment is given by the following equation as the matrix representation P ^a _C (under the source given by a certain action a).

<Theorem 2 (redefined reward function)>
The reward function R _C (of the state set L) in the redefined environment is given the matrix representation R ^a _C (under a given action a) by the following equation.

The above state transition probabilities _{PC and reward function RC} _can be obtained by matrix operation (using inverse matrix calculation) or by using power iteration.

[Value function and Bellman equation]
If we define a value function using the state transition probabilities and reward functions of the redefined environment, it satisfies the Bellman equation below.

However, π _l represents a policy of the state on the set L. Therefore, the optimum policy and the optimum value function can be obtained by the same algorithm as MDP.

[Planning algorithm]
Here, as an example of the planning algorithm, an algorithm based on the value iterative method is shown. As described above, in the present embodiment, this algorithm is referred to as "Censored MDP Value _Iteration ".

FIG. 4 is a diagram for explaining a processing procedure including the CenVI algorithm. In FIG. 4, first, the redefined environment reward function PC (s'| s, a) and the state transition probability _RC (s, a) are calculated, and then by _CenVI based on the value iteration method. The optimal value function is required. Since the number of state spaces for which the value iterative method is performed is reduced from S to L (the number of elements), it is only necessary to perform the value iterative method for a smaller number of state spaces. It is possible to construct an algorithm based on an algorithm other than the value iterative method in almost the same manner.

[Reinforcement learning algorithm]
Next, as an example of the reinforcement learning algorithm, an algorithm based on Q-learning is shown. As described, in this embodiment, this algorithm is referred to as "Censored MDP Q-Learning".

FIG. 5 is a diagram for explaining CentQ learning. In CenQ learning, the estimated value ^ _QC of the optimal action value function Q ^* _C is updated based on the following equation each time the learner visits a state belonging to L. Note that ^ Q indicates a symbol in which ^ is added above Q.

However, η _t represents the learning rate, and t + k'represents the time when the learner first visits the state belonging to L after the time t. It can be shown that CenQ learning converges to a true value function under certain conditions, similar to normal Q-learning and Q-learning in OMDP. That is, the state when the learner visits the state belonging to L ( _{st + k'} ), the state when the learner last visits the state belonging to L before that time (st + _k '), and the state in between. The optimal value function is derived by updating the value function using the discount sum of the rewards (the first term on the right side of the second equation of equation (10)).

[Value function derivation device 10]
Hereinafter, the value function derivation device 10 that realizes the above-mentioned contents will be described. FIG. 6 is a diagram showing a hardware configuration example of the value function deriving device 10 according to the first embodiment. The value function derivation device 10 of FIG. 6 has a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, an interface device 105, and the like, which are connected to each other by a bus B, respectively.

The program that realizes the processing in the value function derivation device 10 is provided by a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100, the program is installed in the auxiliary storage device 102 from the recording medium 101 via the drive device 100. However, the program does not necessarily have to be installed from the recording medium 101, and may be downloaded from another computer via the network. The auxiliary storage device 102 stores the installed program and also stores necessary files, data, and the like.

The memory device 103 reads a program from the auxiliary storage device 102 and stores it when there is an instruction to start the program. The processor 104 is a CPU or GPU (Graphics Processing Unit), or a CPU and GPU, and executes a function related to the value function derivation device 10 according to a program stored in the memory device 103. The interface device 105 is used as an interface for connecting to a network.

FIG. 7 is a diagram showing a functional configuration example of the value function deriving device 10 according to the first embodiment. The value function derivation device 10 of the first embodiment has a parameter input unit 11, a CeMDP parameter calculation unit 12, a planning algorithm execution unit 13, and an execution result processing unit 14 in order to derive an optimum value function and an optimum measure by a planning algorithm. And so on. Each of these parts is realized by a process of causing the processor 104 to execute one or more programs installed in the value function derivation device 10. The value function derivation device 10 also uses an input parameter storage unit 121, a setting parameter storage unit 122, a CeMDP parameter storage unit 123, and an execution result storage unit 124. Each of these storage units can be realized by using, for example, an auxiliary storage device 102, a storage device that can be connected to the value function derivation device 10 via a network, or the like.

FIG. 8 is a flowchart for explaining an example of the processing procedure executed by the value function deriving device 10 in the first embodiment.

In step S101, the parameter input unit 11 includes parameters related to _MDP (state transition probability PM, reward function _RM , discount rate γ) and parameters at the time of algorithm execution (hereinafter referred to as “setting parameters” such as the maximum number of iterations). Is input, and the parameters related to MDP are recorded in the input parameter storage unit 121, and the setting parameters are recorded in the setting parameter storage unit 122.

Subsequently, the _CeMDP parameter calculation unit 12 inputs the state transition probability PM, the reward function _RM and the discount rate γ recorded in the input parameter storage unit 121, and the setting parameter recorded in the setting parameter storage unit 122. , Redefined environment state transition probabilities PC (s' | s, a) and reward function _RC (s, a) are calculated for all combinations of s, _s'∈ L, and a (S102). .. The CeMDP parameter calculation unit 12 records the state transition probability PC and the reward function _RC of the calculation result in the _CeMDP parameter storage unit 123. Step S102 corresponds to the processing of line numbers 2 and 3 in FIG. The state transition probability _{PC is calculated based on Theorem 1, and the reward function RC} _is calculated based on Theorem 2. Further, all s, s'∈ _L , and a can be specified based on the state transition probability PM.

Subsequently, the planning algorithm execution unit 13 includes a state transition probability PC and a reward function _RC recorded in the _CeMDP parameter storage unit 123, a discount rate γ recorded in the input parameter storage unit 121, and a setting parameter storage unit 122. The (optimal) value function and the (optimal) policy are derived (calculated) by a planning algorithm such as the value iteration method, and the (optimal) value function and the (optimal) policy are executed. Record in unit 124 (S103). Step S103 corresponds to the processing of line numbers 5 to 8 in FIG. 4 (method of calculating the value function in the Markov determination process based on the state transition probability _{PC and the reward function RC} ₎ . Note that FIG. 4 does not describe the process of calculating the (optimal) policy for convenience, but if the (optimal) value function is known, the (optimal) policy can be derived by a known method. , Omitted for convenience.

Subsequently, the execution result processing unit 14 outputs the optimum value function or the optimum measure recorded in the execution result storage unit 124 (S104).

Next, the second embodiment will be described. The second embodiment will explain the differences from the first embodiment. The points not particularly mentioned in the second embodiment may be the same as those in the first embodiment.

FIG. 9 is a diagram showing a functional configuration example of the value function deriving device 10 in the second embodiment. In FIG. 9, the same or corresponding parts as those in FIG. 7 are designated by the same reference numerals. The value function derivation device 10 of the second embodiment has a parameter input unit 11, a reinforcement learning algorithm execution unit 15, an execution result processing unit 14, and the like in order to derive an optimum value function and an optimum policy by a reinforcement learning algorithm. .. Each of these parts is realized by a process of causing the processor 104 to execute one or more programs installed in the value function derivation device 10. The value function derivation device 10 also uses an input parameter storage unit 121, a setting parameter storage unit 122, and an execution result storage unit 124. Each of these storage units can be realized by using, for example, an auxiliary storage device 102, a storage device that can be connected to the value function derivation device 10 via a network, or the like.

FIG. 10 is a flowchart for explaining an example of a processing procedure executed by the value function deriving device 10 in the second embodiment.

In step S201, the parameter input unit 11 is a simulator for describing the interaction with the MDP (a simulator that can calculate the reward and the state transition obtained by the learner based on the MDP (for example, a program)), and the parameters of the MDP (discount rate). γ) is input, and these are recorded in the input parameter storage unit 121. The parameter input unit 11 also inputs parameters (hereinafter referred to as “setting parameters” such as learning rate and maximum number of repetitions) at the time of algorithm execution, and records these in the setting parameter storage unit 122.

Subsequently, the reinforcement learning algorithm execution unit 15 inputs the simulator and discount rate γ recorded in the input parameter storage unit 121 and the setting parameters recorded in the setting parameter storage unit 122, and the reinforcement learning algorithm such as CenQ learning is used. The (optimal) value function _QC and the (optimal) policy are calculated by, and the (optimal) value function _QC and the (optimal) policy are stored in the execution result storage unit 124 (S202). That is, in step S202, the algorithm shown in FIG. 5 is executed.

Subsequently, the execution result processing unit 14 outputs the optimum value function or the optimum measure recorded in the execution result storage unit 124 (S203).

As described above, according to each of the above embodiments, there is an interaction between the learner who is the subject of action, the environment in which the learner is present, and the external system (human intervention, risk avoidance measures, etc.). In, it is possible to estimate the value function with high accuracy and obtain the optimum policy. That is, it is possible to improve the estimation accuracy of the value function in the situation where an external system exists. As a result, existing planning algorithms such as value iterative method / policy iterative method and reinforcement learning algorithms such as Tempora lDifference (TD) learning / Q-learning are used in the presence of human or manually designed controls. It will be possible to extend it so that it can be used.

In this embodiment, S is an example of a set of first states. The measure μ followed by the external system is an example of a predetermined measure. Ε is an example of a set of second states. L is an example of a set of third states. _RM is an example of the first reward function. _PM is an example of the first state transition probability. _RC is an example of the second reward function. _PC is an example of the second state transition probability. The CeMDP parameter calculation unit 12, the planning algorithm execution unit 13, or the reinforcement learning algorithm execution unit 15 are examples of the derivation unit.

Although the embodiments of the present invention have been described in detail above, the present invention is not limited to such specific embodiments, and various modifications are made within the scope of the gist of the present invention described in the claims.・ Can be changed.

10 Value function derivation device 11 Parameter input unit 12 CeMDP parameter calculation unit 13 Planning algorithm execution unit 14 Execution result processing unit 15 Enhanced learning algorithm execution unit 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 121 Input Parameter storage unit 122 Setting parameter storage unit 123 CeMDP parameter storage unit 124 Execution result storage unit B bus

Claims

In the Markov decision-making process in which the learner makes a decision about the set of the third state excluding the set of the second state in which the external system makes a decision based on a predetermined measure among the set of the first states in the environment. , Derivation procedure for deriving the learner's value function,
A method of deriving a value function, characterized by a computer performing.
In the derivation procedure, based on the first reward function and the first state transition probability in the environment, the environment and the external system are used as an environment in the Markov decision process in which the state space is a set of the third states. The second reward function and the second state transition probability when redefined are calculated, and the value function is calculated by the calculation method of the value function in the Markov determination process based on the second reward function and the second state transition probability. Derived,
The value function derivation method according to claim 1, wherein the value function is derived.
In the derivation procedure, the state belonging to the set of the third state is the state at the first time when the learner visits, and the state belonging to the set of the third state before the first time is described. By updating the value function with the state of the second time the learner last visited and the discount sum of the rewards obtained between the first time and the second time, the value function can be obtained. Derived,
The value function derivation method according to claim 1, wherein the value function is derived.
In the Markov decision-making process in which the learner makes a decision about the set of the third state excluding the set of the second state in which the external system makes a decision based on a predetermined measure among the set of the first states in the environment. , A derivation unit that derives the learner's value function,
A value function derivation device characterized by having.
Based on the first reward function and the first state transition probability in the environment, the derivation unit uses the environment and the external system as an environment in the Markov decision process in which the state space is a set of the third states. The second reward function and the second state transition probability when redefined are calculated, and the value function is calculated by the calculation method of the value function in the Markov determination process based on the second reward function and the second state transition probability. Derived,
The value function derivation device according to claim 4, wherein the value function is derived.
The derivation unit describes the state belonging to the set of the third state as the state at the first time when the learner visited, and the state belonging to the set of the third state before the first time. By updating the value function with the state of the second time the learner last visited and the discount sum of the rewards obtained between the first time and the second time, the value function can be obtained. Derived,
The value function derivation device according to claim 4, wherein the value function is derived.
A program characterized in that a computer executes the value function derivation method according to any one of claims 1 to 3.