WO2022038655A1 - Value function derivation method, value function derivation device, and program - Google Patents

Value function derivation method, value function derivation device, and program Download PDF

Info

Publication number
WO2022038655A1
WO2022038655A1 PCT/JP2020/030975 JP2020030975W WO2022038655A1 WO 2022038655 A1 WO2022038655 A1 WO 2022038655A1 JP 2020030975 W JP2020030975 W JP 2020030975W WO 2022038655 A1 WO2022038655 A1 WO 2022038655A1
Authority
WO
WIPO (PCT)
Prior art keywords
value function
state
learner
function
environment
Prior art date
Application number
PCT/JP2020/030975
Other languages
French (fr)
Japanese (ja)
Inventor
匡宏 幸島
公海 高橋
浩之 戸田
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/030975 priority Critical patent/WO2022038655A1/en
Priority to JP2022543822A priority patent/JPWO2022038655A1/ja
Publication of WO2022038655A1 publication Critical patent/WO2022038655A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a value function derivation method, a value function derivation device, and a program.
  • the Markov decision process (MDP: Markov decision process) is a general-purpose framework that handles continuous decision making, and is widely used in reinforcement learning (RL: Reinforcement learning) (Non-Patent Document 1).
  • reinforcement learning in response to successful examples in the game field (Non-Patent Document 2), in recent years various efforts have been made with the aim of applying it to actual systems such as vehicle allocation in ride sharing, mobile communication network operations, and traffic signal control. It has been.
  • the existing MDP does not consider the existence of external systems such as human intervention and risk avoidance measures, so it cannot cope with the situation where external systems exist. Even if a planning algorithm (for example, value iterative method or policy iterative method) and reinforcement learning algorithm (TD learning, Q-learning) constructed based on the existing MDP is applied, the value function cannot be estimated accurately.
  • a planning algorithm for example, value iterative method or policy iterative method
  • reinforcement learning algorithm TD learning, Q-learning
  • the present invention has been made in view of the above points, and an object of the present invention is to improve the estimation accuracy of the value function in a situation where an external system exists.
  • the learner has a set of the third state excluding the set of the second state in which the external system makes a decision based on a predetermined measure among the set of the first states in the environment.
  • the computer executes a derivation procedure for deriving the learner's value function.
  • the motive of this embodiment is to construct a new Markov decision process (MDP: Markov decision process) assuming the use of an external system.
  • the external system refers to human intervention and predetermined measures for avoiding danger.
  • Such a framework is considered to be important not only for Safe RL but also when it is necessary for humans and machines to work together to achieve their goals. For example, consider the case of searching for the optimum method for controlling a chemical plant or an automobile by reinforcement learning. When a system becomes dangerous, such as when a plant's equipment is heavily loaded or there is a possibility of an accident due to a damaged vehicle sensor, people or predetermined workarounds should be taken. It may be useful to avoid danger by handing over control.
  • CeMDP censored MDP
  • CeMDP expresses the interaction of the learner who is the subject of action, the environment in which the learner is, and the external system (human intervention, risk avoidance measures, etc.).
  • FIG. 1 shows the interaction of the three.
  • either the learner or the external system makes a decision.
  • the following algorithm is disclosed in order to derive the optimum value function and the optimum policy in CeMDP.
  • An algorithm modified from an existing planning algorithm such as the value iterative method / policy iterative method.
  • Temporal An algorithm that modifies reinforcement learning algorithms such as difference (TD) learning and Q-learning.
  • an algorithm modified from the value iteration method (hereinafter referred to as “CenVI (Censored Value Iteration)" and an algorithm modified from Q-learning (hereinafter referred to as “CenQ (Censored Q-learning)”).
  • CenVI Chipsored Value Iteration
  • CenQ Chipsored Q-learning
  • CeMDP an algorithm is constructed for the case of discrete state action space and discrete time, but almost the same framework is used even when considering continuous state action space and continuous time. It is possible.
  • LSTD LeastSquaresTD learning
  • LSTDQ LeastSquaresTDQ learning
  • LSPI LeastsquaresPolicyIteration
  • the Markov Decision Process is defined by a set of five elements ⁇ S, A, PM , RM , ⁇ .
  • S ⁇ 1, 2, ...,
  • represents a set of states
  • A ⁇ 1, 2, ...,
  • represents a set of actions.
  • PM: S ⁇ A ⁇ S ⁇ [0,1] is a state transition probability, and the probability of transition from the state s to the state s'by the action a is described as PM (s'
  • RM S ⁇ A ⁇ R is a reward function, and the reward obtained when the action a is executed in the state s is described as RM (s, a).
  • ⁇ ⁇ [0,1) is the discount rate.
  • the state transition probability PM and the reward function RM when performing the action a are expressed by the matrix P a ⁇ [ 0,1 ]
  • R in R
  • S ⁇ A ⁇ [0,1] is the learner's policy (decision-making rule).
  • s) represents the probability that the action a is executed in the state s.
  • the learner in the state st at the time t determines the action at according to the policy ⁇ (at
  • the reward rt + 1 received by the learner and the state st + 1 at the next time are determined according to the reward function RM ( st , at) and the state transition probability PM ( st + 1
  • the sequence of states visited by the learner ⁇ st ⁇ t can be regarded as a Markov chain according to the state transition probability ⁇ P ⁇ .
  • s i) [P a ] ij . It should be noted that -P corresponds to a symbol in which - is added above P in the mathematical formula described later.
  • the state value function V ⁇ M and the action value function Q ⁇ M are defined as the expected value of the sum of discount rewards obtained when the action is determined according to the policy ⁇ as follows.
  • the policy ⁇ * that satisfies is called the optimal policy.
  • planning algorithms such as value iterative method and policy iterative method and reinforcement learning algorithms such as Temporal Difference (TD) learning and Q-learning are constructed, and optimal measures are obtained by using these algorithms. It is possible.
  • the algorithm for finding the optimum measure when the reward function RM and the state transition probability PM are known is called a planning algorithm.
  • an algorithm for finding the optimum measure when the reward or (stochastic) state transition can be observed through interaction by simulation or the like is called a reinforcement learning algorithm.
  • the state value function V ⁇ M can also be expressed as follows.
  • ⁇ Definition 1 (option)> Options are defined by three sets ⁇ , ⁇ , I> of policy ⁇ : S ⁇ A ⁇ [0,1], end condition ⁇ : S ⁇ [0,1], and start condition I ⁇ S ( "Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112 (1-2): 182-111, 1999.” .).
  • Actions can also be considered as an option to end immediately after execution.
  • the set of options is referred to as O.
  • the measure ⁇ represents the execution probability of the option ( ⁇ : S ⁇ O ⁇ [0,1]).
  • the learner-environment interaction in OMDP is expressed as follows.
  • the learner in the state st executes an option ot (including the state st in the start condition I) according to the policy ⁇ ( ot
  • the action at to be executed next is determined according to the policy ⁇ (at
  • the learner who executed the action at receives the reward rt + 1 according to the reward function RM ( st , at), and the state s at the next time according to the state transition probability PM ( st + 1
  • the currently executing option is terminated with a probability ⁇ ( st + 1 ). If the option does not end and continues, the action at the next time is determined according to the currently executing option policy ⁇ (at + 1
  • the timing of decision-making (by policy ⁇ ) is not constant.
  • 2 (a) and 2 (b) show an example of the decision-making timing of MDP and OMDP.
  • Each point in (a) indicates the timing of MDP decision making based on the policy ⁇ .
  • the white circles in (b) indicate the timing of OMDP decision-making based on the policy ⁇ (ot
  • OMDP The advantage of OMDP is that the original reward function and state transition probability defined using behavior are redefined to those using options as shown below, and the value function and optimization are performed in almost the same way as normal MDP. It is to be able to acquire measures.
  • ⁇ (s', k) represents the probability that the option o ends in the state s'after k steps.
  • t + k represents the end time of the option o (probabilistically determined).
  • a planning algorithm such as a value iterative method or a policy iterative method created for ordinary MDP for estimating the value function of OMDP and acquiring the optimum policy (Reference 1).
  • reinforcement learning algorithms such as TD learning and Q-learning can be used in OMDP.
  • Q-learning the estimated value ⁇ QO of the optimal value function Q * O may be updated as follows at the timing when the option ends. Note that ⁇ Q corresponds to a symbol in which ⁇ is added above Q in the following equation.
  • ⁇ t represents the learning rate
  • the difference from Q-learning in normal MDP is that the discount cumulative reward sum of k'steps is used.
  • CeMDP is defined by a set of seven elements ⁇ S, A, PM, RM, ⁇ , ⁇ , ⁇ .
  • S, A, PM, RM , and ⁇ are the same as MDP , and are a set of states, a set of actions, a state transition probability, a reward function, and a discount rate, respectively.
  • ⁇ S is a set of states in which an external system makes a decision
  • ⁇ : ⁇ ⁇ A ⁇ [0,1] is a fixed policy that the external system follows.
  • L the set of states in which the learner makes a decision
  • will also be referred to as ⁇ e for convenience of explanation.
  • the interaction of the three shown in FIG. 1 is expressed as follows. At each time t , if the state st ⁇ L, then the action at is determined according to the learner's policy ⁇ l (at
  • the learner's policy ⁇ l is the target of optimization in this embodiment. On the other hand, the policy ⁇ of the external system is fixed.
  • the learner receives the reward rt + 1 according to the reward function RM ( st , at) and the state transition probability PM ( st + 1
  • FIG. 2 (c) shows an example of the learner's decision-making timing in CeMDP.
  • the white square wave indicates the timing of decision making based on the policy ⁇ l (at
  • each point indicates the timing of decision making based on the policy ⁇ (at
  • subscript subscript l or e is expressed by extracting the element related to the state L or ⁇ from a certain vector. For example,
  • the state transition probability P C (of the state set L) in the redefined environment is given by the following equation as the matrix representation P a C (under the source given by a certain action a).
  • the above state transition probabilities PC and reward function RC can be obtained by matrix operation (using inverse matrix calculation) or by using power iteration.
  • ⁇ l represents a policy of the state on the set L. Therefore, the optimum policy and the optimum value function can be obtained by the same algorithm as MDP.
  • FIG. 4 is a diagram for explaining a processing procedure including the CenVI algorithm.
  • s, a) and the state transition probability RC (s, a) are calculated, and then by CenVI based on the value iteration method.
  • the optimal value function is required. Since the number of state spaces for which the value iterative method is performed is reduced from S to L (the number of elements), it is only necessary to perform the value iterative method for a smaller number of state spaces. It is possible to construct an algorithm based on an algorithm other than the value iterative method in almost the same manner.
  • FIG. 5 is a diagram for explaining CentQ learning.
  • the estimated value ⁇ QC of the optimal action value function Q * C is updated based on the following equation each time the learner visits a state belonging to L. Note that ⁇ Q indicates a symbol in which ⁇ is added above Q.
  • ⁇ t represents the learning rate
  • t + k' represents the time when the learner first visits the state belonging to L after the time t. It can be shown that CenQ learning converges to a true value function under certain conditions, similar to normal Q-learning and Q-learning in OMDP. That is, the state when the learner visits the state belonging to L ( st + k' ), the state when the learner last visits the state belonging to L before that time (st + k '), and the state in between.
  • the optimal value function is derived by updating the value function using the discount sum of the rewards (the first term on the right side of the second equation of equation (10)).
  • FIG. 6 is a diagram showing a hardware configuration example of the value function deriving device 10 according to the first embodiment.
  • the value function derivation device 10 of FIG. 6 has a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, an interface device 105, and the like, which are connected to each other by a bus B, respectively.
  • the program that realizes the processing in the value function derivation device 10 is provided by a recording medium 101 such as a CD-ROM.
  • a recording medium 101 such as a CD-ROM.
  • the program is installed in the auxiliary storage device 102 from the recording medium 101 via the drive device 100.
  • the program does not necessarily have to be installed from the recording medium 101, and may be downloaded from another computer via the network.
  • the auxiliary storage device 102 stores the installed program and also stores necessary files, data, and the like.
  • the memory device 103 reads a program from the auxiliary storage device 102 and stores it when there is an instruction to start the program.
  • the processor 104 is a CPU or GPU (Graphics Processing Unit), or a CPU and GPU, and executes a function related to the value function derivation device 10 according to a program stored in the memory device 103.
  • the interface device 105 is used as an interface for connecting to a network.
  • FIG. 7 is a diagram showing a functional configuration example of the value function deriving device 10 according to the first embodiment.
  • the value function derivation device 10 of the first embodiment has a parameter input unit 11, a CeMDP parameter calculation unit 12, a planning algorithm execution unit 13, and an execution result processing unit 14 in order to derive an optimum value function and an optimum measure by a planning algorithm. And so on.
  • Each of these parts is realized by a process of causing the processor 104 to execute one or more programs installed in the value function derivation device 10.
  • the value function derivation device 10 also uses an input parameter storage unit 121, a setting parameter storage unit 122, a CeMDP parameter storage unit 123, and an execution result storage unit 124.
  • Each of these storage units can be realized by using, for example, an auxiliary storage device 102, a storage device that can be connected to the value function derivation device 10 via a network, or the like.
  • FIG. 8 is a flowchart for explaining an example of the processing procedure executed by the value function deriving device 10 in the first embodiment.
  • the parameter input unit 11 includes parameters related to MDP (state transition probability PM, reward function RM , discount rate ⁇ ) and parameters at the time of algorithm execution (hereinafter referred to as “setting parameters” such as the maximum number of iterations). Is input, and the parameters related to MDP are recorded in the input parameter storage unit 121, and the setting parameters are recorded in the setting parameter storage unit 122.
  • MDP state transition probability PM, reward function RM , discount rate ⁇
  • setting parameters such as the maximum number of iterations
  • the CeMDP parameter calculation unit 12 inputs the state transition probability PM, the reward function RM and the discount rate ⁇ recorded in the input parameter storage unit 121, and the setting parameter recorded in the setting parameter storage unit 122.
  • s, a) and reward function RC (s, a) are calculated for all combinations of s, s' ⁇ L, and a (S102). ..
  • the CeMDP parameter calculation unit 12 records the state transition probability PC and the reward function RC of the calculation result in the CeMDP parameter storage unit 123.
  • Step S102 corresponds to the processing of line numbers 2 and 3 in FIG.
  • the state transition probability PC is calculated based on Theorem 1
  • the reward function RC is calculated based on Theorem 2. Further, all s, s' ⁇ L , and a can be specified based on the state transition probability PM.
  • the planning algorithm execution unit 13 includes a state transition probability PC and a reward function RC recorded in the CeMDP parameter storage unit 123, a discount rate ⁇ recorded in the input parameter storage unit 121, and a setting parameter storage unit 122.
  • the (optimal) value function and the (optimal) policy are derived (calculated) by a planning algorithm such as the value iteration method, and the (optimal) value function and the (optimal) policy are executed.
  • Step S103 corresponds to the processing of line numbers 5 to 8 in FIG. 4 (method of calculating the value function in the Markov determination process based on the state transition probability PC and the reward function RC ) .
  • FIG. 4 does not describe the process of calculating the (optimal) policy for convenience, but if the (optimal) value function is known, the (optimal) policy can be derived by a known method. , Omitted for convenience.
  • the execution result processing unit 14 outputs the optimum value function or the optimum measure recorded in the execution result storage unit 124 (S104).
  • the second embodiment will explain the differences from the first embodiment.
  • the points not particularly mentioned in the second embodiment may be the same as those in the first embodiment.
  • FIG. 9 is a diagram showing a functional configuration example of the value function deriving device 10 in the second embodiment.
  • the value function derivation device 10 of the second embodiment has a parameter input unit 11, a reinforcement learning algorithm execution unit 15, an execution result processing unit 14, and the like in order to derive an optimum value function and an optimum policy by a reinforcement learning algorithm. ..
  • Each of these parts is realized by a process of causing the processor 104 to execute one or more programs installed in the value function derivation device 10.
  • the value function derivation device 10 also uses an input parameter storage unit 121, a setting parameter storage unit 122, and an execution result storage unit 124. Each of these storage units can be realized by using, for example, an auxiliary storage device 102, a storage device that can be connected to the value function derivation device 10 via a network, or the like.
  • FIG. 10 is a flowchart for explaining an example of a processing procedure executed by the value function deriving device 10 in the second embodiment.
  • the parameter input unit 11 is a simulator for describing the interaction with the MDP (a simulator that can calculate the reward and the state transition obtained by the learner based on the MDP (for example, a program)), and the parameters of the MDP (discount rate). ⁇ ) is input, and these are recorded in the input parameter storage unit 121.
  • the parameter input unit 11 also inputs parameters (hereinafter referred to as “setting parameters” such as learning rate and maximum number of repetitions) at the time of algorithm execution, and records these in the setting parameter storage unit 122.
  • the reinforcement learning algorithm execution unit 15 inputs the simulator and discount rate ⁇ recorded in the input parameter storage unit 121 and the setting parameters recorded in the setting parameter storage unit 122, and the reinforcement learning algorithm such as CenQ learning is used.
  • the (optimal) value function QC and the (optimal) policy are calculated by, and the (optimal) value function QC and the (optimal) policy are stored in the execution result storage unit 124 (S202). That is, in step S202, the algorithm shown in FIG. 5 is executed.
  • the execution result processing unit 14 outputs the optimum value function or the optimum measure recorded in the execution result storage unit 124 (S203).
  • S is an example of a set of first states.
  • the measure ⁇ followed by the external system is an example of a predetermined measure.
  • is an example of a set of second states.
  • L is an example of a set of third states.
  • RM is an example of the first reward function.
  • PM is an example of the first state transition probability.
  • RC is an example of the second reward function.
  • PC is an example of the second state transition probability.
  • the CeMDP parameter calculation unit 12, the planning algorithm execution unit 13, or the reinforcement learning algorithm execution unit 15 are examples of the derivation unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

In the present invention, a computer executes a derivation procedure for deriving a learner's value function during a Markov decision process in which the learner makes a decision with regard to a third set of states among a first set of states in an environment, excluding a second set of states in which an external system makes a decision on the basis of a prescribed measure, whereby the estimation accuracy of the value function when the external system is present is improved.

Description

価値関数導出方法、価値関数導出装置及びプログラムValue function derivation method, value function derivation device and program
 本発明は、価値関数導出方法、価値関数導出装置及びプログラムに関する。 The present invention relates to a value function derivation method, a value function derivation device, and a program.
 マルコフ決定過程(MDP:Markov decision process)は、連続する意思決定を扱う汎用的な枠組みであり、強化学習(RL:Reinforcement learning)において広く利用されている(非特許文献1)。強化学習については、ゲーム分野における成功例(非特許文献2)を受けて、近年ではライドシェアにおける配車や、移動通信ネットワークオペレーション、交通信号制御など実システムへの適用を目指した様々な取り組みが行われている。 The Markov decision process (MDP: Markov decision process) is a general-purpose framework that handles continuous decision making, and is widely used in reinforcement learning (RL: Reinforcement learning) (Non-Patent Document 1). Regarding reinforcement learning, in response to successful examples in the game field (Non-Patent Document 2), in recent years various efforts have been made with the aim of applying it to actual systems such as vehicle allocation in ride sharing, mobile communication network operations, and traffic signal control. It has been.
 このような取り組みの中で、安全な強化学習(Safe RL)と呼ばれる、行動主体又はその周辺のシステムなどにダメージを与えることなく最適な報酬を学習することを目指した研究が活発になっている。例えば、最悪ケースやリスクを考慮した評価指標を目的関数に用いるものや(非特許文献3参照)、人の助言を利用するもの(非特許文献4、非特許文献5参照)などが挙げられる。 In such efforts, research called safe reinforcement learning (Safe RL), which aims to learn the optimum reward without damaging the behavioral subject or the system around it, is becoming active. .. For example, those using an evaluation index considering the worst case and risk as an objective function (see Non-Patent Document 3), those using human advice (see Non-Patent Document 4 and Non-Patent Document 5), and the like can be mentioned.
 しかしながら、既存のMDPでは、人による介入や危険回避策などの外部システムの存在が考慮されていないため、外部システムが存在する状況に対応できない。仮に既存のMDPに基づいて構築されたプランニングアルゴリズム(例えば価値反復法や方策反復法)強化学習アルゴリズム(TD学習、Q学習)を適用したとしても、精度良く価値関数を推定することができない。 However, the existing MDP does not consider the existence of external systems such as human intervention and risk avoidance measures, so it cannot cope with the situation where external systems exist. Even if a planning algorithm (for example, value iterative method or policy iterative method) and reinforcement learning algorithm (TD learning, Q-learning) constructed based on the existing MDP is applied, the value function cannot be estimated accurately.
 本発明は、上記の点に鑑みてなされたものであって、外部システムが存在する状況における価値関数の推定精度を向上させることを目的とする。 The present invention has been made in view of the above points, and an object of the present invention is to improve the estimation accuracy of the value function in a situation where an external system exists.
 そこで上記課題を解決するため、環境における第1の状態の集合のうち、外部システムが所定の方策に基づいて意思決定を行う第2の状態の集合を除く第3の状態の集合について学習者が意思決定を行うマルコフ決定過程において、前記学習者の価値関数を導出する導出手順、をコンピュータが実行する。 Therefore, in order to solve the above problem, the learner has a set of the third state excluding the set of the second state in which the external system makes a decision based on a predetermined measure among the set of the first states in the environment. In the Markov decision-making process of making a decision, the computer executes a derivation procedure for deriving the learner's value function.
 外部システムが存在する状況における価値関数の推定精度を向上させることができる。
It is possible to improve the estimation accuracy of the value function in the presence of an external system.
センサーマルコフ決定過程(CeMDP)の相互作用を示す図である。It is a figure which shows the interaction of the sensor Markov determination process (CeMDP). 各MDPの意思決定タイミングの比較を示す図である。It is a figure which shows the comparison of the decision-making timing of each MDP. 再定義した環境の模式図である。It is a schematic diagram of the redefined environment. CenVIアルゴリズムを含む処理手順を説明するための図である。It is a figure for demonstrating the processing procedure including the CenVI algorithm. CenQ学習を説明するための図である。It is a figure for demonstrating the CenQ learning. 第1の実施の形態における価値関数導出装置10のハードウェア構成例を示す図である。It is a figure which shows the hardware configuration example of the value function derivation apparatus 10 in 1st Embodiment. 第1の実施の形態における価値関数導出装置10の機能構成例を示す図である。It is a figure which shows the functional structure example of the value function derivation apparatus 10 in the 1st Embodiment. 第1の実施の形態における価値関数導出装置10が実行する処理手順の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the processing procedure executed by the value function derivation apparatus 10 in 1st Embodiment. 第2の実施の形態における価値関数導出装置10の機能構成例を示す図である。It is a figure which shows the functional structure example of the value function derivation apparatus 10 in the 2nd Embodiment. 第2の実施の形態における価値関数導出装置10が実行する処理手順の一例を説明するためのフローチャートである。It is a flowchart for demonstrating an example of the processing procedure executed by the value function derivation apparatus 10 in the 2nd Embodiment.
 本実施の形態の動機は、外部のシステムを利用することを想定した新しいマルコフ決定過程(MDP:Markov decision process)を構築することにある。ここで、外部システムとは、人による介入や、予め定めた危険回避のための方策のことを指す。このような枠組みは、Safe RLだけでなく、人と機械が共に働いて目標を達成することが必要とされる際に重要だと考えられる。例えば、化学プラントや自動車の制御の最適方策を強化学習によって探す場合を考える。プラントの装置に多大な負荷がかかったり、自動車のセンサが破損したりしたことによる事故の可能性が有る場合のように、システムが危険な状態になったときには、人や予め定めた回避策に制御を譲り渡すことで危険を回避することが有用だと考えられる。 The motive of this embodiment is to construct a new Markov decision process (MDP: Markov decision process) assuming the use of an external system. Here, the external system refers to human intervention and predetermined measures for avoiding danger. Such a framework is considered to be important not only for Safe RL but also when it is necessary for humans and machines to work together to achieve their goals. For example, consider the case of searching for the optimum method for controlling a chemical plant or an automobile by reinforcement learning. When a system becomes dangerous, such as when a plant's equipment is heavily loaded or there is a possibility of an accident due to a damaged vehicle sensor, people or predetermined workarounds should be taken. It may be useful to avoid danger by handing over control.
 そこで、本実施の形態では、センサーマルコフ決定過程(CeMDP:censored MDP)という新たな枠組みが開示される。CeMDPは、行動主体である学習者、学習者の居る環境、そして外部システム(人の介入や危険回避策など)の3者の相互作用を表現するものである。図1に3者の相互作用を示す。学習者の状態に応じて、学習者又は外部システムのいずれかが意思決定を行う。 Therefore, in this embodiment, a new framework called the sensor Markov decision process (CeMDP: censored MDP) is disclosed. CeMDP expresses the interaction of the learner who is the subject of action, the environment in which the learner is, and the external system (human intervention, risk avoidance measures, etc.). FIG. 1 shows the interaction of the three. Depending on the learner's condition, either the learner or the external system makes a decision.
 更に、本実施の形態では、CeMDPにおける最適価値関数及び最適方策を導出するために、以下のアルゴリズムが開示される。
(i)環境と外部システムに関する事前知識が有る場合(報酬関数Rと状態遷移確率Pが既知であるとき):既存の価値反復法・方策反復法などのプランニングアルゴリズムを修正したアルゴリズム。
(ii)環境と外部システムに関する事前知識が無い場合(報酬関数Rと状態遷移確率Pは未知だが、シミュレーションなどによる相互作用を通して報酬や(確率的な)状態遷移を観測できるとき):Temporal Difference(TD)学習・Q学習などの強化学習アルゴリズムを修正したアルゴリズム。
Further, in the present embodiment, the following algorithm is disclosed in order to derive the optimum value function and the optimum policy in CeMDP.
(I) When there is prior knowledge about the environment and the external system (when the reward function RM and the state transition probability PM are known): An algorithm modified from an existing planning algorithm such as the value iterative method / policy iterative method.
(Ii) When there is no prior knowledge about the environment and the external system (when the reward function RM and the state transition probability PM are unknown, but the reward and (stochastic) state transition can be observed through the interaction by simulation etc.): Temporal An algorithm that modifies reinforcement learning algorithms such as difference (TD) learning and Q-learning.
 本実施の形態では、価値反復法を修正したアルゴリズム(以下、「CenVI(Censored Value Iteration)」という。)と、Q学習を修正したアルゴリズム(以下、「CenQ(Censored Q-learning)」という。)について記載するが、他の(通常の)MDPで用いられるアルゴリズムを修正したアルゴリズムを構築することもできる。また、本実施の形態では、CeMDPとして、離散の状態行動空間と離散時刻の場合についてアルゴリズムが構築されているが、連続の状態行動空間や連続の時刻を考える場合でもほぼ同様の枠組みを利用することが可能である。例えば、連続の状態行動空間のCeMDPを考える場合、線形の関数近似器を用いることで価値関数を推定するLeast Squares TD学習(LSTD)や、LeastSquares TDQ学習(LSTDQ)、Least squares Policy Iteration(LSPI)、更に、深層学習を用いるDeep Q-Network(DQN)、Double DQNなど、関数近似を用いる任意の強化学習アルゴリズムをベースにした手法を構築することができる。 In this embodiment, an algorithm modified from the value iteration method (hereinafter referred to as "CenVI (Censored Value Iteration)") and an algorithm modified from Q-learning (hereinafter referred to as "CenQ (Censored Q-learning)"). However, it is also possible to construct an algorithm modified from the algorithm used in other (normal) MDPs. Further, in the present embodiment, as CeMDP, an algorithm is constructed for the case of discrete state action space and discrete time, but almost the same framework is used even when considering continuous state action space and continuous time. It is possible. For example, when considering CeMDP in a continuous state-behavior space, LeastSquaresTD learning (LSTD), LeastSquaresTDQ learning (LSTDQ), and LeastsquaresPolicyIteration (LSPI) that estimate the value function by using a linear function approximator. Furthermore, it is possible to construct a method based on an arbitrary reinforcement learning algorithm using function approximation, such as Deep Q-Network (DQN) using deep learning and Double DQN.
 まず、本実施の形態の前提知識について説明する。 First, the prerequisite knowledge of this embodiment will be explained.
 [マルコフ決定過程(Markov Decision Process)]
 マルコフ決定過程(Markov Decision Process)は、5つの要素からなる組{S,A,P,R,γ}で定義される。S={1,2,・・・,|S|}は、状態の集合、A={1,2,・・・,|A|}は、行動の集合を表す。P:S×A×S→[0,1]は、状態遷移確率であり、状態sから状態s'へ行動aによって遷移する確率をP(s'|s,a)と記す。R:S×A→Rは、報酬関数であり、行動aを状態sで実行した際に得られる報酬をR(s,a)と記す。γ∈[0,1)は、割引率である。また、行動aを行う際の状態遷移確率Pと報酬関数Rをそれぞれ行列P∈[0,1]|S|×|S|とベクトルR∈R|S|で表現する。但し、R|S|の「R」は、実数全体の集合を意味する。
[Markov Decision Process]
The Markov Decision Process is defined by a set of five elements {S, A, PM , RM , γ}. S = {1, 2, ..., | S |} represents a set of states, and A = {1, 2, ..., | A |} represents a set of actions. PM: S × A × S → [0,1] is a state transition probability, and the probability of transition from the state s to the state s'by the action a is described as PM (s' | s, a). RM : S × A → R is a reward function, and the reward obtained when the action a is executed in the state s is described as RM (s, a). γ ∈ [0,1) is the discount rate. Further, the state transition probability PM and the reward function RM when performing the action a are expressed by the matrix P a ∈ [ 0,1 ] | S | × | S | and the vector R a ∈ R | S | , respectively. However, "R" in R | S | means a set of all real numbers.
 π:S×A→[0,1]を学習者の方策(意思決定則)とする。π(a|s)は、行動aが状態sで実行される確率を表す。方策πが1つ定められた時、学習者と環境の相互作用は次のように表現される。 Π: S × A → [0,1] is the learner's policy (decision-making rule). π (a | s) represents the probability that the action a is executed in the state s. When one policy π is defined, the interaction between the learner and the environment is expressed as follows.
 時刻tに状態sにいる学習者が方策π(a|s)に従って行動aを決定する。報酬関数R(s,a)と状態遷移確率P(st+1|s,a)に従って、学習者の受け取る報酬rt+1と次の時刻の状態st+1が決定する。学習者の訪問する状態の系列{sは、状態遷移確率πに従うマルコフ連鎖とみなすことができる。πの要素は、[πij=Σa∈Aπ(a|s=i)[Pijで定義される。なお、Pは、後述される数式において、Pの上に-が付された記号に相当する。 The learner in the state st at the time t determines the action at according to the policy π (at | st ) . The reward rt + 1 received by the learner and the state st + 1 at the next time are determined according to the reward function RM ( st , at) and the state transition probability PM ( st + 1 | st , at). The sequence of states visited by the learner { st} t can be regarded as a Markov chain according to the state transition probability P π . The elements of -P π are defined by [ -P π ] ij = Σ a ∈ A π (a | s = i) [P a ] ij . It should be noted that -P corresponds to a symbol in which - is added above P in the mathematical formula described later.
 状態価値関数Vπ 及び行動価値関数Qπ を、方策πに従って行動を決定する際に得られる割引報酬和の期待値として以下のように定義する。 The state value function V π M and the action value function Q π M are defined as the expected value of the sum of discount rewards obtained when the action is determined according to the policy π as follows.
Figure JPOXMLDOC01-appb-M000001
但し、
Figure JPOXMLDOC01-appb-M000001
However,
Figure JPOXMLDOC01-appb-M000002
は、方策πに従う際のTステップの状態と、行動の系列
Figure JPOXMLDOC01-appb-M000002
Is the state of the T step and the sequence of actions when following the policy π.
Figure JPOXMLDOC01-appb-M000003
の出方に関する期待値を表す。これらの価値関数は、以下のベルマン期待方程式を満たすことが広く知られている。
Figure JPOXMLDOC01-appb-M000003
Represents the expected value of how to get out. It is widely known that these value functions satisfy the following Bellman expectation equations.
Figure JPOXMLDOC01-appb-M000004
 同様に、
Figure JPOXMLDOC01-appb-M000004
Similarly,
Figure JPOXMLDOC01-appb-M000005
として定義される最適行動価値関数Q は、次のベルマン最適方程式を満たす。
Figure JPOXMLDOC01-appb-M000005
The optimal action value function Q * M defined as satisfies the following Bellman optimal equation.
Figure JPOXMLDOC01-appb-M000006
 全ての状態行動対(s,a)で
Figure JPOXMLDOC01-appb-M000006
In all state action pairs (s, a)
Figure JPOXMLDOC01-appb-M000007
を満たす方策πは、最適方策と呼ばれる。これらの式に基づいて、価値反復法・方策反復法などのプランニングアルゴリズムやTemporal Difference(TD)学習・Q学習などの強化学習アルゴリズムが構築されており、これらのアルゴリズムを用いることで最適方策を得ることが可能となっている。なお、報酬関数Rと状態遷移確率Pが既知であるときに最適方策を求めるアルゴリズムはプランニングアルゴリズムと呼ばれる。一方、報酬関数Rと状態遷移確率Pは未知だが、シミュレーションなどによる相互作用を通して、報酬や(確率的な)状態遷移を観測できるときに最適方策を求めるアルゴリズムは強化学習アルゴリズムと呼ばれる。
Figure JPOXMLDOC01-appb-M000007
The policy π * that satisfies is called the optimal policy. Based on these equations, planning algorithms such as value iterative method and policy iterative method and reinforcement learning algorithms such as Temporal Difference (TD) learning and Q-learning are constructed, and optimal measures are obtained by using these algorithms. It is possible. The algorithm for finding the optimum measure when the reward function RM and the state transition probability PM are known is called a planning algorithm. On the other hand, although the reward function RM and the state transition probability PM are unknown, an algorithm for finding the optimum measure when the reward or (stochastic) state transition can be observed through interaction by simulation or the like is called a reinforcement learning algorithm.
 なお、ベルマン方程式をベクトルと行列を用いて表記すれば状態価値関数Vπ は、以下のように表現することもできる。 If the Bellman equation is expressed using a vector and a matrix, the state value function V π M can also be expressed as follows.
Figure JPOXMLDOC01-appb-M000008
但し、
Figure JPOXMLDOC01-appb-M000008
However,
Figure JPOXMLDOC01-appb-M000009
 実際には逆行列を計算しなくても価値反復法などと同様の反復計算によって状態価値関数Vπ を計算することも可能である。
Figure JPOXMLDOC01-appb-M000009
It is also possible to calculate the state value function V π M by iterative calculation similar to the value iterative method without actually calculating the inverse matrix.
 [オプションとオプションを用いるMDP(OMDP)]
 (通常の)MDPでは、各行動は各時間ステップで終了する。それに対し、オプションを用いるMDP(OMDP)では、オプションと呼ばれる複数時間にわたって実行されるマクロな行動を利用する。オプションの定義を以下に示す。
[Options and MDPs with options (OMDP)]
In (normal) MDP, each action ends at each time step. On the other hand, in MDP (OMDP) using an option, a macro action called an option, which is executed over a plurality of hours, is used. The option definitions are shown below.
 <定義1(オプション)>
 オプションは、方策μ:S×A→[0,1]、終了条件β:S→[0,1]、及び開始条件I⊆Sの3つの組<μ,β,I>で定義される(「RichardS Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181-211, 1999.」以下、参考文献1という。)。
<Definition 1 (option)>
Options are defined by three sets <μ, β, I> of policy μ: S × A → [0,1], end condition β: S → [0,1], and start condition I⊆S ( "Richard S Sutton, Doina Precup, and Satinder Singh. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112 (1-2): 182-111, 1999." .).
 (MDPにおける通常の)行動も、実行後にすぐに終了するオプションとみなすことができる。以後オプションの集合をOと記す。また、ここでは、方策πは、オプションの実行確率を表すものであるとする(π:S×O→[0,1])。 Actions (normal in MDP) can also be considered as an option to end immediately after execution. Hereinafter, the set of options is referred to as O. Further, here, it is assumed that the measure π represents the execution probability of the option (π: S × O → [0,1]).
 OMDPにおける学習者と環境の相互作用は以下のように表現される。状態sにいる学習者が、(開始条件Iに状態sを含む)あるオプションoを方策π(o|s)に従い実行する。実行するオプションoが決まったら、次に実行する行動aをオプションoの方策μ(a|s)に従い決定する。行動aを実行した学習者は、報酬関数R(s,a)に従って報酬rt+1を受け取り、状態遷移確率P(st+1|s,a)に従って次の時刻の状態st+1に遷移する。遷移の後、確率β(st+1)で現在実行中のオプションを終了する。もし、オプションが終了せず継続となった場合、次の時刻の行動aを現在実行中のオプションの方策μ(at+1|st+1)に従って決定する。もし、オプションが終了となった場合は、次の時刻のオプションを方策π(ot+1|st+1)に従い決定する。 The learner-environment interaction in OMDP is expressed as follows. The learner in the state st executes an option ot (including the state st in the start condition I) according to the policy π ( ot | st ). Once the option ot to be executed is determined, the action at to be executed next is determined according to the policy μ (at | st ) of the option ot . The learner who executed the action at receives the reward rt + 1 according to the reward function RM ( st , at), and the state s at the next time according to the state transition probability PM ( st + 1 | st , at). Transition to t + 1 . After the transition, the currently executing option is terminated with a probability β ( st + 1 ). If the option does not end and continues, the action at the next time is determined according to the currently executing option policy μ (at + 1 | st + 1 ). If the option is finished, the option for the next time is determined according to the policy π (ot + 1 | st + 1 ).
 このように、OMDPでは、(方策πによる)意思決定のタイミングが一定ではなくなっている。図2(a)、(b)に、MDP及びOMDPの意思決定のタイミングの例を示す。(a)における各点は、方策πに基づくMDPの意思決定のタイミングを示す。(b)における白抜きの円は、方策π(o|s)に基づくOMDPの意思決定のタイミングを示し、(b)における各点は、方策μ(at+1|st+1)に基づく意思決定のタイミングを示す。 Thus, in OMDP, the timing of decision-making (by policy π) is not constant. 2 (a) and 2 (b) show an example of the decision-making timing of MDP and OMDP. Each point in (a) indicates the timing of MDP decision making based on the policy π. The white circles in (b) indicate the timing of OMDP decision-making based on the policy π (ot | st ) , and each point in (b) is the intention based on the policy μ (at + 1 | st + 1 ). Indicates the timing of the decision.
 OMDPの利点は、行動を用いて定義されたもとの報酬関数と状態遷移確率を下記で示すようにオプションを用いたものへ定義し直すことで、通常のMDPとほぼ同様の方法で価値関数や最適方策の獲得が可能となることにある。 The advantage of OMDP is that the original reward function and state transition probability defined using behavior are redefined to those using options as shown below, and the value function and optimization are performed in almost the same way as normal MDP. It is to be able to acquire measures.
 <定義2(オプションを用いる時の状態遷移確率(参考文献1))> <Definition 2 (State transition probability when using the option (Reference 1))>
Figure JPOXMLDOC01-appb-M000010
但し、ρ(s',k)は、オプションoがkステップ後に状態s'で終了する確率を表す。
Figure JPOXMLDOC01-appb-M000010
However, ρ (s', k) represents the probability that the option o ends in the state s'after k steps.
 <定義3(オプションを用いる時の報酬関数(参考文献1))> <Definition 3 (Reward function when using options (Reference 1))>
Figure JPOXMLDOC01-appb-M000011
 但し、t+kは(確率的に定まる)オプションoの終了時刻を表す。
Figure JPOXMLDOC01-appb-M000011
However, t + k represents the end time of the option o (probabilistically determined).
 MDPと同様にOMDPの価値関数を定義すると、以下のベルマン方程式が成り立つ。 If the value function of OMDP is defined in the same way as MDP, the following Bellman equation holds.
Figure JPOXMLDOC01-appb-M000012
 したがって、OMDPの価値関数の推定や最適方策の獲得に通常のMDPのために作られた、価値反復法や方策反復法などのプランニングアルゴリズムを利用することができる(参考文献1)。同様に、TD学習やQ学習などの強化学習アルゴリズムをOMDPにおいても利用できる。Q学習を用いる場合、最適価値関数Q の推定値^Qをオプションが終了するタイミングで以下のように更新すればよい。なお、^Qは、以下の式において、Qの上に^が付与された記号に相当する。
Figure JPOXMLDOC01-appb-M000012
Therefore, it is possible to use a planning algorithm such as a value iterative method or a policy iterative method created for ordinary MDP for estimating the value function of OMDP and acquiring the optimum policy (Reference 1). Similarly, reinforcement learning algorithms such as TD learning and Q-learning can be used in OMDP. When Q-learning is used, the estimated value ^ QO of the optimal value function Q * O may be updated as follows at the timing when the option ends. Note that ^ Q corresponds to a symbol in which ^ is added above Q in the following equation.
Figure JPOXMLDOC01-appb-M000013
但し、ηは学習率、t+k'はオプションoの終了時刻を表す。通常のMDPにおけるQ学習との違いは、k'ステップの割引累積報酬和を用いることにある。
Figure JPOXMLDOC01-appb-M000013
However, η t represents the learning rate, and t + k'represents the end time of the option ot. The difference from Q-learning in normal MDP is that the discount cumulative reward sum of k'steps is used.
 [本実施の形態の概要]
 次に、本実施の形態の概要について述べる。まず、行動主体である学習者、学習者のいる環境、そして外部システム(人の介入や危険回避策など)の3者の相互作用を表現するセンサーマルコフ決定過程遷移(CeMDP)を定義する。
[Outline of the present embodiment]
Next, the outline of the present embodiment will be described. First, we define a sensor Markov decision process transition (CeMDP) that expresses the interaction of the learner who is the subject of action, the environment in which the learner is, and the external system (human intervention, risk avoidance measures, etc.).
 [センサーマルコフ決定過程(CeMDP)の定義]
 CeMDPを7つの要素からなる組{S,A,P,R,γ,Ε,μ}で定義する。S,A,P,R,γは、MDPと同じであり、それぞれ状態の集合、行動の集合、状態遷移確率、報酬関数、割引率である。Ε⊆Sは、外部のシステムが意思決定を行う状態の集合であり、μ:Ε×A→[0,1]は、外部のシステムが従う固定の方策である。また、学習者が意思決定を行う状態の集合を、以下「L」と記す。LとΕの要素は重複が無く、LとΕの和が状態の集合Sと等しい(Ε∪L=S、Ε∩L=φ)。すなわち、SのうちΕを除く集合がLである。なお、μについては、以降において、説明の便宜上、πとも表記される。
[Definition of sensor Markov decision process (CemDP)]
CeMDP is defined by a set of seven elements {S, A, PM, RM, γ, Ε , μ}. S, A, PM, RM , and γ are the same as MDP , and are a set of states, a set of actions, a state transition probability, a reward function, and a discount rate, respectively. Ε⊆S is a set of states in which an external system makes a decision, and μ: Ε × A → [0,1] is a fixed policy that the external system follows. In addition, the set of states in which the learner makes a decision is hereinafter referred to as "L". The elements of L and Ε do not overlap, and the sum of L and Ε is equal to the set S of states (Ε∪L = S, Ε∩L = φ). That is, the set of S excluding Ε is L. In the following, μ will also be referred to as π e for convenience of explanation.
 図1に示す3者の相互作用は以下のように表現される。各時刻tにおいて、状態s∈Lであるならば、行動aを学習者の方策π(a|s)に従って決定する。そうではなく、状態s∈Ε、であるならば、行動aは、外部システムの方策μ(a|s)に従って決定される。なお、学習者の方策πは、本実施の形態において最適化の対象である。一方、外部システムの方策μは、固定である。行動が実行されると、学習者は、報酬関数R(s,a)と状態遷移確率P(st+1|s,a)に従い報酬rt+1を受け取り、次の時刻の状態st+1に遷移する。 The interaction of the three shown in FIG. 1 is expressed as follows. At each time t , if the state st L, then the action at is determined according to the learner's policy π l (at | st ) . Instead, if the state st Ε , then the action at is determined according to the policy μ (at | st ) of the external system. The learner's policy π l is the target of optimization in this embodiment. On the other hand, the policy μ of the external system is fixed. When the action is executed, the learner receives the reward rt + 1 according to the reward function RM ( st , at) and the state transition probability PM ( st + 1 | st , at), and the state at the next time. Transition to s t + 1 .
 図2(c)に、CeMDPにおける学習者の意思決定のタイミングの例を示す。図2(c)において、白抜きの矩形は、方策π(a|s)に基づく意思決定のタイミングを示す。図2(c)において、各点は、方策μ(a|s)に基づく意思決定のタイミングを示す。なお、仮に、Εが空集合であるならば、この相互作用は通常のMDPのそれと全く等しいものとなる。 FIG. 2 (c) shows an example of the learner's decision-making timing in CeMDP. In FIG. 2 (c), the white square wave indicates the timing of decision making based on the policy π l (at | st ). In FIG. 2 (c), each point indicates the timing of decision making based on the policy μ (at | st ). If Ε is an empty set, this interaction is exactly the same as that of a normal MDP.
 [環境の再定義]
 CeMDPにおけるアルゴリズムを構築する上では、OMDPがそうであったように、MDPのアルゴリズムが適用できるように報酬関数や状態遷移確率といった環境を再定義するアプローチが有用だと考えられる。そこで、本願発明者は、CeMDPの環境を再定義することを考えた。再定義において鍵となる考え方は、CeMDPの相互作用において、学習者は状態s∈Lにおいてのみ意思決定を行う必要があるということである。よって、図1のCeMDPにおける環境と外部システムとを、図1に示した状態空間がLのMDPにおける環境として再定義することでMDPのアルゴリズムと(ほぼ同様の)アルゴリズムで最適方策及び最適価値関数を得ることが可能となる。図3に、再定義した環境の模式図を示す。
[Redefining the environment]
In constructing the algorithm in CeMDP, it is considered useful to redefine the environment such as reward function and state transition probability so that the algorithm of MDP can be applied, as was the case with OMDP. Therefore, the inventor of the present application considered to redefine the environment of CeMDP. The key idea in the redefinition is that in CeMDP interactions, the learner needs to make decisions only in the state s ∈ L. Therefore, by redefining the environment in CeMDP and the external system in FIG. 1 as the environment in MDP whose state space is L as shown in FIG. Can be obtained. FIG. 3 shows a schematic diagram of the redefined environment.
 上記のアプローチに基づき、本願発明者は、再定義した環境における状態遷移確率と報酬関数を導いた。なお、一般性を失うことなく、状態が並び替えられて、状態遷移確率Pとその方策π=(π,π)を用いて行動aに関して平均をとったπが以下のようなブロック行列で表現されているとする。 Based on the above approach, the inventor of the present application has derived the state transition probability and reward function in the redefined environment. It should be noted that the states were rearranged without losing generality, and the state transition probability P a and its policy π = (π l , π e ) were used to take an average for the action a - P π is as follows. It is assumed that it is represented by a block matrix.
Figure JPOXMLDOC01-appb-M000014
 同様に、或るベクトルから状態L又はΕに関する要素を抜き出したものを下付きの添字l又はeで表現する。例えば、
Figure JPOXMLDOC01-appb-M000014
Similarly, the subscript subscript l or e is expressed by extracting the element related to the state L or Ε from a certain vector. for example,
Figure JPOXMLDOC01-appb-M000015
 この時、再定義した環境における状態遷移確率と報酬関数が以下の式で与えられることが示せる。
Figure JPOXMLDOC01-appb-M000015
At this time, it can be shown that the state transition probability and the reward function in the redefined environment are given by the following equations.
 <定理1(再定義した状態遷移確率)>
 再定義した環境における(状態集合Lの)状態遷移確率Pは、その(ある行動aが与えられた元での)行列表現P が以下の式で与えられるものとなる。
<Theorem 1 (redefined state transition probability)>
The state transition probability P C (of the state set L) in the redefined environment is given by the following equation as the matrix representation P a C (under the source given by a certain action a).
Figure JPOXMLDOC01-appb-M000016
 <定理2(再定義した報酬関数)>
 再定義した環境における(状態集合Lの)報酬関数Rは、その(ある行動aが与えられたもとでの)行列表現R が以下の式で与えられるものとなる。
Figure JPOXMLDOC01-appb-M000016
<Theorem 2 (redefined reward function)>
The reward function R C (of the state set L) in the redefined environment is given the matrix representation R a C (under a given action a) by the following equation.
Figure JPOXMLDOC01-appb-M000017
 上記の状態遷移確率Pと報酬関数Rは、(逆行列計算を用いた)行列演算によって求めることもできるし、power iterationを利用して求めることも可能である。
Figure JPOXMLDOC01-appb-M000017
The above state transition probabilities PC and reward function RC can be obtained by matrix operation (using inverse matrix calculation) or by using power iteration.
 [価値関数とベルマン方程式]
 再定義した環境の状態遷移確率と報酬関数を用いて価値関数を定義すると、それは以下のベルマン方程式を満たす。
[Value function and Bellman equation]
If we define a value function using the state transition probabilities and reward functions of the redefined environment, it satisfies the Bellman equation below.
Figure JPOXMLDOC01-appb-M000018
但し、πは集合L上の状態の方策を表す。したがって、MDPと同様のアルゴリズムによって最適方策及び最適価値関数を得ることができる。
Figure JPOXMLDOC01-appb-M000018
However, π l represents a policy of the state on the set L. Therefore, the optimum policy and the optimum value function can be obtained by the same algorithm as MDP.
 [プランニングアルゴリズム]
 ここでは、プランニングアルゴリズムの例として、価値反復法に基づくアルゴリズムを示す。上記したように、本実施の形態では、このアルゴリズムを「CenVI(Censored MDP Value Iteration)」という。
[Planning algorithm]
Here, as an example of the planning algorithm, an algorithm based on the value iterative method is shown. As described above, in the present embodiment, this algorithm is referred to as "Censored MDP Value Iteration ".
 図4は、CenVIアルゴリズムを含む処理手順を説明するための図である。図4では、まず初めに、再定義された環境の報酬関数P(s'|s,a)と状態遷移確率R(s,a)が計算され、その後に価値反復法に基づくCenVIによって最適価値関数が求められる。価値反復法を行う状態空間の数はSからL(の要素数)へ減少しているため、より少ない数の状態空間での価値反復法を行うだけでよい。なお、価値反復法以外のアルゴリズムに基づくアルゴリズムもほぼ同様に構築することが可能である。 FIG. 4 is a diagram for explaining a processing procedure including the CenVI algorithm. In FIG. 4, first, the redefined environment reward function PC (s'| s, a) and the state transition probability RC (s, a) are calculated, and then by CenVI based on the value iteration method. The optimal value function is required. Since the number of state spaces for which the value iterative method is performed is reduced from S to L (the number of elements), it is only necessary to perform the value iterative method for a smaller number of state spaces. It is possible to construct an algorithm based on an algorithm other than the value iterative method in almost the same manner.
 [強化学習アルゴリズム]
 次に、強化学習アルゴリズムの例として、Q学習に基づくアルゴリズムを示す。記したように、本実施の形態では、このアルゴリズムを「CenQ学習(Censored MDP Q-Learning)」という。
[Reinforcement learning algorithm]
Next, as an example of the reinforcement learning algorithm, an algorithm based on Q-learning is shown. As described, in this embodiment, this algorithm is referred to as "Censored MDP Q-Learning".
 図5は、CenQ学習を説明するための図である。CenQ学習では、学習者がLに属する状態を訪れるたびに下記の式に基づいて最適行動価値関数Q の推定値^Qを更新する。なお、^Qは、Qの上に^が付与された記号を示す。 FIG. 5 is a diagram for explaining CentQ learning. In CenQ learning, the estimated value ^ QC of the optimal action value function Q * C is updated based on the following equation each time the learner visits a state belonging to L. Note that ^ Q indicates a symbol in which ^ is added above Q.
Figure JPOXMLDOC01-appb-M000019
但し、ηは学習率、t+k'は時刻tの後に学習者がLに属する状態を初めて訪れた時刻を表す。CenQ学習は、通常のQ学習やOMDPにおけるQ学習と同様、一定の条件のもとで真の価値関数に収束することが示せる。すなわち、Lに属する状態を学習者が訪れた時の状態(st+k')と、その時より前にLに属する状態を学習者が最後に訪れた時の状態(s)と、その間に得られた報酬の割引和(式(10)の2番目の式の右辺の第1項)とを用いて価値関数を更新することで、最適価値関数が導出される。
Figure JPOXMLDOC01-appb-M000019
However, η t represents the learning rate, and t + k'represents the time when the learner first visits the state belonging to L after the time t. It can be shown that CenQ learning converges to a true value function under certain conditions, similar to normal Q-learning and Q-learning in OMDP. That is, the state when the learner visits the state belonging to L ( st + k' ), the state when the learner last visits the state belonging to L before that time (st + k '), and the state in between. The optimal value function is derived by updating the value function using the discount sum of the rewards (the first term on the right side of the second equation of equation (10)).
 [価値関数導出装置10]
 以下、上記した内容を実現する価値関数導出装置10について説明する。図6は、第1の実施の形態における価値関数導出装置10のハードウェア構成例を示す図である。図6の価値関数導出装置10は、それぞれバスBで相互に接続されているドライブ装置100、補助記憶装置102、メモリ装置103、プロセッサ104、及びインタフェース装置105等を有する。
[Value function derivation device 10]
Hereinafter, the value function derivation device 10 that realizes the above-mentioned contents will be described. FIG. 6 is a diagram showing a hardware configuration example of the value function deriving device 10 according to the first embodiment. The value function derivation device 10 of FIG. 6 has a drive device 100, an auxiliary storage device 102, a memory device 103, a processor 104, an interface device 105, and the like, which are connected to each other by a bus B, respectively.
 価値関数導出装置10での処理を実現するプログラムは、CD-ROM等の記録媒体101によって提供される。プログラムを記憶した記録媒体101がドライブ装置100にセットされると、プログラムが記録媒体101からドライブ装置100を介して補助記憶装置102にインストールされる。但し、プログラムのインストールは必ずしも記録媒体101より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置102は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 The program that realizes the processing in the value function derivation device 10 is provided by a recording medium 101 such as a CD-ROM. When the recording medium 101 storing the program is set in the drive device 100, the program is installed in the auxiliary storage device 102 from the recording medium 101 via the drive device 100. However, the program does not necessarily have to be installed from the recording medium 101, and may be downloaded from another computer via the network. The auxiliary storage device 102 stores the installed program and also stores necessary files, data, and the like.
 メモリ装置103は、プログラムの起動指示があった場合に、補助記憶装置102からプログラムを読み出して格納する。プロセッサ104は、CPU若しくはGPU(Graphics Processing Unit)、又はCPU及びGPUであり、メモリ装置103に格納されたプログラムに従って価値関数導出装置10に係る機能を実行する。インタフェース装置105は、ネットワークに接続するためのインタフェースとして用いられる。 The memory device 103 reads a program from the auxiliary storage device 102 and stores it when there is an instruction to start the program. The processor 104 is a CPU or GPU (Graphics Processing Unit), or a CPU and GPU, and executes a function related to the value function derivation device 10 according to a program stored in the memory device 103. The interface device 105 is used as an interface for connecting to a network.
 図7は、第1の実施の形態における価値関数導出装置10の機能構成例を示す図である。第1の実施の形態の価値関数導出装置10は、プランニングアルゴリズムによって最適価値関数及び最適方策を導出ために、パラメタ入力部11、CeMDPパラメタ計算部12、プランニングアルゴリズム実行部13及び実行結果処理部14及等を有する。これら各部は、価値関数導出装置10にインストールされた1以上のプログラムが、プロセッサ104に実行させる処理により実現される。価値関数導出装置10は、また、入力パラメタ記憶部121、設定パラメタ記憶部122、CeMDPパラメタ記憶部123及び実行結果記憶部124を利用する。これら各記憶部は、例えば、補助記憶装置102、又は価値関数導出装置10にネットワークを介して接続可能な記憶装置等を用いて実現可能である。 FIG. 7 is a diagram showing a functional configuration example of the value function deriving device 10 according to the first embodiment. The value function derivation device 10 of the first embodiment has a parameter input unit 11, a CeMDP parameter calculation unit 12, a planning algorithm execution unit 13, and an execution result processing unit 14 in order to derive an optimum value function and an optimum measure by a planning algorithm. And so on. Each of these parts is realized by a process of causing the processor 104 to execute one or more programs installed in the value function derivation device 10. The value function derivation device 10 also uses an input parameter storage unit 121, a setting parameter storage unit 122, a CeMDP parameter storage unit 123, and an execution result storage unit 124. Each of these storage units can be realized by using, for example, an auxiliary storage device 102, a storage device that can be connected to the value function derivation device 10 via a network, or the like.
 図8は、第1の実施の形態における価値関数導出装置10が実行する処理手順の一例を説明するためのフローチャートである。 FIG. 8 is a flowchart for explaining an example of the processing procedure executed by the value function deriving device 10 in the first embodiment.
 ステップS101において、パラメタ入力部11は、MDPに関するパラメタ(状態遷移確率Pと報酬関数R、割引率γ)と、アルゴリズム実行時のパラメタ(最大繰り返し回数など、以下「設定パラメタ」という。)を入力し、MDPに関するパラメタに入力パラメタ記憶部121へ記録すると共に、設定パラメタを設定パラメタ記憶部122に記録する。 In step S101, the parameter input unit 11 includes parameters related to MDP (state transition probability PM, reward function RM , discount rate γ) and parameters at the time of algorithm execution (hereinafter referred to as “setting parameters” such as the maximum number of iterations). Is input, and the parameters related to MDP are recorded in the input parameter storage unit 121, and the setting parameters are recorded in the setting parameter storage unit 122.
 続いて、CeMDPパラメタ計算部12は、入力パラメタ記憶部121に記録された状態遷移確率P、報酬関数R及び割引率γと、設定パラメタ記憶部122に記録された設定パラメタとを入力とし、再定義された環境の状態遷移確率P(s'|s,a)及び報酬関数R(s,a)を全てのs、s'∈L、及びaの組み合わせについて算出する(S102)。CeMDPパラメタ計算部12は、算出結果の状態遷移確率P及び報酬関数RをCeMDPパラメタ記憶部123に記録する。ステップS102は、図4における行番号2及び3の処理に対応する。なお、状態遷移確率Pは、定理1に基づいて算出され、報酬関数Rは、定理2に基づいて算出される。また、全てのs、s'∈L、及びaについては、状態遷移確率Pに基づいて特定可能である。 Subsequently, the CeMDP parameter calculation unit 12 inputs the state transition probability PM, the reward function RM and the discount rate γ recorded in the input parameter storage unit 121, and the setting parameter recorded in the setting parameter storage unit 122. , Redefined environment state transition probabilities PC (s' | s, a) and reward function RC (s, a) are calculated for all combinations of s, s'∈ L, and a (S102). .. The CeMDP parameter calculation unit 12 records the state transition probability PC and the reward function RC of the calculation result in the CeMDP parameter storage unit 123. Step S102 corresponds to the processing of line numbers 2 and 3 in FIG. The state transition probability PC is calculated based on Theorem 1, and the reward function RC is calculated based on Theorem 2. Further, all s, s'∈ L , and a can be specified based on the state transition probability PM.
 続いて、プランニングアルゴリズム実行部13は、CeMDPパラメタ記憶部123に記録された状態遷移確率P及び報酬関数Rと、入力パラメタ記憶部121に記録された割引率γと、設定パラメタ記憶部122に記録された設定パラメタとを入力とし、価値反復法などのプランニングアルゴリズムによって(最適)価値関数と(最適)方策を導出(算出)し、(最適)価値関数及び(最適)方策を実行結果記憶部124に記録する(S103)。ステップS103は、図4における行番号5~8の処理(状態遷移確率P及び報酬関数Rに基づくマルコフ決定過程における価値関数の算出法)に対応する。なお、図4には、便宜上、(最適)方策を計算する処理は記載されていないが、(最適)価値関数が分かれば、公知の方法により(最適)方策を導出可能なため、当該処理は、便宜上、省略されている。 Subsequently, the planning algorithm execution unit 13 includes a state transition probability PC and a reward function RC recorded in the CeMDP parameter storage unit 123, a discount rate γ recorded in the input parameter storage unit 121, and a setting parameter storage unit 122. The (optimal) value function and the (optimal) policy are derived (calculated) by a planning algorithm such as the value iteration method, and the (optimal) value function and the (optimal) policy are executed. Record in unit 124 (S103). Step S103 corresponds to the processing of line numbers 5 to 8 in FIG. 4 (method of calculating the value function in the Markov determination process based on the state transition probability PC and the reward function RC ) . Note that FIG. 4 does not describe the process of calculating the (optimal) policy for convenience, but if the (optimal) value function is known, the (optimal) policy can be derived by a known method. , Omitted for convenience.
 続いて、実行結果処理部14は、実行結果記憶部124に記録された最適価値関数又は最適方策を出力する(S104)。 Subsequently, the execution result processing unit 14 outputs the optimum value function or the optimum measure recorded in the execution result storage unit 124 (S104).
 次に、第2の実施の形態について説明する。第2の実施の形態では第1の実施の形態と異なる点について説明する。第2の実施の形態において特に言及されない点については、第1の実施の形態と同様でもよい。 Next, the second embodiment will be described. The second embodiment will explain the differences from the first embodiment. The points not particularly mentioned in the second embodiment may be the same as those in the first embodiment.
 図9は、第2の実施の形態における価値関数導出装置10の機能構成例を示す図である。図9中、図7と同一又は対応する部分には同一符号を付している。第2の実施の形態の価値関数導出装置10は、強化学習アルゴリズムによって最適価値関数及び最適方策を導出するために、パラメタ入力部11、強化学習アルゴリズム実行部15及び実行結果処理部14等を有する。これら各部は、価値関数導出装置10にインストールされた1以上のプログラムが、プロセッサ104に実行させる処理により実現される。価値関数導出装置10は、また、入力パラメタ記憶部121、設定パラメタ記憶部122及び実行結果記憶部124を利用する。これら各記憶部は、例えば、補助記憶装置102、又は価値関数導出装置10にネットワークを介して接続可能な記憶装置等を用いて実現可能である。 FIG. 9 is a diagram showing a functional configuration example of the value function deriving device 10 in the second embodiment. In FIG. 9, the same or corresponding parts as those in FIG. 7 are designated by the same reference numerals. The value function derivation device 10 of the second embodiment has a parameter input unit 11, a reinforcement learning algorithm execution unit 15, an execution result processing unit 14, and the like in order to derive an optimum value function and an optimum policy by a reinforcement learning algorithm. .. Each of these parts is realized by a process of causing the processor 104 to execute one or more programs installed in the value function derivation device 10. The value function derivation device 10 also uses an input parameter storage unit 121, a setting parameter storage unit 122, and an execution result storage unit 124. Each of these storage units can be realized by using, for example, an auxiliary storage device 102, a storage device that can be connected to the value function derivation device 10 via a network, or the like.
 図10は、第2の実施の形態における価値関数導出装置10が実行する処理手順の一例を説明するためのフローチャートである。 FIG. 10 is a flowchart for explaining an example of a processing procedure executed by the value function deriving device 10 in the second embodiment.
 ステップS201において、パラメタ入力部11は、MDPとの相互作用を記述するシミュレータ(学習者の得る報酬や状態遷移をMDPに基づいて計算できるもの(例えば、プログラム等))、MDPのパラメタ(割引率γ)を入力し、これらを入力パラメタ記憶部121に記録する。パラメタ入力部11は、また、アルゴリズム実行時のパラメタ(学習率や最大繰り返し回数など、以下「設定パラメタ」という。)を入力し、これらを設定パラメタ記憶部122に記録する。 In step S201, the parameter input unit 11 is a simulator for describing the interaction with the MDP (a simulator that can calculate the reward and the state transition obtained by the learner based on the MDP (for example, a program)), and the parameters of the MDP (discount rate). γ) is input, and these are recorded in the input parameter storage unit 121. The parameter input unit 11 also inputs parameters (hereinafter referred to as “setting parameters” such as learning rate and maximum number of repetitions) at the time of algorithm execution, and records these in the setting parameter storage unit 122.
 続いて、強化学習アルゴリズム実行部15は、入力パラメタ記憶部121に記録されたシミュレータ及び割引率γと、設定パラメタ記憶部122に記録された設定パラメタとを入力とし、CenQ学習などの強化学習アルゴリズムによって(最適)価値関数Qと(最適)方策を算出し、(最適)価値関数Qと(最適)方策を実行結果記憶部124に格納する(S202)。すなわち、ステップS202では、図5に示したアルゴリズムが実行される。 Subsequently, the reinforcement learning algorithm execution unit 15 inputs the simulator and discount rate γ recorded in the input parameter storage unit 121 and the setting parameters recorded in the setting parameter storage unit 122, and the reinforcement learning algorithm such as CenQ learning is used. The (optimal) value function QC and the (optimal) policy are calculated by, and the (optimal) value function QC and the (optimal) policy are stored in the execution result storage unit 124 (S202). That is, in step S202, the algorithm shown in FIG. 5 is executed.
 続いて、実行結果処理部14は、実行結果記憶部124に記録された最適価値関数又は最適方策を出力する(S203)。 Subsequently, the execution result processing unit 14 outputs the optimum value function or the optimum measure recorded in the execution result storage unit 124 (S203).
 上述したように、上記各実施の形態によれば、行動主体である学習者、学習者のいる環境、及び外部システム(人の介入や危険回避策など)の3者の相互作用が存在する状況において、精度良く価値関数を推定し、最適方策を得ることが可能となる。すなわち、外部システムが存在する状況における価値関数の推定精度を向上させることができる。その結果、既存の価値反復法・方策反復法などのプランニングアルゴリズムやTempora lDifference(TD)学習・Q学習などの強化学習アルゴリズムを、人や人手で設計された制御器などによる介入が存在する状況で利用できるように拡張することが可能となる。 As described above, according to each of the above embodiments, there is an interaction between the learner who is the subject of action, the environment in which the learner is present, and the external system (human intervention, risk avoidance measures, etc.). In, it is possible to estimate the value function with high accuracy and obtain the optimum policy. That is, it is possible to improve the estimation accuracy of the value function in the situation where an external system exists. As a result, existing planning algorithms such as value iterative method / policy iterative method and reinforcement learning algorithms such as Tempora lDifference (TD) learning / Q-learning are used in the presence of human or manually designed controls. It will be possible to extend it so that it can be used.
 なお、本実施の形態において、Sは、第1の状態の集合の一例である。外部システムが従う方策μは、所定の方策の一例である。Εは、第2の状態の集合の一例である。Lは、第3の状態の集合の一例である。Rは、第1の報酬関数の一例である。Pは、第1の状態遷移確率の一例である。Rは、第2の報酬関数の一例である。Pは、第2の状態遷移確率の一例である。CeMDPパラメタ計算部12及びプランニングアルゴリズム実行部13、又は強化学習アルゴリズム実行部15は、導出部の一例である。 In this embodiment, S is an example of a set of first states. The measure μ followed by the external system is an example of a predetermined measure. Ε is an example of a set of second states. L is an example of a set of third states. RM is an example of the first reward function. PM is an example of the first state transition probability. RC is an example of the second reward function. PC is an example of the second state transition probability. The CeMDP parameter calculation unit 12, the planning algorithm execution unit 13, or the reinforcement learning algorithm execution unit 15 are examples of the derivation unit.
 以上、本発明の実施の形態について詳述したが、本発明は斯かる特定の実施形態に限定されるものではなく、請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the embodiments of the present invention have been described in detail above, the present invention is not limited to such specific embodiments, and various modifications are made within the scope of the gist of the present invention described in the claims.・ Can be changed.
10     価値関数導出装置
11     パラメタ入力部
12     CeMDPパラメタ計算部
13     プランニングアルゴリズム実行部
14     実行結果処理部
15     強化学習アルゴリズム実行部
100    ドライブ装置
101    記録媒体
102    補助記憶装置
103    メモリ装置
104    プロセッサ
105    インタフェース装置
121    入力パラメタ記憶部
122    設定パラメタ記憶部
123    CeMDPパラメタ記憶部
124    実行結果記憶部
B      バス
10 Value function derivation device 11 Parameter input unit 12 CeMDP parameter calculation unit 13 Planning algorithm execution unit 14 Execution result processing unit 15 Enhanced learning algorithm execution unit 100 Drive device 101 Recording medium 102 Auxiliary storage device 103 Memory device 104 Processor 105 Interface device 121 Input Parameter storage unit 122 Setting parameter storage unit 123 CeMDP parameter storage unit 124 Execution result storage unit B bus

Claims (7)

  1.  環境における第1の状態の集合のうち、外部システムが所定の方策に基づいて意思決定を行う第2の状態の集合を除く第3の状態の集合について学習者が意思決定を行うマルコフ決定過程において、前記学習者の価値関数を導出する導出手順、
    をコンピュータが実行することを特徴とする価値関数導出方法。
    In the Markov decision-making process in which the learner makes a decision about the set of the third state excluding the set of the second state in which the external system makes a decision based on a predetermined measure among the set of the first states in the environment. , Derivation procedure for deriving the learner's value function,
    A method of deriving a value function, characterized by a computer performing.
  2.  前記導出手順は、前記環境における第1の報酬関数及び第1の状態遷移確率に基づいて、前記環境と前記外部システムとを状態空間が前記第3の状態の集合であるマルコフ決定過程における環境として再定義した場合の第2の報酬関数及び第2の状態遷移確率を算出し、前記第2の報酬関数及び第2の状態遷移確率に基づくマルコフ決定過程における価値関数の算出法によって前記価値関数を導出する、
    ことを特徴とする請求項1記載の価値関数導出方法。
    In the derivation procedure, based on the first reward function and the first state transition probability in the environment, the environment and the external system are used as an environment in the Markov decision process in which the state space is a set of the third states. The second reward function and the second state transition probability when redefined are calculated, and the value function is calculated by the calculation method of the value function in the Markov determination process based on the second reward function and the second state transition probability. Derived,
    The value function derivation method according to claim 1, wherein the value function is derived.
  3.  前記導出手順は、前記第3の状態の集合に属する状態を前記学習者が訪れた第1の時の状態と、前記第1の時より前に前記第3の状態の集合に属する状態を前記学習者が最後に訪れた第2の時の状態と、前記第1の時から前記第2の時の間に得られた報酬の割引和とを用いて価値関数を更新することで、前記価値関数を導出する、
    ことを特徴とする請求項1記載の価値関数導出方法。
    In the derivation procedure, the state belonging to the set of the third state is the state at the first time when the learner visits, and the state belonging to the set of the third state before the first time is described. By updating the value function with the state of the second time the learner last visited and the discount sum of the rewards obtained between the first time and the second time, the value function can be obtained. Derived,
    The value function derivation method according to claim 1, wherein the value function is derived.
  4.  環境における第1の状態の集合のうち、外部システムが所定の方策に基づいて意思決定を行う第2の状態の集合を除く第3の状態の集合について学習者が意思決定を行うマルコフ決定過程において、前記学習者の価値関数を導出する導出部、
    を有することを特徴とする価値関数導出装置。
    In the Markov decision-making process in which the learner makes a decision about the set of the third state excluding the set of the second state in which the external system makes a decision based on a predetermined measure among the set of the first states in the environment. , A derivation unit that derives the learner's value function,
    A value function derivation device characterized by having.
  5.  前記導出部は、前記環境における第1の報酬関数及び第1の状態遷移確率に基づいて、前記環境と前記外部システムとを状態空間が前記第3の状態の集合であるマルコフ決定過程における環境として再定義した場合の第2の報酬関数及び第2の状態遷移確率を算出し、前記第2の報酬関数及び第2の状態遷移確率に基づくマルコフ決定過程における価値関数の算出法によって前記価値関数を導出する、
    ことを特徴とする請求項4記載の価値関数導出装置。
    Based on the first reward function and the first state transition probability in the environment, the derivation unit uses the environment and the external system as an environment in the Markov decision process in which the state space is a set of the third states. The second reward function and the second state transition probability when redefined are calculated, and the value function is calculated by the calculation method of the value function in the Markov determination process based on the second reward function and the second state transition probability. Derived,
    The value function derivation device according to claim 4, wherein the value function is derived.
  6.  前記導出部は、前記第3の状態の集合に属する状態を前記学習者が訪れた第1の時の状態と、前記第1の時より前に前記第3の状態の集合に属する状態を前記学習者が最後に訪れた第2の時の状態と、前記第1の時から前記第2の時の間に得られた報酬の割引和とを用いて価値関数を更新することで、前記価値関数を導出する、
    ことを特徴とする請求項4記載の価値関数導出装置。
    The derivation unit describes the state belonging to the set of the third state as the state at the first time when the learner visited, and the state belonging to the set of the third state before the first time. By updating the value function with the state of the second time the learner last visited and the discount sum of the rewards obtained between the first time and the second time, the value function can be obtained. Derived,
    The value function derivation device according to claim 4, wherein the value function is derived.
  7.  請求項1乃至3いずれか一項記載の価値関数導出方法をコンピュータに実行させることを特徴とするプログラム。 A program characterized in that a computer executes the value function derivation method according to any one of claims 1 to 3.
PCT/JP2020/030975 2020-08-17 2020-08-17 Value function derivation method, value function derivation device, and program WO2022038655A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/030975 WO2022038655A1 (en) 2020-08-17 2020-08-17 Value function derivation method, value function derivation device, and program
JP2022543822A JPWO2022038655A1 (en) 2020-08-17 2020-08-17

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/030975 WO2022038655A1 (en) 2020-08-17 2020-08-17 Value function derivation method, value function derivation device, and program

Publications (1)

Publication Number Publication Date
WO2022038655A1 true WO2022038655A1 (en) 2022-02-24

Family

ID=80322840

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/030975 WO2022038655A1 (en) 2020-08-17 2020-08-17 Value function derivation method, value function derivation device, and program

Country Status (2)

Country Link
JP (1) JPWO2022038655A1 (en)
WO (1) WO2022038655A1 (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020149172A1 (en) * 2019-01-16 2020-07-23 日本電信電話株式会社 Agent joining device, method, and program

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020149172A1 (en) * 2019-01-16 2020-07-23 日本電信電話株式会社 Agent joining device, method, and program

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ARRUDA, EDILSON F. ET AL.: "A Two-Phase Time Aggregation Algorithm for Average Cost Markov Decision Processes", PROCEEDINGS OF 2012 AMERICAN CONTROL CONFERENCE(ACC, 2012, pages 1615 - 1620, XP032244621, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6315187> [retrieved on 20201106], DOI: 10.1109/ACC.2012.6315187 *
YASUNARI MAEDA, MASAKIYO SUZUKI: "Managing Credit Lines Using Markov Decision Processes", IEICE TECHNICAL REPORT, SIS, vol. 110, no. 74 (SIS2010-13), 3 June 2010 (2010-06-03), pages 71 - 75, XP009534781, ISSN: 0913-5685 *

Also Published As

Publication number Publication date
JPWO2022038655A1 (en) 2022-02-24

Similar Documents

Publication Publication Date Title
KR102242516B1 (en) Train machine learning models on multiple machine learning tasks
KR102170105B1 (en) Method and apparatus for generating neural network structure, electronic device, storage medium
JP6417629B2 (en) INVERSE REINFORCEMENT LEARNING METHOD, STORAGE MEDIUM FOR STORING INSTRUCTION TO EXECUTE PROCESSOR FOR PROCESS FOR INVERSE REINFORCEMENT LEARNING, SYSTEM FOR INVERSE REINFORCEMENT LEARNING, AND PREDICTION SYSTEM INCLUDING SYSTEM FOR INVERSE REINFORCEMENT LEARNING
US10482379B2 (en) Systems and methods to perform machine learning with feedback consistency
US11627165B2 (en) Multi-agent reinforcement learning with matchmaking policies
US10860927B2 (en) Stacked convolutional long short-term memory for model-free reinforcement learning
JP2016100009A (en) Method for controlling operation of machine and control system for iteratively controlling operation of machine
US20180032868A1 (en) Early prediction of an intention of a user&#39;s actions
Yuan et al. Persistency of excitation and performance of deterministic learning
US20210107144A1 (en) Learning method, learning apparatus, and learning system
JP2019537136A (en) Environmental prediction using reinforcement learning
CN115066694A (en) Computation graph optimization
Homer et al. Utilizing null controllable regions to stabilize input-constrained nonlinear systems
CN114529010A (en) Robot autonomous learning method, device, equipment and storage medium
EP3582153A1 (en) Generating hybrid models of physical systems
WO2022038655A1 (en) Value function derivation method, value function derivation device, and program
CN113641525A (en) Variable exception recovery method, apparatus, medium, and computer program product
WO2021186500A1 (en) Learning device, learning method, and recording medium
JP6919856B2 (en) Reinforcement learning programs, reinforcement learning methods, and reinforcement learning devices
Tősér et al. The cyber-physical system approach towards artificial general intelligence: the problem of verification
Cubuktepe et al. Verification of Markov decision processes with risk-sensitive measures
JP2021082014A (en) Estimation device, training device, estimation method, training method, program, and non-transitory computer readable medium
WO2023084609A1 (en) Behavior model cost estimation device, method, and program
Dastider et al. Learning adaptive control in dynamic environments using reproducing kernel priors with bayesian policy gradients
JP2004265069A (en) Model parameter identification method for virtual passive joint model and its control process

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20950221

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022543822

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20950221

Country of ref document: EP

Kind code of ref document: A1